* [PATCH v2] fs: cache the string generated by reading /proc/filesystems
@ 2026-04-22 18:17 Mateusz Guzik
2026-04-22 22:07 ` [PATCH v2] " Andreas Dilger
2026-04-23 13:36 ` [PATCH v2] fs: " Christian Brauner
0 siblings, 2 replies; 7+ messages in thread
From: Mateusz Guzik @ 2026-04-22 18:17 UTC (permalink / raw)
To: brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel, Mateusz Guzik
It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
This is lock-protected pointer chasing over a linked list to pay for
sprintf for every fs (32 on my boxen).
Instead cache the result.
open+read+close cycle single-threaded (ops/s):
before: 442732
after: 1063462 (+140%)
Here the main bottleneck is memcg.
Scalability-wise problems are avoidable lockref trip on open and ref
management for the file on procfs side.
The file looks like a sterotypical C from the 90s, right down to an
open-coded and slightly obfuscated linked list. I intentionally did not
clean up any of it -- I think the file will be best served by a Rust
rewrite when the time comes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
v2:
- drop the procfs bits
- touch up some comments
I posted v1 last year https://lore.kernel.org/linux-fsdevel/20250329192821.822253-1-mjguzik@gmail.com/
but that ran into some procfs issues. the thing can be sped up
regardless of the procfs problem.
fs/filesystems.c | 144 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 137 insertions(+), 7 deletions(-)
diff --git a/fs/filesystems.c b/fs/filesystems.c
index 0c7d2b7ac26c..704fc6d49f80 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -34,6 +34,23 @@
static struct file_system_type *file_systems;
static DEFINE_RWLOCK(file_systems_lock);
+#ifdef CONFIG_PROC_FS
+static unsigned long file_systems_gen;
+
+struct file_systems_string {
+ struct rcu_head rcufree;
+ unsigned long gen;
+ size_t len;
+ char string[];
+};
+static struct file_systems_string *file_systems_string;
+static void invalidate_filesystems_string(void);
+#else
+static void invalidate_filesystems_string(void)
+{
+}
+#endif
+
/* WARNING: This can be used only if we _already_ own a reference */
struct file_system_type *get_filesystem(struct file_system_type *fs)
{
@@ -83,10 +100,12 @@ int register_filesystem(struct file_system_type * fs)
return -EBUSY;
write_lock(&file_systems_lock);
p = find_filesystem(fs->name, strlen(fs->name));
- if (*p)
+ if (*p) {
res = -EBUSY;
- else
+ } else {
*p = fs;
+ invalidate_filesystems_string();
+ }
write_unlock(&file_systems_lock);
return res;
}
@@ -115,6 +134,7 @@ int unregister_filesystem(struct file_system_type * fs)
if (fs == *tmp) {
*tmp = fs->next;
fs->next = NULL;
+ invalidate_filesystems_string();
write_unlock(&file_systems_lock);
synchronize_rcu();
return 0;
@@ -235,22 +255,132 @@ int __init list_bdev_fs_names(char *buf, size_t size)
}
#ifdef CONFIG_PROC_FS
-static int filesystems_proc_show(struct seq_file *m, void *v)
+/*
+ * The fs list gets queried a lot by userspace because of libselinux, including
+ * rather surprising programs (would you guess *sed* is on the list?). In order
+ * to reduce the overhead we cache the resulting string, which normally hangs
+ * around below 512 bytes in size.
+ *
+ * As the list almost never changes, its creation is not particularly optimized
+ * for simplicity.
+ *
+ * We sort it out on read in order to not introduce a failure point for fs
+ * registration (in principle we may be unable to alloc memory for the list).
+ */
+static void invalidate_filesystems_string(void)
{
- struct file_system_type * tmp;
+ struct file_systems_string *fss;
- read_lock(&file_systems_lock);
+ lockdep_assert_held_write(&file_systems_lock);
+ file_systems_gen++;
+ fss = file_systems_string;
+ WRITE_ONCE(file_systems_string, NULL);
+ kfree_rcu(fss, rcufree);
+}
+
+static noinline int regen_filesystems_string(void)
+{
+ struct file_system_type *tmp;
+ struct file_systems_string *old, *new;
+ size_t newlen, usedlen;
+ unsigned long gen;
+
+retry:
+ lockdep_assert_not_held(&file_systems_lock);
+
+ newlen = 0;
+ write_lock(&file_systems_lock);
+ gen = file_systems_gen;
+ tmp = file_systems;
+
+ /* pre-calc space for "%s\t%s\n" for each fs */
+ while (tmp) {
+ if (!(tmp->fs_flags & FS_REQUIRES_DEV))
+ newlen += strlen("nodev");
+ newlen += strlen("\t");
+ newlen += strlen(tmp->name);
+ newlen += strlen("\n");
+ tmp = tmp->next;
+ }
+ write_unlock(&file_systems_lock);
+
+ new = kmalloc(offsetof(struct file_systems_string, string) + newlen + 1,
+ GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ new->gen = gen;
+ new->len = newlen;
+ new->string[newlen] = '\0';
+ write_lock(&file_systems_lock);
+ old = file_systems_string;
+
+ /*
+ * Did someone beat us to it?
+ */
+ if (old && old->gen == file_systems_gen) {
+ write_unlock(&file_systems_lock);
+ kfree(new);
+ return 0;
+ }
+
+ /*
+ * Did the list change in the meantime?
+ */
+ if (gen != file_systems_gen) {
+ write_unlock(&file_systems_lock);
+ kfree(new);
+ goto retry;
+ }
+
+ /*
+ * Populate the string.
+ *
+ * We know we have just enough space because we calculated the right
+ * size the previous time we had the lock and confirmed the list has
+ * not changed after reacquiring it.
+ */
+ usedlen = 0;
tmp = file_systems;
while (tmp) {
- seq_printf(m, "%s\t%s\n",
+ usedlen += sprintf(&new->string[usedlen], "%s\t%s\n",
(tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
tmp->name);
tmp = tmp->next;
}
- read_unlock(&file_systems_lock);
+ BUG_ON(new->len != strlen(new->string));
+
+ /*
+ * Paired with consume fence in READ_ONCE() in filesystems_proc_show()
+ */
+ smp_store_release(&file_systems_string, new);
+ write_unlock(&file_systems_lock);
+ kfree_rcu(old, rcufree);
return 0;
}
+static int filesystems_proc_show(struct seq_file *m, void *v)
+{
+ struct file_systems_string *fss;
+
+ for (;;) {
+ scoped_guard(rcu) {
+ /*
+ * Paired with smp_store_release() in regen_filesystems_string()
+ */
+ fss = READ_ONCE(file_systems_string);
+ if (likely(fss)) {
+ seq_write(m, fss->string, fss->len);
+ return 0;
+ }
+ }
+
+ int err = regen_filesystems_string();
+ if (unlikely(err))
+ return err;
+ }
+}
+
static int __init proc_filesystems_init(void)
{
proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
--
2.48.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] cache the string generated by reading /proc/filesystems
2026-04-22 18:17 [PATCH v2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
@ 2026-04-22 22:07 ` Andreas Dilger
2026-04-22 22:42 ` Al Viro
2026-04-23 14:28 ` Mateusz Guzik
2026-04-23 13:36 ` [PATCH v2] fs: " Christian Brauner
1 sibling, 2 replies; 7+ messages in thread
From: Andreas Dilger @ 2026-04-22 22:07 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: brauner, viro, jack, linux-kernel, linux-fsdevel
On Apr 22, 2026, at 12:17, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
Yeah, it is sad that bloat continually grows to continually make fast
systems slow.
> This is lock-protected pointer chasing over a linked list to pay for
> sprintf for every fs (32 on my boxen).
>
> Instead cache the result.
>
> open+read+close cycle single-threaded (ops/s):
> before: 442732
> after: 1063462 (+140%)
>
> Here the main bottleneck is memcg.
>
> Scalability-wise problems are avoidable lockref trip on open and ref
> management for the file on procfs side.
>
> The file looks like a sterotypical C from the 90s, right down to an
> open-coded and slightly obfuscated linked list. I intentionally did not
> clean up any of it -- I think the file will be best served by a Rust
> rewrite when the time comes.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> v2:
> - drop the procfs bits
> - touch up some comments
>
> I posted v1 last year https://lore.kernel.org/linux-fsdevel/20250329192821.822253-1-mjguzik@gmail.com/
>
> but that ran into some procfs issues. the thing can be sped up
> regardless of the procfs problem.
>
> fs/filesystems.c | 144 ++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 137 insertions(+), 7 deletions(-)
>
> diff --git a/fs/filesystems.c b/fs/filesystems.c
>
> +static noinline int regen_filesystems_string(void)
> +{
> + struct file_system_type *tmp;
> + struct file_systems_string *old, *new;
> + size_t newlen, usedlen;
> + unsigned long gen;
> +
> +retry:
> + lockdep_assert_not_held(&file_systems_lock);
> +
> + newlen = 0;
> + write_lock(&file_systems_lock);
> + gen = file_systems_gen;
> + tmp = file_systems;
> +
> + /* pre-calc space for "%s\t%s\n" for each fs */
> + while (tmp) {
> + if (!(tmp->fs_flags & FS_REQUIRES_DEV))
> + newlen += strlen("nodev");
> + newlen += strlen("\t");
> + newlen += strlen(tmp->name);
> + newlen += strlen("\n");
> + tmp = tmp->next;
> + }
> + write_unlock(&file_systems_lock);
> +
> + new = kmalloc(offsetof(struct file_systems_string, string) + newlen + 1,
> + GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
This is introducing an error case where there was none before, in a
critical system file. It would also be possible to fall back to
seq_printf() instead of sprinft() if no buffer is available.
> + new->gen = gen;
> + new->len = newlen;
> + new->string[newlen] = '\0';
> + write_lock(&file_systems_lock);
> + old = file_systems_string;
> +
> + /*
> + * Did someone beat us to it?
> + */
> + if (old && old->gen == file_systems_gen) {
> + write_unlock(&file_systems_lock);
> + kfree(new);
> + return 0;
> + }
> +
> + /*
> + * Did the list change in the meantime?
> + */
> + if (gen != file_systems_gen) {
> + write_unlock(&file_systems_lock);
> + kfree(new);
> + goto retry;
> + }
> +
> + /*
> + * Populate the string.
> + *
> + * We know we have just enough space because we calculated the right
> + * size the previous time we had the lock and confirmed the list has
> + * not changed after reacquiring it.
> + */
It doesn't seem prudent to use raw sprintf() in a loop and justify this in
a comment, when the actual string generation is not performance critical.
This could use scnprintf() and be "safer" even if it "should just work" (tm).
> + usedlen = 0;
> tmp = file_systems;
> while (tmp) {
> - seq_printf(m, "%s\t%s\n",
> + usedlen += sprintf(&new->string[usedlen], "%s\t%s\n",
> (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
> tmp->name);
> tmp = tmp->next;
> }
> - read_unlock(&file_systems_lock);
> + BUG_ON(new->len != strlen(new->string));
Adding a BUG_ON() here seems gratuitous, when there are lots of other
options to handle this case, especially if sprintf() didn't just corrupt
memory for some reason. It could allocate a larger buffer and retry,
or in the worst case return an error instead of crashing the system.
> +
> + /*
> + * Paired with consume fence in READ_ONCE() in filesystems_proc_show()
> + */
> + smp_store_release(&file_systems_string, new);
> + write_unlock(&file_systems_lock);
> + kfree_rcu(old, rcufree);
> return 0;
> }
>
Cheers, Andreas
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] cache the string generated by reading /proc/filesystems
2026-04-22 22:07 ` [PATCH v2] " Andreas Dilger
@ 2026-04-22 22:42 ` Al Viro
2026-04-23 14:28 ` Mateusz Guzik
1 sibling, 0 replies; 7+ messages in thread
From: Al Viro @ 2026-04-22 22:42 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Mateusz Guzik, brauner, jack, linux-kernel, linux-fsdevel
On Wed, Apr 22, 2026 at 04:07:15PM -0600, Andreas Dilger wrote:
> On Apr 22, 2026, at 12:17, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
>
> Yeah, it is sad that bloat continually grows to continually make fast
> systems slow.
libselinux looking for selinuxfs, at a guess, not that I had any idea
why sed might possibly care about that...
<RTFS>
-i support, apparently.
No printable comment...
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] fs: cache the string generated by reading /proc/filesystems
2026-04-22 18:17 [PATCH v2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
2026-04-22 22:07 ` [PATCH v2] " Andreas Dilger
@ 2026-04-23 13:36 ` Christian Brauner
2026-04-23 14:38 ` Mateusz Guzik
1 sibling, 1 reply; 7+ messages in thread
From: Christian Brauner @ 2026-04-23 13:36 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: viro, jack, linux-kernel, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 963 bytes --]
On Wed, Apr 22, 2026 at 08:17:11PM +0200, Mateusz Guzik wrote:
> It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
>
> This is lock-protected pointer chasing over a linked list to pay for
> sprintf for every fs (32 on my boxen).
>
> Instead cache the result.
>
> open+read+close cycle single-threaded (ops/s):
> before: 442732
> after: 1063462 (+140%)
>
> Here the main bottleneck is memcg.
>
> Scalability-wise problems are avoidable lockref trip on open and ref
> management for the file on procfs side.
>
> The file looks like a sterotypical C from the 90s, right down to an
> open-coded and slightly obfuscated linked list. I intentionally did not
> clean up any of it -- I think the file will be best served by a Rust
> rewrite when the time comes.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
Ugh, can't we at least also get rid of the rwlock stuff?
Something like the appended two (completely untested) patches.
[-- Attachment #2: 0001-proc-add-a-helper-for-marking-files-as-permanent-by-.patch --]
[-- Type: text/x-diff, Size: 1724 bytes --]
From 148061e07d8dfc513df6d190791811ed98e21adc Mon Sep 17 00:00:00 2001
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Sat, 29 Mar 2025 20:28:20 +0100
Subject: [PATCH 1/2] proc: add a helper for marking files as permanent by
external consumers
This avoids refing trips on use, added with VFS in mind but should be
helpful elsewhere as well.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250329192821.822253-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/proc/generic.c | 6 ++++++
include/linux/proc_fs.h | 1 +
2 files changed, 7 insertions(+)
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 8bb81e58c9d8..c9de1c95bbdc 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -677,6 +677,12 @@ struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
}
EXPORT_SYMBOL(proc_create_single_data);
+void proc_make_permanent(struct proc_dir_entry *de)
+{
+ pde_make_permanent(de);
+}
+EXPORT_SYMBOL(proc_make_permanent);
+
void proc_set_size(struct proc_dir_entry *de, loff_t size)
{
de->size = size;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 19d1c5e5f335..360c1b90b546 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -106,6 +106,7 @@ struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
int (*show)(struct seq_file *, void *), void *data);
#define proc_create_single(name, mode, parent, show) \
proc_create_single_data(name, mode, parent, show, NULL)
+void proc_make_permanent(struct proc_dir_entry *);
extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
struct proc_dir_entry *,
--
2.47.3
[-- Attachment #3: 0002-fs-RCU-ify-filesystems-list-and-cache-proc-filesyste.patch --]
[-- Type: text/x-diff, Size: 14523 bytes --]
From 5c5a25c5ab66f666349e80f78b5168a9fb4eebd6 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 23 Apr 2026 14:43:17 +0200
Subject: [PATCH 2/2] fs: RCU-ify filesystems list and cache /proc/filesystems
The drivers list was protected by an rwlock; every mount, every open
of /proc/filesystems and the legacy sysfs(2) syscall walked a
hand-rolled singly-linked list under it. /proc/filesystems is
especially hot because libselinux causes programs as mundane as
mkdir, ls and sed to open and read it on every invocation.
Convert the list to an RCU-protected hlist and switch the writer side
to a plain spinlock. Writers keep their existing non-sleeping
section while readers walk under rcu_read_lock() with no lock traffic:
- register_filesystem()/unregister_filesystem() take
file_systems_lock, publish via hlist_{add_tail,del_init}_rcu()
and invalidate the cached /proc/filesystems string.
unregister_filesystem() keeps its synchronize_rcu() after
dropping the lock so in-flight readers are drained before the
module (and its embedded file_system_type) can go away.
- __get_fs_type(), list_bdev_fs_names() and the
fs_index()/fs_name()/fs_maxindex() helpers walk the list under
rcu_read_lock(). fs_name() continues to drop the read-side
lock after try_module_get() and accesses ->name outside the RCU
section; the module reference pins the embedded file_system_type
across the boundary.
Additionally cache the pre-rendered "[nodev\t]<name>\n" blob so
filesystems_proc_show() becomes a single seq_write() under
rcu_read_lock(). On a cache miss regen_filesystems_string() sizes
the buffer under RCU, allocates GFP_KERNEL and refills under the
writer spinlock; a generation counter bumped under the spinlock lets
it detect concurrent (un)register and retry the allocation.
Finally mark /proc/filesystems as permanent so the per-open refcount
bounce (which can degrade to a spinlock acquire) is removed; this
was the dominant bottleneck once the list walk itself was off the
critical path and is what lifts the multi-threaded numbers in
Mateusz Guzik's measurements:
https://lore.kernel.org/linux-fsdevel/20260422181711.1340269-1-mjguzik@gmail.com/
struct file_system_type::next becomes struct hlist_node list; no
in-tree caller references the old ->next field outside
fs/filesystems.c.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/filesystems.c | 282 +++++++++++++++++++++++++++++----------------
fs/ocfs2/super.c | 1 -
include/linux/fs.h | 2 +-
3 files changed, 181 insertions(+), 104 deletions(-)
diff --git a/fs/filesystems.c b/fs/filesystems.c
index 0c7d2b7ac26c..9b24c309f71e 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -17,22 +17,50 @@
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/fs_parser.h>
+#include <linux/rculist.h>
/*
- * Handling of filesystem drivers list.
- * Rules:
- * Inclusion to/removals from/scanning of list are protected by spinlock.
- * During the unload module must call unregister_filesystem().
- * We can access the fields of list element if:
- * 1) spinlock is held or
- * 2) we hold the reference to the module.
- * The latter can be guaranteed by call of try_module_get(); if it
- * returned 0 we must skip the element, otherwise we got the reference.
- * Once the reference is obtained we can drop the spinlock.
+ * Read-mostly filesystem drivers list.
+ *
+ * Readers walk under rcu_read_lock(); writers take file_systems_lock
+ * and publish via _rcu hlist primitives. unregister_filesystem()
+ * synchronize_rcu()s after unlock so the embedded file_system_type
+ * can't go away under a reader. To keep using a filesystem after
+ * the RCU section ends, take a module reference via try_module_get().
+ */
+static HLIST_HEAD(file_systems);
+static DEFINE_SPINLOCK(file_systems_lock);
+
+#ifdef CONFIG_PROC_FS
+/*
+ * Cache /proc/filesystems - libselinux has mkdir/ls/sed read it on
+ * every invocation. file_systems_gen lets regen detect (un)register
+ * races against the cached generation and retry.
*/
+struct file_systems_string {
+ struct rcu_head rcu;
+ unsigned long gen;
+ size_t len;
+ char string[];
+};
-static struct file_system_type *file_systems;
-static DEFINE_RWLOCK(file_systems_lock);
+static unsigned long file_systems_gen;
+static struct file_systems_string __rcu *file_systems_string;
+
+static void invalidate_filesystems_string(void)
+{
+ struct file_systems_string *old;
+
+ lockdep_assert_held(&file_systems_lock);
+ file_systems_gen++;
+ old = rcu_replace_pointer(file_systems_string, NULL,
+ lockdep_is_held(&file_systems_lock));
+ if (old)
+ kfree_rcu(old, rcu);
+}
+#else
+static inline void invalidate_filesystems_string(void) { }
+#endif
/* WARNING: This can be used only if we _already_ own a reference */
struct file_system_type *get_filesystem(struct file_system_type *fs)
@@ -46,14 +74,15 @@ void put_filesystem(struct file_system_type *fs)
module_put(fs->owner);
}
-static struct file_system_type **find_filesystem(const char *name, unsigned len)
+static struct file_system_type *find_filesystem(const char *name, unsigned len)
{
- struct file_system_type **p;
- for (p = &file_systems; *p; p = &(*p)->next)
- if (strncmp((*p)->name, name, len) == 0 &&
- !(*p)->name[len])
- break;
- return p;
+ struct file_system_type *fs;
+
+ hlist_for_each_entry_rcu(fs, &file_systems, list,
+ lockdep_is_held(&file_systems_lock))
+ if (strncmp(fs->name, name, len) == 0 && !fs->name[len])
+ return fs;
+ return NULL;
}
/**
@@ -64,33 +93,27 @@ static struct file_system_type **find_filesystem(const char *name, unsigned len)
* is aware of for mount and other syscalls. Returns 0 on success,
* or a negative errno code on an error.
*
- * The &struct file_system_type that is passed is linked into the kernel
+ * The &struct file_system_type that is passed is linked into the kernel
* structures and must not be freed until the file system has been
* unregistered.
*/
-
-int register_filesystem(struct file_system_type * fs)
+int register_filesystem(struct file_system_type *fs)
{
- int res = 0;
- struct file_system_type ** p;
-
if (fs->parameters &&
!fs_validate_description(fs->name, fs->parameters))
return -EINVAL;
BUG_ON(strchr(fs->name, '.'));
- if (fs->next)
+ if (!hlist_unhashed_lockless(&fs->list))
return -EBUSY;
- write_lock(&file_systems_lock);
- p = find_filesystem(fs->name, strlen(fs->name));
- if (*p)
- res = -EBUSY;
- else
- *p = fs;
- write_unlock(&file_systems_lock);
- return res;
-}
+ guard(spinlock)(&file_systems_lock);
+ if (find_filesystem(fs->name, strlen(fs->name)))
+ return -EBUSY;
+ hlist_add_tail_rcu(&fs->list, &file_systems);
+ invalidate_filesystems_string();
+ return 0;
+}
EXPORT_SYMBOL(register_filesystem);
/**
@@ -100,94 +123,79 @@ EXPORT_SYMBOL(register_filesystem);
* Remove a file system that was previously successfully registered
* with the kernel. An error is returned if the file system is not found.
* Zero is returned on a success.
- *
+ *
* Once this function has returned the &struct file_system_type structure
* may be freed or reused.
*/
-
-int unregister_filesystem(struct file_system_type * fs)
+int unregister_filesystem(struct file_system_type *fs)
{
- struct file_system_type ** tmp;
-
- write_lock(&file_systems_lock);
- tmp = &file_systems;
- while (*tmp) {
- if (fs == *tmp) {
- *tmp = fs->next;
- fs->next = NULL;
- write_unlock(&file_systems_lock);
- synchronize_rcu();
- return 0;
- }
- tmp = &(*tmp)->next;
+ scoped_guard(spinlock, &file_systems_lock) {
+ if (hlist_unhashed(&fs->list))
+ return -EINVAL;
+ hlist_del_init_rcu(&fs->list);
+ invalidate_filesystems_string();
}
- write_unlock(&file_systems_lock);
-
- return -EINVAL;
+ synchronize_rcu();
+ return 0;
}
-
EXPORT_SYMBOL(unregister_filesystem);
#ifdef CONFIG_SYSFS_SYSCALL
-static int fs_index(const char __user * __name)
+static int fs_index(const char __user *__name)
{
- struct file_system_type * tmp;
+ struct file_system_type *tmp;
char *name __free(kfree) = strndup_user(__name, PATH_MAX);
- int err, index;
+ int index = 0;
if (IS_ERR(name))
return PTR_ERR(name);
- err = -EINVAL;
- read_lock(&file_systems_lock);
- for (tmp=file_systems, index=0 ; tmp ; tmp=tmp->next, index++) {
- if (strcmp(tmp->name, name) == 0) {
- err = index;
- break;
- }
+ guard(rcu)();
+ hlist_for_each_entry_rcu(tmp, &file_systems, list) {
+ if (strcmp(tmp->name, name) == 0)
+ return index;
+ index++;
}
- read_unlock(&file_systems_lock);
- return err;
+ return -EINVAL;
}
-static int fs_name(unsigned int index, char __user * buf)
+static int fs_name(unsigned int index, char __user *buf)
{
- struct file_system_type * tmp;
- int len, res = -EINVAL;
+ struct file_system_type *tmp, *found = NULL;
+ int len, res;
- read_lock(&file_systems_lock);
- for (tmp = file_systems; tmp; tmp = tmp->next, index--) {
- if (index == 0) {
+ scoped_guard(rcu) {
+ hlist_for_each_entry_rcu(tmp, &file_systems, list) {
+ if (index--)
+ continue;
if (try_module_get(tmp->owner))
- res = 0;
+ found = tmp;
break;
}
}
- read_unlock(&file_systems_lock);
- if (res)
- return res;
+ if (!found)
+ return -EINVAL;
/* OK, we got the reference, so we can safely block */
- len = strlen(tmp->name) + 1;
- res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0;
- put_filesystem(tmp);
+ len = strlen(found->name) + 1;
+ res = copy_to_user(buf, found->name, len) ? -EFAULT : 0;
+ put_filesystem(found);
return res;
}
static int fs_maxindex(void)
{
- struct file_system_type * tmp;
- int index;
+ struct file_system_type *tmp;
+ int index = 0;
- read_lock(&file_systems_lock);
- for (tmp = file_systems, index = 0 ; tmp ; tmp = tmp->next, index++)
- ;
- read_unlock(&file_systems_lock);
+ guard(rcu)();
+ hlist_for_each_entry_rcu(tmp, &file_systems, list)
+ index++;
return index;
}
/*
- * Whee.. Weird sysv syscall.
+ * Whee.. Weird sysv syscall.
*/
SYSCALL_DEFINE3(sysfs, int, option, unsigned long, arg1, unsigned long, arg2)
{
@@ -216,8 +224,8 @@ int __init list_bdev_fs_names(char *buf, size_t size)
size_t len;
int count = 0;
- read_lock(&file_systems_lock);
- for (p = file_systems; p; p = p->next) {
+ guard(rcu)();
+ hlist_for_each_entry_rcu(p, &file_systems, list) {
if (!(p->fs_flags & FS_REQUIRES_DEV))
continue;
len = strlen(p->name) + 1;
@@ -230,30 +238,101 @@ int __init list_bdev_fs_names(char *buf, size_t size)
size -= len;
count++;
}
- read_unlock(&file_systems_lock);
return count;
}
#ifdef CONFIG_PROC_FS
-static int filesystems_proc_show(struct seq_file *m, void *v)
+static __cold noinline int regen_filesystems_string(void)
{
- struct file_system_type * tmp;
+ struct file_system_type *tmp;
+ struct file_systems_string *old, *new;
+ size_t newlen, usedlen;
+ unsigned long gen;
+
+retry:
+ /*
+ * Size under RCU; the spinlock below detects list mutation via
+ * file_systems_gen and we retry. Bounded by (un)register rate.
+ */
+ newlen = 0;
+ scoped_guard(rcu) {
+ gen = READ_ONCE(file_systems_gen);
+ hlist_for_each_entry_rcu(tmp, &file_systems, list) {
+ if (!(tmp->fs_flags & FS_REQUIRES_DEV))
+ newlen += strlen("nodev");
+ newlen += strlen("\t") + strlen(tmp->name) +
+ strlen("\n");
+ }
+ }
+
+ new = kmalloc(struct_size(new, string, newlen), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ new->gen = gen;
+ new->len = newlen;
+
+ spin_lock(&file_systems_lock);
+ old = rcu_dereference_protected(file_systems_string,
+ lockdep_is_held(&file_systems_lock));
+
+ /* Someone regenerated for our generation already. */
+ if (old && old->gen == file_systems_gen) {
+ spin_unlock(&file_systems_lock);
+ kfree(new);
+ return 0;
+ }
- read_lock(&file_systems_lock);
- tmp = file_systems;
- while (tmp) {
- seq_printf(m, "%s\t%s\n",
+ /* List mutated between size-calc and here; retry. */
+ if (gen != file_systems_gen) {
+ spin_unlock(&file_systems_lock);
+ kfree(new);
+ goto retry;
+ }
+
+ /* Sized for this generation under the lock, so sprintf fits. */
+ usedlen = 0;
+ hlist_for_each_entry(tmp, &file_systems, list) {
+ usedlen += sprintf(&new->string[usedlen], "%s\t%s\n",
(tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
tmp->name);
- tmp = tmp->next;
}
- read_unlock(&file_systems_lock);
+ WARN_ON_ONCE(usedlen != newlen);
+
+ rcu_assign_pointer(file_systems_string, new);
+ spin_unlock(&file_systems_lock);
+ if (old)
+ kfree_rcu(old, rcu);
return 0;
}
+static int filesystems_proc_show(struct seq_file *m, void *v)
+{
+ struct file_systems_string *fss;
+ int err;
+
+ for (;;) {
+ scoped_guard(rcu) {
+ fss = rcu_dereference(file_systems_string);
+ if (likely(fss)) {
+ seq_write(m, fss->string, fss->len);
+ return 0;
+ }
+ }
+
+ err = regen_filesystems_string();
+ if (unlikely(err))
+ return err;
+ }
+}
+
static int __init proc_filesystems_init(void)
{
- proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
+ struct proc_dir_entry *pde;
+
+ pde = proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
+ if (!pde)
+ return -ENOMEM;
+ proc_make_permanent(pde);
return 0;
}
module_init(proc_filesystems_init);
@@ -263,11 +342,10 @@ static struct file_system_type *__get_fs_type(const char *name, int len)
{
struct file_system_type *fs;
- read_lock(&file_systems_lock);
- fs = *(find_filesystem(name, len));
+ guard(rcu)();
+ fs = find_filesystem(name, len);
if (fs && !try_module_get(fs->owner))
fs = NULL;
- read_unlock(&file_systems_lock);
return fs;
}
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index b875f01c9756..4870e680c4e5 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1224,7 +1224,6 @@ static struct file_system_type ocfs2_fs_type = {
.name = "ocfs2",
.kill_sb = kill_block_super,
.fs_flags = FS_REQUIRES_DEV|FS_RENAME_DOES_D_MOVE,
- .next = NULL,
.init_fs_context = ocfs2_init_fs_context,
.parameters = ocfs2_param_spec,
};
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..c37bb3c7de8b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2286,7 +2286,7 @@ struct file_system_type {
const struct fs_parameter_spec *parameters;
void (*kill_sb) (struct super_block *);
struct module *owner;
- struct file_system_type * next;
+ struct hlist_node list;
struct hlist_head fs_supers;
struct lock_class_key s_lock_key;
--
2.47.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] cache the string generated by reading /proc/filesystems
2026-04-22 22:07 ` [PATCH v2] " Andreas Dilger
2026-04-22 22:42 ` Al Viro
@ 2026-04-23 14:28 ` Mateusz Guzik
1 sibling, 0 replies; 7+ messages in thread
From: Mateusz Guzik @ 2026-04-23 14:28 UTC (permalink / raw)
To: Andreas Dilger; +Cc: brauner, viro, jack, linux-kernel, linux-fsdevel
On Thu, Apr 23, 2026 at 12:07 AM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Apr 22, 2026, at 12:17, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > + new = kmalloc(offsetof(struct file_systems_string, string) + newlen + 1,
> > + GFP_KERNEL);
> > + if (!new)
> > + return -ENOMEM;
>
> This is introducing an error case where there was none before, in a
> critical system file. It would also be possible to fall back to
> seq_printf() instead of sprinft() if no buffer is available.
>
Fair, should be easy to do.
> > + /*
> > + * Populate the string.
> > + *
> > + * We know we have just enough space because we calculated the right
> > + * size the previous time we had the lock and confirmed the list has
> > + * not changed after reacquiring it.
> > + */
>
> It doesn't seem prudent to use raw sprintf() in a loop and justify this in
> a comment, when the actual string generation is not performance critical.
> This could use scnprintf() and be "safer" even if it "should just work" (tm).
>
I don't get how that's supposed help anything. I still have to call
the routine in a loop, except this time I have to keep recalculating
remaining size by hand on top of it.
What this really wants is a string library which handles easy things
like this one, FreeBSD has something like this in
https://man.freebsd.org/cgi/man.cgi?sbuf(9) , but I certainly can't be
bothered to work on an equivalent or porting it over.
If one is to "overflow-proof this" with total disregard for
performance, a series of strlcats would do it.
> > + usedlen = 0;
> > tmp = file_systems;
> > while (tmp) {
> > - seq_printf(m, "%s\t%s\n",
> > + usedlen += sprintf(&new->string[usedlen], "%s\t%s\n",
> > (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
> > tmp->name);
> > tmp = tmp->next;
> > }
> > - read_unlock(&file_systems_lock);
> > + BUG_ON(new->len != strlen(new->string));
>
> Adding a BUG_ON() here seems gratuitous, when there are lots of other
> options to handle this case, especially if sprintf() didn't just corrupt
> memory for some reason. It could allocate a larger buffer and retry,
> or in the worst case return an error instead of crashing the system.
>
Well nobody is supposed to get the splat, except maybe someone messing
with how the string is generated, so I don't really see your point --
it's purely a bug-catching measure.
However, it can still just print a warn and fallback to the current
pointer chasing.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] fs: cache the string generated by reading /proc/filesystems
2026-04-23 13:36 ` [PATCH v2] fs: " Christian Brauner
@ 2026-04-23 14:38 ` Mateusz Guzik
2026-04-23 21:04 ` Christian Brauner
0 siblings, 1 reply; 7+ messages in thread
From: Mateusz Guzik @ 2026-04-23 14:38 UTC (permalink / raw)
To: Christian Brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, Alexey Dobriyan
On Thu, Apr 23, 2026 at 3:36 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Apr 22, 2026 at 08:17:11PM +0200, Mateusz Guzik wrote:
> > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> >
> > This is lock-protected pointer chasing over a linked list to pay for
> > sprintf for every fs (32 on my boxen).
> >
> > Instead cache the result.
> >
> > open+read+close cycle single-threaded (ops/s):
> > before: 442732
> > after: 1063462 (+140%)
> >
> > Here the main bottleneck is memcg.
> >
> > Scalability-wise problems are avoidable lockref trip on open and ref
> > management for the file on procfs side.
> >
> > The file looks like a sterotypical C from the 90s, right down to an
> > open-coded and slightly obfuscated linked list. I intentionally did not
> > clean up any of it -- I think the file will be best served by a Rust
> > rewrite when the time comes.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
>
> Ugh, can't we at least also get rid of the rwlock stuff?
> Something like the appended two (completely untested) patches.
The procfs thing got protest last time, see
https://lore.kernel.org/all/CACVxJT_eXRNjp-2sEfuxYmzTMPvu7U1R2bsKYjadVfR-W691og@mail.gmail.com/
Alexey wrote his own variant and put my name on it:
https://lore.kernel.org/linux-fsdevel/c58291cd-0775-4c90-8443-ba71897b5cbb@p183/
. I'm not going to protest the patch, but I'm not signing off on the
thing. Perhaps you can pull it while removing my credit? The author
cc'ed.
The wtf list handling cleanup + rwlock removal should certainly be
their own commit. applying all of this on top of my patch should be a
matter of few shell commands, alternatively I can massage this the
other way around.
I did not go for the other changes because a big chunk of the file is
the suspicious sysv syscall marked for removal and which I have 0
intentions of testing.
Finallly, a review did not like the BUG_ON (which you replaced with
WARN_ON_ONCE) and the new failure mode of ENOMEM. Both bogus
calculation and ENOMEM can be handled with a fallback to iterating the
list like it is done now.
I can do the massaging if it sounds like a plan.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] fs: cache the string generated by reading /proc/filesystems
2026-04-23 14:38 ` Mateusz Guzik
@ 2026-04-23 21:04 ` Christian Brauner
0 siblings, 0 replies; 7+ messages in thread
From: Christian Brauner @ 2026-04-23 21:04 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: viro, jack, linux-kernel, linux-fsdevel, Alexey Dobriyan
On Thu, Apr 23, 2026 at 04:38:05PM +0200, Mateusz Guzik wrote:
> On Thu, Apr 23, 2026 at 3:36 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Wed, Apr 22, 2026 at 08:17:11PM +0200, Mateusz Guzik wrote:
> > > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> > >
> > > This is lock-protected pointer chasing over a linked list to pay for
> > > sprintf for every fs (32 on my boxen).
> > >
> > > Instead cache the result.
> > >
> > > open+read+close cycle single-threaded (ops/s):
> > > before: 442732
> > > after: 1063462 (+140%)
> > >
> > > Here the main bottleneck is memcg.
> > >
> > > Scalability-wise problems are avoidable lockref trip on open and ref
> > > management for the file on procfs side.
> > >
> > > The file looks like a sterotypical C from the 90s, right down to an
> > > open-coded and slightly obfuscated linked list. I intentionally did not
> > > clean up any of it -- I think the file will be best served by a Rust
> > > rewrite when the time comes.
> > >
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > > ---
> >
> > Ugh, can't we at least also get rid of the rwlock stuff?
> > Something like the appended two (completely untested) patches.
>
> The procfs thing got protest last time, see
> https://lore.kernel.org/all/CACVxJT_eXRNjp-2sEfuxYmzTMPvu7U1R2bsKYjadVfR-W691og@mail.gmail.com/
>
> Alexey wrote his own variant and put my name on it:
> https://lore.kernel.org/linux-fsdevel/c58291cd-0775-4c90-8443-ba71897b5cbb@p183/
> . I'm not going to protest the patch, but I'm not signing off on the
> thing. Perhaps you can pull it while removing my credit? The author
> cc'ed.
>
> The wtf list handling cleanup + rwlock removal should certainly be
> their own commit. applying all of this on top of my patch should be a
> matter of few shell commands, alternatively I can massage this the
> other way around.
>
> I did not go for the other changes because a big chunk of the file is
> the suspicious sysv syscall marked for removal and which I have 0
> intentions of testing.
>
> Finallly, a review did not like the BUG_ON (which you replaced with
> WARN_ON_ONCE) and the new failure mode of ENOMEM. Both bogus
> calculation and ENOMEM can be handled with a fallback to iterating the
> list like it is done now.
>
> I can do the massaging if it sounds like a plan.
Yeah, sounds good to me.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-23 21:04 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 18:17 [PATCH v2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
2026-04-22 22:07 ` [PATCH v2] " Andreas Dilger
2026-04-22 22:42 ` Al Viro
2026-04-23 14:28 ` Mateusz Guzik
2026-04-23 13:36 ` [PATCH v2] fs: " Christian Brauner
2026-04-23 14:38 ` Mateusz Guzik
2026-04-23 21:04 ` Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox