All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: viro@zeniv.linux.org.uk, jack@suse.cz,
	linux-kernel@vger.kernel.org,  linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2] fs: cache the string generated by reading /proc/filesystems
Date: Thu, 23 Apr 2026 15:36:51 +0200	[thread overview]
Message-ID: <20260423-flohmarkt-immun-9aedcbd5b60f@brauner> (raw)
In-Reply-To: <20260422181711.1340269-1-mjguzik@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

On Wed, Apr 22, 2026 at 08:17:11PM +0200, Mateusz Guzik wrote:
> It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> 
> This is lock-protected pointer chasing over a linked list to pay for
> sprintf for every fs (32 on my boxen).
> 
> Instead cache the result.
> 
> open+read+close cycle single-threaded (ops/s):
> before:	442732
> after:	1063462 (+140%)
> 
> Here the main bottleneck is memcg.
> 
> Scalability-wise problems are avoidable lockref trip on open and ref
> management for the file on procfs side.
> 
> The file looks like a sterotypical C from the 90s, right down to an
> open-coded and slightly obfuscated linked list. I intentionally did not
> clean up any of it -- I think the file will be best served by a Rust
> rewrite when the time comes.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---

Ugh, can't we at least also get rid of the rwlock stuff?
Something like the appended two (completely untested) patches.

[-- Attachment #2: 0001-proc-add-a-helper-for-marking-files-as-permanent-by-.patch --]
[-- Type: text/x-diff, Size: 1724 bytes --]

From 148061e07d8dfc513df6d190791811ed98e21adc Mon Sep 17 00:00:00 2001
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Sat, 29 Mar 2025 20:28:20 +0100
Subject: [PATCH 1/2] proc: add a helper for marking files as permanent by
 external consumers

This avoids refing trips on use, added with VFS in mind but should be
helpful elsewhere as well.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250329192821.822253-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/proc/generic.c       | 6 ++++++
 include/linux/proc_fs.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 8bb81e58c9d8..c9de1c95bbdc 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -677,6 +677,12 @@ struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
 }
 EXPORT_SYMBOL(proc_create_single_data);
 
+void proc_make_permanent(struct proc_dir_entry *de)
+{
+	pde_make_permanent(de);
+}
+EXPORT_SYMBOL(proc_make_permanent);
+
 void proc_set_size(struct proc_dir_entry *de, loff_t size)
 {
 	de->size = size;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 19d1c5e5f335..360c1b90b546 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -106,6 +106,7 @@ struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
 		int (*show)(struct seq_file *, void *), void *data);
 #define proc_create_single(name, mode, parent, show) \
 	proc_create_single_data(name, mode, parent, show, NULL)
+void proc_make_permanent(struct proc_dir_entry *);
  
 extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
 					       struct proc_dir_entry *,
-- 
2.47.3


[-- Attachment #3: 0002-fs-RCU-ify-filesystems-list-and-cache-proc-filesyste.patch --]
[-- Type: text/x-diff, Size: 14523 bytes --]

From 5c5a25c5ab66f666349e80f78b5168a9fb4eebd6 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 23 Apr 2026 14:43:17 +0200
Subject: [PATCH 2/2] fs: RCU-ify filesystems list and cache /proc/filesystems

The drivers list was protected by an rwlock; every mount, every open
of /proc/filesystems and the legacy sysfs(2) syscall walked a
hand-rolled singly-linked list under it. /proc/filesystems is
especially hot because libselinux causes programs as mundane as
mkdir, ls and sed to open and read it on every invocation.

Convert the list to an RCU-protected hlist and switch the writer side
to a plain spinlock. Writers keep their existing non-sleeping
section while readers walk under rcu_read_lock() with no lock traffic:

  - register_filesystem()/unregister_filesystem() take
    file_systems_lock, publish via hlist_{add_tail,del_init}_rcu()
    and invalidate the cached /proc/filesystems string.
    unregister_filesystem() keeps its synchronize_rcu() after
    dropping the lock so in-flight readers are drained before the
    module (and its embedded file_system_type) can go away.

  - __get_fs_type(), list_bdev_fs_names() and the
    fs_index()/fs_name()/fs_maxindex() helpers walk the list under
    rcu_read_lock(). fs_name() continues to drop the read-side
    lock after try_module_get() and accesses ->name outside the RCU
    section; the module reference pins the embedded file_system_type
    across the boundary.

Additionally cache the pre-rendered "[nodev\t]<name>\n" blob so
filesystems_proc_show() becomes a single seq_write() under
rcu_read_lock(). On a cache miss regen_filesystems_string() sizes
the buffer under RCU, allocates GFP_KERNEL and refills under the
writer spinlock; a generation counter bumped under the spinlock lets
it detect concurrent (un)register and retry the allocation.

Finally mark /proc/filesystems as permanent so the per-open refcount
bounce (which can degrade to a spinlock acquire) is removed; this
was the dominant bottleneck once the list walk itself was off the
critical path and is what lifts the multi-threaded numbers in
Mateusz Guzik's measurements:

  https://lore.kernel.org/linux-fsdevel/20260422181711.1340269-1-mjguzik@gmail.com/

struct file_system_type::next becomes struct hlist_node list; no
in-tree caller references the old ->next field outside
fs/filesystems.c.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/filesystems.c   | 282 +++++++++++++++++++++++++++++----------------
 fs/ocfs2/super.c   |   1 -
 include/linux/fs.h |   2 +-
 3 files changed, 181 insertions(+), 104 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 0c7d2b7ac26c..9b24c309f71e 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -17,22 +17,50 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/fs_parser.h>
+#include <linux/rculist.h>
 
 /*
- * Handling of filesystem drivers list.
- * Rules:
- *	Inclusion to/removals from/scanning of list are protected by spinlock.
- *	During the unload module must call unregister_filesystem().
- *	We can access the fields of list element if:
- *		1) spinlock is held or
- *		2) we hold the reference to the module.
- *	The latter can be guaranteed by call of try_module_get(); if it
- *	returned 0 we must skip the element, otherwise we got the reference.
- *	Once the reference is obtained we can drop the spinlock.
+ * Read-mostly filesystem drivers list.
+ *
+ * Readers walk under rcu_read_lock(); writers take file_systems_lock
+ * and publish via _rcu hlist primitives.  unregister_filesystem()
+ * synchronize_rcu()s after unlock so the embedded file_system_type
+ * can't go away under a reader.  To keep using a filesystem after
+ * the RCU section ends, take a module reference via try_module_get().
+ */
+static HLIST_HEAD(file_systems);
+static DEFINE_SPINLOCK(file_systems_lock);
+
+#ifdef CONFIG_PROC_FS
+/*
+ * Cache /proc/filesystems - libselinux has mkdir/ls/sed read it on
+ * every invocation.  file_systems_gen lets regen detect (un)register
+ * races against the cached generation and retry.
  */
+struct file_systems_string {
+	struct rcu_head rcu;
+	unsigned long gen;
+	size_t len;
+	char string[];
+};
 
-static struct file_system_type *file_systems;
-static DEFINE_RWLOCK(file_systems_lock);
+static unsigned long file_systems_gen;
+static struct file_systems_string __rcu *file_systems_string;
+
+static void invalidate_filesystems_string(void)
+{
+	struct file_systems_string *old;
+
+	lockdep_assert_held(&file_systems_lock);
+	file_systems_gen++;
+	old = rcu_replace_pointer(file_systems_string, NULL,
+				  lockdep_is_held(&file_systems_lock));
+	if (old)
+		kfree_rcu(old, rcu);
+}
+#else
+static inline void invalidate_filesystems_string(void) { }
+#endif
 
 /* WARNING: This can be used only if we _already_ own a reference */
 struct file_system_type *get_filesystem(struct file_system_type *fs)
@@ -46,14 +74,15 @@ void put_filesystem(struct file_system_type *fs)
 	module_put(fs->owner);
 }
 
-static struct file_system_type **find_filesystem(const char *name, unsigned len)
+static struct file_system_type *find_filesystem(const char *name, unsigned len)
 {
-	struct file_system_type **p;
-	for (p = &file_systems; *p; p = &(*p)->next)
-		if (strncmp((*p)->name, name, len) == 0 &&
-		    !(*p)->name[len])
-			break;
-	return p;
+	struct file_system_type *fs;
+
+	hlist_for_each_entry_rcu(fs, &file_systems, list,
+				 lockdep_is_held(&file_systems_lock))
+		if (strncmp(fs->name, name, len) == 0 && !fs->name[len])
+			return fs;
+	return NULL;
 }
 
 /**
@@ -64,33 +93,27 @@ static struct file_system_type **find_filesystem(const char *name, unsigned len)
  *	is aware of for mount and other syscalls. Returns 0 on success,
  *	or a negative errno code on an error.
  *
- *	The &struct file_system_type that is passed is linked into the kernel 
+ *	The &struct file_system_type that is passed is linked into the kernel
  *	structures and must not be freed until the file system has been
  *	unregistered.
  */
- 
-int register_filesystem(struct file_system_type * fs)
+int register_filesystem(struct file_system_type *fs)
 {
-	int res = 0;
-	struct file_system_type ** p;
-
 	if (fs->parameters &&
 	    !fs_validate_description(fs->name, fs->parameters))
 		return -EINVAL;
 
 	BUG_ON(strchr(fs->name, '.'));
-	if (fs->next)
+	if (!hlist_unhashed_lockless(&fs->list))
 		return -EBUSY;
-	write_lock(&file_systems_lock);
-	p = find_filesystem(fs->name, strlen(fs->name));
-	if (*p)
-		res = -EBUSY;
-	else
-		*p = fs;
-	write_unlock(&file_systems_lock);
-	return res;
-}
 
+	guard(spinlock)(&file_systems_lock);
+	if (find_filesystem(fs->name, strlen(fs->name)))
+		return -EBUSY;
+	hlist_add_tail_rcu(&fs->list, &file_systems);
+	invalidate_filesystems_string();
+	return 0;
+}
 EXPORT_SYMBOL(register_filesystem);
 
 /**
@@ -100,94 +123,79 @@ EXPORT_SYMBOL(register_filesystem);
  *	Remove a file system that was previously successfully registered
  *	with the kernel. An error is returned if the file system is not found.
  *	Zero is returned on a success.
- *	
+ *
  *	Once this function has returned the &struct file_system_type structure
  *	may be freed or reused.
  */
- 
-int unregister_filesystem(struct file_system_type * fs)
+int unregister_filesystem(struct file_system_type *fs)
 {
-	struct file_system_type ** tmp;
-
-	write_lock(&file_systems_lock);
-	tmp = &file_systems;
-	while (*tmp) {
-		if (fs == *tmp) {
-			*tmp = fs->next;
-			fs->next = NULL;
-			write_unlock(&file_systems_lock);
-			synchronize_rcu();
-			return 0;
-		}
-		tmp = &(*tmp)->next;
+	scoped_guard(spinlock, &file_systems_lock) {
+		if (hlist_unhashed(&fs->list))
+			return -EINVAL;
+		hlist_del_init_rcu(&fs->list);
+		invalidate_filesystems_string();
 	}
-	write_unlock(&file_systems_lock);
-
-	return -EINVAL;
+	synchronize_rcu();
+	return 0;
 }
-
 EXPORT_SYMBOL(unregister_filesystem);
 
 #ifdef CONFIG_SYSFS_SYSCALL
-static int fs_index(const char __user * __name)
+static int fs_index(const char __user *__name)
 {
-	struct file_system_type * tmp;
+	struct file_system_type *tmp;
 	char *name __free(kfree) = strndup_user(__name, PATH_MAX);
-	int err, index;
+	int index = 0;
 
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
-	err = -EINVAL;
-	read_lock(&file_systems_lock);
-	for (tmp=file_systems, index=0 ; tmp ; tmp=tmp->next, index++) {
-		if (strcmp(tmp->name, name) == 0) {
-			err = index;
-			break;
-		}
+	guard(rcu)();
+	hlist_for_each_entry_rcu(tmp, &file_systems, list) {
+		if (strcmp(tmp->name, name) == 0)
+			return index;
+		index++;
 	}
-	read_unlock(&file_systems_lock);
-	return err;
+	return -EINVAL;
 }
 
-static int fs_name(unsigned int index, char __user * buf)
+static int fs_name(unsigned int index, char __user *buf)
 {
-	struct file_system_type * tmp;
-	int len, res = -EINVAL;
+	struct file_system_type *tmp, *found = NULL;
+	int len, res;
 
-	read_lock(&file_systems_lock);
-	for (tmp = file_systems; tmp; tmp = tmp->next, index--) {
-		if (index == 0) {
+	scoped_guard(rcu) {
+		hlist_for_each_entry_rcu(tmp, &file_systems, list) {
+			if (index--)
+				continue;
 			if (try_module_get(tmp->owner))
-				res = 0;
+				found = tmp;
 			break;
 		}
 	}
-	read_unlock(&file_systems_lock);
-	if (res)
-		return res;
+	if (!found)
+		return -EINVAL;
 
 	/* OK, we got the reference, so we can safely block */
-	len = strlen(tmp->name) + 1;
-	res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0;
-	put_filesystem(tmp);
+	len = strlen(found->name) + 1;
+	res = copy_to_user(buf, found->name, len) ? -EFAULT : 0;
+	put_filesystem(found);
 	return res;
 }
 
 static int fs_maxindex(void)
 {
-	struct file_system_type * tmp;
-	int index;
+	struct file_system_type *tmp;
+	int index = 0;
 
-	read_lock(&file_systems_lock);
-	for (tmp = file_systems, index = 0 ; tmp ; tmp = tmp->next, index++)
-		;
-	read_unlock(&file_systems_lock);
+	guard(rcu)();
+	hlist_for_each_entry_rcu(tmp, &file_systems, list)
+		index++;
 	return index;
 }
 
 /*
- * Whee.. Weird sysv syscall. 
+ * Whee.. Weird sysv syscall.
  */
 SYSCALL_DEFINE3(sysfs, int, option, unsigned long, arg1, unsigned long, arg2)
 {
@@ -216,8 +224,8 @@ int __init list_bdev_fs_names(char *buf, size_t size)
 	size_t len;
 	int count = 0;
 
-	read_lock(&file_systems_lock);
-	for (p = file_systems; p; p = p->next) {
+	guard(rcu)();
+	hlist_for_each_entry_rcu(p, &file_systems, list) {
 		if (!(p->fs_flags & FS_REQUIRES_DEV))
 			continue;
 		len = strlen(p->name) + 1;
@@ -230,30 +238,101 @@ int __init list_bdev_fs_names(char *buf, size_t size)
 		size -= len;
 		count++;
 	}
-	read_unlock(&file_systems_lock);
 	return count;
 }
 
 #ifdef CONFIG_PROC_FS
-static int filesystems_proc_show(struct seq_file *m, void *v)
+static __cold noinline int regen_filesystems_string(void)
 {
-	struct file_system_type * tmp;
+	struct file_system_type *tmp;
+	struct file_systems_string *old, *new;
+	size_t newlen, usedlen;
+	unsigned long gen;
+
+retry:
+	/*
+	 * Size under RCU; the spinlock below detects list mutation via
+	 * file_systems_gen and we retry.  Bounded by (un)register rate.
+	 */
+	newlen = 0;
+	scoped_guard(rcu) {
+		gen = READ_ONCE(file_systems_gen);
+		hlist_for_each_entry_rcu(tmp, &file_systems, list) {
+			if (!(tmp->fs_flags & FS_REQUIRES_DEV))
+				newlen += strlen("nodev");
+			newlen += strlen("\t") + strlen(tmp->name) +
+				  strlen("\n");
+		}
+	}
+
+	new = kmalloc(struct_size(new, string, newlen), GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+	new->gen = gen;
+	new->len = newlen;
+
+	spin_lock(&file_systems_lock);
+	old = rcu_dereference_protected(file_systems_string,
+					lockdep_is_held(&file_systems_lock));
+
+	/* Someone regenerated for our generation already. */
+	if (old && old->gen == file_systems_gen) {
+		spin_unlock(&file_systems_lock);
+		kfree(new);
+		return 0;
+	}
 
-	read_lock(&file_systems_lock);
-	tmp = file_systems;
-	while (tmp) {
-		seq_printf(m, "%s\t%s\n",
+	/* List mutated between size-calc and here; retry. */
+	if (gen != file_systems_gen) {
+		spin_unlock(&file_systems_lock);
+		kfree(new);
+		goto retry;
+	}
+
+	/* Sized for this generation under the lock, so sprintf fits. */
+	usedlen = 0;
+	hlist_for_each_entry(tmp, &file_systems, list) {
+		usedlen += sprintf(&new->string[usedlen], "%s\t%s\n",
 			(tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
 			tmp->name);
-		tmp = tmp->next;
 	}
-	read_unlock(&file_systems_lock);
+	WARN_ON_ONCE(usedlen != newlen);
+
+	rcu_assign_pointer(file_systems_string, new);
+	spin_unlock(&file_systems_lock);
+	if (old)
+		kfree_rcu(old, rcu);
 	return 0;
 }
 
+static int filesystems_proc_show(struct seq_file *m, void *v)
+{
+	struct file_systems_string *fss;
+	int err;
+
+	for (;;) {
+		scoped_guard(rcu) {
+			fss = rcu_dereference(file_systems_string);
+			if (likely(fss)) {
+				seq_write(m, fss->string, fss->len);
+				return 0;
+			}
+		}
+
+		err = regen_filesystems_string();
+		if (unlikely(err))
+			return err;
+	}
+}
+
 static int __init proc_filesystems_init(void)
 {
-	proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
+	struct proc_dir_entry *pde;
+
+	pde = proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
+	if (!pde)
+		return -ENOMEM;
+	proc_make_permanent(pde);
 	return 0;
 }
 module_init(proc_filesystems_init);
@@ -263,11 +342,10 @@ static struct file_system_type *__get_fs_type(const char *name, int len)
 {
 	struct file_system_type *fs;
 
-	read_lock(&file_systems_lock);
-	fs = *(find_filesystem(name, len));
+	guard(rcu)();
+	fs = find_filesystem(name, len);
 	if (fs && !try_module_get(fs->owner))
 		fs = NULL;
-	read_unlock(&file_systems_lock);
 	return fs;
 }
 
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index b875f01c9756..4870e680c4e5 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1224,7 +1224,6 @@ static struct file_system_type ocfs2_fs_type = {
 	.name           = "ocfs2",
 	.kill_sb        = kill_block_super,
 	.fs_flags       = FS_REQUIRES_DEV|FS_RENAME_DOES_D_MOVE,
-	.next           = NULL,
 	.init_fs_context = ocfs2_init_fs_context,
 	.parameters	= ocfs2_param_spec,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..c37bb3c7de8b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2286,7 +2286,7 @@ struct file_system_type {
 	const struct fs_parameter_spec *parameters;
 	void (*kill_sb) (struct super_block *);
 	struct module *owner;
-	struct file_system_type * next;
+	struct hlist_node list;
 	struct hlist_head fs_supers;
 
 	struct lock_class_key s_lock_key;
-- 
2.47.3


  parent reply	other threads:[~2026-04-23 13:36 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22 18:17 [PATCH v2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
2026-04-22 22:07 ` [PATCH v2] " Andreas Dilger
2026-04-22 22:42   ` Al Viro
2026-04-23 14:28   ` Mateusz Guzik
2026-04-23 13:36 ` Christian Brauner [this message]
2026-04-23 14:38   ` [PATCH v2] fs: " Mateusz Guzik
2026-04-23 21:04     ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260423-flohmarkt-immun-9aedcbd5b60f@brauner \
    --to=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.