public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] speed up /proc/filesystems
@ 2025-03-29 19:28 Mateusz Guzik
  2025-03-29 19:28 ` [PATCH 1/2] proc: add a helper for marking files as permanent by external consumers Mateusz Guzik
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Mateusz Guzik @ 2025-03-29 19:28 UTC (permalink / raw)
  To: brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel, Mateusz Guzik

I accidentally found out it is used a lot *and* is incredibly slow.

Part of it is procfs protecting the file from going away on each op,
other part is content generatin being dog slow.

Turns out procfs did not provide an interface to mark files as
permanent. I added easiest hack I could think of to remedy the problem,
I am not going to argue how to do it.

Mateusz Guzik (2):
  proc: add a helper for marking files as permanent by external
    consumers
  fs: cache the string generated by reading /proc/filesystems

 fs/filesystems.c        | 148 +++++++++++++++++++++++++++++++++++++---
 fs/proc/generic.c       |   6 ++
 include/linux/proc_fs.h |   1 +
 3 files changed, 147 insertions(+), 8 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] proc: add a helper for marking files as permanent by external consumers
  2025-03-29 19:28 [PATCH 0/2] speed up /proc/filesystems Mateusz Guzik
@ 2025-03-29 19:28 ` Mateusz Guzik
  2025-03-29 19:28 ` [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
  2025-04-01 10:34 ` [PATCH 0/2] speed up /proc/filesystems Christian Brauner
  2 siblings, 0 replies; 10+ messages in thread
From: Mateusz Guzik @ 2025-03-29 19:28 UTC (permalink / raw)
  To: brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel, Mateusz Guzik

This avoids refing trips on use, added with VFS in mind but should be
helpful elsewhere as well.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/proc/generic.c       | 6 ++++++
 include/linux/proc_fs.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index a3e22803cddf..ae86554bb271 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -668,6 +668,12 @@ struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
 }
 EXPORT_SYMBOL(proc_create_single_data);
 
+void proc_make_permanent(struct proc_dir_entry *de)
+{
+	pde_make_permanent(de);
+}
+EXPORT_SYMBOL(proc_make_permanent);
+
 void proc_set_size(struct proc_dir_entry *de, loff_t size)
 {
 	de->size = size;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index ea62201c74c4..0b6fcef4099a 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -105,6 +105,7 @@ struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
 		int (*show)(struct seq_file *, void *), void *data);
 #define proc_create_single(name, mode, parent, show) \
 	proc_create_single_data(name, mode, parent, show, NULL)
+void proc_make_permanent(struct proc_dir_entry *);
  
 extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
 					       struct proc_dir_entry *,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-03-29 19:28 [PATCH 0/2] speed up /proc/filesystems Mateusz Guzik
  2025-03-29 19:28 ` [PATCH 1/2] proc: add a helper for marking files as permanent by external consumers Mateusz Guzik
@ 2025-03-29 19:28 ` Mateusz Guzik
  2025-03-29 20:53   ` Andreas Schwab
  2025-04-01 10:30   ` Christian Brauner
  2025-04-01 10:34 ` [PATCH 0/2] speed up /proc/filesystems Christian Brauner
  2 siblings, 2 replies; 10+ messages in thread
From: Mateusz Guzik @ 2025-03-29 19:28 UTC (permalink / raw)
  To: brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel, Mateusz Guzik

It is being read surprisingly often (e.g., by mkdir, ls and even sed!).

This is lock-protected pointer chasing over a linked list to pay for
sprintf for every fs (32 on my boxen).

Instead cache the result.

While here mark the file as permanent to avoid atomic refs on each op
(which can also degrade to taking a spinlock).

open+read+close cycle single-threaded (ops/s):
before:	450929
after:	982308 (+117%)

Here the main bottleneck is memcg.

open+read+close cycle with 20 processes (ops/s):
before:	578654
after:	3163961 (+446%)

The main bottleneck *before* is spinlock acquire in procfs eliminated by
marking the file as permanent. The main bottleneck *after* is the
spurious lockref trip on open.

The file looks like a sterotypical C from the 90s, right down to an
open-code and slightly obfuscated linked list. I intentionally did not
clean up any of it -- I think the file will be best served by a Rust
rewrite when the time comes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/filesystems.c | 148 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 140 insertions(+), 8 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 58b9067b2391..5d4649fd8788 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -34,6 +34,23 @@
 static struct file_system_type *file_systems;
 static DEFINE_RWLOCK(file_systems_lock);
 
+#ifdef CONFIG_PROC_FS
+static unsigned long file_systems_gen;
+
+struct file_systems_string {
+	struct rcu_head rcufree;
+	unsigned long gen;
+	size_t len;
+	char string[];
+};
+static struct file_systems_string *file_systems_string;
+static void invalidate_filesystems_string(void);
+#else
+static void invalidate_filesystems_string(void)
+{
+}
+#endif
+
 /* WARNING: This can be used only if we _already_ own a reference */
 struct file_system_type *get_filesystem(struct file_system_type *fs)
 {
@@ -83,10 +100,12 @@ int register_filesystem(struct file_system_type * fs)
 		return -EBUSY;
 	write_lock(&file_systems_lock);
 	p = find_filesystem(fs->name, strlen(fs->name));
-	if (*p)
+	if (*p) {
 		res = -EBUSY;
-	else
+	} else {
 		*p = fs;
+		invalidate_filesystems_string();
+	}
 	write_unlock(&file_systems_lock);
 	return res;
 }
@@ -115,6 +134,7 @@ int unregister_filesystem(struct file_system_type * fs)
 		if (fs == *tmp) {
 			*tmp = fs->next;
 			fs->next = NULL;
+			invalidate_filesystems_string();
 			write_unlock(&file_systems_lock);
 			synchronize_rcu();
 			return 0;
@@ -234,25 +254,137 @@ int __init list_bdev_fs_names(char *buf, size_t size)
 }
 
 #ifdef CONFIG_PROC_FS
-static int filesystems_proc_show(struct seq_file *m, void *v)
+/*
+ * The fs list gets queried a lot by userspace, including rather surprising
+ * programs (would you guess *sed* is on the list?). In order to reduce the
+ * overhead we cache the resulting string, which normally hangs around below
+ * 512 bytes in size.
+ *
+ * As the list almost never changes, its creation is not particularly optimized
+ * for simplicity.
+ *
+ * We sort it out on read in order to not introduce a failure point for fs
+ * registration (in principle we may be unable to alloc memory for the list).
+ */
+static void invalidate_filesystems_string(void)
 {
-	struct file_system_type * tmp;
+	struct file_systems_string *fss;
 
-	read_lock(&file_systems_lock);
+	lockdep_assert_held_write(&file_systems_lock);
+	file_systems_gen++;
+	fss = file_systems_string;
+	WRITE_ONCE(file_systems_string, NULL);
+	kfree_rcu(fss, rcufree);
+}
+
+static noinline int regen_filesystems_string(void)
+{
+	struct file_system_type *tmp;
+	struct file_systems_string *old, *new;
+	size_t newlen, usedlen;
+	unsigned long gen;
+
+retry:
+	lockdep_assert_not_held(&file_systems_lock);
+
+	newlen = 0;
+	write_lock(&file_systems_lock);
+	gen = file_systems_gen;
+	tmp = file_systems;
+	/* pre-calc space for "%s\t%s\n" for each fs */
+	while (tmp) {
+		if (!(tmp->fs_flags & FS_REQUIRES_DEV))
+			newlen += strlen("nodev");
+		newlen += strlen("\t");
+		newlen += strlen(tmp->name);
+		newlen += strlen("\n");
+		tmp = tmp->next;
+	}
+	write_unlock(&file_systems_lock);
+
+	new = kmalloc(offsetof(struct file_systems_string, string) + newlen + 1,
+		      GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	new->gen = gen;
+	new->len = newlen;
+	new->string[newlen] = '\0';
+	write_lock(&file_systems_lock);
+	old = file_systems_string;
+
+	/*
+	 * Did someone beat us to it?
+	 */
+	if (old && old->gen == file_systems_gen) {
+		write_unlock(&file_systems_lock);
+		kfree(new);
+		return 0;
+	}
+
+	/*
+	 * Did the list change in the meantime?
+	 */
+	if (gen != file_systems_gen) {
+		write_unlock(&file_systems_lock);
+		kfree(new);
+		goto retry;
+	}
+
+	/*
+	 * Populated the string.
+	 *
+	 * We know we have just enough space because we calculated the right
+	 * size the previous time we had the lock and confirmed the list has
+	 * not changed after reacquiring it.
+	 */
+	usedlen = 0;
 	tmp = file_systems;
 	while (tmp) {
-		seq_printf(m, "%s\t%s\n",
+		usedlen += sprintf(&new->string[usedlen], "%s\t%s\n",
 			(tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
 			tmp->name);
 		tmp = tmp->next;
 	}
-	read_unlock(&file_systems_lock);
+	BUG_ON(new->len != strlen(new->string));
+
+	/*
+	 * Pairs with consume fence in READ_ONCE() in filesystems_proc_show().
+	 */
+	smp_store_release(&file_systems_string, new);
+	write_unlock(&file_systems_lock);
+	kfree_rcu(old, rcufree);
 	return 0;
 }
 
+static int filesystems_proc_show(struct seq_file *m, void *v)
+{
+	struct file_systems_string *fss;
+
+	for (;;) {
+		scoped_guard(rcu) {
+			/*
+			 * Pairs with smp_store_release() in regen_filesystems_string().
+			 */
+			fss = READ_ONCE(file_systems_string);
+			if (likely(fss)) {
+				seq_write(m, fss->string, fss->len);
+				return 0;
+			}
+		}
+
+		int err = regen_filesystems_string();
+		if (unlikely(err))
+			return err;
+	}
+}
+
 static int __init proc_filesystems_init(void)
 {
-	proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
+	struct proc_dir_entry *pde;
+
+	pde = proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
+	proc_make_permanent(pde);
 	return 0;
 }
 module_init(proc_filesystems_init);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-03-29 19:28 ` [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
@ 2025-03-29 20:53   ` Andreas Schwab
  2025-04-04  8:41     ` Christoph Hellwig
  2025-04-01 10:30   ` Christian Brauner
  1 sibling, 1 reply; 10+ messages in thread
From: Andreas Schwab @ 2025-03-29 20:53 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: brauner, viro, jack, linux-kernel, linux-fsdevel

On Mär 29 2025, Mateusz Guzik wrote:

> It is being read surprisingly often (e.g., by mkdir, ls and even sed!).

It is part of libselinux (selinuxfs_exits), called by its library
initializer.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-03-29 19:28 ` [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
  2025-03-29 20:53   ` Andreas Schwab
@ 2025-04-01 10:30   ` Christian Brauner
  2025-04-01 12:12     ` Mateusz Guzik
  1 sibling, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2025-04-01 10:30 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: viro, jack, linux-kernel, linux-fsdevel

On Sat, Mar 29, 2025 at 08:28:21PM +0100, Mateusz Guzik wrote:
> It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> 
> This is lock-protected pointer chasing over a linked list to pay for
> sprintf for every fs (32 on my boxen).
> 
> Instead cache the result.
> 
> While here mark the file as permanent to avoid atomic refs on each op
> (which can also degrade to taking a spinlock).
> 
> open+read+close cycle single-threaded (ops/s):
> before:	450929
> after:	982308 (+117%)
> 
> Here the main bottleneck is memcg.
> 
> open+read+close cycle with 20 processes (ops/s):
> before:	578654
> after:	3163961 (+446%)
> 
> The main bottleneck *before* is spinlock acquire in procfs eliminated by
> marking the file as permanent. The main bottleneck *after* is the
> spurious lockref trip on open.
> 
> The file looks like a sterotypical C from the 90s, right down to an
> open-code and slightly obfuscated linked list. I intentionally did not
> clean up any of it -- I think the file will be best served by a Rust
> rewrite when the time comes.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---

> +	pde = proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
> +	proc_make_permanent(pde);
>  	return 0;

This all looks good enough for me especially if it's really that much of
a bottleneck for some workloads. But the above part is broken because
proc_create_single() may return NULL afair and that means
proc_make_permanent()->pde_make_permanent() will NULL-deref.

I'll fix that up locally.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] speed up /proc/filesystems
  2025-03-29 19:28 [PATCH 0/2] speed up /proc/filesystems Mateusz Guzik
  2025-03-29 19:28 ` [PATCH 1/2] proc: add a helper for marking files as permanent by external consumers Mateusz Guzik
  2025-03-29 19:28 ` [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
@ 2025-04-01 10:34 ` Christian Brauner
  2 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2025-04-01 10:34 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Christian Brauner, viro, jack, linux-kernel, linux-fsdevel

On Sat, 29 Mar 2025 20:28:19 +0100, Mateusz Guzik wrote:
> I accidentally found out it is used a lot *and* is incredibly slow.
> 
> Part of it is procfs protecting the file from going away on each op,
> other part is content generatin being dog slow.
> 
> Turns out procfs did not provide an interface to mark files as
> permanent. I added easiest hack I could think of to remedy the problem,
> I am not going to argue how to do it.
> 
> [...]

Applied to the vfs-6.16.procfs branch of the vfs/vfs.git tree.
Patches in the vfs-6.16.procfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.16.procfs

[1/2] proc: add a helper for marking files as permanent by external consumers
      https://git.kernel.org/vfs/vfs/c/6040503a448b
[2/2] fs: cache the string generated by reading /proc/filesystems
      https://git.kernel.org/vfs/vfs/c/9750cdeb327d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-04-01 10:30   ` Christian Brauner
@ 2025-04-01 12:12     ` Mateusz Guzik
  0 siblings, 0 replies; 10+ messages in thread
From: Mateusz Guzik @ 2025-04-01 12:12 UTC (permalink / raw)
  To: Christian Brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel

On Tue, Apr 1, 2025 at 12:31 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sat, Mar 29, 2025 at 08:28:21PM +0100, Mateusz Guzik wrote:
> > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> >
> > This is lock-protected pointer chasing over a linked list to pay for
> > sprintf for every fs (32 on my boxen).
> >
> > Instead cache the result.
> >
> > While here mark the file as permanent to avoid atomic refs on each op
> > (which can also degrade to taking a spinlock).
> >
> > open+read+close cycle single-threaded (ops/s):
> > before:       450929
> > after:        982308 (+117%)
> >
> > Here the main bottleneck is memcg.
> >
> > open+read+close cycle with 20 processes (ops/s):
> > before:       578654
> > after:        3163961 (+446%)
> >
> > The main bottleneck *before* is spinlock acquire in procfs eliminated by
> > marking the file as permanent. The main bottleneck *after* is the
> > spurious lockref trip on open.
> >
> > The file looks like a sterotypical C from the 90s, right down to an
> > open-code and slightly obfuscated linked list. I intentionally did not
> > clean up any of it -- I think the file will be best served by a Rust
> > rewrite when the time comes.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
>
> > +     pde = proc_create_single("filesystems", 0, NULL, filesystems_proc_show);
> > +     proc_make_permanent(pde);
> >       return 0;
>
> This all looks good enough for me especially if it's really that much of
> a bottleneck for some workloads. But the above part is broken because
> proc_create_single() may return NULL afair and that means
> proc_make_permanent()->pde_make_permanent() will NULL-deref.
>
> I'll fix that up locally.

oh huh, indeed my bad. But in that case it should perhaps complain? WARN_ON?

I would argue if this fails then something really went wrong.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-03-29 20:53   ` Andreas Schwab
@ 2025-04-04  8:41     ` Christoph Hellwig
  2025-04-05  4:55       ` Mateusz Guzik
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2025-04-04  8:41 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Mateusz Guzik, brauner, viro, jack, linux-kernel, linux-fsdevel

On Sat, Mar 29, 2025 at 09:53:16PM +0100, Andreas Schwab wrote:
> On Mär 29 2025, Mateusz Guzik wrote:
> 
> > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> 
> It is part of libselinux (selinuxfs_exits), called by its library
> initializer.

Can we please fix libselinux instead of working around this really
broken behavior in the kernel?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-04-04  8:41     ` Christoph Hellwig
@ 2025-04-05  4:55       ` Mateusz Guzik
  2025-04-05  5:26         ` Mateusz Guzik
  0 siblings, 1 reply; 10+ messages in thread
From: Mateusz Guzik @ 2025-04-05  4:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Schwab, brauner, viro, jack, linux-kernel, linux-fsdevel

On Fri, Apr 4, 2025 at 10:41 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, Mar 29, 2025 at 09:53:16PM +0100, Andreas Schwab wrote:
> > On Mär 29 2025, Mateusz Guzik wrote:
> >
> > > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> >
> > It is part of libselinux (selinuxfs_exits), called by its library
> > initializer.
>
> Can we please fix libselinux instead of working around this really
> broken behavior in the kernel?
>

That's a fair point, I'm going to ask them about it. If they fix it
then I'll probably self-NAK the patch.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems
  2025-04-05  4:55       ` Mateusz Guzik
@ 2025-04-05  5:26         ` Mateusz Guzik
  0 siblings, 0 replies; 10+ messages in thread
From: Mateusz Guzik @ 2025-04-05  5:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Schwab, brauner, viro, jack, linux-kernel, linux-fsdevel

On Sat, Apr 5, 2025 at 6:55 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Fri, Apr 4, 2025 at 10:41 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Sat, Mar 29, 2025 at 09:53:16PM +0100, Andreas Schwab wrote:
> > > On Mär 29 2025, Mateusz Guzik wrote:
> > >
> > > > It is being read surprisingly often (e.g., by mkdir, ls and even sed!).
> > >
> > > It is part of libselinux (selinuxfs_exits), called by its library
> > > initializer.
> >
> > Can we please fix libselinux instead of working around this really
> > broken behavior in the kernel?
> >
>
> That's a fair point, I'm going to ask them about it. If they fix it
> then I'll probably self-NAK the patch.
>

for interested parties: https://github.com/SELinuxProject/selinux/issues/468

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-04-05  5:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-29 19:28 [PATCH 0/2] speed up /proc/filesystems Mateusz Guzik
2025-03-29 19:28 ` [PATCH 1/2] proc: add a helper for marking files as permanent by external consumers Mateusz Guzik
2025-03-29 19:28 ` [PATCH 2/2] fs: cache the string generated by reading /proc/filesystems Mateusz Guzik
2025-03-29 20:53   ` Andreas Schwab
2025-04-04  8:41     ` Christoph Hellwig
2025-04-05  4:55       ` Mateusz Guzik
2025-04-05  5:26         ` Mateusz Guzik
2025-04-01 10:30   ` Christian Brauner
2025-04-01 12:12     ` Mateusz Guzik
2025-04-01 10:34 ` [PATCH 0/2] speed up /proc/filesystems Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox