Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation
@ 2026-06-25 19:33 Darrick J. Wong
  2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:33 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Neal Gompa

Hi everyone,

This is the sole remaining part of the gigantic patchset to enable
mounting ext4 filesystems as a systemd-contained fuse server instead of
in the kernel.  The libfuse parts have now been merged upstream, which
means that fuse4fs can now run as a non-root user, with no privilege,
and no access to the network or hardware, etc.  The only connection to
the outside is an ephemeral AF_UNIX socket.  The mount helper program
the other end is a helper program that acquires resources and calls
fsmount().

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.  Runtime filesystem metadata parsing is no longer a
privileged (== risky) operation.

The consequences of a crashed driver is a dead mount, instead of a
crashed or corrupt OS kernel.

Note that contained fuse filesystem servers are no faster than regular
fuse.  The containerization code only requires changes to libfuse and is
ready to go today.

e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-service-container_2026-06-25

Note that I threw in a couple more patchsets to improve the caching
behavior of libext2fs for better performance; and the ability to watch
for memory pressure complaints from the kernel so that we can drop our
own cache in response to memory pressure.

e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim_2026-06-25

--Darrick

Unreviewed patches in this patchbomb:

[PATCHSET 1/4] libext2fs: fix some missed fsync calls
  [PATCH 1/3] libext2fs: always fsync the device when flushing the
  [PATCH 2/3] libext2fs: always fsync the device when closing the unix
  [PATCH 3/3] libext2fs: only fsync the unix fd if we wrote to the
[PATCHSET v6 2/4] fuse4fs: run servers as a contained service
  [PATCH 01/10] libext2fs: make it possible to extract the fd from an
  [PATCH 02/10] libext2fs: fix checking for valid fds in mmp.c
  [PATCH 03/10] unix_io: allow passing /dev/fd/XXX paths to the unixfd
  [PATCH 04/10] libext2fs: fix MMP code to work with unixfd IO manager
  [PATCH 05/10] libext2fs: bump libfuse API version to 3.19
  [PATCH 06/10] fuse4fs: hoist some code out of fuse4fs_main
  [PATCH 07/10] fuse4fs: enable safe service mode
  [PATCH 08/10] fuse4fs: set proc title when in fuse service mode
  [PATCH 09/10] fuse4fs: make MMP work correctly in safe service mode
  [PATCH 10/10] debian: update packaging for fuse4fs service
[PATCHSET v6 3/4] fuse2fs: improve block and inode caching
  [PATCH 1/6] libsupport: add caching IO manager
  [PATCH 2/6] iocache: add the actual buffer cache
  [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses
  [PATCH 4/6] fuse2fs: enable caching IO manager
  [PATCH 5/6] fuse2fs: increase inode cache size
  [PATCH 6/6] libext2fs: improve caching for inodes
[PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure
  [PATCH 1/4] libsupport: add pressure stall monitor
  [PATCH 2/4] fuse2fs: only reclaim buffer cache when there is memory
  [PATCH 3/4] fuse4fs: enable memory pressure monitoring with service
  [PATCH 4/4] fuse2fs: flush dirty metadata periodically

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHSET 1/4] libext2fs: fix some missed fsync calls
  2026-06-25 19:33 [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation Darrick J. Wong
@ 2026-06-25 19:35 ` Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 1/3] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
                     ` (2 more replies)
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hi all,

Fix a few places (like device closing) where we really ought to tell the
block device to flush whatever's dirty to disk, even if we've failed to
flush all our cached buffers out to disk.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-flushing-fixes
---
Commits in this patchset:
 * libext2fs: always fsync the device when flushing the cache
 * libext2fs: always fsync the device when closing the unix IO manager
 * libext2fs: only fsync the unix fd if we wrote to the device
---
 lib/ext2fs/unix_io.c |   83 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 69 insertions(+), 14 deletions(-)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHSET v6 2/4] fuse4fs: run servers as a contained service
  2026-06-25 19:33 [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation Darrick J. Wong
  2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
@ 2026-06-25 19:35 ` Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
                     ` (9 more replies)
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
  2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
  3 siblings, 10 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-ext4

Hi all,

This series packages the newly created fuse4fs server into a systemd
socket service.  This service can be used by the "mount.service" helper
in libfuse to implement untrusted unprivileged mounts.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-service-container
---
Commits in this patchset:
 * libext2fs: make it possible to extract the fd from an IO manager
 * libext2fs: fix checking for valid fds in mmp.c
 * unix_io: allow passing /dev/fd/XXX paths to the unixfd IO manager
 * libext2fs: fix MMP code to work with unixfd IO manager
 * libext2fs: bump libfuse API version to 3.19
 * fuse4fs: hoist some code out of fuse4fs_main
 * fuse4fs: enable safe service mode
 * fuse4fs: set proc title when in fuse service mode
 * fuse4fs: make MMP work correctly in safe service mode
 * debian: update packaging for fuse4fs service
---
 lib/ext2fs/ext2_io.h         |    4 
 lib/ext2fs/ext2fs.h          |    1 
 lib/ext2fs/ext2fsP.h         |    4 
 MCONFIG.in                   |    2 
 configure                    |  303 ++++++++++++++++++++++++++-
 configure.ac                 |  131 +++++++++++
 debian/e2fsprogs.install     |    7 +
 debian/fuse4fs.install       |    3 
 debian/libext2fs2t64.symbols |    1 
 debian/rules                 |    3 
 fuse4fs/Makefile.in          |   42 +++-
 fuse4fs/fuse4fs.c            |  479 ++++++++++++++++++++++++++++++++++++------
 fuse4fs/fuse4fs.socket.in    |   17 +
 fuse4fs/fuse4fs@.service.in  |  102 +++++++++
 lib/config.h.in              |   12 +
 lib/ext2fs/io_manager.c      |    8 +
 lib/ext2fs/mmp.c             |  101 +++++++++
 lib/ext2fs/openfs.c          |    1 
 lib/ext2fs/unix_io.c         |   50 ++++
 util/subst.conf.in           |    3 
 20 files changed, 1177 insertions(+), 97 deletions(-)
 mode change 100644 => 100755 debian/fuse4fs.install
 create mode 100644 fuse4fs/fuse4fs.socket.in
 create mode 100644 fuse4fs/fuse4fs@.service.in


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHSET v6 3/4] fuse2fs: improve block and inode caching
  2026-06-25 19:33 [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation Darrick J. Wong
  2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
@ 2026-06-25 19:35 ` Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
                     ` (5 more replies)
  2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
  3 siblings, 6 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hi all,

This series ports the libext2fs inode cache to the new cache.c hashtable
code that was added for fuse4fs unlinked file support and improves on
the UNIX I/O manager's block cache by adding a new I/O manager that does
its own caching.  Now we no longer have statically sized buffer caching
for the two fuse servers.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-caching
---
Commits in this patchset:
 * libsupport: add caching IO manager
 * iocache: add the actual buffer cache
 * iocache: bump buffer mru priority every 50 accesses
 * fuse2fs: enable caching IO manager
 * fuse2fs: increase inode cache size
 * libext2fs: improve caching for inodes
---
 lib/ext2fs/ext2fsP.h     |   13 +
 lib/support/cache.h      |    1 
 lib/support/iocache.h    |   17 +
 debugfs/Makefile.in      |    8 
 e2fsck/Makefile.in       |   12 -
 fuse4fs/Makefile.in      |   10 -
 fuse4fs/fuse4fs.c        |   15 +
 lib/ext2fs/Makefile.in   |   69 ++--
 lib/ext2fs/inline_data.c |    4 
 lib/ext2fs/inode.c       |  215 ++++++++++---
 lib/ext2fs/io_manager.c  |    3 
 lib/support/Makefile.in  |    6 
 lib/support/cache.c      |   16 +
 lib/support/iocache.c    |  751 ++++++++++++++++++++++++++++++++++++++++++++++
 misc/Makefile.in         |   11 -
 misc/fuse2fs.c           |   11 +
 resize/Makefile.in       |   11 -
 tests/fuzz/Makefile.in   |    4 
 tests/progs/Makefile.in  |    4 
 19 files changed, 1057 insertions(+), 124 deletions(-)
 create mode 100644 lib/support/iocache.h
 create mode 100644 lib/support/iocache.c


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure
  2026-06-25 19:33 [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation Darrick J. Wong
                   ` (2 preceding siblings ...)
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
@ 2026-06-25 19:35 ` Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 1/4] libsupport: add pressure stall monitor Darrick J. Wong
                     ` (3 more replies)
  3 siblings, 4 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hi all,

Having a static buffer cache limit of 32MB is very conservative.  When
there's plenty of free memory, evicting metadata from the cache isn't
actually a good idea, so we'd like to let it grow to handle large
working sets.  However, we also don't want to OOM the kernel or (in the
future) the fuse4fs container cgroup, so we need to listen for memory
reclamation events in the kernel.

The solution to this is to open the kernel memory pressure stall
indicator files, configure an event when too much time is spent waiting
for reclamation, and to trim the buffer cache when the events happen.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim
---
Commits in this patchset:
 * libsupport: add pressure stall monitor
 * fuse2fs: only reclaim buffer cache when there is memory pressure
 * fuse4fs: enable memory pressure monitoring with service containers
 * fuse2fs: flush dirty metadata periodically
---
 lib/support/list.h      |    6 +
 lib/support/psi.h       |   66 ++++++
 fuse4fs/Makefile.in     |    3 
 fuse4fs/fuse4fs.c       |  258 +++++++++++++++++++++-
 lib/support/Makefile.in |    4 
 lib/support/iocache.c   |   19 ++
 lib/support/psi.c       |  557 +++++++++++++++++++++++++++++++++++++++++++++++
 misc/Makefile.in        |    3 
 misc/fuse2fs.c          |  189 +++++++++++++++-
 9 files changed, 1091 insertions(+), 14 deletions(-)
 create mode 100644 lib/support/psi.h
 create mode 100644 lib/support/psi.c


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] libext2fs: always fsync the device when flushing the cache
  2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
@ 2026-06-25 19:36   ` Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 2/3] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 3/3] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:36 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

When we're flushing the unix IO manager's buffer cache, we should always
fsync the block device, because something could have written to the
block device -- either the buffer cache itself, or a direct write.
Regardless, the callers all want all dirtied regions to be persisted to
stable media.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index abd33ba839f7e9..b6feebef93fa5b 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1531,7 +1531,8 @@ static errcode_t unix_flush(io_channel channel)
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
 #ifdef HAVE_FSYNC
-	if (!retval && fsync(data->dev) != 0)
+	/* always fsync the device, even if flushing our own cache failed */
+	if (fsync(data->dev) != 0 && !retval)
 		return errno;
 #endif
 	return retval;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] libext2fs: always fsync the device when closing the unix IO manager
  2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 1/3] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
@ 2026-06-25 19:36   ` Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 3/3] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:36 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

unix_close is the last chance that libext2fs has to report write
failures to users.  Although it's likely that ext2fs_close already
called ext2fs_flush and told the IO manager to flush, we could do one
more sync before we close the file descriptor.  Also don't override the
fsync's errno with the close's errno.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index b6feebef93fa5b..15d6d55ff7fdd4 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1213,10 +1213,16 @@ static errcode_t unix_close(io_channel channel)
 #ifndef NO_IO_CACHE
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
+#ifdef HAVE_FSYNC
+	/* always fsync the device, even if flushing our own cache failed */
+	if (fsync(data->dev) != 0 && !retval)
+		retval = errno;
+#endif
 
 	unix_funlock(channel);
 
-	if (channel->manager != unixfd_io_manager && close(data->dev) < 0)
+	if (channel->manager != unixfd_io_manager && close(data->dev) < 0 &&
+	    !retval)
 		retval = errno;
 	free_cache(data);
 	free(data->cache);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] libext2fs: only fsync the unix fd if we wrote to the device
  2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 1/3] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
  2026-06-25 19:36   ` [PATCH 2/3] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
@ 2026-06-25 19:36   ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:36 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

As an optimization, only fsync the block device fd if we tried to write
to the io channel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |   86 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 67 insertions(+), 19 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 15d6d55ff7fdd4..f4307db0fb2b05 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -132,10 +132,13 @@ struct unix_cache {
 #define WRITE_DIRECT_SIZE 4	/* Must be smaller than CACHE_SIZE */
 #define READ_DIRECT_SIZE 4	/* Should be smaller than CACHE_SIZE */
 
+#define UNIX_STATE_DIRTY	(1U << 0) /* device needs fsyncing */
+
 struct unix_private_data {
 	int	magic;
 	int	dev;
 	int	flags;
+	unsigned int	state; /* UNIX_STATE_* */
 	int	align;
 	int	access_time;
 	int	unix_flock_flags;
@@ -1198,10 +1201,65 @@ static errcode_t unix_open(const char *name, int flags,
 	return unix_open_channel(name, fd, flags, channel, unix_io_manager);
 }
 
+#ifdef HAVE_FSYNC
+static void mark_dirty(io_channel channel)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+
+	mutex_lock(data, CACHE_MTX);
+	data->state |= UNIX_STATE_DIRTY;
+	mutex_unlock(data, CACHE_MTX);
+}
+
+static errcode_t maybe_fsync(io_channel channel, int force_fsync)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+	int need_fsync;
+	errcode_t retval = 0;
+
+#ifndef NO_IO_CACHE
+	retval = flush_cached_blocks(channel, data, 0);
+#endif
+
+	mutex_lock(data, CACHE_MTX);
+	need_fsync = force_fsync || (data->state & UNIX_STATE_DIRTY);
+	data->state &= ~UNIX_STATE_DIRTY;
+	mutex_unlock(data, CACHE_MTX);
+
+	if (need_fsync && fsync(data->dev) != 0) {
+		if (!retval)
+			retval = errno;
+	}
+	if (retval) {
+		/* redirty because writeback failed */
+		mark_dirty(channel);
+		return retval;
+	}
+
+	return 0;
+}
+#else
+# define mark_dirty(...)		((void)0)
+
+static errcode_t maybe_fsync(io_channel channel, int force_fsync)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+	errcode_t retval = 0;
+
+#ifndef NO_IO_CACHE
+	retval = flush_cached_blocks(channel, data, 0);
+#endif
+	return retval;
+}
+#endif
+
 static errcode_t unix_close(io_channel channel)
 {
 	struct unix_private_data *data;
-	errcode_t	retval = 0;
+	errcode_t	retval;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	data = (struct unix_private_data *) channel->private_data;
@@ -1210,14 +1268,7 @@ static errcode_t unix_close(io_channel channel)
 	if (--channel->refcount > 0)
 		return 0;
 
-#ifndef NO_IO_CACHE
-	retval = flush_cached_blocks(channel, data, 0);
-#endif
-#ifdef HAVE_FSYNC
-	/* always fsync the device, even if flushing our own cache failed */
-	if (fsync(data->dev) != 0 && !retval)
-		retval = errno;
-#endif
+	retval = maybe_fsync(channel, 1);
 
 	unix_funlock(channel);
 
@@ -1388,6 +1439,8 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+	mark_dirty(channel);
+
 #ifdef NO_IO_CACHE
 	return raw_write_blk(channel, data, block, count, buf, 0);
 #else
@@ -1512,6 +1565,8 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)
 		return errno;
 
+	mark_dirty(channel);
+
 	actual = write(data->dev, buf, size);
 	if (actual < 0)
 		return errno;
@@ -1527,21 +1582,12 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 static errcode_t unix_flush(io_channel channel)
 {
 	struct unix_private_data *data;
-	errcode_t retval = 0;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
-#ifndef NO_IO_CACHE
-	retval = flush_cached_blocks(channel, data, 0);
-#endif
-#ifdef HAVE_FSYNC
-	/* always fsync the device, even if flushing our own cache failed */
-	if (fsync(data->dev) != 0 && !retval)
-		return errno;
-#endif
-	return retval;
+	return maybe_fsync(channel, 0);
 }
 
 static errcode_t unix_set_option(io_channel channel, const char *option,
@@ -1653,6 +1699,7 @@ static errcode_t unix_discard(io_channel channel, unsigned long long block,
 		}
 		return errno;
 	}
+	mark_dirty(channel);
 	return 0;
 unimplemented:
 	return EXT2_ET_UNIMPLEMENTED;
@@ -1734,6 +1781,7 @@ static errcode_t unix_zeroout(io_channel channel, unsigned long long block,
 		}
 		return errno;
 	}
+	mark_dirty(channel);
 	return 0;
 unimplemented:
 	return EXT2_ET_UNIMPLEMENTED;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
@ 2026-06-25 19:37   ` Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 02/10] libext2fs: fix checking for valid fds in mmp.c Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:37 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can extract the fd from an open IO manager.  This
will be used in subsequent patches to register the open block device
with the fuse iomap kernel driver.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2_io.h         |    4 +++-
 debian/libext2fs2t64.symbols |    1 +
 lib/ext2fs/io_manager.c      |    8 ++++++++
 lib/ext2fs/unix_io.c         |   20 ++++++++++++++++++++
 4 files changed, 32 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index 61865d54d82490..c880ea2524f248 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -103,7 +103,8 @@ struct struct_io_manager {
 	errcode_t (*zeroout)(io_channel channel, unsigned long long block,
 			     unsigned long long count);
 	errcode_t (*flock)(io_channel channel, unsigned int flock_flags);
-	long	reserved[13];
+	errcode_t (*get_fd)(io_channel channel, int *fd);
+	long	reserved[12];
 };
 
 #define IO_FLAG_RW		0x0001
@@ -155,6 +156,7 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
 					    unsigned long long count);
 extern errcode_t io_channel_flock(io_channel io, unsigned int flock_flags);
 extern errcode_t io_channel_funlock(io_channel io);
+extern errcode_t io_channel_get_fd(io_channel io, int *fd);
 
 #ifdef _WIN32
 /* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index affe4c27d4e791..555fbbb0c98878 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -701,6 +701,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
  io_channel_discard@Base 1.42
  io_channel_flock@Base 1.47.99
  io_channel_funlock@Base 1.47.99
+ io_channel_get_fd@Base 1.47.99
  io_channel_read_blk64@Base 1.41.1
  io_channel_set_options@Base 1.37
  io_channel_write_blk64@Base 1.41.1
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 791ec7d14adbba..dff3d73552827f 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -166,3 +166,11 @@ errcode_t io_channel_funlock(io_channel io)
 
 	return io->manager->flock(io, 0);
 }
+
+errcode_t io_channel_get_fd(io_channel io, int *fd)
+{
+	if (!io->manager->get_fd)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->get_fd(io, fd);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index f4307db0fb2b05..79bc9219f9515b 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1786,6 +1786,24 @@ static errcode_t unix_zeroout(io_channel channel, unsigned long long block,
 unimplemented:
 	return EXT2_ET_UNIMPLEMENTED;
 }
+
+static errcode_t unix_get_fd(io_channel channel, int *fd)
+{
+	struct unix_private_data *data;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+	if (data->offset) {
+		*fd = -1;
+		return EINVAL;
+	}
+
+	*fd = data->dev;
+	return 0;
+}
+
 #if __GNUC_PREREQ (4, 6)
 #pragma GCC diagnostic pop
 #endif
@@ -1808,6 +1826,7 @@ static struct struct_io_manager struct_unix_manager = {
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
 	.flock		= unix_flock,
+	.get_fd		= unix_get_fd,
 };
 
 io_manager unix_io_manager = &struct_unix_manager;
@@ -1830,6 +1849,7 @@ static struct struct_io_manager struct_unixfd_manager = {
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
 	.flock		= unix_flock,
+	.get_fd		= unix_get_fd,
 };
 
 io_manager unixfd_io_manager = &struct_unixfd_manager;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/10] libext2fs: fix checking for valid fds in mmp.c
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
@ 2026-06-25 19:37   ` Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 03/10] unix_io: allow passing /dev/fd/XXX paths to the unixfd IO manager Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:37 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

File descriptors are non-negative numbers, which means that 0 is a valid
fd.  Fix the code to be consistent with Unix behaviors.

Cc: <linux-ext4@vger.kernel.org> # v1.42
Fixes: 0f5eba7501f467 ("ext2fs: add multi-mount protection (INCOMPAT_MMP)")
Fixes: 76a6c8788c79e4 ("mmp: do not use O_DIRECT when working with regular file")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/mmp.c    |    6 +++---
 lib/ext2fs/openfs.c |    1 +
 2 files changed, 4 insertions(+), 3 deletions(-)


diff --git a/lib/ext2fs/mmp.c b/lib/ext2fs/mmp.c
index e2823732e2b6a2..cb15a18fce5547 100644
--- a/lib/ext2fs/mmp.c
+++ b/lib/ext2fs/mmp.c
@@ -59,11 +59,11 @@ errcode_t ext2fs_mmp_read(ext2_filsys fs, blk64_t mmp_blk, void *buf)
 		return EXT2_ET_MMP_BAD_BLOCK;
 
 	/* ext2fs_open() reserves fd0,1,2 to avoid stdio collision, so checking
-	 * mmp_fd <= 0 is OK to validate that the fd is valid.  This opens its
+	 * mmp_fd < 0 is OK to validate that the fd is valid.  This opens its
 	 * own fd to read the MMP block to ensure that it is using O_DIRECT,
 	 * regardless of how the io_manager is doing reads, to avoid caching of
 	 * the MMP block by the io_manager or the VM.  It needs to be fresh. */
-	if (fs->mmp_fd <= 0) {
+	if (fs->mmp_fd < 0) {
 		struct stat st;
 		int flags = O_RDONLY | O_DIRECT;
 
@@ -427,7 +427,7 @@ errcode_t ext2fs_mmp_stop(ext2_filsys fs)
 	retval = ext2fs_mmp_write(fs, fs->super->s_mmp_block, fs->mmp_cmp);
 
 mmp_error:
-	if (fs->mmp_fd > 0) {
+	if (fs->mmp_fd >= 0) {
 		close(fs->mmp_fd);
 		fs->mmp_fd = -1;
 	}
diff --git a/lib/ext2fs/openfs.c b/lib/ext2fs/openfs.c
index 2b8e0e753c46e8..41359d15740881 100644
--- a/lib/ext2fs/openfs.c
+++ b/lib/ext2fs/openfs.c
@@ -148,6 +148,7 @@ errcode_t ext2fs_open2(const char *name, const char *io_options,
 	/* don't overwrite sb backups unless flag is explicitly cleared */
 	fs->flags |= EXT2_FLAG_MASTER_SB_ONLY;
 	fs->umask = 022;
+	fs->mmp_fd = -1;
 
 	time_env = ext2fs_safe_getenv("SOURCE_DATE_EPOCH");
 	if (time_env) {


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/10] unix_io: allow passing /dev/fd/XXX paths to the unixfd IO manager
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 02/10] libext2fs: fix checking for valid fds in mmp.c Darrick J. Wong
@ 2026-06-25 19:37   ` Darrick J. Wong
  2026-06-25 19:37   ` [PATCH 04/10] libext2fs: fix MMP code to work with " Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:37 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Commit 4ccf9e4fe165cf created a "unixfd" IO manager that allows someone
to choose the unixfd IO manager and then mount a filesystem from an
existing file descriptor by passing a string with the fd number as the
"device" name to ext2fs_open().

That was an unfortunate choice of naming, however, because that could
be mistaken for a relative path to a file whose name is an integer
number.  Let's improve this by allowing callers to pass /dev/fd/XX
as the filesystem device name.  The upcoming fuse4fs service patches
will employ this method to open a filesystem on a block device fd passed
into the secure container from a mount helper.

Cc: <linux-ext4@vger.kernel.org> # v1.43.2
Fixes: 4ccf9e4fe165cf ("libext2fs: add unixfd_io_manager")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 79bc9219f9515b..a9b1fac62a0250 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -67,6 +67,7 @@
 #ifdef HAVE_SYS_FILE_H
 #include <sys/file.h>
 #endif
+#include <limits.h>
 
 #if defined(__linux__) && defined(_IO) && !defined(BLKROGET)
 #define BLKROGET   _IO(0x12, 94) /* Get read-only status (0 = read_write).  */
@@ -1148,13 +1149,40 @@ static errcode_t unix_open_channel(const char *name, int fd,
 	return retval;
 }
 
+#define DEV_FD_PATH	"/dev/fd/"
+#define DEV_FD_PATHLEN	(sizeof(DEV_FD_PATH) - 1)
+
+static int possible_unixfd_pathname(const char *path)
+{
+	return strncmp(DEV_FD_PATH, path, DEV_FD_PATHLEN) == 0;
+}
+
 static errcode_t unixfd_open(const char *str_fd, int flags,
 			     io_channel *channel)
 {
 	int fd;
 	int fd_flags;
 
-	fd = atoi(str_fd);
+	/*
+	 * The caller should provide a path in the form "/dev/fd/XX",
+	 * but the shorthand form "XX" is allowed for legacy reasons.
+	 */
+	if (possible_unixfd_pathname(str_fd)) {
+		char *endptr;
+		long maybe_fd;
+
+		errno = 0;
+		maybe_fd = strtol(str_fd + DEV_FD_PATHLEN, &endptr, 10);
+		if (errno)
+			return errno;
+		if (*endptr != 0)
+			return EINVAL;
+		if (maybe_fd < 0 || maybe_fd > INT_MAX)
+			return EINVAL;
+		fd = maybe_fd;
+	} else {
+		fd = atoi(str_fd);
+	}
 #if defined(HAVE_FCNTL)
 	fd_flags = fcntl(fd, F_GETFL);
 	if (fd_flags == -1)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/10] libext2fs: fix MMP code to work with unixfd IO manager
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (2 preceding siblings ...)
  2026-06-25 19:37   ` [PATCH 03/10] unix_io: allow passing /dev/fd/XXX paths to the unixfd IO manager Darrick J. Wong
@ 2026-06-25 19:37   ` Darrick J. Wong
  2026-06-25 19:38   ` [PATCH 05/10] libext2fs: bump libfuse API version to 3.19 Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:37 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

The MMP code wants to be able to read and write the MMP block directly
to storage so that the pagecache does not get in the way.  This is
critical for correct operation of MMP, because it is guarding against
two cluster nodes trying to change the filesystem at the same time.

Unfortunately there's no convenient way to tell an IO manager to perform
a particular IO in directio mode, so the MMP code open()s the filesystem
source device a second time so that it can set O_DIRECT and maintain its
own file position independently of the IO channel.  This is a gross
layering violation.

For unprivileged containerized fuse4fs, we're going to have a privileged
mount helper pass us the fd to the block device, so we'll be using the
unixfd IO manager.  The enhanced security posture provided by the
service definition file (minimal /dev) means that we cannot reopen the
source device.  In this case, MMP can only duplicate the fd and use the
IO channel carefully.

Fix this (sort of) by detecting the unixfd IO manager and duplicating
the open fd if it's in use.  This adds a requirement that the unixfd
originally be opened in O_DIRECT mode if the filesystem is on a block
device, but that's the best we can do here.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2fs.h  |    1 +
 lib/ext2fs/ext2fsP.h |    4 ++
 lib/ext2fs/mmp.c     |   95 +++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/ext2fs/unix_io.c |    2 +
 4 files changed, 100 insertions(+), 2 deletions(-)


diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index c4fcb10bea0fb9..02c3cbcea92482 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -225,6 +225,7 @@ typedef struct ext2_file *ext2_file_t;
  * Internal flags for use by the ext2fs library only
  */
 #define EXT2_FLAG2_USE_FAKE_TIME	0x000000001
+#define EXT2_FLAG2_MMP_USE_IOCHANNEL	0x000000002
 
 /*
  * Special flag in the ext2 inode i_flag field that means that this is
diff --git a/lib/ext2fs/ext2fsP.h b/lib/ext2fs/ext2fsP.h
index 428081c9e2ff38..bdc92991e7dda0 100644
--- a/lib/ext2fs/ext2fsP.h
+++ b/lib/ext2fs/ext2fsP.h
@@ -218,3 +218,7 @@ errcode_t ext2fs_remove_exit_fn(ext2_exit_fn fn, void *data);
         (sizeof(array) / sizeof(array[0]))
 
 #define EXT2FS_BUILD_BUG_ON(cond) ((void)sizeof(char[1 - 2*!!(cond)]))
+
+#ifndef _WIN32
+int possible_unixfd_pathname(const char *path);
+#endif
diff --git a/lib/ext2fs/mmp.c b/lib/ext2fs/mmp.c
index cb15a18fce5547..188cdb68900e97 100644
--- a/lib/ext2fs/mmp.c
+++ b/lib/ext2fs/mmp.c
@@ -26,9 +26,11 @@
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
+#include <limits.h>
 
 #include "ext2fs/ext2_fs.h"
 #include "ext2fs/ext2fs.h"
+#include "ext2fs/ext2fsP.h"
 
 #ifndef O_DIRECT
 #define O_DIRECT 0
@@ -48,6 +50,86 @@ errcode_t ext2fs_mmp_get_mem(ext2_filsys fs, void **ptr)
 	return ext2fs_get_memalign(fs->blocksize, align, ptr);
 }
 
+#ifdef _WIN32
+static int ext2fs_mmp_open_device(ext2_filsys fs, int flags)
+{
+	return open(fs->device_name, flags);
+}
+#else
+static int ext2fs_mmp_open_device(ext2_filsys fs, int flags)
+{
+	struct stat stbuf;
+	char path[64];
+	int maybe_fd = -1;
+	int new_fd;
+	int ret;
+	errcode_t retval = 0;
+
+	/*
+	 * If we can't possibly be using the unixfd IO manager, open the device
+	 * a second time, which is the historical behavior.  This is a huge
+	 * and historic layering violation!
+	 *
+	 * It's also broken if the unixfd IO manager was passed a string with a
+	 * file descriptor number instead of a /dev/fd/XX path, but the
+	 * internet thinks there are no users of the manager outside of Google.
+	 */
+	if (!possible_unixfd_pathname(fs->device_name))
+		return open(fs->device_name, flags);
+
+	/*
+	 * Try to get the fd of the open block device.  If this fails for any
+	 * reason, fall back to the classic open path.
+	 */
+	retval = io_channel_get_fd(fs->io, &maybe_fd);
+	if (retval || maybe_fd < 0)
+		return open(fs->device_name, flags);
+
+	/*
+	 * We extracted the fd from the IO manager.
+	 *
+	 * Skip directio if this is a regular file, just ext2fs_mmp_read does.
+	 * Note that the O_DIRECT-clearing logic in the caller might not have
+	 * cleared the bit because it is path based.
+	 */
+	if (fstat(maybe_fd, &stbuf) == 0 && S_ISREG(stbuf.st_mode))
+		flags &= ~O_DIRECT;
+
+	/*
+	 * Try to reopen the same file descriptor, but with the new mode flags.
+	 * If that works then we're done.  Note that these magic symlinks do
+	 * not have to resolve anywhere.
+	 */
+	snprintf(path, sizeof(path), "/dev/fd/%d", maybe_fd);
+	new_fd = open(path, flags);
+	if (new_fd >= 0)
+		return new_fd;
+
+	/*
+	 * Reopening didn't work.  Instead, duplicate the file descriptor and
+	 * check that we actually got directio if that's required.  Note that
+	 * we can't change the mode on the IO channel's fd because we already
+	 * set it up for buffered IO.
+	 */
+	new_fd = dup(maybe_fd);
+	if (flags & O_DIRECT) {
+		ret = fcntl(new_fd, F_GETFL);
+		if (ret < 0 || !(ret & O_DIRECT)) {
+			close(new_fd);
+			return -1;
+		}
+	}
+
+	/*
+	 * The MMP fd shadows the io channel fd, so we must use that for all
+	 * MMP block accesses because the two fds share the same file position
+	 * and O_DIRECT state, and the iochannel must know about that.
+	 */
+	fs->flags2 |= EXT2_FLAG2_MMP_USE_IOCHANNEL;
+	return new_fd;
+}
+#endif
+
 errcode_t ext2fs_mmp_read(ext2_filsys fs, blk64_t mmp_blk, void *buf)
 {
 #ifdef CONFIG_MMP
@@ -77,7 +159,7 @@ errcode_t ext2fs_mmp_read(ext2_filsys fs, blk64_t mmp_blk, void *buf)
 		    S_ISREG(st.st_mode))
 			flags &= ~O_DIRECT;
 
-		fs->mmp_fd = open(fs->device_name, flags);
+		fs->mmp_fd = ext2fs_mmp_open_device(fs, flags);
 		if (fs->mmp_fd < 0) {
 			retval = EXT2_ET_MMP_OPEN_DIRECT;
 			goto out;
@@ -90,6 +172,15 @@ errcode_t ext2fs_mmp_read(ext2_filsys fs, blk64_t mmp_blk, void *buf)
 			return retval;
 	}
 
+	if (fs->flags2 & EXT2_FLAG2_MMP_USE_IOCHANNEL) {
+		retval = io_channel_read_blk64(fs->io, mmp_blk, -fs->blocksize,
+					       fs->mmp_cmp);
+		if (retval)
+			return retval;
+
+		goto read_compare;
+	}
+
 	if ((blk64_t) ext2fs_llseek(fs->mmp_fd, mmp_blk * fs->blocksize,
 				    SEEK_SET) !=
 	    mmp_blk * fs->blocksize) {
@@ -102,6 +193,7 @@ errcode_t ext2fs_mmp_read(ext2_filsys fs, blk64_t mmp_blk, void *buf)
 		goto out;
 	}
 
+read_compare:
 	mmp_cmp = fs->mmp_cmp;
 
 	if (!(fs->flags & EXT2_FLAG_IGNORE_CSUM_ERRORS) &&
@@ -428,6 +520,7 @@ errcode_t ext2fs_mmp_stop(ext2_filsys fs)
 
 mmp_error:
 	if (fs->mmp_fd >= 0) {
+		fs->flags2 &= ~EXT2_FLAG2_MMP_USE_IOCHANNEL;
 		close(fs->mmp_fd);
 		fs->mmp_fd = -1;
 	}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index a9b1fac62a0250..567bbd9493f7f1 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1152,7 +1152,7 @@ static errcode_t unix_open_channel(const char *name, int fd,
 #define DEV_FD_PATH	"/dev/fd/"
 #define DEV_FD_PATHLEN	(sizeof(DEV_FD_PATH) - 1)
 
-static int possible_unixfd_pathname(const char *path)
+int possible_unixfd_pathname(const char *path)
 {
 	return strncmp(DEV_FD_PATH, path, DEV_FD_PATHLEN) == 0;
 }


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/10] libext2fs: bump libfuse API version to 3.19
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (3 preceding siblings ...)
  2026-06-25 19:37   ` [PATCH 04/10] libext2fs: fix MMP code to work with " Darrick J. Wong
@ 2026-06-25 19:38   ` Darrick J. Wong
  2026-06-25 19:38   ` [PATCH 06/10] fuse4fs: hoist some code out of fuse4fs_main Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:38 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

The fuse service container API is only available in 3.19, so we need to
bump FUSE_USE_VERSION up from 3.14 to 3.19.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure    |    8 ++++----
 configure.ac |   10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)


diff --git a/configure b/configure
index d941ff1f1ad900..f24897fcdd4949 100755
--- a/configure
+++ b/configure
@@ -14604,14 +14604,14 @@ fi
 
 if test -n "$FUSE_LIB"
 then
-	FUSE_USE_VERSION=314
+	FUSE_USE_VERSION=319
 	CFLAGS="$fuse3_CFLAGS $CFLAGS"
 	FUSE_LIB="$fuse3_LIBS"
 	       for ac_header in pthread.h fuse.h
 do :
   as_ac_Header=`printf "%s\n" "ac_cv_header_$ac_header" | sed "$as_sed_sh"`
 ac_fn_c_check_header_compile "$LINENO" "$ac_header" "$as_ac_Header" "#define _FILE_OFFSET_BITS	64
-#define FUSE_USE_VERSION	314
+#define FUSE_USE_VERSION	319
 "
 if eval test \"x\$"$as_ac_Header"\" = x"yes"
 then :
@@ -14646,7 +14646,7 @@ printf %s "checking for lowlevel interface in libfuse... " >&6; }
 
 	#define _GNU_SOURCE
 	#define _FILE_OFFSET_BITS	64
-	#define FUSE_USE_VERSION	314
+	#define FUSE_USE_VERSION	319
 	#include <fuse_lowlevel.h>
 
 int
@@ -14826,7 +14826,7 @@ printf %s "checking for cache_readdir support in libfuse... " >&6; }
 
 	#define _GNU_SOURCE
 	#define _FILE_OFFSET_BITS	64
-	#define FUSE_USE_VERSION	314
+	#define FUSE_USE_VERSION	319
 	#include <fuse.h>
 
 int
diff --git a/configure.ac b/configure.ac
index d8f40f5df0946b..38a18de0b67283 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1384,17 +1384,17 @@ AC_SUBST(FUSE_LIB)
 
 dnl
 dnl Set FUSE_USE_VERSION, which is how fuse servers build against a particular
-dnl libfuse ABI.  Currently we link against the libfuse 3.14 ABI (hence 314)
+dnl libfuse ABI.  Currently we link against the libfuse 3.19 ABI (hence 319)
 dnl
 if test -n "$FUSE_LIB"
 then
-	FUSE_USE_VERSION=314
+	FUSE_USE_VERSION=319
 	CFLAGS="$fuse3_CFLAGS $CFLAGS"
 	FUSE_LIB="$fuse3_LIBS"
 	AC_CHECK_HEADERS([pthread.h fuse.h], [],
 		[AC_MSG_FAILURE([Cannot build against fuse3 headers])],
 [#define _FILE_OFFSET_BITS	64
-#define FUSE_USE_VERSION	314])
+#define FUSE_USE_VERSION	319])
 fi
 if test -n "$FUSE_USE_VERSION"
 then
@@ -1413,7 +1413,7 @@ then
 	[	AC_LANG_PROGRAM([[
 	#define _GNU_SOURCE
 	#define _FILE_OFFSET_BITS	64
-	#define FUSE_USE_VERSION	314
+	#define FUSE_USE_VERSION	319
 	#include <fuse_lowlevel.h>
 		]], [[
 	struct fuse_lowlevel_ops fs_ops = { };
@@ -1515,7 +1515,7 @@ then
 	[	AC_LANG_PROGRAM([[
 	#define _GNU_SOURCE
 	#define _FILE_OFFSET_BITS	64
-	#define FUSE_USE_VERSION	314
+	#define FUSE_USE_VERSION	319
 	#include <fuse.h>
 		]], [[
 	struct fuse_file_info fs_ops = {


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/10] fuse4fs: hoist some code out of fuse4fs_main
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (4 preceding siblings ...)
  2026-06-25 19:38   ` [PATCH 05/10] libext2fs: bump libfuse API version to 3.19 Darrick J. Wong
@ 2026-06-25 19:38   ` Darrick J. Wong
  2026-06-25 19:38   ` [PATCH 07/10] fuse4fs: enable safe service mode Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:38 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

In the next patch, we're going to create a separate fuse4fs_main
function when we're running in service mode.  Hoist into separate
helpers the code that will be shared between the two functions.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fuse4fs/fuse4fs.c |   95 +++++++++++++++++++++++++++--------------------------
 1 file changed, 49 insertions(+), 46 deletions(-)


diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 155cb7332a9b3f..ebf42609c1a739 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -6055,47 +6055,64 @@ static void fuse4fs_com_err_proc(const char *whoami, errcode_t code,
 	fflush(stderr);
 }
 
-static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
+static int fuse4fs_create_session(struct fuse4fs *ff, struct fuse_args *args,
+				  struct fuse_cmdline_opts *opts)
 {
-	struct fuse_cmdline_opts opts;
-	struct fuse_session *se;
-	struct fuse_loop_config *loop_config = NULL;
-	int ret = 0;
-
-	if (fuse_parse_cmdline(args, &opts) != 0) {
-		ret = 1;
-		goto out;
-	}
-
 	if (ff->debug)
-		opts.debug = true;
+		opts->debug = true;
 
-	if (opts.show_help) {
+	if (opts->show_help) {
 		fuse_cmdline_help();
-		ret = 0;
-		goto out_free_opts;
+		return 0;
 	}
 
-	if (opts.show_version) {
+	if (opts->show_version) {
 		printf("FUSE library version %s\n", fuse_pkgversion());
-		ret = 0;
-		goto out_free_opts;
+		return 0;
 	}
 
-	if (!opts.mountpoint) {
+	if (!opts->mountpoint) {
 		fprintf(stderr, "error: no mountpoint specified\n");
-		ret = 2;
-		goto out_free_opts;
+		return 2;
 	}
 
-	se = fuse_session_new(args, &fs_ops, sizeof(fs_ops), ff);
-	if (se == NULL) {
-		ret = 3;
-		goto out_free_opts;
+	ff->fuse = fuse_session_new(args, &fs_ops, sizeof(fs_ops), ff);
+	return ff->fuse ? 0 : 3;
+}
+
+static int fuse4fs_event_loop(struct fuse4fs *ff,
+			      struct fuse_loop_config *loop_config,
+			      const struct fuse_cmdline_opts *opts)
+{
+	/*
+	 * Since there's a Big Kernel Lock around all the libext2fs code, we
+	 * only need to start four threads -- one to decode a request, another
+	 * to do the filesystem work, a third to transmit the reply, and a
+	 * fourth to handle fuse notifications.
+	 */
+	fuse_loop_cfg_set_clone_fd(loop_config, opts->clone_fd);
+	fuse_loop_cfg_set_idle_threads(loop_config, opts->max_idle_threads);
+	fuse_loop_cfg_set_max_threads(loop_config, 4);
+
+	return fuse_session_loop_mt(ff->fuse, loop_config) == 0 ? 0 : 8;
+}
+
+static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
+{
+	struct fuse_cmdline_opts opts;
+	struct fuse_loop_config *loop_config = NULL;
+	int ret;
+
+	if (fuse_parse_cmdline(args, &opts) != 0) {
+		ret = 1;
+		goto out;
 	}
-	ff->fuse = se;
 
-	if (fuse_session_mount(se, opts.mountpoint) != 0) {
+	ret = fuse4fs_create_session(ff, args, &opts);
+	if (ret || !ff->fuse)
+		goto out_free_opts;
+
+	if (fuse_session_mount(ff->fuse, opts.mountpoint) != 0) {
 		ret = 4;
 		goto out_destroy_session;
 	}
@@ -6115,7 +6132,7 @@ static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
 		close(ff->logfd);
 	ff->logfd = -1;
 
-	if (fuse_set_signal_handlers(se) != 0) {
+	if (fuse_set_signal_handlers(ff->fuse) != 0) {
 		ret = 6;
 		goto out_unmount;
 	}
@@ -6126,30 +6143,16 @@ static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
 		goto out_remove_signal_handlers;
 	}
 
-	/*
-	 * Since there's a Big Kernel Lock around all the libext2fs code, we
-	 * only need to start four threads -- one to decode a request, another
-	 * to do the filesystem work, a third to transmit the reply, and a
-	 * fourth to handle fuse notifications.
-	 */
-	fuse_loop_cfg_set_clone_fd(loop_config, opts.clone_fd);
-	fuse_loop_cfg_set_idle_threads(loop_config, opts.max_idle_threads);
-	fuse_loop_cfg_set_max_threads(loop_config, 4);
+	ret = fuse4fs_event_loop(ff, loop_config, &opts);
 
-	if (fuse_session_loop_mt(se, loop_config) != 0) {
-		ret = 8;
-		goto out_loopcfg;
-	}
-
-out_loopcfg:
 	fuse_loop_cfg_destroy(loop_config);
 out_remove_signal_handlers:
-	fuse_remove_signal_handlers(se);
+	fuse_remove_signal_handlers(ff->fuse);
 out_unmount:
-	fuse_session_unmount(se);
+	fuse_session_unmount(ff->fuse);
 out_destroy_session:
+	fuse_session_destroy(ff->fuse);
 	ff->fuse = NULL;
-	fuse_session_destroy(se);
 out_free_opts:
 	free(opts.mountpoint);
 out:


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/10] fuse4fs: enable safe service mode
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (5 preceding siblings ...)
  2026-06-25 19:38   ` [PATCH 06/10] fuse4fs: hoist some code out of fuse4fs_main Darrick J. Wong
@ 2026-06-25 19:38   ` Darrick J. Wong
  2026-06-25 19:38   ` [PATCH 08/10] fuse4fs: set proc title when in fuse " Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:38 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Make it possible to run fuse4fs as a safe systemd service, wherein the
fuse server only has access to the fds that we pass in.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 MCONFIG.in                  |    2 
 configure                   |  186 +++++++++++++++++++++++++++++++
 configure.ac                |  108 ++++++++++++++++++
 fuse4fs/Makefile.in         |   40 ++++++-
 fuse4fs/fuse4fs.c           |  254 +++++++++++++++++++++++++++++++++++++++++--
 fuse4fs/fuse4fs.socket.in   |   17 +++
 fuse4fs/fuse4fs@.service.in |  102 +++++++++++++++++
 lib/config.h.in             |    6 +
 util/subst.conf.in          |    3 +
 9 files changed, 703 insertions(+), 15 deletions(-)
 create mode 100644 fuse4fs/fuse4fs.socket.in
 create mode 100644 fuse4fs/fuse4fs@.service.in


diff --git a/MCONFIG.in b/MCONFIG.in
index d66e2f3bc1d552..7a17778b6da67f 100644
--- a/MCONFIG.in
+++ b/MCONFIG.in
@@ -42,6 +42,8 @@ HAVE_CROND = @have_crond@
 CROND_DIR = @crond_dir@
 HAVE_SYSTEMD = @have_systemd@
 SYSTEMD_SYSTEM_UNIT_DIR = @systemd_system_unit_dir@
+HAVE_FUSE_SERVICE = @have_fuse_service@
+HAVE_FUSE4FS_SERVICE = @have_fuse4fs_service@
 
 @SET_MAKE@
 
diff --git a/configure b/configure
index f24897fcdd4949..87960ad2cae3c3 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ enable_year2038=no
 ac_subst_vars='LTLIBOBJS
 LIBOBJS
 OS_IO_FILE
+have_fuse4fs_service
 systemd_system_unit_dir
 have_systemd
 systemd_LIBS
@@ -697,6 +698,9 @@ UNI_DIFF_OPTS
 SEM_INIT_LIB
 FUSE4FS_CMT
 FUSE2FS_CMT
+fuse_service_socket_perms
+fuse_service_socket_dir
+have_fuse_service
 FUSE_LIB
 fuse3_LIBS
 fuse3_CFLAGS
@@ -929,6 +933,8 @@ with_libiconv_prefix
 with_libintl_prefix
 enable_largefile
 with_libarchive
+with_fuse_service_socket_dir
+with_fuse_service_socket_perms
 enable_fuse2fs
 enable_fuse4fs
 enable_lto
@@ -1652,6 +1658,11 @@ Optional Packages:
   --with-libintl-prefix[=DIR]  search for libintl in DIR/include and DIR/lib
   --without-libintl-prefix     don't search for libintl in includedir and libdir
   --without-libarchive    disable use of libarchive
+  --with-fuse-service-socket-dir[=DIR]
+                          Create fuse3 filesystem service sockets in DIR.
+  --with-fuse-service-socket-perms[=MODE]
+                          Create fuse3 filesystem service socket with these
+                          permissions.
   --with-multiarch=ARCH   specify the multiarch triplet
   --with-udev-rules-dir[=DIR]
                           Install udev rules into DIR.
@@ -14598,7 +14609,7 @@ else
         fuse3_LIBS=$pkg_cv_fuse3_LIBS
         { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
 printf "%s\n" "yes" >&6; }
-        FUSE_LIB=-lfuse3
+        FUSE_LIB=-lfuse3 ; have_fuse3_pkg=yes
 fi
 
 
@@ -14680,6 +14691,155 @@ printf "%s\n" "#define HAVE_FUSE_LOWLEVEL 1" >>confdefs.h
 
 fi
 
+have_fuse_service=
+fuse_service_socket_dir=
+if test -n "$have_fuse_lowlevel"
+then
+
+# Check whether --with-fuse_service_socket_dir was given.
+if test ${with_fuse_service_socket_dir+y}
+then :
+  withval=$with_fuse_service_socket_dir;
+else case e in #(
+  e) with_fuse_service_socket_dir=yes ;;
+esac
+fi
+
+	if test "x${with_fuse_service_socket_dir}" != "xno"
+then :
+
+		if test "x${with_fuse_service_socket_dir}" = "xyes"
+then :
+
+			if test "x$have_fuse3_pkg" = "xyes"
+then :
+
+				with_fuse_service_socket_dir="$($PKG_CONFIG --variable=service_socket_dir fuse3)"
+
+else case e in #(
+  e)
+				with_fuse_service_socket_dir=""
+			   ;;
+esac
+fi
+
+fi
+		{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for fuse3 service socket dir" >&5
+printf %s "checking for fuse3 service socket dir... " >&6; }
+		fuse_service_socket_dir="${with_fuse_service_socket_dir}"
+		if test -n "${fuse_service_socket_dir}"
+then :
+
+			{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: ${fuse_service_socket_dir}" >&5
+printf "%s\n" "${fuse_service_socket_dir}" >&6; }
+
+else case e in #(
+  e)
+			{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+		   ;;
+esac
+fi
+
+fi
+
+# Check whether --with-fuse_service_socket_perms was given.
+if test ${with_fuse_service_socket_perms+y}
+then :
+  withval=$with_fuse_service_socket_perms;
+else case e in #(
+  e) with_fuse_service_socket_perms=yes ;;
+esac
+fi
+
+	if test "x${with_fuse_service_socket_perms}" != "xno"
+then :
+
+		if test "x${with_fuse_service_socket_perms}" = "xyes"
+then :
+
+			if test "x$have_fuse3_pkg" = "xyes"
+then :
+
+				with_fuse_service_socket_perms="$($PKG_CONFIG --variable=service_socket_perms fuse3)"
+
+else case e in #(
+  e)
+				with_fuse_service_socket_perms=""
+			   ;;
+esac
+fi
+
+fi
+		fuse_service_socket_perms="${with_fuse_service_socket_perms}"
+
+fi
+fi
+if test -n "$FUSE_USE_VERSION"
+then
+	{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for fuse_service_accept in libfuse" >&5
+printf %s "checking for fuse_service_accept in libfuse... " >&6; }
+	cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+	#define _GNU_SOURCE
+	#define _FILE_OFFSET_BITS	64
+	#define FUSE_USE_VERSION	319
+	#include <fuse_lowlevel.h>
+	#include <fuse_service.h>
+
+int
+main (void)
+{
+
+	struct fuse_service *moo;
+	fuse_service_accepted(moo);
+
+  ;
+  return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  have_fuse_service_accept=yes
+	   { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else case e in #(
+  e) { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; } ;;
+esac
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+
+	{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for fuse3 service support" >&5
+printf %s "checking for fuse3 service support... " >&6; }
+	if test -n "${fuse_service_socket_dir}" && test "${have_fuse_service_accept}" = "yes"
+then :
+
+		{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+		have_fuse_service="yes"
+
+else case e in #(
+  e)
+		{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+	   ;;
+esac
+fi
+fi
+
+
+
+if test "$have_fuse_service" = yes
+then
+
+printf "%s\n" "#define HAVE_FUSE_SERVICE 1" >>confdefs.h
+
+fi
+
 FUSE2FS_CMT=
 # Check whether --enable-fuse2fs was given.
 if test ${enable_fuse2fs+y}
@@ -16595,6 +16755,30 @@ esac
 fi
 
 
+
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for fuse4fs service support and systemd" >&5
+printf %s "checking for fuse4fs service support and systemd... " >&6; }
+if test "${FUSE4FS_CMT}${have_fuse_service}${have_systemd}" = "yesyes"
+then :
+
+           { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+
+printf "%s\n" "#define HAVE_FUSE4FS_SERVICE 1" >>confdefs.h
+
+           have_fuse4fs_service=yes
+
+else case e in #(
+  e)
+           { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+           have_fuse4fs_service=no
+
+ ;;
+esac
+fi
+
+
 OS_IO_FILE=""
 case "$host_os" in
   mingw*)
diff --git a/configure.ac b/configure.ac
index 38a18de0b67283..381bb15d920a0f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1376,7 +1376,7 @@ dnl Check to see if the FUSE library is -lfuse3 or -losxfuse
 dnl
 FUSE_LIB=
 dnl osxfuse.dylib supersedes fuselib.dylib
-PKG_CHECK_MODULES([fuse3], [fuse3], [FUSE_LIB=-lfuse3],
+PKG_CHECK_MODULES([fuse3], [fuse3], [FUSE_LIB=-lfuse3 ; have_fuse3_pkg=yes],
 [
 	AC_CHECK_LIB(osxfuse, fuse_main, [FUSE_LIB=-losxfuse])
 ])
@@ -1428,6 +1428,96 @@ then
 		  [Define to 1 if fuse supports lowlevel API])
 fi
 
+dnl
+dnl Check if the FUSE library tells us where to put fs service sockets
+dnl
+have_fuse_service=
+fuse_service_socket_dir=
+if test -n "$have_fuse_lowlevel"
+then
+	AC_ARG_WITH([fuse_service_socket_dir],
+	  [AS_HELP_STRING([--with-fuse-service-socket-dir@<:@=DIR@:>@],
+		  [Create fuse3 filesystem service sockets in DIR.])],
+	  [],
+	  [with_fuse_service_socket_dir=yes])
+	AS_IF([test "x${with_fuse_service_socket_dir}" != "xno"],
+	  [
+		AS_IF([test "x${with_fuse_service_socket_dir}" = "xyes"],
+		  [
+			AS_IF([test "x$have_fuse3_pkg" = "xyes" ],
+			  [
+				with_fuse_service_socket_dir="$($PKG_CONFIG --variable=service_socket_dir fuse3)"
+			  ], [
+				with_fuse_service_socket_dir=""
+			  ])
+		  ])
+		AC_MSG_CHECKING([for fuse3 service socket dir])
+		fuse_service_socket_dir="${with_fuse_service_socket_dir}"
+		AS_IF([test -n "${fuse_service_socket_dir}"],
+		  [
+			AC_MSG_RESULT(${fuse_service_socket_dir})
+		  ],
+		  [
+			AC_MSG_RESULT(no)
+		  ])
+	  ],
+	  [])
+	AC_ARG_WITH([fuse_service_socket_perms],
+	  [AS_HELP_STRING([--with-fuse-service-socket-perms@<:@=MODE@:>@],
+		  [Create fuse3 filesystem service socket with these permissions.])],
+	  [],
+	  [with_fuse_service_socket_perms=yes])
+	AS_IF([test "x${with_fuse_service_socket_perms}" != "xno"],
+	  [
+		AS_IF([test "x${with_fuse_service_socket_perms}" = "xyes"],
+		  [
+			AS_IF([test "x$have_fuse3_pkg" = "xyes" ],
+			  [
+				with_fuse_service_socket_perms="$($PKG_CONFIG --variable=service_socket_perms fuse3)"
+			  ], [
+				with_fuse_service_socket_perms=""
+			  ])
+		  ])
+		fuse_service_socket_perms="${with_fuse_service_socket_perms}"
+	  ],
+	  [])
+fi
+if test -n "$FUSE_USE_VERSION"
+then
+	AC_MSG_CHECKING(for fuse_service_accept in libfuse)
+	AC_LINK_IFELSE(
+	[	AC_LANG_PROGRAM([[
+	#define _GNU_SOURCE
+	#define _FILE_OFFSET_BITS	64
+	#define FUSE_USE_VERSION	319
+	#include <fuse_lowlevel.h>
+	#include <fuse_service.h>
+		]], [[
+	struct fuse_service *moo;
+	fuse_service_accepted(moo);
+		]])
+	], have_fuse_service_accept=yes
+	   AC_MSG_RESULT(yes),
+	   AC_MSG_RESULT(no))
+
+	AC_MSG_CHECKING([for fuse3 service support])
+	AS_IF([test -n "${fuse_service_socket_dir}" && test "${have_fuse_service_accept}" = "yes"],
+	  [
+		AC_MSG_RESULT(yes)
+		have_fuse_service="yes"
+	  ],
+	  [
+		AC_MSG_RESULT(no)
+	  ])
+fi
+AC_SUBST(have_fuse_service)
+AC_SUBST(fuse_service_socket_dir)
+AC_SUBST(fuse_service_socket_perms)
+if test "$have_fuse_service" = yes
+then
+	AC_DEFINE(HAVE_FUSE_SERVICE, 1, [Define to 1 if fuse supports service])
+fi
+
 dnl
 dnl Check if fuse2fs is actually built.
 dnl
@@ -2101,6 +2191,22 @@ AS_IF([test "x${with_systemd_unit_dir}" != "xno"],
   ])
 AC_SUBST(have_systemd)
 AC_SUBST(systemd_system_unit_dir)
+
+AC_MSG_CHECKING([for fuse4fs service support and systemd])
+AS_IF([test "${FUSE4FS_CMT}${have_fuse_service}${have_systemd}" = "yesyes"],
+      [
+           AC_MSG_RESULT(yes)
+           AC_DEFINE(HAVE_FUSE4FS_SERVICE, 1,
+                     [Define to 1 if fuse4fs should be built with fuse service support])
+           have_fuse4fs_service=yes
+      ],
+      [
+           AC_MSG_RESULT(no)
+           have_fuse4fs_service=no
+      ]
+)
+AC_SUBST(have_fuse4fs_service)
+
 dnl Adjust the compiled files if we are on windows vs everywhere else
 dnl
 OS_IO_FILE=""
diff --git a/fuse4fs/Makefile.in b/fuse4fs/Makefile.in
index cecee2b2554f82..67b8afd54493b0 100644
--- a/fuse4fs/Makefile.in
+++ b/fuse4fs/Makefile.in
@@ -17,6 +17,13 @@ UMANPAGES=
 @FUSE4FS_CMT@UPROGS+=fuse4fs
 @FUSE4FS_CMT@UMANPAGES+=fuse4fs.1
 
+ifeq ($(HAVE_FUSE4FS_SERVICE),yes)
+SERVICE_FILES	+= fuse4fs.socket fuse4fs@.service
+INSTALLDIRS_TGT	+= installdirs-systemd
+INSTALL_TGT	+= install-systemd
+UNINSTALL_TGT	+= uninstall-systemd
+endif
+
 FUSE4FS_OBJS=	fuse4fs.o journal.o recovery.o revoke.o
 
 PROFILED_FUSE4FS_OJBS=	profiled/fuse4fs.o profiled/journal.o \
@@ -54,7 +61,7 @@ DEPEND_CFLAGS = -I$(top_srcdir)/e2fsck
 @PROFILE_CMT@	$(Q) $(CC) $(ALL_CFLAGS) -g -pg -o profiled/$*.o -c $<
 
 all:: profiled $(SPROGS) $(UPROGS) $(USPROGS) $(SMANPAGES) $(UMANPAGES) \
-	$(FMANPAGES) $(LPROGS)
+	$(FMANPAGES) $(LPROGS) $(SERVICE_FILES)
 
 all-static::
 
@@ -71,6 +78,14 @@ fuse4fs: $(FUSE4FS_OBJS) $(DEPLIBS) $(DEPLIBBLKID) $(DEPLIBUUID) \
 		$(LIBFUSE) $(LIBBLKID) $(LIBUUID) $(LIBEXT2FS) $(LIBINTL) \
 		$(CLOCK_GETTIME_LIB) $(SYSLIBS) $(LIBS_E2P)
 
+%.socket: %.socket.in $(DEP_SUBSTITUTE)
+	$(E) "	SUBST $@"
+	$(Q) $(SUBSTITUTE_UPTIME) $< $@
+
+%.service: %.service.in $(DEP_SUBSTITUTE)
+	$(E) "	SUBST $@"
+	$(Q) $(SUBSTITUTE_UPTIME) $< $@
+
 journal.o: $(srcdir)/../debugfs/journal.c
 	$(E) "	CC $<"
 	$(Q) $(CC) -c $(JOURNAL_CFLAGS) -I$(srcdir) \
@@ -93,11 +108,15 @@ fuse4fs.1: $(DEP_SUBSTITUTE) $(srcdir)/fuse4fs.1.in
 	$(E) "	SUBST $@"
 	$(Q) $(SUBSTITUTE_UPTIME) $(srcdir)/fuse4fs.1.in fuse4fs.1
 
-installdirs:
+installdirs: $(INSTALLDIRS_TGT)
 	$(E) "	MKDIR_P $(bindir) $(man1dir)"
 	$(Q) $(MKDIR_P) $(DESTDIR)$(bindir) $(DESTDIR)$(man1dir)
 
-install: all $(UMANPAGES) installdirs
+installdirs-systemd:
+	$(E) "	MKDIR_P $(SYSTEMD_SYSTEM_UNIT_DIR)"
+	$(Q) $(MKDIR_P) $(DESTDIR)$(SYSTEMD_SYSTEM_UNIT_DIR)
+
+install: all $(UMANPAGES) installdirs $(INSTALL_TGT)
 	$(Q) for i in $(UPROGS); do \
 		$(ES) "	INSTALL $(bindir)/$$i"; \
 		$(INSTALL_PROGRAM) $$i $(DESTDIR)$(bindir)/$$i; \
@@ -110,13 +129,19 @@ install: all $(UMANPAGES) installdirs
 		$(INSTALL_DATA) $$i $(DESTDIR)$(man1dir)/$$i; \
 	done
 
+install-systemd: $(SERVICE_FILES) installdirs-systemd
+	$(Q) for i in $(SERVICE_FILES); do \
+		$(ES) "	INSTALL_DATA $(SYSTEMD_SYSTEM_UNIT_DIR)/$$i"; \
+		$(INSTALL_DATA) $$i $(DESTDIR)$(SYSTEMD_SYSTEM_UNIT_DIR)/$$i; \
+	done
+
 install-strip: install
 	$(Q) for i in $(UPROGS); do \
 		$(E) "	STRIP $(bindir)/$$i"; \
 		$(STRIP) $(DESTDIR)$(bindir)/$$i; \
 	done
 
-uninstall:
+uninstall: $(UNINSTALL_TGT)
 	for i in $(UPROGS); do \
 		$(RM) -f $(DESTDIR)$(bindir)/$$i; \
 	done
@@ -124,9 +149,16 @@ uninstall:
 		$(RM) -f $(DESTDIR)$(man1dir)/$$i; \
 	done
 
+uninstall-systemd:
+	for i in $(SERVICE_FILES); do \
+		$(RM) -f $(DESTDIR)$(SYSTEMD_SYSTEM_UNIT_DIR)/$$i; \
+	done
+
 clean::
 	$(RM) -f $(UPROGS) $(UMANPAGES) profile.h \
 		fuse4fs.profiled \
+		$(SERVICE_FILES) \
+		fuse4fs.socket \
 		profiled/*.o \#* *.s *.o *.a *~ core gmon.out
 
 mostlyclean: clean
diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index ebf42609c1a739..97e668fadc2398 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -42,6 +42,10 @@
 # define _FILE_OFFSET_BITS 64
 #endif /* _FILE_OFFSET_BITS */
 #include <fuse_lowlevel.h>
+#ifdef HAVE_FUSE4FS_SERVICE
+# include <sys/mount.h>
+# include <fuse_service.h>
+#endif
 #ifdef __SET_FOB_FOR_FUSE
 # undef _FILE_OFFSET_BITS
 #endif /* __SET_FOB_FOR_FUSE */
@@ -140,6 +144,10 @@
 
 #define FUSE4FS_ATTR_TIMEOUT	(0.0)
 
+#ifndef O_DIRECT
+# define O_DIRECT	(0)
+#endif
+
 static inline uint64_t round_up(uint64_t b, unsigned int align)
 {
 	unsigned int m;
@@ -285,8 +293,21 @@ struct fuse4fs {
 #endif
 	struct fuse_session *fuse;
 	struct cache inodes;
+#ifdef HAVE_FUSE4FS_SERVICE
+	struct fuse_service *service;
+	int bdev_fd;
+#endif
 };
 
+#ifdef HAVE_FUSE4FS_SERVICE
+static inline bool fuse4fs_is_service(const struct fuse4fs *ff)
+{
+	return fuse_service_accepted(ff->service);
+}
+#else
+# define fuse4fs_is_service(...)		(false)
+#endif
+
 #define FUSE4FS_CHECK_HANDLE(req, fh) \
 	do { \
 		if ((fh) == NULL || (fh)->magic != FUSE4FS_FILE_MAGIC) { \
@@ -1270,6 +1291,118 @@ static errcode_t fuse4fs_check_support(struct fuse4fs *ff)
 	return 0;
 }
 
+#ifdef HAVE_FUSE4FS_SERVICE
+static int fuse4fs_service_connect(struct fuse4fs *ff, struct fuse_args *args)
+{
+	int ret;
+
+	ret = fuse_service_accept(&ff->service);
+	if (ret)
+		return ret;
+
+	if (!fuse4fs_is_service(ff))
+		return 0;
+
+	return fuse_service_append_args(ff->service, args);
+}
+
+static bool fuse4fs_service_should_drop_kernel_mode(const struct fuse4fs *ff)
+{
+	return ff->kernel && fuse4fs_is_service(ff) &&
+	       !fuse_service_can_allow_other(ff->service);
+}
+
+static void fuse4fs_service_close_bdev(struct fuse4fs *ff)
+{
+	if (ff->bdev_fd >= 0)
+		close(ff->bdev_fd);
+	ff->bdev_fd = -1;
+}
+
+static int fuse4fs_service_exit(struct fuse4fs *ff, int exitcode)
+{
+	if (!fuse4fs_is_service(ff))
+		return exitcode;
+
+	fuse_service_send_goodbye(ff->service, exitcode);
+	fuse_service_release(ff->service);
+	close(ff->bdev_fd);
+	ff->bdev_fd = -1;
+
+	return fuse_service_exit(exitcode);
+}
+
+static int fuse4fs_service_open_bdev(struct fuse4fs *ff)
+{
+	double deadline = init_deadline(FUSE4FS_OPEN_TIMEOUT);
+	const int open_flags = O_EXCL | (ff->directio ? O_DIRECT : 0);
+	int open_mode = O_RDWR;
+	int fd;
+	int ret;
+
+	do {
+		ret = fuse_service_request_file(ff->service, ff->device,
+						open_mode | open_flags, 0, 0);
+		if (ret)
+			return ret;
+
+		ret = fuse_service_receive_file(ff->service, ff->device, &fd);
+		if (ret)
+			return ret;
+
+		if ((fd == -EPERM || fd == -EACCES || fd == -EROFS) &&
+		    open_mode == O_RDWR) {
+			/* Try readonly, but force the loop to run once more */
+			open_mode = O_RDONLY;
+			ret = 1;
+		}
+	} while (ret == 1 || (fd == -EBUSY && retry_before_deadline(deadline)));
+
+	if (fd < 0) {
+		err_printf(ff, "%s %s: %s.\n", _("opening device"), ff->device,
+			   strerror(-fd));
+		return -1;
+	}
+
+	if (!ff->ro && open_mode == O_RDONLY)
+		ff->ro = 1;
+
+	ff->bdev_fd = fd;
+	return 0;
+}
+
+static int fuse4fs_service_get_config(struct fuse4fs *ff)
+{
+	int ret, ret2;
+
+	ret = fuse4fs_service_open_bdev(ff);
+
+	/* Always prevent further fds from being added to our file table */
+	ret2 = fuse_service_finish_file_requests(ff->service);
+	if (ret2 && !ret)
+		ret = ret2;
+
+	return ret;
+}
+
+static errcode_t fuse4fs_service_openfs(struct fuse4fs *ff, char *options,
+					int flags)
+{
+	char path[64];
+
+	snprintf(path, sizeof(path), "/dev/fd/%d", ff->bdev_fd);
+	return ext2fs_open2(path, options, flags, 0, 0, unixfd_io_manager,
+			    &ff->fs);
+}
+#else
+# define fuse4fs_service_connect(...)		(0)
+# define fuse4fs_service_should_drop_kernel_mode(...)	(false)
+# define fuse4fs_service_close_bdev(...)	((void)0)
+# define fuse4fs_service_exit(fctx, ret)	(ret)
+# define fuse4fs_service_get_config(...)	(EOPNOTSUPP)
+# define fuse4fs_service_openfs(...)		(EOPNOTSUPP)
+#endif
+
 static errcode_t fuse4fs_acquire_lockfile(struct fuse4fs *ff)
 {
 	char *resolved;
@@ -1340,6 +1473,8 @@ static void fuse4fs_unmount(struct fuse4fs *ff)
 				   uuid);
 	}
 
+	fuse4fs_service_close_bdev(ff);
+
 	if (ff->lockfile)
 		fuse4fs_release_lockfile(ff);
 }
@@ -1395,8 +1530,11 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff)
 	 */
 	deadline = init_deadline(FUSE4FS_OPEN_TIMEOUT);
 	do {
-		err = ext2fs_open2(ff->device, options, flags, 0, 0,
-				   unix_io_manager, &ff->fs);
+		if (fuse4fs_is_service(ff))
+			err = fuse4fs_service_openfs(ff, options, flags);
+		else
+			err = ext2fs_open2(ff->device, options, flags, 0, 0,
+					   unix_io_manager, &ff->fs);
 		if ((err == EPERM || err == EACCES) &&
 		    (!ff->ro || (flags & EXT2_FLAG_RW))) {
 			/*
@@ -1741,6 +1879,10 @@ static int fuse4fs_setup_logging(struct fuse4fs *ff)
 	if (logfile)
 		return fuse4fs_capture_output(ff, logfile);
 
+	/* systemd already hooked us up to /dev/ttyprintk */
+	if (fuse4fs_is_service(ff))
+		return 0;
+
 	/* in kernel mode, try to log errors to the kernel log */
 	if (ff->kernel)
 		fuse4fs_capture_output(ff, "/dev/ttyprintk");
@@ -5962,14 +6104,13 @@ static const char *get_subtype(const char *argv0)
 }
 
 static void fuse4fs_compute_libfuse_args(struct fuse4fs *ff,
-					 struct fuse_args *args,
-					 const char *argv0)
+					 struct fuse_args *args)
 {
 	char extra_args[BUFSIZ];
 
 	/* Set up default fuse parameters */
 	snprintf(extra_args, BUFSIZ, "-osubtype=%s,fsname=%s",
-		 get_subtype(argv0),
+		 get_subtype(args->argv[0]),
 		 ff->device);
 	if (ff->no_default_opts == 0)
 		fuse_opt_add_arg(args, extra_args);
@@ -5986,6 +6127,15 @@ static void fuse4fs_compute_libfuse_args(struct fuse4fs *ff,
 #endif
 	}
 
+	/*
+	 * If we're mounting as a systemd service but the mount helper told us
+	 * that allow_other isn't allowed, then disable -okernel.  This mount
+	 * option gets special consideration because it's hardcoded in the
+	 * service unit file.
+	 */
+	if (fuse4fs_service_should_drop_kernel_mode(ff))
+		ff->kernel = 0;
+
 	if (ff->kernel) {
 		/*
 		 * ACLs are always enforced when kernel mode is enabled, to
@@ -6097,6 +6247,69 @@ static int fuse4fs_event_loop(struct fuse4fs *ff,
 	return fuse_session_loop_mt(ff->fuse, loop_config) == 0 ? 0 : 8;
 }
 
+#ifdef HAVE_FUSE4FS_SERVICE
+static int fuse4fs_service_main(struct fuse_args *args, struct fuse4fs *ff)
+{
+	struct fuse_cmdline_opts opts;
+	struct fuse_loop_config *loop_config = NULL;
+	int ret;
+
+	/*
+	 * Service initialization doesn't fork or change stdout/stderr so we
+	 * can drop the extra logfd right now.
+	 */
+	if (ff->logfd >= 0)
+		close(ff->logfd);
+	ff->logfd = -1;
+
+	ret = fuse_service_parse_cmdline_opts(args, &opts);
+	if (ret != 0) {
+		ret = 1;
+		goto out;
+	}
+
+	ret = fuse4fs_create_session(ff, args, &opts);
+	if (ret || !ff->fuse)
+		goto out_free_opts;
+
+	loop_config = fuse_loop_cfg_create();
+	if (loop_config == NULL) {
+		ret = 7;
+		goto out_destroy_session;
+	}
+
+	if (fuse_set_signal_handlers(ff->fuse) != 0) {
+		ret = 6;
+		goto out_loopcfg;
+	}
+
+	ret = fuse_service_session_mount(ff->service, ff->fuse, S_IFDIR, &opts);
+	if (ret) {
+		ret = 4;
+		goto out_signals;
+	}
+
+	fuse_service_send_goodbye(ff->service, 0);
+	fuse_service_release(ff->service);
+
+	ret = fuse4fs_event_loop(ff, loop_config, &opts);
+
+out_signals:
+	fuse_remove_signal_handlers(ff->fuse);
+out_loopcfg:
+	fuse_loop_cfg_destroy(loop_config);
+out_destroy_session:
+	fuse_session_destroy(ff->fuse);
+	ff->fuse = NULL;
+out_free_opts:
+	free(opts.mountpoint);
+out:
+	return ret;
+}
+#else
+# define fuse4fs_service_main(...)		(8)
+#endif
+
 static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
 {
 	struct fuse_cmdline_opts opts;
@@ -6168,18 +6381,28 @@ int main(int argc, char *argv[])
 		.bfl = (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER,
 		.oom_score_adj = -500,
 		.opstate = F4OP_WRITABLE,
+#ifdef HAVE_FUSE4FS_SERVICE
+		.bdev_fd = -1,
+#endif
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
 	int ret;
 
+	ret = fuse4fs_service_connect(&fctx, &args);
+	if (ret) {
+		ret = 1;
+		goto out_exit;
+	}
+
 	ret = fuse_opt_parse(&args, &fctx, fuse4fs_opts, fuse4fs_opt_proc);
 	if (ret)
-		exit(1);
+		goto out_exit;
 	if (fctx.device == NULL) {
 		fprintf(stderr, "Missing ext4 device/image\n");
 		fprintf(stderr, "See '%s -h' for usage\n", argv[0]);
-		exit(1);
+		ret = 1;
+		goto out_exit;
 	}
 
 	/* /dev/sda -> sda for reporting */
@@ -6209,6 +6432,14 @@ int main(int argc, char *argv[])
 		goto out;
 	}
 
+	if (fuse4fs_is_service(&fctx)) {
+		ret = fuse4fs_service_get_config(&fctx);
+		if (ret) {
+			ret = 2;
+			goto out;
+		}
+	}
+
 	try_set_io_flusher(&fctx);
 	try_adjust_oom_score(&fctx);
 
@@ -6264,9 +6495,12 @@ int main(int argc, char *argv[])
 	/* Initialize generation counter */
 	get_random_bytes(&fctx.next_generation, sizeof(unsigned int));
 
-	fuse4fs_compute_libfuse_args(&fctx, &args, argv[0]);
+	fuse4fs_compute_libfuse_args(&fctx, &args);
 
-	ret = fuse4fs_main(&args, &fctx);
+	if (fuse4fs_is_service(&fctx))
+		ret = fuse4fs_service_main(&args, &fctx);
+	else
+		ret = fuse4fs_main(&args, &fctx);
 	switch(ret) {
 	case 0:
 		/* success */
@@ -6308,6 +6542,8 @@ int main(int argc, char *argv[])
 	if (fctx.device)
 		free(fctx.device);
 	pthread_mutex_destroy(&fctx.bfl);
+out_exit:
+	ret = fuse4fs_service_exit(&fctx, ret);
 	fuse_opt_free_args(&args);
 	return ret;
 }
diff --git a/fuse4fs/fuse4fs.socket.in b/fuse4fs/fuse4fs.socket.in
new file mode 100644
index 00000000000000..99e391bcc6787e
--- /dev/null
+++ b/fuse4fs/fuse4fs.socket.in
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+[Unit]
+Description=Socket for ext4 Service
+
+[Socket]
+ListenSequentialPacket=@fuse_service_socket_dir@/ext2
+ListenSequentialPacket=@fuse_service_socket_dir@/ext3
+ListenSequentialPacket=@fuse_service_socket_dir@/ext4
+Accept=yes
+SocketMode=@fuse_service_socket_perms@
+RemoveOnStop=yes
+
+[Install]
+WantedBy=sockets.target
diff --git a/fuse4fs/fuse4fs@.service.in b/fuse4fs/fuse4fs@.service.in
new file mode 100644
index 00000000000000..38434c383c7be3
--- /dev/null
+++ b/fuse4fs/fuse4fs@.service.in
@@ -0,0 +1,102 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+[Unit]
+Description=ext4 Service
+
+# Don't leave failed units behind, systemd does not clean them up!
+CollectMode=inactive-or-failed
+
+[Service]
+Type=exec
+ExecStart=@bindir@/fuse4fs -o kernel
+
+# Try to capture core dumps
+LimitCORE=infinity
+
+SyslogIdentifier=%N
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# Don't let us see anything in the regular system, and don't run as root
+DynamicUser=true
+ProtectSystem=strict
+ProtectHome=true
+PrivateTmp=true
+PrivateDevices=true
+PrivateUsers=true
+
+# No network access
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=none
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+RestrictFileSystems=
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+
+SystemCallFilter=~@clock
+SystemCallFilter=~@cpu-emulation
+SystemCallFilter=~@debug
+SystemCallFilter=~@module
+SystemCallFilter=~@reboot
+SystemCallFilter=~@swap
+
+SystemCallFilter=~@mount
+
+# libfuse io_uring wants to pin cores and memory
+SystemCallFilter=mbind
+SystemCallFilter=sched_setaffinity
+
+# Leave a breadcrumb if we get whacked by the system call filter
+SystemCallErrorNumber=EL3RST
+
+# Log to the kernel dmesg, just like an in-kernel ext4 driver
+StandardOutput=append:/dev/ttyprintk
+StandardError=append:/dev/ttyprintk
+
+# Run with no capabilities at all
+CapabilityBoundingSet=
+AmbientCapabilities=
+NoNewPrivileges=true
+
+# fuse4fs doesn't create files
+UMask=7777
+
+# No access to hardware /dev files at all
+ProtectClock=true
+DevicePolicy=closed
+
+# Don't mess with set[ug]id anything.
+RestrictSUIDSGID=true
+
+# Don't let OOM kills of processes in this containment group kill the whole
+# service, because we don't want filesystem drivers to go down.
+OOMPolicy=continue
+OOMScoreAdjust=-1000
diff --git a/lib/config.h.in b/lib/config.h.in
index abba5e2c625b24..15b99c6d28c59e 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -142,6 +142,9 @@
 /* Define to 1 if you have the 'ftruncate64' function. */
 #undef HAVE_FTRUNCATE64
 
+/* Define to 1 if fuse4fs should be built with fuse service support */
+#undef HAVE_FUSE4FS_SERVICE
+
 /* Define to 1 if fuse supports cache_readdir */
 #undef HAVE_FUSE_CACHE_READDIR
 
@@ -151,6 +154,9 @@
 /* Define to 1 if fuse supports lowlevel API */
 #undef HAVE_FUSE_LOWLEVEL
 
+/* Define to 1 if fuse supports service */
+#undef HAVE_FUSE_SERVICE
+
 /* Define to 1 if you have the 'futimes' function. */
 #undef HAVE_FUTIMES
 
diff --git a/util/subst.conf.in b/util/subst.conf.in
index 5af5e356d46ac7..3d0ec5cc39eabd 100644
--- a/util/subst.conf.in
+++ b/util/subst.conf.in
@@ -24,3 +24,6 @@ root_bindir		@root_bindir@
 libdir			@libdir@
 $exec_prefix		@exec_prefix@
 pkglibexecdir		@libexecdir@/e2fsprogs
+bindir			@bindir@
+fuse_service_socket_dir	@fuse_service_socket_dir@
+fuse_service_socket_perms	@fuse_service_socket_perms@


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/10] fuse4fs: set proc title when in fuse service mode
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (6 preceding siblings ...)
  2026-06-25 19:38   ` [PATCH 07/10] fuse4fs: enable safe service mode Darrick J. Wong
@ 2026-06-25 19:38   ` Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 09/10] fuse4fs: make MMP work correctly in safe " Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 10/10] debian: update packaging for fuse4fs service Darrick J. Wong
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:38 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

When in fuse service mode, set the process title so that we can identify
fuse servers by mount arguments.  When the service ends, amend the title
again to say that we're cleaning up.  This is done to make ps aux a bit
more communicative as to what is going on.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure           |  109 +++++++++++++++++++++++++++++++++++++++++++++++++++
 configure.ac        |   13 ++++++
 fuse4fs/Makefile.in |    2 -
 fuse4fs/fuse4fs.c   |   47 ++++++++++++++++++++++
 lib/config.h.in     |    6 +++
 5 files changed, 176 insertions(+), 1 deletion(-)


diff --git a/configure b/configure
index 87960ad2cae3c3..b0531eb58b2b64 100755
--- a/configure
+++ b/configure
@@ -696,6 +696,7 @@ gcc_ranlib
 gcc_ar
 UNI_DIFF_OPTS
 SEM_INIT_LIB
+LIBBSD_LIB
 FUSE4FS_CMT
 FUSE2FS_CMT
 fuse_service_socket_perms
@@ -15022,6 +15023,114 @@ printf "%s\n" "#define HAVE_FUSE_CACHE_READDIR 1" >>confdefs.h
 
 fi
 
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for setproctitle in -lbsd" >&5
+printf %s "checking for setproctitle in -lbsd... " >&6; }
+if test ${ac_cv_lib_bsd_setproctitle+y}
+then :
+  printf %s "(cached) " >&6
+else case e in #(
+  e) ac_check_lib_save_LIBS=$LIBS
+LIBS="-lbsd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.
+   The 'extern "C"' is for builds by C++ compilers;
+   although this is not generally supported in C code supporting it here
+   has little cost and some practical benefit (sr 110532).  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char setproctitle (void);
+int
+main (void)
+{
+return setproctitle ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  ac_cv_lib_bsd_setproctitle=yes
+else case e in #(
+  e) ac_cv_lib_bsd_setproctitle=no ;;
+esac
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS ;;
+esac
+fi
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_bsd_setproctitle" >&5
+printf "%s\n" "$ac_cv_lib_bsd_setproctitle" >&6; }
+if test "x$ac_cv_lib_bsd_setproctitle" = xyes
+then :
+  LIBBSD_LIB=-lbsd
+fi
+
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for setproctitle_init in -lbsd" >&5
+printf %s "checking for setproctitle_init in -lbsd... " >&6; }
+if test ${ac_cv_lib_bsd_setproctitle_init+y}
+then :
+  printf %s "(cached) " >&6
+else case e in #(
+  e) ac_check_lib_save_LIBS=$LIBS
+LIBS="-lbsd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.
+   The 'extern "C"' is for builds by C++ compilers;
+   although this is not generally supported in C code supporting it here
+   has little cost and some practical benefit (sr 110532).  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char setproctitle_init (void);
+int
+main (void)
+{
+return setproctitle_init ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  ac_cv_lib_bsd_setproctitle_init=yes
+else case e in #(
+  e) ac_cv_lib_bsd_setproctitle_init=no ;;
+esac
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS ;;
+esac
+fi
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_bsd_setproctitle_init" >&5
+printf "%s\n" "$ac_cv_lib_bsd_setproctitle_init" >&6; }
+if test "x$ac_cv_lib_bsd_setproctitle_init" = xyes
+then :
+  LIBBSD_LIB=-lbsd
+fi
+
+
+if test "$ac_cv_lib_bsd_setproctitle" = yes ; then
+
+printf "%s\n" "#define HAVE_SETPROCTITLE 1" >>confdefs.h
+
+fi
+if test "$ac_cv_lib_bsd_setproctitle_init" = yes ; then
+
+printf "%s\n" "#define HAVE_SETPROCTITLE_INIT 1" >>confdefs.h
+
+fi
+
 { printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for PR_SET_IO_FLUSHER" >&5
 printf %s "checking for PR_SET_IO_FLUSHER... " >&6; }
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
diff --git a/configure.ac b/configure.ac
index 381bb15d920a0f..8a5e95cd4eb866 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1622,6 +1622,19 @@ then
 		  [Define to 1 if fuse supports cache_readdir])
 fi
 
+dnl
+dnl see if setproctitle exists
+dnl
+AC_CHECK_LIB(bsd, setproctitle, [LIBBSD_LIB=-lbsd])
+AC_CHECK_LIB(bsd, setproctitle_init, [LIBBSD_LIB=-lbsd])
+AC_SUBST(LIBBSD_LIB)
+if test "$ac_cv_lib_bsd_setproctitle" = yes ; then
+	AC_DEFINE(HAVE_SETPROCTITLE, 1, [Define to 1 if setproctitle present in libbsd])
+fi
+if test "$ac_cv_lib_bsd_setproctitle_init" = yes ; then
+	AC_DEFINE(HAVE_SETPROCTITLE_INIT, 1, [Define to 1 if setproctitle_init present in libbsd])
+fi
+
 dnl
 dnl see if PR_SET_IO_FLUSHER exists
 dnl
diff --git a/fuse4fs/Makefile.in b/fuse4fs/Makefile.in
index 67b8afd54493b0..bb859369914a36 100644
--- a/fuse4fs/Makefile.in
+++ b/fuse4fs/Makefile.in
@@ -76,7 +76,7 @@ fuse4fs: $(FUSE4FS_OBJS) $(DEPLIBS) $(DEPLIBBLKID) $(DEPLIBUUID) \
 	$(E) "	LD $@"
 	$(Q) $(CC) $(ALL_LDFLAGS) -o fuse4fs $(FUSE4FS_OBJS) $(LIBS) \
 		$(LIBFUSE) $(LIBBLKID) $(LIBUUID) $(LIBEXT2FS) $(LIBINTL) \
-		$(CLOCK_GETTIME_LIB) $(SYSLIBS) $(LIBS_E2P)
+		$(CLOCK_GETTIME_LIB) $(SYSLIBS) $(LIBS_E2P) @LIBBSD_LIB@
 
 %.socket: %.socket.in $(DEP_SUBSTITUTE)
 	$(E) "	SUBST $@"
diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 97e668fadc2398..5fa51569a1167f 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -45,6 +45,9 @@
 #ifdef HAVE_FUSE4FS_SERVICE
 # include <sys/mount.h>
 # include <fuse_service.h>
+# ifdef HAVE_SETPROCTITLE
+#  include <bsd/unistd.h>
+# endif
 #endif
 #ifdef __SET_FOB_FOR_FUSE
 # undef _FILE_OFFSET_BITS
@@ -295,6 +298,9 @@ struct fuse4fs {
 	struct cache inodes;
 #ifdef HAVE_FUSE4FS_SERVICE
 	struct fuse_service *service;
+# ifdef HAVE_SETPROCTITLE
+	char *svc_cmdline;
+# endif
 	int bdev_fd;
 #endif
 };
@@ -1291,6 +1297,35 @@ static errcode_t fuse4fs_check_support(struct fuse4fs *ff)
 	return 0;
 }
 
+#if defined(HAVE_FUSE4FS_SERVICE) && defined(HAVE_SETPROCTITLE)
+static void fuse4fs_service_set_proc_cmdline(struct fuse4fs *ff, int argc,
+					     char *argv[],
+					     struct fuse_args *args)
+{
+#ifdef HAVE_SETPROCTITLE_INIT
+	setproctitle_init(argc, argv, environ);
+#endif
+
+	ff->svc_cmdline = fuse_service_cmdline(argc, (const char * const *)argv, args);
+	if (!ff->svc_cmdline)
+		return;
+
+	setproctitle("-%s", ff->svc_cmdline);
+}
+
+static void fuse4fs_service_finish_proc_cmdline(struct fuse4fs *ff)
+{
+	if (!ff->svc_cmdline)
+		return;
+
+	setproctitle("-%s [cleaning up]", ff->svc_cmdline);
+	free(ff->svc_cmdline);
+}
+#else
+# define fuse4fs_service_set_proc_cmdline(...)		((void)0)
+# define fuse4fs_service_finish_proc_cmdline(...)	((void)0)
+#endif
+
 #ifdef HAVE_FUSE4FS_SERVICE
 static int fuse4fs_service_connect(struct fuse4fs *ff, struct fuse_args *args)
 {
@@ -1324,6 +1359,8 @@ static int fuse4fs_service_exit(struct fuse4fs *ff, int exitcode)
 	if (!fuse4fs_is_service(ff))
 		return exitcode;
 
+	fuse4fs_service_finish_proc_cmdline(ff);
+
 	fuse_service_send_goodbye(ff->service, exitcode);
 	fuse_service_release(ff->service);
 	close(ff->bdev_fd);
@@ -6395,6 +6432,16 @@ int main(int argc, char *argv[])
 		goto out_exit;
 	}
 
+	/*
+	 * For fuse services, make the /proc title include the arguments that
+	 * we got from the mount helper.  Do this before parsing argc/argv
+	 * because that may overwrite the argv area.  Note that the procfs
+	 * listing might not reflect the options that actually get enabled,
+	 * just like regular fuse4fs.
+	 */
+	if (fuse4fs_is_service(&fctx))
+		fuse4fs_service_set_proc_cmdline(&fctx, argc, argv, &args);
+
 	ret = fuse_opt_parse(&args, &fctx, fuse4fs_opts, fuse4fs_opt_proc);
 	if (ret)
 		goto out_exit;
diff --git a/lib/config.h.in b/lib/config.h.in
index 15b99c6d28c59e..0973413b5c11e2 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -379,6 +379,12 @@
 /* Define to 1 if you have the 'setmntent' function. */
 #undef HAVE_SETMNTENT
 
+/* Define to 1 if setproctitle present in libbsd */
+#undef HAVE_SETPROCTITLE
+
+/* Define to 1 if setproctitle_init present in libbsd */
+#undef HAVE_SETPROCTITLE_INIT
+
 /* Define to 1 if you have the 'setresgid' function. */
 #undef HAVE_SETRESGID
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/10] fuse4fs: make MMP work correctly in safe service mode
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (7 preceding siblings ...)
  2026-06-25 19:38   ` [PATCH 08/10] fuse4fs: set proc title when in fuse " Darrick J. Wong
@ 2026-06-25 19:39   ` Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 10/10] debian: update packaging for fuse4fs service Darrick J. Wong
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:39 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Normally, the libext2fs MMP code open()s a complete separate file
descriptor to read and write the MMP block so that it can have its own
private open file with its own access mode and file position.  However,
if the unixfd IO manager is in use, it will reuse the io channel, which
means that MMP and the unixfd share the same open file and hence the
access mode and file position.

MMP requires directio access to block devices so that changes are
immediately visible on other nodes.  Therefore, we need the IO channel
(and thus the filesystem) to be running in directio mode if MMP is in
use.

To make this work correctly with the sole unixfd IO manager user
(fuse4fs in unprivileged service mode), we must set O_DIRECT on the
bdev fd and mount the filesystem in directio mode.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fuse4fs/fuse4fs.c |   51 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)


diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 5fa51569a1167f..fdd4327a4c0907 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -1423,12 +1423,57 @@ static int fuse4fs_service_get_config(struct fuse4fs *ff)
 }
 
 static errcode_t fuse4fs_service_openfs(struct fuse4fs *ff, char *options,
-					int flags)
+					int *flags)
 {
+	struct stat statbuf;
 	char path[64];
+	errcode_t retval;
+	int ret;
 
+	ret = fstat(ff->bdev_fd, &statbuf);
+	if (ret)
+		return errno;
+
+	/*
+	 * Open the filesystem with SKIP_MMP so that we can find out if the
+	 * filesystem actually has MMP.
+	 */
 	snprintf(path, sizeof(path), "/dev/fd/%d", ff->bdev_fd);
-	return ext2fs_open2(path, options, flags, 0, 0, unixfd_io_manager,
+	retval = ext2fs_open2(path, options, *flags | EXT2_FLAG_SKIP_MMP, 0, 0,
+			      unixfd_io_manager, &ff->fs);
+	if (retval)
+		return retval;
+
+	/*
+	 * If the fs doesn't have MMP then we're good to go.  Otherwise close
+	 * the filesystem so that we can reopen it with MMP enabled.
+	 */
+	if (!ext2fs_has_feature_mmp(ff->fs->super))
+		return 0;
+
+	retval = ext2fs_close_free(&ff->fs);
+	if (retval)
+		return retval;
+
+	/*
+	 * If the filesystem is not on a regular file, MMP will share the same
+	 * fd as the unixfd IO channel.  We need to set O_DIRECT on the bdev_fd
+	 * and open the filesystem in directio mode.
+	 */
+	if (!S_ISREG(statbuf.st_mode)) {
+		int fflags = fcntl(ff->bdev_fd, F_GETFL);
+
+		if (!(fflags & O_DIRECT)) {
+			ret = fcntl(ff->bdev_fd, F_SETFL, fflags | O_DIRECT);
+			if (ret)
+				return EXT2_ET_MMP_OPEN_DIRECT;
+		}
+
+		ff->directio = 1;
+		*flags |= EXT2_FLAG_DIRECT_IO;
+	}
+
+	return ext2fs_open2(path, options, *flags, 0, 0, unixfd_io_manager,
 			    &ff->fs);
 }
 #else
@@ -1568,7 +1613,7 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff)
 	deadline = init_deadline(FUSE4FS_OPEN_TIMEOUT);
 	do {
 		if (fuse4fs_is_service(ff))
-			err = fuse4fs_service_openfs(ff, options, flags);
+			err = fuse4fs_service_openfs(ff, options, &flags);
 		else
 			err = ext2fs_open2(ff->device, options, flags, 0, 0,
 					   unix_io_manager, &ff->fs);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/10] debian: update packaging for fuse4fs service
  2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
                     ` (8 preceding siblings ...)
  2026-06-25 19:39   ` [PATCH 09/10] fuse4fs: make MMP work correctly in safe " Darrick J. Wong
@ 2026-06-25 19:39   ` Darrick J. Wong
  9 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:39 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Update the Debian packaging code so that we can create fuse4fs service
containers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 debian/e2fsprogs.install |    7 ++++++-
 debian/fuse4fs.install   |    3 +++
 debian/rules             |    3 +++
 3 files changed, 12 insertions(+), 1 deletion(-)
 mode change 100644 => 100755 debian/fuse4fs.install


diff --git a/debian/e2fsprogs.install b/debian/e2fsprogs.install
index 17a80e3922dcee..808474bcab1717 100755
--- a/debian/e2fsprogs.install
+++ b/debian/e2fsprogs.install
@@ -50,4 +50,9 @@ usr/share/man/man8/resize2fs.8
 usr/share/man/man8/tune2fs.8
 etc
 [linux-any] ${deb_udevudevdir}/rules.d
-[linux-any] ${deb_systemdsystemunitdir}
+[linux-any] ${deb_systemdsystemunitdir}/e2scrub@.service
+[linux-any] ${deb_systemdsystemunitdir}/e2scrub@.service
+[linux-any] ${deb_systemdsystemunitdir}/e2scrub_all.service
+[linux-any] ${deb_systemdsystemunitdir}/e2scrub_all.timer
+[linux-any] ${deb_systemdsystemunitdir}/e2scrub_fail@.service
+[linux-any] ${deb_systemdsystemunitdir}/e2scrub_reap.service
diff --git a/debian/fuse4fs.install b/debian/fuse4fs.install
old mode 100644
new mode 100755
index 17bdc90e33cb67..56048136c2b28b
--- a/debian/fuse4fs.install
+++ b/debian/fuse4fs.install
@@ -1,2 +1,5 @@
+#!/usr/bin/dh-exec
 usr/bin/fuse4fs
 usr/share/man/man1/fuse4fs.1
+[linux-any] ${deb_systemdsystemunitdir}/fuse4fs.socket
+[linux-any] ${deb_systemdsystemunitdir}/fuse4fs@.service
diff --git a/debian/rules b/debian/rules
index b680eb33ceac9e..d629e9d6915cfe 100755
--- a/debian/rules
+++ b/debian/rules
@@ -173,6 +173,9 @@ override_dh_installinfo:
 ifneq ($(DEB_HOST_ARCH_OS), hurd)
 override_dh_installsystemd:
 	dh_installsystemd -p e2fsprogs --no-restart-after-upgrade --no-stop-on-upgrade e2scrub_all.timer e2scrub_reap.service
+ifeq ($(SKIP_FUSE4FS),)
+	dh_installsystemd -p fuse4fs fuse4fs.socket
+endif
 endif
 
 override_dh_makeshlibs:


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 1/6] libsupport: add caching IO manager
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
@ 2026-06-25 19:39   ` Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:39 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Start creating a caching IO manager so that we can have better caching
of metadata blocks in fuse2fs.  For now it's just a passthrough cache.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/support/iocache.h   |   17 +++
 lib/ext2fs/io_manager.c |    3 
 lib/support/Makefile.in |    6 +
 lib/support/iocache.c   |  304 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 329 insertions(+), 1 deletion(-)
 create mode 100644 lib/support/iocache.h
 create mode 100644 lib/support/iocache.c


diff --git a/lib/support/iocache.h b/lib/support/iocache.h
new file mode 100644
index 00000000000000..502eede08aadc5
--- /dev/null
+++ b/lib/support/iocache.h
@@ -0,0 +1,17 @@
+/*
+ * iocache.h - IO cache
+ *
+ * Copyright (C) 2025-2026 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#ifndef __IOCACHE_H__
+#define __IOCACHE_H__
+
+errcode_t iocache_set_backing_manager(io_manager manager);
+extern io_manager iocache_io_manager;
+
+#endif /* __IOCACHE_H__ */
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index dff3d73552827f..57beb0368c2a8d 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -16,9 +16,12 @@
 #if HAVE_SYS_TYPES_H
 #include <sys/types.h>
 #endif
+#include <stdbool.h>
 
 #include "ext2_fs.h"
 #include "ext2fs.h"
+#include "support/list.h"
+#include "support/cache.h"
 
 errcode_t io_channel_set_options(io_channel channel, const char *opts)
 {
diff --git a/lib/support/Makefile.in b/lib/support/Makefile.in
index d20d6a984b7679..22242758b4e618 100644
--- a/lib/support/Makefile.in
+++ b/lib/support/Makefile.in
@@ -15,6 +15,7 @@ all::
 
 OBJS=		bthread.o \
 		cstring.o \
+		iocache.o \
 		mkquota.o \
 		plausible.o \
 		profile.o \
@@ -46,7 +47,8 @@ SRCS=		$(srcdir)/argv_parse.c \
 		$(srcdir)/thread.c \
 		$(srcdir)/dict.c \
 		$(srcdir)/devname.c \
-		$(srcdir)/cache.c
+		$(srcdir)/cache.c \
+		$(srcdir)/iocache.c
 
 LIBRARY= libsupport
 LIBDIR= support
@@ -200,3 +202,5 @@ devname.o: $(srcdir)/devname.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/devname.h $(srcdir)/nls-enable.h
 cache.o: $(srcdir)/cache.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/list.h $(srcdir)/cache.h
+iocache.o: $(srcdir)/iocache.c $(top_builddir)/lib/config.h \
+ $(srcdir)/iocache.h $(srcdir)/cache.h $(srcdir)/list.h
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
new file mode 100644
index 00000000000000..2148a9d93a4285
--- /dev/null
+++ b/lib/support/iocache.c
@@ -0,0 +1,304 @@
+/*
+ * iocache.c - caching IO manager
+ *
+ * Copyright (C) 2025-2026 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#include "config.h"
+#include "ext2fs/ext2_fs.h"
+#include "ext2fs/ext2fs.h"
+#include "ext2fs/ext2fsP.h"
+#include "support/iocache.h"
+
+#define IOCACHE_IO_CHANNEL_MAGIC	0x424F5254	/* BORT */
+
+static io_manager iocache_backing_manager;
+
+struct iocache_private_data {
+	int			magic;
+	io_channel		real;
+};
+
+static struct iocache_private_data *IOCACHE(io_channel channel)
+{
+	return (struct iocache_private_data *)channel->private_data;
+}
+
+static errcode_t iocache_read_error(io_channel channel, unsigned long block,
+				    int count, void *data, size_t size,
+				    int actual_bytes_read, errcode_t error)
+{
+	io_channel iocache_channel = channel->app_data;
+
+	return iocache_channel->read_error(iocache_channel, block, count, data,
+					   size, actual_bytes_read, error);
+}
+
+static errcode_t iocache_write_error(io_channel channel, unsigned long block,
+				     int count, const void *data, size_t size,
+				     int actual_bytes_written,
+				     errcode_t error)
+{
+	io_channel iocache_channel = channel->app_data;
+
+	return iocache_channel->write_error(iocache_channel, block, count, data,
+					    size, actual_bytes_written, error);
+}
+
+static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
+{
+	io_channel	io = NULL;
+	io_channel	real;
+	struct iocache_private_data *data = NULL;
+	errcode_t	retval;
+
+	if (!name)
+		return EXT2_ET_BAD_DEVICE_NAME;
+	if (!iocache_backing_manager)
+		return EXT2_ET_INVALID_ARGUMENT;
+
+	retval = iocache_backing_manager->open(name, flags, &real);
+	if (retval)
+		return retval;
+
+	retval = ext2fs_get_mem(sizeof(struct struct_io_channel), &io);
+	if (retval)
+		goto out_backing;
+	memset(io, 0, sizeof(struct struct_io_channel));
+	io->magic = EXT2_ET_MAGIC_IO_CHANNEL;
+
+	retval = ext2fs_get_mem(sizeof(struct iocache_private_data), &data);
+	if (retval)
+		goto out_channel;
+	memset(data, 0, sizeof(struct iocache_private_data));
+	data->magic = IOCACHE_IO_CHANNEL_MAGIC;
+
+	io->manager = iocache_io_manager;
+	retval = ext2fs_get_mem(strlen(name) + 1, &io->name);
+	if (retval)
+		goto out_data;
+
+	strcpy(io->name, name);
+	io->private_data = data;
+	io->block_size = real->block_size;
+	io->read_error = 0;
+	io->write_error = 0;
+	io->refcount = 1;
+	io->flags = real->flags;
+	data->real = real;
+	real->app_data = io;
+	real->read_error = iocache_read_error;
+	real->write_error = iocache_write_error;
+
+	*channel = io;
+	return 0;
+
+out_data:
+	ext2fs_free_mem(&data);
+out_channel:
+	ext2fs_free_mem(&io);
+out_backing:
+	io_channel_close(real);
+	return retval;
+}
+
+static errcode_t iocache_close(io_channel channel)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+	errcode_t	retval = 0;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	if (--channel->refcount > 0)
+		return 0;
+	if (data->real)
+		retval = io_channel_close(data->real);
+	ext2fs_free_mem(&channel->private_data);
+	if (channel->name)
+		ext2fs_free_mem(&channel->name);
+	ext2fs_free_mem(&channel);
+
+	return retval;
+}
+
+static errcode_t iocache_set_blksize(io_channel channel, int blksize)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+	errcode_t retval;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	retval = io_channel_set_blksize(data->real, blksize);
+	if (retval)
+		return retval;
+
+	channel->block_size = data->real->block_size;
+	return 0;
+}
+
+static errcode_t iocache_flush(io_channel channel)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_flush(data->real);
+}
+
+static errcode_t iocache_write_byte(io_channel channel, unsigned long offset,
+				    int count, const void *buf)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_write_byte(data->real, offset, count, buf);
+}
+
+static errcode_t iocache_set_option(io_channel channel, const char *option,
+				    const char *arg)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return data->real->manager->set_option(data->real, option, arg);
+}
+
+static errcode_t iocache_get_stats(io_channel channel, io_stats *io_stats)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return data->real->manager->get_stats(data->real, io_stats);
+}
+
+static errcode_t iocache_read_blk64(io_channel channel,
+				    unsigned long long block, int count,
+				    void *buf)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_read_blk64(data->real, block, count, buf);
+}
+
+static errcode_t iocache_write_blk64(io_channel channel,
+				     unsigned long long block, int count,
+				     const void *buf)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_write_blk64(data->real, block, count, buf);
+}
+
+static errcode_t iocache_read_blk(io_channel channel, unsigned long block,
+				  int count, void *buf)
+{
+	return iocache_read_blk64(channel, block, count, buf);
+}
+
+static errcode_t iocache_write_blk(io_channel channel, unsigned long block,
+				   int count, const void *buf)
+{
+	return iocache_write_blk64(channel, block, count, buf);
+}
+
+static errcode_t iocache_discard(io_channel channel, unsigned long long block,
+				 unsigned long long count)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_discard(data->real, block, count);
+}
+
+static errcode_t iocache_cache_readahead(io_channel channel,
+					 unsigned long long block,
+					 unsigned long long count)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_cache_readahead(data->real, block, count);
+}
+
+static errcode_t iocache_zeroout(io_channel channel, unsigned long long block,
+				 unsigned long long count)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_zeroout(data->real, block, count);
+}
+
+static errcode_t iocache_get_fd(io_channel channel, int *fd)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_get_fd(data->real, fd);
+}
+
+static errcode_t iocache_flock(io_channel channel, unsigned int flock_flags)
+{
+	struct iocache_private_data *data = IOCACHE(channel);
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+	return io_channel_flock(data->real, flock_flags);
+}
+
+static struct struct_io_manager struct_iocache_manager = {
+	.magic			= EXT2_ET_MAGIC_IO_MANAGER,
+	.name			= "iocache I/O manager",
+	.open			= iocache_open,
+	.close			= iocache_close,
+	.set_blksize		= iocache_set_blksize,
+	.read_blk		= iocache_read_blk,
+	.write_blk		= iocache_write_blk,
+	.flush			= iocache_flush,
+	.write_byte		= iocache_write_byte,
+	.set_option		= iocache_set_option,
+	.get_stats		= iocache_get_stats,
+	.read_blk64		= iocache_read_blk64,
+	.write_blk64		= iocache_write_blk64,
+	.discard		= iocache_discard,
+	.cache_readahead	= iocache_cache_readahead,
+	.zeroout		= iocache_zeroout,
+	.get_fd			= iocache_get_fd,
+	.flock			= iocache_flock,
+};
+
+io_manager iocache_io_manager = &struct_iocache_manager;
+
+errcode_t iocache_set_backing_manager(io_manager manager)
+{
+	iocache_backing_manager = manager;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] iocache: add the actual buffer cache
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
@ 2026-06-25 19:39   ` Darrick J. Wong
  2026-06-25 19:40   ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:39 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Wire up buffer caching into our new caching IO manager.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/support/iocache.c |  482 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 460 insertions(+), 22 deletions(-)


diff --git a/lib/support/iocache.c b/lib/support/iocache.c
index 2148a9d93a4285..59b71306f4dd41 100644
--- a/lib/support/iocache.c
+++ b/lib/support/iocache.c
@@ -9,46 +9,287 @@
  * %End-Header%
  */
 #include "config.h"
+#include <assert.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <unistd.h>
+#include <limits.h>
 #include "ext2fs/ext2_fs.h"
 #include "ext2fs/ext2fs.h"
 #include "ext2fs/ext2fsP.h"
 #include "support/iocache.h"
+#include "support/list.h"
+#include "support/cache.h"
 
 #define IOCACHE_IO_CHANNEL_MAGIC	0x424F5254	/* BORT */
 
 static io_manager iocache_backing_manager;
 
+static inline uint64_t B_TO_FSBT(io_channel channel, uint64_t number) {
+	return number / channel->block_size;
+}
+
+static inline uint64_t B_TO_FSB(io_channel channel, uint64_t number) {
+	return (number + channel->block_size - 1) / channel->block_size;
+}
+
 struct iocache_private_data {
 	int			magic;
-	io_channel		real;
+	io_channel		real;		/* lower level io channel */
+	io_channel		channel;	/* cache channel */
+	struct cache		cache;
+	pthread_mutex_t		stats_lock;
+	struct struct_io_stats	io_stats;
+	unsigned long long	write_errors;
 };
 
+#define IOCACHEDATA(cache) \
+	(container_of(cache, struct iocache_private_data, cache))
+
 static struct iocache_private_data *IOCACHE(io_channel channel)
 {
 	return (struct iocache_private_data *)channel->private_data;
 }
 
-static errcode_t iocache_read_error(io_channel channel, unsigned long block,
-				    int count, void *data, size_t size,
-				    int actual_bytes_read, errcode_t error)
+struct iocache_buf {
+	struct cache_node	node;
+	struct list_head	list;
+	blk64_t			block;
+	void			*buf;
+	errcode_t		write_error;
+	unsigned int		uptodate:1;
+	unsigned int		dirty:1;
+};
+
+static inline void iocache_buf_lock(struct iocache_buf *ubuf)
+{
+	pthread_mutex_lock(&ubuf->node.cn_mutex);
+}
+
+static inline void iocache_buf_unlock(struct iocache_buf *ubuf)
+{
+	pthread_mutex_unlock(&ubuf->node.cn_mutex);
+}
+
+struct iocache_key {
+	blk64_t			block;
+};
+
+#define IOKEY(key)	((struct iocache_key *)(key))
+#define IOBUF(node)	(container_of((node), struct iocache_buf, node))
+
+static unsigned int
+iocache_hash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
+{
+	uint64_t	hashval = IOKEY(key)->block;
+	uint64_t	tmp;
+
+	tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+	return tmp % hashsize;
+}
+
+static int iocache_compare(struct cache_node *node, cache_key_t key)
+{
+	struct iocache_buf *ubuf = IOBUF(node);
+	struct iocache_key *ukey = IOKEY(key);
+
+	if (ubuf->block == ukey->block)
+		return CACHE_HIT;
+
+	return CACHE_MISS;
+}
+
+static struct cache_node *iocache_alloc_node(struct cache *cache,
+					     cache_key_t key)
+{
+	struct iocache_private_data *data = IOCACHEDATA(cache);
+	struct iocache_key *ukey = IOKEY(key);
+	struct iocache_buf *ubuf;
+	errcode_t retval;
+
+	retval = ext2fs_get_mem(sizeof(struct iocache_buf), &ubuf);
+	if (retval)
+		return NULL;
+	memset(ubuf, 0, sizeof(*ubuf));
+
+	retval = io_channel_alloc_buf(data->channel, 0, &ubuf->buf);
+	if (retval) {
+		free(ubuf);
+		return NULL;
+	}
+	memset(ubuf->buf, 0, data->channel->block_size);
+
+	INIT_LIST_HEAD(&ubuf->list);
+	ubuf->block = ukey->block;
+	return &ubuf->node;
+}
+
+static bool iocache_flush_node(struct cache *cache, struct cache_node *node)
+{
+	struct iocache_private_data *data = IOCACHEDATA(cache);
+	struct iocache_buf *ubuf = IOBUF(node);
+	errcode_t retval;
+
+	if (ubuf->dirty) {
+		retval = io_channel_write_blk64(data->real, ubuf->block, 1,
+						ubuf->buf);
+		if (retval) {
+			ubuf->write_error = retval;
+			data->write_errors++;
+		} else {
+			ubuf->dirty = 0;
+			ubuf->write_error = 0;
+		}
+	}
+
+	return ubuf->dirty;
+}
+
+static void iocache_relse(struct cache *cache, struct cache_node *node)
+{
+	struct iocache_buf *ubuf = IOBUF(node);
+
+	ext2fs_free_mem(&ubuf->buf);
+	ext2fs_free_mem(&ubuf);
+}
+
+static unsigned int iocache_bulkrelse(struct cache *cache,
+				      struct list_head *list)
+{
+	struct cache_node *cn, *n;
+	int count = 0;
+
+	if (list_empty(list))
+		return 0;
+
+	list_for_each_entry_safe(cn, n, list, cn_mru) {
+		iocache_relse(cache, cn);
+		count++;
+	}
+
+	return count;
+}
+
+/* Flush all dirty buffers in the cache to disk. */
+static errcode_t iocache_flush_cache(struct iocache_private_data *data)
+{
+	return cache_flush(&data->cache) ? 0 : EIO;
+}
+
+/* Flush all dirty buffers in this range of the cache to disk. */
+static errcode_t iocache_flush_range(struct iocache_private_data *data,
+				     blk64_t block, uint64_t count)
+{
+	uint64_t i;
+	bool still_dirty = false;
+
+	for (i = 0; i < count; i++) {
+		struct iocache_key ukey = {
+			.block = block + i,
+		};
+		struct cache_node *node;
+
+		cache_node_get(&data->cache, &ukey, CACHE_GET_INCORE,
+			       &node);
+		if (!node)
+			continue;
+
+		/* cache_flush holds cn_mutex across the node flush */
+		pthread_mutex_unlock(&node->cn_mutex);
+		still_dirty |= iocache_flush_node(&data->cache, node);
+		pthread_mutex_unlock(&node->cn_mutex);
+
+		cache_node_put(&data->cache, node);
+	}
+
+	return still_dirty ? EIO : 0;
+}
+
+static void iocache_add_list(struct cache *cache, struct cache_node *node,
+			     void *data)
+{
+	struct iocache_buf *ubuf = IOBUF(node);
+	struct list_head *list = data;
+
+	assert(node->cn_count == 0 || node->cn_count == 1);
+
+	iocache_buf_lock(ubuf);
+	cache_node_grab(cache, node);
+	list_add_tail(&ubuf->list, list);
+	iocache_buf_unlock(ubuf);
+}
+
+static void iocache_invalidate_bufs(struct iocache_private_data *data,
+				    struct list_head *list)
+{
+	struct iocache_buf *ubuf, *n;
+
+	list_for_each_entry_safe(ubuf, n, list, list) {
+		struct iocache_key ukey = {
+			.block = ubuf->block,
+		};
+
+		assert(ubuf->node.cn_count == 1);
+
+		iocache_buf_lock(ubuf);
+		ubuf->dirty = 0;
+		list_del_init(&ubuf->list);
+		iocache_buf_unlock(ubuf);
+
+		cache_node_put(&data->cache, &ubuf->node);
+		cache_node_purge(&data->cache, &ukey, &ubuf->node);
+	}
+}
+
+/*
+ * Remove all blocks from the cache.  Dirty contents are discarded.  Buffer
+ * refcounts must be zero!
+ */
+static void iocache_invalidate_cache(struct iocache_private_data *data)
 {
-	io_channel iocache_channel = channel->app_data;
+	LIST_HEAD(list);
 
-	return iocache_channel->read_error(iocache_channel, block, count, data,
-					   size, actual_bytes_read, error);
+	cache_walk(&data->cache, iocache_add_list, &list);
+	iocache_invalidate_bufs(data, &list);
 }
 
-static errcode_t iocache_write_error(io_channel channel, unsigned long block,
-				     int count, const void *data, size_t size,
-				     int actual_bytes_written,
-				     errcode_t error)
+/*
+ * Remove a range of blocks from the cache.  Dirty contents are discarded.
+ * Buffer refcounts must be zero!
+ */
+static void iocache_invalidate_range(struct iocache_private_data *data,
+				     blk64_t block, uint64_t count)
 {
-	io_channel iocache_channel = channel->app_data;
+	LIST_HEAD(list);
+	uint64_t i;
 
-	return iocache_channel->write_error(iocache_channel, block, count, data,
-					    size, actual_bytes_written, error);
+	for (i = 0; i < count; i++) {
+		struct iocache_key ukey = {
+			.block = block + i,
+		};
+		struct cache_node *node;
+
+		cache_node_get(&data->cache, &ukey, CACHE_GET_INCORE,
+			       &node);
+		if (node) {
+			iocache_add_list(&data->cache, node, &list);
+			cache_node_put(&data->cache, node);
+		}
+	}
+	iocache_invalidate_bufs(data, &list);
 }
 
+static const struct cache_operations iocache_ops = {
+	.hash		= iocache_hash,
+	.alloc		= iocache_alloc_node,
+	.flush		= iocache_flush_node,
+	.relse		= iocache_relse,
+	.compare	= iocache_compare,
+	.bulkrelse	= iocache_bulkrelse,
+	.resize		= cache_gradual_resize,
+};
+
 static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
 {
 	io_channel	io = NULL;
@@ -65,6 +306,9 @@ static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
 	if (retval)
 		return retval;
 
+	/* disable any static cache in the lower io manager */
+	io_channel_set_options(real, "cache=off");
+
 	retval = ext2fs_get_mem(sizeof(struct struct_io_channel), &io);
 	if (retval)
 		goto out_backing;
@@ -76,12 +320,19 @@ static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
 		goto out_channel;
 	memset(data, 0, sizeof(struct iocache_private_data));
 	data->magic = IOCACHE_IO_CHANNEL_MAGIC;
+	data->io_stats.num_fields = 4;
+	data->channel = io;
 
 	io->manager = iocache_io_manager;
 	retval = ext2fs_get_mem(strlen(name) + 1, &io->name);
 	if (retval)
 		goto out_data;
 
+	retval = cache_init(CACHE_AUTO_SHRINK, 1U << 10, &iocache_ops,
+			    &data->cache);
+	if (retval)
+		goto out_name;
+
 	strcpy(io->name, name);
 	io->private_data = data;
 	io->block_size = real->block_size;
@@ -91,12 +342,14 @@ static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
 	io->flags = real->flags;
 	data->real = real;
 	real->app_data = io;
-	real->read_error = iocache_read_error;
-	real->write_error = iocache_write_error;
+
+	pthread_mutex_init(&data->stats_lock, NULL);
 
 	*channel = io;
 	return 0;
 
+out_name:
+	ext2fs_free_mem(&io->name);
 out_data:
 	ext2fs_free_mem(&data);
 out_channel:
@@ -116,6 +369,10 @@ static errcode_t iocache_close(io_channel channel)
 
 	if (--channel->refcount > 0)
 		return 0;
+	pthread_mutex_destroy(&data->stats_lock);
+	cache_flush(&data->cache);
+	cache_purge(&data->cache);
+	cache_destroy(&data->cache);
 	if (data->real)
 		retval = io_channel_close(data->real);
 	ext2fs_free_mem(&channel->private_data);
@@ -134,6 +391,11 @@ static errcode_t iocache_set_blksize(io_channel channel, int blksize)
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
+	retval = iocache_flush_cache(data);
+	if (retval)
+		return retval;
+	iocache_invalidate_cache(data);
+
 	retval = io_channel_set_blksize(data->real, blksize);
 	if (retval)
 		return retval;
@@ -145,21 +407,34 @@ static errcode_t iocache_set_blksize(io_channel channel, int blksize)
 static errcode_t iocache_flush(io_channel channel)
 {
 	struct iocache_private_data *data = IOCACHE(channel);
+	errcode_t retval = 0;
+	errcode_t retval2;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
-	return io_channel_flush(data->real);
+	retval = iocache_flush_cache(data);
+	retval2 = io_channel_flush(data->real);
+	if (retval)
+		return retval;
+	return retval2;
 }
 
 static errcode_t iocache_write_byte(io_channel channel, unsigned long offset,
 				    int count, const void *buf)
 {
 	struct iocache_private_data *data = IOCACHE(channel);
+	blk64_t bno = B_TO_FSBT(channel, offset);
+	blk64_t next_bno = B_TO_FSB(channel, offset + count);
+	errcode_t retval;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
+	retval = iocache_flush_range(data, bno, next_bno - bno);
+	if (retval)
+		return retval;
+	iocache_invalidate_range(data, bno, next_bno - bno);
 	return io_channel_write_byte(data->real, offset, count, buf);
 }
 
@@ -170,6 +445,31 @@ static errcode_t iocache_set_option(io_channel channel, const char *option,
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+	errcode_t retval;
+
+	/* don't let unix io cache= options leak through */
+	if (!strcmp(option, "cache"))
+		return 0;
+
+	if (!strcmp(option, "cache_blocks")) {
+		long long size;
+
+		if (!arg)
+			return EXT2_ET_INVALID_ARGUMENT;
+
+		errno = 0;
+		size = strtoll(arg, NULL, 0);
+		if (errno || size == 0 || size > UINT_MAX)
+			return EXT2_ET_INVALID_ARGUMENT;
+
+		cache_set_maxcount(&data->cache, size);
+		return 0;
+	}
+
+	retval = iocache_flush_cache(data);
+	if (retval)
+		return retval;
+	iocache_invalidate_cache(data);
 
 	return data->real->manager->set_option(data->real, option, arg);
 }
@@ -181,31 +481,157 @@ static errcode_t iocache_get_stats(io_channel channel, io_stats *io_stats)
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
-	return data->real->manager->get_stats(data->real, io_stats);
+	/*
+	 * Yes, io_stats is a double-pointer, and we let the caller scribble on
+	 * our stats struct WITHOUT LOCKING!
+	 */
+	if (io_stats)
+		*io_stats = &data->io_stats;
+	return 0;
+}
+
+static void iocache_update_stats(struct iocache_private_data *data,
+				 unsigned long long bytes_read,
+				 unsigned long long bytes_written,
+				 int cache_op)
+{
+	pthread_mutex_lock(&data->stats_lock);
+	data->io_stats.bytes_read += bytes_read;
+	data->io_stats.bytes_written += bytes_written;
+	if (cache_op == CACHE_HIT)
+		data->io_stats.cache_hits++;
+	else
+		data->io_stats.cache_misses++;
+	pthread_mutex_unlock(&data->stats_lock);
 }
 
 static errcode_t iocache_read_blk64(io_channel channel,
 				    unsigned long long block, int count,
 				    void *buf)
 {
+	struct iocache_key ukey = {
+		.block = block,
+	};
 	struct iocache_private_data *data = IOCACHE(channel);
+	unsigned long long i;
+	errcode_t retval;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
-	return io_channel_read_blk64(data->real, block, count, buf);
+	/*
+	 * If we're doing an odd-sized read, flush out the cache and then do a
+	 * direct read.
+	 */
+	if (count < 0) {
+		uint64_t fsbcount = B_TO_FSB(channel, -count);
+
+		retval = iocache_flush_range(data, block, fsbcount);
+		if (retval)
+			return retval;
+		iocache_invalidate_range(data, block, fsbcount);
+		iocache_update_stats(data, 0, 0, CACHE_MISS);
+		return io_channel_read_blk64(data->real, block, count, buf);
+	}
+
+	for (i = 0; i < count; i++, ukey.block++, buf += channel->block_size) {
+		struct cache_node *node;
+		struct iocache_buf *ubuf;
+
+		cache_node_get(&data->cache, &ukey, 0, &node);
+		if (!node) {
+			/* cannot instantiate cache, just do a direct read */
+			retval = io_channel_read_blk64(data->real, ukey.block,
+						       1, buf);
+			if (retval)
+				return retval;
+			iocache_update_stats(data, channel->block_size, 0,
+					     CACHE_MISS);
+			continue;
+		}
+
+		ubuf = IOBUF(node);
+		iocache_buf_lock(ubuf);
+		if (!ubuf->uptodate) {
+			retval = io_channel_read_blk64(data->real, ukey.block,
+						       1, ubuf->buf);
+			if (!retval) {
+				ubuf->uptodate = 1;
+				iocache_update_stats(data, channel->block_size,
+						     0, CACHE_MISS);
+			}
+		} else {
+			iocache_update_stats(data, channel->block_size, 0,
+					     CACHE_HIT);
+		}
+		if (ubuf->uptodate)
+			memcpy(buf, ubuf->buf, channel->block_size);
+		iocache_buf_unlock(ubuf);
+		cache_node_put(&data->cache, node);
+		if (retval)
+			return retval;
+	}
+
+	return 0;
 }
 
 static errcode_t iocache_write_blk64(io_channel channel,
 				     unsigned long long block, int count,
 				     const void *buf)
 {
+	struct iocache_key ukey = {
+		.block = block,
+	};
 	struct iocache_private_data *data = IOCACHE(channel);
+	unsigned long long i;
+	errcode_t retval;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
-	return io_channel_write_blk64(data->real, block, count, buf);
+	/*
+	 * If we're doing an odd-sized write, flush out the cache and then do a
+	 * direct write.
+	 */
+	if (count < 0) {
+		uint64_t fsbcount = B_TO_FSB(channel, -count);
+
+		retval = iocache_flush_range(data, block, fsbcount);
+		if (retval)
+			return retval;
+		iocache_invalidate_range(data, block, fsbcount);
+		iocache_update_stats(data, 0, 0, CACHE_MISS);
+		return io_channel_write_blk64(data->real, block, count, buf);
+	}
+
+	for (i = 0; i < count; i++, ukey.block++, buf += channel->block_size) {
+		struct cache_node *node;
+		struct iocache_buf *ubuf;
+
+		cache_node_get(&data->cache, &ukey, 0, &node);
+		if (!node) {
+			/* cannot instantiate cache, do a direct write */
+			retval = io_channel_write_blk64(data->real, ukey.block,
+							1, buf);
+			if (retval)
+				return retval;
+			iocache_update_stats(data, 0, channel->block_size,
+					     CACHE_MISS);
+			continue;
+		}
+
+		ubuf = IOBUF(node);
+		iocache_buf_lock(ubuf);
+		memcpy(ubuf->buf, buf, channel->block_size);
+		iocache_update_stats(data, 0, channel->block_size,
+				     ubuf->uptodate ? CACHE_HIT : CACHE_MISS);
+		ubuf->dirty = 1;
+		ubuf->uptodate = 1;
+		iocache_buf_unlock(ubuf);
+		cache_node_put(&data->cache, node);
+	}
+
+	return 0;
 }
 
 static errcode_t iocache_read_blk(io_channel channel, unsigned long block,
@@ -224,11 +650,17 @@ static errcode_t iocache_discard(io_channel channel, unsigned long long block,
 				 unsigned long long count)
 {
 	struct iocache_private_data *data = IOCACHE(channel);
+	errcode_t retval;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
-	return io_channel_discard(data->real, block, count);
+	retval = io_channel_discard(data->real, block, count);
+	if (retval)
+		return retval;
+
+	iocache_invalidate_range(data, block, count);
+	return 0;
 }
 
 static errcode_t iocache_cache_readahead(io_channel channel,
@@ -247,11 +679,17 @@ static errcode_t iocache_zeroout(io_channel channel, unsigned long long block,
 				 unsigned long long count)
 {
 	struct iocache_private_data *data = IOCACHE(channel);
+	errcode_t retval;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
 
-	return io_channel_zeroout(data->real, block, count);
+	retval = io_channel_zeroout(data->real, block, count);
+	if (retval)
+		return retval;
+
+	iocache_invalidate_range(data, block, count);
+	return 0;
 }
 
 static errcode_t iocache_get_fd(io_channel channel, int *fd)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
  2026-06-25 19:39   ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
@ 2026-06-25 19:40   ` Darrick J. Wong
  2026-06-25 19:40   ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:40 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

If a buffer is hot enough to survive more than 50 access without being
reclaimed, bump its priority to the next MRU so it sticks around longer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/support/cache.h   |    1 +
 lib/support/cache.c   |   16 ++++++++++++++++
 lib/support/iocache.c |    9 +++++++++
 3 files changed, 26 insertions(+)


diff --git a/lib/support/cache.h b/lib/support/cache.h
index cd0e8c20e2f3ea..c57cebf99cf589 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -180,5 +180,6 @@ int cache_node_purge(struct cache *, cache_key_t, struct cache_node *);
 void cache_report(FILE *fp, const char *, struct cache *);
 int cache_overflowed(struct cache *);
 struct cache_node *cache_node_grab(struct cache *cache, struct cache_node *node);
+void cache_node_bump_priority(struct cache *cache, struct cache_node *node);
 
 #endif	/* __CACHE_H__ */
diff --git a/lib/support/cache.c b/lib/support/cache.c
index ece0adeceda8ed..44397ea3bc7ac4 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -715,6 +715,22 @@ cache_node_put(
 		cache_shrink(cache);
 }
 
+/* Bump the priority of a cache node.  Caller must hold cn_mutex. */
+void
+cache_node_bump_priority(
+	struct cache		*cache,
+	struct cache_node	*node)
+{
+	int			*priop;
+
+	if (node->cn_priority == CACHE_DIRTY_PRIORITY)
+		priop = &node->cn_old_priority;
+	else
+		priop = &node->cn_priority;
+	if (*priop < CACHE_MAX_PRIORITY)
+		(*priop)++;
+}
+
 void
 cache_node_set_priority(
 	struct cache *		cache EXT2FS_ATTR((unused)),
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
index 59b71306f4dd41..82f805e1000e97 100644
--- a/lib/support/iocache.c
+++ b/lib/support/iocache.c
@@ -57,6 +57,7 @@ struct iocache_buf {
 	blk64_t			block;
 	void			*buf;
 	errcode_t		write_error;
+	uint8_t			access;
 	unsigned int		uptodate:1;
 	unsigned int		dirty:1;
 };
@@ -566,6 +567,10 @@ static errcode_t iocache_read_blk64(io_channel channel,
 		}
 		if (ubuf->uptodate)
 			memcpy(buf, ubuf->buf, channel->block_size);
+		if (++ubuf->access > 50) {
+			cache_node_bump_priority(&data->cache, node);
+			ubuf->access = 0;
+		}
 		iocache_buf_unlock(ubuf);
 		cache_node_put(&data->cache, node);
 		if (retval)
@@ -627,6 +632,10 @@ static errcode_t iocache_write_blk64(io_channel channel,
 				     ubuf->uptodate ? CACHE_HIT : CACHE_MISS);
 		ubuf->dirty = 1;
 		ubuf->uptodate = 1;
+		if (++ubuf->access > 50) {
+			cache_node_bump_priority(&data->cache, node);
+			ubuf->access = 0;
+		}
 		iocache_buf_unlock(ubuf);
 		cache_node_put(&data->cache, node);
 	}


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] fuse2fs: enable caching IO manager
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
                     ` (2 preceding siblings ...)
  2026-06-25 19:40   ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
@ 2026-06-25 19:40   ` Darrick J. Wong
  2026-06-25 19:40   ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
  2026-06-25 19:40   ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
  5 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:40 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Enable the new dynamic iocache I/O manager in the fuse server, and turn
off all the other cache control.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fuse4fs/Makefile.in   |    2 +-
 fuse4fs/fuse4fs.c     |   11 ++++++++---
 lib/support/iocache.c |    4 ++--
 misc/Makefile.in      |    3 ++-
 misc/fuse2fs.c        |    7 ++++++-
 5 files changed, 19 insertions(+), 8 deletions(-)


diff --git a/fuse4fs/Makefile.in b/fuse4fs/Makefile.in
index bb859369914a36..8fbbfabe7be1e7 100644
--- a/fuse4fs/Makefile.in
+++ b/fuse4fs/Makefile.in
@@ -180,7 +180,7 @@ fuse4fs.o: $(srcdir)/fuse4fs.c $(top_builddir)/lib/config.h \
  $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/lib/support/bthread.h \
  $(top_srcdir)/lib/support/thread.h $(top_srcdir)/lib/support/list.h \
  $(top_srcdir)/lib/support/cache.h $(top_srcdir)/version.h \
- $(top_srcdir)/lib/e2p/e2p.h
+ $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/iocache.h
 journal.o: $(srcdir)/../debugfs/journal.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/../debugfs/journal.h \
  $(top_srcdir)/e2fsck/jfs_user.h $(top_srcdir)/e2fsck/e2fsck.h \
diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index fdd4327a4c0907..43e7278ffec1a9 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -60,6 +60,7 @@
 #include "support/thread.h"
 #include "support/list.h"
 #include "support/cache.h"
+#include "support/iocache.h"
 
 #include "../version.h"
 #include "uuid/uuid.h"
@@ -1434,13 +1435,15 @@ static errcode_t fuse4fs_service_openfs(struct fuse4fs *ff, char *options,
 	if (ret)
 		return errno;
 
+	iocache_set_backing_manager(unixfd_io_manager);
+
 	/*
 	 * Open the filesystem with SKIP_MMP so that we can find out if the
 	 * filesystem actually has MMP.
 	 */
 	snprintf(path, sizeof(path), "/dev/fd/%d", ff->bdev_fd);
 	retval = ext2fs_open2(path, options, *flags | EXT2_FLAG_SKIP_MMP, 0, 0,
-			      unixfd_io_manager, &ff->fs);
+			      iocache_io_manager, &ff->fs);
 	if (retval)
 		return retval;
 
@@ -1473,7 +1476,7 @@ static errcode_t fuse4fs_service_openfs(struct fuse4fs *ff, char *options,
 		*flags |= EXT2_FLAG_DIRECT_IO;
 	}
 
-	return ext2fs_open2(path, options, *flags, 0, 0, unixfd_io_manager,
+	return ext2fs_open2(path, options, *flags, 0, 0, iocache_io_manager,
 			    &ff->fs);
 }
 #else
@@ -1581,6 +1584,8 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff)
 	if (ff->directio)
 		flags |= EXT2_FLAG_DIRECT_IO;
 
+	iocache_set_backing_manager(unix_io_manager);
+
 	/*
 	 * If the filesystem is stored on a block device, the _EXCLUSIVE flag
 	 * causes libext2fs to try to open the block device with O_EXCL.  If
@@ -1616,7 +1621,7 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff)
 			err = fuse4fs_service_openfs(ff, options, &flags);
 		else
 			err = ext2fs_open2(ff->device, options, flags, 0, 0,
-					   unix_io_manager, &ff->fs);
+					   iocache_io_manager, &ff->fs);
 		if ((err == EPERM || err == EACCES) &&
 		    (!ff->ro || (flags & EXT2_FLAG_RW))) {
 			/*
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
index 82f805e1000e97..4ed941ff2d3ae6 100644
--- a/lib/support/iocache.c
+++ b/lib/support/iocache.c
@@ -249,7 +249,7 @@ static void iocache_invalidate_bufs(struct iocache_private_data *data,
  */
 static void iocache_invalidate_cache(struct iocache_private_data *data)
 {
-	LIST_HEAD(list);
+	struct list_head list = LIST_HEAD_INIT(list);
 
 	cache_walk(&data->cache, iocache_add_list, &list);
 	iocache_invalidate_bufs(data, &list);
@@ -262,7 +262,7 @@ static void iocache_invalidate_cache(struct iocache_private_data *data)
 static void iocache_invalidate_range(struct iocache_private_data *data,
 				     blk64_t block, uint64_t count)
 {
-	LIST_HEAD(list);
+	struct list_head list = LIST_HEAD_INIT(list);
 	uint64_t i;
 
 	for (i = 0; i < count; i++) {
diff --git a/misc/Makefile.in b/misc/Makefile.in
index ec964688acd623..48bd42b8272572 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -880,7 +880,8 @@ fuse2fs.o: $(srcdir)/fuse2fs.c $(top_builddir)/lib/config.h \
  $(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
  $(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
  $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
- $(top_srcdir)/lib/e2p/e2p.h
+ $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/cache.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/iocache.h
 e2fuzz.o: $(srcdir)/e2fuzz.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
  $(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 59ad3ab0c6eb3f..32f4f1e2a48056 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -49,6 +49,9 @@
 #include "ext2fs/ext2fsP.h"
 #include "support/bthread.h"
 #include "support/thread.h"
+#include "support/list.h"
+#include "support/cache.h"
+#include "support/iocache.h"
 
 #include "../version.h"
 #include "uuid/uuid.h"
@@ -1176,6 +1179,8 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff)
 	if (ff->directio)
 		flags |= EXT2_FLAG_DIRECT_IO;
 
+	iocache_set_backing_manager(unix_io_manager);
+
 	/*
 	 * If the filesystem is stored on a block device, the _EXCLUSIVE flag
 	 * causes libext2fs to try to open the block device with O_EXCL.  If
@@ -1208,7 +1213,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff)
 	deadline = init_deadline(FUSE2FS_OPEN_TIMEOUT);
 	do {
 		err = ext2fs_open2(ff->device, options, flags, 0, 0,
-				   unix_io_manager, &ff->fs);
+				   iocache_io_manager, &ff->fs);
 		if ((err == EPERM || err == EACCES) &&
 		    (!ff->ro || (flags & EXT2_FLAG_RW))) {
 			/*


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] fuse2fs: increase inode cache size
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
                     ` (3 preceding siblings ...)
  2026-06-25 19:40   ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
@ 2026-06-25 19:40   ` Darrick J. Wong
  2026-06-25 19:40   ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
  5 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:40 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Increase the internal inode cache size.  Does this improve performance
any?

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fuse4fs/fuse4fs.c |    4 ++++
 misc/fuse2fs.c    |    4 ++++
 2 files changed, 8 insertions(+)


diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 43e7278ffec1a9..9744c0941cf31b 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -1689,6 +1689,10 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff)
 	if (err)
 		return translate_error(ff->fs, 0, err);
 
+	err = ext2fs_create_inode_cache(ff->fs, 1024);
+	if (err)
+		return translate_error(ff->fs, 0, err);
+
 	ff->fs->priv_data = ff;
 	ff->blocklog = u_log2(ff->fs->blocksize);
 	ff->blockmask = ff->fs->blocksize - 1;
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 32f4f1e2a48056..caa675bc9f95e9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1277,6 +1277,10 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff)
 		log_printf(ff, "%s %s.\n", _("mounted filesystem"), uuid);
 	}
 
+	err = ext2fs_create_inode_cache(ff->fs, 1024);
+	if (err)
+		return translate_error(ff->fs, 0, err);
+
 	ff->fs->priv_data = ff;
 	ff->blocklog = u_log2(ff->fs->blocksize);
 	ff->blockmask = ff->fs->blocksize - 1;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] libext2fs: improve caching for inodes
  2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
                     ` (4 preceding siblings ...)
  2026-06-25 19:40   ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
@ 2026-06-25 19:40   ` Darrick J. Wong
  5 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:40 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Use our new cache code to improve the ondisk inode cache inside
libext2fs.  Oops, list.h duplication, and libext2fs needs to link
against libsupport now.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2fsP.h     |   13 ++-
 debugfs/Makefile.in      |    8 +-
 e2fsck/Makefile.in       |   12 +--
 fuse4fs/Makefile.in      |    8 +-
 lib/ext2fs/Makefile.in   |   69 +++++++--------
 lib/ext2fs/inline_data.c |    4 -
 lib/ext2fs/inode.c       |  215 ++++++++++++++++++++++++++++++++++++----------
 misc/Makefile.in         |    8 +-
 resize/Makefile.in       |   11 +-
 tests/fuzz/Makefile.in   |    4 -
 tests/progs/Makefile.in  |    4 -
 11 files changed, 239 insertions(+), 117 deletions(-)


diff --git a/lib/ext2fs/ext2fsP.h b/lib/ext2fs/ext2fsP.h
index bdc92991e7dda0..272de454bf4d65 100644
--- a/lib/ext2fs/ext2fsP.h
+++ b/lib/ext2fs/ext2fsP.h
@@ -82,21 +82,26 @@ struct dir_context {
 	errcode_t	errcode;
 };
 
+#include "support/list.h"
+#include "support/cache.h"
+
 /*
  * Inode cache structure
  */
 struct ext2_inode_cache {
 	void *				buffer;
 	blk64_t				buffer_blk;
-	int				cache_last;
-	unsigned int			cache_size;
 	int				refcount;
-	struct ext2_inode_cache_ent	*cache;
+	struct cache			cache;
 };
 
 struct ext2_inode_cache_ent {
+	struct cache_node	node;
 	ext2_ino_t		ino;
-	struct ext2_inode	*inode;
+	uint8_t			access;
+
+	/* bytes representing a host-endian ext2_inode_large object */
+	char			raw[];
 };
 
 /*
diff --git a/debugfs/Makefile.in b/debugfs/Makefile.in
index 700ae87418c268..8bee4b67fc2de7 100644
--- a/debugfs/Makefile.in
+++ b/debugfs/Makefile.in
@@ -38,15 +38,15 @@ SRCS= debug_cmds.c $(srcdir)/debugfs.c $(srcdir)/util.c $(srcdir)/ls.c \
 	$(srcdir)/../e2fsck/recovery.c $(srcdir)/do_journal.c \
 	$(srcdir)/do_orphan.c
 
-LIBS= $(LIBSUPPORT) $(LIBEXT2FS) $(LIBE2P) $(LIBSS) $(LIBCOM_ERR) $(LIBBLKID) \
+LIBS= $(LIBEXT2FS) $(LIBSUPPORT) $(LIBE2P) $(LIBSS) $(LIBCOM_ERR) $(LIBBLKID) \
 	$(LIBUUID) $(LIBMAGIC) $(SYSLIBS) $(LIBARCHIVE)
-DEPLIBS= $(DEPLIBSUPPORT) $(LIBEXT2FS) $(LIBE2P) $(DEPLIBSS) $(DEPLIBCOM_ERR) \
+DEPLIBS= $(LIBEXT2FS) $(DEPLIBSUPPORT) $(LIBE2P) $(DEPLIBSS) $(DEPLIBCOM_ERR) \
 	$(DEPLIBBLKID) $(DEPLIBUUID)
 
-STATIC_LIBS= $(STATIC_LIBSUPPORT) $(STATIC_LIBEXT2FS) $(STATIC_LIBSS) \
+STATIC_LIBS= $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBSS) \
 	$(STATIC_LIBCOM_ERR) $(STATIC_LIBBLKID) $(STATIC_LIBUUID) \
 	$(STATIC_LIBE2P) $(LIBMAGIC) $(SYSLIBS)
-STATIC_DEPLIBS= $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSS) \
+STATIC_DEPLIBS= $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSUPPORT) $(DEPSTATIC_LIBSS) \
 		$(DEPSTATIC_LIBCOM_ERR) $(DEPSTATIC_LIBUUID) \
 		$(DEPSTATIC_LIBE2P)
 
diff --git a/e2fsck/Makefile.in b/e2fsck/Makefile.in
index 52fad9cbfd2b23..d72244f47e47c0 100644
--- a/e2fsck/Makefile.in
+++ b/e2fsck/Makefile.in
@@ -16,22 +16,22 @@ PROGS=		e2fsck
 MANPAGES=	e2fsck.8
 FMANPAGES=	e2fsck.conf.5
 
-LIBS= $(LIBSUPPORT) $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBBLKID) $(LIBUUID) \
+LIBS= $(LIBEXT2FS) $(LIBSUPPORT) $(LIBCOM_ERR) $(LIBBLKID) $(LIBUUID) \
 	$(LIBINTL) $(LIBE2P) $(LIBMAGIC) $(SYSLIBS)
-DEPLIBS= $(DEPLIBSUPPORT) $(LIBEXT2FS) $(DEPLIBCOM_ERR) $(DEPLIBBLKID) \
+DEPLIBS= $(LIBEXT2FS) $(DEPLIBSUPPORT) $(DEPLIBCOM_ERR) $(DEPLIBBLKID) \
 	 $(DEPLIBUUID) $(DEPLIBE2P)
 
-STATIC_LIBS= $(STATIC_LIBSUPPORT) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
+STATIC_LIBS= $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) \
 	     $(STATIC_LIBBLKID) $(STATIC_LIBUUID) $(LIBINTL) $(STATIC_LIBE2P) \
 	     $(LIBMAGIC) $(SYSLIBS)
-STATIC_DEPLIBS= $(DEPSTATIC_LIBSUPPORT) $(STATIC_LIBEXT2FS) \
+STATIC_DEPLIBS= $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSUPPORT) \
 		$(DEPSTATIC_LIBCOM_ERR) $(DEPSTATIC_LIBBLKID) \
 		$(DEPSTATIC_LIBUUID) $(DEPSTATIC_LIBE2P)
 
-PROFILED_LIBS= $(PROFILED_LIBSUPPORT) $(PROFILED_LIBEXT2FS) \
+PROFILED_LIBS= $(PROFILED_LIBEXT2FS) $(PROFILED_LIBSUPPORT) \
 	       $(PROFILED_LIBCOM_ERR) $(PROFILED_LIBBLKID) $(PROFILED_LIBUUID) \
 	       $(PROFILED_LIBE2P) $(LIBINTL) $(LIBMAGIC) $(SYSLIBS)
-PROFILED_DEPLIBS= $(DEPPROFILED_LIBSUPPORT) $(PROFILED_LIBEXT2FS) \
+PROFILED_DEPLIBS= $(PROFILED_LIBEXT2FS) $(DEPPROFILED_LIBSUPPORT) \
 		  $(DEPPROFILED_LIBCOM_ERR) $(DEPPROFILED_LIBBLKID) \
 		  $(DEPPROFILED_LIBUUID) $(DEPPROFILED_LIBE2P)
 
diff --git a/fuse4fs/Makefile.in b/fuse4fs/Makefile.in
index 8fbbfabe7be1e7..44ae3b78a29b9d 100644
--- a/fuse4fs/Makefile.in
+++ b/fuse4fs/Makefile.in
@@ -37,11 +37,11 @@ SRCS=\
 
 LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBSUPPORT)
 DEPLIBS= $(LIBEXT2FS) $(DEPLIBCOM_ERR) $(DEPLIBSUPPORT)
-PROFILED_LIBS= $(LIBSUPPORT) $(PROFILED_LIBEXT2FS) $(PROFILED_LIBCOM_ERR)
-PROFILED_DEPLIBS= $(DEPLIBSUPPORT) $(PROFILED_LIBEXT2FS) $(DEPPROFILED_LIBCOM_ERR)
+PROFILED_LIBS= $(PROFILED_LIBEXT2FS) $(PROFILED_LIBSUPPORT) $(PROFILED_LIBCOM_ERR)
+PROFILED_DEPLIBS= $(PROFILED_LIBEXT2FS) $(DEPPROFILED_LIBSUPPORT) $(DEPPROFILED_LIBCOM_ERR)
 
-STATIC_LIBS= $(LIBSUPPORT) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR)
-STATIC_DEPLIBS= $(DEPLIBSUPPORT) $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+STATIC_LIBS= $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR)
+STATIC_DEPLIBS= $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 
 LIBS_E2P= $(LIBE2P) $(LIBCOM_ERR)
 DEPLIBS_E2P= $(LIBE2P) $(DEPLIBCOM_ERR)
diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index 1d0991defff804..45cd3814d4f2c7 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -246,7 +246,7 @@ ELF_SO_VERSION = 2
 ELF_IMAGE = libext2fs
 ELF_MYDIR = ext2fs
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -lcom_err
+ELF_OTHER_LIBS = -lcom_err $(top_builddir)/../lib/libsupport.a
 
 BSDLIB_VERSION = 2.1
 BSDLIB_IMAGE = libext2fs
@@ -283,54 +283,54 @@ ext2fs.pc: $(srcdir)/ext2fs.pc.in $(top_builddir)/config.status
 	$(E) "	CONFIG.STATUS $@"
 	$(Q) cd $(top_builddir); CONFIG_FILES=lib/ext2fs/ext2fs.pc ./config.status
 
-tst_badblocks: tst_badblocks.o $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_badblocks: tst_badblocks.o $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_badblocks tst_badblocks.o $(ALL_LDFLAGS) \
-		$(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
 tst_digest_encode: $(srcdir)/digest_encode.c $(srcdir)/ext2_fs.h
 	$(E) "	CC $@"
 	$(Q) $(CC) $(ALL_LDFLAGS) $(ALL_CFLAGS) -o tst_digest_encode \
 		$(srcdir)/digest_encode.c -DUNITTEST $(SYSLIBS)
 
-tst_icount: $(srcdir)/icount.c $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_icount: $(srcdir)/icount.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_icount $(srcdir)/icount.c -DDEBUG \
 		$(ALL_CFLAGS) $(ALL_LDFLAGS) \
-		$(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_iscan: tst_iscan.o $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_iscan: tst_iscan.o $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_iscan tst_iscan.o $(ALL_LDFLAGS) \
-		$(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_getsize: tst_getsize.o $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_getsize: tst_getsize.o $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_getsize tst_getsize.o $(ALL_LDFLAGS) \
-		$(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_ismounted: $(srcdir)/ismounted.c $(STATIC_LIBEXT2FS) \
+tst_ismounted: $(srcdir)/ismounted.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
 		$(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_ismounted $(srcdir)/ismounted.c \
-		$(STATIC_LIBEXT2FS) -DDEBUG $(ALL_CFLAGS) $(ALL_LDFLAGS) \
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) -DDEBUG $(ALL_CFLAGS) $(ALL_LDFLAGS) \
 		$(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_byteswap: tst_byteswap.o $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_byteswap: tst_byteswap.o $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_byteswap tst_byteswap.o $(ALL_LDFLAGS) \
-		$(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_bitops: tst_bitops.o $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_bitops: tst_bitops.o $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_bitops tst_bitops.o $(ALL_CFLAGS) $(ALL_LDFLAGS) \
-		$(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_getsectsize: tst_getsectsize.o getsectsize.o $(STATIC_LIBEXT2FS) \
+tst_getsectsize: tst_getsectsize.o getsectsize.o $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
 			$(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_getsectsize tst_getsectsize.o getsectsize.o \
-		$(ALL_LDFLAGS) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
+		$(ALL_LDFLAGS) $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) \
 		$(SYSLIBS)
 
 tst_types.o: $(srcdir)/tst_types.c ext2_types.h 
@@ -490,11 +490,11 @@ tst_bitmaps_cmd.c: tst_bitmaps_cmd.ct
 	$(Q) DIR=$(srcdir) $(MK_CMDS) $(srcdir)/tst_bitmaps_cmd.ct
 
 tst_bitmaps: tst_bitmaps.o tst_bitmaps_cmd.o $(srcdir)/blkmap64_rb.c \
-		$(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSS) $(DEPSTATIC_LIBCOM_ERR)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBSS) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o $@ tst_bitmaps.o tst_bitmaps_cmd.o \
 		-DDEBUG_RB $(srcdir)/blkmap64_rb.c $(ALL_CFLAGS) \
-		$(ALL_LDFLAGS) $(STATIC_LIBEXT2FS) $(STATIC_LIBSS) \
+		$(ALL_LDFLAGS) $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBSS) \
 		$(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
 tst_extents: $(srcdir)/extent.c $(DEBUG_OBJS) $(DEPSTATIC_LIBSS) libext2fs.a \
@@ -503,8 +503,8 @@ tst_extents: $(srcdir)/extent.c $(DEBUG_OBJS) $(DEPSTATIC_LIBSS) libext2fs.a \
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_extents $(srcdir)/extent.c \
 		$(ALL_CFLAGS) $(ALL_LDFLAGS) -DDEBUG $(DEBUG_OBJS) \
-		$(STATIC_LIBSS) $(STATIC_LIBE2P) $(LIBSUPPORT) \
-		$(STATIC_LIBEXT2FS) $(LIBBLKID) $(LIBUUID) \
+		$(STATIC_LIBSS) $(STATIC_LIBE2P) \
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(LIBSUPPORT) $(LIBBLKID) $(LIBUUID) \
 		$(STATIC_LIBCOM_ERR) $(SYSLIBS) -I $(top_srcdir)/debugfs
 
 tst_libext2fs: $(DEBUG_OBJS) \
@@ -512,38 +512,38 @@ tst_libext2fs: $(DEBUG_OBJS) \
 	$(DEPLIBBLKID) $(DEPSTATIC_LIBCOM_ERR) $(DEPLIBSUPPORT)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_libext2fs $(ALL_LDFLAGS) -DDEBUG $(DEBUG_OBJS) \
-		$(STATIC_LIBSS) $(STATIC_LIBE2P) $(LIBSUPPORT) \
-		$(STATIC_LIBEXT2FS) $(LIBBLKID) $(LIBUUID) $(LIBMAGIC) \
+		$(STATIC_LIBSS) $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
+		$(LIBSUPPORT) $(LIBBLKID) $(LIBUUID) $(LIBMAGIC) \
 		$(STATIC_LIBCOM_ERR) $(SYSLIBS) $(LIBARCHIVE) -I $(top_srcdir)/debugfs
 
-tst_inline: $(srcdir)/inline.c $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_inline: $(srcdir)/inline.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_inline $(srcdir)/inline.c $(ALL_CFLAGS) \
-		$(ALL_LDFLAGS) -DDEBUG $(STATIC_LIBEXT2FS) \
+		$(ALL_LDFLAGS) -DDEBUG $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
 		$(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_inline_data: inline_data.c $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_inline_data: inline_data.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_inline_data $(srcdir)/inline_data.c $(ALL_CFLAGS) \
-		$(ALL_LDFLAGS) -DDEBUG $(STATIC_LIBEXT2FS) \
+		$(ALL_LDFLAGS) -DDEBUG $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
 		$(STATIC_LIBCOM_ERR) $(SYSLIBS)
 
-tst_csum: csum.c $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR) $(STATIC_LIBE2P) \
+tst_csum: csum.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR) $(STATIC_LIBE2P) \
 		$(top_srcdir)/lib/e2p/e2p.h
 	$(E) "	LD $@"
 	$(Q) $(CC) -o tst_csum $(srcdir)/csum.c -DDEBUG \
-		$(ALL_CFLAGS) $(ALL_LDFLAGS) $(STATIC_LIBEXT2FS) \
+		$(ALL_CFLAGS) $(ALL_LDFLAGS) $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
 		$(STATIC_LIBCOM_ERR) $(STATIC_LIBE2P) $(SYSLIBS)
 
-tst_crc32c: $(srcdir)/crc32c.c $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+tst_crc32c: $(srcdir)/crc32c.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 	$(Q) $(CC) $(ALL_LDFLAGS) $(ALL_CFLAGS) -o tst_crc32c $(srcdir)/crc32c.c \
-		-DUNITTEST $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
+		-DUNITTEST $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR) \
 		$(SYSLIBS)
 
-mkjournal: mkjournal.c $(STATIC_LIBEXT2FS) $(DEPLIBCOM_ERR)
+mkjournal: mkjournal.c $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(DEPLIBCOM_ERR)
 	$(E) "	LD $@"
 	$(Q) $(CC) -o mkjournal $(srcdir)/mkjournal.c -DDEBUG \
-		$(STATIC_LIBEXT2FS) $(LIBCOM_ERR) $(ALL_CFLAGS) $(SYSLIBS)
+		$(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(LIBCOM_ERR) $(ALL_CFLAGS) $(SYSLIBS)
 
 fullcheck check:: tst_bitops tst_badblocks tst_iscan tst_types tst_icount \
     tst_super_size tst_types tst_inode_size tst_csum tst_crc32c tst_bitmaps \
@@ -976,7 +976,8 @@ inode.o: $(srcdir)/inode.c $(top_builddir)/lib/config.h \
  $(srcdir)/ext2fs.h $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h \
  $(top_srcdir)/lib/et/com_err.h $(srcdir)/ext2_io.h \
  $(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/ext2_ext_attr.h \
- $(srcdir)/hashmap.h $(srcdir)/bitops.h $(srcdir)/e2image.h
+ $(srcdir)/hashmap.h $(srcdir)/bitops.h $(srcdir)/e2image.h \
+ $(srcdir)/../support/cache.h $(srcdir)/../support/list.h
 inode_io.o: $(srcdir)/inode_io.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2_fs.h \
  $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2fs.h \
diff --git a/lib/ext2fs/inline_data.c b/lib/ext2fs/inline_data.c
index bd52e37708ccad..8ff4a23397f499 100644
--- a/lib/ext2fs/inline_data.c
+++ b/lib/ext2fs/inline_data.c
@@ -817,10 +817,6 @@ int main(int argc, char *argv[])
 				"tst_inline_data: init inode cache failed\n");
 			exit(1);
 		}
-
-		/* setup inode cache */
-		for (i = 0; i < fs->icache->cache_size; i++)
-			fs->icache->cache[i].ino = first_ino++;
 	}
 
 	/* test */
diff --git a/lib/ext2fs/inode.c b/lib/ext2fs/inode.c
index c9389a2324be07..8ca82af1ab35d3 100644
--- a/lib/ext2fs/inode.c
+++ b/lib/ext2fs/inode.c
@@ -59,18 +59,145 @@ struct ext2_struct_inode_scan {
 	int			reserved[6];
 };
 
+struct ext2_inode_cache_key {
+	ext2_filsys		fs;
+	ext2_ino_t		ino;
+};
+
+#define ICKEY(key)	((struct ext2_inode_cache_key *)(key))
+#define ICNODE(node)	(container_of((node), struct ext2_inode_cache_ent, node))
+
+static unsigned int
+ext2_inode_cache_hash(cache_key_t key, unsigned int hashsize,
+		      unsigned int hashshift)
+{
+	uint64_t	hashval = ICKEY(key)->ino;
+	uint64_t	tmp;
+
+	tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+	return tmp % hashsize;
+}
+
+static int ext2_inode_cache_compare(struct cache_node *node, cache_key_t key)
+{
+	struct ext2_inode_cache_ent *ent = ICNODE(node);
+	struct ext2_inode_cache_key *ikey = ICKEY(key);
+
+	if (ent->ino == ikey->ino)
+		return CACHE_HIT;
+
+	return CACHE_MISS;
+}
+
+static struct cache_node *ext2_inode_cache_alloc(struct cache *c,
+						 cache_key_t key)
+{
+	struct ext2_inode_cache_key *ikey = ICKEY(key);
+	struct ext2_inode_cache_ent *ent;
+
+	ent = calloc(1, sizeof(struct ext2_inode_cache_ent) +
+			EXT2_INODE_SIZE(ikey->fs->super));
+	if (!ent)
+		return NULL;
+
+	ent->ino = ikey->ino;
+	return &ent->node;
+}
+
+static bool ext2_inode_cache_flush(struct cache *c, struct cache_node *node)
+{
+	/* can always drop inode cache */
+	return 0;
+}
+
+static void ext2_inode_cache_relse(struct cache *c, struct cache_node *node)
+{
+	struct ext2_inode_cache_ent *ent = ICNODE(node);
+
+	free(ent);
+}
+
+static unsigned int ext2_inode_cache_bulkrelse(struct cache *cache,
+					       struct list_head *list)
+{
+	struct cache_node *cn, *n;
+	int count = 0;
+
+	if (list_empty(list))
+		return 0;
+
+	list_for_each_entry_safe(cn, n, list, cn_mru) {
+		ext2_inode_cache_relse(cache, cn);
+		count++;
+	}
+
+	return count;
+}
+
+static const struct cache_operations ext2_inode_cache_ops = {
+	.hash		= ext2_inode_cache_hash,
+	.alloc		= ext2_inode_cache_alloc,
+	.flush		= ext2_inode_cache_flush,
+	.relse		= ext2_inode_cache_relse,
+	.compare	= ext2_inode_cache_compare,
+	.bulkrelse	= ext2_inode_cache_bulkrelse,
+	.resize		= cache_gradual_resize,
+};
+
+static errcode_t ext2_inode_cache_iget(ext2_filsys fs, ext2_ino_t ino,
+				       unsigned int getflags,
+				       struct ext2_inode_cache_ent **entp)
+{
+	struct ext2_inode_cache_key ikey = {
+		.fs = fs,
+		.ino = ino,
+	};
+	struct cache_node *node = NULL;
+
+	cache_node_get(&fs->icache->cache, &ikey, getflags, &node);
+	if (!node)
+		return ENOMEM;
+
+	*entp = ICNODE(node);
+	return 0;
+}
+
+static void ext2_inode_cache_iput(ext2_filsys fs,
+				  struct ext2_inode_cache_ent *ent)
+{
+	cache_node_put(&fs->icache->cache, &ent->node);
+}
+
+static int ext2_inode_cache_ipurge(ext2_filsys fs, ext2_ino_t ino,
+				   struct ext2_inode_cache_ent *ent)
+{
+	struct ext2_inode_cache_key ikey = {
+		.fs = fs,
+		.ino = ino,
+	};
+
+	return cache_node_purge(&fs->icache->cache, &ikey, &ent->node);
+}
+
+static void ext2_inode_cache_ibump(ext2_filsys fs,
+				   struct ext2_inode_cache_ent *ent)
+{
+	if (++ent->access > 50) {
+		cache_node_bump_priority(&fs->icache->cache, &ent->node);
+		ent->access = 0;
+	}
+}
+
 /*
  * This routine flushes the icache, if it exists.
  */
 errcode_t ext2fs_flush_icache(ext2_filsys fs)
 {
-	unsigned	i;
-
 	if (!fs->icache)
 		return 0;
 
-	for (i=0; i < fs->icache->cache_size; i++)
-		fs->icache->cache[i].ino = 0;
+	cache_purge(&fs->icache->cache);
 
 	fs->icache->buffer_blk = 0;
 	return 0;
@@ -81,23 +208,20 @@ errcode_t ext2fs_flush_icache(ext2_filsys fs)
  */
 void ext2fs_free_inode_cache(struct ext2_inode_cache *icache)
 {
-	unsigned i;
-
 	if (--icache->refcount)
 		return;
 	if (icache->buffer)
 		ext2fs_free_mem(&icache->buffer);
-	for (i = 0; i < icache->cache_size; i++)
-		ext2fs_free_mem(&icache->cache[i].inode);
-	if (icache->cache)
-		ext2fs_free_mem(&icache->cache);
+	if (cache_initialized(&icache->cache)) {
+		cache_purge(&icache->cache);
+		cache_destroy(&icache->cache);
+	}
 	icache->buffer_blk = 0;
 	ext2fs_free_mem(&icache);
 }
 
 errcode_t ext2fs_create_inode_cache(ext2_filsys fs, unsigned int cache_size)
 {
-	unsigned	i;
 	errcode_t	retval;
 
 	if (fs->icache)
@@ -112,22 +236,12 @@ errcode_t ext2fs_create_inode_cache(ext2_filsys fs, unsigned int cache_size)
 		goto errout;
 
 	fs->icache->buffer_blk = 0;
-	fs->icache->cache_last = -1;
-	fs->icache->cache_size = cache_size;
 	fs->icache->refcount = 1;
-	retval = ext2fs_get_array(fs->icache->cache_size,
-				  sizeof(struct ext2_inode_cache_ent),
-				  &fs->icache->cache);
+	retval = cache_init(0, cache_size, &ext2_inode_cache_ops,
+			    &fs->icache->cache);
 	if (retval)
 		goto errout;
 
-	for (i = 0; i < fs->icache->cache_size; i++) {
-		retval = ext2fs_get_mem(EXT2_INODE_SIZE(fs->super),
-					&fs->icache->cache[i].inode);
-		if (retval)
-			goto errout;
-	}
-
 	ext2fs_flush_icache(fs);
 	return 0;
 errout:
@@ -762,12 +876,12 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
 	unsigned long 	block, offset;
 	char 		*ptr;
 	errcode_t	retval;
-	unsigned	i;
 	int		clen, inodes_per_block;
 	io_channel	io;
 	int		length = EXT2_INODE_SIZE(fs->super);
 	struct ext2_inode_large	*iptr;
-	int		cache_slot, fail_csum;
+	struct ext2_inode_cache_ent *ent = NULL;
+	int		fail_csum;
 
 	EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
 
@@ -794,12 +908,12 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
 			return retval;
 	}
 	/* Check to see if it's in the inode cache */
-	for (i = 0; i < fs->icache->cache_size; i++) {
-		if (fs->icache->cache[i].ino == ino) {
-			memcpy(inode, fs->icache->cache[i].inode,
-			       (bufsize > length) ? length : bufsize);
-			return 0;
-		}
+	ext2_inode_cache_iget(fs, ino, CACHE_GET_INCORE, &ent);
+	if (ent) {
+		memcpy(inode, ent->raw, (bufsize > length) ? length : bufsize);
+		ext2_inode_cache_ibump(fs, ent);
+		ext2_inode_cache_iput(fs, ent);
+		return 0;
 	}
 	if (fs->flags & EXT2_FLAG_IMAGE_FILE) {
 		inodes_per_block = fs->blocksize / EXT2_INODE_SIZE(fs->super);
@@ -827,8 +941,10 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
 	}
 	offset &= (EXT2_BLOCK_SIZE(fs->super) - 1);
 
-	cache_slot = (fs->icache->cache_last + 1) % fs->icache->cache_size;
-	iptr = (struct ext2_inode_large *)fs->icache->cache[cache_slot].inode;
+	retval = ext2_inode_cache_iget(fs, ino, 0, &ent);
+	if (retval)
+		return retval;
+	iptr = (struct ext2_inode_large *)ent->raw;
 
 	ptr = (char *) iptr;
 	while (length) {
@@ -863,13 +979,15 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
 			       0, length);
 #endif
 
-	/* Update the inode cache bookkeeping */
-	if (!fail_csum) {
-		fs->icache->cache_last = cache_slot;
-		fs->icache->cache[cache_slot].ino = ino;
-	}
 	memcpy(inode, iptr, (bufsize > length) ? length : bufsize);
 
+	/* Update the inode cache bookkeeping */
+	if (!fail_csum)
+		ext2_inode_cache_ibump(fs, ent);
+	ext2_inode_cache_iput(fs, ent);
+	if (fail_csum)
+		ext2_inode_cache_ipurge(fs, ino, ent);
+
 	if (!(fs->flags & EXT2_FLAG_IGNORE_CSUM_ERRORS) &&
 	    !(flags & READ_INODE_NOCSUM) && fail_csum)
 		return EXT2_ET_INODE_CSUM_INVALID;
@@ -899,8 +1017,8 @@ errcode_t ext2fs_write_inode2(ext2_filsys fs, ext2_ino_t ino,
 	unsigned long block, offset;
 	errcode_t retval = 0;
 	struct ext2_inode_large *w_inode;
+	struct ext2_inode_cache_ent *ent;
 	char *ptr;
-	unsigned i;
 	int clen;
 	int length = EXT2_INODE_SIZE(fs->super);
 
@@ -933,19 +1051,20 @@ errcode_t ext2fs_write_inode2(ext2_filsys fs, ext2_ino_t ino,
 	}
 
 	/* Check to see if the inode cache needs to be updated */
-	if (fs->icache) {
-		for (i=0; i < fs->icache->cache_size; i++) {
-			if (fs->icache->cache[i].ino == ino) {
-				memcpy(fs->icache->cache[i].inode, inode,
-				       (bufsize > length) ? length : bufsize);
-				break;
-			}
-		}
-	} else {
+	if (!fs->icache) {
 		retval = ext2fs_create_inode_cache(fs, 4);
 		if (retval)
 			goto errout;
 	}
+
+	retval = ext2_inode_cache_iget(fs, ino, 0, &ent);
+	if (retval)
+		goto errout;
+
+	memcpy(ent->raw, inode, (bufsize > length) ? length : bufsize);
+	ext2_inode_cache_ibump(fs, ent);
+	ext2_inode_cache_iput(fs, ent);
+
 	memcpy(w_inode, inode, (bufsize > length) ? length : bufsize);
 
 	if (!(fs->flags & EXT2_FLAG_RW)) {
diff --git a/misc/Makefile.in b/misc/Makefile.in
index 48bd42b8272572..6f46ae7007018c 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -115,11 +115,11 @@ SRCS=	$(srcdir)/tune2fs.c $(srcdir)/mklost+found.c $(srcdir)/mke2fs.c $(srcdir)/
 
 LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBSUPPORT)
 DEPLIBS= $(LIBEXT2FS) $(DEPLIBCOM_ERR) $(DEPLIBSUPPORT)
-PROFILED_LIBS= $(LIBSUPPORT) $(PROFILED_LIBEXT2FS) $(PROFILED_LIBCOM_ERR)
-PROFILED_DEPLIBS= $(DEPLIBSUPPORT) $(PROFILED_LIBEXT2FS) $(DEPPROFILED_LIBCOM_ERR)
+PROFILED_LIBS= $(PROFILED_LIBEXT2FS) $(PROFILED_LIBSUPPORT) $(PROFILED_LIBCOM_ERR)
+PROFILED_DEPLIBS= $(PROFILED_LIBEXT2FS) $(DEPPROFILED_LIBSUPPORT) $(DEPPROFILED_LIBCOM_ERR)
 
-STATIC_LIBS= $(LIBSUPPORT) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR)
-STATIC_DEPLIBS= $(DEPLIBSUPPORT) $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR)
+STATIC_LIBS= $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) $(STATIC_LIBCOM_ERR)
+STATIC_DEPLIBS= $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSUPPORT) $(DEPSTATIC_LIBCOM_ERR)
 
 LIBS_E2P= $(LIBE2P) $(LIBCOM_ERR)
 DEPLIBS_E2P= $(LIBE2P) $(DEPLIBCOM_ERR)
diff --git a/resize/Makefile.in b/resize/Makefile.in
index 27f721305e052e..101cdbeaa9f1ef 100644
--- a/resize/Makefile.in
+++ b/resize/Makefile.in
@@ -28,12 +28,13 @@ SRCS= $(srcdir)/extent.c \
 	$(srcdir)/resource_track.c \
 	$(srcdir)/sim_progress.c
 
-LIBS= $(LIBE2P) $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBINTL) $(SYSLIBS)
-DEPLIBS= $(LIBE2P) $(LIBEXT2FS) $(DEPLIBCOM_ERR)
+LIBS= $(LIBE2P) $(LIBEXT2FS) $(LIBSUPPORT) $(LIBCOM_ERR) $(LIBINTL) $(SYSLIBS)
+DEPLIBS= $(LIBE2P) $(LIBEXT2FS) $(DEPLIBSUPPORT) $(DEPLIBCOM_ERR)
 
-STATIC_LIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
-	$(LIBINTL) $(SYSLIBS)
-DEPSTATIC_LIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBCOM_ERR) 
+STATIC_LIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
+	     $(STATIC_LIBCOM_ERR) $(LIBINTL) $(SYSLIBS)
+DEPSTATIC_LIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSUPPORT) \
+		$(DEPSTATIC_LIBCOM_ERR)
 
 .c.o:
 	$(E) "	CC $<"
diff --git a/tests/fuzz/Makefile.in b/tests/fuzz/Makefile.in
index 949579e7c6501f..2b959f612e2079 100644
--- a/tests/fuzz/Makefile.in
+++ b/tests/fuzz/Makefile.in
@@ -24,9 +24,9 @@ LOCAL_LDFLAGS= @fuzzer_ldflags@
 LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBSUPPORT)
 DEPLIBS= $(LIBEXT2FS) $(DEPLIBCOM_ERR) $(DEPLIBSUPPORT)
 
-STATIC_LIBS= $(LIBSUPPORT) $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) \
+STATIC_LIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(STATIC_LIBSUPPORT) \
 	$(STATIC_LIBCOM_ERR)
-STATIC_DEPLIBS= $(DEPLIBSUPPORT) $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) \
+STATIC_DEPLIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSUPPORT) \
 	$(DEPSTATIC_LIBCOM_ERR)
 
 FUZZ_LDFLAGS= $(ALL_LDFLAGS)
diff --git a/tests/progs/Makefile.in b/tests/progs/Makefile.in
index 1a8e9299a1c1ca..64069a52c57cd3 100644
--- a/tests/progs/Makefile.in
+++ b/tests/progs/Makefile.in
@@ -23,8 +23,8 @@ TEST_ICOUNT_OBJS=	test_icount.o test_icount_cmds.o
 SRCS=	$(srcdir)/test_icount.c \
 	$(srcdir)/test_rel.c
 
-LIBS= $(LIBEXT2FS) $(LIBSS) $(LIBCOM_ERR) $(SYSLIBS)
-DEPLIBS= $(LIBEXT2FS) $(DEPLIBSS) $(DEPLIBCOM_ERR)
+LIBS= $(LIBEXT2FS) $(LIBSUPPORT) $(LIBSS) $(LIBCOM_ERR) $(SYSLIBS)
+DEPLIBS= $(LIBEXT2FS) $(DEPLIBSUPPORT) $(DEPLIBSS) $(DEPLIBCOM_ERR)
 
 .c.o:
 	$(E) "	CC $<"


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 1/4] libsupport: add pressure stall monitor
  2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
@ 2026-06-25 19:41   ` Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 2/4] fuse2fs: only reclaim buffer cache when there is memory pressure Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:41 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Create some monitoring code that will sit in the background and watch
for resource pressure stalls and call some sort of handler when this
happens.  This will be useful for shrinking the buffer cache when memory
gets tight.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/support/list.h      |    6 +
 lib/support/psi.h       |   57 +++++
 lib/support/Makefile.in |    4 
 lib/support/iocache.c   |   19 ++
 lib/support/psi.c       |  510 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 596 insertions(+)
 create mode 100644 lib/support/psi.h
 create mode 100644 lib/support/psi.c


diff --git a/lib/support/list.h b/lib/support/list.h
index 54e8e236048af7..831a40efc9b4f0 100644
--- a/lib/support/list.h
+++ b/lib/support/list.h
@@ -4,6 +4,12 @@
 
 #include <stdbool.h>
 
+#ifdef __GNUC__
+#define EXT2FS_ATTR(x) __attribute__(x)
+#else
+#define EXT2FS_ATTR(x)
+#endif
+
 struct list_head {
 	struct list_head *next, *prev;
 };
diff --git a/lib/support/psi.h b/lib/support/psi.h
new file mode 100644
index 00000000000000..675ebeb553da3e
--- /dev/null
+++ b/lib/support/psi.h
@@ -0,0 +1,57 @@
+/*
+ * psi.h - Pressure stall monitor
+ *
+ * Copyright (C) 2025-2026 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#ifndef __PSI_H__
+#define __PSI_H__
+
+struct psi;
+struct psi_handler;
+
+enum psi_type {
+	PSI_MEMORY,
+};
+
+void psi_destroy(struct psi **psip);
+
+/* call malloc_trim after calling handlers */
+#define PSI_TRIM_HEAP		(1U << 0)
+
+#define PSI_FLAGS		(PSI_TRIM_HEAP)
+
+int psi_create(enum psi_type type, unsigned int psi_flags,
+	       uint64_t stall_us, uint64_t window_us, uint64_t timeout_us,
+	       struct psi **psip);
+
+/* psi triggered due to timeout (and not pressure) */
+#define PSI_REASON_TIMEOUT	(1U << 0)
+
+/*
+ * Prototype of a function to call when a stall occurs.  Implementations must
+ * not block on any resources that are held if psi_stop_thread is called.
+ */
+typedef void (*psi_handler_fn)(const struct psi *psi, unsigned int reasons,
+			       void *data);
+
+int psi_add_handler(struct psi *psi, psi_handler_fn callback, void *data,
+		    struct psi_handler **hanp);
+void psi_del_handler(struct psi *psi, struct psi_handler **hanp);
+void psi_cancel_handler(struct psi *psi, struct psi_handler **hanp);
+
+int psi_start_thread(struct psi *psi);
+void psi_stop_thread(struct psi *psi);
+
+bool psi_thread_cancelled(const struct psi *psi);
+
+static inline bool psi_active(struct psi *psi)
+{
+	return psi != NULL;
+}
+
+#endif /* __PSI_H__ */
diff --git a/lib/support/Makefile.in b/lib/support/Makefile.in
index 22242758b4e618..7cab855c24fc50 100644
--- a/lib/support/Makefile.in
+++ b/lib/support/Makefile.in
@@ -23,6 +23,7 @@ OBJS=		bthread.o \
 		print_fs_flags.o \
 		profile_helpers.o \
 		prof_err.o \
+		psi.o \
 		quotaio.o \
 		quotaio_v2.o \
 		quotaio_tree.o \
@@ -41,6 +42,7 @@ SRCS=		$(srcdir)/argv_parse.c \
 		$(srcdir)/profile.c \
 		$(srcdir)/profile_helpers.c \
 		prof_err.c \
+		$(srcdir)/psi.c \
 		$(srcdir)/quotaio.c \
 		$(srcdir)/quotaio_tree.c \
 		$(srcdir)/quotaio_v2.c \
@@ -204,3 +206,5 @@ cache.o: $(srcdir)/cache.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/list.h $(srcdir)/cache.h
 iocache.o: $(srcdir)/iocache.c $(top_builddir)/lib/config.h \
  $(srcdir)/iocache.h $(srcdir)/cache.h $(srcdir)/list.h
+psi.o: $(srcdir)/psi.c $(top_builddir)/lib/config.h \
+ $(srcdir)/psi.h $(srcdir)/list.h
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
index 4ed941ff2d3ae6..faf434a95becc6 100644
--- a/lib/support/iocache.c
+++ b/lib/support/iocache.c
@@ -452,6 +452,20 @@ static errcode_t iocache_set_option(io_channel channel, const char *option,
 	if (!strcmp(option, "cache"))
 		return 0;
 
+	if (!strcmp(option, "cache_auto_shrink")) {
+		if (!arg)
+			return EXT2_ET_INVALID_ARGUMENT;
+		if (!strcmp(arg, "on")) {
+			cache_set_flag(&data->cache, CACHE_AUTO_SHRINK);
+			return 0;
+		}
+		if (!strcmp(arg, "off")) {
+			cache_clear_flag(&data->cache, CACHE_AUTO_SHRINK);
+			return 0;
+		}
+		return EXT2_ET_INVALID_ARGUMENT;
+	}
+
 	if (!strcmp(option, "cache_blocks")) {
 		long long size;
 
@@ -467,6 +481,11 @@ static errcode_t iocache_set_option(io_channel channel, const char *option,
 		return 0;
 	}
 
+	if (!strcmp(option, "cache_shrink")) {
+		cache_shrink(&data->cache);
+		return 0;
+	}
+
 	retval = iocache_flush_cache(data);
 	if (retval)
 		return retval;
diff --git a/lib/support/psi.c b/lib/support/psi.c
new file mode 100644
index 00000000000000..26ce6ee1985641
--- /dev/null
+++ b/lib/support/psi.c
@@ -0,0 +1,510 @@
+/*
+ * psi.c - Pressure stall monitor
+ *
+ * Copyright (C) 2025-2026 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#include "config.h"
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <errno.h>
+#include <poll.h>
+#include <pthread.h>
+#include <malloc.h>
+#include <signal.h>
+#include <limits.h>
+
+#include "support/list.h"
+#include "support/psi.h"
+
+enum psi_state {
+	/* waiting to be put in the running state */
+	PSI_WAITING,
+	/* running */
+	PSI_RUNNING,
+	/* cancelled */
+	PSI_CANCELLED,
+};
+
+struct psi_handler {
+	struct list_head list;
+	psi_handler_fn callback;
+	void *data;
+};
+
+struct psi {
+	int system_fd;
+	int cgroup_fd;
+	unsigned int flags;
+	uint64_t timeout_us;
+
+	pthread_t thread;
+	pthread_mutex_t lock;
+	struct list_head handlers;
+
+	enum psi_type type;
+	enum psi_state state;
+};
+
+static const char *psi_system_path(enum psi_type type)
+{
+	switch (type) {
+	case PSI_MEMORY:
+		return "/proc/pressure/memory";
+	default:
+		return NULL;
+	}
+}
+
+static const char *psi_cgroup_fname(enum psi_type type)
+{
+	switch (type) {
+	case PSI_MEMORY:
+		return "memory.pressure";
+	default:
+		return NULL;
+	}
+}
+
+static const char *psi_shortname(enum psi_type type)
+{
+	switch (type) {
+	case PSI_MEMORY:
+		return "psi:memory";
+	default:
+		return NULL;
+	}
+}
+
+static void psi_run_callbacks(struct psi *psi, unsigned int reasons)
+{
+	struct psi_handler *h, *i;
+
+	pthread_mutex_lock(&psi->lock);
+	list_for_each_entry_safe(h, i, &psi->handlers, list)
+		h->callback(psi, reasons, h->data);
+	pthread_mutex_unlock(&psi->lock);
+
+	if (psi->flags & PSI_TRIM_HEAP)
+		malloc_trim(0);
+}
+
+static inline void psi_fill_pollfd(struct pollfd *pfd, int fd)
+{
+	memset(pfd, 0, sizeof(*pfd));
+	pfd->fd = fd;
+	pfd->events = POLLPRI | POLLRDHUP | POLLERR | POLLHUP;
+}
+
+static unsigned int psi_fill_pollfds(struct psi *psi, struct pollfd *pfds)
+{
+	unsigned int ret = 0;
+
+	if (psi->system_fd >= 0) {
+		psi_fill_pollfd(pfds, psi->system_fd);
+		pfds++;
+		ret++;
+	}
+
+	if (psi->cgroup_fd >= 0) {
+		psi_fill_pollfd(pfds, psi->cgroup_fd);
+		pfds++;
+		ret++;
+	}
+
+	return ret;
+}
+
+static void *psi_thread(void *arg)
+{
+	struct psi *psi = arg;
+	int oldstate;
+
+	/*
+	 * Don't let pthread_cancel kill us except while we're in poll()
+	 * because we don't hold any resources during that call.  Everywhere
+	 * else, there could be resource cleanups that would have to be done.
+	 * Hence we just turn off cancelling for simplicity's sake.
+	 */
+	pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
+
+	pthread_mutex_lock(&psi->lock);
+	psi->state = PSI_RUNNING;
+	pthread_mutex_unlock(&psi->lock);
+
+	while (1) {
+		struct pollfd pfds[2];
+		unsigned int nr_pfds;
+		int timeout_ms;
+		int n;
+
+		pthread_mutex_lock(&psi->lock);
+		if (psi_thread_cancelled(psi)) {
+			pthread_mutex_unlock(&psi->lock);
+			break;
+		}
+
+		nr_pfds = psi_fill_pollfds(psi, pfds);
+		timeout_ms = psi->timeout_us ? psi->timeout_us / 1000 : -1;
+		pthread_mutex_unlock(&psi->lock);
+
+		pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
+		n = poll(pfds, nr_pfds, timeout_ms);
+		pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
+		if (n == 0) {
+			/* run callbacks on timeout */
+			psi_run_callbacks(psi, PSI_REASON_TIMEOUT);
+			continue;
+		}
+		if (n < 0) {
+			perror(psi_shortname(psi->type));
+			break;
+		}
+
+		/* psi fd closed */
+		if ((pfds[0].revents & POLLNVAL) ||
+		    (pfds[1].revents & POLLNVAL))
+			break;
+
+		if ((pfds[0].revents & POLLERR) ||
+		    (pfds[1].revents & POLLERR)) {
+			fprintf(stderr, "%s: event source dead?\n",
+				psi_shortname(psi->type));
+			break;
+		}
+
+		/* POLLPRI on a psi fd means we hit the pressure threshold */
+		if ((pfds[0].revents & POLLPRI) ||
+		    (pfds[1].revents & POLLPRI)) {
+			psi_run_callbacks(psi, 0);
+			continue;
+		}
+
+		fprintf(stderr, "%s: unknown events 0x%x/0x%x, ignoring.\n",
+			psi_shortname(psi->type), pfds[0].revents,
+			pfds[1].revents);
+	}
+
+	pthread_setcancelstate(oldstate, NULL);
+	return NULL;
+}
+
+/* Call a function whenever there is resource pressure */
+int psi_add_handler(struct psi *psi, psi_handler_fn callback, void *data,
+		    struct psi_handler **hanp)
+{
+	struct psi_handler *handler;
+
+	handler = malloc(sizeof(*handler));
+	if (!handler)
+		return -1;
+
+	INIT_LIST_HEAD(&handler->list);
+	handler->callback = callback;
+	handler->data = data;
+
+	pthread_mutex_lock(&psi->lock);
+	list_add_tail(&handler->list, &psi->handlers);
+	pthread_mutex_unlock(&psi->lock);
+
+	*hanp = handler;
+	return 0;
+}
+
+/* Stop calling this handler when there is resource pressure */
+void psi_del_handler(struct psi *psi, struct psi_handler **hanp)
+{
+	struct psi_handler *handler = *hanp;
+
+	if (handler) {
+		pthread_mutex_lock(&psi->lock);
+		list_del_init(&handler->list);
+		pthread_mutex_unlock(&psi->lock);
+		free(handler);
+	}
+
+	*hanp = NULL;
+}
+
+/* Cancel a running handler. */
+void psi_cancel_handler(struct psi *psi, struct psi_handler **hanp)
+{
+	struct psi_handler *handler = *hanp;
+
+	if (handler) {
+		list_del_init(&handler->list);
+		free(handler);
+	}
+
+	*hanp = NULL;
+}
+
+/*
+ * Stop monitoring for resource pressure stalls.  The monitor cannot be
+ * restarted after this call completes.
+ */
+void psi_stop_thread(struct psi *psi)
+{
+	int system_fd;
+	int cgroup_fd;
+	enum psi_state old_state;
+
+	pthread_mutex_lock(&psi->lock);
+	system_fd = psi->system_fd;
+	cgroup_fd = psi->cgroup_fd;
+	old_state = psi->state;
+	psi->system_fd = -1;
+	psi->cgroup_fd = -1;
+	psi->state = PSI_CANCELLED;
+	pthread_mutex_unlock(&psi->lock);
+
+	if (system_fd >= 0)
+		close(system_fd);
+	if (cgroup_fd >= 0)
+		close(cgroup_fd);
+
+	if (old_state == PSI_RUNNING) {
+		/* Cancelling the thread interrupts the poll() call */
+		pthread_cancel(psi->thread);
+		pthread_join(psi->thread, NULL);
+	}
+}
+
+/* Is this stall monitor active and its thread running? */
+bool psi_thread_cancelled(const struct psi *psi)
+{
+	return !psi || psi->state == PSI_CANCELLED;
+}
+
+/* Destroy this resource pressure stall monitor having stopped the thread */
+void psi_destroy(struct psi **psip)
+{
+	struct psi *psi = *psip;
+
+	if (psi) {
+		psi_stop_thread(psi);
+		pthread_mutex_destroy(&psi->lock);
+		free(psi);
+	}
+
+	*psip = NULL;
+}
+
+static int psi_open_control(const char *path)
+{
+	return open(path, O_RDWR | O_NONBLOCK);
+}
+
+static void psi_open_system_control(struct psi *psi)
+{
+	/* PSI may not exist, so we don't error if it's not there */
+	psi->system_fd = psi_open_control(psi_system_path(psi->type));
+}
+
+static ssize_t psi_cgroup_path(enum psi_type type, char *path, size_t pathsize)
+{
+	char cgpath[PATH_MAX];
+	char *p = cgpath;
+	ssize_t bytes;
+	int pscgroupfd = open("/proc/self/cgroup", O_RDONLY);
+	int nr_colons = 0;
+
+	/*
+	 * Read the contents of /proc/self/cgroup, which should have the
+	 * format:
+	 *
+	 * <id>:<stuff>:<absolute path under cgroupfs>\n
+	 *
+	 * We care about the cgroupfs path (column 3) and not the newline.
+	 */
+	if (pscgroupfd < 0)
+		return 0;
+
+	bytes = read(pscgroupfd, cgpath, sizeof(cgpath) - 1);
+	close(pscgroupfd);
+	if (bytes < 0)
+		return 0;
+	cgpath[bytes] = 0;
+
+	/*
+	 * Find the second colon, turn it into a dot so that we have a relative
+	 * path.  sysfs paths can contain colons, so this will always be the
+	 * last column... right?
+	 */
+	while (*p != 0) {
+		if (*p == ':')
+			nr_colons++;
+		if (nr_colons == 2) {
+			*p = '.';
+			break;
+		}
+
+		p++;
+	}
+
+	if (nr_colons != 2)
+		return 0;
+
+	/* Trim trailing newline, p points to column 3 */
+	bytes = strlen(p);
+	if (p[bytes - 1] == '\n')
+		p[bytes - 1] = 0;
+
+	/* /sys/fs/cgroup/$col3/$psi_cgroup_fname */
+	return snprintf(path, pathsize, "/sys/fs/cgroup/%s/%s", p,
+			psi_cgroup_fname(type));
+}
+
+static void psi_open_cgroup_control(struct psi *psi)
+{
+	char path[PATH_MAX];
+	ssize_t pathlen;
+
+	pathlen = psi_cgroup_path(psi->type, path, sizeof(path));
+	if (!pathlen || pathlen >= sizeof(path)) {
+		psi->cgroup_fd = -1;
+		return;
+	}
+
+	/* PSI may not exist, so we don't error if it's not there */
+	psi->cgroup_fd = psi_open_control(path);
+}
+
+static int psi_config_fd(struct psi *psi, int fd, uint64_t stall_us,
+			 uint64_t window_us)
+{
+	char buf[256];
+	size_t bytes;
+	ssize_t written;
+
+	if (fd < 0)
+		return 0;
+
+	/*
+	 * The kernel blindly nulls out the last byte we write into the psi
+	 * file, so put a newline at the end because I bet they're only testing
+	 * this with bash scripts.
+	 */
+	bytes = snprintf(buf, sizeof(buf), "some %llu %llu\n",
+			 (unsigned long long)stall_us,
+			 (unsigned long long)window_us);
+	if (bytes > sizeof(buf))
+		return -1;
+
+	written = write(fd, buf, bytes);
+	if (written >= 0 && written != bytes) {
+		written = -1;
+		errno = EMSGSIZE;
+	}
+	if (written < 0) {
+		perror(psi_shortname(psi->type));
+		return -1;
+	}
+
+	return 0;
+}
+
+static inline struct psi *psi_alloc(enum psi_type type, unsigned int psi_flags,
+				    uint64_t timeout_us)
+{
+	struct psi *psi = calloc(1, sizeof(*psi));
+	if (!psi)
+		return NULL;
+
+	psi->type = type;
+	psi->flags = psi_flags;
+	psi->timeout_us = timeout_us;
+	psi->state = PSI_WAITING;
+	INIT_LIST_HEAD(&psi->handlers);
+	pthread_mutex_init(&psi->lock, NULL);
+
+	return psi;
+}
+
+/*
+ * Create a pressure stall indicator monitor thread that monitors for
+ * resource availability stalls exceeding @stall_us in any @window_us time
+ * period and calls any attached handlers.  If @timeout_us is nonzero, the
+ * handlers will be called at that interval with PSI_REASON_TIMEOUT.
+ *
+ * Unprivileged processes are not allowed to set a @window_us that is not a
+ * multiple of 2 seconds(!)
+ *
+ * Returns 0 on success.  On error, returns -1 and sets errno.  errno values
+ * are as follows:
+ *
+ *    ENOENT means the monitor file cannot be found
+ *    EACCESS or EPERM mean that monitoring is not available
+ *    EINVAL means the window values are not valid
+ *    Any other errno is a sign of deeper problems
+ */
+int psi_create(enum psi_type type, unsigned int psi_flags, uint64_t stall_us,
+	       uint64_t window_us, uint64_t timeout_us, struct psi **psip)
+{
+	struct psi *psi;
+	int ret;
+
+	if (psi_flags & ~PSI_FLAGS) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	psi = psi_alloc(type, psi_flags, timeout_us);
+	if (!psi)
+		return -1;
+
+	psi_open_system_control(psi);
+	psi_open_cgroup_control(psi);
+
+	if (psi->system_fd < 0 && psi->cgroup_fd < 0 && !psi->timeout_us) {
+		errno = ENOENT;
+		goto out_fds;
+	}
+
+	ret = psi_config_fd(psi, psi->system_fd, stall_us, window_us);
+	if (ret)
+		goto out_fds;
+
+	ret = psi_config_fd(psi, psi->cgroup_fd, stall_us, window_us);
+	if (ret)
+		goto out_fds;
+
+	*psip = psi;
+	return 0;
+
+out_fds:
+	psi_destroy(&psi);
+	return -1;
+}
+
+/* Start monitoring for resource pressure stalls */
+int psi_start_thread(struct psi *psi)
+{
+	int error;
+
+	if (psi->state != PSI_WAITING) {
+		fprintf(stderr, "%s: psi already torn down\n",
+			psi_shortname(psi->type));
+		errno = EINVAL;
+		return -1;
+	}
+
+	error = pthread_create(&psi->thread, NULL, psi_thread, psi);
+	if (error) {
+		fprintf(stderr, "%s: could not create thread: %s\n",
+			psi_shortname(psi->type), strerror(error));
+		errno = error;
+		return -1;
+	}
+
+	pthread_setname_np(psi->thread, psi_shortname(psi->type));
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] fuse2fs: only reclaim buffer cache when there is memory pressure
  2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 1/4] libsupport: add pressure stall monitor Darrick J. Wong
@ 2026-06-25 19:41   ` Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 3/4] fuse4fs: enable memory pressure monitoring with service containers Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 4/4] fuse2fs: flush dirty metadata periodically Darrick J. Wong
  3 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:41 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Use the pressure stall indicator library that we added in the previous
patch to make it so that we only shrink the cache when there's memory
pressure.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fuse4fs/Makefile.in |    3 +-
 fuse4fs/fuse4fs.c   |   84 +++++++++++++++++++++++++++++++++++++++++++++++++++
 misc/Makefile.in    |    3 +-
 misc/fuse2fs.c      |   84 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 172 insertions(+), 2 deletions(-)


diff --git a/fuse4fs/Makefile.in b/fuse4fs/Makefile.in
index 44ae3b78a29b9d..f7475c5616ca7e 100644
--- a/fuse4fs/Makefile.in
+++ b/fuse4fs/Makefile.in
@@ -180,7 +180,8 @@ fuse4fs.o: $(srcdir)/fuse4fs.c $(top_builddir)/lib/config.h \
  $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/lib/support/bthread.h \
  $(top_srcdir)/lib/support/thread.h $(top_srcdir)/lib/support/list.h \
  $(top_srcdir)/lib/support/cache.h $(top_srcdir)/version.h \
- $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/iocache.h
+ $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/iocache.h \
+ $(top_srcdir)/lib/support/psi.h
 journal.o: $(srcdir)/../debugfs/journal.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/../debugfs/journal.h \
  $(top_srcdir)/e2fsck/jfs_user.h $(top_srcdir)/e2fsck/e2fsck.h \
diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 9744c0941cf31b..9b04d12f7c8762 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -61,6 +61,7 @@
 #include "support/list.h"
 #include "support/cache.h"
 #include "support/iocache.h"
+#include "support/psi.h"
 
 #include "../version.h"
 #include "uuid/uuid.h"
@@ -304,6 +305,8 @@ struct fuse4fs {
 # endif
 	int bdev_fd;
 #endif
+	struct psi *mem_psi;
+	struct psi_handler *mem_psi_handler;
 };
 
 #ifdef HAVE_FUSE4FS_SERVICE
@@ -904,6 +907,74 @@ static void fuse4fs_mmp_destroy(struct fuse4fs *ff)
 # define fuse4fs_mmp_destroy(...)	((void)0)
 #endif
 
+static void fuse4fs_psi_memory(const struct psi *psi, unsigned int reasons,
+			       void *data)
+{
+	struct fuse4fs *ff = data;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	fs = fuse4fs_start(ff);
+	dbg_printf(ff, "%s:\n", __func__);
+	if (fs && !psi_thread_cancelled(ff->mem_psi)) {
+		err = io_channel_set_options(fs->io, "cache_shrink");
+		if (err)
+			ret = translate_error(fs, 0, err);
+	} else {
+		psi_cancel_handler(ff->mem_psi, &ff->mem_psi_handler);
+	}
+	fuse4fs_finish(ff, ret);
+}
+
+static int fuse4fs_psi_config(struct fuse4fs *ff)
+{
+	errcode_t err;
+
+	/*
+	 * Activate when there are memory stalls for 200ms every 2s; or
+	 * 5min goes by.  Unprivileged processes can only use 2s windows.
+	 */
+	err = psi_create(PSI_MEMORY, PSI_TRIM_HEAP, 20100, 2000000,
+			 5 * 60 * 1000000, &ff->mem_psi);
+	if (err) {
+		switch (errno) {
+		case ENOENT:
+		case EINVAL:
+		case EACCES:
+		case EPERM:
+			break;
+		default:
+			err_printf(ff, "PSI: %s.\n", error_message(errno));
+			return -1;
+		}
+	}
+
+	err = psi_add_handler(ff->mem_psi, fuse4fs_psi_memory, ff,
+			      &ff->mem_psi_handler);
+	if (err) {
+		err_printf(ff, "PSI: %s.\n", error_message(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void fuse4fs_psi_start(struct fuse4fs *ff)
+{
+	if (psi_active(ff->mem_psi))
+		psi_start_thread(ff->mem_psi);
+}
+
+static void fuse4fs_psi_destroy(struct fuse4fs *ff)
+{
+	if (!psi_active(ff->mem_psi))
+		return;
+
+	psi_del_handler(ff->mem_psi, &ff->mem_psi_handler);
+	psi_destroy(&ff->mem_psi);
+}
+
 static inline struct fuse4fs *fuse4fs_get(fuse_req_t req)
 {
 	return (struct fuse4fs *)fuse_req_userdata(req);
@@ -1744,6 +1815,11 @@ static errcode_t fuse4fs_config_cache(struct fuse4fs *ff)
 		return err;
 	}
 
+	if (psi_active(ff->mem_psi)) {
+		snprintf(buf, sizeof(buf), "cache_auto_shrink=off");
+		io_channel_set_options(ff->fs->io, buf);
+	}
+
 	return 0;
 }
 
@@ -2042,6 +2118,7 @@ static void op_init(void *userdata, struct fuse_conn_info *conn)
 	 * conveyed to the new child process.
 	 */
 	fuse4fs_mmp_start(ff);
+	fuse4fs_psi_start(ff);
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
 	/*
@@ -6544,6 +6621,12 @@ int main(int argc, char *argv[])
 	try_set_io_flusher(&fctx);
 	try_adjust_oom_score(&fctx);
 
+	err = fuse4fs_psi_config(&fctx);
+	if (err) {
+		ret |= 32;
+		goto out;
+	}
+
 	/* Will we allow users to allocate every last block? */
 	if (getenv("FUSE4FS_ALLOC_ALL_BLOCKS")) {
 		log_printf(&fctx, "%s\n",
@@ -6636,6 +6719,7 @@ int main(int argc, char *argv[])
  _("Mount failed while opening filesystem.  Check dmesg(1) for details."));
 		fflush(orig_stderr);
 	}
+	fuse4fs_psi_destroy(&fctx);
 	fuse4fs_mmp_destroy(&fctx);
 	fuse4fs_unmount(&fctx);
 	reset_com_err_hook();
diff --git a/misc/Makefile.in b/misc/Makefile.in
index 6f46ae7007018c..be668f69745dc2 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -881,7 +881,8 @@ fuse2fs.o: $(srcdir)/fuse2fs.c $(top_builddir)/lib/config.h \
  $(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
  $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
  $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/cache.h \
- $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/iocache.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/iocache.h \
+ $(top_srcdir)/lib/support/psi.h
 e2fuzz.o: $(srcdir)/e2fuzz.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
  $(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index caa675bc9f95e9..92ecc4ca431a95 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -52,6 +52,7 @@
 #include "support/list.h"
 #include "support/cache.h"
 #include "support/iocache.h"
+#include "support/psi.h"
 
 #include "../version.h"
 #include "uuid/uuid.h"
@@ -279,6 +280,8 @@ struct fuse2fs {
 	/* options set by fuse_opt_parse must be of type int */
 	int timing;
 #endif
+	struct psi *mem_psi;
+	struct psi_handler *mem_psi_handler;
 };
 
 #define FUSE2FS_CHECK_HANDLE(ff, fh) \
@@ -716,6 +719,74 @@ static void fuse2fs_mmp_destroy(struct fuse2fs *ff)
 # define fuse2fs_mmp_destroy(...)	((void)0)
 #endif
 
+static void fuse2fs_psi_memory(const struct psi *psi, unsigned int reasons,
+			       void *data)
+{
+	struct fuse2fs *ff = data;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	fs = fuse2fs_start(ff);
+	dbg_printf(ff, "%s:\n", __func__);
+	if (fs && !psi_thread_cancelled(ff->mem_psi)) {
+		err = io_channel_set_options(fs->io, "cache_shrink");
+		if (err)
+			ret = translate_error(fs, 0, err);
+	} else {
+		psi_cancel_handler(ff->mem_psi, &ff->mem_psi_handler);
+	}
+	fuse2fs_finish(ff, ret);
+}
+
+static int fuse2fs_psi_config(struct fuse2fs *ff)
+{
+	errcode_t err;
+
+	/*
+	 * Activate when there are memory stalls for 200ms every 2s; or
+	 * 5min goes by.  Unprivileged processes can only use 2s windows.
+	 */
+	err = psi_create(PSI_MEMORY, PSI_TRIM_HEAP, 20100, 2000000,
+			 5 * 60 * 1000000, &ff->mem_psi);
+	if (err) {
+		switch (errno) {
+		case ENOENT:
+		case EINVAL:
+		case EACCES:
+		case EPERM:
+			break;
+		default:
+			err_printf(ff, "PSI: %s.\n", error_message(errno));
+			return -1;
+		}
+	}
+
+	err = psi_add_handler(ff->mem_psi, fuse2fs_psi_memory, ff,
+			      &ff->mem_psi_handler);
+	if (err) {
+		err_printf(ff, "PSI: %s.\n", error_message(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void fuse2fs_psi_start(struct fuse2fs *ff)
+{
+	if (psi_active(ff->mem_psi))
+		psi_start_thread(ff->mem_psi);
+}
+
+static void fuse2fs_psi_destroy(struct fuse2fs *ff)
+{
+	if (!psi_active(ff->mem_psi))
+		return;
+
+	psi_del_handler(ff->mem_psi, &ff->mem_psi_handler);
+	psi_destroy(&ff->mem_psi);
+}
+
 static inline struct fuse2fs *fuse2fs_get(void)
 {
 	struct fuse_context *ctxt = fuse_get_context();
@@ -1332,6 +1403,11 @@ static errcode_t fuse2fs_config_cache(struct fuse2fs *ff)
 		return err;
 	}
 
+	if (psi_active(ff->mem_psi)) {
+		snprintf(buf, sizeof(buf), "cache_auto_shrink=off");
+		err = io_channel_set_options(ff->fs->io, buf);
+	}
+
 	return 0;
 }
 
@@ -1641,6 +1717,7 @@ static void *op_init(struct fuse_conn_info *conn,
 	 * conveyed to the new child process.
 	 */
 	fuse2fs_mmp_start(ff);
+	fuse2fs_psi_start(ff);
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
 	/*
@@ -5569,6 +5646,12 @@ int main(int argc, char *argv[])
 	try_set_io_flusher(&fctx);
 	try_adjust_oom_score(&fctx);
 
+	err = fuse2fs_psi_config(&fctx);
+	if (err) {
+		ret |= 32;
+		goto out;
+	}
+
 	/* Will we allow users to allocate every last block? */
 	if (getenv("FUSE2FS_ALLOC_ALL_BLOCKS")) {
 		log_printf(&fctx, "%s\n",
@@ -5658,6 +5741,7 @@ int main(int argc, char *argv[])
  _("Mount failed while opening filesystem.  Check dmesg(1) for details."));
 		fflush(orig_stderr);
 	}
+	fuse2fs_psi_destroy(&fctx);
 	fuse2fs_mmp_destroy(&fctx);
 	fuse2fs_unmount(&fctx);
 	reset_com_err_hook();


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] fuse4fs: enable memory pressure monitoring with service containers
  2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 1/4] libsupport: add pressure stall monitor Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 2/4] fuse2fs: only reclaim buffer cache when there is memory pressure Darrick J. Wong
@ 2026-06-25 19:41   ` Darrick J. Wong
  2026-06-25 19:41   ` [PATCH 4/4] fuse2fs: flush dirty metadata periodically Darrick J. Wong
  3 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:41 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Ask the fuse filesystem service mount helper to open the memory pressure
stall files because we cannot open them ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/support/psi.h |    9 +++++++
 fuse4fs/fuse4fs.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/support/psi.c |   53 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 130 insertions(+), 5 deletions(-)


diff --git a/lib/support/psi.h b/lib/support/psi.h
index 675ebeb553da3e..916ebf15d17431 100644
--- a/lib/support/psi.h
+++ b/lib/support/psi.h
@@ -54,4 +54,13 @@ static inline bool psi_active(struct psi *psi)
 	return psi != NULL;
 }
 
+char *psi_system_path(enum psi_type type);
+ssize_t psi_cgroup_path(enum psi_type type, char *path, size_t pathsize);
+
+#define PSI_OPEN_FLAGS (O_RDWR | O_NONBLOCK)
+
+int psi_create_from(enum psi_type type, unsigned int psi_flags,
+		    uint64_t stall_us, uint64_t window_us, uint64_t timeout_us,
+		    int *system_fd, int *cgroup_fd, struct psi **psip);
+
 #endif /* __PSI_H__ */
diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 9b04d12f7c8762..83855e4c5e63d7 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -304,6 +304,8 @@ struct fuse4fs {
 	char *svc_cmdline;
 # endif
 	int bdev_fd;
+	int psi_sys_mem_fd;
+	int psi_cgroup_mem_fd;
 #endif
 	struct psi *mem_psi;
 	struct psi_handler *mem_psi_handler;
@@ -935,8 +937,16 @@ static int fuse4fs_psi_config(struct fuse4fs *ff)
 	 * Activate when there are memory stalls for 200ms every 2s; or
 	 * 5min goes by.  Unprivileged processes can only use 2s windows.
 	 */
-	err = psi_create(PSI_MEMORY, PSI_TRIM_HEAP, 20100, 2000000,
-			 5 * 60 * 1000000, &ff->mem_psi);
+#ifdef HAVE_FUSE4FS_SERVICE
+	if (fuse4fs_is_service(ff))
+		err = psi_create_from(PSI_MEMORY, PSI_TRIM_HEAP, 202002,
+				      2000000, 5 * 60 * 1000000,
+				      &ff->psi_sys_mem_fd,
+				      &ff->psi_cgroup_mem_fd, &ff->mem_psi);
+	else
+#endif
+		err = psi_create(PSI_MEMORY, PSI_TRIM_HEAP, 202002, 2000000,
+				 5 * 60 * 1000000, &ff->mem_psi);
 	if (err) {
 		switch (errno) {
 		case ENOENT:
@@ -1438,6 +1448,14 @@ static int fuse4fs_service_exit(struct fuse4fs *ff, int exitcode)
 	close(ff->bdev_fd);
 	ff->bdev_fd = -1;
 
+	if (ff->psi_sys_mem_fd >= 0)
+		close(ff->psi_sys_mem_fd);
+	ff->psi_sys_mem_fd = -1;
+
+	if (ff->psi_cgroup_mem_fd >= 0)
+		close(ff->psi_cgroup_mem_fd);
+	ff->psi_cgroup_mem_fd = -1;
+
 	return fuse_service_exit(exitcode);
 }
 
@@ -1480,12 +1498,61 @@ static int fuse4fs_service_open_bdev(struct fuse4fs *ff)
 	return 0;
 }
 
+/* Open pressure stall control files for self monitoring */
+static int fuse4fs_service_open_psi_controls(struct fuse4fs *ff)
+{
+	const char *psifile = psi_system_path(PSI_MEMORY);
+	char cgpath[PATH_MAX];
+	ssize_t cgpathlen;
+	int fd;
+	int ret;
+
+	ret = fuse_service_request_file(ff->service, psifile, PSI_OPEN_FLAGS,
+					0, FUSE_SERVICE_REQUEST_FILE_QUIET);
+	if (ret)
+		return ret;
+
+	ret = fuse_service_receive_file(ff->service, psifile, &fd);
+	if (ret)
+		return ret;
+	if (fd < 0)
+		err_printf(ff, "%s %s: %s.\n",
+			   _("opening system memory pressure monitor"),
+			   psifile, strerror(-fd));
+	ff->psi_sys_mem_fd = fd;
+
+	cgpathlen = psi_cgroup_path(PSI_MEMORY, cgpath, sizeof(cgpath));
+	if (!cgpathlen || cgpathlen >= sizeof(cgpath))
+		return 0;
+
+	ret = fuse_service_request_file(ff->service, cgpath, PSI_OPEN_FLAGS,
+					0, FUSE_SERVICE_REQUEST_FILE_QUIET);
+	if (ret)
+		return ret;
+
+	ret = fuse_service_receive_file(ff->service, cgpath, &fd);
+	if (ret)
+		return ret;
+	if (fd < 0)
+		err_printf(ff, "%s %s: %s.\n",
+			   _("opening cgroup memory pressure monitor"),
+			   cgpath, strerror(-fd));
+	ff->psi_sys_mem_fd = fd;
+
+	return 0;
+}
+
 static int fuse4fs_service_get_config(struct fuse4fs *ff)
 {
 	int ret, ret2;
 
 	ret = fuse4fs_service_open_bdev(ff);
+	if (ret)
+		goto out_seal;
 
+	ret = fuse4fs_service_open_psi_controls(ff);
+
+out_seal:
 	/* Always prevent further fds from being added to our file table */
 	ret2 = fuse_service_finish_file_requests(ff->service);
 	if (ret2 && !ret)
@@ -6551,6 +6618,8 @@ int main(int argc, char *argv[])
 		.opstate = F4OP_WRITABLE,
 #ifdef HAVE_FUSE4FS_SERVICE
 		.bdev_fd = -1,
+		.psi_sys_mem_fd = -1,
+		.psi_cgroup_mem_fd = -1,
 #endif
 	};
 	errcode_t err;
diff --git a/lib/support/psi.c b/lib/support/psi.c
index 26ce6ee1985641..531ae935701edf 100644
--- a/lib/support/psi.c
+++ b/lib/support/psi.c
@@ -54,7 +54,7 @@ struct psi {
 	enum psi_state state;
 };
 
-static const char *psi_system_path(enum psi_type type)
+char *psi_system_path(enum psi_type type)
 {
 	switch (type) {
 	case PSI_MEMORY:
@@ -300,7 +300,7 @@ void psi_destroy(struct psi **psip)
 
 static int psi_open_control(const char *path)
 {
-	return open(path, O_RDWR | O_NONBLOCK);
+	return open(path, PSI_OPEN_FLAGS);
 }
 
 static void psi_open_system_control(struct psi *psi)
@@ -309,7 +309,7 @@ static void psi_open_system_control(struct psi *psi)
 	psi->system_fd = psi_open_control(psi_system_path(psi->type));
 }
 
-static ssize_t psi_cgroup_path(enum psi_type type, char *path, size_t pathsize)
+ssize_t psi_cgroup_path(enum psi_type type, char *path, size_t pathsize)
 {
 	char cgpath[PATH_MAX];
 	char *p = cgpath;
@@ -485,6 +485,53 @@ int psi_create(enum psi_type type, unsigned int psi_flags, uint64_t stall_us,
 	return -1;
 }
 
+/*
+ * Same as psi_create, but you can specify the whole-system and per-cgroup
+ * monitoring fds.
+ */
+int psi_create_from(enum psi_type type, unsigned int psi_flags,
+		    uint64_t stall_us, uint64_t window_us, uint64_t timeout_us,
+		    int *system_fd, int *cgroup_fd, struct psi **psip)
+{
+	struct psi *psi;
+	int ret;
+
+	if (psi_flags & ~PSI_FLAGS) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	psi = psi_alloc(type, psi_flags, timeout_us);
+	if (!psi)
+		return -1;
+
+	psi->system_fd = *system_fd;
+	psi->cgroup_fd = *cgroup_fd;
+
+	*system_fd = -1;
+	*cgroup_fd = -1;
+
+	if (psi->system_fd < 0 && psi->cgroup_fd < 0 && !psi->timeout_us) {
+		errno = ENOENT;
+		goto out_fds;
+	}
+
+	ret = psi_config_fd(psi, psi->system_fd, stall_us, window_us);
+	if (ret)
+		goto out_fds;
+
+	ret = psi_config_fd(psi, psi->cgroup_fd, stall_us, window_us);
+	if (ret)
+		goto out_fds;
+
+	*psip = psi;
+	return 0;
+
+out_fds:
+	psi_destroy(&psi);
+	return -1;
+}
+
 /* Start monitoring for resource pressure stalls */
 int psi_start_thread(struct psi *psi)
 {


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] fuse2fs: flush dirty metadata periodically
  2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
                     ` (2 preceding siblings ...)
  2026-06-25 19:41   ` [PATCH 3/4] fuse4fs: enable memory pressure monitoring with service containers Darrick J. Wong
@ 2026-06-25 19:41   ` Darrick J. Wong
  3 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2026-06-25 19:41 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Flush dirty metadata out to disk periodically like the kernel, to reduce
the potential for data loss if userspace doesn't explicitly fsync.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fuse4fs/fuse4fs.c |  105 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 misc/fuse2fs.c    |  105 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 198 insertions(+), 12 deletions(-)


diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c
index 83855e4c5e63d7..91cf65212c2e82 100644
--- a/fuse4fs/fuse4fs.c
+++ b/fuse4fs/fuse4fs.c
@@ -27,6 +27,7 @@
 #include <unistd.h>
 #include <ctype.h>
 #include <assert.h>
+#include <limits.h>
 #define FUSE_DARWIN_ENABLE_EXTENSIONS 0
 #ifdef __SET_FOB_FOR_FUSE
 # error Do not set magic value __SET_FOB_FOR_FUSE!!!!
@@ -309,6 +310,10 @@ struct fuse4fs {
 #endif
 	struct psi *mem_psi;
 	struct psi_handler *mem_psi_handler;
+
+	struct bthread *flush_thread;
+	unsigned int flush_interval;
+	double last_flush;
 };
 
 #ifdef HAVE_FUSE4FS_SERVICE
@@ -1003,6 +1008,71 @@ fuse4fs_set_handle(struct fuse_file_info *fp, struct fuse4fs_file_handle *fh)
 	fp->keep_cache = 1;
 }
 
+static errcode_t fuse4fs_flush(struct fuse4fs *ff, int flags)
+{
+	double last_flush = gettime_monotonic();
+	errcode_t err;
+
+	err = ext2fs_flush2(ff->fs, flags);
+	if (err)
+		return err;
+
+	ff->last_flush = last_flush;
+	return 0;
+}
+
+static inline int fuse4fs_flush_wanted(struct fuse4fs *ff)
+{
+	return ff->fs != NULL && ff->opstate == F4OP_WRITABLE &&
+	       ff->last_flush + ff->flush_interval <= gettime_monotonic();
+}
+
+static void fuse4fs_flush_bthread(void *data)
+{
+	struct fuse4fs *ff = data;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	fs = fuse4fs_start(ff);
+	if (fuse4fs_flush_wanted(ff) && !bthread_cancelled(ff->flush_thread)) {
+		err = fuse4fs_flush(ff, 0);
+		if (err)
+			ret = translate_error(fs, 0, err);
+	}
+	fuse4fs_finish(ff, ret);
+}
+
+static void fuse4fs_flush_start(struct fuse4fs *ff)
+{
+	int ret;
+
+	if (!ff->flush_interval)
+		return;
+
+	ret = bthread_create("fuse4fs_flush", fuse4fs_flush_bthread, ff,
+			     ff->flush_interval, &ff->flush_thread);
+	if (ret) {
+		err_printf(ff, "flusher: %s.\n", error_message(ret));
+		return;
+	}
+
+	ret = bthread_start(ff->flush_thread);
+	if (ret)
+		err_printf(ff, "flusher: %s.\n", error_message(ret));
+}
+
+static void fuse4fs_flush_cancel(struct fuse4fs *ff)
+{
+	if (ff->flush_thread)
+		bthread_cancel(ff->flush_thread);
+}
+
+static void fuse4fs_flush_destroy(struct fuse4fs *ff)
+{
+	bthread_destroy(&ff->flush_thread);
+}
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME
@@ -1966,7 +2036,7 @@ static int fuse4fs_mount(struct fuse4fs *ff)
 		ext2fs_set_tstamp(fs->super, s_mtime, time(NULL));
 		fs->super->s_state &= ~EXT2_VALID_FS;
 		ext2fs_mark_super_dirty(fs);
-		err = ext2fs_flush2(fs, 0);
+		err = fuse4fs_flush(ff, 0);
 		if (err)
 			return translate_error(fs, 0, err);
 	}
@@ -1994,7 +2064,7 @@ static void op_destroy(void *userdata)
 		if (err)
 			translate_error(fs, 0, err);
 
-		err = ext2fs_flush2(fs, 0);
+		err = fuse4fs_flush(ff, 0);
 		if (err)
 			translate_error(fs, 0, err);
 	}
@@ -2186,6 +2256,7 @@ static void op_init(void *userdata, struct fuse_conn_info *conn)
 	 */
 	fuse4fs_mmp_start(ff);
 	fuse4fs_psi_start(ff);
+	fuse4fs_flush_start(ff);
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
 	/*
@@ -2553,7 +2624,7 @@ static inline int fuse4fs_dirsync_flush(struct fuse4fs *ff, ext2_ino_t ino,
 		*flushed = 0;
 	return 0;
 flush:
-	err = ext2fs_flush2(fs, 0);
+	err = fuse4fs_flush(ff, 0);
 	if (err)
 		return translate_error(fs, 0, err);
 
@@ -4378,7 +4449,7 @@ static void op_release(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
 	if ((fp->flags & O_SYNC) &&
 	    fuse4fs_is_writeable(ff) &&
 	    (fh->open_flags & EXT2_FILE_WRITE)) {
-		err = ext2fs_flush2(fs, EXT2_FLAG_FLUSH_NO_SYNC);
+		err = fuse4fs_flush(ff, EXT2_FLAG_FLUSH_NO_SYNC);
 		if (err)
 			ret = translate_error(fs, fh->ino, err);
 	}
@@ -4409,7 +4480,7 @@ static void op_fsync(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
 	fs = fuse4fs_start(ff);
 	/* For now, flush everything, even if it's slow */
 	if (fuse4fs_is_writeable(ff) && fh->open_flags & EXT2_FILE_WRITE) {
-		err = ext2fs_flush2(fs, 0);
+		err = fuse4fs_flush(ff, 0);
 		if (err)
 			ret = translate_error(fs, fh->ino, err);
 	}
@@ -5730,6 +5801,7 @@ static int ioctl_shutdown(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
 
 	err_printf(ff, "%s.\n", _("shut down requested"));
 
+	fuse4fs_flush_cancel(ff);
 	fuse4fs_mmp_cancel(ff);
 
 	/*
@@ -5738,7 +5810,7 @@ static int ioctl_shutdown(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
 	 * any of the flags.  Flush whatever is dirty and shut down.
 	 */
 	if (ff->opstate == F4OP_WRITABLE)
-		ext2fs_flush2(fs, 0);
+		fuse4fs_flush(ff, 0);
 	ff->opstate = F4OP_SHUTDOWN;
 	fs->flags &= ~EXT2_FLAG_RW;
 
@@ -6192,6 +6264,7 @@ enum {
 	FUSE4FS_CACHE_SIZE,
 	FUSE4FS_DIRSYNC,
 	FUSE4FS_ERRORS_BEHAVIOR,
+	FUSE4FS_FLUSH_INTERVAL,
 };
 
 #define FUSE4FS_OPT(t, p, v) { t, offsetof(struct fuse4fs, p), v }
@@ -6216,6 +6289,7 @@ static struct fuse_opt fuse4fs_opts[] = {
 #ifdef HAVE_CLOCK_MONOTONIC
 	FUSE4FS_OPT("timing",		timing,			1),
 #endif
+	FUSE_OPT_KEY("flush_interval=%s", FUSE4FS_FLUSH_INTERVAL),
 
 	FUSE_OPT_KEY("user_xattr",	FUSE4FS_IGNORED),
 	FUSE_OPT_KEY("noblock_validity", FUSE4FS_IGNORED),
@@ -6274,6 +6348,21 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
 
 		/* do not pass through to libfuse */
 		return 0;
+	case FUSE4FS_FLUSH_INTERVAL:
+		char *p;
+		unsigned long val;
+
+		errno = 0;
+		val = strtoul(arg + 15, &p, 0);
+		if (p != arg + strlen(arg) || errno || val > UINT_MAX) {
+			fprintf(stderr, "%s: %s.\n", arg,
+				_("Unrecognized flush interval"));
+			return -1;
+		}
+
+		/* do not pass through to libfuse */
+		ff->flush_interval = val;
+		return 0;
 	case FUSE4FS_IGNORED:
 		return 0;
 	case FUSE4FS_HELP:
@@ -6301,6 +6390,7 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
 	"    -o cache_size=N[KMG]   use a disk cache of this size\n"
 	"    -o errors=             behavior when an error is encountered:\n"
 	"                           continue|remount-ro|panic\n"
+	"    -o flush=<time>        flush dirty metadata on this interval\n"
 	"\n",
 			outargs->argv[0]);
 		if (key == FUSE4FS_HELPFULL) {
@@ -6621,6 +6711,7 @@ int main(int argc, char *argv[])
 		.psi_sys_mem_fd = -1,
 		.psi_cgroup_mem_fd = -1,
 #endif
+		.flush_interval = 30,
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
@@ -6788,6 +6879,7 @@ int main(int argc, char *argv[])
  _("Mount failed while opening filesystem.  Check dmesg(1) for details."));
 		fflush(orig_stderr);
 	}
+	fuse4fs_flush_destroy(&fctx);
 	fuse4fs_psi_destroy(&fctx);
 	fuse4fs_mmp_destroy(&fctx);
 	fuse4fs_unmount(&fctx);
@@ -6969,6 +7061,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
  _("Remounting read-only due to errors."));
 			ff->opstate = F4OP_READONLY;
 		}
+		fuse4fs_flush_cancel(ff);
 		fuse4fs_mmp_cancel(ff);
 		fs->flags &= ~EXT2_FLAG_RW;
 		break;
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 92ecc4ca431a95..985229ccf676d6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -25,6 +25,7 @@
 #include <sys/ioctl.h>
 #include <unistd.h>
 #include <ctype.h>
+#include <limits.h>
 #define FUSE_DARWIN_ENABLE_EXTENSIONS 0
 #ifdef __SET_FOB_FOR_FUSE
 # error Do not set magic value __SET_FOB_FOR_FUSE!!!!
@@ -282,6 +283,10 @@ struct fuse2fs {
 #endif
 	struct psi *mem_psi;
 	struct psi_handler *mem_psi_handler;
+
+	struct bthread *flush_thread;
+	unsigned int flush_interval;
+	double last_flush;
 };
 
 #define FUSE2FS_CHECK_HANDLE(ff, fh) \
@@ -806,6 +811,71 @@ fuse2fs_set_handle(struct fuse_file_info *fp, struct fuse2fs_file_handle *fh)
 	fp->fh = (uintptr_t)fh;
 }
 
+static errcode_t fuse2fs_flush(struct fuse2fs *ff, int flags)
+{
+	double last_flush = gettime_monotonic();
+	errcode_t err;
+
+	err = ext2fs_flush2(ff->fs, flags);
+	if (err)
+		return err;
+
+	ff->last_flush = last_flush;
+	return 0;
+}
+
+static inline int fuse2fs_flush_wanted(struct fuse2fs *ff)
+{
+	return ff->fs != NULL && ff->opstate == F2OP_WRITABLE &&
+	       ff->last_flush + ff->flush_interval <= gettime_monotonic();
+}
+
+static void fuse2fs_flush_bthread(void *data)
+{
+	struct fuse2fs *ff = data;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	fs = fuse2fs_start(ff);
+	if (fuse2fs_flush_wanted(ff) && !bthread_cancelled(ff->flush_thread)) {
+		err = fuse2fs_flush(ff, 0);
+		if (err)
+			ret = translate_error(fs, 0, err);
+	}
+	fuse2fs_finish(ff, ret);
+}
+
+static void fuse2fs_flush_start(struct fuse2fs *ff)
+{
+	int ret;
+
+	if (!ff->flush_interval)
+		return;
+
+	ret = bthread_create("fuse2fs_flush", fuse2fs_flush_bthread, ff,
+			     ff->flush_interval, &ff->flush_thread);
+	if (ret) {
+		err_printf(ff, "flusher: %s.\n", error_message(ret));
+		return;
+	}
+
+	ret = bthread_start(ff->flush_thread);
+	if (ret)
+		err_printf(ff, "flusher: %s.\n", error_message(ret));
+}
+
+static void fuse2fs_flush_cancel(struct fuse2fs *ff)
+{
+	if (ff->flush_thread)
+		bthread_cancel(ff->flush_thread);
+}
+
+static void fuse2fs_flush_destroy(struct fuse2fs *ff)
+{
+	bthread_destroy(&ff->flush_thread);
+}
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME
@@ -1487,7 +1557,7 @@ static int fuse2fs_mount(struct fuse2fs *ff)
 		ext2fs_set_tstamp(fs->super, s_mtime, time(NULL));
 		fs->super->s_state &= ~EXT2_VALID_FS;
 		ext2fs_mark_super_dirty(fs);
-		err = ext2fs_flush2(fs, 0);
+		err = fuse2fs_flush(ff, 0);
 		if (err)
 			return translate_error(fs, 0, err);
 	}
@@ -1515,7 +1585,7 @@ static void op_destroy(void *p EXT2FS_ATTR((unused)))
 		if (err)
 			translate_error(fs, 0, err);
 
-		err = ext2fs_flush2(fs, 0);
+		err = fuse2fs_flush(ff, 0);
 		if (err)
 			translate_error(fs, 0, err);
 	}
@@ -1718,6 +1788,7 @@ static void *op_init(struct fuse_conn_info *conn,
 	 */
 	fuse2fs_mmp_start(ff);
 	fuse2fs_psi_start(ff);
+	fuse2fs_flush_start(ff);
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
 	/*
@@ -2061,7 +2132,7 @@ static inline int fuse2fs_dirsync_flush(struct fuse2fs *ff, ext2_ino_t ino,
 		*flushed = 0;
 	return 0;
 flush:
-	err = ext2fs_flush2(fs, 0);
+	err = fuse2fs_flush(ff, 0);
 	if (err)
 		return translate_error(fs, 0, err);
 
@@ -3766,7 +3837,7 @@ static int op_release(const char *path EXT2FS_ATTR((unused)),
 	if ((fp->flags & O_SYNC) &&
 	    fuse2fs_is_writeable(ff) &&
 	    (fh->open_flags & EXT2_FILE_WRITE)) {
-		err = ext2fs_flush2(fs, EXT2_FLAG_FLUSH_NO_SYNC);
+		err = fuse2fs_flush(ff, EXT2_FLAG_FLUSH_NO_SYNC);
 		if (err)
 			ret = translate_error(fs, fh->ino, err);
 	}
@@ -3795,7 +3866,7 @@ static int op_fsync(const char *path EXT2FS_ATTR((unused)),
 	fs = fuse2fs_start(ff);
 	/* For now, flush everything, even if it's slow */
 	if (fuse2fs_is_writeable(ff) && fh->open_flags & EXT2_FILE_WRITE) {
-		err = ext2fs_flush2(fs, 0);
+		err = fuse2fs_flush(ff, 0);
 		if (err)
 			ret = translate_error(fs, fh->ino, err);
 	}
@@ -4900,6 +4971,7 @@ static int ioctl_shutdown(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 
 	err_printf(ff, "%s.\n", _("shut down requested"));
 
+	fuse2fs_flush_cancel(ff);
 	fuse2fs_mmp_cancel(ff);
 
 	/*
@@ -4908,7 +4980,7 @@ static int ioctl_shutdown(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	 * any of the flags.  Flush whatever is dirty and shut down.
 	 */
 	if (ff->opstate == F2OP_WRITABLE)
-		ext2fs_flush2(fs, 0);
+		fuse2fs_flush(ff, 0);
 	ff->opstate = F2OP_SHUTDOWN;
 	fs->flags &= ~EXT2_FLAG_RW;
 
@@ -5343,6 +5415,7 @@ enum {
 	FUSE2FS_CACHE_SIZE,
 	FUSE2FS_DIRSYNC,
 	FUSE2FS_ERRORS_BEHAVIOR,
+	FUSE2FS_FLUSH_INTERVAL,
 };
 
 #define FUSE2FS_OPT(t, p, v) { t, offsetof(struct fuse2fs, p), v }
@@ -5367,6 +5440,7 @@ static struct fuse_opt fuse2fs_opts[] = {
 #ifdef HAVE_CLOCK_MONOTONIC
 	FUSE2FS_OPT("timing",		timing,			1),
 #endif
+	FUSE_OPT_KEY("flush_interval=%s", FUSE2FS_FLUSH_INTERVAL),
 
 	FUSE_OPT_KEY("user_xattr",	FUSE2FS_IGNORED),
 	FUSE_OPT_KEY("noblock_validity", FUSE2FS_IGNORED),
@@ -5425,6 +5499,21 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 
 		/* do not pass through to libfuse */
 		return 0;
+	case FUSE2FS_FLUSH_INTERVAL:
+		char *p;
+		unsigned long val;
+
+		errno = 0;
+		val = strtoul(arg + 15, &p, 0);
+		if (p != arg + strlen(arg) || errno || val > UINT_MAX) {
+			fprintf(stderr, "%s: %s.\n", arg,
+				_("Unrecognized flush interval"));
+			return -1;
+		}
+
+		/* do not pass through to libfuse */
+		ff->flush_interval = val;
+		return 0;
 	case FUSE2FS_IGNORED:
 		return 0;
 	case FUSE2FS_HELP:
@@ -5452,6 +5541,7 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 	"    -o cache_size=N[KMG]   use a disk cache of this size\n"
 	"    -o errors=             behavior when an error is encountered:\n"
 	"                           continue|remount-ro|panic\n"
+	"    -o flush=<time>        flush dirty metadata on this interval\n"
 	"\n",
 			outargs->argv[0]);
 		if (key == FUSE2FS_HELPFULL) {
@@ -5602,6 +5692,7 @@ int main(int argc, char *argv[])
 		.bfl = (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER,
 		.oom_score_adj = -500,
 		.opstate = F2OP_WRITABLE,
+		.flush_interval = 30,
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
@@ -5741,6 +5832,7 @@ int main(int argc, char *argv[])
  _("Mount failed while opening filesystem.  Check dmesg(1) for details."));
 		fflush(orig_stderr);
 	}
+	fuse2fs_flush_destroy(&fctx);
 	fuse2fs_psi_destroy(&fctx);
 	fuse2fs_mmp_destroy(&fctx);
 	fuse2fs_unmount(&fctx);
@@ -5920,6 +6012,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
  _("Remounting read-only due to errors."));
 			ff->opstate = F2OP_READONLY;
 		}
+		fuse2fs_flush_cancel(ff);
 		fuse2fs_mmp_cancel(ff);
 		fs->flags &= ~EXT2_FLAG_RW;
 		break;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-06-25 19:41 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 19:33 [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation Darrick J. Wong
2026-06-25 19:35 ` [PATCHSET 1/4] libext2fs: fix some missed fsync calls Darrick J. Wong
2026-06-25 19:36   ` [PATCH 1/3] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2026-06-25 19:36   ` [PATCH 2/3] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2026-06-25 19:36   ` [PATCH 3/3] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2026-06-25 19:35 ` [PATCHSET v6 2/4] fuse4fs: run servers as a contained service Darrick J. Wong
2026-06-25 19:37   ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
2026-06-25 19:37   ` [PATCH 02/10] libext2fs: fix checking for valid fds in mmp.c Darrick J. Wong
2026-06-25 19:37   ` [PATCH 03/10] unix_io: allow passing /dev/fd/XXX paths to the unixfd IO manager Darrick J. Wong
2026-06-25 19:37   ` [PATCH 04/10] libext2fs: fix MMP code to work with " Darrick J. Wong
2026-06-25 19:38   ` [PATCH 05/10] libext2fs: bump libfuse API version to 3.19 Darrick J. Wong
2026-06-25 19:38   ` [PATCH 06/10] fuse4fs: hoist some code out of fuse4fs_main Darrick J. Wong
2026-06-25 19:38   ` [PATCH 07/10] fuse4fs: enable safe service mode Darrick J. Wong
2026-06-25 19:38   ` [PATCH 08/10] fuse4fs: set proc title when in fuse " Darrick J. Wong
2026-06-25 19:39   ` [PATCH 09/10] fuse4fs: make MMP work correctly in safe " Darrick J. Wong
2026-06-25 19:39   ` [PATCH 10/10] debian: update packaging for fuse4fs service Darrick J. Wong
2026-06-25 19:35 ` [PATCHSET v6 3/4] fuse2fs: improve block and inode caching Darrick J. Wong
2026-06-25 19:39   ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
2026-06-25 19:39   ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
2026-06-25 19:40   ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
2026-06-25 19:40   ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
2026-06-25 19:40   ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
2026-06-25 19:40   ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
2026-06-25 19:35 ` [PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure Darrick J. Wong
2026-06-25 19:41   ` [PATCH 1/4] libsupport: add pressure stall monitor Darrick J. Wong
2026-06-25 19:41   ` [PATCH 2/4] fuse2fs: only reclaim buffer cache when there is memory pressure Darrick J. Wong
2026-06-25 19:41   ` [PATCH 3/4] fuse4fs: enable memory pressure monitoring with service containers Darrick J. Wong
2026-06-25 19:41   ` [PATCH 4/4] fuse2fs: flush dirty metadata periodically Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox