linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
@ 2025-07-17 23:10 Darrick J. Wong
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4,
	Theodore Ts'o, Neal Gompa

Hi everyone,

DO NOT MERGE THIS, STILL!

This is the third request for comments of a prototype to connect the
Linux fuse driver to fs-iomap for regular file IO operations to and from
files whose contents persist to locally attached storage devices.

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.

willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code.  Eeeugh.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands.  Pagecache
writeback is now a directio write.  The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.

With this RFC, I am able to show that it's possible to build a fuse
server for a real filesystem (ext4) that runs entirely in userspace yet
maintains most of its performance.  At this stage I still get about 95%
of the kernel ext4 driver's streaming directio performance on streaming
IO, and 110% of its streaming buffered IO performance.  Random buffered
IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
fast as the kernel; see the cover letter for the fuse2fs iomap changes
for more details.  Unwritten extent conversions on random direct writes
are especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead.  And that's with debugging turned on!

These items have been addressed since the first RFC:

1. The iomap cookie validation is now present, which avoids subtle races
between pagecache zeroing and writeback on filesystems that support
unwritten and delalloc mappings.

2. Mappings can be cached in the kernel for more speed.

3. iomap supports inline data.

4. I can now turn on fuse+iomap on a per-inode basis, which turned out
to be as easy as creating a new ->getattr_iflags callback so that the
fuse server can set fuse_attr::flags.

5. statx and syncfs work on iomap filesystems.

6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
is enabled.

7. The ext4 shutdown ioctl is now supported.

There are some major warts remaining:

a. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

b. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.  This is related to #3 below:

c. Hardlinks and iomap are not possible for upper-level libfuse clients
because the upper level libfuse likes to abstract kernel nodeids with
its own homebrew dirent/inode cache, which doesn't understand hardlinks.
As a result, a hardlinked file results in two distinct struct inodes in
the kernel, which completely breaks iomap's locking model.  I will have
to rewrite fuse2fs for the lowlevel libfuse library to make this work,
but on the plus side there will be far less path lookup overhead.

d. There are too many changes to the IO manager in libext2fs because I
built things needed to stage the direct/buffered IO paths separately.
These are now unnecessary but I haven't pulled them out yet because
they're sort of useful to verify that iomap file IO never goes through
libext2fs except for inline data.

e. If we're going to use fuse servers as "safe" replacements for kernel
filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
We also need to disable the OOM killer(s) for fuse servers because you
don't want filesystems to unmount abruptly.

f. How do we maximally contain the fuse server to have safe filesystem
mounts?  It's very convenient to use systemd services to configure
isolation declaratively, but fuse2fs still needs to be able to open
/dev/fuse, the ext4 block device, and call mount() in the shared
namespace.  This prevents us from using most of the stronger systemd
protections because they tend to run in a private mount namespace with
various parts of the filesystem either hidden or readonly.

In theory one could design a socket protocol to pass mount options,
block device paths, fds, and responsibility for the mount() call between
a mount helper and a service:

e2fsprogs would define as a systemd socket service for fuse2fs that sets
up a dynamic unprivileged user, no network access, and no access to the
host's filesystem aside from readonly access to the root filesystem.

The mount helper (e.g. mount.safe) would then connect to the magic
socket and pass the CLI arguments to the fuse2fs service.  The service
would parse the arguments, find the block device paths, and feed them
back through the socket to mount.safe.  mount.safe would open them and
pass fds back to the fuse2fs service.  The service would then open the
devices, parse the superblock, and if everything was ok, request a mount
through the socket.  The mount helper would then open /dev/fuse and
mount the filesystem, and if successful, pass the /dev/fuse fd through
the socket to the fuse2fs server.  At that point the fuse2fs server
would attach to the /dev/fuse device and handle the usual events.

Finally we'd have to train people/daemons to run "mount -t safe.ext4
/dev/sda1 /mnt" to get the contained version of ext4.

(Yeah, #f is all Neal. ;))

g. fuse2fs doesn't support the ext4 journal.  Urk.

I'll work on these in July/August, but for now here's an unmergeable RFC
to start some discussion.

--Darrick


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
@ 2025-07-17 23:25 ` Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
                     ` (21 more replies)
  2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 22 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:25 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

Hi all,

Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection.  For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel.  This
means that we can get rid of all file data block processing within
fuse2fs.

Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous.  Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.

The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s.  Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s.  FIEMAP and SEEK_DATA/SEEK_HOLE now work
too.  The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.

Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s.  The kernel
can do 900-1300MB/s.  Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s.  I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes.  We also probably
need iomap caching really badly.

These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance.  It contains a single
Big Filesystem Lock which nukes multi-threaded scalability.  There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS.  Sad!

Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance.  We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap
---
Commits in this patchset:
 * fuse2fs: implement bare minimum iomap for file mapping reporting
 * fuse2fs: add iomap= mount option
 * fuse2fs: implement iomap configuration
 * fuse2fs: register block devices for use with iomap
 * fuse2fs: always use directio disk reads with fuse2fs
 * fuse2fs: implement directio file reads
 * fuse2fs: use tagged block IO for zeroing sub-block regions
 * fuse2fs: only flush the cache for the file under directio read
 * fuse2fs: add extent dump function for debugging
 * fuse2fs: implement direct write support
 * fuse2fs: turn on iomap for pagecache IO
 * fuse2fs: improve tracing for fallocate
 * fuse2fs: don't zero bytes in punch hole
 * fuse2fs: don't do file data block IO when iomap is enabled
 * fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
 * fuse2fs: re-enable the block device pagecache for metadata IO
 * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
 * fuse2fs: don't allow hardlinks for now
 * fuse2fs: enable file IO to inline data files
 * fuse2fs: set iomap-related inode flags
 * fuse2fs: add strictatime/lazytime mount options
 * fuse2fs: configure block device block size
---
 configure       |   47 ++
 configure.ac    |   32 +
 lib/config.h.in |    3 
 misc/fuse2fs.c  | 1567 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 1628 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:26 ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 1/1] fuse2fs: enable caching of iomaps Darrick J. Wong
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
  3 siblings, 1 reply; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:26 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

Hi all,

This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel.  For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem.  For everyone else, it simply
eliminates roundtrips to userspace.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache
---
Commits in this patchset:
 * fuse2fs: enable caching of iomaps
---
 misc/fuse2fs.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:26 ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
                     ` (9 more replies)
  2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
  3 siblings, 10 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:26 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

Hi all,

When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can.  That means no calling
out to the fuse server in the IO path when we can avoid it.  However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.

We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync).  Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.

IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode.  Let's make the kernel manage all that
and push the results to userspace as needed.  This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-attrs
---
Commits in this patchset:
 * fuse2fs: allow O_APPEND and O_TRUNC opens
 * fuse2fs: skip permission checking on utimens when iomap is enabled
 * fuse2fs: let the kernel tell us about acl/mode updates
 * fuse2fs: better debugging for file mode updates
 * fuse2fs: debug timestamp updates
 * fuse2fs: use coarse timestamps for iomap mode
 * fuse2fs: add tracing for retrieving timestamps
 * fuse2fs: enable syncfs
 * fuse2fs: skip the gdt write in op_destroy if syncfs is working
 * fuse2fs: implement statx
---
 misc/fuse2fs.c |  348 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 276 insertions(+), 72 deletions(-)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:39   ` Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
                     ` (20 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:39 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add enough of an iomap implementation that we can do FIEMAP and
SEEK_DATA and SEEK_HOLE.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure       |   47 +++++
 configure.ac    |   32 ++++
 lib/config.h.in |    3 
 misc/fuse2fs.c  |  500 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 576 insertions(+), 6 deletions(-)


diff --git a/configure b/configure
index 0dc027d21280dc..ffa98829757788 100755
--- a/configure
+++ b/configure
@@ -14719,6 +14719,53 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_USE_VERSION" -ge 30
+then
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5
+printf %s "checking for iomap_begin in libfuse... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+
+int
+main (void)
+{
+
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+
+  ;
+  return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  have_fuse_iomap=yes
+   { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else $as_nop
+  { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+
+printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h
+
+fi
+fi
+
 if test -n "$FUSE_USE_VERSION"
 then
 
diff --git a/configure.ac b/configure.ac
index 9f0e74c209b0f2..a4e122ac37880e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1447,6 +1447,38 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_USE_VERSION" -ge 30
+then
+dnl
+dnl see if fuse3 supports iomap
+dnl
+AC_MSG_CHECKING(for iomap_begin in libfuse)
+AC_LINK_IFELSE(
+[	AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+	]], [[
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+	]])
+], have_fuse_iomap=yes
+   AC_MSG_RESULT(yes),
+   AC_MSG_RESULT(no))
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+  AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap])
+fi
+fi
+
+dnl
+dnl set FUSE_USE_VERSION now that we've done all the feature tests
+dnl
 if test -n "$FUSE_USE_VERSION"
 then
 	AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
diff --git a/lib/config.h.in b/lib/config.h.in
index f6597e69a7df8a..f054a1c1642a39 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -73,6 +73,9 @@
 /* Define to 1 if PR_SET_IO_FLUSHER is present */
 #undef HAVE_PR_SET_IO_FLUSHER
 
+/* Define to 1 if fuse supports iomap */
+#undef HAVE_FUSE_IOMAP
+
 /* Define to 1 if you have the Mac OS X function
    CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
 #undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 526c928f735ea2..e688772ddd8b60 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -145,6 +145,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
 	return b - m;
 }
 
+#define max(a, b)	((a) > (b) ? (a) : (b))
+#define min(x, y)	((x) < (y) ? (y) : (x))
+
 #define dbg_printf(fuse2fs, format, ...) \
 	while ((fuse2fs)->debug) { \
 		printf("FUSE2FS (%s): " format, (fuse2fs)->shortdev, ##__VA_ARGS__); \
@@ -216,6 +219,14 @@ enum fuse2fs_opstate {
 	F2OP_SHUTDOWN,
 };
 
+#ifdef HAVE_FUSE_IOMAP
+enum fuse2fs_iomap_state {
+	IOMAP_DISABLED,
+	IOMAP_UNKNOWN,
+	IOMAP_ENABLED,
+};
+#endif
+
 /* Main program context */
 #define FUSE2FS_MAGIC		(0xEF53DEADUL)
 struct fuse2fs {
@@ -241,6 +252,9 @@ struct fuse2fs {
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
+#ifdef HAVE_FUSE_IOMAP
+	enum fuse2fs_iomap_state iomap_state;
+#endif
 	unsigned int blockmask;
 	int retcode;
 	unsigned long offset;
@@ -462,6 +476,15 @@ static inline void __fuse2fs_finish(struct fuse2fs *ff, int ret,
 }
 #define fuse2fs_finish(ff, ret) __fuse2fs_finish((ff), (ret), __func__)
 
+#ifdef HAVE_FUSE_IOMAP
+static int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
+{
+	return ff->iomap_state >= IOMAP_ENABLED;
+}
+#else
+# define fuse2fs_iomap_enabled(...)	(0)
+#endif
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME
@@ -856,7 +879,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
 {
 	char options[128];
 	int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
-		    libext2_flags;
+		    EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags;
 	errcode_t err;
 
 	if (ff->lockfile) {
@@ -1105,6 +1128,30 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
 }
 #endif
 
+#ifdef HAVE_FUSE_IOMAP
+static void fuse2fs_iomap_confirm(struct fuse_conn_info *conn,
+				  struct fuse2fs *ff)
+{
+	switch (ff->iomap_state) {
+	case IOMAP_UNKNOWN:
+		ff->iomap_state = IOMAP_DISABLED;
+		return;
+	case IOMAP_DISABLED:
+		return;
+	case IOMAP_ENABLED:
+		break;
+	}
+
+	/* iomap only works with block devices */
+	if (!fuse2fs_on_bdev(ff)) {
+		fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+		ff->iomap_state = IOMAP_DISABLED;
+	}
+}
+#else
+# define fuse2fs_iomap_confirm(...)	((void)0)
+#endif
+
 static void *op_init(struct fuse_conn_info *conn
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 			, struct fuse_config *cfg EXT2FS_ATTR((unused))
@@ -1132,6 +1179,12 @@ static void *op_init(struct fuse_conn_info *conn
 #ifdef FUSE_CAP_NO_EXPORT_SUPPORT
 	fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	if (ff->iomap_state != IOMAP_DISABLED &&
+	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
+		ff->iomap_state = IOMAP_ENABLED;
+#endif
+
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	conn->time_gran = 1;
 	cfg->use_ino = 1;
@@ -1151,6 +1204,8 @@ static void *op_init(struct fuse_conn_info *conn
 			goto mount_fail;
 		fs = ff->fs;
 
+		fuse2fs_iomap_confirm(conn, ff);
+
 		if (ff->cache_size) {
 			err = fuse2fs_config_cache(ff);
 			if (err)
@@ -1176,8 +1231,17 @@ static void *op_init(struct fuse_conn_info *conn
 		err = fuse2fs_mount(ff);
 		if (err)
 			goto mount_fail;
+	} else {
+		fuse2fs_iomap_confirm(conn, ff);
 	}
 
+	/*
+	 * If we're mounting in iomap mode, we need to unmount in op_destroy
+	 * so that the block device will be released before umount(2) returns.
+	 */
+	if (fuse2fs_iomap_enabled(ff))
+		ff->unmount_in_destroy = 1;
+
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->opstate == F2OP_WRITABLE) {
 		fs->super->s_mnt_count++;
@@ -4734,6 +4798,424 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 # endif /* SUPPORT_FALLOCATE */
 #endif /* FUSE 29 */
 
+#ifdef HAVE_FUSE_IOMAP
+static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
+			       off_t pos, uint64_t count)
+{
+	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->addr = FUSE_IOMAP_NULL_ADDR;
+	iomap->offset = pos;
+	iomap->length = count;
+	iomap->type = FUSE_IOMAP_TYPE_HOLE;
+}
+
+static void fuse2fs_iomap_hole_to_eof(struct fuse2fs *ff,
+				      struct fuse_iomap *iomap, off_t pos,
+				      off_t count,
+				      const struct ext2_inode_large *inode)
+{
+	ext2_filsys fs = ff->fs;
+	uint64_t isize = EXT2_I_SIZE(inode);
+
+	/*
+	 * We have to be careful about handling a hole to the right of the
+	 * entire mapping tree.  First, the mapping must start and end on a
+	 * block boundary because they must be aligned to at least an LBA for
+	 * the block layer; and to the fsblock for smoother operation.
+	 *
+	 * As for the length -- we could return a mapping all the way to
+	 * i_size, but i_size could be less than pos/count if we're zeroing the
+	 * EOF block in anticipation of a truncate operation.  Similarly, we
+	 * don't want to end the mapping at pos+count because we know there's
+	 * nothing mapped byeond here.
+	 */
+	uint64_t startoff = round_down(pos, fs->blocksize);
+	uint64_t eofoff = round_up(max(pos + count, isize), fs->blocksize);
+
+	dbg_printf(ff,
+ "pos=0x%llx count=0x%llx isize=0x%llx startoff=0x%llx eofoff=0x%llx\n",
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   (unsigned long long)isize,
+		   (unsigned long long)startoff,
+		   (unsigned long long)eofoff);
+
+	fuse2fs_iomap_hole(ff, iomap, startoff, eofoff - startoff);
+}
+
+#define DEBUG_IOMAP
+#ifdef DEBUG_IOMAP
+# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \
+	do { \
+		dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \
+			   (func), (tag), (startoff), (err), (extent)->e_lblk, \
+			   (extent)->e_pblk, (extent)->e_len, \
+			   (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \
+	} while(0)
+# define DUMP_EXTENT(ff, tag, startoff, err, extent) \
+	__DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent))
+#else
+# define __DUMP_EXTENT(...)	((void)0)
+# define DUMP_EXTENT(...)	((void)0)
+#endif
+
+static inline errcode_t __fuse2fs_get_mapping_at(struct fuse2fs *ff,
+						 ext2_extent_handle_t handle,
+						 blk64_t startoff,
+						 struct ext2fs_extent *bmap,
+						 const char *func)
+{
+	errcode_t err;
+
+	/*
+	 * Find the file mapping at startoff.  We don't check the return value
+	 * of _goto because _get will error out if _goto failed.  There's a
+	 * subtlety to the outcome of _goto when startoff falls in a sparse
+	 * hole however:
+	 *
+	 * Most of the time, _goto points the cursor at the mapping whose lblk
+	 * is just to the left of startoff.  The mapping may or may not overlap
+	 * startoff; this is ok.  In other words, the tree lookup behaves as if
+	 * we asked it to use a less than or equals comparison.
+	 *
+	 * However, if startoff is to the left of the first mapping in the
+	 * extent tree, _goto points the cursor at that first mapping because
+	 * it doesn't know how to deal with this situation.  In this case,
+	 * the tree lookup behaves as if we asked it to use a greater than
+	 * or equals comparison.
+	 *
+	 * Note: If _get() returns 'no current node', that means that there
+	 * aren't any mappings at all.
+	 */
+	ext2fs_extent_goto(handle, startoff);
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap);
+	__DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap);
+	if (err == EXT2_ET_NO_CURRENT_NODE)
+		err = EXT2_ET_EXTENT_NOT_FOUND;
+	return err;
+}
+
+static inline errcode_t __fuse2fs_get_next_mapping(struct fuse2fs *ff,
+						   ext2_extent_handle_t handle,
+						   blk64_t startoff,
+						   struct ext2fs_extent *bmap,
+						   const char *func)
+{
+	struct ext2fs_extent newex, errex;
+	errcode_t err;
+
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex);
+	DUMP_EXTENT(ff, "NEXT", startoff, err, &newex);
+	if (err == EXT2_ET_EXTENT_NO_NEXT)
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	if (err)
+		return err;
+
+	/*
+	 * Try to get the next leaf mapping.  There's a weird and longstanding
+	 * "feature" of EXT2_EXTENT_NEXT_LEAF where walking off the end of the
+	 * mapping recordset causes it to wrap around to the beginning of the
+	 * extent map and we end up with a mapping to the left of the one that
+	 * was passed in.
+	 *
+	 * However, a corrupt extent tree could also have such a record.  The
+	 * only way to be sure is to retrieve the mapping for the extreme right
+	 * edge of the tree and compare it to the mapping that the caller gave
+	 * us.  If they match, then we've hit the end.  If not, something is
+	 * corrupt in the ondisk metadata.
+	 */
+	if (newex.e_lblk <= bmap->e_lblk + bmap->e_len) {
+		err = __fuse2fs_get_mapping_at(ff, handle, ~0U, &errex, func);
+		if (err)
+			return err;
+
+		if (memcmp(bmap, &errex, sizeof(errex)) != 0)
+			return EXT2_ET_INODE_CORRUPTED;
+
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	}
+
+	*bmap = newex;
+	return 0;
+}
+
+#define fuse2fs_get_mapping_at(ff, handle, startoff, bmap) \
+	__fuse2fs_get_mapping_at((ff), (handle), (startoff), (bmap), __func__)
+#define fuse2fs_get_next_mapping(ff, handle, startoff, bmap) \
+	__fuse2fs_get_next_mapping((ff), (handle), (startoff), (bmap), __func__)
+
+static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino,
+					    struct ext2_inode_large *inode,
+					    off_t pos, uint64_t count,
+					    uint32_t opflags,
+					    struct fuse_iomap *iomap)
+{
+	ext2_extent_handle_t handle;
+	struct ext2fs_extent extent;
+	ext2_filsys fs = ff->fs;
+	const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	errcode_t err;
+	int ret = 0;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/* No mappings at all; the whole range is a hole. */
+		fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+		goto out_handle;
+	}
+	if (err) {
+		ret = translate_error(fs, ino, err);
+		goto out_handle;
+	}
+
+	if (startoff < extent.e_lblk) {
+		/*
+		 * Mapping starts to the right of the current position.
+		 * Synthesize a hole going to that next extent.
+		 */
+		fuse2fs_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+		goto out_handle;
+	}
+
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, the
+		 * whole range is in a hole.
+		 */
+		err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+			goto out_handle;
+		}
+
+		/*
+		 * If the new mapping starts to the right of startoff, there's
+		 * a hole from startoff to the start of the new mapping.
+		 */
+		if (startoff < extent.e_lblk) {
+			fuse2fs_iomap_hole(ff, iomap,
+				FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+			goto out_handle;
+		}
+
+		/*
+		 * The new mapping starts at startoff.  Something weird
+		 * happened in the extent tree lookup, but we found a valid
+		 * mapping so we'll run with it.
+		 */
+	}
+
+	/* Mapping overlaps startoff, report this. */
+	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
+	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
+	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)
+		iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN;
+	else
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino,
+					struct ext2_inode_large *inode,
+					off_t pos, uint64_t count,
+					uint32_t opflags,
+					struct fuse_iomap *iomap)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	uint64_t real_count = min(count, 131072);
+	const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count);
+	blk64_t startblock;
+	errcode_t err;
+
+	err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL,
+			   &startblock);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->offset = pos;
+	iomap->flags |= FUSE_IOMAP_F_MERGED;
+	if (startblock) {
+		iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+	} else {
+		iomap->addr = FUSE_IOMAP_NULL_ADDR;
+		iomap->type = FUSE_IOMAP_TYPE_HOLE;
+	}
+	iomap->length = fs->blocksize;
+
+	/* See how long the mapping goes for. */
+	for (startoff++; startoff < endoff; startoff++) {
+		blk64_t prev_startblock = startblock;
+
+		err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0,
+				   startoff, NULL, &startblock);
+		if (err)
+			break;
+
+		if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) {
+			if (startblock == prev_startblock + 1)
+				iomap->length += fs->blocksize;
+			else
+				break;
+		} else {
+			if (startblock != 0)
+				break;
+		}
+	}
+
+	return 0;
+}
+
+static int fuse2fs_iomap_begin_inline(struct fuse2fs *ff, ext2_ino_t ino,
+				      struct ext2_inode_large *inode, off_t pos,
+				      uint64_t count, struct fuse_iomap *iomap)
+{
+	uint64_t one_fsb = FUSE2FS_FSB_TO_B(ff, 1);
+
+	if (pos >= one_fsb) {
+		fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+	} else {
+		/* ext4 only supports inline data files up to 1 fsb */
+		iomap->dev = FUSE_IOMAP_DEV_NULL;
+		iomap->addr = FUSE_IOMAP_NULL_ADDR;
+		iomap->offset = 0;
+		iomap->length = one_fsb;
+		iomap->type = FUSE_IOMAP_TYPE_INLINE;
+	}
+
+	return 0;
+}
+
+static int fuse2fs_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino,
+				      struct ext2_inode_large *inode,
+				      off_t pos, uint64_t count,
+				      uint32_t opflags,
+				      struct fuse_iomap *read_iomap)
+{
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count,
+						  read_iomap);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
+						  opflags, read_iomap);
+
+	return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+					    opflags, read_iomap);
+}
+
+static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
+				    struct ext2_inode_large *inode, off_t pos,
+				    uint64_t count, uint32_t opflags,
+				    struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode, off_t pos,
+				     uint64_t count, uint32_t opflags,
+				     struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, uint64_t count, uint32_t opflags,
+			  struct fuse_iomap *read_iomap,
+			  struct fuse_iomap *write_iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags);
+
+	fs = fuse2fs_start(ff);
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (opflags & FUSE_IOMAP_OP_REPORT)
+		ret = fuse2fs_iomap_begin_report(ff, attr_ino, &inode, pos,
+						 count, opflags, read_iomap);
+	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
+		ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos,
+						count, opflags, read_iomap);
+	else
+		ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos,
+					       count, opflags, read_iomap);
+	if (ret)
+		goto out_unlock;
+
+	dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n",
+		   __func__,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)read_iomap->addr,
+		   (unsigned long long)read_iomap->offset,
+		   (unsigned long long)read_iomap->length,
+		   read_iomap->type);
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+
+static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			off_t pos, uint64_t count, uint32_t opflags,
+			ssize_t written, const struct fuse_iomap *iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags 0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags,
+		   written,
+		   iomap->flags);
+
+	return 0;
+}
+#endif /* HAVE_FUSE_IOMAP */
+
 static struct fuse_operations fs_ops = {
 	.init = op_init,
 	.destroy = op_destroy,
@@ -4794,6 +5276,10 @@ static struct fuse_operations fs_ops = {
 	.fallocate = op_fallocate,
 # endif
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	.iomap_begin = op_iomap_begin,
+	.iomap_end = op_iomap_end,
+#endif /* HAVE_FUSE_IOMAP */
 };
 
 static int get_random_bytes(void *p, size_t sz)
@@ -5010,17 +5496,19 @@ static void fuse2fs_com_err_proc(const char *whoami, errcode_t code,
 int main(int argc, char *argv[])
 {
 	struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
-	struct fuse2fs fctx;
+	struct fuse2fs fctx = {
+		.magic = FUSE2FS_MAGIC,
+		.opstate = F2OP_WRITABLE,
+#ifdef HAVE_FUSE_IOMAP
+		.iomap_state = IOMAP_UNKNOWN,
+#endif
+	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
 	char *logfile;
 	char extra_args[BUFSIZ];
 	int ret;
 
-	memset(&fctx, 0, sizeof(fctx));
-	fctx.magic = FUSE2FS_MAGIC;
-	fctx.opstate = F2OP_WRITABLE;
-
 	ret = fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc);
 	if (ret)
 		exit(1);


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/22] fuse2fs: add iomap= mount option
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
@ 2025-07-17 23:39   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:39 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add a mount option to control iomap usage so that we can test before and
after scenarios.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e688772ddd8b60..d4912dee08d43f 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -219,6 +219,12 @@ enum fuse2fs_opstate {
 	F2OP_SHUTDOWN,
 };
 
+enum fuse2fs_feature_toggle {
+	FT_DISABLE,
+	FT_ENABLE,
+	FT_DEFAULT,
+};
+
 #ifdef HAVE_FUSE_IOMAP
 enum fuse2fs_iomap_state {
 	IOMAP_DISABLED,
@@ -253,6 +259,7 @@ struct fuse2fs {
 	enum fuse2fs_opstate opstate;
 	int blocklog;
 #ifdef HAVE_FUSE_IOMAP
+	enum fuse2fs_feature_toggle iomap_want;
 	enum fuse2fs_iomap_state iomap_state;
 #endif
 	unsigned int blockmask;
@@ -1235,6 +1242,13 @@ static void *op_init(struct fuse_conn_info *conn
 		fuse2fs_iomap_confirm(conn, ff);
 	}
 
+#if defined(HAVE_FUSE_IOMAP)
+	if (ff->iomap_want == FT_ENABLE && !fuse2fs_iomap_enabled(ff)) {
+		err_printf(ff, "%s\n", _("could not enable iomap."));
+		goto mount_fail;
+	}
+#endif
+
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
 	 * so that the block device will be released before umount(2) returns.
@@ -5307,6 +5321,9 @@ enum {
 	FUSE2FS_CACHE_SIZE,
 	FUSE2FS_DIRSYNC,
 	FUSE2FS_ERRORS_BEHAVIOR,
+#ifdef HAVE_FUSE_IOMAP
+	FUSE2FS_IOMAP,
+#endif
 };
 
 #define FUSE2FS_OPT(t, p, v) { t, offsetof(struct fuse2fs, p), v }
@@ -5335,6 +5352,10 @@ static struct fuse_opt fuse2fs_opts[] = {
 	FUSE_OPT_KEY("cache_size=%s",	FUSE2FS_CACHE_SIZE),
 	FUSE_OPT_KEY("dirsync",		FUSE2FS_DIRSYNC),
 	FUSE_OPT_KEY("errors=%s",	FUSE2FS_ERRORS_BEHAVIOR),
+#ifdef HAVE_FUSE_IOMAP
+	FUSE_OPT_KEY("iomap=%s",	FUSE2FS_IOMAP),
+	FUSE_OPT_KEY("iomap",		FUSE2FS_IOMAP),
+#endif
 
 	FUSE_OPT_KEY("-V",             FUSE2FS_VERSION),
 	FUSE_OPT_KEY("--version",      FUSE2FS_VERSION),
@@ -5386,6 +5407,23 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 
 		/* do not pass through to libfuse */
 		return 0;
+#ifdef HAVE_FUSE_IOMAP
+	case FUSE2FS_IOMAP:
+		if (strcmp(arg, "iomap") == 0 || strcmp(arg + 6, "1") == 0)
+			ff->iomap_want = FT_ENABLE;
+		else if (strcmp(arg + 6, "0") == 0)
+			ff->iomap_want = FT_DISABLE;
+		else if (strcmp(arg + 6, "default") == 0)
+			ff->iomap_want = FT_DEFAULT;
+		else {
+			fprintf(stderr, "%s: %s\n", arg,
+ _("unknown iomap= behavior."));
+			return -1;
+		}
+
+		/* do not pass through to libfuse */
+		return 0;
+#endif
 	case FUSE2FS_IGNORED:
 		return 0;
 	case FUSE2FS_HELP:
@@ -5413,6 +5451,9 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 	"    -o cache_size=N[KMG]   use a disk cache of this size\n"
 	"    -o errors=             behavior when an error is encountered:\n"
 	"                           continue|remount-ro|panic\n"
+#ifdef HAVE_FUSE_IOMAP
+	"    -o iomap=              0 to disable iomap, 1 to enable iomap\n"
+#endif
 	"\n",
 			outargs->argv[0]);
 		if (key == FUSE2FS_HELPFULL) {
@@ -5500,6 +5541,7 @@ int main(int argc, char *argv[])
 		.magic = FUSE2FS_MAGIC,
 		.opstate = F2OP_WRITABLE,
 #ifdef HAVE_FUSE_IOMAP
+		.iomap_want = FT_DEFAULT,
 		.iomap_state = IOMAP_UNKNOWN,
 #endif
 	};
@@ -5518,6 +5560,11 @@ int main(int argc, char *argv[])
 		exit(1);
 	}
 
+#ifdef HAVE_FUSE_IOMAP
+	if (fctx.iomap_want == FT_DISABLE)
+		fctx.iomap_state = IOMAP_DISABLED;
+#endif
+
 	/* /dev/sda -> sda for reporting */
 	fctx.shortdev = strrchr(fctx.device, '/');
 	if (fctx.shortdev)


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/22] fuse2fs: implement iomap configuration
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Upload the filesystem geometry to the kernel when asked.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 93 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index d4912dee08d43f..fb71886b58f215 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -194,6 +194,10 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
 # define FL_ZERO_RANGE_FLAG (0)
 #endif
 
+#ifndef NSEC_PER_SEC
+# define NSEC_PER_SEC	(1000000000L)
+#endif
+
 errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs);
 
 const char *err_shortdev;
@@ -575,9 +579,9 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
 	EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
 	get_now(&now);
 
-	datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000);
-	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000);
-	dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000);
+	datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
+	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
+	dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
 
 	/*
 	 * If atime is newer than mtime and atime hasn't been updated in thirty
@@ -5228,6 +5232,91 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 
 	return 0;
 }
+
+/*
+ * Maximal extent format file size.
+ * Resulting logical blkno at s_maxbytes must fit in our on-disk
+ * extent format containers, within a sector_t, and within i_blocks
+ * in the vfs.  ext4 inode has 48 bits of i_block in fsblock units,
+ * so that won't be a limiting factor.
+ *
+ * However there is other limiting factor. We do store extents in the form
+ * of starting block and length, hence the resulting length of the extent
+ * covering maximum file size must fit into on-disk format containers as
+ * well. Given that length is always by 1 unit bigger than max unit (because
+ * we count 0 as well) we have to lower the s_maxbytes by one fs block.
+ *
+ * Note, this does *not* consider any metadata overhead for vfs i_blocks.
+ */
+static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
+{
+	off_t res;
+
+	if (!ext2fs_has_feature_huge_file(ff->fs->super)) {
+		upper_limit = (1LL << 32) - 1;
+
+		/* total blocks in file system block size */
+		upper_limit >>= (ff->blocklog - 9);
+		upper_limit <<= ff->blocklog;
+	}
+
+	/*
+	 * 32-bit extent-start container, ee_block. We lower the maxbytes
+	 * by one fs block, so ee_len can cover the extent of maximum file
+	 * size
+	 */
+	res = (1LL << 32) - 1;
+	res <<= ff->blocklog;
+
+	/* Sanity check against vm- & vfs- imposed limits */
+	if (res > upper_limit)
+		res = upper_limit;
+
+	return res;
+}
+
+static int op_iomap_config(uint32_t flags, off_t maxbytes,
+			   struct fuse_iomap_config *cfg)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	ext2_filsys fs;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff, "%s: flags=0x%x maxbytes=0x%llx\n", __func__, flags,
+		   (long long)maxbytes);
+	fs = fuse2fs_start(ff);
+
+	cfg->flags |= FUSE_IOMAP_CONFIG_UUID;
+	memcpy(cfg->s_uuid, fs->super->s_uuid, sizeof(cfg->s_uuid));
+	cfg->s_uuid_len = sizeof(fs->super->s_uuid);
+
+	cfg->flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE;
+	cfg->s_blocksize = FUSE2FS_FSB_TO_B(ff, 1);
+
+	/*
+	 * If there inode is large enough to house i_[acm]time_extra then we
+	 * can turn on nanosecond timestamps; i_crtime was the next field added
+	 * after i_atime_extra.
+	 */
+	cfg->flags |= FUSE_IOMAP_CONFIG_TIME;
+	if (fs->super->s_inode_size >=
+	    offsetof(struct ext2_inode_large, i_crtime)) {
+		cfg->s_time_gran = 1;
+		cfg->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
+	} else {
+		cfg->s_time_gran = NSEC_PER_SEC;
+		cfg->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
+	}
+	cfg->s_time_min = EXT4_TIMESTAMP_MIN;
+
+	cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
+	cfg->s_maxbytes = fuse2fs_max_size(ff, maxbytes);
+
+	fuse2fs_finish(ff, 0);
+	return 0;
+}
 #endif /* HAVE_FUSE_IOMAP */
 
 static struct fuse_operations fs_ops = {
@@ -5293,6 +5382,7 @@ static struct fuse_operations fs_ops = {
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,
+	.iomap_config = op_iomap_config,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/22] fuse2fs: register block devices for use with iomap
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Register the ext4 block device with the kernel for use with iomap.  For
now this is redundant with using fuseblk mode because the kernel
automatically registers any fuseblk devices, but eventually we'll go
back to regular fuse mode and we'll have to pin the bdev ourselves.
In theory this interface supports strange beasts where the metadata can
exist somewhere else entirely (or be made up by AI) while the file data
persists to real disks.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fb71886b58f215..9eb067e1737054 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -40,6 +40,7 @@
 # define _FILE_OFFSET_BITS 64
 #endif /* _FILE_OFFSET_BITS */
 #include <fuse.h>
+#include <fuse_lowlevel.h>
 #ifdef __SET_FOB_FOR_FUSE
 # undef _FILE_OFFSET_BITS
 #endif /* __SET_FOB_FOR_FUSE */
@@ -265,6 +266,7 @@ struct fuse2fs {
 #ifdef HAVE_FUSE_IOMAP
 	enum fuse2fs_feature_toggle iomap_want;
 	enum fuse2fs_iomap_state iomap_state;
+	uint32_t iomap_dev;
 #endif
 	unsigned int blockmask;
 	int retcode;
@@ -5032,7 +5034,7 @@ static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino,
 	}
 
 	/* Mapping overlaps startoff, report this. */
-	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
 	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
 	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
@@ -5064,13 +5066,14 @@ static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino,
 	if (err)
 		return translate_error(fs, ino, err);
 
-	iomap->dev = FUSE_IOMAP_DEV_NULL;
 	iomap->offset = pos;
 	iomap->flags |= FUSE_IOMAP_F_MERGED;
 	if (startblock) {
+		iomap->dev = ff->iomap_dev;
 		iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
 		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
 	} else {
+		iomap->dev = FUSE_IOMAP_DEV_NULL;
 		iomap->addr = FUSE_IOMAP_NULL_ADDR;
 		iomap->type = FUSE_IOMAP_TYPE_HOLE;
 	}
@@ -5275,12 +5278,38 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
 	return res;
 }
 
+static errcode_t fuse2fs_iomap_config_devices(struct fuse_context *ctxt,
+					      struct fuse2fs *ff)
+{
+	struct fuse_session *se = fuse_get_session(ctxt->fuse);
+	errcode_t err;
+	int fd;
+	int ret;
+
+	err = io_channel_fd(ff->fs->io, &fd);
+	if (err)
+		return err;
+
+	ret = fuse_iomap_add_device(se, fd, 0);
+
+	dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n",
+		   __func__, fd, ret, ff->iomap_dev);
+
+	if (ret < 1)
+		return -EIO;
+
+	ff->iomap_dev = ret;
+	return 0;
+}
+
 static int op_iomap_config(uint32_t flags, off_t maxbytes,
 			   struct fuse_iomap_config *cfg)
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
 
@@ -5314,8 +5343,15 @@ static int op_iomap_config(uint32_t flags, off_t maxbytes,
 	cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
 	cfg->s_maxbytes = fuse2fs_max_size(ff, maxbytes);
 
-	fuse2fs_finish(ff, 0);
-	return 0;
+	err = fuse2fs_iomap_config_devices(ctxt, ff);
+	if (err) {
+		ret = translate_error(fs, 0, err);
+		goto out_unlock;
+	}
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
 }
 #endif /* HAVE_FUSE_IOMAP */
 
@@ -5633,6 +5669,7 @@ int main(int argc, char *argv[])
 #ifdef HAVE_FUSE_IOMAP
 		.iomap_want = FT_DEFAULT,
 		.iomap_state = IOMAP_UNKNOWN,
+		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
 	};
 	errcode_t err;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel writes file data directly to the block device
and does not flush the bdev page cache.  We must open the filesystem in
directio mode to avoid cache coherency issues when reading file data
blocks.  If we can't open the bdev in directio mode, we must not use
iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9eb067e1737054..72b9ec837209ca 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1174,6 +1174,9 @@ static void *op_init(struct fuse_conn_info *conn
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs = ff->fs;
+#ifdef HAVE_FUSE_IOMAP
+	int was_directio = ff->directio;
+#endif
 	errcode_t err;
 	int ret;
 
@@ -1196,6 +1199,15 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+	/*
+	 * In iomap mode, the kernel writes file data directly to the block
+	 * device and does not flush the bdev page cache.  We must open the
+	 * filesystem in directio mode to avoid cache coherency issues when
+	 * reading file data.  If we can't open the bdev in directio mode, we
+	 * must not use iomap.
+	 */
+	if (fuse2fs_iomap_enabled(ff))
+		ff->directio = 1;
 #endif
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
@@ -1213,6 +1225,14 @@ static void *op_init(struct fuse_conn_info *conn
 	 */
 	if (!fs) {
 		err = fuse2fs_open(ff, 0);
+#ifdef HAVE_FUSE_IOMAP
+		if (err && fuse2fs_iomap_enabled(ff) && !was_directio) {
+			fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+			ff->iomap_state = IOMAP_DISABLED;
+			ff->directio = 0;
+			err = fuse2fs_open(ff, 0);
+		}
+#endif
 		if (err)
 			goto mount_fail;
 		fs = ff->fs;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/22] fuse2fs: implement directio file reads
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Implement file reads via iomap.  Currently only directio is supported.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 72b9ec837209ca..209858aeb9307c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1274,6 +1274,10 @@ static void *op_init(struct fuse_conn_info *conn
 		goto mount_fail;
 	}
 #endif
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO)
+	if (fuse2fs_iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
+#endif
 
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
@@ -5165,7 +5169,26 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 				    uint64_t count, uint32_t opflags,
 				    struct fuse_iomap *read_iomap)
 {
-	return -ENOSYS;
+	errcode_t err;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	/* fall back to slow path for inline data reads */
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return -ENOSYS;
+
+	/* flush dirty io_channel buffers to disk before iomap reads them */
+	err = io_channel_flush(ff->fs->io);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
+						  opflags, read_iomap);
+
+	return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+					    opflags, read_iomap);
 }
 
 static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Change the punch hole helpers to use the tagged block IO commands now
that libext2fs uses tagged block IO commands for file IO.  We'll need
this in the next patch when we turn on selective IO manager cache
clearing and invalidation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 209858aeb9307c..64aca0f962daaf 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4675,13 +4675,13 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 
 	memset(*buf + residue, 0, len);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
@@ -4709,7 +4709,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 	if (!blk || (retflags & BMAP_RET_UNINIT))
@@ -4720,7 +4720,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	else
 		memset(*buf + residue, 0, fs->blocksize - residue);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static int fuse2fs_punch_range(struct fuse2fs *ff,


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

We only need to flush the io_channel's cache for the file that's being
read directly, not everything else.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 64aca0f962daaf..88b71af417c0d7 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5179,7 +5179,7 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush(ff->fs->io);
+	err = io_channel_flush_tag(ff->fs->io, ino);
 	if (err)
 		return translate_error(ff->fs, ino, err);
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/22] fuse2fs: add extent dump function for debugging
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add a function to dump an inode's extent map for debugging purposes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 88b71af417c0d7..0137403b7a25b9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -498,6 +498,74 @@ static int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
 # define fuse2fs_iomap_enabled(...)	(0)
 #endif
 
+static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
+					struct ext2_inode_large *inode,
+					const char *why)
+{
+	ext2_filsys fs = ff->fs;
+	unsigned int nr = 0;
+	blk64_t blockcount = 0;
+	struct ext2_inode_large xinode;
+	struct ext2fs_extent extent;
+	ext2_extent_handle_t extents;
+	int op = EXT2_EXTENT_ROOT;
+	errcode_t retval;
+
+	if (!inode) {
+		inode = &xinode;
+
+		retval = fuse2fs_read_inode(fs, ino, inode);
+		if (retval) {
+			com_err(__func__, retval, _("reading ino %u"), ino);
+			return;
+		}
+	}
+
+	if (!(inode->i_flags & EXT4_EXTENTS_FL))
+		return;
+
+	printf("%s: %s ino=%u isize %llu iblocks %llu\n", __func__, why, ino,
+	       EXT2_I_SIZE(inode),
+	       (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) /
+	        fs->blocksize);
+	fflush(stdout);
+
+	retval = ext2fs_extent_open(fs, ino, &extents);
+	if (retval) {
+		com_err(__func__, retval, _("opening extents of ino \"%u\""),
+			ino);
+		return;
+	}
+
+	while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) {
+		op = EXT2_EXTENT_NEXT;
+
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT)
+			continue;
+
+		printf("[%u]: %s ino=%u lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n",
+		       nr++, why, ino, extent.e_lblk, extent.e_pblk,
+		       extent.e_len, extent.e_flags);
+		fflush(stdout);
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF)
+			blockcount += extent.e_len;
+		else
+			blockcount++;
+	}
+	if (retval == EXT2_ET_EXTENT_NO_NEXT)
+		retval = 0;
+	if (retval) {
+		com_err(__func__, retval, ("getting extents of ino %u"),
+			ino);
+	}
+	if (inode->i_file_acl)
+		blockcount++;
+	printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount);
+	fflush(stdout);
+
+	ext2fs_extent_free(extents);
+}
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/22] fuse2fs: implement direct write support
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:42   ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Wire up an iomap_begin method that can allocate into holes so that we
can do directio writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |  482 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 479 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 0137403b7a25b9..8c3cc7adc72579 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5259,12 +5259,100 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 					    opflags, read_iomap);
 }
 
+static int fuse2fs_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode, off_t pos,
+				     uint64_t count, uint32_t opflags, struct
+				     fuse_iomap *read_iomap, bool *dirty)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count);
+	errcode_t err;
+	int ret;
+
+	dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n",
+		   __func__, ino, startoff, stopoff - startoff);
+
+	if (!fs_can_allocate(ff, stopoff - startoff))
+		return -ENOSPC;
+
+	err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino,
+			       EXT2_INODE(inode), ~0ULL, startoff,
+			       stopoff - startoff);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* pick up the newly allocated mapping */
+	ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				       read_iomap);
+	if (ret)
+		return ret;
+
+	read_iomap->flags |= FUSE_IOMAP_F_DIRTY;
+	*dirty = true;
+	return 0;
+}
+
+static off_t fuse2fs_max_file_size(const struct fuse2fs *ff,
+				   const struct ext2_inode_large *inode)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t addr_per_block, max_map_block;
+
+	if (inode->i_flags & EXT4_EXTENTS_FL) {
+		max_map_block = (1ULL << 32) - 1;
+	} else {
+		addr_per_block = fs->blocksize >> 2;
+		max_map_block = addr_per_block;
+		max_map_block += addr_per_block * addr_per_block;
+		max_map_block += addr_per_block * addr_per_block * addr_per_block;
+		max_map_block += 12;
+	}
+
+	return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1);
+}
+
 static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 				     struct ext2_inode_large *inode, off_t pos,
 				     uint64_t count, uint32_t opflags,
-				     struct fuse_iomap *read_iomap)
+				     struct fuse_iomap *read_iomap,
+				     bool *dirty)
 {
-	return -ENOSYS;
+	off_t max_size = fuse2fs_max_file_size(ff, inode);
+	errcode_t err;
+	int ret;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	if (pos >= max_size)
+		return -EFBIG;
+
+	if (pos >= max_size - count)
+		count = max_size - pos;
+
+	ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				       read_iomap);
+	if (ret)
+		return ret;
+
+	if (read_iomap->type == FUSE_IOMAP_TYPE_HOLE &&
+	    !(opflags & FUSE_IOMAP_OP_ZERO)) {
+		ret = fuse2fs_iomap_write_allocate(ff, ino, inode, pos, count,
+						   opflags, read_iomap, dirty);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers before iomap
+	 * writes them
+	 */
+	err = io_channel_invalidate_tag(ff->fs->io, ino);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	return 0;
 }
 
 static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
@@ -5277,6 +5365,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	struct ext2_inode_large inode;
 	ext2_filsys fs;
 	errcode_t err;
+	bool dirty = false;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
@@ -5302,7 +5391,8 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 						 count, opflags, read_iomap);
 	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
 		ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos,
-						count, opflags, read_iomap);
+						count, opflags, read_iomap,
+						&dirty);
 	else
 		ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos,
 					       count, opflags, read_iomap);
@@ -5319,6 +5409,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   (unsigned long long)read_iomap->length,
 		   read_iomap->type);
 
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -5460,6 +5558,383 @@ static int op_iomap_config(uint32_t flags, off_t maxbytes,
 		goto out_unlock;
 	}
 
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+
+static inline bool fuse2fs_can_merge_mappings(const struct ext2fs_extent *left,
+					      const struct ext2fs_extent *right)
+{
+	uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ?
+				EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN;
+
+	return left->e_lblk + left->e_len == right->e_lblk &&
+	       left->e_pblk + left->e_len == right->e_pblk &&
+	       (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ==
+	        (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) &&
+	       (uint64_t)left->e_len + right->e_len <= max_len;
+}
+
+static int fuse2fs_try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+				      ext2_extent_handle_t handle,
+				      blk64_t startoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent left, right;
+	errcode_t err;
+
+	/* Look up the mappings before startoff */
+	err = fuse2fs_get_mapping_at(ff, handle, startoff - 1, &left);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Look up the mapping at startoff */
+	err = fuse2fs_get_mapping_at(ff, handle, startoff, &right);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Can we combine them? */
+	if (!fuse2fs_can_merge_mappings(&left, &right))
+		return 0;
+
+	/*
+	 * Delete the mapping after startoff because libext2fs cannot handle
+	 * overlapping mappings.
+	 */
+	err = ext2fs_extent_delete(handle, 0);
+	DUMP_EXTENT(ff, "remover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixremover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Move back and lengthen the mapping before startoff */
+	err = ext2fs_extent_goto(handle, left.e_lblk);
+	DUMP_EXTENT(ff, "movel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	left.e_len += right.e_len;
+	err = ext2fs_extent_replace(handle, 0, &left);
+	DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
+static int fuse2fs_convert_unwritten_mapping(struct fuse2fs *ff,
+					     ext2_ino_t ino,
+					     struct ext2_inode_large *inode,
+					     ext2_extent_handle_t handle,
+					     blk64_t *cursor, blk64_t stopoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent extent;
+	blk64_t startoff = *cursor;
+	errcode_t err;
+
+	/*
+	 * Find the mapping at startoff.  Note that we can find holes because
+	 * the mapping data can change due to racing writes.
+	 */
+	err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/*
+		 * If we didn't find any mappings at all then the file is
+		 * completely sparse.  There's nothing to convert.
+		 */
+		*cursor = stopoff;
+		return 0;
+	}
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/*
+	 * The mapping is completely to the left of the range that we want.
+	 * Let's see what's in the next extent, if there is one.
+	 */
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, then
+		 * we're done.
+		 */
+		err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			*cursor = stopoff;
+			return 0;
+		}
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/*
+	 * The mapping is completely to the right of the range that we want,
+	 * so we're done.
+	 */
+	if (extent.e_lblk >= stopoff) {
+		*cursor = stopoff;
+		return 0;
+	}
+
+	/*
+	 * At this point, we have a mapping that overlaps (startoff, stopoff].
+	 * If the mapping is already written, move on to the next one.
+	 */
+	if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT))
+		goto next;
+
+	if (startoff > extent.e_lblk) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping starts before startoff.  Shorten
+		 * the previous mapping...
+		 */
+		newex.e_len = startoff - extent.e_lblk;
+		err = ext2fs_extent_replace(handle, 0, &newex);
+		DUMP_EXTENT(ff, "shortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ...and create new written mapping at startoff. */
+		extent.e_len -= newex.e_len;
+		extent.e_lblk += newex.e_len;
+		extent.e_pblk += newex.e_len;
+		extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &extent);
+		DUMP_EXTENT(ff, "insertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	if (extent.e_lblk + extent.e_len > stopoff) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping ends after stopoff.  Shorten the current
+		 * mapping...
+		 */
+		extent.e_len = stopoff - extent.e_lblk;
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "shortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ..and create a new unwritten mapping at stopoff. */
+		newex.e_pblk += extent.e_len;
+		newex.e_lblk += extent.e_len;
+		newex.e_len -= extent.e_len;
+		newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &newex);
+		DUMP_EXTENT(ff, "insertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/* Still unwritten?  Update the state. */
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) {
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "replacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+next:
+	/* Try to merge with the previous extent */
+	if (startoff > 0) {
+		err = fuse2fs_try_merge_mappings(ff, ino, handle, startoff);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	*cursor = extent.e_lblk + extent.e_len;
+	return 0;
+}
+
+static int fuse2fs_convert_unwritten_mappings(struct fuse2fs *ff,
+					      ext2_ino_t ino,
+					      struct ext2_inode_large *inode,
+					      off_t pos, size_t written)
+{
+	ext2_extent_handle_t handle;
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written);
+	errcode_t err;
+	int ret;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Walk every mapping in the range, converting them. */
+	while (startoff < stopoff) {
+		blk64_t old_startoff = startoff;
+
+		ret = fuse2fs_convert_unwritten_mapping(ff, ino, inode, handle,
+							&startoff, stopoff);
+		if (ret)
+			goto out_handle;
+		if (startoff <= old_startoff) {
+			/* Do not go backwards. */
+			ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+			goto out_handle;
+		}
+	}
+
+	/* Try to merge the right edge */
+	ret = fuse2fs_try_merge_mappings(ff, ino, handle, stopoff);
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, size_t written, uint32_t ioendflags,
+			  int error, uint64_t new_addr)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	bool dirty = false;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=%llu\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   written,
+		   ioendflags,
+		   error,
+		   (unsigned long long)new_addr);
+
+	fs = fuse2fs_start(ff);
+	if (error) {
+		ret = error;
+		goto out_unlock;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers again now that
+	 * iomap wrote them
+	 */
+	if (written > 0) {
+		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
+		if (err) {
+			ret = translate_error(ff->fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
+	/* should never see these ioend types */
+	if ((ioendflags & FUSE_IOMAP_IOEND_SHARED) ||
+	    new_addr != FUSE_IOMAP_NULL_ADDR) {
+		ret = translate_error(fs, attr_ino,
+				      EXT2_ET_FILESYSTEM_CORRUPTED);
+		goto out_unlock;
+	}
+
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) {
+		/* unwritten extents are only supported on extents files */
+		if (!(inode.i_flags & EXT4_EXTENTS_FL)) {
+			ret = translate_error(fs, attr_ino,
+					      EXT2_ET_FILESYSTEM_CORRUPTED);
+			goto out_unlock;
+		}
+
+		ret = fuse2fs_convert_unwritten_mappings(ff, attr_ino, &inode,
+							 pos, written);
+		if (ret)
+			goto out_unlock;
+
+		dirty = true;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_APPEND) {
+		ext2_off64_t isize = EXT2_I_SIZE(&inode);
+
+		if (pos + written > isize) {
+			err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode),
+						    pos + written);
+			if (err) {
+				ret = translate_error(fs, attr_ino, err);
+				goto out_unlock;
+			}
+
+			dirty = true;
+		}
+	}
+
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -5530,6 +6005,7 @@ static struct fuse_operations fs_ops = {
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,
 	.iomap_config = op_iomap_config,
+	.iomap_ioend = op_iomap_ioend,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
@ 2025-07-17 23:42   ` Darrick J. Wong
  2025-07-17 23:42   ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:42 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Turn on iomap for pagecache IO to regular files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 8c3cc7adc72579..a8fb18650ec080 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1346,6 +1346,10 @@ static void *op_init(struct fuse_conn_info *conn
 	if (fuse2fs_iomap_enabled(ff))
 		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
 #endif
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_FILEIO)
+	if (fuse2fs_iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_FILEIO);
+#endif
 
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
@@ -5239,9 +5243,6 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 {
 	errcode_t err;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	/* fall back to slow path for inline data reads */
 	if (inode->i_flags & EXT4_INLINE_DATA_FL)
 		return -ENOSYS;
@@ -5322,9 +5323,6 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	errcode_t err;
 	int ret;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	if (pos >= max_size)
 		return -EFBIG;
 
@@ -5422,12 +5420,51 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	return ret;
 }
 
+static int fuse2fs_iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino,
+					loff_t newsize)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2_inode_large inode;
+	ext2_off64_t isize;
+	errcode_t err;
+
+	dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino,
+		   (unsigned long long)newsize);
+
+	err = fuse2fs_read_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	isize = EXT2_I_SIZE(&inode);
+	if (newsize <= isize)
+		return 0;
+
+	dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino,
+		   (unsigned long long)isize,
+		   (unsigned long long)newsize);
+
+	/*
+	 * XXX cheesily update the ondisk size even though we only want to do
+	 * the incore size until writeback happens
+	 */
+	err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = fuse2fs_write_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
 static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 			off_t pos, uint64_t count, uint32_t opflags,
 			ssize_t written, const struct fuse_iomap *iomap)
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
 
@@ -5442,7 +5479,21 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   written,
 		   iomap->flags);
 
-	return 0;
+	fuse2fs_start(ff);
+
+	/* XXX is this really necessary? */
+	if ((opflags & FUSE_IOMAP_OP_WRITE) &&
+	    !(opflags & FUSE_IOMAP_OP_DIRECT) &&
+	    (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) &&
+	    written > 0) {
+		ret = fuse2fs_iomap_append_setsize(ff, attr_ino, pos + written);
+		if (ret)
+			goto out_unlock;
+	}
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/22] fuse2fs: improve tracing for fallocate
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-07-17 23:42   ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
@ 2025-07-17 23:42   ` Darrick J. Wong
  2025-07-17 23:42   ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:42 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Improve the tracing for fallocate by reporting the inode number and the
file range in all tracepoints.  Make the ranges hexadecimal to make it
easier for the programmer to convert bytes to block numbers and back.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index a8fb18650ec080..f7d17737459c11 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4683,8 +4683,8 @@ static int fuse2fs_allocate_range(struct fuse2fs *ff,
 
 	start = FUSE2FS_B_TO_FSBT(ff, offset);
 	end = FUSE2FS_B_TO_FSBT(ff, offset + len - 1);
-	dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__,
-		   fh->ino, mode, start, end);
+	dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n",
+		   __func__, fh->ino, mode, offset, len, start, end);
 	if (!fs_can_allocate(ff, FUSE2FS_B_TO_FSB(ff, len)))
 		return -ENOSPC;
 
@@ -4751,6 +4751,7 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
+	dbg_printf(ff, "%s: ino=%d offset=0x%jx len=0x%jx\n", __func__, ino, offset + residue, len);
 	memset(*buf + residue, 0, len);
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
@@ -4787,10 +4788,15 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	if (clean_before)
+	if (clean_before) {
+		dbg_printf(ff, "%s: ino=%d before offset=0x%jx len=0x%jx\n",
+			   __func__, ino, offset, residue);
 		memset(*buf, 0, residue);
-	else
+	} else {
+		dbg_printf(ff, "%s: ino=%d after offset=0x%jx len=0x%jx\n",
+			   __func__, ino, offset, fs->blocksize - residue);
 		memset(*buf + residue, 0, fs->blocksize - residue);
+	}
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
@@ -4805,9 +4811,6 @@ static int fuse2fs_punch_range(struct fuse2fs *ff,
 	errcode_t err;
 	char *buf = NULL;
 
-	dbg_printf(ff, "%s: offset=%jd len=%jd\n", __func__,
-		   (intmax_t) offset, (intmax_t) len);
-
 	/* kernel ext4 punch requires this flag to be set */
 	if (!(mode & FL_KEEP_SIZE_FLAG))
 		return -EINVAL;
@@ -4900,6 +4903,12 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 		ret = -EROFS;
 		goto out;
 	}
+
+	dbg_printf(ff, "%s: ino=%d mode=0x%x start=0x%llx end=0x%llx\n", __func__,
+		   fh->ino, mode,
+		   (unsigned long long)offset,
+		   (unsigned long long)offset + len);
+
 	if (mode & FL_ZERO_RANGE_FLAG)
 		ret = fuse2fs_zero_range(ff, fh, mode, offset, len);
 	else if (mode & FL_PUNCH_HOLE_FLAG)


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/22] fuse2fs: don't zero bytes in punch hole
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-07-17 23:42   ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
@ 2025-07-17 23:42   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:42 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the pagecache, it will take care of zeroing the
unaligned parts of punched out regions so we don't have to do it
ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f7d17737459c11..45eec59d85faf4 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -235,6 +235,7 @@ enum fuse2fs_iomap_state {
 	IOMAP_DISABLED,
 	IOMAP_UNKNOWN,
 	IOMAP_ENABLED,
+	IOMAP_FILEIO,	/* enabled and does all file data block IO */
 };
 #endif
 
@@ -494,8 +495,14 @@ static int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
 {
 	return ff->iomap_state >= IOMAP_ENABLED;
 }
+
+static int fuse2fs_iomap_does_fileio(const struct fuse2fs *ff)
+{
+	return ff->iomap_state == IOMAP_FILEIO;
+}
 #else
 # define fuse2fs_iomap_enabled(...)	(0)
+# define fuse2fs_iomap_does_fileio(...)	(0)
 #endif
 
 static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
@@ -1219,6 +1226,7 @@ static void fuse2fs_iomap_confirm(struct fuse_conn_info *conn,
 		return;
 	case IOMAP_DISABLED:
 		return;
+	case IOMAP_FILEIO:
 	case IOMAP_ENABLED:
 		break;
 	}
@@ -1267,6 +1275,20 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+
+	/*
+	 * If iomap is turned on and the kernel advertises support for both
+	 * direct and buffered IO, then that means the kernel handles all
+	 * regular file data block IO for us.  That means we can turn off all
+	 * of libext2fs' file data block handling except for inline data.
+	 *
+	 * XXX: kernel doesn't support inline data iomap
+	 */
+	if (fuse2fs_iomap_enabled(ff) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_FILEIO))
+		ff->iomap_state = IOMAP_FILEIO;
+
 	/*
 	 * In iomap mode, the kernel writes file data directly to the block
 	 * device and does not flush the bdev page cache.  We must open the
@@ -4734,6 +4756,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	int retflags;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		return 0;
+
 	if (!*buf) {
 		err = ext2fs_get_mem(fs->blocksize, buf);
 		if (err)
@@ -4767,6 +4793,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	off_t residue;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		return 0;
+
 	residue = FUSE2FS_OFF_IN_FSB(ff, offset);
 	if (residue == 0)
 		return 0;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-07-17 23:42   ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the page cache, the kernel will take care of
all the file data block IO for us, including zeroing of punched ranges
and post-EOF bytes.  fuse2fs only needs to do IO for inline data.

Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not
do any regular file IO to or from disk blocks at all.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 45eec59d85faf4..989f9f17cae0a9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -3059,9 +3059,14 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
 	ext2_file_t file;
 	__u64 old_isize;
 	errcode_t err;
+	int flags = EXT2_FILE_WRITE;
 	int ret = 0;
 
-	err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+	/* the kernel handles all eof zeroing for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		flags |= EXT2_FILE_NOBLOCKIO;
+
+	err = ext2fs_file_open(fs, ino, flags, &file);
 	if (err)
 		return translate_error(fs, ino, err);
 
@@ -3181,6 +3186,9 @@ static int __op_open(struct fuse2fs *ff, const char *path,
 		file->open_flags |= EXT2_FILE_WRITE;
 		break;
 	}
+	/* the kernel handles all block IO for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		file->open_flags |= EXT2_FILE_NOBLOCKIO;
 	if (fp->flags & O_APPEND) {
 		/* the kernel doesn't allow truncation of an append-only file */
 		if (fp->flags & O_TRUNC) {


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (13 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Now that fuse2fs uses iomap for pagecache IO, all regular file IO goes
directly to the disk.  There is no need to flush the unix IO manager's
disk cache (or invalidate it) because it does not contain file data.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 989f9f17cae0a9..9604f06e69bc90 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5295,9 +5295,11 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!fuse2fs_iomap_does_fileio(ff)) {
+		err = io_channel_flush_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	if (inode->i_flags & EXT4_EXTENTS_FL)
 		return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
@@ -5393,9 +5395,11 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	 * flush and invalidate the file's io_channel buffers before iomap
 	 * writes them
 	 */
-	err = io_channel_invalidate_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!fuse2fs_iomap_does_fileio(ff)) {
+		err = io_channel_invalidate_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	return 0;
 }
@@ -5972,7 +5976,7 @@ static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	 * flush and invalidate the file's io_channel buffers again now that
 	 * iomap wrote them
 	 */
-	if (written > 0) {
+	if (written > 0 && !fuse2fs_iomap_does_fileio(ff)) {
 		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
 		if (err) {
 			ret = translate_error(ff->fs, attr_ino, err);


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (14 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Back in "fuse2fs: always use directio disk reads with fuse2fs", we
started using directio for all libext2fs disk IO to deal with cache
coherency issues between the unix io manager's disk cache, the block
device page cache, and the file data blocks being read and written to
disk by the kernel itself.

Now that we've turned off all regular file data block IO in libext2fs,
we don't need that and can go back to the old way, which is a lot
faster for metadata operations.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9604f06e69bc90..9a62971f8dbba7 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1295,8 +1295,12 @@ static void *op_init(struct fuse_conn_info *conn
 	 * filesystem in directio mode to avoid cache coherency issues when
 	 * reading file data.  If we can't open the bdev in directio mode, we
 	 * must not use iomap.
+	 *
+	 * If we know that the kernel can handle all regular file IO for us,
+	 * then there is no cache coherency issue and we can use buffered reads
+	 * for all IO, which will all be filesystem metadata.
 	 */
-	if (fuse2fs_iomap_enabled(ff))
+	if (fuse2fs_iomap_enabled(ff) && !fuse2fs_iomap_does_fileio(ff))
 		ff->directio = 1;
 #endif
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (15 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Since fuse in iomap mode guarantees that op_destroy will be called
before umount returns, we don't need to use fuseblk mode to get that
guarantee.  Disable fuseblk mode, which saves us the trouble of closing
and reopening the device.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9a62971f8dbba7..82b59c1ac89774 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -982,6 +982,8 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
 	if (ff->directio)
 		flags |= EXT2_FLAG_DIRECT_IO;
 
+	dbg_printf(ff, "opening with flags=0x%x\n", flags);
+
 	err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
 			   &ff->fs);
 	if (err == EPERM) {
@@ -6333,10 +6335,24 @@ static unsigned long long default_cache_size(void)
 	return ret;
 }
 
+#ifdef HAVE_FUSE_IOMAP
+static inline bool fuse2fs_discover_iomap(const struct fuse2fs *ff)
+{
+	if (ff->iomap_want == FT_DISABLE)
+		return false;
+
+	return fuse_discover_iomap();
+}
+#else
+# define fuse2fs_discover_iomap(...)	(false)
+#endif
+
 static inline bool fuse2fs_want_fuseblk(const struct fuse2fs *ff)
 {
 	if (ff->noblkdev)
 		return false;
+	if (fuse2fs_discover_iomap(ff))
+		return false;
 
 	return fuse2fs_on_bdev(ff);
 }
@@ -6499,6 +6515,12 @@ int main(int argc, char *argv[])
 		 * device) so that unmount will wait until op_destroy
 		 * completes.  If this is not a block device, we cannot use
 		 * fuseblk mode and should leave the filesystem open.
+		 *
+		 * However, fuse+iomap guarantees that op_destroy is called
+		 * before the filesystem is unmounted, so we don't need fuseblk
+		 * mode.  This save us the trouble of reopening the filesystem
+		 * later, and means that fuse2fs itself owns the exclusive lock
+		 * on the block device.
 		 */
 		fuse2fs_unmount(&fctx);
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 18/22] fuse2fs: don't allow hardlinks for now
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (16 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

XXX see the comment for why we have to do this bellicosely stupid thing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 82b59c1ac89774..e281b5fc589d82 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -261,6 +261,7 @@ struct fuse2fs {
 	uint8_t dirsync;
 	uint8_t unmount_in_destroy;
 	uint8_t noblkdev;
+	uint8_t can_hardlink;
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
@@ -1382,9 +1383,31 @@ static void *op_init(struct fuse_conn_info *conn
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
 	 * so that the block device will be released before umount(2) returns.
+	 *
+	 * XXX: It turns out that fuse2fs creates internal node ids that have
+	 * nothing to do with the ext2_ino_t that we give it.  These internal
+	 * node ids are what actually gets igetted in the kernel, which means
+	 * that there can be multiple fuse_inode objects for the same fuse2fs
+	 * inode.
+	 *
+	 * What this means, horrifyingly, is that on a fuse filesystem that
+	 * supports hard links, the in-kernel i_rwsem does not protect against
+	 * concurrent writes between files that point to the same inode.  That
+	 * in turn means that the file mode and size can get desynchronized
+	 * between the multiple fuse_inode objects.  This also means that we
+	 * cannot cache iomaps in the kernel AT ALL because the caches will
+	 * get out of sync, leading to WARN_ONs from the iomap zeroing code and
+	 * probably data corruption after that.
+	 *
+	 * So for now we just disable hardlinking on iomap to see if the weird
+	 * fstests failures (particularly g/476) go away.  Long term it means
+	 * we probably have to find a way around this, like porting fuse2fs
+	 * to be a low level fuse driver.
 	 */
-	if (fuse2fs_iomap_enabled(ff))
+	if (fuse2fs_iomap_enabled(ff)) {
 		ff->unmount_in_destroy = 1;
+		ff->can_hardlink = 0;
+	}
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->opstate == F2OP_WRITABLE) {
@@ -2751,6 +2774,10 @@ static int op_link(const char *src, const char *dest)
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
+
+	if (!ff->can_hardlink)
+		return -ENOSYS;
+
 	dbg_printf(ff, "%s: src=%s dest=%s\n", __func__, src, dest);
 	temp_path = strdup(dest);
 	if (!temp_path) {
@@ -6380,6 +6407,7 @@ int main(int argc, char *argv[])
 		.iomap_state = IOMAP_UNKNOWN,
 		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
+		.can_hardlink = 1,
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 19/22] fuse2fs: enable file IO to inline data files
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (17 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Enable file reads and writes from inline data files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e281b5fc589d82..c21a95b6920d5c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1407,6 +1407,14 @@ static void *op_init(struct fuse_conn_info *conn
 	if (fuse2fs_iomap_enabled(ff)) {
 		ff->unmount_in_destroy = 1;
 		ff->can_hardlink = 0;
+
+		/*
+		 * XXX: inline data file io depends on op_read/write being fed
+		 * a path, so we have to slow everyone down to look up the path
+		 * from the nodeid
+		 */
+		if (ext2fs_has_feature_inline_data(ff->fs->super))
+			cfg->nullpath_ok = 0;
 	}
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
@@ -3294,6 +3302,9 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
 		   size_t len, off_t offset,
 		   struct fuse_file_info *fp)
 {
+	struct fuse2fs_file_handle fhurk = {
+		.magic = FUSE2FS_FILE_MAGIC,
+	};
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	struct fuse2fs_file_handle *fh =
@@ -3305,10 +3316,21 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
+
+	if (!fh)
+		fh = &fhurk;
+
 	FUSE2FS_CHECK_HANDLE(ff, fh);
 	dbg_printf(ff, "%s: ino=%d off=%jd len=%jd\n", __func__, fh->ino,
 		   (intmax_t) offset, len);
 	fs = fuse2fs_start(ff);
+
+	if (fh == &fhurk) {
+		ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino);
+		if (ret)
+			goto out;
+	}
+
 	err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
 	if (err) {
 		ret = translate_error(fs, fh->ino, err);
@@ -3350,6 +3372,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 		    const char *buf, size_t len, off_t offset,
 		    struct fuse_file_info *fp)
 {
+	struct fuse2fs_file_handle fhurk = {
+		.magic = FUSE2FS_FILE_MAGIC,
+		.open_flags = EXT2_FILE_WRITE,
+	};
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	struct fuse2fs_file_handle *fh =
@@ -3361,6 +3387,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
+
+	if (!fh)
+		fh = &fhurk;
+
 	FUSE2FS_CHECK_HANDLE(ff, fh);
 	dbg_printf(ff, "%s: ino=%d off=%jd len=%jd\n", __func__, fh->ino,
 		   (intmax_t) offset, (intmax_t) len);
@@ -3375,6 +3405,12 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 		goto out;
 	}
 
+	if (fh == &fhurk) {
+		ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino);
+		if (ret)
+			goto out;
+	}
+
 	err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
 	if (err) {
 		ret = translate_error(fs, fh->ino, err);
@@ -5325,7 +5361,8 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 
 	/* fall back to slow path for inline data reads */
 	if (inode->i_flags & EXT4_INLINE_DATA_FL)
-		return -ENOSYS;
+		return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count,
+						  read_iomap);
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
 	if (!fuse2fs_iomap_does_fileio(ff)) {


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 20/22] fuse2fs: set iomap-related inode flags
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (18 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Set FUSE_IFLAG_* when we do a getattr, so that all files will have iomap
enabled.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index c21a95b6920d5c..e71fcbaeeaf0c6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1571,6 +1571,25 @@ static int op_getattr(const char *path, struct stat *statbuf
 	return ret;
 }
 
+#ifdef HAVE_FUSE_IOMAP
+static int op_getattr_iflags(const char *path, struct stat *statbuf,
+			     unsigned int *iflags, struct fuse_file_info *fi)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	int ret = op_getattr(path, statbuf, fi);
+
+	if (ret)
+		return ret;
+
+	if (fuse2fs_iomap_does_fileio(ff))
+		*iflags |= FUSE_IFLAG_IOMAP_DIRECTIO | FUSE_IFLAG_IOMAP_FILEIO;
+
+	return 0;
+}
+#endif
+
+
 static int op_readlink(const char *path, char *buf, size_t len)
 {
 	struct fuse_context *ctxt = fuse_get_context();
@@ -6178,6 +6197,7 @@ static struct fuse_operations fs_ops = {
 	.iomap_end = op_iomap_end,
 	.iomap_config = op_iomap_config,
 	.iomap_ioend = op_iomap_ioend,
+	.getattr_iflags = op_getattr_iflags,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (19 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, we can support the strictatime/lazytime mount options.
Add them to fuse2fs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e71fcbaeeaf0c6..b5f665ada36991 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -262,6 +262,7 @@ struct fuse2fs {
 	uint8_t unmount_in_destroy;
 	uint8_t noblkdev;
 	uint8_t can_hardlink;
+	uint8_t iomap_passthrough_options;
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
@@ -1370,6 +1371,10 @@ static void *op_init(struct fuse_conn_info *conn
 		err_printf(ff, "%s\n", _("could not enable iomap."));
 		goto mount_fail;
 	}
+	if (ff->iomap_passthrough_options && !fuse2fs_iomap_enabled(ff)) {
+		err_printf(ff, "%s\n", _("some mount options require iomap."));
+		goto mount_fail;
+	}
 #endif
 #if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO)
 	if (fuse2fs_iomap_enabled(ff))
@@ -6228,6 +6233,7 @@ enum {
 	FUSE2FS_ERRORS_BEHAVIOR,
 #ifdef HAVE_FUSE_IOMAP
 	FUSE2FS_IOMAP,
+	FUSE2FS_IOMAP_PASSTHROUGH,
 #endif
 };
 
@@ -6251,6 +6257,17 @@ static struct fuse_opt fuse2fs_opts[] = {
 	FUSE2FS_OPT("lockfile=%s",	lockfile,		0),
 	FUSE2FS_OPT("noblkdev",		noblkdev,		1),
 
+#ifdef HAVE_FUSE_IOMAP
+#ifdef MS_LAZYTIME
+	FUSE_OPT_KEY("lazytime",	FUSE2FS_IOMAP_PASSTHROUGH),
+	FUSE_OPT_KEY("nolazytime",	FUSE2FS_IOMAP_PASSTHROUGH),
+#endif
+#ifdef MS_STRICTATIME
+	FUSE_OPT_KEY("strictatime",	FUSE2FS_IOMAP_PASSTHROUGH),
+	FUSE_OPT_KEY("nostrictatime",	FUSE2FS_IOMAP_PASSTHROUGH),
+#endif
+#endif
+
 	FUSE_OPT_KEY("user_xattr",	FUSE2FS_IGNORED),
 	FUSE_OPT_KEY("noblock_validity", FUSE2FS_IGNORED),
 	FUSE_OPT_KEY("nodelalloc",	FUSE2FS_IGNORED),
@@ -6277,6 +6294,12 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 	struct fuse2fs *ff = data;
 
 	switch (key) {
+#ifdef HAVE_FUSE_IOMAP
+	case FUSE2FS_IOMAP_PASSTHROUGH:
+		ff->iomap_passthrough_options = 1;
+		/* pass through to libfuse */
+		return 1;
+#endif
 	case FUSE2FS_DIRSYNC:
 		ff->dirsync = 1;
 		/* pass through to libfuse */


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 22/22] fuse2fs: configure block device block size
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (20 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  21 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Set the blocksize of the block device to the filesystem blocksize.
This prevents the bdev pagecache from caching file data blocks that
iomap will read and write directly.  Cache duplication is dangerous.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index b5f665ada36991..d0478af036a25e 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5683,6 +5683,42 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
 	return res;
 }
 
+/*
+ * Set the block device's blocksize to the fs blocksize.
+ *
+ * This is required to avoid creating uptodate bdev pagecache that aliases file
+ * data blocks because iomap reads and writes directly to file data blocks.
+ */
+static int fuse2fs_set_bdev_blocksize(struct fuse2fs *ff, int fd)
+{
+	int blocksize = ff->fs->blocksize;
+	int set_error;
+	int ret;
+
+	ret = ioctl(fd, BLKBSZSET, &blocksize);
+	if (!ret)
+		return 0;
+
+	/*
+	 * Save the original errno so we can report that if the block device
+	 * blocksize isn't set in an agreeable way.
+	 */
+	set_error = errno;
+
+	ret = ioctl(fd, BLKBSZGET, &blocksize);
+	if (ret)
+		goto out_bad;
+
+	if (blocksize > ff->fs->blocksize)
+		set_error = -EINVAL;
+
+	return 0;
+out_bad:
+	err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__,
+		   blocksize, strerror(set_error));
+	return -EIO;
+}
+
 static errcode_t fuse2fs_iomap_config_devices(struct fuse_context *ctxt,
 					      struct fuse2fs *ff)
 {
@@ -5695,6 +5731,10 @@ static errcode_t fuse2fs_iomap_config_devices(struct fuse_context *ctxt,
 	if (err)
 		return err;
 
+	ret = fuse2fs_set_bdev_blocksize(ff, fd);
+	if (ret)
+		return ret;
+
 	ret = fuse_iomap_add_device(se, fd, 0);
 
 	dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n",


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 1/1] fuse2fs: enable caching of iomaps
  2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  0 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Cache the iomaps we generate in the kernel for better performance.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index d0478af036a25e..f863042a4db074 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5505,6 +5505,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct fuse_session *se = fuse_get_session(ctxt->fuse);
 	struct ext2_inode_large inode;
 	ext2_filsys fs;
 	errcode_t err;
@@ -5560,6 +5561,24 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		}
 	}
 
+	/*
+	 * Cache the mapping in the kernel so that we can reuse them for
+	 * subsequent IO.  Note that we have to return NULL mappings to the
+	 * kernel to prompt it to re-try the cache.
+	 */
+	write_iomap->type = FUSE_IOMAP_TYPE_NULL;
+	err = fuse_lowlevel_notify_iomap_upsert(se, nodeid, attr_ino,
+						read_iomap, write_iomap);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	/* Null out the read mapping to encourage a retry. */
+	read_iomap->type = FUSE_IOMAP_TYPE_NULL;
+	read_iomap->dev = FUSE_IOMAP_DEV_NULL;
+	read_iomap->addr = FUSE_IOMAP_NULL_ADDR;
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Commit 9f69dfc4e275cc didn't quite get the permissions checking correct:

generic/362       - output mismatch (see /var/tmp/fstests/generic/362.out.bad)
    --- tests/generic/362.out   2025-04-30 16:20:44.563833050 -0700
    +++ /var/tmp/fstests/generic/362.out.bad    2025-06-11 17:04:24.061193618 -0700
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Failed to open/create file: Operation not permitted
     Silence is golden
    ...
    (Run 'diff -u /run/fstests/bin/tests/generic/362.out /var/tmp/fstests/generic/362.out.bad'  to see the entire diff)

The kernel allows opening a file for append and truncation.  What it
doesn't allow is opening an append-only file for truncation.  Note that
this causes generic/079 to regress, but the root cause of that problem
is actually that fuse oddly supports FS_IOC_[GS]ETFLAGS but doesn't
actually set the VFS inode flags.

Fixes: 9f69dfc4e275cc ("fuse2fs: implement O_APPEND correctly")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f863042a4db074..f9151ae6acb4e5 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -3254,15 +3254,8 @@ static int __op_open(struct fuse2fs *ff, const char *path,
 	/* the kernel handles all block IO for us in iomap mode */
 	if (fuse2fs_iomap_does_fileio(ff))
 		file->open_flags |= EXT2_FILE_NOBLOCKIO;
-	if (fp->flags & O_APPEND) {
-		/* the kernel doesn't allow truncation of an append-only file */
-		if (fp->flags & O_TRUNC) {
-			ret = -EPERM;
-			goto out;
-		}
-
+	if (fp->flags & O_APPEND)
 		check |= A_OK;
-	}
 
 	detect_linux_executable_open(fp->flags, &check, &file->open_flags);
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When iomap is enabled, the kernel is in charge of enforcing permissions
checks on timestamp updates for files.  We needn't do that in userspace
anymore.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f9151ae6acb4e5..5d75cffa8f6bca 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4334,11 +4334,12 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
 
 	/*
 	 * ext4 allows timestamp updates of append-only files but only if we're
-	 * setting to current time
+	 * setting to current time.  If iomap is enabled, the kernel does the
+	 * permission checking for timestamp updates and we can skip the check.
 	 */
 	if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW)
 		access |= A_OK;
-	ret = check_inum_access(ff, ino, access);
+	ret = fuse2fs_iomap_enabled(ff) ? 0 : check_inum_access(ff, ino, access);
 	if (ret)
 		goto out;
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When the kernel is running in iomap mode, it will also manage all the
ACL updates and the resulting file mode changes for us.  Disable the
manual implementation of it in fuse2fs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 5d75cffa8f6bca..e580622d39b1d1 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1739,7 +1739,7 @@ static int propagate_default_acls(struct fuse2fs *ff, ext2_ino_t parent,
 	size_t deflen;
 	int ret;
 
-	if (!ff->acl)
+	if (!ff->acl || fuse2fs_iomap_does_fileio(ff))
 		return 0;
 
 	ret = __getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
@@ -2999,7 +2999,7 @@ static int op_chmod(const char *path, mode_t mode
 	 * of the user's groups, but FUSE only tells us about the primary
 	 * group.
 	 */
-	if (!is_superuser(ff, ctxt)) {
+	if (!fuse2fs_iomap_does_fileio(ff) && !is_superuser(ff, ctxt)) {
 		ret = in_file_group(ctxt, &inode);
 		if (ret < 0)
 			goto out;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/10] fuse2fs: better debugging for file mode updates
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Improve the tracing of a chmod operation so that we can debug file mode
updates.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e580622d39b1d1..f2cb44a4e53b4c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -2964,12 +2964,13 @@ static int op_chmod(const char *path, mode_t mode
 #endif
 			)
 {
+	struct ext2_inode_large inode;
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs;
 	errcode_t err;
 	ext2_ino_t ino;
-	struct ext2_inode_large inode;
+	mode_t new_mode;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
@@ -3008,11 +3009,12 @@ static int op_chmod(const char *path, mode_t mode
 			mode &= ~S_ISGID;
 	}
 
-	inode.i_mode &= ~0xFFF;
-	inode.i_mode |= mode & 0xFFF;
+	new_mode = (inode.i_mode & ~0xFFF) | (mode & 0xFFF);
 
-	dbg_printf(ff, "%s: path=%s new_mode=0%o ino=%d\n", __func__,
-		   path, inode.i_mode, ino);
+	dbg_printf(ff, "%s: path=%s old_mode=0%o new_mode=0%o ino=%d\n",
+		   __func__, path, inode.i_mode, new_mode, ino);
+
+	inode.i_mode = new_mode;
 
 	ret = update_ctime(fs, ino, &inode);
 	if (ret)


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/10] fuse2fs: debug timestamp updates
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add tracing for timestamp updates to files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   99 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 62 insertions(+), 37 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f2cb44a4e53b4c..ddc647f32c5df6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -599,7 +599,8 @@ static void increment_version(struct ext2_inode_large *inode)
 		inode->i_version_hi = ver >> 32;
 }
 
-static void init_times(struct ext2_inode_large *inode)
+static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino,
+				    struct ext2_inode_large *inode)
 {
 	struct timespec now;
 
@@ -609,11 +610,15 @@ static void init_times(struct ext2_inode_large *inode)
 	EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
 	EXT4_EINODE_SET_XTIME(i_crtime, &now, inode);
 	increment_version(inode);
+
+	dbg_printf(ff, "%s: ino=%u time %ld:%lu\n", __func__, ino, now.tv_sec,
+		   now.tv_nsec);
 }
 
-static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
-			struct ext2_inode_large *pinode)
+static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino,
+				struct ext2_inode_large *pinode)
 {
+	ext2_filsys fs = ff->fs;
 	errcode_t err;
 	struct timespec now;
 	struct ext2_inode_large inode;
@@ -624,6 +629,10 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
 	if (pinode) {
 		increment_version(pinode);
 		EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
+
+		dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino,
+			   now.tv_sec, now.tv_nsec);
+
 		return 0;
 	}
 
@@ -635,6 +644,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
 	increment_version(&inode);
 	EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
 
+	dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino,
+		   now.tv_sec, now.tv_nsec);
+
 	err = fuse2fs_write_inode(fs, ino, &inode);
 	if (err)
 		return translate_error(fs, ino, err);
@@ -642,8 +654,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
 	return 0;
 }
 
-static int update_atime(ext2_filsys fs, ext2_ino_t ino)
+static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino)
 {
+	ext2_filsys fs = ff->fs;
 	errcode_t err;
 	struct ext2_inode_large inode, *pinode;
 	struct timespec atime, mtime, now;
@@ -662,6 +675,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
 	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
 	dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
 
+	dbg_printf(ff, "%s: ino=%u atime %ld:%lu mtime %ld:%lu now %ld:%lu\n",
+		   __func__, ino, atime.tv_sec, atime.tv_nsec, mtime.tv_sec,
+		   mtime.tv_nsec, now.tv_sec, now.tv_nsec);
+
 	/*
 	 * If atime is newer than mtime and atime hasn't been updated in thirty
 	 * seconds, skip the atime update.  Same idea as Linux "relatime".  Use
@@ -678,9 +695,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
 	return 0;
 }
 
-static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
-			struct ext2_inode_large *pinode)
+static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
+				struct ext2_inode_large *pinode)
 {
+	ext2_filsys fs = ff->fs;
 	errcode_t err;
 	struct ext2_inode_large inode;
 	struct timespec now;
@@ -690,6 +708,10 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
 		EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
 		EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
 		increment_version(pinode);
+
+		dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n",
+			   __func__, ino, now.tv_sec, now.tv_nsec);
+
 		return 0;
 	}
 
@@ -702,6 +724,9 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
 	EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
 	increment_version(&inode);
 
+	dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n",
+		   __func__, ino, now.tv_sec, now.tv_nsec);
+
 	err = fuse2fs_write_inode(fs, ino, &inode);
 	if (err)
 		return translate_error(fs, ino, err);
@@ -1660,7 +1685,7 @@ static int op_readlink(const char *path, char *buf, size_t len)
 	buf[len] = 0;
 
 	if (fuse2fs_is_writeable(ff)) {
-		ret = update_atime(fs, ino);
+		ret = fuse2fs_update_atime(ff, ino);
 		if (ret)
 			goto out;
 	}
@@ -1927,7 +1952,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -1950,7 +1975,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
 	}
 
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
 		ret = translate_error(fs, child, err);
@@ -2036,7 +2061,7 @@ static int op_mkdir(const char *path, mode_t mode)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -2063,7 +2088,7 @@ static int op_mkdir(const char *path, mode_t mode)
 	if (parent_sgid)
 		inode.i_mode |= S_ISGID;
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
@@ -2146,7 +2171,7 @@ static int fuse2fs_unlink(struct fuse2fs *ff, const char *path,
 	if (err)
 		return translate_error(fs, dir, err);
 
-	ret = update_mtime(fs, dir, NULL);
+	ret = fuse2fs_update_mtime(ff, dir, NULL);
 	if (ret)
 		return ret;
 
@@ -2215,7 +2240,7 @@ static int remove_inode(struct fuse2fs *ff, ext2_ino_t ino)
 		inode.i_links_count--;
 	}
 
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -2394,7 +2419,7 @@ static int __op_rmdir(struct fuse2fs *ff, const char *path)
 		}
 		if (inode.i_links_count > 1)
 			inode.i_links_count--;
-		ret = update_mtime(fs, rds.parent, &inode);
+		ret = fuse2fs_update_mtime(ff, rds.parent, &inode);
 		if (ret)
 			goto out;
 		err = fuse2fs_write_inode(fs, rds.parent, &inode);
@@ -2488,7 +2513,7 @@ static int op_symlink(const char *src, const char *dest)
 	}
 
 	/* Update parent dir's mtime */
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -2512,7 +2537,7 @@ static int op_symlink(const char *src, const char *dest)
 	fuse2fs_set_uid(&inode, ctxt->uid);
 	fuse2fs_set_gid(&inode, gid);
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
@@ -2762,11 +2787,11 @@ static int op_rename(const char *from, const char *to
 	}
 
 	/* Update timestamps */
-	ret = update_ctime(fs, from_ino, NULL);
+	ret = fuse2fs_update_ctime(ff, from_ino, NULL);
 	if (ret)
 		goto out2;
 
-	ret = update_mtime(fs, to_dir_ino, NULL);
+	ret = fuse2fs_update_mtime(ff, to_dir_ino, NULL);
 	if (ret)
 		goto out2;
 
@@ -2860,7 +2885,7 @@ static int op_link(const char *src, const char *dest)
 		goto out2;
 
 	inode.i_links_count++;
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out2;
 
@@ -2879,7 +2904,7 @@ static int op_link(const char *src, const char *dest)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -3016,7 +3041,7 @@ static int op_chmod(const char *path, mode_t mode
 
 	inode.i_mode = new_mode;
 
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -3086,7 +3111,7 @@ static int op_chown(const char *path, uid_t owner, gid_t group
 		fuse2fs_set_gid(&inode, group);
 	}
 
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -3159,7 +3184,7 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
 	if (err)
 		return translate_error(fs, ino, err);
 
-	ret = update_mtime(fs, ino, NULL);
+	ret = fuse2fs_update_mtime(ff, ino, NULL);
 	if (ret)
 		return ret;
 
@@ -3378,7 +3403,7 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
 	}
 
 	if (fuse2fs_is_writeable(ff)) {
-		ret = update_atime(fs, fh->ino);
+		ret = fuse2fs_update_atime(ff, fh->ino);
 		if (ret)
 			goto out;
 	}
@@ -3464,7 +3489,7 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 		goto out;
 	}
 
-	ret = update_mtime(fs, fh->ino, NULL);
+	ret = fuse2fs_update_mtime(ff, fh->ino, NULL);
 	if (ret)
 		goto out;
 
@@ -3834,7 +3859,7 @@ static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
 		goto out2;
 	}
 
-	ret = update_ctime(fs, ino, NULL);
+	ret = fuse2fs_update_ctime(ff, ino, NULL);
 out2:
 	err = ext2fs_xattrs_close(&h);
 	if (!ret && err)
@@ -3929,7 +3954,7 @@ static int op_removexattr(const char *path, const char *key)
 		goto out2;
 	}
 
-	ret = update_ctime(fs, ino, NULL);
+	ret = fuse2fs_update_ctime(ff, ino, NULL);
 out2:
 	err = ext2fs_xattrs_close(&h);
 	if (err && !ret)
@@ -4067,7 +4092,7 @@ static int op_readdir(const char *path EXT2FS_ATTR((unused)),
 	}
 
 	if (fuse2fs_is_writeable(ff)) {
-		ret = update_atime(i.fs, fh->ino);
+		ret = fuse2fs_update_atime(ff, fh->ino);
 		if (ret)
 			goto out;
 	}
@@ -4173,7 +4198,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -4204,7 +4229,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
 	}
 
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
 		ret = translate_error(fs, child, err);
@@ -4277,7 +4302,7 @@ static int op_ftruncate(const char *path EXT2FS_ATTR((unused)),
 		goto out;
 	}
 
-	ret = update_mtime(fs, fh->ino, NULL);
+	ret = fuse2fs_update_mtime(ff, fh->ino, NULL);
 	if (ret)
 		goto out;
 
@@ -4365,7 +4390,7 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
 	if (tv[1].tv_nsec != UTIME_OMIT)
 		EXT4_INODE_SET_XTIME(i_mtime, &tv[1], &inode);
 #endif /* UTIME_OMIT */
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -4433,7 +4458,7 @@ static int ioctl_setflags(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	if (ret)
 		return ret;
 
-	ret = update_ctime(fs, fh->ino, &inode);
+	ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
 	if (ret)
 		return ret;
 
@@ -4480,7 +4505,7 @@ static int ioctl_setversion(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 
 	inode.i_generation = generation;
 
-	ret = update_ctime(fs, fh->ino, &inode);
+	ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
 	if (ret)
 		return ret;
 
@@ -4585,7 +4610,7 @@ static int ioctl_fssetxattr(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	if (ext2fs_inode_includes(inode_size, i_projid))
 		inode.i_projid = fsx->fsx_projid;
 
-	ret = update_ctime(fs, fh->ino, &inode);
+	ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
 	if (ret)
 		return ret;
 
@@ -4832,7 +4857,7 @@ static int fuse2fs_allocate_range(struct fuse2fs *ff,
 		}
 	}
 
-	err = update_mtime(fs, fh->ino, &inode);
+	err = fuse2fs_update_mtime(ff, fh->ino, &inode);
 	if (err)
 		return err;
 
@@ -4986,7 +5011,7 @@ static int fuse2fs_punch_range(struct fuse2fs *ff,
 			return translate_error(fs, fh->ino, err);
 	}
 
-	err = update_mtime(fs, fh->ino, &inode);
+	err = fuse2fs_update_mtime(ff, fh->ino, &inode);
 	if (err)
 		return err;
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel is responsible for maintaining timestamps
because file writes don't upcall to fuse2fs.  The kernel's predicate for
deciding if [cm]time should be updated bases its decisions off [cm]time
being an exact match for the coarse clock (instead of checking that
[cm]time < coarse_clock) which means that fuse2fs setting a fine-grained
timestamp that is slightly ahead of the coarse clock can result in
timestamps appearing to go backwards.  generic/423 doesn't like seeing
btime > ctime from statx, so we'll use the coarse clock in iomap mode.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index ddc647f32c5df6..54f501b36d808b 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -575,8 +575,24 @@ static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
 	ext2fs_extent_free(extents);
 }
 
-static void get_now(struct timespec *now)
+static void fuse2fs_get_now(struct fuse2fs *ff, struct timespec *now)
 {
+#ifdef CLOCK_REALTIME_COARSE
+	/*
+	 * In iomap mode, the kernel is responsible for maintaining timestamps
+	 * because file writes don't upcall to fuse2fs.  The kernel's predicate
+	 * for deciding if [cm]time should be updated bases its decisions off
+	 * [cm]time being an exact match for the coarse clock (instead of
+	 * checking that [cm]time < coarse_clock) which means that fuse2fs
+	 * setting a fine-grained timestamp that is slightly ahead of the
+	 * coarse clock can result in timestamps appearing to go backwards.
+	 * generic/423 doesn't like seeing btime > ctime from statx, so we'll
+	 * use the coarse clock in iomap mode.
+	 */
+	if (fuse2fs_iomap_does_fileio(ff) &&
+	    !clock_gettime(CLOCK_REALTIME_COARSE, now))
+		return;
+#endif
 #ifdef CLOCK_REALTIME
 	if (!clock_gettime(CLOCK_REALTIME, now))
 		return;
@@ -604,7 +620,7 @@ static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino,
 {
 	struct timespec now;
 
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 	EXT4_INODE_SET_XTIME(i_atime, &now, inode);
 	EXT4_INODE_SET_XTIME(i_ctime, &now, inode);
 	EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
@@ -623,7 +639,7 @@ static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino,
 	struct timespec now;
 	struct ext2_inode_large inode;
 
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 
 	/* If user already has a inode buffer, just update that */
 	if (pinode) {
@@ -669,7 +685,7 @@ static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino)
 	pinode = &inode;
 	EXT4_INODE_GET_XTIME(i_atime, &atime, pinode);
 	EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 
 	datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
 	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
@@ -704,7 +720,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
 	struct timespec now;
 
 	if (pinode) {
-		get_now(&now);
+		fuse2fs_get_now(ff, &now);
 		EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
 		EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
 		increment_version(pinode);
@@ -719,7 +735,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return translate_error(fs, ino, err);
 
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 	EXT4_INODE_SET_XTIME(i_mtime, &now, &inode);
 	EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
 	increment_version(&inode);
@@ -4380,9 +4396,9 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
 	tv[1] = ctv[1];
 #ifdef UTIME_NOW
 	if (tv[0].tv_nsec == UTIME_NOW)
-		get_now(tv);
+		fuse2fs_get_now(ff, tv);
 	if (tv[1].tv_nsec == UTIME_NOW)
-		get_now(tv + 1);
+		fuse2fs_get_now(ff, tv + 1);
 #endif /* UTIME_NOW */
 #ifdef UTIME_OMIT
 	if (tv[0].tv_nsec != UTIME_OMIT)
@@ -6917,7 +6933,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
 			error_message(err), func, line);
 
 	/* Make a note in the error log */
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 	ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec);
 	fs->super->s_last_error_ino = ino;
 	fs->super->s_last_error_line = line;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add tracing for retrieving timestamps so we can debug the weird
behavior.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 54f501b36d808b..15595fdf0b19ba 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1502,9 +1502,11 @@ static void *op_init(struct fuse_conn_info *conn
 	goto out;
 }
 
-static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
+static int fuse2fs_stat(struct fuse2fs *ff, ext2_ino_t ino,
+			struct stat *statbuf)
 {
 	struct ext2_inode_large inode;
+	ext2_filsys fs = ff->fs;
 	dev_t fakedev = 0;
 	errcode_t err;
 	int ret = 0;
@@ -1543,6 +1545,13 @@ static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
 #else
 	statbuf->st_ctime = tv.tv_sec;
 #endif
+
+	dbg_printf(ff, "%s: ino=%d atime=%lld.%ld mtime=%lld.%ld ctime=%lld.%ld\n",
+		   __func__, ino,
+		   (long long int)statbuf->st_atim.tv_sec, statbuf->st_atim.tv_nsec,
+		   (long long int)statbuf->st_mtim.tv_sec, statbuf->st_mtim.tv_nsec,
+		   (long long int)statbuf->st_ctim.tv_sec, statbuf->st_ctim.tv_nsec);
+
 	if (LINUX_S_ISCHR(inode.i_mode) ||
 	    LINUX_S_ISBLK(inode.i_mode)) {
 		if (inode.i_block[0])
@@ -1602,16 +1611,15 @@ static int op_getattr(const char *path, struct stat *statbuf
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
-	ext2_filsys fs;
 	ext2_ino_t ino;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
-	fs = fuse2fs_start(ff);
+	fuse2fs_start(ff);
 	ret = fuse2fs_file_ino(ff, path, fi, &ino);
 	if (ret)
 		goto out;
-	ret = stat_inode(fs, ino, statbuf);
+	ret = fuse2fs_stat(ff, ino, statbuf);
 out:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -4051,7 +4059,7 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	if (i->flags == FUSE_READDIR_PLUS) {
-		ret = stat_inode(i->fs, dirent->inode, &stat);
+		ret = fuse2fs_stat(i->ff, dirent->inode, &stat);
 		if (ret)
 			return DIRENT_ABORT;
 	}
@@ -4342,7 +4350,7 @@ static int op_fgetattr(const char *path EXT2FS_ATTR((unused)),
 	FUSE2FS_CHECK_HANDLE(ff, fh);
 	dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
 	fs = fuse2fs_start(ff);
-	ret = stat_inode(fs, fh->ino, statbuf);
+	ret = fuse2fs_stat(ff, fh->ino, statbuf);
 	fuse2fs_finish(ff, ret);
 
 	return ret;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/10] fuse2fs: enable syncfs
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-07-17 23:47   ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Enable syncfs calls in fuse2fs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 15595fdf0b19ba..66baca72ad49d1 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5099,6 +5099,42 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 # endif /* SUPPORT_FALLOCATE */
 #endif /* FUSE 29 */
 
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+static int op_syncfs(const char *path)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	dbg_printf(ff, "%s: path=%s\n", __func__, path);
+	fs = fuse2fs_start(ff);
+
+	if (ff->opstate == F2OP_WRITABLE) {
+		if (fs->super->s_error_count)
+			fs->super->s_state |= EXT2_ERROR_FS;
+		ext2fs_mark_super_dirty(fs);
+		err = ext2fs_set_gdt_csum(fs);
+		if (err) {
+			ret = translate_error(fs, 0, err);
+			goto out_unlock;
+		}
+
+		err = ext2fs_flush2(fs, 0);
+		if (err) {
+			ret = translate_error(fs, 0, err);
+			goto out_unlock;
+		}
+	}
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+#endif
+
 #ifdef HAVE_FUSE_IOMAP
 static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
 			       off_t pos, uint64_t count)
@@ -6301,6 +6337,9 @@ static struct fuse_operations fs_ops = {
 	.fallocate = op_fallocate,
 # endif
 #endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+	.syncfs = op_syncfs,
+#endif
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-07-17 23:47   ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

As an umount-time performance enhancement, don't bother to write the
group descriptor tables in op_destroy if we know that op_syncfs will do
it for us.  That only happens if iomap is enabled.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 66baca72ad49d1..3bded0fdd21e2a 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -263,6 +263,7 @@ struct fuse2fs {
 	uint8_t noblkdev;
 	uint8_t can_hardlink;
 	uint8_t iomap_passthrough_options;
+	uint8_t write_gdt_on_destroy;
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
@@ -1212,9 +1213,11 @@ static void op_destroy(void *p EXT2FS_ATTR((unused)))
 		if (fs->super->s_error_count)
 			fs->super->s_state |= EXT2_ERROR_FS;
 		ext2fs_mark_super_dirty(fs);
-		err = ext2fs_set_gdt_csum(fs);
-		if (err)
-			translate_error(fs, 0, err);
+		if (ff->write_gdt_on_destroy) {
+			err = ext2fs_set_gdt_csum(fs);
+			if (err)
+				translate_error(fs, 0, err);
+		}
 
 		err = ext2fs_flush2(fs, 0);
 		if (err)
@@ -5129,6 +5132,15 @@ static int op_syncfs(const char *path)
 		}
 	}
 
+	/*
+	 * When iomap is enabled, the kernel will call syncfs right before
+	 * calling the destroy method.  If any syncfs succeeds, then we know
+	 * that there will be a last syncfs and that it will write the GDT, so
+	 * destroy doesn't need to waste time doing that.
+	 */
+	if (fuse2fs_iomap_enabled(ff))
+		ff->write_gdt_on_destroy = 0;
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -6631,6 +6643,7 @@ int main(int argc, char *argv[])
 		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
 		.can_hardlink = 1,
+		.write_gdt_on_destroy = 1,
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/10] fuse2fs: implement statx
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-07-17 23:47   ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  9 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Implement statx.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 3bded0fdd21e2a..6d2ed7da9cc09e 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -23,6 +23,7 @@
 #include <sys/xattr.h>
 #endif
 #include <sys/ioctl.h>
+#include <sys/sysmacros.h>
 #include <unistd.h>
 #include <ctype.h>
 #define FUSE_DARWIN_ENABLE_EXTENSIONS 0
@@ -1646,6 +1647,111 @@ static int op_getattr_iflags(const char *path, struct stat *statbuf,
 }
 #endif
 
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) && defined(STATX_BASIC_STATS)
+static inline void fuse2fs_set_statx_attr(struct statx *stx,
+					  uint64_t statx_flag, int set)
+{
+	if (set)
+		stx->stx_attributes |= statx_flag;
+	stx->stx_attributes_mask |= statx_flag;
+}
+
+static int fuse2fs_statx(struct fuse2fs *ff, ext2_ino_t ino,
+			 uint32_t statx_mask, struct statx *stx, size_t size)
+{
+	struct ext2_inode_large inode;
+	ext2_filsys fs = ff->fs;;
+	dev_t fakedev = 0;
+	errcode_t err;
+	struct timespec tv;
+
+	if (size < sizeof(struct statx))
+		return translate_error(fs, ino, EOPNOTSUPP);
+
+	err = fuse2fs_read_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev));
+	stx->stx_mask = STATX_BASIC_STATS | STATX_BTIME;
+	stx->stx_dev_major = major(fakedev);
+	stx->stx_dev_minor = minor(fakedev);
+	stx->stx_ino = ino;
+	stx->stx_mode = inode.i_mode;
+	stx->stx_nlink = inode.i_links_count;
+	stx->stx_uid = inode_uid(inode);
+	stx->stx_gid = inode_gid(inode);
+	stx->stx_size = EXT2_I_SIZE(&inode);
+	stx->stx_blksize = fs->blocksize;
+	stx->stx_blocks = ext2fs_get_stat_i_blocks(fs,
+						EXT2_INODE(&inode));
+	EXT4_INODE_GET_XTIME(i_atime, &tv, &inode);
+	stx->stx_atime.tv_sec = tv.tv_sec;
+	stx->stx_atime.tv_nsec = tv.tv_nsec;
+
+	EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode);
+	stx->stx_mtime.tv_sec = tv.tv_sec;
+	stx->stx_mtime.tv_nsec = tv.tv_nsec;
+
+	EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode);
+	stx->stx_ctime.tv_sec = tv.tv_sec;
+	stx->stx_ctime.tv_nsec = tv.tv_nsec;
+
+	EXT4_INODE_GET_XTIME(i_crtime, &tv, &inode);
+	stx->stx_btime.tv_sec = tv.tv_sec;
+	stx->stx_btime.tv_nsec = tv.tv_nsec;
+
+	dbg_printf(ff, "%s: ino=%d atime=%lld.%d mtime=%lld.%d ctime=%lld.%d btime=%lld.%d\n",
+		   __func__, ino,
+		   (long long int)stx->stx_atime.tv_sec, stx->stx_atime.tv_nsec,
+		   (long long int)stx->stx_mtime.tv_sec, stx->stx_mtime.tv_nsec,
+		   (long long int)stx->stx_ctime.tv_sec, stx->stx_ctime.tv_nsec,
+		   (long long int)stx->stx_btime.tv_sec, stx->stx_btime.tv_nsec);
+
+	if (LINUX_S_ISCHR(inode.i_mode) ||
+	    LINUX_S_ISBLK(inode.i_mode)) {
+		if (inode.i_block[0]) {
+			stx->stx_rdev_major = major(inode.i_block[0]);
+			stx->stx_rdev_minor = minor(inode.i_block[0]);
+		} else {
+			stx->stx_rdev_major = major(inode.i_block[1]);
+			stx->stx_rdev_minor = minor(inode.i_block[1]);
+		}
+	}
+
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_COMPRESSED,
+			       inode.i_flags & EXT2_COMPR_FL);
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_IMMUTABLE,
+			       inode.i_flags & EXT2_IMMUTABLE_FL);
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_APPEND,
+			       inode.i_flags & EXT2_APPEND_FL);
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_NODUMP,
+			       inode.i_flags & EXT2_NODUMP_FL);
+
+	return 0;
+}
+
+static int op_statx(const char *path, uint32_t statx_flags, uint32_t statx_mask,
+		    struct statx *stx, size_t size, struct fuse_file_info *fi)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	ext2_ino_t ino;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	fuse2fs_start(ff);
+	ret = fuse2fs_file_ino(ff, path, fi, &ino);
+	if (ret)
+		goto out;
+	ret = fuse2fs_statx(ff, ino, statx_mask, stx, size);
+out:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+#else
+# define op_statx		NULL
+#endif
 
 static int op_readlink(const char *path, char *buf, size_t len)
 {
@@ -6351,6 +6457,7 @@ static struct fuse_operations fs_ops = {
 #endif
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
 	.syncfs = op_syncfs,
+	.statx = op_statx,
 #endif
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-07-18  8:54 ` Christian Brauner
  2025-07-18 11:55   ` Amir Goldstein
  3 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2025-07-18  8:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o, Neal Gompa

On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> Hi everyone,
> 
> DO NOT MERGE THIS, STILL!
> 
> This is the third request for comments of a prototype to connect the
> Linux fuse driver to fs-iomap for regular file IO operations to and from
> files whose contents persist to locally attached storage devices.
> 
> Why would you want to do that?  Most filesystem drivers are seriously
> vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> over almost a decade of its existence.  Faulty code can lead to total
> kernel compromise, and I think there's a very strong incentive to move
> all that parsing out to userspace where we can containerize the fuse
> server process.
> 
> willy's folios conversion project (and to a certain degree RH's new
> mount API) have also demonstrated that treewide changes to the core
> mm/pagecache/fs code are very very difficult to pull off and take years
> because you have to understand every filesystem's bespoke use of that
> core code.  Eeeugh.
> 
> The fuse command plumbing is very simple -- the ->iomap_begin,
> ->iomap_end, and iomap ->ioend calls within iomap are turned into
> upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> writeback is now a directio write.  The fuse server is now able to
> upsert mappings into the kernel for cached access (== zero upcalls for
> rereads and pure overwrites!) and the iomap cache revalidation code
> works.
> 
> With this RFC, I am able to show that it's possible to build a fuse
> server for a real filesystem (ext4) that runs entirely in userspace yet
> maintains most of its performance.  At this stage I still get about 95%
> of the kernel ext4 driver's streaming directio performance on streaming
> IO, and 110% of its streaming buffered IO performance.  Random buffered
> IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> fast as the kernel; see the cover letter for the fuse2fs iomap changes
> for more details.  Unwritten extent conversions on random direct writes
> are especially painful for fuse+iomap (~90% more overhead) due to upcall
> overhead.  And that's with debugging turned on!
> 
> These items have been addressed since the first RFC:
> 
> 1. The iomap cookie validation is now present, which avoids subtle races
> between pagecache zeroing and writeback on filesystems that support
> unwritten and delalloc mappings.
> 
> 2. Mappings can be cached in the kernel for more speed.
> 
> 3. iomap supports inline data.
> 
> 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> to be as easy as creating a new ->getattr_iflags callback so that the
> fuse server can set fuse_attr::flags.
> 
> 5. statx and syncfs work on iomap filesystems.
> 
> 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> is enabled.
> 
> 7. The ext4 shutdown ioctl is now supported.
> 
> There are some major warts remaining:
> 
> a. ext4 doesn't support out of place writes so I don't know if that
> actually works correctly.
> 
> b. iomap is an inode-based service, not a file-based service.  This
> means that we /must/ push ext2's inode numbers into the kernel via
> FUSE_GETATTR so that it can report those same numbers back out through
> the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> to index its incore inode, so we have to pass those too so that
> notifications work properly.  This is related to #3 below:
> 
> c. Hardlinks and iomap are not possible for upper-level libfuse clients
> because the upper level libfuse likes to abstract kernel nodeids with
> its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> As a result, a hardlinked file results in two distinct struct inodes in
> the kernel, which completely breaks iomap's locking model.  I will have
> to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> but on the plus side there will be far less path lookup overhead.
> 
> d. There are too many changes to the IO manager in libext2fs because I
> built things needed to stage the direct/buffered IO paths separately.
> These are now unnecessary but I haven't pulled them out yet because
> they're sort of useful to verify that iomap file IO never goes through
> libext2fs except for inline data.
> 
> e. If we're going to use fuse servers as "safe" replacements for kernel
> filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> We also need to disable the OOM killer(s) for fuse servers because you
> don't want filesystems to unmount abruptly.
> 
> f. How do we maximally contain the fuse server to have safe filesystem
> mounts?  It's very convenient to use systemd services to configure
> isolation declaratively, but fuse2fs still needs to be able to open
> /dev/fuse, the ext4 block device, and call mount() in the shared
> namespace.  This prevents us from using most of the stronger systemd

I'm happy to help you here.

First, I think using a character device for namespaced drivers is always
a mistake. FUSE predates all that ofc. They're incredibly terrible for
delegation because of devtmpfs not being namespaced as well as devices
in general. And having device nodes on anything other than tmpfs is just
wrong (TM).

In systemd I ultimately want a bpf LSM program that prevents the
creation of device nodes outside of tmpfs. They don't belong on
persistent storage imho. But anyway, that's besides the point.

Opening the block device should be done by systemd-mountfsd but I think
/dev/fuse should really be openable by the service itself.

So we can try and allowlist /dev/fuse in vfs_mknod() similar to
whiteouts. That means you can do mknod() in the container to create
/dev/fuse (Personally, I would even restrict this to tmpfs right off the
bat so that containers can only do this on their private tmpfs mount at
/dev.)

The downside of this would be to give unprivileged containers access to
FUSE by default. I don't think that's a problem per se but it is a uapi
change.

Let me think a bit about alternatives. I have one crazy idea but I'm not
sure enough about it to spill it.

> protections because they tend to run in a private mount namespace with
> various parts of the filesystem either hidden or readonly.
> 
> In theory one could design a socket protocol to pass mount options,
> block device paths, fds, and responsibility for the mount() call between
> a mount helper and a service:

This isn't a problem really. This should just be an extension to
systemd-mountfsd.

> 
> e2fsprogs would define as a systemd socket service for fuse2fs that sets
> up a dynamic unprivileged user, no network access, and no access to the
> host's filesystem aside from readonly access to the root filesystem.
> 
> The mount helper (e.g. mount.safe) would then connect to the magic
> socket and pass the CLI arguments to the fuse2fs service.  The service
> would parse the arguments, find the block device paths, and feed them
> back through the socket to mount.safe.  mount.safe would open them and
> pass fds back to the fuse2fs service.  The service would then open the
> devices, parse the superblock, and if everything was ok, request a mount
> through the socket.  The mount helper would then open /dev/fuse and
> mount the filesystem, and if successful, pass the /dev/fuse fd through
> the socket to the fuse2fs server.  At that point the fuse2fs server
> would attach to the /dev/fuse device and handle the usual events.
> 
> Finally we'd have to train people/daemons to run "mount -t safe.ext4
> /dev/sda1 /mnt" to get the contained version of ext4.
> 
> (Yeah, #f is all Neal. ;))
> 
> g. fuse2fs doesn't support the ext4 journal.  Urk.
> 
> I'll work on these in July/August, but for now here's an unmergeable RFC
> to start some discussion.
> 
> --Darrick
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
@ 2025-07-18 11:55   ` Amir Goldstein
  2025-07-18 19:31     ` Darrick J. Wong
  0 siblings, 1 reply; 49+ messages in thread
From: Amir Goldstein @ 2025-07-18 11:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> >
> > DO NOT MERGE THIS, STILL!
> >
> > This is the third request for comments of a prototype to connect the
> > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > files whose contents persist to locally attached storage devices.
> >
> > Why would you want to do that?  Most filesystem drivers are seriously
> > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > over almost a decade of its existence.  Faulty code can lead to total
> > kernel compromise, and I think there's a very strong incentive to move
> > all that parsing out to userspace where we can containerize the fuse
> > server process.
> >
> > willy's folios conversion project (and to a certain degree RH's new
> > mount API) have also demonstrated that treewide changes to the core
> > mm/pagecache/fs code are very very difficult to pull off and take years
> > because you have to understand every filesystem's bespoke use of that
> > core code.  Eeeugh.
> >
> > The fuse command plumbing is very simple -- the ->iomap_begin,
> > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > writeback is now a directio write.  The fuse server is now able to
> > upsert mappings into the kernel for cached access (== zero upcalls for
> > rereads and pure overwrites!) and the iomap cache revalidation code
> > works.
> >
> > With this RFC, I am able to show that it's possible to build a fuse
> > server for a real filesystem (ext4) that runs entirely in userspace yet
> > maintains most of its performance.  At this stage I still get about 95%
> > of the kernel ext4 driver's streaming directio performance on streaming
> > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > for more details.  Unwritten extent conversions on random direct writes
> > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > overhead.  And that's with debugging turned on!
> >
> > These items have been addressed since the first RFC:
> >
> > 1. The iomap cookie validation is now present, which avoids subtle races
> > between pagecache zeroing and writeback on filesystems that support
> > unwritten and delalloc mappings.
> >
> > 2. Mappings can be cached in the kernel for more speed.
> >
> > 3. iomap supports inline data.
> >
> > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > to be as easy as creating a new ->getattr_iflags callback so that the
> > fuse server can set fuse_attr::flags.
> >
> > 5. statx and syncfs work on iomap filesystems.
> >
> > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > is enabled.
> >
> > 7. The ext4 shutdown ioctl is now supported.
> >
> > There are some major warts remaining:
> >
> > a. ext4 doesn't support out of place writes so I don't know if that
> > actually works correctly.
> >
> > b. iomap is an inode-based service, not a file-based service.  This
> > means that we /must/ push ext2's inode numbers into the kernel via
> > FUSE_GETATTR so that it can report those same numbers back out through
> > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > to index its incore inode, so we have to pass those too so that
> > notifications work properly.  This is related to #3 below:
> >
> > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > because the upper level libfuse likes to abstract kernel nodeids with
> > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > As a result, a hardlinked file results in two distinct struct inodes in
> > the kernel, which completely breaks iomap's locking model.  I will have
> > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > but on the plus side there will be far less path lookup overhead.
> >
> > d. There are too many changes to the IO manager in libext2fs because I
> > built things needed to stage the direct/buffered IO paths separately.
> > These are now unnecessary but I haven't pulled them out yet because
> > they're sort of useful to verify that iomap file IO never goes through
> > libext2fs except for inline data.
> >
> > e. If we're going to use fuse servers as "safe" replacements for kernel
> > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > We also need to disable the OOM killer(s) for fuse servers because you
> > don't want filesystems to unmount abruptly.
> >
> > f. How do we maximally contain the fuse server to have safe filesystem
> > mounts?  It's very convenient to use systemd services to configure
> > isolation declaratively, but fuse2fs still needs to be able to open
> > /dev/fuse, the ext4 block device, and call mount() in the shared
> > namespace.  This prevents us from using most of the stronger systemd
>
> I'm happy to help you here.
>
> First, I think using a character device for namespaced drivers is always
> a mistake. FUSE predates all that ofc. They're incredibly terrible for
> delegation because of devtmpfs not being namespaced as well as devices
> in general. And having device nodes on anything other than tmpfs is just
> wrong (TM).
>
> In systemd I ultimately want a bpf LSM program that prevents the
> creation of device nodes outside of tmpfs. They don't belong on
> persistent storage imho. But anyway, that's besides the point.
>
> Opening the block device should be done by systemd-mountfsd but I think
> /dev/fuse should really be openable by the service itself.
>
> So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> whiteouts. That means you can do mknod() in the container to create
> /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> bat so that containers can only do this on their private tmpfs mount at
> /dev.)
>
> The downside of this would be to give unprivileged containers access to
> FUSE by default. I don't think that's a problem per se but it is a uapi
> change.
>
> Let me think a bit about alternatives. I have one crazy idea but I'm not
> sure enough about it to spill it.
>

I don't think there is a hard requirement for the fuse fd to be opened from
a device driver.
With fuse io_uring communication, the open fd doesn't even need to do io.

> > protections because they tend to run in a private mount namespace with
> > various parts of the filesystem either hidden or readonly.
> >
> > In theory one could design a socket protocol to pass mount options,
> > block device paths, fds, and responsibility for the mount() call between
> > a mount helper and a service:
>
> This isn't a problem really. This should just be an extension to
> systemd-mountfsd.

This is relevant not only to systemd env.

I have been experimenting with this mount helper service to mount fuse fs
inside an unprivileged kubernetes container, where opening of /dev/fuse
is restricted by LSM policy:

https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 11:55   ` Amir Goldstein
@ 2025-07-18 19:31     ` Darrick J. Wong
  2025-07-18 19:56       ` Amir Goldstein
  2025-07-23 13:05       ` Christian Brauner
  0 siblings, 2 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-18 19:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o,
	Neal Gompa

On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > Hi everyone,
> > >
> > > DO NOT MERGE THIS, STILL!
> > >
> > > This is the third request for comments of a prototype to connect the
> > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > files whose contents persist to locally attached storage devices.
> > >
> > > Why would you want to do that?  Most filesystem drivers are seriously
> > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > over almost a decade of its existence.  Faulty code can lead to total
> > > kernel compromise, and I think there's a very strong incentive to move
> > > all that parsing out to userspace where we can containerize the fuse
> > > server process.
> > >
> > > willy's folios conversion project (and to a certain degree RH's new
> > > mount API) have also demonstrated that treewide changes to the core
> > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > because you have to understand every filesystem's bespoke use of that
> > > core code.  Eeeugh.
> > >
> > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > writeback is now a directio write.  The fuse server is now able to
> > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > works.
> > >
> > > With this RFC, I am able to show that it's possible to build a fuse
> > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > maintains most of its performance.  At this stage I still get about 95%
> > > of the kernel ext4 driver's streaming directio performance on streaming
> > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > for more details.  Unwritten extent conversions on random direct writes
> > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > overhead.  And that's with debugging turned on!
> > >
> > > These items have been addressed since the first RFC:
> > >
> > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > between pagecache zeroing and writeback on filesystems that support
> > > unwritten and delalloc mappings.
> > >
> > > 2. Mappings can be cached in the kernel for more speed.
> > >
> > > 3. iomap supports inline data.
> > >
> > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > fuse server can set fuse_attr::flags.
> > >
> > > 5. statx and syncfs work on iomap filesystems.
> > >
> > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > is enabled.
> > >
> > > 7. The ext4 shutdown ioctl is now supported.
> > >
> > > There are some major warts remaining:
> > >
> > > a. ext4 doesn't support out of place writes so I don't know if that
> > > actually works correctly.
> > >
> > > b. iomap is an inode-based service, not a file-based service.  This
> > > means that we /must/ push ext2's inode numbers into the kernel via
> > > FUSE_GETATTR so that it can report those same numbers back out through
> > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > to index its incore inode, so we have to pass those too so that
> > > notifications work properly.  This is related to #3 below:
> > >
> > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > because the upper level libfuse likes to abstract kernel nodeids with
> > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > As a result, a hardlinked file results in two distinct struct inodes in
> > > the kernel, which completely breaks iomap's locking model.  I will have
> > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > but on the plus side there will be far less path lookup overhead.
> > >
> > > d. There are too many changes to the IO manager in libext2fs because I
> > > built things needed to stage the direct/buffered IO paths separately.
> > > These are now unnecessary but I haven't pulled them out yet because
> > > they're sort of useful to verify that iomap file IO never goes through
> > > libext2fs except for inline data.
> > >
> > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > We also need to disable the OOM killer(s) for fuse servers because you
> > > don't want filesystems to unmount abruptly.
> > >
> > > f. How do we maximally contain the fuse server to have safe filesystem
> > > mounts?  It's very convenient to use systemd services to configure
> > > isolation declaratively, but fuse2fs still needs to be able to open
> > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > namespace.  This prevents us from using most of the stronger systemd
> >
> > I'm happy to help you here.
> >
> > First, I think using a character device for namespaced drivers is always
> > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > delegation because of devtmpfs not being namespaced as well as devices
> > in general. And having device nodes on anything other than tmpfs is just
> > wrong (TM).
> >
> > In systemd I ultimately want a bpf LSM program that prevents the
> > creation of device nodes outside of tmpfs. They don't belong on
> > persistent storage imho. But anyway, that's besides the point.
> >
> > Opening the block device should be done by systemd-mountfsd but I think
> > /dev/fuse should really be openable by the service itself.

/me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
Can you pass an fsopen fd to an unprivileged process and have that
second process call fsmount?

If so, then it would be more convenient if mount.safe/systemd-mountfsd
could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
need any special /dev access at all.  I think then the fuse server's
service could have:

DynamicUser=true
ProtectSystem=true
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
DevicePolicy=strict

(I think most of those are redundant with DynamicUser=true but a lot of
my systemd-fu is paged out ATM.)

My goal here is extreme containment -- the code doing the fs metadata
parsing has no privileges, no write access except to the fds it was
given, no network access, and no ability to read anything outside the
root filesystem.  Then I can get back to writing buffer
overflows^W^Whigh quality filesystem code in peace.

> > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > whiteouts. That means you can do mknod() in the container to create
> > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > bat so that containers can only do this on their private tmpfs mount at
> > /dev.)
> >
> > The downside of this would be to give unprivileged containers access to
> > FUSE by default. I don't think that's a problem per se but it is a uapi
> > change.

Yeah, that is a new risk.  It's still better than metadata parsing
within the kernel address space ... though who knows how thoroughly fuse
has been fuzzed by syzbot :P

> > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > sure enough about it to spill it.

Please do share, #f is my crazy unbaked idea. :)

> I don't think there is a hard requirement for the fuse fd to be opened from
> a device driver.
> With fuse io_uring communication, the open fd doesn't even need to do io.
> 
> > > protections because they tend to run in a private mount namespace with
> > > various parts of the filesystem either hidden or readonly.
> > >
> > > In theory one could design a socket protocol to pass mount options,
> > > block device paths, fds, and responsibility for the mount() call between
> > > a mount helper and a service:
> >
> > This isn't a problem really. This should just be an extension to
> > systemd-mountfsd.

I suppose mount.safe could very well call systemd-mount to go do all the
systemd-related service setup, and that would take care of udisks as
well.

> This is relevant not only to systemd env.
> 
> I have been experimenting with this mount helper service to mount fuse fs
> inside an unprivileged kubernetes container, where opening of /dev/fuse
> is restricted by LSM policy:
> 
> https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach

That sounds similar to what I was thinking about, though there are a lot
of TLAs that I don't understand.

--D

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 19:31     ` Darrick J. Wong
@ 2025-07-18 19:56       ` Amir Goldstein
  2025-07-18 20:21         ` Darrick J. Wong
  2025-07-23 13:05       ` Christian Brauner
  1 sibling, 1 reply; 49+ messages in thread
From: Amir Goldstein @ 2025-07-18 19:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o,
	Neal Gompa

On Fri, Jul 18, 2025 at 9:31 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > Hi everyone,
> > > >
> > > > DO NOT MERGE THIS, STILL!
> > > >
> > > > This is the third request for comments of a prototype to connect the
> > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > files whose contents persist to locally attached storage devices.
> > > >
> > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > kernel compromise, and I think there's a very strong incentive to move
> > > > all that parsing out to userspace where we can containerize the fuse
> > > > server process.
> > > >
> > > > willy's folios conversion project (and to a certain degree RH's new
> > > > mount API) have also demonstrated that treewide changes to the core
> > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > because you have to understand every filesystem's bespoke use of that
> > > > core code.  Eeeugh.
> > > >
> > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > writeback is now a directio write.  The fuse server is now able to
> > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > works.
> > > >
> > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > maintains most of its performance.  At this stage I still get about 95%
> > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > for more details.  Unwritten extent conversions on random direct writes
> > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > overhead.  And that's with debugging turned on!
> > > >
> > > > These items have been addressed since the first RFC:
> > > >
> > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > between pagecache zeroing and writeback on filesystems that support
> > > > unwritten and delalloc mappings.
> > > >
> > > > 2. Mappings can be cached in the kernel for more speed.
> > > >
> > > > 3. iomap supports inline data.
> > > >
> > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > fuse server can set fuse_attr::flags.
> > > >
> > > > 5. statx and syncfs work on iomap filesystems.
> > > >
> > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > is enabled.
> > > >
> > > > 7. The ext4 shutdown ioctl is now supported.
> > > >
> > > > There are some major warts remaining:
> > > >
> > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > actually works correctly.
> > > >
> > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > to index its incore inode, so we have to pass those too so that
> > > > notifications work properly.  This is related to #3 below:
> > > >
> > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > but on the plus side there will be far less path lookup overhead.
> > > >
> > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > built things needed to stage the direct/buffered IO paths separately.
> > > > These are now unnecessary but I haven't pulled them out yet because
> > > > they're sort of useful to verify that iomap file IO never goes through
> > > > libext2fs except for inline data.
> > > >
> > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > don't want filesystems to unmount abruptly.
> > > >
> > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > mounts?  It's very convenient to use systemd services to configure
> > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > namespace.  This prevents us from using most of the stronger systemd
> > >
> > > I'm happy to help you here.
> > >
> > > First, I think using a character device for namespaced drivers is always
> > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > delegation because of devtmpfs not being namespaced as well as devices
> > > in general. And having device nodes on anything other than tmpfs is just
> > > wrong (TM).
> > >
> > > In systemd I ultimately want a bpf LSM program that prevents the
> > > creation of device nodes outside of tmpfs. They don't belong on
> > > persistent storage imho. But anyway, that's besides the point.
> > >
> > > Opening the block device should be done by systemd-mountfsd but I think
> > > /dev/fuse should really be openable by the service itself.
>
> /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> Can you pass an fsopen fd to an unprivileged process and have that
> second process call fsmount?
>
> If so, then it would be more convenient if mount.safe/systemd-mountfsd
> could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> need any special /dev access at all.  I think then the fuse server's
> service could have:
>
> DynamicUser=true
> ProtectSystem=true
> ProtectHome=true
> PrivateTmp=true
> PrivateDevices=true
> DevicePolicy=strict
>
> (I think most of those are redundant with DynamicUser=true but a lot of
> my systemd-fu is paged out ATM.)
>
> My goal here is extreme containment -- the code doing the fs metadata
> parsing has no privileges, no write access except to the fds it was
> given, no network access, and no ability to read anything outside the
> root filesystem.  Then I can get back to writing buffer
> overflows^W^Whigh quality filesystem code in peace.
>
> > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > whiteouts. That means you can do mknod() in the container to create
> > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > bat so that containers can only do this on their private tmpfs mount at
> > > /dev.)
> > >
> > > The downside of this would be to give unprivileged containers access to
> > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > change.
>
> Yeah, that is a new risk.  It's still better than metadata parsing
> within the kernel address space ... though who knows how thoroughly fuse
> has been fuzzed by syzbot :P
>
> > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > sure enough about it to spill it.
>
> Please do share, #f is my crazy unbaked idea. :)
>
> > I don't think there is a hard requirement for the fuse fd to be opened from
> > a device driver.
> > With fuse io_uring communication, the open fd doesn't even need to do io.
> >
> > > > protections because they tend to run in a private mount namespace with
> > > > various parts of the filesystem either hidden or readonly.
> > > >
> > > > In theory one could design a socket protocol to pass mount options,
> > > > block device paths, fds, and responsibility for the mount() call between
> > > > a mount helper and a service:
> > >
> > > This isn't a problem really. This should just be an extension to
> > > systemd-mountfsd.
>
> I suppose mount.safe could very well call systemd-mount to go do all the
> systemd-related service setup, and that would take care of udisks as
> well.
>
> > This is relevant not only to systemd env.
> >
> > I have been experimenting with this mount helper service to mount fuse fs
> > inside an unprivileged kubernetes container, where opening of /dev/fuse
> > is restricted by LSM policy:
> >
> > https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach
>
> That sounds similar to what I was thinking about, though there are a lot
> of TLAs that I don't understand.

Heh. UDS is Unix Domain Socket if that's what you missed (?)
All the rest don't matter.
It's just a privileged service to mount fuse filesystems.
The interesting thing is the trick with replacing fusermount3
to make existing fuse filesystems work out of the box, but the
principle is simply what you described.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 19:56       ` Amir Goldstein
@ 2025-07-18 20:21         ` Darrick J. Wong
  0 siblings, 0 replies; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-18 20:21 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o,
	Neal Gompa

On Fri, Jul 18, 2025 at 09:56:56PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 9:31 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS, STILL!
> > > > >
> > > > > This is the third request for comments of a prototype to connect the
> > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > works.
> > > > >
> > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > overhead.  And that's with debugging turned on!
> > > > >
> > > > > These items have been addressed since the first RFC:
> > > > >
> > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap supports inline data.
> > > > >
> > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > fuse server can set fuse_attr::flags.
> > > > >
> > > > > 5. statx and syncfs work on iomap filesystems.
> > > > >
> > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > is enabled.
> > > > >
> > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.  This is related to #3 below:
> > > > >
> > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > but on the plus side there will be far less path lookup overhead.
> > > > >
> > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > libext2fs except for inline data.
> > > > >
> > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > don't want filesystems to unmount abruptly.
> > > > >
> > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > namespace.  This prevents us from using most of the stronger systemd
> > > >
> > > > I'm happy to help you here.
> > > >
> > > > First, I think using a character device for namespaced drivers is always
> > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > in general. And having device nodes on anything other than tmpfs is just
> > > > wrong (TM).
> > > >
> > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > persistent storage imho. But anyway, that's besides the point.
> > > >
> > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > /dev/fuse should really be openable by the service itself.
> >
> > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > Can you pass an fsopen fd to an unprivileged process and have that
> > second process call fsmount?
> >
> > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> > need any special /dev access at all.  I think then the fuse server's
> > service could have:
> >
> > DynamicUser=true
> > ProtectSystem=true
> > ProtectHome=true
> > PrivateTmp=true
> > PrivateDevices=true
> > DevicePolicy=strict
> >
> > (I think most of those are redundant with DynamicUser=true but a lot of
> > my systemd-fu is paged out ATM.)
> >
> > My goal here is extreme containment -- the code doing the fs metadata
> > parsing has no privileges, no write access except to the fds it was
> > given, no network access, and no ability to read anything outside the
> > root filesystem.  Then I can get back to writing buffer
> > overflows^W^Whigh quality filesystem code in peace.
> >
> > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > whiteouts. That means you can do mknod() in the container to create
> > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > bat so that containers can only do this on their private tmpfs mount at
> > > > /dev.)
> > > >
> > > > The downside of this would be to give unprivileged containers access to
> > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > change.
> >
> > Yeah, that is a new risk.  It's still better than metadata parsing
> > within the kernel address space ... though who knows how thoroughly fuse
> > has been fuzzed by syzbot :P
> >
> > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > sure enough about it to spill it.
> >
> > Please do share, #f is my crazy unbaked idea. :)
> >
> > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > a device driver.
> > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > >
> > > > > protections because they tend to run in a private mount namespace with
> > > > > various parts of the filesystem either hidden or readonly.
> > > > >
> > > > > In theory one could design a socket protocol to pass mount options,
> > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > a mount helper and a service:
> > > >
> > > > This isn't a problem really. This should just be an extension to
> > > > systemd-mountfsd.
> >
> > I suppose mount.safe could very well call systemd-mount to go do all the
> > systemd-related service setup, and that would take care of udisks as
> > well.
> >
> > > This is relevant not only to systemd env.
> > >
> > > I have been experimenting with this mount helper service to mount fuse fs
> > > inside an unprivileged kubernetes container, where opening of /dev/fuse
> > > is restricted by LSM policy:
> > >
> > > https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach
> >
> > That sounds similar to what I was thinking about, though there are a lot
> > of TLAs that I don't understand.
> 
> Heh. UDS is Unix Domain Socket if that's what you missed (?)
> All the rest don't matter.

I was wondering what that was.

> It's just a privileged service to mount fuse filesystems.
> The interesting thing is the trick with replacing fusermount3
> to make existing fuse filesystems work out of the box, but the
> principle is simply what you described.

<nod> Got it.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 19:31     ` Darrick J. Wong
  2025-07-18 19:56       ` Amir Goldstein
@ 2025-07-23 13:05       ` Christian Brauner
  2025-07-23 18:04         ` Darrick J. Wong
  1 sibling, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2025-07-23 13:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > Hi everyone,
> > > >
> > > > DO NOT MERGE THIS, STILL!
> > > >
> > > > This is the third request for comments of a prototype to connect the
> > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > files whose contents persist to locally attached storage devices.
> > > >
> > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > kernel compromise, and I think there's a very strong incentive to move
> > > > all that parsing out to userspace where we can containerize the fuse
> > > > server process.
> > > >
> > > > willy's folios conversion project (and to a certain degree RH's new
> > > > mount API) have also demonstrated that treewide changes to the core
> > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > because you have to understand every filesystem's bespoke use of that
> > > > core code.  Eeeugh.
> > > >
> > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > writeback is now a directio write.  The fuse server is now able to
> > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > works.
> > > >
> > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > maintains most of its performance.  At this stage I still get about 95%
> > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > for more details.  Unwritten extent conversions on random direct writes
> > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > overhead.  And that's with debugging turned on!
> > > >
> > > > These items have been addressed since the first RFC:
> > > >
> > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > between pagecache zeroing and writeback on filesystems that support
> > > > unwritten and delalloc mappings.
> > > >
> > > > 2. Mappings can be cached in the kernel for more speed.
> > > >
> > > > 3. iomap supports inline data.
> > > >
> > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > fuse server can set fuse_attr::flags.
> > > >
> > > > 5. statx and syncfs work on iomap filesystems.
> > > >
> > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > is enabled.
> > > >
> > > > 7. The ext4 shutdown ioctl is now supported.
> > > >
> > > > There are some major warts remaining:
> > > >
> > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > actually works correctly.
> > > >
> > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > to index its incore inode, so we have to pass those too so that
> > > > notifications work properly.  This is related to #3 below:
> > > >
> > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > but on the plus side there will be far less path lookup overhead.
> > > >
> > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > built things needed to stage the direct/buffered IO paths separately.
> > > > These are now unnecessary but I haven't pulled them out yet because
> > > > they're sort of useful to verify that iomap file IO never goes through
> > > > libext2fs except for inline data.
> > > >
> > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > don't want filesystems to unmount abruptly.
> > > >
> > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > mounts?  It's very convenient to use systemd services to configure
> > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > namespace.  This prevents us from using most of the stronger systemd
> > >
> > > I'm happy to help you here.
> > >
> > > First, I think using a character device for namespaced drivers is always
> > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > delegation because of devtmpfs not being namespaced as well as devices
> > > in general. And having device nodes on anything other than tmpfs is just
> > > wrong (TM).
> > >
> > > In systemd I ultimately want a bpf LSM program that prevents the
> > > creation of device nodes outside of tmpfs. They don't belong on
> > > persistent storage imho. But anyway, that's besides the point.
> > >
> > > Opening the block device should be done by systemd-mountfsd but I think
> > > /dev/fuse should really be openable by the service itself.
> 
> /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> Can you pass an fsopen fd to an unprivileged process and have that
> second process call fsmount?

Yes, but remember that at some point you must call
fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
fses that requires CAP_SYS_ADMIN so that has to be done by the
privielged process. All the rest can be done by the unprivileged process
though. That's exactly how bpf tokens work.

> 
> If so, then it would be more convenient if mount.safe/systemd-mountfsd
> could pass open fds for /dev/fuse fsopen then the fuse server wouldn't

Yes, that would work.

> need any special /dev access at all.  I think then the fuse server's
> service could have:
> 
> DynamicUser=true
> ProtectSystem=true
> ProtectHome=true
> PrivateTmp=true
> PrivateDevices=true
> DevicePolicy=strict
> 
> (I think most of those are redundant with DynamicUser=true but a lot of
> my systemd-fu is paged out ATM.)
> 
> My goal here is extreme containment -- the code doing the fs metadata
> parsing has no privileges, no write access except to the fds it was
> given, no network access, and no ability to read anything outside the
> root filesystem.  Then I can get back to writing buffer
> overflows^W^Whigh quality filesystem code in peace.

Yeah, sounds about right.

> 
> > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > whiteouts. That means you can do mknod() in the container to create
> > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > bat so that containers can only do this on their private tmpfs mount at
> > > /dev.)
> > >
> > > The downside of this would be to give unprivileged containers access to
> > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > change.
> 
> Yeah, that is a new risk.  It's still better than metadata parsing
> within the kernel address space ... though who knows how thoroughly fuse
> has been fuzzed by syzbot :P
> 
> > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > sure enough about it to spill it.
> 
> Please do share, #f is my crazy unbaked idea. :)
> 
> > I don't think there is a hard requirement for the fuse fd to be opened from
> > a device driver.
> > With fuse io_uring communication, the open fd doesn't even need to do io.
> > 
> > > > protections because they tend to run in a private mount namespace with
> > > > various parts of the filesystem either hidden or readonly.
> > > >
> > > > In theory one could design a socket protocol to pass mount options,
> > > > block device paths, fds, and responsibility for the mount() call between
> > > > a mount helper and a service:
> > >
> > > This isn't a problem really. This should just be an extension to
> > > systemd-mountfsd.
> 
> I suppose mount.safe could very well call systemd-mount to go do all the
> systemd-related service setup, and that would take care of udisks as
> well.

The ultimate goal is to teach mount(8)/libmount to use that daemon when
it's available. Because that would just make unprivileged mounting work
without userspace noticing anything.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-23 13:05       ` Christian Brauner
@ 2025-07-23 18:04         ` Darrick J. Wong
  2025-07-31 10:13           ` Christian Brauner
  0 siblings, 1 reply; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-23 18:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS, STILL!
> > > > >
> > > > > This is the third request for comments of a prototype to connect the
> > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > works.
> > > > >
> > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > overhead.  And that's with debugging turned on!
> > > > >
> > > > > These items have been addressed since the first RFC:
> > > > >
> > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap supports inline data.
> > > > >
> > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > fuse server can set fuse_attr::flags.
> > > > >
> > > > > 5. statx and syncfs work on iomap filesystems.
> > > > >
> > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > is enabled.
> > > > >
> > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.  This is related to #3 below:
> > > > >
> > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > but on the plus side there will be far less path lookup overhead.
> > > > >
> > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > libext2fs except for inline data.
> > > > >
> > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > don't want filesystems to unmount abruptly.
> > > > >
> > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > namespace.  This prevents us from using most of the stronger systemd
> > > >
> > > > I'm happy to help you here.
> > > >
> > > > First, I think using a character device for namespaced drivers is always
> > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > in general. And having device nodes on anything other than tmpfs is just
> > > > wrong (TM).
> > > >
> > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > persistent storage imho. But anyway, that's besides the point.
> > > >
> > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > /dev/fuse should really be openable by the service itself.
> > 
> > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > Can you pass an fsopen fd to an unprivileged process and have that
> > second process call fsmount?
> 
> Yes, but remember that at some point you must call
> fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> fses that requires CAP_SYS_ADMIN so that has to be done by the
> privielged process. All the rest can be done by the unprivileged process
> though. That's exactly how bpf tokens work.

Hrm.  Assuming the fsopen mount sequence is still:

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
	...
	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
for move_mount() to have the intended effect.

Can two processes share the same fsopen fd?  If so then systemd-mountfsd
could pass the fsopen fd to the fuse server (whilst retaining its own
copy).  The fuse server could do its own mount option parsing, call
FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
the create/fsmount/move_mount part.

The systemd-mountfsd would have to be running in desired fs namespace
and with sufficient privileges to open block devices, but I'm guessing
that's already a requirement?

> > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> 
> Yes, that would work.

Oh goody :)

> > need any special /dev access at all.  I think then the fuse server's
> > service could have:
> > 
> > DynamicUser=true
> > ProtectSystem=true
> > ProtectHome=true
> > PrivateTmp=true
> > PrivateDevices=true
> > DevicePolicy=strict
> > 
> > (I think most of those are redundant with DynamicUser=true but a lot of
> > my systemd-fu is paged out ATM.)
> > 
> > My goal here is extreme containment -- the code doing the fs metadata
> > parsing has no privileges, no write access except to the fds it was
> > given, no network access, and no ability to read anything outside the
> > root filesystem.  Then I can get back to writing buffer
> > overflows^W^Whigh quality filesystem code in peace.
> 
> Yeah, sounds about right.
> 
> > 
> > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > whiteouts. That means you can do mknod() in the container to create
> > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > bat so that containers can only do this on their private tmpfs mount at
> > > > /dev.)
> > > >
> > > > The downside of this would be to give unprivileged containers access to
> > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > change.
> > 
> > Yeah, that is a new risk.  It's still better than metadata parsing
> > within the kernel address space ... though who knows how thoroughly fuse
> > has been fuzzed by syzbot :P
> > 
> > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > sure enough about it to spill it.
> > 
> > Please do share, #f is my crazy unbaked idea. :)
> > 
> > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > a device driver.
> > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > 
> > > > > protections because they tend to run in a private mount namespace with
> > > > > various parts of the filesystem either hidden or readonly.
> > > > >
> > > > > In theory one could design a socket protocol to pass mount options,
> > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > a mount helper and a service:
> > > >
> > > > This isn't a problem really. This should just be an extension to
> > > > systemd-mountfsd.
> > 
> > I suppose mount.safe could very well call systemd-mount to go do all the
> > systemd-related service setup, and that would take care of udisks as
> > well.
> 
> The ultimate goal is to teach mount(8)/libmount to use that daemon when
> it's available. Because that would just make unprivileged mounting work
> without userspace noticing anything.

That sounds really neat. :)

--D

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-23 18:04         ` Darrick J. Wong
@ 2025-07-31 10:13           ` Christian Brauner
  2025-07-31 17:22             ` Darrick J. Wong
  0 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2025-07-31 10:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > DO NOT MERGE THIS, STILL!
> > > > > >
> > > > > > This is the third request for comments of a prototype to connect the
> > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > files whose contents persist to locally attached storage devices.
> > > > > >
> > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > server process.
> > > > > >
> > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > core code.  Eeeugh.
> > > > > >
> > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > works.
> > > > > >
> > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > overhead.  And that's with debugging turned on!
> > > > > >
> > > > > > These items have been addressed since the first RFC:
> > > > > >
> > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > unwritten and delalloc mappings.
> > > > > >
> > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > >
> > > > > > 3. iomap supports inline data.
> > > > > >
> > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > fuse server can set fuse_attr::flags.
> > > > > >
> > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > >
> > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > is enabled.
> > > > > >
> > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > >
> > > > > > There are some major warts remaining:
> > > > > >
> > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > actually works correctly.
> > > > > >
> > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > notifications work properly.  This is related to #3 below:
> > > > > >
> > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > >
> > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > libext2fs except for inline data.
> > > > > >
> > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > don't want filesystems to unmount abruptly.
> > > > > >
> > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > >
> > > > > I'm happy to help you here.
> > > > >
> > > > > First, I think using a character device for namespaced drivers is always
> > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > wrong (TM).
> > > > >
> > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > persistent storage imho. But anyway, that's besides the point.
> > > > >
> > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > /dev/fuse should really be openable by the service itself.
> > > 
> > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > Can you pass an fsopen fd to an unprivileged process and have that
> > > second process call fsmount?
> > 
> > Yes, but remember that at some point you must call
> > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > privielged process. All the rest can be done by the unprivileged process
> > though. That's exactly how bpf tokens work.
> 
> Hrm.  Assuming the fsopen mount sequence is still:
> 
> 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> 	...
> 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> 
> Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> for move_mount() to have the intended effect.

Yes-ish.

At fsopen() time the user namespace of the caller is recorded in
fs_context->user_ns. If the filesystems is mountable inside of a user
namespace then fs_context->user_ns will be used to perform the
CAP_SYS_ADMIN check.

For filesystems that aren't mountable inside of user namespaces (ext4,
xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
to mount a filesystem with a non-initial userns if it's not marked as
mountable. That used to be possible but it's an invitation for extremely
subtle bugs and you gain control over the superblock itself.

TL;DR the user namespace the superblock belongs to is usually determined
at fsopen() time.

> 
> Can two processes share the same fsopen fd?  If so then systemd-mountfsd

Yes, they can share and it's synchronized.

> could pass the fsopen fd to the fuse server (whilst retaining its own
> copy).  The fuse server could do its own mount option parsing, call

Yes, systemd-mountfsd already does passing like that.

> FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> the create/fsmount/move_mount part.

Yes.

> 
> The systemd-mountfsd would have to be running in desired fs namespace
> and with sufficient privileges to open block devices, but I'm guessing
> that's already a requirement?

Yes, systemd-mountfsd is a system level service running in the initial
set of namespaces and interacting with systemd-nsresourced (namespace
related stuff). It can obviously also create helper to setns() into
various namespaces if required. 

> 
> > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't

Yes, I would think so.

> > 
> > Yes, that would work.
> 
> Oh goody :)
> 
> > > need any special /dev access at all.  I think then the fuse server's
> > > service could have:
> > > 
> > > DynamicUser=true
> > > ProtectSystem=true
> > > ProtectHome=true
> > > PrivateTmp=true
> > > PrivateDevices=true
> > > DevicePolicy=strict
> > > 
> > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > my systemd-fu is paged out ATM.)
> > > 
> > > My goal here is extreme containment -- the code doing the fs metadata
> > > parsing has no privileges, no write access except to the fds it was
> > > given, no network access, and no ability to read anything outside the
> > > root filesystem.  Then I can get back to writing buffer
> > > overflows^W^Whigh quality filesystem code in peace.
> > 
> > Yeah, sounds about right.
> > 
> > > 
> > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > /dev.)
> > > > >
> > > > > The downside of this would be to give unprivileged containers access to
> > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > change.
> > > 
> > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > within the kernel address space ... though who knows how thoroughly fuse
> > > has been fuzzed by syzbot :P
> > > 
> > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > sure enough about it to spill it.
> > > 
> > > Please do share, #f is my crazy unbaked idea. :)
> > > 
> > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > a device driver.
> > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > 
> > > > > > protections because they tend to run in a private mount namespace with
> > > > > > various parts of the filesystem either hidden or readonly.
> > > > > >
> > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > a mount helper and a service:
> > > > >
> > > > > This isn't a problem really. This should just be an extension to
> > > > > systemd-mountfsd.
> > > 
> > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > systemd-related service setup, and that would take care of udisks as
> > > well.
> > 
> > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > it's available. Because that would just make unprivileged mounting work
> > without userspace noticing anything.
> 
> That sounds really neat. :)
> 
> --D

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-31 10:13           ` Christian Brauner
@ 2025-07-31 17:22             ` Darrick J. Wong
  2025-08-04 10:12               ` Christian Brauner
  0 siblings, 1 reply; 49+ messages in thread
From: Darrick J. Wong @ 2025-07-31 17:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > >
> > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > >
> > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > server process.
> > > > > > >
> > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > core code.  Eeeugh.
> > > > > > >
> > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > works.
> > > > > > >
> > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > overhead.  And that's with debugging turned on!
> > > > > > >
> > > > > > > These items have been addressed since the first RFC:
> > > > > > >
> > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > unwritten and delalloc mappings.
> > > > > > >
> > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > >
> > > > > > > 3. iomap supports inline data.
> > > > > > >
> > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > fuse server can set fuse_attr::flags.
> > > > > > >
> > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > >
> > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > is enabled.
> > > > > > >
> > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > >
> > > > > > > There are some major warts remaining:
> > > > > > >
> > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > actually works correctly.
> > > > > > >
> > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > >
> > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > >
> > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > libext2fs except for inline data.
> > > > > > >
> > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > don't want filesystems to unmount abruptly.
> > > > > > >
> > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > >
> > > > > > I'm happy to help you here.
> > > > > >
> > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > wrong (TM).
> > > > > >
> > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > >
> > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > /dev/fuse should really be openable by the service itself.
> > > > 
> > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > second process call fsmount?
> > > 
> > > Yes, but remember that at some point you must call
> > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > privielged process. All the rest can be done by the unprivileged process
> > > though. That's exactly how bpf tokens work.
> > 
> > Hrm.  Assuming the fsopen mount sequence is still:
> > 
> > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > 	...
> > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > 
> > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > for move_mount() to have the intended effect.
> 
> Yes-ish.
> 
> At fsopen() time the user namespace of the caller is recorded in
> fs_context->user_ns. If the filesystems is mountable inside of a user
> namespace then fs_context->user_ns will be used to perform the
> CAP_SYS_ADMIN check.

Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
I gather that means that the fuse service server (ugh) could invoke the
mount using the fsopen fd given to it?  That sounds promising.

> For filesystems that aren't mountable inside of user namespaces (ext4,
> xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> to mount a filesystem with a non-initial userns if it's not marked as
> mountable. That used to be possible but it's an invitation for extremely
> subtle bugs and you gain control over the superblock itself.

I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
for a filesystem to be "...written with a non-initial s_user_ns in
mind"?  Is there something specific that I should look out for, aside
from the usual "we don't mount parking lot xfs because validating that
is too hard and it might explode the kernel"?

> TL;DR the user namespace the superblock belongs to is usually determined
> at fsopen() time.
> 
> > 
> > Can two processes share the same fsopen fd?  If so then systemd-mountfsd
> 
> Yes, they can share and it's synchronized.

> > could pass the fsopen fd to the fuse server (whilst retaining its own
> > copy).  The fuse server could do its own mount option parsing, call
> 
> Yes, systemd-mountfsd already does passing like that.

Oh!

> > FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> > the create/fsmount/move_mount part.
> 
> Yes.

If the fdopen fd tracks the userns of whoever initiated the mount
attempt, then maybe the fuse server can do that part too?  I guess the
weird part would be that the fuse server would effectively be passing a
path from the caller's ns, despite not having access to that ns.

> > The systemd-mountfsd would have to be running in desired fs namespace
> > and with sufficient privileges to open block devices, but I'm guessing
> > that's already a requirement?
> 
> Yes, systemd-mountfsd is a system level service running in the initial
> set of namespaces and interacting with systemd-nsresourced (namespace
> related stuff). It can obviously also create helper to setns() into
> various namespaces if required. 

<nod> I think I saw something else from you about a file descriptor
store, so I'll go look there next.

--D

> > 
> > > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> 
> Yes, I would think so.
> 
> > > 
> > > Yes, that would work.
> > 
> > Oh goody :)
> > 
> > > > need any special /dev access at all.  I think then the fuse server's
> > > > service could have:
> > > > 
> > > > DynamicUser=true
> > > > ProtectSystem=true
> > > > ProtectHome=true
> > > > PrivateTmp=true
> > > > PrivateDevices=true
> > > > DevicePolicy=strict
> > > > 
> > > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > > my systemd-fu is paged out ATM.)
> > > > 
> > > > My goal here is extreme containment -- the code doing the fs metadata
> > > > parsing has no privileges, no write access except to the fds it was
> > > > given, no network access, and no ability to read anything outside the
> > > > root filesystem.  Then I can get back to writing buffer
> > > > overflows^W^Whigh quality filesystem code in peace.
> > > 
> > > Yeah, sounds about right.
> > > 
> > > > 
> > > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > > /dev.)
> > > > > >
> > > > > > The downside of this would be to give unprivileged containers access to
> > > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > > change.
> > > > 
> > > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > > within the kernel address space ... though who knows how thoroughly fuse
> > > > has been fuzzed by syzbot :P
> > > > 
> > > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > > sure enough about it to spill it.
> > > > 
> > > > Please do share, #f is my crazy unbaked idea. :)
> > > > 
> > > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > > a device driver.
> > > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > > 
> > > > > > > protections because they tend to run in a private mount namespace with
> > > > > > > various parts of the filesystem either hidden or readonly.
> > > > > > >
> > > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > > a mount helper and a service:
> > > > > >
> > > > > > This isn't a problem really. This should just be an extension to
> > > > > > systemd-mountfsd.
> > > > 
> > > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > > systemd-related service setup, and that would take care of udisks as
> > > > well.
> > > 
> > > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > > it's available. Because that would just make unprivileged mounting work
> > > without userspace noticing anything.
> > 
> > That sounds really neat. :)
> > 
> > --D

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-31 17:22             ` Darrick J. Wong
@ 2025-08-04 10:12               ` Christian Brauner
  2025-08-12 20:20                 ` Darrick J. Wong
  0 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2025-08-04 10:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Thu, Jul 31, 2025 at 10:22:06AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> > On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > > >
> > > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > > >
> > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > server process.
> > > > > > > >
> > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > core code.  Eeeugh.
> > > > > > > >
> > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > > works.
> > > > > > > >
> > > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > > overhead.  And that's with debugging turned on!
> > > > > > > >
> > > > > > > > These items have been addressed since the first RFC:
> > > > > > > >
> > > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > > unwritten and delalloc mappings.
> > > > > > > >
> > > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > > >
> > > > > > > > 3. iomap supports inline data.
> > > > > > > >
> > > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > > fuse server can set fuse_attr::flags.
> > > > > > > >
> > > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > > >
> > > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > > is enabled.
> > > > > > > >
> > > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > > >
> > > > > > > > There are some major warts remaining:
> > > > > > > >
> > > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > > actually works correctly.
> > > > > > > >
> > > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > > >
> > > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > > >
> > > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > > libext2fs except for inline data.
> > > > > > > >
> > > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > > don't want filesystems to unmount abruptly.
> > > > > > > >
> > > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > > >
> > > > > > > I'm happy to help you here.
> > > > > > >
> > > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > > wrong (TM).
> > > > > > >
> > > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > > >
> > > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > > /dev/fuse should really be openable by the service itself.
> > > > > 
> > > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > > second process call fsmount?
> > > > 
> > > > Yes, but remember that at some point you must call
> > > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > > privielged process. All the rest can be done by the unprivileged process
> > > > though. That's exactly how bpf tokens work.
> > > 
> > > Hrm.  Assuming the fsopen mount sequence is still:
> > > 
> > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > > 	...
> > > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > > 
> > > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > > for move_mount() to have the intended effect.
> > 
> > Yes-ish.
> > 
> > At fsopen() time the user namespace of the caller is recorded in
> > fs_context->user_ns. If the filesystems is mountable inside of a user
> > namespace then fs_context->user_ns will be used to perform the
> > CAP_SYS_ADMIN check.
> 
> Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
> I gather that means that the fuse service server (ugh) could invoke the
> mount using the fsopen fd given to it?  That sounds promising.

Yes, it could provided fsopen() was called in a user namespace that the
service holds privileges over.

> 
> > For filesystems that aren't mountable inside of user namespaces (ext4,
> > xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> > global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> > to mount a filesystem with a non-initial userns if it's not marked as
> > mountable. That used to be possible but it's an invitation for extremely
> > subtle bugs and you gain control over the superblock itself.
> 
> I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
> s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
> for a filesystem to be "...written with a non-initial s_user_ns in
> mind"?  Is there something specific that I should look out for, aside
> from the usual "we don't mount parking lot xfs because validating that
> is too hard and it might explode the kernel"?

So there are two sides on how to view this:

(1) The filesystem is mountable   in a user namespace.
(2) The filesystem is delegatable to a user namespace.

These are two different things. Allowing (1) is difficult because of the
usual complexities involved even though everyone always seems to believe
that their block-based filesystems is reliable enough to be mounted with
any corrupted image.

But (2) is something that's doable and in fact something we do allow
currently for e.g., bpffs. In order to allow containers to use bpf the
container must have a bpffs instance mounted.

To do this fsopen() must be called in the containers user namespace. To
allow specific bpf features and to actually create the superblock
CAP_SYS_ADMIN or CAP_BPF in the initial users namespace are required.
Then a new bpf instance will be created that is owned by the user
namespace of the container.

IOW, to delegate a superblock/filesystems to an unprivileged container
capabilities are still required but ultimately the filesystems will be
owned by the container.

One story I always found worth exploring to get at (1) is if we had
dm-verity directly integrated into the filesystem. And I don't mean
fsverity, I mean dm-verity and in a way such that it's explicitly not
part of the on-disk image in contrast to fsverity where each filesystem
integrates this very differently into their on-disk format. It basically
would be as dumb as it gets. Static, simple arithmetic, appended,
pre-pended, whatever.

> 
> > TL;DR the user namespace the superblock belongs to is usually determined
> > at fsopen() time.
> > 
> > > 
> > > Can two processes share the same fsopen fd?  If so then systemd-mountfsd
> > 
> > Yes, they can share and it's synchronized.
> 
> > > could pass the fsopen fd to the fuse server (whilst retaining its own
> > > copy).  The fuse server could do its own mount option parsing, call
> > 
> > Yes, systemd-mountfsd already does passing like that.
> 
> Oh!
> 
> > > FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> > > the create/fsmount/move_mount part.
> > 
> > Yes.
> 
> If the fdopen fd tracks the userns of whoever initiated the mount
> attempt, then maybe the fuse server can do that part too?  I guess the
> weird part would be that the fuse server would effectively be passing a
> path from the caller's ns, despite not having access to that ns.

Remind me why the FUSE server would want to track the userns?

> 
> > > The systemd-mountfsd would have to be running in desired fs namespace
> > > and with sufficient privileges to open block devices, but I'm guessing
> > > that's already a requirement?
> > 
> > Yes, systemd-mountfsd is a system level service running in the initial
> > set of namespaces and interacting with systemd-nsresourced (namespace
> > related stuff). It can obviously also create helper to setns() into
> > various namespaces if required. 
> 
> <nod> I think I saw something else from you about a file descriptor
> store, so I'll go look there next.
> 
> --D
> 
> > > 
> > > > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> > 
> > Yes, I would think so.
> > 
> > > > 
> > > > Yes, that would work.
> > > 
> > > Oh goody :)
> > > 
> > > > > need any special /dev access at all.  I think then the fuse server's
> > > > > service could have:
> > > > > 
> > > > > DynamicUser=true
> > > > > ProtectSystem=true
> > > > > ProtectHome=true
> > > > > PrivateTmp=true
> > > > > PrivateDevices=true
> > > > > DevicePolicy=strict
> > > > > 
> > > > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > > > my systemd-fu is paged out ATM.)
> > > > > 
> > > > > My goal here is extreme containment -- the code doing the fs metadata
> > > > > parsing has no privileges, no write access except to the fds it was
> > > > > given, no network access, and no ability to read anything outside the
> > > > > root filesystem.  Then I can get back to writing buffer
> > > > > overflows^W^Whigh quality filesystem code in peace.
> > > > 
> > > > Yeah, sounds about right.
> > > > 
> > > > > 
> > > > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > > > /dev.)
> > > > > > >
> > > > > > > The downside of this would be to give unprivileged containers access to
> > > > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > > > change.
> > > > > 
> > > > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > > > within the kernel address space ... though who knows how thoroughly fuse
> > > > > has been fuzzed by syzbot :P
> > > > > 
> > > > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > > > sure enough about it to spill it.
> > > > > 
> > > > > Please do share, #f is my crazy unbaked idea. :)
> > > > > 
> > > > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > > > a device driver.
> > > > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > > > 
> > > > > > > > protections because they tend to run in a private mount namespace with
> > > > > > > > various parts of the filesystem either hidden or readonly.
> > > > > > > >
> > > > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > > > a mount helper and a service:
> > > > > > >
> > > > > > > This isn't a problem really. This should just be an extension to
> > > > > > > systemd-mountfsd.
> > > > > 
> > > > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > > > systemd-related service setup, and that would take care of udisks as
> > > > > well.
> > > > 
> > > > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > > > it's available. Because that would just make unprivileged mounting work
> > > > without userspace noticing anything.
> > > 
> > > That sounds really neat. :)
> > > 
> > > --D

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-08-04 10:12               ` Christian Brauner
@ 2025-08-12 20:20                 ` Darrick J. Wong
  2025-08-15 14:20                   ` Christian Brauner
  0 siblings, 1 reply; 49+ messages in thread
From: Darrick J. Wong @ 2025-08-12 20:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Mon, Aug 04, 2025 at 12:12:24PM +0200, Christian Brauner wrote:
> On Thu, Jul 31, 2025 at 10:22:06AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> > > On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > > > >
> > > > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > > > >
> > > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > > server process.
> > > > > > > > >
> > > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > > core code.  Eeeugh.
> > > > > > > > >
> > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > > > works.
> > > > > > > > >
> > > > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > > > overhead.  And that's with debugging turned on!
> > > > > > > > >
> > > > > > > > > These items have been addressed since the first RFC:
> > > > > > > > >
> > > > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > > > unwritten and delalloc mappings.
> > > > > > > > >
> > > > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > > > >
> > > > > > > > > 3. iomap supports inline data.
> > > > > > > > >
> > > > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > > > fuse server can set fuse_attr::flags.
> > > > > > > > >
> > > > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > > > >
> > > > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > > > is enabled.
> > > > > > > > >
> > > > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > > > >
> > > > > > > > > There are some major warts remaining:
> > > > > > > > >
> > > > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > > > actually works correctly.
> > > > > > > > >
> > > > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > > > >
> > > > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > > > >
> > > > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > > > libext2fs except for inline data.
> > > > > > > > >
> > > > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > > > don't want filesystems to unmount abruptly.
> > > > > > > > >
> > > > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > > > >
> > > > > > > > I'm happy to help you here.
> > > > > > > >
> > > > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > > > wrong (TM).
> > > > > > > >
> > > > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > > > >
> > > > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > > > /dev/fuse should really be openable by the service itself.
> > > > > > 
> > > > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > > > second process call fsmount?
> > > > > 
> > > > > Yes, but remember that at some point you must call
> > > > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > > > privielged process. All the rest can be done by the unprivileged process
> > > > > though. That's exactly how bpf tokens work.
> > > > 
> > > > Hrm.  Assuming the fsopen mount sequence is still:
> > > > 
> > > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > > > 	...
> > > > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > > > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > > > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > > > 
> > > > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > > > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > > > for move_mount() to have the intended effect.
> > > 
> > > Yes-ish.
> > > 
> > > At fsopen() time the user namespace of the caller is recorded in
> > > fs_context->user_ns. If the filesystems is mountable inside of a user
> > > namespace then fs_context->user_ns will be used to perform the
> > > CAP_SYS_ADMIN check.
> > 
> > Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
> > I gather that means that the fuse service server (ugh) could invoke the
> > mount using the fsopen fd given to it?  That sounds promising.
> 
> Yes, it could provided fsopen() was called in a user namespace that the
> service holds privileges over.
> 
> > 
> > > For filesystems that aren't mountable inside of user namespaces (ext4,
> > > xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> > > global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> > > to mount a filesystem with a non-initial userns if it's not marked as
> > > mountable. That used to be possible but it's an invitation for extremely
> > > subtle bugs and you gain control over the superblock itself.
> > 
> > I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
> > s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
> > for a filesystem to be "...written with a non-initial s_user_ns in
> > mind"?  Is there something specific that I should look out for, aside
> > from the usual "we don't mount parking lot xfs because validating that
> > is too hard and it might explode the kernel"?
> 
> So there are two sides on how to view this:
> 
> (1) The filesystem is mountable   in a user namespace.
> (2) The filesystem is delegatable to a user namespace.
> 
> These are two different things. Allowing (1) is difficult because of the
> usual complexities involved even though everyone always seems to believe
> that their block-based filesystems is reliable enough to be mounted with
> any corrupted image.
> 
> But (2) is something that's doable and in fact something we do allow
> currently for e.g., bpffs. In order to allow containers to use bpf the
> container must have a bpffs instance mounted.
> 
> To do this fsopen() must be called in the containers user namespace. To
> allow specific bpf features and to actually create the superblock
> CAP_SYS_ADMIN or CAP_BPF in the initial users namespace are required.
> Then a new bpf instance will be created that is owned by the user
> namespace of the container.
> 
> IOW, to delegate a superblock/filesystems to an unprivileged container
> capabilities are still required but ultimately the filesystems will be
> owned by the container.

<nod>

> One story I always found worth exploring to get at (1) is if we had
> dm-verity directly integrated into the filesystem. And I don't mean
> fsverity, I mean dm-verity and in a way such that it's explicitly not
> part of the on-disk image in contrast to fsverity where each filesystem
> integrates this very differently into their on-disk format. It basically
> would be as dumb as it gets. Static, simple arithmetic, appended,
> pre-pended, whatever.

That would work as long as you don't need to write to the filesystem,
ever.  For gold master rootfs that would work fine, less so for "my
container needs a writable data partition but the bofh doesn't want us
compromising kernel memory".

> > 
> > > TL;DR the user namespace the superblock belongs to is usually determined
> > > at fsopen() time.
> > > 
> > > > 
> > > > Can two processes share the same fsopen fd?  If so then systemd-mountfsd
> > > 
> > > Yes, they can share and it's synchronized.
> > 
> > > > could pass the fsopen fd to the fuse server (whilst retaining its own
> > > > copy).  The fuse server could do its own mount option parsing, call
> > > 
> > > Yes, systemd-mountfsd already does passing like that.
> > 
> > Oh!
> > 
> > > > FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> > > > the create/fsmount/move_mount part.
> > > 
> > > Yes.
> > 
> > If the fdopen fd tracks the userns of whoever initiated the mount
> > attempt, then maybe the fuse server can do that part too?  I guess the
> > weird part would be that the fuse server would effectively be passing a
> > path from the caller's ns, despite not having access to that ns.
> 
> Remind me why the FUSE server would want to track the userns?

My wording there might have been confusing -- what I meant is:

1. The fdopen fd tracks the userns of the program that called fdopen.
2. The program from #1 passes the fdopen fd to a fuse server that's
   running in a much more constrained environment (separate systemd
   scope, no privileges at all, resources)
3. The fuse server calls fsmount on the fdopen fd passed to it by #1.

But I also haven't tried *building* any of these pieces, so this is
entirely speculative nonsense on my part. :)

> > > > The systemd-mountfsd would have to be running in desired fs namespace
> > > > and with sufficient privileges to open block devices, but I'm guessing
> > > > that's already a requirement?
> > > 
> > > Yes, systemd-mountfsd is a system level service running in the initial
> > > set of namespaces and interacting with systemd-nsresourced (namespace
> > > related stuff). It can obviously also create helper to setns() into
> > > various namespaces if required. 
> > 
> > <nod> I think I saw something else from you about a file descriptor
> > store, so I'll go look there next.
> > 
> > --D
> > 
> > > > 
> > > > > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > > > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> > > 
> > > Yes, I would think so.
> > > 
> > > > > 
> > > > > Yes, that would work.
> > > > 
> > > > Oh goody :)
> > > > 
> > > > > > need any special /dev access at all.  I think then the fuse server's
> > > > > > service could have:
> > > > > > 
> > > > > > DynamicUser=true
> > > > > > ProtectSystem=true
> > > > > > ProtectHome=true
> > > > > > PrivateTmp=true
> > > > > > PrivateDevices=true
> > > > > > DevicePolicy=strict
> > > > > > 
> > > > > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > > > > my systemd-fu is paged out ATM.)
> > > > > > 
> > > > > > My goal here is extreme containment -- the code doing the fs metadata
> > > > > > parsing has no privileges, no write access except to the fds it was
> > > > > > given, no network access, and no ability to read anything outside the
> > > > > > root filesystem.  Then I can get back to writing buffer
> > > > > > overflows^W^Whigh quality filesystem code in peace.
> > > > > 
> > > > > Yeah, sounds about right.
> > > > > 
> > > > > > 
> > > > > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > > > > /dev.)
> > > > > > > >
> > > > > > > > The downside of this would be to give unprivileged containers access to
> > > > > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > > > > change.
> > > > > > 
> > > > > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > > > > within the kernel address space ... though who knows how thoroughly fuse
> > > > > > has been fuzzed by syzbot :P
> > > > > > 
> > > > > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > > > > sure enough about it to spill it.
> > > > > > 
> > > > > > Please do share, #f is my crazy unbaked idea. :)
> > > > > > 
> > > > > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > > > > a device driver.
> > > > > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > > > > 
> > > > > > > > > protections because they tend to run in a private mount namespace with
> > > > > > > > > various parts of the filesystem either hidden or readonly.
> > > > > > > > >
> > > > > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > > > > a mount helper and a service:
> > > > > > > >
> > > > > > > > This isn't a problem really. This should just be an extension to
> > > > > > > > systemd-mountfsd.
> > > > > > 
> > > > > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > > > > systemd-related service setup, and that would take care of udisks as
> > > > > > well.
> > > > > 
> > > > > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > > > > it's available. Because that would just make unprivileged mounting work
> > > > > without userspace noticing anything.
> > > > 
> > > > That sounds really neat. :)
> > > > 
> > > > --D
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-08-12 20:20                 ` Darrick J. Wong
@ 2025-08-15 14:20                   ` Christian Brauner
  0 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2025-08-15 14:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Tue, Aug 12, 2025 at 01:20:25PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 04, 2025 at 12:12:24PM +0200, Christian Brauner wrote:
> > On Thu, Jul 31, 2025 at 10:22:06AM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> > > > On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > > > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > > > > >
> > > > > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > > > > >
> > > > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > > > server process.
> > > > > > > > > >
> > > > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > > > core code.  Eeeugh.
> > > > > > > > > >
> > > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > > > > works.
> > > > > > > > > >
> > > > > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > > > > overhead.  And that's with debugging turned on!
> > > > > > > > > >
> > > > > > > > > > These items have been addressed since the first RFC:
> > > > > > > > > >
> > > > > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > > > > unwritten and delalloc mappings.
> > > > > > > > > >
> > > > > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > > > > >
> > > > > > > > > > 3. iomap supports inline data.
> > > > > > > > > >
> > > > > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > > > > fuse server can set fuse_attr::flags.
> > > > > > > > > >
> > > > > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > > > > >
> > > > > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > > > > is enabled.
> > > > > > > > > >
> > > > > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > > > > >
> > > > > > > > > > There are some major warts remaining:
> > > > > > > > > >
> > > > > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > > > > actually works correctly.
> > > > > > > > > >
> > > > > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > > > > >
> > > > > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > > > > >
> > > > > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > > > > libext2fs except for inline data.
> > > > > > > > > >
> > > > > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > > > > don't want filesystems to unmount abruptly.
> > > > > > > > > >
> > > > > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > > > > >
> > > > > > > > > I'm happy to help you here.
> > > > > > > > >
> > > > > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > > > > wrong (TM).
> > > > > > > > >
> > > > > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > > > > >
> > > > > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > > > > /dev/fuse should really be openable by the service itself.
> > > > > > > 
> > > > > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > > > > second process call fsmount?
> > > > > > 
> > > > > > Yes, but remember that at some point you must call
> > > > > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > > > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > > > > privielged process. All the rest can be done by the unprivileged process
> > > > > > though. That's exactly how bpf tokens work.
> > > > > 
> > > > > Hrm.  Assuming the fsopen mount sequence is still:
> > > > > 
> > > > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > > > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > > > > 	...
> > > > > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > > > > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > > > > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > > > > 
> > > > > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > > > > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > > > > for move_mount() to have the intended effect.
> > > > 
> > > > Yes-ish.
> > > > 
> > > > At fsopen() time the user namespace of the caller is recorded in
> > > > fs_context->user_ns. If the filesystems is mountable inside of a user
> > > > namespace then fs_context->user_ns will be used to perform the
> > > > CAP_SYS_ADMIN check.
> > > 
> > > Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
> > > I gather that means that the fuse service server (ugh) could invoke the
> > > mount using the fsopen fd given to it?  That sounds promising.
> > 
> > Yes, it could provided fsopen() was called in a user namespace that the
> > service holds privileges over.
> > 
> > > 
> > > > For filesystems that aren't mountable inside of user namespaces (ext4,
> > > > xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> > > > global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> > > > to mount a filesystem with a non-initial userns if it's not marked as
> > > > mountable. That used to be possible but it's an invitation for extremely
> > > > subtle bugs and you gain control over the superblock itself.
> > > 
> > > I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
> > > s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
> > > for a filesystem to be "...written with a non-initial s_user_ns in
> > > mind"?  Is there something specific that I should look out for, aside
> > > from the usual "we don't mount parking lot xfs because validating that
> > > is too hard and it might explode the kernel"?
> > 
> > So there are two sides on how to view this:
> > 
> > (1) The filesystem is mountable   in a user namespace.
> > (2) The filesystem is delegatable to a user namespace.
> > 
> > These are two different things. Allowing (1) is difficult because of the
> > usual complexities involved even though everyone always seems to believe
> > that their block-based filesystems is reliable enough to be mounted with
> > any corrupted image.
> > 
> > But (2) is something that's doable and in fact something we do allow
> > currently for e.g., bpffs. In order to allow containers to use bpf the
> > container must have a bpffs instance mounted.
> > 
> > To do this fsopen() must be called in the containers user namespace. To
> > allow specific bpf features and to actually create the superblock
> > CAP_SYS_ADMIN or CAP_BPF in the initial users namespace are required.
> > Then a new bpf instance will be created that is owned by the user
> > namespace of the container.
> > 
> > IOW, to delegate a superblock/filesystems to an unprivileged container
> > capabilities are still required but ultimately the filesystems will be
> > owned by the container.
> 
> <nod>
> 
> > One story I always found worth exploring to get at (1) is if we had
> > dm-verity directly integrated into the filesystem. And I don't mean
> > fsverity, I mean dm-verity and in a way such that it's explicitly not
> > part of the on-disk image in contrast to fsverity where each filesystem
> > integrates this very differently into their on-disk format. It basically
> > would be as dumb as it gets. Static, simple arithmetic, appended,
> > pre-pended, whatever.
> 
> That would work as long as you don't need to write to the filesystem,
> ever.  For gold master rootfs that would work fine, less so for "my
> container needs a writable data partition but the bofh doesn't want us
> compromising kernel memory".

Yes, for that use-case you probably almost always want to combine this
with overlayfs. Well, ideally the system would clearly differentiate
between filesystems that contain executable code and those should never
be writable and filesystem that contain data.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-08-15 14:20 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-07-17 23:39   ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
2025-07-17 23:40   ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
2025-07-17 23:40   ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-07-17 23:40   ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-07-17 23:40   ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
2025-07-17 23:41   ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-07-17 23:41   ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-07-17 23:41   ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-07-17 23:41   ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
2025-07-17 23:42   ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-07-17 23:42   ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-07-17 23:42   ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-07-17 23:43   ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-07-17 23:43   ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-07-17 23:43   ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-07-17 23:43   ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-07-17 23:44   ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
2025-07-17 23:44   ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
2025-07-17 23:44   ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
2025-07-17 23:44   ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2025-07-17 23:45   ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-07-17 23:45   ` [PATCH 1/1] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
2025-07-17 23:45   ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
2025-07-17 23:46   ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2025-07-17 23:46   ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
2025-07-17 23:46   ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
2025-07-17 23:46   ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2025-07-17 23:47   ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2025-07-17 23:47   ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
2025-07-17 23:47   ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
2025-07-17 23:47   ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
2025-07-18 11:55   ` Amir Goldstein
2025-07-18 19:31     ` Darrick J. Wong
2025-07-18 19:56       ` Amir Goldstein
2025-07-18 20:21         ` Darrick J. Wong
2025-07-23 13:05       ` Christian Brauner
2025-07-23 18:04         ` Darrick J. Wong
2025-07-31 10:13           ` Christian Brauner
2025-07-31 17:22             ` Darrick J. Wong
2025-08-04 10:12               ` Christian Brauner
2025-08-12 20:20                 ` Darrick J. Wong
2025-08-15 14:20                   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).