linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
@ 2025-05-21 23:58 Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-21 23:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4,
	Theodore Ts'o

Hi everyone,

DO NOT MERGE THIS.

This is the very first request for comments of a prototype to connect
the Linux fuse driver to fs-iomap for regular file IO operations to and
from files whose contents persist to locally attached storage devices.

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.

willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code.  Eeeugh.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ioend calls within iomap are turned into upcalls
to the fuse server via a trio of new fuse commands.  This is suitable
for very simple filesystems that don't do tricky things with mappings
(e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
but solving that is for the next sprint.

With this overly simplistic RFC, I am to show that it's possible to
build a fuse server for a real filesystem (ext4) that runs entirely in
userspace yet maintains most of its performance.  At this early stage I
get about 95% of the kernel ext4 driver's streaming directio performance
on streaming IO, and 110% of its streaming buffered IO performance.
Random buffered IO suffers a 90% hit on writes due to unwritten extent
conversions.  Random direct IO is about 60% as fast as the kernel; see
the cover letter for the fuse2fs iomap changes for more details.

There are some major warts remaining:

1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.

2. Mappings ought to be cached in the kernel for more speed.

3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.

4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.

5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

6. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.

I'll work on these in June, but for now here's an unmergeable RFC to
start some discussion.

--Darrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
@ 2025-05-22  0:01 ` Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong
                     ` (2 more replies)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:01 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hi all,

In preparation to start hacking on fuse2fs and iomap, upgrade fuse2fs
library support to 3.17, which is the latest upstream release as of this
writing.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-library-upgrade
---
Commits in this patchset:
 * fuse2fs: bump library version
 * fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse
 * fuse2fs: disable nfs exports
---
 configure      |    4 ++--
 configure.ac   |    4 ++--
 misc/fuse2fs.c |   35 ++++++++++++++++++++++++++++++++---
 3 files changed, 36 insertions(+), 7 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
@ 2025-05-22  0:02 ` Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
                     ` (9 more replies)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
  3 siblings, 10 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:02 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

Hi all,

In preparation for connecting fuse, iomap, and fuse2fs for a much more
performant file IO path, make some changes to the Unix IO manager in
libext2fs so that we can have better IO.  First we start by making
filesystem flushes a lot more efficient by eliding fsyncs when they're
not necessary, and allowing library clients to turn off the racy code
that writes the superblock byte by byte but exposes stale checksums.

XXX: The second part of this series adds IO tagging so that we could tag
IOs by inode number to distinguish file data blocks in cache from
everything else.  This is temporary scaffolding whilst we're in the
middle adding directio and later buffered writes.  Once we can use the
pagecache for all file IO activity I think we could drop the back half
of this series.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-iomap-prep
---
Commits in this patchset:
 * libext2fs: always fsync the device when flushing the cache
 * libext2fs: always fsync the device when closing the unix IO manager
 * libext2fs: only fsync the unix fd if we wrote to the device
 * libext2fs: invalidate cached blocks when freeing them
 * libext2fs: add tagged block IO for better caching
 * libext2fs: add tagged block IO caching to the unix IO manager
 * libext2fs: only flush affected blocks in unix_write_byte
 * libext2fs: allow unix_write_byte when the write would be aligned
 * libext2fs: allow clients to ask to write full superblocks
 * libext2fs: allow callers to disallow I/O to file data blocks
---
 lib/ext2fs/ext2_io.h         |   29 ++++
 lib/ext2fs/ext2fs.h          |    4 +
 debian/libext2fs2t64.symbols |    5 +
 lib/ext2fs/alloc_stats.c     |    7 +
 lib/ext2fs/closefs.c         |    7 +
 lib/ext2fs/fileio.c          |   26 +++-
 lib/ext2fs/io_manager.c      |   56 ++++++++
 lib/ext2fs/unix_io.c         |  281 +++++++++++++++++++++++++++++++++++-------
 8 files changed, 362 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
@ 2025-05-22  0:02 ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
                     ` (15 more replies)
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
  3 siblings, 16 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:02 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

Hi all,

Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection.  For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel.  This
means that we can get rid of all file data block processing within
fuse2fs.

Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous.  Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.

The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s.  Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s.  FIEMAP and SEEK_DATA/SEEK_HOLE now work
too.  The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.

Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s.  The kernel
can do 900-1300MB/s.  Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s.  I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes.  We also probably
need iomap caching really badly.

These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance.  It contains a single
Big Filesystem Lock which nukes multi-threaded scalability.  There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS.  Sad!

Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance.  We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.

iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so
for capable systems, fuse2fs doesn't need to run in fuseblk mode
anymore.

However, there are some major warts remaining:

1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.

2. Mappings ought to be cached in the kernel for more speed.

3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.

4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.

5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

6. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.

I'll work on these in June, but for now here's an unmergeable RFC to
start some discussion.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap
---
Commits in this patchset:
 * fuse2fs: implement bare minimum iomap for file mapping reporting
 * fuse2fs: register block devices for use with iomap
 * fuse2fs: always use directio disk reads with fuse2fs
 * fuse2fs: implement directio file reads
 * fuse2fs: use tagged block IO for zeroing sub-block regions
 * fuse2fs: only flush the cache for the file under directio read
 * fuse2fs: add extent dump function for debugging
 * fuse2fs: implement direct write support
 * fuse2fs: turn on iomap for pagecache IO
 * fuse2fs: flush and invalidate the buffer cache on trim
 * fuse2fs: improve tracing for fallocate
 * fuse2fs: don't zero bytes in punch hole
 * fuse2fs: don't do file data block IO when iomap is enabled
 * fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
 * fuse2fs: re-enable the block device pagecache for metadata IO
 * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
---
 configure       |   47 ++
 configure.ac    |   32 +
 lib/config.h.in |    3 
 misc/fuse2fs.c  | 1251 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 1312 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/3] fuse2fs: bump library version
  2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
@ 2025-05-22  0:07   ` Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong
  2 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:07 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Bump the library version so we can take advantage of new functionality
since libfuse 3.5.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure    |    4 ++--
 configure.ac |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/configure b/configure
index dfc6bb4a5daa2e..1f7dbe24ee1ab1 100755
--- a/configure
+++ b/configure
@@ -14513,14 +14513,14 @@ fi
 
 if test "$FUSE_LIB" = "-lfuse3"
 then
-	FUSE_USE_VERSION=35
+	FUSE_USE_VERSION=314
 	CFLAGS="$CFLAGS $fuse3_CFLAGS"
 	LDFLAGS="$LDFLAGS $fuse3_LDFLAGS"
 	       for ac_header in pthread.h fuse.h
 do :
   as_ac_Header=`printf "%s\n" "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_compile "$LINENO" "$ac_header" "$as_ac_Header" "#define _FILE_OFFSET_BITS	64
-#define FUSE_USE_VERSION 35
+#define FUSE_USE_VERSION 314
 #ifdef __linux__
 #include <linux/fs.h>
 #include <linux/falloc.h>
diff --git a/configure.ac b/configure.ac
index 7f28701534a905..c7f193b4ed06bf 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1413,13 +1413,13 @@ AC_SUBST(FUSE_LIB)
 AC_SUBST(FUSE_CMT)
 if test "$FUSE_LIB" = "-lfuse3"
 then
-	FUSE_USE_VERSION=35
+	FUSE_USE_VERSION=314
 	CFLAGS="$CFLAGS $fuse3_CFLAGS"
 	LDFLAGS="$LDFLAGS $fuse3_LDFLAGS"
 	AC_CHECK_HEADERS([pthread.h fuse.h], [],
 		[AC_MSG_FAILURE([Cannot find fuse3 fuse2fs headers.])],
 [#define _FILE_OFFSET_BITS	64
-#define FUSE_USE_VERSION 35
+#define FUSE_USE_VERSION 314
 #ifdef __linux__
 #include <linux/fs.h>
 #include <linux/falloc.h>


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse
  2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong
@ 2025-05-22  0:07   ` Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong
  2 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:07 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

Create a compatibility wrapper for fuse_set_feature_flag if the libfuse
version is older than the one where that function was introduced (3.17).

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9667f00e366a66..6137fc04198d39 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -932,6 +932,19 @@ static void op_destroy(void *p EXT2FS_ATTR((unused)))
 	}
 }
 
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 17)
+static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
+					 uint64_t flag)
+{
+	if (conn->capable & flag) {
+		conn->want |= flag;
+		return 1;
+	}
+
+	return 0;
+}
+#endif
+
 static void *op_init(struct fuse_conn_info *conn
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 			, struct fuse_config *cfg EXT2FS_ATTR((unused))
@@ -947,14 +960,14 @@ static void *op_init(struct fuse_conn_info *conn
 	FUSE2FS_CHECK_CONTEXT_NULL(ff);
 	dbg_printf(ff, "%s: dev=%s\n", __func__, ff->device);
 #ifdef FUSE_CAP_IOCTL_DIR
-	conn->want |= FUSE_CAP_IOCTL_DIR;
+	fuse_set_feature_flag(conn, FUSE_CAP_IOCTL_DIR);
 #endif
 #ifdef FUSE_CAP_POSIX_ACL
 	if (ff->acl)
-		conn->want |= FUSE_CAP_POSIX_ACL;
+		fuse_set_feature_flag(conn, FUSE_CAP_POSIX_ACL);
 #endif
 #ifdef FUSE_CAP_CACHE_SYMLINKS
-	conn->want |= FUSE_CAP_CACHE_SYMLINKS;
+	fuse_set_feature_flag(conn, FUSE_CAP_CACHE_SYMLINKS);
 #endif
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	conn->time_gran = 1;
@@ -1020,6 +1033,19 @@ static void *op_init(struct fuse_conn_info *conn
 		log_printf(ff, "%s %s.\n", _("mounted filesystem"), uuid);
 	}
 out:
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
+	/*
+	 * THIS MUST GO LAST!
+	 *
+	 * The high-level libfuse code has a strange bug: it sets feature flags
+	 * in conn->want_ext, and later copies the lower 32 bits to conn->want.
+	 * If we in turn change some bits in want_ext without updating want,
+	 * the lower level library to observe that both want and want_ext have
+	 * gotten out of sync, and refuses to mount.  Therefore, synchronize
+	 * the two.
+	 */
+	conn->want = conn->want_ext & 0xFFFFFFFF;
+#endif
 	return ff;
 mount_fail:
 	ff->retcode = 32;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/3] fuse2fs: disable nfs exports
  2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong
@ 2025-05-22  0:08   ` Darrick J. Wong
  2 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:08 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

From: Darrick J. Wong <djwong@kernel.org>

The kernel fuse driver can export its own handles, but it doesn't
actually talk to the fuse server about those handles.  Hence they don't
survive unmount/mount cycles like regular ext4.  Disable them, because
they cause fstests regressions and it's not clear that they're suitable
for NFS export, at least not as most people understand ext4 NFS exports.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 6137fc04198d39..769bb5babd2738 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -969,6 +969,9 @@ static void *op_init(struct fuse_conn_info *conn
 #ifdef FUSE_CAP_CACHE_SYMLINKS
 	fuse_set_feature_flag(conn, FUSE_CAP_CACHE_SYMLINKS);
 #endif
+#ifdef FUSE_CAP_NO_EXPORT_SUPPORT
+	fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
+#endif
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	conn->time_gran = 1;
 	cfg->use_ino = 1;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 01/10] libext2fs: always fsync the device when flushing the cache
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
@ 2025-05-22  0:08   ` Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:08 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When we're flushing the unix IO manager's buffer cache, we should always
fsync the block device, because something could have written to the
block device -- either the buffer cache itself, or a direct write.
Regardless, the callers all want all dirtied regions to be persisted to
stable media.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index ede75cf8ee3681..40fd9cc1427c31 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1452,7 +1452,8 @@ static errcode_t unix_flush(io_channel channel)
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
 #ifdef HAVE_FSYNC
-	if (!retval && fsync(data->dev) != 0)
+	/* always fsync the device, even if flushing our own cache failed */
+	if (fsync(data->dev) != 0 && !retval)
 		return errno;
 #endif
 	return retval;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
@ 2025-05-22  0:08   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:08 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

unix_close is the last chance that libext2fs has to report write
failures to users.  Although it's likely that ext2fs_close already
called ext2fs_flush and told the IO manager to flush, we could do one
more sync before we close the file descriptor.  Also don't override the
fsync's errno with the close's errno.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 40fd9cc1427c31..7c5cb075d6b6b6 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1136,8 +1136,11 @@ static errcode_t unix_close(io_channel channel)
 #ifndef NO_IO_CACHE
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
+	/* always fsync the device, even if flushing our own cache failed */
+	if (fsync(data->dev) != 0 && !retval)
+		retval = errno;
 
-	if (close(data->dev) < 0)
+	if (close(data->dev) < 0 && !retval)
 		retval = errno;
 	free_cache(data);
 	free(data->cache);


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

As an optimization, only fsync the block device fd if we tried to  write
to the io channel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |   48 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 7c5cb075d6b6b6..0fc83e471ca0fe 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -129,10 +129,13 @@ struct unix_cache {
 #define WRITE_DIRECT_SIZE 4	/* Must be smaller than CACHE_SIZE */
 #define READ_DIRECT_SIZE 4	/* Should be smaller than CACHE_SIZE */
 
+#define UNIX_STATE_DIRTY	(1U << 0) /* device needs fsyncing */
+
 struct unix_private_data {
 	int	magic;
 	int	dev;
 	int	flags;
+	unsigned int	state; /* UNIX_STATE_* */
 	int	align;
 	int	access_time;
 	ext2_loff_t offset;
@@ -1121,10 +1124,37 @@ static errcode_t unix_open(const char *name, int flags,
 	return unix_open_channel(name, fd, flags, channel, unix_io_manager);
 }
 
+static void mark_dirty(io_channel channel)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+
+	mutex_lock(data, CACHE_MTX);
+	data->state |= UNIX_STATE_DIRTY;
+	mutex_unlock(data, CACHE_MTX);
+}
+
+static errcode_t maybe_fsync(io_channel channel)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+	int was_dirty;
+
+	mutex_lock(data, CACHE_MTX);
+	was_dirty = data->state & UNIX_STATE_DIRTY;
+	data->state &= ~UNIX_STATE_DIRTY;
+	mutex_unlock(data, CACHE_MTX);
+
+	if (was_dirty && fsync(data->dev) != 0)
+		return errno;
+
+	return 0;
+}
+
 static errcode_t unix_close(io_channel channel)
 {
 	struct unix_private_data *data;
-	errcode_t	retval = 0;
+	errcode_t	retval = 0, retval2;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	data = (struct unix_private_data *) channel->private_data;
@@ -1137,8 +1167,9 @@ static errcode_t unix_close(io_channel channel)
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
 	/* always fsync the device, even if flushing our own cache failed */
-	if (fsync(data->dev) != 0 && !retval)
-		retval = errno;
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
 
 	if (close(data->dev) < 0 && !retval)
 		retval = errno;
@@ -1306,6 +1337,8 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+	mark_dirty(channel);
+
 #ifdef NO_IO_CACHE
 	return raw_write_blk(channel, data, block, count, buf, 0);
 #else
@@ -1430,6 +1463,8 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)
 		return errno;
 
+	mark_dirty(channel);
+
 	actual = write(data->dev, buf, size);
 	if (actual < 0)
 		return errno;
@@ -1445,7 +1480,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 static errcode_t unix_flush(io_channel channel)
 {
 	struct unix_private_data *data;
-	errcode_t retval = 0;
+	errcode_t retval = 0, retval2;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	data = (struct unix_private_data *) channel->private_data;
@@ -1456,8 +1491,9 @@ static errcode_t unix_flush(io_channel channel)
 #endif
 #ifdef HAVE_FSYNC
 	/* always fsync the device, even if flushing our own cache failed */
-	if (fsync(data->dev) != 0 && !retval)
-		return errno;
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
 #endif
 	return retval;
 }


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When we're freeing blocks, we should tell the IO manager to drop them
from any cache it might be maintaining to improve performance.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2_io.h         |    6 +++++-
 debian/libext2fs2t64.symbols |    1 +
 lib/ext2fs/alloc_stats.c     |    7 +++++++
 lib/ext2fs/io_manager.c      |    8 ++++++++
 lib/ext2fs/unix_io.c         |   32 ++++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index 78c988374c8808..bab7f2a6a44b81 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -103,7 +103,9 @@ struct struct_io_manager {
 	errcode_t (*zeroout)(io_channel channel, unsigned long long block,
 			     unsigned long long count);
 	errcode_t (*get_fd)(io_channel channel, int *fd);
-	long	reserved[13];
+	errcode_t (*invalidate_blk)(io_channel channel,
+				    unsigned long long block);
+	long	reserved[12];
 };
 
 #define IO_FLAG_RW		0x0001
@@ -147,6 +149,8 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
 					    unsigned long long block,
 					    unsigned long long count);
 extern errcode_t io_channel_fd(io_channel io, int *fd);
+extern errcode_t io_channel_invalidate_blk(io_channel io,
+					   unsigned long long block);
 
 #ifdef _WIN32
 /* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index 9cf3b33ca15f91..13870c4b545b2f 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -689,6 +689,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
  io_channel_cache_readahead@Base 1.43
  io_channel_discard@Base 1.42
  io_channel_fd@Base 1.47.3
+ io_channel_invalidate_blk@Base 1.47.3
  io_channel_read_blk64@Base 1.41.1
  io_channel_set_options@Base 1.37
  io_channel_write_blk64@Base 1.41.1
diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c
index 6f98bcc7cbd5f3..4aeaa286b88a7e 100644
--- a/lib/ext2fs/alloc_stats.c
+++ b/lib/ext2fs/alloc_stats.c
@@ -84,6 +84,13 @@ void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse)
 	ext2fs_mark_bb_dirty(fs);
 	if (fs->block_alloc_stats)
 		(fs->block_alloc_stats)(fs, (blk64_t) blk, inuse);
+
+	if (inuse < 0) {
+		unsigned int i;
+
+		for (i = 0; i < EXT2FS_CLUSTER_RATIO(fs); i++)
+			io_channel_invalidate_blk(fs->io, blk + i);
+	}
 }
 
 void ext2fs_block_alloc_stats(ext2_filsys fs, blk_t blk, int inuse)
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 1bab069de63e12..aa7fc58b846be8 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -158,3 +158,11 @@ errcode_t io_channel_fd(io_channel io, int *fd)
 
 	return io->manager->get_fd(io, fd);
 }
+
+errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block)
+{
+	if (!io->manager->invalidate_blk)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->invalidate_blk(io, block);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 0fc83e471ca0fe..89f7915371307f 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -664,6 +664,23 @@ static errcode_t reuse_cache(io_channel channel,
 #define FLUSH_INVALIDATE	0x01
 #define FLUSH_NOLOCK		0x02
 
+/* Remove a block from the cache.  Dirty contents are discarded. */
+static void invalidate_cached_block(io_channel channel,
+				    struct unix_private_data *data,
+				    unsigned long long block)
+{
+	struct unix_cache	*cache;
+	int			i;
+
+	mutex_lock(data, CACHE_MTX);
+	for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) {
+		if (!cache->in_use || cache->block != block)
+			continue;
+		cache->in_use = 0;
+	}
+	mutex_unlock(data, CACHE_MTX);
+}
+
 /*
  * Flush all of the blocks in the cache
  */
@@ -1705,6 +1722,19 @@ static errcode_t unix_get_fd(io_channel channel, int *fd)
 	return 0;
 }
 
+static errcode_t unix_invalidate_blk(io_channel channel,
+				     unsigned long long block)
+{
+	struct unix_private_data *data;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+	invalidate_cached_block(channel, data, block);
+	return 0;
+}
+
 #if __GNUC_PREREQ (4, 6)
 #pragma GCC diagnostic pop
 #endif
@@ -1727,6 +1757,7 @@ static struct struct_io_manager struct_unix_manager = {
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
 	.get_fd		= unix_get_fd,
+	.invalidate_blk	= unix_invalidate_blk,
 };
 
 io_manager unix_io_manager = &struct_unix_manager;
@@ -1749,6 +1780,7 @@ static struct struct_io_manager struct_unixfd_manager = {
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
 	.get_fd		= unix_get_fd,
+	.invalidate_blk	= unix_invalidate_blk,
 };
 
 io_manager unixfd_io_manager = &struct_unixfd_manager;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/10] libext2fs: add tagged block IO for better caching
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Pass inode numbers from the fileio.c code through the io manager to the
unix io manager so that we can manage the disk cache more effectively.
In the next few patches we'll need the ability to flush and invalidate
the caches for specific files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2_io.h         |   25 +++++++++++++++++++++-
 debian/libext2fs2t64.symbols |    4 ++++
 lib/ext2fs/fileio.c          |   14 +++++++-----
 lib/ext2fs/io_manager.c      |   48 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 84 insertions(+), 7 deletions(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index bab7f2a6a44b81..64b35b31d669e7 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -39,6 +39,11 @@ typedef struct struct_io_stats *io_stats;
 
 #define io_channel_discard_zeroes_data(i) (i->flags & CHANNEL_FLAGS_DISCARD_ZEROES)
 
+typedef unsigned int	io_channel_tag_t;
+
+/* I/O operation has no associated tag */
+#define IO_CHANNEL_TAG_NULL		(0)
+
 struct struct_io_channel {
 	errcode_t	magic;
 	io_manager	manager;
@@ -105,7 +110,15 @@ struct struct_io_manager {
 	errcode_t (*get_fd)(io_channel channel, int *fd);
 	errcode_t (*invalidate_blk)(io_channel channel,
 				    unsigned long long block);
-	long	reserved[12];
+	errcode_t (*read_tagblk)(io_channel channel, io_channel_tag_t tag,
+				 unsigned long long block, int count,
+				 void *data);
+	errcode_t (*write_tagblk)(io_channel channel, io_channel_tag_t tag,
+				   unsigned long long block, int count,
+				   const void *data);
+	errcode_t (*flush_tag)(io_channel channel, io_channel_tag_t tag);
+	errcode_t (*invalidate_tag)(io_channel channel, io_channel_tag_t tag);
+	long	reserved[8];
 };
 
 #define IO_FLAG_RW		0x0001
@@ -134,9 +147,17 @@ extern errcode_t io_channel_write_byte(io_channel channel,
 extern errcode_t io_channel_read_blk64(io_channel channel,
 				       unsigned long long block,
 				       int count, void *data);
+extern errcode_t io_channel_read_tagblk(io_channel channel,
+					io_channel_tag_t tag,
+					unsigned long long block, int count,
+					void *data);
 extern errcode_t io_channel_write_blk64(io_channel channel,
 					unsigned long long block,
 					int count, const void *data);
+extern errcode_t io_channel_write_tagblk(io_channel channel,
+					 io_channel_tag_t tag,
+					 unsigned long long block, int count,
+					 const void *data);
 extern errcode_t io_channel_discard(io_channel channel,
 				    unsigned long long block,
 				    unsigned long long count);
@@ -151,6 +172,8 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
 extern errcode_t io_channel_fd(io_channel io, int *fd);
 extern errcode_t io_channel_invalidate_blk(io_channel io,
 					   unsigned long long block);
+extern errcode_t io_channel_flush_tag(io_channel io, io_channel_tag_t tag);
+extern errcode_t io_channel_invalidate_tag(io_channel io, io_channel_tag_t tag);
 
 #ifdef _WIN32
 /* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index 13870c4b545b2f..87ed63155702e0 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -689,11 +689,15 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
  io_channel_cache_readahead@Base 1.43
  io_channel_discard@Base 1.42
  io_channel_fd@Base 1.47.3
+ io_channel_flush_tag@Base 1.47.3
  io_channel_invalidate_blk@Base 1.47.3
+ io_channel_invalidate_tag@Base 1.47.3
  io_channel_read_blk64@Base 1.41.1
+ io_channel_read_tagblk@Base 1.47.3
  io_channel_set_options@Base 1.37
  io_channel_write_blk64@Base 1.41.1
  io_channel_write_byte@Base 1.37
+ io_channel_write_tagblk@Base 1.47.3
  io_channel_zeroout@Base 1.43
  qcow2_read_header@Base 1.42
  qcow2_write_raw_image@Base 1.42
diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c
index 818f7f05420029..1b7e88d990036b 100644
--- a/lib/ext2fs/fileio.c
+++ b/lib/ext2fs/fileio.c
@@ -167,7 +167,8 @@ errcode_t ext2fs_file_flush(ext2_file_t file)
 			return retval;
 	}
 
-	retval = io_channel_write_blk64(fs->io, file->physblock, 1, file->buf);
+	retval = io_channel_write_tagblk(fs->io, file->ino, file->physblock,
+					  1, file->buf);
 	if (retval)
 		return retval;
 
@@ -220,9 +221,10 @@ static errcode_t load_buffer(ext2_file_t file, int dontfill)
 		if (!dontfill) {
 			if (file->physblock &&
 			    !(ret_flags & BMAP_RET_UNINIT)) {
-				retval = io_channel_read_blk64(fs->io,
-							       file->physblock,
-							       1, file->buf);
+				retval = io_channel_read_tagblk(fs->io,
+								 file->ino,
+								 file->physblock,
+								 1, file->buf);
 				if (retval)
 					return retval;
 			} else
@@ -603,13 +605,13 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file,
 		return retval;
 
 	/* Read/zero/write block */
-	retval = io_channel_read_blk64(fs->io, blk, 1, b);
+	retval = io_channel_read_tagblk(fs->io, file->ino, blk, 1, b);
 	if (retval)
 		goto out;
 
 	memset(b + off, 0, fs->blocksize - off);
 
-	retval = io_channel_write_blk64(fs->io, blk, 1, b);
+	retval = io_channel_write_tagblk(fs->io, file->ino, blk, 1, b);
 	if (retval)
 		goto out;
 
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index aa7fc58b846be8..357a3bc7698129 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -85,6 +85,22 @@ errcode_t io_channel_read_blk64(io_channel channel, unsigned long long block,
 					     count, data);
 }
 
+errcode_t io_channel_read_tagblk(io_channel channel, io_channel_tag_t tag,
+				 unsigned long long block, int count,
+				 void *data)
+{
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+
+	if (channel->manager->read_tagblk)
+		return (channel->manager->read_tagblk)(channel, tag, block,
+						       count, data);
+
+	if (tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io_channel_read_blk64(channel, block, count, data);
+}
+
 errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block,
 				 int count, const void *data)
 {
@@ -101,6 +117,22 @@ errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block,
 					     count, data);
 }
 
+errcode_t io_channel_write_tagblk(io_channel channel, io_channel_tag_t tag,
+				  unsigned long long block, int count,
+				  const void *data)
+{
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+
+	if (channel->manager->write_tagblk)
+		return (channel->manager->write_tagblk)(channel, tag, block,
+							count, data);
+
+	if (tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io_channel_write_blk64(channel, block, count, data);
+}
+
 errcode_t io_channel_discard(io_channel channel, unsigned long long block,
 			     unsigned long long count)
 {
@@ -166,3 +198,19 @@ errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block)
 
 	return io->manager->invalidate_blk(io, block);
 }
+
+errcode_t io_channel_flush_tag(io_channel io, io_channel_tag_t tag)
+{
+	if (!io->manager->flush_tag && tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->flush_tag(io, tag);
+}
+
+errcode_t io_channel_invalidate_tag(io_channel io, io_channel_tag_t tag)
+{
+	if (!io->manager->invalidate_tag && tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->invalidate_tag(io, tag);
+}


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add tagged block caching to the UNIX IO manager.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |  198 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 154 insertions(+), 44 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 89f7915371307f..8a8afe47ee4503 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -120,6 +120,7 @@ struct unix_cache {
 	char			*buf;
 	unsigned long long	block;
 	int			access_time;
+	io_channel_tag_t	tag;
 	unsigned		dirty:1;
 	unsigned		in_use:1;
 	unsigned		write_err:1;
@@ -526,6 +527,7 @@ static errcode_t alloc_cache(io_channel channel,
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		if (cache->buf)
 			ext2fs_free_mem(&cache->buf);
 		retval = io_channel_alloc_buf(channel, 0, &cache->buf);
@@ -552,6 +554,7 @@ static void free_cache(struct unix_private_data *data)
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		if (cache->buf)
 			ext2fs_free_mem(&cache->buf);
 	}
@@ -639,8 +642,9 @@ static struct unix_cache *find_cached_block(struct unix_private_data *data,
  * Reuse a particular cache entry for another block.
  */
 static errcode_t reuse_cache(io_channel channel,
-		struct unix_private_data *data, struct unix_cache *cache,
-		unsigned long long block)
+			     struct unix_private_data *data,
+			     struct unix_cache *cache, io_channel_tag_t tag,
+			     unsigned long long block)
 {
 	if (cache->dirty && cache->in_use) {
 		errcode_t retval;
@@ -653,7 +657,16 @@ static errcode_t reuse_cache(io_channel channel,
 		}
 	}
 
+#ifdef DEBUG
+	if (cache->in_use)
+		printf("Reusing cached block %llu(%u) for %llu(%u)\n",
+			cache->block, cache->tag, block, tag);
+	else
+		printf("Using cached block %llu(%u)\n", block, tag);
+#endif
+
 	cache->in_use = 1;
+	cache->tag = tag;
 	cache->dirty = 0;
 	cache->write_err = 0;
 	cache->block = block;
@@ -664,6 +677,17 @@ static errcode_t reuse_cache(io_channel channel,
 #define FLUSH_INVALIDATE	0x01
 #define FLUSH_NOLOCK		0x02
 
+static inline void invalidate_cache(struct unix_cache *cache)
+{
+#ifdef DEBUG
+	if (cache->in_use)
+		printf("Invalidating cache %llu(%u)\n", cache->block,
+				cache->tag);
+#endif
+	cache->in_use = 0;
+	cache->tag = IO_CHANNEL_TAG_NULL;
+}
+
 /* Remove a block from the cache.  Dirty contents are discarded. */
 static void invalidate_cached_block(io_channel channel,
 				    struct unix_private_data *data,
@@ -676,7 +700,7 @@ static void invalidate_cached_block(io_channel channel,
 	for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) {
 		if (!cache->in_use || cache->block != block)
 			continue;
-		cache->in_use = 0;
+		invalidate_cache(cache);
 	}
 	mutex_unlock(data, CACHE_MTX);
 }
@@ -686,7 +710,7 @@ static void invalidate_cached_block(io_channel channel,
  */
 static errcode_t flush_cached_blocks(io_channel channel,
 				     struct unix_private_data *data,
-				     int flags)
+				     io_channel_tag_t tag, int flags)
 {
 	struct unix_cache	*cache;
 	errcode_t		retval, retval2 = 0;
@@ -698,6 +722,11 @@ static errcode_t flush_cached_blocks(io_channel channel,
 	for (i=0, cache = data->cache; i < data->cache_size; i++, cache++) {
 		if (!cache->in_use)
 			continue;
+		if (tag && cache->tag != tag)
+			continue;
+#ifdef DEBUG
+		printf("Flushing %sblock %llu(%u)\n", cache->dirty ? "dirty " : "", cache->block, cache->tag);
+#endif
 		if (cache->dirty) {
 			int raw_flags = RAW_WRITE_NO_HANDLER;
 
@@ -715,10 +744,10 @@ static errcode_t flush_cached_blocks(io_channel channel,
 				cache->dirty = 0;
 				cache->write_err = 0;
 				if (flags & FLUSH_INVALIDATE)
-					cache->in_use = 0;
+					invalidate_cache(cache);
 			}
 		} else if (flags & FLUSH_INVALIDATE) {
-			cache->in_use = 0;
+			invalidate_cache(cache);
 		}
 	}
 	if ((flags & FLUSH_NOLOCK) == 0)
@@ -737,7 +766,7 @@ static errcode_t flush_cached_blocks(io_channel channel,
 				unsigned long long err_block = cache->block;
 
 				cache->dirty = 0;
-				cache->in_use = 0;
+				invalidate_cache(cache);
 				cache->write_err = 0;
 				if (io_channel_alloc_buf(channel, 0,
 							 &err_buf))
@@ -772,7 +801,7 @@ static errcode_t shrink_cache(io_channel channel,
 
 	mutex_lock(data, CACHE_MTX);
 
-	retval = flush_cached_blocks(channel, data,
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
 			FLUSH_INVALIDATE | FLUSH_NOLOCK);
 	if (retval)
 		goto unlock;
@@ -784,6 +813,7 @@ static errcode_t shrink_cache(io_channel channel,
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		if (cache->buf)
 			ext2fs_free_mem(&cache->buf);
 	}
@@ -814,7 +844,7 @@ static errcode_t grow_cache(io_channel channel,
 
 	mutex_lock(data, CACHE_MTX);
 
-	retval = flush_cached_blocks(channel, data,
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
 			FLUSH_INVALIDATE | FLUSH_NOLOCK);
 	if (retval)
 		goto unlock;
@@ -832,6 +862,7 @@ static errcode_t grow_cache(io_channel channel,
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		retval = io_channel_alloc_buf(channel, 0, &cache->buf);
 		if (retval)
 			goto unlock;
@@ -1181,7 +1212,7 @@ static errcode_t unix_close(io_channel channel)
 		return 0;
 
 #ifndef NO_IO_CACHE
-	retval = flush_cached_blocks(channel, data, 0);
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, 0);
 #endif
 	/* always fsync the device, even if flushing our own cache failed */
 	retval2 = maybe_fsync(channel);
@@ -1220,7 +1251,9 @@ static errcode_t unix_set_blksize(io_channel channel, int blksize)
 		mutex_lock(data, CACHE_MTX);
 		mutex_lock(data, BOUNCE_MTX);
 #ifndef NO_IO_CACHE
-		if ((retval = flush_cached_blocks(channel, data, FLUSH_NOLOCK))){
+		retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
+					     FLUSH_NOLOCK);
+		if (retval) {
 			mutex_unlock(data, BOUNCE_MTX);
 			mutex_unlock(data, CACHE_MTX);
 			return retval;
@@ -1236,8 +1269,9 @@ static errcode_t unix_set_blksize(io_channel channel, int blksize)
 	return retval;
 }
 
-static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
-			       int count, void *buf)
+static errcode_t unix_read_tagblk(io_channel channel, io_channel_tag_t tag,
+				  unsigned long long block, int count,
+				  void *buf)
 {
 	struct unix_private_data *data;
 	struct unix_cache *cache;
@@ -1249,6 +1283,10 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+#ifdef DEBUG
+	printf("read block %llu(%u) count %u\n", block, tag, count);
+#endif
+
 #ifdef NO_IO_CACHE
 	return raw_read_blk(channel, data, block, count, buf);
 #else
@@ -1259,7 +1297,8 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 	 * flush out the cache and then do a direct read.
 	 */
 	if (count < 0 || count > WRITE_DIRECT_SIZE) {
-		if ((retval = flush_cached_blocks(channel, data, 0)))
+		retval = flush_cached_blocks(channel, data, tag, 0);
+		if (retval)
 			return retval;
 		return raw_read_blk(channel, data, block, count, buf);
 	}
@@ -1270,9 +1309,11 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 		/* If it's in the cache, use it! */
 		if ((cache = find_cached_block(data, block, NULL))) {
 #ifdef DEBUG
-			printf("Using cached block %lu\n", block);
+			printf("Reading from cached block %llu(%u)\n", block, tag);
 #endif
 			memcpy(cp, cache->buf, channel->block_size);
+			if (tag != IO_CHANNEL_TAG_NULL)
+				cache->tag = tag;
 			count--;
 			block++;
 			cp += channel->block_size;
@@ -1287,7 +1328,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 			if (find_cached_block(data, block+i, NULL))
 				break;
 #ifdef DEBUG
-		printf("Reading %d blocks starting at %lu\n", i, block);
+		printf("Reading %d blocks starting at %llu\n", i, block);
 #endif
 		mutex_unlock(data, CACHE_MTX);
 		if ((retval = raw_read_blk(channel, data, block, i, cp)))
@@ -1298,7 +1339,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 		for (j=0; j < i; j++) {
 			if (!find_cached_block(data, block, &cache)) {
 				retval = reuse_cache(channel, data,
-						     cache, block);
+						     cache, tag, block);
 				if (retval)
 					goto call_write_handler;
 				memcpy(cache->buf, cp, channel->block_size);
@@ -1317,7 +1358,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 		unsigned long long err_block = cache->block;
 
 		cache->dirty = 0;
-		cache->in_use = 0;
+		invalidate_cache(cache);
 		cache->write_err = 0;
 		if (io_channel_alloc_buf(channel, 0, &err_buf))
 			err_buf = NULL;
@@ -1335,14 +1376,22 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 #endif /* NO_IO_CACHE */
 }
 
+static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
+				  int count, void *buf)
+{
+	return unix_read_tagblk(channel, IO_CHANNEL_TAG_NULL, block, count,
+				buf);
+}
+
 static errcode_t unix_read_blk(io_channel channel, unsigned long block,
 			       int count, void *buf)
 {
 	return unix_read_blk64(channel, block, count, buf);
 }
 
-static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
-				int count, const void *buf)
+static errcode_t unix_write_tagblk(io_channel channel, io_channel_tag_t tag,
+				   unsigned long long block, int count,
+				   const void *buf)
 {
 	struct unix_private_data *data;
 	struct unix_cache *cache, *reuse;
@@ -1354,6 +1403,10 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+#ifdef DEBUG
+	printf("write block %llu(%u) count %u\n", block, tag, count);
+#endif
+
 	mark_dirty(channel);
 
 #ifdef NO_IO_CACHE
@@ -1366,8 +1419,9 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	 * flush out the cache completely and then do a direct write.
 	 */
 	if (count < 0 || count > WRITE_DIRECT_SIZE) {
-		if ((retval = flush_cached_blocks(channel, data,
-						  FLUSH_INVALIDATE)))
+		retval = flush_cached_blocks(channel, data, tag,
+					     FLUSH_INVALIDATE);
+		if (retval)
 			return retval;
 		return raw_write_blk(channel, data, block, count, buf, 0);
 	}
@@ -1385,11 +1439,17 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	mutex_lock(data, CACHE_MTX);
 	while (count > 0) {
 		cache = find_cached_block(data, block, &reuse);
-		if (!cache) {
+		if (cache) {
+#ifdef DEBUG
+			printf("Writing to cached block %llu(%u)\n", block, tag);
+#endif
+			if (tag != IO_CHANNEL_TAG_NULL)
+				cache->tag = tag;
+		} else {
 			errcode_t err;
 
 			cache = reuse;
-			err = reuse_cache(channel, data, cache, block);
+			err = reuse_cache(channel, data, cache, tag, block);
 			if (err)
 				goto call_write_handler;
 		}
@@ -1409,7 +1469,7 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 		unsigned long long err_block = cache->block;
 
 		cache->dirty = 0;
-		cache->in_use = 0;
+		invalidate_cache(cache);
 		cache->write_err = 0;
 		if (io_channel_alloc_buf(channel, 0, &err_buf))
 			err_buf = NULL;
@@ -1427,6 +1487,13 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 #endif /* NO_IO_CACHE */
 }
 
+static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
+				  int count, const void *buf)
+{
+	return unix_write_tagblk(channel, IO_CHANNEL_TAG_NULL, block, count,
+				 buf);
+}
+
 static errcode_t unix_cache_readahead(io_channel channel,
 				      unsigned long long block,
 				      unsigned long long count)
@@ -1473,7 +1540,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	/*
 	 * Flush out the cache completely
 	 */
-	if ((retval = flush_cached_blocks(channel, data, FLUSH_INVALIDATE)))
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
+				     FLUSH_INVALIDATE);
+	if (retval)
 		return retval;
 #endif
 
@@ -1491,28 +1560,60 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	return 0;
 }
 
+/*
+ * Flush data buffers with the given tag to disk and invalidate them.
+ */
+static errcode_t unix_invalidate_tag(io_channel channel, io_channel_tag_t tag)
+{
+	struct unix_private_data *data;
+	errcode_t retval = 0, retval2;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+#ifndef NO_IO_CACHE
+	retval = flush_cached_blocks(channel, data, tag, FLUSH_INVALIDATE);
+#endif
+#ifdef HAVE_FSYNC
+	/* always fsync the device, even if flushing our own cache failed */
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
+#endif
+	return retval;
+}
+
+/*
+ * Flush data buffers with the given tag to disk.
+ */
+static errcode_t unix_flush_tag(io_channel channel, io_channel_tag_t tag)
+{
+	struct unix_private_data *data;
+	errcode_t retval = 0, retval2;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+#ifndef NO_IO_CACHE
+	retval = flush_cached_blocks(channel, data, tag, 0);
+#endif
+#ifdef HAVE_FSYNC
+	/* always fsync the device, even if flushing our own cache failed */
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
+#endif
+	return retval;
+}
+
 /*
  * Flush data buffers to disk.
  */
 static errcode_t unix_flush(io_channel channel)
 {
-	struct unix_private_data *data;
-	errcode_t retval = 0, retval2;
-
-	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
-	data = (struct unix_private_data *) channel->private_data;
-	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
-
-#ifndef NO_IO_CACHE
-	retval = flush_cached_blocks(channel, data, 0);
-#endif
-#ifdef HAVE_FSYNC
-	/* always fsync the device, even if flushing our own cache failed */
-	retval2 = maybe_fsync(channel);
-	if (retval2 && !retval)
-		retval = retval2;
-#endif
-	return retval;
+	return unix_flush_tag(channel, 0);
 }
 
 static errcode_t unix_set_option(io_channel channel, const char *option,
@@ -1547,7 +1648,8 @@ static errcode_t unix_set_option(io_channel channel, const char *option,
 			return 0;
 		}
 		if (!strcmp(arg, "off")) {
-			retval = flush_cached_blocks(channel, data, 0);
+			retval = flush_cached_blocks(channel, data,
+						     IO_CHANNEL_TAG_NULL, 0);
 			data->flags |= IO_FLAG_NOCACHE;
 			return retval;
 		}
@@ -1748,11 +1850,15 @@ static struct struct_io_manager struct_unix_manager = {
 	.read_blk	= unix_read_blk,
 	.write_blk	= unix_write_blk,
 	.flush		= unix_flush,
+	.flush_tag	= unix_flush_tag,
+	.invalidate_tag	= unix_invalidate_tag,
 	.write_byte	= unix_write_byte,
 	.set_option	= unix_set_option,
 	.get_stats	= unix_get_stats,
 	.read_blk64	= unix_read_blk64,
 	.write_blk64	= unix_write_blk64,
+	.read_tagblk	= unix_read_tagblk,
+	.write_tagblk	= unix_write_tagblk,
 	.discard	= unix_discard,
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
@@ -1771,11 +1877,15 @@ static struct struct_io_manager struct_unixfd_manager = {
 	.read_blk	= unix_read_blk,
 	.write_blk	= unix_write_blk,
 	.flush		= unix_flush,
+	.flush_tag	= unix_flush_tag,
+	.invalidate_tag	= unix_invalidate_tag,
 	.write_byte	= unix_write_byte,
 	.set_option	= unix_set_option,
 	.get_stats	= unix_get_stats,
 	.read_blk64	= unix_read_blk64,
 	.write_blk64	= unix_write_blk64,
+	.read_tagblk	= unix_read_tagblk,
+	.write_tagblk	= unix_write_tagblk,
 	.discard	= unix_discard,
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

There's no need to invalidate the entire cache when writing a range of
bytes to the device.  The only ones we need to invalidate are the ones
that we're writing separately.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 8a8afe47ee4503..4c924ec9ee0760 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1523,6 +1523,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 {
 	struct unix_private_data *data;
 	errcode_t	retval = 0;
+	unsigned long long bno, nbno;
 	ssize_t		actual;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
@@ -1538,12 +1539,18 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 
 #ifndef NO_IO_CACHE
 	/*
-	 * Flush out the cache completely
+	 * Flush all the dirty blocks, then invalidate the blocks we're about
+	 * to write.
 	 */
-	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
-				     FLUSH_INVALIDATE);
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, 0);
 	if (retval)
 		return retval;
+
+	bno = offset / channel->block_size;
+	nbno = (offset + size + channel->block_size - 1) / channel->block_size;
+
+	for (; bno < nbno; bno++)
+		invalidate_cached_block(channel, data, bno);
 #endif
 
 	if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

If someone calls write_byte on an IO channel with an alignment
requirement and the range to be written is aligned correctly, go ahead
and do the write.  This will be needed later when we try to speed up
superblock writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 4c924ec9ee0760..008a5b46ce7f1f 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1534,7 +1534,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 #ifdef ALIGN_DEBUG
 		printf("unix_write_byte: O_DIRECT fallback\n");
 #endif
-		return EXT2_ET_UNIMPLEMENTED;
+		if (!IS_ALIGNED(data->offset + offset, channel->align) ||
+		    !IS_ALIGNED(data->offset + offset + size, channel->align))
+			return EXT2_ET_UNIMPLEMENTED;
 	}
 
 #ifndef NO_IO_CACHE


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

write_primary_superblock currently does this weird dance where it will
try to write only the dirty bytes of the primary superblock to disk.  In
theory, this is done so that tune2fs can incrementally update superblock
bytes when the filesystem is mounted; ext2 was famous for allowing using
this dance to set new fs parameters and have them take effect in real
time.

The ability to do this safely was obliterated back in 2001 when ext3 was
introduced with journalling, because tune2fs has no way to know if the
journal has already logged an updated primary superblock but not yet
written it to disk, which means that they can race to write, and changes
can be lost.

This (non-)safety was further obliterated back in 2012 when I added
checksums to all the metadata blocks in ext4 because anyone else with
the block device open can see the primary superblock in an intermediate
state where the checksum does not match the superblock contents.

At this point in 2025 it's kind of stupid to still be doing this, and it
makes fuse2fs syncfs slow because we now perform a bunch of small writes
and introduce extra fsyncs.  It will become especially painful when
fuse2fs turns on iomap, at which point it will need to use directio to
access the disk, which then runs the Really Sad Path where we change the
blocksize and completely obliterate the cache contents.

So, add a new flag to ask for full superblock writes, which fuse2fs will
use later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2fs.h  |    1 +
 lib/ext2fs/closefs.c |    7 +++++++
 2 files changed, 8 insertions(+)


diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 2661e10f57c047..22d56ad7554496 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -220,6 +220,7 @@ typedef struct ext2_file *ext2_file_t;
 #define EXT2_FLAG_IBITMAP_TAIL_PROBLEM	0x2000000
 #define EXT2_FLAG_THREADS		0x4000000
 #define EXT2_FLAG_IGNORE_SWAP_DIRENT	0x8000000
+#define EXT2_FLAG_WRITE_FULL_SUPER	0x10000000
 
 /*
  * Internal flags for use by the ext2fs library only
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 8e5bec03a050de..9a67db76e7b326 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -196,6 +196,13 @@ static errcode_t write_primary_superblock(ext2_filsys fs,
 	int		check_idx, write_idx, size;
 	errcode_t	retval;
 
+	if (fs->flags & EXT2_FLAG_WRITE_FULL_SUPER) {
+		retval = io_channel_write_byte(fs->io, SUPERBLOCK_OFFSET,
+					       SUPERBLOCK_SIZE, super);
+		if (!retval)
+			return 0;
+	}
+
 	if (!fs->io->manager->write_byte || !fs->orig_super) {
 	fallback:
 		io_channel_set_blksize(fs->io, SUPERBLOCK_OFFSET);


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  9 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add a flag to ext2_file_t to disallow read and write I/O to file data
blocks.  This supports fuse2fs iomap support, which will keep all the
file data I/O inside the kerne.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2fs.h |    3 +++
 lib/ext2fs/fileio.c |   12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 22d56ad7554496..2c8e2cc2b55416 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -178,6 +178,9 @@ typedef struct ext2_struct_dblist *ext2_dblist;
 #define EXT2_FILE_WRITE		0x0001
 #define EXT2_FILE_CREATE	0x0002
 
+/* no file I/O to disk blocks, only to inline data */
+#define EXT2_FILE_NOBLOCKIO	0x0004
+
 #define EXT2_FILE_MASK		0x00FF
 
 #define EXT2_FILE_BUF_DIRTY	0x4000
diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c
index 1b7e88d990036b..229ae6da7f448b 100644
--- a/lib/ext2fs/fileio.c
+++ b/lib/ext2fs/fileio.c
@@ -300,6 +300,11 @@ errcode_t ext2fs_file_read(ext2_file_t file, void *buf,
 	if (file->inode.i_flags & EXT4_INLINE_DATA_FL)
 		return ext2fs_file_read_inline_data(file, buf, wanted, got);
 
+	if (file->flags & EXT2_FILE_NOBLOCKIO) {
+		retval = EXT2_ET_OP_NOT_SUPPORTED;
+		goto fail;
+	}
+
 	while ((file->pos < EXT2_I_SIZE(&file->inode)) && (wanted > 0)) {
 		retval = sync_buffer_position(file);
 		if (retval)
@@ -416,6 +421,11 @@ errcode_t ext2fs_file_write(ext2_file_t file, const void *buf,
 		retval = 0;
 	}
 
+	if (file->flags & EXT2_FILE_NOBLOCKIO) {
+		retval = EXT2_ET_OP_NOT_SUPPORTED;
+		goto fail;
+	}
+
 	while (nbytes > 0) {
 		retval = sync_buffer_position(file);
 		if (retval)
@@ -584,7 +594,7 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file,
 	int ret_flags;
 	errcode_t retval;
 
-	if (off == 0)
+	if (off == 0 || (file->flags & EXT2_FILE_NOBLOCKIO))
 		return 0;
 
 	retval = sync_buffer_position(file);


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add enough of an iomap implementation that we can do FIEMAP and
SEEK_DATA and SEEK_HOLE.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure       |   47 ++++++
 configure.ac    |   32 ++++
 lib/config.h.in |    3 
 misc/fuse2fs.c  |  453 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 530 insertions(+), 5 deletions(-)


diff --git a/configure b/configure
index 1f7dbe24ee1ab1..c8b63dd448dca8 100755
--- a/configure
+++ b/configure
@@ -14545,6 +14545,53 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_LIB" = "-lfuse3"
+then
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5
+printf %s "checking for iomap_begin in libfuse... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+
+int
+main (void)
+{
+
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+
+  ;
+  return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  have_fuse_iomap=yes
+   { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else $as_nop
+  { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+
+printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h
+
+fi
+fi
+
 if test -n "$FUSE_USE_VERSION"
 then
 
diff --git a/configure.ac b/configure.ac
index c7f193b4ed06bf..8b12ef3ee542e3 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1429,6 +1429,38 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_LIB" = "-lfuse3"
+then
+dnl
+dnl see if fuse3 supports iomap
+dnl
+AC_MSG_CHECKING(for iomap_begin in libfuse)
+AC_LINK_IFELSE(
+[	AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+	]], [[
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+	]])
+], have_fuse_iomap=yes
+   AC_MSG_RESULT(yes),
+   AC_MSG_RESULT(no))
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+  AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap])
+fi
+fi
+
+dnl
+dnl set FUSE_USE_VERSION now that we've done all the feature tests
+dnl
 if test -n "$FUSE_USE_VERSION"
 then
 	AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
diff --git a/lib/config.h.in b/lib/config.h.in
index 6cd9751baab9d1..850c5fa573bcf0 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -73,6 +73,9 @@
 /* Define to 1 if PR_SET_IO_FLUSHER is present */
 #undef HAVE_PR_SET_IO_FLUSHER
 
+/* Define to 1 if fuse supports iomap */
+#undef HAVE_FUSE_IOMAP
+
 /* Define to 1 if you have the Mac OS X function
    CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
 #undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 769bb5babd2738..f9eed078d91152 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -79,6 +79,8 @@
 #define P_(singular, plural, n) ((n) == 1 ? (singular) : (plural))
 #endif
 
+#define min(x, y)	((x) < (y) ? (y) : (x))
+
 #define dbg_printf(fuse2fs, format, ...) \
 	while ((fuse2fs)->debug) { \
 		printf("FUSE2FS (%s): " format, (fuse2fs)->shortdev, ##__VA_ARGS__); \
@@ -144,6 +146,14 @@ struct fuse2fs_file_handle {
 	int open_flags;
 };
 
+#ifdef HAVE_FUSE_IOMAP
+enum fuse2fs_iomap_state {
+	IOMAP_DISABLED,
+	IOMAP_UNKNOWN,
+	IOMAP_ENABLED,
+};
+#endif
+
 /* Main program context */
 #define FUSE2FS_MAGIC		(0xEF53DEADUL)
 struct fuse2fs {
@@ -167,6 +177,9 @@ struct fuse2fs {
 	uint8_t writable;
 
 	int blocklog;
+#ifdef HAVE_FUSE_IOMAP
+	enum fuse2fs_iomap_state iomap_state;
+#endif
 	unsigned int blockmask;
 	int retcode;
 	unsigned long offset;
@@ -694,7 +707,7 @@ static errcode_t open_fs(struct fuse2fs *ff, int libext2_flags)
 {
 	char options[128];
 	int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
-		    libext2_flags;
+		    EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags;
 	errcode_t err;
 
 	snprintf(options, sizeof(options) - 1, "offset=%lu", ff->offset);
@@ -945,6 +958,38 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
 }
 #endif
 
+#ifdef HAVE_FUSE_IOMAP
+static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
+{
+	int is_bdev;
+	errcode_t err;
+
+	switch (ff->iomap_state) {
+	case IOMAP_UNKNOWN:
+		ff->iomap_state = IOMAP_DISABLED;
+		/* fallthrough */;
+	case IOMAP_DISABLED:
+		return 0;
+	case IOMAP_ENABLED:
+		break;
+	}
+
+	err = fs_on_bdev(ff, &is_bdev);
+	if (err)
+		return err;
+
+	/* iomap only works with block devices */
+	if (!is_bdev) {
+		fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+		ff->iomap_state = IOMAP_DISABLED;
+	}
+
+	return 0;
+}
+#else
+# define confirm_iomap(...)	(0)
+#endif
+
 static void *op_init(struct fuse_conn_info *conn
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 			, struct fuse_config *cfg EXT2FS_ATTR((unused))
@@ -972,6 +1017,12 @@ static void *op_init(struct fuse_conn_info *conn
 #ifdef FUSE_CAP_NO_EXPORT_SUPPORT
 	fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	if (ff->iomap_state != IOMAP_DISABLED &&
+	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
+		ff->iomap_state = IOMAP_ENABLED;
+#endif
+
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	conn->time_gran = 1;
 	cfg->use_ino = 1;
@@ -989,6 +1040,10 @@ static void *op_init(struct fuse_conn_info *conn
 			goto mount_fail;
 		fs = ff->fs;
 
+		err = confirm_iomap(conn, ff);
+		if (err)
+			goto mount_fail;
+
 		if (ff->cache_size) {
 			err = config_fs_cache(ff);
 			if (err)
@@ -1014,6 +1069,10 @@ static void *op_init(struct fuse_conn_info *conn
 		err = mount_fs(ff);
 		if (err)
 			goto mount_fail;
+	} else {
+		err = confirm_iomap(conn, ff);
+		if (err)
+			goto mount_fail;
 	}
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
@@ -4575,6 +4634,384 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 # endif /* SUPPORT_FALLOCATE */
 #endif /* FUSE 29 */
 
+#ifdef HAVE_FUSE_IOMAP
+static void handle_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
+			      off_t pos, uint64_t count)
+{
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->addr = FUSE_IOMAP_NULL_ADDR;
+	iomap->offset = pos;
+	iomap->length = count;
+	iomap->type = FUSE_IOMAP_TYPE_HOLE;
+}
+
+#define DEBUG_IOMAP
+#ifdef DEBUG_IOMAP
+# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \
+	do { \
+		dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \
+			   (func), (tag), (startoff), (err), (extent)->e_lblk, \
+			   (extent)->e_pblk, (extent)->e_len, \
+			   (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \
+	} while(0)
+# define DUMP_EXTENT(ff, tag, startoff, err, extent) \
+	__DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent))
+#else
+# define __DUMP_EXTENT(...)	((void)0)
+# define DUMP_EXTENT(...)	((void)0)
+#endif
+
+static inline errcode_t __get_mapping_at(struct fuse2fs *ff,
+					 ext2_extent_handle_t handle,
+					 blk64_t startoff,
+					 struct ext2fs_extent *bmap,
+					 const char *func)
+{
+	errcode_t err;
+
+	/*
+	 * Find the file mapping at startoff.  We don't check the return value
+	 * of _goto because _get will error out if _goto failed.  There's a
+	 * subtlety to the outcome of _goto when startoff falls in a sparse
+	 * hole however:
+	 *
+	 * Most of the time, _goto points the cursor at the mapping whose lblk
+	 * is just to the left of startoff.  The mapping may or may not overlap
+	 * startoff; this is ok.  In other words, the tree lookup behaves as if
+	 * we asked it to use a less than or equals comparison.
+	 *
+	 * However, if startoff is to the left of the first mapping in the
+	 * extent tree, _goto points the cursor at that first mapping because
+	 * it doesn't know how to deal with this situation.  In this case,
+	 * the tree lookup behaves as if we asked it to use a greater than
+	 * or equals comparison.
+	 *
+	 * Note: If _get() returns 'no current node', that means that there
+	 * aren't any mappings at all.
+	 */
+	ext2fs_extent_goto(handle, startoff);
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap);
+	__DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap);
+	if (err == EXT2_ET_NO_CURRENT_NODE)
+		err = EXT2_ET_EXTENT_NOT_FOUND;
+	return err;
+}
+
+static inline errcode_t __get_next_mapping(struct fuse2fs *ff,
+					   ext2_extent_handle_t handle,
+					   blk64_t startoff,
+					   struct ext2fs_extent *bmap,
+					   const char *func)
+{
+	struct ext2fs_extent newex, errex;
+	errcode_t err;
+
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex);
+	DUMP_EXTENT(ff, "NEXT", startoff, err, &newex);
+	if (err == EXT2_ET_EXTENT_NO_NEXT)
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	if (err)
+		return err;
+
+	/*
+	 * Try to get the next leaf mapping.  There's a weird and longstanding
+	 * "feature" of EXT2_EXTENT_NEXT_LEAF where walking off the end of the
+	 * mapping recordset causes it to wrap around to the beginning of the
+	 * extent map and we end up with a mapping to the left of the one that
+	 * was passed in.
+	 *
+	 * However, a corrupt extent tree could also have such a record.  The
+	 * only way to be sure is to retrieve the mapping for the extreme right
+	 * edge of the tree and compare it to the mapping that the caller gave
+	 * us.  If they match, then we've hit the end.  If not, something is
+	 * corrupt in the ondisk metadata.
+	 */
+	if (newex.e_lblk <= bmap->e_lblk + bmap->e_len) {
+		err = __get_mapping_at(ff, handle, ~0U, &errex, func);
+		if (err)
+			return err;
+
+		if (memcmp(bmap, &errex, sizeof(errex)) != 0)
+			return EXT2_ET_INODE_CORRUPTED;
+
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	}
+
+	*bmap = newex;
+	return 0;
+}
+
+#define get_mapping_at(ff, handle, startoff, bmap) \
+	__get_mapping_at((ff), (handle), (startoff), (bmap), __func__)
+#define get_next_mapping(ff, handle, startoff, bmap) \
+	__get_next_mapping((ff), (handle), (startoff), (bmap), __func__)
+
+static errcode_t extent_iomap_begin(struct fuse2fs *ff, uint64_t ino,
+				    struct ext2_inode_large *inode,
+				    off_t pos, uint64_t count,
+				    uint32_t opflags, struct fuse_iomap *iomap)
+{
+	ext2_extent_handle_t handle;
+	struct ext2fs_extent extent;
+	ext2_filsys fs = ff->fs;
+	const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	errcode_t err;
+	int ret = 0;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/* No mappings at all; the whole range is a hole. */
+		handle_iomap_hole(ff, iomap, pos, count);
+		goto out_handle;
+	}
+	if (err) {
+		ret = translate_error(fs, ino, err);
+		goto out_handle;
+	}
+
+	if (startoff < extent.e_lblk) {
+		/*
+		 * Mapping starts to the right of the current position.
+		 * Synthesize a hole going to that next extent.
+		 */
+		handle_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+		goto out_handle;
+	}
+
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, the
+		 * whole range is in a hole.
+		 */
+		err = get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			handle_iomap_hole(ff, iomap, pos, count);
+			goto out_handle;
+		}
+
+		/*
+		 * If the new mapping starts to the right of startoff, there's
+		 * a hole from startoff to the start of the new mapping.
+		 */
+		if (startoff < extent.e_lblk) {
+			handle_iomap_hole(ff, iomap,
+				FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+			goto out_handle;
+		}
+
+		/*
+		 * The new mapping starts at startoff.  Something weird
+		 * happened in the extent tree lookup, but we found a valid
+		 * mapping so we'll run with it.
+		 */
+	}
+
+	/* Mapping overlaps startoff, report this. */
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
+	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
+	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)
+		iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN;
+	else
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino,
+				struct ext2_inode_large *inode, off_t pos,
+				uint64_t count, uint32_t opflags,
+				struct fuse_iomap *iomap)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	uint64_t real_count = min(count, 131072);
+	const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count);
+	blk64_t startblock;
+	errcode_t err;
+
+	err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL,
+			   &startblock);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->offset = pos;
+	iomap->flags |= FUSE_IOMAP_F_MERGED;
+	if (startblock) {
+		iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+	} else {
+		iomap->addr = FUSE_IOMAP_NULL_ADDR;
+		iomap->type = FUSE_IOMAP_TYPE_HOLE;
+	}
+	iomap->length = fs->blocksize;
+
+	/* See how long the mapping goes for. */
+	for (startoff++; startoff < endoff; startoff++) {
+		blk64_t prev_startblock = startblock;
+
+		err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0,
+				   startoff, NULL, &startblock);
+		if (err)
+			break;
+
+		if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) {
+			if (startblock == prev_startblock + 1)
+				iomap->length += fs->blocksize;
+			else
+				break;
+		} else {
+			if (startblock != 0)
+				break;
+		}
+	}
+
+	return 0;
+}
+
+static int inline_iomap_begin(struct fuse2fs *ff, off_t pos, uint64_t count,
+			      struct fuse_iomap *iomap)
+{
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->addr = FUSE_IOMAP_NULL_ADDR;
+	iomap->offset = pos;
+	iomap->length = count;
+	iomap->type = FUSE_IOMAP_TYPE_INLINE;
+
+	return 0;
+}
+
+static int fuse_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino,
+				   struct ext2_inode_large *inode,
+				   off_t pos, uint64_t count, uint32_t opflags,
+				   struct fuse_iomap *read_iomap)
+{
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return inline_iomap_begin(ff, pos, count, read_iomap);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return extent_iomap_begin(ff, ino, inode, pos, count, opflags,
+					 read_iomap);
+
+	return indirect_iomap_begin(ff, ino, inode, pos, count, opflags,
+				    read_iomap);
+}
+
+static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
+				 struct ext2_inode_large *inode, off_t pos,
+				 uint64_t count, uint32_t opflags,
+				 struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
+				  struct ext2_inode_large *inode, off_t pos,
+				  uint64_t count, uint32_t opflags,
+				  struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, uint64_t count, uint32_t opflags,
+			  struct fuse_iomap *read_iomap,
+			  struct fuse_iomap *write_iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	fs = ff->fs;
+
+	pthread_mutex_lock(&ff->bfl);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags);
+
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (opflags & FUSE_IOMAP_OP_REPORT)
+		ret = fuse_iomap_begin_report(ff, attr_ino, &inode, pos, count,
+					      opflags, read_iomap);
+	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
+		ret = fuse_iomap_begin_write(ff, attr_ino, &inode, pos, count,
+					     opflags, read_iomap);
+	else
+		ret = fuse_iomap_begin_read(ff, attr_ino, &inode, pos, count,
+					    opflags, read_iomap);
+	if (ret)
+		goto out_unlock;
+
+	dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n",
+		   __func__,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)read_iomap->addr,
+		   (unsigned long long)read_iomap->offset,
+		   (unsigned long long)read_iomap->length,
+		   read_iomap->type);
+
+out_unlock:
+	if (ret < 0)
+		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
+	pthread_mutex_unlock(&ff->bfl);
+	return ret;
+}
+
+static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			off_t pos, uint64_t count, uint32_t opflags,
+			ssize_t written, const struct fuse_iomap *iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	pthread_mutex_lock(&ff->bfl);
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags 0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags,
+		   written,
+		   iomap->flags);
+	pthread_mutex_unlock(&ff->bfl);
+
+	return 0;
+}
+#endif /* HAVE_FUSE_IOMAP */
+
 static struct fuse_operations fs_ops = {
 	.init = op_init,
 	.destroy = op_destroy,
@@ -4635,6 +5072,10 @@ static struct fuse_operations fs_ops = {
 	.fallocate = op_fallocate,
 # endif
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	.iomap_begin = op_iomap_begin,
+	.iomap_end = op_iomap_end,
+#endif /* HAVE_FUSE_IOMAP */
 };
 
 static int get_random_bytes(void *p, size_t sz)
@@ -4840,7 +5281,12 @@ static void fuse2fs_com_err_proc(const char *whoami, errcode_t code,
 int main(int argc, char *argv[])
 {
 	struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
-	struct fuse2fs fctx;
+	struct fuse2fs fctx = {
+		.magic = FUSE2FS_MAGIC,
+#ifdef HAVE_FUSE_IOMAP
+		.iomap_state = IOMAP_UNKNOWN,
+#endif
+	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
 	char *logfile;
@@ -4849,9 +5295,6 @@ int main(int argc, char *argv[])
 	int is_bdev;
 	int ret = 0;
 
-	memset(&fctx, 0, sizeof(fctx));
-	fctx.magic = FUSE2FS_MAGIC;
-
 	fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc);
 	if (fctx.device == NULL) {
 		fprintf(stderr, "Missing ext4 device/image\n");


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/16] fuse2fs: register block devices for use with iomap
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
                     ` (13 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Register the ext4 block device with the kernel for use with iomap.  For
now this is redundant with using fuseblk mode because the kernel
automatically registers any fuseblk devices, but eventually we'll go
back to regular fuse mode and we'll have to pin the bdev ourselves.
In theory this interface supports strange beasts where the metadata can
exist somewhere else entirely (or be made up by AI) while the file data
persists to real disks.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   44 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 40 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f9eed078d91152..92a80753f4f1e8 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -36,6 +36,7 @@
 # define _FILE_OFFSET_BITS 64
 #endif /* _FILE_OFFSET_BITS */
 #include <fuse.h>
+#include <fuse_lowlevel.h>
 #ifdef __SET_FOB_FOR_FUSE
 # undef _FILE_OFFSET_BITS
 #endif /* __SET_FOB_FOR_FUSE */
@@ -179,6 +180,7 @@ struct fuse2fs {
 	int blocklog;
 #ifdef HAVE_FUSE_IOMAP
 	enum fuse2fs_iomap_state iomap_state;
+	uint32_t iomap_dev;
 #endif
 	unsigned int blockmask;
 	int retcode;
@@ -4638,7 +4640,7 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 static void handle_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
 			      off_t pos, uint64_t count)
 {
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE_IOMAP_NULL_ADDR;
 	iomap->offset = pos;
 	iomap->length = count;
@@ -4815,7 +4817,7 @@ static errcode_t extent_iomap_begin(struct fuse2fs *ff, uint64_t ino,
 	}
 
 	/* Mapping overlaps startoff, report this. */
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
 	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
 	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
@@ -4846,7 +4848,7 @@ static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino,
 	if (err)
 		return translate_error(fs, ino, err);
 
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->offset = pos;
 	iomap->flags |= FUSE_IOMAP_F_MERGED;
 	if (startblock) {
@@ -4884,7 +4886,7 @@ static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino,
 static int inline_iomap_begin(struct fuse2fs *ff, off_t pos, uint64_t count,
 			      struct fuse_iomap *iomap)
 {
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE_IOMAP_NULL_ADDR;
 	iomap->offset = pos;
 	iomap->length = count;
@@ -4925,6 +4927,31 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	return -ENOSYS;
 }
 
+static errcode_t config_iomap_devices(struct fuse_context *ctxt,
+				      struct fuse2fs *ff)
+{
+	struct fuse_session *se = fuse_get_session(ctxt->fuse);
+	errcode_t err;
+	int fd;
+	int ret;
+
+	err = io_channel_fd(ff->fs->io, &fd);
+	if (err)
+		return err;
+
+	ret = fuse_lowlevel_notify_iomap_add_device(se, fd, &ff->iomap_dev);
+
+	dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n",
+		   __func__, fd, ret, ff->iomap_dev);
+
+	if (ret)
+		return ret;
+	if (ff->iomap_dev == FUSE_IOMAP_DEV_NULL)
+		return -EIO;
+
+	return 0;
+}
+
 static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 			  off_t pos, uint64_t count, uint32_t opflags,
 			  struct fuse_iomap *read_iomap,
@@ -4951,6 +4978,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   (unsigned long long)count,
 		   opflags);
 
+	if (ff->iomap_dev == FUSE_IOMAP_DEV_NULL) {
+		err = config_iomap_devices(ctxt, ff);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 	err = fuse2fs_read_inode(fs, attr_ino, &inode);
 	if (err) {
 		ret = translate_error(fs, attr_ino, err);
@@ -5285,6 +5320,7 @@ int main(int argc, char *argv[])
 		.magic = FUSE2FS_MAGIC,
 #ifdef HAVE_FUSE_IOMAP
 		.iomap_state = IOMAP_UNKNOWN,
+		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
 	};
 	errcode_t err;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel writes file data directly to the block device
and does not flush the bdev page cache.  We must open the filesystem in
directio mode to avoid cache coherency issues when reading file data
blocks.  If we can't open the bdev in directio mode, we must not use
iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 92a80753f4f1e8..91c0da096bef9c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -988,8 +988,14 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
 
 	return 0;
 }
+
+static int iomap_enabled(const struct fuse2fs *ff)
+{
+	return ff->iomap_state == IOMAP_ENABLED;
+}
 #else
 # define confirm_iomap(...)	(0)
+# define iomap_enabled(...)	(0)
 #endif
 
 static void *op_init(struct fuse_conn_info *conn
@@ -1001,6 +1007,9 @@ static void *op_init(struct fuse_conn_info *conn
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs = ff->fs;
+#ifdef HAVE_FUSE_IOMAP
+	int was_directio = ff->directio;
+#endif
 	errcode_t err;
 	int ret;
 
@@ -1023,6 +1032,15 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+	/*
+	 * In iomap mode, the kernel writes file data directly to the block
+	 * device and does not flush the bdev page cache.  We must open the
+	 * filesystem in directio mode to avoid cache coherency issues when
+	 * reading file data.  If we can't open the bdev in directio mode, we
+	 * must not use iomap.
+	 */
+	if (iomap_enabled(ff))
+		ff->directio = 1;
 #endif
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
@@ -1038,6 +1056,14 @@ static void *op_init(struct fuse_conn_info *conn
 	 */
 	if (!fs) {
 		err = open_fs(ff, 0);
+#ifdef HAVE_FUSE_IOMAP
+		if (err && iomap_enabled(ff) && !was_directio) {
+			fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+			ff->iomap_state = IOMAP_DISABLED;
+			ff->directio = 0;
+			err = open_fs(ff, 0);
+		}
+#endif
 		if (err)
 			goto mount_fail;
 		fs = ff->fs;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/16] fuse2fs: implement directio file reads
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Implement file reads via iomap.  Currently only directio is supported.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 91c0da096bef9c..b1f3002ec8c481 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1103,6 +1103,11 @@ static void *op_init(struct fuse_conn_info *conn
 			goto mount_fail;
 	}
 
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO)
+	if (iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
+#endif
+
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->writable) {
 		fs->super->s_mnt_count++;
@@ -4942,7 +4947,26 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 				 uint64_t count, uint32_t opflags,
 				 struct fuse_iomap *read_iomap)
 {
-	return -ENOSYS;
+	errcode_t err;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	/* fall back to slow path for inline data reads */
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return -ENOSYS;
+
+	/* flush dirty io_channel buffers to disk before iomap reads them */
+	err = io_channel_flush(ff->fs->io);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return extent_iomap_begin(ff, ino, inode, pos, count, opflags,
+					 read_iomap);
+
+	return indirect_iomap_begin(ff, ino, inode, pos, count, opflags,
+				    read_iomap);
 }
 
 static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Change the punch hole helpers to use the tagged block IO commands now
that libext2fs uses tagged block IO commands for file IO.  We'll need
this in the next patch when we turn on selective IO manager cache
clearing and invalidation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index b1f3002ec8c481..c0f868e8f01ed4 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4510,13 +4510,13 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 
 	memset(*buf + residue, 0, len);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
@@ -4544,7 +4544,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 	if (!blk || (retflags & BMAP_RET_UNINIT))
@@ -4555,7 +4555,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	else
 		memset(*buf + residue, 0, fs->blocksize - residue);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset,


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

We only need to flush the io_channel's cache for the file that's being
read directly, not everything else.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index c0f868e8f01ed4..3ec99310b0f112 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4957,7 +4957,7 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush(ff->fs->io);
+	err = io_channel_flush_tag(ff->fs->io, ino);
 	if (err)
 		return translate_error(ff->fs, ino, err);
 


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/16] fuse2fs: add extent dump function for debugging
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add a function to dump an inode's extent map for debugging purposes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 3ec99310b0f112..7e9095766c6624 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -377,6 +377,74 @@ static inline errcode_t fuse2fs_write_inode(ext2_filsys fs, ext2_ino_t ino,
 				       sizeof(*inode));
 }
 
+static inline void dump_ino_extents(struct fuse2fs *ff, ext2_ino_t ino,
+				    struct ext2_inode_large *inode,
+				    const char *why)
+{
+	ext2_filsys fs = ff->fs;
+	unsigned int nr = 0;
+	blk64_t blockcount = 0;
+	struct ext2_inode_large xinode;
+	struct ext2fs_extent extent;
+	ext2_extent_handle_t extents;
+	int op = EXT2_EXTENT_ROOT;
+	errcode_t retval;
+
+	if (!inode) {
+		inode = &xinode;
+
+		retval = fuse2fs_read_inode(fs, ino, inode);
+		if (retval) {
+			com_err(__func__, retval, _("reading ino %u"), ino);
+			return;
+		}
+	}
+
+	if (!(inode->i_flags & EXT4_EXTENTS_FL))
+		return;
+
+	printf("%s: %s ino %u isize %llu iblocks %llu\n", __func__, why, ino,
+	       EXT2_I_SIZE(inode),
+	       (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) /
+	        fs->blocksize);
+	fflush(stdout);
+
+	retval = ext2fs_extent_open(fs, ino, &extents);
+	if (retval) {
+		com_err(__func__, retval, _("opening extents of ino \"%u\""),
+			ino);
+		return;
+	}
+
+	while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) {
+		op = EXT2_EXTENT_NEXT;
+
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT)
+			continue;
+
+		printf("[%u]: %s lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n",
+		       nr++, why, extent.e_lblk, extent.e_pblk, extent.e_len,
+		       extent.e_flags);
+		fflush(stdout);
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF)
+			blockcount += extent.e_len;
+		else
+			blockcount++;
+	}
+	if (retval == EXT2_ET_EXTENT_NO_NEXT)
+		retval = 0;
+	if (retval) {
+		com_err(__func__, retval, ("getting extents of ino %u"),
+			ino);
+	}
+	if (inode->i_file_acl)
+		blockcount++;
+	printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount);
+	fflush(stdout);
+
+	ext2fs_extent_free(extents);
+}
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/16] fuse2fs: implement direct write support
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Wire up an iomap_begin method that can allocate into holes so that we
can do directio writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |  481 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 478 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 7e9095766c6624..ec17f6203b4b70 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5037,12 +5037,99 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 				    read_iomap);
 }
 
+static int fuse_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode, off_t pos,
+				     uint64_t count, uint32_t opflags, struct
+				     fuse_iomap *read_iomap, bool *dirty)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count);
+	errcode_t err;
+	int ret;
+
+	dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n",
+		   __func__, ino, startoff, stopoff - startoff);
+
+	if (!fs_can_allocate(ff, stopoff - startoff))
+		return -ENOSPC;
+
+	err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino,
+			       EXT2_INODE(inode), 0, startoff,
+			       stopoff - startoff);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* pick up the newly allocated mapping */
+	ret = fuse_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				     read_iomap);
+	if (ret)
+		return ret;
+
+	read_iomap->flags |= FUSE_IOMAP_F_DIRTY;
+	*dirty = true;
+	return 0;
+}
+
+static off_t max_file_size(const struct fuse2fs *ff,
+			   const struct ext2_inode_large *inode)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t addr_per_block, max_map_block;
+
+	if (inode->i_flags & EXT4_EXTENTS_FL) {
+		max_map_block = (1ULL << 32) - 1;
+	} else {
+		addr_per_block = fs->blocksize >> 2;
+		max_map_block = addr_per_block;
+		max_map_block += addr_per_block * addr_per_block;
+		max_map_block += addr_per_block * addr_per_block * addr_per_block;
+		max_map_block += 12;
+	}
+
+	return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1);
+}
+
 static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 				  struct ext2_inode_large *inode, off_t pos,
 				  uint64_t count, uint32_t opflags,
-				  struct fuse_iomap *read_iomap)
+				  struct fuse_iomap *read_iomap, bool *dirty)
 {
-	return -ENOSYS;
+	off_t max_size = max_file_size(ff, inode);
+	errcode_t err;
+	int ret;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	if (pos >= max_size)
+		return -EFBIG;
+
+	if (pos >= max_size - count)
+		count = max_size - pos;
+
+	ret = fuse_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				    read_iomap);
+	if (ret)
+		return ret;
+
+	if (read_iomap->type == FUSE_IOMAP_TYPE_HOLE &&
+	    !(opflags & FUSE_IOMAP_OP_ZERO)) {
+		ret = fuse_iomap_write_allocate(ff, ino, inode, pos, count,
+						opflags, read_iomap, dirty);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers before iomap
+	 * writes them
+	 */
+	err = io_channel_invalidate_tag(ff->fs->io, ino);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	return 0;
 }
 
 static errcode_t config_iomap_devices(struct fuse_context *ctxt,
@@ -5080,6 +5167,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	struct ext2_inode_large inode;
 	ext2_filsys fs;
 	errcode_t err;
+	bool dirty = false;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
@@ -5115,7 +5203,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 					      opflags, read_iomap);
 	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
 		ret = fuse_iomap_begin_write(ff, attr_ino, &inode, pos, count,
-					     opflags, read_iomap);
+					     opflags, read_iomap, &dirty);
 	else
 		ret = fuse_iomap_begin_read(ff, attr_ino, &inode, pos, count,
 					    opflags, read_iomap);
@@ -5132,6 +5220,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   (unsigned long long)read_iomap->length,
 		   read_iomap->type);
 
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 out_unlock:
 	if (ret < 0)
 		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
@@ -5163,6 +5259,384 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 
 	return 0;
 }
+
+static inline bool can_merge_mappings(const struct ext2fs_extent *left,
+				      const struct ext2fs_extent *right)
+{
+	uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ?
+				EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN;
+
+	return left->e_lblk + left->e_len == right->e_lblk &&
+	       left->e_pblk + left->e_len == right->e_pblk &&
+	       (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ==
+	        (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) &&
+	       (uint64_t)left->e_len + right->e_len <= max_len;
+}
+
+static int try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+			      ext2_extent_handle_t handle, blk64_t startoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent left, right;
+	errcode_t err;
+
+	/* Look up the mappings before startoff */
+	err = get_mapping_at(ff, handle, startoff - 1, &left);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Look up the mapping at startoff */
+	err = get_mapping_at(ff, handle, startoff, &right);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Can we combine them? */
+	if (!can_merge_mappings(&left, &right))
+		return 0;
+
+	/*
+	 * Delete the mapping after startoff because libext2fs cannot handle
+	 * overlapping mappings.
+	 */
+	err = ext2fs_extent_delete(handle, 0);
+	DUMP_EXTENT(ff, "remover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixremover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Move back and lengthen the mapping before startoff */
+	err = ext2fs_extent_goto(handle, left.e_lblk);
+	DUMP_EXTENT(ff, "movel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	left.e_len += right.e_len;
+	err = ext2fs_extent_replace(handle, 0, &left);
+	DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
+static int convert_unwritten_mapping(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode,
+				     ext2_extent_handle_t handle,
+				     blk64_t *cursor, blk64_t stopoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent extent;
+	blk64_t startoff = *cursor;
+	errcode_t err;
+
+	/*
+	 * Find the mapping at startoff.  Note that we can find holes because
+	 * the mapping data can change due to racing writes.
+	 */
+	err = get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/*
+		 * If we didn't find any mappings at all then the file is
+		 * completely sparse.  There's nothing to convert.
+		 */
+		*cursor = stopoff;
+		return 0;
+	}
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/*
+	 * The mapping is completely to the left of the range that we want.
+	 * Let's see what's in the next extent, if there is one.
+	 */
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, then
+		 * we're done.
+		 */
+		err = get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			*cursor = stopoff;
+			return 0;
+		}
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/*
+	 * The mapping is completely to the right of the range that we want,
+	 * so we're done.
+	 */
+	if (extent.e_lblk >= stopoff) {
+		*cursor = stopoff;
+		return 0;
+	}
+
+	/*
+	 * At this point, we have a mapping that overlaps (startoff, stopoff].
+	 * If the mapping is already written, move on to the next one.
+	 */
+	if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT))
+		goto next;
+
+	if (startoff > extent.e_lblk) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping starts before startoff.  Shorten
+		 * the previous mapping...
+		 */
+		newex.e_len = startoff - extent.e_lblk;
+		err = ext2fs_extent_replace(handle, 0, &newex);
+		DUMP_EXTENT(ff, "shortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ...and create new written mapping at startoff. */
+		extent.e_len -= newex.e_len;
+		extent.e_lblk += newex.e_len;
+		extent.e_pblk += newex.e_len;
+		extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &extent);
+		DUMP_EXTENT(ff, "insertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	if (extent.e_lblk + extent.e_len > stopoff) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping ends after stopoff.  Shorten the current
+		 * mapping...
+		 */
+		extent.e_len = stopoff - extent.e_lblk;
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "shortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ..and create a new unwritten mapping at stopoff. */
+		newex.e_pblk += extent.e_len;
+		newex.e_lblk += extent.e_len;
+		newex.e_len -= extent.e_len;
+		newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &newex);
+		DUMP_EXTENT(ff, "insertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/* Still unwritten?  Update the state. */
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) {
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "replacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+next:
+	/* Try to merge with the previous extent */
+	if (startoff > 0) {
+		err = try_merge_mappings(ff, ino, handle, startoff);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	*cursor = extent.e_lblk + extent.e_len;
+	return 0;
+}
+
+static int convert_unwritten_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+				      struct ext2_inode_large *inode,
+				      off_t pos, size_t written)
+{
+	ext2_extent_handle_t handle;
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written);
+	errcode_t err;
+	int ret;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Walk every mapping in the range, converting them. */
+	while (startoff < stopoff) {
+		blk64_t old_startoff = startoff;
+
+		ret = convert_unwritten_mapping(ff, ino, inode, handle,
+					        &startoff, stopoff);
+		if (ret)
+			goto out_handle;
+		if (startoff <= old_startoff) {
+			/* Do not go backwards. */
+			ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+			goto out_handle;
+		}
+	}
+
+	/* Try to merge the right edge */
+	ret = try_merge_mappings(ff, ino, handle, stopoff);
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, size_t written, uint32_t ioendflags,
+			  int error, uint64_t new_addr)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	bool dirty = false;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	fs = ff->fs;
+
+	pthread_mutex_lock(&ff->bfl);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=%llu\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   written,
+		   ioendflags,
+		   error,
+		   (unsigned long long)new_addr);
+
+	if (error) {
+		ret = error;
+		goto out_unlock;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers again now that
+	 * iomap wrote them
+	 */
+	if (written > 0) {
+		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
+		if (err) {
+			ret = translate_error(ff->fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
+	/* should never see these ioend types */
+	if ((ioendflags & FUSE_IOMAP_IOEND_SHARED) ||
+	    new_addr != FUSE_IOMAP_NULL_ADDR) {
+		ret = translate_error(fs, attr_ino,
+				      EXT2_ET_FILESYSTEM_CORRUPTED);
+		goto out_unlock;
+	}
+
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) {
+		/* unwritten extents are only supported on extents files */
+		if (!(inode.i_flags & EXT4_EXTENTS_FL)) {
+			ret = translate_error(fs, attr_ino,
+					      EXT2_ET_FILESYSTEM_CORRUPTED);
+			goto out_unlock;
+		}
+
+		ret = convert_unwritten_mappings(ff, attr_ino, &inode, pos,
+						 written);
+		if (ret)
+			goto out_unlock;
+
+		dirty = true;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_APPEND) {
+		ext2_off64_t isize = EXT2_I_SIZE(&inode);
+
+		if (pos + written > isize) {
+			err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode),
+						    pos + written);
+			if (err) {
+				ret = translate_error(fs, attr_ino, err);
+				goto out_unlock;
+			}
+
+			dirty = true;
+		}
+	}
+
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
+out_unlock:
+	if (ret < 0)
+		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
+	pthread_mutex_unlock(&ff->bfl);
+	return ret;
+}
 #endif /* HAVE_FUSE_IOMAP */
 
 static struct fuse_operations fs_ops = {
@@ -5228,6 +5702,7 @@ static struct fuse_operations fs_ops = {
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,
+	.iomap_ioend = op_iomap_ioend,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Turn on iomap for pagecache IO to regular files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index ec17f6203b4b70..7152979ed6694e 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1175,6 +1175,10 @@ static void *op_init(struct fuse_conn_info *conn
 	if (iomap_enabled(ff))
 		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
 #endif
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_PAGECACHE)
+	if (iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_PAGECACHE);
+#endif
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->writable) {
@@ -5017,9 +5021,6 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 {
 	errcode_t err;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	/* fall back to slow path for inline data reads */
 	if (inode->i_flags & EXT4_INLINE_DATA_FL)
 		return -ENOSYS;
@@ -5099,9 +5100,6 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	errcode_t err;
 	int ret;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	if (pos >= max_size)
 		return -EFBIG;
 
@@ -5235,12 +5233,51 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	return ret;
 }
 
+static int iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino,
+				loff_t newsize)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2_inode_large inode;
+	ext2_off64_t isize;
+	errcode_t err;
+
+	dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino,
+		   (unsigned long long)newsize);
+
+	err = fuse2fs_read_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	isize = EXT2_I_SIZE(&inode);
+	if (newsize <= isize)
+		return 0;
+
+	dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino,
+		   (unsigned long long)isize,
+		   (unsigned long long)newsize);
+
+	/*
+	 * XXX cheesily update the ondisk size even though we only want to do
+	 * the incore size until writeback happens
+	 */
+	err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = fuse2fs_write_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
 static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 			off_t pos, uint64_t count, uint32_t opflags,
 			ssize_t written, const struct fuse_iomap *iomap)
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
 
@@ -5255,9 +5292,22 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   opflags,
 		   written,
 		   iomap->flags);
+
+	if ((opflags & FUSE_IOMAP_OP_WRITE) &&
+	    !(opflags & FUSE_IOMAP_OP_DIRECT) &&
+	    (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) &&
+	    written > 0) {
+		ret = iomap_append_setsize(ff, attr_ino, pos + written);
+		if (ret)
+			goto out_unlock;
+	}
+
+out_unlock:
+	if (ret < 0)
+		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
 	pthread_mutex_unlock(&ff->bfl);
 
-	return 0;
+	return ret;
 }
 
 static inline bool can_merge_mappings(const struct ext2fs_extent *left,


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
                     ` (5 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Discard operates directly on the storage device, which means that we
need to flush and invalidate the buffer cache because it could be
caching freed blocks whose contents are about to change.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 7152979ed6694e..219d4bf698d628 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4365,6 +4365,11 @@ static int ioctl_fitrim(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	cleared = 0;
 	max_blocks = FUSE2FS_B_TO_FSBT(ff, 2048ULL * 1024 * 1024);
 
+	/* flush any dirty data out of the disk cache before trimming */
+	err = io_channel_flush_tag(ff->fs->io, IO_CHANNEL_TAG_NULL);
+	if (err)
+		return translate_error(fs, fh->ino, err);
+
 	fr->len = 0;
 	while (start <= end) {
 		err = ext2fs_find_first_zero_block_bitmap2(fs->block_map,
@@ -4394,6 +4399,16 @@ static int ioctl_fitrim(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 		}
 		start = b + 1;
 	}
+	if (err)
+		goto out;
+
+	/*
+	 * Invalidate the entire disk cache now that we've written zeroes so
+	 * that EXT2_ALLOCRANGE_ZERO_BLOCKS works correctly.
+	 */
+	err = io_channel_invalidate_tag(ff->fs->io, IO_CHANNEL_TAG_NULL);
+	if (err)
+		return translate_error(fs, fh->ino, err);
 
 out:
 	fr->len = cleared;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/16] fuse2fs: improve tracing for fallocate
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Improve the tracing for fallocate by reporting the inode number and the
file range in all tracepoints.  Make the ranges hexadecimal to make it
easier for the programmer to convert bytes to block numbers and back.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 219d4bf698d628..fe6d97324c1f57 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4529,8 +4529,8 @@ static int fallocate_helper(struct fuse_file_info *fp, int mode, off_t offset,
 	FUSE2FS_CHECK_MAGIC(fs, fh, FUSE2FS_FILE_MAGIC);
 	start = FUSE2FS_B_TO_FSBT(ff, offset);
 	end = FUSE2FS_B_TO_FSBT(ff, offset + len - 1);
-	dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__,
-		   fh->ino, mode, start, end);
+	dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n",
+		   __func__, fh->ino, mode, offset, len, start, end);
 	if (!fs_can_allocate(ff, FUSE2FS_B_TO_FSB(ff, len)))
 		return -ENOSPC;
 
@@ -4601,6 +4601,7 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
+	dbg_printf(ff, "%s: ino=%d offset=0x%jx len=0x%jx\n", __func__, ino, offset + residue, len);
 	memset(*buf + residue, 0, len);
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
@@ -4637,10 +4638,13 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	if (clean_before)
+	if (clean_before) {
+		dbg_printf(ff, "%s: ino=%d before offset=0x%jx len=0x%jx\n", __func__, ino, offset, residue);
 		memset(*buf, 0, residue);
-	else
+	} else {
+		dbg_printf(ff, "%s: ino=%d after offset=0x%jx len=0x%jx\n", __func__, ino, offset, fs->blocksize - residue);
 		memset(*buf + residue, 0, fs->blocksize - residue);
+	}
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
@@ -4661,7 +4665,6 @@ static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset,
 	FUSE2FS_CHECK_CONTEXT(ff);
 	fs = ff->fs;
 	FUSE2FS_CHECK_MAGIC(fs, fh, FUSE2FS_FILE_MAGIC);
-	dbg_printf(ff, "%s: offset=%jd len=%jd\n", __func__, offset, len);
 
 	/* kernel ext4 punch requires this flag to be set */
 	if (!(mode & FL_KEEP_SIZE_FLAG))
@@ -4670,8 +4673,9 @@ static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset,
 	/* Punch out a bunch of blocks */
 	start = FUSE2FS_B_TO_FSB(ff, offset);
 	end = (offset + len - fs->blocksize) / fs->blocksize;
-	dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__,
-		   fh->ino, mode, start, end);
+
+	dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n",
+		   __func__, fh->ino, mode, offset, len, start, end);
 
 	err = fuse2fs_read_inode(fs, fh->ino, &inode);
 	if (err)
@@ -4727,6 +4731,8 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct fuse2fs_file_handle *fh =
+		(struct fuse2fs_file_handle *)(uintptr_t)fp->fh;
 	int ret;
 
 	/* Catch unknown flags */
@@ -4738,6 +4744,12 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 		ret = -EROFS;
 		goto out;
 	}
+
+	dbg_printf(ff, "%s: ino=%d mode=0x%x start=0x%llx end=0x%llx\n", __func__,
+		   fh->ino, mode,
+		   (unsigned long long)offset,
+		   (unsigned long long)offset + len);
+
 	if (mode & FL_ZERO_RANGE_FLAG)
 		ret = zero_helper(fp, mode, offset, len);
 	else if (mode & FL_PUNCH_HOLE_FLAG)


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/16] fuse2fs: don't zero bytes in punch hole
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the pagecache, it will take care of zeroing the
unaligned parts of punched out regions so we don't have to do it
ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fe6d97324c1f57..aeb2b6fbc28401 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -152,6 +152,7 @@ enum fuse2fs_iomap_state {
 	IOMAP_DISABLED,
 	IOMAP_UNKNOWN,
 	IOMAP_ENABLED,
+	IOMAP_FILEIO,	/* enabled and does all file data block IO */
 };
 #endif
 
@@ -1040,6 +1041,7 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
 		/* fallthrough */;
 	case IOMAP_DISABLED:
 		return 0;
+	case IOMAP_FILEIO:
 	case IOMAP_ENABLED:
 		break;
 	}
@@ -1059,11 +1061,17 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
 
 static int iomap_enabled(const struct fuse2fs *ff)
 {
-	return ff->iomap_state == IOMAP_ENABLED;
+	return ff->iomap_state >= IOMAP_ENABLED;
+}
+
+static int iomap_does_fileio(const struct fuse2fs *ff)
+{
+	return ff->iomap_state == IOMAP_FILEIO;
 }
 #else
 # define confirm_iomap(...)	(0)
 # define iomap_enabled(...)	(0)
+# define iomap_does_fileio(...)	(0)
 #endif
 
 static void *op_init(struct fuse_conn_info *conn
@@ -1100,6 +1108,20 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+
+	/*
+	 * If iomap is turned on and the kernel advertises support for both
+	 * direct and pagecache IO, then that means the kernel handles all
+	 * regular file data block IO for us.  That means we can turn off all
+	 * of libext2fs' file data block handling except for inline data.
+	 *
+	 * XXX: kernel doesn't support inline data iomap
+	 */
+	if (iomap_enabled(ff) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_PAGECACHE))
+		ff->iomap_state = IOMAP_FILEIO;
+
 	/*
 	 * In iomap mode, the kernel writes file data directly to the block
 	 * device and does not flush the bdev page cache.  We must open the
@@ -4580,6 +4602,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	int retflags;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		return 0;
+
 	residue = FUSE2FS_OFF_IN_FSB(ff, offset);
 	if (residue == 0)
 		return 0;
@@ -4617,6 +4643,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	off_t residue;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		return 0;
+
 	residue = FUSE2FS_OFF_IN_FSB(ff, offset);
 	if (residue == 0)
 		return 0;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
@ 2025-05-22  0:14   ` Darrick J. Wong
  2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
                     ` (2 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:14 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the page cache, the kernel will take care of
all the file data block IO for us, including zeroing of punched ranges
and post-EOF bytes.  fuse2fs only needs to do IO for inline data.

Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not
do any regular file IO to or from disk blocks at all.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index aeb2b6fbc28401..842ea3a191fa44 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -2863,9 +2863,14 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
 	ext2_file_t file;
 	__u64 old_isize;
 	errcode_t err;
+	int flags = EXT2_FILE_WRITE;
 	int ret = 0;
 
-	err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+	/* the kernel handles all eof zeroing for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		flags |= EXT2_FILE_NOBLOCKIO;
+
+	err = ext2fs_file_open(fs, ino, flags, &file);
 	if (err)
 		return translate_error(fs, ino, err);
 
@@ -2987,6 +2992,9 @@ static int __op_open(struct fuse2fs *ff, const char *path,
 		file->open_flags |= EXT2_FILE_WRITE;
 		break;
 	}
+	/* the kernel handles all block IO for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		file->open_flags |= EXT2_FILE_NOBLOCKIO;
 	if (fp->flags & O_APPEND) {
 		/* the kernel doesn't allow truncation of an append-only file */
 		if (fp->flags & O_TRUNC) {


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
@ 2025-05-22  0:14   ` Darrick J. Wong
  2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
  2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:14 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Now that fuse2fs uses iomap for pagecache IO, all regular file IO goes
directly to the disk.  There is no need to flush the unix IO manager's
disk cache (or invalidate it) because it does not contain file data.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 842ea3a191fa44..ba8c5f301625c6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5091,9 +5091,11 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!iomap_does_fileio(ff)) {
+		err = io_channel_flush_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	if (inode->i_flags & EXT4_EXTENTS_FL)
 		return extent_iomap_begin(ff, ino, inode, pos, count, opflags,
@@ -5188,9 +5190,11 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	 * flush and invalidate the file's io_channel buffers before iomap
 	 * writes them
 	 */
-	err = io_channel_invalidate_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!iomap_does_fileio(ff)) {
+		err = io_channel_invalidate_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	return 0;
 }
@@ -5685,7 +5689,7 @@ static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	 * flush and invalidate the file's io_channel buffers again now that
 	 * iomap wrote them
 	 */
-	if (written > 0) {
+	if (written > 0 && !iomap_does_fileio(ff)) {
 		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
 		if (err) {
 			ret = translate_error(ff->fs, attr_ino, err);


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (13 preceding siblings ...)
  2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
@ 2025-05-22  0:14   ` Darrick J. Wong
  2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:14 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Back in "fuse2fs: always use directio disk reads with fuse2fs", we
started using directio for all libext2fs disk IO to deal with cache
coherency issues between the unix io manager's disk cache, the block
device page cache, and the file data blocks being read and written to
disk by the kernel itself.

Now that we've turned off all regular file data block IO in libext2fs,
we don't need that and can go back to the old way, which is a lot
faster for metadata operations.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index ba8c5f301625c6..f31aee5af5aad9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1128,8 +1128,12 @@ static void *op_init(struct fuse_conn_info *conn
 	 * filesystem in directio mode to avoid cache coherency issues when
 	 * reading file data.  If we can't open the bdev in directio mode, we
 	 * must not use iomap.
+	 *
+	 * If we know that the kernel can handle all regular file IO for us,
+	 * then there is no cache coherency issue and we can use buffered reads
+	 * for all IO, which will all be filesystem metadata.
 	 */
-	if (iomap_enabled(ff))
+	if (iomap_enabled(ff) && !iomap_does_fileio(ff))
 		ff->directio = 1;
 #endif
 


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (14 preceding siblings ...)
  2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
@ 2025-05-22  0:15   ` Darrick J. Wong
  15 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:15 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Since fuse in iomap mode guarantees that op_destroy will be called
before umount returns, we don't need to use fuseblk mode to get that
guarantee.  Disable fuseblk mode, which saves us the trouble of closing
and reopening the device.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f31aee5af5aad9..28385d654f5e05 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -787,6 +787,8 @@ static errcode_t open_fs(struct fuse2fs *ff, int libext2_flags)
 	if (ff->directio)
 		flags |= EXT2_FLAG_DIRECT_IO;
 
+	dbg_printf(ff, "opening with flags=0x%x\n", flags);
+
 	err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
 			   &ff->fs);
 	if (err) {
@@ -6153,6 +6155,18 @@ int main(int argc, char *argv[])
 		ret = 32;
 		goto out;
 	}
+#ifdef HAVE_FUSE_IOMAP
+	if (is_bdev && fuse_discover_iomap()) {
+		/*
+		 * fuse-iomap guarantees that op_destroy is called before the
+		 * filesystem is unmounted, so we don't need fuseblk mode.
+		 * This save us the trouble of reopening the filesystem later,
+		 * and means that fuse2fs itself owns the exclusive lock on the
+		 * block device.
+		 */
+		is_bdev = 0;
+	}
+#endif
 
 	blksize = fctx.fs->blocksize;
 
@@ -6171,14 +6185,14 @@ int main(int argc, char *argv[])
 
 	/* Set up default fuse parameters */
 	snprintf(extra_args, BUFSIZ, "-okernel_cache,subtype=%s,"
-		 "attr_timeout=0" FUSE_PLATFORM_OPTS,
-		 get_subtype(argv[0]));
+		 "attr_timeout=0,fsname=%s" FUSE_PLATFORM_OPTS,
+		 get_subtype(argv[0]), fctx.device);
 	if (fctx.no_default_opts == 0)
 		fuse_opt_add_arg(&args, extra_args);
 
 	if (is_bdev) {
-		snprintf(extra_args, BUFSIZ, "-ofsname=%s,blkdev,blksize=%u",
-			 fctx.device, blksize);
+		snprintf(extra_args, BUFSIZ, "-oblkdev,blksize=%u",
+			 blksize);
 		fuse_opt_add_arg(&args, extra_args);
 	}
 


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-05-22 16:24 ` Amir Goldstein
  2025-05-29 16:45   ` Darrick J. Wong
  2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
  3 siblings, 2 replies; 55+ messages in thread
From: Amir Goldstein @ 2025-05-22 16:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi everyone,
>
> DO NOT MERGE THIS.
>
> This is the very first request for comments of a prototype to connect
> the Linux fuse driver to fs-iomap for regular file IO operations to and
> from files whose contents persist to locally attached storage devices.
>
> Why would you want to do that?  Most filesystem drivers are seriously
> vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> over almost a decade of its existence.  Faulty code can lead to total
> kernel compromise, and I think there's a very strong incentive to move
> all that parsing out to userspace where we can containerize the fuse
> server process.
>
> willy's folios conversion project (and to a certain degree RH's new
> mount API) have also demonstrated that treewide changes to the core
> mm/pagecache/fs code are very very difficult to pull off and take years
> because you have to understand every filesystem's bespoke use of that
> core code.  Eeeugh.
>
> The fuse command plumbing is very simple -- the ->iomap_begin,
> ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> to the fuse server via a trio of new fuse commands.  This is suitable
> for very simple filesystems that don't do tricky things with mappings
> (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> but solving that is for the next sprint.
>
> With this overly simplistic RFC, I am to show that it's possible to
> build a fuse server for a real filesystem (ext4) that runs entirely in
> userspace yet maintains most of its performance.  At this early stage I
> get about 95% of the kernel ext4 driver's streaming directio performance
> on streaming IO, and 110% of its streaming buffered IO performance.
> Random buffered IO suffers a 90% hit on writes due to unwritten extent
> conversions.  Random direct IO is about 60% as fast as the kernel; see
> the cover letter for the fuse2fs iomap changes for more details.
>

Very cool!

> There are some major warts remaining:
>
> 1. The iomap cookie validation is not present, which can lead to subtle
> races between pagecache zeroing and writeback on filesystems that
> support unwritten and delalloc mappings.
>
> 2. Mappings ought to be cached in the kernel for more speed.
>
> 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> yet figured out how inline data is supposed to work.
>
> 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> which currently isn't possible because the kernel fuse driver will iget
> inodes prior to calling FUSE_GETATTR to discover the properties of the
> inode it just read.

Can you make the decision about enabling iomap on lookup?
The plan for passthrough for inode operations was to allow
setting up passthough config of inode on lookup.

>
> 5. ext4 doesn't support out of place writes so I don't know if that
> actually works correctly.
>
> 6. iomap is an inode-based service, not a file-based service.  This
> means that we /must/ push ext2's inode numbers into the kernel via
> FUSE_GETATTR so that it can report those same numbers back out through
> the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> to index its incore inode, so we have to pass those too so that
> notifications work properly.
>

Again, I might be missing something, but as long as the fuse filesystem
is exposing a single backing filesystem, it should be possible to make
sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
inode number.
See sketch in this WIP branch:
https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
@ 2025-05-29 16:45   ` Darrick J. Wong
  2025-05-29 19:41     ` Amir Goldstein
  2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
  1 sibling, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2025-05-29 16:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi everyone,
> >
> > DO NOT MERGE THIS.
> >
> > This is the very first request for comments of a prototype to connect
> > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > from files whose contents persist to locally attached storage devices.
> >
> > Why would you want to do that?  Most filesystem drivers are seriously
> > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > over almost a decade of its existence.  Faulty code can lead to total
> > kernel compromise, and I think there's a very strong incentive to move
> > all that parsing out to userspace where we can containerize the fuse
> > server process.
> >
> > willy's folios conversion project (and to a certain degree RH's new
> > mount API) have also demonstrated that treewide changes to the core
> > mm/pagecache/fs code are very very difficult to pull off and take years
> > because you have to understand every filesystem's bespoke use of that
> > core code.  Eeeugh.
> >
> > The fuse command plumbing is very simple -- the ->iomap_begin,
> > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > to the fuse server via a trio of new fuse commands.  This is suitable
> > for very simple filesystems that don't do tricky things with mappings
> > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > but solving that is for the next sprint.
> >
> > With this overly simplistic RFC, I am to show that it's possible to
> > build a fuse server for a real filesystem (ext4) that runs entirely in
> > userspace yet maintains most of its performance.  At this early stage I
> > get about 95% of the kernel ext4 driver's streaming directio performance
> > on streaming IO, and 110% of its streaming buffered IO performance.
> > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > the cover letter for the fuse2fs iomap changes for more details.
> >
> 
> Very cool!
> 
> > There are some major warts remaining:
> >
> > 1. The iomap cookie validation is not present, which can lead to subtle
> > races between pagecache zeroing and writeback on filesystems that
> > support unwritten and delalloc mappings.
> >
> > 2. Mappings ought to be cached in the kernel for more speed.
> >
> > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > yet figured out how inline data is supposed to work.
> >
> > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > which currently isn't possible because the kernel fuse driver will iget
> > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > inode it just read.
> 
> Can you make the decision about enabling iomap on lookup?
> The plan for passthrough for inode operations was to allow
> setting up passthough config of inode on lookup.

The main requirement (especially for buffered IO) is that we've set the
address space operations structure either to the regular fuse one or to
the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
code assumes that cannot change on a live inode.

So I /think/ we could ask the fuse server at inode instantiation time
(which, if I'm reading the code correctly, is when iget5_locked gives
fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
to userspace at that time.  Alternately I guess we could extend struct
fuse_attr with another FUSE_ATTR_ flag, I think?

> > 5. ext4 doesn't support out of place writes so I don't know if that
> > actually works correctly.
> >
> > 6. iomap is an inode-based service, not a file-based service.  This
> > means that we /must/ push ext2's inode numbers into the kernel via
> > FUSE_GETATTR so that it can report those same numbers back out through
> > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > to index its incore inode, so we have to pass those too so that
> > notifications work properly.
> >
> 
> Again, I might be missing something, but as long as the fuse filesystem
> is exposing a single backing filesystem, it should be possible to make
> sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> inode number.
> See sketch in this WIP branch:
> https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575

I think this would work in many places, except for filesystems with
64-bit inumbers on 32-bit machines.  That might be a good argument for
continuing to pass along the nodeid and fuse_inode::orig_ino like it
does now.  Plus there are some filesystems that synthesize inode numbers
so tying the two together might not be feasible/desirable anyway.

Though one nice feature of letting fuse have its own nodeids might be
that if the in-memory index switches to a tree structure, then it could
be more compact if the filesystem's inumbers are fairly sparse like xfs.
OTOH the current inode hashtable has been around for a very long time so
that might not be a big concern.  For fuse2fs it doesn't matter since
ext4 inumbers are u32.

--D

> 
> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-29 16:45   ` Darrick J. Wong
@ 2025-05-29 19:41     ` Amir Goldstein
  2025-06-09 22:31       ` Darrick J. Wong
  2025-07-12 10:57       ` Amir Goldstein
  0 siblings, 2 replies; 55+ messages in thread
From: Amir Goldstein @ 2025-05-29 19:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

 or

On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > Hi everyone,
> > >
> > > DO NOT MERGE THIS.
> > >
> > > This is the very first request for comments of a prototype to connect
> > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > from files whose contents persist to locally attached storage devices.
> > >
> > > Why would you want to do that?  Most filesystem drivers are seriously
> > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > over almost a decade of its existence.  Faulty code can lead to total
> > > kernel compromise, and I think there's a very strong incentive to move
> > > all that parsing out to userspace where we can containerize the fuse
> > > server process.
> > >
> > > willy's folios conversion project (and to a certain degree RH's new
> > > mount API) have also demonstrated that treewide changes to the core
> > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > because you have to understand every filesystem's bespoke use of that
> > > core code.  Eeeugh.
> > >
> > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > for very simple filesystems that don't do tricky things with mappings
> > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > but solving that is for the next sprint.
> > >
> > > With this overly simplistic RFC, I am to show that it's possible to
> > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > userspace yet maintains most of its performance.  At this early stage I
> > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > the cover letter for the fuse2fs iomap changes for more details.
> > >
> >
> > Very cool!
> >
> > > There are some major warts remaining:
> > >
> > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > races between pagecache zeroing and writeback on filesystems that
> > > support unwritten and delalloc mappings.
> > >
> > > 2. Mappings ought to be cached in the kernel for more speed.
> > >
> > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > yet figured out how inline data is supposed to work.
> > >
> > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > which currently isn't possible because the kernel fuse driver will iget
> > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > inode it just read.
> >
> > Can you make the decision about enabling iomap on lookup?
> > The plan for passthrough for inode operations was to allow
> > setting up passthough config of inode on lookup.
>
> The main requirement (especially for buffered IO) is that we've set the
> address space operations structure either to the regular fuse one or to
> the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> code assumes that cannot change on a live inode.
>
> So I /think/ we could ask the fuse server at inode instantiation time
> (which, if I'm reading the code correctly, is when iget5_locked gives
> fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> to userspace at that time.  Alternately I guess we could extend struct
> fuse_attr with another FUSE_ATTR_ flag, I think?
>

The latter. Either extend fuse_attr or struct fuse_entry_out,
which is in the responses of FUSE_LOOKUP,
FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
which instantiate fuse inodes.

There is a very hand wavy discussion about this at:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/

In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
command that uses the variable length file handle instead of nodeid
as a key for the inode.

So we will have to extend fuse_entry_out anyway, but TBH I never got to
look at the gritty details of how best to extend all the relevant commands,
so I hope I am not sending you down the wrong path.


> > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > actually works correctly.
> > >
> > > 6. iomap is an inode-based service, not a file-based service.  This
> > > means that we /must/ push ext2's inode numbers into the kernel via
> > > FUSE_GETATTR so that it can report those same numbers back out through
> > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > to index its incore inode, so we have to pass those too so that
> > > notifications work properly.
> > >
> >
> > Again, I might be missing something, but as long as the fuse filesystem
> > is exposing a single backing filesystem, it should be possible to make
> > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > inode number.
> > See sketch in this WIP branch:
> > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
>
> I think this would work in many places, except for filesystems with
> 64-bit inumbers on 32-bit machines.  That might be a good argument for
> continuing to pass along the nodeid and fuse_inode::orig_ino like it
> does now.  Plus there are some filesystems that synthesize inode numbers
> so tying the two together might not be feasible/desirable anyway.
>
> Though one nice feature of letting fuse have its own nodeids might be
> that if the in-memory index switches to a tree structure, then it could
> be more compact if the filesystem's inumbers are fairly sparse like xfs.
> OTOH the current inode hashtable has been around for a very long time so
> that might not be a big concern.  For fuse2fs it doesn't matter since
> ext4 inumbers are u32.
>

I wanted to see if declaring one-to-one 64bit ino can simplify things
for the first version of inode ops passthrough.
If this is not the case, or if this is too much of a limitation for
your use case
then nevermind.
But if it is a good enough shortcut for the demo and can be extended later,
then why not.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-29 19:41     ` Amir Goldstein
@ 2025-06-09 22:31       ` Darrick J. Wong
  2025-06-10 10:59         ` Amir Goldstein
  2025-07-12 10:57       ` Amir Goldstein
  1 sibling, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2025-06-09 22:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
>  or
> 
> On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > DO NOT MERGE THIS.
> > > >
> > > > This is the very first request for comments of a prototype to connect
> > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > from files whose contents persist to locally attached storage devices.
> > > >
> > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > kernel compromise, and I think there's a very strong incentive to move
> > > > all that parsing out to userspace where we can containerize the fuse
> > > > server process.
> > > >
> > > > willy's folios conversion project (and to a certain degree RH's new
> > > > mount API) have also demonstrated that treewide changes to the core
> > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > because you have to understand every filesystem's bespoke use of that
> > > > core code.  Eeeugh.
> > > >
> > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > for very simple filesystems that don't do tricky things with mappings
> > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > but solving that is for the next sprint.
> > > >
> > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > userspace yet maintains most of its performance.  At this early stage I
> > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > the cover letter for the fuse2fs iomap changes for more details.
> > > >
> > >
> > > Very cool!
> > >
> > > > There are some major warts remaining:
> > > >
> > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > races between pagecache zeroing and writeback on filesystems that
> > > > support unwritten and delalloc mappings.
> > > >
> > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > >
> > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > yet figured out how inline data is supposed to work.
> > > >
> > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > which currently isn't possible because the kernel fuse driver will iget
> > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > inode it just read.
> > >
> > > Can you make the decision about enabling iomap on lookup?
> > > The plan for passthrough for inode operations was to allow
> > > setting up passthough config of inode on lookup.
> >
> > The main requirement (especially for buffered IO) is that we've set the
> > address space operations structure either to the regular fuse one or to
> > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > code assumes that cannot change on a live inode.
> >
> > So I /think/ we could ask the fuse server at inode instantiation time
> > (which, if I'm reading the code correctly, is when iget5_locked gives
> > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > to userspace at that time.  Alternately I guess we could extend struct
> > fuse_attr with another FUSE_ATTR_ flag, I think?
> >
> 
> The latter. Either extend fuse_attr or struct fuse_entry_out,
> which is in the responses of FUSE_LOOKUP,
> FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> which instantiate fuse inodes.
> 
> There is a very hand wavy discussion about this at:
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> 
> In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> command that uses the variable length file handle instead of nodeid
> as a key for the inode.
> 
> So we will have to extend fuse_entry_out anyway, but TBH I never got to
> look at the gritty details of how best to extend all the relevant commands,
> so I hope I am not sending you down the wrong path.

I found another twist to this story: the upper level libfuse3 library
assigns distinct nodeids for each directory entry.  These nodeids are
passed into the kernel and appear to the basis for an iget5_locked call.
IOWs, each nodeid causes a struct fuse_inode to be created in the
kernel.

For a single-linked file this is no big deal, but for a hardlink this
makes iomap a mess because this means that in fuse2fs, an ext2 inode can
map to multiple kernel fuse_inode objects.  This /really/ breaks the
locking model of iomap, which assumes that there's one in-kernel inode
and that it can use i_rwsem to synchronize updates.

So I'm going to have to find a way to deal with this.  I tried trivially
messing with libfuse nodeid assigment but that blew some assertion.
Maybe your LOOKUP_HANDLE thing would work.

> > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > actually works correctly.
> > > >
> > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > to index its incore inode, so we have to pass those too so that
> > > > notifications work properly.
> > > >
> > >
> > > Again, I might be missing something, but as long as the fuse filesystem
> > > is exposing a single backing filesystem, it should be possible to make
> > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > inode number.
> > > See sketch in this WIP branch:
> > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> >
> > I think this would work in many places, except for filesystems with
> > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > does now.  Plus there are some filesystems that synthesize inode numbers
> > so tying the two together might not be feasible/desirable anyway.
> >
> > Though one nice feature of letting fuse have its own nodeids might be
> > that if the in-memory index switches to a tree structure, then it could
> > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > OTOH the current inode hashtable has been around for a very long time so
> > that might not be a big concern.  For fuse2fs it doesn't matter since
> > ext4 inumbers are u32.
> >
> 
> I wanted to see if declaring one-to-one 64bit ino can simplify things
> for the first version of inode ops passthrough.
> If this is not the case, or if this is too much of a limitation for
> your use case
> then nevermind.
> But if it is a good enough shortcut for the demo and can be extended later,
> then why not.

It's very tempting, because it's very confusing to have nodeids and
stat st_ino not be the same thing.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-09 22:31       ` Darrick J. Wong
@ 2025-06-10 10:59         ` Amir Goldstein
  2025-06-10 19:00           ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Amir Goldstein @ 2025-06-10 10:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> >  or
> >
> > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS.
> > > > >
> > > > > This is the very first request for comments of a prototype to connect
> > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > from files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > but solving that is for the next sprint.
> > > > >
> > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > >
> > > >
> > > > Very cool!
> > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > support unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > yet figured out how inline data is supposed to work.
> > > > >
> > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > inode it just read.
> > > >
> > > > Can you make the decision about enabling iomap on lookup?
> > > > The plan for passthrough for inode operations was to allow
> > > > setting up passthough config of inode on lookup.
> > >
> > > The main requirement (especially for buffered IO) is that we've set the
> > > address space operations structure either to the regular fuse one or to
> > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > code assumes that cannot change on a live inode.
> > >
> > > So I /think/ we could ask the fuse server at inode instantiation time
> > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > to userspace at that time.  Alternately I guess we could extend struct
> > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > >
> >
> > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > which is in the responses of FUSE_LOOKUP,
> > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > which instantiate fuse inodes.
> >
> > There is a very hand wavy discussion about this at:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> >
> > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > command that uses the variable length file handle instead of nodeid
> > as a key for the inode.
> >
> > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > look at the gritty details of how best to extend all the relevant commands,
> > so I hope I am not sending you down the wrong path.
>
> I found another twist to this story: the upper level libfuse3 library
> assigns distinct nodeids for each directory entry.  These nodeids are
> passed into the kernel and appear to the basis for an iget5_locked call.
> IOWs, each nodeid causes a struct fuse_inode to be created in the
> kernel.
>
> For a single-linked file this is no big deal, but for a hardlink this
> makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> map to multiple kernel fuse_inode objects.  This /really/ breaks the
> locking model of iomap, which assumes that there's one in-kernel inode
> and that it can use i_rwsem to synchronize updates.
>
> So I'm going to have to find a way to deal with this.  I tried trivially
> messing with libfuse nodeid assigment but that blew some assertion.
> Maybe your LOOKUP_HANDLE thing would work.
>

Pull the emergency break!

In an amature move, I did not look at fuse2fs.c before commenting on your
work.

High level fuse interface is not the right tool for the job.
It's not even the easiest way to have written fuse2fs in the first place.

High-level fuse API addresses file system objects with full paths.
This is good for writing simple virtual filesystems, but it is not the
correct nor is the easiest choice to write a userspace driver for ext4.

Low-level fuse interface addresses filesystem objects by nodeid
and requires the server to implement lookup(parent_nodeid, name)
where the server gets to choose the nodeid (not libfuse).

current fuse2fs code needs to go to an effort to convert from full path
to inode + name using ext2fs_namei().

With the low-level fuse op_lookup() might have used the native ext2_lookup()
which would have been much more natural.

You can find the most featureful low-level fuse example at:
https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc

Among other things, the server has an inode cache, where an inode
has in its state 'nopen' (was this inode opened for io) and 'backing_id'
(was this inode mapped for kernel passthrough).

Currently this backing_id mapping is only made on first open of inode,
but the plan is to do that also at lookup time, for example, if the
iomap mode for the inode can be determined at lookup time.


> > > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.
> > > > >
> > > >
> > > > Again, I might be missing something, but as long as the fuse filesystem
> > > > is exposing a single backing filesystem, it should be possible to make
> > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > > inode number.
> > > > See sketch in this WIP branch:
> > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> > >
> > > I think this would work in many places, except for filesystems with
> > > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > > does now.  Plus there are some filesystems that synthesize inode numbers
> > > so tying the two together might not be feasible/desirable anyway.
> > >
> > > Though one nice feature of letting fuse have its own nodeids might be
> > > that if the in-memory index switches to a tree structure, then it could
> > > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > > OTOH the current inode hashtable has been around for a very long time so
> > > that might not be a big concern.  For fuse2fs it doesn't matter since
> > > ext4 inumbers are u32.
> > >
> >
> > I wanted to see if declaring one-to-one 64bit ino can simplify things
> > for the first version of inode ops passthrough.
> > If this is not the case, or if this is too much of a limitation for
> > your use case
> > then nevermind.
> > But if it is a good enough shortcut for the demo and can be extended later,
> > then why not.
>
> It's very tempting, because it's very confusing to have nodeids and
> stat st_ino not be the same thing.
>

Now that I have explained that fuse2fs should be low-level, it should be
trivial to claim that it should have no problem to declare via
FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino,
because I see no reason to implement fuse2fs with non one-to-one
mapping of ino <==> nodeid.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 10:59         ` Amir Goldstein
@ 2025-06-10 19:00           ` Darrick J. Wong
  2025-06-10 19:51             ` Amir Goldstein
  2025-06-11 11:56             ` Theodore Ts'o
  0 siblings, 2 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-06-10 19:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > >  or
> > >
> > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > DO NOT MERGE THIS.
> > > > > >
> > > > > > This is the very first request for comments of a prototype to connect
> > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > from files whose contents persist to locally attached storage devices.
> > > > > >
> > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > server process.
> > > > > >
> > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > core code.  Eeeugh.
> > > > > >
> > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > but solving that is for the next sprint.
> > > > > >
> > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > >
> > > > >
> > > > > Very cool!
> > > > >
> > > > > > There are some major warts remaining:
> > > > > >
> > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > support unwritten and delalloc mappings.
> > > > > >
> > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > >
> > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > yet figured out how inline data is supposed to work.
> > > > > >
> > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > inode it just read.
> > > > >
> > > > > Can you make the decision about enabling iomap on lookup?
> > > > > The plan for passthrough for inode operations was to allow
> > > > > setting up passthough config of inode on lookup.
> > > >
> > > > The main requirement (especially for buffered IO) is that we've set the
> > > > address space operations structure either to the regular fuse one or to
> > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > code assumes that cannot change on a live inode.
> > > >
> > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > >
> > >
> > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > which is in the responses of FUSE_LOOKUP,
> > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > which instantiate fuse inodes.
> > >
> > > There is a very hand wavy discussion about this at:
> > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > >
> > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > command that uses the variable length file handle instead of nodeid
> > > as a key for the inode.
> > >
> > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > look at the gritty details of how best to extend all the relevant commands,
> > > so I hope I am not sending you down the wrong path.
> >
> > I found another twist to this story: the upper level libfuse3 library
> > assigns distinct nodeids for each directory entry.  These nodeids are
> > passed into the kernel and appear to the basis for an iget5_locked call.
> > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > kernel.
> >
> > For a single-linked file this is no big deal, but for a hardlink this
> > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > locking model of iomap, which assumes that there's one in-kernel inode
> > and that it can use i_rwsem to synchronize updates.
> >
> > So I'm going to have to find a way to deal with this.  I tried trivially
> > messing with libfuse nodeid assigment but that blew some assertion.
> > Maybe your LOOKUP_HANDLE thing would work.
> >
> 
> Pull the emergency break!
> 
> In an amature move, I did not look at fuse2fs.c before commenting on your
> work.
> 
> High level fuse interface is not the right tool for the job.
> It's not even the easiest way to have written fuse2fs in the first place.

At the time I thought it would minimize friction across multiple
operating systems' fuse implementations.

> High-level fuse API addresses file system objects with full paths.
> This is good for writing simple virtual filesystems, but it is not the
> correct nor is the easiest choice to write a userspace driver for ext4.

Agreed, it's a *terrible* way to implement ext4.

I think, however, that Ted would like to maintain compatibility with
macfuse and freebsd(?) so he's been resistant to rewriting the entire
program to work with the lowlevel library.

That said, I decided just now to do some spelunking into those two fuse
ports and have discovered that freebsd[1] packages the same upstream
libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.

[1] https://wiki.freebsd.org/FUSEFS
[2] https://github.com/macfuse/macfuse

Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
think about rewriting all of fuse2fs against the lowlevel library?  It's
really annoying to deal with all the problems of the current codebase.
I think I'll try to stabilize the current fuse+iomap code and then look
into a fuse2fs port.  What would we call it, fuse4fs? :D

> Low-level fuse interface addresses filesystem objects by nodeid
> and requires the server to implement lookup(parent_nodeid, name)
> where the server gets to choose the nodeid (not libfuse).

Does the nodeid for the root directory have to be FUSE_ROOT_ID?  I guess
for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
which cannot be accessed from userspace anyway.

> current fuse2fs code needs to go to an effort to convert from full path
> to inode + name using ext2fs_namei().
> 
> With the low-level fuse op_lookup() might have used the native ext2_lookup()
> which would have been much more natural.
> 
> You can find the most featureful low-level fuse example at:
> https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc
> 
> Among other things, the server has an inode cache, where an inode
> has in its state 'nopen' (was this inode opened for io) and 'backing_id'
> (was this inode mapped for kernel passthrough).
> 
> Currently this backing_id mapping is only made on first open of inode,
> but the plan is to do that also at lookup time, for example, if the
> iomap mode for the inode can be determined at lookup time.

<nod>

> > > > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > > > actually works correctly.
> > > > > >
> > > > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > notifications work properly.
> > > > > >
> > > > >
> > > > > Again, I might be missing something, but as long as the fuse filesystem
> > > > > is exposing a single backing filesystem, it should be possible to make
> > > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > > > inode number.
> > > > > See sketch in this WIP branch:
> > > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> > > >
> > > > I think this would work in many places, except for filesystems with
> > > > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > > > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > > > does now.  Plus there are some filesystems that synthesize inode numbers
> > > > so tying the two together might not be feasible/desirable anyway.
> > > >
> > > > Though one nice feature of letting fuse have its own nodeids might be
> > > > that if the in-memory index switches to a tree structure, then it could
> > > > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > > > OTOH the current inode hashtable has been around for a very long time so
> > > > that might not be a big concern.  For fuse2fs it doesn't matter since
> > > > ext4 inumbers are u32.
> > > >
> > >
> > > I wanted to see if declaring one-to-one 64bit ino can simplify things
> > > for the first version of inode ops passthrough.
> > > If this is not the case, or if this is too much of a limitation for
> > > your use case
> > > then nevermind.
> > > But if it is a good enough shortcut for the demo and can be extended later,
> > > then why not.
> >
> > It's very tempting, because it's very confusing to have nodeids and
> > stat st_ino not be the same thing.
> >
> 
> Now that I have explained that fuse2fs should be low-level, it should be
> trivial to claim that it should have no problem to declare via
> FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino,
> because I see no reason to implement fuse2fs with non one-to-one
> mapping of ino <==> nodeid.

Agreed!  Thanks for the nudge!

Let's see what Ted thinks when he returns from vacation. :)

--D

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 19:00           ` Darrick J. Wong
@ 2025-06-10 19:51             ` Amir Goldstein
  2025-06-11  6:00               ` Darrick J. Wong
  2025-06-11 11:56             ` Theodore Ts'o
  1 sibling, 1 reply; 55+ messages in thread
From: Amir Goldstein @ 2025-06-10 19:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > > >  or
> > > >
> > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > DO NOT MERGE THIS.
> > > > > > >
> > > > > > > This is the very first request for comments of a prototype to connect
> > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > > from files whose contents persist to locally attached storage devices.
> > > > > > >
> > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > server process.
> > > > > > >
> > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > core code.  Eeeugh.
> > > > > > >
> > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > > but solving that is for the next sprint.
> > > > > > >
> > > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > > >
> > > > > >
> > > > > > Very cool!
> > > > > >
> > > > > > > There are some major warts remaining:
> > > > > > >
> > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > > support unwritten and delalloc mappings.
> > > > > > >
> > > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > > >
> > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > > yet figured out how inline data is supposed to work.
> > > > > > >
> > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > > inode it just read.
> > > > > >
> > > > > > Can you make the decision about enabling iomap on lookup?
> > > > > > The plan for passthrough for inode operations was to allow
> > > > > > setting up passthough config of inode on lookup.
> > > > >
> > > > > The main requirement (especially for buffered IO) is that we've set the
> > > > > address space operations structure either to the regular fuse one or to
> > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > > code assumes that cannot change on a live inode.
> > > > >
> > > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > > >
> > > >
> > > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > > which is in the responses of FUSE_LOOKUP,
> > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > > which instantiate fuse inodes.
> > > >
> > > > There is a very hand wavy discussion about this at:
> > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > > >
> > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > > command that uses the variable length file handle instead of nodeid
> > > > as a key for the inode.
> > > >
> > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > > look at the gritty details of how best to extend all the relevant commands,
> > > > so I hope I am not sending you down the wrong path.
> > >
> > > I found another twist to this story: the upper level libfuse3 library
> > > assigns distinct nodeids for each directory entry.  These nodeids are
> > > passed into the kernel and appear to the basis for an iget5_locked call.
> > > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > > kernel.
> > >
> > > For a single-linked file this is no big deal, but for a hardlink this
> > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > > locking model of iomap, which assumes that there's one in-kernel inode
> > > and that it can use i_rwsem to synchronize updates.
> > >
> > > So I'm going to have to find a way to deal with this.  I tried trivially
> > > messing with libfuse nodeid assigment but that blew some assertion.
> > > Maybe your LOOKUP_HANDLE thing would work.
> > >
> >
> > Pull the emergency break!
> >
> > In an amature move, I did not look at fuse2fs.c before commenting on your
> > work.
> >
> > High level fuse interface is not the right tool for the job.
> > It's not even the easiest way to have written fuse2fs in the first place.
>
> At the time I thought it would minimize friction across multiple
> operating systems' fuse implementations.
>
> > High-level fuse API addresses file system objects with full paths.
> > This is good for writing simple virtual filesystems, but it is not the
> > correct nor is the easiest choice to write a userspace driver for ext4.
>
> Agreed, it's a *terrible* way to implement ext4.
>
> I think, however, that Ted would like to maintain compatibility with
> macfuse and freebsd(?) so he's been resistant to rewriting the entire
> program to work with the lowlevel library.
>
> That said, I decided just now to do some spelunking into those two fuse
> ports and have discovered that freebsd[1] packages the same upstream
> libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.
>
> [1] https://wiki.freebsd.org/FUSEFS
> [2] https://github.com/macfuse/macfuse
>
> Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
> think about rewriting all of fuse2fs against the lowlevel library?  It's
> really annoying to deal with all the problems of the current codebase.
> I think I'll try to stabilize the current fuse+iomap code and then look
> into a fuse2fs port.  What would we call it, fuse4fs? :D
>
> > Low-level fuse interface addresses filesystem objects by nodeid
> > and requires the server to implement lookup(parent_nodeid, name)
> > where the server gets to choose the nodeid (not libfuse).
>
> Does the nodeid for the root directory have to be FUSE_ROOT_ID?

Yeh, I think that's the case, otherwise FUSE_INIT would need to
tell the kernel the root nodeid, because there is no lookup to
return the root nodeid.

> I guess
> for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> which cannot be accessed from userspace anyway.
>

As long as inode #1 is reserved it should be fine.
just need to refine the rules of the one-to-one mapping with
this exception.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 19:51             ` Amir Goldstein
@ 2025-06-11  6:00               ` Darrick J. Wong
  2025-06-11  8:54                 ` Amir Goldstein
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2025-06-11  6:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 09:51:55PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> > > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > > > >  or
> > > > >
> > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > DO NOT MERGE THIS.
> > > > > > > >
> > > > > > > > This is the very first request for comments of a prototype to connect
> > > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > > > from files whose contents persist to locally attached storage devices.
> > > > > > > >
> > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > server process.
> > > > > > > >
> > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > core code.  Eeeugh.
> > > > > > > >
> > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > > > but solving that is for the next sprint.
> > > > > > > >
> > > > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > > > >
> > > > > > >
> > > > > > > Very cool!
> > > > > > >
> > > > > > > > There are some major warts remaining:
> > > > > > > >
> > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > > > support unwritten and delalloc mappings.
> > > > > > > >
> > > > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > > > >
> > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > > > yet figured out how inline data is supposed to work.
> > > > > > > >
> > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > > > inode it just read.
> > > > > > >
> > > > > > > Can you make the decision about enabling iomap on lookup?
> > > > > > > The plan for passthrough for inode operations was to allow
> > > > > > > setting up passthough config of inode on lookup.
> > > > > >
> > > > > > The main requirement (especially for buffered IO) is that we've set the
> > > > > > address space operations structure either to the regular fuse one or to
> > > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > > > code assumes that cannot change on a live inode.
> > > > > >
> > > > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > > > >
> > > > >
> > > > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > > > which is in the responses of FUSE_LOOKUP,
> > > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > > > which instantiate fuse inodes.
> > > > >
> > > > > There is a very hand wavy discussion about this at:
> > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > > > >
> > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > > > command that uses the variable length file handle instead of nodeid
> > > > > as a key for the inode.
> > > > >
> > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > > > look at the gritty details of how best to extend all the relevant commands,
> > > > > so I hope I am not sending you down the wrong path.
> > > >
> > > > I found another twist to this story: the upper level libfuse3 library
> > > > assigns distinct nodeids for each directory entry.  These nodeids are
> > > > passed into the kernel and appear to the basis for an iget5_locked call.
> > > > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > > > kernel.
> > > >
> > > > For a single-linked file this is no big deal, but for a hardlink this
> > > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > > > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > > > locking model of iomap, which assumes that there's one in-kernel inode
> > > > and that it can use i_rwsem to synchronize updates.
> > > >
> > > > So I'm going to have to find a way to deal with this.  I tried trivially
> > > > messing with libfuse nodeid assigment but that blew some assertion.
> > > > Maybe your LOOKUP_HANDLE thing would work.
> > > >
> > >
> > > Pull the emergency break!
> > >
> > > In an amature move, I did not look at fuse2fs.c before commenting on your
> > > work.
> > >
> > > High level fuse interface is not the right tool for the job.
> > > It's not even the easiest way to have written fuse2fs in the first place.
> >
> > At the time I thought it would minimize friction across multiple
> > operating systems' fuse implementations.
> >
> > > High-level fuse API addresses file system objects with full paths.
> > > This is good for writing simple virtual filesystems, but it is not the
> > > correct nor is the easiest choice to write a userspace driver for ext4.
> >
> > Agreed, it's a *terrible* way to implement ext4.
> >
> > I think, however, that Ted would like to maintain compatibility with
> > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > program to work with the lowlevel library.
> >
> > That said, I decided just now to do some spelunking into those two fuse
> > ports and have discovered that freebsd[1] packages the same upstream
> > libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.
> >
> > [1] https://wiki.freebsd.org/FUSEFS
> > [2] https://github.com/macfuse/macfuse
> >
> > Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
> > think about rewriting all of fuse2fs against the lowlevel library?  It's
> > really annoying to deal with all the problems of the current codebase.
> > I think I'll try to stabilize the current fuse+iomap code and then look
> > into a fuse2fs port.  What would we call it, fuse4fs? :D
> >
> > > Low-level fuse interface addresses filesystem objects by nodeid
> > > and requires the server to implement lookup(parent_nodeid, name)
> > > where the server gets to choose the nodeid (not libfuse).
> >
> > Does the nodeid for the root directory have to be FUSE_ROOT_ID?
> 
> Yeh, I think that's the case, otherwise FUSE_INIT would need to
> tell the kernel the root nodeid, because there is no lookup to
> return the root nodeid.
> 
> > I guess
> > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> > which cannot be accessed from userspace anyway.
> >
> 
> As long as inode #1 is reserved it should be fine.
> just need to refine the rules of the one-to-one mapping with
> this exception.

Or just make it so that passthrough_ino filesystems can specify the
rootdir inumber?

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11  6:00               ` Darrick J. Wong
@ 2025-06-11  8:54                 ` Amir Goldstein
  2025-06-12  5:54                   ` Miklos Szeredi
  0 siblings, 1 reply; 55+ messages in thread
From: Amir Goldstein @ 2025-06-11  8:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

> > > Does the nodeid for the root directory have to be FUSE_ROOT_ID?
> >
> > Yeh, I think that's the case, otherwise FUSE_INIT would need to
> > tell the kernel the root nodeid, because there is no lookup to
> > return the root nodeid.
> >
> > > I guess
> > > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> > > which cannot be accessed from userspace anyway.
> > >
> >
> > As long as inode #1 is reserved it should be fine.
> > just need to refine the rules of the one-to-one mapping with
> > this exception.
>
> Or just make it so that passthrough_ino filesystems can specify the
> rootdir inumber?
>

There is already a mount option 'rootmode' for st_mode of root inode
so I suppose we could add the rootino mount option.

Note that currently fuse_fill_super_common() instantiates the root inode
before negotiating FUSE_INIT with the server.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 19:00           ` Darrick J. Wong
  2025-06-10 19:51             ` Amir Goldstein
@ 2025-06-11 11:56             ` Theodore Ts'o
  2025-06-12  3:20               ` Darrick J. Wong
  2025-06-20  8:58               ` Allison Karlitskaya
  1 sibling, 2 replies; 55+ messages in thread
From: Theodore Ts'o @ 2025-06-11 11:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Allison Karlitskaya

+Allison Karlitskaya

On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote:
> > High level fuse interface is not the right tool for the job.
> > It's not even the easiest way to have written fuse2fs in the first place.
> 
> At the time I thought it would minimize friction across multiple
> operating systems' fuse implementations.
> 
> > High-level fuse API addresses file system objects with full paths.
> > This is good for writing simple virtual filesystems, but it is not the
> > correct nor is the easiest choice to write a userspace driver for ext4.
> 
> Agreed, it's a *terrible* way to implement ext4.
> 
> I think, however, that Ted would like to maintain compatibility with
> macfuse and freebsd(?) so he's been resistant to rewriting the entire
> program to work with the lowlevel library.

My priority is to make sure that we have compatibility with other OS's
(in particular MacOS, FreeBSD, if possible Windows, although that's
not something that I develop against or have test vehicles to
validate).  However, from what I can tell, they all support Fuse3 at
this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as
of today.

The only complaint that I've had about breaking support using Fuse2
was from Allison (Cc'ed), who was involved with another Github
project, whose Github Actions break because they were using a very old
version of Ubuntu LTS 20.04), which only had support for libfuse2.  I
am going to assume that this is probably only because they hadn't
bothered to update their .github/workflows/ci.yaml file, and not
because there was any inherit requirement that we support ancient
versions of Linux distributions.  (When I was at IBM, I remember
having to support customers who used RHEL4, and even in one extreme
case, RHEL3 because there were a customer paying $$$$$ that refused to
update; but that was well over a decade ago, and at this point, I'm
finding it a lot harder to care about that.  :-)

My plan is that after I release 1.47.2 (which will have some
interesting data corruption bugfixes thanks to Darrick and other users
using fuse2fs in deadly earnest, as opposed to as a lightweight way to
copy files in and out of an file system image), I plan to transition
the master and next branches for the future 1.48 release, and the
maint branch will have bug fixes for 1.47.N releases.

At that point, unless I hear some very strong arguments against, for
1.48, my current thinking is that we will drop support for Fuse2.  I
will still care about making sure that fuse2fs will build and work
well enough that casual file copies work on MacOS and FreeBSD, and
I'll accept patches that make fuse2fs work with WinFSP.  In practice,
this means that Linux-specific things like Verity support will need to
be #ifdef'ed so that they will build against MacFUSE, and I assume the
same will be true for fuseblk mode and iomap mode(?).

This may break the github actions for composefs-rs[1], but I'm going
to assume that they can figure out a way to transition to Fuse3
(hopefully by just using a newer version of Ubuntu, but I suppose it's
possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
in any case, I don't think it makes sense to hold back fuse2fs
development just for the sake of Ubuntu Focal (LTS 20.04).  And if
necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
sound fair to you?

[1] https://github.com/containers/composefs-rs

Does anyone else have any objections to dropping Fuse2 support?  And
is that sufficient for folks to more easily support iomap mode in
fuse2fs?

Cheers,

							- Ted

P.S.  Greetings from Greenland.  :-)  (We're currently in the middle of
a cruise that started in Iceland, and will be ending in New York City
next week.)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11 11:56             ` Theodore Ts'o
@ 2025-06-12  3:20               ` Darrick J. Wong
  2025-06-12  6:10                 ` Amir Goldstein
  2025-06-20  8:58               ` Allison Karlitskaya
  1 sibling, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2025-06-12  3:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Allison Karlitskaya

On Wed, Jun 11, 2025 at 10:56:29AM -0100, Theodore Ts'o wrote:
> +Allison Karlitskaya
> 
> On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote:
> > > High level fuse interface is not the right tool for the job.
> > > It's not even the easiest way to have written fuse2fs in the first place.
> > 
> > At the time I thought it would minimize friction across multiple
> > operating systems' fuse implementations.
> > 
> > > High-level fuse API addresses file system objects with full paths.
> > > This is good for writing simple virtual filesystems, but it is not the
> > > correct nor is the easiest choice to write a userspace driver for ext4.
> > 
> > Agreed, it's a *terrible* way to implement ext4.
> > 
> > I think, however, that Ted would like to maintain compatibility with
> > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > program to work with the lowlevel library.
> 
> My priority is to make sure that we have compatibility with other OS's
> (in particular MacOS, FreeBSD, if possible Windows, although that's
> not something that I develop against or have test vehicles to
> validate).  However, from what I can tell, they all support Fuse3 at
> this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as
> of today.
> 
> The only complaint that I've had about breaking support using Fuse2
> was from Allison (Cc'ed), who was involved with another Github
> project, whose Github Actions break because they were using a very old
> version of Ubuntu LTS 20.04), which only had support for libfuse2.  I
> am going to assume that this is probably only because they hadn't
> bothered to update their .github/workflows/ci.yaml file, and not
> because there was any inherit requirement that we support ancient
> versions of Linux distributions.  (When I was at IBM, I remember
> having to support customers who used RHEL4, and even in one extreme
> case, RHEL3 because there were a customer paying $$$$$ that refused to
> update; but that was well over a decade ago, and at this point, I'm
> finding it a lot harder to care about that.  :-)
> 
> My plan is that after I release 1.47.2 (which will have some
> interesting data corruption bugfixes thanks to Darrick and other users
> using fuse2fs in deadly earnest, as opposed to as a lightweight way to
> copy files in and out of an file system image), I plan to transition
> the master and next branches for the future 1.48 release, and the
> maint branch will have bug fixes for 1.47.N releases.
> 
> At that point, unless I hear some very strong arguments against, for
> 1.48, my current thinking is that we will drop support for Fuse2.  I
> will still care about making sure that fuse2fs will build and work
> well enough that casual file copies work on MacOS and FreeBSD, and
> I'll accept patches that make fuse2fs work with WinFSP.  In practice,
> this means that Linux-specific things like Verity support will need to
> be #ifdef'ed so that they will build against MacFUSE, and I assume the
> same will be true for fuseblk mode and iomap mode(?).

<nod> I might just drop fuseblk mode since it's unusable for
unprivileged userspace and regular files; and is a real pain even for
"I'm pretending to be the kernel" mode.

> This may break the github actions for composefs-rs[1], but I'm going
> to assume that they can figure out a way to transition to Fuse3
> (hopefully by just using a newer version of Ubuntu, but I suppose it's
> possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> in any case, I don't think it makes sense to hold back fuse2fs
> development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> sound fair to you?
> 
> [1] https://github.com/containers/composefs-rs
> 
> Does anyone else have any objections to dropping Fuse2 support?  And
> is that sufficient for folks to more easily support iomap mode in
> fuse2fs?

I don't have any objections to cleaning the fuse2 crud out of fuse2fs.

I /do/ worry that rewriting fuse2fs to target the lowlevel fuse3 library
instead of the highlevel one is going to break the !linux platforms.
Although I *think* macfuse and freebsd fuse actually support the
lowlevel library will be ok, I do worry that we might lose windows
support.  I can't tell if winfsp or dokan are what you're supposed to
use there, but afaict neither of them support the lowlevel interface.

That said, I could just fork fuse2fs and make the fork ("fuse4fs") talk
to the lowlevel library, and we can see what happens when/if people try
to build it on those platforms.

(Though again I have zero capacity to build macos or windows programs...)

TBH it might be a huge relief to just start with a new fuse4fs codebase
where I can focus on making iomap the single IO path that works really
well, rather than try to support the existing one.  There's a lot of IO
manager changes in the fuse2fs+iomap prototype that I think just go away
if you don't need to support doing the file IO yourself.

Any code that's shareable between fuse[24]fs should of course get split
out, which should ease the maintenance burden of having two fuse
servers.  Most of fuse2fs' "smarts" are just calling libext2fs anyway.
Maybe someday we can pull an egcs. :P

> Cheers,
> 
> 							- Ted
> 
> P.S.  Greetings from Greenland.  :-)  (We're currently in the middle of
> a cruise that started in Iceland, and will be ending in New York City
> next week.)

Heh, enjoy your cruise!!

--D

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11  8:54                 ` Amir Goldstein
@ 2025-06-12  5:54                   ` Miklos Szeredi
  2025-06-13 17:44                     ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Miklos Szeredi @ 2025-06-12  5:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o

On Wed, 11 Jun 2025 at 10:54, Amir Goldstein <amir73il@gmail.com> wrote:

> There is already a mount option 'rootmode' for st_mode of root inode
> so I suppose we could add the rootino mount option.
>
> Note that currently fuse_fill_super_common() instantiates the root inode
> before negotiating FUSE_INIT with the server.

I'd prefer not to add more mount options like this.

It would be nice to move away from async FUSE_INIT.  It's one of those
things I wish I'd done differently.

Unfortunately I don't think adding FUSE_INIT_SYNC would be sufficient,
as servers might expect the first request to be always FUSE_INIT and
break if it isn't.   Libfuse seems to be okay, but...

One idea is to add an ioctl that the server would call before
mounting, that explicitly allows FUSE_INIT_SYNC.  It's somewhat ugly,
but I can't think of a better solution.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-12  3:20               ` Darrick J. Wong
@ 2025-06-12  6:10                 ` Amir Goldstein
  0 siblings, 0 replies; 55+ messages in thread
From: Amir Goldstein @ 2025-06-12  6:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Allison Karlitskaya

On Thu, Jun 12, 2025 at 5:20 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jun 11, 2025 at 10:56:29AM -0100, Theodore Ts'o wrote:
> > +Allison Karlitskaya
> >
> > On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote:
> > > > High level fuse interface is not the right tool for the job.
> > > > It's not even the easiest way to have written fuse2fs in the first place.
> > >
> > > At the time I thought it would minimize friction across multiple
> > > operating systems' fuse implementations.
> > >
> > > > High-level fuse API addresses file system objects with full paths.
> > > > This is good for writing simple virtual filesystems, but it is not the
> > > > correct nor is the easiest choice to write a userspace driver for ext4.
> > >
> > > Agreed, it's a *terrible* way to implement ext4.
> > >
> > > I think, however, that Ted would like to maintain compatibility with
> > > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > > program to work with the lowlevel library.
> >
> > My priority is to make sure that we have compatibility with other OS's
> > (in particular MacOS, FreeBSD, if possible Windows, although that's
> > not something that I develop against or have test vehicles to
> > validate).  However, from what I can tell, they all support Fuse3 at
> > this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as
> > of today.
> >
> > The only complaint that I've had about breaking support using Fuse2
> > was from Allison (Cc'ed), who was involved with another Github
> > project, whose Github Actions break because they were using a very old
> > version of Ubuntu LTS 20.04), which only had support for libfuse2.  I
> > am going to assume that this is probably only because they hadn't
> > bothered to update their .github/workflows/ci.yaml file, and not
> > because there was any inherit requirement that we support ancient
> > versions of Linux distributions.  (When I was at IBM, I remember
> > having to support customers who used RHEL4, and even in one extreme
> > case, RHEL3 because there were a customer paying $$$$$ that refused to
> > update; but that was well over a decade ago, and at this point, I'm
> > finding it a lot harder to care about that.  :-)
> >
> > My plan is that after I release 1.47.2 (which will have some
> > interesting data corruption bugfixes thanks to Darrick and other users
> > using fuse2fs in deadly earnest, as opposed to as a lightweight way to
> > copy files in and out of an file system image), I plan to transition
> > the master and next branches for the future 1.48 release, and the
> > maint branch will have bug fixes for 1.47.N releases.
> >
> > At that point, unless I hear some very strong arguments against, for
> > 1.48, my current thinking is that we will drop support for Fuse2.  I
> > will still care about making sure that fuse2fs will build and work
> > well enough that casual file copies work on MacOS and FreeBSD, and
> > I'll accept patches that make fuse2fs work with WinFSP.  In practice,
> > this means that Linux-specific things like Verity support will need to
> > be #ifdef'ed so that they will build against MacFUSE, and I assume the
> > same will be true for fuseblk mode and iomap mode(?).
>
> <nod> I might just drop fuseblk mode since it's unusable for
> unprivileged userspace and regular files; and is a real pain even for
> "I'm pretending to be the kernel" mode.
>
> > This may break the github actions for composefs-rs[1], but I'm going
> > to assume that they can figure out a way to transition to Fuse3
> > (hopefully by just using a newer version of Ubuntu, but I suppose it's
> > possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> > in any case, I don't think it makes sense to hold back fuse2fs
> > development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> > they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> > sound fair to you?
> >
> > [1] https://github.com/containers/composefs-rs
> >
> > Does anyone else have any objections to dropping Fuse2 support?  And
> > is that sufficient for folks to more easily support iomap mode in
> > fuse2fs?
>
> I don't have any objections to cleaning the fuse2 crud out of fuse2fs.
>
> I /do/ worry that rewriting fuse2fs to target the lowlevel fuse3 library
> instead of the highlevel one is going to break the !linux platforms.
> Although I *think* macfuse and freebsd fuse actually support the
> lowlevel library will be ok, I do worry that we might lose windows
> support.  I can't tell if winfsp or dokan are what you're supposed to
> use there, but afaict neither of them support the lowlevel interface.
>
> That said, I could just fork fuse2fs and make the fork ("fuse4fs") talk
> to the lowlevel library, and we can see what happens when/if people try
> to build it on those platforms.
>
> (Though again I have zero capacity to build macos or windows programs...)
>
> TBH it might be a huge relief to just start with a new fuse4fs codebase
> where I can focus on making iomap the single IO path that works really
> well, rather than try to support the existing one.  There's a lot of IO
> manager changes in the fuse2fs+iomap prototype that I think just go away
> if you don't need to support doing the file IO yourself.
>
> Any code that's shareable between fuse[24]fs should of course get split
> out, which should ease the maintenance burden of having two fuse
> servers.  Most of fuse2fs' "smarts" are just calling libext2fs anyway.

That seems like a good way to focus your energy on the important
goals. I like it.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
  2025-05-29 16:45   ` Darrick J. Wong
@ 2025-06-13 17:37   ` Darrick J. Wong
  2025-06-23 13:16     ` Miklos Szeredi
  1 sibling, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2025-06-13 17:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o, Matthew Wilcox

On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi everyone,
> >
> > DO NOT MERGE THIS.

Three weeks later, I've mostly gotten the iomap caching working.  This
is probably most exciting for John, because we were talking earlier
about uploading storage mappings to the fuse driver and this is what
I've come up with.  I'm running around trying to fix all the stuff that
doesn't quite work right.

Top of that list is timestamps and file attributes, because fuse no
longer calls the fuse server for file writes.  As a result, the kernel
inode always has the most uptodate versions of the some file attributes
(i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever
the dirty inode gets flushed.

After I get that working I'm going to have to rewrite fuse2fs (or more
likely just fork it) to be a lowlevel driver because as I've noted
elsewhere in this thread, the upper level fuse library can assign
multiple fuse nodeids for a single hardlinked inode.  The only reason
that worked for non-iomap fuse2fs is because we have a BKL and disable
all caching.

For fuse+iomap, this discrepancy between fuse nodeids and ext2 inumbers
means that iomap just plain doesn't work with hardlinks because there
are multiple struct fuse_inodes for each ondisk inode and the locking is
just broken; and the iomap callouts are per-inode, not per-file which
leads to multiple layering violations in the upper level fuse library.
Also as Amir points out, path lookups on every operation is just *slow*.

Interim branches can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2025-06-13
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-cache_2025-06-13
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache_2025-06-13
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs_2025-06-13

(I'm not going to respam the list with patches right now because the
quality as told by fstests isn't quite where I want it to be for such a
thing.  fuse2fs+iomap passes 87% of fstests (down from 89% without
iomap) but that's still pretty bad.)

--D

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-12  5:54                   ` Miklos Szeredi
@ 2025-06-13 17:44                     ` Darrick J. Wong
  0 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-06-13 17:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o

On Thu, Jun 12, 2025 at 07:54:12AM +0200, Miklos Szeredi wrote:
> On Wed, 11 Jun 2025 at 10:54, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > There is already a mount option 'rootmode' for st_mode of root inode
> > so I suppose we could add the rootino mount option.
> >
> > Note that currently fuse_fill_super_common() instantiates the root inode
> > before negotiating FUSE_INIT with the server.
> 
> I'd prefer not to add more mount options like this.
> 
> It would be nice to move away from async FUSE_INIT.  It's one of those
> things I wish I'd done differently.
> 
> Unfortunately I don't think adding FUSE_INIT_SYNC would be sufficient,
> as servers might expect the first request to be always FUSE_INIT and
> break if it isn't.   Libfuse seems to be okay, but...
> 
> One idea is to add an ioctl that the server would call before
> mounting, that explicitly allows FUSE_INIT_SYNC.  It's somewhat ugly,
> but I can't think of a better solution.

Hmm, well for iomap the fuse server kinda wants to know if the kernel is
going to accept iomap prior to initializing the filesystem, so it
wouldn't be that weird to have it set a "send INIT_SYNC" flag.

If one were to add an INIT_SYNC upcall, where would the callsite be?
Somewhere just prior to where we need to open the root file?  And would
you want to add more fields to it?  Or just use the same struct and
flags as the existing INIT call?

--D

> 
> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11 11:56             ` Theodore Ts'o
  2025-06-12  3:20               ` Darrick J. Wong
@ 2025-06-20  8:58               ` Allison Karlitskaya
  2025-06-20 11:50                 ` Bernd Schubert
  2025-07-01  5:58                 ` Darrick J. Wong
  1 sibling, 2 replies; 55+ messages in thread
From: Allison Karlitskaya @ 2025-06-20  8:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, Amir Goldstein, linux-fsdevel, John, bernd,
	miklos, joannelkoong, Josef Bacik, linux-ext4

hi Ted,

Sorry I didn't see this earlier.  I've been travelling.

On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
> This may break the github actions for composefs-rs[1], but I'm going
> to assume that they can figure out a way to transition to Fuse3
> (hopefully by just using a newer version of Ubuntu, but I suppose it's
> possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> in any case, I don't think it makes sense to hold back fuse2fs
> development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> sound fair to you?

To be honest, with a composefs-rs hat on, I don't care at all about
fuse support for ext2/3/4 (although I think it's cool that it exists).
We also use fuse in composefs-rs for unrelated reasons, but even there
we use the fuser rust crate which has a "pure rust" direct syscall
layer that no longer depends on libfuse.  Our use of e2fsprogs is
strictly related to building testing images in CI, and for that we
only use mkfs.ext4.  There's also no specific reason that we're using
old Ubuntu.  I probably just copy-pasted it from another project
without paying too much attention.

Thanks for asking, though!

lis


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-20  8:58               ` Allison Karlitskaya
@ 2025-06-20 11:50                 ` Bernd Schubert
  2025-07-01  6:02                   ` Darrick J. Wong
  2025-07-01  5:58                 ` Darrick J. Wong
  1 sibling, 1 reply; 55+ messages in thread
From: Bernd Schubert @ 2025-06-20 11:50 UTC (permalink / raw)
  To: Allison Karlitskaya, Theodore Ts'o
  Cc: Darrick J. Wong, Amir Goldstein, linux-fsdevel, John, miklos,
	joannelkoong, Josef Bacik, linux-ext4



On 6/20/25 10:58, Allison Karlitskaya wrote:
> hi Ted,
> 
> Sorry I didn't see this earlier.  I've been travelling.
> 
> On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
>> This may break the github actions for composefs-rs[1], but I'm going
>> to assume that they can figure out a way to transition to Fuse3
>> (hopefully by just using a newer version of Ubuntu, but I suppose it's
>> possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
>> in any case, I don't think it makes sense to hold back fuse2fs
>> development just for the sake of Ubuntu Focal (LTS 20.04).  And if
>> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
>> they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
>> sound fair to you?
> 
> To be honest, with a composefs-rs hat on, I don't care at all about
> fuse support for ext2/3/4 (although I think it's cool that it exists).
> We also use fuse in composefs-rs for unrelated reasons, but even there
> we use the fuser rust crate which has a "pure rust" direct syscall
> layer that no longer depends on libfuse.  Our use of e2fsprogs is
> strictly related to building testing images in CI, and for that we
> only use mkfs.ext4.  There's also no specific reason that we're using
> old Ubuntu.  I probably just copy-pasted it from another project
> without paying too much attention.


 From libfuse point of view I'm too happy about that split into different
libraries. Libfuse already right now misses several features because
they were added to virtiofs, but not to libfuse. I need to find the time
for it, but I guess it makes sense to add rust support to libfuse (and
some parts can be entirely rewritten into rust).



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
@ 2025-06-23 13:16     ` Miklos Szeredi
  2025-07-01  6:05       ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Miklos Szeredi @ 2025-06-23 13:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox

On Fri, 13 Jun 2025 at 19:37, Darrick J. Wong <djwong@kernel.org> wrote:

> Top of that list is timestamps and file attributes, because fuse no
> longer calls the fuse server for file writes.  As a result, the kernel
> inode always has the most uptodate versions of the some file attributes
> (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever
> the dirty inode gets flushed.

This is already the case for cached writes, no new code should be needed.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-20  8:58               ` Allison Karlitskaya
  2025-06-20 11:50                 ` Bernd Schubert
@ 2025-07-01  5:58                 ` Darrick J. Wong
  1 sibling, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-07-01  5:58 UTC (permalink / raw)
  To: Allison Karlitskaya
  Cc: Theodore Ts'o, Amir Goldstein, linux-fsdevel, John, bernd,
	miklos, joannelkoong, Josef Bacik, linux-ext4

On Fri, Jun 20, 2025 at 10:58:38AM +0200, Allison Karlitskaya wrote:
> hi Ted,
> 
> Sorry I didn't see this earlier.  I've been travelling.
> 
> On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
> > This may break the github actions for composefs-rs[1], but I'm going
> > to assume that they can figure out a way to transition to Fuse3
> > (hopefully by just using a newer version of Ubuntu, but I suppose it's
> > possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> > in any case, I don't think it makes sense to hold back fuse2fs
> > development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> > they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> > sound fair to you?
> 
> To be honest, with a composefs-rs hat on, I don't care at all about
> fuse support for ext2/3/4 (although I think it's cool that it exists).
> We also use fuse in composefs-rs for unrelated reasons, but even there
> we use the fuser rust crate which has a "pure rust" direct syscall

Aha, I just stumbled upon that crate.  There are ... too many things on
crates.io that claim to be fuse libraries/wrappers/etc.

It's tempting to go write fuse4fs as a iomap-only Rust server, but I
never quite got the hang of configuring cargo to link against a locally
built .so in the same source tree (i.e. when I was trying to link
xfs_healer against libhandle that ships as part of xfsprogs).  I'm not
even sure I want to explore exposing libext2fs in a Rust-safe way.

> layer that no longer depends on libfuse.  Our use of e2fsprogs is
> strictly related to building testing images in CI, and for that we
> only use mkfs.ext4.  There's also no specific reason that we're using
> old Ubuntu.  I probably just copy-pasted it from another project
> without paying too much attention.
> 
> Thanks for asking, though!

I'm glad to hear that e2fsprogs can drop fuse2 support! :)

--D

> lis
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-20 11:50                 ` Bernd Schubert
@ 2025-07-01  6:02                   ` Darrick J. Wong
  0 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-07-01  6:02 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Allison Karlitskaya, Theodore Ts'o, Amir Goldstein,
	linux-fsdevel, John, miklos, joannelkoong, Josef Bacik,
	linux-ext4

On Fri, Jun 20, 2025 at 01:50:20PM +0200, Bernd Schubert wrote:
> 
> 
> On 6/20/25 10:58, Allison Karlitskaya wrote:
> > hi Ted,
> > 
> > Sorry I didn't see this earlier.  I've been travelling.
> > 
> > On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
> > > This may break the github actions for composefs-rs[1], but I'm going
> > > to assume that they can figure out a way to transition to Fuse3
> > > (hopefully by just using a newer version of Ubuntu, but I suppose it's
> > > possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> > > in any case, I don't think it makes sense to hold back fuse2fs
> > > development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> > > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> > > they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> > > sound fair to you?
> > 
> > To be honest, with a composefs-rs hat on, I don't care at all about
> > fuse support for ext2/3/4 (although I think it's cool that it exists).
> > We also use fuse in composefs-rs for unrelated reasons, but even there
> > we use the fuser rust crate which has a "pure rust" direct syscall
> > layer that no longer depends on libfuse.  Our use of e2fsprogs is
> > strictly related to building testing images in CI, and for that we
> > only use mkfs.ext4.  There's also no specific reason that we're using
> > old Ubuntu.  I probably just copy-pasted it from another project
> > without paying too much attention.
> 
> 
> From libfuse point of view I'm too happy about that split into different

"too happy"?  I would have thought you would /not/ be too happy about
splits... <confused>

> libraries. Libfuse already right now misses several features because
> they were added to virtiofs, but not to libfuse. I need to find the time
> for it, but I guess it makes sense to add rust support to libfuse (and
> some parts can be entirely rewritten into rust).

Yeah, I noticed a few missing pieces like statx and syncfs support,
which I added to my own libfuse branch (+ fuse2fs).  Eventually I'd like
to get the kernel umount code to flush and wait for all pending fuse
commands, issue a FUSE_SYNCFS and wait for that, and then issue a
FUSE_DESTROY to tell the fuse server to tear itself down and release the
block devices(s) its holding.

--D

> 
> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-23 13:16     ` Miklos Szeredi
@ 2025-07-01  6:05       ` Darrick J. Wong
  0 siblings, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2025-07-01  6:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox

On Mon, Jun 23, 2025 at 03:16:53PM +0200, Miklos Szeredi wrote:
> On Fri, 13 Jun 2025 at 19:37, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Top of that list is timestamps and file attributes, because fuse no
> > longer calls the fuse server for file writes.  As a result, the kernel
> > inode always has the most uptodate versions of the some file attributes
> > (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever
> > the dirty inode gets flushed.
> 
> This is already the case for cached writes, no new code should be needed.

Are you talking about the fc->writeback_cache stuff?  Yeah, that mostly
works out for fuse2fs.  Though I was wondering, when does atime get
updated?  fs/fuse sets S_NOATIME, so I guess it's up to the fuse server
to update it when it wants to, and a later FUSE_GETATTR can pick it up?
If so, how do fuse servers implement lazytime/relatime?

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-29 19:41     ` Amir Goldstein
  2025-06-09 22:31       ` Darrick J. Wong
@ 2025-07-12 10:57       ` Amir Goldstein
  1 sibling, 0 replies; 55+ messages in thread
From: Amir Goldstein @ 2025-07-12 10:57 UTC (permalink / raw)
  To: Darrick J. Wong, Bernd Schubert
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

> On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
...
> > So I /think/ we could ask the fuse server at inode instantiation time
> > (which, if I'm reading the code correctly, is when iget5_locked gives
> > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > to userspace at that time.  Alternately I guess we could extend struct
> > fuse_attr with another FUSE_ATTR_ flag, I think?
> >
>
> The latter. Either extend fuse_attr or struct fuse_entry_out,
> which is in the responses of FUSE_LOOKUP,
> FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> which instantiate fuse inodes.
>

Update:
I went to look at this extension for my inode ops passthrough patches.

What I saw is that while struct fuse_attr and struct fuse_entry_out
are designed to be extended and have been extended in the past:
 * 7.9:
 *  - add blksize field to fuse_attr

Later on, struct fuse_direntplus was introduced
 * 7.21
 *  - add FUSE_READDIRPLUS

With struct struct fuse_entry_out/fuse_attr embedded in the middle
and I don't see any code in the kernel/lib that is prepared to handle
a change in the FUSE_NAME_OFFSET_DIRENTPLUS constant
(maybe it's there and I am missing it)

So for my own use, which only requires passing a single int backing_id
I was tempted to try and overload attr_valid{,_nsec} which are
not relevant for passthrough getattr case,
something like {attr_valid = backing_id, attr_valid_nsec = UTIME_OMIT}.

In the meanwhile, as an example I used a hole in struct fuse_attr_out
to implement backing file setup in reply to GETATTR in the wip branch [1].

Bernd,

I wonder if I am missing something w.r.t the intended extensibility of
struct fuse_entry_out/fuse_attr and current readdirplus code?

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fuse-backing-inode-wip/

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-07-12 10:58 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
2025-05-22  0:07   ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong
2025-05-22  0:07   ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong
2025-05-22  0:08   ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
2025-05-29 16:45   ` Darrick J. Wong
2025-05-29 19:41     ` Amir Goldstein
2025-06-09 22:31       ` Darrick J. Wong
2025-06-10 10:59         ` Amir Goldstein
2025-06-10 19:00           ` Darrick J. Wong
2025-06-10 19:51             ` Amir Goldstein
2025-06-11  6:00               ` Darrick J. Wong
2025-06-11  8:54                 ` Amir Goldstein
2025-06-12  5:54                   ` Miklos Szeredi
2025-06-13 17:44                     ` Darrick J. Wong
2025-06-11 11:56             ` Theodore Ts'o
2025-06-12  3:20               ` Darrick J. Wong
2025-06-12  6:10                 ` Amir Goldstein
2025-06-20  8:58               ` Allison Karlitskaya
2025-06-20 11:50                 ` Bernd Schubert
2025-07-01  6:02                   ` Darrick J. Wong
2025-07-01  5:58                 ` Darrick J. Wong
2025-07-12 10:57       ` Amir Goldstein
2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
2025-06-23 13:16     ` Miklos Szeredi
2025-07-01  6:05       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).