* [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 @ 2025-05-21 23:58 Darrick J. Wong 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong ` (3 more replies) 0 siblings, 4 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-21 23:58 UTC (permalink / raw) To: linux-fsdevel Cc: John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o Hi everyone, DO NOT MERGE THIS. This is the very first request for comments of a prototype to connect the Linux fuse driver to fs-iomap for regular file IO operations to and from files whose contents persist to locally attached storage devices. Why would you want to do that? Most filesystem drivers are seriously vulnerable to metadata parsing attacks, as syzbot has shown repeatedly over almost a decade of its existence. Faulty code can lead to total kernel compromise, and I think there's a very strong incentive to move all that parsing out to userspace where we can containerize the fuse server process. willy's folios conversion project (and to a certain degree RH's new mount API) have also demonstrated that treewide changes to the core mm/pagecache/fs code are very very difficult to pull off and take years because you have to understand every filesystem's bespoke use of that core code. Eeeugh. The fuse command plumbing is very simple -- the ->iomap_begin, ->iomap_end, and iomap ioend calls within iomap are turned into upcalls to the fuse server via a trio of new fuse commands. This is suitable for very simple filesystems that don't do tricky things with mappings (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, but solving that is for the next sprint. With this overly simplistic RFC, I am to show that it's possible to build a fuse server for a real filesystem (ext4) that runs entirely in userspace yet maintains most of its performance. At this early stage I get about 95% of the kernel ext4 driver's streaming directio performance on streaming IO, and 110% of its streaming buffered IO performance. Random buffered IO suffers a 90% hit on writes due to unwritten extent conversions. Random direct IO is about 60% as fast as the kernel; see the cover letter for the fuse2fs iomap changes for more details. There are some major warts remaining: 1. The iomap cookie validation is not present, which can lead to subtle races between pagecache zeroing and writeback on filesystems that support unwritten and delalloc mappings. 2. Mappings ought to be cached in the kernel for more speed. 3. iomap doesn't support things like fscrypt or fsverity, and I haven't yet figured out how inline data is supposed to work. 4. I would like to be able to turn on fuse+iomap on a per-inode basis, which currently isn't possible because the kernel fuse driver will iget inodes prior to calling FUSE_GETATTR to discover the properties of the inode it just read. 5. ext4 doesn't support out of place writes so I don't know if that actually works correctly. 6. iomap is an inode-based service, not a file-based service. This means that we /must/ push ext2's inode numbers into the kernel via FUSE_GETATTR so that it can report those same numbers back out through the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid to index its incore inode, so we have to pass those too so that notifications work properly. I'll work on these in June, but for now here's an unmergeable RFC to start some discussion. --Darrick ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong @ 2025-05-22 0:01 ` Darrick J. Wong 2025-05-22 0:07 ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong ` (2 more replies) 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 3 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:01 UTC (permalink / raw) To: tytso; +Cc: linux-ext4 Hi all, In preparation to start hacking on fuse2fs and iomap, upgrade fuse2fs library support to 3.17, which is the latest upstream release as of this writing. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-library-upgrade --- Commits in this patchset: * fuse2fs: bump library version * fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse * fuse2fs: disable nfs exports --- configure | 4 ++-- configure.ac | 4 ++-- misc/fuse2fs.c | 35 ++++++++++++++++++++++++++++++++--- 3 files changed, 36 insertions(+), 7 deletions(-) ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 1/3] fuse2fs: bump library version 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong @ 2025-05-22 0:07 ` Darrick J. Wong 2025-05-22 0:07 ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong 2025-05-22 0:08 ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong 2 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:07 UTC (permalink / raw) To: tytso; +Cc: linux-ext4 From: Darrick J. Wong <djwong@kernel.org> Bump the library version so we can take advantage of new functionality since libfuse 3.5. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- configure | 4 ++-- configure.ac | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/configure b/configure index dfc6bb4a5daa2e..1f7dbe24ee1ab1 100755 --- a/configure +++ b/configure @@ -14513,14 +14513,14 @@ fi if test "$FUSE_LIB" = "-lfuse3" then - FUSE_USE_VERSION=35 + FUSE_USE_VERSION=314 CFLAGS="$CFLAGS $fuse3_CFLAGS" LDFLAGS="$LDFLAGS $fuse3_LDFLAGS" for ac_header in pthread.h fuse.h do : as_ac_Header=`printf "%s\n" "ac_cv_header_$ac_header" | $as_tr_sh` ac_fn_c_check_header_compile "$LINENO" "$ac_header" "$as_ac_Header" "#define _FILE_OFFSET_BITS 64 -#define FUSE_USE_VERSION 35 +#define FUSE_USE_VERSION 314 #ifdef __linux__ #include <linux/fs.h> #include <linux/falloc.h> diff --git a/configure.ac b/configure.ac index 7f28701534a905..c7f193b4ed06bf 100644 --- a/configure.ac +++ b/configure.ac @@ -1413,13 +1413,13 @@ AC_SUBST(FUSE_LIB) AC_SUBST(FUSE_CMT) if test "$FUSE_LIB" = "-lfuse3" then - FUSE_USE_VERSION=35 + FUSE_USE_VERSION=314 CFLAGS="$CFLAGS $fuse3_CFLAGS" LDFLAGS="$LDFLAGS $fuse3_LDFLAGS" AC_CHECK_HEADERS([pthread.h fuse.h], [], [AC_MSG_FAILURE([Cannot find fuse3 fuse2fs headers.])], [#define _FILE_OFFSET_BITS 64 -#define FUSE_USE_VERSION 35 +#define FUSE_USE_VERSION 314 #ifdef __linux__ #include <linux/fs.h> #include <linux/falloc.h> ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong 2025-05-22 0:07 ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong @ 2025-05-22 0:07 ` Darrick J. Wong 2025-05-22 0:08 ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong 2 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:07 UTC (permalink / raw) To: tytso; +Cc: linux-ext4 From: Darrick J. Wong <djwong@kernel.org> Create a compatibility wrapper for fuse_set_feature_flag if the libfuse version is older than the one where that function was introduced (3.17). Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 9667f00e366a66..6137fc04198d39 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -932,6 +932,19 @@ static void op_destroy(void *p EXT2FS_ATTR((unused))) } } +#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 17) +static inline int fuse_set_feature_flag(struct fuse_conn_info *conn, + uint64_t flag) +{ + if (conn->capable & flag) { + conn->want |= flag; + return 1; + } + + return 0; +} +#endif + static void *op_init(struct fuse_conn_info *conn #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) , struct fuse_config *cfg EXT2FS_ATTR((unused)) @@ -947,14 +960,14 @@ static void *op_init(struct fuse_conn_info *conn FUSE2FS_CHECK_CONTEXT_NULL(ff); dbg_printf(ff, "%s: dev=%s\n", __func__, ff->device); #ifdef FUSE_CAP_IOCTL_DIR - conn->want |= FUSE_CAP_IOCTL_DIR; + fuse_set_feature_flag(conn, FUSE_CAP_IOCTL_DIR); #endif #ifdef FUSE_CAP_POSIX_ACL if (ff->acl) - conn->want |= FUSE_CAP_POSIX_ACL; + fuse_set_feature_flag(conn, FUSE_CAP_POSIX_ACL); #endif #ifdef FUSE_CAP_CACHE_SYMLINKS - conn->want |= FUSE_CAP_CACHE_SYMLINKS; + fuse_set_feature_flag(conn, FUSE_CAP_CACHE_SYMLINKS); #endif #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) conn->time_gran = 1; @@ -1020,6 +1033,19 @@ static void *op_init(struct fuse_conn_info *conn log_printf(ff, "%s %s.\n", _("mounted filesystem"), uuid); } out: +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17) + /* + * THIS MUST GO LAST! + * + * The high-level libfuse code has a strange bug: it sets feature flags + * in conn->want_ext, and later copies the lower 32 bits to conn->want. + * If we in turn change some bits in want_ext without updating want, + * the lower level library to observe that both want and want_ext have + * gotten out of sync, and refuses to mount. Therefore, synchronize + * the two. + */ + conn->want = conn->want_ext & 0xFFFFFFFF; +#endif return ff; mount_fail: ff->retcode = 32; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 3/3] fuse2fs: disable nfs exports 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong 2025-05-22 0:07 ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong 2025-05-22 0:07 ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong @ 2025-05-22 0:08 ` Darrick J. Wong 2 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:08 UTC (permalink / raw) To: tytso; +Cc: linux-ext4 From: Darrick J. Wong <djwong@kernel.org> The kernel fuse driver can export its own handles, but it doesn't actually talk to the fuse server about those handles. Hence they don't survive unmount/mount cycles like regular ext4. Disable them, because they cause fstests regressions and it's not clear that they're suitable for NFS export, at least not as most people understand ext4 NFS exports. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 6137fc04198d39..769bb5babd2738 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -969,6 +969,9 @@ static void *op_init(struct fuse_conn_info *conn #ifdef FUSE_CAP_CACHE_SYMLINKS fuse_set_feature_flag(conn, FUSE_CAP_CACHE_SYMLINKS); #endif +#ifdef FUSE_CAP_NO_EXPORT_SUPPORT + fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT); +#endif #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) conn->time_gran = 1; cfg->use_ino = 1; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support 2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong @ 2025-05-22 0:02 ` Darrick J. Wong 2025-05-22 0:08 ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong ` (9 more replies) 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong 2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein 3 siblings, 10 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:02 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel Hi all, In preparation for connecting fuse, iomap, and fuse2fs for a much more performant file IO path, make some changes to the Unix IO manager in libext2fs so that we can have better IO. First we start by making filesystem flushes a lot more efficient by eliding fsyncs when they're not necessary, and allowing library clients to turn off the racy code that writes the superblock byte by byte but exposes stale checksums. XXX: The second part of this series adds IO tagging so that we could tag IOs by inode number to distinguish file data blocks in cache from everything else. This is temporary scaffolding whilst we're in the middle adding directio and later buffered writes. Once we can use the pagecache for all file IO activity I think we could drop the back half of this series. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-iomap-prep --- Commits in this patchset: * libext2fs: always fsync the device when flushing the cache * libext2fs: always fsync the device when closing the unix IO manager * libext2fs: only fsync the unix fd if we wrote to the device * libext2fs: invalidate cached blocks when freeing them * libext2fs: add tagged block IO for better caching * libext2fs: add tagged block IO caching to the unix IO manager * libext2fs: only flush affected blocks in unix_write_byte * libext2fs: allow unix_write_byte when the write would be aligned * libext2fs: allow clients to ask to write full superblocks * libext2fs: allow callers to disallow I/O to file data blocks --- lib/ext2fs/ext2_io.h | 29 ++++ lib/ext2fs/ext2fs.h | 4 + debian/libext2fs2t64.symbols | 5 + lib/ext2fs/alloc_stats.c | 7 + lib/ext2fs/closefs.c | 7 + lib/ext2fs/fileio.c | 26 +++- lib/ext2fs/io_manager.c | 56 ++++++++ lib/ext2fs/unix_io.c | 281 +++++++++++++++++++++++++++++++++++------- 8 files changed, 362 insertions(+), 53 deletions(-) ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 01/10] libext2fs: always fsync the device when flushing the cache 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong @ 2025-05-22 0:08 ` Darrick J. Wong 2025-05-22 0:08 ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong ` (8 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:08 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> When we're flushing the unix IO manager's buffer cache, we should always fsync the block device, because something could have written to the block device -- either the buffer cache itself, or a direct write. Regardless, the callers all want all dirtied regions to be persisted to stable media. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index ede75cf8ee3681..40fd9cc1427c31 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -1452,7 +1452,8 @@ static errcode_t unix_flush(io_channel channel) retval = flush_cached_blocks(channel, data, 0); #endif #ifdef HAVE_FSYNC - if (!retval && fsync(data->dev) != 0) + /* always fsync the device, even if flushing our own cache failed */ + if (fsync(data->dev) != 0 && !retval) return errno; #endif return retval; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong 2025-05-22 0:08 ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong @ 2025-05-22 0:08 ` Darrick J. Wong 2025-05-22 0:09 ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong ` (7 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:08 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> unix_close is the last chance that libext2fs has to report write failures to users. Although it's likely that ext2fs_close already called ext2fs_flush and told the IO manager to flush, we could do one more sync before we close the file descriptor. Also don't override the fsync's errno with the close's errno. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 40fd9cc1427c31..7c5cb075d6b6b6 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -1136,8 +1136,11 @@ static errcode_t unix_close(io_channel channel) #ifndef NO_IO_CACHE retval = flush_cached_blocks(channel, data, 0); #endif + /* always fsync the device, even if flushing our own cache failed */ + if (fsync(data->dev) != 0 && !retval) + retval = errno; - if (close(data->dev) < 0) + if (close(data->dev) < 0 && !retval) retval = errno; free_cache(data); free(data->cache); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong 2025-05-22 0:08 ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong 2025-05-22 0:08 ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong @ 2025-05-22 0:09 ` Darrick J. Wong 2025-05-22 0:09 ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong ` (6 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:09 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> As an optimization, only fsync the block device fd if we tried to write to the io channel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 42 insertions(+), 6 deletions(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 7c5cb075d6b6b6..0fc83e471ca0fe 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -129,10 +129,13 @@ struct unix_cache { #define WRITE_DIRECT_SIZE 4 /* Must be smaller than CACHE_SIZE */ #define READ_DIRECT_SIZE 4 /* Should be smaller than CACHE_SIZE */ +#define UNIX_STATE_DIRTY (1U << 0) /* device needs fsyncing */ + struct unix_private_data { int magic; int dev; int flags; + unsigned int state; /* UNIX_STATE_* */ int align; int access_time; ext2_loff_t offset; @@ -1121,10 +1124,37 @@ static errcode_t unix_open(const char *name, int flags, return unix_open_channel(name, fd, flags, channel, unix_io_manager); } +static void mark_dirty(io_channel channel) +{ + struct unix_private_data *data = + (struct unix_private_data *) channel->private_data; + + mutex_lock(data, CACHE_MTX); + data->state |= UNIX_STATE_DIRTY; + mutex_unlock(data, CACHE_MTX); +} + +static errcode_t maybe_fsync(io_channel channel) +{ + struct unix_private_data *data = + (struct unix_private_data *) channel->private_data; + int was_dirty; + + mutex_lock(data, CACHE_MTX); + was_dirty = data->state & UNIX_STATE_DIRTY; + data->state &= ~UNIX_STATE_DIRTY; + mutex_unlock(data, CACHE_MTX); + + if (was_dirty && fsync(data->dev) != 0) + return errno; + + return 0; +} + static errcode_t unix_close(io_channel channel) { struct unix_private_data *data; - errcode_t retval = 0; + errcode_t retval = 0, retval2; EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); data = (struct unix_private_data *) channel->private_data; @@ -1137,8 +1167,9 @@ static errcode_t unix_close(io_channel channel) retval = flush_cached_blocks(channel, data, 0); #endif /* always fsync the device, even if flushing our own cache failed */ - if (fsync(data->dev) != 0 && !retval) - retval = errno; + retval2 = maybe_fsync(channel); + if (retval2 && !retval) + retval = retval2; if (close(data->dev) < 0 && !retval) retval = errno; @@ -1306,6 +1337,8 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, data = (struct unix_private_data *) channel->private_data; EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); + mark_dirty(channel); + #ifdef NO_IO_CACHE return raw_write_blk(channel, data, block, count, buf, 0); #else @@ -1430,6 +1463,8 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0) return errno; + mark_dirty(channel); + actual = write(data->dev, buf, size); if (actual < 0) return errno; @@ -1445,7 +1480,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, static errcode_t unix_flush(io_channel channel) { struct unix_private_data *data; - errcode_t retval = 0; + errcode_t retval = 0, retval2; EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); data = (struct unix_private_data *) channel->private_data; @@ -1456,8 +1491,9 @@ static errcode_t unix_flush(io_channel channel) #endif #ifdef HAVE_FSYNC /* always fsync the device, even if flushing our own cache failed */ - if (fsync(data->dev) != 0 && !retval) - return errno; + retval2 = maybe_fsync(channel); + if (retval2 && !retval) + retval = retval2; #endif return retval; } ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (2 preceding siblings ...) 2025-05-22 0:09 ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong @ 2025-05-22 0:09 ` Darrick J. Wong 2025-05-22 0:09 ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong ` (5 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:09 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> When we're freeing blocks, we should tell the IO manager to drop them from any cache it might be maintaining to improve performance. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2_io.h | 6 +++++- debian/libext2fs2t64.symbols | 1 + lib/ext2fs/alloc_stats.c | 7 +++++++ lib/ext2fs/io_manager.c | 8 ++++++++ lib/ext2fs/unix_io.c | 32 ++++++++++++++++++++++++++++++++ 5 files changed, 53 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h index 78c988374c8808..bab7f2a6a44b81 100644 --- a/lib/ext2fs/ext2_io.h +++ b/lib/ext2fs/ext2_io.h @@ -103,7 +103,9 @@ struct struct_io_manager { errcode_t (*zeroout)(io_channel channel, unsigned long long block, unsigned long long count); errcode_t (*get_fd)(io_channel channel, int *fd); - long reserved[13]; + errcode_t (*invalidate_blk)(io_channel channel, + unsigned long long block); + long reserved[12]; }; #define IO_FLAG_RW 0x0001 @@ -147,6 +149,8 @@ extern errcode_t io_channel_cache_readahead(io_channel io, unsigned long long block, unsigned long long count); extern errcode_t io_channel_fd(io_channel io, int *fd); +extern errcode_t io_channel_invalidate_blk(io_channel io, + unsigned long long block); #ifdef _WIN32 /* windows_io.c */ diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols index 9cf3b33ca15f91..13870c4b545b2f 100644 --- a/debian/libext2fs2t64.symbols +++ b/debian/libext2fs2t64.symbols @@ -689,6 +689,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER# io_channel_cache_readahead@Base 1.43 io_channel_discard@Base 1.42 io_channel_fd@Base 1.47.3 + io_channel_invalidate_blk@Base 1.47.3 io_channel_read_blk64@Base 1.41.1 io_channel_set_options@Base 1.37 io_channel_write_blk64@Base 1.41.1 diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c index 6f98bcc7cbd5f3..4aeaa286b88a7e 100644 --- a/lib/ext2fs/alloc_stats.c +++ b/lib/ext2fs/alloc_stats.c @@ -84,6 +84,13 @@ void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse) ext2fs_mark_bb_dirty(fs); if (fs->block_alloc_stats) (fs->block_alloc_stats)(fs, (blk64_t) blk, inuse); + + if (inuse < 0) { + unsigned int i; + + for (i = 0; i < EXT2FS_CLUSTER_RATIO(fs); i++) + io_channel_invalidate_blk(fs->io, blk + i); + } } void ext2fs_block_alloc_stats(ext2_filsys fs, blk_t blk, int inuse) diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c index 1bab069de63e12..aa7fc58b846be8 100644 --- a/lib/ext2fs/io_manager.c +++ b/lib/ext2fs/io_manager.c @@ -158,3 +158,11 @@ errcode_t io_channel_fd(io_channel io, int *fd) return io->manager->get_fd(io, fd); } + +errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block) +{ + if (!io->manager->invalidate_blk) + return EXT2_ET_OP_NOT_SUPPORTED; + + return io->manager->invalidate_blk(io, block); +} diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 0fc83e471ca0fe..89f7915371307f 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -664,6 +664,23 @@ static errcode_t reuse_cache(io_channel channel, #define FLUSH_INVALIDATE 0x01 #define FLUSH_NOLOCK 0x02 +/* Remove a block from the cache. Dirty contents are discarded. */ +static void invalidate_cached_block(io_channel channel, + struct unix_private_data *data, + unsigned long long block) +{ + struct unix_cache *cache; + int i; + + mutex_lock(data, CACHE_MTX); + for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) { + if (!cache->in_use || cache->block != block) + continue; + cache->in_use = 0; + } + mutex_unlock(data, CACHE_MTX); +} + /* * Flush all of the blocks in the cache */ @@ -1705,6 +1722,19 @@ static errcode_t unix_get_fd(io_channel channel, int *fd) return 0; } +static errcode_t unix_invalidate_blk(io_channel channel, + unsigned long long block) +{ + struct unix_private_data *data; + + EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); + data = (struct unix_private_data *) channel->private_data; + EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); + + invalidate_cached_block(channel, data, block); + return 0; +} + #if __GNUC_PREREQ (4, 6) #pragma GCC diagnostic pop #endif @@ -1727,6 +1757,7 @@ static struct struct_io_manager struct_unix_manager = { .cache_readahead = unix_cache_readahead, .zeroout = unix_zeroout, .get_fd = unix_get_fd, + .invalidate_blk = unix_invalidate_blk, }; io_manager unix_io_manager = &struct_unix_manager; @@ -1749,6 +1780,7 @@ static struct struct_io_manager struct_unixfd_manager = { .cache_readahead = unix_cache_readahead, .zeroout = unix_zeroout, .get_fd = unix_get_fd, + .invalidate_blk = unix_invalidate_blk, }; io_manager unixfd_io_manager = &struct_unixfd_manager; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 05/10] libext2fs: add tagged block IO for better caching 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (3 preceding siblings ...) 2025-05-22 0:09 ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong @ 2025-05-22 0:09 ` Darrick J. Wong 2025-05-22 0:09 ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong ` (4 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:09 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Pass inode numbers from the fileio.c code through the io manager to the unix io manager so that we can manage the disk cache more effectively. In the next few patches we'll need the ability to flush and invalidate the caches for specific files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2_io.h | 25 +++++++++++++++++++++- debian/libext2fs2t64.symbols | 4 ++++ lib/ext2fs/fileio.c | 14 +++++++----- lib/ext2fs/io_manager.c | 48 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 84 insertions(+), 7 deletions(-) diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h index bab7f2a6a44b81..64b35b31d669e7 100644 --- a/lib/ext2fs/ext2_io.h +++ b/lib/ext2fs/ext2_io.h @@ -39,6 +39,11 @@ typedef struct struct_io_stats *io_stats; #define io_channel_discard_zeroes_data(i) (i->flags & CHANNEL_FLAGS_DISCARD_ZEROES) +typedef unsigned int io_channel_tag_t; + +/* I/O operation has no associated tag */ +#define IO_CHANNEL_TAG_NULL (0) + struct struct_io_channel { errcode_t magic; io_manager manager; @@ -105,7 +110,15 @@ struct struct_io_manager { errcode_t (*get_fd)(io_channel channel, int *fd); errcode_t (*invalidate_blk)(io_channel channel, unsigned long long block); - long reserved[12]; + errcode_t (*read_tagblk)(io_channel channel, io_channel_tag_t tag, + unsigned long long block, int count, + void *data); + errcode_t (*write_tagblk)(io_channel channel, io_channel_tag_t tag, + unsigned long long block, int count, + const void *data); + errcode_t (*flush_tag)(io_channel channel, io_channel_tag_t tag); + errcode_t (*invalidate_tag)(io_channel channel, io_channel_tag_t tag); + long reserved[8]; }; #define IO_FLAG_RW 0x0001 @@ -134,9 +147,17 @@ extern errcode_t io_channel_write_byte(io_channel channel, extern errcode_t io_channel_read_blk64(io_channel channel, unsigned long long block, int count, void *data); +extern errcode_t io_channel_read_tagblk(io_channel channel, + io_channel_tag_t tag, + unsigned long long block, int count, + void *data); extern errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block, int count, const void *data); +extern errcode_t io_channel_write_tagblk(io_channel channel, + io_channel_tag_t tag, + unsigned long long block, int count, + const void *data); extern errcode_t io_channel_discard(io_channel channel, unsigned long long block, unsigned long long count); @@ -151,6 +172,8 @@ extern errcode_t io_channel_cache_readahead(io_channel io, extern errcode_t io_channel_fd(io_channel io, int *fd); extern errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block); +extern errcode_t io_channel_flush_tag(io_channel io, io_channel_tag_t tag); +extern errcode_t io_channel_invalidate_tag(io_channel io, io_channel_tag_t tag); #ifdef _WIN32 /* windows_io.c */ diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols index 13870c4b545b2f..87ed63155702e0 100644 --- a/debian/libext2fs2t64.symbols +++ b/debian/libext2fs2t64.symbols @@ -689,11 +689,15 @@ libext2fs.so.2 libext2fs2t64 #MINVER# io_channel_cache_readahead@Base 1.43 io_channel_discard@Base 1.42 io_channel_fd@Base 1.47.3 + io_channel_flush_tag@Base 1.47.3 io_channel_invalidate_blk@Base 1.47.3 + io_channel_invalidate_tag@Base 1.47.3 io_channel_read_blk64@Base 1.41.1 + io_channel_read_tagblk@Base 1.47.3 io_channel_set_options@Base 1.37 io_channel_write_blk64@Base 1.41.1 io_channel_write_byte@Base 1.37 + io_channel_write_tagblk@Base 1.47.3 io_channel_zeroout@Base 1.43 qcow2_read_header@Base 1.42 qcow2_write_raw_image@Base 1.42 diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c index 818f7f05420029..1b7e88d990036b 100644 --- a/lib/ext2fs/fileio.c +++ b/lib/ext2fs/fileio.c @@ -167,7 +167,8 @@ errcode_t ext2fs_file_flush(ext2_file_t file) return retval; } - retval = io_channel_write_blk64(fs->io, file->physblock, 1, file->buf); + retval = io_channel_write_tagblk(fs->io, file->ino, file->physblock, + 1, file->buf); if (retval) return retval; @@ -220,9 +221,10 @@ static errcode_t load_buffer(ext2_file_t file, int dontfill) if (!dontfill) { if (file->physblock && !(ret_flags & BMAP_RET_UNINIT)) { - retval = io_channel_read_blk64(fs->io, - file->physblock, - 1, file->buf); + retval = io_channel_read_tagblk(fs->io, + file->ino, + file->physblock, + 1, file->buf); if (retval) return retval; } else @@ -603,13 +605,13 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file, return retval; /* Read/zero/write block */ - retval = io_channel_read_blk64(fs->io, blk, 1, b); + retval = io_channel_read_tagblk(fs->io, file->ino, blk, 1, b); if (retval) goto out; memset(b + off, 0, fs->blocksize - off); - retval = io_channel_write_blk64(fs->io, blk, 1, b); + retval = io_channel_write_tagblk(fs->io, file->ino, blk, 1, b); if (retval) goto out; diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c index aa7fc58b846be8..357a3bc7698129 100644 --- a/lib/ext2fs/io_manager.c +++ b/lib/ext2fs/io_manager.c @@ -85,6 +85,22 @@ errcode_t io_channel_read_blk64(io_channel channel, unsigned long long block, count, data); } +errcode_t io_channel_read_tagblk(io_channel channel, io_channel_tag_t tag, + unsigned long long block, int count, + void *data) +{ + EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); + + if (channel->manager->read_tagblk) + return (channel->manager->read_tagblk)(channel, tag, block, + count, data); + + if (tag != IO_CHANNEL_TAG_NULL) + return EXT2_ET_OP_NOT_SUPPORTED; + + return io_channel_read_blk64(channel, block, count, data); +} + errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block, int count, const void *data) { @@ -101,6 +117,22 @@ errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block, count, data); } +errcode_t io_channel_write_tagblk(io_channel channel, io_channel_tag_t tag, + unsigned long long block, int count, + const void *data) +{ + EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); + + if (channel->manager->write_tagblk) + return (channel->manager->write_tagblk)(channel, tag, block, + count, data); + + if (tag != IO_CHANNEL_TAG_NULL) + return EXT2_ET_OP_NOT_SUPPORTED; + + return io_channel_write_blk64(channel, block, count, data); +} + errcode_t io_channel_discard(io_channel channel, unsigned long long block, unsigned long long count) { @@ -166,3 +198,19 @@ errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block) return io->manager->invalidate_blk(io, block); } + +errcode_t io_channel_flush_tag(io_channel io, io_channel_tag_t tag) +{ + if (!io->manager->flush_tag && tag != IO_CHANNEL_TAG_NULL) + return EXT2_ET_OP_NOT_SUPPORTED; + + return io->manager->flush_tag(io, tag); +} + +errcode_t io_channel_invalidate_tag(io_channel io, io_channel_tag_t tag) +{ + if (!io->manager->invalidate_tag && tag != IO_CHANNEL_TAG_NULL) + return EXT2_ET_OP_NOT_SUPPORTED; + + return io->manager->invalidate_tag(io, tag); +} ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (4 preceding siblings ...) 2025-05-22 0:09 ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong @ 2025-05-22 0:09 ` Darrick J. Wong 2025-05-22 0:10 ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong ` (3 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:09 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Add tagged block caching to the UNIX IO manager. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 198 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 154 insertions(+), 44 deletions(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 89f7915371307f..8a8afe47ee4503 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -120,6 +120,7 @@ struct unix_cache { char *buf; unsigned long long block; int access_time; + io_channel_tag_t tag; unsigned dirty:1; unsigned in_use:1; unsigned write_err:1; @@ -526,6 +527,7 @@ static errcode_t alloc_cache(io_channel channel, cache->access_time = 0; cache->dirty = 0; cache->in_use = 0; + cache->tag = IO_CHANNEL_TAG_NULL; if (cache->buf) ext2fs_free_mem(&cache->buf); retval = io_channel_alloc_buf(channel, 0, &cache->buf); @@ -552,6 +554,7 @@ static void free_cache(struct unix_private_data *data) cache->access_time = 0; cache->dirty = 0; cache->in_use = 0; + cache->tag = IO_CHANNEL_TAG_NULL; if (cache->buf) ext2fs_free_mem(&cache->buf); } @@ -639,8 +642,9 @@ static struct unix_cache *find_cached_block(struct unix_private_data *data, * Reuse a particular cache entry for another block. */ static errcode_t reuse_cache(io_channel channel, - struct unix_private_data *data, struct unix_cache *cache, - unsigned long long block) + struct unix_private_data *data, + struct unix_cache *cache, io_channel_tag_t tag, + unsigned long long block) { if (cache->dirty && cache->in_use) { errcode_t retval; @@ -653,7 +657,16 @@ static errcode_t reuse_cache(io_channel channel, } } +#ifdef DEBUG + if (cache->in_use) + printf("Reusing cached block %llu(%u) for %llu(%u)\n", + cache->block, cache->tag, block, tag); + else + printf("Using cached block %llu(%u)\n", block, tag); +#endif + cache->in_use = 1; + cache->tag = tag; cache->dirty = 0; cache->write_err = 0; cache->block = block; @@ -664,6 +677,17 @@ static errcode_t reuse_cache(io_channel channel, #define FLUSH_INVALIDATE 0x01 #define FLUSH_NOLOCK 0x02 +static inline void invalidate_cache(struct unix_cache *cache) +{ +#ifdef DEBUG + if (cache->in_use) + printf("Invalidating cache %llu(%u)\n", cache->block, + cache->tag); +#endif + cache->in_use = 0; + cache->tag = IO_CHANNEL_TAG_NULL; +} + /* Remove a block from the cache. Dirty contents are discarded. */ static void invalidate_cached_block(io_channel channel, struct unix_private_data *data, @@ -676,7 +700,7 @@ static void invalidate_cached_block(io_channel channel, for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) { if (!cache->in_use || cache->block != block) continue; - cache->in_use = 0; + invalidate_cache(cache); } mutex_unlock(data, CACHE_MTX); } @@ -686,7 +710,7 @@ static void invalidate_cached_block(io_channel channel, */ static errcode_t flush_cached_blocks(io_channel channel, struct unix_private_data *data, - int flags) + io_channel_tag_t tag, int flags) { struct unix_cache *cache; errcode_t retval, retval2 = 0; @@ -698,6 +722,11 @@ static errcode_t flush_cached_blocks(io_channel channel, for (i=0, cache = data->cache; i < data->cache_size; i++, cache++) { if (!cache->in_use) continue; + if (tag && cache->tag != tag) + continue; +#ifdef DEBUG + printf("Flushing %sblock %llu(%u)\n", cache->dirty ? "dirty " : "", cache->block, cache->tag); +#endif if (cache->dirty) { int raw_flags = RAW_WRITE_NO_HANDLER; @@ -715,10 +744,10 @@ static errcode_t flush_cached_blocks(io_channel channel, cache->dirty = 0; cache->write_err = 0; if (flags & FLUSH_INVALIDATE) - cache->in_use = 0; + invalidate_cache(cache); } } else if (flags & FLUSH_INVALIDATE) { - cache->in_use = 0; + invalidate_cache(cache); } } if ((flags & FLUSH_NOLOCK) == 0) @@ -737,7 +766,7 @@ static errcode_t flush_cached_blocks(io_channel channel, unsigned long long err_block = cache->block; cache->dirty = 0; - cache->in_use = 0; + invalidate_cache(cache); cache->write_err = 0; if (io_channel_alloc_buf(channel, 0, &err_buf)) @@ -772,7 +801,7 @@ static errcode_t shrink_cache(io_channel channel, mutex_lock(data, CACHE_MTX); - retval = flush_cached_blocks(channel, data, + retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, FLUSH_INVALIDATE | FLUSH_NOLOCK); if (retval) goto unlock; @@ -784,6 +813,7 @@ static errcode_t shrink_cache(io_channel channel, cache->access_time = 0; cache->dirty = 0; cache->in_use = 0; + cache->tag = IO_CHANNEL_TAG_NULL; if (cache->buf) ext2fs_free_mem(&cache->buf); } @@ -814,7 +844,7 @@ static errcode_t grow_cache(io_channel channel, mutex_lock(data, CACHE_MTX); - retval = flush_cached_blocks(channel, data, + retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, FLUSH_INVALIDATE | FLUSH_NOLOCK); if (retval) goto unlock; @@ -832,6 +862,7 @@ static errcode_t grow_cache(io_channel channel, cache->access_time = 0; cache->dirty = 0; cache->in_use = 0; + cache->tag = IO_CHANNEL_TAG_NULL; retval = io_channel_alloc_buf(channel, 0, &cache->buf); if (retval) goto unlock; @@ -1181,7 +1212,7 @@ static errcode_t unix_close(io_channel channel) return 0; #ifndef NO_IO_CACHE - retval = flush_cached_blocks(channel, data, 0); + retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, 0); #endif /* always fsync the device, even if flushing our own cache failed */ retval2 = maybe_fsync(channel); @@ -1220,7 +1251,9 @@ static errcode_t unix_set_blksize(io_channel channel, int blksize) mutex_lock(data, CACHE_MTX); mutex_lock(data, BOUNCE_MTX); #ifndef NO_IO_CACHE - if ((retval = flush_cached_blocks(channel, data, FLUSH_NOLOCK))){ + retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, + FLUSH_NOLOCK); + if (retval) { mutex_unlock(data, BOUNCE_MTX); mutex_unlock(data, CACHE_MTX); return retval; @@ -1236,8 +1269,9 @@ static errcode_t unix_set_blksize(io_channel channel, int blksize) return retval; } -static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, - int count, void *buf) +static errcode_t unix_read_tagblk(io_channel channel, io_channel_tag_t tag, + unsigned long long block, int count, + void *buf) { struct unix_private_data *data; struct unix_cache *cache; @@ -1249,6 +1283,10 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, data = (struct unix_private_data *) channel->private_data; EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); +#ifdef DEBUG + printf("read block %llu(%u) count %u\n", block, tag, count); +#endif + #ifdef NO_IO_CACHE return raw_read_blk(channel, data, block, count, buf); #else @@ -1259,7 +1297,8 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, * flush out the cache and then do a direct read. */ if (count < 0 || count > WRITE_DIRECT_SIZE) { - if ((retval = flush_cached_blocks(channel, data, 0))) + retval = flush_cached_blocks(channel, data, tag, 0); + if (retval) return retval; return raw_read_blk(channel, data, block, count, buf); } @@ -1270,9 +1309,11 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, /* If it's in the cache, use it! */ if ((cache = find_cached_block(data, block, NULL))) { #ifdef DEBUG - printf("Using cached block %lu\n", block); + printf("Reading from cached block %llu(%u)\n", block, tag); #endif memcpy(cp, cache->buf, channel->block_size); + if (tag != IO_CHANNEL_TAG_NULL) + cache->tag = tag; count--; block++; cp += channel->block_size; @@ -1287,7 +1328,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, if (find_cached_block(data, block+i, NULL)) break; #ifdef DEBUG - printf("Reading %d blocks starting at %lu\n", i, block); + printf("Reading %d blocks starting at %llu\n", i, block); #endif mutex_unlock(data, CACHE_MTX); if ((retval = raw_read_blk(channel, data, block, i, cp))) @@ -1298,7 +1339,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, for (j=0; j < i; j++) { if (!find_cached_block(data, block, &cache)) { retval = reuse_cache(channel, data, - cache, block); + cache, tag, block); if (retval) goto call_write_handler; memcpy(cache->buf, cp, channel->block_size); @@ -1317,7 +1358,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, unsigned long long err_block = cache->block; cache->dirty = 0; - cache->in_use = 0; + invalidate_cache(cache); cache->write_err = 0; if (io_channel_alloc_buf(channel, 0, &err_buf)) err_buf = NULL; @@ -1335,14 +1376,22 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, #endif /* NO_IO_CACHE */ } +static errcode_t unix_read_blk64(io_channel channel, unsigned long long block, + int count, void *buf) +{ + return unix_read_tagblk(channel, IO_CHANNEL_TAG_NULL, block, count, + buf); +} + static errcode_t unix_read_blk(io_channel channel, unsigned long block, int count, void *buf) { return unix_read_blk64(channel, block, count, buf); } -static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, - int count, const void *buf) +static errcode_t unix_write_tagblk(io_channel channel, io_channel_tag_t tag, + unsigned long long block, int count, + const void *buf) { struct unix_private_data *data; struct unix_cache *cache, *reuse; @@ -1354,6 +1403,10 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, data = (struct unix_private_data *) channel->private_data; EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); +#ifdef DEBUG + printf("write block %llu(%u) count %u\n", block, tag, count); +#endif + mark_dirty(channel); #ifdef NO_IO_CACHE @@ -1366,8 +1419,9 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, * flush out the cache completely and then do a direct write. */ if (count < 0 || count > WRITE_DIRECT_SIZE) { - if ((retval = flush_cached_blocks(channel, data, - FLUSH_INVALIDATE))) + retval = flush_cached_blocks(channel, data, tag, + FLUSH_INVALIDATE); + if (retval) return retval; return raw_write_blk(channel, data, block, count, buf, 0); } @@ -1385,11 +1439,17 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, mutex_lock(data, CACHE_MTX); while (count > 0) { cache = find_cached_block(data, block, &reuse); - if (!cache) { + if (cache) { +#ifdef DEBUG + printf("Writing to cached block %llu(%u)\n", block, tag); +#endif + if (tag != IO_CHANNEL_TAG_NULL) + cache->tag = tag; + } else { errcode_t err; cache = reuse; - err = reuse_cache(channel, data, cache, block); + err = reuse_cache(channel, data, cache, tag, block); if (err) goto call_write_handler; } @@ -1409,7 +1469,7 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, unsigned long long err_block = cache->block; cache->dirty = 0; - cache->in_use = 0; + invalidate_cache(cache); cache->write_err = 0; if (io_channel_alloc_buf(channel, 0, &err_buf)) err_buf = NULL; @@ -1427,6 +1487,13 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, #endif /* NO_IO_CACHE */ } +static errcode_t unix_write_blk64(io_channel channel, unsigned long long block, + int count, const void *buf) +{ + return unix_write_tagblk(channel, IO_CHANNEL_TAG_NULL, block, count, + buf); +} + static errcode_t unix_cache_readahead(io_channel channel, unsigned long long block, unsigned long long count) @@ -1473,7 +1540,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, /* * Flush out the cache completely */ - if ((retval = flush_cached_blocks(channel, data, FLUSH_INVALIDATE))) + retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, + FLUSH_INVALIDATE); + if (retval) return retval; #endif @@ -1491,28 +1560,60 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, return 0; } +/* + * Flush data buffers with the given tag to disk and invalidate them. + */ +static errcode_t unix_invalidate_tag(io_channel channel, io_channel_tag_t tag) +{ + struct unix_private_data *data; + errcode_t retval = 0, retval2; + + EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); + data = (struct unix_private_data *) channel->private_data; + EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); + +#ifndef NO_IO_CACHE + retval = flush_cached_blocks(channel, data, tag, FLUSH_INVALIDATE); +#endif +#ifdef HAVE_FSYNC + /* always fsync the device, even if flushing our own cache failed */ + retval2 = maybe_fsync(channel); + if (retval2 && !retval) + retval = retval2; +#endif + return retval; +} + +/* + * Flush data buffers with the given tag to disk. + */ +static errcode_t unix_flush_tag(io_channel channel, io_channel_tag_t tag) +{ + struct unix_private_data *data; + errcode_t retval = 0, retval2; + + EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); + data = (struct unix_private_data *) channel->private_data; + EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); + +#ifndef NO_IO_CACHE + retval = flush_cached_blocks(channel, data, tag, 0); +#endif +#ifdef HAVE_FSYNC + /* always fsync the device, even if flushing our own cache failed */ + retval2 = maybe_fsync(channel); + if (retval2 && !retval) + retval = retval2; +#endif + return retval; +} + /* * Flush data buffers to disk. */ static errcode_t unix_flush(io_channel channel) { - struct unix_private_data *data; - errcode_t retval = 0, retval2; - - EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); - data = (struct unix_private_data *) channel->private_data; - EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); - -#ifndef NO_IO_CACHE - retval = flush_cached_blocks(channel, data, 0); -#endif -#ifdef HAVE_FSYNC - /* always fsync the device, even if flushing our own cache failed */ - retval2 = maybe_fsync(channel); - if (retval2 && !retval) - retval = retval2; -#endif - return retval; + return unix_flush_tag(channel, 0); } static errcode_t unix_set_option(io_channel channel, const char *option, @@ -1547,7 +1648,8 @@ static errcode_t unix_set_option(io_channel channel, const char *option, return 0; } if (!strcmp(arg, "off")) { - retval = flush_cached_blocks(channel, data, 0); + retval = flush_cached_blocks(channel, data, + IO_CHANNEL_TAG_NULL, 0); data->flags |= IO_FLAG_NOCACHE; return retval; } @@ -1748,11 +1850,15 @@ static struct struct_io_manager struct_unix_manager = { .read_blk = unix_read_blk, .write_blk = unix_write_blk, .flush = unix_flush, + .flush_tag = unix_flush_tag, + .invalidate_tag = unix_invalidate_tag, .write_byte = unix_write_byte, .set_option = unix_set_option, .get_stats = unix_get_stats, .read_blk64 = unix_read_blk64, .write_blk64 = unix_write_blk64, + .read_tagblk = unix_read_tagblk, + .write_tagblk = unix_write_tagblk, .discard = unix_discard, .cache_readahead = unix_cache_readahead, .zeroout = unix_zeroout, @@ -1771,11 +1877,15 @@ static struct struct_io_manager struct_unixfd_manager = { .read_blk = unix_read_blk, .write_blk = unix_write_blk, .flush = unix_flush, + .flush_tag = unix_flush_tag, + .invalidate_tag = unix_invalidate_tag, .write_byte = unix_write_byte, .set_option = unix_set_option, .get_stats = unix_get_stats, .read_blk64 = unix_read_blk64, .write_blk64 = unix_write_blk64, + .read_tagblk = unix_read_tagblk, + .write_tagblk = unix_write_tagblk, .discard = unix_discard, .cache_readahead = unix_cache_readahead, .zeroout = unix_zeroout, ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (5 preceding siblings ...) 2025-05-22 0:09 ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong @ 2025-05-22 0:10 ` Darrick J. Wong 2025-05-22 0:10 ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong ` (2 subsequent siblings) 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:10 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> There's no need to invalidate the entire cache when writing a range of bytes to the device. The only ones we need to invalidate are the ones that we're writing separately. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 8a8afe47ee4503..4c924ec9ee0760 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -1523,6 +1523,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, { struct unix_private_data *data; errcode_t retval = 0; + unsigned long long bno, nbno; ssize_t actual; EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); @@ -1538,12 +1539,18 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, #ifndef NO_IO_CACHE /* - * Flush out the cache completely + * Flush all the dirty blocks, then invalidate the blocks we're about + * to write. */ - retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, - FLUSH_INVALIDATE); + retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, 0); if (retval) return retval; + + bno = offset / channel->block_size; + nbno = (offset + size + channel->block_size - 1) / channel->block_size; + + for (; bno < nbno; bno++) + invalidate_cached_block(channel, data, bno); #endif if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0) ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (6 preceding siblings ...) 2025-05-22 0:10 ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong @ 2025-05-22 0:10 ` Darrick J. Wong 2025-05-22 0:10 ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong 2025-05-22 0:10 ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:10 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> If someone calls write_byte on an IO channel with an alignment requirement and the range to be written is aligned correctly, go ahead and do the write. This will be needed later when we try to speed up superblock writes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 4c924ec9ee0760..008a5b46ce7f1f 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -1534,7 +1534,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, #ifdef ALIGN_DEBUG printf("unix_write_byte: O_DIRECT fallback\n"); #endif - return EXT2_ET_UNIMPLEMENTED; + if (!IS_ALIGNED(data->offset + offset, channel->align) || + !IS_ALIGNED(data->offset + offset + size, channel->align)) + return EXT2_ET_UNIMPLEMENTED; } #ifndef NO_IO_CACHE ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (7 preceding siblings ...) 2025-05-22 0:10 ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong @ 2025-05-22 0:10 ` Darrick J. Wong 2025-05-22 0:10 ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:10 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> write_primary_superblock currently does this weird dance where it will try to write only the dirty bytes of the primary superblock to disk. In theory, this is done so that tune2fs can incrementally update superblock bytes when the filesystem is mounted; ext2 was famous for allowing using this dance to set new fs parameters and have them take effect in real time. The ability to do this safely was obliterated back in 2001 when ext3 was introduced with journalling, because tune2fs has no way to know if the journal has already logged an updated primary superblock but not yet written it to disk, which means that they can race to write, and changes can be lost. This (non-)safety was further obliterated back in 2012 when I added checksums to all the metadata blocks in ext4 because anyone else with the block device open can see the primary superblock in an intermediate state where the checksum does not match the superblock contents. At this point in 2025 it's kind of stupid to still be doing this, and it makes fuse2fs syncfs slow because we now perform a bunch of small writes and introduce extra fsyncs. It will become especially painful when fuse2fs turns on iomap, at which point it will need to use directio to access the disk, which then runs the Really Sad Path where we change the blocksize and completely obliterate the cache contents. So, add a new flag to ask for full superblock writes, which fuse2fs will use later. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2fs.h | 1 + lib/ext2fs/closefs.c | 7 +++++++ 2 files changed, 8 insertions(+) diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h index 2661e10f57c047..22d56ad7554496 100644 --- a/lib/ext2fs/ext2fs.h +++ b/lib/ext2fs/ext2fs.h @@ -220,6 +220,7 @@ typedef struct ext2_file *ext2_file_t; #define EXT2_FLAG_IBITMAP_TAIL_PROBLEM 0x2000000 #define EXT2_FLAG_THREADS 0x4000000 #define EXT2_FLAG_IGNORE_SWAP_DIRENT 0x8000000 +#define EXT2_FLAG_WRITE_FULL_SUPER 0x10000000 /* * Internal flags for use by the ext2fs library only diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c index 8e5bec03a050de..9a67db76e7b326 100644 --- a/lib/ext2fs/closefs.c +++ b/lib/ext2fs/closefs.c @@ -196,6 +196,13 @@ static errcode_t write_primary_superblock(ext2_filsys fs, int check_idx, write_idx, size; errcode_t retval; + if (fs->flags & EXT2_FLAG_WRITE_FULL_SUPER) { + retval = io_channel_write_byte(fs->io, SUPERBLOCK_OFFSET, + SUPERBLOCK_SIZE, super); + if (!retval) + return 0; + } + if (!fs->io->manager->write_byte || !fs->orig_super) { fallback: io_channel_set_blksize(fs->io, SUPERBLOCK_OFFSET); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (8 preceding siblings ...) 2025-05-22 0:10 ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong @ 2025-05-22 0:10 ` Darrick J. Wong 9 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:10 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Add a flag to ext2_file_t to disallow read and write I/O to file data blocks. This supports fuse2fs iomap support, which will keep all the file data I/O inside the kerne. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2fs.h | 3 +++ lib/ext2fs/fileio.c | 12 +++++++++++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h index 22d56ad7554496..2c8e2cc2b55416 100644 --- a/lib/ext2fs/ext2fs.h +++ b/lib/ext2fs/ext2fs.h @@ -178,6 +178,9 @@ typedef struct ext2_struct_dblist *ext2_dblist; #define EXT2_FILE_WRITE 0x0001 #define EXT2_FILE_CREATE 0x0002 +/* no file I/O to disk blocks, only to inline data */ +#define EXT2_FILE_NOBLOCKIO 0x0004 + #define EXT2_FILE_MASK 0x00FF #define EXT2_FILE_BUF_DIRTY 0x4000 diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c index 1b7e88d990036b..229ae6da7f448b 100644 --- a/lib/ext2fs/fileio.c +++ b/lib/ext2fs/fileio.c @@ -300,6 +300,11 @@ errcode_t ext2fs_file_read(ext2_file_t file, void *buf, if (file->inode.i_flags & EXT4_INLINE_DATA_FL) return ext2fs_file_read_inline_data(file, buf, wanted, got); + if (file->flags & EXT2_FILE_NOBLOCKIO) { + retval = EXT2_ET_OP_NOT_SUPPORTED; + goto fail; + } + while ((file->pos < EXT2_I_SIZE(&file->inode)) && (wanted > 0)) { retval = sync_buffer_position(file); if (retval) @@ -416,6 +421,11 @@ errcode_t ext2fs_file_write(ext2_file_t file, const void *buf, retval = 0; } + if (file->flags & EXT2_FILE_NOBLOCKIO) { + retval = EXT2_ET_OP_NOT_SUPPORTED; + goto fail; + } + while (nbytes > 0) { retval = sync_buffer_position(file); if (retval) @@ -584,7 +594,7 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file, int ret_flags; errcode_t retval; - if (off == 0) + if (off == 0 || (file->flags & EXT2_FILE_NOBLOCKIO)) return 0; retval = sync_buffer_position(file); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance 2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong @ 2025-05-22 0:02 ` Darrick J. Wong 2025-05-22 0:11 ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong ` (15 more replies) 2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein 3 siblings, 16 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:02 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel Hi all, Switch fuse2fs to use the new iomap file data IO paths instead of pushing it very slowly through the /dev/fuse connection. For local filesystems, all we have to do is respond to requests for file to device mappings; the rest of the IO hot path stays within the kernel. This means that we can get rid of all file data block processing within fuse2fs. Because we're not pinning dirty pages through a potentially slow network connection, we don't need the heavy BDI throttling for which most fuse servers have become infamous. Yes, mapping lookups for writeback can stall, but mappings are small as compared to data and this situation exists for all kernel filesystems as well. The performance of this new data path is quite stunning: on a warm system, streaming reads and writes through the pagecache go from 60-90MB/s to 2-2.5GB/s. Direct IO reads and writes improve from the same baseline to 2.5-8GB/s. FIEMAP and SEEK_DATA/SEEK_HOLE now work too. The kernel ext4 driver can manage about 1.6GB/s for pagecache IO and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the kernel for streaming file IO. Random 4k buffered IO is not so good: plain fuse2fs pokes along at 25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s. The kernel can do 900-1300MB/s. Random directio is worse: plain fuse2fs does 20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does 40-55MB/s. I suspect that metadata heavy workloads do not perform well on fuse2fs because libext2fs wasn't designed for that and it doesn't even have a journal to absorb all the fsync writes. We also probably need iomap caching really badly. These performance numbers are slanted: my machine is 12 years old, and fuse2fs is VERY poorly optimized for performance. It contains a single Big Filesystem Lock which nukes multi-threaded scalability. There's no inode cache nor is there a proper buffer cache, which means that fuse2fs reads metadata in from disk and checksums it on EVERY ACCESS. Sad! Despite these gaps, this RFC demonstrates that it's feasible to run the metadata parsing parts of a filesystem in userspace while not sacrificing much performance. We now have a vehicle to move the filesystems out of the kernel, where they can be containerized so that malicious filesystems can be contained, somewhat. iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so for capable systems, fuse2fs doesn't need to run in fuseblk mode anymore. However, there are some major warts remaining: 1. The iomap cookie validation is not present, which can lead to subtle races between pagecache zeroing and writeback on filesystems that support unwritten and delalloc mappings. 2. Mappings ought to be cached in the kernel for more speed. 3. iomap doesn't support things like fscrypt or fsverity, and I haven't yet figured out how inline data is supposed to work. 4. I would like to be able to turn on fuse+iomap on a per-inode basis, which currently isn't possible because the kernel fuse driver will iget inodes prior to calling FUSE_GETATTR to discover the properties of the inode it just read. 5. ext4 doesn't support out of place writes so I don't know if that actually works correctly. 6. iomap is an inode-based service, not a file-based service. This means that we /must/ push ext2's inode numbers into the kernel via FUSE_GETATTR so that it can report those same numbers back out through the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid to index its incore inode, so we have to pass those too so that notifications work properly. I'll work on these in June, but for now here's an unmergeable RFC to start some discussion. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap --- Commits in this patchset: * fuse2fs: implement bare minimum iomap for file mapping reporting * fuse2fs: register block devices for use with iomap * fuse2fs: always use directio disk reads with fuse2fs * fuse2fs: implement directio file reads * fuse2fs: use tagged block IO for zeroing sub-block regions * fuse2fs: only flush the cache for the file under directio read * fuse2fs: add extent dump function for debugging * fuse2fs: implement direct write support * fuse2fs: turn on iomap for pagecache IO * fuse2fs: flush and invalidate the buffer cache on trim * fuse2fs: improve tracing for fallocate * fuse2fs: don't zero bytes in punch hole * fuse2fs: don't do file data block IO when iomap is enabled * fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode * fuse2fs: re-enable the block device pagecache for metadata IO * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely --- configure | 47 ++ configure.ac | 32 + lib/config.h.in | 3 misc/fuse2fs.c | 1251 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 1312 insertions(+), 21 deletions(-) ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong @ 2025-05-22 0:11 ` Darrick J. Wong 2025-05-22 0:11 ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong ` (14 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:11 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Add enough of an iomap implementation that we can do FIEMAP and SEEK_DATA and SEEK_HOLE. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- configure | 47 ++++++ configure.ac | 32 ++++ lib/config.h.in | 3 misc/fuse2fs.c | 453 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 530 insertions(+), 5 deletions(-) diff --git a/configure b/configure index 1f7dbe24ee1ab1..c8b63dd448dca8 100755 --- a/configure +++ b/configure @@ -14545,6 +14545,53 @@ elif test -n "$FUSE_LIB" then FUSE_USE_VERSION=29 fi + +if test "$FUSE_LIB" = "-lfuse3" +then +{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5 +printf %s "checking for iomap_begin in libfuse... " >&6; } +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +#define _GNU_SOURCE +#define _FILE_OFFSET_BITS 64 +#define FUSE_USE_VERSION 318 +#include <fuse.h> + +int +main (void) +{ + +struct fuse_operations fs_ops = { + .iomap_begin = NULL, + .iomap_end = NULL, +}; +struct fuse_iomap narf = { }; + + ; + return 0; +} + +_ACEOF +if ac_fn_c_try_link "$LINENO" +then : + have_fuse_iomap=yes + { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5 +printf "%s\n" "yes" >&6; } +else $as_nop + { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5 +printf "%s\n" "no" >&6; } +fi +rm -f core conftest.err conftest.$ac_objext conftest.beam \ + conftest$ac_exeext conftest.$ac_ext +if test "$have_fuse_iomap" = yes; then + FUSE_USE_VERSION=318 + +printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h + +fi +fi + if test -n "$FUSE_USE_VERSION" then diff --git a/configure.ac b/configure.ac index c7f193b4ed06bf..8b12ef3ee542e3 100644 --- a/configure.ac +++ b/configure.ac @@ -1429,6 +1429,38 @@ elif test -n "$FUSE_LIB" then FUSE_USE_VERSION=29 fi + +if test "$FUSE_LIB" = "-lfuse3" +then +dnl +dnl see if fuse3 supports iomap +dnl +AC_MSG_CHECKING(for iomap_begin in libfuse) +AC_LINK_IFELSE( +[ AC_LANG_PROGRAM([[ +#define _GNU_SOURCE +#define _FILE_OFFSET_BITS 64 +#define FUSE_USE_VERSION 318 +#include <fuse.h> + ]], [[ +struct fuse_operations fs_ops = { + .iomap_begin = NULL, + .iomap_end = NULL, +}; +struct fuse_iomap narf = { }; + ]]) +], have_fuse_iomap=yes + AC_MSG_RESULT(yes), + AC_MSG_RESULT(no)) +if test "$have_fuse_iomap" = yes; then + FUSE_USE_VERSION=318 + AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap]) +fi +fi + +dnl +dnl set FUSE_USE_VERSION now that we've done all the feature tests +dnl if test -n "$FUSE_USE_VERSION" then AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION, diff --git a/lib/config.h.in b/lib/config.h.in index 6cd9751baab9d1..850c5fa573bcf0 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -73,6 +73,9 @@ /* Define to 1 if PR_SET_IO_FLUSHER is present */ #undef HAVE_PR_SET_IO_FLUSHER +/* Define to 1 if fuse supports iomap */ +#undef HAVE_FUSE_IOMAP + /* Define to 1 if you have the Mac OS X function CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */ #undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 769bb5babd2738..f9eed078d91152 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -79,6 +79,8 @@ #define P_(singular, plural, n) ((n) == 1 ? (singular) : (plural)) #endif +#define min(x, y) ((x) < (y) ? (y) : (x)) + #define dbg_printf(fuse2fs, format, ...) \ while ((fuse2fs)->debug) { \ printf("FUSE2FS (%s): " format, (fuse2fs)->shortdev, ##__VA_ARGS__); \ @@ -144,6 +146,14 @@ struct fuse2fs_file_handle { int open_flags; }; +#ifdef HAVE_FUSE_IOMAP +enum fuse2fs_iomap_state { + IOMAP_DISABLED, + IOMAP_UNKNOWN, + IOMAP_ENABLED, +}; +#endif + /* Main program context */ #define FUSE2FS_MAGIC (0xEF53DEADUL) struct fuse2fs { @@ -167,6 +177,9 @@ struct fuse2fs { uint8_t writable; int blocklog; +#ifdef HAVE_FUSE_IOMAP + enum fuse2fs_iomap_state iomap_state; +#endif unsigned int blockmask; int retcode; unsigned long offset; @@ -694,7 +707,7 @@ static errcode_t open_fs(struct fuse2fs *ff, int libext2_flags) { char options[128]; int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW | - libext2_flags; + EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags; errcode_t err; snprintf(options, sizeof(options) - 1, "offset=%lu", ff->offset); @@ -945,6 +958,38 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn, } #endif +#ifdef HAVE_FUSE_IOMAP +static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff) +{ + int is_bdev; + errcode_t err; + + switch (ff->iomap_state) { + case IOMAP_UNKNOWN: + ff->iomap_state = IOMAP_DISABLED; + /* fallthrough */; + case IOMAP_DISABLED: + return 0; + case IOMAP_ENABLED: + break; + } + + err = fs_on_bdev(ff, &is_bdev); + if (err) + return err; + + /* iomap only works with block devices */ + if (!is_bdev) { + fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP); + ff->iomap_state = IOMAP_DISABLED; + } + + return 0; +} +#else +# define confirm_iomap(...) (0) +#endif + static void *op_init(struct fuse_conn_info *conn #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) , struct fuse_config *cfg EXT2FS_ATTR((unused)) @@ -972,6 +1017,12 @@ static void *op_init(struct fuse_conn_info *conn #ifdef FUSE_CAP_NO_EXPORT_SUPPORT fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT); #endif +#ifdef HAVE_FUSE_IOMAP + if (ff->iomap_state != IOMAP_DISABLED && + fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + ff->iomap_state = IOMAP_ENABLED; +#endif + #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) conn->time_gran = 1; cfg->use_ino = 1; @@ -989,6 +1040,10 @@ static void *op_init(struct fuse_conn_info *conn goto mount_fail; fs = ff->fs; + err = confirm_iomap(conn, ff); + if (err) + goto mount_fail; + if (ff->cache_size) { err = config_fs_cache(ff); if (err) @@ -1014,6 +1069,10 @@ static void *op_init(struct fuse_conn_info *conn err = mount_fs(ff); if (err) goto mount_fail; + } else { + err = confirm_iomap(conn, ff); + if (err) + goto mount_fail; } /* Clear the valid flag so that an unclean shutdown forces a fsck */ @@ -4575,6 +4634,384 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode, # endif /* SUPPORT_FALLOCATE */ #endif /* FUSE 29 */ +#ifdef HAVE_FUSE_IOMAP +static void handle_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap, + off_t pos, uint64_t count) +{ + iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->offset = pos; + iomap->length = count; + iomap->type = FUSE_IOMAP_TYPE_HOLE; +} + +#define DEBUG_IOMAP +#ifdef DEBUG_IOMAP +# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \ + do { \ + dbg_printf((ff), \ + "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \ + (func), (tag), (startoff), (err), (extent)->e_lblk, \ + (extent)->e_pblk, (extent)->e_len, \ + (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \ + } while(0) +# define DUMP_EXTENT(ff, tag, startoff, err, extent) \ + __DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent)) +#else +# define __DUMP_EXTENT(...) ((void)0) +# define DUMP_EXTENT(...) ((void)0) +#endif + +static inline errcode_t __get_mapping_at(struct fuse2fs *ff, + ext2_extent_handle_t handle, + blk64_t startoff, + struct ext2fs_extent *bmap, + const char *func) +{ + errcode_t err; + + /* + * Find the file mapping at startoff. We don't check the return value + * of _goto because _get will error out if _goto failed. There's a + * subtlety to the outcome of _goto when startoff falls in a sparse + * hole however: + * + * Most of the time, _goto points the cursor at the mapping whose lblk + * is just to the left of startoff. The mapping may or may not overlap + * startoff; this is ok. In other words, the tree lookup behaves as if + * we asked it to use a less than or equals comparison. + * + * However, if startoff is to the left of the first mapping in the + * extent tree, _goto points the cursor at that first mapping because + * it doesn't know how to deal with this situation. In this case, + * the tree lookup behaves as if we asked it to use a greater than + * or equals comparison. + * + * Note: If _get() returns 'no current node', that means that there + * aren't any mappings at all. + */ + ext2fs_extent_goto(handle, startoff); + err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap); + __DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap); + if (err == EXT2_ET_NO_CURRENT_NODE) + err = EXT2_ET_EXTENT_NOT_FOUND; + return err; +} + +static inline errcode_t __get_next_mapping(struct fuse2fs *ff, + ext2_extent_handle_t handle, + blk64_t startoff, + struct ext2fs_extent *bmap, + const char *func) +{ + struct ext2fs_extent newex, errex; + errcode_t err; + + err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex); + DUMP_EXTENT(ff, "NEXT", startoff, err, &newex); + if (err == EXT2_ET_EXTENT_NO_NEXT) + return EXT2_ET_EXTENT_NOT_FOUND; + if (err) + return err; + + /* + * Try to get the next leaf mapping. There's a weird and longstanding + * "feature" of EXT2_EXTENT_NEXT_LEAF where walking off the end of the + * mapping recordset causes it to wrap around to the beginning of the + * extent map and we end up with a mapping to the left of the one that + * was passed in. + * + * However, a corrupt extent tree could also have such a record. The + * only way to be sure is to retrieve the mapping for the extreme right + * edge of the tree and compare it to the mapping that the caller gave + * us. If they match, then we've hit the end. If not, something is + * corrupt in the ondisk metadata. + */ + if (newex.e_lblk <= bmap->e_lblk + bmap->e_len) { + err = __get_mapping_at(ff, handle, ~0U, &errex, func); + if (err) + return err; + + if (memcmp(bmap, &errex, sizeof(errex)) != 0) + return EXT2_ET_INODE_CORRUPTED; + + return EXT2_ET_EXTENT_NOT_FOUND; + } + + *bmap = newex; + return 0; +} + +#define get_mapping_at(ff, handle, startoff, bmap) \ + __get_mapping_at((ff), (handle), (startoff), (bmap), __func__) +#define get_next_mapping(ff, handle, startoff, bmap) \ + __get_next_mapping((ff), (handle), (startoff), (bmap), __func__) + +static errcode_t extent_iomap_begin(struct fuse2fs *ff, uint64_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, struct fuse_iomap *iomap) +{ + ext2_extent_handle_t handle; + struct ext2fs_extent extent; + ext2_filsys fs = ff->fs; + const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + errcode_t err; + int ret = 0; + + err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle); + if (err) + return translate_error(fs, ino, err); + + err = get_mapping_at(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + /* No mappings at all; the whole range is a hole. */ + handle_iomap_hole(ff, iomap, pos, count); + goto out_handle; + } + if (err) { + ret = translate_error(fs, ino, err); + goto out_handle; + } + + if (startoff < extent.e_lblk) { + /* + * Mapping starts to the right of the current position. + * Synthesize a hole going to that next extent. + */ + handle_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff), + FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff)); + goto out_handle; + } + + if (startoff >= extent.e_lblk + extent.e_len) { + /* + * Mapping ends to the left of the current position. Try to + * find the next mapping. If there is no next mapping, the + * whole range is in a hole. + */ + err = get_next_mapping(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + handle_iomap_hole(ff, iomap, pos, count); + goto out_handle; + } + + /* + * If the new mapping starts to the right of startoff, there's + * a hole from startoff to the start of the new mapping. + */ + if (startoff < extent.e_lblk) { + handle_iomap_hole(ff, iomap, + FUSE2FS_FSB_TO_B(ff, startoff), + FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff)); + goto out_handle; + } + + /* + * The new mapping starts at startoff. Something weird + * happened in the extent tree lookup, but we found a valid + * mapping so we'll run with it. + */ + } + + /* Mapping overlaps startoff, report this. */ + iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk); + iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk); + iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len); + if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) + iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN; + else + iomap->type = FUSE_IOMAP_TYPE_MAPPED; + +out_handle: + ext2fs_extent_free(handle); + return ret; +} + +static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_iomap *iomap) +{ + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + uint64_t real_count = min(count, 131072); + const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count); + blk64_t startblock; + errcode_t err; + + err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL, + &startblock); + if (err) + return translate_error(fs, ino, err); + + iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->offset = pos; + iomap->flags |= FUSE_IOMAP_F_MERGED; + if (startblock) { + iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock); + iomap->type = FUSE_IOMAP_TYPE_MAPPED; + } else { + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->type = FUSE_IOMAP_TYPE_HOLE; + } + iomap->length = fs->blocksize; + + /* See how long the mapping goes for. */ + for (startoff++; startoff < endoff; startoff++) { + blk64_t prev_startblock = startblock; + + err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, + startoff, NULL, &startblock); + if (err) + break; + + if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) { + if (startblock == prev_startblock + 1) + iomap->length += fs->blocksize; + else + break; + } else { + if (startblock != 0) + break; + } + } + + return 0; +} + +static int inline_iomap_begin(struct fuse2fs *ff, off_t pos, uint64_t count, + struct fuse_iomap *iomap) +{ + iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->offset = pos; + iomap->length = count; + iomap->type = FUSE_IOMAP_TYPE_INLINE; + + return 0; +} + +static int fuse_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, uint32_t opflags, + struct fuse_iomap *read_iomap) +{ + if (inode->i_flags & EXT4_INLINE_DATA_FL) + return inline_iomap_begin(ff, pos, count, read_iomap); + + if (inode->i_flags & EXT4_EXTENTS_FL) + return extent_iomap_begin(ff, ino, inode, pos, count, opflags, + read_iomap); + + return indirect_iomap_begin(ff, ino, inode, pos, count, opflags, + read_iomap); +} + +static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_iomap *read_iomap) +{ + return -ENOSYS; +} + +static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_iomap *read_iomap) +{ + return -ENOSYS; +} + +static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos, uint64_t count, uint32_t opflags, + struct fuse_iomap *read_iomap, + struct fuse_iomap *write_iomap) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data; + struct ext2_inode_large inode; + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + fs = ff->fs; + + pthread_mutex_lock(&ff->bfl); + + dbg_printf(ff, + "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n", + __func__, path, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + err = fuse2fs_read_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + + if (opflags & FUSE_IOMAP_OP_REPORT) + ret = fuse_iomap_begin_report(ff, attr_ino, &inode, pos, count, + opflags, read_iomap); + else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO)) + ret = fuse_iomap_begin_write(ff, attr_ino, &inode, pos, count, + opflags, read_iomap); + else + ret = fuse_iomap_begin_read(ff, attr_ino, &inode, pos, count, + opflags, read_iomap); + if (ret) + goto out_unlock; + + dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n", + __func__, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + (unsigned long long)read_iomap->addr, + (unsigned long long)read_iomap->offset, + (unsigned long long)read_iomap->length, + read_iomap->type); + +out_unlock: + if (ret < 0) + dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret); + pthread_mutex_unlock(&ff->bfl); + return ret; +} + +static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos, uint64_t count, uint32_t opflags, + ssize_t written, const struct fuse_iomap *iomap) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data; + + FUSE2FS_CHECK_CONTEXT(ff); + + pthread_mutex_lock(&ff->bfl); + dbg_printf(ff, + "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags 0x%x\n", + __func__, path, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + (unsigned long long)count, + opflags, + written, + iomap->flags); + pthread_mutex_unlock(&ff->bfl); + + return 0; +} +#endif /* HAVE_FUSE_IOMAP */ + static struct fuse_operations fs_ops = { .init = op_init, .destroy = op_destroy, @@ -4635,6 +5072,10 @@ static struct fuse_operations fs_ops = { .fallocate = op_fallocate, # endif #endif +#ifdef HAVE_FUSE_IOMAP + .iomap_begin = op_iomap_begin, + .iomap_end = op_iomap_end, +#endif /* HAVE_FUSE_IOMAP */ }; static int get_random_bytes(void *p, size_t sz) @@ -4840,7 +5281,12 @@ static void fuse2fs_com_err_proc(const char *whoami, errcode_t code, int main(int argc, char *argv[]) { struct fuse_args args = FUSE_ARGS_INIT(argc, argv); - struct fuse2fs fctx; + struct fuse2fs fctx = { + .magic = FUSE2FS_MAGIC, +#ifdef HAVE_FUSE_IOMAP + .iomap_state = IOMAP_UNKNOWN, +#endif + }; errcode_t err; FILE *orig_stderr = stderr; char *logfile; @@ -4849,9 +5295,6 @@ int main(int argc, char *argv[]) int is_bdev; int ret = 0; - memset(&fctx, 0, sizeof(fctx)); - fctx.magic = FUSE2FS_MAGIC; - fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc); if (fctx.device == NULL) { fprintf(stderr, "Missing ext4 device/image\n"); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 02/16] fuse2fs: register block devices for use with iomap 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong 2025-05-22 0:11 ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong @ 2025-05-22 0:11 ` Darrick J. Wong 2025-05-22 0:11 ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong ` (13 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:11 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Register the ext4 block device with the kernel for use with iomap. For now this is redundant with using fuseblk mode because the kernel automatically registers any fuseblk devices, but eventually we'll go back to regular fuse mode and we'll have to pin the bdev ourselves. In theory this interface supports strange beasts where the metadata can exist somewhere else entirely (or be made up by AI) while the file data persists to real disks. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 44 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 40 insertions(+), 4 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index f9eed078d91152..92a80753f4f1e8 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -36,6 +36,7 @@ # define _FILE_OFFSET_BITS 64 #endif /* _FILE_OFFSET_BITS */ #include <fuse.h> +#include <fuse_lowlevel.h> #ifdef __SET_FOB_FOR_FUSE # undef _FILE_OFFSET_BITS #endif /* __SET_FOB_FOR_FUSE */ @@ -179,6 +180,7 @@ struct fuse2fs { int blocklog; #ifdef HAVE_FUSE_IOMAP enum fuse2fs_iomap_state iomap_state; + uint32_t iomap_dev; #endif unsigned int blockmask; int retcode; @@ -4638,7 +4640,7 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode, static void handle_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap, off_t pos, uint64_t count) { - iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->dev = ff->iomap_dev; iomap->addr = FUSE_IOMAP_NULL_ADDR; iomap->offset = pos; iomap->length = count; @@ -4815,7 +4817,7 @@ static errcode_t extent_iomap_begin(struct fuse2fs *ff, uint64_t ino, } /* Mapping overlaps startoff, report this. */ - iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->dev = ff->iomap_dev; iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk); iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk); iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len); @@ -4846,7 +4848,7 @@ static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino, if (err) return translate_error(fs, ino, err); - iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->dev = ff->iomap_dev; iomap->offset = pos; iomap->flags |= FUSE_IOMAP_F_MERGED; if (startblock) { @@ -4884,7 +4886,7 @@ static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino, static int inline_iomap_begin(struct fuse2fs *ff, off_t pos, uint64_t count, struct fuse_iomap *iomap) { - iomap->dev = FUSE_IOMAP_DEV_FUSEBLK; + iomap->dev = ff->iomap_dev; iomap->addr = FUSE_IOMAP_NULL_ADDR; iomap->offset = pos; iomap->length = count; @@ -4925,6 +4927,31 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, return -ENOSYS; } +static errcode_t config_iomap_devices(struct fuse_context *ctxt, + struct fuse2fs *ff) +{ + struct fuse_session *se = fuse_get_session(ctxt->fuse); + errcode_t err; + int fd; + int ret; + + err = io_channel_fd(ff->fs->io, &fd); + if (err) + return err; + + ret = fuse_lowlevel_notify_iomap_add_device(se, fd, &ff->iomap_dev); + + dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n", + __func__, fd, ret, ff->iomap_dev); + + if (ret) + return ret; + if (ff->iomap_dev == FUSE_IOMAP_DEV_NULL) + return -EIO; + + return 0; +} + static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, off_t pos, uint64_t count, uint32_t opflags, struct fuse_iomap *read_iomap, @@ -4951,6 +4978,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, (unsigned long long)count, opflags); + if (ff->iomap_dev == FUSE_IOMAP_DEV_NULL) { + err = config_iomap_devices(ctxt, ff); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + } + err = fuse2fs_read_inode(fs, attr_ino, &inode); if (err) { ret = translate_error(fs, attr_ino, err); @@ -5285,6 +5320,7 @@ int main(int argc, char *argv[]) .magic = FUSE2FS_MAGIC, #ifdef HAVE_FUSE_IOMAP .iomap_state = IOMAP_UNKNOWN, + .iomap_dev = FUSE_IOMAP_DEV_NULL, #endif }; errcode_t err; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong 2025-05-22 0:11 ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong 2025-05-22 0:11 ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong @ 2025-05-22 0:11 ` Darrick J. Wong 2025-05-22 0:11 ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong ` (12 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:11 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> In iomap mode, the kernel writes file data directly to the block device and does not flush the bdev page cache. We must open the filesystem in directio mode to avoid cache coherency issues when reading file data blocks. If we can't open the bdev in directio mode, we must not use iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 92a80753f4f1e8..91c0da096bef9c 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -988,8 +988,14 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff) return 0; } + +static int iomap_enabled(const struct fuse2fs *ff) +{ + return ff->iomap_state == IOMAP_ENABLED; +} #else # define confirm_iomap(...) (0) +# define iomap_enabled(...) (0) #endif static void *op_init(struct fuse_conn_info *conn @@ -1001,6 +1007,9 @@ static void *op_init(struct fuse_conn_info *conn struct fuse_context *ctxt = fuse_get_context(); struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data; ext2_filsys fs = ff->fs; +#ifdef HAVE_FUSE_IOMAP + int was_directio = ff->directio; +#endif errcode_t err; int ret; @@ -1023,6 +1032,15 @@ static void *op_init(struct fuse_conn_info *conn if (ff->iomap_state != IOMAP_DISABLED && fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) ff->iomap_state = IOMAP_ENABLED; + /* + * In iomap mode, the kernel writes file data directly to the block + * device and does not flush the bdev page cache. We must open the + * filesystem in directio mode to avoid cache coherency issues when + * reading file data. If we can't open the bdev in directio mode, we + * must not use iomap. + */ + if (iomap_enabled(ff)) + ff->directio = 1; #endif #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) @@ -1038,6 +1056,14 @@ static void *op_init(struct fuse_conn_info *conn */ if (!fs) { err = open_fs(ff, 0); +#ifdef HAVE_FUSE_IOMAP + if (err && iomap_enabled(ff) && !was_directio) { + fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP); + ff->iomap_state = IOMAP_DISABLED; + ff->directio = 0; + err = open_fs(ff, 0); + } +#endif if (err) goto mount_fail; fs = ff->fs; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 04/16] fuse2fs: implement directio file reads 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (2 preceding siblings ...) 2025-05-22 0:11 ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong @ 2025-05-22 0:11 ` Darrick J. Wong 2025-05-22 0:12 ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong ` (11 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:11 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Implement file reads via iomap. Currently only directio is supported. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 91c0da096bef9c..b1f3002ec8c481 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1103,6 +1103,11 @@ static void *op_init(struct fuse_conn_info *conn goto mount_fail; } +#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO) + if (iomap_enabled(ff)) + fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO); +#endif + /* Clear the valid flag so that an unclean shutdown forces a fsck */ if (ff->writable) { fs->super->s_mnt_count++; @@ -4942,7 +4947,26 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, uint64_t count, uint32_t opflags, struct fuse_iomap *read_iomap) { - return -ENOSYS; + errcode_t err; + + if (!(opflags & FUSE_IOMAP_OP_DIRECT)) + return -ENOSYS; + + /* fall back to slow path for inline data reads */ + if (inode->i_flags & EXT4_INLINE_DATA_FL) + return -ENOSYS; + + /* flush dirty io_channel buffers to disk before iomap reads them */ + err = io_channel_flush(ff->fs->io); + if (err) + return translate_error(ff->fs, ino, err); + + if (inode->i_flags & EXT4_EXTENTS_FL) + return extent_iomap_begin(ff, ino, inode, pos, count, opflags, + read_iomap); + + return indirect_iomap_begin(ff, ino, inode, pos, count, opflags, + read_iomap); } static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (3 preceding siblings ...) 2025-05-22 0:11 ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong @ 2025-05-22 0:12 ` Darrick J. Wong 2025-05-22 0:12 ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong ` (10 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:12 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Change the punch hole helpers to use the tagged block IO commands now that libext2fs uses tagged block IO commands for file IO. We'll need this in the next patch when we turn on selective IO manager cache clearing and invalidation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index b1f3002ec8c481..c0f868e8f01ed4 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -4510,13 +4510,13 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino, if (!blk || (retflags & BMAP_RET_UNINIT)) return 0; - err = io_channel_read_blk(fs->io, blk, 1, *buf); + err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf); if (err) return err; memset(*buf + residue, 0, len); - return io_channel_write_blk(fs->io, blk, 1, *buf); + return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf); } static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino, @@ -4544,7 +4544,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino, if (err) return err; - err = io_channel_read_blk(fs->io, blk, 1, *buf); + err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf); if (err) return err; if (!blk || (retflags & BMAP_RET_UNINIT)) @@ -4555,7 +4555,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino, else memset(*buf + residue, 0, fs->blocksize - residue); - return io_channel_write_blk(fs->io, blk, 1, *buf); + return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf); } static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset, ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (4 preceding siblings ...) 2025-05-22 0:12 ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong @ 2025-05-22 0:12 ` Darrick J. Wong 2025-05-22 0:12 ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong ` (9 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:12 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> We only need to flush the io_channel's cache for the file that's being read directly, not everything else. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index c0f868e8f01ed4..3ec99310b0f112 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -4957,7 +4957,7 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, return -ENOSYS; /* flush dirty io_channel buffers to disk before iomap reads them */ - err = io_channel_flush(ff->fs->io); + err = io_channel_flush_tag(ff->fs->io, ino); if (err) return translate_error(ff->fs, ino, err); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 07/16] fuse2fs: add extent dump function for debugging 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (5 preceding siblings ...) 2025-05-22 0:12 ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong @ 2025-05-22 0:12 ` Darrick J. Wong 2025-05-22 0:12 ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong ` (8 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:12 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Add a function to dump an inode's extent map for debugging purposes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 3ec99310b0f112..7e9095766c6624 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -377,6 +377,74 @@ static inline errcode_t fuse2fs_write_inode(ext2_filsys fs, ext2_ino_t ino, sizeof(*inode)); } +static inline void dump_ino_extents(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + const char *why) +{ + ext2_filsys fs = ff->fs; + unsigned int nr = 0; + blk64_t blockcount = 0; + struct ext2_inode_large xinode; + struct ext2fs_extent extent; + ext2_extent_handle_t extents; + int op = EXT2_EXTENT_ROOT; + errcode_t retval; + + if (!inode) { + inode = &xinode; + + retval = fuse2fs_read_inode(fs, ino, inode); + if (retval) { + com_err(__func__, retval, _("reading ino %u"), ino); + return; + } + } + + if (!(inode->i_flags & EXT4_EXTENTS_FL)) + return; + + printf("%s: %s ino %u isize %llu iblocks %llu\n", __func__, why, ino, + EXT2_I_SIZE(inode), + (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) / + fs->blocksize); + fflush(stdout); + + retval = ext2fs_extent_open(fs, ino, &extents); + if (retval) { + com_err(__func__, retval, _("opening extents of ino \"%u\""), + ino); + return; + } + + while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) { + op = EXT2_EXTENT_NEXT; + + if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT) + continue; + + printf("[%u]: %s lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", + nr++, why, extent.e_lblk, extent.e_pblk, extent.e_len, + extent.e_flags); + fflush(stdout); + if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF) + blockcount += extent.e_len; + else + blockcount++; + } + if (retval == EXT2_ET_EXTENT_NO_NEXT) + retval = 0; + if (retval) { + com_err(__func__, retval, ("getting extents of ino %u"), + ino); + } + if (inode->i_file_acl) + blockcount++; + printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount); + fflush(stdout); + + ext2fs_extent_free(extents); +} + static void get_now(struct timespec *now) { #ifdef CLOCK_REALTIME ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 08/16] fuse2fs: implement direct write support 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (6 preceding siblings ...) 2025-05-22 0:12 ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong @ 2025-05-22 0:12 ` Darrick J. Wong 2025-05-22 0:13 ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong ` (7 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:12 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Wire up an iomap_begin method that can allocate into holes so that we can do directio writes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 481 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 478 insertions(+), 3 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 7e9095766c6624..ec17f6203b4b70 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -5037,12 +5037,99 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, read_iomap); } +static int fuse_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, struct + fuse_iomap *read_iomap, bool *dirty) +{ + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count); + errcode_t err; + int ret; + + dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n", + __func__, ino, startoff, stopoff - startoff); + + if (!fs_can_allocate(ff, stopoff - startoff)) + return -ENOSPC; + + err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino, + EXT2_INODE(inode), 0, startoff, + stopoff - startoff); + if (err) + return translate_error(fs, ino, err); + + /* pick up the newly allocated mapping */ + ret = fuse_iomap_begin_read(ff, ino, inode, pos, count, opflags, + read_iomap); + if (ret) + return ret; + + read_iomap->flags |= FUSE_IOMAP_F_DIRTY; + *dirty = true; + return 0; +} + +static off_t max_file_size(const struct fuse2fs *ff, + const struct ext2_inode_large *inode) +{ + ext2_filsys fs = ff->fs; + blk64_t addr_per_block, max_map_block; + + if (inode->i_flags & EXT4_EXTENTS_FL) { + max_map_block = (1ULL << 32) - 1; + } else { + addr_per_block = fs->blocksize >> 2; + max_map_block = addr_per_block; + max_map_block += addr_per_block * addr_per_block; + max_map_block += addr_per_block * addr_per_block * addr_per_block; + max_map_block += 12; + } + + return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1); +} + static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, struct ext2_inode_large *inode, off_t pos, uint64_t count, uint32_t opflags, - struct fuse_iomap *read_iomap) + struct fuse_iomap *read_iomap, bool *dirty) { - return -ENOSYS; + off_t max_size = max_file_size(ff, inode); + errcode_t err; + int ret; + + if (!(opflags & FUSE_IOMAP_OP_DIRECT)) + return -ENOSYS; + + if (pos >= max_size) + return -EFBIG; + + if (pos >= max_size - count) + count = max_size - pos; + + ret = fuse_iomap_begin_read(ff, ino, inode, pos, count, opflags, + read_iomap); + if (ret) + return ret; + + if (read_iomap->type == FUSE_IOMAP_TYPE_HOLE && + !(opflags & FUSE_IOMAP_OP_ZERO)) { + ret = fuse_iomap_write_allocate(ff, ino, inode, pos, count, + opflags, read_iomap, dirty); + if (ret) + return ret; + } + + /* + * flush and invalidate the file's io_channel buffers before iomap + * writes them + */ + err = io_channel_invalidate_tag(ff->fs->io, ino); + if (err) + return translate_error(ff->fs, ino, err); + + return 0; } static errcode_t config_iomap_devices(struct fuse_context *ctxt, @@ -5080,6 +5167,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, struct ext2_inode_large inode; ext2_filsys fs; errcode_t err; + bool dirty = false; int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); @@ -5115,7 +5203,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, opflags, read_iomap); else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO)) ret = fuse_iomap_begin_write(ff, attr_ino, &inode, pos, count, - opflags, read_iomap); + opflags, read_iomap, &dirty); else ret = fuse_iomap_begin_read(ff, attr_ino, &inode, pos, count, opflags, read_iomap); @@ -5132,6 +5220,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, (unsigned long long)read_iomap->length, read_iomap->type); + if (dirty) { + err = fuse2fs_write_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + } + out_unlock: if (ret < 0) dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret); @@ -5163,6 +5259,384 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, return 0; } + +static inline bool can_merge_mappings(const struct ext2fs_extent *left, + const struct ext2fs_extent *right) +{ + uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ? + EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN; + + return left->e_lblk + left->e_len == right->e_lblk && + left->e_pblk + left->e_len == right->e_pblk && + (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) == + (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) && + (uint64_t)left->e_len + right->e_len <= max_len; +} + +static int try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino, + ext2_extent_handle_t handle, blk64_t startoff) +{ + ext2_filsys fs = ff->fs; + struct ext2fs_extent left, right; + errcode_t err; + + /* Look up the mappings before startoff */ + err = get_mapping_at(ff, handle, startoff - 1, &left); + if (err == EXT2_ET_EXTENT_NOT_FOUND) + return 0; + if (err) + return translate_error(fs, ino, err); + + /* Look up the mapping at startoff */ + err = get_mapping_at(ff, handle, startoff, &right); + if (err == EXT2_ET_EXTENT_NOT_FOUND) + return 0; + if (err) + return translate_error(fs, ino, err); + + /* Can we combine them? */ + if (!can_merge_mappings(&left, &right)) + return 0; + + /* + * Delete the mapping after startoff because libext2fs cannot handle + * overlapping mappings. + */ + err = ext2fs_extent_delete(handle, 0); + DUMP_EXTENT(ff, "remover", startoff, err, &right); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixremover", startoff, err, &right); + if (err) + return translate_error(fs, ino, err); + + /* Move back and lengthen the mapping before startoff */ + err = ext2fs_extent_goto(handle, left.e_lblk); + DUMP_EXTENT(ff, "movel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + left.e_len += right.e_len; + err = ext2fs_extent_replace(handle, 0, &left); + DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + return 0; +} + +static int convert_unwritten_mapping(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + ext2_extent_handle_t handle, + blk64_t *cursor, blk64_t stopoff) +{ + ext2_filsys fs = ff->fs; + struct ext2fs_extent extent; + blk64_t startoff = *cursor; + errcode_t err; + + /* + * Find the mapping at startoff. Note that we can find holes because + * the mapping data can change due to racing writes. + */ + err = get_mapping_at(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + /* + * If we didn't find any mappings at all then the file is + * completely sparse. There's nothing to convert. + */ + *cursor = stopoff; + return 0; + } + if (err) + return translate_error(fs, ino, err); + + /* + * The mapping is completely to the left of the range that we want. + * Let's see what's in the next extent, if there is one. + */ + if (startoff >= extent.e_lblk + extent.e_len) { + /* + * Mapping ends to the left of the current position. Try to + * find the next mapping. If there is no next mapping, then + * we're done. + */ + err = get_next_mapping(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + *cursor = stopoff; + return 0; + } + if (err) + return translate_error(fs, ino, err); + } + + /* + * The mapping is completely to the right of the range that we want, + * so we're done. + */ + if (extent.e_lblk >= stopoff) { + *cursor = stopoff; + return 0; + } + + /* + * At this point, we have a mapping that overlaps (startoff, stopoff]. + * If the mapping is already written, move on to the next one. + */ + if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)) + goto next; + + if (startoff > extent.e_lblk) { + struct ext2fs_extent newex = extent; + + /* + * Unwritten mapping starts before startoff. Shorten + * the previous mapping... + */ + newex.e_len = startoff - extent.e_lblk; + err = ext2fs_extent_replace(handle, 0, &newex); + DUMP_EXTENT(ff, "shortenp", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + /* ...and create new written mapping at startoff. */ + extent.e_len -= newex.e_len; + extent.e_lblk += newex.e_len; + extent.e_pblk += newex.e_len; + extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_insert(handle, + EXT2_EXTENT_INSERT_AFTER, + &extent); + DUMP_EXTENT(ff, "insertx", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + } + + if (extent.e_lblk + extent.e_len > stopoff) { + struct ext2fs_extent newex = extent; + + /* + * Unwritten mapping ends after stopoff. Shorten the current + * mapping... + */ + extent.e_len = stopoff - extent.e_lblk; + extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_replace(handle, 0, &extent); + DUMP_EXTENT(ff, "shortenn", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + /* ..and create a new unwritten mapping at stopoff. */ + newex.e_pblk += extent.e_len; + newex.e_lblk += extent.e_len; + newex.e_len -= extent.e_len; + newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_insert(handle, + EXT2_EXTENT_INSERT_AFTER, + &newex); + DUMP_EXTENT(ff, "insertn", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + } + + /* Still unwritten? Update the state. */ + if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) { + extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_replace(handle, 0, &extent); + DUMP_EXTENT(ff, "replacex", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + } + +next: + /* Try to merge with the previous extent */ + if (startoff > 0) { + err = try_merge_mappings(ff, ino, handle, startoff); + if (err) + return translate_error(fs, ino, err); + } + + *cursor = extent.e_lblk + extent.e_len; + return 0; +} + +static int convert_unwritten_mappings(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, size_t written) +{ + ext2_extent_handle_t handle; + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written); + errcode_t err; + int ret; + + err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle); + if (err) + return translate_error(fs, ino, err); + + /* Walk every mapping in the range, converting them. */ + while (startoff < stopoff) { + blk64_t old_startoff = startoff; + + ret = convert_unwritten_mapping(ff, ino, inode, handle, + &startoff, stopoff); + if (ret) + goto out_handle; + if (startoff <= old_startoff) { + /* Do not go backwards. */ + ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED); + goto out_handle; + } + } + + /* Try to merge the right edge */ + ret = try_merge_mappings(ff, ino, handle, stopoff); +out_handle: + ext2fs_extent_free(handle); + return ret; +} + +static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos, size_t written, uint32_t ioendflags, + int error, uint64_t new_addr) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data; + struct ext2_inode_large inode; + ext2_filsys fs; + errcode_t err; + bool dirty = false; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + fs = ff->fs; + + pthread_mutex_lock(&ff->bfl); + + dbg_printf(ff, + "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=%llu\n", + __func__, path, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + written, + ioendflags, + error, + (unsigned long long)new_addr); + + if (error) { + ret = error; + goto out_unlock; + } + + /* + * flush and invalidate the file's io_channel buffers again now that + * iomap wrote them + */ + if (written > 0) { + err = io_channel_invalidate_tag(ff->fs->io, attr_ino); + if (err) { + ret = translate_error(ff->fs, attr_ino, err); + goto out_unlock; + } + } + + /* should never see these ioend types */ + if ((ioendflags & FUSE_IOMAP_IOEND_SHARED) || + new_addr != FUSE_IOMAP_NULL_ADDR) { + ret = translate_error(fs, attr_ino, + EXT2_ET_FILESYSTEM_CORRUPTED); + goto out_unlock; + } + + err = fuse2fs_read_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + + if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) { + /* unwritten extents are only supported on extents files */ + if (!(inode.i_flags & EXT4_EXTENTS_FL)) { + ret = translate_error(fs, attr_ino, + EXT2_ET_FILESYSTEM_CORRUPTED); + goto out_unlock; + } + + ret = convert_unwritten_mappings(ff, attr_ino, &inode, pos, + written); + if (ret) + goto out_unlock; + + dirty = true; + } + + if (ioendflags & FUSE_IOMAP_IOEND_APPEND) { + ext2_off64_t isize = EXT2_I_SIZE(&inode); + + if (pos + written > isize) { + err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), + pos + written); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + + dirty = true; + } + } + + if (dirty) { + err = fuse2fs_write_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + } + +out_unlock: + if (ret < 0) + dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret); + pthread_mutex_unlock(&ff->bfl); + return ret; +} #endif /* HAVE_FUSE_IOMAP */ static struct fuse_operations fs_ops = { @@ -5228,6 +5702,7 @@ static struct fuse_operations fs_ops = { #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, + .iomap_ioend = op_iomap_ioend, #endif /* HAVE_FUSE_IOMAP */ }; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (7 preceding siblings ...) 2025-05-22 0:12 ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong @ 2025-05-22 0:13 ` Darrick J. Wong 2025-05-22 0:13 ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong ` (6 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:13 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Turn on iomap for pagecache IO to regular files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 57 insertions(+), 7 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index ec17f6203b4b70..7152979ed6694e 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1175,6 +1175,10 @@ static void *op_init(struct fuse_conn_info *conn if (iomap_enabled(ff)) fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO); #endif +#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_PAGECACHE) + if (iomap_enabled(ff)) + fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_PAGECACHE); +#endif /* Clear the valid flag so that an unclean shutdown forces a fsck */ if (ff->writable) { @@ -5017,9 +5021,6 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, { errcode_t err; - if (!(opflags & FUSE_IOMAP_OP_DIRECT)) - return -ENOSYS; - /* fall back to slow path for inline data reads */ if (inode->i_flags & EXT4_INLINE_DATA_FL) return -ENOSYS; @@ -5099,9 +5100,6 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, errcode_t err; int ret; - if (!(opflags & FUSE_IOMAP_OP_DIRECT)) - return -ENOSYS; - if (pos >= max_size) return -EFBIG; @@ -5235,12 +5233,51 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, return ret; } +static int iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino, + loff_t newsize) +{ + ext2_filsys fs = ff->fs; + struct ext2_inode_large inode; + ext2_off64_t isize; + errcode_t err; + + dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino, + (unsigned long long)newsize); + + err = fuse2fs_read_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + isize = EXT2_I_SIZE(&inode); + if (newsize <= isize) + return 0; + + dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino, + (unsigned long long)isize, + (unsigned long long)newsize); + + /* + * XXX cheesily update the ondisk size even though we only want to do + * the incore size until writeback happens + */ + err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize); + if (err) + return translate_error(fs, ino, err); + + err = fuse2fs_write_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + return 0; +} + static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, off_t pos, uint64_t count, uint32_t opflags, ssize_t written, const struct fuse_iomap *iomap) { struct fuse_context *ctxt = fuse_get_context(); struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data; + int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); @@ -5255,9 +5292,22 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, opflags, written, iomap->flags); + + if ((opflags & FUSE_IOMAP_OP_WRITE) && + !(opflags & FUSE_IOMAP_OP_DIRECT) && + (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) && + written > 0) { + ret = iomap_append_setsize(ff, attr_ino, pos + written); + if (ret) + goto out_unlock; + } + +out_unlock: + if (ret < 0) + dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret); pthread_mutex_unlock(&ff->bfl); - return 0; + return ret; } static inline bool can_merge_mappings(const struct ext2fs_extent *left, ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (8 preceding siblings ...) 2025-05-22 0:13 ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong @ 2025-05-22 0:13 ` Darrick J. Wong 2025-05-22 0:13 ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong ` (5 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:13 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Discard operates directly on the storage device, which means that we need to flush and invalidate the buffer cache because it could be caching freed blocks whose contents are about to change. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 7152979ed6694e..219d4bf698d628 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -4365,6 +4365,11 @@ static int ioctl_fitrim(struct fuse2fs *ff, struct fuse2fs_file_handle *fh, cleared = 0; max_blocks = FUSE2FS_B_TO_FSBT(ff, 2048ULL * 1024 * 1024); + /* flush any dirty data out of the disk cache before trimming */ + err = io_channel_flush_tag(ff->fs->io, IO_CHANNEL_TAG_NULL); + if (err) + return translate_error(fs, fh->ino, err); + fr->len = 0; while (start <= end) { err = ext2fs_find_first_zero_block_bitmap2(fs->block_map, @@ -4394,6 +4399,16 @@ static int ioctl_fitrim(struct fuse2fs *ff, struct fuse2fs_file_handle *fh, } start = b + 1; } + if (err) + goto out; + + /* + * Invalidate the entire disk cache now that we've written zeroes so + * that EXT2_ALLOCRANGE_ZERO_BLOCKS works correctly. + */ + err = io_channel_invalidate_tag(ff->fs->io, IO_CHANNEL_TAG_NULL); + if (err) + return translate_error(fs, fh->ino, err); out: fr->len = cleared; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 11/16] fuse2fs: improve tracing for fallocate 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (9 preceding siblings ...) 2025-05-22 0:13 ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong @ 2025-05-22 0:13 ` Darrick J. Wong 2025-05-22 0:13 ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong ` (4 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:13 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Improve the tracing for fallocate by reporting the inode number and the file range in all tracepoints. Make the ranges hexadecimal to make it easier for the programmer to convert bytes to block numbers and back. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 219d4bf698d628..fe6d97324c1f57 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -4529,8 +4529,8 @@ static int fallocate_helper(struct fuse_file_info *fp, int mode, off_t offset, FUSE2FS_CHECK_MAGIC(fs, fh, FUSE2FS_FILE_MAGIC); start = FUSE2FS_B_TO_FSBT(ff, offset); end = FUSE2FS_B_TO_FSBT(ff, offset + len - 1); - dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__, - fh->ino, mode, start, end); + dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n", + __func__, fh->ino, mode, offset, len, start, end); if (!fs_can_allocate(ff, FUSE2FS_B_TO_FSB(ff, len))) return -ENOSPC; @@ -4601,6 +4601,7 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino, if (err) return err; + dbg_printf(ff, "%s: ino=%d offset=0x%jx len=0x%jx\n", __func__, ino, offset + residue, len); memset(*buf + residue, 0, len); return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf); @@ -4637,10 +4638,13 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino, if (!blk || (retflags & BMAP_RET_UNINIT)) return 0; - if (clean_before) + if (clean_before) { + dbg_printf(ff, "%s: ino=%d before offset=0x%jx len=0x%jx\n", __func__, ino, offset, residue); memset(*buf, 0, residue); - else + } else { + dbg_printf(ff, "%s: ino=%d after offset=0x%jx len=0x%jx\n", __func__, ino, offset, fs->blocksize - residue); memset(*buf + residue, 0, fs->blocksize - residue); + } return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf); } @@ -4661,7 +4665,6 @@ static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset, FUSE2FS_CHECK_CONTEXT(ff); fs = ff->fs; FUSE2FS_CHECK_MAGIC(fs, fh, FUSE2FS_FILE_MAGIC); - dbg_printf(ff, "%s: offset=%jd len=%jd\n", __func__, offset, len); /* kernel ext4 punch requires this flag to be set */ if (!(mode & FL_KEEP_SIZE_FLAG)) @@ -4670,8 +4673,9 @@ static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset, /* Punch out a bunch of blocks */ start = FUSE2FS_B_TO_FSB(ff, offset); end = (offset + len - fs->blocksize) / fs->blocksize; - dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__, - fh->ino, mode, start, end); + + dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n", + __func__, fh->ino, mode, offset, len, start, end); err = fuse2fs_read_inode(fs, fh->ino, &inode); if (err) @@ -4727,6 +4731,8 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode, { struct fuse_context *ctxt = fuse_get_context(); struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data; + struct fuse2fs_file_handle *fh = + (struct fuse2fs_file_handle *)(uintptr_t)fp->fh; int ret; /* Catch unknown flags */ @@ -4738,6 +4744,12 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode, ret = -EROFS; goto out; } + + dbg_printf(ff, "%s: ino=%d mode=0x%x start=0x%llx end=0x%llx\n", __func__, + fh->ino, mode, + (unsigned long long)offset, + (unsigned long long)offset + len); + if (mode & FL_ZERO_RANGE_FLAG) ret = zero_helper(fp, mode, offset, len); else if (mode & FL_PUNCH_HOLE_FLAG) ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 12/16] fuse2fs: don't zero bytes in punch hole 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (10 preceding siblings ...) 2025-05-22 0:13 ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong @ 2025-05-22 0:13 ` Darrick J. Wong 2025-05-22 0:14 ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong ` (3 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:13 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> When iomap is in use for the pagecache, it will take care of zeroing the unaligned parts of punched out regions so we don't have to do it ourselves. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index fe6d97324c1f57..aeb2b6fbc28401 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -152,6 +152,7 @@ enum fuse2fs_iomap_state { IOMAP_DISABLED, IOMAP_UNKNOWN, IOMAP_ENABLED, + IOMAP_FILEIO, /* enabled and does all file data block IO */ }; #endif @@ -1040,6 +1041,7 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff) /* fallthrough */; case IOMAP_DISABLED: return 0; + case IOMAP_FILEIO: case IOMAP_ENABLED: break; } @@ -1059,11 +1061,17 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff) static int iomap_enabled(const struct fuse2fs *ff) { - return ff->iomap_state == IOMAP_ENABLED; + return ff->iomap_state >= IOMAP_ENABLED; +} + +static int iomap_does_fileio(const struct fuse2fs *ff) +{ + return ff->iomap_state == IOMAP_FILEIO; } #else # define confirm_iomap(...) (0) # define iomap_enabled(...) (0) +# define iomap_does_fileio(...) (0) #endif static void *op_init(struct fuse_conn_info *conn @@ -1100,6 +1108,20 @@ static void *op_init(struct fuse_conn_info *conn if (ff->iomap_state != IOMAP_DISABLED && fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) ff->iomap_state = IOMAP_ENABLED; + + /* + * If iomap is turned on and the kernel advertises support for both + * direct and pagecache IO, then that means the kernel handles all + * regular file data block IO for us. That means we can turn off all + * of libext2fs' file data block handling except for inline data. + * + * XXX: kernel doesn't support inline data iomap + */ + if (iomap_enabled(ff) && + fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO) && + fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_PAGECACHE)) + ff->iomap_state = IOMAP_FILEIO; + /* * In iomap mode, the kernel writes file data directly to the block * device and does not flush the bdev page cache. We must open the @@ -4580,6 +4602,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino, int retflags; errcode_t err; + /* the kernel does this for us in iomap mode */ + if (iomap_does_fileio(ff)) + return 0; + residue = FUSE2FS_OFF_IN_FSB(ff, offset); if (residue == 0) return 0; @@ -4617,6 +4643,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino, off_t residue; errcode_t err; + /* the kernel does this for us in iomap mode */ + if (iomap_does_fileio(ff)) + return 0; + residue = FUSE2FS_OFF_IN_FSB(ff, offset); if (residue == 0) return 0; ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (11 preceding siblings ...) 2025-05-22 0:13 ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong @ 2025-05-22 0:14 ` Darrick J. Wong 2025-05-22 0:14 ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong ` (2 subsequent siblings) 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:14 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> When iomap is in use for the page cache, the kernel will take care of all the file data block IO for us, including zeroing of punched ranges and post-EOF bytes. fuse2fs only needs to do IO for inline data. Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not do any regular file IO to or from disk blocks at all. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index aeb2b6fbc28401..842ea3a191fa44 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -2863,9 +2863,14 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size) ext2_file_t file; __u64 old_isize; errcode_t err; + int flags = EXT2_FILE_WRITE; int ret = 0; - err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file); + /* the kernel handles all eof zeroing for us in iomap mode */ + if (iomap_does_fileio(ff)) + flags |= EXT2_FILE_NOBLOCKIO; + + err = ext2fs_file_open(fs, ino, flags, &file); if (err) return translate_error(fs, ino, err); @@ -2987,6 +2992,9 @@ static int __op_open(struct fuse2fs *ff, const char *path, file->open_flags |= EXT2_FILE_WRITE; break; } + /* the kernel handles all block IO for us in iomap mode */ + if (iomap_does_fileio(ff)) + file->open_flags |= EXT2_FILE_NOBLOCKIO; if (fp->flags & O_APPEND) { /* the kernel doesn't allow truncation of an append-only file */ if (fp->flags & O_TRUNC) { ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (12 preceding siblings ...) 2025-05-22 0:14 ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong @ 2025-05-22 0:14 ` Darrick J. Wong 2025-05-22 0:14 ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong 2025-05-22 0:15 ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:14 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Now that fuse2fs uses iomap for pagecache IO, all regular file IO goes directly to the disk. There is no need to flush the unix IO manager's disk cache (or invalidate it) because it does not contain file data. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 842ea3a191fa44..ba8c5f301625c6 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -5091,9 +5091,11 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, return -ENOSYS; /* flush dirty io_channel buffers to disk before iomap reads them */ - err = io_channel_flush_tag(ff->fs->io, ino); - if (err) - return translate_error(ff->fs, ino, err); + if (!iomap_does_fileio(ff)) { + err = io_channel_flush_tag(ff->fs->io, ino); + if (err) + return translate_error(ff->fs, ino, err); + } if (inode->i_flags & EXT4_EXTENTS_FL) return extent_iomap_begin(ff, ino, inode, pos, count, opflags, @@ -5188,9 +5190,11 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, * flush and invalidate the file's io_channel buffers before iomap * writes them */ - err = io_channel_invalidate_tag(ff->fs->io, ino); - if (err) - return translate_error(ff->fs, ino, err); + if (!iomap_does_fileio(ff)) { + err = io_channel_invalidate_tag(ff->fs->io, ino); + if (err) + return translate_error(ff->fs, ino, err); + } return 0; } @@ -5685,7 +5689,7 @@ static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino, * flush and invalidate the file's io_channel buffers again now that * iomap wrote them */ - if (written > 0) { + if (written > 0 && !iomap_does_fileio(ff)) { err = io_channel_invalidate_tag(ff->fs->io, attr_ino); if (err) { ret = translate_error(ff->fs, attr_ino, err); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (13 preceding siblings ...) 2025-05-22 0:14 ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong @ 2025-05-22 0:14 ` Darrick J. Wong 2025-05-22 0:15 ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:14 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Back in "fuse2fs: always use directio disk reads with fuse2fs", we started using directio for all libext2fs disk IO to deal with cache coherency issues between the unix io manager's disk cache, the block device page cache, and the file data blocks being read and written to disk by the kernel itself. Now that we've turned off all regular file data block IO in libext2fs, we don't need that and can go back to the old way, which is a lot faster for metadata operations. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index ba8c5f301625c6..f31aee5af5aad9 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1128,8 +1128,12 @@ static void *op_init(struct fuse_conn_info *conn * filesystem in directio mode to avoid cache coherency issues when * reading file data. If we can't open the bdev in directio mode, we * must not use iomap. + * + * If we know that the kernel can handle all regular file IO for us, + * then there is no cache coherency issue and we can use buffered reads + * for all IO, which will all be filesystem metadata. */ - if (iomap_enabled(ff)) + if (iomap_enabled(ff) && !iomap_does_fileio(ff)) ff->directio = 1; #endif ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (14 preceding siblings ...) 2025-05-22 0:14 ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong @ 2025-05-22 0:15 ` Darrick J. Wong 15 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-05-22 0:15 UTC (permalink / raw) To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel From: Darrick J. Wong <djwong@kernel.org> Since fuse in iomap mode guarantees that op_destroy will be called before umount returns, we don't need to use fuseblk mode to get that guarantee. Disable fuseblk mode, which saves us the trouble of closing and reopening the device. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index f31aee5af5aad9..28385d654f5e05 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -787,6 +787,8 @@ static errcode_t open_fs(struct fuse2fs *ff, int libext2_flags) if (ff->directio) flags |= EXT2_FLAG_DIRECT_IO; + dbg_printf(ff, "opening with flags=0x%x\n", flags); + err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager, &ff->fs); if (err) { @@ -6153,6 +6155,18 @@ int main(int argc, char *argv[]) ret = 32; goto out; } +#ifdef HAVE_FUSE_IOMAP + if (is_bdev && fuse_discover_iomap()) { + /* + * fuse-iomap guarantees that op_destroy is called before the + * filesystem is unmounted, so we don't need fuseblk mode. + * This save us the trouble of reopening the filesystem later, + * and means that fuse2fs itself owns the exclusive lock on the + * block device. + */ + is_bdev = 0; + } +#endif blksize = fctx.fs->blocksize; @@ -6171,14 +6185,14 @@ int main(int argc, char *argv[]) /* Set up default fuse parameters */ snprintf(extra_args, BUFSIZ, "-okernel_cache,subtype=%s," - "attr_timeout=0" FUSE_PLATFORM_OPTS, - get_subtype(argv[0])); + "attr_timeout=0,fsname=%s" FUSE_PLATFORM_OPTS, + get_subtype(argv[0]), fctx.device); if (fctx.no_default_opts == 0) fuse_opt_add_arg(&args, extra_args); if (is_bdev) { - snprintf(extra_args, BUFSIZ, "-ofsname=%s,blkdev,blksize=%u", - fctx.device, blksize); + snprintf(extra_args, BUFSIZ, "-oblkdev,blksize=%u", + blksize); fuse_opt_add_arg(&args, extra_args); } ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong ` (2 preceding siblings ...) 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong @ 2025-05-22 16:24 ` Amir Goldstein 2025-05-29 16:45 ` Darrick J. Wong 2025-06-13 17:37 ` [RFC[RAP] V2] " Darrick J. Wong 3 siblings, 2 replies; 55+ messages in thread From: Amir Goldstein @ 2025-05-22 16:24 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > Hi everyone, > > DO NOT MERGE THIS. > > This is the very first request for comments of a prototype to connect > the Linux fuse driver to fs-iomap for regular file IO operations to and > from files whose contents persist to locally attached storage devices. > > Why would you want to do that? Most filesystem drivers are seriously > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > over almost a decade of its existence. Faulty code can lead to total > kernel compromise, and I think there's a very strong incentive to move > all that parsing out to userspace where we can containerize the fuse > server process. > > willy's folios conversion project (and to a certain degree RH's new > mount API) have also demonstrated that treewide changes to the core > mm/pagecache/fs code are very very difficult to pull off and take years > because you have to understand every filesystem's bespoke use of that > core code. Eeeugh. > > The fuse command plumbing is very simple -- the ->iomap_begin, > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > to the fuse server via a trio of new fuse commands. This is suitable > for very simple filesystems that don't do tricky things with mappings > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > but solving that is for the next sprint. > > With this overly simplistic RFC, I am to show that it's possible to > build a fuse server for a real filesystem (ext4) that runs entirely in > userspace yet maintains most of its performance. At this early stage I > get about 95% of the kernel ext4 driver's streaming directio performance > on streaming IO, and 110% of its streaming buffered IO performance. > Random buffered IO suffers a 90% hit on writes due to unwritten extent > conversions. Random direct IO is about 60% as fast as the kernel; see > the cover letter for the fuse2fs iomap changes for more details. > Very cool! > There are some major warts remaining: > > 1. The iomap cookie validation is not present, which can lead to subtle > races between pagecache zeroing and writeback on filesystems that > support unwritten and delalloc mappings. > > 2. Mappings ought to be cached in the kernel for more speed. > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > yet figured out how inline data is supposed to work. > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > which currently isn't possible because the kernel fuse driver will iget > inodes prior to calling FUSE_GETATTR to discover the properties of the > inode it just read. Can you make the decision about enabling iomap on lookup? The plan for passthrough for inode operations was to allow setting up passthough config of inode on lookup. > > 5. ext4 doesn't support out of place writes so I don't know if that > actually works correctly. > > 6. iomap is an inode-based service, not a file-based service. This > means that we /must/ push ext2's inode numbers into the kernel via > FUSE_GETATTR so that it can report those same numbers back out through > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > to index its incore inode, so we have to pass those too so that > notifications work properly. > Again, I might be missing something, but as long as the fuse filesystem is exposing a single backing filesystem, it should be possible to make sure (via opt-in) that fuse nodeid's are equivalent to the backing fs inode number. See sketch in this WIP branch: https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 Thanks, Amir. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein @ 2025-05-29 16:45 ` Darrick J. Wong 2025-05-29 19:41 ` Amir Goldstein 2025-06-13 17:37 ` [RFC[RAP] V2] " Darrick J. Wong 1 sibling, 1 reply; 55+ messages in thread From: Darrick J. Wong @ 2025-05-29 16:45 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > Hi everyone, > > > > DO NOT MERGE THIS. > > > > This is the very first request for comments of a prototype to connect > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > from files whose contents persist to locally attached storage devices. > > > > Why would you want to do that? Most filesystem drivers are seriously > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > over almost a decade of its existence. Faulty code can lead to total > > kernel compromise, and I think there's a very strong incentive to move > > all that parsing out to userspace where we can containerize the fuse > > server process. > > > > willy's folios conversion project (and to a certain degree RH's new > > mount API) have also demonstrated that treewide changes to the core > > mm/pagecache/fs code are very very difficult to pull off and take years > > because you have to understand every filesystem's bespoke use of that > > core code. Eeeugh. > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > to the fuse server via a trio of new fuse commands. This is suitable > > for very simple filesystems that don't do tricky things with mappings > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > but solving that is for the next sprint. > > > > With this overly simplistic RFC, I am to show that it's possible to > > build a fuse server for a real filesystem (ext4) that runs entirely in > > userspace yet maintains most of its performance. At this early stage I > > get about 95% of the kernel ext4 driver's streaming directio performance > > on streaming IO, and 110% of its streaming buffered IO performance. > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > conversions. Random direct IO is about 60% as fast as the kernel; see > > the cover letter for the fuse2fs iomap changes for more details. > > > > Very cool! > > > There are some major warts remaining: > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > races between pagecache zeroing and writeback on filesystems that > > support unwritten and delalloc mappings. > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > yet figured out how inline data is supposed to work. > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > which currently isn't possible because the kernel fuse driver will iget > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > inode it just read. > > Can you make the decision about enabling iomap on lookup? > The plan for passthrough for inode operations was to allow > setting up passthough config of inode on lookup. The main requirement (especially for buffered IO) is that we've set the address space operations structure either to the regular fuse one or to the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c code assumes that cannot change on a live inode. So I /think/ we could ask the fuse server at inode instantiation time (which, if I'm reading the code correctly, is when iget5_locked gives fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall to userspace at that time. Alternately I guess we could extend struct fuse_attr with another FUSE_ATTR_ flag, I think? > > 5. ext4 doesn't support out of place writes so I don't know if that > > actually works correctly. > > > > 6. iomap is an inode-based service, not a file-based service. This > > means that we /must/ push ext2's inode numbers into the kernel via > > FUSE_GETATTR so that it can report those same numbers back out through > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > to index its incore inode, so we have to pass those too so that > > notifications work properly. > > > > Again, I might be missing something, but as long as the fuse filesystem > is exposing a single backing filesystem, it should be possible to make > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs > inode number. > See sketch in this WIP branch: > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 I think this would work in many places, except for filesystems with 64-bit inumbers on 32-bit machines. That might be a good argument for continuing to pass along the nodeid and fuse_inode::orig_ino like it does now. Plus there are some filesystems that synthesize inode numbers so tying the two together might not be feasible/desirable anyway. Though one nice feature of letting fuse have its own nodeids might be that if the in-memory index switches to a tree structure, then it could be more compact if the filesystem's inumbers are fairly sparse like xfs. OTOH the current inode hashtable has been around for a very long time so that might not be a big concern. For fuse2fs it doesn't matter since ext4 inumbers are u32. --D > > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-05-29 16:45 ` Darrick J. Wong @ 2025-05-29 19:41 ` Amir Goldstein 2025-06-09 22:31 ` Darrick J. Wong 2025-07-12 10:57 ` Amir Goldstein 0 siblings, 2 replies; 55+ messages in thread From: Amir Goldstein @ 2025-05-29 19:41 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o or On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > Hi everyone, > > > > > > DO NOT MERGE THIS. > > > > > > This is the very first request for comments of a prototype to connect > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > from files whose contents persist to locally attached storage devices. > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > over almost a decade of its existence. Faulty code can lead to total > > > kernel compromise, and I think there's a very strong incentive to move > > > all that parsing out to userspace where we can containerize the fuse > > > server process. > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > mount API) have also demonstrated that treewide changes to the core > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > because you have to understand every filesystem's bespoke use of that > > > core code. Eeeugh. > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > to the fuse server via a trio of new fuse commands. This is suitable > > > for very simple filesystems that don't do tricky things with mappings > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > but solving that is for the next sprint. > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > userspace yet maintains most of its performance. At this early stage I > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > Very cool! > > > > > There are some major warts remaining: > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > races between pagecache zeroing and writeback on filesystems that > > > support unwritten and delalloc mappings. > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > yet figured out how inline data is supposed to work. > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > which currently isn't possible because the kernel fuse driver will iget > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > inode it just read. > > > > Can you make the decision about enabling iomap on lookup? > > The plan for passthrough for inode operations was to allow > > setting up passthough config of inode on lookup. > > The main requirement (especially for buffered IO) is that we've set the > address space operations structure either to the regular fuse one or to > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > code assumes that cannot change on a live inode. > > So I /think/ we could ask the fuse server at inode instantiation time > (which, if I'm reading the code correctly, is when iget5_locked gives > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > to userspace at that time. Alternately I guess we could extend struct > fuse_attr with another FUSE_ATTR_ flag, I think? > The latter. Either extend fuse_attr or struct fuse_entry_out, which is in the responses of FUSE_LOOKUP, FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. which instantiate fuse inodes. There is a very hand wavy discussion about this at: https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE command that uses the variable length file handle instead of nodeid as a key for the inode. So we will have to extend fuse_entry_out anyway, but TBH I never got to look at the gritty details of how best to extend all the relevant commands, so I hope I am not sending you down the wrong path. > > > 5. ext4 doesn't support out of place writes so I don't know if that > > > actually works correctly. > > > > > > 6. iomap is an inode-based service, not a file-based service. This > > > means that we /must/ push ext2's inode numbers into the kernel via > > > FUSE_GETATTR so that it can report those same numbers back out through > > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > > to index its incore inode, so we have to pass those too so that > > > notifications work properly. > > > > > > > Again, I might be missing something, but as long as the fuse filesystem > > is exposing a single backing filesystem, it should be possible to make > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs > > inode number. > > See sketch in this WIP branch: > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 > > I think this would work in many places, except for filesystems with > 64-bit inumbers on 32-bit machines. That might be a good argument for > continuing to pass along the nodeid and fuse_inode::orig_ino like it > does now. Plus there are some filesystems that synthesize inode numbers > so tying the two together might not be feasible/desirable anyway. > > Though one nice feature of letting fuse have its own nodeids might be > that if the in-memory index switches to a tree structure, then it could > be more compact if the filesystem's inumbers are fairly sparse like xfs. > OTOH the current inode hashtable has been around for a very long time so > that might not be a big concern. For fuse2fs it doesn't matter since > ext4 inumbers are u32. > I wanted to see if declaring one-to-one 64bit ino can simplify things for the first version of inode ops passthrough. If this is not the case, or if this is too much of a limitation for your use case then nevermind. But if it is a good enough shortcut for the demo and can be extended later, then why not. Thanks, Amir. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-05-29 19:41 ` Amir Goldstein @ 2025-06-09 22:31 ` Darrick J. Wong 2025-06-10 10:59 ` Amir Goldstein 2025-07-12 10:57 ` Amir Goldstein 1 sibling, 1 reply; 55+ messages in thread From: Darrick J. Wong @ 2025-06-09 22:31 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote: > or > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > Hi everyone, > > > > > > > > DO NOT MERGE THIS. > > > > > > > > This is the very first request for comments of a prototype to connect > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > > from files whose contents persist to locally attached storage devices. > > > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > > over almost a decade of its existence. Faulty code can lead to total > > > > kernel compromise, and I think there's a very strong incentive to move > > > > all that parsing out to userspace where we can containerize the fuse > > > > server process. > > > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > > mount API) have also demonstrated that treewide changes to the core > > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > > because you have to understand every filesystem's bespoke use of that > > > > core code. Eeeugh. > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > > to the fuse server via a trio of new fuse commands. This is suitable > > > > for very simple filesystems that don't do tricky things with mappings > > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > > but solving that is for the next sprint. > > > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > > userspace yet maintains most of its performance. At this early stage I > > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > > > > Very cool! > > > > > > > There are some major warts remaining: > > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > > races between pagecache zeroing and writeback on filesystems that > > > > support unwritten and delalloc mappings. > > > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > > yet figured out how inline data is supposed to work. > > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > > which currently isn't possible because the kernel fuse driver will iget > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > > inode it just read. > > > > > > Can you make the decision about enabling iomap on lookup? > > > The plan for passthrough for inode operations was to allow > > > setting up passthough config of inode on lookup. > > > > The main requirement (especially for buffered IO) is that we've set the > > address space operations structure either to the regular fuse one or to > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > > code assumes that cannot change on a live inode. > > > > So I /think/ we could ask the fuse server at inode instantiation time > > (which, if I'm reading the code correctly, is when iget5_locked gives > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > to userspace at that time. Alternately I guess we could extend struct > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > which is in the responses of FUSE_LOOKUP, > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > which instantiate fuse inodes. > > There is a very hand wavy discussion about this at: > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE > command that uses the variable length file handle instead of nodeid > as a key for the inode. > > So we will have to extend fuse_entry_out anyway, but TBH I never got to > look at the gritty details of how best to extend all the relevant commands, > so I hope I am not sending you down the wrong path. I found another twist to this story: the upper level libfuse3 library assigns distinct nodeids for each directory entry. These nodeids are passed into the kernel and appear to the basis for an iget5_locked call. IOWs, each nodeid causes a struct fuse_inode to be created in the kernel. For a single-linked file this is no big deal, but for a hardlink this makes iomap a mess because this means that in fuse2fs, an ext2 inode can map to multiple kernel fuse_inode objects. This /really/ breaks the locking model of iomap, which assumes that there's one in-kernel inode and that it can use i_rwsem to synchronize updates. So I'm going to have to find a way to deal with this. I tried trivially messing with libfuse nodeid assigment but that blew some assertion. Maybe your LOOKUP_HANDLE thing would work. > > > > 5. ext4 doesn't support out of place writes so I don't know if that > > > > actually works correctly. > > > > > > > > 6. iomap is an inode-based service, not a file-based service. This > > > > means that we /must/ push ext2's inode numbers into the kernel via > > > > FUSE_GETATTR so that it can report those same numbers back out through > > > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > > > to index its incore inode, so we have to pass those too so that > > > > notifications work properly. > > > > > > > > > > Again, I might be missing something, but as long as the fuse filesystem > > > is exposing a single backing filesystem, it should be possible to make > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs > > > inode number. > > > See sketch in this WIP branch: > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 > > > > I think this would work in many places, except for filesystems with > > 64-bit inumbers on 32-bit machines. That might be a good argument for > > continuing to pass along the nodeid and fuse_inode::orig_ino like it > > does now. Plus there are some filesystems that synthesize inode numbers > > so tying the two together might not be feasible/desirable anyway. > > > > Though one nice feature of letting fuse have its own nodeids might be > > that if the in-memory index switches to a tree structure, then it could > > be more compact if the filesystem's inumbers are fairly sparse like xfs. > > OTOH the current inode hashtable has been around for a very long time so > > that might not be a big concern. For fuse2fs it doesn't matter since > > ext4 inumbers are u32. > > > > I wanted to see if declaring one-to-one 64bit ino can simplify things > for the first version of inode ops passthrough. > If this is not the case, or if this is too much of a limitation for > your use case > then nevermind. > But if it is a good enough shortcut for the demo and can be extended later, > then why not. It's very tempting, because it's very confusing to have nodeids and stat st_ino not be the same thing. --D > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-09 22:31 ` Darrick J. Wong @ 2025-06-10 10:59 ` Amir Goldstein 2025-06-10 19:00 ` Darrick J. Wong 0 siblings, 1 reply; 55+ messages in thread From: Amir Goldstein @ 2025-06-10 10:59 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote: > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote: > > or > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > > > Hi everyone, > > > > > > > > > > DO NOT MERGE THIS. > > > > > > > > > > This is the very first request for comments of a prototype to connect > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > > > from files whose contents persist to locally attached storage devices. > > > > > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > > > over almost a decade of its existence. Faulty code can lead to total > > > > > kernel compromise, and I think there's a very strong incentive to move > > > > > all that parsing out to userspace where we can containerize the fuse > > > > > server process. > > > > > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > > > mount API) have also demonstrated that treewide changes to the core > > > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > > > because you have to understand every filesystem's bespoke use of that > > > > > core code. Eeeugh. > > > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > > > to the fuse server via a trio of new fuse commands. This is suitable > > > > > for very simple filesystems that don't do tricky things with mappings > > > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > > > but solving that is for the next sprint. > > > > > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > > > userspace yet maintains most of its performance. At this early stage I > > > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > > > > > > > Very cool! > > > > > > > > > There are some major warts remaining: > > > > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > > > races between pagecache zeroing and writeback on filesystems that > > > > > support unwritten and delalloc mappings. > > > > > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > > > yet figured out how inline data is supposed to work. > > > > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > > > which currently isn't possible because the kernel fuse driver will iget > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > > > inode it just read. > > > > > > > > Can you make the decision about enabling iomap on lookup? > > > > The plan for passthrough for inode operations was to allow > > > > setting up passthough config of inode on lookup. > > > > > > The main requirement (especially for buffered IO) is that we've set the > > > address space operations structure either to the regular fuse one or to > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > > > code assumes that cannot change on a live inode. > > > > > > So I /think/ we could ask the fuse server at inode instantiation time > > > (which, if I'm reading the code correctly, is when iget5_locked gives > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > > to userspace at that time. Alternately I guess we could extend struct > > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > > which is in the responses of FUSE_LOOKUP, > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > > which instantiate fuse inodes. > > > > There is a very hand wavy discussion about this at: > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE > > command that uses the variable length file handle instead of nodeid > > as a key for the inode. > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to > > look at the gritty details of how best to extend all the relevant commands, > > so I hope I am not sending you down the wrong path. > > I found another twist to this story: the upper level libfuse3 library > assigns distinct nodeids for each directory entry. These nodeids are > passed into the kernel and appear to the basis for an iget5_locked call. > IOWs, each nodeid causes a struct fuse_inode to be created in the > kernel. > > For a single-linked file this is no big deal, but for a hardlink this > makes iomap a mess because this means that in fuse2fs, an ext2 inode can > map to multiple kernel fuse_inode objects. This /really/ breaks the > locking model of iomap, which assumes that there's one in-kernel inode > and that it can use i_rwsem to synchronize updates. > > So I'm going to have to find a way to deal with this. I tried trivially > messing with libfuse nodeid assigment but that blew some assertion. > Maybe your LOOKUP_HANDLE thing would work. > Pull the emergency break! In an amature move, I did not look at fuse2fs.c before commenting on your work. High level fuse interface is not the right tool for the job. It's not even the easiest way to have written fuse2fs in the first place. High-level fuse API addresses file system objects with full paths. This is good for writing simple virtual filesystems, but it is not the correct nor is the easiest choice to write a userspace driver for ext4. Low-level fuse interface addresses filesystem objects by nodeid and requires the server to implement lookup(parent_nodeid, name) where the server gets to choose the nodeid (not libfuse). current fuse2fs code needs to go to an effort to convert from full path to inode + name using ext2fs_namei(). With the low-level fuse op_lookup() might have used the native ext2_lookup() which would have been much more natural. You can find the most featureful low-level fuse example at: https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc Among other things, the server has an inode cache, where an inode has in its state 'nopen' (was this inode opened for io) and 'backing_id' (was this inode mapped for kernel passthrough). Currently this backing_id mapping is only made on first open of inode, but the plan is to do that also at lookup time, for example, if the iomap mode for the inode can be determined at lookup time. > > > > > 5. ext4 doesn't support out of place writes so I don't know if that > > > > > actually works correctly. > > > > > > > > > > 6. iomap is an inode-based service, not a file-based service. This > > > > > means that we /must/ push ext2's inode numbers into the kernel via > > > > > FUSE_GETATTR so that it can report those same numbers back out through > > > > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > > > > to index its incore inode, so we have to pass those too so that > > > > > notifications work properly. > > > > > > > > > > > > > Again, I might be missing something, but as long as the fuse filesystem > > > > is exposing a single backing filesystem, it should be possible to make > > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs > > > > inode number. > > > > See sketch in this WIP branch: > > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 > > > > > > I think this would work in many places, except for filesystems with > > > 64-bit inumbers on 32-bit machines. That might be a good argument for > > > continuing to pass along the nodeid and fuse_inode::orig_ino like it > > > does now. Plus there are some filesystems that synthesize inode numbers > > > so tying the two together might not be feasible/desirable anyway. > > > > > > Though one nice feature of letting fuse have its own nodeids might be > > > that if the in-memory index switches to a tree structure, then it could > > > be more compact if the filesystem's inumbers are fairly sparse like xfs. > > > OTOH the current inode hashtable has been around for a very long time so > > > that might not be a big concern. For fuse2fs it doesn't matter since > > > ext4 inumbers are u32. > > > > > > > I wanted to see if declaring one-to-one 64bit ino can simplify things > > for the first version of inode ops passthrough. > > If this is not the case, or if this is too much of a limitation for > > your use case > > then nevermind. > > But if it is a good enough shortcut for the demo and can be extended later, > > then why not. > > It's very tempting, because it's very confusing to have nodeids and > stat st_ino not be the same thing. > Now that I have explained that fuse2fs should be low-level, it should be trivial to claim that it should have no problem to declare via FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino, because I see no reason to implement fuse2fs with non one-to-one mapping of ino <==> nodeid. Thanks, Amir. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-10 10:59 ` Amir Goldstein @ 2025-06-10 19:00 ` Darrick J. Wong 2025-06-10 19:51 ` Amir Goldstein 2025-06-11 11:56 ` Theodore Ts'o 0 siblings, 2 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-06-10 19:00 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote: > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote: > > > or > > > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > DO NOT MERGE THIS. > > > > > > > > > > > > This is the very first request for comments of a prototype to connect > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > > > > from files whose contents persist to locally attached storage devices. > > > > > > > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > > > > over almost a decade of its existence. Faulty code can lead to total > > > > > > kernel compromise, and I think there's a very strong incentive to move > > > > > > all that parsing out to userspace where we can containerize the fuse > > > > > > server process. > > > > > > > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > > > > mount API) have also demonstrated that treewide changes to the core > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > > > > because you have to understand every filesystem's bespoke use of that > > > > > > core code. Eeeugh. > > > > > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > > > > to the fuse server via a trio of new fuse commands. This is suitable > > > > > > for very simple filesystems that don't do tricky things with mappings > > > > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > > > > but solving that is for the next sprint. > > > > > > > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > > > > userspace yet maintains most of its performance. At this early stage I > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > > > > > > > > > > Very cool! > > > > > > > > > > > There are some major warts remaining: > > > > > > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > > > > races between pagecache zeroing and writeback on filesystems that > > > > > > support unwritten and delalloc mappings. > > > > > > > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > > > > yet figured out how inline data is supposed to work. > > > > > > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > > > > which currently isn't possible because the kernel fuse driver will iget > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > > > > inode it just read. > > > > > > > > > > Can you make the decision about enabling iomap on lookup? > > > > > The plan for passthrough for inode operations was to allow > > > > > setting up passthough config of inode on lookup. > > > > > > > > The main requirement (especially for buffered IO) is that we've set the > > > > address space operations structure either to the regular fuse one or to > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > > > > code assumes that cannot change on a live inode. > > > > > > > > So I /think/ we could ask the fuse server at inode instantiation time > > > > (which, if I'm reading the code correctly, is when iget5_locked gives > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > > > to userspace at that time. Alternately I guess we could extend struct > > > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > > > > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > > > which is in the responses of FUSE_LOOKUP, > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > > > which instantiate fuse inodes. > > > > > > There is a very hand wavy discussion about this at: > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ > > > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE > > > command that uses the variable length file handle instead of nodeid > > > as a key for the inode. > > > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to > > > look at the gritty details of how best to extend all the relevant commands, > > > so I hope I am not sending you down the wrong path. > > > > I found another twist to this story: the upper level libfuse3 library > > assigns distinct nodeids for each directory entry. These nodeids are > > passed into the kernel and appear to the basis for an iget5_locked call. > > IOWs, each nodeid causes a struct fuse_inode to be created in the > > kernel. > > > > For a single-linked file this is no big deal, but for a hardlink this > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can > > map to multiple kernel fuse_inode objects. This /really/ breaks the > > locking model of iomap, which assumes that there's one in-kernel inode > > and that it can use i_rwsem to synchronize updates. > > > > So I'm going to have to find a way to deal with this. I tried trivially > > messing with libfuse nodeid assigment but that blew some assertion. > > Maybe your LOOKUP_HANDLE thing would work. > > > > Pull the emergency break! > > In an amature move, I did not look at fuse2fs.c before commenting on your > work. > > High level fuse interface is not the right tool for the job. > It's not even the easiest way to have written fuse2fs in the first place. At the time I thought it would minimize friction across multiple operating systems' fuse implementations. > High-level fuse API addresses file system objects with full paths. > This is good for writing simple virtual filesystems, but it is not the > correct nor is the easiest choice to write a userspace driver for ext4. Agreed, it's a *terrible* way to implement ext4. I think, however, that Ted would like to maintain compatibility with macfuse and freebsd(?) so he's been resistant to rewriting the entire program to work with the lowlevel library. That said, I decided just now to do some spelunking into those two fuse ports and have discovered that freebsd[1] packages the same upstream libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3. [1] https://wiki.freebsd.org/FUSEFS [2] https://github.com/macfuse/macfuse Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should think about rewriting all of fuse2fs against the lowlevel library? It's really annoying to deal with all the problems of the current codebase. I think I'll try to stabilize the current fuse+iomap code and then look into a fuse2fs port. What would we call it, fuse4fs? :D > Low-level fuse interface addresses filesystem objects by nodeid > and requires the server to implement lookup(parent_nodeid, name) > where the server gets to choose the nodeid (not libfuse). Does the nodeid for the root directory have to be FUSE_ROOT_ID? I guess for ext4 that's not a big deal since ext2 inode #1 is the badblocks file which cannot be accessed from userspace anyway. > current fuse2fs code needs to go to an effort to convert from full path > to inode + name using ext2fs_namei(). > > With the low-level fuse op_lookup() might have used the native ext2_lookup() > which would have been much more natural. > > You can find the most featureful low-level fuse example at: > https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc > > Among other things, the server has an inode cache, where an inode > has in its state 'nopen' (was this inode opened for io) and 'backing_id' > (was this inode mapped for kernel passthrough). > > Currently this backing_id mapping is only made on first open of inode, > but the plan is to do that also at lookup time, for example, if the > iomap mode for the inode can be determined at lookup time. <nod> > > > > > > 5. ext4 doesn't support out of place writes so I don't know if that > > > > > > actually works correctly. > > > > > > > > > > > > 6. iomap is an inode-based service, not a file-based service. This > > > > > > means that we /must/ push ext2's inode numbers into the kernel via > > > > > > FUSE_GETATTR so that it can report those same numbers back out through > > > > > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > > > > > to index its incore inode, so we have to pass those too so that > > > > > > notifications work properly. > > > > > > > > > > > > > > > > Again, I might be missing something, but as long as the fuse filesystem > > > > > is exposing a single backing filesystem, it should be possible to make > > > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs > > > > > inode number. > > > > > See sketch in this WIP branch: > > > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 > > > > > > > > I think this would work in many places, except for filesystems with > > > > 64-bit inumbers on 32-bit machines. That might be a good argument for > > > > continuing to pass along the nodeid and fuse_inode::orig_ino like it > > > > does now. Plus there are some filesystems that synthesize inode numbers > > > > so tying the two together might not be feasible/desirable anyway. > > > > > > > > Though one nice feature of letting fuse have its own nodeids might be > > > > that if the in-memory index switches to a tree structure, then it could > > > > be more compact if the filesystem's inumbers are fairly sparse like xfs. > > > > OTOH the current inode hashtable has been around for a very long time so > > > > that might not be a big concern. For fuse2fs it doesn't matter since > > > > ext4 inumbers are u32. > > > > > > > > > > I wanted to see if declaring one-to-one 64bit ino can simplify things > > > for the first version of inode ops passthrough. > > > If this is not the case, or if this is too much of a limitation for > > > your use case > > > then nevermind. > > > But if it is a good enough shortcut for the demo and can be extended later, > > > then why not. > > > > It's very tempting, because it's very confusing to have nodeids and > > stat st_ino not be the same thing. > > > > Now that I have explained that fuse2fs should be low-level, it should be > trivial to claim that it should have no problem to declare via > FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino, > because I see no reason to implement fuse2fs with non one-to-one > mapping of ino <==> nodeid. Agreed! Thanks for the nudge! Let's see what Ted thinks when he returns from vacation. :) --D ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-10 19:00 ` Darrick J. Wong @ 2025-06-10 19:51 ` Amir Goldstein 2025-06-11 6:00 ` Darrick J. Wong 2025-06-11 11:56 ` Theodore Ts'o 1 sibling, 1 reply; 55+ messages in thread From: Amir Goldstein @ 2025-06-10 19:51 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote: > > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote: > > > > or > > > > > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > DO NOT MERGE THIS. > > > > > > > > > > > > > > This is the very first request for comments of a prototype to connect > > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > > > > > from files whose contents persist to locally attached storage devices. > > > > > > > > > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > > > > > over almost a decade of its existence. Faulty code can lead to total > > > > > > > kernel compromise, and I think there's a very strong incentive to move > > > > > > > all that parsing out to userspace where we can containerize the fuse > > > > > > > server process. > > > > > > > > > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > > > > > mount API) have also demonstrated that treewide changes to the core > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > > > > > because you have to understand every filesystem's bespoke use of that > > > > > > > core code. Eeeugh. > > > > > > > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > > > > > to the fuse server via a trio of new fuse commands. This is suitable > > > > > > > for very simple filesystems that don't do tricky things with mappings > > > > > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > > > > > but solving that is for the next sprint. > > > > > > > > > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > > > > > userspace yet maintains most of its performance. At this early stage I > > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > > > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > > > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > > > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > > > > > > > > > > > > > Very cool! > > > > > > > > > > > > > There are some major warts remaining: > > > > > > > > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > > > > > races between pagecache zeroing and writeback on filesystems that > > > > > > > support unwritten and delalloc mappings. > > > > > > > > > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > > > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > > > > > yet figured out how inline data is supposed to work. > > > > > > > > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > > > > > which currently isn't possible because the kernel fuse driver will iget > > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > > > > > inode it just read. > > > > > > > > > > > > Can you make the decision about enabling iomap on lookup? > > > > > > The plan for passthrough for inode operations was to allow > > > > > > setting up passthough config of inode on lookup. > > > > > > > > > > The main requirement (especially for buffered IO) is that we've set the > > > > > address space operations structure either to the regular fuse one or to > > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > > > > > code assumes that cannot change on a live inode. > > > > > > > > > > So I /think/ we could ask the fuse server at inode instantiation time > > > > > (which, if I'm reading the code correctly, is when iget5_locked gives > > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > > > > to userspace at that time. Alternately I guess we could extend struct > > > > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > > > > > > > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > > > > which is in the responses of FUSE_LOOKUP, > > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > > > > which instantiate fuse inodes. > > > > > > > > There is a very hand wavy discussion about this at: > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ > > > > > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE > > > > command that uses the variable length file handle instead of nodeid > > > > as a key for the inode. > > > > > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to > > > > look at the gritty details of how best to extend all the relevant commands, > > > > so I hope I am not sending you down the wrong path. > > > > > > I found another twist to this story: the upper level libfuse3 library > > > assigns distinct nodeids for each directory entry. These nodeids are > > > passed into the kernel and appear to the basis for an iget5_locked call. > > > IOWs, each nodeid causes a struct fuse_inode to be created in the > > > kernel. > > > > > > For a single-linked file this is no big deal, but for a hardlink this > > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can > > > map to multiple kernel fuse_inode objects. This /really/ breaks the > > > locking model of iomap, which assumes that there's one in-kernel inode > > > and that it can use i_rwsem to synchronize updates. > > > > > > So I'm going to have to find a way to deal with this. I tried trivially > > > messing with libfuse nodeid assigment but that blew some assertion. > > > Maybe your LOOKUP_HANDLE thing would work. > > > > > > > Pull the emergency break! > > > > In an amature move, I did not look at fuse2fs.c before commenting on your > > work. > > > > High level fuse interface is not the right tool for the job. > > It's not even the easiest way to have written fuse2fs in the first place. > > At the time I thought it would minimize friction across multiple > operating systems' fuse implementations. > > > High-level fuse API addresses file system objects with full paths. > > This is good for writing simple virtual filesystems, but it is not the > > correct nor is the easiest choice to write a userspace driver for ext4. > > Agreed, it's a *terrible* way to implement ext4. > > I think, however, that Ted would like to maintain compatibility with > macfuse and freebsd(?) so he's been resistant to rewriting the entire > program to work with the lowlevel library. > > That said, I decided just now to do some spelunking into those two fuse > ports and have discovered that freebsd[1] packages the same upstream > libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3. > > [1] https://wiki.freebsd.org/FUSEFS > [2] https://github.com/macfuse/macfuse > > Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should > think about rewriting all of fuse2fs against the lowlevel library? It's > really annoying to deal with all the problems of the current codebase. > I think I'll try to stabilize the current fuse+iomap code and then look > into a fuse2fs port. What would we call it, fuse4fs? :D > > > Low-level fuse interface addresses filesystem objects by nodeid > > and requires the server to implement lookup(parent_nodeid, name) > > where the server gets to choose the nodeid (not libfuse). > > Does the nodeid for the root directory have to be FUSE_ROOT_ID? Yeh, I think that's the case, otherwise FUSE_INIT would need to tell the kernel the root nodeid, because there is no lookup to return the root nodeid. > I guess > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file > which cannot be accessed from userspace anyway. > As long as inode #1 is reserved it should be fine. just need to refine the rules of the one-to-one mapping with this exception. Thanks, Amir. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-10 19:51 ` Amir Goldstein @ 2025-06-11 6:00 ` Darrick J. Wong 2025-06-11 8:54 ` Amir Goldstein 0 siblings, 1 reply; 55+ messages in thread From: Darrick J. Wong @ 2025-06-11 6:00 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Tue, Jun 10, 2025 at 09:51:55PM +0200, Amir Goldstein wrote: > On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote: > > > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote: > > > > > or > > > > > > > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > DO NOT MERGE THIS. > > > > > > > > > > > > > > > > This is the very first request for comments of a prototype to connect > > > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > > > > > > from files whose contents persist to locally attached storage devices. > > > > > > > > > > > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > > > > > > over almost a decade of its existence. Faulty code can lead to total > > > > > > > > kernel compromise, and I think there's a very strong incentive to move > > > > > > > > all that parsing out to userspace where we can containerize the fuse > > > > > > > > server process. > > > > > > > > > > > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > > > > > > mount API) have also demonstrated that treewide changes to the core > > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > > > > > > because you have to understand every filesystem's bespoke use of that > > > > > > > > core code. Eeeugh. > > > > > > > > > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > > > > > > to the fuse server via a trio of new fuse commands. This is suitable > > > > > > > > for very simple filesystems that don't do tricky things with mappings > > > > > > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > > > > > > but solving that is for the next sprint. > > > > > > > > > > > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > > > > > > userspace yet maintains most of its performance. At this early stage I > > > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > > > > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > > > > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > > > > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > > > > > > > > > > > > > > > > Very cool! > > > > > > > > > > > > > > > There are some major warts remaining: > > > > > > > > > > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > > > > > > races between pagecache zeroing and writeback on filesystems that > > > > > > > > support unwritten and delalloc mappings. > > > > > > > > > > > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > > > > > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > > > > > > yet figured out how inline data is supposed to work. > > > > > > > > > > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > > > > > > which currently isn't possible because the kernel fuse driver will iget > > > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > > > > > > inode it just read. > > > > > > > > > > > > > > Can you make the decision about enabling iomap on lookup? > > > > > > > The plan for passthrough for inode operations was to allow > > > > > > > setting up passthough config of inode on lookup. > > > > > > > > > > > > The main requirement (especially for buffered IO) is that we've set the > > > > > > address space operations structure either to the regular fuse one or to > > > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > > > > > > code assumes that cannot change on a live inode. > > > > > > > > > > > > So I /think/ we could ask the fuse server at inode instantiation time > > > > > > (which, if I'm reading the code correctly, is when iget5_locked gives > > > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > > > > > to userspace at that time. Alternately I guess we could extend struct > > > > > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > > > > > > > > > > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > > > > > which is in the responses of FUSE_LOOKUP, > > > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > > > > > which instantiate fuse inodes. > > > > > > > > > > There is a very hand wavy discussion about this at: > > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ > > > > > > > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE > > > > > command that uses the variable length file handle instead of nodeid > > > > > as a key for the inode. > > > > > > > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to > > > > > look at the gritty details of how best to extend all the relevant commands, > > > > > so I hope I am not sending you down the wrong path. > > > > > > > > I found another twist to this story: the upper level libfuse3 library > > > > assigns distinct nodeids for each directory entry. These nodeids are > > > > passed into the kernel and appear to the basis for an iget5_locked call. > > > > IOWs, each nodeid causes a struct fuse_inode to be created in the > > > > kernel. > > > > > > > > For a single-linked file this is no big deal, but for a hardlink this > > > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can > > > > map to multiple kernel fuse_inode objects. This /really/ breaks the > > > > locking model of iomap, which assumes that there's one in-kernel inode > > > > and that it can use i_rwsem to synchronize updates. > > > > > > > > So I'm going to have to find a way to deal with this. I tried trivially > > > > messing with libfuse nodeid assigment but that blew some assertion. > > > > Maybe your LOOKUP_HANDLE thing would work. > > > > > > > > > > Pull the emergency break! > > > > > > In an amature move, I did not look at fuse2fs.c before commenting on your > > > work. > > > > > > High level fuse interface is not the right tool for the job. > > > It's not even the easiest way to have written fuse2fs in the first place. > > > > At the time I thought it would minimize friction across multiple > > operating systems' fuse implementations. > > > > > High-level fuse API addresses file system objects with full paths. > > > This is good for writing simple virtual filesystems, but it is not the > > > correct nor is the easiest choice to write a userspace driver for ext4. > > > > Agreed, it's a *terrible* way to implement ext4. > > > > I think, however, that Ted would like to maintain compatibility with > > macfuse and freebsd(?) so he's been resistant to rewriting the entire > > program to work with the lowlevel library. > > > > That said, I decided just now to do some spelunking into those two fuse > > ports and have discovered that freebsd[1] packages the same upstream > > libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3. > > > > [1] https://wiki.freebsd.org/FUSEFS > > [2] https://github.com/macfuse/macfuse > > > > Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should > > think about rewriting all of fuse2fs against the lowlevel library? It's > > really annoying to deal with all the problems of the current codebase. > > I think I'll try to stabilize the current fuse+iomap code and then look > > into a fuse2fs port. What would we call it, fuse4fs? :D > > > > > Low-level fuse interface addresses filesystem objects by nodeid > > > and requires the server to implement lookup(parent_nodeid, name) > > > where the server gets to choose the nodeid (not libfuse). > > > > Does the nodeid for the root directory have to be FUSE_ROOT_ID? > > Yeh, I think that's the case, otherwise FUSE_INIT would need to > tell the kernel the root nodeid, because there is no lookup to > return the root nodeid. > > > I guess > > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file > > which cannot be accessed from userspace anyway. > > > > As long as inode #1 is reserved it should be fine. > just need to refine the rules of the one-to-one mapping with > this exception. Or just make it so that passthrough_ino filesystems can specify the rootdir inumber? --D > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-11 6:00 ` Darrick J. Wong @ 2025-06-11 8:54 ` Amir Goldstein 2025-06-12 5:54 ` Miklos Szeredi 0 siblings, 1 reply; 55+ messages in thread From: Amir Goldstein @ 2025-06-11 8:54 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o > > > Does the nodeid for the root directory have to be FUSE_ROOT_ID? > > > > Yeh, I think that's the case, otherwise FUSE_INIT would need to > > tell the kernel the root nodeid, because there is no lookup to > > return the root nodeid. > > > > > I guess > > > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file > > > which cannot be accessed from userspace anyway. > > > > > > > As long as inode #1 is reserved it should be fine. > > just need to refine the rules of the one-to-one mapping with > > this exception. > > Or just make it so that passthrough_ino filesystems can specify the > rootdir inumber? > There is already a mount option 'rootmode' for st_mode of root inode so I suppose we could add the rootino mount option. Note that currently fuse_fill_super_common() instantiates the root inode before negotiating FUSE_INIT with the server. Thanks, Amir. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-11 8:54 ` Amir Goldstein @ 2025-06-12 5:54 ` Miklos Szeredi 2025-06-13 17:44 ` Darrick J. Wong 0 siblings, 1 reply; 55+ messages in thread From: Miklos Szeredi @ 2025-06-12 5:54 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, linux-fsdevel, John, bernd, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Wed, 11 Jun 2025 at 10:54, Amir Goldstein <amir73il@gmail.com> wrote: > There is already a mount option 'rootmode' for st_mode of root inode > so I suppose we could add the rootino mount option. > > Note that currently fuse_fill_super_common() instantiates the root inode > before negotiating FUSE_INIT with the server. I'd prefer not to add more mount options like this. It would be nice to move away from async FUSE_INIT. It's one of those things I wish I'd done differently. Unfortunately I don't think adding FUSE_INIT_SYNC would be sufficient, as servers might expect the first request to be always FUSE_INIT and break if it isn't. Libfuse seems to be okay, but... One idea is to add an ioctl that the server would call before mounting, that explicitly allows FUSE_INIT_SYNC. It's somewhat ugly, but I can't think of a better solution. Thanks, Miklos ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-12 5:54 ` Miklos Szeredi @ 2025-06-13 17:44 ` Darrick J. Wong 0 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-06-13 17:44 UTC (permalink / raw) To: Miklos Szeredi Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o On Thu, Jun 12, 2025 at 07:54:12AM +0200, Miklos Szeredi wrote: > On Wed, 11 Jun 2025 at 10:54, Amir Goldstein <amir73il@gmail.com> wrote: > > > There is already a mount option 'rootmode' for st_mode of root inode > > so I suppose we could add the rootino mount option. > > > > Note that currently fuse_fill_super_common() instantiates the root inode > > before negotiating FUSE_INIT with the server. > > I'd prefer not to add more mount options like this. > > It would be nice to move away from async FUSE_INIT. It's one of those > things I wish I'd done differently. > > Unfortunately I don't think adding FUSE_INIT_SYNC would be sufficient, > as servers might expect the first request to be always FUSE_INIT and > break if it isn't. Libfuse seems to be okay, but... > > One idea is to add an ioctl that the server would call before > mounting, that explicitly allows FUSE_INIT_SYNC. It's somewhat ugly, > but I can't think of a better solution. Hmm, well for iomap the fuse server kinda wants to know if the kernel is going to accept iomap prior to initializing the filesystem, so it wouldn't be that weird to have it set a "send INIT_SYNC" flag. If one were to add an INIT_SYNC upcall, where would the callsite be? Somewhere just prior to where we need to open the root file? And would you want to add more fields to it? Or just use the same struct and flags as the existing INIT call? --D > > Thanks, > Miklos > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-10 19:00 ` Darrick J. Wong 2025-06-10 19:51 ` Amir Goldstein @ 2025-06-11 11:56 ` Theodore Ts'o 2025-06-12 3:20 ` Darrick J. Wong 2025-06-20 8:58 ` Allison Karlitskaya 1 sibling, 2 replies; 55+ messages in thread From: Theodore Ts'o @ 2025-06-11 11:56 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Allison Karlitskaya +Allison Karlitskaya On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote: > > High level fuse interface is not the right tool for the job. > > It's not even the easiest way to have written fuse2fs in the first place. > > At the time I thought it would minimize friction across multiple > operating systems' fuse implementations. > > > High-level fuse API addresses file system objects with full paths. > > This is good for writing simple virtual filesystems, but it is not the > > correct nor is the easiest choice to write a userspace driver for ext4. > > Agreed, it's a *terrible* way to implement ext4. > > I think, however, that Ted would like to maintain compatibility with > macfuse and freebsd(?) so he's been resistant to rewriting the entire > program to work with the lowlevel library. My priority is to make sure that we have compatibility with other OS's (in particular MacOS, FreeBSD, if possible Windows, although that's not something that I develop against or have test vehicles to validate). However, from what I can tell, they all support Fuse3 at this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as of today. The only complaint that I've had about breaking support using Fuse2 was from Allison (Cc'ed), who was involved with another Github project, whose Github Actions break because they were using a very old version of Ubuntu LTS 20.04), which only had support for libfuse2. I am going to assume that this is probably only because they hadn't bothered to update their .github/workflows/ci.yaml file, and not because there was any inherit requirement that we support ancient versions of Linux distributions. (When I was at IBM, I remember having to support customers who used RHEL4, and even in one extreme case, RHEL3 because there were a customer paying $$$$$ that refused to update; but that was well over a decade ago, and at this point, I'm finding it a lot harder to care about that. :-) My plan is that after I release 1.47.2 (which will have some interesting data corruption bugfixes thanks to Darrick and other users using fuse2fs in deadly earnest, as opposed to as a lightweight way to copy files in and out of an file system image), I plan to transition the master and next branches for the future 1.48 release, and the maint branch will have bug fixes for 1.47.N releases. At that point, unless I hear some very strong arguments against, for 1.48, my current thinking is that we will drop support for Fuse2. I will still care about making sure that fuse2fs will build and work well enough that casual file copies work on MacOS and FreeBSD, and I'll accept patches that make fuse2fs work with WinFSP. In practice, this means that Linux-specific things like Verity support will need to be #ifdef'ed so that they will build against MacFUSE, and I assume the same will be true for fuseblk mode and iomap mode(?). This may break the github actions for composefs-rs[1], but I'm going to assume that they can figure out a way to transition to Fuse3 (hopefully by just using a newer version of Ubuntu, but I suppose it's possible that Rust bindings only exist for Fuse2, and not Fuse3). But in any case, I don't think it makes sense to hold back fuse2fs development just for the sake of Ubuntu Focal (LTS 20.04). And if necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that sound fair to you? [1] https://github.com/containers/composefs-rs Does anyone else have any objections to dropping Fuse2 support? And is that sufficient for folks to more easily support iomap mode in fuse2fs? Cheers, - Ted P.S. Greetings from Greenland. :-) (We're currently in the middle of a cruise that started in Iceland, and will be ending in New York City next week.) ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-11 11:56 ` Theodore Ts'o @ 2025-06-12 3:20 ` Darrick J. Wong 2025-06-12 6:10 ` Amir Goldstein 2025-06-20 8:58 ` Allison Karlitskaya 1 sibling, 1 reply; 55+ messages in thread From: Darrick J. Wong @ 2025-06-12 3:20 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Allison Karlitskaya On Wed, Jun 11, 2025 at 10:56:29AM -0100, Theodore Ts'o wrote: > +Allison Karlitskaya > > On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote: > > > High level fuse interface is not the right tool for the job. > > > It's not even the easiest way to have written fuse2fs in the first place. > > > > At the time I thought it would minimize friction across multiple > > operating systems' fuse implementations. > > > > > High-level fuse API addresses file system objects with full paths. > > > This is good for writing simple virtual filesystems, but it is not the > > > correct nor is the easiest choice to write a userspace driver for ext4. > > > > Agreed, it's a *terrible* way to implement ext4. > > > > I think, however, that Ted would like to maintain compatibility with > > macfuse and freebsd(?) so he's been resistant to rewriting the entire > > program to work with the lowlevel library. > > My priority is to make sure that we have compatibility with other OS's > (in particular MacOS, FreeBSD, if possible Windows, although that's > not something that I develop against or have test vehicles to > validate). However, from what I can tell, they all support Fuse3 at > this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as > of today. > > The only complaint that I've had about breaking support using Fuse2 > was from Allison (Cc'ed), who was involved with another Github > project, whose Github Actions break because they were using a very old > version of Ubuntu LTS 20.04), which only had support for libfuse2. I > am going to assume that this is probably only because they hadn't > bothered to update their .github/workflows/ci.yaml file, and not > because there was any inherit requirement that we support ancient > versions of Linux distributions. (When I was at IBM, I remember > having to support customers who used RHEL4, and even in one extreme > case, RHEL3 because there were a customer paying $$$$$ that refused to > update; but that was well over a decade ago, and at this point, I'm > finding it a lot harder to care about that. :-) > > My plan is that after I release 1.47.2 (which will have some > interesting data corruption bugfixes thanks to Darrick and other users > using fuse2fs in deadly earnest, as opposed to as a lightweight way to > copy files in and out of an file system image), I plan to transition > the master and next branches for the future 1.48 release, and the > maint branch will have bug fixes for 1.47.N releases. > > At that point, unless I hear some very strong arguments against, for > 1.48, my current thinking is that we will drop support for Fuse2. I > will still care about making sure that fuse2fs will build and work > well enough that casual file copies work on MacOS and FreeBSD, and > I'll accept patches that make fuse2fs work with WinFSP. In practice, > this means that Linux-specific things like Verity support will need to > be #ifdef'ed so that they will build against MacFUSE, and I assume the > same will be true for fuseblk mode and iomap mode(?). <nod> I might just drop fuseblk mode since it's unusable for unprivileged userspace and regular files; and is a real pain even for "I'm pretending to be the kernel" mode. > This may break the github actions for composefs-rs[1], but I'm going > to assume that they can figure out a way to transition to Fuse3 > (hopefully by just using a newer version of Ubuntu, but I suppose it's > possible that Rust bindings only exist for Fuse2, and not Fuse3). But > in any case, I don't think it makes sense to hold back fuse2fs > development just for the sake of Ubuntu Focal (LTS 20.04). And if > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until > they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that > sound fair to you? > > [1] https://github.com/containers/composefs-rs > > Does anyone else have any objections to dropping Fuse2 support? And > is that sufficient for folks to more easily support iomap mode in > fuse2fs? I don't have any objections to cleaning the fuse2 crud out of fuse2fs. I /do/ worry that rewriting fuse2fs to target the lowlevel fuse3 library instead of the highlevel one is going to break the !linux platforms. Although I *think* macfuse and freebsd fuse actually support the lowlevel library will be ok, I do worry that we might lose windows support. I can't tell if winfsp or dokan are what you're supposed to use there, but afaict neither of them support the lowlevel interface. That said, I could just fork fuse2fs and make the fork ("fuse4fs") talk to the lowlevel library, and we can see what happens when/if people try to build it on those platforms. (Though again I have zero capacity to build macos or windows programs...) TBH it might be a huge relief to just start with a new fuse4fs codebase where I can focus on making iomap the single IO path that works really well, rather than try to support the existing one. There's a lot of IO manager changes in the fuse2fs+iomap prototype that I think just go away if you don't need to support doing the file IO yourself. Any code that's shareable between fuse[24]fs should of course get split out, which should ease the maintenance burden of having two fuse servers. Most of fuse2fs' "smarts" are just calling libext2fs anyway. Maybe someday we can pull an egcs. :P > Cheers, > > - Ted > > P.S. Greetings from Greenland. :-) (We're currently in the middle of > a cruise that started in Iceland, and will be ending in New York City > next week.) Heh, enjoy your cruise!! --D ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-12 3:20 ` Darrick J. Wong @ 2025-06-12 6:10 ` Amir Goldstein 0 siblings, 0 replies; 55+ messages in thread From: Amir Goldstein @ 2025-06-12 6:10 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Allison Karlitskaya On Thu, Jun 12, 2025 at 5:20 AM Darrick J. Wong <djwong@kernel.org> wrote: > > On Wed, Jun 11, 2025 at 10:56:29AM -0100, Theodore Ts'o wrote: > > +Allison Karlitskaya > > > > On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote: > > > > High level fuse interface is not the right tool for the job. > > > > It's not even the easiest way to have written fuse2fs in the first place. > > > > > > At the time I thought it would minimize friction across multiple > > > operating systems' fuse implementations. > > > > > > > High-level fuse API addresses file system objects with full paths. > > > > This is good for writing simple virtual filesystems, but it is not the > > > > correct nor is the easiest choice to write a userspace driver for ext4. > > > > > > Agreed, it's a *terrible* way to implement ext4. > > > > > > I think, however, that Ted would like to maintain compatibility with > > > macfuse and freebsd(?) so he's been resistant to rewriting the entire > > > program to work with the lowlevel library. > > > > My priority is to make sure that we have compatibility with other OS's > > (in particular MacOS, FreeBSD, if possible Windows, although that's > > not something that I develop against or have test vehicles to > > validate). However, from what I can tell, they all support Fuse3 at > > this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as > > of today. > > > > The only complaint that I've had about breaking support using Fuse2 > > was from Allison (Cc'ed), who was involved with another Github > > project, whose Github Actions break because they were using a very old > > version of Ubuntu LTS 20.04), which only had support for libfuse2. I > > am going to assume that this is probably only because they hadn't > > bothered to update their .github/workflows/ci.yaml file, and not > > because there was any inherit requirement that we support ancient > > versions of Linux distributions. (When I was at IBM, I remember > > having to support customers who used RHEL4, and even in one extreme > > case, RHEL3 because there were a customer paying $$$$$ that refused to > > update; but that was well over a decade ago, and at this point, I'm > > finding it a lot harder to care about that. :-) > > > > My plan is that after I release 1.47.2 (which will have some > > interesting data corruption bugfixes thanks to Darrick and other users > > using fuse2fs in deadly earnest, as opposed to as a lightweight way to > > copy files in and out of an file system image), I plan to transition > > the master and next branches for the future 1.48 release, and the > > maint branch will have bug fixes for 1.47.N releases. > > > > At that point, unless I hear some very strong arguments against, for > > 1.48, my current thinking is that we will drop support for Fuse2. I > > will still care about making sure that fuse2fs will build and work > > well enough that casual file copies work on MacOS and FreeBSD, and > > I'll accept patches that make fuse2fs work with WinFSP. In practice, > > this means that Linux-specific things like Verity support will need to > > be #ifdef'ed so that they will build against MacFUSE, and I assume the > > same will be true for fuseblk mode and iomap mode(?). > > <nod> I might just drop fuseblk mode since it's unusable for > unprivileged userspace and regular files; and is a real pain even for > "I'm pretending to be the kernel" mode. > > > This may break the github actions for composefs-rs[1], but I'm going > > to assume that they can figure out a way to transition to Fuse3 > > (hopefully by just using a newer version of Ubuntu, but I suppose it's > > possible that Rust bindings only exist for Fuse2, and not Fuse3). But > > in any case, I don't think it makes sense to hold back fuse2fs > > development just for the sake of Ubuntu Focal (LTS 20.04). And if > > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until > > they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that > > sound fair to you? > > > > [1] https://github.com/containers/composefs-rs > > > > Does anyone else have any objections to dropping Fuse2 support? And > > is that sufficient for folks to more easily support iomap mode in > > fuse2fs? > > I don't have any objections to cleaning the fuse2 crud out of fuse2fs. > > I /do/ worry that rewriting fuse2fs to target the lowlevel fuse3 library > instead of the highlevel one is going to break the !linux platforms. > Although I *think* macfuse and freebsd fuse actually support the > lowlevel library will be ok, I do worry that we might lose windows > support. I can't tell if winfsp or dokan are what you're supposed to > use there, but afaict neither of them support the lowlevel interface. > > That said, I could just fork fuse2fs and make the fork ("fuse4fs") talk > to the lowlevel library, and we can see what happens when/if people try > to build it on those platforms. > > (Though again I have zero capacity to build macos or windows programs...) > > TBH it might be a huge relief to just start with a new fuse4fs codebase > where I can focus on making iomap the single IO path that works really > well, rather than try to support the existing one. There's a lot of IO > manager changes in the fuse2fs+iomap prototype that I think just go away > if you don't need to support doing the file IO yourself. > > Any code that's shareable between fuse[24]fs should of course get split > out, which should ease the maintenance burden of having two fuse > servers. Most of fuse2fs' "smarts" are just calling libext2fs anyway. That seems like a good way to focus your energy on the important goals. I like it. Thanks, Amir. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-11 11:56 ` Theodore Ts'o 2025-06-12 3:20 ` Darrick J. Wong @ 2025-06-20 8:58 ` Allison Karlitskaya 2025-06-20 11:50 ` Bernd Schubert 2025-07-01 5:58 ` Darrick J. Wong 1 sibling, 2 replies; 55+ messages in thread From: Allison Karlitskaya @ 2025-06-20 8:58 UTC (permalink / raw) To: Theodore Ts'o Cc: Darrick J. Wong, Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4 hi Ted, Sorry I didn't see this earlier. I've been travelling. On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote: > This may break the github actions for composefs-rs[1], but I'm going > to assume that they can figure out a way to transition to Fuse3 > (hopefully by just using a newer version of Ubuntu, but I suppose it's > possible that Rust bindings only exist for Fuse2, and not Fuse3). But > in any case, I don't think it makes sense to hold back fuse2fs > development just for the sake of Ubuntu Focal (LTS 20.04). And if > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until > they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that > sound fair to you? To be honest, with a composefs-rs hat on, I don't care at all about fuse support for ext2/3/4 (although I think it's cool that it exists). We also use fuse in composefs-rs for unrelated reasons, but even there we use the fuser rust crate which has a "pure rust" direct syscall layer that no longer depends on libfuse. Our use of e2fsprogs is strictly related to building testing images in CI, and for that we only use mkfs.ext4. There's also no specific reason that we're using old Ubuntu. I probably just copy-pasted it from another project without paying too much attention. Thanks for asking, though! lis ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-20 8:58 ` Allison Karlitskaya @ 2025-06-20 11:50 ` Bernd Schubert 2025-07-01 6:02 ` Darrick J. Wong 2025-07-01 5:58 ` Darrick J. Wong 1 sibling, 1 reply; 55+ messages in thread From: Bernd Schubert @ 2025-06-20 11:50 UTC (permalink / raw) To: Allison Karlitskaya, Theodore Ts'o Cc: Darrick J. Wong, Amir Goldstein, linux-fsdevel, John, miklos, joannelkoong, Josef Bacik, linux-ext4 On 6/20/25 10:58, Allison Karlitskaya wrote: > hi Ted, > > Sorry I didn't see this earlier. I've been travelling. > > On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote: >> This may break the github actions for composefs-rs[1], but I'm going >> to assume that they can figure out a way to transition to Fuse3 >> (hopefully by just using a newer version of Ubuntu, but I suppose it's >> possible that Rust bindings only exist for Fuse2, and not Fuse3). But >> in any case, I don't think it makes sense to hold back fuse2fs >> development just for the sake of Ubuntu Focal (LTS 20.04). And if >> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until >> they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that >> sound fair to you? > > To be honest, with a composefs-rs hat on, I don't care at all about > fuse support for ext2/3/4 (although I think it's cool that it exists). > We also use fuse in composefs-rs for unrelated reasons, but even there > we use the fuser rust crate which has a "pure rust" direct syscall > layer that no longer depends on libfuse. Our use of e2fsprogs is > strictly related to building testing images in CI, and for that we > only use mkfs.ext4. There's also no specific reason that we're using > old Ubuntu. I probably just copy-pasted it from another project > without paying too much attention. From libfuse point of view I'm too happy about that split into different libraries. Libfuse already right now misses several features because they were added to virtiofs, but not to libfuse. I need to find the time for it, but I guess it makes sense to add rust support to libfuse (and some parts can be entirely rewritten into rust). Thanks, Bernd ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-20 11:50 ` Bernd Schubert @ 2025-07-01 6:02 ` Darrick J. Wong 0 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-07-01 6:02 UTC (permalink / raw) To: Bernd Schubert Cc: Allison Karlitskaya, Theodore Ts'o, Amir Goldstein, linux-fsdevel, John, miklos, joannelkoong, Josef Bacik, linux-ext4 On Fri, Jun 20, 2025 at 01:50:20PM +0200, Bernd Schubert wrote: > > > On 6/20/25 10:58, Allison Karlitskaya wrote: > > hi Ted, > > > > Sorry I didn't see this earlier. I've been travelling. > > > > On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote: > > > This may break the github actions for composefs-rs[1], but I'm going > > > to assume that they can figure out a way to transition to Fuse3 > > > (hopefully by just using a newer version of Ubuntu, but I suppose it's > > > possible that Rust bindings only exist for Fuse2, and not Fuse3). But > > > in any case, I don't think it makes sense to hold back fuse2fs > > > development just for the sake of Ubuntu Focal (LTS 20.04). And if > > > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until > > > they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that > > > sound fair to you? > > > > To be honest, with a composefs-rs hat on, I don't care at all about > > fuse support for ext2/3/4 (although I think it's cool that it exists). > > We also use fuse in composefs-rs for unrelated reasons, but even there > > we use the fuser rust crate which has a "pure rust" direct syscall > > layer that no longer depends on libfuse. Our use of e2fsprogs is > > strictly related to building testing images in CI, and for that we > > only use mkfs.ext4. There's also no specific reason that we're using > > old Ubuntu. I probably just copy-pasted it from another project > > without paying too much attention. > > > From libfuse point of view I'm too happy about that split into different "too happy"? I would have thought you would /not/ be too happy about splits... <confused> > libraries. Libfuse already right now misses several features because > they were added to virtiofs, but not to libfuse. I need to find the time > for it, but I guess it makes sense to add rust support to libfuse (and > some parts can be entirely rewritten into rust). Yeah, I noticed a few missing pieces like statx and syncfs support, which I added to my own libfuse branch (+ fuse2fs). Eventually I'd like to get the kernel umount code to flush and wait for all pending fuse commands, issue a FUSE_SYNCFS and wait for that, and then issue a FUSE_DESTROY to tell the fuse server to tear itself down and release the block devices(s) its holding. --D > > > Thanks, > Bernd > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-20 8:58 ` Allison Karlitskaya 2025-06-20 11:50 ` Bernd Schubert @ 2025-07-01 5:58 ` Darrick J. Wong 1 sibling, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-07-01 5:58 UTC (permalink / raw) To: Allison Karlitskaya Cc: Theodore Ts'o, Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4 On Fri, Jun 20, 2025 at 10:58:38AM +0200, Allison Karlitskaya wrote: > hi Ted, > > Sorry I didn't see this earlier. I've been travelling. > > On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote: > > This may break the github actions for composefs-rs[1], but I'm going > > to assume that they can figure out a way to transition to Fuse3 > > (hopefully by just using a newer version of Ubuntu, but I suppose it's > > possible that Rust bindings only exist for Fuse2, and not Fuse3). But > > in any case, I don't think it makes sense to hold back fuse2fs > > development just for the sake of Ubuntu Focal (LTS 20.04). And if > > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until > > they can get off of Fuse2 and/or Ubuntu 20.04. Allison, does that > > sound fair to you? > > To be honest, with a composefs-rs hat on, I don't care at all about > fuse support for ext2/3/4 (although I think it's cool that it exists). > We also use fuse in composefs-rs for unrelated reasons, but even there > we use the fuser rust crate which has a "pure rust" direct syscall Aha, I just stumbled upon that crate. There are ... too many things on crates.io that claim to be fuse libraries/wrappers/etc. It's tempting to go write fuse4fs as a iomap-only Rust server, but I never quite got the hang of configuring cargo to link against a locally built .so in the same source tree (i.e. when I was trying to link xfs_healer against libhandle that ships as part of xfsprogs). I'm not even sure I want to explore exposing libext2fs in a Rust-safe way. > layer that no longer depends on libfuse. Our use of e2fsprogs is > strictly related to building testing images in CI, and for that we > only use mkfs.ext4. There's also no specific reason that we're using > old Ubuntu. I probably just copy-pasted it from another project > without paying too much attention. > > Thanks for asking, though! I'm glad to hear that e2fsprogs can drop fuse2 support! :) --D > lis > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 2025-05-29 19:41 ` Amir Goldstein 2025-06-09 22:31 ` Darrick J. Wong @ 2025-07-12 10:57 ` Amir Goldstein 1 sibling, 0 replies; 55+ messages in thread From: Amir Goldstein @ 2025-07-12 10:57 UTC (permalink / raw) To: Darrick J. Wong, Bernd Schubert Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote: ... > > So I /think/ we could ask the fuse server at inode instantiation time > > (which, if I'm reading the code correctly, is when iget5_locked gives > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > to userspace at that time. Alternately I guess we could extend struct > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > which is in the responses of FUSE_LOOKUP, > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > which instantiate fuse inodes. > Update: I went to look at this extension for my inode ops passthrough patches. What I saw is that while struct fuse_attr and struct fuse_entry_out are designed to be extended and have been extended in the past: * 7.9: * - add blksize field to fuse_attr Later on, struct fuse_direntplus was introduced * 7.21 * - add FUSE_READDIRPLUS With struct struct fuse_entry_out/fuse_attr embedded in the middle and I don't see any code in the kernel/lib that is prepared to handle a change in the FUSE_NAME_OFFSET_DIRENTPLUS constant (maybe it's there and I am missing it) So for my own use, which only requires passing a single int backing_id I was tempted to try and overload attr_valid{,_nsec} which are not relevant for passthrough getattr case, something like {attr_valid = backing_id, attr_valid_nsec = UTIME_OMIT}. In the meanwhile, as an example I used a hole in struct fuse_attr_out to implement backing file setup in reply to GETATTR in the wip branch [1]. Bernd, I wonder if I am missing something w.r.t the intended extensibility of struct fuse_entry_out/fuse_attr and current readdirplus code? Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fuse-backing-inode-wip/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4 2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein 2025-05-29 16:45 ` Darrick J. Wong @ 2025-06-13 17:37 ` Darrick J. Wong 2025-06-23 13:16 ` Miklos Szeredi 1 sibling, 1 reply; 55+ messages in thread From: Darrick J. Wong @ 2025-06-13 17:37 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote: > > > > Hi everyone, > > > > DO NOT MERGE THIS. Three weeks later, I've mostly gotten the iomap caching working. This is probably most exciting for John, because we were talking earlier about uploading storage mappings to the fuse driver and this is what I've come up with. I'm running around trying to fix all the stuff that doesn't quite work right. Top of that list is timestamps and file attributes, because fuse no longer calls the fuse server for file writes. As a result, the kernel inode always has the most uptodate versions of the some file attributes (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever the dirty inode gets flushed. After I get that working I'm going to have to rewrite fuse2fs (or more likely just fork it) to be a lowlevel driver because as I've noted elsewhere in this thread, the upper level fuse library can assign multiple fuse nodeids for a single hardlinked inode. The only reason that worked for non-iomap fuse2fs is because we have a BKL and disable all caching. For fuse+iomap, this discrepancy between fuse nodeids and ext2 inumbers means that iomap just plain doesn't work with hardlinks because there are multiple struct fuse_inodes for each ondisk inode and the locking is just broken; and the iomap callouts are per-inode, not per-file which leads to multiple layering violations in the upper level fuse library. Also as Amir points out, path lookups on every operation is just *slow*. Interim branches can be found here: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2025-06-13 https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-cache_2025-06-13 https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache_2025-06-13 https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs_2025-06-13 (I'm not going to respam the list with patches right now because the quality as told by fstests isn't quite where I want it to be for such a thing. fuse2fs+iomap passes 87% of fstests (down from 89% without iomap) but that's still pretty bad.) --D ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-13 17:37 ` [RFC[RAP] V2] " Darrick J. Wong @ 2025-06-23 13:16 ` Miklos Szeredi 2025-07-01 6:05 ` Darrick J. Wong 0 siblings, 1 reply; 55+ messages in thread From: Miklos Szeredi @ 2025-06-23 13:16 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox On Fri, 13 Jun 2025 at 19:37, Darrick J. Wong <djwong@kernel.org> wrote: > Top of that list is timestamps and file attributes, because fuse no > longer calls the fuse server for file writes. As a result, the kernel > inode always has the most uptodate versions of the some file attributes > (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever > the dirty inode gets flushed. This is already the case for cached writes, no new code should be needed. Thanks, Miklos ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4 2025-06-23 13:16 ` Miklos Szeredi @ 2025-07-01 6:05 ` Darrick J. Wong 0 siblings, 0 replies; 55+ messages in thread From: Darrick J. Wong @ 2025-07-01 6:05 UTC (permalink / raw) To: Miklos Szeredi Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox On Mon, Jun 23, 2025 at 03:16:53PM +0200, Miklos Szeredi wrote: > On Fri, 13 Jun 2025 at 19:37, Darrick J. Wong <djwong@kernel.org> wrote: > > > Top of that list is timestamps and file attributes, because fuse no > > longer calls the fuse server for file writes. As a result, the kernel > > inode always has the most uptodate versions of the some file attributes > > (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever > > the dirty inode gets flushed. > > This is already the case for cached writes, no new code should be needed. Are you talking about the fc->writeback_cache stuff? Yeah, that mostly works out for fuse2fs. Though I was wondering, when does atime get updated? fs/fuse sets S_NOATIME, so I guess it's up to the fuse server to update it when it wants to, and a later FUSE_GETATTR can pick it up? If so, how do fuse servers implement lazytime/relatime? --D > Thanks, > Miklos > ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2025-07-12 10:58 UTC | newest] Thread overview: 55+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong 2025-05-22 0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong 2025-05-22 0:07 ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong 2025-05-22 0:07 ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong 2025-05-22 0:08 ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong 2025-05-22 0:08 ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong 2025-05-22 0:08 ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong 2025-05-22 0:09 ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong 2025-05-22 0:09 ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong 2025-05-22 0:09 ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong 2025-05-22 0:09 ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong 2025-05-22 0:10 ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong 2025-05-22 0:10 ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong 2025-05-22 0:10 ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong 2025-05-22 0:10 ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong 2025-05-22 0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong 2025-05-22 0:11 ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong 2025-05-22 0:11 ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong 2025-05-22 0:11 ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong 2025-05-22 0:11 ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong 2025-05-22 0:12 ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong 2025-05-22 0:12 ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong 2025-05-22 0:12 ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong 2025-05-22 0:12 ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong 2025-05-22 0:13 ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong 2025-05-22 0:13 ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong 2025-05-22 0:13 ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong 2025-05-22 0:13 ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong 2025-05-22 0:14 ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong 2025-05-22 0:14 ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong 2025-05-22 0:14 ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong 2025-05-22 0:15 ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong 2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein 2025-05-29 16:45 ` Darrick J. Wong 2025-05-29 19:41 ` Amir Goldstein 2025-06-09 22:31 ` Darrick J. Wong 2025-06-10 10:59 ` Amir Goldstein 2025-06-10 19:00 ` Darrick J. Wong 2025-06-10 19:51 ` Amir Goldstein 2025-06-11 6:00 ` Darrick J. Wong 2025-06-11 8:54 ` Amir Goldstein 2025-06-12 5:54 ` Miklos Szeredi 2025-06-13 17:44 ` Darrick J. Wong 2025-06-11 11:56 ` Theodore Ts'o 2025-06-12 3:20 ` Darrick J. Wong 2025-06-12 6:10 ` Amir Goldstein 2025-06-20 8:58 ` Allison Karlitskaya 2025-06-20 11:50 ` Bernd Schubert 2025-07-01 6:02 ` Darrick J. Wong 2025-07-01 5:58 ` Darrick J. Wong 2025-07-12 10:57 ` Amir Goldstein 2025-06-13 17:37 ` [RFC[RAP] V2] " Darrick J. Wong 2025-06-23 13:16 ` Miklos Szeredi 2025-07-01 6:05 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).