* [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers
@ 2026-04-29 14:12 Darrick J. Wong
2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong
` (19 more replies)
0 siblings, 20 replies; 191+ messages in thread
From: Darrick J. Wong @ 2026-04-29 14:12 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, fuse-devel
Cc: Miklos Szeredi, Bernd Schubert, Joanne Koong, Theodore Ts'o,
Neal Gompa, Amir Goldstein, Christian Brauner
[let's send this as a separate thread]
Hi everyone,
This is the eighth public draft of a prototype to connect the Linux
fuse driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices. With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance.
This effort is now separate from the one to run fuse servers in a
constrained environment via systemd. Putting fuse servers in a
container gets you all the blast radii reduction advantages and provides
a pathway to removing less popular filesystem drivers to reduce
maintenance work in the kernel; now we want trade relaxation of that
isolation for better performance.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server can upsert mappings
into the kernel for cached access (== zero upcalls for rereads and pure
overwrites!) and the iomap cache revalidation code works.
At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance. Random buffered IO is about 85% as
fast as the kernel. Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details. Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
This series has been rebased to 7.1-rc1 since the seventh RFC, but it
has not otherwise changed much. Most changes happened in userspace
this time:
1. I've written some example fuse-iomap servers, so I now have a vehicle
for testing that out of place writes works (they do) and that inline
data works.
2. Ted has started merging the very large quantity of fuse2fs
improvements into e2fsprogs.
3. I reordered the systemd service container patchset towards master
because the maintainer indicated that he wanted to merge it.
There are some questions remaining:
a. I would like to continue the discussion about how the design review
of this code should be structured, and how might I go about creating
new userspace filesystem servers -- lightweight new ones based off
the existing userspace tools? Or by merging lklfuse?
b. fuse2fs doesn't support the ext4 journal. Urk.
c. I've dropped the fstests and BPF parts of the patchbomb because v7
was just way too long. I'm also not including some extra
enhancements to fuse4fs, also for brevity.
I would like to get the main parts of this submission reviewed for 7.2
now that this has been collecting comments and tweaks in non-rfc status
for 5.5 months.
Kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container
libfuse:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-cache
e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim
fstests:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs
--Darrick
Unreviewed patches:
[PATCHSET v8 1/8] fuse: general bug fixes
[PATCH 3/4] fuse: update file mode when updating acls
[PATCH 4/4] fuse: propagate default and file acls on creation
[PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support
[PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile
[PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support
[PATCH 1/2] fuse: move the passthrough-specific code back to
[PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO
[PATCH 01/33] fuse: implement the basic iomap mechanisms
[PATCH 02/33] fuse_trace: implement the basic iomap mechanisms
[PATCH 03/33] fuse: make debugging configurable at runtime
[PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add
[PATCH 05/33] fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to
[PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything
[PATCH 07/33] fuse: clean up per-file type inode initialization
[PATCH 08/33] fuse: create a per-inode flag for setting exclusive
[PATCH 10/33] fuse_trace: create a per-inode flag for toggling iomap
[PATCH 11/33] fuse: isolate the other regular file IO paths from
[PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP
[PATCH 13/33] fuse_trace: implement basic iomap reporting such as
[PATCH 14/33] fuse: implement direct IO with iomap
[PATCH 15/33] fuse_trace: implement direct IO with iomap
[PATCH 16/33] fuse: implement buffered IO with iomap
[PATCH 17/33] fuse_trace: implement buffered IO with iomap
[PATCH 18/33] fuse: use an unrestricted backing device with iomap
[PATCH 20/33] fuse: advertise support for iomap
[PATCH 21/33] fuse: query filesystem geometry when using iomap
[PATCH 22/33] fuse_trace: query filesystem geometry when using iomap
[PATCH 23/33] fuse: implement fadvise for iomap files
[PATCH 24/33] fuse: invalidate ranges of block devices being used for
[PATCH 25/33] fuse_trace: invalidate ranges of block devices being
[PATCH 26/33] fuse: implement inline data file IO via iomap
[PATCH 27/33] fuse_trace: implement inline data file IO via iomap
[PATCH 28/33] fuse: allow more statx fields
[PATCH 29/33] fuse: support atomic writes with iomap
[PATCH 30/33] fuse_trace: support atomic writes with iomap
[PATCH 31/33] fuse: disable direct fs reclaim for any fuse server
[PATCH 32/33] fuse: enable swapfile activation on iomap
[PATCH 33/33] fuse: implement freeze and shutdowns for iomap
[PATCHSET v8 5/8] fuse: allow servers to specify root node id
[PATCH 1/3] fuse: make the root nodeid dynamic
[PATCH 2/3] fuse_trace: make the root nodeid dynamic
[PATCH 3/3] fuse: allow setting of root nodeid
[PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when
[PATCH 1/9] fuse: enable caching of timestamps
[PATCH 2/9] fuse: force a ctime update after a fileattr_set call when
[PATCH 3/9] fuse: allow local filesystems to set some VFS iflags
[PATCH 4/9] fuse_trace: allow local filesystems to set some VFS
[PATCH 5/9] fuse: cache atime when in iomap mode
[PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap
[PATCH 7/9] fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for
[PATCH 8/9] fuse: update ctime when updating acls on an iomap inode
[PATCH 9/9] fuse: always cache ACLs when using iomap
[PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO
[PATCH 01/12] fuse: cache iomaps
[PATCH 02/12] fuse_trace: cache iomaps
[PATCH 03/12] fuse: use the iomap cache for iomap_begin
[PATCH 04/12] fuse_trace: use the iomap cache for iomap_begin
[PATCH 05/12] fuse: invalidate iomap cache after file updates
[PATCH 06/12] fuse_trace: invalidate iomap cache after file updates
[PATCH 07/12] fuse: enable iomap cache management
[PATCH 08/12] fuse_trace: enable iomap cache management
[PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode
[PATCH 10/12] fuse: constrain iomap mapping cache size
[PATCH 11/12] fuse_trace: constrain iomap mapping cache size
[PATCH 12/12] fuse: enable iomap
[PATCHSET v8 8/8] fuse: run fuse servers as a contained service
[PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap
[PATCH 2/2] fuse: set iomap backing device block size
[PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file
[PATCH 01/25] libfuse: bump kernel and library ABI versions
[PATCH 02/25] libfuse: wait in do_destroy until all open files are
[PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP
[PATCH 04/25] libfuse: add fuse commands for iomap_begin and end
[PATCH 05/25] libfuse: add upper level iomap commands
[PATCH 06/25] libfuse: add a lowlevel notification to add a new
[PATCH 07/25] libfuse: add upper-level iomap add device function
[PATCH 08/25] libfuse: add iomap ioend low level handler
[PATCH 09/25] libfuse: add upper level iomap ioend commands
[PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to
[PATCH 11/25] libfuse: connect high level fuse library to
[PATCH 12/25] libfuse: support enabling exclusive mode for files
[PATCH 13/25] libfuse: support direct I/O through iomap
[PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the
[PATCH 15/25] libfuse: allow discovery of the kernel's iomap
[PATCH 16/25] libfuse: add lower level iomap_config implementation
[PATCH 17/25] libfuse: add upper level iomap_config implementation
[PATCH 18/25] libfuse: add low level code to invalidate iomap block
[PATCH 19/25] libfuse: add upper-level API to invalidate parts of an
[PATCH 20/25] libfuse: add atomic write support
[PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write
[PATCH 22/25] libfuse: create a helper to transform an open regular
[PATCH 23/25] libfuse: add swapfile support for iomap files
[PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw,
[PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw,
[PATCHSET v8 2/6] libfuse: allow servers to specify root node id
[PATCH 1/1] libfuse: allow root_nodeid mount option
[PATCHSET v8 3/6] libfuse: implement syncfs
[PATCH 1/2] libfuse: add strictatime/lazytime mount options
[PATCH 2/2] libfuse: set sync, immutable,
[PATCHSET v8 4/6] libfuse: add some service helper commands for iomap
[PATCH 1/3] mount_service: delegate iomap privilege from
[PATCH 2/3] libfuse: enable setting iomap block device block size
[PATCH 3/3] mount_service: create loop devices for regular files
[PATCHSET v8 5/6] fuse: add sample iomap fuse servers
[PATCH 1/7] example/iomap_ll: create a simple iomap server
[PATCH 2/7] example/iomap_ll: track block state
[PATCH 3/7] example/iomap_ll: implement atomic writes
[PATCH 4/7] example/iomap_inline_ll: create a simple server to test
[PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place
[PATCH 6/7] example/iomap_ow_ll: implement atomic writes
[PATCH 7/7] example/iomap_service_ll: create a sample systemd service
[PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file
[PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse
[PATCH 2/9] libfuse: add upper-level iomap cache management
[PATCH 3/9] libfuse: allow constraining of iomap mapping cache size
[PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint
[PATCH 5/9] libfuse: enable iomap
[PATCH 6/9] example/iomap_ll: cache mappings for later
[PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel
[PATCH 8/9] example/iomap_ow_ll: cache iomappings in the kernel
[PATCH 9/9] example/iomap_service_ll: cache iomappings in the kernel
[PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support
[PATCH 1/5] libext2fs: invalidate cached blocks when freeing them
[PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte
[PATCH 3/5] libext2fs: allow unix_write_byte when the write would be
[PATCH 4/5] libext2fs: allow clients to ask to write full superblocks
[PATCH 5/5] libext2fs: allow callers to disallow I/O to file data
[PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file
[PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping
[PATCH 02/19] fuse2fs: add iomap= mount option
[PATCH 03/19] fuse2fs: implement iomap configuration
[PATCH 04/19] fuse2fs: register block devices for use with iomap
[PATCH 05/19] fuse2fs: implement directio file reads
[PATCH 06/19] fuse2fs: add extent dump function for debugging
[PATCH 07/19] fuse2fs: implement direct write support
[PATCH 08/19] fuse2fs: turn on iomap for pagecache IO
[PATCH 09/19] fuse2fs: don't zero bytes in punch hole
[PATCH 10/19] fuse2fs: don't do file data block IO when iomap is
[PATCH 11/19] fuse2fs: try to create loop device when ext4 device is
[PATCH 12/19] fuse2fs: enable file IO to inline data files
[PATCH 13/19] fuse2fs: set iomap-related inode flags
[PATCH 14/19] fuse2fs: configure block device block size
[PATCH 15/19] fuse4fs: separate invalidation
[PATCH 16/19] fuse2fs: implement statx
[PATCH 17/19] fuse2fs: enable atomic writes
[PATCH 18/19] fuse4fs: disable fs reclaim and write throttling
[PATCH 19/19] fuse2fs: implement freeze and shutdown requests
[PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services
[PATCH 1/3] fuse4fs: configure iomap when running as a service
[PATCH 2/3] fuse4fs: set iomap backing device blocksize
[PATCH 3/3] fuse4fs: ask for loop devices when opening via
[PATCHSET v8 4/6] fuse4fs: specify the root node id
[PATCH 1/1] fuse4fs: don't use inode number translation when possible
[PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when
[PATCH 01/10] fuse2fs: add strictatime/lazytime mount options
[PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap
[PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates
[PATCH 04/10] fuse2fs: better debugging for file mode updates
[PATCH 05/10] fuse2fs: debug timestamp updates
[PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode
[PATCH 07/10] fuse2fs: add tracing for retrieving timestamps
[PATCH 08/10] fuse2fs: enable syncfs
[PATCH 09/10] fuse2fs: set sync, immutable,
[PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode
[PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file
[PATCH 1/4] fuse2fs: enable caching of iomaps
[PATCH 2/4] fuse2fs: constrain iomap mapping cache size
[PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open
[PATCH 4/4] fuse2fs: enable iomap
^ permalink raw reply [flat|nested] 191+ messages in thread* [PATCHSET v8 1/8] fuse: general bug fixes 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong @ 2026-04-29 14:16 ` Darrick J. Wong 2026-04-29 14:21 ` [PATCH 1/4] fuse: flush pending FUSE_RELEASE requests before sending FUSE_DESTROY Darrick J. Wong ` (3 more replies) 2026-04-29 14:16 ` [PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong ` (18 subsequent siblings) 19 siblings, 4 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:16 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, Here's a collection of fixes that I *think* are bugs in fuse, along with some scattered improvements. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-fixes --- Commits in this patchset: * fuse: flush pending FUSE_RELEASE requests before sending FUSE_DESTROY * fuse: implement file attributes mask for statx * fuse: update file mode when updating acls * fuse: propagate default and file acls on creation --- fs/fuse/fuse_i.h | 46 +++++++++++++++++++++ fs/fuse/acl.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++--- fs/fuse/dev.c | 18 ++++++++ fs/fuse/dir.c | 105 +++++++++++++++++++++++++++++++++++++----------- fs/fuse/inode.c | 14 ++++++ 5 files changed, 267 insertions(+), 34 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/4] fuse: flush pending FUSE_RELEASE requests before sending FUSE_DESTROY 2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong @ 2026-04-29 14:21 ` Darrick J. Wong 2026-04-29 14:22 ` [PATCH 2/4] fuse: implement file attributes mask for statx Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> generic/488 fails with fuse2fs in the following fashion: generic/488 _check_generic_filesystem: filesystem on /dev/sdf is inconsistent (see /var/tmp/fstests/generic/488.full for details) This test opens a large number of files, unlinks them (which really just renames them to fuse hidden files), closes the program, unmounts the filesystem, and runs fsck to check that there aren't any inconsistencies in the filesystem. Unfortunately, the 488.full file shows that there are a lot of hidden files left over in the filesystem, with incorrect link counts. Tracing fuse_request_* shows that there are a large number of FUSE_RELEASE commands that are queued up on behalf of the unlinked files at the time that fuse_conn_destroy calls fuse_abort_conn. Had the connection not aborted, the fuse server would have responded to the RELEASE commands by removing the hidden files; instead they stick around. For upper-level fuse servers that don't use fuseblk mode this isn't a problem because libfuse responds to the connection going down by pruning its inode cache and calling the fuse server's ->release for any open files before calling the server's ->destroy function. For fuseblk servers this is a problem, however, because the kernel sends FUSE_DESTROY to the fuse server, and the fuse server has to write all of its pending changes to the block device before replying to the DESTROY request because the kernel releases its O_EXCL hold on the block device. This means that the kernel must flush all pending FUSE_RELEASE requests before issuing FUSE_DESTROY. For fuse-iomap servers this will also be a problem because iomap servers are expected to release all exclusively-held resources before unmount returns from the kernel. Create a function to push all the background requests to the queue before sending FUSE_DESTROY. That way, all the pending file release events are processed by the fuse server before it tears itself down, and we don't end up with a corrupt filesystem. Note that multithreaded fuse servers will need to track the number of open files and defer a FUSE_DESTROY request until that number reaches zero. An earlier version of this patch made the kernel wait for the RELEASE acknowledgements before sending DESTROY, but the kernel people weren't comfortable with adding blocking waits to unmount. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/fuse_i.h | 3 +++ fs/fuse/dev.c | 18 ++++++++++++++++++ fs/fuse/inode.c | 10 +++++++++- 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 17423d4e3cfa67..f83bd090ca28ea 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1286,6 +1286,9 @@ void fuse_request_end(struct fuse_req *req); void fuse_abort_conn(struct fuse_conn *fc); void fuse_wait_aborted(struct fuse_conn *fc); +/* Flush all pending requests but do not wait for them. */ +void fuse_flush_requests(struct fuse_conn *fc); + /* Check if any requests timed out */ void fuse_check_timeout(struct work_struct *work); diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 5dda7080f4a909..2ff09d9f101d00 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2436,6 +2436,24 @@ static void end_polls(struct fuse_conn *fc) } } +/* + * Flush all pending requests but do not wait for them. Only call this + * function when it is no longer possible for other threads to add requests. + */ +void fuse_flush_requests(struct fuse_conn *fc) +{ + spin_lock(&fc->lock); + spin_lock(&fc->bg_lock); + if (fc->connected) { + /* Push all the background requests to the queue. */ + fc->blocked = 0; + fc->max_background = UINT_MAX; + flush_bg_queue(fc); + } + spin_unlock(&fc->bg_lock); + spin_unlock(&fc->lock); +} + /* * Abort all requests. * diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index deddfffb037fb8..21e55080faee2d 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -2113,8 +2113,16 @@ void fuse_conn_destroy(struct fuse_mount *fm) { struct fuse_conn *fc = fm->fc; - if (fc->destroy) + if (fc->destroy) { + /* + * Flush all pending requests before sending FUSE_DESTROY. The + * fuse server must reply to the flushed requests before + * handling FUSE_DESTROY because unmount is about to release + * its O_EXCL hold on the block device. + */ + fuse_flush_requests(fc); fuse_send_destroy(fm); + } fuse_abort_conn(fc); fuse_wait_aborted(fc); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/4] fuse: implement file attributes mask for statx 2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong 2026-04-29 14:21 ` [PATCH 1/4] fuse: flush pending FUSE_RELEASE requests before sending FUSE_DESTROY Darrick J. Wong @ 2026-04-29 14:22 ` Darrick J. Wong 2026-04-29 14:22 ` [PATCH 3/4] fuse: update file mode when updating acls Darrick J. Wong 2026-04-29 14:22 ` [PATCH 4/4] fuse: propagate default and file acls on creation Darrick J. Wong 3 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:22 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Actually copy the attributes/attributes_mask from userspace. Ignore file attributes bits that the VFS sets (or doesn't set) on its own. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/fuse_i.h | 37 +++++++++++++++++++++++++++++++++++++ fs/fuse/dir.c | 4 ++++ fs/fuse/inode.c | 4 ++++ 3 files changed, 45 insertions(+) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index f83bd090ca28ea..4c135a7edf54ac 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -147,6 +147,10 @@ struct fuse_inode { /** Version of last attribute change */ u64 attr_version; + /** statx file attributes */ + u64 statx_attributes; + u64 statx_attributes_mask; + union { /* read/write io cache (regular file only) */ struct { @@ -1235,6 +1239,39 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, u64 attr_valid, u32 cache_mask, u64 evict_ctr); +/* + * These statx attribute flags are set by the VFS so mask them out of replies + * from the fuse server for local filesystems. Nonlocal filesystems are + * responsible for enforcing and advertising these flags themselves. + */ +#define FUSE_STATX_LOCAL_VFS_ATTRIBUTES (STATX_ATTR_IMMUTABLE | \ + STATX_ATTR_APPEND) + +/* + * These statx attribute flags are set by the VFS so mask them out of replies + * from the fuse server. + */ +#define FUSE_STATX_VFS_ATTRIBUTES (STATX_ATTR_AUTOMOUNT | STATX_ATTR_DAX | \ + STATX_ATTR_MOUNT_ROOT) + +static inline u64 fuse_statx_attributes_mask(const struct inode *inode, + const struct fuse_statx *sx) +{ + if (fuse_inode_is_exclusive(inode)) + return sx->attributes_mask & ~(FUSE_STATX_VFS_ATTRIBUTES | + FUSE_STATX_LOCAL_VFS_ATTRIBUTES); + return sx->attributes_mask & ~FUSE_STATX_VFS_ATTRIBUTES; +} + +static inline u64 fuse_statx_attributes(const struct inode *inode, + const struct fuse_statx *sx) +{ + if (fuse_inode_is_exclusive(inode)) + return sx->attributes & ~(FUSE_STATX_VFS_ATTRIBUTES | + FUSE_STATX_LOCAL_VFS_ATTRIBUTES); + return sx->attributes & ~FUSE_STATX_VFS_ATTRIBUTES; +} + u32 fuse_get_cache_mask(struct inode *inode); /** diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index b658b6baf72fe9..5d9466c7fd464e 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1468,6 +1468,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode, stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME); stat->btime.tv_sec = sx->btime.tv_sec; stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1); + stat->attributes |= fuse_statx_attributes(inode, sx); + stat->attributes_mask |= fuse_statx_attributes_mask(inode, sx); fuse_fillattr(idmap, inode, &attr, stat); stat->result_mask |= STATX_TYPE; } @@ -1572,6 +1574,8 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode, stat->btime = fi->i_btime; stat->result_mask |= STATX_BTIME; } + stat->attributes |= fi->statx_attributes; + stat->attributes_mask |= fi->statx_attributes_mask; } return err; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 21e55080faee2d..b43e4bf4a1c117 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -286,6 +286,10 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, fi->i_btime.tv_sec = sx->btime.tv_sec; fi->i_btime.tv_nsec = sx->btime.tv_nsec; } + + fi->statx_attributes = fuse_statx_attributes(inode, sx); + fi->statx_attributes_mask = fuse_statx_attributes_mask(inode, + sx); } if (attr->blksize) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/4] fuse: update file mode when updating acls 2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong 2026-04-29 14:21 ` [PATCH 1/4] fuse: flush pending FUSE_RELEASE requests before sending FUSE_DESTROY Darrick J. Wong 2026-04-29 14:22 ` [PATCH 2/4] fuse: implement file attributes mask for statx Darrick J. Wong @ 2026-04-29 14:22 ` Darrick J. Wong 2026-04-30 13:48 ` Joanne Koong 2026-04-29 14:22 ` [PATCH 4/4] fuse: propagate default and file acls on creation Darrick J. Wong 3 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:22 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> If someone sets ACLs on a file that can be expressed fully as Unix DAC mode bits, most local filesystems will then update the mode bits and drop the ACL xattr to reduce inefficiency in the file access paths. Let's do that too. Note that means that we can setacl and end up with no ACL xattrs, so we also need to tolerate ENODATA returns from fuse_removexattr. Note that here we define a "local" fuse filesystem as one that uses fuseblk mode; we'll shortly add fuse servers that use iomap for the file IO path to that list. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 2 +- fs/fuse/acl.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 50 insertions(+), 8 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 4c135a7edf54ac..0bcfb42592895c 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1050,7 +1050,7 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode) return get_fuse_mount_super(inode->i_sb); } -static inline struct fuse_conn *get_fuse_conn(struct inode *inode) +static inline struct fuse_conn *get_fuse_conn(const struct inode *inode) { return get_fuse_mount_super(inode->i_sb)->fc; } diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index cbde6ac1add35a..bee8a9a734f50a 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -11,6 +11,18 @@ #include <linux/posix_acl.h> #include <linux/posix_acl_xattr.h> +/* + * If this fuse server behaves like a local filesystem, we can implement the + * kernel's optimizations for ACLs for local filesystems instead of passing + * the ACL requests straight through to another server. + */ +static inline bool fuse_inode_has_local_acls(const struct inode *inode) +{ + const struct fuse_conn *fc = get_fuse_conn(inode); + + return fc->posix_acl && fuse_inode_is_exclusive(inode); +} + static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc, struct inode *inode, int type, bool rcu) { @@ -98,6 +110,8 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, struct inode *inode = d_inode(dentry); struct fuse_conn *fc = get_fuse_conn(inode); const char *name; + umode_t mode = inode->i_mode; + const bool local_acls = fuse_inode_has_local_acls(inode); int ret; if (fuse_is_bad(inode)) @@ -113,14 +127,25 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, else return -EINVAL; + /* + * If the ACL can be represented entirely with changes to the mode + * bits, then most filesystems will update the mode bits and delete + * the ACL xattr. + */ + if (acl && type == ACL_TYPE_ACCESS && local_acls) { + ret = posix_acl_update_mode(idmap, inode, &mode, &acl); + if (ret) + return ret; + } + if (acl) { unsigned int extra_flags = 0; /* - * Fuse userspace is responsible for updating access - * permissions in the inode, if needed. fuse_setxattr - * invalidates the inode attributes, which will force - * them to be refreshed the next time they are used, - * and it also updates i_ctime. + * For non-local filesystems, fuse userspace is responsible for + * updating access permissions in the inode, if needed. + * fuse_setxattr invalidates the inode attributes, which will + * force them to be refreshed the next time they are used, and + * it also updates i_ctime. */ size_t size; void *value; @@ -137,9 +162,10 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, /* * Fuse daemons without FUSE_POSIX_ACL never changed the passed * through POSIX ACLs. Such daemons don't expect setgid bits to - * be stripped. + * be stripped, unless they've explicitly told the kernel to + * take care of that. */ - if (fc->posix_acl && + if (fc->posix_acl && !local_acls && !in_group_or_capable(idmap, inode, i_gid_into_vfsgid(idmap, inode))) extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID; @@ -148,6 +174,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, kfree(value); } else { ret = fuse_removexattr(inode, name); + /* If the acl didn't exist to start with that's fine. */ + if (ret == -ENODATA) + ret = 0; + } + + /* If we scheduled a mode update above, push that to userspace now. */ + if (!ret) { + struct iattr attr = { }; + + if (mode != inode->i_mode) { + attr.ia_valid |= ATTR_MODE; + attr.ia_mode = mode; + } + + if (attr.ia_valid) + ret = fuse_do_setattr(idmap, dentry, &attr, NULL); } if (fc->posix_acl) { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 3/4] fuse: update file mode when updating acls 2026-04-29 14:22 ` [PATCH 3/4] fuse: update file mode when updating acls Darrick J. Wong @ 2026-04-30 13:48 ` Joanne Koong 2026-04-30 20:57 ` Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Joanne Koong @ 2026-04-30 13:48 UTC (permalink / raw) To: Darrick J. Wong; +Cc: miklos, neal, linux-fsdevel, bernd, fuse-devel On Wed, Apr 29, 2026 at 3:22 PM Darrick J. Wong <djwong@kernel.org> wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > If someone sets ACLs on a file that can be expressed fully as Unix DAC > mode bits, most local filesystems will then update the mode bits and > drop the ACL xattr to reduce inefficiency in the file access paths. > Let's do that too. Note that means that we can setacl and end up with > no ACL xattrs, so we also need to tolerate ENODATA returns from > fuse_removexattr. > > Note that here we define a "local" fuse filesystem as one that uses > fuseblk mode; we'll shortly add fuse servers that use iomap for the file > IO path to that list. > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > --- > fs/fuse/fuse_i.h | 2 +- > fs/fuse/acl.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++------- > 2 files changed, 50 insertions(+), 8 deletions(-) > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 4c135a7edf54ac..0bcfb42592895c 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -1050,7 +1050,7 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode) > return get_fuse_mount_super(inode->i_sb); > } > > -static inline struct fuse_conn *get_fuse_conn(struct inode *inode) > +static inline struct fuse_conn *get_fuse_conn(const struct inode *inode) > { > return get_fuse_mount_super(inode->i_sb)->fc; > } > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > index cbde6ac1add35a..bee8a9a734f50a 100644 > --- a/fs/fuse/acl.c > +++ b/fs/fuse/acl.c > @@ -11,6 +11,18 @@ > #include <linux/posix_acl.h> > #include <linux/posix_acl_xattr.h> > > +/* > + * If this fuse server behaves like a local filesystem, we can implement the > + * kernel's optimizations for ACLs for local filesystems instead of passing > + * the ACL requests straight through to another server. nit: "to the server" instead of "to another server"? another server sounds like there's 2 servers > + */ > +static inline bool fuse_inode_has_local_acls(const struct inode *inode) > +{ > + const struct fuse_conn *fc = get_fuse_conn(inode); > + > + return fc->posix_acl && fuse_inode_is_exclusive(inode); > +} > + > static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc, > struct inode *inode, int type, bool rcu) > { > @@ -98,6 +110,8 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > struct inode *inode = d_inode(dentry); > struct fuse_conn *fc = get_fuse_conn(inode); > const char *name; > + umode_t mode = inode->i_mode; > + const bool local_acls = fuse_inode_has_local_acls(inode); > int ret; > > if (fuse_is_bad(inode)) > @@ -113,14 +127,25 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > else > return -EINVAL; > > + /* > + * If the ACL can be represented entirely with changes to the mode > + * bits, then most filesystems will update the mode bits and delete > + * the ACL xattr. > + */ > + if (acl && type == ACL_TYPE_ACCESS && local_acls) { > + ret = posix_acl_update_mode(idmap, inode, &mode, &acl); > + if (ret) > + return ret; > + } > + > if (acl) { > unsigned int extra_flags = 0; > /* > - * Fuse userspace is responsible for updating access > - * permissions in the inode, if needed. fuse_setxattr > - * invalidates the inode attributes, which will force > - * them to be refreshed the next time they are used, > - * and it also updates i_ctime. > + * For non-local filesystems, fuse userspace is responsible for > + * updating access permissions in the inode, if needed. > + * fuse_setxattr invalidates the inode attributes, which will > + * force them to be refreshed the next time they are used, and > + * it also updates i_ctime. > */ > size_t size; > void *value; > @@ -137,9 +162,10 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > /* > * Fuse daemons without FUSE_POSIX_ACL never changed the passed > * through POSIX ACLs. Such daemons don't expect setgid bits to > - * be stripped. > + * be stripped, unless they've explicitly told the kernel to > + * take care of that. nit: imo this would be clearer as its own sentence, eg "For local filesystems, the kernel already handled sgid stripping in posix_acl_update_mode() above". > */ > - if (fc->posix_acl && > + if (fc->posix_acl && !local_acls && > !in_group_or_capable(idmap, inode, > i_gid_into_vfsgid(idmap, inode))) > extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID; > @@ -148,6 +174,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > kfree(value); > } else { > ret = fuse_removexattr(inode, name); > + /* If the acl didn't exist to start with that's fine. */ > + if (ret == -ENODATA) > + ret = 0; > + } > + > + /* If we scheduled a mode update above, push that to userspace now. */ > + if (!ret) { > + struct iattr attr = { }; > + > + if (mode != inode->i_mode) { > + attr.ia_valid |= ATTR_MODE; > + attr.ia_mode = mode; > + } > + > + if (attr.ia_valid) > + ret = fuse_do_setattr(idmap, dentry, &attr, NULL); maybe something like this is clearer? if (!ret && mode != inode->i_mode) { struct iattr attr = { .ia_valid |= ATTR_MODE, .ia_mode = mode, }; ret = fuse_do_setattr(idmap, dentry, &attr, NULL); } Overall, this lgtm though. Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Thanks, Joanne > } > > if (fc->posix_acl) { > ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 3/4] fuse: update file mode when updating acls 2026-04-30 13:48 ` Joanne Koong @ 2026-04-30 20:57 ` Darrick J. Wong 2026-05-01 9:53 ` Joanne Koong 0 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-04-30 20:57 UTC (permalink / raw) To: Joanne Koong; +Cc: miklos, neal, linux-fsdevel, bernd, fuse-devel On Thu, Apr 30, 2026 at 02:48:29PM +0100, Joanne Koong wrote: > On Wed, Apr 29, 2026 at 3:22 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > From: Darrick J. Wong <djwong@kernel.org> > > > > If someone sets ACLs on a file that can be expressed fully as Unix DAC > > mode bits, most local filesystems will then update the mode bits and > > drop the ACL xattr to reduce inefficiency in the file access paths. > > Let's do that too. Note that means that we can setacl and end up with > > no ACL xattrs, so we also need to tolerate ENODATA returns from > > fuse_removexattr. > > > > Note that here we define a "local" fuse filesystem as one that uses > > fuseblk mode; we'll shortly add fuse servers that use iomap for the file > > IO path to that list. > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > > --- > > fs/fuse/fuse_i.h | 2 +- > > fs/fuse/acl.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++------- > > 2 files changed, 50 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > index 4c135a7edf54ac..0bcfb42592895c 100644 > > --- a/fs/fuse/fuse_i.h > > +++ b/fs/fuse/fuse_i.h > > @@ -1050,7 +1050,7 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode) > > return get_fuse_mount_super(inode->i_sb); > > } > > > > -static inline struct fuse_conn *get_fuse_conn(struct inode *inode) > > +static inline struct fuse_conn *get_fuse_conn(const struct inode *inode) > > { > > return get_fuse_mount_super(inode->i_sb)->fc; > > } > > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > > index cbde6ac1add35a..bee8a9a734f50a 100644 > > --- a/fs/fuse/acl.c > > +++ b/fs/fuse/acl.c > > @@ -11,6 +11,18 @@ > > #include <linux/posix_acl.h> > > #include <linux/posix_acl_xattr.h> > > > > +/* > > + * If this fuse server behaves like a local filesystem, we can implement the > > + * kernel's optimizations for ACLs for local filesystems instead of passing > > + * the ACL requests straight through to another server. > > nit: "to the server" instead of "to another server"? another server > sounds like there's 2 servers > > > + */ > > +static inline bool fuse_inode_has_local_acls(const struct inode *inode) > > +{ > > + const struct fuse_conn *fc = get_fuse_conn(inode); > > + > > + return fc->posix_acl && fuse_inode_is_exclusive(inode); > > +} > > + > > static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc, > > struct inode *inode, int type, bool rcu) > > { > > @@ -98,6 +110,8 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > struct inode *inode = d_inode(dentry); > > struct fuse_conn *fc = get_fuse_conn(inode); > > const char *name; > > + umode_t mode = inode->i_mode; > > + const bool local_acls = fuse_inode_has_local_acls(inode); > > int ret; > > > > if (fuse_is_bad(inode)) > > @@ -113,14 +127,25 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > else > > return -EINVAL; > > > > + /* > > + * If the ACL can be represented entirely with changes to the mode > > + * bits, then most filesystems will update the mode bits and delete > > + * the ACL xattr. > > + */ > > + if (acl && type == ACL_TYPE_ACCESS && local_acls) { > > + ret = posix_acl_update_mode(idmap, inode, &mode, &acl); > > + if (ret) > > + return ret; > > + } > > + > > if (acl) { > > unsigned int extra_flags = 0; > > /* > > - * Fuse userspace is responsible for updating access > > - * permissions in the inode, if needed. fuse_setxattr > > - * invalidates the inode attributes, which will force > > - * them to be refreshed the next time they are used, > > - * and it also updates i_ctime. > > + * For non-local filesystems, fuse userspace is responsible for > > + * updating access permissions in the inode, if needed. > > + * fuse_setxattr invalidates the inode attributes, which will > > + * force them to be refreshed the next time they are used, and > > + * it also updates i_ctime. > > */ > > size_t size; > > void *value; > > @@ -137,9 +162,10 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > /* > > * Fuse daemons without FUSE_POSIX_ACL never changed the passed > > * through POSIX ACLs. Such daemons don't expect setgid bits to > > - * be stripped. > > + * be stripped, unless they've explicitly told the kernel to > > + * take care of that. > > nit: imo this would be clearer as its own sentence, eg "For local > filesystems, the kernel already handled sgid stripping in > posix_acl_update_mode() above". > > > */ > > - if (fc->posix_acl && > > + if (fc->posix_acl && !local_acls && > > !in_group_or_capable(idmap, inode, > > i_gid_into_vfsgid(idmap, inode))) > > extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID; > > @@ -148,6 +174,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > kfree(value); > > } else { > > ret = fuse_removexattr(inode, name); > > + /* If the acl didn't exist to start with that's fine. */ > > + if (ret == -ENODATA) > > + ret = 0; > > + } > > + > > + /* If we scheduled a mode update above, push that to userspace now. */ > > + if (!ret) { > > + struct iattr attr = { }; > > + > > + if (mode != inode->i_mode) { > > + attr.ia_valid |= ATTR_MODE; > > + attr.ia_mode = mode; > > + } > > + > > + if (attr.ia_valid) > > + ret = fuse_do_setattr(idmap, dentry, &attr, NULL); > > maybe something like this is clearer? > > if (!ret && mode != inode->i_mode) { > struct iattr attr = { > .ia_valid |= ATTR_MODE, > .ia_mode = mode, > }; > ret = fuse_do_setattr(idmap, dentry, &attr, NULL); > } I know, you said that last time ;) https://lore.kernel.org/linux-fsdevel/CAJnrk1YL3KnON-WtNjNi+2GZ+6rYvnVUnYwCk5efv0o41XkxcA@mail.gmail.com/ and my reply is the same: https://lore.kernel.org/linux-fsdevel/20260325222308.GH6202@frogsfrogsfrogs/#t ...which is to say, does anyone think that's worth the churn? > > Overall, this lgtm though. > > Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Thanks! --D > Thanks, > Joanne > > > } > > > > if (fc->posix_acl) { > > > ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 3/4] fuse: update file mode when updating acls 2026-04-30 20:57 ` Darrick J. Wong @ 2026-05-01 9:53 ` Joanne Koong 2026-05-01 16:15 ` Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Joanne Koong @ 2026-05-01 9:53 UTC (permalink / raw) To: Darrick J. Wong; +Cc: miklos, neal, linux-fsdevel, bernd, fuse-devel On Thu, Apr 30, 2026 at 9:57 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Thu, Apr 30, 2026 at 02:48:29PM +0100, Joanne Koong wrote: > > On Wed, Apr 29, 2026 at 3:22 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > If someone sets ACLs on a file that can be expressed fully as Unix DAC > > > mode bits, most local filesystems will then update the mode bits and > > > drop the ACL xattr to reduce inefficiency in the file access paths. > > > Let's do that too. Note that means that we can setacl and end up with > > > no ACL xattrs, so we also need to tolerate ENODATA returns from > > > fuse_removexattr. > > > > > > Note that here we define a "local" fuse filesystem as one that uses > > > fuseblk mode; we'll shortly add fuse servers that use iomap for the file > > > IO path to that list. > > > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > > > --- > > > fs/fuse/fuse_i.h | 2 +- > > > fs/fuse/acl.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++------- > > > 2 files changed, 50 insertions(+), 8 deletions(-) > > > > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > > index 4c135a7edf54ac..0bcfb42592895c 100644 > > > --- a/fs/fuse/fuse_i.h > > > +++ b/fs/fuse/fuse_i.h > > > @@ -1050,7 +1050,7 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode) > > > return get_fuse_mount_super(inode->i_sb); > > > } > > > > > > -static inline struct fuse_conn *get_fuse_conn(struct inode *inode) > > > +static inline struct fuse_conn *get_fuse_conn(const struct inode *inode) > > > { > > > return get_fuse_mount_super(inode->i_sb)->fc; > > > } > > > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > > > index cbde6ac1add35a..bee8a9a734f50a 100644 > > > --- a/fs/fuse/acl.c > > > +++ b/fs/fuse/acl.c > > > @@ -11,6 +11,18 @@ > > > #include <linux/posix_acl.h> > > > #include <linux/posix_acl_xattr.h> > > > > > > +/* > > > + * If this fuse server behaves like a local filesystem, we can implement the > > > + * kernel's optimizations for ACLs for local filesystems instead of passing > > > + * the ACL requests straight through to another server. > > > > nit: "to the server" instead of "to another server"? another server > > sounds like there's 2 servers > > > > > + */ > > > +static inline bool fuse_inode_has_local_acls(const struct inode *inode) > > > +{ > > > + const struct fuse_conn *fc = get_fuse_conn(inode); > > > + > > > + return fc->posix_acl && fuse_inode_is_exclusive(inode); > > > +} > > > + > > > static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc, > > > struct inode *inode, int type, bool rcu) > > > { > > > @@ -98,6 +110,8 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > struct inode *inode = d_inode(dentry); > > > struct fuse_conn *fc = get_fuse_conn(inode); > > > const char *name; > > > + umode_t mode = inode->i_mode; > > > + const bool local_acls = fuse_inode_has_local_acls(inode); > > > int ret; > > > > > > if (fuse_is_bad(inode)) > > > @@ -113,14 +127,25 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > else > > > return -EINVAL; > > > > > > + /* > > > + * If the ACL can be represented entirely with changes to the mode > > > + * bits, then most filesystems will update the mode bits and delete > > > + * the ACL xattr. > > > + */ > > > + if (acl && type == ACL_TYPE_ACCESS && local_acls) { > > > + ret = posix_acl_update_mode(idmap, inode, &mode, &acl); > > > + if (ret) > > > + return ret; > > > + } > > > + > > > if (acl) { > > > unsigned int extra_flags = 0; > > > /* > > > - * Fuse userspace is responsible for updating access > > > - * permissions in the inode, if needed. fuse_setxattr > > > - * invalidates the inode attributes, which will force > > > - * them to be refreshed the next time they are used, > > > - * and it also updates i_ctime. > > > + * For non-local filesystems, fuse userspace is responsible for > > > + * updating access permissions in the inode, if needed. > > > + * fuse_setxattr invalidates the inode attributes, which will > > > + * force them to be refreshed the next time they are used, and > > > + * it also updates i_ctime. > > > */ > > > size_t size; > > > void *value; > > > @@ -137,9 +162,10 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > /* > > > * Fuse daemons without FUSE_POSIX_ACL never changed the passed > > > * through POSIX ACLs. Such daemons don't expect setgid bits to > > > - * be stripped. > > > + * be stripped, unless they've explicitly told the kernel to > > > + * take care of that. > > > > nit: imo this would be clearer as its own sentence, eg "For local > > filesystems, the kernel already handled sgid stripping in > > posix_acl_update_mode() above". > > > > > */ > > > - if (fc->posix_acl && > > > + if (fc->posix_acl && !local_acls && > > > !in_group_or_capable(idmap, inode, > > > i_gid_into_vfsgid(idmap, inode))) > > > extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID; > > > @@ -148,6 +174,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > kfree(value); > > > } else { > > > ret = fuse_removexattr(inode, name); > > > + /* If the acl didn't exist to start with that's fine. */ > > > + if (ret == -ENODATA) > > > + ret = 0; > > > + } > > > + > > > + /* If we scheduled a mode update above, push that to userspace now. */ > > > + if (!ret) { > > > + struct iattr attr = { }; > > > + > > > + if (mode != inode->i_mode) { > > > + attr.ia_valid |= ATTR_MODE; > > > + attr.ia_mode = mode; > > > + } > > > + > > > + if (attr.ia_valid) > > > + ret = fuse_do_setattr(idmap, dentry, &attr, NULL); > > > > maybe something like this is clearer? > > > > if (!ret && mode != inode->i_mode) { > > struct iattr attr = { > > .ia_valid |= ATTR_MODE, > > .ia_mode = mode, > > }; > > ret = fuse_do_setattr(idmap, dentry, &attr, NULL); > > } > > I know, you said that last time ;) > https://lore.kernel.org/linux-fsdevel/CAJnrk1YL3KnON-WtNjNi+2GZ+6rYvnVUnYwCk5efv0o41XkxcA@mail.gmail.com/ > > and my reply is the same: > https://lore.kernel.org/linux-fsdevel/20260325222308.GH6202@frogsfrogsfrogs/#t Fair enough, I'd forgotten the conversation from v7. Also btw, I think there's the edge case where if the fuse_removexattr succeeds but the fuse_do_setattr() fails, then the acl would have been removed but the mode bits weren't updated. Maybe reordering it for that case so that the mode update happens first before the xattr removal would be safer (eg if the xattr removal fails, the old acl is still there to enforce permissions)? Thanks, Joanne > > ...which is to say, does anyone think that's worth the churn? > > > > > Overall, this lgtm though. > > > > Reviewed-by: Joanne Koong <joannelkoong@gmail.com> > > Thanks! > > --D > > > Thanks, > > Joanne > > > > > } > > > > > > if (fc->posix_acl) { > > > > > ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 3/4] fuse: update file mode when updating acls 2026-05-01 9:53 ` Joanne Koong @ 2026-05-01 16:15 ` Darrick J. Wong 0 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-05-01 16:15 UTC (permalink / raw) To: Joanne Koong; +Cc: miklos, neal, linux-fsdevel, bernd, fuse-devel On Fri, May 01, 2026 at 10:53:20AM +0100, Joanne Koong wrote: > On Thu, Apr 30, 2026 at 9:57 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Thu, Apr 30, 2026 at 02:48:29PM +0100, Joanne Koong wrote: > > > On Wed, Apr 29, 2026 at 3:22 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > > > If someone sets ACLs on a file that can be expressed fully as Unix DAC > > > > mode bits, most local filesystems will then update the mode bits and > > > > drop the ACL xattr to reduce inefficiency in the file access paths. > > > > Let's do that too. Note that means that we can setacl and end up with > > > > no ACL xattrs, so we also need to tolerate ENODATA returns from > > > > fuse_removexattr. > > > > > > > > Note that here we define a "local" fuse filesystem as one that uses > > > > fuseblk mode; we'll shortly add fuse servers that use iomap for the file > > > > IO path to that list. > > > > > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > > > > --- > > > > fs/fuse/fuse_i.h | 2 +- > > > > fs/fuse/acl.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++------- > > > > 2 files changed, 50 insertions(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > > > index 4c135a7edf54ac..0bcfb42592895c 100644 > > > > --- a/fs/fuse/fuse_i.h > > > > +++ b/fs/fuse/fuse_i.h > > > > @@ -1050,7 +1050,7 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode) > > > > return get_fuse_mount_super(inode->i_sb); > > > > } > > > > > > > > -static inline struct fuse_conn *get_fuse_conn(struct inode *inode) > > > > +static inline struct fuse_conn *get_fuse_conn(const struct inode *inode) > > > > { > > > > return get_fuse_mount_super(inode->i_sb)->fc; > > > > } > > > > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > > > > index cbde6ac1add35a..bee8a9a734f50a 100644 > > > > --- a/fs/fuse/acl.c > > > > +++ b/fs/fuse/acl.c > > > > @@ -11,6 +11,18 @@ > > > > #include <linux/posix_acl.h> > > > > #include <linux/posix_acl_xattr.h> > > > > > > > > +/* > > > > + * If this fuse server behaves like a local filesystem, we can implement the > > > > + * kernel's optimizations for ACLs for local filesystems instead of passing > > > > + * the ACL requests straight through to another server. > > > > > > nit: "to the server" instead of "to another server"? another server > > > sounds like there's 2 servers > > > > > > > + */ > > > > +static inline bool fuse_inode_has_local_acls(const struct inode *inode) > > > > +{ > > > > + const struct fuse_conn *fc = get_fuse_conn(inode); > > > > + > > > > + return fc->posix_acl && fuse_inode_is_exclusive(inode); > > > > +} > > > > + > > > > static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc, > > > > struct inode *inode, int type, bool rcu) > > > > { > > > > @@ -98,6 +110,8 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > > struct inode *inode = d_inode(dentry); > > > > struct fuse_conn *fc = get_fuse_conn(inode); > > > > const char *name; > > > > + umode_t mode = inode->i_mode; > > > > + const bool local_acls = fuse_inode_has_local_acls(inode); > > > > int ret; > > > > > > > > if (fuse_is_bad(inode)) > > > > @@ -113,14 +127,25 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > > else > > > > return -EINVAL; > > > > > > > > + /* > > > > + * If the ACL can be represented entirely with changes to the mode > > > > + * bits, then most filesystems will update the mode bits and delete > > > > + * the ACL xattr. > > > > + */ > > > > + if (acl && type == ACL_TYPE_ACCESS && local_acls) { > > > > + ret = posix_acl_update_mode(idmap, inode, &mode, &acl); > > > > + if (ret) > > > > + return ret; > > > > + } > > > > + > > > > if (acl) { > > > > unsigned int extra_flags = 0; > > > > /* > > > > - * Fuse userspace is responsible for updating access > > > > - * permissions in the inode, if needed. fuse_setxattr > > > > - * invalidates the inode attributes, which will force > > > > - * them to be refreshed the next time they are used, > > > > - * and it also updates i_ctime. > > > > + * For non-local filesystems, fuse userspace is responsible for > > > > + * updating access permissions in the inode, if needed. > > > > + * fuse_setxattr invalidates the inode attributes, which will > > > > + * force them to be refreshed the next time they are used, and > > > > + * it also updates i_ctime. > > > > */ > > > > size_t size; > > > > void *value; > > > > @@ -137,9 +162,10 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > > /* > > > > * Fuse daemons without FUSE_POSIX_ACL never changed the passed > > > > * through POSIX ACLs. Such daemons don't expect setgid bits to > > > > - * be stripped. > > > > + * be stripped, unless they've explicitly told the kernel to > > > > + * take care of that. > > > > > > nit: imo this would be clearer as its own sentence, eg "For local > > > filesystems, the kernel already handled sgid stripping in > > > posix_acl_update_mode() above". > > > > > > > */ > > > > - if (fc->posix_acl && > > > > + if (fc->posix_acl && !local_acls && > > > > !in_group_or_capable(idmap, inode, > > > > i_gid_into_vfsgid(idmap, inode))) > > > > extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID; > > > > @@ -148,6 +174,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > > kfree(value); > > > > } else { > > > > ret = fuse_removexattr(inode, name); > > > > + /* If the acl didn't exist to start with that's fine. */ > > > > + if (ret == -ENODATA) > > > > + ret = 0; > > > > + } > > > > + > > > > + /* If we scheduled a mode update above, push that to userspace now. */ > > > > + if (!ret) { > > > > + struct iattr attr = { }; > > > > + > > > > + if (mode != inode->i_mode) { > > > > + attr.ia_valid |= ATTR_MODE; > > > > + attr.ia_mode = mode; > > > > + } > > > > + > > > > + if (attr.ia_valid) > > > > + ret = fuse_do_setattr(idmap, dentry, &attr, NULL); > > > > > > maybe something like this is clearer? > > > > > > if (!ret && mode != inode->i_mode) { > > > struct iattr attr = { > > > .ia_valid |= ATTR_MODE, > > > .ia_mode = mode, > > > }; > > > ret = fuse_do_setattr(idmap, dentry, &attr, NULL); > > > } > > > > I know, you said that last time ;) > > https://lore.kernel.org/linux-fsdevel/CAJnrk1YL3KnON-WtNjNi+2GZ+6rYvnVUnYwCk5efv0o41XkxcA@mail.gmail.com/ > > > > and my reply is the same: > > https://lore.kernel.org/linux-fsdevel/20260325222308.GH6202@frogsfrogsfrogs/#t > > Fair enough, I'd forgotten the conversation from v7. > > Also btw, I think there's the edge case where if the fuse_removexattr > succeeds but the fuse_do_setattr() fails, then the acl would have been > removed but the mode bits weren't updated. Maybe reordering it for > that case so that the mode update happens first before the xattr > removal would be safer (eg if the xattr removal fails, the old acl is > still there to enforce permissions)? I don't think reordering the upcalls will help much for atomicity -- if the mode update succeeds but the removexattr fails, the file's ACLs are still inconsistent with what the calling process was trying to do. Hrm. Solving this atomically probably means declaring a new FUSE_SET_ACL command which conveys i_mode and the name/value of the acl xattr, or null if the acl should be removed. OTOH, XFS and ext4 don't do this atomically either. Updating the ACL xattr and the mode are separate transactions in both. XFS has this comment: /* * We set the mode after successfully updating the ACL xattr * because the xattr update can fail at ENOSPC and we don't want * to change the mode if the ACL update hasn't been applied. */ so I think I'll inject that comment into the patch but otherwise leave the code as-is. Well actually, I also forgot to set inode->i_mode = mode if the fuse_do_setattr succeeds. So I'll change that too. --D > Thanks, > Joanne > > > > > ...which is to say, does anyone think that's worth the churn? > > > > > > > > Overall, this lgtm though. > > > > > > Reviewed-by: Joanne Koong <joannelkoong@gmail.com> > > > > Thanks! > > > > --D > > > > > Thanks, > > > Joanne > > > > > > > } > > > > > > > > if (fc->posix_acl) { > > > > > > > > ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 4/4] fuse: propagate default and file acls on creation 2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:22 ` [PATCH 3/4] fuse: update file mode when updating acls Darrick J. Wong @ 2026-04-29 14:22 ` Darrick J. Wong 2026-05-01 11:11 ` Joanne Koong 3 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:22 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> For local filesystems, propagate the default and file access ACLs to new children when creating them, just like the other in-kernel local filesystems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 4 ++ fs/fuse/acl.c | 62 +++++++++++++++++++++++++++++++++ fs/fuse/dir.c | 101 +++++++++++++++++++++++++++++++++++++++++------------- 3 files changed, 142 insertions(+), 25 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0bcfb42592895c..0b9c617ee3e5be 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1530,6 +1530,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap, struct dentry *dentry, int type); int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry, struct posix_acl *acl, int type); +int fuse_acl_create(struct inode *dir, umode_t *mode, + struct posix_acl **default_acl, struct posix_acl **acl); +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl, + const struct posix_acl *acl); /* readdir.c */ int fuse_readdir(struct file *file, struct dir_context *ctx); diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index bee8a9a734f50a..9619ac84a85886 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -10,6 +10,7 @@ #include <linux/posix_acl.h> #include <linux/posix_acl_xattr.h> +#include <linux/fs_struct.h> /* * If this fuse server behaves like a local filesystem, we can implement the @@ -203,3 +204,64 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } + +int fuse_acl_create(struct inode *dir, umode_t *mode, + struct posix_acl **default_acl, struct posix_acl **acl) +{ + struct fuse_conn *fc = get_fuse_conn(dir); + + if (fuse_is_bad(dir)) + return -EIO; + + if (IS_POSIXACL(dir) && fuse_inode_has_local_acls(dir)) + return posix_acl_create(dir, mode, default_acl, acl); + + if (!fc->dont_mask) + *mode &= ~current_umask(); + + *default_acl = NULL; + *acl = NULL; + return 0; +} + +static int __fuse_set_acl(struct inode *inode, const char *name, + const struct posix_acl *acl) +{ + struct fuse_conn *fc = get_fuse_conn(inode); + size_t size; + void *value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL); + int ret; + + if (!value) + return -ENOMEM; + + if (size > PAGE_SIZE) { + kfree(value); + return -E2BIG; + } + + ret = fuse_setxattr(inode, name, value, size, 0, 0); + kfree(value); + return ret; +} + +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl, + const struct posix_acl *acl) +{ + int ret; + + if (default_acl) { + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT, + default_acl); + if (ret) + return ret; + } + + if (acl) { + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl); + if (ret) + return ret; + } + + return 0; +} diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 5d9466c7fd464e..c5c97065984557 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -825,26 +825,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, struct fuse_entry_out outentry; struct fuse_inode *fi; struct fuse_file *ff; + struct posix_acl *default_acl = NULL, *acl = NULL; int epoch, err; bool trunc = flags & O_TRUNC; /* Userspace expects S_IFREG in create mode */ BUG_ON((mode & S_IFMT) != S_IFREG); + err = fuse_acl_create(dir, &mode, &default_acl, &acl); + if (err) + return err; + epoch = atomic_read(&fm->fc->epoch); forget = fuse_alloc_forget(); err = -ENOMEM; if (!forget) - goto out_err; + goto out_acl_release; err = -ENOMEM; ff = fuse_file_alloc(fm, true); if (!ff) goto out_put_forget_req; - if (!fm->fc->dont_mask) - mode &= ~current_umask(); - flags &= ~O_NOCTTY; memset(&inarg, 0, sizeof(inarg)); memset(&outentry, 0, sizeof(outentry)); @@ -896,12 +898,17 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, fuse_sync_release(NULL, ff, flags); fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1); err = -ENOMEM; - goto out_err; + goto out_acl_release; } kfree(forget); d_instantiate(entry, inode); entry->d_time = epoch; fuse_change_entry_timeout(entry, &outentry); + + err = fuse_init_acls(inode, default_acl, acl); + if (err) + goto out_acl_release; + fuse_dir_changed(dir); err = generic_file_open(inode, file); if (!err) { @@ -917,13 +924,17 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) invalidate_inode_pages2(inode->i_mapping); } + posix_acl_release(default_acl); + posix_acl_release(acl); return err; out_free_ff: fuse_file_free(ff); out_put_forget_req: kfree(forget); -out_err: +out_acl_release: + posix_acl_release(default_acl); + posix_acl_release(acl); return err; } @@ -975,7 +986,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry, */ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm, struct fuse_args *args, struct inode *dir, - struct dentry *entry, umode_t mode) + struct dentry *entry, umode_t mode, + struct posix_acl *default_acl, + struct posix_acl *acl) { struct fuse_entry_out outarg; struct inode *inode; @@ -983,14 +996,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun struct fuse_forget_link *forget; int epoch, err; - if (fuse_is_bad(dir)) - return ERR_PTR(-EIO); + if (fuse_is_bad(dir)) { + err = -EIO; + goto out_acl_release; + } epoch = atomic_read(&fm->fc->epoch); forget = fuse_alloc_forget(); - if (!forget) - return ERR_PTR(-ENOMEM); + if (!forget) { + err = -ENOMEM; + goto out_acl_release; + } memset(&outarg, 0, sizeof(outarg)); args->nodeid = get_node_id(dir); @@ -1020,14 +1037,17 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun &outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0); if (!inode) { fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1); - return ERR_PTR(-ENOMEM); + err = -ENOMEM; + goto out_acl_release; } kfree(forget); d_drop(entry); d = d_splice_alias(inode, entry); - if (IS_ERR(d)) - return d; + if (IS_ERR(d)) { + err = PTR_ERR(d); + goto out_acl_release; + } if (d) { d->d_time = epoch; @@ -1036,19 +1056,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun entry->d_time = epoch; fuse_change_entry_timeout(entry, &outarg); } + + err = fuse_init_acls(inode, default_acl, acl); + if (err) + goto out_acl_release; fuse_dir_changed(dir); + + posix_acl_release(default_acl); + posix_acl_release(acl); return d; out_put_forget_req: if (err == -EEXIST) fuse_invalidate_entry(entry); kfree(forget); + out_acl_release: + posix_acl_release(default_acl); + posix_acl_release(acl); return ERR_PTR(err); } static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm, struct fuse_args *args, struct inode *dir, - struct dentry *entry, umode_t mode) + struct dentry *entry, umode_t mode, + struct posix_acl *default_acl, + struct posix_acl *acl) { /* * Note that when creating anything other than a directory we @@ -1059,7 +1091,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm, */ WARN_ON_ONCE(S_ISDIR(mode)); - return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode)); + return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode, + default_acl, acl)); } static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, @@ -1067,10 +1100,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, { struct fuse_mknod_in inarg; struct fuse_mount *fm = get_fuse_mount(dir); + struct posix_acl *default_acl, *acl; FUSE_ARGS(args); + int err; - if (!fm->fc->dont_mask) - mode &= ~current_umask(); + err = fuse_acl_create(dir, &mode, &default_acl, &acl); + if (err) + return err; memset(&inarg, 0, sizeof(inarg)); inarg.mode = mode; @@ -1082,7 +1118,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, args.in_args[0].value = &inarg; args.in_args[1].size = entry->d_name.len + 1; args.in_args[1].value = entry->d_name.name; - return create_new_nondir(idmap, fm, &args, dir, entry, mode); + return create_new_nondir(idmap, fm, &args, dir, entry, mode, + default_acl, acl); } static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, @@ -1114,13 +1151,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, { struct fuse_mkdir_in inarg; struct fuse_mount *fm = get_fuse_mount(dir); + struct posix_acl *default_acl, *acl; FUSE_ARGS(args); + int err; - if (!fm->fc->dont_mask) - mode &= ~current_umask(); + mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */ + err = fuse_acl_create(dir, &mode, &default_acl, &acl); + if (err) + return ERR_PTR(err); memset(&inarg, 0, sizeof(inarg)); - inarg.mode = mode; + inarg.mode = mode & ~S_IFDIR; inarg.umask = current_umask(); args.opcode = FUSE_MKDIR; args.in_numargs = 2; @@ -1128,7 +1169,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, args.in_args[0].value = &inarg; args.in_args[1].size = entry->d_name.len + 1; args.in_args[1].value = entry->d_name.name; - return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR); + return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR, + default_acl, acl); } static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, @@ -1136,7 +1178,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, { struct fuse_mount *fm = get_fuse_mount(dir); unsigned len = strlen(link) + 1; + struct posix_acl *default_acl, *acl; + umode_t mode = S_IFLNK | 0777; FUSE_ARGS(args); + int err; + + err = fuse_acl_create(dir, &mode, &default_acl, &acl); + if (err) + return err; args.opcode = FUSE_SYMLINK; args.in_numargs = 3; @@ -1145,7 +1194,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, args.in_args[1].value = entry->d_name.name; args.in_args[2].size = len; args.in_args[2].value = link; - return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK); + return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK, + default_acl, acl); } void fuse_flush_time_update(struct inode *inode) @@ -1345,7 +1395,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir, args.in_args[0].value = &inarg; args.in_args[1].size = newent->d_name.len + 1; args.in_args[1].value = newent->d_name.name; - err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode); + err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, + inode->i_mode, NULL, NULL); if (!err) fuse_update_ctime_in_cache(inode); else if (err == -EINTR) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 4/4] fuse: propagate default and file acls on creation 2026-04-29 14:22 ` [PATCH 4/4] fuse: propagate default and file acls on creation Darrick J. Wong @ 2026-05-01 11:11 ` Joanne Koong 2026-05-01 16:57 ` Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Joanne Koong @ 2026-05-01 11:11 UTC (permalink / raw) To: Darrick J. Wong; +Cc: miklos, neal, linux-fsdevel, bernd, fuse-devel On Wed, Apr 29, 2026 at 3:22 PM Darrick J. Wong <djwong@kernel.org> wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > For local filesystems, propagate the default and file access ACLs to new > children when creating them, just like the other in-kernel local > filesystems. > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > --- > fs/fuse/fuse_i.h | 4 ++ > fs/fuse/acl.c | 62 +++++++++++++++++++++++++++++++++ > fs/fuse/dir.c | 101 +++++++++++++++++++++++++++++++++++++++++------------- > 3 files changed, 142 insertions(+), 25 deletions(-) > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 0bcfb42592895c..0b9c617ee3e5be 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -1530,6 +1530,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap, > struct dentry *dentry, int type); > int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry, > struct posix_acl *acl, int type); > +int fuse_acl_create(struct inode *dir, umode_t *mode, > + struct posix_acl **default_acl, struct posix_acl **acl); > +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl, > + const struct posix_acl *acl); > > /* readdir.c */ > int fuse_readdir(struct file *file, struct dir_context *ctx); > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > index bee8a9a734f50a..9619ac84a85886 100644 > --- a/fs/fuse/acl.c > +++ b/fs/fuse/acl.c > @@ -10,6 +10,7 @@ > > #include <linux/posix_acl.h> > #include <linux/posix_acl_xattr.h> > +#include <linux/fs_struct.h> > > /* > * If this fuse server behaves like a local filesystem, we can implement the > @@ -203,3 +204,64 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > return ret; > } > + > +int fuse_acl_create(struct inode *dir, umode_t *mode, > + struct posix_acl **default_acl, struct posix_acl **acl) > +{ > + struct fuse_conn *fc = get_fuse_conn(dir); > + > + if (fuse_is_bad(dir)) > + return -EIO; > + > + if (IS_POSIXACL(dir) && fuse_inode_has_local_acls(dir)) > + return posix_acl_create(dir, mode, default_acl, acl); > + > + if (!fc->dont_mask) > + *mode &= ~current_umask(); > + > + *default_acl = NULL; > + *acl = NULL; > + return 0; > +} > + > +static int __fuse_set_acl(struct inode *inode, const char *name, Should this function just be named something like "fuse_set_posix_acl()"? imo that seems clearer > + const struct posix_acl *acl) > +{ > + struct fuse_conn *fc = get_fuse_conn(inode); > + size_t size; > + void *value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL); nit: imo this would be cleaner separated out to its own line after the variable declarations. Also, I think there's that __free(kfree) annotation that automatically does the freeing for you after you're done with it? Maybe that could be useful here > + int ret; > + > + if (!value) > + return -ENOMEM; > + > + if (size > PAGE_SIZE) { > + kfree(value); > + return -E2BIG; > + } > + > + ret = fuse_setxattr(inode, name, value, size, 0, 0); > + kfree(value); > + return ret; > +} > + > +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl, > + const struct posix_acl *acl) > +{ > + int ret; > + > + if (default_acl) { > + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT, > + default_acl); > + if (ret) > + return ret; > + } > + > + if (acl) { > + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl); > + if (ret) > + return ret; > + } > + > + return 0; > +} > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 5d9466c7fd464e..c5c97065984557 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -825,26 +825,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, > struct fuse_entry_out outentry; > struct fuse_inode *fi; > struct fuse_file *ff; > + struct posix_acl *default_acl = NULL, *acl = NULL; nit: Do these need to be null-initialized as fuse_acl_create() will already do that? the other call sites (fuse_mknod(), fuse_mkdir()) don't set to null, so might be nicer to have it be consistent > int epoch, err; > bool trunc = flags & O_TRUNC; > > /* Userspace expects S_IFREG in create mode */ > BUG_ON((mode & S_IFMT) != S_IFREG); > > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > + if (err) > + return err; > + > epoch = atomic_read(&fm->fc->epoch); > forget = fuse_alloc_forget(); > err = -ENOMEM; > if (!forget) > - goto out_err; > + goto out_acl_release; > > err = -ENOMEM; > ff = fuse_file_alloc(fm, true); > if (!ff) > goto out_put_forget_req; > > - if (!fm->fc->dont_mask) > - mode &= ~current_umask(); > - > flags &= ~O_NOCTTY; > memset(&inarg, 0, sizeof(inarg)); > memset(&outentry, 0, sizeof(outentry)); > @@ -896,12 +898,17 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, > fuse_sync_release(NULL, ff, flags); > fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1); > err = -ENOMEM; > - goto out_err; > + goto out_acl_release; > } > kfree(forget); > d_instantiate(entry, inode); > entry->d_time = epoch; > fuse_change_entry_timeout(entry, &outentry); > + > + err = fuse_init_acls(inode, default_acl, acl); > + if (err) > + goto out_acl_release; I think this will leak the allocated fuse_file. I think there needs to be a FUSE_RELEASE sent to the server as well so the server can release state. > + > fuse_dir_changed(dir); > err = generic_file_open(inode, file); > if (!err) { > @@ -917,13 +924,17 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, > else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) > invalidate_inode_pages2(inode->i_mapping); > } > + posix_acl_release(default_acl); > + posix_acl_release(acl); > return err; > > out_free_ff: > fuse_file_free(ff); > out_put_forget_req: > kfree(forget); > -out_err: > +out_acl_release: > + posix_acl_release(default_acl); > + posix_acl_release(acl); > return err; > } > > @@ -975,7 +986,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry, > */ > static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm, > struct fuse_args *args, struct inode *dir, > - struct dentry *entry, umode_t mode) > + struct dentry *entry, umode_t mode, > + struct posix_acl *default_acl, > + struct posix_acl *acl) imo it would be cleaner to do the acl logic in the callers instead of threading it through create_new_entry() and create_new_nondir(), especially since the acl creation logic gets called directly in the caller function. > { > struct fuse_entry_out outarg; > struct inode *inode; > @@ -983,14 +996,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun > struct fuse_forget_link *forget; > int epoch, err; > > - if (fuse_is_bad(dir)) > - return ERR_PTR(-EIO); > + if (fuse_is_bad(dir)) { > + err = -EIO; > + goto out_acl_release; > + } > epoch = atomic_read(&fm->fc->epoch); > > forget = fuse_alloc_forget(); > - if (!forget) > - return ERR_PTR(-ENOMEM); > + if (!forget) { > + err = -ENOMEM; > + goto out_acl_release; > + } > > memset(&outarg, 0, sizeof(outarg)); > args->nodeid = get_node_id(dir); > @@ -1020,14 +1037,17 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun > &outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0); > if (!inode) { > fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1); > - return ERR_PTR(-ENOMEM); > + err = -ENOMEM; > + goto out_acl_release; > } > kfree(forget); > > d_drop(entry); > d = d_splice_alias(inode, entry); > - if (IS_ERR(d)) > - return d; > + if (IS_ERR(d)) { > + err = PTR_ERR(d); > + goto out_acl_release; > + } > > if (d) { > d->d_time = epoch; > @@ -1036,19 +1056,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun > entry->d_time = epoch; > fuse_change_entry_timeout(entry, &outarg); > } > + > + err = fuse_init_acls(inode, default_acl, acl); > + if (err) > + goto out_acl_release; Do we probably need a dput() here? afaict from the logic in d_splice_alias_ops() [1], it looks like there's a refcount obtained if the spliced dentry was non-null [1] https://elixir.bootlin.com/linux/v7.0/source/fs/dcache.c#L3097 > fuse_dir_changed(dir); > + > + posix_acl_release(default_acl); > + posix_acl_release(acl); > return d; > > out_put_forget_req: > if (err == -EEXIST) > fuse_invalidate_entry(entry); > kfree(forget); > + out_acl_release: > + posix_acl_release(default_acl); > + posix_acl_release(acl); > return ERR_PTR(err); > } > > static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm, > struct fuse_args *args, struct inode *dir, > - struct dentry *entry, umode_t mode) > + struct dentry *entry, umode_t mode, > + struct posix_acl *default_acl, > + struct posix_acl *acl) > { > /* > * Note that when creating anything other than a directory we > @@ -1059,7 +1091,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm, > */ > WARN_ON_ONCE(S_ISDIR(mode)); > > - return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode)); > + return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode, > + default_acl, acl)); > } > > static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, > @@ -1067,10 +1100,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, > { > struct fuse_mknod_in inarg; > struct fuse_mount *fm = get_fuse_mount(dir); > + struct posix_acl *default_acl, *acl; > FUSE_ARGS(args); > + int err; > > - if (!fm->fc->dont_mask) > - mode &= ~current_umask(); > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > + if (err) > + return err; > > memset(&inarg, 0, sizeof(inarg)); > inarg.mode = mode; > @@ -1082,7 +1118,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, > args.in_args[0].value = &inarg; > args.in_args[1].size = entry->d_name.len + 1; > args.in_args[1].value = entry->d_name.name; > - return create_new_nondir(idmap, fm, &args, dir, entry, mode); > + return create_new_nondir(idmap, fm, &args, dir, entry, mode, > + default_acl, acl); > } > > static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, > @@ -1114,13 +1151,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, > { > struct fuse_mkdir_in inarg; > struct fuse_mount *fm = get_fuse_mount(dir); > + struct posix_acl *default_acl, *acl; > FUSE_ARGS(args); > + int err; > > - if (!fm->fc->dont_mask) > - mode &= ~current_umask(); > + mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */ > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > + if (err) > + return ERR_PTR(err); > > memset(&inarg, 0, sizeof(inarg)); > - inarg.mode = mode; > + inarg.mode = mode & ~S_IFDIR; > inarg.umask = current_umask(); > args.opcode = FUSE_MKDIR; > args.in_numargs = 2; > @@ -1128,7 +1169,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, > args.in_args[0].value = &inarg; > args.in_args[1].size = entry->d_name.len + 1; > args.in_args[1].value = entry->d_name.name; > - return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR); > + return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR, > + default_acl, acl); > } > > static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, > @@ -1136,7 +1178,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, > { > struct fuse_mount *fm = get_fuse_mount(dir); > unsigned len = strlen(link) + 1; > + struct posix_acl *default_acl, *acl; > + umode_t mode = S_IFLNK | 0777; > FUSE_ARGS(args); > + int err; > + > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > + if (err) > + return err; I think we could skip the acl stuff for symlinks, it looks like posix_acl_create() is a no-op for S_IFLNK mode [2] [2] https://elixir.bootlin.com/linux/v7.0.1/source/fs/posix_acl.c#L643 Thanks, Joanne > > args.opcode = FUSE_SYMLINK; > args.in_numargs = 3; > @@ -1145,7 +1194,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, > args.in_args[1].value = entry->d_name.name; > args.in_args[2].size = len; > args.in_args[2].value = link; > - return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK); > + return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK, > + default_acl, acl); > } > > void fuse_flush_time_update(struct inode *inode) > @@ -1345,7 +1395,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir, > args.in_args[0].value = &inarg; > args.in_args[1].size = newent->d_name.len + 1; > args.in_args[1].value = newent->d_name.name; > - err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode); > + err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, > + inode->i_mode, NULL, NULL); > if (!err) > fuse_update_ctime_in_cache(inode); > else if (err == -EINTR) > ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 4/4] fuse: propagate default and file acls on creation 2026-05-01 11:11 ` Joanne Koong @ 2026-05-01 16:57 ` Darrick J. Wong 0 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-05-01 16:57 UTC (permalink / raw) To: Joanne Koong; +Cc: miklos, neal, linux-fsdevel, bernd, fuse-devel On Fri, May 01, 2026 at 12:11:42PM +0100, Joanne Koong wrote: > On Wed, Apr 29, 2026 at 3:22 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > From: Darrick J. Wong <djwong@kernel.org> > > > > For local filesystems, propagate the default and file access ACLs to new > > children when creating them, just like the other in-kernel local > > filesystems. > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > > --- > > fs/fuse/fuse_i.h | 4 ++ > > fs/fuse/acl.c | 62 +++++++++++++++++++++++++++++++++ > > fs/fuse/dir.c | 101 +++++++++++++++++++++++++++++++++++++++++------------- > > 3 files changed, 142 insertions(+), 25 deletions(-) > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > index 0bcfb42592895c..0b9c617ee3e5be 100644 > > --- a/fs/fuse/fuse_i.h > > +++ b/fs/fuse/fuse_i.h > > @@ -1530,6 +1530,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap, > > struct dentry *dentry, int type); > > int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry, > > struct posix_acl *acl, int type); > > +int fuse_acl_create(struct inode *dir, umode_t *mode, > > + struct posix_acl **default_acl, struct posix_acl **acl); > > +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl, > > + const struct posix_acl *acl); > > > > /* readdir.c */ > > int fuse_readdir(struct file *file, struct dir_context *ctx); > > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > > index bee8a9a734f50a..9619ac84a85886 100644 > > --- a/fs/fuse/acl.c > > +++ b/fs/fuse/acl.c > > @@ -10,6 +10,7 @@ > > > > #include <linux/posix_acl.h> > > #include <linux/posix_acl_xattr.h> > > +#include <linux/fs_struct.h> > > > > /* > > * If this fuse server behaves like a local filesystem, we can implement the > > @@ -203,3 +204,64 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, > > > > return ret; > > } > > + > > +int fuse_acl_create(struct inode *dir, umode_t *mode, > > + struct posix_acl **default_acl, struct posix_acl **acl) > > +{ > > + struct fuse_conn *fc = get_fuse_conn(dir); > > + > > + if (fuse_is_bad(dir)) > > + return -EIO; > > + > > + if (IS_POSIXACL(dir) && fuse_inode_has_local_acls(dir)) > > + return posix_acl_create(dir, mode, default_acl, acl); > > + > > + if (!fc->dont_mask) > > + *mode &= ~current_umask(); > > + > > + *default_acl = NULL; > > + *acl = NULL; > > + return 0; > > +} > > + > > +static int __fuse_set_acl(struct inode *inode, const char *name, > > Should this function just be named something like > "fuse_set_posix_acl()"? imo that seems clearer Will do. > > + const struct posix_acl *acl) > > +{ > > + struct fuse_conn *fc = get_fuse_conn(inode); > > + size_t size; > > + void *value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL); > > nit: imo this would be cleaner separated out to its own line after the > variable declarations. Also, I think there's that __free(kfree) > annotation that automatically does the freeing for you after you're > done with it? Maybe that could be useful here I'm not a big fan of these weird new macros, but I'll change it: struct fuse_conn *fc = get_fuse_conn(inode); void *value __free(kfree) = NULL; size_t size; value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL); if (!value) return -ENOMEM; if (size > PAGE_SIZE) return -E2BIG; return fuse_setxattr(inode, name, value, size, 0, 0); > > + int ret; > > + > > + if (!value) > > + return -ENOMEM; > > + > > + if (size > PAGE_SIZE) { > > + kfree(value); > > + return -E2BIG; > > + } > > + > > + ret = fuse_setxattr(inode, name, value, size, 0, 0); > > + kfree(value); > > + return ret; > > +} > > + > > +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl, > > + const struct posix_acl *acl) > > +{ > > + int ret; > > + > > + if (default_acl) { > > + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT, > > + default_acl); > > + if (ret) > > + return ret; > > + } > > + > > + if (acl) { > > + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl); > > + if (ret) > > + return ret; > > + } > > + > > + return 0; > > +} > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > index 5d9466c7fd464e..c5c97065984557 100644 > > --- a/fs/fuse/dir.c > > +++ b/fs/fuse/dir.c > > @@ -825,26 +825,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, > > struct fuse_entry_out outentry; > > struct fuse_inode *fi; > > struct fuse_file *ff; > > + struct posix_acl *default_acl = NULL, *acl = NULL; > > nit: Do these need to be null-initialized as fuse_acl_create() will > already do that? the other call sites (fuse_mknod(), fuse_mkdir()) > don't set to null, so might be nicer to have it be consistent They're not strictly required because (AFAICT) fuse_acl_create either returns an errno or sets default_acl/acl. But I'm really really sick and tired of playing the game where some static checkers can't dig deep enough into the code to figure that out and complain based on their incomplete scans. But I'm probably screwed anyway because other checkers that can dig that deep will whine about the unnecessary store. Meh. I don't know. I don't care. I want to focus on getting the logic right, not micro-optimizing C local variable initialization. > > int epoch, err; > > bool trunc = flags & O_TRUNC; > > > > /* Userspace expects S_IFREG in create mode */ > > BUG_ON((mode & S_IFMT) != S_IFREG); > > > > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > > + if (err) > > + return err; > > + > > epoch = atomic_read(&fm->fc->epoch); > > forget = fuse_alloc_forget(); > > err = -ENOMEM; > > if (!forget) > > - goto out_err; > > + goto out_acl_release; > > > > err = -ENOMEM; > > ff = fuse_file_alloc(fm, true); > > if (!ff) > > goto out_put_forget_req; > > > > - if (!fm->fc->dont_mask) > > - mode &= ~current_umask(); > > - > > flags &= ~O_NOCTTY; > > memset(&inarg, 0, sizeof(inarg)); > > memset(&outentry, 0, sizeof(outentry)); > > @@ -896,12 +898,17 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, > > fuse_sync_release(NULL, ff, flags); > > fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1); > > err = -ENOMEM; > > - goto out_err; > > + goto out_acl_release; > > } > > kfree(forget); > > d_instantiate(entry, inode); > > entry->d_time = epoch; > > fuse_change_entry_timeout(entry, &outentry); > > + > > + err = fuse_init_acls(inode, default_acl, acl); > > + if (err) > > + goto out_acl_release; > > I think this will leak the allocated fuse_file. I think there needs > to be a FUSE_RELEASE sent to the server as well so the server can > release state. I think you're right. This needs to do the same cleanups as the failure case for fuse_iget above, correct? I think in that case it makes sense to move the fuse_init_acls call up: inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation, &outentry.attr, ATTR_TIMEOUT(&outentry), 0, 0); if (!inode) { err = -ENOMEM; goto out_release; } err = fuse_init_acls(inode, default_acl, acl); if (err) goto out_release; wherein out_release calls fuse_sync_release and fuse_queue_forget. > > + > > fuse_dir_changed(dir); > > err = generic_file_open(inode, file); > > if (!err) { > > @@ -917,13 +924,17 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, > > else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) > > invalidate_inode_pages2(inode->i_mapping); > > } > > + posix_acl_release(default_acl); > > + posix_acl_release(acl); > > return err; > > > > out_free_ff: > > fuse_file_free(ff); > > out_put_forget_req: > > kfree(forget); > > -out_err: > > +out_acl_release: > > + posix_acl_release(default_acl); > > + posix_acl_release(acl); > > return err; > > } > > > > @@ -975,7 +986,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry, > > */ > > static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm, > > struct fuse_args *args, struct inode *dir, > > - struct dentry *entry, umode_t mode) > > + struct dentry *entry, umode_t mode, > > + struct posix_acl *default_acl, > > + struct posix_acl *acl) > > imo it would be cleaner to do the acl logic in the callers instead of > threading it through create_new_entry() and create_new_nondir(), > especially since the acl creation logic gets called directly in the > caller function. I'll change the code so that create_new_entry no longer consumes acl/default_acl. > > > { > > struct fuse_entry_out outarg; > > struct inode *inode; > > @@ -983,14 +996,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun > > struct fuse_forget_link *forget; > > int epoch, err; > > > > - if (fuse_is_bad(dir)) > > - return ERR_PTR(-EIO); > > + if (fuse_is_bad(dir)) { > > + err = -EIO; > > + goto out_acl_release; > > + } > > epoch = atomic_read(&fm->fc->epoch); > > > > forget = fuse_alloc_forget(); > > - if (!forget) > > - return ERR_PTR(-ENOMEM); > > + if (!forget) { > > + err = -ENOMEM; > > + goto out_acl_release; > > + } > > > > memset(&outarg, 0, sizeof(outarg)); > > args->nodeid = get_node_id(dir); > > @@ -1020,14 +1037,17 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun > > &outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0); > > if (!inode) { > > fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1); > > - return ERR_PTR(-ENOMEM); > > + err = -ENOMEM; > > + goto out_acl_release; > > } > > kfree(forget); > > > > d_drop(entry); > > d = d_splice_alias(inode, entry); > > - if (IS_ERR(d)) > > - return d; > > + if (IS_ERR(d)) { > > + err = PTR_ERR(d); > > + goto out_acl_release; > > + } > > > > if (d) { > > d->d_time = epoch; > > @@ -1036,19 +1056,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun > > entry->d_time = epoch; > > fuse_change_entry_timeout(entry, &outarg); > > } > > + > > + err = fuse_init_acls(inode, default_acl, acl); > > + if (err) > > + goto out_acl_release; > > Do we probably need a dput() here? afaict from the logic in > d_splice_alias_ops() [1], it looks like there's a refcount obtained if > the spliced dentry was non-null > > [1] https://elixir.bootlin.com/linux/v7.0/source/fs/dcache.c#L3097 I think this code hunk should move to immediately after the fuse_iget, same as the last one. > > fuse_dir_changed(dir); > > + > > + posix_acl_release(default_acl); > > + posix_acl_release(acl); > > return d; > > > > out_put_forget_req: > > if (err == -EEXIST) > > fuse_invalidate_entry(entry); > > kfree(forget); > > + out_acl_release: > > + posix_acl_release(default_acl); > > + posix_acl_release(acl); > > return ERR_PTR(err); > > } > > > > static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm, > > struct fuse_args *args, struct inode *dir, > > - struct dentry *entry, umode_t mode) > > + struct dentry *entry, umode_t mode, > > + struct posix_acl *default_acl, > > + struct posix_acl *acl) > > { > > /* > > * Note that when creating anything other than a directory we > > @@ -1059,7 +1091,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm, > > */ > > WARN_ON_ONCE(S_ISDIR(mode)); > > > > - return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode)); > > + return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode, > > + default_acl, acl)); > > } > > > > static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, > > @@ -1067,10 +1100,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, > > { > > struct fuse_mknod_in inarg; > > struct fuse_mount *fm = get_fuse_mount(dir); > > + struct posix_acl *default_acl, *acl; > > FUSE_ARGS(args); > > + int err; > > > > - if (!fm->fc->dont_mask) > > - mode &= ~current_umask(); > > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > > + if (err) > > + return err; > > > > memset(&inarg, 0, sizeof(inarg)); > > inarg.mode = mode; > > @@ -1082,7 +1118,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, > > args.in_args[0].value = &inarg; > > args.in_args[1].size = entry->d_name.len + 1; > > args.in_args[1].value = entry->d_name.name; > > - return create_new_nondir(idmap, fm, &args, dir, entry, mode); > > + return create_new_nondir(idmap, fm, &args, dir, entry, mode, > > + default_acl, acl); > > } > > > > static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, > > @@ -1114,13 +1151,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, > > { > > struct fuse_mkdir_in inarg; > > struct fuse_mount *fm = get_fuse_mount(dir); > > + struct posix_acl *default_acl, *acl; > > FUSE_ARGS(args); > > + int err; > > > > - if (!fm->fc->dont_mask) > > - mode &= ~current_umask(); > > + mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */ > > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > > + if (err) > > + return ERR_PTR(err); > > > > memset(&inarg, 0, sizeof(inarg)); > > - inarg.mode = mode; > > + inarg.mode = mode & ~S_IFDIR; > > inarg.umask = current_umask(); > > args.opcode = FUSE_MKDIR; > > args.in_numargs = 2; > > @@ -1128,7 +1169,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, > > args.in_args[0].value = &inarg; > > args.in_args[1].size = entry->d_name.len + 1; > > args.in_args[1].value = entry->d_name.name; > > - return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR); > > + return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR, > > + default_acl, acl); > > } > > > > static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, > > @@ -1136,7 +1178,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, > > { > > struct fuse_mount *fm = get_fuse_mount(dir); > > unsigned len = strlen(link) + 1; > > + struct posix_acl *default_acl, *acl; > > + umode_t mode = S_IFLNK | 0777; > > FUSE_ARGS(args); > > + int err; > > + > > + err = fuse_acl_create(dir, &mode, &default_acl, &acl); > > + if (err) > > + return err; > > I think we could skip the acl stuff for symlinks, it looks like > posix_acl_create() is a no-op for S_IFLNK mode [2] > > [2] https://elixir.bootlin.com/linux/v7.0.1/source/fs/posix_acl.c#L643 You're right that posix_acl_create does nothing for symlinks, but skipping the call here encodes that implementation detail in fuse. That leaves a logic bomb for anyone who might modify posix_acl_create to do more with symlinks, because the major filesystems (ext4/btrfs/xfs) call posix_acl_create from their generic inode creation functions, which means they call it even for symlinks. That increases the chance that the author making that change will fail to notice fuse. --D > Thanks, > Joanne > > > > > args.opcode = FUSE_SYMLINK; > > args.in_numargs = 3; > > @@ -1145,7 +1194,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, > > args.in_args[1].value = entry->d_name.name; > > args.in_args[2].size = len; > > args.in_args[2].value = link; > > - return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK); > > + return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK, > > + default_acl, acl); > > } > > > > void fuse_flush_time_update(struct inode *inode) > > @@ -1345,7 +1395,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir, > > args.in_args[0].value = &inarg; > > args.in_args[1].size = newent->d_name.len + 1; > > args.in_args[1].value = newent->d_name.name; > > - err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode); > > + err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, > > + inode->i_mode, NULL, NULL); > > if (!err) > > fuse_update_ctime_in_cache(inode); > > else if (err == -EINTR) > > > ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong 2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong @ 2026-04-29 14:16 ` Darrick J. Wong 2026-04-29 14:22 ` [PATCH 1/2] iomap: allow directio callers to supply _COMP_WORK Darrick J. Wong 2026-04-29 14:23 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong 2026-04-29 14:17 ` [PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong ` (17 subsequent siblings) 19 siblings, 2 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:16 UTC (permalink / raw) To: djwong, miklos, brauner; +Cc: hch, fuse-devel, linux-fsdevel, hch Hi all, In preparation for making fuse use the fs/iomap code for regular file data IO, fix a few bugs in fuse and apply a couple of tweaks to iomap. These patches can go in immediately. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=iomap-fuse-prep --- Commits in this patchset: * iomap: allow directio callers to supply _COMP_WORK * iomap: allow NULL swap info bdev when activating swapfile --- include/linux/iomap.h | 3 +++ fs/iomap/direct-io.c | 7 ++++--- fs/iomap/swapfile.c | 17 +++++++++++++++++ 3 files changed, 24 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/2] iomap: allow directio callers to supply _COMP_WORK 2026-04-29 14:16 ` [PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong @ 2026-04-29 14:22 ` Darrick J. Wong 2026-04-29 14:23 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:22 UTC (permalink / raw) To: djwong, miklos, brauner; +Cc: hch, fuse-devel, linux-fsdevel, hch From: Darrick J. Wong <djwong@kernel.org> Allow callers of iomap_dio_rw to specify the _COMP_WORK flag if they require that all directio ioend completions occur in process context. The upcoming fuse-iomap patchset needs this because fuse requests (specifically FUSE_IOMAP_IOEND) cannot be sent from interrupt context. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- include/linux/iomap.h | 3 +++ fs/iomap/direct-io.c | 7 ++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 2c5685adf3a97c..0de25b1af72d3f 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -590,6 +590,9 @@ struct iomap_dio_ops { */ #define IOMAP_DIO_BOUNCE (1 << 4) +/* Run IO completions from process context */ +#define IOMAP_DIO_COMP_WORK (1 << 5) + ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, const struct iomap_dio_ops *dops, unsigned int dio_flags, void *private, size_t done_before); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index b0a6549b38487b..71f43fa0cf26ea 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -19,8 +19,7 @@ * Private flags for iomap_dio, must not overlap with the public ones in * iomap.h: */ -#define IOMAP_DIO_NO_INVALIDATE (1U << 26) -#define IOMAP_DIO_COMP_WORK (1U << 27) +#define IOMAP_DIO_NO_INVALIDATE (1U << 27) #define IOMAP_DIO_WRITE_THROUGH (1U << 28) #define IOMAP_DIO_NEED_SYNC (1U << 29) #define IOMAP_DIO_WRITE (1U << 30) @@ -711,7 +710,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, dio->i_size = i_size_read(inode); dio->dops = dops; dio->error = 0; - dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE); + dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | + IOMAP_DIO_BOUNCE | + IOMAP_DIO_COMP_WORK); dio->done_before = done_before; dio->submit.iter = iter; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-04-29 14:16 ` [PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong 2026-04-29 14:22 ` [PATCH 1/2] iomap: allow directio callers to supply _COMP_WORK Darrick J. Wong @ 2026-04-29 14:23 ` Darrick J. Wong 2026-05-08 9:06 ` Christoph Hellwig 1 sibling, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:23 UTC (permalink / raw) To: djwong, miklos, brauner; +Cc: fuse-devel, linux-fsdevel, hch From: Darrick J. Wong <djwong@kernel.org> All current users of the iomap swapfile activation mechanism are block device filesystems. This means that claim_swapfile will set swap_info_struct::bdev to inode->i_sb->s_bdev of the swap file. However, a subsequent patch to fuse will add iomap infrastructure so that fuse servers can be asked to provide file mappings specifically for swap files. The fuse server isn't required to set s_bdev (by mounting as fuseblk) so s_bdev might be null. For this case, we want to set sis::bdev from the first mapping. To make this work robustly, we must explicitly check that each mapping provides a bdev and that there's no way we can succeed at collecting swapfile pages without a block device. And just to be clear: fuse-iomap servers will have to respond to an explicit request for swapfile activation. It's not like fuseblk, where responding to bmap means swapfiles work even if that wasn't expected. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/iomap/swapfile.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c index 0db77c449467a7..9d9f4e84437df5 100644 --- a/fs/iomap/swapfile.c +++ b/fs/iomap/swapfile.c @@ -112,6 +112,13 @@ static int iomap_swapfile_iter(struct iomap_iter *iter, if (iomap->flags & IOMAP_F_SHARED) return iomap_swapfile_fail(isi, "has shared extents"); + /* Swapfiles must be backed by a block device */ + if (!iomap->bdev) + return iomap_swapfile_fail(isi, "is not on a block device"); + + if (iter->pos == 0 && !isi->sis->bdev) + isi->sis->bdev = iomap->bdev; + /* Only one bdev per swap file. */ if (iomap->bdev != isi->sis->bdev) return iomap_swapfile_fail(isi, "outside the main device"); @@ -184,6 +191,16 @@ int iomap_swapfile_activate(struct swap_info_struct *sis, return -EINVAL; } + /* + * If this swapfile doesn't have a block device, reject this useless + * swapfile to prevent confusion later on. + */ + if (sis->bdev == NULL) { + pr_warn( + "swapon: No block device for swap file but usage pages?!\n"); + return -EINVAL; + } + *pagespan = 1 + isi.highest_ppage - isi.lowest_ppage; sis->max = isi.nr_pages; sis->pages = isi.nr_pages - 1; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-04-29 14:23 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong @ 2026-05-08 9:06 ` Christoph Hellwig 2026-05-08 23:41 ` Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Christoph Hellwig @ 2026-05-08 9:06 UTC (permalink / raw) To: Darrick J. Wong; +Cc: miklos, brauner, fuse-devel, linux-fsdevel, hch > + /* Swapfiles must be backed by a block device */ > + if (!iomap->bdev) > + return iomap_swapfile_fail(isi, "is not on a block device"); > + > + if (iter->pos == 0 && !isi->sis->bdev) > + isi->sis->bdev = iomap->bdev; My gut feeling is still that I'd rather have this in the caller than hidden in iomap. > + /* > + * If this swapfile doesn't have a block device, reject this useless > + * swapfile to prevent confusion later on. > + */ > + if (sis->bdev == NULL) { normal kernel style would be "if (!sis->bdev)" > + pr_warn( > + "swapon: No block device for swap file but usage pages?!\n"); Also non-XFS code tends to just put this onto the pr_warn line even if it is less readable (and XFS code tends to not do an empty space). ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-05-08 9:06 ` Christoph Hellwig @ 2026-05-08 23:41 ` Darrick J. Wong 0 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-05-08 23:41 UTC (permalink / raw) To: Christoph Hellwig; +Cc: miklos, brauner, fuse-devel, linux-fsdevel On Fri, May 08, 2026 at 11:06:34AM +0200, Christoph Hellwig wrote: > > + /* Swapfiles must be backed by a block device */ > > + if (!iomap->bdev) > > + return iomap_swapfile_fail(isi, "is not on a block device"); > > + > > + if (iter->pos == 0 && !isi->sis->bdev) > > + isi->sis->bdev = iomap->bdev; > > My gut feeling is still that I'd rather have this in the caller than > hidden in iomap. The trouble here is that the iomap code requires that the caller sets sis->bdev before calling iomap_swapfile_activate. That works great for XFS: /* * Direct the swap code to the correct block device when this file * sits on the RT device. */ sis->bdev = xfs_inode_buftarg(ip)->bt_bdev; return iomap_swapfile_activate(sis, swap_file, span, &xfs_read_iomap_ops); But it doesn't work for fuse because fuse cannot know what the bdev for block 0 is going to be without sending FUSE_IOMAP_BEGIN to the server. iomap_swapfile_activate is already going to call fuse_iomap_begin to do that, so we'd end up making an extra upcall just to set one pointer that we could easily set later anyway. > > + /* > > + * If this swapfile doesn't have a block device, reject this useless > > + * swapfile to prevent confusion later on. > > + */ > > + if (sis->bdev == NULL) { > > normal kernel style would be "if (!sis->bdev)" > > > + pr_warn( > > + "swapon: No block device for swap file but usage pages?!\n"); > > Also non-XFS code tends to just put this onto the pr_warn line even > if it is less readable (and XFS code tends to not do an empty space). Ok. I could change that message to something less confusing too: "swapon: No block device for swap file but we configured swap pages?!\n" --D ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong 2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong 2026-04-29 14:16 ` [PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong @ 2026-04-29 14:17 ` Darrick J. Wong 2026-04-29 14:23 ` [PATCH 1/2] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong 2026-04-29 14:23 ` [PATCH 2/2] fuse_trace: " Darrick J. Wong 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (16 subsequent siblings) 19 siblings, 2 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:17 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, In preparation for making fuse use the fs/iomap code for regular file data IO, fix a few bugs in fuse and apply a couple of tweaks to iomap. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-prep --- Commits in this patchset: * fuse: move the passthrough-specific code back to passthrough.c * fuse_trace: move the passthrough-specific code back to passthrough.c --- fs/fuse/fuse_i.h | 25 ++++++++++- fs/fuse/fuse_trace.h | 35 ++++++++++++++++ include/uapi/linux/fuse.h | 8 +++- fs/fuse/Kconfig | 4 ++ fs/fuse/Makefile | 3 + fs/fuse/backing.c | 101 ++++++++++++++++++++++++++++++++++----------- fs/fuse/dev.c | 4 +- fs/fuse/inode.c | 4 +- fs/fuse/passthrough.c | 38 ++++++++++++++++- 9 files changed, 188 insertions(+), 34 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/2] fuse: move the passthrough-specific code back to passthrough.c 2026-04-29 14:17 ` [PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong @ 2026-04-29 14:23 ` Darrick J. Wong 2026-04-29 14:23 ` [PATCH 2/2] fuse_trace: " Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:23 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> In preparation for iomap, move the passthrough-specific validation code back to passthrough.c and create a new Kconfig item for conditional compilation of backing.c. In the next patch, iomap will share the backing structures. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 25 ++++++++++- include/uapi/linux/fuse.h | 8 +++- fs/fuse/Kconfig | 4 ++ fs/fuse/Makefile | 3 + fs/fuse/backing.c | 98 ++++++++++++++++++++++++++++++++++----------- fs/fuse/dev.c | 4 +- fs/fuse/inode.c | 4 +- fs/fuse/passthrough.c | 38 +++++++++++++++++ 8 files changed, 149 insertions(+), 35 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0b9c617ee3e5be..0666e03723071b 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -103,10 +103,23 @@ struct fuse_submount_lookup { struct fuse_forget_link *forget; }; +struct fuse_conn; + +/** Operations for subsystems that want to use a backing file */ +struct fuse_backing_ops { + int (*may_admin)(struct fuse_conn *fc, uint32_t flags); + int (*may_open)(struct fuse_conn *fc, struct file *file); + int (*may_close)(struct fuse_conn *fc, struct file *file); + unsigned int type; + int id_start; + int id_end; +}; + /** Container for data related to mapping to backing file */ struct fuse_backing { struct file *file; struct cred *cred; + const struct fuse_backing_ops *ops; /** refcount */ refcount_t count; @@ -980,7 +993,7 @@ struct fuse_conn { /* New writepages go into this bucket */ struct fuse_sync_bucket __rcu *curr_bucket; -#ifdef CONFIG_FUSE_PASSTHROUGH +#ifdef CONFIG_FUSE_BACKING /** IDR for backing files ids */ struct idr backing_files_map; #endif @@ -1591,10 +1604,12 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff, unsigned int open_flags, fl_owner_t id, bool isdir); /* backing.c */ -#ifdef CONFIG_FUSE_PASSTHROUGH +#ifdef CONFIG_FUSE_BACKING struct fuse_backing *fuse_backing_get(struct fuse_backing *fb); void fuse_backing_put(struct fuse_backing *fb); -struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id); +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, + const struct fuse_backing_ops *ops, + int backing_id); #else static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb) @@ -1649,6 +1664,10 @@ static inline struct file *fuse_file_passthrough(struct fuse_file *ff) #endif } +#ifdef CONFIG_FUSE_PASSTHROUGH +extern const struct fuse_backing_ops fuse_passthrough_backing_ops; +#endif + ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *iter); ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *iter); ssize_t fuse_passthrough_splice_read(struct file *in, loff_t *ppos, diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index c13e1f9a2f12bd..18713cfaf09171 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1126,9 +1126,15 @@ struct fuse_notify_prune_out { uint64_t spare; }; +#define FUSE_BACKING_TYPE_MASK (0xFF) +#define FUSE_BACKING_TYPE_PASSTHROUGH (0) +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH) + +#define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK) + struct fuse_backing_map { int32_t fd; - uint32_t flags; + uint32_t flags; /* FUSE_BACKING_* */ uint64_t padding; }; diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig index 3a4ae632c94aa8..290d1c09e0b924 100644 --- a/fs/fuse/Kconfig +++ b/fs/fuse/Kconfig @@ -59,12 +59,16 @@ config FUSE_PASSTHROUGH default y depends on FUSE_FS select FS_STACK + select FUSE_BACKING help This allows bypassing FUSE server by mapping specific FUSE operations to be performed directly on a backing file. If you want to allow passthrough operations, answer Y. +config FUSE_BACKING + bool + config FUSE_IO_URING bool "FUSE communication over io-uring" default y diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile index 22ad9538dfc4b8..46041228e5be2c 100644 --- a/fs/fuse/Makefile +++ b/fs/fuse/Makefile @@ -14,7 +14,8 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o fuse-y += iomode.o fuse-$(CONFIG_FUSE_DAX) += dax.o -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o +fuse-$(CONFIG_FUSE_BACKING) += backing.o fuse-$(CONFIG_SYSCTL) += sysctl.o fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c index d95dfa48483f0a..adb4d2ebb21379 100644 --- a/fs/fuse/backing.c +++ b/fs/fuse/backing.c @@ -6,6 +6,7 @@ */ #include "fuse_i.h" +#include "fuse_trace.h" #include <linux/file.h> @@ -44,7 +45,8 @@ static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb) idr_preload(GFP_KERNEL); spin_lock(&fc->lock); /* FIXME: xarray might be space inefficient */ - id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC); + id = idr_alloc_cyclic(&fc->backing_files_map, fb, fb->ops->id_start, + fb->ops->id_end, GFP_ATOMIC); spin_unlock(&fc->lock); idr_preload_end(); @@ -69,32 +71,53 @@ static int fuse_backing_id_free(int id, void *p, void *data) struct fuse_backing *fb = p; WARN_ON_ONCE(refcount_read(&fb->count) != 1); + fuse_backing_free(fb); return 0; } void fuse_backing_files_free(struct fuse_conn *fc) { - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL); + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, fc); idr_destroy(&fc->backing_files_map); } +static inline const struct fuse_backing_ops * +fuse_backing_ops_from_map(const struct fuse_backing_map *map) +{ + switch (map->flags & FUSE_BACKING_TYPE_MASK) { +#ifdef CONFIG_FUSE_PASSTHROUGH + case FUSE_BACKING_TYPE_PASSTHROUGH: + return &fuse_passthrough_backing_ops; +#endif + default: + break; + } + + return NULL; +} + int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map) { struct file *file; - struct super_block *backing_sb; struct fuse_backing *fb = NULL; + const struct fuse_backing_ops *ops = fuse_backing_ops_from_map(map); + uint32_t op_flags = map->flags & ~FUSE_BACKING_TYPE_MASK; int res; pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags); - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */ - res = -EPERM; - if (!fc->passthrough || !capable(CAP_SYS_ADMIN)) + res = -EOPNOTSUPP; + if (!ops) + goto out; + WARN_ON(ops->type != (map->flags & FUSE_BACKING_TYPE_MASK)); + + res = ops->may_admin ? ops->may_admin(fc, op_flags) : 0; + if (res) goto out; res = -EINVAL; - if (map->flags || map->padding) + if (map->padding) goto out; file = fget_raw(map->fd); @@ -102,14 +125,8 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map) if (!file) goto out; - /* read/write/splice/mmap passthrough only relevant for regular files */ - res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL; - if (!d_is_reg(file->f_path.dentry)) - goto out_fput; - - backing_sb = file_inode(file)->i_sb; - res = -ELOOP; - if (backing_sb->s_stack_depth >= fc->max_stack_depth) + res = ops->may_open ? ops->may_open(fc, file) : 0; + if (res) goto out_fput; fb = kmalloc_obj(struct fuse_backing); @@ -119,14 +136,15 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map) fb->file = file; fb->cred = prepare_creds(); + fb->ops = ops; refcount_set(&fb->count, 1); res = fuse_backing_id_alloc(fc, fb); if (res < 0) { fuse_backing_free(fb); fb = NULL; + goto out; } - out: pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res); @@ -137,41 +155,71 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map) goto out; } +static struct fuse_backing *__fuse_backing_lookup(struct fuse_conn *fc, + int backing_id) +{ + struct fuse_backing *fb; + + rcu_read_lock(); + fb = idr_find(&fc->backing_files_map, backing_id); + fb = fuse_backing_get(fb); + rcu_read_unlock(); + + return fb; +} + int fuse_backing_close(struct fuse_conn *fc, int backing_id) { - struct fuse_backing *fb = NULL; + struct fuse_backing *fb = NULL, *test_fb; + const struct fuse_backing_ops *ops; int err; pr_debug("%s: backing_id=%d\n", __func__, backing_id); - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */ - err = -EPERM; - if (!fc->passthrough || !capable(CAP_SYS_ADMIN)) - goto out; - err = -EINVAL; if (backing_id <= 0) goto out; err = -ENOENT; - fb = fuse_backing_id_remove(fc, backing_id); + fb = __fuse_backing_lookup(fc, backing_id); if (!fb) goto out; + ops = fb->ops; - fuse_backing_put(fb); + err = ops->may_admin ? ops->may_admin(fc, 0) : 0; + if (err) + goto out_fb; + + err = ops->may_close ? ops->may_close(fc, fb->file) : 0; + if (err) + goto out_fb; + + err = -ENOENT; + test_fb = fuse_backing_id_remove(fc, backing_id); + if (!test_fb) + goto out_fb; + + WARN_ON(fb != test_fb); err = 0; + fuse_backing_put(test_fb); +out_fb: + fuse_backing_put(fb); out: pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err); return err; } -struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id) +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, + const struct fuse_backing_ops *ops, + int backing_id) { struct fuse_backing *fb; rcu_read_lock(); fb = idr_find(&fc->backing_files_map, backing_id); + if (fb && fb->ops != ops) + fb = NULL; fb = fuse_backing_get(fb); rcu_read_unlock(); diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 2ff09d9f101d00..2708e17bc46949 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2644,7 +2644,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file, if (IS_ERR(fud)) return PTR_ERR(fud); - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) + if (!IS_ENABLED(CONFIG_FUSE_BACKING)) return -EOPNOTSUPP; if (copy_from_user(&map, argp, sizeof(map))) @@ -2661,7 +2661,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp) if (IS_ERR(fud)) return PTR_ERR(fud); - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) + if (!IS_ENABLED(CONFIG_FUSE_BACKING)) return -EOPNOTSUPP; if (get_user(backing_id, argp)) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index b43e4bf4a1c117..c04d9ae42bb008 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1034,7 +1034,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm, fc->name_max = FUSE_NAME_LOW_MAX; fc->timeout.req_timeout = 0; - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) + if (IS_ENABLED(CONFIG_FUSE_BACKING)) fuse_backing_files_init(fc); INIT_LIST_HEAD(&fc->mounts); @@ -1074,7 +1074,7 @@ void fuse_conn_put(struct fuse_conn *fc) WARN_ON(atomic_read(&bucket->count) != 1); kfree(bucket); } - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) + if (IS_ENABLED(CONFIG_FUSE_BACKING)) fuse_backing_files_free(fc); call_rcu(&fc->rcu, delayed_release); } diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c index f2d08ac2459b7e..a3ea9b9eb903d5 100644 --- a/fs/fuse/passthrough.c +++ b/fs/fuse/passthrough.c @@ -162,7 +162,7 @@ struct fuse_backing *fuse_passthrough_open(struct file *file, int backing_id) goto out; err = -ENOENT; - fb = fuse_backing_lookup(fc, backing_id); + fb = fuse_backing_lookup(fc, &fuse_passthrough_backing_ops, backing_id); if (!fb) goto out; @@ -195,3 +195,39 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb) put_cred(ff->cred); ff->cred = NULL; } + +static int fuse_passthrough_may_admin(struct fuse_conn *fc, unsigned int flags) +{ + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */ + if (!fc->passthrough || !capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (flags) + return -EINVAL; + + return 0; +} + +static int fuse_passthrough_may_open(struct fuse_conn *fc, struct file *file) +{ + struct super_block *backing_sb; + int res; + + /* read/write/splice/mmap passthrough only relevant for regular files */ + res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL; + if (!d_is_reg(file->f_path.dentry)) + return res; + + backing_sb = file_inode(file)->i_sb; + if (backing_sb->s_stack_depth >= fc->max_stack_depth) + return -ELOOP; + + return 0; +} + +const struct fuse_backing_ops fuse_passthrough_backing_ops = { + .type = FUSE_BACKING_TYPE_PASSTHROUGH, + .id_start = 1, + .may_admin = fuse_passthrough_may_admin, + .may_open = fuse_passthrough_may_open, +}; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/2] fuse_trace: move the passthrough-specific code back to passthrough.c 2026-04-29 14:17 ` [PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong 2026-04-29 14:23 ` [PATCH 1/2] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong @ 2026-04-29 14:23 ` Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:23 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/fuse_trace.h | 35 +++++++++++++++++++++++++++++++++++ fs/fuse/backing.c | 5 +++++ 2 files changed, 40 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index bbe9ddd8c71696..286a0845dc0898 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -124,6 +124,41 @@ TRACE_EVENT(fuse_request_end, __entry->unique, __entry->len, __entry->error) ); +#ifdef CONFIG_FUSE_BACKING +TRACE_EVENT(fuse_backing_class, + TP_PROTO(const struct fuse_conn *fc, unsigned int idx, + const struct fuse_backing *fb), + + TP_ARGS(fc, idx, fb), + + TP_STRUCT__entry( + __field(dev_t, connection) + __field(unsigned int, idx) + __field(unsigned long, ino) + ), + + TP_fast_assign( + struct inode *inode = file_inode(fb->file); + + __entry->connection = fc->dev; + __entry->idx = idx; + __entry->ino = inode->i_ino; + ), + + TP_printk("connection %u idx %u ino 0x%lx", + __entry->connection, + __entry->idx, + __entry->ino) +); +#define DEFINE_FUSE_BACKING_EVENT(name) \ +DEFINE_EVENT(fuse_backing_class, name, \ + TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \ + const struct fuse_backing *fb), \ + TP_ARGS(fc, idx, fb)) +DEFINE_FUSE_BACKING_EVENT(fuse_backing_open); +DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); +#endif /* CONFIG_FUSE_BACKING */ + #endif /* _TRACE_FUSE_H */ #undef TRACE_INCLUDE_PATH diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c index adb4d2ebb21379..d7e074c30f46cc 100644 --- a/fs/fuse/backing.c +++ b/fs/fuse/backing.c @@ -72,6 +72,7 @@ static int fuse_backing_id_free(int id, void *p, void *data) WARN_ON_ONCE(refcount_read(&fb->count) != 1); + trace_fuse_backing_close((struct fuse_conn *)data, id, fb); fuse_backing_free(fb); return 0; } @@ -145,6 +146,8 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map) fb = NULL; goto out; } + + trace_fuse_backing_open(fc, res, fb); out: pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res); @@ -194,6 +197,8 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id) if (err) goto out_fb; + trace_fuse_backing_close(fc, backing_id, fb); + err = -ENOENT; test_fb = fuse_backing_id_remove(fc, backing_id); if (!test_fb) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:17 ` [PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong @ 2026-04-29 14:17 ` Darrick J. Wong 2026-04-29 14:23 ` [PATCH 01/33] fuse: implement the basic iomap mechanisms Darrick J. Wong ` (32 more replies) 2026-04-29 14:17 ` [PATCHSET v8 5/8] fuse: allow servers to specify root node id Darrick J. Wong ` (15 subsequent siblings) 19 siblings, 33 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:17 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, This series connects fuse (the userspace filesystem layer) to fs-iomap to get fuse servers out of the business of handling file I/O themselves. By keeping the IO path mostly within the kernel, we can dramatically improve the speed of disk-based filesystems. This enables us to move all the filesystem metadata parsing code out of the kernel and into userspace, which means that we can containerize them for security without losing a lot of performance. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio --- Commits in this patchset: * fuse: implement the basic iomap mechanisms * fuse_trace: implement the basic iomap mechanisms * fuse: make debugging configurable at runtime * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices * fuse: enable SYNCFS and ensure we flush everything before sending DESTROY * fuse: clean up per-file type inode initialization * fuse: create a per-inode flag for setting exclusive mode. * fuse: create a per-inode flag for toggling iomap * fuse_trace: create a per-inode flag for toggling iomap * fuse: isolate the other regular file IO paths from iomap * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} * fuse: implement direct IO with iomap * fuse_trace: implement direct IO with iomap * fuse: implement buffered IO with iomap * fuse_trace: implement buffered IO with iomap * fuse: use an unrestricted backing device with iomap pagecache io * fuse: implement large folios for iomap pagecache files * fuse: advertise support for iomap * fuse: query filesystem geometry when using iomap * fuse_trace: query filesystem geometry when using iomap * fuse: implement fadvise for iomap files * fuse: invalidate ranges of block devices being used for iomap * fuse_trace: invalidate ranges of block devices being used for iomap * fuse: implement inline data file IO via iomap * fuse_trace: implement inline data file IO via iomap * fuse: allow more statx fields * fuse: support atomic writes with iomap * fuse_trace: support atomic writes with iomap * fuse: disable direct fs reclaim for any fuse server that uses iomap * fuse: enable swapfile activation on iomap * fuse: implement freeze and shutdowns for iomap filesystems --- fs/fuse/fuse_i.h | 73 + fs/fuse/fuse_iomap.h | 110 ++ fs/fuse/fuse_iomap_i.h | 45 + fs/fuse/fuse_trace.h | 974 ++++++++++++++++++ include/uapi/linux/fuse.h | 235 ++++ fs/fuse/Kconfig | 48 + fs/fuse/Makefile | 1 fs/fuse/backing.c | 47 + fs/fuse/dev.c | 42 + fs/fuse/dir.c | 141 ++- fs/fuse/file.c | 122 ++ fs/fuse/fuse_iomap.c | 2380 +++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/inode.c | 180 +++ fs/fuse/iomode.c | 1 fs/fuse/trace.c | 3 15 files changed, 4331 insertions(+), 71 deletions(-) create mode 100644 fs/fuse/fuse_iomap.h create mode 100644 fs/fuse/fuse_iomap_i.h create mode 100644 fs/fuse/fuse_iomap.c ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 01/33] fuse: implement the basic iomap mechanisms 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong @ 2026-04-29 14:23 ` Darrick J. Wong 2026-04-29 14:24 ` [PATCH 02/33] fuse_trace: " Darrick J. Wong ` (31 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:23 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Implement functions to enable upcalling of iomap_begin and iomap_end to userspace fuse servers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 5 - fs/fuse/fuse_iomap.h | 26 +++ fs/fuse/fuse_iomap_i.h | 28 +++ include/uapi/linux/fuse.h | 91 +++++++++- fs/fuse/Kconfig | 32 +++ fs/fuse/Makefile | 1 fs/fuse/fuse_iomap.c | 430 +++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/inode.c | 9 + 8 files changed, 620 insertions(+), 2 deletions(-) create mode 100644 fs/fuse/fuse_iomap.h create mode 100644 fs/fuse/fuse_iomap_i.h create mode 100644 fs/fuse/fuse_iomap.c diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0666e03723071b..7accde465d03a7 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -937,6 +937,9 @@ struct fuse_conn { /* Is synchronous FUSE_INIT allowed? */ unsigned int sync_init:1; + /* Enable fs/iomap for file operations */ + unsigned int iomap:1; + /* Use io_uring for communication */ unsigned int io_uring; @@ -1058,7 +1061,7 @@ static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb) return get_fuse_mount_super(sb)->fc; } -static inline struct fuse_mount *get_fuse_mount(struct inode *inode) +static inline struct fuse_mount *get_fuse_mount(const struct inode *inode) { return get_fuse_mount_super(inode->i_sb); } diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h new file mode 100644 index 00000000000000..6c71318365ca82 --- /dev/null +++ b/fs/fuse/fuse_iomap.h @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025-2026 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef _FS_FUSE_IOMAP_H +#define _FS_FUSE_IOMAP_H + +#if IS_ENABLED(CONFIG_FUSE_IOMAP) +enum fuse_iomap_iodir { + READ_MAPPING, + WRITE_MAPPING, +}; + +bool fuse_iomap_enabled(void); + +static inline bool fuse_has_iomap(const struct inode *inode) +{ + return get_fuse_conn(inode)->iomap; +} +#else +# define fuse_iomap_enabled(...) (false) +# define fuse_has_iomap(...) (false) +#endif /* CONFIG_FUSE_IOMAP */ + +#endif /* _FS_FUSE_IOMAP_H */ diff --git a/fs/fuse/fuse_iomap_i.h b/fs/fuse/fuse_iomap_i.h new file mode 100644 index 00000000000000..2897049637fad2 --- /dev/null +++ b/fs/fuse/fuse_iomap_i.h @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025-2026 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef _FS_FUSE_IOMAP_I_H +#define _FS_FUSE_IOMAP_I_H + +#if IS_ENABLED(CONFIG_FUSE_IOMAP) +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) +# define ASSERT(condition) do { \ + int __cond = !!(condition); \ + WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ +} while (0) +# define BAD_DATA(condition) ({ \ + int __cond = !!(condition); \ + WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ +}) +#else +# define ASSERT(condition) +# define BAD_DATA(condition) ({ \ + int __cond = !!(condition); \ + unlikely(__cond); \ +}) +#endif /* CONFIG_FUSE_IOMAP_DEBUG */ +#endif /* CONFIG_FUSE_IOMAP */ + +#endif /* _FS_FUSE_IOMAP_I_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 18713cfaf09171..5a58011f66f501 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -240,6 +240,10 @@ * - add FUSE_COPY_FILE_RANGE_64 * - add struct fuse_copy_file_range_out * - add FUSE_NOTIFY_PRUNE + * + * 7.99 + * - XXX magic minor revision to make experimental code really obvious + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations */ #ifndef _LINUX_FUSE_H @@ -275,7 +279,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 45 +#define FUSE_KERNEL_MINOR_VERSION 99 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -448,6 +452,7 @@ struct fuse_file_lock { * FUSE_OVER_IO_URING: Indicate that client supports io-uring * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests. * init_out.request_timeout contains the timeout (in secs) + * FUSE_IOMAP: Client supports iomap for regular file operations. */ #define FUSE_ASYNC_READ (1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -495,6 +500,7 @@ struct fuse_file_lock { #define FUSE_ALLOW_IDMAP (1ULL << 40) #define FUSE_OVER_IO_URING (1ULL << 41) #define FUSE_REQUEST_TIMEOUT (1ULL << 42) +#define FUSE_IOMAP (1ULL << 43) /** * CUSE INIT request/reply flags @@ -664,6 +670,9 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_IOMAP_BEGIN = 4094, + FUSE_IOMAP_END = 4095, + /* CUSE specific operations */ CUSE_INIT = 4096, @@ -1314,4 +1323,84 @@ struct fuse_uring_cmd_req { uint8_t padding[6]; }; +/* mapping types; see corresponding IOMAP_TYPE_ */ +#define FUSE_IOMAP_TYPE_HOLE (0) +#define FUSE_IOMAP_TYPE_DELALLOC (1) +#define FUSE_IOMAP_TYPE_MAPPED (2) +#define FUSE_IOMAP_TYPE_UNWRITTEN (3) +#define FUSE_IOMAP_TYPE_INLINE (4) + +/* fuse-specific mapping type indicating that writes use the read mapping */ +#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255) + +#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */ + +/* mapping flags passed back from iomap_begin; see corresponding IOMAP_F_ */ +#define FUSE_IOMAP_F_NEW (1U << 0) +#define FUSE_IOMAP_F_DIRTY (1U << 1) +#define FUSE_IOMAP_F_SHARED (1U << 2) +#define FUSE_IOMAP_F_MERGED (1U << 3) +#define FUSE_IOMAP_F_BOUNDARY (1U << 4) +#define FUSE_IOMAP_F_ANON_WRITE (1U << 5) +#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 6) + +/* fuse-specific mapping flag asking for ->iomap_end call */ +#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 7) + +/* mapping flags passed to iomap_end */ +#define FUSE_IOMAP_F_SIZE_CHANGED (1U << 8) +#define FUSE_IOMAP_F_STALE (1U << 9) + +/* operation flags from iomap; see corresponding IOMAP_* */ +#define FUSE_IOMAP_OP_WRITE (1U << 0) +#define FUSE_IOMAP_OP_ZERO (1U << 1) +#define FUSE_IOMAP_OP_REPORT (1U << 2) +#define FUSE_IOMAP_OP_FAULT (1U << 3) +#define FUSE_IOMAP_OP_DIRECT (1U << 4) +#define FUSE_IOMAP_OP_NOWAIT (1U << 5) +#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1U << 6) +#define FUSE_IOMAP_OP_UNSHARE (1U << 7) +#define FUSE_IOMAP_OP_DAX (1U << 8) +#define FUSE_IOMAP_OP_ATOMIC (1U << 9) +#define FUSE_IOMAP_OP_DONTCACHE (1U << 10) + +#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */ + +struct fuse_iomap_io { + uint64_t offset; /* file offset of mapping, bytes */ + uint64_t length; /* length of mapping, bytes */ + uint64_t addr; /* disk offset of mapping, bytes */ + uint16_t type; /* FUSE_IOMAP_TYPE_* */ + uint16_t flags; /* FUSE_IOMAP_F_* */ + uint32_t dev; /* device cookie */ +}; + +struct fuse_iomap_begin_in { + uint32_t opflags; /* FUSE_IOMAP_OP_* */ + uint32_t reserved; /* zero */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + uint64_t pos; /* file position, in bytes */ + uint64_t count; /* operation length, in bytes */ +}; + +struct fuse_iomap_begin_out { + /* read file data from here */ + struct fuse_iomap_io read; + + /* write file data to here, if applicable */ + struct fuse_iomap_io write; +}; + +struct fuse_iomap_end_in { + uint32_t opflags; /* FUSE_IOMAP_OP_* */ + uint32_t reserved; /* zero */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + uint64_t pos; /* file position, in bytes */ + uint64_t count; /* operation length, in bytes */ + int64_t written; /* bytes processed */ + + /* mapping that the kernel acted upon */ + struct fuse_iomap_io map; +}; + #endif /* _LINUX_FUSE_H */ diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig index 290d1c09e0b924..934d48076a010c 100644 --- a/fs/fuse/Kconfig +++ b/fs/fuse/Kconfig @@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH config FUSE_BACKING bool +config FUSE_IOMAP + bool "FUSE file IO over iomap" + default y + depends on FUSE_FS + depends on BLOCK + select FS_IOMAP + help + Enable fuse servers to operate the regular file I/O path through + the fs-iomap library in the kernel. This enables higher performance + userspace filesystems by keeping the performance critical parts in + the kernel while delegating the difficult metadata parsing parts to + an easily-contained userspace program. + + This feature is considered EXPERIMENTAL. Use with caution! + + If unsure, say N. + +config FUSE_IOMAP_BY_DEFAULT + bool "FUSE file I/O over iomap by default" + default n + depends on FUSE_IOMAP + help + Enable sending FUSE file I/O over iomap by default. + +config FUSE_IOMAP_DEBUG + bool "Debug FUSE file IO over iomap" + default y + depends on FUSE_IOMAP + help + Enable debugging assertions for the fuse iomap code paths and logging + of bad iomap file mapping data being sent to the kernel. + config FUSE_IO_URING bool "FUSE communication over io-uring" default y diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile index 46041228e5be2c..2536bc6a71b898 100644 --- a/fs/fuse/Makefile +++ b/fs/fuse/Makefile @@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o fuse-$(CONFIG_FUSE_BACKING) += backing.o fuse-$(CONFIG_SYSCTL) += sysctl.o fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o +fuse-$(CONFIG_FUSE_IOMAP) += fuse_iomap.o virtiofs-y := virtio_fs.o diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c new file mode 100644 index 00000000000000..8785f86941a1d2 --- /dev/null +++ b/fs/fuse/fuse_iomap.c @@ -0,0 +1,430 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025-2026 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include <linux/iomap.h> +#include "fuse_i.h" +#include "fuse_trace.h" +#include "fuse_iomap.h" +#include "fuse_iomap_i.h" + +static bool __read_mostly enable_iomap = +#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT) + true; +#else + false; +#endif +module_param(enable_iomap, bool, 0644); +MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap"); + +bool fuse_iomap_enabled(void) +{ + /* Don't let anyone touch iomap until the end of the patchset. */ + return false; + + /* + * There are fears that a fuse+iomap server could somehow DoS the + * system by doing things like going out to lunch during a writeback + * related iomap request. Only allow iomap access if the fuse server + * has rawio capabilities since those processes can mess things up + * quite well even without our help. + */ + return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO); +} + +/* Convert IOMAP_* mapping types to FUSE_IOMAP_TYPE_* */ +#define XMAP(word) \ + case IOMAP_##word: \ + return FUSE_IOMAP_TYPE_##word +static inline uint16_t fuse_iomap_type_to_server(uint16_t iomap_type) +{ + switch (iomap_type) { + XMAP(HOLE); + XMAP(DELALLOC); + XMAP(MAPPED); + XMAP(UNWRITTEN); + XMAP(INLINE); + default: + ASSERT(0); + } + return 0; +} +#undef XMAP + +/* Convert FUSE_IOMAP_TYPE_* to IOMAP_* mapping types */ +#define XMAP(word) \ + case FUSE_IOMAP_TYPE_##word: \ + return IOMAP_##word +static inline uint16_t fuse_iomap_type_from_server(uint16_t fuse_type) +{ + switch (fuse_type) { + XMAP(HOLE); + XMAP(DELALLOC); + XMAP(MAPPED); + XMAP(UNWRITTEN); + XMAP(INLINE); + default: + ASSERT(0); + } + return 0; +} +#undef XMAP + +/* Validate FUSE_IOMAP_TYPE_* */ +static inline bool fuse_iomap_check_type(uint16_t fuse_type) +{ + switch (fuse_type) { + case FUSE_IOMAP_TYPE_HOLE: + case FUSE_IOMAP_TYPE_DELALLOC: + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + case FUSE_IOMAP_TYPE_INLINE: + case FUSE_IOMAP_TYPE_PURE_OVERWRITE: + return true; + } + + return false; +} + +#define FUSE_IOMAP_F_ALL (FUSE_IOMAP_F_NEW | \ + FUSE_IOMAP_F_DIRTY | \ + FUSE_IOMAP_F_SHARED | \ + FUSE_IOMAP_F_MERGED | \ + FUSE_IOMAP_F_BOUNDARY | \ + FUSE_IOMAP_F_ANON_WRITE | \ + FUSE_IOMAP_F_ATOMIC_BIO | \ + FUSE_IOMAP_F_WANT_IOMAP_END) + +static inline bool fuse_iomap_check_flags(uint16_t flags) +{ + return (flags & ~FUSE_IOMAP_F_ALL) == 0; +} + +/* Convert IOMAP_F_* mapping state flags to FUSE_IOMAP_F_* */ +#define XMAP(word) \ + if (iomap_f_flags & IOMAP_F_##word) \ + ret |= FUSE_IOMAP_F_##word +#define XMAP2(iword, oword) \ + if (iomap_f_flags & IOMAP_F_##iword) \ + ret |= FUSE_IOMAP_F_##oword +static inline uint16_t fuse_iomap_flags_to_server(uint16_t iomap_f_flags) +{ + uint16_t ret = 0; + + XMAP(NEW); + XMAP(DIRTY); + XMAP(SHARED); + XMAP(MERGED); + XMAP(BOUNDARY); + XMAP(ANON_WRITE); + XMAP(ATOMIC_BIO); + XMAP2(PRIVATE, WANT_IOMAP_END); + + XMAP(SIZE_CHANGED); + XMAP(STALE); + + return ret; +} +#undef XMAP2 +#undef XMAP + +/* Convert FUSE_IOMAP_F_* to IOMAP_F_* mapping state flags */ +#define XMAP(word) \ + if (fuse_f_flags & FUSE_IOMAP_F_##word) \ + ret |= IOMAP_F_##word +#define XMAP2(iword, oword) \ + if (fuse_f_flags & FUSE_IOMAP_F_##iword) \ + ret |= IOMAP_F_##oword +static inline uint16_t fuse_iomap_flags_from_server(uint16_t fuse_f_flags) +{ + uint16_t ret = 0; + + XMAP(NEW); + XMAP(DIRTY); + XMAP(SHARED); + XMAP(MERGED); + XMAP(BOUNDARY); + XMAP(ANON_WRITE); + XMAP(ATOMIC_BIO); + XMAP2(WANT_IOMAP_END, PRIVATE); + + return ret; +} +#undef XMAP2 +#undef XMAP + +/* Convert IOMAP_* operation flags to FUSE_IOMAP_OP_* */ +#define XMAP(word) \ + if (iomap_op_flags & IOMAP_##word) \ + ret |= FUSE_IOMAP_OP_##word +static inline uint32_t fuse_iomap_op_to_server(unsigned iomap_op_flags) +{ + uint32_t ret = 0; + + XMAP(WRITE); + XMAP(ZERO); + XMAP(REPORT); + XMAP(FAULT); + XMAP(DIRECT); + XMAP(NOWAIT); + XMAP(OVERWRITE_ONLY); + XMAP(UNSHARE); + XMAP(DAX); + XMAP(ATOMIC); + XMAP(DONTCACHE); + + return ret; +} +#undef XMAP + +/* Validate an iomap mapping. */ +static inline bool fuse_iomap_check_mapping(const struct inode *inode, + const struct fuse_iomap_io *map, + enum fuse_iomap_iodir iodir) +{ + const unsigned int blocksize = i_blocksize(inode); + uint64_t end; + + /* Type and flags must be known */ + if (BAD_DATA(!fuse_iomap_check_type(map->type))) + return false; + if (BAD_DATA(!fuse_iomap_check_flags(map->flags))) + return false; + + /* No zero-length mappings */ + if (BAD_DATA(map->length == 0)) + return false; + + /* File range must be aligned to blocksize */ + if (BAD_DATA(!IS_ALIGNED(map->offset, blocksize))) + return false; + if (BAD_DATA(!IS_ALIGNED(map->length, blocksize))) + return false; + + /* No overflows in the file range */ + if (BAD_DATA(check_add_overflow(map->offset, map->length, &end))) + return false; + + /* File range cannot start past maxbytes */ + if (BAD_DATA(map->offset >= inode->i_sb->s_maxbytes)) + return false; + + switch (map->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + /* Mappings backed by space must have a device/addr */ + if (BAD_DATA(map->dev == FUSE_IOMAP_DEV_NULL)) + return false; + if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR)) + return false; + break; + case FUSE_IOMAP_TYPE_DELALLOC: + case FUSE_IOMAP_TYPE_HOLE: + case FUSE_IOMAP_TYPE_INLINE: + /* Mappings not backed by space cannot have a device addr. */ + if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL)) + return false; + if (BAD_DATA(map->addr != FUSE_IOMAP_NULL_ADDR)) + return false; + break; + case FUSE_IOMAP_TYPE_PURE_OVERWRITE: + /* "Pure overwrite" only allowed for write mapping */ + if (BAD_DATA(iodir != WRITE_MAPPING)) + return false; + break; + default: + /* should have been caught already */ + ASSERT(0); + return false; + } + + /* XXX: we don't support devices yet */ + if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL)) + return false; + + /* No overflows in the device range, if supplied */ + if (map->addr != FUSE_IOMAP_NULL_ADDR && + BAD_DATA(check_add_overflow(map->addr, map->length, &end))) + return false; + + return true; +} + +/* Convert a mapping from the server into something the kernel can use */ +static inline void fuse_iomap_from_server(struct iomap *iomap, + const struct fuse_iomap_io *fmap) +{ + iomap->addr = fmap->addr; + iomap->offset = fmap->offset; + iomap->length = fmap->length; + iomap->type = fuse_iomap_type_from_server(fmap->type); + iomap->flags = fuse_iomap_flags_from_server(fmap->flags); + iomap->bdev = NULL; /* XXX */ +} + +/* Convert a mapping from the kernel into something the server can use */ +static inline void fuse_iomap_to_server(struct fuse_iomap_io *fmap, + const struct iomap *iomap) +{ + fmap->addr = iomap->addr; + fmap->offset = iomap->offset; + fmap->length = iomap->length; + fmap->type = fuse_iomap_type_to_server(iomap->type); + fmap->flags = fuse_iomap_flags_to_server(iomap->flags); + fmap->dev = FUSE_IOMAP_DEV_NULL; /* XXX */ +} + +/* Check the incoming _begin mappings to make sure they're not nonsense. */ +static inline int +fuse_iomap_begin_validate(const struct inode *inode, + unsigned opflags, loff_t pos, + const struct fuse_iomap_begin_out *outarg) +{ + /* Make sure the mappings aren't garbage */ + if (!fuse_iomap_check_mapping(inode, &outarg->read, READ_MAPPING)) + return -EFSCORRUPTED; + + if (!fuse_iomap_check_mapping(inode, &outarg->write, WRITE_MAPPING)) + return -EFSCORRUPTED; + + /* + * Must have returned a mapping for at least the first byte in the + * range. The main mapping check already validated that the length + * is nonzero and there is no overflow in computing end. + */ + if (BAD_DATA(outarg->read.offset > pos)) + return -EFSCORRUPTED; + if (BAD_DATA(outarg->write.offset > pos)) + return -EFSCORRUPTED; + + if (BAD_DATA(outarg->read.offset + outarg->read.length <= pos)) + return -EFSCORRUPTED; + if (BAD_DATA(outarg->write.offset + outarg->write.length <= pos)) + return -EFSCORRUPTED; + + return 0; +} + +static inline bool fuse_is_iomap_file_write(unsigned int opflags) +{ + return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE); +} + +static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, + unsigned opflags, struct iomap *iomap, + struct iomap *srcmap) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_begin_in inarg = { + .attr_ino = fi->orig_ino, + .opflags = fuse_iomap_op_to_server(opflags), + .pos = pos, + .count = count, + }; + struct fuse_iomap_begin_out outarg = { }; + struct fuse_mount *fm = get_fuse_mount(inode); + FUSE_ARGS(args); + int err; + + args.opcode = FUSE_IOMAP_BEGIN; + args.nodeid = get_node_id(inode); + args.in_numargs = 1; + args.in_args[0].size = sizeof(inarg); + args.in_args[0].value = &inarg; + args.out_numargs = 1; + args.out_args[0].size = sizeof(outarg); + args.out_args[0].value = &outarg; + err = fuse_simple_request(fm, &args); + if (err) + return err; + + err = fuse_iomap_begin_validate(inode, opflags, pos, &outarg); + if (err) + return err; + + if (fuse_is_iomap_file_write(opflags) && + outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) { + /* + * For an out of place write, we must supply the write mapping + * via @iomap, and the read mapping via @srcmap. + */ + fuse_iomap_from_server(iomap, &outarg.write); + fuse_iomap_from_server(srcmap, &outarg.read); + } else { + /* + * For everything else (reads, reporting, and pure overwrites), + * we can return the sole mapping through @iomap and leave + * @srcmap unchanged from its default (HOLE). + */ + fuse_iomap_from_server(iomap, &outarg.read); + } + + return 0; +} + +/* Decide if we send FUSE_IOMAP_END to the fuse server */ +static bool fuse_should_send_iomap_end(const struct iomap *iomap, + unsigned int opflags, loff_t count, + ssize_t written) +{ + /* fuse server demanded an iomap_end call. */ + if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END) + return true; + + /* Reads and reporting should never affect the filesystem metadata */ + if (!fuse_is_iomap_file_write(opflags)) + return false; + + /* Appending writes get an iomap_end call */ + if (iomap->flags & IOMAP_F_SIZE_CHANGED) + return true; + + /* Short writes get an iomap_end call to clean up delalloc */ + return written < count; +} + +static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, + ssize_t written, unsigned opflags, + struct iomap *iomap) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_mount *fm = get_fuse_mount(inode); + int err; + + if (fuse_should_send_iomap_end(iomap, opflags, count, written)) { + struct fuse_iomap_end_in inarg = { + .opflags = fuse_iomap_op_to_server(opflags), + .attr_ino = fi->orig_ino, + .pos = pos, + .count = count, + .written = written, + }; + FUSE_ARGS(args); + + fuse_iomap_to_server(&inarg.map, iomap); + + args.opcode = FUSE_IOMAP_END; + args.nodeid = get_node_id(inode); + args.in_numargs = 1; + args.in_args[0].size = sizeof(inarg); + args.in_args[0].value = &inarg; + err = fuse_simple_request(fm, &args); + if (err == -ENOSYS) { + /* + * libfuse returns ENOSYS for servers that don't + * implement iomap_end + */ + err = 0; + } + if (err) + return err; + } + + return 0; +} + +const struct iomap_ops fuse_iomap_ops = { + .iomap_begin = fuse_iomap_begin, + .iomap_end = fuse_iomap_end, +}; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index c04d9ae42bb008..15a8ce8afa4241 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -9,6 +9,7 @@ #include "fuse_i.h" #include "fuse_dev_i.h" #include "dev_uring_i.h" +#include "fuse_iomap.h" #include <linux/dax.h> #include <linux/pagemap.h> @@ -1489,6 +1490,12 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, if (flags & FUSE_REQUEST_TIMEOUT) timeout = arg->request_timeout; + + if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) { + fc->iomap = 1; + pr_warn( + "EXPERIMENTAL iomap feature enabled. Use at your own risk!"); + } } else { ra_pages = fc->max_read / PAGE_SIZE; fc->no_lock = 1; @@ -1557,6 +1564,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm) */ if (fuse_uring_enabled()) flags |= FUSE_OVER_IO_URING; + if (fuse_iomap_enabled()) + flags |= FUSE_IOMAP; ia->in.flags = flags; ia->in.flags2 = flags >> 32; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 02/33] fuse_trace: implement the basic iomap mechanisms 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong 2026-04-29 14:23 ` [PATCH 01/33] fuse: implement the basic iomap mechanisms Darrick J. Wong @ 2026-04-29 14:24 ` Darrick J. Wong 2026-04-29 14:24 ` [PATCH 03/33] fuse: make debugging configurable at runtime Darrick J. Wong ` (30 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:24 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap_i.h | 6 + fs/fuse/fuse_trace.h | 295 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 15 ++ 3 files changed, 314 insertions(+), 2 deletions(-) diff --git a/fs/fuse/fuse_iomap_i.h b/fs/fuse/fuse_iomap_i.h index 2897049637fad2..b9ab8ce140e8e1 100644 --- a/fs/fuse/fuse_iomap_i.h +++ b/fs/fuse/fuse_iomap_i.h @@ -10,16 +10,22 @@ #if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) # define ASSERT(condition) do { \ int __cond = !!(condition); \ + if (unlikely(!__cond)) \ + trace_fuse_iomap_assert(__func__, __LINE__, #condition); \ WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ } while (0) # define BAD_DATA(condition) ({ \ int __cond = !!(condition); \ + if (unlikely(__cond)) \ + trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \ WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ }) #else # define ASSERT(condition) # define BAD_DATA(condition) ({ \ int __cond = !!(condition); \ + if (unlikely(__cond)) \ + trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \ unlikely(__cond); \ }) #endif /* CONFIG_FUSE_IOMAP_DEBUG */ diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 286a0845dc0898..c0878253e7c6ad 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -58,6 +58,8 @@ EM( FUSE_SYNCFS, "FUSE_SYNCFS") \ EM( FUSE_TMPFILE, "FUSE_TMPFILE") \ EM( FUSE_STATX, "FUSE_STATX") \ + EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \ + EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \ EMe(CUSE_INIT, "CUSE_INIT") /* @@ -77,6 +79,54 @@ OPCODES #define EM(a, b) {a, b}, #define EMe(a, b) {a, b} +/* tracepoint boilerplate so we don't have to keep doing this */ +#define FUSE_INODE_FIELDS \ + __field(dev_t, connection) \ + __field(uint64_t, ino) \ + __field(uint64_t, nodeid) \ + __field(loff_t, isize) + +#define FUSE_INODE_ASSIGN(inode, fi, fm) \ + const struct fuse_inode *fi = get_fuse_inode(inode); \ + const struct fuse_mount *fm = get_fuse_mount(inode); \ +\ + __entry->connection = (fm)->fc->dev; \ + __entry->ino = (fi)->orig_ino; \ + __entry->nodeid = (fi)->nodeid; \ + __entry->isize = i_size_read(inode) + +#define FUSE_INODE_FMT \ + "connection %u ino %llu nodeid %llu isize 0x%llx" + +#define FUSE_INODE_PRINTK_ARGS \ + __entry->connection, \ + __entry->ino, \ + __entry->nodeid, \ + __entry->isize + +#define FUSE_FILE_RANGE_FIELDS(prefix) \ + __field(loff_t, prefix##offset) \ + __field(loff_t, prefix##length) + +#define FUSE_FILE_RANGE_FMT(prefix) \ + " " prefix "pos 0x%llx length 0x%llx" + +#define FUSE_FILE_RANGE_PRINTK_ARGS(prefix) \ + __entry->prefix##offset, \ + __entry->prefix##length + +/* combinations of boilerplate to reduce typing further */ +#define FUSE_IO_RANGE_FIELDS(prefix) \ + FUSE_INODE_FIELDS \ + FUSE_FILE_RANGE_FIELDS(prefix) + +#define FUSE_IO_RANGE_FMT(prefix) \ + FUSE_INODE_FMT FUSE_FILE_RANGE_FMT(prefix) + +#define FUSE_IO_RANGE_PRINTK_ARGS(prefix) \ + FUSE_INODE_PRINTK_ARGS, \ + FUSE_FILE_RANGE_PRINTK_ARGS(prefix) + TRACE_EVENT(fuse_request_send, TP_PROTO(const struct fuse_req *req), @@ -159,6 +209,251 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_open); DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); #endif /* CONFIG_FUSE_BACKING */ +#if IS_ENABLED(CONFIG_FUSE_IOMAP) + +/* tracepoint boilerplate so we don't have to keep doing this */ +#define FUSE_IOMAP_OPFLAGS_FIELD \ + __field(unsigned, opflags) + +#define FUSE_IOMAP_OPFLAGS_FMT \ + " opflags (%s)" + +#define FUSE_IOMAP_OPFLAGS_PRINTK_ARG \ + __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS) + +#define FUSE_IOMAP_MAP_FIELDS(prefix) \ + __field(uint64_t, prefix##offset) \ + __field(uint64_t, prefix##length) \ + __field(uint64_t, prefix##addr) \ + __field(uint32_t, prefix##dev) \ + __field(uint16_t, prefix##type) \ + __field(uint16_t, prefix##flags) + +#define FUSE_IOMAP_MAP_FMT(prefix) \ + " " prefix "offset 0x%llx length 0x%llx type %s dev %u addr 0x%llx mapflags (%s)" + +#define FUSE_IOMAP_MAP_PRINTK_ARGS(prefix) \ + __entry->prefix##offset, \ + __entry->prefix##length, \ + __print_symbolic(__entry->prefix##type, FUSE_IOMAP_TYPE_STRINGS), \ + __entry->prefix##dev, \ + __entry->prefix##addr, \ + __print_flags(__entry->prefix##flags, "|", FUSE_IOMAP_F_STRINGS) + +/* combinations of boilerplate to reduce typing further */ +#define FUSE_IOMAP_OP_FIELDS(prefix) \ + FUSE_INODE_FIELDS \ + FUSE_IOMAP_OPFLAGS_FIELD \ + FUSE_FILE_RANGE_FIELDS(prefix) + +#define FUSE_IOMAP_OP_FMT(prefix) \ + FUSE_INODE_FMT FUSE_IOMAP_OPFLAGS_FMT FUSE_FILE_RANGE_FMT(prefix) + +#define FUSE_IOMAP_OP_PRINTK_ARGS(prefix) \ + FUSE_INODE_PRINTK_ARGS, \ + FUSE_IOMAP_OPFLAGS_PRINTK_ARG, \ + FUSE_FILE_RANGE_PRINTK_ARGS(prefix) + +/* string decoding */ +#define FUSE_IOMAP_F_STRINGS \ + { FUSE_IOMAP_F_NEW, "new" }, \ + { FUSE_IOMAP_F_DIRTY, "dirty" }, \ + { FUSE_IOMAP_F_SHARED, "shared" }, \ + { FUSE_IOMAP_F_MERGED, "merged" }, \ + { FUSE_IOMAP_F_BOUNDARY, "boundary" }, \ + { FUSE_IOMAP_F_ANON_WRITE, "anon_write" }, \ + { FUSE_IOMAP_F_ATOMIC_BIO, "atomic" }, \ + { FUSE_IOMAP_F_WANT_IOMAP_END, "iomap_end" }, \ + { FUSE_IOMAP_F_SIZE_CHANGED, "append" }, \ + { FUSE_IOMAP_F_STALE, "stale" } + +#define FUSE_IOMAP_OP_STRINGS \ + { FUSE_IOMAP_OP_WRITE, "write" }, \ + { FUSE_IOMAP_OP_ZERO, "zero" }, \ + { FUSE_IOMAP_OP_REPORT, "report" }, \ + { FUSE_IOMAP_OP_FAULT, "fault" }, \ + { FUSE_IOMAP_OP_DIRECT, "direct" }, \ + { FUSE_IOMAP_OP_NOWAIT, "nowait" }, \ + { FUSE_IOMAP_OP_OVERWRITE_ONLY, "overwrite" }, \ + { FUSE_IOMAP_OP_UNSHARE, "unshare" }, \ + { FUSE_IOMAP_OP_DAX, "fsdax" }, \ + { FUSE_IOMAP_OP_ATOMIC, "atomic" }, \ + { FUSE_IOMAP_OP_DONTCACHE, "dontcache" } + +#define FUSE_IOMAP_TYPE_STRINGS \ + { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \ + { FUSE_IOMAP_TYPE_HOLE, "hole" }, \ + { FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \ + { FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \ + { FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \ + { FUSE_IOMAP_TYPE_INLINE, "inline" } + +DECLARE_EVENT_CLASS(fuse_iomap_check_class, + TP_PROTO(const char *func, int line, const char *condition), + + TP_ARGS(func, line, condition), + + TP_STRUCT__entry( + __string(func, func) + __field(int, line) + __string(condition, condition) + ), + + TP_fast_assign( + __assign_str(func); + __assign_str(condition); + __entry->line = line; + ), + + TP_printk("func %s line %d condition %s", __get_str(func), + __entry->line, __get_str(condition)) +); +#define DEFINE_FUSE_IOMAP_CHECK_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_check_class, name, \ + TP_PROTO(const char *func, int line, const char *condition), \ + TP_ARGS(func, line, condition)) +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) +DEFINE_FUSE_IOMAP_CHECK_EVENT(fuse_iomap_assert); +#endif +DEFINE_FUSE_IOMAP_CHECK_EVENT(fuse_iomap_bad_data); + +TRACE_EVENT(fuse_iomap_begin, + TP_PROTO(const struct inode *inode, loff_t pos, loff_t count, + unsigned opflags), + + TP_ARGS(inode, pos, count, opflags), + + TP_STRUCT__entry( + FUSE_IOMAP_OP_FIELDS() + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = pos; + __entry->length = count; + __entry->opflags = opflags; + ), + + TP_printk(FUSE_IOMAP_OP_FMT(), + FUSE_IOMAP_OP_PRINTK_ARGS()) +); + +TRACE_EVENT(fuse_iomap_begin_error, + TP_PROTO(const struct inode *inode, loff_t pos, loff_t count, + unsigned opflags, int error), + + TP_ARGS(inode, pos, count, opflags, error), + + TP_STRUCT__entry( + FUSE_IOMAP_OP_FIELDS() + __field(int, error) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = pos; + __entry->length = count; + __entry->opflags = opflags; + __entry->error = error; + ), + + TP_printk(FUSE_IOMAP_OP_FMT() " err %d", + FUSE_IOMAP_OP_PRINTK_ARGS(), + __entry->error) +); + +DECLARE_EVENT_CLASS(fuse_iomap_mapping_class, + TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), + + TP_ARGS(inode, map), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + FUSE_IOMAP_MAP_FIELDS(map) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->mapoffset = map->offset; + __entry->maplength = map->length; + __entry->mapdev = map->dev; + __entry->mapaddr = map->addr; + __entry->maptype = map->type; + __entry->mapflags = map->flags; + ), + + TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT(), + FUSE_INODE_PRINTK_ARGS, + FUSE_IOMAP_MAP_PRINTK_ARGS(map)) +); +#define DEFINE_FUSE_IOMAP_MAPPING_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_mapping_class, name, \ + TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), \ + TP_ARGS(inode, map)) +DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_read_map); +DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_write_map); + +TRACE_EVENT(fuse_iomap_end, + TP_PROTO(const struct inode *inode, + const struct fuse_iomap_end_in *inarg), + + TP_ARGS(inode, inarg), + + TP_STRUCT__entry( + FUSE_IOMAP_OP_FIELDS() + __field(size_t, written) + FUSE_IOMAP_MAP_FIELDS(map) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->opflags = inarg->opflags; + __entry->written = inarg->written; + __entry->offset = inarg->pos; + __entry->length = inarg->count; + + __entry->mapoffset = inarg->map.offset; + __entry->maplength = inarg->map.length; + __entry->mapdev = inarg->map.dev; + __entry->mapaddr = inarg->map.addr; + __entry->maptype = inarg->map.type; + __entry->mapflags = inarg->map.flags; + ), + + TP_printk(FUSE_IOMAP_OP_FMT() " written %zd" FUSE_IOMAP_MAP_FMT(), + FUSE_IOMAP_OP_PRINTK_ARGS(), + __entry->written, + FUSE_IOMAP_MAP_PRINTK_ARGS(map)) +); + +TRACE_EVENT(fuse_iomap_end_error, + TP_PROTO(const struct inode *inode, + const struct fuse_iomap_end_in *inarg, int error), + + TP_ARGS(inode, inarg, error), + + TP_STRUCT__entry( + FUSE_IOMAP_OP_FIELDS() + __field(size_t, written) + __field(int, error) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = inarg->pos; + __entry->length = inarg->count; + __entry->opflags = inarg->opflags; + __entry->written = inarg->written; + __entry->error = error; + ), + + TP_printk(FUSE_IOMAP_OP_FMT() " written %zd error %d", + FUSE_IOMAP_OP_PRINTK_ARGS(), + __entry->written, + __entry->error) +); +#endif /* CONFIG_FUSE_IOMAP */ + #endif /* _TRACE_FUSE_H */ #undef TRACE_INCLUDE_PATH diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 8785f86941a1d2..c22c7961cc0bdc 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -327,6 +327,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, FUSE_ARGS(args); int err; + trace_fuse_iomap_begin(inode, pos, count, opflags); + args.opcode = FUSE_IOMAP_BEGIN; args.nodeid = get_node_id(inode); args.in_numargs = 1; @@ -336,8 +338,13 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, args.out_args[0].size = sizeof(outarg); args.out_args[0].value = &outarg; err = fuse_simple_request(fm, &args); - if (err) + if (err) { + trace_fuse_iomap_begin_error(inode, pos, count, opflags, err); return err; + } + + trace_fuse_iomap_read_map(inode, &outarg.read); + trace_fuse_iomap_write_map(inode, &outarg.write); err = fuse_iomap_begin_validate(inode, opflags, pos, &outarg); if (err) @@ -404,6 +411,8 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, fuse_iomap_to_server(&inarg.map, iomap); + trace_fuse_iomap_end(inode, &inarg); + args.opcode = FUSE_IOMAP_END; args.nodeid = get_node_id(inode); args.in_numargs = 1; @@ -417,8 +426,10 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, */ err = 0; } - if (err) + if (err) { + trace_fuse_iomap_end_error(inode, &inarg, err); return err; + } } return 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 03/33] fuse: make debugging configurable at runtime 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong 2026-04-29 14:23 ` [PATCH 01/33] fuse: implement the basic iomap mechanisms Darrick J. Wong 2026-04-29 14:24 ` [PATCH 02/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:24 ` Darrick J. Wong 2026-04-29 14:24 ` [PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong ` (29 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:24 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Use static keys so that we can configure debugging assertions and dmesg warnings at runtime. By default this is turned off so the cost is merely scanning a nop sled. However, fuse server developers can turn it on for their debugging systems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap_i.h | 17 ++++++++++++++--- fs/fuse/Kconfig | 15 +++++++++++++++ fs/fuse/fuse_iomap.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 72 insertions(+), 3 deletions(-) diff --git a/fs/fuse/fuse_iomap_i.h b/fs/fuse/fuse_iomap_i.h index b9ab8ce140e8e1..c37a7c5cfc862f 100644 --- a/fs/fuse/fuse_iomap_i.h +++ b/fs/fuse/fuse_iomap_i.h @@ -8,17 +8,28 @@ #if IS_ENABLED(CONFIG_FUSE_IOMAP) #if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) -# define ASSERT(condition) do { \ + +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG_BY_DEFAULT) +DECLARE_STATIC_KEY_TRUE(fuse_iomap_debug); +#else +DECLARE_STATIC_KEY_FALSE(fuse_iomap_debug); +#endif /* FUSE_IOMAP_DEBUG_BY_DEFAULT */ + +# define ASSERT(condition) \ +while (static_branch_unlikely(&fuse_iomap_debug)) { \ int __cond = !!(condition); \ if (unlikely(!__cond)) \ trace_fuse_iomap_assert(__func__, __LINE__, #condition); \ WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ -} while (0) + break; \ +} # define BAD_DATA(condition) ({ \ int __cond = !!(condition); \ if (unlikely(__cond)) \ trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \ - WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ + if (static_branch_unlikely(&fuse_iomap_debug)) \ + WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \ + unlikely(__cond); \ }) #else # define ASSERT(condition) diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig index 934d48076a010c..1b8990f1c2a8f9 100644 --- a/fs/fuse/Kconfig +++ b/fs/fuse/Kconfig @@ -101,6 +101,21 @@ config FUSE_IOMAP_DEBUG Enable debugging assertions for the fuse iomap code paths and logging of bad iomap file mapping data being sent to the kernel. + Say N here if you don't want any debugging code code compiled in at + all. + +config FUSE_IOMAP_DEBUG_BY_DEFAULT + bool "Debug FUSE file IO over iomap at boot time" + default n + depends on FUSE_IOMAP_DEBUG + help + At boot time, enable debugging assertions for the fuse iomap code + paths and warnings about bad iomap file mapping data. This enables + fuse server authors to control debugging at runtime even on a + distribution kernel while avoiding most of the overhead on production + systems. The setting can be changed at runtime via the debug_iomap + module parameter. + config FUSE_IO_URING bool "FUSE communication over io-uring" default y diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index c22c7961cc0bdc..f7a7eba8317c18 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -18,6 +18,49 @@ static bool __read_mostly enable_iomap = module_param(enable_iomap, bool, 0644); MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap"); +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG_BY_DEFAULT) +DEFINE_STATIC_KEY_TRUE(fuse_iomap_debug); +#else +DEFINE_STATIC_KEY_FALSE(fuse_iomap_debug); +#endif /* FUSE_IOMAP_DEBUG_BY_DEFAULT */ + +static int iomap_debug_set(const char *val, const struct kernel_param *kp) +{ + bool now; + int ret; + + if (!val) + return -EINVAL; + + ret = kstrtobool(val, &now); + if (ret) + return ret; + + if (now) + static_branch_enable(&fuse_iomap_debug); + else + static_branch_disable(&fuse_iomap_debug); + + return 0; +} + +static int iomap_debug_get(char *buffer, const struct kernel_param *kp) +{ + return sprintf(buffer, "%c\n", + static_branch_unlikely(&fuse_iomap_debug) ? 'Y' : 'N'); +} + +static const struct kernel_param_ops iomap_debug_ops = { + .set = iomap_debug_set, + .get = iomap_debug_get, +}; + +module_param_cb(debug_iomap, &iomap_debug_ops, NULL, 0644); +__MODULE_PARM_TYPE(debug_iomap, "bool"); +MODULE_PARM_DESC(debug_iomap, "Enable debugging of fuse iomap"); +#endif /* IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) */ + bool fuse_iomap_enabled(void) { /* Don't let anyone touch iomap until the end of the patchset. */ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:24 ` [PATCH 03/33] fuse: make debugging configurable at runtime Darrick J. Wong @ 2026-04-29 14:24 ` Darrick J. Wong 2026-04-29 14:24 ` [PATCH 05/33] fuse_trace: " Darrick J. Wong ` (28 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:24 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Enable the use of the backing file open/close ioctls so that fuse servers can register block devices for use with iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 7 +++ fs/fuse/fuse_iomap.h | 2 + include/uapi/linux/fuse.h | 3 + fs/fuse/Kconfig | 1 fs/fuse/backing.c | 47 +++++++++++++++++ fs/fuse/fuse_iomap.c | 126 +++++++++++++++++++++++++++++++++++++++++---- fs/fuse/trace.c | 1 7 files changed, 174 insertions(+), 13 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 7accde465d03a7..a02af3403e6849 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -104,12 +104,14 @@ struct fuse_submount_lookup { }; struct fuse_conn; +struct fuse_backing; /** Operations for subsystems that want to use a backing file */ struct fuse_backing_ops { int (*may_admin)(struct fuse_conn *fc, uint32_t flags); int (*may_open)(struct fuse_conn *fc, struct file *file); int (*may_close)(struct fuse_conn *fc, struct file *file); + int (*post_open)(struct fuse_conn *fc, struct fuse_backing *fb); unsigned int type; int id_start; int id_end; @@ -119,6 +121,7 @@ struct fuse_backing_ops { struct fuse_backing { struct file *file; struct cred *cred; + struct block_device *bdev; const struct fuse_backing_ops *ops; /** refcount */ @@ -1613,6 +1616,10 @@ void fuse_backing_put(struct fuse_backing *fb); struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, const struct fuse_backing_ops *ops, int backing_id); +typedef bool (*fuse_match_backing_fn)(const struct fuse_backing *fb, + const void *data); +int fuse_backing_lookup_id(struct fuse_conn *fc, fuse_match_backing_fn match_fn, + const void *data); #else static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 6c71318365ca82..43562ef23fb325 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -18,6 +18,8 @@ static inline bool fuse_has_iomap(const struct inode *inode) { return get_fuse_conn(inode)->iomap; } + +extern const struct fuse_backing_ops fuse_iomap_backing_ops; #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 5a58011f66f501..5ae6b05de623d7 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1137,7 +1137,8 @@ struct fuse_notify_prune_out { #define FUSE_BACKING_TYPE_MASK (0xFF) #define FUSE_BACKING_TYPE_PASSTHROUGH (0) -#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH) +#define FUSE_BACKING_TYPE_IOMAP (1) +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP) #define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK) diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig index 1b8990f1c2a8f9..3e35611c3aac07 100644 --- a/fs/fuse/Kconfig +++ b/fs/fuse/Kconfig @@ -75,6 +75,7 @@ config FUSE_IOMAP depends on FUSE_FS depends on BLOCK select FS_IOMAP + select FUSE_BACKING help Enable fuse servers to operate the regular file I/O path through the fs-iomap library in the kernel. This enables higher performance diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c index d7e074c30f46cc..050657a6ef1c98 100644 --- a/fs/fuse/backing.c +++ b/fs/fuse/backing.c @@ -6,6 +6,7 @@ */ #include "fuse_i.h" +#include "fuse_iomap.h" #include "fuse_trace.h" #include <linux/file.h> @@ -90,6 +91,10 @@ fuse_backing_ops_from_map(const struct fuse_backing_map *map) #ifdef CONFIG_FUSE_PASSTHROUGH case FUSE_BACKING_TYPE_PASSTHROUGH: return &fuse_passthrough_backing_ops; +#endif +#ifdef CONFIG_FUSE_IOMAP + case FUSE_BACKING_TYPE_IOMAP: + return &fuse_iomap_backing_ops; #endif default: break; @@ -138,8 +143,16 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map) fb->file = file; fb->cred = prepare_creds(); fb->ops = ops; + fb->bdev = NULL; refcount_set(&fb->count, 1); + res = ops->post_open ? ops->post_open(fc, fb) : 0; + if (res) { + fuse_backing_free(fb); + fb = NULL; + goto out; + } + res = fuse_backing_id_alloc(fc, fb); if (res < 0) { fuse_backing_free(fb); @@ -230,3 +243,37 @@ struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, return fb; } + +struct fuse_backing_match { + fuse_match_backing_fn match_fn; + const void *data; +}; + +static int fuse_backing_matches(int id, void *p, void *data) +{ + struct fuse_backing *fb = p; + struct fuse_backing_match *fbm = data; + + if (fb && fbm->match_fn(fb, fbm->data)) { + /* backing ids are always greater than zero */ + return id; + } + + return 0; +} + +int fuse_backing_lookup_id(struct fuse_conn *fc, fuse_match_backing_fn match_fn, + const void *data) +{ + struct fuse_backing_match fbm = { + .match_fn = match_fn, + .data = data, + }; + int ret; + + rcu_read_lock(); + ret = idr_for_each(&fc->backing_files_map, fuse_backing_matches, &fbm); + rcu_read_unlock(); + + return ret; +} diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index f7a7eba8317c18..0ac783dd312dc3 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -282,10 +282,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, return false; } - /* XXX: we don't support devices yet */ - if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL)) - return false; - /* No overflows in the device range, if supplied */ if (map->addr != FUSE_IOMAP_NULL_ADDR && BAD_DATA(check_add_overflow(map->addr, map->length, &end))) @@ -296,6 +292,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, /* Convert a mapping from the server into something the kernel can use */ static inline void fuse_iomap_from_server(struct iomap *iomap, + const struct fuse_backing *fb, const struct fuse_iomap_io *fmap) { iomap->addr = fmap->addr; @@ -303,11 +300,32 @@ static inline void fuse_iomap_from_server(struct iomap *iomap, iomap->length = fmap->length; iomap->type = fuse_iomap_type_from_server(fmap->type); iomap->flags = fuse_iomap_flags_from_server(fmap->flags); - iomap->bdev = NULL; /* XXX */ + iomap->bdev = fb ? fb->bdev : NULL; + iomap->dax_dev = NULL; +} + +static bool fuse_iomap_matches_bdev(const struct fuse_backing *fb, + const void *data) +{ + return fb->bdev == data; +} + +static inline uint32_t +fuse_iomap_find_backing_id(struct fuse_conn *fc, + const struct block_device *bdev) +{ + int ret = -ENODEV; + + if (bdev) + ret = fuse_backing_lookup_id(fc, fuse_iomap_matches_bdev, bdev); + if (ret < 0) + return FUSE_IOMAP_DEV_NULL; + return ret; } /* Convert a mapping from the kernel into something the server can use */ -static inline void fuse_iomap_to_server(struct fuse_iomap_io *fmap, +static inline void fuse_iomap_to_server(struct fuse_conn *fc, + struct fuse_iomap_io *fmap, const struct iomap *iomap) { fmap->addr = iomap->addr; @@ -315,7 +333,7 @@ static inline void fuse_iomap_to_server(struct fuse_iomap_io *fmap, fmap->length = iomap->length; fmap->type = fuse_iomap_type_to_server(iomap->type); fmap->flags = fuse_iomap_flags_to_server(iomap->flags); - fmap->dev = FUSE_IOMAP_DEV_NULL; /* XXX */ + fmap->dev = fuse_iomap_find_backing_id(fc, iomap->bdev); } /* Check the incoming _begin mappings to make sure they're not nonsense. */ @@ -354,6 +372,27 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags) return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE); } +static inline struct fuse_backing * +fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map) +{ + struct fuse_backing *ret = NULL; + + if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX) + ret = fuse_backing_lookup(fc, &fuse_iomap_backing_ops, + map->dev); + + switch (map->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + /* Mappings backed by space must have a device/addr */ + if (BAD_DATA(ret == NULL)) + return ERR_PTR(-EFSCORRUPTED); + break; + } + + return ret; +} + static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, unsigned opflags, struct iomap *iomap, struct iomap *srcmap) @@ -367,6 +406,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, }; struct fuse_iomap_begin_out outarg = { }; struct fuse_mount *fm = get_fuse_mount(inode); + struct fuse_backing *read_dev = NULL; + struct fuse_backing *write_dev = NULL; FUSE_ARGS(args); int err; @@ -393,24 +434,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, if (err) return err; + read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read); + if (IS_ERR(read_dev)) + return PTR_ERR(read_dev); + if (fuse_is_iomap_file_write(opflags) && outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) { + /* open the write device */ + write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write); + if (IS_ERR(write_dev)) { + err = PTR_ERR(write_dev); + goto out_read_dev; + } + /* * For an out of place write, we must supply the write mapping * via @iomap, and the read mapping via @srcmap. */ - fuse_iomap_from_server(iomap, &outarg.write); - fuse_iomap_from_server(srcmap, &outarg.read); + fuse_iomap_from_server(iomap, write_dev, &outarg.write); + fuse_iomap_from_server(srcmap, read_dev, &outarg.read); } else { /* * For everything else (reads, reporting, and pure overwrites), * we can return the sole mapping through @iomap and leave * @srcmap unchanged from its default (HOLE). */ - fuse_iomap_from_server(iomap, &outarg.read); + fuse_iomap_from_server(iomap, read_dev, &outarg.read); } - return 0; + /* + * XXX: if we ever want to support closing devices, we need a way to + * track the fuse_backing refcount all the way through bio endios. + * For now we put the refcount here because you can't remove an iomap + * device until unmount time. + */ + fuse_backing_put(write_dev); +out_read_dev: + fuse_backing_put(read_dev); + return err; } /* Decide if we send FUSE_IOMAP_END to the fuse server */ @@ -452,7 +513,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, }; FUSE_ARGS(args); - fuse_iomap_to_server(&inarg.map, iomap); + fuse_iomap_to_server(fm->fc, &inarg.map, iomap); trace_fuse_iomap_end(inode, &inarg); @@ -482,3 +543,44 @@ const struct iomap_ops fuse_iomap_ops = { .iomap_begin = fuse_iomap_begin, .iomap_end = fuse_iomap_end, }; + +static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags) +{ + if (!fc->iomap) + return -EPERM; + + if (flags) + return -EINVAL; + + return 0; +} + +static int fuse_iomap_may_open(struct fuse_conn *fc, struct file *file) +{ + if (!S_ISBLK(file_inode(file)->i_mode)) + return -ENODEV; + + return 0; +} + +static int fuse_iomap_post_open(struct fuse_conn *fc, struct fuse_backing *fb) +{ + fb->bdev = I_BDEV(fb->file->f_mapping->host); + return 0; +} + +static int fuse_iomap_may_close(struct fuse_conn *fc, struct file *file) +{ + /* We only support closing iomap block devices at unmount */ + return -EBUSY; +} + +const struct fuse_backing_ops fuse_iomap_backing_ops = { + .type = FUSE_BACKING_TYPE_IOMAP, + .id_start = 1, + .id_end = 1025, /* maximum 1024 block devices */ + .may_admin = fuse_iomap_may_admin, + .may_open = fuse_iomap_may_open, + .may_close = fuse_iomap_may_close, + .post_open = fuse_iomap_post_open, +}; diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c index 93bd72efc98cd0..c830c1c38a833c 100644 --- a/fs/fuse/trace.c +++ b/fs/fuse/trace.c @@ -6,6 +6,7 @@ #include "dev_uring_i.h" #include "fuse_i.h" #include "fuse_dev_i.h" +#include "fuse_iomap_i.h" #include <linux/pagemap.h> ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 05/33] fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:24 ` [PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong @ 2026-04-29 14:24 ` Darrick J. Wong 2026-04-29 14:25 ` [PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything before sending DESTROY Darrick J. Wong ` (27 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:24 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Enhance the existing backing file tracepoints to report the subsystem that's actually using the backing file. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 42 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index c0878253e7c6ad..af21654d797f45 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -175,6 +175,10 @@ TRACE_EVENT(fuse_request_end, ); #ifdef CONFIG_FUSE_BACKING +#define FUSE_BACKING_FLAG_STRINGS \ + { FUSE_BACKING_TYPE_PASSTHROUGH, "pass" }, \ + { FUSE_BACKING_TYPE_IOMAP, "iomap" } + TRACE_EVENT(fuse_backing_class, TP_PROTO(const struct fuse_conn *fc, unsigned int idx, const struct fuse_backing *fb), @@ -184,7 +188,9 @@ TRACE_EVENT(fuse_backing_class, TP_STRUCT__entry( __field(dev_t, connection) __field(unsigned int, idx) + __field(unsigned int, type) __field(unsigned long, ino) + __field(dev_t, rdev) ), TP_fast_assign( @@ -193,12 +199,19 @@ TRACE_EVENT(fuse_backing_class, __entry->connection = fc->dev; __entry->idx = idx; __entry->ino = inode->i_ino; + __entry->type = fb->ops->type; + if (fb->ops->type == FUSE_BACKING_TYPE_IOMAP) + __entry->rdev = inode->i_rdev; + else + __entry->rdev = 0; ), - TP_printk("connection %u idx %u ino 0x%lx", + TP_printk("connection %u idx %u type %s ino 0x%lx rdev %u:%u", __entry->connection, __entry->idx, - __entry->ino) + __print_symbolic(__entry->type, FUSE_BACKING_FLAG_STRINGS), + __entry->ino, + MAJOR(__entry->rdev), MINOR(__entry->rdev)) ); #define DEFINE_FUSE_BACKING_EVENT(name) \ DEFINE_EVENT(fuse_backing_class, name, \ @@ -210,7 +223,6 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); #endif /* CONFIG_FUSE_BACKING */ #if IS_ENABLED(CONFIG_FUSE_IOMAP) - /* tracepoint boilerplate so we don't have to keep doing this */ #define FUSE_IOMAP_OPFLAGS_FIELD \ __field(unsigned, opflags) @@ -452,6 +464,30 @@ TRACE_EVENT(fuse_iomap_end_error, __entry->written, __entry->error) ); + +TRACE_EVENT(fuse_iomap_dev_add, + TP_PROTO(const struct fuse_conn *fc, + const struct fuse_backing_map *map), + + TP_ARGS(fc, map), + + TP_STRUCT__entry( + __field(dev_t, connection) + __field(int, fd) + __field(unsigned int, flags) + ), + + TP_fast_assign( + __entry->connection = fc->dev; + __entry->fd = map->fd; + __entry->flags = map->flags; + ), + + TP_printk("connection %u fd %d flags 0x%x", + __entry->connection, + __entry->fd, + __entry->flags) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything before sending DESTROY 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:24 ` [PATCH 05/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:25 ` Darrick J. Wong 2026-04-29 14:25 ` [PATCH 07/33] fuse: clean up per-file type inode initialization Darrick J. Wong ` (26 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:25 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> At unmount time, there are a few things that we need to ask a fuse iomap server to do. We need to send FUSE_DESTROY to the fuse server so that it closes the filesystem and the device fds before unmount returns. That way, a script that does something like "umount /dev/sda ; e2fsck -fn /dev/sda" will not fail the e2fsck because the fd closure races with e2fsck startup. This is essential for fstests QA to work properly. To make this happen, first, we need to flush queued events to userspace to give the fuse server a chance to process the events. These will all be FUSE_RELEASE events for previously opened files. Once the commands are flushed, we can send the FUSE_DESTROY request to ask the server to close the filesystem. This depends on libfuse to hold the destroy command until there are no unreleased files (instead of having the kernel enforce the ordering) because the kernel fuse developers don't like the idea of unbounded waits in unmount. Second, to reduce the amount of metadata must be persisted to disk in the fuse server's destroy method, enable FUSE_SYNCFS so that previous sync_filesystem calls can flush dirty data before we get deeper in the unmount machinery. If a fuse command timeout is set, this reduces the likelihood that the kernel aborts the FUSE_DESTROY command whilst the server is busy flushing dirty metadata. This is a major behavior change and who knows what might break existing code, so we hide it behind iomap mode. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 3 +++ fs/fuse/fuse_iomap.h | 5 +++++ fs/fuse/fuse_iomap.c | 32 ++++++++++++++++++++++++++++++++ fs/fuse/inode.c | 9 +++++++-- 4 files changed, 47 insertions(+), 2 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index a02af3403e6849..620489520ddc8e 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1428,6 +1428,9 @@ int fuse_init_fs_context_submount(struct fs_context *fsc); */ void fuse_conn_destroy(struct fuse_mount *fm); +/* Send the FUSE_DESTROY command. */ +void fuse_send_destroy(struct fuse_mount *fm); + /* Drop the connection and free the fuse mount */ void fuse_mount_destroy(struct fuse_mount *fm); diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 43562ef23fb325..129680b056ebea 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -20,9 +20,14 @@ static inline bool fuse_has_iomap(const struct inode *inode) } extern const struct fuse_backing_ops fuse_iomap_backing_ops; + +void fuse_iomap_mount(struct fuse_mount *fm); +void fuse_iomap_unmount(struct fuse_mount *fm); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) +# define fuse_iomap_mount(...) ((void)0) +# define fuse_iomap_unmount(...) ((void)0) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 0ac783dd312dc3..4b63bb32167877 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -584,3 +584,35 @@ const struct fuse_backing_ops fuse_iomap_backing_ops = { .may_close = fuse_iomap_may_close, .post_open = fuse_iomap_post_open, }; + +void fuse_iomap_mount(struct fuse_mount *fm) +{ + struct fuse_conn *fc = fm->fc; + + /* + * Enable syncfs for iomap fuse servers so that we can send a final + * flush at unmount time. This also means that we can support + * freeze/thaw properly. + */ + fc->sync_fs = true; +} + +void fuse_iomap_unmount(struct fuse_mount *fm) +{ + struct fuse_conn *fc = fm->fc; + + /* + * Flush all pending file release commands and send a destroy command. + * This gives the fuse server a chance to process all the pending + * releases, write the last bits of metadata changes to disk, and close + * the iomap block devices before we return from the umount call. + * iomap fuse servers are expected to release all exclusive access + * resources before unmount completes. + * + * Note that multithreaded fuse servers will have to hold the destroy + * command until all release requests have completed because the kernel + * maintainers do not want to introduce waits in unmount. + */ + fuse_flush_requests(fc); + fuse_send_destroy(fm); +} diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 15a8ce8afa4241..945afc30211d69 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -636,7 +636,7 @@ static void fuse_umount_begin(struct super_block *sb) retire_super(sb); } -static void fuse_send_destroy(struct fuse_mount *fm) +void fuse_send_destroy(struct fuse_mount *fm) { if (fm->fc->conn_init) { FUSE_ARGS(args); @@ -1504,6 +1504,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, init_server_timeout(fc, timeout); + if (fc->iomap) + fuse_iomap_mount(fm); + fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages); fc->minor = arg->minor; @@ -2126,7 +2129,9 @@ void fuse_conn_destroy(struct fuse_mount *fm) { struct fuse_conn *fc = fm->fc; - if (fc->destroy) { + if (fc->iomap) { + fuse_iomap_unmount(fm); + } else if (fc->destroy) { /* * Flush all pending requests before sending FUSE_DESTROY. The * fuse server must reply to the flushed requests before ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 07/33] fuse: clean up per-file type inode initialization 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:25 ` [PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything before sending DESTROY Darrick J. Wong @ 2026-04-29 14:25 ` Darrick J. Wong 2026-04-29 14:25 ` [PATCH 08/33] fuse: create a per-inode flag for setting exclusive mode Darrick J. Wong ` (25 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:25 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Clean up the per-filetype inode initialization in fuse_init_inode before we start adding more functionality here. Primarily this consists of refactoring to use a switch statement and passing the file attr struct around. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 2 +- fs/fuse/file.c | 5 +++-- fs/fuse/inode.c | 36 +++++++++++++++++++++++++----------- 3 files changed, 29 insertions(+), 14 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 620489520ddc8e..740589d5dc6c2d 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1229,7 +1229,7 @@ int fuse_notify_poll_wakeup(struct fuse_conn *fc, /** * Initialize file operations on a regular file */ -void fuse_init_file_inode(struct inode *inode, unsigned int flags); +void fuse_init_file_inode(struct inode *inode, struct fuse_attr *attr); /** * Initialize inode operations on regular files and special files diff --git a/fs/fuse/file.c b/fs/fuse/file.c index c59452d60b8da0..5e7fe7ef87d2e4 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3199,11 +3199,12 @@ static const struct address_space_operations fuse_file_aops = { .direct_IO = fuse_direct_IO, }; -void fuse_init_file_inode(struct inode *inode, unsigned int flags) +void fuse_init_file_inode(struct inode *inode, struct fuse_attr *attr) { struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_conn *fc = get_fuse_conn(inode); + fuse_init_common(inode); inode->i_fop = &fuse_file_operations; inode->i_data.a_ops = &fuse_file_aops; if (fc->writeback_cache) @@ -3217,5 +3218,5 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) init_waitqueue_head(&fi->direct_io_waitq); if (IS_ENABLED(CONFIG_FUSE_DAX)) - fuse_dax_inode_init(inode, flags); + fuse_dax_inode_init(inode, attr->flags); } diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 945afc30211d69..bf86edbfa22d6b 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -422,6 +422,12 @@ static void fuse_init_submount_lookup(struct fuse_submount_lookup *sl, refcount_set(&sl->count, 1); } +static void fuse_init_special(struct inode *inode, struct fuse_attr *attr) +{ + fuse_init_common(inode); + init_special_inode(inode, inode->i_mode, new_decode_dev(attr->rdev)); +} + static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr, struct fuse_conn *fc) { @@ -429,20 +435,28 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr, inode->i_size = attr->size; inode_set_mtime(inode, attr->mtime, attr->mtimensec); inode_set_ctime(inode, attr->ctime, attr->ctimensec); - if (S_ISREG(inode->i_mode)) { - fuse_init_common(inode); - fuse_init_file_inode(inode, attr->flags); - } else if (S_ISDIR(inode->i_mode)) + + switch (inode->i_mode & S_IFMT) { + case S_IFREG: + fuse_init_file_inode(inode, attr); + break; + case S_IFDIR: fuse_init_dir(inode); - else if (S_ISLNK(inode->i_mode)) + break; + case S_IFLNK: fuse_init_symlink(inode); - else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) || - S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) { - fuse_init_common(inode); - init_special_inode(inode, inode->i_mode, - new_decode_dev(attr->rdev)); - } else + break; + case S_IFCHR: + case S_IFBLK: + case S_IFIFO: + case S_IFSOCK: + fuse_init_special(inode, attr); + break; + default: BUG(); + break; + } + /* * Ensure that we don't cache acls for daemons without FUSE_POSIX_ACL * so they see the exact same behavior as before. ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 08/33] fuse: create a per-inode flag for setting exclusive mode. 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:25 ` [PATCH 07/33] fuse: clean up per-file type inode initialization Darrick J. Wong @ 2026-04-29 14:25 ` Darrick J. Wong 2026-04-29 14:26 ` [PATCH 09/33] fuse: create a per-inode flag for toggling iomap Darrick J. Wong ` (24 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:25 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Create a per-inode flag to control whether or not this inode can use exclusive mode. This enables more aggressive use of cached file attributes for things such as ACL inheritance and transformations. The initial user will be iomap filesystems, where we hoist the IO path into the kernel and need ACLs to work for regular files and directories in the manner that most local filesystems expect. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 17 +++++++++++++++++ include/uapi/linux/fuse.h | 4 ++++ fs/fuse/inode.c | 5 +++++ 3 files changed, 26 insertions(+) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 740589d5dc6c2d..07ad5abc48f70f 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1116,6 +1116,23 @@ static inline bool fuse_is_bad(struct inode *inode) return unlikely(test_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state)); } +static inline void fuse_inode_set_exclusive(const struct fuse_conn *fc, + struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + /* This flag wasn't added until kernel API 7.99 */ + if (fc->minor >= 99) + set_bit(FUSE_I_EXCLUSIVE, &fi->state); +} + +static inline void fuse_inode_clear_exclusive(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + clear_bit(FUSE_I_EXCLUSIVE, &fi->state); +} + static inline bool fuse_inode_is_exclusive(const struct inode *inode) { const struct fuse_inode *fi = get_fuse_inode(inode); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 5ae6b05de623d7..2d35dcfbf8aaf5 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -244,6 +244,7 @@ * 7.99 * - XXX magic minor revision to make experimental code really obvious * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations + * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes */ #ifndef _LINUX_FUSE_H @@ -584,9 +585,12 @@ struct fuse_file_lock { * * FUSE_ATTR_SUBMOUNT: Object is a submount root * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode + * FUSE_ATTR_EXCLUSIVE: This file can only be modified by this mount, so the + * kernel can use cached attributes more aggressively (e.g. ACL inheritance) */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) +#define FUSE_ATTR_EXCLUSIVE (1 << 2) /** * Open flags diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index bf86edbfa22d6b..36814b5de46879 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -197,6 +197,8 @@ static void fuse_evict_inode(struct inode *inode) WARN_ON(!list_empty(&fi->write_files)); WARN_ON(!list_empty(&fi->queued_writes)); } + + fuse_inode_clear_exclusive(inode); } static int fuse_reconfigure(struct fs_context *fsc) @@ -436,6 +438,9 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr, inode_set_mtime(inode, attr->mtime, attr->mtimensec); inode_set_ctime(inode, attr->ctime, attr->ctimensec); + if (attr->flags & FUSE_ATTR_EXCLUSIVE) + fuse_inode_set_exclusive(fc, inode); + switch (inode->i_mode & S_IFMT) { case S_IFREG: fuse_init_file_inode(inode, attr); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 09/33] fuse: create a per-inode flag for toggling iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:25 ` [PATCH 08/33] fuse: create a per-inode flag for setting exclusive mode Darrick J. Wong @ 2026-04-29 14:26 ` Darrick J. Wong 2026-04-29 14:26 ` [PATCH 10/33] fuse_trace: " Darrick J. Wong ` (23 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:26 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Create a per-inode flag to control whether or not this inode actually uses iomap. This is required for non-regular files because iomap doesn't apply there; and enables fuse filesystems to provide some non-iomap files if desired. Note that more code will be added to fuse_iomap_init_inode in subsequent patches. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/fuse_i.h | 6 ++++-- fs/fuse/fuse_iomap.h | 13 ++++++++++++ include/uapi/linux/fuse.h | 3 +++ fs/fuse/dir.c | 14 +++++++++++-- fs/fuse/file.c | 3 +++ fs/fuse/fuse_iomap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/inode.c | 6 ++++-- 7 files changed, 86 insertions(+), 6 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 07ad5abc48f70f..a4d64fd2837778 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -264,6 +264,8 @@ enum { * or the fuse server has an exclusive "lease" on distributed fs */ FUSE_I_EXCLUSIVE, + /* Use iomap for this inode */ + FUSE_I_IOMAP, }; struct fuse_conn; @@ -1256,12 +1258,12 @@ void fuse_init_common(struct inode *inode); /** * Initialize inode and file operations on a directory */ -void fuse_init_dir(struct inode *inode); +void fuse_init_dir(struct inode *inode, struct fuse_attr *attr); /** * Initialize inode operations on a symlink */ -void fuse_init_symlink(struct inode *inode); +void fuse_init_symlink(struct inode *inode, struct fuse_attr *attr); /** * Change attributes of an inode diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 129680b056ebea..34f2c75416eb62 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -23,11 +23,24 @@ extern const struct fuse_backing_ops fuse_iomap_backing_ops; void fuse_iomap_mount(struct fuse_mount *fm); void fuse_iomap_unmount(struct fuse_mount *fm); + +void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr); +void fuse_iomap_evict_inode(struct inode *inode); + +static inline bool fuse_inode_has_iomap(const struct inode *inode) +{ + const struct fuse_inode *fi = get_fuse_inode(inode); + + return test_bit(FUSE_I_IOMAP, &fi->state); +} #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) # define fuse_iomap_mount(...) ((void)0) # define fuse_iomap_unmount(...) ((void)0) +# define fuse_iomap_init_inode(...) ((void)0) +# define fuse_iomap_evict_inode(...) ((void)0) +# define fuse_inode_has_iomap(...) (false) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 2d35dcfbf8aaf5..88f76f4be749a7 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -245,6 +245,7 @@ * - XXX magic minor revision to make experimental code really obvious * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes + * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes */ #ifndef _LINUX_FUSE_H @@ -587,10 +588,12 @@ struct fuse_file_lock { * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode * FUSE_ATTR_EXCLUSIVE: This file can only be modified by this mount, so the * kernel can use cached attributes more aggressively (e.g. ACL inheritance) + * FUSE_ATTR_IOMAP: Use iomap for this inode */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) #define FUSE_ATTR_EXCLUSIVE (1 << 2) +#define FUSE_ATTR_IOMAP (1 << 3) /** * Open flags diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index c5c97065984557..996d81f263d637 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -7,6 +7,7 @@ */ #include "fuse_i.h" +#include "fuse_iomap.h" #include <linux/pagemap.h> #include <linux/file.h> @@ -2514,9 +2515,10 @@ void fuse_init_common(struct inode *inode) inode->i_op = &fuse_common_inode_operations; } -void fuse_init_dir(struct inode *inode) +void fuse_init_dir(struct inode *inode, struct fuse_attr *attr) { struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_conn *fc = get_fuse_conn(inode); inode->i_op = &fuse_dir_inode_operations; inode->i_fop = &fuse_dir_operations; @@ -2526,6 +2528,9 @@ void fuse_init_dir(struct inode *inode) fi->rdc.size = 0; fi->rdc.pos = 0; fi->rdc.version = 0; + + if (fc->iomap) + fuse_iomap_init_inode(inode, attr); } static int fuse_symlink_read_folio(struct file *null, struct folio *folio) @@ -2544,9 +2549,14 @@ static const struct address_space_operations fuse_symlink_aops = { .read_folio = fuse_symlink_read_folio, }; -void fuse_init_symlink(struct inode *inode) +void fuse_init_symlink(struct inode *inode, struct fuse_attr *attr) { + struct fuse_conn *fc = get_fuse_conn(inode); + inode->i_op = &fuse_symlink_inode_operations; inode->i_data.a_ops = &fuse_symlink_aops; inode_nohighmem(inode); + + if (fc->iomap) + fuse_iomap_init_inode(inode, attr); } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 5e7fe7ef87d2e4..8d55d2f4dd4cc9 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -7,6 +7,7 @@ */ #include "fuse_i.h" +#include "fuse_iomap.h" #include <linux/pagemap.h> #include <linux/slab.h> @@ -3217,6 +3218,8 @@ void fuse_init_file_inode(struct inode *inode, struct fuse_attr *attr) init_waitqueue_head(&fi->page_waitq); init_waitqueue_head(&fi->direct_io_waitq); + if (fc->iomap) + fuse_iomap_init_inode(inode, attr); if (IS_ENABLED(CONFIG_FUSE_DAX)) fuse_dax_inode_init(inode, attr->flags); } diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 4b63bb32167877..39c9239c64435a 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -616,3 +616,50 @@ void fuse_iomap_unmount(struct fuse_mount *fm) fuse_flush_requests(fc); fuse_send_destroy(fm); } + +static inline void fuse_inode_set_iomap(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + set_bit(FUSE_I_IOMAP, &fi->state); +} + +static inline void fuse_inode_clear_iomap(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + clear_bit(FUSE_I_IOMAP, &fi->state); +} + +void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr) +{ + ASSERT(get_fuse_conn(inode)->iomap); + + if (!(attr->flags & FUSE_ATTR_IOMAP)) + return; + + /* + * Any file being used in conjunction with iomap must also have the + * exclusive flag set because iomap requires cached file attributes to + * be correct at any time. This applies even to non-regular files + * (e.g. directories) because we need to do ACL and attribute + * inheritance the same way a local filesystem would do. If exclusive + * mode isn't set, then we won't use iomap. + */ + if (!fuse_inode_is_exclusive(inode)) { + ASSERT(fuse_inode_is_exclusive(inode)); + return; + } + + if (!S_ISREG(inode->i_mode)) + return; + + fuse_inode_set_iomap(inode); +} + +void fuse_iomap_evict_inode(struct inode *inode) +{ + ASSERT(fuse_inode_has_iomap(inode)); + + fuse_inode_clear_iomap(inode); +} diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 36814b5de46879..f8e5c03580e56b 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -198,6 +198,8 @@ static void fuse_evict_inode(struct inode *inode) WARN_ON(!list_empty(&fi->queued_writes)); } + if (fuse_inode_has_iomap(inode)) + fuse_iomap_evict_inode(inode); fuse_inode_clear_exclusive(inode); } @@ -446,10 +448,10 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr, fuse_init_file_inode(inode, attr); break; case S_IFDIR: - fuse_init_dir(inode); + fuse_init_dir(inode, attr); break; case S_IFLNK: - fuse_init_symlink(inode); + fuse_init_symlink(inode, attr); break; case S_IFCHR: case S_IFBLK: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 10/33] fuse_trace: create a per-inode flag for toggling iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (8 preceding siblings ...) 2026-04-29 14:26 ` [PATCH 09/33] fuse: create a per-inode flag for toggling iomap Darrick J. Wong @ 2026-04-29 14:26 ` Darrick J. Wong 2026-04-29 14:26 ` [PATCH 11/33] fuse: isolate the other regular file IO paths from iomap Darrick J. Wong ` (22 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:26 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 8 +++++++- 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index af21654d797f45..fac981e2a30df0 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -300,6 +300,25 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); { FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \ { FUSE_IOMAP_TYPE_INLINE, "inline" } +TRACE_DEFINE_ENUM(FUSE_I_ADVISE_RDPLUS); +TRACE_DEFINE_ENUM(FUSE_I_INIT_RDPLUS); +TRACE_DEFINE_ENUM(FUSE_I_SIZE_UNSTABLE); +TRACE_DEFINE_ENUM(FUSE_I_BAD); +TRACE_DEFINE_ENUM(FUSE_I_BTIME); +TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE); +TRACE_DEFINE_ENUM(FUSE_I_EXCLUSIVE); +TRACE_DEFINE_ENUM(FUSE_I_IOMAP); + +#define FUSE_IFLAG_STRINGS \ + { 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \ + { 1 << FUSE_I_INIT_RDPLUS, "init_rdplus" }, \ + { 1 << FUSE_I_SIZE_UNSTABLE, "size_unstable" }, \ + { 1 << FUSE_I_BAD, "bad" }, \ + { 1 << FUSE_I_BTIME, "btime" }, \ + { 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \ + { 1 << FUSE_I_EXCLUSIVE, "excl" }, \ + { 1 << FUSE_I_IOMAP, "iomap" } + DECLARE_EVENT_CLASS(fuse_iomap_check_class, TP_PROTO(const char *func, int line, const char *condition), @@ -488,6 +507,31 @@ TRACE_EVENT(fuse_iomap_dev_add, __entry->fd, __entry->flags) ); + +DECLARE_EVENT_CLASS(fuse_inode_state_class, + TP_PROTO(const struct inode *inode), + TP_ARGS(inode), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(unsigned long, state) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->state = fi->state; + ), + + TP_printk(FUSE_INODE_FMT " state (%s)", + FUSE_INODE_PRINTK_ARGS, + __print_flags(__entry->state, "|", FUSE_IFLAG_STRINGS)) +); +#define DEFINE_FUSE_INODE_STATE_EVENT(name) \ +DEFINE_EVENT(fuse_inode_state_class, name, \ + TP_PROTO(const struct inode *inode), \ + TP_ARGS(inode)) +DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode); +DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 39c9239c64435a..dccfc9a2c9847c 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -651,15 +651,21 @@ void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr) return; } - if (!S_ISREG(inode->i_mode)) + if (!S_ISREG(inode->i_mode)) { + trace_fuse_iomap_init_inode(inode); return; + } fuse_inode_set_iomap(inode); + + trace_fuse_iomap_init_inode(inode); } void fuse_iomap_evict_inode(struct inode *inode) { ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_evict_inode(inode); + fuse_inode_clear_iomap(inode); } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 11/33] fuse: isolate the other regular file IO paths from iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (9 preceding siblings ...) 2026-04-29 14:26 ` [PATCH 10/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:26 ` Darrick J. Wong 2026-04-29 14:26 ` [PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong ` (21 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:26 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> iomap completely takes over all regular file IO, so we don't need to access any of the other mechanisms at all. Gate them off so that we can eventually overlay them with a union to save space in struct fuse_inode. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/dir.c | 14 +++++++++----- fs/fuse/file.c | 18 +++++++++++++----- fs/fuse/inode.c | 3 ++- fs/fuse/iomode.c | 3 ++- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 996d81f263d637..1bb0302f7ce8bb 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -2200,6 +2200,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, FUSE_ARGS(args); struct fuse_setattr_in inarg; struct fuse_attr_out outarg; + const bool is_iomap = fuse_inode_has_iomap(inode); bool is_truncate = false; bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode); loff_t oldsize; @@ -2257,12 +2258,15 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, if (err) return err; - fuse_set_nowrite(inode); - fuse_release_nowrite(inode); + if (!is_iomap) { + fuse_set_nowrite(inode); + fuse_release_nowrite(inode); + } } if (is_truncate) { - fuse_set_nowrite(inode); + if (!is_iomap) + fuse_set_nowrite(inode); set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); if (trust_local_cmtime && attr->ia_size != inode->i_size) attr->ia_valid |= ATTR_MTIME | ATTR_CTIME; @@ -2334,7 +2338,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, if (!is_wb || is_truncate) i_size_write(inode, outarg.attr.size); - if (is_truncate) { + if (is_truncate && !is_iomap) { /* NOTE: this may release/reacquire fi->lock */ __fuse_release_nowrite(inode); } @@ -2358,7 +2362,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, return 0; error: - if (is_truncate) + if (is_truncate && !is_iomap) fuse_release_nowrite(inode); clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 8d55d2f4dd4cc9..2a807c49792d53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -257,6 +257,7 @@ static int fuse_open(struct inode *inode, struct file *file) struct fuse_conn *fc = fm->fc; struct fuse_file *ff; int err; + const bool is_iomap = fuse_inode_has_iomap(inode); bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc; bool is_wb_truncate = is_truncate && fc->writeback_cache; bool dax_truncate = is_truncate && FUSE_IS_DAX(inode); @@ -278,7 +279,7 @@ static int fuse_open(struct inode *inode, struct file *file) goto out_inode_unlock; } - if (is_wb_truncate || dax_truncate) + if ((is_wb_truncate || dax_truncate) && !is_iomap) fuse_set_nowrite(inode); err = fuse_do_open(fm, get_node_id(inode), file, false); @@ -291,7 +292,7 @@ static int fuse_open(struct inode *inode, struct file *file) fuse_truncate_update_attr(inode, file); } - if (is_wb_truncate || dax_truncate) + if ((is_wb_truncate || dax_truncate) && !is_iomap) fuse_release_nowrite(inode); if (!err) { if (is_truncate) @@ -539,12 +540,14 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end, { struct inode *inode = file->f_mapping->host; struct fuse_conn *fc = get_fuse_conn(inode); + const bool need_sync_writes = !fuse_inode_has_iomap(inode); int err; if (fuse_is_bad(inode)) return -EIO; - inode_lock(inode); + if (need_sync_writes) + inode_lock(inode); /* * Start writeback against all dirty pages of the inode, then @@ -555,7 +558,8 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end, if (err) goto out; - fuse_sync_writes(inode); + if (need_sync_writes) + fuse_sync_writes(inode); /* * Due to implementation of fuse writeback @@ -579,7 +583,8 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end, err = 0; } out: - inode_unlock(inode); + if (need_sync_writes) + inode_unlock(inode); return err; } @@ -2015,6 +2020,9 @@ static struct fuse_file *__fuse_write_file_get(struct fuse_inode *fi) { struct fuse_file *ff; + if (fuse_inode_has_iomap(&fi->inode)) + return NULL; + spin_lock(&fi->lock); ff = list_first_entry_or_null(&fi->write_files, struct fuse_file, write_entry); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index f8e5c03580e56b..3cb2ef161ef7c5 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -192,7 +192,8 @@ static void fuse_evict_inode(struct inode *inode) if (inode->i_nlink > 0) atomic64_inc(&fc->evict_ctr); } - if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) { + if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode) && + !fuse_inode_has_iomap(inode)) { WARN_ON(fi->iocachectr != 0); WARN_ON(!list_empty(&fi->write_files)); WARN_ON(!list_empty(&fi->queued_writes)); diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c index 3728933188f307..9be9ae3520003e 100644 --- a/fs/fuse/iomode.c +++ b/fs/fuse/iomode.c @@ -6,6 +6,7 @@ */ #include "fuse_i.h" +#include "fuse_iomap.h" #include <linux/kernel.h> #include <linux/sched.h> @@ -203,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode) * io modes are not relevant with DAX and with server that does not * implement open. */ - if (FUSE_IS_DAX(inode) || !ff->args) + if (fuse_inode_has_iomap(inode) || FUSE_IS_DAX(inode) || !ff->args) return 0; /* ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (10 preceding siblings ...) 2026-04-29 14:26 ` [PATCH 11/33] fuse: isolate the other regular file IO paths from iomap Darrick J. Wong @ 2026-04-29 14:26 ` Darrick J. Wong 2026-04-29 14:27 ` [PATCH 13/33] fuse_trace: " Darrick J. Wong ` (20 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:26 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Implement the basic file mapping reporting functions like FIEMAP, BMAP, and SEEK_DATA/HOLE. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 8 ++++++ fs/fuse/dir.c | 1 + fs/fuse/file.c | 13 ++++++++++ fs/fuse/fuse_iomap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 89 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 34f2c75416eb62..8ba30a496545f5 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -33,6 +33,11 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode) return test_bit(FUSE_I_IOMAP, &fi->state); } + +int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 length); +loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence); +sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -41,6 +46,9 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode) # define fuse_iomap_init_inode(...) ((void)0) # define fuse_iomap_evict_inode(...) ((void)0) # define fuse_inode_has_iomap(...) (false) +# define fuse_iomap_fiemap NULL +# define fuse_iomap_lseek(...) (-ENOSYS) +# define fuse_iomap_bmap(...) (-ENOSYS) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 1bb0302f7ce8bb..2ed19e21d90702 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -2505,6 +2505,7 @@ static const struct inode_operations fuse_common_inode_operations = { .set_acl = fuse_set_acl, .fileattr_get = fuse_fileattr_get, .fileattr_set = fuse_fileattr_set, + .fiemap = fuse_iomap_fiemap, }; static const struct inode_operations fuse_symlink_inode_operations = { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 2a807c49792d53..67cb0844181851 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2593,6 +2593,12 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block) struct fuse_bmap_out outarg; int err; + if (fuse_inode_has_iomap(inode)) { + sector_t alt_sec = fuse_iomap_bmap(mapping, block); + if (alt_sec > 0) + return alt_sec; + } + if (!inode->i_sb->s_bdev || fm->fc->no_bmap) return 0; @@ -2628,6 +2634,13 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence) struct fuse_lseek_out outarg; int err; + if (fuse_inode_has_iomap(inode)) { + loff_t alt_pos = fuse_iomap_lseek(file, offset, whence); + + if (alt_pos != -ENOSYS) + return alt_pos; + } + if (fm->fc->no_lseek) goto fallback; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index dccfc9a2c9847c..32ddf2fa6bdf78 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -4,6 +4,7 @@ * Author: Darrick J. Wong <djwong@kernel.org> */ #include <linux/iomap.h> +#include <linux/fiemap.h> #include "fuse_i.h" #include "fuse_trace.h" #include "fuse_iomap.h" @@ -539,7 +540,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, return 0; } -const struct iomap_ops fuse_iomap_ops = { +static const struct iomap_ops fuse_iomap_ops = { .iomap_begin = fuse_iomap_begin, .iomap_end = fuse_iomap_end, }; @@ -669,3 +670,68 @@ void fuse_iomap_evict_inode(struct inode *inode) fuse_inode_clear_iomap(inode); } + +int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 count) +{ + struct fuse_conn *fc = get_fuse_conn(inode); + int error; + + /* + * We are called directly from the vfs so we need to check per-inode + * support here explicitly. + */ + if (!fuse_inode_has_iomap(inode)) + return -EOPNOTSUPP; + + if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) + return -EOPNOTSUPP; + + if (fuse_is_bad(inode)) + return -EIO; + + if (!fuse_allow_current_process(fc)) + return -EACCES; + + inode_lock_shared(inode); + error = iomap_fiemap(inode, fieinfo, start, count, &fuse_iomap_ops); + inode_unlock_shared(inode); + + return error; +} + +sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block) +{ + ASSERT(fuse_inode_has_iomap(mapping->host)); + + return iomap_bmap(mapping, block, &fuse_iomap_ops); +} + +loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence) +{ + struct inode *inode = file->f_mapping->host; + struct fuse_conn *fc = get_fuse_conn(inode); + + ASSERT(fuse_inode_has_iomap(inode)); + + if (fuse_is_bad(inode)) + return -EIO; + + if (!fuse_allow_current_process(fc)) + return -EACCES; + + switch (whence) { + case SEEK_HOLE: + offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops); + break; + case SEEK_DATA: + offset = iomap_seek_data(inode, offset, &fuse_iomap_ops); + break; + default: + return generic_file_llseek(file, offset, whence); + } + + if (offset < 0) + return offset; + return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 13/33] fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (11 preceding siblings ...) 2026-04-29 14:26 ` [PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong @ 2026-04-29 14:27 ` Darrick J. Wong 2026-04-29 14:27 ` [PATCH 14/33] fuse: implement direct IO with iomap Darrick J. Wong ` (19 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:27 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 46 ++++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 4 ++++ 2 files changed, 50 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index fac981e2a30df0..730ab8bce44450 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -532,6 +532,52 @@ DEFINE_EVENT(fuse_inode_state_class, name, \ TP_ARGS(inode)) DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode); DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode); + +TRACE_EVENT(fuse_iomap_fiemap, + TP_PROTO(const struct inode *inode, u64 start, u64 count, + unsigned int flags), + + TP_ARGS(inode, start, count, flags), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(unsigned int, flags) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = start; + __entry->length = count; + __entry->flags = flags; + ), + + TP_printk(FUSE_IO_RANGE_FMT("fiemap") " flags 0x%x", + FUSE_IO_RANGE_PRINTK_ARGS(), + __entry->flags) +); + +TRACE_EVENT(fuse_iomap_lseek, + TP_PROTO(const struct inode *inode, loff_t offset, int whence), + + TP_ARGS(inode, offset, whence), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(loff_t, offset) + __field(int, whence) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = offset; + __entry->whence = whence; + ), + + TP_printk(FUSE_INODE_FMT " offset 0x%llx whence %d", + FUSE_INODE_PRINTK_ARGS, + __entry->offset, + __entry->whence) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 32ddf2fa6bdf78..be922888ae9e8a 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -693,6 +693,8 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, if (!fuse_allow_current_process(fc)) return -EACCES; + trace_fuse_iomap_fiemap(inode, start, count, fieinfo->fi_flags); + inode_lock_shared(inode); error = iomap_fiemap(inode, fieinfo, start, count, &fuse_iomap_ops); inode_unlock_shared(inode); @@ -720,6 +722,8 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence) if (!fuse_allow_current_process(fc)) return -EACCES; + trace_fuse_iomap_lseek(inode, offset, whence); + switch (whence) { case SEEK_HOLE: offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 14/33] fuse: implement direct IO with iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (12 preceding siblings ...) 2026-04-29 14:27 ` [PATCH 13/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:27 ` Darrick J. Wong 2026-04-29 14:27 ` [PATCH 15/33] fuse_trace: " Darrick J. Wong ` (18 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:27 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Start implementing the fuse-iomap file I/O paths by adding direct I/O support and all the signalling flags that come with it. Buffered I/O is much more complicated, so we leave that to a subsequent patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 20 +++ fs/fuse/fuse_iomap.h | 18 ++ include/uapi/linux/fuse.h | 27 ++++ fs/fuse/dir.c | 13 ++ fs/fuse/file.c | 29 +++- fs/fuse/fuse_iomap.c | 335 +++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/inode.c | 3 fs/fuse/iomode.c | 2 8 files changed, 441 insertions(+), 6 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index a4d64fd2837778..ea95615fbff598 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -188,6 +188,11 @@ struct fuse_inode { /* waitq for direct-io completion */ wait_queue_head_t direct_io_waitq; + +#ifdef CONFIG_FUSE_IOMAP + /* file size as reported by fuse server */ + loff_t i_disk_size; +#endif }; /* readdir cache (directory only) */ @@ -658,6 +663,16 @@ struct fuse_sync_bucket { struct rcu_head rcu; }; +#ifdef CONFIG_FUSE_IOMAP +struct fuse_iomap_conn { + /* fuse server doesn't implement iomap_end */ + unsigned int no_end:1; + + /* fuse server doesn't implement iomap_ioend */ + unsigned int no_ioend:1; +}; +#endif + /** * A Fuse connection. * @@ -1006,6 +1021,11 @@ struct fuse_conn { struct idr backing_files_map; #endif +#ifdef CONFIG_FUSE_IOMAP + /** iomap information */ + struct fuse_iomap_conn iomap_conn; +#endif + #ifdef CONFIG_FUSE_IO_URING /** uring connection information*/ struct fuse_ring *ring; diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 8ba30a496545f5..476e1b869d1906 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -38,6 +38,17 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 length); loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence); sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block); + +void fuse_iomap_open(struct inode *inode, struct file *file); +int fuse_iomap_finish_open(const struct fuse_file *ff, + const struct inode *inode); +void fuse_iomap_open_truncate(struct inode *inode); + +void fuse_iomap_set_disk_size(struct fuse_inode *fi, loff_t newsize); +int fuse_iomap_setsize_finish(struct inode *inode, loff_t newsize); + +ssize_t fuse_iomap_read_iter(struct kiocb *iocb, struct iov_iter *to); +ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -49,6 +60,13 @@ sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block); # define fuse_iomap_fiemap NULL # define fuse_iomap_lseek(...) (-ENOSYS) # define fuse_iomap_bmap(...) (-ENOSYS) +# define fuse_iomap_open(...) ((void)0) +# define fuse_iomap_finish_open(...) (-ENOSYS) +# define fuse_iomap_open_truncate(...) ((void)0) +# define fuse_iomap_set_disk_size(...) ((void)0) +# define fuse_iomap_setsize_finish(...) (-ENOSYS) +# define fuse_iomap_read_iter(...) (-ENOSYS) +# define fuse_iomap_write_iter(...) (-ENOSYS) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 88f76f4be749a7..543965b2f8fb37 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -677,6 +677,7 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_IOMAP_IOEND = 4093, FUSE_IOMAP_BEGIN = 4094, FUSE_IOMAP_END = 4095, @@ -1411,4 +1412,30 @@ struct fuse_iomap_end_in { struct fuse_iomap_io map; }; +/* out of place write extent */ +#define FUSE_IOMAP_IOEND_SHARED (1U << 0) +/* unwritten extent */ +#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1) +/* don't merge into previous ioend */ +#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2) +/* is direct I/O */ +#define FUSE_IOMAP_IOEND_DIRECT (1U << 3) +/* is append ioend */ +#define FUSE_IOMAP_IOEND_APPEND (1U << 4) + +struct fuse_iomap_ioend_in { + uint32_t flags; /* FUSE_IOMAP_IOEND_* */ + int32_t error; /* negative errno or 0 */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + uint64_t pos; /* file position, in bytes */ + uint64_t new_addr; /* disk offset of new mapping, in bytes */ + uint64_t written; /* bytes processed */ + uint32_t dev; /* device cookie */ + uint32_t pad; /* zero */ +}; + +struct fuse_iomap_ioend_out { + uint64_t newsize; /* new ondisk size */ +}; + #endif /* _LINUX_FUSE_H */ diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 2ed19e21d90702..944ed53852444c 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -911,6 +911,10 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, goto out_acl_release; fuse_dir_changed(dir); + + if (fuse_inode_has_iomap(inode)) + fuse_iomap_open(inode, file); + err = generic_file_open(inode, file); if (!err) { file->private_data = ff; @@ -1952,6 +1956,9 @@ static int fuse_dir_open(struct inode *inode, struct file *file) if (fuse_is_bad(inode)) return -EIO; + if (fuse_inode_has_iomap(inode)) + fuse_iomap_open(inode, file); + err = generic_file_open(inode, file); if (err) return err; @@ -2312,6 +2319,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, goto error; } + if (is_iomap && is_truncate) { + err = fuse_iomap_setsize_finish(inode, outarg.attr.size); + if (err) + goto error; + } + spin_lock(&fi->lock); /* the kernel maintains i_mtime locally */ if (trust_local_cmtime) { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 67cb0844181851..ad28bcd36d15e6 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -222,7 +222,10 @@ int fuse_finish_open(struct inode *inode, struct file *file) struct fuse_conn *fc = get_fuse_conn(inode); int err; - err = fuse_file_io_open(file, inode); + if (fuse_inode_has_iomap(inode)) + err = fuse_iomap_finish_open(ff, inode); + else + err = fuse_file_io_open(file, inode); if (err) return err; @@ -265,6 +268,9 @@ static int fuse_open(struct inode *inode, struct file *file) if (fuse_is_bad(inode)) return -EIO; + if (is_iomap) + fuse_iomap_open(inode, file); + err = generic_file_open(inode, file); if (err) return err; @@ -295,9 +301,11 @@ static int fuse_open(struct inode *inode, struct file *file) if ((is_wb_truncate || dax_truncate) && !is_iomap) fuse_release_nowrite(inode); if (!err) { - if (is_truncate) + if (is_truncate) { truncate_pagecache(inode, 0); - else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) + if (is_iomap) + fuse_iomap_open_truncate(inode); + } else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) invalidate_inode_pages2(inode->i_mapping); } if (dax_truncate) @@ -1827,6 +1835,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to) if (fuse_is_bad(inode)) return -EIO; + if (fuse_inode_has_iomap(inode)) + return fuse_iomap_read_iter(iocb, to); + if (FUSE_IS_DAX(inode)) return fuse_dax_read_iter(iocb, to); @@ -1848,6 +1859,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (fuse_is_bad(inode)) return -EIO; + if (fuse_inode_has_iomap(inode)) + return fuse_iomap_write_iter(iocb, from); + if (FUSE_IS_DAX(inode)) return fuse_dax_write_iter(iocb, from); @@ -2960,7 +2974,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, .length = length, .mode = mode }; + loff_t newsize = 0; int err; + const bool is_iomap = fuse_inode_has_iomap(inode); bool block_faults = FUSE_IS_DAX(inode) && (!(mode & FALLOC_FL_KEEP_SIZE) || (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))); @@ -2993,6 +3009,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, err = inode_newsize_ok(inode, offset + length); if (err) goto out; + newsize = offset + length; } err = file_modified(file); @@ -3017,6 +3034,12 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, /* we could have extended the file */ if (!(mode & FALLOC_FL_KEEP_SIZE)) { + if (is_iomap && newsize > 0) { + err = fuse_iomap_setsize_finish(inode, newsize); + if (err) + goto out; + } + if (fuse_write_update_attr(inode, offset + length, length)) file_update_time(file); } diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index be922888ae9e8a..833ac0ce682838 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -476,10 +476,15 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, } /* Decide if we send FUSE_IOMAP_END to the fuse server */ -static bool fuse_should_send_iomap_end(const struct iomap *iomap, +static bool fuse_should_send_iomap_end(const struct fuse_mount *fm, + const struct iomap *iomap, unsigned int opflags, loff_t count, ssize_t written) { + /* Not implemented on fuse server */ + if (fm->fc->iomap_conn.no_end) + return false; + /* fuse server demanded an iomap_end call. */ if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END) return true; @@ -504,7 +509,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, struct fuse_mount *fm = get_fuse_mount(inode); int err; - if (fuse_should_send_iomap_end(iomap, opflags, count, written)) { + if (fuse_should_send_iomap_end(fm, iomap, opflags, count, written)) { struct fuse_iomap_end_in inarg = { .opflags = fuse_iomap_op_to_server(opflags), .attr_ino = fi->orig_ino, @@ -529,6 +534,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, * libfuse returns ENOSYS for servers that don't * implement iomap_end */ + fm->fc->iomap_conn.no_end = 1; err = 0; } if (err) { @@ -545,6 +551,119 @@ static const struct iomap_ops fuse_iomap_ops = { .iomap_end = fuse_iomap_end, }; +static inline bool +fuse_should_send_iomap_ioend(const struct fuse_mount *fm, + const struct fuse_iomap_ioend_in *inarg) +{ + /* Not implemented on fuse server */ + if (fm->fc->iomap_conn.no_ioend) + return false; + + /* Always send an ioend for errors. */ + if (inarg->error) + return true; + + /* Send an ioend if we performed an IO involving metadata changes. */ + return inarg->written > 0 && + (inarg->flags & (FUSE_IOMAP_IOEND_SHARED | + FUSE_IOMAP_IOEND_UNWRITTEN | + FUSE_IOMAP_IOEND_APPEND)); +} + +int +fuse_iomap_setsize_finish( + struct inode *inode, + loff_t newsize) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + ASSERT(fuse_inode_has_iomap(inode)); + + spin_lock(&fi->lock); + fi->i_disk_size = newsize; + spin_unlock(&fi->lock); + return 0; +} + +static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written, + int error, unsigned ioendflags, + struct block_device *bdev, sector_t new_addr) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_mount *fm = get_fuse_mount(inode); + struct fuse_iomap_ioend_in inarg = { + .flags = ioendflags, + .error = error, + .attr_ino = fi->orig_ino, + .pos = pos, + .written = written, + .dev = fuse_iomap_find_backing_id(fm->fc, bdev), + .new_addr = new_addr, + }; + struct fuse_iomap_ioend_out outarg = { }; + + spin_lock(&fi->lock); + if (pos + written > fi->i_disk_size) + inarg.flags |= FUSE_IOMAP_IOEND_APPEND; + + outarg.newsize = max_t(loff_t, fi->i_disk_size, pos + written), + spin_unlock(&fi->lock); + + if (fuse_should_send_iomap_ioend(fm, &inarg)) { + FUSE_ARGS(args); + int iomap_error; + + args.opcode = FUSE_IOMAP_IOEND; + args.nodeid = get_node_id(inode); + args.in_numargs = 1; + args.in_args[0].size = sizeof(inarg); + args.in_args[0].value = &inarg; + args.out_numargs = 1; + args.out_args[0].size = sizeof(outarg); + args.out_args[0].value = &outarg; + iomap_error = fuse_simple_request(fm, &args); + switch (iomap_error) { + case -ENOSYS: + /* + * fuse servers can return ENOSYS if ioend processing + * is never needed for this filesystem. Don't pass + * that up to iomap. + */ + fm->fc->iomap_conn.no_ioend = 1; + break; + case 0: + break; + default: + /* + * If the write IO failed, return the failure code to + * the caller no matter what happens with the ioend. + * If the write IO succeeded but the ioend did not, + * pass the new error up to the caller. + */ + if (!error) + error = iomap_error; + break; + } + } + + /* + * Pass whatever error iomap gave us (or any new errors since then) + * back to iomap. + */ + if (error) + return error; + + /* + * If there weren't any ioend errors, update the incore isize, which + * confusingly takes the new i_size as "pos". + */ + spin_lock(&fi->lock); + fi->i_disk_size = outarg.newsize; + spin_unlock(&fi->lock); + fuse_write_update_attr(inode, pos + written, written); + return 0; +} + static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags) { if (!fc->iomap) @@ -739,3 +858,215 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence) return offset; return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); } + +void fuse_iomap_open(struct inode *inode, struct file *file) +{ + ASSERT(fuse_inode_has_iomap(inode)); + + file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT; +} + +int fuse_iomap_finish_open(const struct fuse_file *ff, + const struct inode *inode) +{ + ASSERT(fuse_inode_has_iomap(inode)); + + /* no weird modes, iomap only handles seekable regular files */ + if (ff->open_flags & (FOPEN_PASSTHROUGH | + FOPEN_STREAM | + FOPEN_NONSEEKABLE)) + return -EINVAL; + + return 0; +} + +enum fuse_ilock_type { + SHARED, + EXCL, +}; + +static int fuse_iomap_ilock_iocb(const struct kiocb *iocb, + enum fuse_ilock_type type) +{ + struct inode *inode = file_inode(iocb->ki_filp); + + if (iocb->ki_flags & IOCB_NOWAIT) { + switch (type) { + case SHARED: + return inode_trylock_shared(inode) ? 0 : -EAGAIN; + case EXCL: + return inode_trylock(inode) ? 0 : -EAGAIN; + default: + ASSERT(0); + return -EIO; + } + + /* shut up gcc */ + return 0; + } + + switch (type) { + case SHARED: + inode_lock_shared(inode); + break; + case EXCL: + inode_lock(inode); + break; + default: + ASSERT(0); + return -EIO; + } + + return 0; +} + +static ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to) +{ + struct inode *inode = file_inode(iocb->ki_filp); + ssize_t ret; + + if (!iov_iter_count(to)) + return 0; /* skip atime */ + + ret = fuse_iomap_ilock_iocb(iocb, SHARED); + if (ret) + return ret; + ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0); + if (ret > 0) + file_accessed(iocb->ki_filp); + inode_unlock_shared(inode); + + return ret; +} + +static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written, + int error, unsigned dioflags) +{ + struct inode *inode = file_inode(iocb->ki_filp); + unsigned int ioendflags = FUSE_IOMAP_IOEND_DIRECT; + + if (fuse_is_bad(inode)) + return -EIO; + + ASSERT(fuse_inode_has_iomap(inode)); + + if (dioflags & IOMAP_DIO_COW) + ioendflags |= FUSE_IOMAP_IOEND_SHARED; + if (dioflags & IOMAP_DIO_UNWRITTEN) + ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN; + + return fuse_iomap_ioend(inode, iocb->ki_pos, written, error, + ioendflags, NULL, FUSE_IOMAP_NULL_ADDR); +} + +static const struct iomap_dio_ops fuse_iomap_dio_write_ops = { + .end_io = fuse_iomap_dio_write_end_io, +}; + +static ssize_t +fuse_iomap_write_checks( + struct kiocb *iocb, + struct iov_iter *from) +{ + ssize_t error; + + error = generic_write_checks(iocb, from); + if (error <= 0) + return error; + + return kiocb_modified(iocb); +} + +static ssize_t fuse_iomap_direct_write(struct kiocb *iocb, + struct iov_iter *from) +{ + struct inode *inode = file_inode(iocb->ki_filp); + loff_t blockmask = i_blocksize(inode) - 1; + size_t count = iov_iter_count(from); + unsigned int flags = IOMAP_DIO_COMP_WORK; + ssize_t ret; + + if (!count) + return 0; + + /* + * Unaligned direct writes require zeroing of unwritten head and tail + * blocks. Extending writes require zeroing of post-EOF tail blocks. + * The zeroing writes must complete before we return the direct write + * to userspace. Don't even bother trying the fast path. + */ + if ((iocb->ki_pos | count) & blockmask) + flags |= IOMAP_DIO_FORCE_WAIT; + + ret = fuse_iomap_ilock_iocb(iocb, EXCL); + if (ret) + goto out_dsync; + + ret = fuse_iomap_write_checks(iocb, from); + if (ret) + goto out_unlock; + + /* + * If we are doing exclusive unaligned I/O, this must be the only I/O + * in-flight. Otherwise we risk data corruption due to unwritten + * extent conversions from the AIO end_io handler. Wait for all other + * I/O to drain first. + */ + if (flags & IOMAP_DIO_FORCE_WAIT) + inode_dio_wait(inode); + + ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops, + &fuse_iomap_dio_write_ops, flags, NULL, 0); +out_unlock: + inode_unlock(inode); +out_dsync: + return ret; +} + +void fuse_iomap_set_disk_size(struct fuse_inode *fi, loff_t newsize) +{ + lockdep_assert_held(&fi->lock); + + if (fuse_inode_has_iomap(&fi->inode)) + fi->i_disk_size = newsize; +} + +void fuse_iomap_open_truncate(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + ASSERT(fuse_inode_has_iomap(inode)); + + spin_lock(&fi->lock); + fi->i_disk_size = 0; + spin_unlock(&fi->lock); +} + +static inline bool fuse_iomap_force_directio(const struct kiocb *iocb) +{ + struct fuse_file *ff = iocb->ki_filp->private_data; + + return ff->open_flags & FOPEN_DIRECT_IO; +} + +ssize_t fuse_iomap_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + const bool force_directio = fuse_iomap_force_directio(iocb); + + ASSERT(fuse_inode_has_iomap(file_inode(iocb->ki_filp))); + + if ((iocb->ki_flags & IOCB_DIRECT) || force_directio) + return fuse_iomap_direct_read(iocb, to); + return -EIO; +} + +ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from) +{ + const bool force_directio = fuse_iomap_force_directio(iocb); + + ASSERT(fuse_inode_has_iomap(file_inode(iocb->ki_filp))); + + if ((iocb->ki_flags & IOCB_DIRECT) || force_directio) + return fuse_iomap_direct_write(iocb, from); + return -EIO; +} diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 3cb2ef161ef7c5..23ca401a3e08e6 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -303,6 +303,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, else fi->cached_i_blkbits = inode->i_sb->s_blocksize_bits; + fuse_iomap_set_disk_size(fi, attr->size); + /* * Don't set the sticky bit in i_mode, unless we want the VFS * to check permissions. This prevents failures due to the @@ -352,6 +354,7 @@ static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr * inode. */ cache_mask = fuse_get_cache_mask(inode); + fuse_iomap_set_disk_size(fi, attr->size); if (cache_mask & STATX_SIZE) attr->size = i_size_read(inode); diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c index 9be9ae3520003e..ed666b39cf8af4 100644 --- a/fs/fuse/iomode.c +++ b/fs/fuse/iomode.c @@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode) * io modes are not relevant with DAX and with server that does not * implement open. */ - if (fuse_inode_has_iomap(inode) || FUSE_IS_DAX(inode) || !ff->args) + if (FUSE_IS_DAX(inode) || !ff->args) return 0; /* ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 15/33] fuse_trace: implement direct IO with iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (13 preceding siblings ...) 2026-04-29 14:27 ` [PATCH 14/33] fuse: implement direct IO with iomap Darrick J. Wong @ 2026-04-29 14:27 ` Darrick J. Wong 2026-04-29 14:27 ` [PATCH 16/33] fuse: implement buffered " Darrick J. Wong ` (17 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:27 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 6 + fs/fuse/fuse_trace.h | 212 +++++++++++++++++++++++++++++++++++++++++++++++++- fs/fuse/fuse_iomap.c | 18 ++++ fs/fuse/trace.c | 2 4 files changed, 234 insertions(+), 4 deletions(-) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 476e1b869d1906..0076e1cb12e8b0 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -45,6 +45,11 @@ int fuse_iomap_finish_open(const struct fuse_file *ff, void fuse_iomap_open_truncate(struct inode *inode); void fuse_iomap_set_disk_size(struct fuse_inode *fi, loff_t newsize); +static inline loff_t fuse_iomap_get_disk_size(const struct fuse_inode *fi) +{ + /* unlocked access, for tracing only */ + return fuse_inode_has_iomap(&fi->inode) ? fi->i_disk_size : 0; +} int fuse_iomap_setsize_finish(struct inode *inode, loff_t newsize); ssize_t fuse_iomap_read_iter(struct kiocb *iocb, struct iov_iter *to); @@ -64,6 +69,7 @@ ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from); # define fuse_iomap_finish_open(...) (-ENOSYS) # define fuse_iomap_open_truncate(...) ((void)0) # define fuse_iomap_set_disk_size(...) ((void)0) +# define fuse_iomap_get_disk_size(...) ((loff_t)0) # define fuse_iomap_setsize_finish(...) (-ENOSYS) # define fuse_iomap_read_iter(...) (-ENOSYS) # define fuse_iomap_write_iter(...) (-ENOSYS) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 730ab8bce44450..a8337f5ddcf011 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -60,6 +60,7 @@ EM( FUSE_STATX, "FUSE_STATX") \ EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \ EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \ + EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \ EMe(CUSE_INIT, "CUSE_INIT") /* @@ -84,7 +85,8 @@ OPCODES __field(dev_t, connection) \ __field(uint64_t, ino) \ __field(uint64_t, nodeid) \ - __field(loff_t, isize) + __field(loff_t, isize) \ + __field(loff_t, idisksize) #define FUSE_INODE_ASSIGN(inode, fi, fm) \ const struct fuse_inode *fi = get_fuse_inode(inode); \ @@ -93,16 +95,18 @@ OPCODES __entry->connection = (fm)->fc->dev; \ __entry->ino = (fi)->orig_ino; \ __entry->nodeid = (fi)->nodeid; \ - __entry->isize = i_size_read(inode) + __entry->isize = i_size_read(inode); \ + __entry->idisksize = fuse_iomap_get_disk_size(fi) #define FUSE_INODE_FMT \ - "connection %u ino %llu nodeid %llu isize 0x%llx" + "connection %u ino %llu nodeid %llu isize 0x%llx idisksize 0x%llx" #define FUSE_INODE_PRINTK_ARGS \ __entry->connection, \ __entry->ino, \ __entry->nodeid, \ - __entry->isize + __entry->isize, \ + __entry->idisksize #define FUSE_FILE_RANGE_FIELDS(prefix) \ __field(loff_t, prefix##offset) \ @@ -300,6 +304,17 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); { FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \ { FUSE_IOMAP_TYPE_INLINE, "inline" } +#define FUSE_IOMAP_IOEND_STRINGS \ + { FUSE_IOMAP_IOEND_SHARED, "shared" }, \ + { FUSE_IOMAP_IOEND_UNWRITTEN, "unwritten" }, \ + { FUSE_IOMAP_IOEND_BOUNDARY, "boundary" }, \ + { FUSE_IOMAP_IOEND_DIRECT, "direct" }, \ + { FUSE_IOMAP_IOEND_APPEND, "append" } + +#define IOMAP_DIOEND_STRINGS \ + { IOMAP_DIO_UNWRITTEN, "unwritten" }, \ + { IOMAP_DIO_COW, "cow" } + TRACE_DEFINE_ENUM(FUSE_I_ADVISE_RDPLUS); TRACE_DEFINE_ENUM(FUSE_I_INIT_RDPLUS); TRACE_DEFINE_ENUM(FUSE_I_SIZE_UNSTABLE); @@ -484,6 +499,75 @@ TRACE_EVENT(fuse_iomap_end_error, __entry->error) ); +TRACE_EVENT(fuse_iomap_ioend, + TP_PROTO(const struct inode *inode, + const struct fuse_iomap_ioend_in *inarg), + + TP_ARGS(inode, inarg), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(unsigned, ioendflags) + __field(int, error) + __field(uint32_t, dev) + __field(uint64_t, new_addr) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = inarg->pos; + __entry->length = inarg->written; + __entry->ioendflags = inarg->flags; + __entry->error = inarg->error; + __entry->dev = inarg->dev; + __entry->new_addr = inarg->new_addr; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d dev %u new_addr 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS), + __entry->error, + __entry->dev, + __entry->new_addr) +); + +TRACE_EVENT(fuse_iomap_ioend_error, + TP_PROTO(const struct inode *inode, + const struct fuse_iomap_ioend_in *inarg, + const struct fuse_iomap_ioend_out *outarg, + int error), + + TP_ARGS(inode, inarg, outarg, error), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(unsigned, ioendflags) + __field(int, error) + __field(uint32_t, dev) + __field(uint64_t, new_addr) + __field(uint64_t, new_size) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = inarg->pos; + __entry->length = inarg->written; + __entry->ioendflags = inarg->flags; + __entry->error = error; + __entry->dev = inarg->dev; + __entry->new_addr = inarg->new_addr; + __entry->new_size = outarg->newsize; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d dev %u new_addr 0x%llx new_size 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS), + __entry->error, + __entry->dev, + __entry->new_addr, + __entry->new_size) +); + TRACE_EVENT(fuse_iomap_dev_add, TP_PROTO(const struct fuse_conn *fc, const struct fuse_backing_map *map), @@ -578,6 +662,126 @@ TRACE_EVENT(fuse_iomap_lseek, __entry->offset, __entry->whence) ); + +DECLARE_EVENT_CLASS(fuse_iomap_file_io_class, + TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), + TP_ARGS(iocb, iter), + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + TP_fast_assign( + FUSE_INODE_ASSIGN(file_inode(iocb->ki_filp), fi, fm); + __entry->offset = iocb->ki_pos; + __entry->length = iov_iter_count(iter); + ), + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +) +#define DEFINE_FUSE_IOMAP_FILE_IO_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_file_io_class, name, \ + TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), \ + TP_ARGS(iocb, iter)) +DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read); +DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write); + +DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class, + TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, + ssize_t ret), + TP_ARGS(iocb, iter, ret), + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(ssize_t, ret) + ), + TP_fast_assign( + FUSE_INODE_ASSIGN(file_inode(iocb->ki_filp), fi, fm); + __entry->offset = iocb->ki_pos; + __entry->length = iov_iter_count(iter); + __entry->ret = ret; + ), + TP_printk(FUSE_IO_RANGE_FMT() " ret 0x%zx", + FUSE_IO_RANGE_PRINTK_ARGS(), + __entry->ret) +) +#define DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \ + TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, \ + ssize_t ret), \ + TP_ARGS(iocb, iter, ret)) +DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end); +DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end); + +TRACE_EVENT(fuse_iomap_dio_write_end_io, + TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written, + int error, unsigned flags), + + TP_ARGS(inode, pos, written, error, flags), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(unsigned, dioendflags) + __field(int, error) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = pos; + __entry->length = written; + __entry->dioendflags = flags; + __entry->error = error; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " dioendflags (%s) error %d", + FUSE_IO_RANGE_PRINTK_ARGS(), + __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS), + __entry->error) +); + +DECLARE_EVENT_CLASS(fuse_iomap_inode_class, + TP_PROTO(const struct inode *inode), + + TP_ARGS(inode), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + ), + + TP_printk(FUSE_INODE_FMT, + FUSE_INODE_PRINTK_ARGS) +); +#define DEFINE_FUSE_IOMAP_INODE_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_inode_class, name, \ + TP_PROTO(const struct inode *inode), \ + TP_ARGS(inode)) +DEFINE_FUSE_IOMAP_INODE_EVENT(fuse_iomap_open_truncate); + +DECLARE_EVENT_CLASS(fuse_iomap_file_range_class, + TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), + + TP_ARGS(inode, offset, length), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = offset; + __entry->length = length; + ), + + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +) +#define DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_file_range_class, name, \ + TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \ + TP_ARGS(inode, offset, length)) +DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize_finish); + #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 833ac0ce682838..acc7ae0f56a2bf 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -579,6 +579,8 @@ fuse_iomap_setsize_finish( ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_setsize_finish(inode, newsize, 0); + spin_lock(&fi->lock); fi->i_disk_size = newsize; spin_unlock(&fi->lock); @@ -609,6 +611,8 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written, outarg.newsize = max_t(loff_t, fi->i_disk_size, pos + written), spin_unlock(&fi->lock); + trace_fuse_iomap_ioend(inode, &inarg); + if (fuse_should_send_iomap_ioend(fm, &inarg)) { FUSE_ARGS(args); int iomap_error; @@ -634,6 +638,9 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written, case 0: break; default: + trace_fuse_iomap_ioend_error(inode, &inarg, &outarg, + iomap_error); + /* * If the write IO failed, return the failure code to * the caller no matter what happens with the ioend. @@ -925,6 +932,8 @@ static ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to) struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + trace_fuse_iomap_direct_read(iocb, to); + if (!iov_iter_count(to)) return 0; /* skip atime */ @@ -936,6 +945,7 @@ static ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to) file_accessed(iocb->ki_filp); inode_unlock_shared(inode); + trace_fuse_iomap_direct_read_end(iocb, to, ret); return ret; } @@ -950,6 +960,9 @@ static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written, ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_dio_write_end_io(inode, iocb->ki_pos, written, error, + dioflags); + if (dioflags & IOMAP_DIO_COW) ioendflags |= FUSE_IOMAP_IOEND_SHARED; if (dioflags & IOMAP_DIO_UNWRITTEN) @@ -986,6 +999,8 @@ static ssize_t fuse_iomap_direct_write(struct kiocb *iocb, unsigned int flags = IOMAP_DIO_COMP_WORK; ssize_t ret; + trace_fuse_iomap_direct_write(iocb, from); + if (!count) return 0; @@ -1020,6 +1035,7 @@ static ssize_t fuse_iomap_direct_write(struct kiocb *iocb, out_unlock: inode_unlock(inode); out_dsync: + trace_fuse_iomap_direct_write_end(iocb, from, ret); return ret; } @@ -1037,6 +1053,8 @@ void fuse_iomap_open_truncate(struct inode *inode) ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_open_truncate(inode); + spin_lock(&fi->lock); fi->i_disk_size = 0; spin_unlock(&fi->lock); diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c index c830c1c38a833c..71d444ac1e5021 100644 --- a/fs/fuse/trace.c +++ b/fs/fuse/trace.c @@ -6,9 +6,11 @@ #include "dev_uring_i.h" #include "fuse_i.h" #include "fuse_dev_i.h" +#include "fuse_iomap.h" #include "fuse_iomap_i.h" #include <linux/pagemap.h> +#include <linux/iomap.h> #define CREATE_TRACE_POINTS #include "fuse_trace.h" ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 16/33] fuse: implement buffered IO with iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (14 preceding siblings ...) 2026-04-29 14:27 ` [PATCH 15/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:27 ` Darrick J. Wong 2026-04-29 14:28 ` [PATCH 17/33] fuse_trace: " Darrick J. Wong ` (16 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:27 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Implement pagecache IO with iomap, complete with hooks into truncate and fallocate so that the fuse server needn't implement disk block zeroing of post-EOF and unaligned punch/zero regions. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 7 fs/fuse/fuse_iomap.h | 11 + include/uapi/linux/fuse.h | 5 fs/fuse/dir.c | 23 + fs/fuse/file.c | 61 +++- fs/fuse/fuse_iomap.c | 699 ++++++++++++++++++++++++++++++++++++++++++++- 6 files changed, 775 insertions(+), 31 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index ea95615fbff598..23212ca1b6871e 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -192,6 +192,11 @@ struct fuse_inode { #ifdef CONFIG_FUSE_IOMAP /* file size as reported by fuse server */ loff_t i_disk_size; + + /* pending io completions */ + spinlock_t ioend_lock; + struct work_struct ioend_work; + struct list_head ioend_list; #endif }; @@ -1738,4 +1743,6 @@ extern void fuse_sysctl_unregister(void); #define fuse_sysctl_unregister() do { } while (0) #endif /* CONFIG_SYSCTL */ +sector_t fuse_bmap(struct address_space *mapping, sector_t block); + #endif /* _FS_FUSE_I_H */ diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 0076e1cb12e8b0..be37ddac2f1e25 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -54,6 +54,13 @@ int fuse_iomap_setsize_finish(struct inode *inode, loff_t newsize); ssize_t fuse_iomap_read_iter(struct kiocb *iocb, struct iov_iter *to); ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from); + +int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma); +int fuse_iomap_setsize_start(struct inode *inode, loff_t newsize); +int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset, + loff_t length, loff_t new_size); +int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, + loff_t endpos); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -73,6 +80,10 @@ ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from); # define fuse_iomap_setsize_finish(...) (-ENOSYS) # define fuse_iomap_read_iter(...) (-ENOSYS) # define fuse_iomap_write_iter(...) (-ENOSYS) +# define fuse_iomap_mmap(...) (-ENOSYS) +# define fuse_iomap_setsize_start(...) (-ENOSYS) +# define fuse_iomap_fallocate(...) (-ENOSYS) +# define fuse_iomap_flush_unmap_range(...) (-ENOSYS) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 543965b2f8fb37..71b216262c84cb 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1373,6 +1373,9 @@ struct fuse_uring_cmd_req { #define FUSE_IOMAP_OP_ATOMIC (1U << 9) #define FUSE_IOMAP_OP_DONTCACHE (1U << 10) +/* pagecache writeback operation */ +#define FUSE_IOMAP_OP_WRITEBACK (1U << 31) + #define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */ struct fuse_iomap_io { @@ -1422,6 +1425,8 @@ struct fuse_iomap_end_in { #define FUSE_IOMAP_IOEND_DIRECT (1U << 3) /* is append ioend */ #define FUSE_IOMAP_IOEND_APPEND (1U << 4) +/* is pagecache writeback */ +#define FUSE_IOMAP_IOEND_WRITEBACK (1U << 5) struct fuse_iomap_ioend_in { uint32_t flags; /* FUSE_IOMAP_IOEND_* */ diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 944ed53852444c..5b10ddf9b8077a 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -2229,7 +2229,10 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, is_truncate = true; } - if (FUSE_IS_DAX(inode) && is_truncate) { + if (is_iomap && is_truncate) { + filemap_invalidate_lock(mapping); + fault_blocked = true; + } else if (FUSE_IS_DAX(inode) && is_truncate) { filemap_invalidate_lock(mapping); fault_blocked = true; err = fuse_dax_break_layouts(inode, 0, -1); @@ -2244,6 +2247,18 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, WARN_ON(!(attr->ia_valid & ATTR_SIZE)); WARN_ON(attr->ia_size != 0); if (fc->atomic_o_trunc) { + if (is_iomap) { + /* + * fuse_open already set the size to zero and + * truncated the pagecache, and we've since + * cycled the inode locks. Another thread + * could have performed an appending write, so + * we don't want to touch the file further. + */ + filemap_invalidate_unlock(mapping); + return 0; + } + /* * No need to send request to userspace, since actual * truncation has already been done by OPEN. But still @@ -2277,6 +2292,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); if (trust_local_cmtime && attr->ia_size != inode->i_size) attr->ia_valid |= ATTR_MTIME | ATTR_CTIME; + + if (is_iomap) { + err = fuse_iomap_setsize_start(inode, attr->ia_size); + if (err) + goto error; + } } memset(&inarg, 0, sizeof(inarg)); diff --git a/fs/fuse/file.c b/fs/fuse/file.c index ad28bcd36d15e6..6471952489aa2e 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -408,7 +408,7 @@ static int fuse_release(struct inode *inode, struct file *file) * Dirty pages might remain despite write_inode_now() call from * fuse_flush() due to writes racing with the close. */ - if (fc->writeback_cache) + if (fc->writeback_cache || fuse_inode_has_iomap(inode)) write_inode_now(inode, 1); fuse_release_common(file, false); @@ -2400,6 +2400,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) struct inode *inode = file_inode(file); int rc; + if (fuse_inode_has_iomap(inode)) + return fuse_iomap_mmap(file, vma); + /* DAX mmap is superior to direct_io mmap */ if (FUSE_IS_DAX(inode)) return fuse_dax_mmap(file, vma); @@ -2598,7 +2601,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl) return err; } -static sector_t fuse_bmap(struct address_space *mapping, sector_t block) +sector_t fuse_bmap(struct address_space *mapping, sector_t block) { struct inode *inode = mapping->host; struct fuse_mount *fm = get_fuse_mount(inode); @@ -2952,8 +2955,12 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter) static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end) { - int err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX); + int err; + if (fuse_inode_has_iomap(inode)) + return fuse_iomap_flush_unmap_range(inode, start, end); + + err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX); if (!err) fuse_sync_writes(inode); @@ -2989,7 +2996,10 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, return -EOPNOTSUPP; inode_lock(inode); - if (block_faults) { + if (is_iomap) { + filemap_invalidate_lock(inode->i_mapping); + block_faults = true; + } else if (block_faults) { filemap_invalidate_lock(inode->i_mapping); err = fuse_dax_break_layouts(inode, 0, -1); if (err) @@ -3004,6 +3014,17 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, goto out; } + /* + * If we are using iomap for file IO, fallocate must wait for all AIO + * to complete before we continue as AIO can change the file size on + * completion without holding any locks we currently hold. We must do + * this first because AIO can update the in-memory inode size, and the + * operations that follow require the in-memory size to be fully + * up-to-date. + */ + if (is_iomap) + inode_dio_wait(inode); + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + length > i_size_read(inode)) { err = inode_newsize_ok(inode, offset + length); @@ -3032,21 +3053,23 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, if (err) goto out; - /* we could have extended the file */ - if (!(mode & FALLOC_FL_KEEP_SIZE)) { - if (is_iomap && newsize > 0) { - err = fuse_iomap_setsize_finish(inode, newsize); - if (err) - goto out; + if (is_iomap) { + err = fuse_iomap_fallocate(file, mode, offset, length, + newsize); + if (err) + goto out; + } else { + /* we could have extended the file */ + if (!(mode & FALLOC_FL_KEEP_SIZE)) { + if (fuse_write_update_attr(inode, newsize, length)) + file_update_time(file); } - if (fuse_write_update_attr(inode, offset + length, length)) - file_update_time(file); + if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) + truncate_pagecache_range(inode, offset, + offset + length - 1); } - if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) - truncate_pagecache_range(inode, offset, offset + length - 1); - fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE); out: @@ -3090,6 +3113,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, ssize_t err; /* mark unstable when write-back is not used, and file_out gets * extended */ + const bool is_iomap = fuse_inode_has_iomap(inode_out); bool is_unstable = (!fc->writeback_cache) && ((pos_out + len) > inode_out->i_size); @@ -3133,6 +3157,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, if (err) goto out; + /* See inode_dio_wait comment in fuse_file_fallocate */ + if (is_iomap) + inode_dio_wait(inode_out); + if (is_unstable) set_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state); @@ -3173,7 +3201,8 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, goto out; } - truncate_inode_pages_range(inode_out->i_mapping, + if (!is_iomap) + truncate_inode_pages_range(inode_out->i_mapping, ALIGN_DOWN(pos_out, PAGE_SIZE), ALIGN(pos_out + bytes_copied, PAGE_SIZE) - 1); diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index acc7ae0f56a2bf..7a7dfc4f665d8e 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -5,6 +5,8 @@ */ #include <linux/iomap.h> #include <linux/fiemap.h> +#include <linux/pagemap.h> +#include <linux/falloc.h> #include "fuse_i.h" #include "fuse_trace.h" #include "fuse_iomap.h" @@ -204,7 +206,7 @@ static inline uint16_t fuse_iomap_flags_from_server(uint16_t fuse_f_flags) ret |= FUSE_IOMAP_OP_##word static inline uint32_t fuse_iomap_op_to_server(unsigned iomap_op_flags) { - uint32_t ret = 0; + uint32_t ret = iomap_op_flags & FUSE_IOMAP_OP_WRITEBACK; XMAP(WRITE); XMAP(ZERO); @@ -370,7 +372,8 @@ fuse_iomap_begin_validate(const struct inode *inode, static inline bool fuse_is_iomap_file_write(unsigned int opflags) { - return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE); + return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE | + FUSE_IOMAP_OP_WRITEBACK); } static inline struct fuse_backing * @@ -744,12 +747,7 @@ void fuse_iomap_unmount(struct fuse_mount *fm) fuse_send_destroy(fm); } -static inline void fuse_inode_set_iomap(struct inode *inode) -{ - struct fuse_inode *fi = get_fuse_inode(inode); - - set_bit(FUSE_I_IOMAP, &fi->state); -} +static inline void fuse_inode_set_iomap(struct inode *inode); static inline void fuse_inode_clear_iomap(struct inode *inode) { @@ -976,17 +974,107 @@ static const struct iomap_dio_ops fuse_iomap_dio_write_ops = { .end_io = fuse_iomap_dio_write_end_io, }; +static const struct iomap_write_ops fuse_iomap_write_ops = { +}; + +static int +fuse_iomap_zero_range( + struct inode *inode, + loff_t pos, + loff_t len, + bool *did_zero) +{ + return iomap_zero_range(inode, pos, len, did_zero, &fuse_iomap_ops, + &fuse_iomap_write_ops, NULL); +} + +/* Take care of zeroing post-EOF blocks when they might exist. */ +static ssize_t +fuse_iomap_write_zero_eof( + struct kiocb *iocb, + struct iov_iter *from, + bool *drained_dio) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct fuse_inode *fi = get_fuse_inode(inode); + struct address_space *mapping = iocb->ki_filp->f_mapping; + loff_t isize; + int error; + + /* + * We need to serialise against EOF updates that occur in IO + * completions here. We want to make sure that nobody is changing the + * size while we do this check until we have placed an IO barrier (i.e. + * hold i_rwsem exclusively) that prevents new IO from being + * dispatched. The spinlock effectively forms a memory barrier once we + * have i_rwsem exclusively so we are guaranteed to see the latest EOF + * value and hence be able to correctly determine if we need to run + * zeroing. + */ + spin_lock(&fi->lock); + isize = i_size_read(inode); + if (iocb->ki_pos <= isize) { + spin_unlock(&fi->lock); + return 0; + } + spin_unlock(&fi->lock); + + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + + if (!(*drained_dio)) { + /* + * We now have an IO submission barrier in place, but AIO can + * do EOF updates during IO completion and hence we now need to + * wait for all of them to drain. Non-AIO DIO will have + * drained before we are given the exclusive i_rwsem, and so + * for most cases this wait is a no-op. + */ + inode_dio_wait(inode); + *drained_dio = true; + return 1; + } + + filemap_invalidate_lock(mapping); + error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL); + filemap_invalidate_unlock(mapping); + + return error; +} + static ssize_t fuse_iomap_write_checks( struct kiocb *iocb, struct iov_iter *from) { + struct inode *inode = iocb->ki_filp->f_mapping->host; ssize_t error; + bool drained_dio = false; +restart: error = generic_write_checks(iocb, from); if (error <= 0) return error; + /* + * If the offset is beyond the size of the file, we need to zero all + * blocks that fall between the existing EOF and the start of this + * write. + * + * We can do an unlocked check for i_size here safely as I/O completion + * can only extend EOF. Truncate is locked out at this point, so the + * EOF cannot move backwards, only forwards. Hence we only need to take + * the slow path when we are at or beyond the current EOF. + */ + if (fuse_inode_has_iomap(inode) && + iocb->ki_pos > i_size_read(inode)) { + error = fuse_iomap_write_zero_eof(iocb, from, &drained_dio); + if (error == 1) + goto restart; + if (error) + return error; + } + return kiocb_modified(iocb); } @@ -1060,6 +1148,366 @@ void fuse_iomap_open_truncate(struct inode *inode) spin_unlock(&fi->lock); } +struct fuse_writepage_ctx { + struct iomap_writepage_ctx ctx; +}; + +static void fuse_iomap_end_ioend(struct iomap_ioend *ioend) +{ + struct inode *inode = ioend->io_inode; + unsigned int ioendflags = FUSE_IOMAP_IOEND_WRITEBACK; + unsigned int nofs_flag; + int error = blk_status_to_errno(ioend->io_bio.bi_status); + + ASSERT(fuse_inode_has_iomap(inode)); + + /* We still have to clean up the ioend even if the inode is dead */ + if (!error && fuse_is_bad(inode)) + error = -EIO; + + if (ioend->io_flags & IOMAP_IOEND_SHARED) + ioendflags |= FUSE_IOMAP_IOEND_SHARED; + if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN) + ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN; + + /* + * We can allocate memory here while doing writeback on behalf of + * memory reclaim. To avoid memory allocation deadlocks set the + * task-wide nofs context for the following operations. + */ + nofs_flag = memalloc_nofs_save(); + fuse_iomap_ioend(inode, ioend->io_offset, ioend->io_size, error, + ioendflags, ioend->io_bio.bi_bdev, ioend->io_sector); + iomap_finish_ioends(ioend, error); + memalloc_nofs_restore(nofs_flag); +} + +/* + * Finish all pending IO completions that require transactional modifications. + * + * We try to merge physical and logically contiguous ioends before completion to + * minimise the number of transactions we need to perform during IO completion. + * Both unwritten extent conversion and COW remapping need to iterate and modify + * one physical extent at a time, so we gain nothing by merging physically + * discontiguous extents here. + * + * The ioend chain length that we can be processing here is largely unbound in + * length and we may have to perform significant amounts of work on each ioend + * to complete it. Hence we have to be careful about holding the CPU for too + * long in this loop. + */ +static void fuse_iomap_end_io(struct work_struct *work) +{ + struct fuse_inode *fi = + container_of(work, struct fuse_inode, ioend_work); + struct iomap_ioend *ioend; + struct list_head tmp; + unsigned long flags; + + spin_lock_irqsave(&fi->ioend_lock, flags); + list_replace_init(&fi->ioend_list, &tmp); + spin_unlock_irqrestore(&fi->ioend_lock, flags); + + iomap_sort_ioends(&tmp); + while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend, + io_list))) { + list_del_init(&ioend->io_list); + iomap_ioend_try_merge(ioend, &tmp); + fuse_iomap_end_ioend(ioend); + cond_resched(); + } +} + +static void fuse_iomap_end_bio(struct bio *bio) +{ + struct iomap_ioend *ioend = iomap_ioend_from_bio(bio); + struct inode *inode = ioend->io_inode; + struct fuse_inode *fi = get_fuse_inode(inode); + unsigned long flags; + + ASSERT(fuse_inode_has_iomap(inode)); + + spin_lock_irqsave(&fi->ioend_lock, flags); + if (list_empty(&fi->ioend_list)) + WARN_ON_ONCE(!queue_work(system_unbound_wq, &fi->ioend_work)); + list_add_tail(&ioend->io_list, &fi->ioend_list); + spin_unlock_irqrestore(&fi->ioend_lock, flags); +} + +/* + * Fast revalidation of the cached writeback mapping. Return true if the current + * mapping is valid, false otherwise. + */ +static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc, + loff_t offset) +{ + if (offset < wpc->iomap.offset || + offset >= wpc->iomap.offset + wpc->iomap.length) + return false; + + /* XXX actually use revalidation cookie */ + return true; +} + +/* + * If the folio has delalloc blocks on it, the caller is asking us to punch them + * out. If we don't, we can leave a stale delalloc mapping covered by a clean + * page that needs to be dirtied again before the delalloc mapping can be + * converted. This stale delalloc mapping can trip up a later direct I/O read + * operation on the same region. + * + * We prevent this by truncating away the delalloc regions on the folio. Because + * they are delalloc, we can do this without needing a transaction. Indeed - if + * we get ENOSPC errors, we have to be able to do this truncation without a + * transaction as there is no space left for block reservation (typically why + * we see a ENOSPC in writeback). + */ +static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos, int error) +{ + struct inode *inode = folio->mapping->host; + struct fuse_inode *fi = get_fuse_inode(inode); + loff_t end = folio_pos(folio) + folio_size(folio); + + if (fuse_is_bad(inode)) + return; + + ASSERT(fuse_inode_has_iomap(inode)); + + printk_ratelimited(KERN_ERR + "page discard on page %px, inode 0x%llx, pos %llu.", + folio, fi->orig_ino, pos); + + /* Userspace may need to remove delayed allocations */ + fuse_iomap_ioend(inode, pos, end - pos, error, 0, NULL, + FUSE_IOMAP_NULL_ADDR); +} + +static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc, + struct folio *folio, u64 offset, + unsigned int len, u64 end_pos) +{ + struct inode *inode = wpc->inode; + struct iomap write_iomap, dontcare; + ssize_t ret; + + if (fuse_is_bad(inode)) { + ret = -EIO; + goto discard_folio; + } + + ASSERT(fuse_inode_has_iomap(inode)); + + if (!fuse_iomap_revalidate_writeback(wpc, offset)) { + ret = fuse_iomap_begin(inode, offset, len, + FUSE_IOMAP_OP_WRITEBACK, + &write_iomap, &dontcare); + if (ret) + goto discard_folio; + + /* + * Landed in a hole or beyond EOF? Send that to iomap, it'll + * skip writing back the file range. + */ + if (write_iomap.offset > offset) { + write_iomap.length = write_iomap.offset - offset; + write_iomap.offset = offset; + write_iomap.type = IOMAP_HOLE; + } + + memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap)); + } + + ret = iomap_add_to_ioend(wpc, folio, offset, end_pos, len); + if (ret < 0) + goto discard_folio; + + return ret; +discard_folio: + fuse_iomap_discard_folio(folio, offset, ret); + return ret; +} + +static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc, + int error) +{ + struct iomap_ioend *ioend = wpc->wb_ctx; + + ASSERT(fuse_inode_has_iomap(ioend->io_inode)); + + /* always call our ioend function, even if we cancel the bio */ + ioend->io_bio.bi_end_io = fuse_iomap_end_bio; + return iomap_ioend_writeback_submit(wpc, error); +} + +static const struct iomap_writeback_ops fuse_iomap_writeback_ops = { + .writeback_range = fuse_iomap_writeback_range, + .writeback_submit = fuse_iomap_writeback_submit, +}; + +static int fuse_iomap_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct fuse_writepage_ctx wpc = { + .ctx = { + .inode = mapping->host, + .wbc = wbc, + .ops = &fuse_iomap_writeback_ops, + }, + }; + + ASSERT(fuse_inode_has_iomap(mapping->host)); + + return iomap_writepages(&wpc.ctx); +} + +static int fuse_iomap_read_folio(struct file *file, struct folio *folio) +{ + ASSERT(fuse_inode_has_iomap(file_inode(file))); + + iomap_bio_read_folio(folio, &fuse_iomap_ops); + return 0; +} + +static void fuse_iomap_readahead(struct readahead_control *rac) +{ + ASSERT(fuse_inode_has_iomap(file_inode(rac->file))); + + iomap_bio_readahead(rac, &fuse_iomap_ops); +} + +static const struct address_space_operations fuse_iomap_aops = { + .read_folio = fuse_iomap_read_folio, + .readahead = fuse_iomap_readahead, + .writepages = fuse_iomap_writepages, + .dirty_folio = iomap_dirty_folio, + .release_folio = iomap_release_folio, + .invalidate_folio = iomap_invalidate_folio, + .migrate_folio = filemap_migrate_folio, + .is_partially_uptodate = iomap_is_partially_uptodate, + .error_remove_folio = generic_error_remove_folio, + + /* These aren't pagecache operations per se */ + .bmap = fuse_bmap, +}; + +static inline void fuse_inode_set_iomap(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + inode->i_data.a_ops = &fuse_iomap_aops; + + INIT_WORK(&fi->ioend_work, fuse_iomap_end_io); + INIT_LIST_HEAD(&fi->ioend_list); + spin_lock_init(&fi->ioend_lock); + set_bit(FUSE_I_IOMAP, &fi->state); +} + +/* + * Locking for serialisation of IO during page faults. This results in a lock + * ordering of: + * + * mmap_lock (MM) + * sb_start_pagefault(vfs, freeze) + * invalidate_lock (vfs - truncate serialisation) + * page_lock (MM) + * i_lock (FUSE - extent map serialisation) + */ +static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf) +{ + struct inode *inode = file_inode(vmf->vma->vm_file); + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + vm_fault_t ret; + + ASSERT(fuse_inode_has_iomap(inode)); + + sb_start_pagefault(inode->i_sb); + file_update_time(vmf->vma->vm_file); + + filemap_invalidate_lock_shared(mapping); + ret = iomap_page_mkwrite(vmf, &fuse_iomap_ops, NULL); + filemap_invalidate_unlock_shared(mapping); + + sb_end_pagefault(inode->i_sb); + return ret; +} + +static const struct vm_operations_struct fuse_iomap_vm_ops = { + .fault = filemap_fault, + .map_pages = filemap_map_pages, + .page_mkwrite = fuse_iomap_page_mkwrite, +}; + +int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma) +{ + ASSERT(fuse_inode_has_iomap(file_inode(file))); + + file_accessed(file); + vma->vm_ops = &fuse_iomap_vm_ops; + return 0; +} + +static ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to) +{ + struct inode *inode = file_inode(iocb->ki_filp); + ssize_t ret; + + ASSERT(fuse_inode_has_iomap(inode)); + + if (!iov_iter_count(to)) + return 0; /* skip atime */ + + ret = fuse_iomap_ilock_iocb(iocb, SHARED); + if (ret) + return ret; + ret = generic_file_read_iter(iocb, to); + if (ret > 0) + file_accessed(iocb->ki_filp); + inode_unlock_shared(inode); + + return ret; +} + +static ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, + struct iov_iter *from) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct fuse_inode *fi = get_fuse_inode(inode); + loff_t pos = iocb->ki_pos; + ssize_t ret; + + ASSERT(fuse_inode_has_iomap(inode)); + + if (!iov_iter_count(from)) + return 0; + + ret = fuse_iomap_ilock_iocb(iocb, EXCL); + if (ret) + return ret; + + ret = fuse_iomap_write_checks(iocb, from); + if (ret) + goto out_unlock; + + if (inode->i_size < pos + iov_iter_count(from)) + set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); + + ret = iomap_file_buffered_write(iocb, from, &fuse_iomap_ops, + &fuse_iomap_write_ops, NULL); + + if (ret > 0) + fuse_write_update_attr(inode, pos + ret, ret); + clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); + +out_unlock: + inode_unlock(inode); + + if (ret > 0) { + /* Handle various SYNC-type writes */ + ret = generic_write_sync(iocb, ret); + } + return ret; +} + static inline bool fuse_iomap_force_directio(const struct kiocb *iocb) { struct fuse_file *ff = iocb->ki_filp->private_data; @@ -1073,9 +1521,33 @@ ssize_t fuse_iomap_read_iter(struct kiocb *iocb, struct iov_iter *to) ASSERT(fuse_inode_has_iomap(file_inode(iocb->ki_filp))); - if ((iocb->ki_flags & IOCB_DIRECT) || force_directio) - return fuse_iomap_direct_read(iocb, to); - return -EIO; + if ((iocb->ki_flags & IOCB_DIRECT) || force_directio) { + ssize_t ret = fuse_iomap_direct_read(iocb, to); + + switch (ret) { + case -ENOTBLK: + case -ENOSYS: + /* + * We fall back to a buffered read if: + * + * - ENOTBLK means iomap told us to do it + * - ENOSYS means the fuse server wants it + * + * Don't fall back if we were forced to do it. + */ + if (force_directio) + return -EIO; + break; + default: + /* errors, no progress, or partial progress */ + return ret; + } + + /* do not let generic_file_read_iter fall into ->direct_IO */ + iocb->ki_flags &= ~IOCB_DIRECT; + } + + return fuse_iomap_buffered_read(iocb, to); } ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from) @@ -1084,7 +1556,206 @@ ssize_t fuse_iomap_write_iter(struct kiocb *iocb, struct iov_iter *from) ASSERT(fuse_inode_has_iomap(file_inode(iocb->ki_filp))); - if ((iocb->ki_flags & IOCB_DIRECT) || force_directio) - return fuse_iomap_direct_write(iocb, from); - return -EIO; + if ((iocb->ki_flags & IOCB_DIRECT) || force_directio) { + ssize_t ret = fuse_iomap_direct_write(iocb, from); + + switch (ret) { + case -ENOTBLK: + case -ENOSYS: + /* + * We fall back to a buffered write if: + * + * - ENOTBLK means iomap told us to do it + * - ENOSYS means the fuse server wants it + * + * Either way, try the write again as a synchronous + * buffered write unless we were forced to do directio. + */ + if (force_directio) + return -EIO; + iocb->ki_flags |= IOCB_SYNC; + break; + default: + /* errors, no progress, or partial progress */ + return ret; + } + } + + return fuse_iomap_buffered_write(iocb, from); +} + +static int +fuse_iomap_truncate_page( + struct inode *inode, + loff_t pos, + bool *did_zero) +{ + return iomap_truncate_page(inode, pos, did_zero, &fuse_iomap_ops, + &fuse_iomap_write_ops, NULL); +} +/* + * Truncate pagecache for a file before sending the truncate request to + * userspace. Must have write permission and not be a directory. + * + * Caution: The caller of this function is responsible for calling + * setattr_prepare() or otherwise verifying the change is fine. + */ +int +fuse_iomap_setsize_start( + struct inode *inode, + loff_t newsize) +{ + loff_t oldsize = i_size_read(inode); + int error; + bool did_zeroing = false; + + rwsem_assert_held_write(&inode->i_rwsem); + rwsem_assert_held_write(&inode->i_mapping->invalidate_lock); + ASSERT(S_ISREG(inode->i_mode)); + + /* + * Wait for all direct I/O to complete. + */ + inode_dio_wait(inode); + + /* + * File data changes must be complete and flushed to disk before we + * call userspace to modify the inode. + * + * Start with zeroing any data beyond EOF that we may expose on file + * extension, or zeroing out the rest of the block on a downward + * truncate. + */ + if (newsize > oldsize) + error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize, + &did_zeroing); + else + error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing); + if (error) + return error; + + /* + * We've already locked out new page faults, so now we can safely + * remove pages from the page cache knowing they won't get refaulted + * until we drop the mapping invalidation lock after the extent + * manipulations are complete. The truncate_setsize() call also cleans + * folios spanning EOF on extending truncates and hence ensures + * sub-page block size filesystems are correctly handled, too. + * + * And we update in-core i_size and truncate page cache beyond newsize + * before writing back the whole file, so we're guaranteed not to write + * stale data past the new EOF on truncate down. + */ + truncate_setsize(inode, newsize); + + /* + * Flush the entire pagecache to ensure the fuse server logs the inode + * size change and all dirty data that might be associated with it. + * We don't know the ondisk inode size, so we only have this clumsy + * hammer. + */ + return filemap_write_and_wait(inode->i_mapping); +} + +/* + * Prepare for a file data block remapping operation by flushing and unmapping + * all pagecache for the entire range. + */ +int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, + loff_t endpos) +{ + loff_t start, end; + unsigned int rounding; + int error; + + /* + * Make sure we extend the flush out to extent alignment boundaries so + * any extent range overlapping the start/end of the modification we + * are about to do is clean and idle. + */ + rounding = max_t(unsigned int, i_blocksize(inode), PAGE_SIZE); + start = round_down(pos, rounding); + end = round_up(endpos + 1, rounding) - 1; + + error = filemap_write_and_wait_range(inode->i_mapping, start, end); + if (error) + return error; + truncate_pagecache_range(inode, start, end); + return 0; +} + +static int fuse_iomap_punch_range(struct inode *inode, loff_t offset, + loff_t length) +{ + loff_t isize = i_size_read(inode); + int error; + + /* + * Now that we've unmap all full blocks we'll have to zero out any + * partial block at the beginning and/or end. iomap_zero_range is + * smart enough to skip holes and unwritten extents, including those we + * just created, but we must take care not to zero beyond EOF, which + * would enlarge i_size. + */ + if (offset >= isize) + return 0; + if (offset + length > isize) + length = isize - offset; + error = fuse_iomap_zero_range(inode, offset, length, NULL); + if (error) + return error; + + /* + * If we zeroed right up to EOF and EOF straddles a page boundary we + * must make sure that the post-EOF area is also zeroed because the + * page could be mmap'd and iomap_zero_range doesn't do that for us. + * Writeback of the eof page will do this, albeit clumsily. + */ + if (offset + length >= isize && offset_in_page(offset + length) > 0) { + error = filemap_write_and_wait_range(inode->i_mapping, + round_down(offset + length, PAGE_SIZE), + LLONG_MAX); + } + + return error; +} + +int +fuse_iomap_fallocate( + struct file *file, + int mode, + loff_t offset, + loff_t length, + loff_t new_size) +{ + struct inode *inode = file_inode(file); + int error; + + ASSERT(fuse_inode_has_iomap(inode)); + + /* + * If we unmapped blocks from the file range, then we zero the + * pagecache for those regions and push them to disk rather than make + * the fuse server manually zero the disk blocks. + */ + if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) { + error = fuse_iomap_punch_range(inode, offset, length); + if (error) + return error; + } + + /* + * If this is an extending write, we need to zero the bytes beyond the + * new EOF and bounce the new size out to userspace. + */ + if (new_size) { + error = fuse_iomap_setsize_start(inode, new_size); + if (error) + return error; + + fuse_write_update_attr(inode, new_size, length); + } + + file_update_time(file); + return 0; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 17/33] fuse_trace: implement buffered IO with iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (15 preceding siblings ...) 2026-04-29 14:27 ` [PATCH 16/33] fuse: implement buffered " Darrick J. Wong @ 2026-04-29 14:28 ` Darrick J. Wong 2026-04-29 14:28 ` [PATCH 18/33] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong ` (15 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:28 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 227 ++++++++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/dev.c | 10 ++ fs/fuse/fuse_iomap.c | 40 ++++++++- 3 files changed, 273 insertions(+), 4 deletions(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index a8337f5ddcf011..c832fb9012d983 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -227,6 +227,9 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); #endif /* CONFIG_FUSE_BACKING */ #if IS_ENABLED(CONFIG_FUSE_IOMAP) +struct iomap_writepage_ctx; +struct iomap_ioend; + /* tracepoint boilerplate so we don't have to keep doing this */ #define FUSE_IOMAP_OPFLAGS_FIELD \ __field(unsigned, opflags) @@ -294,7 +297,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); { FUSE_IOMAP_OP_UNSHARE, "unshare" }, \ { FUSE_IOMAP_OP_DAX, "fsdax" }, \ { FUSE_IOMAP_OP_ATOMIC, "atomic" }, \ - { FUSE_IOMAP_OP_DONTCACHE, "dontcache" } + { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }, \ + { FUSE_IOMAP_OP_WRITEBACK, "writeback" } #define FUSE_IOMAP_TYPE_STRINGS \ { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \ @@ -309,7 +313,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); { FUSE_IOMAP_IOEND_UNWRITTEN, "unwritten" }, \ { FUSE_IOMAP_IOEND_BOUNDARY, "boundary" }, \ { FUSE_IOMAP_IOEND_DIRECT, "direct" }, \ - { FUSE_IOMAP_IOEND_APPEND, "append" } + { FUSE_IOMAP_IOEND_APPEND, "append" }, \ + { FUSE_IOMAP_IOEND_WRITEBACK, "writeback" } #define IOMAP_DIOEND_STRINGS \ { IOMAP_DIO_UNWRITTEN, "unwritten" }, \ @@ -334,6 +339,12 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP); { 1 << FUSE_I_EXCLUSIVE, "excl" }, \ { 1 << FUSE_I_IOMAP, "iomap" } +#define IOMAP_IOEND_STRINGS \ + { IOMAP_IOEND_SHARED, "shared" }, \ + { IOMAP_IOEND_UNWRITTEN, "unwritten" }, \ + { IOMAP_IOEND_BOUNDARY, "boundary" }, \ + { IOMAP_IOEND_DIRECT, "direct" } + DECLARE_EVENT_CLASS(fuse_iomap_check_class, TP_PROTO(const char *func, int line, const char *condition), @@ -683,6 +694,9 @@ DEFINE_EVENT(fuse_iomap_file_io_class, name, \ TP_ARGS(iocb, iter)) DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read); DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write); +DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_read); +DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_write); +DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_write_zero_eof); DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class, TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, @@ -709,6 +723,8 @@ DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \ TP_ARGS(iocb, iter, ret)) DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end); DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end); +DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_read_end); +DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_write_end); TRACE_EVENT(fuse_iomap_dio_write_end_io, TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written, @@ -781,7 +797,214 @@ DEFINE_EVENT(fuse_iomap_file_range_class, name, \ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \ TP_ARGS(inode, offset, length)) DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize_finish); +DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up); +DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down); +DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range); +DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range); +TRACE_EVENT(fuse_iomap_end_ioend, + TP_PROTO(const struct iomap_ioend *ioend), + + TP_ARGS(ioend), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(unsigned int, ioendflags) + __field(int, error) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(ioend->io_inode, fi, fm); + __entry->offset = ioend->io_offset; + __entry->length = ioend->io_size; + __entry->ioendflags = ioend->io_flags; + __entry->error = blk_status_to_errno(ioend->io_bio.bi_status); + ), + + TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d", + FUSE_IO_RANGE_PRINTK_ARGS(), + __print_flags(__entry->ioendflags, "|", IOMAP_IOEND_STRINGS), + __entry->error) +); + +TRACE_EVENT(fuse_iomap_writeback_range, + TP_PROTO(const struct inode *inode, u64 offset, unsigned int count, + u64 end_pos), + + TP_ARGS(inode, offset, count, end_pos), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(uint64_t, end_pos) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = offset; + __entry->length = count; + __entry->end_pos = end_pos; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " end_pos 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + __entry->end_pos) +); + +TRACE_EVENT(fuse_iomap_writeback_submit, + TP_PROTO(const struct iomap_writepage_ctx *wpc, int error), + + TP_ARGS(wpc, error), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(unsigned int, nr_folios) + __field(uint64_t, addr) + __field(int, error) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(wpc->inode, fi, fm); + __entry->nr_folios = wpc->nr_folios; + __entry->offset = wpc->iomap.offset; + __entry->length = wpc->iomap.length; + __entry->addr = wpc->iomap.addr << 9; + __entry->error = error; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " addr 0x%llx nr_folios %u error %d", + FUSE_IO_RANGE_PRINTK_ARGS(), + __entry->addr, + __entry->nr_folios, + __entry->error) +); + +TRACE_EVENT(fuse_iomap_discard_folio, + TP_PROTO(const struct inode *inode, loff_t offset, size_t count), + + TP_ARGS(inode, offset, count), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = offset; + __entry->length = count; + ), + + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +); + +TRACE_EVENT(fuse_iomap_writepages, + TP_PROTO(const struct inode *inode, const struct writeback_control *wbc), + + TP_ARGS(inode, wbc), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(long, nr_to_write) + __field(bool, sync_all) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = wbc->range_start; + __entry->length = wbc->range_end - wbc->range_start + 1; + __entry->nr_to_write = wbc->nr_to_write; + __entry->sync_all = wbc->sync_mode == WB_SYNC_ALL; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " nr_folios %ld sync_all? %d", + FUSE_IO_RANGE_PRINTK_ARGS(), + __entry->nr_to_write, + __entry->sync_all) +); + +TRACE_EVENT(fuse_iomap_read_folio, + TP_PROTO(const struct folio *folio), + + TP_ARGS(folio), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(folio->mapping->host, fi, fm); + __entry->offset = folio_pos(folio); + __entry->length = folio_size(folio); + ), + + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +); + +TRACE_EVENT(fuse_iomap_readahead, + TP_PROTO(const struct readahead_control *rac), + + TP_ARGS(rac), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + + TP_fast_assign( + struct readahead_control *mutrac = (struct readahead_control *)rac; + FUSE_INODE_ASSIGN(file_inode(rac->file), fi, fm); + __entry->offset = readahead_pos(mutrac); + __entry->length = readahead_length(mutrac); + ), + + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +); + +TRACE_EVENT(fuse_iomap_page_mkwrite, + TP_PROTO(const struct vm_fault *vmf), + + TP_ARGS(vmf), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + + TP_fast_assign( + struct folio *folio = page_folio(vmf->page); + FUSE_INODE_ASSIGN(file_inode(vmf->vma->vm_file), fi, fm); + __entry->offset = folio_pos(folio); + __entry->length = folio_size(folio); + ), + + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +); + +TRACE_EVENT(fuse_iomap_fallocate, + TP_PROTO(const struct inode *inode, int mode, loff_t offset, + loff_t length, loff_t newsize), + TP_ARGS(inode, mode, offset, length, newsize), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + __field(loff_t, newsize) + __field(int, mode) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = offset; + __entry->length = length; + __entry->mode = mode; + __entry->newsize = newsize; + ), + + TP_printk(FUSE_IO_RANGE_FMT() " mode 0x%x newsize 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + __entry->mode, + __entry->newsize) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 2708e17bc46949..d21eebbe12d4c9 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -9,6 +9,7 @@ #include "dev_uring_i.h" #include "fuse_i.h" #include "fuse_dev_i.h" +#include "fuse_iomap.h" #include <linux/init.h> #include <linux/module.h> @@ -1794,6 +1795,12 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size, if (!inode) goto out_up_killsb; + /* no backchannels for messing with the pagecache */ + if (fuse_inode_has_iomap(inode)) { + err = -EOPNOTSUPP; + goto out_iput; + } + mapping = inode->i_mapping; file_size = i_size_read(inode); end = pos + num; @@ -1873,6 +1880,9 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode, struct fuse_args *args; loff_t pos = outarg->offset; + if (fuse_inode_has_iomap(inode)) + return -EOPNOTSUPP; + offset = offset_in_page(pos); file_size = i_size_read(inode); diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 7a7dfc4f665d8e..3136326bafb858 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1035,6 +1035,8 @@ fuse_iomap_write_zero_eof( return 1; } + trace_fuse_iomap_write_zero_eof(iocb, from); + filemap_invalidate_lock(mapping); error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL); filemap_invalidate_unlock(mapping); @@ -1165,6 +1167,8 @@ static void fuse_iomap_end_ioend(struct iomap_ioend *ioend) if (!error && fuse_is_bad(inode)) error = -EIO; + trace_fuse_iomap_end_ioend(ioend); + if (ioend->io_flags & IOMAP_IOEND_SHARED) ioendflags |= FUSE_IOMAP_IOEND_SHARED; if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN) @@ -1273,6 +1277,8 @@ static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos, int error) ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_discard_folio(inode, pos, folio_size(folio)); + printk_ratelimited(KERN_ERR "page discard on page %px, inode 0x%llx, pos %llu.", folio, fi->orig_ino, pos); @@ -1297,6 +1303,8 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc, ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_writeback_range(inode, offset, len, end_pos); + if (!fuse_iomap_revalidate_writeback(wpc, offset)) { ret = fuse_iomap_begin(inode, offset, len, FUSE_IOMAP_OP_WRITEBACK, @@ -1334,6 +1342,8 @@ static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc, ASSERT(fuse_inode_has_iomap(ioend->io_inode)); + trace_fuse_iomap_writeback_submit(wpc, error); + /* always call our ioend function, even if we cancel the bio */ ioend->io_bio.bi_end_io = fuse_iomap_end_bio; return iomap_ioend_writeback_submit(wpc, error); @@ -1357,6 +1367,8 @@ static int fuse_iomap_writepages(struct address_space *mapping, ASSERT(fuse_inode_has_iomap(mapping->host)); + trace_fuse_iomap_writepages(mapping->host, wbc); + return iomap_writepages(&wpc.ctx); } @@ -1364,6 +1376,8 @@ static int fuse_iomap_read_folio(struct file *file, struct folio *folio) { ASSERT(fuse_inode_has_iomap(file_inode(file))); + trace_fuse_iomap_read_folio(folio); + iomap_bio_read_folio(folio, &fuse_iomap_ops); return 0; } @@ -1372,6 +1386,8 @@ static void fuse_iomap_readahead(struct readahead_control *rac) { ASSERT(fuse_inode_has_iomap(file_inode(rac->file))); + trace_fuse_iomap_readahead(rac); + iomap_bio_readahead(rac, &fuse_iomap_ops); } @@ -1420,6 +1436,8 @@ static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf) ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_page_mkwrite(vmf); + sb_start_pagefault(inode->i_sb); file_update_time(vmf->vma->vm_file); @@ -1453,6 +1471,8 @@ static ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to) ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_buffered_read(iocb, to); + if (!iov_iter_count(to)) return 0; /* skip atime */ @@ -1464,6 +1484,7 @@ static ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to) file_accessed(iocb->ki_filp); inode_unlock_shared(inode); + trace_fuse_iomap_buffered_read_end(iocb, to, ret); return ret; } @@ -1477,6 +1498,8 @@ static ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_buffered_write(iocb, from); + if (!iov_iter_count(from)) return 0; @@ -1505,6 +1528,7 @@ static ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, /* Handle various SYNC-type writes */ ret = generic_write_sync(iocb, ret); } + trace_fuse_iomap_buffered_write_end(iocb, from, ret); return ret; } @@ -1626,11 +1650,17 @@ fuse_iomap_setsize_start( * extension, or zeroing out the rest of the block on a downward * truncate. */ - if (newsize > oldsize) + if (newsize > oldsize) { + trace_fuse_iomap_truncate_up(inode, oldsize, newsize - oldsize); + error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize, &did_zeroing); - else + } else { + trace_fuse_iomap_truncate_down(inode, newsize, + oldsize - newsize); + error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing); + } if (error) return error; @@ -1677,6 +1707,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, start = round_down(pos, rounding); end = round_up(endpos + 1, rounding) - 1; + trace_fuse_iomap_flush_unmap_range(inode, start, end + 1 - start); + error = filemap_write_and_wait_range(inode->i_mapping, start, end); if (error) return error; @@ -1690,6 +1722,8 @@ static int fuse_iomap_punch_range(struct inode *inode, loff_t offset, loff_t isize = i_size_read(inode); int error; + trace_fuse_iomap_punch_range(inode, offset, length); + /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. iomap_zero_range is @@ -1733,6 +1767,8 @@ fuse_iomap_fallocate( ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size); + /* * If we unmapped blocks from the file range, then we zero the * pagecache for those regions and push them to disk rather than make ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 18/33] fuse: use an unrestricted backing device with iomap pagecache io 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (16 preceding siblings ...) 2026-04-29 14:28 ` [PATCH 17/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:28 ` Darrick J. Wong 2026-04-29 14:28 ` [PATCH 19/33] fuse: implement large folios for iomap pagecache files Darrick J. Wong ` (14 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:28 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> With iomap support turned on for the pagecache, the kernel issues writeback to directly to block devices and we no longer have to push all those pages through the fuse device to userspace. Therefore, we don't need the tight dirty limits (~1M) that are used for regular fuse. This dramatically increases the performance of fuse's pagecache IO. A reviewer of this patch asked why we reset s_bdi to the noop bdi and call super_setup_bdi_name a second time, instead of simply clearing STRICTLIMIT and resetting the bdi max ratio. That's sufficient to undo the effects of fuse_bdi_init, yes. However the BDI gets created with the name "$major:$minor{-fuseblk}" and there are "management" scripts that try to tweak fuse BDIs for better performance. I don't want some dumb script to mismanage a fuse-iomap filesystem because it can't tell the difference, so I create a new bdi with the name "$major:$minor.iomap" to make it obvious. But super_setup_bdi_name gets cranky if s_bdi isn't set to noop and we don't want to fail a mount here due to ENOMEM so ... I implemented this weird switcheroo code. Also, userspace scripts such as udev rules can modify the bdi as soon as it appears in sysfs, so we can't run the fuse_bdi_init code in reverse and expect that will undo everything. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 3136326bafb858..3c40a1e58017b3 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -718,6 +718,27 @@ const struct fuse_backing_ops fuse_iomap_backing_ops = { void fuse_iomap_mount(struct fuse_mount *fm) { struct fuse_conn *fc = fm->fc; + struct super_block *sb = fm->sb; + struct backing_dev_info *old_bdi = sb->s_bdi; + char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse"; + int res; + + /* + * sb->s_bdi points to the initial private bdi. However, we want to + * redirect it to a new private bdi with default dirty and readahead + * settings because iomap writeback won't be pushing a ton of dirty + * data through the fuse device. If this fails we fall back to the + * initial fuse bdi. + */ + sb->s_bdi = &noop_backing_dev_info; + res = super_setup_bdi_name(sb, "%u:%u%s.iomap", MAJOR(fc->dev), + MINOR(fc->dev), suffix); + if (res) { + sb->s_bdi = old_bdi; + } else { + bdi_unregister(old_bdi); + bdi_put(old_bdi); + } /* * Enable syncfs for iomap fuse servers so that we can send a final ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 19/33] fuse: implement large folios for iomap pagecache files 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (17 preceding siblings ...) 2026-04-29 14:28 ` [PATCH 18/33] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong @ 2026-04-29 14:28 ` Darrick J. Wong 2026-04-29 14:28 ` [PATCH 20/33] fuse: advertise support for iomap Darrick J. Wong ` (13 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:28 UTC (permalink / raw) To: djwong, miklos Cc: joannelkoong, joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Use large folios when we're using iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/fuse_iomap.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 3c40a1e58017b3..3519d5a9c836ba 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1430,12 +1430,18 @@ static const struct address_space_operations fuse_iomap_aops = { static inline void fuse_inode_set_iomap(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); + unsigned int min_order = 0; inode->i_data.a_ops = &fuse_iomap_aops; INIT_WORK(&fi->ioend_work, fuse_iomap_end_io); INIT_LIST_HEAD(&fi->ioend_list); spin_lock_init(&fi->ioend_lock); + + if (inode->i_blkbits > PAGE_SHIFT) + min_order = inode->i_blkbits - PAGE_SHIFT; + + mapping_set_folio_min_order(inode->i_mapping, min_order); set_bit(FUSE_I_IOMAP, &fi->state); } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 20/33] fuse: advertise support for iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (18 preceding siblings ...) 2026-04-29 14:28 ` [PATCH 19/33] fuse: implement large folios for iomap pagecache files Darrick J. Wong @ 2026-04-29 14:28 ` Darrick J. Wong 2026-04-29 14:29 ` [PATCH 21/33] fuse: query filesystem geometry when using iomap Darrick J. Wong ` (12 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:28 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Advertise our new IO paths programmatically by creating an ioctl that can return the capabilities of the kernel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 4 ++++ include/uapi/linux/fuse.h | 9 +++++++++ fs/fuse/dev.c | 3 +++ fs/fuse/fuse_iomap.c | 13 +++++++++++++ 4 files changed, 29 insertions(+) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index be37ddac2f1e25..9b17f4414dcca4 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -61,6 +61,9 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset, loff_t length, loff_t new_size); int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, loff_t endpos); + +int fuse_dev_ioctl_iomap_support(struct file *file, + struct fuse_iomap_support __user *argp); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -84,6 +87,7 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, # define fuse_iomap_setsize_start(...) (-ENOSYS) # define fuse_iomap_fallocate(...) (-ENOSYS) # define fuse_iomap_flush_unmap_range(...) (-ENOSYS) +# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 71b216262c84cb..de9b56e6e8d250 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1156,6 +1156,13 @@ struct fuse_backing_map { uint64_t padding; }; +/* basic file I/O functionality through iomap */ +#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0) +struct fuse_iomap_support { + uint64_t flags; + uint64_t padding; +}; + /* Device ioctls: */ #define FUSE_DEV_IOC_MAGIC 229 #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t) @@ -1163,6 +1170,8 @@ struct fuse_backing_map { struct fuse_backing_map) #define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t) #define FUSE_DEV_IOC_SYNC_INIT _IO(FUSE_DEV_IOC_MAGIC, 3) +#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \ + struct fuse_iomap_support) struct fuse_lseek_in { uint64_t fh; diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index d21eebbe12d4c9..75e1e3f8a4ddd1 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2712,6 +2712,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd, case FUSE_DEV_IOC_SYNC_INIT: return fuse_dev_ioctl_sync_init(file); + case FUSE_DEV_IOC_IOMAP_SUPPORT: + return fuse_dev_ioctl_iomap_support(file, argp); + default: return -ENOTTY; } diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 3519d5a9c836ba..f57e0317f7324e 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1822,3 +1822,16 @@ fuse_iomap_fallocate( file_update_time(file); return 0; } + +int fuse_dev_ioctl_iomap_support(struct file *file, + struct fuse_iomap_support __user *argp) +{ + struct fuse_iomap_support ios = { }; + + if (fuse_iomap_enabled()) + ios.flags = FUSE_IOMAP_SUPPORT_FILEIO; + + if (copy_to_user(argp, &ios, sizeof(ios))) + return -EFAULT; + return 0; +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 21/33] fuse: query filesystem geometry when using iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (19 preceding siblings ...) 2026-04-29 14:28 ` [PATCH 20/33] fuse: advertise support for iomap Darrick J. Wong @ 2026-04-29 14:29 ` Darrick J. Wong 2026-04-29 14:29 ` [PATCH 22/33] fuse_trace: " Darrick J. Wong ` (11 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:29 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add a new upcall to the fuse server so that the kernel can request filesystem geometry bits when iomap mode is in use. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 4 + fs/fuse/fuse_iomap.h | 6 +- include/uapi/linux/fuse.h | 39 ++++++++++++ fs/fuse/fuse_iomap.c | 147 +++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/inode.c | 42 ++++++++++--- 5 files changed, 227 insertions(+), 11 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 23212ca1b6871e..0d9ac3ff18eedf 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1036,6 +1036,9 @@ struct fuse_conn { struct fuse_ring *ring; #endif + /** How many subsystems still need initialization? */ + atomic_t need_init; + /** Only used if the connection opts into request timeouts */ struct { /* Worker for checking if any requests have timed out */ @@ -1447,6 +1450,7 @@ struct fuse_dev *fuse_dev_alloc(void); void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc); void fuse_dev_put(struct fuse_dev *fud); int fuse_send_init(struct fuse_mount *fm); +void fuse_finish_init(struct fuse_conn *fc, bool ok); /** * Fill in superblock and initialize fuse connection diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 9b17f4414dcca4..13b5c5c896f25a 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -21,7 +21,8 @@ static inline bool fuse_has_iomap(const struct inode *inode) extern const struct fuse_backing_ops fuse_iomap_backing_ops; -void fuse_iomap_mount(struct fuse_mount *fm); +int fuse_iomap_mount(struct fuse_mount *fm); +void fuse_iomap_mount_async(struct fuse_mount *fm); void fuse_iomap_unmount(struct fuse_mount *fm); void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr); @@ -67,7 +68,8 @@ int fuse_dev_ioctl_iomap_support(struct file *file, #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) -# define fuse_iomap_mount(...) ((void)0) +# define fuse_iomap_mount(...) (0) +# define fuse_iomap_mount_async(...) ((void)0) # define fuse_iomap_unmount(...) ((void)0) # define fuse_iomap_init_inode(...) ((void)0) # define fuse_iomap_evict_inode(...) ((void)0) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index de9b56e6e8d250..33668d66e9c4b4 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -246,6 +246,7 @@ * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes + * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry */ #ifndef _LINUX_FUSE_H @@ -677,6 +678,7 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_IOMAP_CONFIG = 4092, FUSE_IOMAP_IOEND = 4093, FUSE_IOMAP_BEGIN = 4094, FUSE_IOMAP_END = 4095, @@ -1452,4 +1454,41 @@ struct fuse_iomap_ioend_out { uint64_t newsize; /* new ondisk size */ }; +struct fuse_iomap_config_in { + uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */ + int64_t maxbytes; /* maximum supported file size */ + uint64_t padding[6]; /* zero */ +}; + +/* Which fields are set in fuse_iomap_config_out? */ +#define FUSE_IOMAP_CONFIG_SID (1 << 0ULL) +#define FUSE_IOMAP_CONFIG_UUID (1 << 1ULL) +#define FUSE_IOMAP_CONFIG_BLOCKSIZE (1 << 2ULL) +#define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL) +#define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL) +#define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL) + +struct fuse_iomap_config_out { + uint64_t flags; /* FUSE_IOMAP_CONFIG_* */ + + char s_id[32]; /* Informational name */ + char s_uuid[16]; /* UUID */ + + uint8_t s_uuid_len; /* length of s_uuid */ + + uint8_t s_pad[3]; /* must be zeroes */ + + uint32_t s_blocksize; /* fs block size */ + uint32_t s_max_links; /* max hard links */ + + /* Granularity of c/m/atime in ns (cannot be worse than a second) */ + uint32_t s_time_gran; + + /* Time limits for c/m/atime in seconds */ + int64_t s_time_min; + int64_t s_time_max; + + int64_t s_maxbytes; /* max file size */ +}; + #endif /* _LINUX_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index f57e0317f7324e..cd74497ceb3f42 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -715,14 +715,103 @@ const struct fuse_backing_ops fuse_iomap_backing_ops = { .post_open = fuse_iomap_post_open, }; -void fuse_iomap_mount(struct fuse_mount *fm) +struct fuse_iomap_config_args { + struct fuse_args args; + struct fuse_iomap_config_in inarg; + struct fuse_iomap_config_out outarg; +}; + +#define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_SID | \ + FUSE_IOMAP_CONFIG_UUID | \ + FUSE_IOMAP_CONFIG_BLOCKSIZE | \ + FUSE_IOMAP_CONFIG_MAX_LINKS | \ + FUSE_IOMAP_CONFIG_TIME | \ + FUSE_IOMAP_CONFIG_MAXBYTES) + +static int fuse_iomap_process_config(struct fuse_mount *fm, int error, + const struct fuse_iomap_config_out *outarg) { + struct super_block *sb = fm->sb; + + switch (error) { + case 0: + break; + case -ENOSYS: + return 0; + default: + return error; + } + + if (outarg->flags & ~FUSE_IOMAP_CONFIG_ALL) + return -EINVAL; + + if (outarg->s_uuid_len > sizeof(outarg->s_uuid)) + return -EINVAL; + + if (memchr_inv(outarg->s_pad, 0, sizeof(outarg->s_pad))) + return -EINVAL; + + if (outarg->flags & FUSE_IOMAP_CONFIG_BLOCKSIZE) { + if (sb->s_bdev) { +#ifdef CONFIG_BLOCK + if (!sb_set_blocksize(sb, outarg->s_blocksize)) + return -EINVAL; +#else + /* + * XXX: how do we have a bdev filesystem without + * CONFIG_BLOCK??? + */ + return -EINVAL; +#endif + } else { + sb->s_blocksize = outarg->s_blocksize; + sb->s_blocksize_bits = blksize_bits(outarg->s_blocksize); + } + } + + if (outarg->flags & FUSE_IOMAP_CONFIG_SID) + memcpy(sb->s_id, outarg->s_id, sizeof(sb->s_id)); + + if (outarg->flags & FUSE_IOMAP_CONFIG_UUID) { + memcpy(&sb->s_uuid, outarg->s_uuid, outarg->s_uuid_len); + sb->s_uuid_len = outarg->s_uuid_len; + } + + if (outarg->flags & FUSE_IOMAP_CONFIG_MAX_LINKS) + sb->s_max_links = outarg->s_max_links; + + if (outarg->flags & FUSE_IOMAP_CONFIG_TIME) { + sb->s_time_gran = outarg->s_time_gran; + sb->s_time_min = outarg->s_time_min; + sb->s_time_max = outarg->s_time_max; + } + + if (outarg->flags & FUSE_IOMAP_CONFIG_MAXBYTES) + sb->s_maxbytes = outarg->s_maxbytes; + + return 0; +} + +static void fuse_iomap_config_reply(struct fuse_mount *fm, + struct fuse_args *args, int error) +{ + struct fuse_iomap_config_args *ia = + container_of(args, struct fuse_iomap_config_args, args); struct fuse_conn *fc = fm->fc; struct super_block *sb = fm->sb; struct backing_dev_info *old_bdi = sb->s_bdi; char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse"; + bool ok = true; int res; + res = fuse_iomap_process_config(fm, error, &ia->outarg); + if (res) { + printk(KERN_ERR "%s: could not configure iomap, err=%d", + sb->s_id, res); + ok = false; + goto done; + } + /* * sb->s_bdi points to the initial private bdi. However, we want to * redirect it to a new private bdi with default dirty and readahead @@ -746,6 +835,62 @@ void fuse_iomap_mount(struct fuse_mount *fm) * freeze/thaw properly. */ fc->sync_fs = true; + +done: + kfree(ia); + fuse_finish_init(fc, ok); +} + +static struct fuse_iomap_config_args * +fuse_iomap_new_mount(struct fuse_mount *fm) +{ + struct fuse_iomap_config_args *ia; + + ia = kzalloc(sizeof(*ia), GFP_KERNEL | __GFP_NOFAIL); + ia->inarg.maxbytes = MAX_LFS_FILESIZE; + ia->inarg.flags = FUSE_IOMAP_CONFIG_ALL; + + ia->args.opcode = FUSE_IOMAP_CONFIG; + ia->args.nodeid = 0; + ia->args.in_numargs = 1; + ia->args.in_args[0].size = sizeof(ia->inarg); + ia->args.in_args[0].value = &ia->inarg; + ia->args.out_argvar = true; + ia->args.out_numargs = 1; + ia->args.out_args[0].size = sizeof(ia->outarg); + ia->args.out_args[0].value = &ia->outarg; + ia->args.force = true; + ia->args.nocreds = true; + + return ia; +} + +int fuse_iomap_mount(struct fuse_mount *fm) +{ + struct fuse_iomap_config_args *ia = fuse_iomap_new_mount(fm); + int err; + + ASSERT(fm->fc->sync_init); + + err = fuse_simple_request(fm, &ia->args); + /* Ignore size of iomap_config reply */ + if (err > 0) + err = 0; + fuse_iomap_config_reply(fm, &ia->args, err); + return err; +} + +void fuse_iomap_mount_async(struct fuse_mount *fm) +{ + struct fuse_iomap_config_args *ia = fuse_iomap_new_mount(fm); + int err; + + ASSERT(!fm->fc->sync_init); + + ia->args.end = fuse_iomap_config_reply; + err = fuse_simple_background(fm, &ia->args, GFP_KERNEL); + if (err) + fuse_iomap_config_reply(fm, &ia->args, -ENOTCONN); } void fuse_iomap_unmount(struct fuse_mount *fm) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 23ca401a3e08e6..f6ec67a8eb86a2 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1383,6 +1383,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, struct fuse_init_out *arg = &ia->out; bool ok = true; + atomic_inc(&fc->need_init); + if (error || arg->major != FUSE_KERNEL_VERSION) ok = false; else { @@ -1529,9 +1531,6 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, init_server_timeout(fc, timeout); - if (fc->iomap) - fuse_iomap_mount(fm); - fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages); fc->minor = arg->minor; @@ -1541,13 +1540,27 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, } kfree(ia); - if (!ok) { + if (!ok) fc->conn_init = 0; + + if (ok && fc->iomap) { + atomic_inc(&fc->need_init); + if (!fc->sync_init) + fuse_iomap_mount_async(fm); + } + + fuse_finish_init(fc, ok); +} + +void fuse_finish_init(struct fuse_conn *fc, bool ok) +{ + if (!ok) fc->conn_error = 1; - } - fuse_set_initialized(fc); - wake_up_all(&fc->blocked_waitq); + if (atomic_dec_and_test(&fc->need_init)) { + fuse_set_initialized(fc); + wake_up_all(&fc->blocked_waitq); + } } static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm) @@ -2028,7 +2041,20 @@ static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc) fm = get_fuse_mount_super(sb); - return fuse_send_init(fm); + err = fuse_send_init(fm); + if (err) + return err; + + if (fm->fc->conn_init && fm->fc->sync_init && fm->fc->iomap) { + err = fuse_iomap_mount(fm); + if (err) + return err; + } + + if (fm->fc->conn_error) + return -EIO; + + return 0; } /* ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 22/33] fuse_trace: query filesystem geometry when using iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (20 preceding siblings ...) 2026-04-29 14:29 ` [PATCH 21/33] fuse: query filesystem geometry when using iomap Darrick J. Wong @ 2026-04-29 14:29 ` Darrick J. Wong 2026-04-29 14:29 ` [PATCH 23/33] fuse: implement fadvise for iomap files Darrick J. Wong ` (10 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:29 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 2 ++ 2 files changed, 50 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index c832fb9012d983..96c4db84c7106a 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -58,6 +58,7 @@ EM( FUSE_SYNCFS, "FUSE_SYNCFS") \ EM( FUSE_TMPFILE, "FUSE_TMPFILE") \ EM( FUSE_STATX, "FUSE_STATX") \ + EM( FUSE_IOMAP_CONFIG, "FUSE_IOMAP_CONFIG") \ EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \ EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \ EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \ @@ -345,6 +346,14 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP); { IOMAP_IOEND_BOUNDARY, "boundary" }, \ { IOMAP_IOEND_DIRECT, "direct" } +#define FUSE_IOMAP_CONFIG_STRINGS \ + { FUSE_IOMAP_CONFIG_SID, "sid" }, \ + { FUSE_IOMAP_CONFIG_UUID, "uuid" }, \ + { FUSE_IOMAP_CONFIG_BLOCKSIZE, "blocksize" }, \ + { FUSE_IOMAP_CONFIG_MAX_LINKS, "max_links" }, \ + { FUSE_IOMAP_CONFIG_TIME, "time" }, \ + { FUSE_IOMAP_CONFIG_MAXBYTES, "maxbytes" } + DECLARE_EVENT_CLASS(fuse_iomap_check_class, TP_PROTO(const char *func, int line, const char *condition), @@ -1005,6 +1014,45 @@ TRACE_EVENT(fuse_iomap_fallocate, __entry->mode, __entry->newsize) ); + +TRACE_EVENT(fuse_iomap_config, + TP_PROTO(const struct fuse_mount *fm, + const struct fuse_iomap_config_out *outarg), + TP_ARGS(fm, outarg), + + TP_STRUCT__entry( + __field(dev_t, connection) + + __field(uint64_t, flags) + __field(uint32_t, blocksize) + __field(uint32_t, max_links) + __field(uint32_t, time_gran) + + __field(int64_t, time_min) + __field(int64_t, time_max) + __field(int64_t, maxbytes) + __field(uint8_t, uuid_len) + ), + + TP_fast_assign( + __entry->connection = fm->fc->dev; + __entry->flags = outarg->flags; + __entry->blocksize = outarg->s_blocksize; + __entry->max_links = outarg->s_max_links; + __entry->time_gran = outarg->s_time_gran; + __entry->time_min = outarg->s_time_min; + __entry->time_max = outarg->s_time_max; + __entry->maxbytes = outarg->s_maxbytes; + __entry->uuid_len = outarg->s_uuid_len; + ), + + TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u", + __entry->connection, + __print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS), + __entry->blocksize, __entry->max_links, __entry->time_gran, + __entry->time_min, __entry->time_max, __entry->maxbytes, + __entry->uuid_len) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index cd74497ceb3f42..9a856617234f78 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -742,6 +742,8 @@ static int fuse_iomap_process_config(struct fuse_mount *fm, int error, return error; } + trace_fuse_iomap_config(fm, outarg); + if (outarg->flags & ~FUSE_IOMAP_CONFIG_ALL) return -EINVAL; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 23/33] fuse: implement fadvise for iomap files 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (21 preceding siblings ...) 2026-04-29 14:29 ` [PATCH 22/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:29 ` Darrick J. Wong 2026-04-29 14:29 ` [PATCH 24/33] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong ` (9 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:29 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> If userspace asks us to perform readahead on a file, take i_rwsem so that it can't race with hole punching or writes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 3 +++ fs/fuse/file.c | 1 + fs/fuse/fuse_iomap.c | 20 ++++++++++++++++++++ 3 files changed, 24 insertions(+) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 13b5c5c896f25a..17d0507a243b59 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -65,6 +65,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support __user *argp); + +int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -90,6 +92,7 @@ int fuse_dev_ioctl_iomap_support(struct file *file, # define fuse_iomap_fallocate(...) (-ENOSYS) # define fuse_iomap_flush_unmap_range(...) (-ENOSYS) # define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) +# define fuse_iomap_fadvise NULL #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 6471952489aa2e..fa67b20f5ad3ae 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3257,6 +3257,7 @@ static const struct file_operations fuse_file_operations = { .fallocate = fuse_file_fallocate, .copy_file_range = fuse_copy_file_range, .setlease = generic_setlease, + .fadvise = fuse_iomap_fadvise, }; static const struct address_space_operations fuse_file_aops = { diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 9a856617234f78..ad7c526545776e 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -7,6 +7,7 @@ #include <linux/fiemap.h> #include <linux/pagemap.h> #include <linux/falloc.h> +#include <linux/fadvise.h> #include "fuse_i.h" #include "fuse_trace.h" #include "fuse_iomap.h" @@ -1982,3 +1983,22 @@ int fuse_dev_ioctl_iomap_support(struct file *file, return -EFAULT; return 0; } + +int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice) +{ + struct inode *inode = file_inode(file); + bool needlock = advice == POSIX_FADV_WILLNEED && + fuse_inode_has_iomap(inode); + int ret; + + /* + * Operations creating pages in page cache need protection from hole + * punching and similar ops + */ + if (needlock) + inode_lock_shared(inode); + ret = generic_fadvise(file, start, end, advice); + if (needlock) + inode_unlock_shared(inode); + return ret; +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 24/33] fuse: invalidate ranges of block devices being used for iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (22 preceding siblings ...) 2026-04-29 14:29 ` [PATCH 23/33] fuse: implement fadvise for iomap files Darrick J. Wong @ 2026-04-29 14:29 ` Darrick J. Wong 2026-04-29 14:30 ` [PATCH 25/33] fuse_trace: " Darrick J. Wong ` (8 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:29 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Make it easier to invalidate the page cache for a block device that is being used in conjunction with iomap. This allows a fuse server to kill all cached data for a block that is being freed, so that block reuse doesn't result in file corruption. Right now, the only way to do this is with fadvise, which ignores and doesn't wait for pages undergoing writeback. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 3 +++ include/uapi/linux/fuse.h | 16 ++++++++++++++++ fs/fuse/dev.c | 27 +++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 87 insertions(+) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 17d0507a243b59..31d6f7b392771c 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -65,6 +65,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support __user *argp); +int fuse_iomap_dev_inval(struct fuse_conn *fc, + const struct fuse_iomap_dev_inval_out *arg); int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice); #else @@ -92,6 +94,7 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice); # define fuse_iomap_fallocate(...) (-ENOSYS) # define fuse_iomap_flush_unmap_range(...) (-ENOSYS) # define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) +# define fuse_iomap_dev_inval(...) (-ENOSYS) # define fuse_iomap_fadvise NULL #endif /* CONFIG_FUSE_IOMAP */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 33668d66e9c4b4..1ef7152306a24f 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -247,6 +247,7 @@ * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry + * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges */ #ifndef _LINUX_FUSE_H @@ -701,6 +702,8 @@ enum fuse_notify_code { FUSE_NOTIFY_RESEND = 7, FUSE_NOTIFY_INC_EPOCH = 8, FUSE_NOTIFY_PRUNE = 9, + FUSE_NOTIFY_IOMAP_DEV_INVAL = 99, + FUSE_NOTIFY_CODE_MAX, }; /* The read buffer is required to be at least 8k, but may be much larger */ @@ -1491,4 +1494,17 @@ struct fuse_iomap_config_out { int64_t s_maxbytes; /* max file size */ }; +struct fuse_range { + uint64_t offset; + uint64_t length; +}; + +struct fuse_iomap_dev_inval_out { + uint32_t dev; /* device cookie */ + uint32_t reserved; /* zero */ + + /* range of bdev pagecache to invalidate, in bytes */ + struct fuse_range range; +}; + #endif /* _LINUX_FUSE_H */ diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 75e1e3f8a4ddd1..9918911fe44855 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1848,6 +1848,30 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size, return err; } +static int fuse_notify_iomap_dev_inval(struct fuse_conn *fc, unsigned int size, + struct fuse_copy_state *cs) +{ + struct fuse_iomap_dev_inval_out outarg; + int err = -EINVAL; + + if (size != sizeof(outarg)) + goto err; + + err = fuse_copy_one(cs, &outarg, sizeof(outarg)); + if (err) + goto err; + if (outarg.reserved) { + err = -EINVAL; + goto err; + } + fuse_copy_finish(cs); + + return fuse_iomap_dev_inval(fc, &outarg); +err: + fuse_copy_finish(cs); + return err; +} + struct fuse_retrieve_args { struct fuse_args_pages ap; struct fuse_notify_retrieve_in inarg; @@ -2138,6 +2162,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code, case FUSE_NOTIFY_PRUNE: return fuse_notify_prune(fc, size, cs); + case FUSE_NOTIFY_IOMAP_DEV_INVAL: + return fuse_notify_iomap_dev_inval(fc, size, cs); + default: return -EINVAL; } diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index ad7c526545776e..fe937529543b0c 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -2002,3 +2002,44 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice) inode_unlock_shared(inode); return ret; } + +int fuse_iomap_dev_inval(struct fuse_conn *fc, + const struct fuse_iomap_dev_inval_out *arg) +{ + struct fuse_backing *fb; + struct block_device *bdev; + loff_t end; + int ret = 0; + + if (!fc->iomap || arg->dev == FUSE_IOMAP_DEV_NULL) + return -EINVAL; + + down_read(&fc->killsb); + fb = fuse_backing_lookup(fc, &fuse_iomap_backing_ops, arg->dev); + if (!fb) { + ret = -ENODEV; + goto out_killsb; + } + bdev = fb->bdev; + + inode_lock(bdev->bd_mapping->host); + filemap_invalidate_lock(bdev->bd_mapping); + + if (check_add_overflow(arg->range.offset, arg->range.length, &end) || + arg->range.offset >= bdev_nr_bytes(bdev)) { + ret = -EINVAL; + goto out_unlock; + } + + end = min(end, bdev_nr_bytes(bdev)); + truncate_inode_pages_range(bdev->bd_mapping, arg->range.offset, + end - 1); + +out_unlock: + filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); + fuse_backing_put(fb); +out_killsb: + up_read(&fc->killsb); + return ret; +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 25/33] fuse_trace: invalidate ranges of block devices being used for iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (23 preceding siblings ...) 2026-04-29 14:29 ` [PATCH 24/33] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong @ 2026-04-29 14:30 ` Darrick J. Wong 2026-04-29 14:30 ` [PATCH 26/33] fuse: implement inline data file IO via iomap Darrick J. Wong ` (7 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:30 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 26 ++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 2 ++ 2 files changed, 28 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 96c4db84c7106a..0e4be645802055 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -1053,6 +1053,32 @@ TRACE_EVENT(fuse_iomap_config, __entry->time_min, __entry->time_max, __entry->maxbytes, __entry->uuid_len) ); + +TRACE_EVENT(fuse_iomap_dev_inval, + TP_PROTO(const struct fuse_conn *fc, + const struct fuse_iomap_dev_inval_out *arg), + TP_ARGS(fc, arg), + + TP_STRUCT__entry( + __field(dev_t, connection) + __field(int, dev) + __field(unsigned long long, offset) + __field(unsigned long long, length) + ), + + TP_fast_assign( + __entry->connection = fc->dev; + __entry->dev = arg->dev; + __entry->offset = arg->range.offset; + __entry->length = arg->range.length; + ), + + TP_printk("connection %u dev %d offset 0x%llx length 0x%llx", + __entry->connection, + __entry->dev, + __entry->offset, + __entry->length) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index fe937529543b0c..807d84a2139362 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -2011,6 +2011,8 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc, loff_t end; int ret = 0; + trace_fuse_iomap_dev_inval(fc, arg); + if (!fc->iomap || arg->dev == FUSE_IOMAP_DEV_NULL) return -EINVAL; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 26/33] fuse: implement inline data file IO via iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (24 preceding siblings ...) 2026-04-29 14:30 ` [PATCH 25/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:30 ` Darrick J. Wong 2026-04-29 14:30 ` [PATCH 27/33] fuse_trace: " Darrick J. Wong ` (6 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:30 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Implement inline data file IO by issuing FUSE_READ/FUSE_WRITE commands in response to an inline data mapping. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.c | 213 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 205 insertions(+), 8 deletions(-) diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 807d84a2139362..b9531ad5ab5180 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -398,6 +398,156 @@ fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map) return ret; } +static inline int fuse_iomap_inline_alloc(struct iomap *iomap) +{ + ASSERT(iomap->inline_data == NULL); + ASSERT(iomap->length > 0); + + iomap->inline_data = kvzalloc(iomap->length, GFP_KERNEL); + return iomap->inline_data ? 0 : -ENOMEM; +} + +static inline void fuse_iomap_inline_free(struct iomap *iomap) +{ + kvfree(iomap->inline_data); + iomap->inline_data = NULL; +} + +/* + * Use the FUSE_READ command to read inline file data from the fuse server. + * Note that there's no file handle attached, so the fuse server must be able + * to reconnect to the inode via the nodeid. + */ +static int fuse_iomap_inline_read(struct inode *inode, loff_t pos, + loff_t count, struct iomap *iomap) +{ + struct fuse_read_in in = { + .offset = pos, + .size = count, + }; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_mount *fm = get_fuse_mount(inode); + FUSE_ARGS(args); + ssize_t ret; + + if (BAD_DATA(!iomap_inline_data_valid(iomap))) { + fuse_iomap_inline_free(iomap); + return -EFSCORRUPTED; + } + + args.opcode = FUSE_READ; + args.nodeid = fi->nodeid; + args.in_numargs = 1; + args.in_args[0].size = sizeof(in); + args.in_args[0].value = ∈ + args.out_argvar = true; + args.out_numargs = 1; + args.out_args[0].size = count; + args.out_args[0].value = iomap_inline_data(iomap, pos); + + ret = fuse_simple_request(fm, &args); + if (ret == -ENOSYS) + ret = 0; + if (ret < 0) { + fuse_iomap_inline_free(iomap); + return ret; + } + /* no readahead means something bad happened */ + if (ret == 0) { + fuse_iomap_inline_free(iomap); + return -EIO; + } + + return 0; +} + +/* + * Use the FUSE_WRITE command to write inline file data from the fuse server. + * Note that there's no file handle attached, so the fuse server must be able + * to reconnect to the inode via the nodeid. + */ +static int fuse_iomap_inline_write(struct inode *inode, loff_t pos, + loff_t count, struct iomap *iomap) +{ + struct fuse_write_in in = { + .offset = pos, + .size = count, + }; + struct fuse_write_out out = { }; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_mount *fm = get_fuse_mount(inode); + FUSE_ARGS(args); + ssize_t ret; + + if (BAD_DATA(!iomap_inline_data_valid(iomap))) + return -EFSCORRUPTED; + + args.opcode = FUSE_WRITE; + args.nodeid = fi->nodeid; + args.in_numargs = 2; + args.in_args[0].size = sizeof(in); + args.in_args[0].value = ∈ + args.in_args[1].size = count; + args.in_args[1].value = iomap_inline_data(iomap, pos); + args.out_numargs = 1; + args.out_args[0].size = sizeof(out); + args.out_args[0].value = &out; + + ret = fuse_simple_request(fm, &args); + if (ret == -ENOSYS) + ret = 0; + if (ret < 0) { + fuse_iomap_inline_free(iomap); + return ret; + } + /* short write means something bad happened */ + if (out.size < count) { + fuse_iomap_inline_free(iomap); + return -EIO; + } + + return 0; +} + +/* Set up inline data buffers for iomap_begin */ +static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags, + loff_t pos, loff_t count, + struct iomap *iomap, struct iomap *srcmap) +{ + int err; + + if (opflags & IOMAP_REPORT) + return 0; + + if (fuse_is_iomap_file_write(opflags)) { + if (iomap->type == IOMAP_INLINE) { + err = fuse_iomap_inline_alloc(iomap); + if (err) + return err; + } + + if (srcmap->type == IOMAP_INLINE) { + err = fuse_iomap_inline_alloc(srcmap); + if (!err) + err = fuse_iomap_inline_read(inode, pos, count, + srcmap); + if (err) { + fuse_iomap_inline_free(iomap); + return err; + } + } + } else if (iomap->type == IOMAP_INLINE) { + /* inline data read */ + err = fuse_iomap_inline_alloc(iomap); + if (!err) + err = fuse_iomap_inline_read(inode, pos, count, iomap); + if (err) + return err; + } + + return 0; +} + static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, unsigned opflags, struct iomap *iomap, struct iomap *srcmap) @@ -467,12 +617,20 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, fuse_iomap_from_server(iomap, read_dev, &outarg.read); } + if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) { + err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap, + srcmap); + if (err) + goto out_write_dev; + } + /* * XXX: if we ever want to support closing devices, we need a way to * track the fuse_backing refcount all the way through bio endios. * For now we put the refcount here because you can't remove an iomap * device until unmount time. */ +out_write_dev: fuse_backing_put(write_dev); out_read_dev: fuse_backing_put(read_dev); @@ -511,8 +669,32 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, { struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_mount *fm = get_fuse_mount(inode); + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); + struct iomap *srcmap = &iter->srcmap; int err; + if (srcmap->inline_data) + fuse_iomap_inline_free(srcmap); + + if (iomap->inline_data) { + if (fuse_is_iomap_file_write(opflags) && written > 0) { + err = fuse_iomap_inline_write(inode, pos, written, + iomap); + fuse_iomap_inline_free(iomap); + if (err) + return err; + + spin_lock(&fi->lock); + fi->i_disk_size = max(fi->i_disk_size, pos + written); + spin_unlock(&fi->lock); + } else { + fuse_iomap_inline_free(iomap); + } + + /* fuse server should already be aware of what happened */ + return 0; + } + if (fuse_should_send_iomap_end(fm, iomap, opflags, count, written)) { struct fuse_iomap_end_in inarg = { .opflags = fuse_iomap_op_to_server(opflags), @@ -1462,7 +1644,6 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc, unsigned int len, u64 end_pos) { struct inode *inode = wpc->inode; - struct iomap write_iomap, dontcare; ssize_t ret; if (fuse_is_bad(inode)) { @@ -1475,23 +1656,39 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc, trace_fuse_iomap_writeback_range(inode, offset, len, end_pos); if (!fuse_iomap_revalidate_writeback(wpc, offset)) { + struct iomap_iter fake_iter = { }; + struct iomap *write_iomap = &fake_iter.iomap; + ret = fuse_iomap_begin(inode, offset, len, - FUSE_IOMAP_OP_WRITEBACK, - &write_iomap, &dontcare); + FUSE_IOMAP_OP_WRITEBACK, write_iomap, + &fake_iter.srcmap); if (ret) goto discard_folio; + if (BAD_DATA(write_iomap->type == IOMAP_INLINE)) { + /* + * iomap assumes that inline data writes are completed + * by the time ->iomap_end completes, so it should + * never mark a pagecache folio dirty. + */ + fuse_iomap_end(inode, offset, len, 0, + FUSE_IOMAP_OP_WRITEBACK, + write_iomap); + ret = -EIO; + goto discard_folio; + } + /* * Landed in a hole or beyond EOF? Send that to iomap, it'll * skip writing back the file range. */ - if (write_iomap.offset > offset) { - write_iomap.length = write_iomap.offset - offset; - write_iomap.offset = offset; - write_iomap.type = IOMAP_HOLE; + if (write_iomap->offset > offset) { + write_iomap->length = write_iomap->offset - offset; + write_iomap->offset = offset; + write_iomap->type = IOMAP_HOLE; } - memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap)); + memcpy(&wpc->iomap, write_iomap, sizeof(struct iomap)); } ret = iomap_add_to_ioend(wpc, folio, offset, end_pos, len); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 27/33] fuse_trace: implement inline data file IO via iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (25 preceding siblings ...) 2026-04-29 14:30 ` [PATCH 26/33] fuse: implement inline data file IO via iomap Darrick J. Wong @ 2026-04-29 14:30 ` Darrick J. Wong 2026-04-29 14:31 ` [PATCH 28/33] fuse: allow more statx fields Darrick J. Wong ` (5 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:30 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 45 +++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 7 +++++++ 2 files changed, 52 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 0e4be645802055..d3352e75fa6bdf 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -230,6 +230,7 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); #if IS_ENABLED(CONFIG_FUSE_IOMAP) struct iomap_writepage_ctx; struct iomap_ioend; +struct iomap; /* tracepoint boilerplate so we don't have to keep doing this */ #define FUSE_IOMAP_OPFLAGS_FIELD \ @@ -1079,6 +1080,50 @@ TRACE_EVENT(fuse_iomap_dev_inval, __entry->offset, __entry->length) ); + +DECLARE_EVENT_CLASS(fuse_iomap_inline_class, + TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count, + const struct iomap *map), + TP_ARGS(inode, pos, count, map), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + FUSE_IOMAP_MAP_FIELDS(map) + __field(bool, has_buf) + __field(uint64_t, validity_cookie) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = pos; + __entry->length = count; + + __entry->mapdev = FUSE_IOMAP_DEV_NULL; + __entry->mapaddr = map->addr; + __entry->mapoffset = map->offset; + __entry->maplength = map->length; + __entry->maptype = map->type; + __entry->mapflags = map->flags; + + __entry->has_buf = map->inline_data != NULL; + __entry->validity_cookie= map->validity_cookie; + ), + + TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_MAP_FMT() " has_buf? %d cookie 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + FUSE_IOMAP_MAP_PRINTK_ARGS(map), + __entry->has_buf, + __entry->validity_cookie) +); +#define DEFINE_FUSE_IOMAP_INLINE_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_inline_class, name, \ + TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count, \ + const struct iomap *map), \ + TP_ARGS(inode, pos, count, map)) +DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read); +DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write); +DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap); +DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index b9531ad5ab5180..8be8e49605bb85 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -435,6 +435,8 @@ static int fuse_iomap_inline_read(struct inode *inode, loff_t pos, return -EFSCORRUPTED; } + trace_fuse_iomap_inline_read(inode, pos, count, iomap); + args.opcode = FUSE_READ; args.nodeid = fi->nodeid; args.in_numargs = 1; @@ -482,6 +484,8 @@ static int fuse_iomap_inline_write(struct inode *inode, loff_t pos, if (BAD_DATA(!iomap_inline_data_valid(iomap))) return -EFSCORRUPTED; + trace_fuse_iomap_inline_write(inode, pos, count, iomap); + args.opcode = FUSE_WRITE; args.nodeid = fi->nodeid; args.in_numargs = 2; @@ -545,6 +549,9 @@ static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags, return err; } + trace_fuse_iomap_set_inline_iomap(inode, pos, count, iomap); + trace_fuse_iomap_set_inline_srcmap(inode, pos, count, srcmap); + return 0; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 28/33] fuse: allow more statx fields 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (26 preceding siblings ...) 2026-04-29 14:30 ` [PATCH 27/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:31 ` Darrick J. Wong 2026-04-29 14:31 ` [PATCH 29/33] fuse: support atomic writes with iomap Darrick J. Wong ` (4 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:31 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Allow the fuse server to supply us with the more recently added fields of struct statx. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/uapi/linux/fuse.h | 15 ++++++++- fs/fuse/dir.c | 76 ++++++++++++++++++++++++++++++++++++++------- 2 files changed, 79 insertions(+), 12 deletions(-) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 1ef7152306a24f..aa96b4cbdfa255 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -341,7 +341,20 @@ struct fuse_statx { uint32_t rdev_minor; uint32_t dev_major; uint32_t dev_minor; - uint64_t __spare2[14]; + + uint64_t mnt_id; + uint32_t dio_mem_align; + uint32_t dio_offset_align; + uint64_t subvol; + + uint32_t atomic_write_unit_min; + uint32_t atomic_write_unit_max; + uint32_t atomic_write_segments_max; + uint32_t dio_read_offset_align; + uint32_t atomic_write_unit_max_opt; + uint32_t __spare2[1]; + + uint64_t __spare3[8]; }; struct fuse_kstatfs { diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 5b10ddf9b8077a..2db88bc364bb50 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1470,6 +1470,50 @@ static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr) attr->blksize = sx->blksize; } +#define FUSE_SUPPORTED_STATX_MASK (STATX_BASIC_STATS | \ + STATX_BTIME | \ + STATX_DIOALIGN | \ + STATX_SUBVOL | \ + STATX_WRITE_ATOMIC) + +#define FUSE_UNCACHED_STATX_MASK (STATX_DIOALIGN | \ + STATX_SUBVOL | \ + STATX_WRITE_ATOMIC) + +static void kstat_from_fuse_statx(const struct inode *inode, + struct kstat *stat, + const struct fuse_statx *sx) +{ + stat->result_mask = sx->mask & FUSE_SUPPORTED_STATX_MASK; + + stat->attributes |= fuse_statx_attributes(inode, sx); + stat->attributes_mask |= fuse_statx_attributes_mask(inode, sx); + + if (sx->mask & STATX_BTIME) { + stat->btime.tv_sec = sx->btime.tv_sec; + stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, + NSEC_PER_SEC - 1); + } + + if (sx->mask & STATX_DIOALIGN) { + stat->dio_mem_align = sx->dio_mem_align; + stat->dio_offset_align = sx->dio_offset_align; + } + + if (sx->mask & STATX_SUBVOL) + stat->subvol = sx->subvol; + + if (sx->mask & STATX_WRITE_ATOMIC) { + stat->atomic_write_unit_min = sx->atomic_write_unit_min; + stat->atomic_write_unit_max = sx->atomic_write_unit_max; + stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt; + stat->atomic_write_segments_max = sx->atomic_write_segments_max; + } + + if (sx->mask & STATX_DIO_READ_ALIGN) + stat->dio_read_offset_align = sx->dio_read_offset_align; +} + static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode, struct file *file, struct kstat *stat) { @@ -1493,7 +1537,7 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode, } /* For now leave sync hints as the default, request all stats. */ inarg.sx_flags = 0; - inarg.sx_mask = STATX_BASIC_STATS | STATX_BTIME; + inarg.sx_mask = FUSE_SUPPORTED_STATX_MASK; args.opcode = FUSE_STATX; args.nodeid = get_node_id(inode); args.in_numargs = 1; @@ -1521,11 +1565,7 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode, } if (stat) { - stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME); - stat->btime.tv_sec = sx->btime.tv_sec; - stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1); - stat->attributes |= fuse_statx_attributes(inode, sx); - stat->attributes_mask |= fuse_statx_attributes_mask(inode, sx); + kstat_from_fuse_statx(inode, stat, sx); fuse_fillattr(idmap, inode, &attr, stat); stat->result_mask |= STATX_TYPE; } @@ -1590,16 +1630,30 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode, u32 inval_mask = READ_ONCE(fi->inval_mask); u32 cache_mask = fuse_get_cache_mask(inode); - - /* FUSE only supports basic stats and possibly btime */ - request_mask &= STATX_BASIC_STATS | STATX_BTIME; + /* Only ask for supported stats */ + request_mask &= FUSE_SUPPORTED_STATX_MASK; retry: if (fc->no_statx) request_mask &= STATX_BASIC_STATS; if (!request_mask) sync = false; - else if (flags & AT_STATX_FORCE_SYNC) + else if (request_mask & FUSE_UNCACHED_STATX_MASK) { + switch (flags & AT_STATX_SYNC_TYPE) { + case AT_STATX_DONT_SYNC: + request_mask &= ~FUSE_UNCACHED_STATX_MASK; + sync = false; + break; + case AT_STATX_FORCE_SYNC: + case AT_STATX_SYNC_AS_STAT: + sync = true; + break; + default: + WARN_ON(1); + sync = false; + break; + } + } else if (flags & AT_STATX_FORCE_SYNC) sync = true; else if (flags & AT_STATX_DONT_SYNC) sync = false; @@ -1610,7 +1664,7 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode, if (sync) { forget_all_cached_acls(inode); - /* Try statx if BTIME is requested */ + /* Try statx if a field not covered by regular stat is wanted */ if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) { err = fuse_do_statx(idmap, inode, file, stat); if (err == -ENOSYS) { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 29/33] fuse: support atomic writes with iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (27 preceding siblings ...) 2026-04-29 14:31 ` [PATCH 28/33] fuse: allow more statx fields Darrick J. Wong @ 2026-04-29 14:31 ` Darrick J. Wong 2026-04-29 14:31 ` [PATCH 30/33] fuse_trace: " Darrick J. Wong ` (3 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:31 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Enable untorn writes of up to a single fsblock, if iomap is enabled. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 2 ++ fs/fuse/fuse_iomap.h | 7 +++++++ include/uapi/linux/fuse.h | 5 +++++ fs/fuse/fuse_iomap.c | 42 +++++++++++++++++++++++++++++++++++++++++- 4 files changed, 55 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0d9ac3ff18eedf..c7c7f2e888bb8b 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -276,6 +276,8 @@ enum { FUSE_I_EXCLUSIVE, /* Use iomap for this inode */ FUSE_I_IOMAP, + /* Enable untorn writes */ + FUSE_I_ATOMIC, }; struct fuse_conn; diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 31d6f7b392771c..ca44df00f113d2 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -35,6 +35,13 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode) return test_bit(FUSE_I_IOMAP, &fi->state); } +static inline bool fuse_inode_has_atomic(const struct inode *inode) +{ + const struct fuse_inode *fi = get_fuse_inode(inode); + + return test_bit(FUSE_I_ATOMIC, &fi->state); +} + int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 length); loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index aa96b4cbdfa255..c454cea83083d3 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -248,6 +248,7 @@ * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges + * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support */ #ifndef _LINUX_FUSE_H @@ -604,11 +605,13 @@ struct fuse_file_lock { * FUSE_ATTR_EXCLUSIVE: This file can only be modified by this mount, so the * kernel can use cached attributes more aggressively (e.g. ACL inheritance) * FUSE_ATTR_IOMAP: Use iomap for this inode + * FUSE_ATTR_ATOMIC: Enable untorn writes */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) #define FUSE_ATTR_EXCLUSIVE (1 << 2) #define FUSE_ATTR_IOMAP (1 << 3) +#define FUSE_ATTR_ATOMIC (1 << 4) /** * Open flags @@ -1176,6 +1179,8 @@ struct fuse_backing_map { /* basic file I/O functionality through iomap */ #define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0) +/* untorn writes through iomap */ +#define FUSE_IOMAP_SUPPORT_ATOMIC (1ULL << 1) struct fuse_iomap_support { uint64_t flags; uint64_t padding; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 8be8e49605bb85..94ed7c69d892d3 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1114,6 +1114,20 @@ static inline void fuse_inode_clear_iomap(struct inode *inode) clear_bit(FUSE_I_IOMAP, &fi->state); } +static inline void fuse_inode_set_atomic(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + set_bit(FUSE_I_ATOMIC, &fi->state); +} + +static inline void fuse_inode_clear_atomic(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + clear_bit(FUSE_I_ATOMIC, &fi->state); +} + void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr) { ASSERT(get_fuse_conn(inode)->iomap); @@ -1140,6 +1154,8 @@ void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr) } fuse_inode_set_iomap(inode); + if (attr->flags & FUSE_ATTR_ATOMIC) + fuse_inode_set_atomic(inode); trace_fuse_iomap_init_inode(inode); } @@ -1150,6 +1166,7 @@ void fuse_iomap_evict_inode(struct inode *inode) trace_fuse_iomap_evict_inode(inode); + fuse_inode_clear_atomic(inode); fuse_inode_clear_iomap(inode); } @@ -1227,6 +1244,8 @@ void fuse_iomap_open(struct inode *inode, struct file *file) ASSERT(fuse_inode_has_iomap(inode)); file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT; + if (fuse_inode_has_atomic(inode)) + file->f_mode |= FMODE_CAN_ATOMIC_WRITE; } int fuse_iomap_finish_open(const struct fuse_file *ff, @@ -1438,6 +1457,17 @@ fuse_iomap_write_checks( return kiocb_modified(iocb); } +static inline ssize_t fuse_iomap_atomic_write_valid(struct kiocb *iocb, + struct iov_iter *from) +{ + struct inode *inode = file_inode(iocb->ki_filp); + + if (iov_iter_count(from) != i_blocksize(inode)) + return -EINVAL; + + return generic_atomic_write_valid(iocb, from); +} + static ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from) { @@ -1452,6 +1482,12 @@ static ssize_t fuse_iomap_direct_write(struct kiocb *iocb, if (!count) return 0; + if (iocb->ki_flags & IOCB_ATOMIC) { + ret = fuse_iomap_atomic_write_valid(iocb, from); + if (ret) + return ret; + } + /* * Unaligned direct writes require zeroing of unwritten head and tail * blocks. Extending writes require zeroing of post-EOF tail blocks. @@ -1882,6 +1918,9 @@ static ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, if (!iov_iter_count(from)) return 0; + if (iocb->ki_flags & IOCB_ATOMIC) + return -EOPNOTSUPP; + ret = fuse_iomap_ilock_iocb(iocb, EXCL); if (ret) return ret; @@ -2181,7 +2220,8 @@ int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support ios = { }; if (fuse_iomap_enabled()) - ios.flags = FUSE_IOMAP_SUPPORT_FILEIO; + ios.flags = FUSE_IOMAP_SUPPORT_FILEIO | + FUSE_IOMAP_SUPPORT_ATOMIC; if (copy_to_user(argp, &ios, sizeof(ios))) return -EFAULT; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 30/33] fuse_trace: support atomic writes with iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (28 preceding siblings ...) 2026-04-29 14:31 ` [PATCH 29/33] fuse: support atomic writes with iomap Darrick J. Wong @ 2026-04-29 14:31 ` Darrick J. Wong 2026-04-29 14:31 ` [PATCH 31/33] fuse: disable direct fs reclaim for any fuse server that uses iomap Darrick J. Wong ` (2 subsequent siblings) 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:31 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index d3352e75fa6bdf..de7d483d4b0f34 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -330,6 +330,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BTIME); TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE); TRACE_DEFINE_ENUM(FUSE_I_EXCLUSIVE); TRACE_DEFINE_ENUM(FUSE_I_IOMAP); +TRACE_DEFINE_ENUM(FUSE_I_ATOMIC); #define FUSE_IFLAG_STRINGS \ { 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \ @@ -339,7 +340,8 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP); { 1 << FUSE_I_BTIME, "btime" }, \ { 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \ { 1 << FUSE_I_EXCLUSIVE, "excl" }, \ - { 1 << FUSE_I_IOMAP, "iomap" } + { 1 << FUSE_I_IOMAP, "iomap" }, \ + { 1 << FUSE_I_ATOMIC, "atomic" } #define IOMAP_IOEND_STRINGS \ { IOMAP_IOEND_SHARED, "shared" }, \ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 31/33] fuse: disable direct fs reclaim for any fuse server that uses iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (29 preceding siblings ...) 2026-04-29 14:31 ` [PATCH 30/33] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:31 ` Darrick J. Wong 2026-04-29 14:32 ` [PATCH 32/33] fuse: enable swapfile activation on iomap Darrick J. Wong 2026-04-29 14:32 ` [PATCH 33/33] fuse: implement freeze and shutdowns for iomap filesystems Darrick J. Wong 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:31 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Any fuse server that uses iomap can create a substantial amount of dirty pages in the pagecache because we don't write dirty stuff until reclaim or fsync. Therefore, memory reclaim on any fuse iomap server musn't ever recurse back into the same filesystem. We must also never throttle the fuse server writes to a bdi because that will just slow down metadata operations. Add a new ioctl that the fuse server can call on the fuse device to set PF_MEMALLOC_NOFS and PF_LOCAL_THROTTLE. Either the fuse connection must have already enabled iomap, or the caller must have CAP_SYS_RESOURCE. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 2 ++ include/uapi/linux/fuse.h | 1 + fs/fuse/dev.c | 2 ++ fs/fuse/fuse_iomap.c | 37 +++++++++++++++++++++++++++++++++++++ 4 files changed, 42 insertions(+) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index ca44df00f113d2..25c36c9c39d6f3 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -76,6 +76,7 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc, const struct fuse_iomap_dev_inval_out *arg); int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice); +int fuse_dev_ioctl_iomap_set_nofs(struct file *file, uint32_t __user *argp); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -103,6 +104,7 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice); # define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) # define fuse_iomap_dev_inval(...) (-ENOSYS) # define fuse_iomap_fadvise NULL +# define fuse_dev_ioctl_iomap_set_nofs(...) (-EOPNOTSUPP) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index c454cea83083d3..9e59fba64f48d9 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1195,6 +1195,7 @@ struct fuse_iomap_support { #define FUSE_DEV_IOC_SYNC_INIT _IO(FUSE_DEV_IOC_MAGIC, 3) #define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \ struct fuse_iomap_support) +#define FUSE_DEV_IOC_SET_NOFS _IOW(FUSE_DEV_IOC_MAGIC, 100, uint32_t) struct fuse_lseek_in { uint64_t fh; diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 9918911fe44855..cf4bad6ffc287b 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2741,6 +2741,8 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd, case FUSE_DEV_IOC_IOMAP_SUPPORT: return fuse_dev_ioctl_iomap_support(file, argp); + case FUSE_DEV_IOC_SET_NOFS: + return fuse_dev_ioctl_iomap_set_nofs(file, argp); default: return -ENOTTY; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 94ed7c69d892d3..1e01e0011a412d 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -12,6 +12,7 @@ #include "fuse_trace.h" #include "fuse_iomap.h" #include "fuse_iomap_i.h" +#include "fuse_dev_i.h" static bool __read_mostly enable_iomap = #if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT) @@ -2289,3 +2290,39 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc, up_read(&fc->killsb); return ret; } + +static inline bool can_set_nofs(struct fuse_dev *fud) +{ + if (fud && fud->fc && fud->fc->iomap) + return true; + + return capable(CAP_SYS_RESOURCE); +} + +int fuse_dev_ioctl_iomap_set_nofs(struct file *file, uint32_t __user *argp) +{ + struct fuse_dev *fud = fuse_get_dev(file); + uint32_t flags; + + if (!can_set_nofs(fud)) + return -EPERM; + + if (copy_from_user(&flags, argp, sizeof(flags))) + return -EFAULT; + + /* + * The fuse server could be asked to perform a substantial amount of + * writeback, so prohibit reclaim from recursing into fuse or the + * kernel from throttling any bdis that the fuse server might write to. + */ + switch (flags) { + case 1: + current->flags |= PF_MEMALLOC_NOFS | PF_LOCAL_THROTTLE; + return 0; + case 0: + current->flags &= ~(PF_MEMALLOC_NOFS | PF_LOCAL_THROTTLE); + return 0; + default: + return -EINVAL; + } +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 32/33] fuse: enable swapfile activation on iomap 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (30 preceding siblings ...) 2026-04-29 14:31 ` [PATCH 31/33] fuse: disable direct fs reclaim for any fuse server that uses iomap Darrick J. Wong @ 2026-04-29 14:32 ` Darrick J. Wong 2026-04-29 14:32 ` [PATCH 33/33] fuse: implement freeze and shutdowns for iomap filesystems Darrick J. Wong 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:32 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> It turns out that fuse supports swapfile activation, so let's enable that. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 1 + include/uapi/linux/fuse.h | 5 ++++ fs/fuse/fuse_iomap.c | 54 ++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 59 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index de7d483d4b0f34..63cc1496ee5ca1 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -300,6 +300,7 @@ struct iomap; { FUSE_IOMAP_OP_DAX, "fsdax" }, \ { FUSE_IOMAP_OP_ATOMIC, "atomic" }, \ { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }, \ + { FUSE_IOMAP_OP_SWAPFILE, "swapfile" }, \ { FUSE_IOMAP_OP_WRITEBACK, "writeback" } #define FUSE_IOMAP_TYPE_STRINGS \ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 9e59fba64f48d9..5f3724f36f764a 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1406,6 +1406,9 @@ struct fuse_uring_cmd_req { #define FUSE_IOMAP_OP_ATOMIC (1U << 9) #define FUSE_IOMAP_OP_DONTCACHE (1U << 10) +/* swapfile config operation */ +#define FUSE_IOMAP_OP_SWAPFILE (1U << 30) + /* pagecache writeback operation */ #define FUSE_IOMAP_OP_WRITEBACK (1U << 31) @@ -1460,6 +1463,8 @@ struct fuse_iomap_end_in { #define FUSE_IOMAP_IOEND_APPEND (1U << 4) /* is pagecache writeback */ #define FUSE_IOMAP_IOEND_WRITEBACK (1U << 5) +/* swapfile deactivation */ +#define FUSE_IOMAP_IOEND_SWAPOFF (1U << 6) struct fuse_iomap_ioend_in { uint32_t flags; /* FUSE_IOMAP_IOEND_* */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 1e01e0011a412d..cf52e2747e7f7f 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -8,6 +8,7 @@ #include <linux/pagemap.h> #include <linux/falloc.h> #include <linux/fadvise.h> +#include <linux/swap.h> #include "fuse_i.h" #include "fuse_trace.h" #include "fuse_iomap.h" @@ -202,13 +203,16 @@ static inline uint16_t fuse_iomap_flags_from_server(uint16_t fuse_f_flags) #undef XMAP2 #undef XMAP +#define FUSE_IOMAP_PRIVATE_OPS (FUSE_IOMAP_OP_WRITEBACK | \ + FUSE_IOMAP_OP_SWAPFILE) + /* Convert IOMAP_* operation flags to FUSE_IOMAP_OP_* */ #define XMAP(word) \ if (iomap_op_flags & IOMAP_##word) \ ret |= FUSE_IOMAP_OP_##word static inline uint32_t fuse_iomap_op_to_server(unsigned iomap_op_flags) { - uint32_t ret = iomap_op_flags & FUSE_IOMAP_OP_WRITEBACK; + uint32_t ret = iomap_op_flags & FUSE_IOMAP_PRIVATE_OPS; XMAP(WRITE); XMAP(ZERO); @@ -757,6 +761,13 @@ fuse_should_send_iomap_ioend(const struct fuse_mount *fm, if (inarg->error) return true; + /* + * Always send an ioend for swapoff to let the fuse server know the + * long term layout "lease" is over. + */ + if (inarg->flags & FUSE_IOMAP_IOEND_SWAPOFF) + return true; + /* Send an ioend if we performed an IO involving metadata changes. */ return inarg->written > 0 && (inarg->flags & (FUSE_IOMAP_IOEND_SHARED | @@ -1801,6 +1812,43 @@ static void fuse_iomap_readahead(struct readahead_control *rac) iomap_bio_readahead(rac, &fuse_iomap_ops); } +#ifdef CONFIG_SWAP +static int fuse_iomap_swapfile_begin(struct inode *inode, loff_t pos, + loff_t count, unsigned opflags, + struct iomap *iomap, struct iomap *srcmap) +{ + return fuse_iomap_begin(inode, pos, count, + FUSE_IOMAP_OP_SWAPFILE | opflags, iomap, + srcmap); +} + +static const struct iomap_ops fuse_iomap_swapfile_ops = { + .iomap_begin = fuse_iomap_swapfile_begin, +}; + +static int fuse_iomap_swap_activate(struct swap_info_struct *sis, + struct file *swap_file, sector_t *span) +{ + int ret; + + /* obtain the block device from the header iomapping */ + sis->bdev = NULL; + ret = iomap_swapfile_activate(sis, swap_file, span, + &fuse_iomap_swapfile_ops); + if (ret < 0) + fuse_iomap_ioend(file_inode(swap_file), 0, 0, ret, + FUSE_IOMAP_IOEND_SWAPOFF, NULL, + FUSE_IOMAP_NULL_ADDR); + return ret; +} + +static void fuse_iomap_swap_deactivate(struct file *file) +{ + fuse_iomap_ioend(file_inode(file), 0, 0, 0, FUSE_IOMAP_IOEND_SWAPOFF, + NULL, FUSE_IOMAP_NULL_ADDR); +} +#endif + static const struct address_space_operations fuse_iomap_aops = { .read_folio = fuse_iomap_read_folio, .readahead = fuse_iomap_readahead, @@ -1811,6 +1859,10 @@ static const struct address_space_operations fuse_iomap_aops = { .migrate_folio = filemap_migrate_folio, .is_partially_uptodate = iomap_is_partially_uptodate, .error_remove_folio = generic_error_remove_folio, +#ifdef CONFIG_SWAP + .swap_activate = fuse_iomap_swap_activate, + .swap_deactivate = fuse_iomap_swap_deactivate, +#endif /* These aren't pagecache operations per se */ .bmap = fuse_bmap, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 33/33] fuse: implement freeze and shutdowns for iomap filesystems 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (31 preceding siblings ...) 2026-04-29 14:32 ` [PATCH 32/33] fuse: enable swapfile activation on iomap Darrick J. Wong @ 2026-04-29 14:32 ` Darrick J. Wong 32 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:32 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Implement filesystem freezing and block device shutdown notifications for iomap-based servers Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/uapi/linux/fuse.h | 12 +++++++ fs/fuse/inode.c | 73 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 85 insertions(+) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 5f3724f36f764a..b6fa828776b82f 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -695,6 +695,10 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_FREEZE_FS = 4089, + FUSE_UNFREEZE_FS = 4090, + FUSE_SHUTDOWN_FS = 4091, + FUSE_IOMAP_CONFIG = 4092, FUSE_IOMAP_IOEND = 4093, FUSE_IOMAP_BEGIN = 4094, @@ -1257,6 +1261,14 @@ struct fuse_syncfs_in { uint64_t padding; }; +struct fuse_freezefs_in { + uint64_t unlinked; +}; + +struct fuse_shutdownfs_in { + uint64_t flags; +}; + /* * For each security context, send fuse_secctx with size of security context * fuse_secctx will be followed by security context name and this in turn diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index f6ec67a8eb86a2..6d4ff08ad3069a 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1264,6 +1264,74 @@ static struct dentry *fuse_get_parent(struct dentry *child) return parent; } +#ifdef CONFIG_FUSE_IOMAP +/* + * Second stage of a freeze. The data is already frozen so we only + * need to take care of the fuse server. + */ +static int fuse_freeze_fs(struct super_block *sb) +{ + struct fuse_mount *fm = get_fuse_mount_super(sb); + struct fuse_conn *fc = get_fuse_conn_super(sb); + struct fuse_freezefs_in inarg = { + .unlinked = atomic_long_read(&sb->s_remove_count), + }; + FUSE_ARGS(args); + int err; + + if (!fc->iomap) + return -EOPNOTSUPP; + + args.opcode = FUSE_FREEZE_FS; + args.nodeid = get_node_id(sb->s_root->d_inode); + args.in_numargs = 1; + args.in_args[0].size = sizeof(inarg); + args.in_args[0].value = &inarg; + err = fuse_simple_request(fm, &args); + if (err == -ENOSYS) + err = -EOPNOTSUPP; + return err; +} + +static int fuse_unfreeze_fs(struct super_block *sb) +{ + struct fuse_mount *fm = get_fuse_mount_super(sb); + struct fuse_conn *fc = get_fuse_conn_super(sb); + FUSE_ARGS(args); + int err; + + if (!fc->iomap) + return 0; + + args.opcode = FUSE_UNFREEZE_FS; + args.nodeid = get_node_id(sb->s_root->d_inode); + err = fuse_simple_request(fm, &args); + if (err == -ENOSYS) + err = 0; + return err; +} + +static void fuse_shutdown_fs(struct super_block *sb) +{ + struct fuse_mount *fm = get_fuse_mount_super(sb); + struct fuse_conn *fc = get_fuse_conn_super(sb); + struct fuse_shutdownfs_in inarg = { + .flags = 0, + }; + FUSE_ARGS(args); + + if (!fc->iomap) + return; + + args.opcode = FUSE_SHUTDOWN_FS; + args.nodeid = get_node_id(sb->s_root->d_inode); + args.in_numargs = 1; + args.in_args[0].size = sizeof(inarg); + args.in_args[0].value = &inarg; + fuse_simple_request(fm, &args); +} +#endif /* CONFIG_FUSE_IOMAP */ + /* only for fid encoding; no support for file handle */ static const struct export_operations fuse_export_fid_operations = { .encode_fh = fuse_encode_fh, @@ -1286,6 +1354,11 @@ static const struct super_operations fuse_super_operations = { .statfs = fuse_statfs, .sync_fs = fuse_sync_fs, .show_options = fuse_show_options, +#ifdef CONFIG_FUSE_IOMAP + .freeze_fs = fuse_freeze_fs, + .unfreeze_fs = fuse_unfreeze_fs, + .shutdown = fuse_shutdown_fs, +#endif }; static void sanitize_global_limit(unsigned int *limit) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 5/8] fuse: allow servers to specify root node id 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong @ 2026-04-29 14:17 ` Darrick J. Wong 2026-04-29 14:32 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong ` (2 more replies) 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (14 subsequent siblings) 19 siblings, 3 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:17 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, This series grants fuse servers full control over the entire node id address space by allowing them to specify the nodeid of the root directory. With this new feature, fuse4fs will not have to translate node ids. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-root-nodeid --- Commits in this patchset: * fuse: make the root nodeid dynamic * fuse_trace: make the root nodeid dynamic * fuse: allow setting of root nodeid --- fs/fuse/fuse_i.h | 9 +++++++-- fs/fuse/fuse_trace.h | 6 ++++-- fs/fuse/dir.c | 10 ++++++---- fs/fuse/inode.c | 22 ++++++++++++++++++---- fs/fuse/readdir.c | 10 +++++----- 5 files changed, 40 insertions(+), 17 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/3] fuse: make the root nodeid dynamic 2026-04-29 14:17 ` [PATCHSET v8 5/8] fuse: allow servers to specify root node id Darrick J. Wong @ 2026-04-29 14:32 ` Darrick J. Wong 2026-04-29 14:32 ` [PATCH 2/3] fuse_trace: " Darrick J. Wong 2026-04-29 14:33 ` [PATCH 3/3] fuse: allow setting of root nodeid Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:32 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Change this from a hardcoded constant to a dynamic field so that fuse servers don't need to translate. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 7 +++++-- fs/fuse/dir.c | 10 ++++++---- fs/fuse/inode.c | 11 +++++++---- fs/fuse/readdir.c | 10 +++++----- 4 files changed, 23 insertions(+), 15 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index c7c7f2e888bb8b..a9ca0e936524e7 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -701,6 +701,9 @@ struct fuse_conn { struct rcu_head rcu; + /* node id of the root directory */ + u64 root_nodeid; + /** The user id for this mount */ kuid_t user_id; @@ -1116,9 +1119,9 @@ static inline u64 get_node_id(struct inode *inode) return get_fuse_inode(inode)->nodeid; } -static inline int invalid_nodeid(u64 nodeid) +static inline int invalid_nodeid(const struct fuse_conn *fc, u64 nodeid) { - return !nodeid || nodeid == FUSE_ROOT_ID; + return !nodeid || nodeid == fc->root_nodeid; } static inline u64 fuse_get_attr_version(struct fuse_conn *fc) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 2db88bc364bb50..3a76eea04c6425 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -584,7 +584,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name err = -EIO; if (fuse_invalid_attr(&outarg->attr)) goto out_put_forget; - if (outarg->nodeid == FUSE_ROOT_ID && outarg->generation != 0) { + if (outarg->nodeid == fm->fc->root_nodeid && outarg->generation != 0) { pr_warn_once("root generation should be zero\n"); outarg->generation = 0; } @@ -634,7 +634,7 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry, goto out_err; err = -EIO; - if (inode && get_node_id(inode) == FUSE_ROOT_ID) + if (inode && get_node_id(inode) == fc->root_nodeid) goto out_iput; newent = d_splice_alias(inode, entry); @@ -885,7 +885,8 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir, goto out_free_ff; err = -EIO; - if (!S_ISREG(outentry.attr.mode) || invalid_nodeid(outentry.nodeid) || + if (!S_ISREG(outentry.attr.mode) || + invalid_nodeid(fm->fc, outentry.nodeid) || fuse_invalid_attr(&outentry.attr)) goto out_free_ff; @@ -1032,7 +1033,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun goto out_put_forget_req; err = -EIO; - if (invalid_nodeid(outarg.nodeid) || fuse_invalid_attr(&outarg.attr)) + if (invalid_nodeid(fm->fc, outarg.nodeid) || + fuse_invalid_attr(&outarg.attr)) goto out_put_forget_req; if ((outarg.attr.mode ^ mode) & S_IFMT) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 6d4ff08ad3069a..d883c9e3543f5c 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1059,6 +1059,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm, fc->max_pages_limit = fuse_max_pages_limit; fc->name_max = FUSE_NAME_LOW_MAX; fc->timeout.req_timeout = 0; + fc->root_nodeid = FUSE_ROOT_ID; if (IS_ENABLED(CONFIG_FUSE_BACKING)) fuse_backing_files_init(fc); @@ -1116,12 +1117,14 @@ EXPORT_SYMBOL_GPL(fuse_conn_get); static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned int mode) { struct fuse_attr attr; + struct fuse_conn *fc = get_fuse_conn_super(sb); + memset(&attr, 0, sizeof(attr)); attr.mode = mode; - attr.ino = FUSE_ROOT_ID; + attr.ino = fc->root_nodeid; attr.nlink = 1; - return fuse_iget(sb, FUSE_ROOT_ID, 0, &attr, 0, 0, 0); + return fuse_iget(sb, fc->root_nodeid, 0, &attr, 0, 0, 0); } struct fuse_inode_handle { @@ -1165,7 +1168,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb, goto out_iput; entry = d_obtain_alias(inode); - if (!IS_ERR(entry) && get_node_id(inode) != FUSE_ROOT_ID) + if (!IS_ERR(entry) && get_node_id(inode) != fc->root_nodeid) fuse_invalidate_entry_cache(entry); return entry; @@ -1258,7 +1261,7 @@ static struct dentry *fuse_get_parent(struct dentry *child) } parent = d_obtain_alias(inode); - if (!IS_ERR(parent) && get_node_id(inode) != FUSE_ROOT_ID) + if (!IS_ERR(parent) && get_node_id(inode) != fc->root_nodeid) fuse_invalidate_entry_cache(parent); return parent; diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c index db5ae8ec10305a..5ba19ca9d8a949 100644 --- a/fs/fuse/readdir.c +++ b/fs/fuse/readdir.c @@ -189,12 +189,12 @@ static int fuse_direntplus_link(struct file *file, return 0; } - if (invalid_nodeid(o->nodeid)) - return -EIO; - if (fuse_invalid_attr(&o->attr)) - return -EIO; - fc = get_fuse_conn(dir); + if (invalid_nodeid(fc, o->nodeid)) + return -EIO; + if (fuse_invalid_attr(&o->attr)) + return -EIO; + epoch = atomic_read(&fc->epoch); name.hash = full_name_hash(parent, name.name, name.len); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/3] fuse_trace: make the root nodeid dynamic 2026-04-29 14:17 ` [PATCHSET v8 5/8] fuse: allow servers to specify root node id Darrick J. Wong 2026-04-29 14:32 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong @ 2026-04-29 14:32 ` Darrick J. Wong 2026-04-29 14:33 ` [PATCH 3/3] fuse: allow setting of root nodeid Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:32 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Enhance the iomap config tracepoint to report the node id of the root directory. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 63cc1496ee5ca1..0016242ff34f62 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -1026,6 +1026,7 @@ TRACE_EVENT(fuse_iomap_config, TP_STRUCT__entry( __field(dev_t, connection) + __field(uint64_t, root_nodeid) __field(uint64_t, flags) __field(uint32_t, blocksize) @@ -1040,6 +1041,7 @@ TRACE_EVENT(fuse_iomap_config, TP_fast_assign( __entry->connection = fm->fc->dev; + __entry->root_nodeid = fm->fc->root_nodeid; __entry->flags = outarg->flags; __entry->blocksize = outarg->s_blocksize; __entry->max_links = outarg->s_max_links; @@ -1050,8 +1052,8 @@ TRACE_EVENT(fuse_iomap_config, __entry->uuid_len = outarg->s_uuid_len; ), - TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u", - __entry->connection, + TP_printk("connection %u root_ino 0x%llx flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u", + __entry->connection, __entry->root_nodeid, __print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS), __entry->blocksize, __entry->max_links, __entry->time_gran, __entry->time_min, __entry->time_max, __entry->maxbytes, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/3] fuse: allow setting of root nodeid 2026-04-29 14:17 ` [PATCHSET v8 5/8] fuse: allow servers to specify root node id Darrick J. Wong 2026-04-29 14:32 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong 2026-04-29 14:32 ` [PATCH 2/3] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:33 ` Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:33 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Provide a new mount option so that fuse servers can actually set the root nodeid. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 2 ++ fs/fuse/inode.c | 11 +++++++++++ 2 files changed, 13 insertions(+) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index a9ca0e936524e7..95b37f4660cc1d 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -642,6 +642,7 @@ static inline bool fuse_is_inode_dax_mode(enum fuse_dax_mode mode) struct fuse_fs_context { struct fuse_dev *fud; unsigned int rootmode; + u64 root_nodeid; kuid_t user_id; kgid_t group_id; bool is_bdev:1; @@ -654,6 +655,7 @@ struct fuse_fs_context { bool no_control:1; bool no_force_umount:1; bool legacy_opts_show:1; + bool root_nodeid_present:1; enum fuse_dax_mode dax_mode; unsigned int max_read; unsigned int blksize; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index d883c9e3543f5c..d48a76e6d17995 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -816,6 +816,7 @@ enum { OPT_ALLOW_OTHER, OPT_MAX_READ, OPT_BLKSIZE, + OPT_ROOT_NODEID, OPT_ERR }; @@ -830,6 +831,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = { fsparam_u32 ("max_read", OPT_MAX_READ), fsparam_u32 ("blksize", OPT_BLKSIZE), fsparam_string ("subtype", OPT_SUBTYPE), + fsparam_u64 ("root_nodeid", OPT_ROOT_NODEID), {} }; @@ -950,6 +952,11 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param) ctx->blksize = result.uint_32; break; + case OPT_ROOT_NODEID: + ctx->root_nodeid = result.uint_64; + ctx->root_nodeid_present = true; + break; + default: return -EINVAL; } @@ -987,6 +994,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) seq_printf(m, ",max_read=%u", fc->max_read); if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) seq_printf(m, ",blksize=%lu", sb->s_blocksize); + if (fc->root_nodeid && fc->root_nodeid != FUSE_ROOT_ID) + seq_printf(m, ",root_nodeid=%llu", fc->root_nodeid); } #ifdef CONFIG_FUSE_DAX if (fc->dax_mode == FUSE_DAX_ALWAYS) @@ -2052,6 +2061,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx) sb->s_flags |= SB_POSIXACL; fc->default_permissions = ctx->default_permissions; + if (ctx->root_nodeid_present) + fc->root_nodeid = ctx->root_nodeid; fc->allow_other = ctx->allow_other; fc->user_id = ctx->user_id; fc->group_id = ctx->group_id; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:17 ` [PATCHSET v8 5/8] fuse: allow servers to specify root node id Darrick J. Wong @ 2026-04-29 14:17 ` Darrick J. Wong 2026-04-29 14:33 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong ` (8 more replies) 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (13 subsequent siblings) 19 siblings, 9 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:17 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, When iomap is enabled for a fuse file, we try to keep as much of the file IO path in the kernel as we possibly can. That means no calling out to the fuse server in the IO path when we can avoid it. However, the existing FUSE architecture defers all file attributes to the fuse server -- [cm]time updates, ACL metadata management, set[ug]id removal, and permissions checking thereof, etc. We'd really rather do all these attribute updates in the kernel, and only push them to the fuse server when it's actually necessary (e.g. fsync). Furthermore, the POSIX ACL code has the weird behavior that if the access ACL can be represented entirely by i_mode bits, it will change the mode and delete the ACL, which fuse servers generally don't seem to implement. IOWs, we want consistent and correct (as defined by fstests) behavior of file attributes in iomap mode. Let's make the kernel manage all that and push the results to userspace as needed. This improves performance even further, since it's sort of like writeback_cache mode but more aggressive. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs --- Commits in this patchset: * fuse: enable caching of timestamps * fuse: force a ctime update after a fileattr_set call when in iomap mode * fuse: allow local filesystems to set some VFS iflags * fuse_trace: allow local filesystems to set some VFS iflags * fuse: cache atime when in iomap mode * fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems * fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems * fuse: update ctime when updating acls on an iomap inode * fuse: always cache ACLs when using iomap --- fs/fuse/fuse_i.h | 1 + fs/fuse/fuse_trace.h | 87 +++++++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fuse.h | 8 ++++ fs/fuse/acl.c | 30 +++++++++++++--- fs/fuse/dir.c | 38 ++++++++++++++++---- fs/fuse/file.c | 18 ++++++--- fs/fuse/fuse_iomap.c | 12 ++++++ fs/fuse/inode.c | 19 +++++++--- fs/fuse/ioctl.c | 68 +++++++++++++++++++++++++++++++++++ fs/fuse/readdir.c | 5 ++- 10 files changed, 262 insertions(+), 24 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/9] fuse: enable caching of timestamps 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong @ 2026-04-29 14:33 ` Darrick J. Wong 2026-04-29 14:33 ` [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong ` (7 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:33 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Cache the timestamps in the kernel so that the kernel sends FUSE_SETATTR calls to the fuse server after writes, because the iomap infrastructure won't do that for us. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/dir.c | 5 ++++- fs/fuse/file.c | 18 ++++++++++++------ fs/fuse/fuse_iomap.c | 6 ++++++ fs/fuse/inode.c | 13 +++++++------ 4 files changed, 29 insertions(+), 13 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 3a76eea04c6425..11898102adefe2 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -2265,7 +2265,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct fuse_attr_out outarg; const bool is_iomap = fuse_inode_has_iomap(inode); bool is_truncate = false; - bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode); + bool is_wb = (is_iomap || fc->writeback_cache) && + S_ISREG(inode->i_mode); loff_t oldsize; int err; bool trust_local_cmtime = is_wb; @@ -2405,6 +2406,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, spin_lock(&fi->lock); /* the kernel maintains i_mtime locally */ if (trust_local_cmtime) { + if ((attr->ia_valid & ATTR_ATIME) && is_iomap) + inode_set_atime_to_ts(inode, attr->ia_atime); if (attr->ia_valid & ATTR_MTIME) inode_set_mtime_to_ts(inode, attr->ia_mtime); if (attr->ia_valid & ATTR_CTIME) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index fa67b20f5ad3ae..eecd0610fbd3e5 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -262,7 +262,7 @@ static int fuse_open(struct inode *inode, struct file *file) int err; const bool is_iomap = fuse_inode_has_iomap(inode); bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc; - bool is_wb_truncate = is_truncate && fc->writeback_cache; + bool is_wb_truncate = is_truncate && (is_iomap || fc->writeback_cache); bool dax_truncate = is_truncate && FUSE_IS_DAX(inode); if (fuse_is_bad(inode)) @@ -477,12 +477,14 @@ static int fuse_flush(struct file *file, fl_owner_t id) struct fuse_file *ff = file->private_data; struct fuse_flush_in inarg; FUSE_ARGS(args); + const bool is_iomap = fuse_inode_has_iomap(inode); int err; if (fuse_is_bad(inode)) return -EIO; - if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache) + if ((ff->open_flags & FOPEN_NOFLUSH) && + !fm->fc->writeback_cache && !is_iomap) return 0; err = write_inode_now(inode, 1); @@ -518,7 +520,7 @@ static int fuse_flush(struct file *file, fl_owner_t id) * In memory i_blocks is not maintained by fuse, if writeback cache is * enabled, i_blocks from cached attr may not be accurate. */ - if (!err && fm->fc->writeback_cache) + if (!err && (is_iomap || fm->fc->writeback_cache)) fuse_invalidate_attr_mask(inode, STATX_BLOCKS); return err; } @@ -820,8 +822,10 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read, * If writeback_cache is enabled, a short read means there's a hole in * the file. Some data after the hole is in page cache, but has not * reached the client fs yet. So the hole is not present there. + * If iomap is enabled, a short read means we hit EOF so there's + * nothing to adjust. */ - if (!fc->writeback_cache) { + if (!fc->writeback_cache && !fuse_inode_has_iomap(inode)) { loff_t pos = folio_pos(ap->folios[0]) + num_read; fuse_read_update_size(inode, pos, attr_ver); } @@ -870,6 +874,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned int flags, struct iomap *iomap, struct iomap *srcmap) { + WARN_ON(fuse_inode_has_iomap(inode)); + iomap->type = IOMAP_MAPPED; iomap->length = length; iomap->offset = offset; @@ -2021,7 +2027,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, * Do this only if writeback_cache is not enabled. If writeback_cache * is enabled, we trust local ctime/mtime. */ - if (!fc->writeback_cache) + if (!fc->writeback_cache && !fuse_inode_has_iomap(inode)) fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY); spin_lock(&fi->lock); fi->writectr--; @@ -3114,7 +3120,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, /* mark unstable when write-back is not used, and file_out gets * extended */ const bool is_iomap = fuse_inode_has_iomap(inode_out); - bool is_unstable = (!fc->writeback_cache) && + bool is_unstable = (!fc->writeback_cache && !is_iomap) && ((pos_out + len) > inode_out->i_size); if (fc->no_copy_file_range) diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index cf52e2747e7f7f..a39370b97ca508 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1873,6 +1873,12 @@ static inline void fuse_inode_set_iomap(struct inode *inode) struct fuse_inode *fi = get_fuse_inode(inode); unsigned int min_order = 0; + /* + * Manage timestamps ourselves, don't make the fuse server do it. This + * is critical for mtime updates to work correctly with page_mkwrite. + */ + inode->i_flags &= ~S_NOCMTIME; + inode->i_flags &= ~S_NOATIME; inode->i_data.a_ops = &fuse_iomap_aops; INIT_WORK(&fi->ioend_work, fuse_iomap_end_io); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index d48a76e6d17995..2513ea108ff9a8 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -331,10 +331,11 @@ u32 fuse_get_cache_mask(struct inode *inode) { struct fuse_conn *fc = get_fuse_conn(inode); - if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) - return 0; + if (S_ISREG(inode->i_mode) && + (fuse_inode_has_iomap(inode) || fc->writeback_cache)) + return STATX_MTIME | STATX_CTIME | STATX_SIZE; - return STATX_MTIME | STATX_CTIME | STATX_SIZE; + return 0; } static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr, @@ -349,9 +350,9 @@ static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr spin_lock(&fi->lock); /* - * In case of writeback_cache enabled, writes update mtime, ctime and - * may update i_size. In these cases trust the cached value in the - * inode. + * In case of writeback_cache or iomap enabled, writes update mtime, + * ctime and may update i_size. In these cases trust the cached value + * in the inode. */ cache_mask = fuse_get_cache_mask(inode); fuse_iomap_set_disk_size(fi, attr->size); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong 2026-04-29 14:33 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong @ 2026-04-29 14:33 ` Darrick J. Wong 2026-04-29 14:33 ` [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags Darrick J. Wong ` (6 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:33 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> In iomap mode, the kernel is in charge of driving ctime updates to the fuse server and ignores updates coming from the fuse server. Therefore, when someone calls fileattr_set to change file attributes, we must force a ctime update. Found by generic/277. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/ioctl.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c index fdc175e93f7474..07529db21fb781 100644 --- a/fs/fuse/ioctl.c +++ b/fs/fuse/ioctl.c @@ -546,8 +546,13 @@ int fuse_fileattr_set(struct mnt_idmap *idmap, struct fuse_file *ff; unsigned int flags = fa->flags; struct fsxattr xfa; + struct file_kattr old_ma = { }; + bool is_wb = (fuse_get_cache_mask(inode) & STATX_CTIME); int err; + if (is_wb) + vfs_fileattr_get(dentry, &old_ma); + ff = fuse_priv_ioctl_prepare(inode); if (IS_ERR(ff)) return PTR_ERR(ff); @@ -571,6 +576,12 @@ int fuse_fileattr_set(struct mnt_idmap *idmap, cleanup: fuse_priv_ioctl_cleanup(inode, ff); + /* + * If we cache ctime updates and the fileattr changed, then force a + * ctime update. + */ + if (is_wb && memcmp(&old_ma, fa, sizeof(old_ma))) + fuse_update_ctime(inode); return err; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong 2026-04-29 14:33 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong 2026-04-29 14:33 ` [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong @ 2026-04-29 14:33 ` Darrick J. Wong 2026-04-29 14:34 ` [PATCH 4/9] fuse_trace: " Darrick J. Wong ` (5 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:33 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> There are three inode flags (immutable, append, sync) that are enforced by the VFS. Whenever we go around setting iflags, let's update the VFS state so that they actually work. Make it so that the fuse server can set these three inode flags at load time and have the kernel advertise and enforce them. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 1 + include/uapi/linux/fuse.h | 8 +++++++ fs/fuse/dir.c | 1 + fs/fuse/inode.c | 1 + fs/fuse/ioctl.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 61 insertions(+) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 95b37f4660cc1d..3b38b98dc0096c 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1651,6 +1651,7 @@ long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa); int fuse_fileattr_set(struct mnt_idmap *idmap, struct dentry *dentry, struct file_kattr *fa); +void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr); /* iomode.c */ int fuse_file_cached_io_open(struct inode *inode, struct fuse_file *ff); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index b6fa828776b82f..bf8514a5ee27af 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -249,6 +249,8 @@ * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support + * - add FUSE_ATTR_{SYNC,IMMUTABLE,APPEND} for VFS enforcement of file + * attributes */ #ifndef _LINUX_FUSE_H @@ -606,12 +608,18 @@ struct fuse_file_lock { * kernel can use cached attributes more aggressively (e.g. ACL inheritance) * FUSE_ATTR_IOMAP: Use iomap for this inode * FUSE_ATTR_ATOMIC: Enable untorn writes + * FUSE_ATTR_SYNC: File writes are synchronous + * FUSE_ATTR_IMMUTABLE: File is immutable + * FUSE_ATTR_APPEND: File is append-only */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) #define FUSE_ATTR_EXCLUSIVE (1 << 2) #define FUSE_ATTR_IOMAP (1 << 3) #define FUSE_ATTR_ATOMIC (1 << 4) +#define FUSE_ATTR_SYNC (1 << 5) +#define FUSE_ATTR_IMMUTABLE (1 << 6) +#define FUSE_ATTR_APPEND (1 << 7) /** * Open flags diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 11898102adefe2..53fda7f63b11c8 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1450,6 +1450,7 @@ static void fuse_fillattr(struct mnt_idmap *idmap, struct inode *inode, blkbits = inode->i_sb->s_blocksize_bits; stat->blksize = 1 << blkbits; + generic_fill_statx_attr(inode, stat); } static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 2513ea108ff9a8..aa9f880b9a2ea6 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -544,6 +544,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid, inode->i_flags |= S_NOCMTIME; inode->i_generation = generation; fuse_init_inode(inode, attr, fc); + fuse_fileattr_init(inode, attr); } else if (fuse_stale_inode(inode, generation, attr)) { /* nodeid was reused, any I/O on the old inode should fail */ fuse_make_bad(inode); diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c index 07529db21fb781..bd2caf191ce2e0 100644 --- a/fs/fuse/ioctl.c +++ b/fs/fuse/ioctl.c @@ -502,6 +502,53 @@ static void fuse_priv_ioctl_cleanup(struct inode *inode, struct fuse_file *ff) fuse_file_release(inode, ff, O_RDONLY, NULL, S_ISDIR(inode->i_mode)); } +static inline void update_iflag(struct inode *inode, unsigned int iflag, + bool set) +{ + if (set) + inode->i_flags |= iflag; + else + inode->i_flags &= ~iflag; +} + +static void fuse_fileattr_update_inode(struct inode *inode, + const struct file_kattr *fa) +{ + unsigned int old_iflags = inode->i_flags; + + if (!fuse_inode_is_exclusive(inode)) + return; + + if (fa->flags_valid) { + update_iflag(inode, S_SYNC, fa->flags & FS_SYNC_FL); + update_iflag(inode, S_IMMUTABLE, fa->flags & FS_IMMUTABLE_FL); + update_iflag(inode, S_APPEND, fa->flags & FS_APPEND_FL); + } else if (fa->fsx_valid) { + update_iflag(inode, S_SYNC, fa->fsx_xflags & FS_XFLAG_SYNC); + update_iflag(inode, S_IMMUTABLE, + fa->fsx_xflags & FS_XFLAG_IMMUTABLE); + update_iflag(inode, S_APPEND, fa->fsx_xflags & FS_XFLAG_APPEND); + } + + if (old_iflags != inode->i_flags) + fuse_invalidate_attr(inode); +} + +void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr) +{ + if (!fuse_inode_is_exclusive(inode)) + return; + + if (attr->flags & FUSE_ATTR_SYNC) + inode->i_flags |= S_SYNC; + + if (attr->flags & FUSE_ATTR_IMMUTABLE) + inode->i_flags |= S_IMMUTABLE; + + if (attr->flags & FUSE_ATTR_APPEND) + inode->i_flags |= S_APPEND; +} + int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa) { struct inode *inode = d_inode(dentry); @@ -572,7 +619,10 @@ int fuse_fileattr_set(struct mnt_idmap *idmap, err = fuse_priv_ioctl(inode, ff, FS_IOC_FSSETXATTR, &xfa, sizeof(xfa)); + if (err) + goto cleanup; } + fuse_fileattr_update_inode(inode, fa); cleanup: fuse_priv_ioctl_cleanup(inode, ff); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 4/9] fuse_trace: allow local filesystems to set some VFS iflags 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:33 ` [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags Darrick J. Wong @ 2026-04-29 14:34 ` Darrick J. Wong 2026-04-29 14:34 ` [PATCH 5/9] fuse: cache atime when in iomap mode Darrick J. Wong ` (4 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:34 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 29 +++++++++++++++++++++++++++++ fs/fuse/ioctl.c | 7 +++++++ 2 files changed, 36 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 0016242ff34f62..7136ecf25e1f2b 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -179,6 +179,35 @@ TRACE_EVENT(fuse_request_end, __entry->unique, __entry->len, __entry->error) ); +DECLARE_EVENT_CLASS(fuse_fileattr_class, + TP_PROTO(const struct inode *inode, unsigned int old_iflags), + + TP_ARGS(inode, old_iflags), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(unsigned int, old_iflags) + __field(unsigned int, new_iflags) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->old_iflags = old_iflags; + __entry->new_iflags = inode->i_flags; + ), + + TP_printk(FUSE_INODE_FMT " old_iflags 0x%x iflags 0x%x", + FUSE_INODE_PRINTK_ARGS, + __entry->old_iflags, + __entry->new_iflags) +); +#define DEFINE_FUSE_FILEATTR_EVENT(name) \ +DEFINE_EVENT(fuse_fileattr_class, name, \ + TP_PROTO(const struct inode *inode, unsigned int old_iflags), \ + TP_ARGS(inode, old_iflags)) +DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_update_inode); +DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_init); + #ifdef CONFIG_FUSE_BACKING #define FUSE_BACKING_FLAG_STRINGS \ { FUSE_BACKING_TYPE_PASSTHROUGH, "pass" }, \ diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c index bd2caf191ce2e0..5180066678e8c1 100644 --- a/fs/fuse/ioctl.c +++ b/fs/fuse/ioctl.c @@ -4,6 +4,7 @@ */ #include "fuse_i.h" +#include "fuse_trace.h" #include <linux/uio.h> #include <linux/compat.h> @@ -530,12 +531,16 @@ static void fuse_fileattr_update_inode(struct inode *inode, update_iflag(inode, S_APPEND, fa->fsx_xflags & FS_XFLAG_APPEND); } + trace_fuse_fileattr_update_inode(inode, old_iflags); + if (old_iflags != inode->i_flags) fuse_invalidate_attr(inode); } void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr) { + unsigned int old_iflags = inode->i_flags; + if (!fuse_inode_is_exclusive(inode)) return; @@ -547,6 +552,8 @@ void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr) if (attr->flags & FUSE_ATTR_APPEND) inode->i_flags |= S_APPEND; + + trace_fuse_fileattr_init(inode, old_iflags); } int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 5/9] fuse: cache atime when in iomap mode 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:34 ` [PATCH 4/9] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:34 ` Darrick J. Wong 2026-04-29 14:34 ` [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong ` (3 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:34 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> When we're running in iomap mode, allow the kernel to cache the access timestamp to further reduce the number of roundtrips to the fuse server. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/dir.c | 5 +++++ fs/fuse/fuse_iomap.c | 6 ++++++ fs/fuse/inode.c | 11 ++++++++--- 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 53fda7f63b11c8..61015e83888ee4 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -2236,6 +2236,11 @@ int fuse_flush_times(struct inode *inode, struct fuse_file *ff) inarg.ctime = inode_get_ctime_sec(inode); inarg.ctimensec = inode_get_ctime_nsec(inode); } + if (fuse_inode_has_iomap(inode)) { + inarg.valid |= FATTR_ATIME; + inarg.atime = inode_get_atime_sec(inode); + inarg.atimensec = inode_get_atime_nsec(inode); + } if (ff) { inarg.valid |= FATTR_FH; inarg.fh = ff->fh; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index a39370b97ca508..9278e2e399ba9b 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1169,6 +1169,12 @@ void fuse_iomap_init_inode(struct inode *inode, struct fuse_attr *attr) if (attr->flags & FUSE_ATTR_ATOMIC) fuse_inode_set_atomic(inode); + /* + * iomap caches atime too, so we must load it from the fuse server + * at instantiation time. + */ + inode_set_atime(inode, attr->atime, attr->atimensec); + trace_fuse_iomap_init_inode(inode); } diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index aa9f880b9a2ea6..1a09b9e1446919 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -266,7 +266,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, attr->mtimensec = min_t(u32, attr->mtimensec, NSEC_PER_SEC - 1); attr->ctimensec = min_t(u32, attr->ctimensec, NSEC_PER_SEC - 1); - inode_set_atime(inode, attr->atime, attr->atimensec); + if (!(cache_mask & STATX_ATIME)) + inode_set_atime(inode, attr->atime, attr->atimensec); /* mtime from server may be stale due to local buffered write */ if (!(cache_mask & STATX_MTIME)) { inode_set_mtime(inode, attr->mtime, attr->mtimensec); @@ -331,8 +332,12 @@ u32 fuse_get_cache_mask(struct inode *inode) { struct fuse_conn *fc = get_fuse_conn(inode); - if (S_ISREG(inode->i_mode) && - (fuse_inode_has_iomap(inode) || fc->writeback_cache)) + if (!S_ISREG(inode->i_mode)) + return 0; + + if (fuse_inode_has_iomap(inode)) + return STATX_MTIME | STATX_CTIME | STATX_ATIME | STATX_SIZE; + if (fc->writeback_cache) return STATX_MTIME | STATX_CTIME | STATX_SIZE; return 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:34 ` [PATCH 5/9] fuse: cache atime when in iomap mode Darrick J. Wong @ 2026-04-29 14:34 ` Darrick J. Wong 2026-04-29 14:34 ` [PATCH 7/9] fuse_trace: " Darrick J. Wong ` (2 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:34 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Let the kernel handle killing the suid/sgid bits because the write/falloc/truncate/chown code already does this, and we don't have to worry about external modifications that are only visible to the fuse server (i.e. we're not a cluster fs). Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/dir.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 61015e83888ee4..e664a60200ee26 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -2477,6 +2477,7 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry, struct inode *inode = d_inode(entry); struct fuse_conn *fc = get_fuse_conn(inode); struct file *file = (attr->ia_valid & ATTR_FILE) ? attr->ia_file : NULL; + const bool is_iomap = fuse_inode_has_iomap(inode); int ret; if (fuse_is_bad(inode)) @@ -2485,15 +2486,19 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry, if (!fuse_allow_current_process(get_fuse_conn(inode))) return -EACCES; - if (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) { + if (!is_iomap && + (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))) { attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_MODE); /* * The only sane way to reliably kill suid/sgid is to do it in - * the userspace filesystem + * the userspace filesystem if this isn't an iomap file. For + * iomap filesystems we let the kernel kill the setuid/setgid + * bits. * - * This should be done on write(), truncate() and chown(). + * This should be done on write(), truncate(), chown(), and + * fallocate(). */ if (!fc->handle_killpriv && !fc->handle_killpriv_v2) { /* ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 7/9] fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:34 ` [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong @ 2026-04-29 14:34 ` Darrick J. Wong 2026-04-29 14:35 ` [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong 2026-04-29 14:35 ` [PATCH 9/9] fuse: always cache ACLs when using iomap Darrick J. Wong 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:34 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/dir.c | 5 ++++ 2 files changed, 63 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 7136ecf25e1f2b..a6374d64a62357 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -208,6 +208,64 @@ DEFINE_EVENT(fuse_fileattr_class, name, \ DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_update_inode); DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_init); +TRACE_EVENT(fuse_setattr_fill, + TP_PROTO(const struct inode *inode, + const struct fuse_setattr_in *inarg), + TP_ARGS(inode, inarg), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(umode_t, mode) + __field(uint32_t, valid) + __field(umode_t, new_mode) + __field(uint64_t, new_size) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->mode = inode->i_mode; + __entry->valid = inarg->valid; + __entry->new_mode = inarg->mode; + __entry->new_size = inarg->size; + ), + + TP_printk(FUSE_INODE_FMT " mode 0%o valid 0x%x new_mode 0%o new_size 0x%llx", + FUSE_INODE_PRINTK_ARGS, + __entry->mode, + __entry->valid, + __entry->new_mode, + __entry->new_size) +); + +TRACE_EVENT(fuse_setattr, + TP_PROTO(const struct inode *inode, + const struct iattr *inarg), + TP_ARGS(inode, inarg), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(umode_t, mode) + __field(uint32_t, valid) + __field(umode_t, new_mode) + __field(uint64_t, new_size) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->mode = inode->i_mode; + __entry->valid = inarg->ia_valid; + __entry->new_mode = inarg->ia_mode; + __entry->new_size = inarg->ia_size; + ), + + TP_printk(FUSE_INODE_FMT " mode 0%o valid 0x%x new_mode 0%o new_size 0x%llx", + FUSE_INODE_PRINTK_ARGS, + __entry->mode, + __entry->valid, + __entry->new_mode, + __entry->new_size) +); + #ifdef CONFIG_FUSE_BACKING #define FUSE_BACKING_FLAG_STRINGS \ { FUSE_BACKING_TYPE_PASSTHROUGH, "pass" }, \ diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index e664a60200ee26..716f3e893dda66 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -8,6 +8,7 @@ #include "fuse_i.h" #include "fuse_iomap.h" +#include "fuse_trace.h" #include <linux/pagemap.h> #include <linux/file.h> @@ -2205,6 +2206,8 @@ static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_args *args, struct fuse_setattr_in *inarg_p, struct fuse_attr_out *outarg_p) { + trace_fuse_setattr_fill(inode, inarg_p); + args->opcode = FUSE_SETATTR; args->nodeid = get_node_id(inode); args->in_numargs = 1; @@ -2486,6 +2489,8 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry, if (!fuse_allow_current_process(get_fuse_conn(inode))) return -EACCES; + trace_fuse_setattr(inode, attr); + if (!is_iomap && (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))) { attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID | ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:34 ` [PATCH 7/9] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:35 ` Darrick J. Wong 2026-04-29 14:35 ` [PATCH 9/9] fuse: always cache ACLs when using iomap Darrick J. Wong 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:35 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> In iomap mode, the fuse kernel driver is in charge of updating file attributes, so we need to update ctime after an ACL change. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/acl.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index 9619ac84a85886..76c46946640d44 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -7,6 +7,7 @@ */ #include "fuse_i.h" +#include "fuse_iomap.h" #include <linux/posix_acl.h> #include <linux/posix_acl_xattr.h> @@ -113,6 +114,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *name; umode_t mode = inode->i_mode; const bool local_acls = fuse_inode_has_local_acls(inode); + const bool is_iomap = fuse_inode_has_iomap(inode); int ret; if (fuse_is_bad(inode)) @@ -180,10 +182,24 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, ret = 0; } - /* If we scheduled a mode update above, push that to userspace now. */ if (!ret) { struct iattr attr = { }; + /* + * When we're running in iomap mode, we need to update mode and + * ctime ourselves instead of letting the fuse server figure + * that out. + */ + if (is_iomap) { + attr.ia_valid |= ATTR_CTIME; + inode_set_ctime_current(inode); + attr.ia_ctime = inode_get_ctime(inode); + } + + /* + * If we scheduled a mode update above, push that to userspace + * now. + */ if (mode != inode->i_mode) { attr.ia_valid |= ATTR_MODE; attr.ia_mode = mode; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 9/9] fuse: always cache ACLs when using iomap 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:35 ` [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong @ 2026-04-29 14:35 ` Darrick J. Wong 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:35 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Keep ACLs cached in memory when we're using iomap, so that we don't have to make a round trip to the fuse server. This might want to become a FUSE_ATTR_ flag. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/acl.c | 12 +++++++++--- fs/fuse/dir.c | 11 ++++++++--- fs/fuse/readdir.c | 5 ++++- 3 files changed, 21 insertions(+), 7 deletions(-) diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index 76c46946640d44..8e88029ae29265 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -212,10 +212,16 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, if (fc->posix_acl) { /* * Fuse daemons without FUSE_POSIX_ACL never cached POSIX ACLs - * and didn't invalidate attributes. Retain that behavior. + * and didn't invalidate attributes. Retain that behavior + * except for iomap, where we assume that only the source of + * ACL changes is userspace. */ - forget_all_cached_acls(inode); - fuse_invalidate_attr(inode); + if (!ret && is_iomap) { + set_cached_acl(inode, type, acl); + } else { + forget_all_cached_acls(inode); + fuse_invalidate_attr(inode); + } } return ret; diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 716f3e893dda66..ca75f2e9be4464 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -447,7 +447,8 @@ static int fuse_dentry_revalidate(struct inode *dir, const struct qstr *name, fuse_stale_inode(inode, outarg.generation, &outarg.attr)) goto invalid; - forget_all_cached_acls(inode); + if (!fuse_inode_has_iomap(inode)) + forget_all_cached_acls(inode); fuse_change_attributes(inode, &outarg.attr, NULL, ATTR_TIMEOUT(&outarg), attr_version); @@ -1667,7 +1668,8 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode, sync = time_before64(fi->i_time, get_jiffies_64()); if (sync) { - forget_all_cached_acls(inode); + if (!fuse_inode_has_iomap(inode)) + forget_all_cached_acls(inode); /* Try statx if a field not covered by regular stat is wanted */ if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) { err = fuse_do_statx(idmap, inode, file, stat); @@ -1851,6 +1853,9 @@ static int fuse_access(struct inode *inode, int mask) static int fuse_perm_getattr(struct inode *inode, int mask) { + if (fuse_inode_has_iomap(inode)) + return 0; + if (mask & MAY_NOT_BLOCK) return -ECHILD; @@ -2534,7 +2539,7 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry, * If filesystem supports acls it may have updated acl xattrs in * the filesystem, so forget cached acls for the inode. */ - if (fc->posix_acl) + if (fc->posix_acl && !is_iomap) forget_all_cached_acls(inode); /* Directory mode changed, may need to revalidate access */ diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c index 5ba19ca9d8a949..bd92bb00d197f7 100644 --- a/fs/fuse/readdir.c +++ b/fs/fuse/readdir.c @@ -8,6 +8,8 @@ #include "fuse_i.h" +#include "fuse_iomap.h" + #include <linux/iversion.h> #include <linux/posix_acl.h> #include <linux/pagemap.h> @@ -228,7 +230,8 @@ static int fuse_direntplus_link(struct file *file, fi->nlookup++; spin_unlock(&fi->lock); - forget_all_cached_acls(inode); + if (!fuse_inode_has_iomap(inode)) + forget_all_cached_acls(inode); fuse_change_attributes(inode, &o->attr, NULL, ATTR_TIMEOUT(o), attr_version); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong @ 2026-04-29 14:18 ` Darrick J. Wong 2026-04-29 14:35 ` [PATCH 01/12] fuse: cache iomaps Darrick J. Wong ` (11 more replies) 2026-04-29 14:18 ` [PATCHSET v8 8/8] fuse: run fuse-iomap servers as a contained service Darrick J. Wong ` (12 subsequent siblings) 19 siblings, 12 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:18 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, This series improves the performance (and correctness for some filesystems) by adding the ability to cache iomap mappings in the kernel. For filesystems that can change mapping states during pagecache writeback (e.g. unwritten extent conversion) this is absolutely necessary to deal with races with writes to the pagecache because writeback does not take i_rwsem. For everyone else, it simply eliminates roundtrips to userspace. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache --- Commits in this patchset: * fuse: cache iomaps * fuse_trace: cache iomaps * fuse: use the iomap cache for iomap_begin * fuse_trace: use the iomap cache for iomap_begin * fuse: invalidate iomap cache after file updates * fuse_trace: invalidate iomap cache after file updates * fuse: enable iomap cache management * fuse_trace: enable iomap cache management * fuse: overlay iomap inode info in struct fuse_inode * fuse: constrain iomap mapping cache size * fuse_trace: constrain iomap mapping cache size * fuse: enable iomap --- fs/fuse/fuse_i.h | 11 fs/fuse/fuse_iomap.h | 10 fs/fuse/fuse_iomap_cache.h | 121 +++ fs/fuse/fuse_trace.h | 445 +++++++++++ include/uapi/linux/fuse.h | 41 + fs/fuse/Makefile | 2 fs/fuse/dev.c | 46 + fs/fuse/file.c | 18 fs/fuse/fuse_iomap.c | 535 +++++++++++++ fs/fuse/fuse_iomap_cache.c | 1791 ++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/trace.c | 1 11 files changed, 2998 insertions(+), 23 deletions(-) create mode 100644 fs/fuse/fuse_iomap_cache.h create mode 100644 fs/fuse/fuse_iomap_cache.c ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 01/12] fuse: cache iomaps 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong @ 2026-04-29 14:35 ` Darrick J. Wong 2026-04-29 14:35 ` [PATCH 02/12] fuse_trace: " Darrick J. Wong ` (10 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:35 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Cache iomaps to a file so that we don't have to upcall the server. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 3 fs/fuse/fuse_iomap_cache.h | 99 +++ include/uapi/linux/fuse.h | 5 fs/fuse/Makefile | 2 fs/fuse/fuse_iomap.c | 5 fs/fuse/fuse_iomap_cache.c | 1695 ++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/trace.c | 1 7 files changed, 1809 insertions(+), 1 deletion(-) create mode 100644 fs/fuse/fuse_iomap_cache.h create mode 100644 fs/fuse/fuse_iomap_cache.c diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 3b38b98dc0096c..1bf5dd373153e5 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -197,6 +197,9 @@ struct fuse_inode { spinlock_t ioend_lock; struct work_struct ioend_work; struct list_head ioend_list; + + /* cached iomap mappings */ + struct fuse_iomap_cache *cache; #endif }; diff --git a/fs/fuse/fuse_iomap_cache.h b/fs/fuse/fuse_iomap_cache.h new file mode 100644 index 00000000000000..922ca182357aa7 --- /dev/null +++ b/fs/fuse/fuse_iomap_cache.h @@ -0,0 +1,99 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * The fuse_iext code comes from xfs_iext_tree.[ch] and is: + * Copyright (c) 2017 Christoph Hellwig. + * + * Everything else is: + * Copyright (C) 2025-2026 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef _FS_FUSE_IOMAP_CACHE_H +#define _FS_FUSE_IOMAP_CACHE_H + +#if IS_ENABLED(CONFIG_FUSE_IOMAP) +/* + * File incore extent information, present for read and write mappings. + */ +struct fuse_iext_root { + /* bytes in ir_data, or -1 if it has never been used */ + int64_t ir_bytes; + void *ir_data; /* extent tree root */ + unsigned int ir_height; /* height of the extent tree */ +}; + +struct fuse_iomap_cache { + struct fuse_iext_root ic_read; + struct fuse_iext_root ic_write; + uint64_t ic_seq; /* validity counter */ + struct rw_semaphore ic_lock; /* mapping lock */ + struct inode *ic_inode; +}; + +void fuse_iomap_cache_lock(struct inode *inode); +void fuse_iomap_cache_unlock(struct inode *inode); +void fuse_iomap_cache_lock_shared(struct inode *inode); +void fuse_iomap_cache_unlock_shared(struct inode *inode); + +struct fuse_iext_leaf; + +struct fuse_iext_cursor { + struct fuse_iext_leaf *leaf; + int pos; +}; + +#define FUSE_IEXT_LEFT_CONTIG (1u << 0) +#define FUSE_IEXT_RIGHT_CONTIG (1u << 1) +#define FUSE_IEXT_LEFT_FILLING (1u << 2) +#define FUSE_IEXT_RIGHT_FILLING (1u << 3) +#define FUSE_IEXT_LEFT_VALID (1u << 4) +#define FUSE_IEXT_RIGHT_VALID (1u << 5) +#define FUSE_IEXT_WRITE_MAPPING (1u << 6) + +bool fuse_iext_get_extent(const struct fuse_iext_root *ir, + const struct fuse_iext_cursor *cur, + struct fuse_iomap_io *gotp); + +static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ic) +{ + return (uint64_t)READ_ONCE(ic->ic_seq); +} + +static inline void fuse_iomap_cache_init(struct fuse_inode *fi) +{ + fi->cache = NULL; +} + +static inline bool fuse_inode_caches_iomaps(const struct inode *inode) +{ + const struct fuse_inode *fi = get_fuse_inode(inode); + + return fi->cache != NULL; +} + +int fuse_iomap_cache_alloc(struct inode *inode); +void fuse_iomap_cache_free(struct inode *inode); + +int fuse_iomap_cache_remove(struct inode *inode, enum fuse_iomap_iodir iodir, + loff_t off, uint64_t len); + +int fuse_iomap_cache_upsert(struct inode *inode, enum fuse_iomap_iodir iodir, + const struct fuse_iomap_io *map); + +enum fuse_iomap_lookup_result { + LOOKUP_HIT, + LOOKUP_MISS, + LOOKUP_NOFORK, +}; + +struct fuse_iomap_lookup { + struct fuse_iomap_io map; /* cached mapping */ + uint64_t validity_cookie; /* used with .iomap_valid() */ +}; + +enum fuse_iomap_lookup_result +fuse_iomap_cache_lookup(struct inode *inode, enum fuse_iomap_iodir iodir, + loff_t off, uint64_t len, + struct fuse_iomap_lookup *mval); +#endif /* CONFIG_FUSE_IOMAP */ + +#endif /* _FS_FUSE_IOMAP_CACHE_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index bf8514a5ee27af..a273838bc20f2f 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1394,6 +1394,8 @@ struct fuse_uring_cmd_req { /* fuse-specific mapping type indicating that writes use the read mapping */ #define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255) +/* fuse-specific mapping type saying the server has populated the cache */ +#define FUSE_IOMAP_TYPE_RETRY_CACHE (254) #define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */ @@ -1551,4 +1553,7 @@ struct fuse_iomap_dev_inval_out { struct fuse_range range; }; +/* invalidate all cached iomap mappings up to EOF */ +#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL) + #endif /* _LINUX_FUSE_H */ diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile index 2536bc6a71b898..c672503da7bcbd 100644 --- a/fs/fuse/Makefile +++ b/fs/fuse/Makefile @@ -18,6 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o fuse-$(CONFIG_FUSE_BACKING) += backing.o fuse-$(CONFIG_SYSCTL) += sysctl.o fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o -fuse-$(CONFIG_FUSE_IOMAP) += fuse_iomap.o +fuse-$(CONFIG_FUSE_IOMAP) += fuse_iomap.o fuse_iomap_cache.o virtiofs-y := virtio_fs.o diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 9278e2e399ba9b..bb47b4c7f7eabc 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -14,6 +14,7 @@ #include "fuse_iomap.h" #include "fuse_iomap_i.h" #include "fuse_dev_i.h" +#include "fuse_iomap_cache.h" static bool __read_mostly enable_iomap = #if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT) @@ -1184,6 +1185,8 @@ void fuse_iomap_evict_inode(struct inode *inode) trace_fuse_iomap_evict_inode(inode); + if (fuse_inode_caches_iomaps(inode)) + fuse_iomap_cache_free(inode); fuse_inode_clear_atomic(inode); fuse_inode_clear_iomap(inode); } @@ -1895,6 +1898,8 @@ static inline void fuse_inode_set_iomap(struct inode *inode) min_order = inode->i_blkbits - PAGE_SHIFT; mapping_set_folio_min_order(inode->i_mapping, min_order); + + fuse_iomap_cache_init(fi); set_bit(FUSE_I_IOMAP, &fi->state); } diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c new file mode 100644 index 00000000000000..35384df09b9200 --- /dev/null +++ b/fs/fuse/fuse_iomap_cache.c @@ -0,0 +1,1695 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * fuse_iext* code adapted from xfs_iext_tree.c: + * Copyright (c) 2017 Christoph Hellwig. + * + * fuse_iomap_cache*lock* code adapted from xfs_inode.c: + * Copyright (c) 2000-2006 Silicon Graphics, Inc. + * All Rights Reserved. + * + * Copyright (C) 2025-2026 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "fuse_i.h" +#include "fuse_trace.h" +#include "fuse_iomap_i.h" +#include "fuse_iomap.h" +#include "fuse_iomap_cache.h" +#include <linux/iomap.h> + +void fuse_iomap_cache_lock_shared(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + + down_read(&ic->ic_lock); +} + +void fuse_iomap_cache_unlock_shared(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + + up_read(&ic->ic_lock); +} + +void fuse_iomap_cache_lock(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + + down_write(&ic->ic_lock); +} + +void fuse_iomap_cache_unlock(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + + up_write(&ic->ic_lock); +} + +static inline void assert_cache_locked_shared(struct fuse_iomap_cache *ic) +{ + rwsem_assert_held(&ic->ic_lock); +} + +static inline void assert_cache_locked(struct fuse_iomap_cache *ic) +{ + rwsem_assert_held_write_nolockdep(&ic->ic_lock); +} + +/* + * In-core extent btree block layout: + * + * There are two types of blocks in the btree: leaf and inner (non-leaf) blocks. + * + * The leaf blocks are made up by %KEYS_PER_NODE extent records, which each + * contain the startoffset, blockcount, startblock and unwritten extent flag. + * See above for the exact format, followed by pointers to the previous and next + * leaf blocks (if there are any). + * + * The inner (non-leaf) blocks first contain KEYS_PER_NODE lookup keys, followed + * by an equal number of pointers to the btree blocks at the next lower level. + * + * +-------+-------+-------+-------+-------+----------+----------+ + * Leaf: | rec 1 | rec 2 | rec 3 | rec 4 | rec N | prev-ptr | next-ptr | + * +-------+-------+-------+-------+-------+----------+----------+ + * + * +-------+-------+-------+-------+-------+-------+------+-------+ + * Inner: | key 1 | key 2 | key 3 | key N | ptr 1 | ptr 2 | ptr3 | ptr N | + * +-------+-------+-------+-------+-------+-------+------+-------+ + */ +typedef uint64_t fuse_iext_key_t; +#define FUSE_IEXT_KEY_INVALID (1ULL << 63) + +enum { + NODE_SIZE = 256, + KEYS_PER_NODE = NODE_SIZE / (sizeof(fuse_iext_key_t) + sizeof(void *)), + RECS_PER_LEAF = (NODE_SIZE - (2 * sizeof(struct fuse_iext_leaf *))) / + sizeof(struct fuse_iomap_io), +}; + +/* maximum length of a mapping that we're willing to cache */ +#define FUSE_IOMAP_MAX_LEN ((loff_t)(1ULL << 63)) + +struct fuse_iext_node { + fuse_iext_key_t keys[KEYS_PER_NODE]; + void *ptrs[KEYS_PER_NODE]; +}; + +struct fuse_iext_leaf { + struct fuse_iomap_io recs[RECS_PER_LEAF]; + struct fuse_iext_leaf *prev; + struct fuse_iext_leaf *next; +}; + +static uint32_t +fuse_iomap_fork_to_state(const struct fuse_iomap_cache *ic, + const struct fuse_iext_root *ir) +{ + ASSERT(ir == &ic->ic_write || ir == &ic->ic_read); + + if (ir == &ic->ic_write) + return FUSE_IEXT_WRITE_MAPPING; + return 0; +} + +/* Convert bmap state flags to an inode fork. */ +static struct fuse_iext_root * +fuse_iext_state_to_fork( + struct fuse_iomap_cache *ic, + uint32_t state) +{ + if (state & FUSE_IEXT_WRITE_MAPPING) + return &ic->ic_write; + return &ic->ic_read; +} + +/* The internal iext tree record is a struct fuse_iomap_io */ + +static inline bool fuse_iext_rec_is_empty(const struct fuse_iomap_io *rec) +{ + return rec->length == 0; +} + +static inline void fuse_iext_rec_clear(struct fuse_iomap_io *rec) +{ + memset(rec, 0, sizeof(*rec)); +} + +static inline void +fuse_iext_set( + struct fuse_iomap_io *rec, + const struct fuse_iomap_io *irec) +{ + ASSERT(irec->length > 0); + + *rec = *irec; +} + +static inline void +fuse_iext_get( + struct fuse_iomap_io *irec, + const struct fuse_iomap_io *rec) +{ + *irec = *rec; +} + +static inline uint64_t fuse_iext_count(const struct fuse_iext_root *ir) +{ + return ir->ir_bytes / sizeof(struct fuse_iomap_io); +} + +static inline int fuse_iext_max_recs(const struct fuse_iext_root *ir) +{ + if (ir->ir_height == 1) + return fuse_iext_count(ir); + return RECS_PER_LEAF; +} + +static inline struct fuse_iomap_io *cur_rec(const struct fuse_iext_cursor *cur) +{ + return &cur->leaf->recs[cur->pos]; +} + +static bool fuse_iext_valid(const struct fuse_iext_root *ir, + const struct fuse_iext_cursor *cur) +{ + if (!cur->leaf) + return false; + if (cur->pos < 0 || cur->pos >= fuse_iext_max_recs(ir)) + return false; + if (fuse_iext_rec_is_empty(cur_rec(cur))) + return false; + return true; +} + +static void * +fuse_iext_find_first_leaf( + struct fuse_iext_root *ir) +{ + struct fuse_iext_node *node = ir->ir_data; + int height; + + if (!ir->ir_height) + return NULL; + + for (height = ir->ir_height; height > 1; height--) { + node = node->ptrs[0]; + ASSERT(node); + } + + return node; +} + +static void * +fuse_iext_find_last_leaf( + struct fuse_iext_root *ir) +{ + struct fuse_iext_node *node = ir->ir_data; + int height, i; + + if (!ir->ir_height) + return NULL; + + for (height = ir->ir_height; height > 1; height--) { + for (i = 1; i < KEYS_PER_NODE; i++) + if (!node->ptrs[i]) + break; + node = node->ptrs[i - 1]; + ASSERT(node); + } + + return node; +} + +static void +fuse_iext_first( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur) +{ + cur->pos = 0; + cur->leaf = fuse_iext_find_first_leaf(ir); +} + +static void +fuse_iext_last( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur) +{ + int i; + + cur->leaf = fuse_iext_find_last_leaf(ir); + if (!cur->leaf) { + cur->pos = 0; + return; + } + + for (i = 1; i < fuse_iext_max_recs(ir); i++) { + if (fuse_iext_rec_is_empty(&cur->leaf->recs[i])) + break; + } + cur->pos = i - 1; +} + +static void +fuse_iext_next( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur) +{ + if (!cur->leaf) { + ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF); + fuse_iext_first(ir, cur); + return; + } + + ASSERT(cur->pos >= 0); + ASSERT(cur->pos < fuse_iext_max_recs(ir)); + + cur->pos++; + if (ir->ir_height > 1 && !fuse_iext_valid(ir, cur) && + cur->leaf->next) { + cur->leaf = cur->leaf->next; + cur->pos = 0; + } +} + +static void +fuse_iext_prev( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur) +{ + if (!cur->leaf) { + ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF); + fuse_iext_last(ir, cur); + return; + } + + ASSERT(cur->pos >= 0); + ASSERT(cur->pos <= RECS_PER_LEAF); + +recurse: + do { + cur->pos--; + if (fuse_iext_valid(ir, cur)) + return; + } while (cur->pos > 0); + + if (ir->ir_height > 1 && cur->leaf->prev) { + cur->leaf = cur->leaf->prev; + cur->pos = RECS_PER_LEAF; + goto recurse; + } +} + +/* + * Return true if the cursor points at an extent and return the extent structure + * in gotp. Else return false. + */ +bool +fuse_iext_get_extent( + const struct fuse_iext_root *ir, + const struct fuse_iext_cursor *cur, + struct fuse_iomap_io *gotp) +{ + if (!fuse_iext_valid(ir, cur)) + return false; + fuse_iext_get(gotp, cur_rec(cur)); + return true; +} + +static inline bool fuse_iext_next_extent(struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp) +{ + fuse_iext_next(ir, cur); + return fuse_iext_get_extent(ir, cur, gotp); +} + +static inline bool fuse_iext_prev_extent(struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp) +{ + fuse_iext_prev(ir, cur); + return fuse_iext_get_extent(ir, cur, gotp); +} + +/* + * Return the extent after cur in gotp without updating the cursor. + */ +static inline bool fuse_iext_peek_next_extent(struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp) +{ + struct fuse_iext_cursor ncur = *cur; + + fuse_iext_next(ir, &ncur); + return fuse_iext_get_extent(ir, &ncur, gotp); +} + +/* + * Return the extent before cur in gotp without updating the cursor. + */ +static inline bool fuse_iext_peek_prev_extent(struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp) +{ + struct fuse_iext_cursor ncur = *cur; + + fuse_iext_prev(ir, &ncur); + return fuse_iext_get_extent(ir, &ncur, gotp); +} + +static inline int +fuse_iext_key_cmp( + struct fuse_iext_node *node, + int n, + loff_t offset) +{ + if (node->keys[n] > offset) + return 1; + if (node->keys[n] < offset) + return -1; + return 0; +} + +static inline int +fuse_iext_rec_cmp( + struct fuse_iomap_io *rec, + loff_t offset) +{ + if (rec->offset > offset) + return 1; + if (rec->offset + rec->length <= offset) + return -1; + return 0; +} + +static void * +fuse_iext_find_level( + struct fuse_iext_root *ir, + loff_t offset, + int level) +{ + struct fuse_iext_node *node = ir->ir_data; + int height, i; + + if (!ir->ir_height) + return NULL; + + for (height = ir->ir_height; height > level; height--) { + for (i = 1; i < KEYS_PER_NODE; i++) + if (fuse_iext_key_cmp(node, i, offset) > 0) + break; + + node = node->ptrs[i - 1]; + if (!node) + break; + } + + return node; +} + +static int +fuse_iext_node_pos( + struct fuse_iext_node *node, + loff_t offset) +{ + int i; + + for (i = 1; i < KEYS_PER_NODE; i++) { + if (fuse_iext_key_cmp(node, i, offset) > 0) + break; + } + + return i - 1; +} + +static int +fuse_iext_node_insert_pos( + struct fuse_iext_node *node, + loff_t offset) +{ + int i; + + for (i = 0; i < KEYS_PER_NODE; i++) { + if (fuse_iext_key_cmp(node, i, offset) > 0) + return i; + } + + return KEYS_PER_NODE; +} + +static int +fuse_iext_node_nr_entries( + struct fuse_iext_node *node, + int start) +{ + int i; + + for (i = start; i < KEYS_PER_NODE; i++) { + if (node->keys[i] == FUSE_IEXT_KEY_INVALID) + break; + } + + return i; +} + +static int +fuse_iext_leaf_nr_entries( + struct fuse_iext_root *ir, + struct fuse_iext_leaf *leaf, + int start) +{ + int i; + + for (i = start; i < fuse_iext_max_recs(ir); i++) { + if (fuse_iext_rec_is_empty(&leaf->recs[i])) + break; + } + + return i; +} + +static inline fuse_iext_key_t +fuse_iext_leaf_key( + struct fuse_iext_leaf *leaf, + int n) +{ + return leaf->recs[n].offset; +} + +static inline void * +fuse_iext_alloc_node( + int size) +{ + return kzalloc(size, GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL); +} + +static void +fuse_iext_grow( + struct fuse_iext_root *ir) +{ + struct fuse_iext_node *node = fuse_iext_alloc_node(NODE_SIZE); + int i; + + if (ir->ir_height == 1) { + struct fuse_iext_leaf *prev = ir->ir_data; + + node->keys[0] = fuse_iext_leaf_key(prev, 0); + node->ptrs[0] = prev; + } else { + struct fuse_iext_node *prev = ir->ir_data; + + ASSERT(ir->ir_height > 1); + + node->keys[0] = prev->keys[0]; + node->ptrs[0] = prev; + } + + for (i = 1; i < KEYS_PER_NODE; i++) + node->keys[i] = FUSE_IEXT_KEY_INVALID; + + ir->ir_data = node; + ir->ir_height++; +} + +static void +fuse_iext_update_node( + struct fuse_iext_root *ir, + loff_t old_offset, + loff_t new_offset, + int level, + void *ptr) +{ + struct fuse_iext_node *node = ir->ir_data; + int height, i; + + for (height = ir->ir_height; height > level; height--) { + for (i = 0; i < KEYS_PER_NODE; i++) { + if (i > 0 && fuse_iext_key_cmp(node, i, old_offset) > 0) + break; + if (node->keys[i] == old_offset) + node->keys[i] = new_offset; + } + node = node->ptrs[i - 1]; + ASSERT(node); + } + + ASSERT(node == ptr); +} + +static struct fuse_iext_node * +fuse_iext_split_node( + struct fuse_iext_node **nodep, + int *pos, + int *nr_entries) +{ + struct fuse_iext_node *node = *nodep; + struct fuse_iext_node *new = fuse_iext_alloc_node(NODE_SIZE); + const int nr_move = KEYS_PER_NODE / 2; + int nr_keep = nr_move + (KEYS_PER_NODE & 1); + int i = 0; + + /* for sequential append operations just spill over into the new node */ + if (*pos == KEYS_PER_NODE) { + *nodep = new; + *pos = 0; + *nr_entries = 0; + goto done; + } + + + for (i = 0; i < nr_move; i++) { + new->keys[i] = node->keys[nr_keep + i]; + new->ptrs[i] = node->ptrs[nr_keep + i]; + + node->keys[nr_keep + i] = FUSE_IEXT_KEY_INVALID; + node->ptrs[nr_keep + i] = NULL; + } + + if (*pos >= nr_keep) { + *nodep = new; + *pos -= nr_keep; + *nr_entries = nr_move; + } else { + *nr_entries = nr_keep; + } +done: + for (; i < KEYS_PER_NODE; i++) + new->keys[i] = FUSE_IEXT_KEY_INVALID; + return new; +} + +static void +fuse_iext_insert_node( + struct fuse_iext_root *ir, + fuse_iext_key_t offset, + void *ptr, + int level) +{ + struct fuse_iext_node *node, *new; + int i, pos, nr_entries; + +again: + if (ir->ir_height < level) + fuse_iext_grow(ir); + + new = NULL; + node = fuse_iext_find_level(ir, offset, level); + pos = fuse_iext_node_insert_pos(node, offset); + nr_entries = fuse_iext_node_nr_entries(node, pos); + + ASSERT(pos >= nr_entries || fuse_iext_key_cmp(node, pos, offset) != 0); + ASSERT(nr_entries <= KEYS_PER_NODE); + + if (nr_entries == KEYS_PER_NODE) + new = fuse_iext_split_node(&node, &pos, &nr_entries); + + /* + * Update the pointers in higher levels if the first entry changes + * in an existing node. + */ + if (node != new && pos == 0 && nr_entries > 0) + fuse_iext_update_node(ir, node->keys[0], offset, level, node); + + for (i = nr_entries; i > pos; i--) { + node->keys[i] = node->keys[i - 1]; + node->ptrs[i] = node->ptrs[i - 1]; + } + node->keys[pos] = offset; + node->ptrs[pos] = ptr; + + if (new) { + offset = new->keys[0]; + ptr = new; + level++; + goto again; + } +} + +static struct fuse_iext_leaf * +fuse_iext_split_leaf( + struct fuse_iext_cursor *cur, + int *nr_entries) +{ + struct fuse_iext_leaf *leaf = cur->leaf; + struct fuse_iext_leaf *new = fuse_iext_alloc_node(NODE_SIZE); + const int nr_move = RECS_PER_LEAF / 2; + int nr_keep = nr_move + (RECS_PER_LEAF & 1); + int i; + + /* for sequential append operations just spill over into the new node */ + if (cur->pos == RECS_PER_LEAF) { + cur->leaf = new; + cur->pos = 0; + *nr_entries = 0; + goto done; + } + + for (i = 0; i < nr_move; i++) { + new->recs[i] = leaf->recs[nr_keep + i]; + fuse_iext_rec_clear(&leaf->recs[nr_keep + i]); + } + + if (cur->pos >= nr_keep) { + cur->leaf = new; + cur->pos -= nr_keep; + *nr_entries = nr_move; + } else { + *nr_entries = nr_keep; + } +done: + if (leaf->next) + leaf->next->prev = new; + new->next = leaf->next; + new->prev = leaf; + leaf->next = new; + return new; +} + +static void +fuse_iext_alloc_root( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur) +{ + ASSERT(ir->ir_bytes == 0); + + ir->ir_data = fuse_iext_alloc_node(sizeof(struct fuse_iomap_io)); + ir->ir_height = 1; + + /* now that we have a node step into it */ + cur->leaf = ir->ir_data; + cur->pos = 0; +} + +static void +fuse_iext_realloc_root( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur) +{ + int64_t new_size = ir->ir_bytes + sizeof(struct fuse_iomap_io); + void *new; + + /* account for the prev/next pointers */ + if (new_size / sizeof(struct fuse_iomap_io) == RECS_PER_LEAF) + new_size = NODE_SIZE; + + new = krealloc(ir->ir_data, new_size, + GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL); + memset(new + ir->ir_bytes, 0, new_size - ir->ir_bytes); + ir->ir_data = new; + cur->leaf = new; +} + +/* + * Increment the sequence counter on extent tree changes. We use WRITE_ONCE + * here to ensure the update to the sequence counter is seen before the + * modifications to the extent tree itself take effect. + */ +static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ic) +{ + WRITE_ONCE(ic->ic_seq, READ_ONCE(ic->ic_seq) + 1); +} + +static void +fuse_iext_insert_raw( + struct fuse_iomap_cache *ic, + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur, + const struct fuse_iomap_io *irec) +{ + loff_t offset = irec->offset; + struct fuse_iext_leaf *new = NULL; + int nr_entries, i; + + fuse_iext_inc_seq(ic); + + if (ir->ir_height == 0) + fuse_iext_alloc_root(ir, cur); + else if (ir->ir_height == 1) + fuse_iext_realloc_root(ir, cur); + + nr_entries = fuse_iext_leaf_nr_entries(ir, cur->leaf, cur->pos); + ASSERT(nr_entries <= RECS_PER_LEAF); + ASSERT(cur->pos >= nr_entries || + fuse_iext_rec_cmp(cur_rec(cur), irec->offset) != 0); + + if (nr_entries == RECS_PER_LEAF) + new = fuse_iext_split_leaf(cur, &nr_entries); + + /* + * Update the pointers in higher levels if the first entry changes + * in an existing node. + */ + if (cur->leaf != new && cur->pos == 0 && nr_entries > 0) { + fuse_iext_update_node(ir, fuse_iext_leaf_key(cur->leaf, 0), + offset, 1, cur->leaf); + } + + for (i = nr_entries; i > cur->pos; i--) + cur->leaf->recs[i] = cur->leaf->recs[i - 1]; + fuse_iext_set(cur_rec(cur), irec); + ir->ir_bytes += sizeof(struct fuse_iomap_io); + + if (new) + fuse_iext_insert_node(ir, fuse_iext_leaf_key(new, 0), new, 2); +} + +static void +fuse_iext_insert( + struct fuse_iomap_cache *ic, + struct fuse_iext_cursor *cur, + const struct fuse_iomap_io *irec, + uint32_t state) +{ + struct fuse_iext_root *ir = fuse_iext_state_to_fork(ic, state); + + fuse_iext_insert_raw(ic, ir, cur, irec); +} + +static struct fuse_iext_node * +fuse_iext_rebalance_node( + struct fuse_iext_node *parent, + int *pos, + struct fuse_iext_node *node, + int nr_entries) +{ + /* + * If the neighbouring nodes are completely full, or have different + * parents, we might never be able to merge our node, and will only + * delete it once the number of entries hits zero. + */ + if (nr_entries == 0) + return node; + + if (*pos > 0) { + struct fuse_iext_node *prev = parent->ptrs[*pos - 1]; + int nr_prev = fuse_iext_node_nr_entries(prev, 0), i; + + if (nr_prev + nr_entries <= KEYS_PER_NODE) { + for (i = 0; i < nr_entries; i++) { + prev->keys[nr_prev + i] = node->keys[i]; + prev->ptrs[nr_prev + i] = node->ptrs[i]; + } + return node; + } + } + + if (*pos + 1 < fuse_iext_node_nr_entries(parent, *pos)) { + struct fuse_iext_node *next = parent->ptrs[*pos + 1]; + int nr_next = fuse_iext_node_nr_entries(next, 0), i; + + if (nr_entries + nr_next <= KEYS_PER_NODE) { + /* + * Merge the next node into this node so that we don't + * have to do an additional update of the keys in the + * higher levels. + */ + for (i = 0; i < nr_next; i++) { + node->keys[nr_entries + i] = next->keys[i]; + node->ptrs[nr_entries + i] = next->ptrs[i]; + } + + ++*pos; + return next; + } + } + + return NULL; +} + +static void +fuse_iext_remove_node( + struct fuse_iext_root *ir, + loff_t offset, + void *victim) +{ + struct fuse_iext_node *node, *parent; + int level = 2, pos, nr_entries, i; + + ASSERT(level <= ir->ir_height); + node = fuse_iext_find_level(ir, offset, level); + pos = fuse_iext_node_pos(node, offset); +again: + ASSERT(node->ptrs[pos]); + ASSERT(node->ptrs[pos] == victim); + kfree(victim); + + nr_entries = fuse_iext_node_nr_entries(node, pos) - 1; + offset = node->keys[0]; + for (i = pos; i < nr_entries; i++) { + node->keys[i] = node->keys[i + 1]; + node->ptrs[i] = node->ptrs[i + 1]; + } + node->keys[nr_entries] = FUSE_IEXT_KEY_INVALID; + node->ptrs[nr_entries] = NULL; + + if (pos == 0 && nr_entries > 0) { + fuse_iext_update_node(ir, offset, node->keys[0], level, node); + offset = node->keys[0]; + } + + if (nr_entries >= KEYS_PER_NODE / 2) + return; + + if (level < ir->ir_height) { + /* + * If we aren't at the root yet try to find a neighbour node to + * merge with (or delete the node if it is empty), and then + * recurse up to the next level. + */ + level++; + parent = fuse_iext_find_level(ir, offset, level); + pos = fuse_iext_node_pos(parent, offset); + + ASSERT(pos != KEYS_PER_NODE); + ASSERT(parent->ptrs[pos] == node); + + node = fuse_iext_rebalance_node(parent, &pos, node, nr_entries); + if (node) { + victim = node; + node = parent; + goto again; + } + } else if (nr_entries == 1) { + /* + * If we are at the root and only one entry is left we can just + * free this node and update the root pointer. + */ + ASSERT(node == ir->ir_data); + ir->ir_data = node->ptrs[0]; + ir->ir_height--; + kfree(node); + } +} + +static void +fuse_iext_rebalance_leaf( + struct fuse_iext_root *ir, + struct fuse_iext_cursor *cur, + struct fuse_iext_leaf *leaf, + loff_t offset, + int nr_entries) +{ + /* + * If the neighbouring nodes are completely full we might never be able + * to merge our node, and will only delete it once the number of + * entries hits zero. + */ + if (nr_entries == 0) + goto remove_node; + + if (leaf->prev) { + int nr_prev = fuse_iext_leaf_nr_entries(ir, leaf->prev, 0), i; + + if (nr_prev + nr_entries <= RECS_PER_LEAF) { + for (i = 0; i < nr_entries; i++) + leaf->prev->recs[nr_prev + i] = leaf->recs[i]; + + if (cur->leaf == leaf) { + cur->leaf = leaf->prev; + cur->pos += nr_prev; + } + goto remove_node; + } + } + + if (leaf->next) { + int nr_next = fuse_iext_leaf_nr_entries(ir, leaf->next, 0), i; + + if (nr_entries + nr_next <= RECS_PER_LEAF) { + /* + * Merge the next node into this node so that we don't + * have to do an additional update of the keys in the + * higher levels. + */ + for (i = 0; i < nr_next; i++) { + leaf->recs[nr_entries + i] = + leaf->next->recs[i]; + } + + if (cur->leaf == leaf->next) { + cur->leaf = leaf; + cur->pos += nr_entries; + } + + offset = fuse_iext_leaf_key(leaf->next, 0); + leaf = leaf->next; + goto remove_node; + } + } + + return; +remove_node: + if (leaf->prev) + leaf->prev->next = leaf->next; + if (leaf->next) + leaf->next->prev = leaf->prev; + fuse_iext_remove_node(ir, offset, leaf); +} + +static void +fuse_iext_free_last_leaf( + struct fuse_iext_root *ir) +{ + ir->ir_height--; + kfree(ir->ir_data); + ir->ir_data = NULL; +} + +static void +fuse_iext_remove( + struct fuse_iomap_cache *ic, + struct fuse_iext_cursor *cur, + uint32_t state) +{ + struct fuse_iext_root *ir = fuse_iext_state_to_fork(ic, state); + struct fuse_iext_leaf *leaf = cur->leaf; + loff_t offset = fuse_iext_leaf_key(leaf, 0); + int i, nr_entries; + + ASSERT(ir->ir_height > 0); + ASSERT(ir->ir_data != NULL); + ASSERT(fuse_iext_valid(ir, cur)); + + fuse_iext_inc_seq(ic); + + nr_entries = fuse_iext_leaf_nr_entries(ir, leaf, cur->pos) - 1; + for (i = cur->pos; i < nr_entries; i++) + leaf->recs[i] = leaf->recs[i + 1]; + fuse_iext_rec_clear(&leaf->recs[nr_entries]); + ir->ir_bytes -= sizeof(struct fuse_iomap_io); + + if (cur->pos == 0 && nr_entries > 0) { + fuse_iext_update_node(ir, offset, fuse_iext_leaf_key(leaf, 0), 1, + leaf); + offset = fuse_iext_leaf_key(leaf, 0); + } else if (cur->pos == nr_entries) { + if (ir->ir_height > 1 && leaf->next) + cur->leaf = leaf->next; + else + cur->leaf = NULL; + cur->pos = 0; + } + + if (nr_entries >= RECS_PER_LEAF / 2) + return; + + if (ir->ir_height > 1) + fuse_iext_rebalance_leaf(ir, cur, leaf, offset, nr_entries); + else if (nr_entries == 0) + fuse_iext_free_last_leaf(ir); +} + +/* + * Lookup the extent covering offset. + * + * If there is an extent covering offset return the extent index, and store the + * expanded extent structure in *gotp, and the extent cursor in *cur. + * If there is no extent covering offset, but there is an extent after it (e.g. + * it lies in a hole) return that extent in *gotp and its cursor in *cur + * instead. + * If offset is beyond the last extent return false, and return an invalid + * cursor value. + */ +static bool +fuse_iext_lookup_extent( + struct fuse_iomap_cache *ic, + struct fuse_iext_root *ir, + loff_t offset, + struct fuse_iext_cursor *cur, + struct fuse_iomap_io *gotp) +{ + cur->leaf = fuse_iext_find_level(ir, offset, 1); + if (!cur->leaf) { + cur->pos = 0; + return false; + } + + for (cur->pos = 0; cur->pos < fuse_iext_max_recs(ir); cur->pos++) { + struct fuse_iomap_io *rec = cur_rec(cur); + + if (fuse_iext_rec_is_empty(rec)) + break; + if (fuse_iext_rec_cmp(rec, offset) >= 0) + goto found; + } + + /* Try looking in the next node for an entry > offset */ + if (ir->ir_height == 1 || !cur->leaf->next) + return false; + cur->leaf = cur->leaf->next; + cur->pos = 0; + if (!fuse_iext_valid(ir, cur)) + return false; +found: + fuse_iext_get(gotp, cur_rec(cur)); + return true; +} + +/* + * Returns the last extent before end, and if this extent doesn't cover + * end, update end to the end of the extent. + */ +static bool +fuse_iext_lookup_extent_before( + struct fuse_iomap_cache *ic, + struct fuse_iext_root *ir, + loff_t *end, + struct fuse_iext_cursor *cur, + struct fuse_iomap_io *gotp) +{ + /* could be optimized to not even look up the next on a match.. */ + if (fuse_iext_lookup_extent(ic, ir, *end - 1, cur, gotp) && + gotp->offset <= *end - 1) + return true; + if (!fuse_iext_prev_extent(ir, cur, gotp)) + return false; + *end = gotp->offset + gotp->length; + return true; +} + +static void +fuse_iext_update_extent( + struct fuse_iomap_cache *ic, + uint32_t state, + struct fuse_iext_cursor *cur, + struct fuse_iomap_io *new) +{ + struct fuse_iext_root *ir = fuse_iext_state_to_fork(ic, state); + + fuse_iext_inc_seq(ic); + + if (cur->pos == 0) { + struct fuse_iomap_io old; + + fuse_iext_get(&old, cur_rec(cur)); + if (new->offset != old.offset) { + fuse_iext_update_node(ir, old.offset, + new->offset, 1, cur->leaf); + } + } + + fuse_iext_set(cur_rec(cur), new); +} + +/* + * This is a recursive function, because of that we need to be extremely + * careful with stack usage. + */ +static void +fuse_iext_destroy_node( + struct fuse_iext_node *node, + int level) +{ + int i; + + if (level > 1) { + for (i = 0; i < KEYS_PER_NODE; i++) { + if (node->keys[i] == FUSE_IEXT_KEY_INVALID) + break; + fuse_iext_destroy_node(node->ptrs[i], level - 1); + } + } + + kfree(node); +} + +static void +fuse_iext_destroy( + struct fuse_iext_root *ir) +{ + fuse_iext_destroy_node(ir->ir_data, ir->ir_height); + + ir->ir_bytes = 0; + ir->ir_height = 0; + ir->ir_data = NULL; +} + +static inline struct fuse_iext_root * +fuse_iext_root_ptr( + struct fuse_iomap_cache *ic, + enum fuse_iomap_iodir iodir) +{ + switch (iodir) { + case READ_MAPPING: + return &ic->ic_read; + case WRITE_MAPPING: + return &ic->ic_write; + default: + ASSERT(0); + return NULL; + } +} + +static inline bool fuse_iomap_addrs_adjacent(const struct fuse_iomap_io *left, + const struct fuse_iomap_io *right) +{ + switch (left->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + return left->addr + left->length == right->addr; + default: + return left->addr == FUSE_IOMAP_NULL_ADDR && + right->addr == FUSE_IOMAP_NULL_ADDR; + } +} + +static inline bool fuse_iomap_can_merge(const struct fuse_iomap_io *left, + const struct fuse_iomap_io *right) +{ + return (left->dev == right->dev && + left->offset + left->length == right->offset && + left->type == right->type && + fuse_iomap_addrs_adjacent(left, right) && + left->flags == right->flags && + left->length + right->length <= FUSE_IOMAP_MAX_LEN); +} + +static inline bool fuse_iomap_can_merge3(const struct fuse_iomap_io *left, + const struct fuse_iomap_io *new, + const struct fuse_iomap_io *right) +{ + return left->length + new->length + right->length <= FUSE_IOMAP_MAX_LEN; +} + +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) +static void fuse_iext_check_mappings(struct fuse_iomap_cache *ic, + struct fuse_iext_root *ir) +{ + struct fuse_iext_cursor icur; + struct fuse_iomap_io prev, got; + struct inode *inode = ic->ic_inode; + struct fuse_inode *fi = get_fuse_inode(inode); + unsigned long long nr = 0; + + if (ir->ir_bytes < 0 || !static_branch_unlikely(&fuse_iomap_debug)) + return; + + fuse_iext_first(ir, &icur); + if (!fuse_iext_get_extent(ir, &icur, &prev)) + return; + nr++; + + fuse_iext_next(ir, &icur); + while (fuse_iext_get_extent(ir, &icur, &got)) { + if (got.length == 0 || + got.offset < prev.offset + prev.length || + fuse_iomap_can_merge(&prev, &got)) { + printk(KERN_ERR "FUSE IOMAP CORRUPTION ino=%llu nr=%llu", + fi->orig_ino, nr); + printk(KERN_ERR "prev: offset=%llu length=%llu type=%u flags=0x%x dev=%u addr=%llu\n", + prev.offset, prev.length, prev.type, prev.flags, + prev.dev, prev.addr); + printk(KERN_ERR "curr: offset=%llu length=%llu type=%u flags=0x%x dev=%u addr=%llu\n", + got.offset, got.length, got.type, got.flags, + got.dev, got.addr); + } + + prev = got; + nr++; + fuse_iext_next(ir, &icur); + } +} +#else +# define fuse_iext_check_mappings(...) ((void)0) +#endif + +static void +fuse_iext_del_mapping( + struct fuse_iomap_cache *ic, + struct fuse_iext_root *ir, + struct fuse_iext_cursor *icur, + struct fuse_iomap_io *got, /* current extent entry */ + struct fuse_iomap_io *del) /* data to remove from extents */ +{ + struct fuse_iomap_io new; /* new record to be inserted */ + /* first addr (fsblock aligned) past del */ + fuse_iext_key_t del_endaddr; + /* first offset (fsblock aligned) past del */ + fuse_iext_key_t del_endoff = del->offset + del->length; + /* first offset (fsblock aligned) past got */ + fuse_iext_key_t got_endoff = got->offset + got->length; + uint32_t state = fuse_iomap_fork_to_state(ic, ir); + + ASSERT(del->length > 0); + ASSERT(got->offset <= del->offset); + ASSERT(got_endoff >= del_endoff); + + switch (del->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + del_endaddr = del->addr + del->length; + break; + default: + del_endaddr = FUSE_IOMAP_NULL_ADDR; + break; + } + + if (got->offset == del->offset) + state |= FUSE_IEXT_LEFT_FILLING; + if (got_endoff == del_endoff) + state |= FUSE_IEXT_RIGHT_FILLING; + + switch (state & (FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING)) { + case FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING: + /* + * Matches the whole extent. Delete the entry. + */ + fuse_iext_remove(ic, icur, state); + fuse_iext_prev(ir, icur); + break; + case FUSE_IEXT_LEFT_FILLING: + /* + * Deleting the first part of the extent. + */ + got->offset = del_endoff; + got->addr = del_endaddr; + got->length -= del->length; + fuse_iext_update_extent(ic, state, icur, got); + break; + case FUSE_IEXT_RIGHT_FILLING: + /* + * Deleting the last part of the extent. + */ + got->length -= del->length; + fuse_iext_update_extent(ic, state, icur, got); + break; + case 0: + /* + * Deleting the middle of the extent. + */ + got->length = del->offset - got->offset; + fuse_iext_update_extent(ic, state, icur, got); + + new.offset = del_endoff; + new.length = got_endoff - del_endoff; + new.type = got->type; + new.flags = got->flags; + new.addr = del_endaddr; + new.dev = got->dev; + + fuse_iext_next(ir, icur); + fuse_iext_insert(ic, icur, &new, state); + break; + } +} + +int +fuse_iomap_cache_remove( + struct inode *inode, + enum fuse_iomap_iodir iodir, + loff_t start, /* first file offset deleted */ + uint64_t len) /* length to unmap */ +{ + struct fuse_iext_cursor icur; + struct fuse_iomap_io got; /* current extent record */ + struct fuse_iomap_io del; /* extent being deleted */ + loff_t end; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + struct fuse_iext_root *ir = fuse_iext_root_ptr(ic, iodir); + bool wasreal; + bool done = false; + int ret = 0; + + assert_cache_locked(ic); + + /* Fork is not active or has zero mappings */ + if (ir->ir_bytes < 0 || fuse_iext_count(ir) == 0) + return 0; + + /* Fast shortcut if the caller wants to erase everything */ + if (start == 0 && len >= inode->i_sb->s_maxbytes) { + fuse_iext_destroy(ir); + return 0; + } + + if (!len) + goto out; + + /* + * If the caller wants us to remove everything to EOF, we set the end + * of the removal range to the maximum file offset. We don't support + * unsigned file offsets. + */ + if (len == FUSE_IOMAP_INVAL_TO_EOF) { + const unsigned int blocksize = i_blocksize(&fi->inode); + + len = round_up(inode->i_sb->s_maxbytes, blocksize) - start; + } + + /* + * Now that we've settled len, look up the extent before the end of the + * range. + */ + end = start + len; + if (!fuse_iext_lookup_extent_before(ic, ir, &end, &icur, &got)) + goto out; + end--; + + while (end != -1 && end >= start) { + /* + * Is the found extent after a hole in which end lives? + * Just back up to the previous extent, if so. + */ + if (got.offset > end && + !fuse_iext_prev_extent(ir, &icur, &got)) { + done = true; + break; + } + /* + * Is the last block of this extent before the range + * we're supposed to delete? If so, we're done. + */ + end = min_t(loff_t, end, got.offset + got.length - 1); + if (end < start) + break; + /* + * Then deal with the (possibly delayed) allocated space + * we found. + */ + del = got; + switch (del.type) { + case FUSE_IOMAP_TYPE_DELALLOC: + case FUSE_IOMAP_TYPE_HOLE: + case FUSE_IOMAP_TYPE_INLINE: + case FUSE_IOMAP_TYPE_PURE_OVERWRITE: + wasreal = false; + break; + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + wasreal = true; + break; + default: + ASSERT(0); + ret = -EFSCORRUPTED; + goto out; + } + + if (got.offset < start) { + del.offset = start; + del.length -= start - got.offset; + if (wasreal) + del.addr += start - got.offset; + } + if (del.offset + del.length > end + 1) + del.length = end + 1 - del.offset; + + fuse_iext_del_mapping(ic, ir, &icur, &got, &del); + end = del.offset - 1; + + /* + * If not done go on to the next (previous) record. + */ + if (end != -1 && end >= start) { + if (!fuse_iext_get_extent(ir, &icur, &got) || + (got.offset > end && + !fuse_iext_prev_extent(ir, &icur, &got))) { + done = true; + break; + } + } + } + + /* Should have removed everything */ + if (len == 0 || done || end == (loff_t)-1 || end < start) + ret = 0; + else + ret = -EFSCORRUPTED; + +out: + fuse_iext_check_mappings(ic, ir); + return ret; +} + +static void +fuse_iext_add_mapping( + struct fuse_iomap_cache *ic, + struct fuse_iext_root *ir, + struct fuse_iext_cursor *icur, + const struct fuse_iomap_io *new) /* new extent entry */ +{ + struct fuse_iomap_io left; /* left neighbor extent entry */ + struct fuse_iomap_io right; /* right neighbor extent entry */ + uint32_t state = fuse_iomap_fork_to_state(ic, ir); + + /* + * Check and set flags if this segment has a left neighbor. + */ + if (fuse_iext_peek_prev_extent(ir, icur, &left)) + state |= FUSE_IEXT_LEFT_VALID; + + /* + * Check and set flags if this segment has a current value. + * Not true if we're inserting into the "hole" at eof. + */ + if (fuse_iext_get_extent(ir, icur, &right)) + state |= FUSE_IEXT_RIGHT_VALID; + + /* + * We're inserting a real allocation between "left" and "right". + * Set the contiguity flags. Don't let extents get too large. + */ + if ((state & FUSE_IEXT_LEFT_VALID) && fuse_iomap_can_merge(&left, new)) + state |= FUSE_IEXT_LEFT_CONTIG; + + if ((state & FUSE_IEXT_RIGHT_VALID) && + fuse_iomap_can_merge(new, &right) && + (!(state & FUSE_IEXT_LEFT_CONTIG) || + fuse_iomap_can_merge3(&left, new, &right))) + state |= FUSE_IEXT_RIGHT_CONTIG; + + /* + * Select which case we're in here, and implement it. + */ + switch (state & (FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG)) { + case FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG: + /* + * New allocation is contiguous with real allocations on the + * left and on the right. + * Merge all three into a single extent record. + */ + left.length += new->length + right.length; + + fuse_iext_remove(ic, icur, state); + fuse_iext_prev(ir, icur); + fuse_iext_update_extent(ic, state, icur, &left); + break; + + case FUSE_IEXT_LEFT_CONTIG: + /* + * New allocation is contiguous with a real allocation + * on the left. + * Merge the new allocation with the left neighbor. + */ + left.length += new->length; + + fuse_iext_prev(ir, icur); + fuse_iext_update_extent(ic, state, icur, &left); + break; + + case FUSE_IEXT_RIGHT_CONTIG: + /* + * New allocation is contiguous with a real allocation + * on the right. + * Merge the new allocation with the right neighbor. + */ + right.offset = new->offset; + right.addr = new->addr; + right.length += new->length; + fuse_iext_update_extent(ic, state, icur, &right); + break; + + case 0: + /* + * New allocation is not contiguous with another + * real allocation. + * Insert a new entry. + */ + fuse_iext_insert(ic, icur, new, state); + break; + } +} + +static int +fuse_iomap_cache_add( + struct inode *inode, + enum fuse_iomap_iodir iodir, + const struct fuse_iomap_io *new) +{ + struct fuse_iext_cursor icur; + struct fuse_iomap_io got; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + struct fuse_iext_root *ir = fuse_iext_root_ptr(ic, iodir); + + assert_cache_locked(ic); + ASSERT(new->length > 0); + ASSERT(new->offset < inode->i_sb->s_maxbytes); + + /* Mark this fork as being in use */ + if (ir->ir_bytes < 0) + ir->ir_bytes = 0; + + if (fuse_iext_lookup_extent(ic, ir, new->offset, &icur, &got)) { + /* make sure we only add into a hole. */ + ASSERT(got.offset > new->offset); + ASSERT(got.offset - new->offset >= new->length); + + if (got.offset <= new->offset || + got.offset - new->offset < new->length) + return -EFSCORRUPTED; + } + + fuse_iext_add_mapping(ic, ir, &icur, new); + fuse_iext_check_mappings(ic, ir); + return 0; +} + +int fuse_iomap_cache_alloc(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *old = NULL; + struct fuse_iomap_cache *ic; + + ic = kzalloc_obj(struct fuse_iomap_cache); + if (!ic) + return -ENOMEM; + + /* Only the write mapping cache can return NOFORK */ + ic->ic_write.ir_bytes = -1; + ic->ic_inode = inode; + init_rwsem(&ic->ic_lock); + + if (!try_cmpxchg(&fi->cache, &old, ic)) { + /* Someone created mapping cache before us? Free ours... */ + kfree(ic); + } + + return 0; +} + +static void fuse_iomap_cache_purge(struct fuse_iomap_cache *ic) +{ + fuse_iext_destroy(&ic->ic_read); + fuse_iext_destroy(&ic->ic_write); +} + +void fuse_iomap_cache_free(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + + /* + * This is only called from eviction, so we cannot be racing to set or + * clear the pointer. + */ + fi->cache = NULL; + + fuse_iomap_cache_purge(ic); + kfree(ic); +} + +int +fuse_iomap_cache_upsert( + struct inode *inode, + enum fuse_iomap_iodir iodir, + const struct fuse_iomap_io *map) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + int err; + + ASSERT(fuse_inode_caches_iomaps(inode)); + + /* + * We interpret no write fork to mean that all writes are pure + * overwrites. Avoid wasting memory if we're trying to upsert a + * pure overwrite. + */ + if (iodir == WRITE_MAPPING && + map->type == FUSE_IOMAP_TYPE_PURE_OVERWRITE && + ic->ic_write.ir_bytes < 0) + return 0; + + err = fuse_iomap_cache_remove(inode, iodir, map->offset, map->length); + if (err) + return err; + + return fuse_iomap_cache_add(inode, iodir, map); +} + +/* + * Trim the returned map to the required bounds + */ +static void +fuse_iomap_trim( + struct fuse_inode *fi, + struct fuse_iomap_lookup *mval, + const struct fuse_iomap_io *got, + loff_t off, + loff_t len) +{ + struct fuse_iomap_cache *ic = fi->cache; + + switch (got->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + mval->map.addr = got->addr; + break; + default: + mval->map.addr = FUSE_IOMAP_NULL_ADDR; + break; + } + mval->map.offset = got->offset; + mval->map.length = got->length; + + mval->map.type = got->type; + mval->map.flags = got->flags; + mval->map.dev = got->dev; + mval->validity_cookie = fuse_iext_read_seq(ic); +} + +enum fuse_iomap_lookup_result +fuse_iomap_cache_lookup( + struct inode *inode, + enum fuse_iomap_iodir iodir, + loff_t off, + uint64_t len, + struct fuse_iomap_lookup *mval) +{ + struct fuse_iomap_io got; + struct fuse_iext_cursor icur; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_iomap_cache *ic = fi->cache; + struct fuse_iext_root *ir = fuse_iext_root_ptr(ic, iodir); + + assert_cache_locked_shared(ic); + + if (ir->ir_bytes < 0) { + /* + * No write fork at all means this filesystem doesn't do out of + * place writes. + */ + return LOOKUP_NOFORK; + } + + if (!fuse_iext_lookup_extent(ic, ir, off, &icur, &got)) { + /* + * Does not contain a mapping at or beyond off, which is a + * cache miss. + */ + return LOOKUP_MISS; + } + + if (got.offset > off) { + /* + * Found a mapping, but it doesn't cover the start of the + * range, which is effectively a miss. + */ + return LOOKUP_MISS; + } + + /* Found a mapping in the cache, return it */ + fuse_iomap_trim(fi, mval, &got, off, len); + return LOOKUP_HIT; +} diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c index 71d444ac1e5021..69310d6f773ffa 100644 --- a/fs/fuse/trace.c +++ b/fs/fuse/trace.c @@ -8,6 +8,7 @@ #include "fuse_dev_i.h" #include "fuse_iomap.h" #include "fuse_iomap_i.h" +#include "fuse_iomap_cache.h" #include <linux/pagemap.h> #include <linux/iomap.h> ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 02/12] fuse_trace: cache iomaps 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong 2026-04-29 14:35 ` [PATCH 01/12] fuse: cache iomaps Darrick J. Wong @ 2026-04-29 14:35 ` Darrick J. Wong 2026-04-29 14:36 ` [PATCH 03/12] fuse: use the iomap cache for iomap_begin Darrick J. Wong ` (9 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:35 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 293 ++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap_cache.c | 36 +++++ 2 files changed, 329 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index a6374d64a62357..697289c82d0dad 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -318,6 +318,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close); struct iomap_writepage_ctx; struct iomap_ioend; struct iomap; +struct fuse_iext_cursor; +struct fuse_iomap_lookup; /* tracepoint boilerplate so we don't have to keep doing this */ #define FUSE_IOMAP_OPFLAGS_FIELD \ @@ -348,6 +350,16 @@ struct iomap; __entry->prefix##addr, \ __print_flags(__entry->prefix##flags, "|", FUSE_IOMAP_F_STRINGS) +#define FUSE_IOMAP_IODIR_FIELD \ + __field(unsigned int, iodir) + +#define FUSE_IOMAP_IODIR_FMT \ + " iodir %s" + +#define FUSE_IOMAP_IODIR_PRINTK_ARGS \ + __print_symbolic(__entry->iodir, FUSE_IOMAP_FORK_STRINGS) + + /* combinations of boilerplate to reduce typing further */ #define FUSE_IOMAP_OP_FIELDS(prefix) \ FUSE_INODE_FIELDS \ @@ -445,6 +457,22 @@ TRACE_DEFINE_ENUM(FUSE_I_ATOMIC); { FUSE_IOMAP_CONFIG_TIME, "time" }, \ { FUSE_IOMAP_CONFIG_MAXBYTES, "maxbytes" } +TRACE_DEFINE_ENUM(READ_MAPPING); +TRACE_DEFINE_ENUM(WRITE_MAPPING); + +#define FUSE_IOMAP_FORK_STRINGS \ + { READ_MAPPING, "read" }, \ + { WRITE_MAPPING, "write" } + +#define FUSE_IEXT_STATE_STRINGS \ + { FUSE_IEXT_LEFT_CONTIG, "l_cont" }, \ + { FUSE_IEXT_RIGHT_CONTIG, "r_cont" }, \ + { FUSE_IEXT_LEFT_FILLING, "l_fill" }, \ + { FUSE_IEXT_RIGHT_FILLING, "r_fill" }, \ + { FUSE_IEXT_LEFT_VALID, "l_valid" }, \ + { FUSE_IEXT_RIGHT_VALID, "r_valid" }, \ + { FUSE_IEXT_WRITE_MAPPING, "write" } + DECLARE_EVENT_CLASS(fuse_iomap_check_class, TP_PROTO(const char *func, int line, const char *condition), @@ -727,6 +755,8 @@ DEFINE_EVENT(fuse_inode_state_class, name, \ TP_ARGS(inode)) DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode); DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode); +DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_cache_alloc); +DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_cache_free); TRACE_EVENT(fuse_iomap_fiemap, TP_PROTO(const struct inode *inode, u64 start, u64 count, @@ -1216,6 +1246,269 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read); DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write); DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap); DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap); + +DECLARE_EVENT_CLASS(fuse_iext_class, + TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur, + int state, unsigned long caller_ip), + + TP_ARGS(inode, cur, state, caller_ip), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + FUSE_IOMAP_MAP_FIELDS(map) + __field(void *, leaf) + __field(int, pos) + __field(int, iext_state) + __field(unsigned long, caller_ip) + ), + TP_fast_assign( + const struct fuse_iext_root *ir; + struct fuse_iomap_io r = { }; + FUSE_INODE_ASSIGN(inode, fi, fm); + + if (state & FUSE_IEXT_WRITE_MAPPING) + ir = &fi->cache->ic_write; + else + ir = &fi->cache->ic_read; + if (ir) + fuse_iext_get_extent(ir, cur, &r); + + __entry->mapoffset = r.offset; + __entry->mapaddr = r.addr; + __entry->maplength = r.length; + __entry->mapdev = r.dev; + __entry->maptype = r.type; + __entry->mapflags = r.flags; + + __entry->leaf = cur->leaf; + __entry->pos = cur->pos; + + __entry->iext_state = state; + __entry->caller_ip = caller_ip; + ), + TP_printk(FUSE_INODE_FMT " state (%s) cur %p/%d " FUSE_IOMAP_MAP_FMT() " caller %pS", + FUSE_INODE_PRINTK_ARGS, + __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS), + __entry->leaf, + __entry->pos, + FUSE_IOMAP_MAP_PRINTK_ARGS(map), + (void *)__entry->caller_ip) +) + +#define DEFINE_IEXT_EVENT(name) \ +DEFINE_EVENT(fuse_iext_class, name, \ + TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur, \ + int state, unsigned long caller_ip), \ + TP_ARGS(inode, cur, state, caller_ip)) +DEFINE_IEXT_EVENT(fuse_iext_insert); +DEFINE_IEXT_EVENT(fuse_iext_remove); +DEFINE_IEXT_EVENT(fuse_iext_pre_update); +DEFINE_IEXT_EVENT(fuse_iext_post_update); + +TRACE_EVENT(fuse_iext_update_class, + TP_PROTO(const struct inode *inode, uint32_t iext_state, + const struct fuse_iomap_io *map), + TP_ARGS(inode, iext_state, map), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + FUSE_IOMAP_MAP_FIELDS(map) + __field(uint32_t, iext_state) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->mapoffset = map->offset; + __entry->maplength = map->length; + __entry->maptype = map->type; + __entry->mapflags = map->flags; + __entry->mapdev = map->dev; + __entry->mapaddr = map->addr; + + __entry->iext_state = iext_state; + ), + + TP_printk(FUSE_INODE_FMT " state (%s)" FUSE_IOMAP_MAP_FMT(), + FUSE_INODE_PRINTK_ARGS, + __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS), + FUSE_IOMAP_MAP_PRINTK_ARGS(map)) +); +#define DEFINE_IEXT_UPDATE_EVENT(name) \ +DEFINE_EVENT(fuse_iext_update_class, name, \ + TP_PROTO(const struct inode *inode, uint32_t iext_state, \ + const struct fuse_iomap_io *map), \ + TP_ARGS(inode, iext_state, map)) +DEFINE_IEXT_UPDATE_EVENT(fuse_iext_del_mapping); +DEFINE_IEXT_UPDATE_EVENT(fuse_iext_add_mapping); + +TRACE_EVENT(fuse_iext_alt_update_class, + TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), + TP_ARGS(inode, map), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + FUSE_IOMAP_MAP_FIELDS(map) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + + __entry->mapoffset = map->offset; + __entry->maplength = map->length; + __entry->maptype = map->type; + __entry->mapflags = map->flags; + __entry->mapdev = map->dev; + __entry->mapaddr = map->addr; + ), + + TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT(), + FUSE_INODE_PRINTK_ARGS, + FUSE_IOMAP_MAP_PRINTK_ARGS(map)) +); +#define DEFINE_IEXT_ALT_UPDATE_EVENT(name) \ +DEFINE_EVENT(fuse_iext_alt_update_class, name, \ + TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), \ + TP_ARGS(inode, map)) +DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_del_mapping_got); +DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_left); +DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_right); + +TRACE_EVENT(fuse_iomap_cache_remove, + TP_PROTO(const struct inode *inode, unsigned int iodir, + loff_t offset, uint64_t length, unsigned long caller_ip), + TP_ARGS(inode, iodir, offset, length, caller_ip), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + FUSE_IOMAP_IODIR_FIELD + __field(unsigned long, caller_ip) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->iodir = iodir; + __entry->offset = offset; + __entry->length = length; + __entry->caller_ip = caller_ip; + ), + + TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " caller %pS", + FUSE_IO_RANGE_PRINTK_ARGS(), + FUSE_IOMAP_IODIR_PRINTK_ARGS, + (void *)__entry->caller_ip) +); + +TRACE_EVENT(fuse_iomap_cached_mapping_class, + TP_PROTO(const struct inode *inode, unsigned int iodir, + const struct fuse_iomap_io *map, unsigned long caller_ip), + TP_ARGS(inode, iodir, map, caller_ip), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + FUSE_IOMAP_IODIR_FIELD + FUSE_IOMAP_MAP_FIELDS(map) + __field(unsigned long, caller_ip) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->iodir = iodir; + + __entry->mapoffset = map->offset; + __entry->maplength = map->length; + __entry->maptype = map->type; + __entry->mapflags = map->flags; + __entry->mapdev = map->dev; + __entry->mapaddr = map->addr; + + __entry->caller_ip = caller_ip; + ), + + TP_printk(FUSE_INODE_FMT FUSE_IOMAP_IODIR_FMT FUSE_IOMAP_MAP_FMT() " caller %pS", + FUSE_INODE_PRINTK_ARGS, + FUSE_IOMAP_IODIR_PRINTK_ARGS, + FUSE_IOMAP_MAP_PRINTK_ARGS(map), + (void *)__entry->caller_ip) +); +#define DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(name) \ +DEFINE_EVENT(fuse_iomap_cached_mapping_class, name, \ + TP_PROTO(const struct inode *inode, unsigned int iodir, \ + const struct fuse_iomap_io *map, unsigned long caller_ip), \ + TP_ARGS(inode, iodir, map, caller_ip)) +DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(fuse_iomap_cache_add); +DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(fuse_iext_check_mapping); + +TRACE_EVENT(fuse_iomap_cache_lookup, + TP_PROTO(const struct inode *inode, unsigned int iodir, + loff_t pos, uint64_t count, unsigned long caller_ip), + TP_ARGS(inode, iodir, pos, count, caller_ip), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + FUSE_IOMAP_IODIR_FIELD + __field(unsigned long, caller_ip) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->iodir = iodir; + __entry->offset = pos; + __entry->length = count; + __entry->caller_ip = caller_ip; + ), + + TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " caller %pS", + FUSE_IO_RANGE_PRINTK_ARGS(), + FUSE_IOMAP_IODIR_PRINTK_ARGS, + (void *)__entry->caller_ip) +); + +TRACE_EVENT(fuse_iomap_cache_lookup_result, + TP_PROTO(const struct inode *inode, unsigned int iodir, + loff_t pos, uint64_t count, const struct fuse_iomap_io *got, + const struct fuse_iomap_lookup *map), + TP_ARGS(inode, iodir, pos, count, got, map), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + + FUSE_IOMAP_MAP_FIELDS(got) + FUSE_IOMAP_MAP_FIELDS(map) + + FUSE_IOMAP_IODIR_FIELD + __field(uint64_t, validity_cookie) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->iodir = iodir; + __entry->offset = pos; + __entry->length = count; + + __entry->gotoffset = got->offset; + __entry->gotlength = got->length; + __entry->gottype = got->type; + __entry->gotflags = got->flags; + __entry->gotdev = got->dev; + __entry->gotaddr = got->addr; + + __entry->mapoffset = map->map.offset; + __entry->maplength = map->map.length; + __entry->maptype = map->map.type; + __entry->mapflags = map->map.flags; + __entry->mapdev = map->map.dev; + __entry->mapaddr = map->map.addr; + + __entry->validity_cookie= map->validity_cookie; + ), + + TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT FUSE_IOMAP_MAP_FMT("map") FUSE_IOMAP_MAP_FMT("got") " cookie 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + FUSE_IOMAP_IODIR_PRINTK_ARGS, + FUSE_IOMAP_MAP_PRINTK_ARGS(map), + FUSE_IOMAP_MAP_PRINTK_ARGS(got), + __entry->validity_cookie) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c index 35384df09b9200..1608e96984bb38 100644 --- a/fs/fuse/fuse_iomap_cache.c +++ b/fs/fuse/fuse_iomap_cache.c @@ -763,6 +763,7 @@ fuse_iext_insert( struct fuse_iext_root *ir = fuse_iext_state_to_fork(ic, state); fuse_iext_insert_raw(ic, ir, cur, irec); + trace_fuse_iext_insert(ic->ic_inode, cur, state, _RET_IP_); } static struct fuse_iext_node * @@ -970,6 +971,8 @@ fuse_iext_remove( ASSERT(ir->ir_data != NULL); ASSERT(fuse_iext_valid(ir, cur)); + trace_fuse_iext_remove(ic->ic_inode, cur, state, _RET_IP_); + fuse_iext_inc_seq(ic); nr_entries = fuse_iext_leaf_nr_entries(ir, leaf, cur->pos) - 1; @@ -1088,7 +1091,9 @@ fuse_iext_update_extent( } } + trace_fuse_iext_pre_update(ic->ic_inode, cur, state, _RET_IP_); fuse_iext_set(cur_rec(cur), new); + trace_fuse_iext_post_update(ic->ic_inode, cur, state, _RET_IP_); } /* @@ -1180,17 +1185,26 @@ static void fuse_iext_check_mappings(struct fuse_iomap_cache *ic, struct inode *inode = ic->ic_inode; struct fuse_inode *fi = get_fuse_inode(inode); unsigned long long nr = 0; + enum fuse_iomap_iodir iodir; if (ir->ir_bytes < 0 || !static_branch_unlikely(&fuse_iomap_debug)) return; + if (ir == &ic->ic_write) + iodir = WRITE_MAPPING; + else + iodir = READ_MAPPING; + fuse_iext_first(ir, &icur); if (!fuse_iext_get_extent(ir, &icur, &prev)) return; + trace_fuse_iext_check_mapping(ic->ic_inode, iodir, &prev, _RET_IP_); nr++; fuse_iext_next(ir, &icur); while (fuse_iext_get_extent(ir, &icur, &got)) { + trace_fuse_iext_check_mapping(ic->ic_inode, iodir, &got, + _RET_IP_); if (got.length == 0 || got.offset < prev.offset + prev.length || fuse_iomap_can_merge(&prev, &got)) { @@ -1249,6 +1263,9 @@ fuse_iext_del_mapping( if (got_endoff == del_endoff) state |= FUSE_IEXT_RIGHT_FILLING; + trace_fuse_iext_del_mapping(ic->ic_inode, state, del); + trace_fuse_iext_del_mapping_got(ic->ic_inode, got); + switch (state & (FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING)) { case FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING: /* @@ -1313,6 +1330,8 @@ fuse_iomap_cache_remove( assert_cache_locked(ic); + trace_fuse_iomap_cache_remove(&fi->inode, iodir, start, len, _RET_IP_); + /* Fork is not active or has zero mappings */ if (ir->ir_bytes < 0 || fuse_iext_count(ir) == 0) return 0; @@ -1458,6 +1477,12 @@ fuse_iext_add_mapping( fuse_iomap_can_merge3(&left, new, &right))) state |= FUSE_IEXT_RIGHT_CONTIG; + trace_fuse_iext_add_mapping(ic->ic_inode, state, new); + if (state & FUSE_IEXT_LEFT_VALID) + trace_fuse_iext_add_mapping_left(ic->ic_inode, &left); + if (state & FUSE_IEXT_RIGHT_VALID) + trace_fuse_iext_add_mapping_right(ic->ic_inode, &right); + /* * Select which case we're in here, and implement it. */ @@ -1526,6 +1551,8 @@ fuse_iomap_cache_add( ASSERT(new->length > 0); ASSERT(new->offset < inode->i_sb->s_maxbytes); + trace_fuse_iomap_cache_add(&fi->inode, iodir, new, _RET_IP_); + /* Mark this fork as being in use */ if (ir->ir_bytes < 0) ir->ir_bytes = 0; @@ -1563,8 +1590,10 @@ int fuse_iomap_cache_alloc(struct inode *inode) if (!try_cmpxchg(&fi->cache, &old, ic)) { /* Someone created mapping cache before us? Free ours... */ kfree(ic); + return 0; } + trace_fuse_iomap_cache_alloc(inode); return 0; } @@ -1579,6 +1608,8 @@ void fuse_iomap_cache_free(struct inode *inode) struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_iomap_cache *ic = fi->cache; + trace_fuse_iomap_cache_free(inode); + /* * This is only called from eviction, so we cannot be racing to set or * clear the pointer. @@ -1665,6 +1696,8 @@ fuse_iomap_cache_lookup( assert_cache_locked_shared(ic); + trace_fuse_iomap_cache_lookup(ic->ic_inode, iodir, off, len, _RET_IP_); + if (ir->ir_bytes < 0) { /* * No write fork at all means this filesystem doesn't do out of @@ -1691,5 +1724,8 @@ fuse_iomap_cache_lookup( /* Found a mapping in the cache, return it */ fuse_iomap_trim(fi, mval, &got, off, len); + + trace_fuse_iomap_cache_lookup_result(inode, iodir, off, len, &got, + mval); return LOOKUP_HIT; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 03/12] fuse: use the iomap cache for iomap_begin 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong 2026-04-29 14:35 ` [PATCH 01/12] fuse: cache iomaps Darrick J. Wong 2026-04-29 14:35 ` [PATCH 02/12] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:36 ` Darrick J. Wong 2026-04-29 14:36 ` [PATCH 04/12] fuse_trace: " Darrick J. Wong ` (8 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:36 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Look inside the iomap cache to try to satisfy iomap_begin. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap_cache.h | 5 + fs/fuse/fuse_iomap.c | 221 +++++++++++++++++++++++++++++++++++++++++++- fs/fuse/fuse_iomap_cache.c | 7 + 3 files changed, 228 insertions(+), 5 deletions(-) diff --git a/fs/fuse/fuse_iomap_cache.h b/fs/fuse/fuse_iomap_cache.h index 922ca182357aa7..dcd52c183f22ab 100644 --- a/fs/fuse/fuse_iomap_cache.h +++ b/fs/fuse/fuse_iomap_cache.h @@ -53,6 +53,11 @@ bool fuse_iext_get_extent(const struct fuse_iext_root *ir, const struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp); +/* iomaps that come direct from the fuse server are presumed to be valid */ +#define FUSE_IOMAP_ALWAYS_VALID ((uint64_t)0) +/* set initial iomap cookie value to avoid ALWAYS_VALID */ +#define FUSE_IOMAP_INIT_COOKIE ((uint64_t)1) + static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ic) { return (uint64_t)READ_ONCE(ic->ic_seq); diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index bb47b4c7f7eabc..8b37c2f5cbdb2b 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -131,6 +131,7 @@ static inline bool fuse_iomap_check_type(uint16_t fuse_type) case FUSE_IOMAP_TYPE_UNWRITTEN: case FUSE_IOMAP_TYPE_INLINE: case FUSE_IOMAP_TYPE_PURE_OVERWRITE: + case FUSE_IOMAP_TYPE_RETRY_CACHE: return true; } @@ -239,9 +240,21 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, const unsigned int blocksize = i_blocksize(inode); uint64_t end; - /* Type and flags must be known */ + /* + * Type and flags must be known. Mapping type "retry cache" doesn't + * use any of the other fields. + */ if (BAD_DATA(!fuse_iomap_check_type(map->type))) return false; + if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE) { + /* + * We only accept cache retries if we have a cache to query. + * There must not be a device addr. + */ + if (BAD_DATA(!fuse_inode_caches_iomaps(inode))) + return false; + return true; + } if (BAD_DATA(!fuse_iomap_check_flags(map->flags))) return false; @@ -286,6 +299,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, if (BAD_DATA(iodir != WRITE_MAPPING)) return false; break; + case FUSE_IOMAP_TYPE_RETRY_CACHE: default: /* should have been caught already */ ASSERT(0); @@ -561,6 +575,157 @@ static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags, return 0; } +/* Convert a mapping from the cache into something the kernel can use */ +static int fuse_iomap_from_cache(struct inode *inode, struct iomap *iomap, + const struct fuse_iomap_lookup *lmap) +{ + struct fuse_mount *fm = get_fuse_mount(inode); + struct fuse_backing *fb; + + fb = fuse_iomap_find_dev(fm->fc, &lmap->map); + if (IS_ERR(fb)) + return PTR_ERR(fb); + + fuse_iomap_from_server(iomap, fb, &lmap->map); + iomap->validity_cookie = lmap->validity_cookie; + + fuse_backing_put(fb); + return 0; +} + +#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) +static inline int +fuse_iomap_cached_validate(const struct inode *inode, + enum fuse_iomap_iodir dir, + const struct fuse_iomap_lookup *lmap) +{ + if (!static_branch_unlikely(&fuse_iomap_debug)) + return 0; + + /* Make sure the mappings aren't garbage */ + if (!fuse_iomap_check_mapping(inode, &lmap->map, dir)) + return -EFSCORRUPTED; + + /* The cache should not be storing "retry cache" mappings */ + if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_RETRY_CACHE)) + return -EFSCORRUPTED; + + return 0; +} +#else +# define fuse_iomap_cached_validate(...) (0) +#endif + +/* + * Look up iomappings from the cache. Returns 1 if iomap and srcmap were + * satisfied from cache; 0 if not; or a negative errno. + */ +static int fuse_iomap_try_cache(struct inode *inode, loff_t pos, loff_t count, + unsigned opflags, struct iomap *iomap, + struct iomap *srcmap) +{ + struct fuse_iomap_lookup lmap; + struct iomap *dest = iomap; + enum fuse_iomap_lookup_result res; + int ret; + + if (!fuse_inode_caches_iomaps(inode)) + return 0; + + fuse_iomap_cache_lock_shared(inode); + + if (fuse_is_iomap_file_write(opflags)) { + res = fuse_iomap_cache_lookup(inode, WRITE_MAPPING, pos, count, + &lmap); + switch (res) { + case LOOKUP_HIT: + ret = fuse_iomap_cached_validate(inode, WRITE_MAPPING, + &lmap); + if (ret) + goto out_unlock; + + if (lmap.map.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) { + ret = fuse_iomap_from_cache(inode, dest, &lmap); + if (ret) + goto out_unlock; + + dest = srcmap; + } + fallthrough; + case LOOKUP_NOFORK: + /* move on to the read fork */ + break; + case LOOKUP_MISS: + ret = 0; + goto out_unlock; + } + } + + res = fuse_iomap_cache_lookup(inode, READ_MAPPING, pos, count, &lmap); + switch (res) { + case LOOKUP_HIT: + break; + case LOOKUP_NOFORK: + ASSERT(res != LOOKUP_NOFORK); + ret = -EFSCORRUPTED; + goto out_unlock; + case LOOKUP_MISS: + ret = 0; + goto out_unlock; + } + + ret = fuse_iomap_cached_validate(inode, READ_MAPPING, &lmap); + if (ret) + goto out_unlock; + + ret = fuse_iomap_from_cache(inode, dest, &lmap); + if (ret) + goto out_unlock; + + if (fuse_is_iomap_file_write(opflags)) { + switch (iomap->type) { + case IOMAP_HOLE: + if (opflags & (IOMAP_ZERO | IOMAP_UNSHARE)) + ret = 1; + else + ret = 0; + break; + case IOMAP_DELALLOC: + if (opflags & (IOMAP_DIRECT | FUSE_IOMAP_OP_WRITEBACK)) + ret = 0; + else + ret = 1; + break; + default: + ret = 1; + break; + } + } else { + ret = 1; + } + +out_unlock: + fuse_iomap_cache_unlock_shared(inode); + if (ret < 1) + return ret; + + if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) { + ret = fuse_iomap_set_inline(inode, opflags, pos, count, iomap, + srcmap); + if (ret) + return ret; + } + return 1; +} + +/* + * For atomic writes we must always query the server because that might require + * assistance from the fuse server. For swapfiles we always query the server + * because we have no idea if the server actually wants to support that. + */ +#define FUSE_IOMAP_OP_NOCACHE (FUSE_IOMAP_OP_ATOMIC | \ + FUSE_IOMAP_OP_SWAPFILE) + static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, unsigned opflags, struct iomap *iomap, struct iomap *srcmap) @@ -581,6 +746,20 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, trace_fuse_iomap_begin(inode, pos, count, opflags); + /* + * Try to read mappings from the cache; if we find something then use + * it; otherwise we upcall the fuse server. + */ + if (!(opflags & FUSE_IOMAP_OP_NOCACHE)) { + err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap, + srcmap); + if (err < 0) + return err; + if (err == 1) + return 0; + } + +retry: args.opcode = FUSE_IOMAP_BEGIN; args.nodeid = get_node_id(inode); args.in_numargs = 1; @@ -602,6 +781,24 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, if (err) return err; + /* + * If the fuse server tells us it populated the cache, we'll try the + * cache lookup again. Note that we dropped the cache lock, so it's + * entirely possible that another thread could have invalidated the + * cache -- if the cache misses, we'll call the server again. + */ + if (outarg.read.type == FUSE_IOMAP_TYPE_RETRY_CACHE) { + err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap, + srcmap); + if (err < 0) + return err; + if (err == 1) + return 0; + if (signal_pending(current)) + return -EINTR; + goto retry; + } + read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read); if (IS_ERR(read_dev)) return PTR_ERR(read_dev); @@ -629,6 +826,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count, */ fuse_iomap_from_server(iomap, read_dev, &outarg.read); } + iomap->validity_cookie = FUSE_IOMAP_ALWAYS_VALID; + srcmap->validity_cookie = FUSE_IOMAP_ALWAYS_VALID; if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) { err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap, @@ -1372,7 +1571,21 @@ static const struct iomap_dio_ops fuse_iomap_dio_write_ops = { .end_io = fuse_iomap_dio_write_end_io, }; +static bool fuse_iomap_revalidate(struct inode *inode, + const struct iomap *iomap) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + uint64_t validity_cookie; + + if (iomap->validity_cookie == FUSE_IOMAP_ALWAYS_VALID) + return true; + + validity_cookie = fuse_iext_read_seq(fi->cache); + return iomap->validity_cookie == validity_cookie; +} + static const struct iomap_write_ops fuse_iomap_write_ops = { + .iomap_valid = fuse_iomap_revalidate, }; static int @@ -1658,14 +1871,14 @@ static void fuse_iomap_end_bio(struct bio *bio) * mapping is valid, false otherwise. */ static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc, + struct inode *inode, loff_t offset) { if (offset < wpc->iomap.offset || offset >= wpc->iomap.offset + wpc->iomap.length) return false; - /* XXX actually use revalidation cookie */ - return true; + return fuse_iomap_revalidate(inode, &wpc->iomap); } /* @@ -1719,7 +1932,7 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc, trace_fuse_iomap_writeback_range(inode, offset, len, end_pos); - if (!fuse_iomap_revalidate_writeback(wpc, offset)) { + if (!fuse_iomap_revalidate_writeback(wpc, inode, offset)) { struct iomap_iter fake_iter = { }; struct iomap *write_iomap = &fake_iter.iomap; diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c index 1608e96984bb38..ea63a12a01c1fe 100644 --- a/fs/fuse/fuse_iomap_cache.c +++ b/fs/fuse/fuse_iomap_cache.c @@ -706,7 +706,11 @@ fuse_iext_realloc_root( */ static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ic) { - WRITE_ONCE(ic->ic_seq, READ_ONCE(ic->ic_seq) + 1); + uint64_t new_val = READ_ONCE(ic->ic_seq) + 1; + + if (new_val == FUSE_IOMAP_ALWAYS_VALID) + new_val++; + WRITE_ONCE(ic->ic_seq, new_val); } static void @@ -1584,6 +1588,7 @@ int fuse_iomap_cache_alloc(struct inode *inode) /* Only the write mapping cache can return NOFORK */ ic->ic_write.ir_bytes = -1; + ic->ic_seq = FUSE_IOMAP_INIT_COOKIE; ic->ic_inode = inode; init_rwsem(&ic->ic_lock); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 04/12] fuse_trace: use the iomap cache for iomap_begin 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:36 ` [PATCH 03/12] fuse: use the iomap cache for iomap_begin Darrick J. Wong @ 2026-04-29 14:36 ` Darrick J. Wong 2026-04-29 14:36 ` [PATCH 05/12] fuse: invalidate iomap cache after file updates Darrick J. Wong ` (7 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:36 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 34 ++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 7 ++++++- 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 697289c82d0dad..bf47008e50920a 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -404,6 +404,7 @@ struct fuse_iomap_lookup; #define FUSE_IOMAP_TYPE_STRINGS \ { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \ + { FUSE_IOMAP_TYPE_RETRY_CACHE, "retry" }, \ { FUSE_IOMAP_TYPE_HOLE, "hole" }, \ { FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \ { FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \ @@ -1509,6 +1510,39 @@ TRACE_EVENT(fuse_iomap_cache_lookup_result, FUSE_IOMAP_MAP_PRINTK_ARGS(got), __entry->validity_cookie) ); + +TRACE_EVENT(fuse_iomap_invalid, + TP_PROTO(const struct inode *inode, const struct iomap *map, + uint64_t validity_cookie), + TP_ARGS(inode, map, validity_cookie), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + FUSE_IOMAP_MAP_FIELDS(map) + __field(uint64_t, old_validity_cookie) + __field(uint64_t, validity_cookie) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + + __entry->mapoffset = map->offset; + __entry->maplength = map->length; + __entry->maptype = map->type; + __entry->mapflags = map->flags; + __entry->mapaddr = map->addr; + __entry->mapdev = FUSE_IOMAP_DEV_NULL; + + __entry->old_validity_cookie= map->validity_cookie; + __entry->validity_cookie= validity_cookie; + ), + + TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT() " old_cookie 0x%llx new_cookie 0x%llx", + FUSE_INODE_PRINTK_ARGS, + FUSE_IOMAP_MAP_PRINTK_ARGS(map), + __entry->old_validity_cookie, + __entry->validity_cookie) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 8b37c2f5cbdb2b..e1e7b98e591d16 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1581,7 +1581,12 @@ static bool fuse_iomap_revalidate(struct inode *inode, return true; validity_cookie = fuse_iext_read_seq(fi->cache); - return iomap->validity_cookie == validity_cookie; + if (unlikely(iomap->validity_cookie != validity_cookie)) { + trace_fuse_iomap_invalid(inode, iomap, validity_cookie); + return false; + } + + return true; } static const struct iomap_write_ops fuse_iomap_write_ops = { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 05/12] fuse: invalidate iomap cache after file updates 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:36 ` [PATCH 04/12] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:36 ` Darrick J. Wong 2026-04-29 14:36 ` [PATCH 06/12] fuse_trace: " Darrick J. Wong ` (6 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:36 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> The kernel doesn't know what the fuse server might have done in response to truncate, fallocate, or ioend events. Therefore, it must invalidate the mapping cache after those operations to ensure cache coherency. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 3 +++ fs/fuse/fuse_iomap_cache.h | 9 +++++++++ fs/fuse/file.c | 18 ++++++++++-------- fs/fuse/fuse_iomap.c | 28 ++++++++++++++++++++++++++-- fs/fuse/fuse_iomap_cache.c | 27 +++++++++++++++++++++++++++ 5 files changed, 75 insertions(+), 10 deletions(-) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 25c36c9c39d6f3..5cdf7b311dba42 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -69,6 +69,8 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset, loff_t length, loff_t new_size); int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, loff_t endpos); +void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset, + u64 written); int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support __user *argp); @@ -101,6 +103,7 @@ int fuse_dev_ioctl_iomap_set_nofs(struct file *file, uint32_t __user *argp); # define fuse_iomap_setsize_start(...) (-ENOSYS) # define fuse_iomap_fallocate(...) (-ENOSYS) # define fuse_iomap_flush_unmap_range(...) (-ENOSYS) +# define fuse_iomap_copied_file_range(...) ((void)0) # define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) # define fuse_iomap_dev_inval(...) (-ENOSYS) # define fuse_iomap_fadvise NULL diff --git a/fs/fuse/fuse_iomap_cache.h b/fs/fuse/fuse_iomap_cache.h index dcd52c183f22ab..eba90d9519b8c3 100644 --- a/fs/fuse/fuse_iomap_cache.h +++ b/fs/fuse/fuse_iomap_cache.h @@ -99,6 +99,15 @@ enum fuse_iomap_lookup_result fuse_iomap_cache_lookup(struct inode *inode, enum fuse_iomap_iodir iodir, loff_t off, uint64_t len, struct fuse_iomap_lookup *mval); + +int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset, + uint64_t length); +static inline int fuse_iomap_cache_invalidate(struct inode *inode, + loff_t offset) +{ + return fuse_iomap_cache_invalidate_range(inode, offset, + FUSE_IOMAP_INVAL_TO_EOF); +} #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_CACHE_H */ diff --git a/fs/fuse/file.c b/fs/fuse/file.c index eecd0610fbd3e5..3e9de61122f529 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -101,7 +101,7 @@ static void fuse_release_end(struct fuse_mount *fm, struct fuse_args *args, kfree(ra); } -static void fuse_file_put(struct fuse_file *ff, bool sync) +static void fuse_file_put(struct fuse_file *ff, struct inode *inode, bool sync) { if (refcount_dec_and_test(&ff->count)) { struct fuse_release_args *ra = &ff->args->release_args; @@ -391,7 +391,7 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff, * own ref to the file, the IO completion has to drop the ref, which is * how the fuse server can end up closing its clients' files. */ - fuse_file_put(ff, false); + fuse_file_put(ff, inode, false); } void fuse_release_common(struct file *file, bool isdir) @@ -422,7 +422,7 @@ void fuse_sync_release(struct fuse_inode *fi, struct fuse_file *ff, { WARN_ON(refcount_read(&ff->count) > 1); fuse_prepare_release(fi, ff, flags, FUSE_RELEASE, true); - fuse_file_put(ff, true); + fuse_file_put(ff, fi ? &fi->inode : NULL, true); } EXPORT_SYMBOL_GPL(fuse_sync_release); @@ -1050,7 +1050,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args, folio_put(ap->folios[i]); } if (ia->ff) - fuse_file_put(ia->ff, false); + fuse_file_put(ia->ff, inode, false); fuse_io_free(ia); } @@ -1912,7 +1912,7 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa) if (wpa->bucket) fuse_sync_bucket_dec(wpa->bucket); - fuse_file_put(wpa->ia.ff, false); + fuse_file_put(wpa->ia.ff, wpa->inode, false); kfree(ap->folios); kfree(wpa); @@ -2069,7 +2069,7 @@ int fuse_write_inode(struct inode *inode, struct writeback_control *wbc) ff = __fuse_write_file_get(fi); err = fuse_flush_times(inode, ff); if (ff) - fuse_file_put(ff, false); + fuse_file_put(ff, inode, false); return err; } @@ -2295,7 +2295,7 @@ static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc, } if (data->ff) - fuse_file_put(data->ff, false); + fuse_file_put(data->ff, wpc->inode, false); return error; } @@ -3207,7 +3207,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, goto out; } - if (!is_iomap) + if (is_iomap) + fuse_iomap_copied_file_range(inode_out, pos_out, bytes_copied); + else truncate_inode_pages_range(inode_out->i_mapping, ALIGN_DOWN(pos_out, PAGE_SIZE), ALIGN(pos_out + bytes_copied, PAGE_SIZE) - 1); diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index e1e7b98e591d16..6223b98890a83b 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -899,6 +899,8 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count, spin_lock(&fi->lock); fi->i_disk_size = max(fi->i_disk_size, pos + written); spin_unlock(&fi->lock); + + fuse_iomap_cache_invalidate_range(inode, pos, written); } else { fuse_iomap_inline_free(iomap); } @@ -989,7 +991,7 @@ fuse_iomap_setsize_finish( spin_lock(&fi->lock); fi->i_disk_size = newsize; spin_unlock(&fi->lock); - return 0; + return fuse_iomap_cache_invalidate(inode, newsize); } static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written, @@ -1067,12 +1069,14 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written, /* * If there weren't any ioend errors, update the incore isize, which - * confusingly takes the new i_size as "pos". + * confusingly takes the new i_size as "pos". Invalidate cached + * mappings for the file range that we just completed. */ spin_lock(&fi->lock); fi->i_disk_size = outarg.newsize; spin_unlock(&fi->lock); fuse_write_update_attr(inode, pos + written, written); + fuse_iomap_cache_invalidate_range(inode, pos, written); return 0; } @@ -1781,6 +1785,8 @@ void fuse_iomap_open_truncate(struct inode *inode) spin_lock(&fi->lock); fi->i_disk_size = 0; spin_unlock(&fi->lock); + + fuse_iomap_cache_invalidate(inode, 0); } struct fuse_writepage_ctx { @@ -2475,6 +2481,14 @@ fuse_iomap_fallocate( trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size); + if (mode & (FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE)) + error = fuse_iomap_cache_invalidate(inode, offset); + else + error = fuse_iomap_cache_invalidate_range(inode, offset, + length); + if (error) + return error; + /* * If we unmapped blocks from the file range, then we zero the * pagecache for those regions and push them to disk rather than make @@ -2492,6 +2506,8 @@ fuse_iomap_fallocate( */ if (new_size) { error = fuse_iomap_setsize_start(inode, new_size); + if (!error) + error = fuse_iomap_setsize_finish(inode, new_size); if (error) return error; @@ -2613,3 +2629,11 @@ int fuse_dev_ioctl_iomap_set_nofs(struct file *file, uint32_t __user *argp) return -EINVAL; } } + +void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset, + u64 written) +{ + ASSERT(fuse_inode_has_iomap(inode)); + + fuse_iomap_cache_invalidate_range(inode, offset, written); +} diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c index ea63a12a01c1fe..3d80c3eafad2cc 100644 --- a/fs/fuse/fuse_iomap_cache.c +++ b/fs/fuse/fuse_iomap_cache.c @@ -1444,6 +1444,33 @@ fuse_iomap_cache_remove( return ret; } +int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset, + uint64_t length) +{ + loff_t aligned_offset; + const unsigned int blocksize = i_blocksize(inode); + int ret, ret2; + + if (!fuse_inode_caches_iomaps(inode)) + return 0; + + aligned_offset = round_down(offset, blocksize); + if (length != FUSE_IOMAP_INVAL_TO_EOF) { + length += offset - aligned_offset; + length = round_up(length, blocksize); + } + + fuse_iomap_cache_lock(inode); + ret = fuse_iomap_cache_remove(inode, READ_MAPPING, + aligned_offset, length); + ret2 = fuse_iomap_cache_remove(inode, WRITE_MAPPING, + aligned_offset, length); + fuse_iomap_cache_unlock(inode); + if (ret) + return ret; + return ret2; +} + static void fuse_iext_add_mapping( struct fuse_iomap_cache *ic, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 06/12] fuse_trace: invalidate iomap cache after file updates 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:36 ` [PATCH 05/12] fuse: invalidate iomap cache after file updates Darrick J. Wong @ 2026-04-29 14:36 ` Darrick J. Wong 2026-04-29 14:37 ` [PATCH 07/12] fuse: enable iomap cache management Darrick J. Wong ` (5 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:36 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 20 ++++++++++++++++++++ fs/fuse/fuse_iomap.c | 2 ++ fs/fuse/fuse_iomap_cache.c | 2 ++ 3 files changed, 24 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index bf47008e50920a..ddcbefd33a4024 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -932,6 +932,7 @@ DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up); DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down); DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range); DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range); +DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_cache_invalidate_range); TRACE_EVENT(fuse_iomap_end_ioend, TP_PROTO(const struct iomap_ioend *ioend), @@ -1248,6 +1249,25 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write); DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap); DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap); +TRACE_EVENT(fuse_iomap_copied_file_range, + TP_PROTO(const struct inode *inode, loff_t offset, + size_t written), + TP_ARGS(inode, offset, written), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->offset = offset; + __entry->length = written; + ), + + TP_printk(FUSE_IO_RANGE_FMT(), + FUSE_IO_RANGE_PRINTK_ARGS()) +); + DECLARE_EVENT_CLASS(fuse_iext_class, TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur, int state, unsigned long caller_ip), diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 6223b98890a83b..6d6c42fdaaac5b 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -2635,5 +2635,7 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset, { ASSERT(fuse_inode_has_iomap(inode)); + trace_fuse_iomap_copied_file_range(inode, offset, written); + fuse_iomap_cache_invalidate_range(inode, offset, written); } diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c index 3d80c3eafad2cc..7d6944a6106f07 100644 --- a/fs/fuse/fuse_iomap_cache.c +++ b/fs/fuse/fuse_iomap_cache.c @@ -1454,6 +1454,8 @@ int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset, if (!fuse_inode_caches_iomaps(inode)) return 0; + trace_fuse_iomap_cache_invalidate_range(inode, offset, length); + aligned_offset = round_down(offset, blocksize); if (length != FUSE_IOMAP_INVAL_TO_EOF) { length += offset - aligned_offset; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 07/12] fuse: enable iomap cache management 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:36 ` [PATCH 06/12] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:37 ` Darrick J. Wong 2026-04-29 14:37 ` [PATCH 08/12] fuse_trace: " Darrick J. Wong ` (4 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:37 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Provide a means for the fuse server to upload iomappings to the kernel and invalidate them. This is how we enable iomap caching for better performance. This is also required for correct synchronization between pagecache writes and writeback. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 7 + include/uapi/linux/fuse.h | 29 +++++ fs/fuse/dev.c | 46 ++++++++ fs/fuse/fuse_iomap.c | 264 ++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 343 insertions(+), 3 deletions(-) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 5cdf7b311dba42..79625897dded50 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -79,6 +79,11 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc, int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice); int fuse_dev_ioctl_iomap_set_nofs(struct file *file, uint32_t __user *argp); + +int fuse_iomap_upsert_mappings(struct fuse_conn *fc, + const struct fuse_iomap_upsert_mappings_out *outarg); +int fuse_iomap_inval_mappings(struct fuse_conn *fc, + const struct fuse_iomap_inval_mappings_out *outarg); #else # define fuse_iomap_enabled(...) (false) # define fuse_has_iomap(...) (false) @@ -108,6 +113,8 @@ int fuse_dev_ioctl_iomap_set_nofs(struct file *file, uint32_t __user *argp); # define fuse_iomap_dev_inval(...) (-ENOSYS) # define fuse_iomap_fadvise NULL # define fuse_dev_ioctl_iomap_set_nofs(...) (-EOPNOTSUPP) +# define fuse_iomap_upsert_mappings(...) (-ENOSYS) +# define fuse_iomap_inval_mappings(...) (-ENOSYS) #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index a273838bc20f2f..8c5e67731b21b8 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -251,6 +251,8 @@ * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support * - add FUSE_ATTR_{SYNC,IMMUTABLE,APPEND} for VFS enforcement of file * attributes + * - add FUSE_NOTIFY_IOMAP_{UPSERT,INVAL}_MAPPINGS so fuse servers can cache + * file range mappings in the kernel for iomap */ #ifndef _LINUX_FUSE_H @@ -731,6 +733,8 @@ enum fuse_notify_code { FUSE_NOTIFY_INC_EPOCH = 8, FUSE_NOTIFY_PRUNE = 9, FUSE_NOTIFY_IOMAP_DEV_INVAL = 99, + FUSE_NOTIFY_IOMAP_UPSERT_MAPPINGS = 100, + FUSE_NOTIFY_IOMAP_INVAL_MAPPINGS = 101, FUSE_NOTIFY_CODE_MAX, }; @@ -1396,6 +1400,8 @@ struct fuse_uring_cmd_req { #define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255) /* fuse-specific mapping type saying the server has populated the cache */ #define FUSE_IOMAP_TYPE_RETRY_CACHE (254) +/* do not upsert this mapping */ +#define FUSE_IOMAP_TYPE_NOCACHE (253) #define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */ @@ -1556,4 +1562,27 @@ struct fuse_iomap_dev_inval_out { /* invalidate all cached iomap mappings up to EOF */ #define FUSE_IOMAP_INVAL_TO_EOF (~0ULL) +struct fuse_iomap_inval_mappings_out { + uint64_t nodeid; /* Inode ID */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + + /* + * Range of read and mappings to invalidate. Zero length means ignore + * the range; and FUSE_IOMAP_INVAL_TO_EOF can be used for length. + */ + struct fuse_range read; + struct fuse_range write; +}; + +struct fuse_iomap_upsert_mappings_out { + uint64_t nodeid; /* Inode ID */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + + /* read file data from here */ + struct fuse_iomap_io read; + + /* write file data to here, if applicable */ + struct fuse_iomap_io write; +}; + #endif /* _LINUX_FUSE_H */ diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index cf4bad6ffc287b..fcee1a23375cee 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1872,6 +1872,48 @@ static int fuse_notify_iomap_dev_inval(struct fuse_conn *fc, unsigned int size, return err; } +static int fuse_notify_iomap_upsert_mappings(struct fuse_conn *fc, + unsigned int size, + struct fuse_copy_state *cs) +{ + struct fuse_iomap_upsert_mappings_out outarg; + int err = -EINVAL; + + if (size != sizeof(outarg)) + goto err; + + err = fuse_copy_one(cs, &outarg, sizeof(outarg)); + if (err) + goto err; + fuse_copy_finish(cs); + + return fuse_iomap_upsert_mappings(fc, &outarg); +err: + fuse_copy_finish(cs); + return err; +} + +static int fuse_notify_iomap_inval_mappings(struct fuse_conn *fc, + unsigned int size, + struct fuse_copy_state *cs) +{ + struct fuse_iomap_inval_mappings_out outarg; + int err = -EINVAL; + + if (size != sizeof(outarg)) + goto err; + + err = fuse_copy_one(cs, &outarg, sizeof(outarg)); + if (err) + goto err; + fuse_copy_finish(cs); + + return fuse_iomap_inval_mappings(fc, &outarg); +err: + fuse_copy_finish(cs); + return err; +} + struct fuse_retrieve_args { struct fuse_args_pages ap; struct fuse_notify_retrieve_in inarg; @@ -2164,6 +2206,10 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code, case FUSE_NOTIFY_IOMAP_DEV_INVAL: return fuse_notify_iomap_dev_inval(fc, size, cs); + case FUSE_NOTIFY_IOMAP_UPSERT_MAPPINGS: + return fuse_notify_iomap_upsert_mappings(fc, size, cs); + case FUSE_NOTIFY_IOMAP_INVAL_MAPPINGS: + return fuse_notify_iomap_inval_mappings(fc, size, cs); default: return -EINVAL; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 6d6c42fdaaac5b..8e747084d81b28 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -132,6 +132,7 @@ static inline bool fuse_iomap_check_type(uint16_t fuse_type) case FUSE_IOMAP_TYPE_INLINE: case FUSE_IOMAP_TYPE_PURE_OVERWRITE: case FUSE_IOMAP_TYPE_RETRY_CACHE: + case FUSE_IOMAP_TYPE_NOCACHE: return true; } @@ -241,8 +242,8 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, uint64_t end; /* - * Type and flags must be known. Mapping type "retry cache" doesn't - * use any of the other fields. + * Type and flags must be known. Mapping types "retry cache" and "do + * not insert in cache" don't use any of the other fields. */ if (BAD_DATA(!fuse_iomap_check_type(map->type))) return false; @@ -255,6 +256,8 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, return false; return true; } + if (map->type == FUSE_IOMAP_TYPE_NOCACHE) + return true; if (BAD_DATA(!fuse_iomap_check_flags(map->flags))) return false; @@ -299,6 +302,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode, if (BAD_DATA(iodir != WRITE_MAPPING)) return false; break; + case FUSE_IOMAP_TYPE_NOCACHE: case FUSE_IOMAP_TYPE_RETRY_CACHE: default: /* should have been caught already */ @@ -373,6 +377,15 @@ fuse_iomap_begin_validate(const struct inode *inode, if (!fuse_iomap_check_mapping(inode, &outarg->write, WRITE_MAPPING)) return -EFSCORRUPTED; + /* + * ->iomap_begin requires real mappings or "retry from cache"; "do not + * add to cache" does not apply here. + */ + if (BAD_DATA(outarg->read.type == FUSE_IOMAP_TYPE_NOCACHE)) + return -EFSCORRUPTED; + if (BAD_DATA(outarg->write.type == FUSE_IOMAP_TYPE_NOCACHE)) + return -EFSCORRUPTED; + /* * Must have returned a mapping for at least the first byte in the * range. The main mapping check already validated that the length @@ -606,9 +619,11 @@ fuse_iomap_cached_validate(const struct inode *inode, if (!fuse_iomap_check_mapping(inode, &lmap->map, dir)) return -EFSCORRUPTED; - /* The cache should not be storing "retry cache" mappings */ + /* The cache should not be storing cache management mappings */ if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_RETRY_CACHE)) return -EFSCORRUPTED; + if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_NOCACHE)) + return -EFSCORRUPTED; return 0; } @@ -2639,3 +2654,246 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset, fuse_iomap_cache_invalidate_range(inode, offset, written); } + +static inline int +fuse_iomap_upsert_validate_dev( + const struct fuse_backing *fb, + const struct fuse_iomap_io *map) +{ + uint64_t map_end; + sector_t device_bytes; + + if (!fb) { + if (BAD_DATA(map->addr != FUSE_IOMAP_NULL_ADDR)) + return -EFSCORRUPTED; + + return 0; + } + + if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR)) + return -EFSCORRUPTED; + + if (BAD_DATA(check_add_overflow(map->addr, map->length, &map_end))) + return -EFSCORRUPTED; + + /* + * bdev_nr_sectors() == 0 usually means the device has gone away from + * underneath us. We won't cache this mapping, but we'll return + * -EINVAL to signal a softer error to the fuse server than "your fs + * metadata are corrupt". If the fuse server persists anyway, then + * the worst that happens is that the IO will fail. + */ + device_bytes = bdev_nr_sectors(fb->bdev) << SECTOR_SHIFT; + if (!device_bytes) + return -EINVAL; + + if (BAD_DATA(map_end > device_bytes)) + return -EFSCORRUPTED; + + return 0; +} + +/* Validate one of the incoming upsert mappings */ +static inline int +fuse_iomap_upsert_validate_mapping(struct inode *inode, + enum fuse_iomap_iodir iodir, + const struct fuse_iomap_io *map) +{ + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_backing *fb; + int ret; + + if (!fuse_iomap_check_mapping(inode, map, iodir)) + return -EFSCORRUPTED; + + /* + * A "retry cache" instruction makes no sense when we're adding to + * the mapping cache. + */ + if (BAD_DATA(map->type == FUSE_IOMAP_TYPE_RETRY_CACHE)) + return -EFSCORRUPTED; + + /* nocache is allowed, because we ignore it later */ + if (map->type == FUSE_IOMAP_TYPE_NOCACHE) + return 0; + + /* Make sure we can find the device */ + fb = fuse_iomap_find_dev(fc, map); + if (BAD_DATA(IS_ERR(fb))) + return -EFSCORRUPTED; + + ret = fuse_iomap_upsert_validate_dev(fb, map); + fuse_backing_put(fb); + return ret; +} + +/* Check the incoming upsert mappings to make sure they're not nonsense */ +static inline int +fuse_iomap_upsert_validate_mappings(struct inode *inode, + const struct fuse_iomap_upsert_mappings_out *outarg) +{ + int ret = fuse_iomap_upsert_validate_mapping(inode, READ_MAPPING, + &outarg->read); + if (ret) + return ret; + + return fuse_iomap_upsert_validate_mapping(inode, WRITE_MAPPING, + &outarg->write); +} + +static int fuse_iomap_upsert_inode(struct inode *inode, + const struct fuse_iomap_upsert_mappings_out *outarg) +{ + int ret = fuse_iomap_upsert_validate_mappings(inode, outarg); + if (ret) + return ret; + + if (!fuse_inode_caches_iomaps(inode)) { + ret = fuse_iomap_cache_alloc(inode); + if (ret) + return ret; + } + + fuse_iomap_cache_lock(inode); + + if (outarg->read.type != FUSE_IOMAP_TYPE_NOCACHE) { + ret = fuse_iomap_cache_upsert(inode, READ_MAPPING, + &outarg->read); + if (ret) + goto out_unlock; + } + + if (outarg->write.type != FUSE_IOMAP_TYPE_NOCACHE) { + ret = fuse_iomap_cache_upsert(inode, WRITE_MAPPING, + &outarg->write); + if (ret) + goto out_unlock; + } + +out_unlock: + fuse_iomap_cache_unlock(inode); + return ret; +} + +int fuse_iomap_upsert_mappings(struct fuse_conn *fc, + const struct fuse_iomap_upsert_mappings_out *outarg) +{ + struct inode *inode; + struct fuse_inode *fi; + int ret; + + if (!fc->iomap) + return -EINVAL; + + down_read(&fc->killsb); + inode = fuse_ilookup(fc, outarg->nodeid, NULL); + if (!inode) { + ret = -ESTALE; + goto out_sb; + } + + fi = get_fuse_inode(inode); + if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) { + ret = -EINVAL; + goto out_inode; + } + + if (fuse_is_bad(inode)) { + ret = -EIO; + goto out_inode; + } + + ret = fuse_iomap_upsert_inode(inode, outarg); +out_inode: + iput(inode); +out_sb: + up_read(&fc->killsb); + return ret; +} + +static inline bool +fuse_iomap_inval_validate_range(const struct inode *inode, + const struct fuse_range *range) +{ + const unsigned int blocksize = i_blocksize(inode); + + if (range->length == 0) + return true; + + /* Range can't start beyond maxbytes */ + if (BAD_DATA(range->offset >= inode->i_sb->s_maxbytes)) + return false; + + /* File range must be aligned to blocksize */ + if (BAD_DATA(!IS_ALIGNED(range->offset, blocksize))) + return false; + if (range->length != FUSE_IOMAP_INVAL_TO_EOF && + BAD_DATA(!IS_ALIGNED(range->length, blocksize))) + return false; + + return true; +} + +static int fuse_iomap_inval_inode(struct inode *inode, + const struct fuse_iomap_inval_mappings_out *outarg) +{ + int ret = 0, ret2 = 0; + + if (!fuse_iomap_inval_validate_range(inode, &outarg->write)) + return -EFSCORRUPTED; + + if (!fuse_iomap_inval_validate_range(inode, &outarg->read)) + return -EFSCORRUPTED; + + if (!fuse_inode_caches_iomaps(inode)) + return 0; + + fuse_iomap_cache_lock(inode); + if (outarg->read.length) + ret2 = fuse_iomap_cache_remove(inode, READ_MAPPING, + outarg->read.offset, + outarg->read.length); + if (outarg->write.length) + ret = fuse_iomap_cache_remove(inode, WRITE_MAPPING, + outarg->write.offset, + outarg->write.length); + fuse_iomap_cache_unlock(inode); + + return ret ? ret : ret2; +} + +int fuse_iomap_inval_mappings(struct fuse_conn *fc, + const struct fuse_iomap_inval_mappings_out *outarg) +{ + struct inode *inode; + struct fuse_inode *fi; + int ret; + + if (!fc->iomap) + return -EINVAL; + + down_read(&fc->killsb); + inode = fuse_ilookup(fc, outarg->nodeid, NULL); + if (!inode) { + ret = -ESTALE; + goto out_sb; + } + + fi = get_fuse_inode(inode); + if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) { + ret = -EINVAL; + goto out_inode; + } + + if (fuse_is_bad(inode)) { + ret = -EIO; + goto out_inode; + } + + ret = fuse_iomap_inval_inode(inode, outarg); +out_inode: + iput(inode); +out_sb: + up_read(&fc->killsb); + return ret; +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 08/12] fuse_trace: enable iomap cache management 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:37 ` [PATCH 07/12] fuse: enable iomap cache management Darrick J. Wong @ 2026-04-29 14:37 ` Darrick J. Wong 2026-04-29 14:37 ` [PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode Darrick J. Wong ` (3 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:37 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++ fs/fuse/fuse_iomap.c | 4 +++ 2 files changed, 71 insertions(+) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index ddcbefd33a4024..09da9bce61b98c 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -405,6 +405,7 @@ struct fuse_iomap_lookup; #define FUSE_IOMAP_TYPE_STRINGS \ { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \ { FUSE_IOMAP_TYPE_RETRY_CACHE, "retry" }, \ + { FUSE_IOMAP_TYPE_NOCACHE, "nocache" }, \ { FUSE_IOMAP_TYPE_HOLE, "hole" }, \ { FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \ { FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \ @@ -1563,6 +1564,72 @@ TRACE_EVENT(fuse_iomap_invalid, __entry->old_validity_cookie, __entry->validity_cookie) ); + +TRACE_EVENT(fuse_iomap_upsert_mappings, + TP_PROTO(const struct inode *inode, + const struct fuse_iomap_upsert_mappings_out *outarg), + TP_ARGS(inode, outarg), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(uint64_t, attr_ino) + + FUSE_IOMAP_MAP_FIELDS(read) + FUSE_IOMAP_MAP_FIELDS(write) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->attr_ino = outarg->attr_ino; + __entry->readoffset = outarg->read.offset; + __entry->readlength = outarg->read.length; + __entry->readaddr = outarg->read.addr; + __entry->readtype = outarg->read.type; + __entry->readflags = outarg->read.flags; + __entry->readdev = outarg->read.dev; + __entry->writeoffset = outarg->write.offset; + __entry->writelength = outarg->write.length; + __entry->writeaddr = outarg->write.addr; + __entry->writetype = outarg->write.type; + __entry->writeflags = outarg->write.flags; + __entry->writedev = outarg->write.dev; + ), + + TP_printk(FUSE_INODE_FMT " attr_ino 0x%llx" FUSE_IOMAP_MAP_FMT("read") FUSE_IOMAP_MAP_FMT("write"), + FUSE_INODE_PRINTK_ARGS, + __entry->attr_ino, + FUSE_IOMAP_MAP_PRINTK_ARGS(read), + FUSE_IOMAP_MAP_PRINTK_ARGS(write)) +); + +TRACE_EVENT(fuse_iomap_inval_mappings, + TP_PROTO(const struct inode *inode, + const struct fuse_iomap_inval_mappings_out *outarg), + TP_ARGS(inode, outarg), + + TP_STRUCT__entry( + FUSE_INODE_FIELDS + __field(uint64_t, attr_ino) + + FUSE_FILE_RANGE_FIELDS(read) + FUSE_FILE_RANGE_FIELDS(write) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->attr_ino = outarg->attr_ino; + __entry->readoffset = outarg->read.offset; + __entry->readlength = outarg->read.length; + __entry->writeoffset = outarg->write.offset; + __entry->writelength = outarg->write.length; + ), + + TP_printk(FUSE_INODE_FMT " attr_ino 0x%llx" FUSE_FILE_RANGE_FMT("read") FUSE_FILE_RANGE_FMT("write"), + FUSE_INODE_PRINTK_ARGS, + __entry->attr_ino, + FUSE_FILE_RANGE_PRINTK_ARGS(read), + FUSE_FILE_RANGE_PRINTK_ARGS(write)) +); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _TRACE_FUSE_H */ diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 8e747084d81b28..075fa62f856495 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -2792,6 +2792,8 @@ int fuse_iomap_upsert_mappings(struct fuse_conn *fc, goto out_sb; } + trace_fuse_iomap_upsert_mappings(inode, outarg); + fi = get_fuse_inode(inode); if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) { ret = -EINVAL; @@ -2879,6 +2881,8 @@ int fuse_iomap_inval_mappings(struct fuse_conn *fc, goto out_sb; } + trace_fuse_iomap_inval_mappings(inode, outarg); + fi = get_fuse_inode(inode); if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) { ret = -EINVAL; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:37 ` [PATCH 08/12] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:37 ` Darrick J. Wong 2026-04-29 14:38 ` [PATCH 10/12] fuse: constrain iomap mapping cache size Darrick J. Wong ` (2 subsequent siblings) 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:37 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> It's not possible for a regular file to use iomap mode and writeback caching at the same time, so we can save some memory in struct fuse_inode by overlaying them in the union. This is a separate patch because C unions are rather unsafe and I prefer any errors to be bisectable to this patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 1bf5dd373153e5..c6a7ca4f1df76c 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -188,8 +188,11 @@ struct fuse_inode { /* waitq for direct-io completion */ wait_queue_head_t direct_io_waitq; + }; #ifdef CONFIG_FUSE_IOMAP + /* regular file iomap mode */ + struct { /* file size as reported by fuse server */ loff_t i_disk_size; @@ -200,8 +203,8 @@ struct fuse_inode { /* cached iomap mappings */ struct fuse_iomap_cache *cache; -#endif }; +#endif /* readdir cache (directory only) */ struct { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 10/12] fuse: constrain iomap mapping cache size 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (8 preceding siblings ...) 2026-04-29 14:37 ` [PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode Darrick J. Wong @ 2026-04-29 14:38 ` Darrick J. Wong 2026-04-29 14:38 ` [PATCH 11/12] fuse_trace: " Darrick J. Wong 2026-04-29 14:38 ` [PATCH 12/12] fuse: enable iomap Darrick J. Wong 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:38 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Provide a means to constrain the size of each iomap mapping cache. Most files (at least on XFS) don't have more than 1000 extents, so we'll set the absolute maximum to 10000 and let the fuse server set a lower limit if it so desires. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 3 +++ fs/fuse/fuse_iomap_cache.h | 8 ++++++++ include/uapi/linux/fuse.h | 7 ++++++- fs/fuse/fuse_iomap.c | 9 ++++++++- fs/fuse/fuse_iomap_cache.c | 24 ++++++++++++++++++++++++ 5 files changed, 49 insertions(+), 2 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index c6a7ca4f1df76c..91f399a7a9a990 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -685,6 +685,9 @@ struct fuse_iomap_conn { /* fuse server doesn't implement iomap_ioend */ unsigned int no_ioend:1; + + /* maximum mapping cache size */ + unsigned int cache_maxbytes; }; #endif diff --git a/fs/fuse/fuse_iomap_cache.h b/fs/fuse/fuse_iomap_cache.h index eba90d9519b8c3..cb312b0720d6b7 100644 --- a/fs/fuse/fuse_iomap_cache.h +++ b/fs/fuse/fuse_iomap_cache.h @@ -108,6 +108,14 @@ static inline int fuse_iomap_cache_invalidate(struct inode *inode, return fuse_iomap_cache_invalidate_range(inode, offset, FUSE_IOMAP_INVAL_TO_EOF); } + +/* absolute maximum memory consumption per iomap mapping cache */ +#define FUSE_IOMAP_CACHE_MAX_MAXBYTES (SZ_2M) + +/* default maximum memory consumption per iomap mapping cache */ +#define FUSE_IOMAP_CACHE_DEFAULT_MAXBYTES (SZ_256K) + +void fuse_iomap_cache_set_maxbytes(struct fuse_conn *fc, unsigned int maxbytes); #endif /* CONFIG_FUSE_IOMAP */ #endif /* _FS_FUSE_IOMAP_CACHE_H */ diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 8c5e67731b21b8..035e1a59ce50d3 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1512,7 +1512,9 @@ struct fuse_iomap_ioend_out { struct fuse_iomap_config_in { uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */ int64_t maxbytes; /* maximum supported file size */ - uint64_t padding[6]; /* zero */ + uint32_t cache_maxbytes; /* mapping cache maxbytes */ + uint32_t zero; /* zero */ + uint64_t padding[5]; /* zero */ }; /* Which fields are set in fuse_iomap_config_out? */ @@ -1522,6 +1524,7 @@ struct fuse_iomap_config_in { #define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL) #define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL) #define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL) +#define FUSE_IOMAP_CONFIG_CACHE_MAXBYTES (1 << 6ULL) struct fuse_iomap_config_out { uint64_t flags; /* FUSE_IOMAP_CONFIG_* */ @@ -1544,6 +1547,8 @@ struct fuse_iomap_config_out { int64_t s_time_max; int64_t s_maxbytes; /* max file size */ + + uint32_t cache_maxbytes; /* mapping cache maximum size */ }; struct fuse_range { diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 075fa62f856495..22b23a3fa2ae2c 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -1147,7 +1147,8 @@ struct fuse_iomap_config_args { FUSE_IOMAP_CONFIG_BLOCKSIZE | \ FUSE_IOMAP_CONFIG_MAX_LINKS | \ FUSE_IOMAP_CONFIG_TIME | \ - FUSE_IOMAP_CONFIG_MAXBYTES) + FUSE_IOMAP_CONFIG_MAXBYTES | \ + FUSE_IOMAP_CONFIG_CACHE_MAXBYTES) static int fuse_iomap_process_config(struct fuse_mount *fm, int error, const struct fuse_iomap_config_out *outarg) @@ -1212,6 +1213,9 @@ static int fuse_iomap_process_config(struct fuse_mount *fm, int error, if (outarg->flags & FUSE_IOMAP_CONFIG_MAXBYTES) sb->s_maxbytes = outarg->s_maxbytes; + if (outarg->flags & FUSE_IOMAP_CONFIG_CACHE_MAXBYTES) + fuse_iomap_cache_set_maxbytes(fm->fc, outarg->cache_maxbytes); + return 0; } @@ -1273,6 +1277,9 @@ fuse_iomap_new_mount(struct fuse_mount *fm) ia->inarg.maxbytes = MAX_LFS_FILESIZE; ia->inarg.flags = FUSE_IOMAP_CONFIG_ALL; + fm->fc->iomap_conn.cache_maxbytes = FUSE_IOMAP_CACHE_DEFAULT_MAXBYTES; + ia->inarg.cache_maxbytes = fm->fc->iomap_conn.cache_maxbytes; + ia->args.opcode = FUSE_IOMAP_CONFIG; ia->args.nodeid = 0; ia->args.in_numargs = 1; diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c index 7d6944a6106f07..a8e16302ce4405 100644 --- a/fs/fuse/fuse_iomap_cache.c +++ b/fs/fuse/fuse_iomap_cache.c @@ -1654,6 +1654,28 @@ void fuse_iomap_cache_free(struct inode *inode) kfree(ic); } +void fuse_iomap_cache_set_maxbytes(struct fuse_conn *fc, unsigned int maxbytes) +{ + if (!maxbytes) + return; + + fc->iomap_conn.cache_maxbytes = clamp(maxbytes, NODE_SIZE, + FUSE_IOMAP_CACHE_MAX_MAXBYTES); +} + +static void +fuse_iomap_cache_cleanup( + struct inode *inode, + enum fuse_iomap_iodir iodir) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_mount *fm = get_fuse_mount(inode); + struct fuse_iext_root *ir = fuse_iext_root_ptr(fi->cache, iodir); + + if (ir && ir->ir_bytes > fm->fc->iomap_conn.cache_maxbytes) + fuse_iext_destroy(ir); +} + int fuse_iomap_cache_upsert( struct inode *inode, @@ -1680,6 +1702,8 @@ fuse_iomap_cache_upsert( if (err) return err; + fuse_iomap_cache_cleanup(inode, iodir); + return fuse_iomap_cache_add(inode, iodir, map); } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 11/12] fuse_trace: constrain iomap mapping cache size 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (9 preceding siblings ...) 2026-04-29 14:38 ` [PATCH 10/12] fuse: constrain iomap mapping cache size Darrick J. Wong @ 2026-04-29 14:38 ` Darrick J. Wong 2026-04-29 14:38 ` [PATCH 12/12] fuse: enable iomap Darrick J. Wong 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:38 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add tracepoints for the previous patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_trace.h | 31 +++++++++++++++++++++++++++++-- fs/fuse/fuse_iomap_cache.c | 4 +++- 2 files changed, 32 insertions(+), 3 deletions(-) diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h index 09da9bce61b98c..aa2d5ca88c9d40 100644 --- a/fs/fuse/fuse_trace.h +++ b/fs/fuse/fuse_trace.h @@ -320,6 +320,7 @@ struct iomap_ioend; struct iomap; struct fuse_iext_cursor; struct fuse_iomap_lookup; +struct fuse_iext_root; /* tracepoint boilerplate so we don't have to keep doing this */ #define FUSE_IOMAP_OPFLAGS_FIELD \ @@ -1157,6 +1158,8 @@ TRACE_EVENT(fuse_iomap_config, __field(int64_t, time_max) __field(int64_t, maxbytes) __field(uint8_t, uuid_len) + + __field(uint32_t, cache_maxbytes) ), TP_fast_assign( @@ -1170,14 +1173,15 @@ TRACE_EVENT(fuse_iomap_config, __entry->time_max = outarg->s_time_max; __entry->maxbytes = outarg->s_maxbytes; __entry->uuid_len = outarg->s_uuid_len; + __entry->cache_maxbytes = outarg->cache_maxbytes; ), - TP_printk("connection %u root_ino 0x%llx flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u", + TP_printk("connection %u root_ino 0x%llx flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u cache_maxbytes %u", __entry->connection, __entry->root_nodeid, __print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS), __entry->blocksize, __entry->max_links, __entry->time_gran, __entry->time_min, __entry->time_max, __entry->maxbytes, - __entry->uuid_len) + __entry->uuid_len, __entry->cache_maxbytes) ); TRACE_EVENT(fuse_iomap_dev_inval, @@ -1395,6 +1399,29 @@ DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_del_mapping_got); DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_left); DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_right); +TRACE_EVENT(fuse_iomap_cache_cleanup, + TP_PROTO(const struct inode *inode, unsigned int iodir, + struct fuse_iext_root *ir), + TP_ARGS(inode, iodir, ir), + + TP_STRUCT__entry( + FUSE_IO_RANGE_FIELDS() + FUSE_IOMAP_IODIR_FIELD + __field(unsigned long long, bytes) + ), + + TP_fast_assign( + FUSE_INODE_ASSIGN(inode, fi, fm); + __entry->iodir = iodir; + __entry->bytes = ir->ir_bytes; + ), + + TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " bytes 0x%llx", + FUSE_IO_RANGE_PRINTK_ARGS(), + FUSE_IOMAP_IODIR_PRINTK_ARGS, + __entry->bytes) +); + TRACE_EVENT(fuse_iomap_cache_remove, TP_PROTO(const struct inode *inode, unsigned int iodir, loff_t offset, uint64_t length, unsigned long caller_ip), diff --git a/fs/fuse/fuse_iomap_cache.c b/fs/fuse/fuse_iomap_cache.c index a8e16302ce4405..66739ec59e9c8b 100644 --- a/fs/fuse/fuse_iomap_cache.c +++ b/fs/fuse/fuse_iomap_cache.c @@ -1672,8 +1672,10 @@ fuse_iomap_cache_cleanup( struct fuse_mount *fm = get_fuse_mount(inode); struct fuse_iext_root *ir = fuse_iext_root_ptr(fi->cache, iodir); - if (ir && ir->ir_bytes > fm->fc->iomap_conn.cache_maxbytes) + if (ir && ir->ir_bytes > fm->fc->iomap_conn.cache_maxbytes) { + trace_fuse_iomap_cache_cleanup(inode, iodir, ir); fuse_iext_destroy(ir); + } } int ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 12/12] fuse: enable iomap 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (10 preceding siblings ...) 2026-04-29 14:38 ` [PATCH 11/12] fuse_trace: " Darrick J. Wong @ 2026-04-29 14:38 ` Darrick J. Wong 11 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:38 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Remove the guard that we used to avoid bisection problems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 22b23a3fa2ae2c..8d3c7273dda810 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -70,9 +70,6 @@ MODULE_PARM_DESC(debug_iomap, "Enable debugging of fuse iomap"); bool fuse_iomap_enabled(void) { - /* Don't let anyone touch iomap until the end of the patchset. */ - return false; - /* * There are fears that a fuse+iomap server could somehow DoS the * system by doing things like going out to lunch during a writeback ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 8/8] fuse: run fuse-iomap servers as a contained service 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong @ 2026-04-29 14:18 ` Darrick J. Wong 2026-04-29 14:38 ` [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage Darrick J. Wong 2026-04-29 14:39 ` [PATCH 2/2] fuse: set iomap backing device block size Darrick J. Wong 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (11 subsequent siblings) 19 siblings, 2 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:18 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel Hi all, This patchset defines the necessary communication protocols and library code so that users can mount fuse-iomap servers that run in unprivileged systemd service containers. This is largely adding new kernel calls and protocols so that anyone with an open fusedev fd can ask for permission for anyone else with that fusedev fd to use iomap. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container --- Commits in this patchset: * fuse: allow privileged mount helpers to pre-approve iomap usage * fuse: set iomap backing device block size --- fs/fuse/fuse_i.h | 10 ++++++++ fs/fuse/fuse_iomap.h | 5 ++++ include/uapi/linux/fuse.h | 8 ++++++ fs/fuse/dev.c | 4 +++ fs/fuse/fuse_iomap.c | 59 +++++++++++++++++++++++++++++++++++++++++++-- fs/fuse/inode.c | 7 ++++- 6 files changed, 89 insertions(+), 4 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage 2026-04-29 14:18 ` [PATCHSET v8 8/8] fuse: run fuse-iomap servers as a contained service Darrick J. Wong @ 2026-04-29 14:38 ` Darrick J. Wong 2026-04-29 14:39 ` [PATCH 2/2] fuse: set iomap backing device block size Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:38 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> For the upcoming safemount functionality in libfuse, we will create a privileged "mount.safe" helper that starts the fuse server in a completely unprivileged systemd container. The mount helper will pass the mount options and fds for /dev/fuse and any other files requested by the fuse server into the container via a Unix socket. Currently, the ability to turn on iomap for fuse depends on a module parameter and the process that calls mount() having the CAP_SYS_RAWIO capability. However, the unprivilged fuse server might want to query the /dev/fuse fd for iomap capabilities before mount or FUSE_INIT so that it can get ready. Similar to FUSE_DEV_SYNC_INIT, add a new bit for iomap that can be squirreled away in file->private_data and an ioctl to set that bit. That way the privileged mount helper can pass its iomap privilege to the contained fuse server without the fuse server needing to have CAP_SYS_RAWIO. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_i.h | 10 ++++++++++ fs/fuse/fuse_iomap.h | 2 ++ include/uapi/linux/fuse.h | 1 + fs/fuse/dev.c | 2 ++ fs/fuse/fuse_iomap.c | 32 ++++++++++++++++++++++++++++++-- fs/fuse/inode.c | 7 +++++-- 6 files changed, 50 insertions(+), 4 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 91f399a7a9a990..a491b80c2afacf 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -623,6 +623,9 @@ struct fuse_dev { /** Issue FUSE_INIT synchronously */ bool sync_init; + /** Allow fuse server to ask for IOMAP */ + bool may_iomap; + /** Fuse connection for this device */ struct fuse_conn *fc; @@ -981,6 +984,13 @@ struct fuse_conn { /* Enable fs/iomap for file operations */ unsigned int iomap:1; + /* + * Are filesystems using this connection allowed to use iomap? This is + * determined by the privilege level of the process that initiated the + * mount() call. + */ + unsigned int may_iomap:1; + /* Use io_uring for communication */ unsigned int io_uring; diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 79625897dded50..636d75e44cda82 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -72,6 +72,7 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos, void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset, u64 written); +int fuse_dev_ioctl_add_iomap(struct file *file); int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support __user *argp); int fuse_iomap_dev_inval(struct fuse_conn *fc, @@ -109,6 +110,7 @@ int fuse_iomap_inval_mappings(struct fuse_conn *fc, # define fuse_iomap_fallocate(...) (-ENOSYS) # define fuse_iomap_flush_unmap_range(...) (-ENOSYS) # define fuse_iomap_copied_file_range(...) ((void)0) +# define fuse_dev_ioctl_add_iomap(...) (-EOPNOTSUPP) # define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) # define fuse_iomap_dev_inval(...) (-ENOSYS) # define fuse_iomap_fadvise NULL diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 035e1a59ce50d3..1132493c66d266 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1212,6 +1212,7 @@ struct fuse_iomap_support { #define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \ struct fuse_iomap_support) #define FUSE_DEV_IOC_SET_NOFS _IOW(FUSE_DEV_IOC_MAGIC, 100, uint32_t) +#define FUSE_DEV_IOC_ADD_IOMAP _IO(FUSE_DEV_IOC_MAGIC, 101) struct fuse_lseek_in { uint64_t fh; diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index fcee1a23375cee..fc2875cfffdb3b 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2789,6 +2789,8 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd, return fuse_dev_ioctl_iomap_support(file, argp); case FUSE_DEV_IOC_SET_NOFS: return fuse_dev_ioctl_iomap_set_nofs(file, argp); + case FUSE_DEV_IOC_ADD_IOMAP: + return fuse_dev_ioctl_add_iomap(file); default: return -ENOTTY; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 8d3c7273dda810..3c2821c5f74c03 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -10,6 +10,7 @@ #include <linux/fadvise.h> #include <linux/swap.h> #include "fuse_i.h" +#include "fuse_dev_i.h" #include "fuse_trace.h" #include "fuse_iomap.h" #include "fuse_iomap_i.h" @@ -80,6 +81,12 @@ bool fuse_iomap_enabled(void) return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO); } +static inline bool fuse_iomap_may_enable(void) +{ + /* Same as above, but this time we log the denial in audit log */ + return enable_iomap && capable(CAP_SYS_RAWIO); +} + /* Convert IOMAP_* mapping types to FUSE_IOMAP_TYPE_* */ #define XMAP(word) \ case IOMAP_##word: \ @@ -2537,14 +2544,35 @@ fuse_iomap_fallocate( return 0; } +int fuse_dev_ioctl_add_iomap(struct file *file) +{ + int err = -EINVAL; + struct fuse_dev *fud = fuse_file_to_fud(file); + + mutex_lock(&fuse_mutex); + if (!fuse_dev_fc_get(fud)) { + fud->may_iomap = true; + err = 0; + } + mutex_unlock(&fuse_mutex); + return err; +} + int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support __user *argp) { struct fuse_iomap_support ios = { }; + struct fuse_dev *fud = fuse_file_to_fud(file); + struct fuse_conn *fc; - if (fuse_iomap_enabled()) + mutex_lock(&fuse_mutex); + fc = fuse_dev_fc_get(fud); + if ((fc && fc != FUSE_DEV_FC_DISCONNECTED && fc->may_iomap) || + (!fc && fud->may_iomap) || + fuse_iomap_enabled()) ios.flags = FUSE_IOMAP_SUPPORT_FILEIO | FUSE_IOMAP_SUPPORT_ATOMIC; + mutex_unlock(&fuse_mutex); if (copy_to_user(argp, &ios, sizeof(ios))) return -EFAULT; @@ -2615,7 +2643,7 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc, static inline bool can_set_nofs(struct fuse_dev *fud) { - if (fud && fud->fc && fud->fc->iomap) + if (fud && fud->fc && (fud->fc->iomap || fud->fc->may_iomap)) return true; return capable(CAP_SYS_RESOURCE); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 1a09b9e1446919..ca7981c38d5a1c 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1076,6 +1076,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm, fc->name_max = FUSE_NAME_LOW_MAX; fc->timeout.req_timeout = 0; fc->root_nodeid = FUSE_ROOT_ID; + fc->may_iomap = fuse_iomap_enabled(); if (IS_ENABLED(CONFIG_FUSE_BACKING)) fuse_backing_files_init(fc); @@ -1610,7 +1611,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, if (flags & FUSE_REQUEST_TIMEOUT) timeout = arg->request_timeout; - if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) { + if ((flags & FUSE_IOMAP) && fc->may_iomap) { fc->iomap = 1; pr_warn( "EXPERIMENTAL iomap feature enabled. Use at your own risk!"); @@ -1697,7 +1698,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm) */ if (fuse_uring_enabled()) flags |= FUSE_OVER_IO_URING; - if (fuse_iomap_enabled()) + if (fm->fc->may_iomap) flags |= FUSE_IOMAP; ia->in.flags = flags; @@ -2093,6 +2094,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx) goto err_unlock; if (fud->sync_init) fc->sync_init = 1; + if (fud->may_iomap) + fc->may_iomap = 1; } err = fuse_ctl_add_conn(fc); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/2] fuse: set iomap backing device block size 2026-04-29 14:18 ` [PATCHSET v8 8/8] fuse: run fuse-iomap servers as a contained service Darrick J. Wong 2026-04-29 14:38 ` [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage Darrick J. Wong @ 2026-04-29 14:39 ` Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:39 UTC (permalink / raw) To: djwong, miklos; +Cc: joannelkoong, neal, linux-fsdevel, bernd, fuse-devel From: Darrick J. Wong <djwong@kernel.org> Add a new ioctl so that an unprivileged fuse server can set the block size of a bdev that's opened for iomap usage. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/fuse/fuse_iomap.h | 3 +++ include/uapi/linux/fuse.h | 7 +++++++ fs/fuse/dev.c | 2 ++ fs/fuse/fuse_iomap.c | 27 +++++++++++++++++++++++++++ 4 files changed, 39 insertions(+) diff --git a/fs/fuse/fuse_iomap.h b/fs/fuse/fuse_iomap.h index 636d75e44cda82..39da52bfdbe622 100644 --- a/fs/fuse/fuse_iomap.h +++ b/fs/fuse/fuse_iomap.h @@ -75,6 +75,8 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset, int fuse_dev_ioctl_add_iomap(struct file *file); int fuse_dev_ioctl_iomap_support(struct file *file, struct fuse_iomap_support __user *argp); +int fuse_dev_ioctl_iomap_set_blocksize(struct file *file, + struct fuse_iomap_backing_info __user *argp); int fuse_iomap_dev_inval(struct fuse_conn *fc, const struct fuse_iomap_dev_inval_out *arg); @@ -112,6 +114,7 @@ int fuse_iomap_inval_mappings(struct fuse_conn *fc, # define fuse_iomap_copied_file_range(...) ((void)0) # define fuse_dev_ioctl_add_iomap(...) (-EOPNOTSUPP) # define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP) +# define fuse_dev_ioctl_iomap_set_blocksize(...) (-EOPNOTSUPP) # define fuse_iomap_dev_inval(...) (-ENOSYS) # define fuse_iomap_fadvise NULL # define fuse_dev_ioctl_iomap_set_nofs(...) (-EOPNOTSUPP) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 1132493c66d266..94e99a01af5da7 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -1202,6 +1202,11 @@ struct fuse_iomap_support { uint64_t padding; }; +struct fuse_iomap_backing_info { + uint32_t backing_id; + uint32_t blocksize; +}; + /* Device ioctls: */ #define FUSE_DEV_IOC_MAGIC 229 #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t) @@ -1213,6 +1218,8 @@ struct fuse_iomap_support { struct fuse_iomap_support) #define FUSE_DEV_IOC_SET_NOFS _IOW(FUSE_DEV_IOC_MAGIC, 100, uint32_t) #define FUSE_DEV_IOC_ADD_IOMAP _IO(FUSE_DEV_IOC_MAGIC, 101) +#define FUSE_DEV_IOC_IOMAP_SET_BLOCKSIZE _IOW(FUSE_DEV_IOC_MAGIC, 102, \ + struct fuse_iomap_backing_info) struct fuse_lseek_in { uint64_t fh; diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index fc2875cfffdb3b..f7da6fb8c60d58 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2791,6 +2791,8 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd, return fuse_dev_ioctl_iomap_set_nofs(file, argp); case FUSE_DEV_IOC_ADD_IOMAP: return fuse_dev_ioctl_add_iomap(file); + case FUSE_DEV_IOC_IOMAP_SET_BLOCKSIZE: + return fuse_dev_ioctl_iomap_set_blocksize(file, argp); default: return -ENOTTY; diff --git a/fs/fuse/fuse_iomap.c b/fs/fuse/fuse_iomap.c index 3c2821c5f74c03..57c0539f78a1d7 100644 --- a/fs/fuse/fuse_iomap.c +++ b/fs/fuse/fuse_iomap.c @@ -2933,3 +2933,30 @@ int fuse_iomap_inval_mappings(struct fuse_conn *fc, up_read(&fc->killsb); return ret; } + +int fuse_dev_ioctl_iomap_set_blocksize(struct file *file, + struct fuse_iomap_backing_info __user *argp) +{ + struct fuse_iomap_backing_info fbi; + struct fuse_dev *fud = fuse_get_dev(file); + struct fuse_backing *fb; + int ret; + + if (IS_ERR(fud)) + return PTR_ERR(fud); + + if (!fud->fc->iomap) + return -EOPNOTSUPP; + + if (copy_from_user(&fbi, argp, sizeof(fbi))) + return -EFAULT; + + fb = fuse_backing_lookup(fud->fc, &fuse_iomap_backing_ops, + fbi.backing_id); + if (!fb) + return -ENOENT; + + ret = set_blocksize(fb->file, fbi.blocksize); + fuse_backing_put(fb); + return ret; +} ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:18 ` [PATCHSET v8 8/8] fuse: run fuse-iomap servers as a contained service Darrick J. Wong @ 2026-04-29 14:18 ` Darrick J. Wong 2026-04-29 14:39 ` [PATCH 01/25] libfuse: bump kernel and library ABI versions Darrick J. Wong ` (24 more replies) 2026-04-29 14:19 ` [PATCHSET v8 2/6] libfuse: allow servers to specify root node id Darrick J. Wong ` (10 subsequent siblings) 19 siblings, 25 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:18 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal Hi all, This series connects libfuse to the iomap-enabled fuse driver in Linux to get fuse servers out of the business of handling file I/O themselves. By keeping the IO path mostly within the kernel, we can dramatically improve the speed of disk-based filesystems. This enables us to move all the filesystem metadata parsing code out of the kernel and into userspace, which means that we can containerize them for security without losing a lot of performance. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio --- Commits in this patchset: * libfuse: bump kernel and library ABI versions * libfuse: wait in do_destroy until all open files are closed * libfuse: add kernel gates for FUSE_IOMAP * libfuse: add fuse commands for iomap_begin and end * libfuse: add upper level iomap commands * libfuse: add a lowlevel notification to add a new device to iomap * libfuse: add upper-level iomap add device function * libfuse: add iomap ioend low level handler * libfuse: add upper level iomap ioend commands * libfuse: add a reply function to send FUSE_ATTR_* to the kernel * libfuse: connect high level fuse library to fuse_reply_attr_iflags * libfuse: support enabling exclusive mode for files * libfuse: support direct I/O through iomap * libfuse: don't allow hardlinking of iomap files in the upper level fuse library * libfuse: allow discovery of the kernel's iomap capabilities * libfuse: add lower level iomap_config implementation * libfuse: add upper level iomap_config implementation * libfuse: add low level code to invalidate iomap block device ranges * libfuse: add upper-level API to invalidate parts of an iomap block device * libfuse: add atomic write support * libfuse: allow disabling of fs memory reclaim and write throttling * libfuse: create a helper to transform an open regular file into an open loopdev * libfuse: add swapfile support for iomap files * libfuse: add lower-level filesystem freeze, thaw, and shutdown requests * libfuse: add upper-level filesystem freeze, thaw, and shutdown events --- include/fuse.h | 102 ++++++++ include/fuse_common.h | 149 ++++++++++++ include/fuse_kernel.h | 147 ++++++++++++ include/fuse_loopdev.h | 29 ++ include/fuse_lowlevel.h | 307 ++++++++++++++++++++++++ include/fuse_service.h | 12 + lib/fuse_i.h | 4 ChangeLog.rst | 5 example/printcap.c | 1 include/meson.build | 4 lib/fuse.c | 593 +++++++++++++++++++++++++++++++++++++++++++---- lib/fuse_loopdev.c | 441 +++++++++++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 564 +++++++++++++++++++++++++++++++++++++++++++-- lib/fuse_service.c | 5 lib/fuse_service_stub.c | 5 lib/fuse_versionscript | 24 ++ lib/meson.build | 7 - meson.build | 13 + 18 files changed, 2329 insertions(+), 83 deletions(-) create mode 100644 include/fuse_loopdev.h create mode 100644 lib/fuse_loopdev.c ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 01/25] libfuse: bump kernel and library ABI versions 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong @ 2026-04-29 14:39 ` Darrick J. Wong 2026-04-29 14:39 ` [PATCH 02/25] libfuse: wait in do_destroy until all open files are closed Darrick J. Wong ` (23 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:39 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Bump the kernel ABI version to 7.99 and the libfuse ABI version to 3.99 to start our development. This patch exists to avoid confusion during the prototyping stage. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 5 ++++- ChangeLog.rst | 5 +++++ lib/fuse_versionscript | 3 +++ lib/meson.build | 4 ++-- meson.build | 2 +- 5 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index f0dee3d6cf51b0..842cc08a083a6f 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -240,6 +240,9 @@ * - add FUSE_COPY_FILE_RANGE_64 * - add struct fuse_copy_file_range_out * - add FUSE_NOTIFY_PRUNE + * + * 7.99 + * - XXX magic minor revision to make experimental code really obvious */ #ifndef _LINUX_FUSE_H @@ -275,7 +278,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 45 +#define FUSE_KERNEL_MINOR_VERSION 99 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 diff --git a/ChangeLog.rst b/ChangeLog.rst index 15c998cf1623b8..3cb95081d42d06 100644 --- a/ChangeLog.rst +++ b/ChangeLog.rst @@ -1,3 +1,8 @@ +libfuse 3.99-rc0 (2025-12-19) +=============================== + +* Add prototypes of iomap and syncfs (djwong) + libfuse 3.18.0 (2025-12-18) =========================== diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index acd1d28907c614..af17e7ab2d7c88 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -247,6 +247,9 @@ FUSE_3.19 { fuse_service_session_unmount; } FUSE_3.18; +FUSE_3.99 { +} FUSE_3.19; + # Local Variables: # indent-tabs-mode: t # End: diff --git a/lib/meson.build b/lib/meson.build index d9a902f74b558f..ff311d4002da0e 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -50,12 +50,12 @@ fusermount_path = join_paths(get_option('prefix'), get_option('bindir')) libfuse = library('fuse3', libfuse_sources, version: base_version, - soversion: '4', + soversion: '99', include_directories: include_dirs, dependencies: deps, install: true, link_depends: 'fuse_versionscript', - c_args: [ '-DFUSE_USE_VERSION=319', + c_args: [ '-DFUSE_USE_VERSION=399', '-DFUSERMOUNT_DIR="@0@"'.format(fusermount_path) ], link_args: ['-Wl,--version-script,' + meson.current_source_dir() + '/fuse_versionscript' ]) diff --git a/meson.build b/meson.build index de038df8d92071..9fa17d08c39b6b 100644 --- a/meson.build +++ b/meson.build @@ -1,5 +1,5 @@ project('libfuse3', ['c'], - version: '3.19.0-rc0', + version: '3.99.0-rc0', # Version with RC suffix meson_version: '>= 0.60.0', default_options: [ 'buildtype=debugoptimized', ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 02/25] libfuse: wait in do_destroy until all open files are closed 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong 2026-04-29 14:39 ` [PATCH 01/25] libfuse: bump kernel and library ABI versions Darrick J. Wong @ 2026-04-29 14:39 ` Darrick J. Wong 2026-04-29 14:39 ` [PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP Darrick J. Wong ` (22 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:39 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> This patch complements the Linux kernel patch "fuse: flush pending fuse events before aborting the connection". This test opens a large number of files, unlinks them (which really just renames them to fuse hidden files), closes the program, unmounts the filesystem, and runs fsck to check that there aren't any inconsistencies in the filesystem. Unfortunately, the 488.full file shows that there are a lot of hidden files left over in the filesystem, with incorrect link counts. Tracing fuse_request_* shows that there are a large number of FUSE_RELEASE commands that are queued up on behalf of the unlinked files at the time that fuse_conn_destroy calls fuse_abort_conn. Had the connection not aborted, the fuse server would have responded to the RELEASE commands by removing the hidden files; instead they stick around. For upper-level fuse servers that don't use fuseblk mode this isn't a problem because libfuse responds to the connection going down by pruning its inode cache and calling the fuse server's ->release for any open files before calling the server's ->destroy function. For fuseblk servers this is a problem, however, because the kernel sends FUSE_DESTROY to the fuse server, and the fuse server has to write all of its pending changes to the block device before replying to the DESTROY request because the kernel releases its O_EXCL hold on the block device. This means that the kernel must flush all pending FUSE_RELEASE requests before issuing FUSE_DESTROY. For fuse-iomap servers this will also be a problem because iomap servers are expected to release all exclusively-held resources before unmount returns from the kernel. Create a function to push all the background requests to the queue before sending FUSE_DESTROY. That way, all the pending file release events are processed by the fuse server before it tears itself down, and we don't end up with a corrupt filesystem. Note that multithreaded fuse servers will need to track the number of open files and defer a FUSE_DESTROY request until that number reaches zero. An earlier version of this patch made the kernel wait for the RELEASE acknowledgements before sending DESTROY, but the kernel people weren't comfortable with adding blocking waits to unmount. This patch implements the deferral for the multithreaded libfuse backend. However, we must implement this deferral by starting a new background thread because libfuse in io_uring mode starts up a bunch of threads, each of which submit batches of SQEs to request fuse commands, and then waits for the kernel to mark some CQEs to note which slots now have fuse commands to process. Each uring thread processes the fuse comands in the CQE serially, which means that _do_destroy can't just wait for the open file counter to hit zero; it has to start a new background thread to do that, so that it can continue to process pending fuse commands. [Aside: is this bad for fuse command processing latency? Suppose we get two CQE completions, then the second command won't even be looked at until the first one is done.] Non-uring fuse by contrast reads one fuse command and processes it immediately, so one command taking a long time won't stall any other commands. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/fuse_i.h | 4 ++ lib/fuse_lowlevel.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 104 insertions(+), 7 deletions(-) diff --git a/lib/fuse_i.h b/lib/fuse_i.h index 1710a872e19c72..17175e4454f90f 100644 --- a/lib/fuse_i.h +++ b/lib/fuse_i.h @@ -122,6 +122,10 @@ struct fuse_session { */ uint32_t conn_want; uint64_t conn_want_ext; + + /* destroy has to wait for all the open files to go away */ + pthread_cond_t zero_open_files; + uint64_t open_files; }; struct fuse_chan { diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 7e7ad683bccb50..487fd72420cdf1 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -54,6 +54,22 @@ #define PARAM(inarg) (((char *)(inarg)) + sizeof(*(inarg))) #define OFFSET_MAX 0x7fffffffffffffffLL +static inline void inc_open_files(struct fuse_session *se) +{ + pthread_mutex_lock(&se->lock); + se->open_files++; + pthread_mutex_unlock(&se->lock); +} + +static inline void dec_open_files(struct fuse_session *se) +{ + pthread_mutex_lock(&se->lock); + se->open_files--; + if (!se->open_files) + pthread_cond_broadcast(&se->zero_open_files); + pthread_mutex_unlock(&se->lock); +} + struct fuse_pollhandle { uint64_t kh; struct fuse_session *se; @@ -549,12 +565,17 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, FUSE_COMPAT_ENTRY_OUT_SIZE : sizeof(struct fuse_entry_out); struct fuse_entry_out *earg = (struct fuse_entry_out *) buf; struct fuse_open_out *oarg = (struct fuse_open_out *) (buf + entrysize); + struct fuse_session *se = req->se; + int error; memset(buf, 0, sizeof(buf)); fill_entry(earg, e); fill_open(oarg, f); - return send_reply_ok(req, buf, + error = send_reply_ok(req, buf, entrysize + sizeof(struct fuse_open_out)); + if (!error) + inc_open_files(se); + return error; } int fuse_reply_attr(fuse_req_t req, const struct stat *attr, @@ -605,10 +626,15 @@ int fuse_passthrough_close(fuse_req_t req, int backing_id) int fuse_reply_open(fuse_req_t req, const struct fuse_file_info *f) { struct fuse_open_out arg; + struct fuse_session *se = req->se; + int error; memset(&arg, 0, sizeof(arg)); fill_open(&arg, f); - return send_reply_ok(req, &arg, sizeof(arg)); + error = send_reply_ok(req, &arg, sizeof(arg)); + if (!error) + inc_open_files(se); + return error; } static int do_fuse_reply_write(fuse_req_t req, size_t count) @@ -1876,6 +1902,7 @@ static void _do_release(fuse_req_t req, const fuse_ino_t nodeid, { (void)in_payload; const struct fuse_release_in *arg = op_in; + struct fuse_session *se = req->se; struct fuse_file_info fi; memset(&fi, 0, sizeof(fi)); @@ -1894,6 +1921,7 @@ static void _do_release(fuse_req_t req, const fuse_ino_t nodeid, req->se->op.release(req, nodeid, &fi); else fuse_reply_err(req, 0); + dec_open_files(se); } static void do_release(fuse_req_t req, const fuse_ino_t nodeid, @@ -1998,6 +2026,7 @@ static void _do_releasedir(fuse_req_t req, const fuse_ino_t nodeid, { (void)in_payload; const struct fuse_release_in *arg = (const struct fuse_release_in *)op_in; + struct fuse_session *se = req->se; struct fuse_file_info fi; memset(&fi, 0, sizeof(fi)); @@ -2008,6 +2037,7 @@ static void _do_releasedir(fuse_req_t req, const fuse_ino_t nodeid, req->se->op.releasedir(req, nodeid, &fi); else fuse_reply_err(req, 0); + dec_open_files(se); } static void do_releasedir(fuse_req_t req, const fuse_ino_t nodeid, @@ -3030,14 +3060,20 @@ do_init(fuse_req_t req, fuse_ino_t nodeid, const void *inarg) _do_init(req, nodeid, inarg, NULL); } -static void _do_destroy(fuse_req_t req, const fuse_ino_t nodeid, - const void *op_in, const void *in_payload) +static void *__fuse_destroy_sync(void *arg) { + struct fuse_req *req = arg; struct fuse_session *se = req->se; - (void) nodeid; - (void)op_in; - (void)in_payload; + /* + * Wait for all the FUSE_RELEASE requests to work their way through the + * other worker threads, if any. + */ + pthread_mutex_lock(&se->lock); + se->open_files--; + while (se->open_files > 0) + pthread_cond_wait(&se->zero_open_files, &se->lock); + pthread_mutex_unlock(&se->lock); { char *mountpoint = atomic_exchange(&se->mountpoint, NULL); @@ -3051,6 +3087,54 @@ static void _do_destroy(fuse_req_t req, const fuse_ino_t nodeid, se->op.destroy(se->userdata); send_reply_ok(req, NULL, 0); + return NULL; +} + +/* + * Destroy the fuse session asynchronously. + * + * If we have any open files, then we want to kick the actual destroy call to a + * new detached background thread that can wait for the open file count to + * reach zero without blocking processing of the rest of the commands that are + * pending in the fuse thread's cqe. For non-uring multithreaded mode, we also + * use the detached thread to avoid blocking a fuse worker from processing + * other commands. + * + * If the kernel sends us an explicit FUSE_DESTROY command then it won't tear + * down the fuse fd until it receives the reply, so fuse_session_destroy + * doesn't need to wait for this thread. + */ +static int __fuse_destroy_try_async(fuse_req_t req) +{ + pthread_t destroy_thread; + pthread_attr_t destroy_attr; + int ret; + + ret = pthread_attr_init(&destroy_attr); + if (ret) + return ret; + + ret = pthread_attr_setdetachstate(&destroy_attr, + PTHREAD_CREATE_DETACHED); + if (ret) + return ret; + + return pthread_create(&destroy_thread, &destroy_attr, + __fuse_destroy_sync, req); +} + +static void _do_destroy(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + struct fuse_session *se = req->se; + + (void) nodeid; + (void)op_in; + (void)in_payload; + + if (se->open_files > 0 && __fuse_destroy_try_async(req) == 0) + return; + __fuse_destroy_sync(req); } static void do_destroy(fuse_req_t req, fuse_ino_t nodeid, const void *inarg) @@ -3896,6 +3980,7 @@ void fuse_session_destroy(struct fuse_session *se) fuse_ll_pipe_free(llp); pthread_key_delete(se->pipe_key); sem_destroy(&se->mt_finish); + pthread_cond_destroy(&se->zero_open_files); pthread_mutex_destroy(&se->mt_lock); pthread_mutex_destroy(&se->lock); free(se->cuse_data); @@ -4283,9 +4368,16 @@ fuse_session_new_versioned(struct fuse_args *args, list_init_nreq(&se->notify_list); se->notify_ctr = 1; pthread_mutex_init(&se->lock, NULL); + pthread_cond_init(&se->zero_open_files, NULL); sem_init(&se->mt_finish, 0, 0); pthread_mutex_init(&se->mt_lock, NULL); + /* + * Bias the open file counter by 1 so that we only wake the condition + * variable once FUSE_DESTROY has been seen. + */ + se->open_files = 1; + err = pthread_key_create(&se->pipe_key, fuse_ll_pipe_destructor); if (err) { fuse_log(FUSE_LOG_ERR, "fuse: failed to create thread specific key: %s\n", @@ -4310,6 +4402,7 @@ fuse_session_new_versioned(struct fuse_args *args, out5: sem_destroy(&se->mt_finish); + pthread_cond_destroy(&se->zero_open_files); pthread_mutex_destroy(&se->mt_lock); pthread_mutex_destroy(&se->lock); out4: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong 2026-04-29 14:39 ` [PATCH 01/25] libfuse: bump kernel and library ABI versions Darrick J. Wong 2026-04-29 14:39 ` [PATCH 02/25] libfuse: wait in do_destroy until all open files are closed Darrick J. Wong @ 2026-04-29 14:39 ` Darrick J. Wong 2026-04-29 14:40 ` [PATCH 04/25] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong ` (21 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:39 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add some flags to query and request kernel support for filesystem iomap for regular files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 5 +++++ include/fuse_kernel.h | 3 +++ example/printcap.c | 1 + lib/fuse_lowlevel.c | 11 +++++++++++ 4 files changed, 20 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 6680fdb17d2e6b..23712fed8d64c2 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -517,6 +517,11 @@ struct fuse_loop_config_v1 { */ #define FUSE_CAP_ALLOW_IDMAP (1ULL << 32) +/* + * Client supports using iomap for regular file operations + */ +#define FUSE_CAP_IOMAP (1ULL << 33) + /** * Ioctl flags * diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 842cc08a083a6f..354a6da01c2ecc 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -243,6 +243,7 @@ * * 7.99 * - XXX magic minor revision to make experimental code really obvious + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations */ #ifndef _LINUX_FUSE_H @@ -451,6 +452,7 @@ struct fuse_file_lock { * FUSE_OVER_IO_URING: Indicate that client supports io-uring * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests. * init_out.request_timeout contains the timeout (in secs) + * FUSE_IOMAP: Client supports iomap for regular file operations */ #define FUSE_ASYNC_READ (1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -498,6 +500,7 @@ struct fuse_file_lock { #define FUSE_ALLOW_IDMAP (1ULL << 40) #define FUSE_OVER_IO_URING (1ULL << 41) #define FUSE_REQUEST_TIMEOUT (1ULL << 42) +#define FUSE_IOMAP (1ULL << 43) /** * CUSE INIT request/reply flags diff --git a/example/printcap.c b/example/printcap.c index e88ce1399ed18c..001922e0d74004 100644 --- a/example/printcap.c +++ b/example/printcap.c @@ -67,6 +67,7 @@ static const struct cap_info capabilities[] = { { FUSE_CAP_PASSTHROUGH, "FUSE_CAP_PASSTHROUGH"}, { FUSE_CAP_OVER_IO_URING, "FUSE_CAP_OVER_IO_URING"}, { FUSE_CAP_ALLOW_IDMAP, "FUSE_CAP_ALLOW_IDMAP"}, + { FUSE_CAP_IOMAP, "FUSE_CAP_IOMAP"}, // Add any new capabilities here { 0, NULL} // Sentinel to mark the end of the array }; diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 487fd72420cdf1..b8700cd786a034 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -2818,6 +2818,10 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in, se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING; if (inargflags & FUSE_ALLOW_IDMAP) se->conn.capable_ext |= FUSE_CAP_ALLOW_IDMAP; + if (inargflags & FUSE_IOMAP) + se->conn.capable_ext |= FUSE_CAP_IOMAP; + /* Don't let anyone touch iomap until the end of the patchset. */ + se->conn.capable_ext &= ~FUSE_CAP_IOMAP; } else { se->conn.max_readahead = 0; @@ -2864,6 +2868,9 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in, FUSE_CAP_READDIRPLUS_AUTO); LL_SET_DEFAULT(1, FUSE_CAP_OVER_IO_URING); + /* servers need to opt-in to iomap explicitly */ + LL_SET_DEFAULT(0, FUSE_CAP_IOMAP); + /* This could safely become default, but libfuse needs an API extension * to support it * LL_SET_DEFAULT(1, FUSE_CAP_SETXATTR_EXT); @@ -2983,6 +2990,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in, outargflags |= FUSE_REQUEST_TIMEOUT; outarg.request_timeout = se->conn.request_timeout; } + if (se->conn.want_ext & FUSE_CAP_IOMAP) + outargflags |= FUSE_IOMAP; outarg.max_readahead = se->conn.max_readahead; outarg.max_write = se->conn.max_write; @@ -3017,6 +3026,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in, if (se->conn.want_ext & FUSE_CAP_PASSTHROUGH) fuse_log(FUSE_LOG_DEBUG, " max_stack_depth=%u\n", outarg.max_stack_depth); + if (se->conn.want_ext & FUSE_CAP_IOMAP) + fuse_log(FUSE_LOG_DEBUG, " iomap=1\n"); } if (arg->minor < 5) outargsize = FUSE_COMPAT_INIT_OUT_SIZE; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 04/25] libfuse: add fuse commands for iomap_begin and end 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:39 ` [PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP Darrick J. Wong @ 2026-04-29 14:40 ` Darrick J. Wong 2026-04-29 14:40 ` [PATCH 05/25] libfuse: add upper level iomap commands Darrick J. Wong ` (20 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:40 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Teach the low level API how to handle iomap begin and end commands that we get from the kernel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 71 +++++++++++++++++++++++++++++++++ include/fuse_kernel.h | 40 ++++++++++++++++++ include/fuse_lowlevel.h | 59 +++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++ lib/fuse_versionscript | 3 + 5 files changed, 275 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 23712fed8d64c2..d8b371b6641a6c 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1140,7 +1140,78 @@ bool fuse_get_feature_flag(const struct fuse_conn_info *conn, uint64_t flag); */ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn); +/** + * iomap operations. + * These APIs are introduced in version 399 (FUSE_MAKE_VERSION(3, 99)). + */ +/* mapping types; see corresponding IOMAP_TYPE_ */ +#define FUSE_IOMAP_TYPE_HOLE (0) +#define FUSE_IOMAP_TYPE_DELALLOC (1) +#define FUSE_IOMAP_TYPE_MAPPED (2) +#define FUSE_IOMAP_TYPE_UNWRITTEN (3) +#define FUSE_IOMAP_TYPE_INLINE (4) + +/* fuse-specific mapping type indicating that writes use the read mapping */ +#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255) + +#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */ + +/* mapping flags passed back from iomap_begin; see corresponding IOMAP_F_ */ +#define FUSE_IOMAP_F_NEW (1U << 0) +#define FUSE_IOMAP_F_DIRTY (1U << 1) +#define FUSE_IOMAP_F_SHARED (1U << 2) +#define FUSE_IOMAP_F_MERGED (1U << 3) +#define FUSE_IOMAP_F_BOUNDARY (1U << 4) +#define FUSE_IOMAP_F_ANON_WRITE (1U << 5) +#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 6) + +/* fuse-specific mapping flag asking for ->iomap_end call */ +#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 7) + +/* mapping flags passed to iomap_end */ +#define FUSE_IOMAP_F_SIZE_CHANGED (1U << 8) +#define FUSE_IOMAP_F_STALE (1U << 9) + +/* operation flags from iomap; see corresponding IOMAP_* */ +#define FUSE_IOMAP_OP_WRITE (1U << 0) +#define FUSE_IOMAP_OP_ZERO (1U << 1) +#define FUSE_IOMAP_OP_REPORT (1U << 2) +#define FUSE_IOMAP_OP_FAULT (1U << 3) +#define FUSE_IOMAP_OP_DIRECT (1U << 4) +#define FUSE_IOMAP_OP_NOWAIT (1U << 5) +#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1U << 6) +#define FUSE_IOMAP_OP_UNSHARE (1U << 7) +#define FUSE_IOMAP_OP_DAX (1U << 8) +#define FUSE_IOMAP_OP_ATOMIC (1U << 9) +#define FUSE_IOMAP_OP_DONTCACHE (1U << 10) + +/* pagecache writeback operation */ +#define FUSE_IOMAP_OP_WRITEBACK (1U << 31) + +#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */ + +struct fuse_file_iomap { + uint64_t offset; /* file offset of mapping, bytes */ + uint64_t length; /* length of mapping, bytes */ + uint64_t addr; /* disk offset of mapping, bytes */ + uint16_t type; /* FUSE_IOMAP_TYPE_* */ + uint16_t flags; /* FUSE_IOMAP_F_* */ + uint32_t dev; /* device cookie */ +}; + +static inline bool fuse_iomap_is_write(unsigned int opflags) +{ + return opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO | + FUSE_IOMAP_OP_UNSHARE | FUSE_IOMAP_OP_WRITEBACK); +} + +static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, + const struct fuse_file_iomap *map) +{ + return map->type == FUSE_IOMAP_TYPE_HOLE && + !(opflags & FUSE_IOMAP_OP_ZERO); +} /* ----------------------------------------------------------- * * Compatibility stuff * diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 354a6da01c2ecc..b3750bb6275620 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -670,6 +670,9 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_IOMAP_BEGIN = 4094, + FUSE_IOMAP_END = 4095, + /* CUSE specific operations */ CUSE_INIT = 4096, @@ -1313,4 +1316,41 @@ struct fuse_uring_cmd_req { uint8_t padding[6]; }; +struct fuse_iomap_io { + uint64_t offset; /* file offset of mapping, bytes */ + uint64_t length; /* length of mapping, bytes */ + uint64_t addr; /* disk offset of mapping, bytes */ + uint16_t type; /* FUSE_IOMAP_TYPE_* */ + uint16_t flags; /* FUSE_IOMAP_F_* */ + uint32_t dev; /* device cookie */ +}; + +struct fuse_iomap_begin_in { + uint32_t opflags; /* FUSE_IOMAP_OP_* */ + uint32_t reserved; /* zero */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + uint64_t pos; /* file position, in bytes */ + uint64_t count; /* operation length, in bytes */ +}; + +struct fuse_iomap_begin_out { + /* read file data from here */ + struct fuse_iomap_io read; + + /* write file data to here, if applicable */ + struct fuse_iomap_io write; +}; + +struct fuse_iomap_end_in { + uint32_t opflags; /* FUSE_IOMAP_OP_* */ + uint32_t reserved; /* zero */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + uint64_t pos; /* file position, in bytes */ + uint64_t count; /* operation length, in bytes */ + int64_t written; /* bytes processed */ + + /* mapping that the kernel acted upon */ + struct fuse_iomap_io map; +}; + #endif /* _LINUX_FUSE_H */ diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index ea71130946ba21..b9a072841ef078 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -1357,6 +1357,43 @@ struct fuse_lowlevel_ops { * @param ino the inode number */ void (*syncfs)(fuse_req_t req, fuse_ino_t ino); + + /** + * Fetch file I/O mappings to begin an operation + * + * Valid replies: + * fuse_reply_iomap_begin + * fuse_reply_err + * + * @param req request handle + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param pos position in file, in bytes + * @param count length of operation, in bytes + * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation + */ + void (*iomap_begin)(fuse_req_t req, fuse_ino_t nodeid, + uint64_t attr_ino, off_t pos, uint64_t count, + uint32_t opflags); + + /** + * Complete an iomap operation + * + * Valid replies: + * fuse_reply_err + * + * @param req request handle + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param pos position in file, in bytes + * @param count length of operation, in bytes + * @param written number of bytes processed, or a negative errno + * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation + * @param iomap file I/O mapping that was acted upon + */ + void (*iomap_end)(fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino, + off_t pos, uint64_t count, uint32_t opflags, + ssize_t written, const struct fuse_file_iomap *iomap); }; /** @@ -1752,6 +1789,28 @@ int fuse_reply_lseek(fuse_req_t req, off_t off); int fuse_reply_statx(fuse_req_t req, int flags, const struct statx *statx, double attr_timeout); +/** + * Set an iomap write mapping to be a pure overwrite of the read mapping. + * @param write mapping for file data writes + * @param read mapping for file data reads + */ +void fuse_iomap_pure_overwrite(struct fuse_file_iomap *write, + const struct fuse_file_iomap *read); + +/** + * Reply with iomappings for an iomap_begin operation + * + * Possible requests: + * iomap_begin + * + * @param req request handle + * @param read mapping for file data reads + * @param write mapping for file data writes + * @return zero for success, or negative errno on failure + */ +int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read, + const struct fuse_file_iomap *write); + /* ----------------------------------------------------------- * * Notification * * ----------------------------------------------------------- */ diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index b8700cd786a034..df13e2f8f84add 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -2619,6 +2619,104 @@ static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, _do_syncfs(req, nodeid, inarg, NULL); } +void fuse_iomap_pure_overwrite(struct fuse_file_iomap *write, + const struct fuse_file_iomap *read) +{ + write->addr = FUSE_IOMAP_NULL_ADDR; + write->offset = read->offset; + write->length = read->length; + write->type = FUSE_IOMAP_TYPE_PURE_OVERWRITE; + write->flags = 0; + write->dev = FUSE_IOMAP_DEV_NULL; +} + +static inline void fuse_iomap_to_kernel(struct fuse_iomap_io *fmap, + const struct fuse_file_iomap *fimap) +{ + fmap->addr = fimap->addr; + fmap->offset = fimap->offset; + fmap->length = fimap->length; + fmap->type = fimap->type; + fmap->flags = fimap->flags; + fmap->dev = fimap->dev; +} + +static inline void fuse_iomap_from_kernel(struct fuse_file_iomap *fimap, + const struct fuse_iomap_io *fmap) +{ + fimap->addr = fmap->addr; + fimap->offset = fmap->offset; + fimap->length = fmap->length; + fimap->type = fmap->type; + fimap->flags = fmap->flags; + fimap->dev = fmap->dev; +} + +int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read, + const struct fuse_file_iomap *write) +{ + struct fuse_iomap_begin_out arg = { + .write = { + .addr = FUSE_IOMAP_NULL_ADDR, + .offset = read->offset, + .length = read->length, + .type = FUSE_IOMAP_TYPE_PURE_OVERWRITE, + .flags = 0, + .dev = FUSE_IOMAP_DEV_NULL, + }, + }; + + fuse_iomap_to_kernel(&arg.read, read); + if (write) + fuse_iomap_to_kernel(&arg.write, write); + + return send_reply_ok(req, &arg, sizeof(arg)); +} + +static void _do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + const struct fuse_iomap_begin_in *arg = op_in; + (void)in_payload; + (void)nodeid; + + if (req->se->op.iomap_begin) + req->se->op.iomap_begin(req, nodeid, arg->attr_ino, arg->pos, + arg->count, arg->opflags); + else + fuse_reply_err(req, ENOSYS); +} + +static void do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_iomap_begin(req, nodeid, inarg, NULL); +} + +static void _do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + const struct fuse_iomap_end_in *arg = op_in; + (void)in_payload; + (void)nodeid; + + if (req->se->op.iomap_end) { + struct fuse_file_iomap fimap; + + fuse_iomap_from_kernel(&fimap, &arg->map); + req->se->op.iomap_end(req, nodeid, arg->attr_ino, arg->pos, + arg->count, arg->opflags, arg->written, + &fimap); + } else + fuse_reply_err(req, ENOSYS); +} + +static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_iomap_end(req, nodeid, inarg, NULL); +} + static bool want_flags_valid(uint64_t capable, uint64_t want) { uint64_t unknown_flags = want & (~capable); @@ -3612,6 +3710,8 @@ static struct { [FUSE_LSEEK] = { do_lseek, "LSEEK" }, [FUSE_SYNCFS] = { do_syncfs, "SYNCFS" }, [FUSE_STATX] = { do_statx, "STATX" }, + [FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" }, + [FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" }, [CUSE_INIT] = { cuse_lowlevel_init, "CUSE_INIT" }, }; @@ -3669,6 +3769,8 @@ static struct { [FUSE_LSEEK] = { _do_lseek, "LSEEK" }, [FUSE_SYNCFS] = { _do_syncfs, "SYNCFS" }, [FUSE_STATX] = { _do_statx, "STATX" }, + [FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" }, + [FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" }, [CUSE_INIT] = { _cuse_lowlevel_init, "CUSE_INIT" }, }; diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index af17e7ab2d7c88..260a7047c158e4 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -248,6 +248,9 @@ FUSE_3.19 { } FUSE_3.18; FUSE_3.99 { + global: + fuse_iomap_pure_overwrite; + fuse_reply_iomap_begin; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 05/25] libfuse: add upper level iomap commands 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:40 ` [PATCH 04/25] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong @ 2026-04-29 14:40 ` Darrick J. Wong 2026-04-29 14:40 ` [PATCH 06/25] libfuse: add a lowlevel notification to add a new device to iomap Darrick J. Wong ` (19 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:40 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Teach the upper level fuse library about the iomap begin and end operations, and connect it to the lower level. This is needed for fuse2fs to start using iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 17 +++++++++ lib/fuse.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 119 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index 129c744e39c46a..58d7538aee795c 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -885,6 +885,23 @@ struct fuse_operations { * calling fsync(2) on every file on the filesystem. */ int (*syncfs)(const char *path); + + /** + * Send a mapping to the kernel so that a file IO operation can run. + */ + int (*iomap_begin)(const char *path, uint64_t nodeid, + uint64_t attr_ino, off_t pos_in, + uint64_t length_in, uint32_t opflags_in, + struct fuse_file_iomap *read_out, + struct fuse_file_iomap *write_out); + + /** + * Respond to the outcome of a previous file mapping operation. + */ + int (*iomap_end)(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos_in, uint64_t length_in, + uint32_t opflags_in, ssize_t written_in, + const struct fuse_file_iomap *iomap); }; /** Extra context that may be needed by some filesystems diff --git a/lib/fuse.c b/lib/fuse.c index 1bb94ca6aa0d39..c8089baaf6628c 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2813,6 +2813,49 @@ int fuse_fs_chmod(struct fuse_fs *fs, const char *path, mode_t mode, return fs->op.chmod(path, mode, fi); } +static int fuse_fs_iomap_begin(struct fuse_fs *fs, const char *path, + fuse_ino_t nodeid, uint64_t attr_ino, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read, + struct fuse_file_iomap *write) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.iomap_begin) + return -ENOSYS; + + if (fs->debug) { + fuse_log(FUSE_LOG_DEBUG, + "iomap_begin[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x\n", + path, (unsigned long long)nodeid, + (unsigned long long)attr_ino, (unsigned long long)pos, + (unsigned long long)count, opflags); + } + + return fs->op.iomap_begin(path, nodeid, attr_ino, pos, count, opflags, + read, write); +} + +static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path, + fuse_ino_t nodeid, uint64_t attr_ino, off_t pos, + uint64_t count, uint32_t opflags, ssize_t written, + const struct fuse_file_iomap *iomap) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.iomap_end) + return -ENOSYS; + + if (fs->debug) { + fuse_log(FUSE_LOG_DEBUG, + "iomap_end[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x written %zd\n", + path, (unsigned long long)nodeid, + (unsigned long long)attr_ino, (unsigned long long)pos, + (unsigned long long)count, opflags, written); + } + + return fs->op.iomap_end(path, nodeid, attr_ino, pos, count, opflags, + written, iomap); +} + static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { @@ -4505,6 +4548,63 @@ static void fuse_lib_syncfs(fuse_req_t req, fuse_ino_t ino) reply_err(req, err); } +static void fuse_lib_iomap_begin(fuse_req_t req, fuse_ino_t nodeid, + uint64_t attr_ino, off_t pos, uint64_t count, + uint32_t opflags) +{ + struct fuse *f = req_fuse_prepare(req); + struct fuse_file_iomap read = { }; + struct fuse_file_iomap write = { }; + struct fuse_intr_data d; + char *path; + int err; + + err = get_path_nullok(f, nodeid, &path); + if (err) { + reply_err(req, err); + return; + } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_iomap_begin(f->fs, path, nodeid, attr_ino, pos, count, + opflags, &read, &write); + fuse_finish_interrupt(f, req, &d); + free_path(f, nodeid, path); + if (err) { + reply_err(req, err); + return; + } + + if (write.length == 0) + fuse_iomap_pure_overwrite(&write, &read); + + fuse_reply_iomap_begin(req, &read, &write); +} + +static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid, + uint64_t attr_ino, off_t pos, uint64_t count, + uint32_t opflags, ssize_t written, + const struct fuse_file_iomap *iomap) +{ + struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; + char *path; + int err; + + err = get_path_nullok(f, nodeid, &path); + if (err) { + reply_err(req, err); + return; + } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_iomap_end(f->fs, path, nodeid, attr_ino, pos, count, + opflags, written, iomap); + fuse_finish_interrupt(f, req, &d); + free_path(f, nodeid, path); + reply_err(req, err); +} + static int clean_delay(struct fuse *f) { /* @@ -4607,6 +4707,8 @@ static struct fuse_lowlevel_ops fuse_path_ops = { .statx = fuse_lib_statx, #endif .syncfs = fuse_lib_syncfs, + .iomap_begin = fuse_lib_iomap_begin, + .iomap_end = fuse_lib_iomap_end, }; int fuse_notify_poll(struct fuse_pollhandle *ph) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 06/25] libfuse: add a lowlevel notification to add a new device to iomap 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:40 ` [PATCH 05/25] libfuse: add upper level iomap commands Darrick J. Wong @ 2026-04-29 14:40 ` Darrick J. Wong 2026-04-29 14:40 ` [PATCH 07/25] libfuse: add upper-level iomap add device function Darrick J. Wong ` (18 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:40 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Plumb in the pieces needed to attach block devices to a fuse+iomap mount for use with iomap operations. This enables us to have filesystems where the metadata could live somewhere else, but the actual file IO goes to locally attached storage. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 7 +++++++ include/fuse_lowlevel.h | 30 +++++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++ lib/fuse_versionscript | 2 ++ 4 files changed, 88 insertions(+) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index b3750bb6275620..e69f9675c4b57d 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -1135,6 +1135,13 @@ struct fuse_notify_prune_out { uint64_t spare; }; +#define FUSE_BACKING_TYPE_MASK (0xFF) +#define FUSE_BACKING_TYPE_PASSTHROUGH (0) +#define FUSE_BACKING_TYPE_IOMAP (1) +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP) + +#define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK) + struct fuse_backing_map { int32_t fd; uint32_t flags; diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index b9a072841ef078..c0bbb6b658808d 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2035,6 +2035,36 @@ int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino, int fuse_lowlevel_notify_prune(struct fuse_session *se, fuse_ino_t *nodeids, uint32_t count); +/* + * Attach an open file descriptor to a fuse+iomap mount. Currently must be + * a block device. + * + * Added in FUSE protocol version 7.99. If the kernel does not support + * this (or a newer) version, the function will return -ENOSYS and do + * nothing. + * + * @param se the session object + * @param fd file descriptor of an open block device + * @param flags behavior flags for the operation; none defined so far + * @return positive nonzero device id on success, or negative errno on failure + */ +int fuse_lowlevel_iomap_device_add(struct fuse_session *se, int fd, + unsigned int flags); + +/** + * Detach an open file from a fuse+iomap mount. Must be a device id returned + * by fuse_lowlevel_iomap_device_add. + * + * Added in FUSE protocol version 7.99. If the kernel does not support + * this (or a newer) version, the function will return -ENOSYS and do + * nothing. + * + * @param se the session object + * @param device_id device index as returned by fuse_lowlevel_iomap_device_add + * @return 0 on success, or negative errno on failure + */ +int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id); + /* ----------------------------------------------------------- * * Utility functions * * ----------------------------------------------------------- */ diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index df13e2f8f84add..95ff8ca458480a 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -623,6 +623,55 @@ int fuse_passthrough_close(fuse_req_t req, int backing_id) return ret; } +int fuse_lowlevel_iomap_device_add(struct fuse_session *se, int fd, + unsigned int flags) +{ + struct fuse_backing_map map = { + .fd = fd, + .flags = FUSE_BACKING_TYPE_IOMAP | + (flags & ~FUSE_BACKING_TYPE_MASK), + }; + int ret; + + if (!(se->conn.want_ext & FUSE_CAP_IOMAP)) + return -ENOSYS; + + ret = ioctl(se->fd, FUSE_DEV_IOC_BACKING_OPEN, &map); + if (ret == 0) { + /* not supposed to happen */ + ret = -1; + errno = ERANGE; + } + if (ret < 0) { + int err = errno; + + fuse_log(FUSE_LOG_ERR, "fuse: iomap_device_add: %s\n", + strerror(err)); + return -err; + } + + return ret; +} + +int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id) +{ + int ret; + + if (!(se->conn.want_ext & FUSE_CAP_IOMAP)) + return -ENOSYS; + + ret = ioctl(se->fd, FUSE_DEV_IOC_BACKING_CLOSE, &device_id); + if (ret < 0) { + int err = errno; + + fuse_log(FUSE_LOG_ERR, "fuse: iomap_device_remove: %s\n", + strerror(errno)); + return -err; + } + + return ret; +} + int fuse_reply_open(fuse_req_t req, const struct fuse_file_info *f) { struct fuse_open_out arg; diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 260a7047c158e4..e32dcb17551e9e 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -251,6 +251,8 @@ FUSE_3.99 { global: fuse_iomap_pure_overwrite; fuse_reply_iomap_begin; + fuse_lowlevel_iomap_device_add; + fuse_lowlevel_iomap_device_remove; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 07/25] libfuse: add upper-level iomap add device function 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:40 ` [PATCH 06/25] libfuse: add a lowlevel notification to add a new device to iomap Darrick J. Wong @ 2026-04-29 14:40 ` Darrick J. Wong 2026-04-29 14:41 ` [PATCH 08/25] libfuse: add iomap ioend low level handler Darrick J. Wong ` (17 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:40 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Make it so that the upper level fuse library can add iomap devices too. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 19 +++++++++++++++++++ lib/fuse.c | 16 ++++++++++++++++ lib/fuse_versionscript | 2 ++ 3 files changed, 37 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index 58d7538aee795c..8e79103f45404d 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -1437,6 +1437,25 @@ void fuse_fs_init(struct fuse_fs *fs, struct fuse_conn_info *conn, struct fuse_config *cfg); void fuse_fs_destroy(struct fuse_fs *fs); +/** + * Attach an open file descriptor to a fuse+iomap mount. Currently must be + * a block device. + * + * @param fd file descriptor of an open block device + * @param flags behavior flags for the operation; none defined so far + * @return positive nonzero device id on success, or negative errno on failure + */ +int fuse_fs_iomap_device_add(int fd, unsigned int flags); + +/** + * Detach an open file from a fuse+iomap mount. Must be a device id returned + * by fuse_lowlevel_iomap_device_add. + * + * @param device_id device index as returned by fuse_lowlevel_iomap_device_add + * @return 0 on success, or negative errno on failure + */ +int fuse_fs_iomap_device_remove(int device_id); + int fuse_notify_poll(struct fuse_pollhandle *ph); /** diff --git a/lib/fuse.c b/lib/fuse.c index c8089baaf6628c..8f012710c994d5 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2856,6 +2856,22 @@ static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path, written, iomap); } +int fuse_fs_iomap_device_add(int fd, unsigned int flags) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + + return fuse_lowlevel_iomap_device_add(se, fd, flags); +} + +int fuse_fs_iomap_device_remove(int device_id) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + + return fuse_lowlevel_iomap_device_remove(se, device_id); +} + static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index e32dcb17551e9e..918b714e2e9166 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -253,6 +253,8 @@ FUSE_3.99 { fuse_reply_iomap_begin; fuse_lowlevel_iomap_device_add; fuse_lowlevel_iomap_device_remove; + fuse_fs_iomap_device_add; + fuse_fs_iomap_device_remove; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 08/25] libfuse: add iomap ioend low level handler 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:40 ` [PATCH 07/25] libfuse: add upper-level iomap add device function Darrick J. Wong @ 2026-04-29 14:41 ` Darrick J. Wong 2026-04-29 14:41 ` [PATCH 09/25] libfuse: add upper level iomap ioend commands Darrick J. Wong ` (16 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:41 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Teach the low level library about the iomap ioend handler, which gets called by the kernel when we finish a file write that isn't a pure overwrite operation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 13 +++++++++++++ include/fuse_kernel.h | 16 ++++++++++++++++ include/fuse_lowlevel.h | 34 ++++++++++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 32 ++++++++++++++++++++++++++++++++ lib/fuse_versionscript | 1 + 5 files changed, 96 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index d8b371b6641a6c..865bcc6f5543d8 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1213,6 +1213,19 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, !(opflags & FUSE_IOMAP_OP_ZERO); } +/* out of place write extent */ +#define FUSE_IOMAP_IOEND_SHARED (1U << 0) +/* unwritten extent */ +#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1) +/* don't merge into previous ioend */ +#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2) +/* is direct I/O */ +#define FUSE_IOMAP_IOEND_DIRECT (1U << 3) +/* is append ioend */ +#define FUSE_IOMAP_IOEND_APPEND (1U << 4) +/* is pagecache writeback */ +#define FUSE_IOMAP_IOEND_WRITEBACK (1U << 5) + /* ----------------------------------------------------------- * * Compatibility stuff * * ----------------------------------------------------------- */ diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index e69f9675c4b57d..732085a1b900b0 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -670,6 +670,7 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_IOMAP_IOEND = 4093, FUSE_IOMAP_BEGIN = 4094, FUSE_IOMAP_END = 4095, @@ -1360,4 +1361,19 @@ struct fuse_iomap_end_in { struct fuse_iomap_io map; }; +struct fuse_iomap_ioend_in { + uint32_t flags; /* FUSE_IOMAP_IOEND_* */ + int32_t error; /* negative errno or 0 */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + uint64_t pos; /* file position, in bytes */ + uint64_t new_addr; /* disk offset of new mapping, in bytes */ + uint64_t written; /* bytes processed */ + uint32_t dev; /* device cookie */ + uint32_t pad; /* zero */ +}; + +struct fuse_iomap_ioend_out { + uint64_t newsize; /* new ondisk size */ +}; + #endif /* _LINUX_FUSE_H */ diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index c0bbb6b658808d..f55682f81e86f4 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -1394,6 +1394,28 @@ struct fuse_lowlevel_ops { void (*iomap_end)(fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino, off_t pos, uint64_t count, uint32_t opflags, ssize_t written, const struct fuse_file_iomap *iomap); + + /** + * Complete an iomap IO operation + * + * Valid replies: + * fuse_reply_ioend + * fuse_reply_err + * + * @param req request handle + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param pos position in file, in bytes + * @param written number of bytes processed, or a negative errno + * @param flags mask of FUSE_IOMAP_IOEND_ flags specifying operation + * @param error errno code of what went wrong + * @param dev device cookie of new address + * @param new_addr disk address of new mapping, in bytes + */ + void (*iomap_ioend)(fuse_req_t req, fuse_ino_t nodeid, + uint64_t attr_ino, off_t pos, uint64_t written, + uint32_t ioendflags, int error, uint32_t dev, + uint64_t new_addr); }; /** @@ -1811,6 +1833,18 @@ void fuse_iomap_pure_overwrite(struct fuse_file_iomap *write, int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read, const struct fuse_file_iomap *write); +/** + * Reply to an ioend with the new ondisk size + * + * Possible requests: + * iomap_ioend + * + * @param req request handle + * @param newsize new ondisk file size + * @return zero for success, or negative errno on failure + */ +int fuse_reply_iomap_ioend(fuse_req_t req, off_t newsize); + /* ----------------------------------------------------------- * * Notification * * ----------------------------------------------------------- */ diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 95ff8ca458480a..8d706aa4b92f88 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -2766,6 +2766,36 @@ static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid, _do_iomap_end(req, nodeid, inarg, NULL); } +int fuse_reply_iomap_ioend(fuse_req_t req, off_t newsize) +{ + struct fuse_iomap_ioend_out arg = { + .newsize = newsize, + }; + + return send_reply_ok(req, &arg, sizeof(arg)); +} + +static void _do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + const struct fuse_iomap_ioend_in *arg = op_in; + (void)in_payload; + (void)nodeid; + + if (req->se->op.iomap_ioend) + req->se->op.iomap_ioend(req, nodeid, arg->attr_ino, arg->pos, + arg->written, arg->flags, arg->error, + arg->dev, arg->new_addr); + else + fuse_reply_err(req, ENOSYS); +} + +static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_iomap_ioend(req, nodeid, inarg, NULL); +} + static bool want_flags_valid(uint64_t capable, uint64_t want) { uint64_t unknown_flags = want & (~capable); @@ -3761,6 +3791,7 @@ static struct { [FUSE_STATX] = { do_statx, "STATX" }, [FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" }, [FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" }, + [FUSE_IOMAP_IOEND] = { do_iomap_ioend, "IOMAP_IOEND" }, [CUSE_INIT] = { cuse_lowlevel_init, "CUSE_INIT" }, }; @@ -3820,6 +3851,7 @@ static struct { [FUSE_STATX] = { _do_statx, "STATX" }, [FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" }, [FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" }, + [FUSE_IOMAP_IOEND] = { _do_iomap_ioend, "IOMAP_IOEND" }, [CUSE_INIT] = { _cuse_lowlevel_init, "CUSE_INIT" }, }; diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 918b714e2e9166..06fcc6dcaa11be 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -251,6 +251,7 @@ FUSE_3.99 { global: fuse_iomap_pure_overwrite; fuse_reply_iomap_begin; + fuse_reply_iomap_ioend; fuse_lowlevel_iomap_device_add; fuse_lowlevel_iomap_device_remove; fuse_fs_iomap_device_add; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 09/25] libfuse: add upper level iomap ioend commands 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:41 ` [PATCH 08/25] libfuse: add iomap ioend low level handler Darrick J. Wong @ 2026-04-29 14:41 ` Darrick J. Wong 2026-04-29 14:41 ` [PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong ` (15 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:41 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Teach the upper level fuse library about iomap ioend events, which happen when a write that isn't a pure overwrite completes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 9 +++++++++ lib/fuse.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 62 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index 8e79103f45404d..dbb72c80b138c6 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -902,6 +902,15 @@ struct fuse_operations { off_t pos_in, uint64_t length_in, uint32_t opflags_in, ssize_t written_in, const struct fuse_file_iomap *iomap); + + /** + * Respond to the outcome of a file IO operation. + */ + int (*iomap_ioend)(const char *path, uint64_t nodeid, + uint64_t attr_ino, off_t pos_in, + uint64_t written_in, uint32_t ioendflags_in, + int error_in, uint32_t dev_in, + uint64_t new_addr_in, off_t *newsize); }; /** Extra context that may be needed by some filesystems diff --git a/lib/fuse.c b/lib/fuse.c index 8f012710c994d5..1bd672a7818a1f 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2872,6 +2872,28 @@ int fuse_fs_iomap_device_remove(int device_id) return fuse_lowlevel_iomap_device_remove(se, device_id); } +static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path, + uint64_t nodeid, uint64_t attr_ino, off_t pos, + uint64_t written, uint32_t ioendflags, int error, + uint32_t dev, uint64_t new_addr, off_t *newsize) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.iomap_ioend) + return -ENOSYS; + + if (fs->debug) { + fuse_log(FUSE_LOG_DEBUG, + "iomap_ioend[%s] nodeid %llu attr_ino %llu pos %llu written %zu ioendflags 0x%x error %d dev %u new_addr 0x%llx\n", + path, (unsigned long long)nodeid, + (unsigned long long)attr_ino, (unsigned long long)pos, + written, ioendflags, error, dev, + (unsigned long long)new_addr); + } + + return fs->op.iomap_ioend(path, nodeid, attr_ino, pos, written, + ioendflags, error, dev, new_addr, newsize); +} + static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { @@ -4621,6 +4643,36 @@ static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid, reply_err(req, err); } +static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid, + uint64_t attr_ino, off_t pos, size_t written, + uint32_t ioendflags, int error, uint32_t dev, + uint64_t new_addr) +{ + struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; + char *path; + off_t newsize = 0; + int err; + + err = get_path_nullok(f, nodeid, &path); + if (err) { + reply_err(req, err); + return; + } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_iomap_ioend(f->fs, path, nodeid, attr_ino, pos, written, + ioendflags, error, dev, new_addr, &newsize); + fuse_finish_interrupt(f, req, &d); + free_path(f, nodeid, path); + if (err) { + reply_err(req, err); + return; + } + + fuse_reply_iomap_ioend(req, newsize); +} + static int clean_delay(struct fuse *f) { /* @@ -4725,6 +4777,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = { .syncfs = fuse_lib_syncfs, .iomap_begin = fuse_lib_iomap_begin, .iomap_end = fuse_lib_iomap_end, + .iomap_ioend = fuse_lib_iomap_ioend, }; int fuse_notify_poll(struct fuse_pollhandle *ph) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to the kernel 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (8 preceding siblings ...) 2026-04-29 14:41 ` [PATCH 09/25] libfuse: add upper level iomap ioend commands Darrick J. Wong @ 2026-04-29 14:41 ` Darrick J. Wong 2026-04-29 14:41 ` [PATCH 11/25] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong ` (14 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:41 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create new fuse_reply_{attr,create,entry}_iflags functions so that we can send FUSE_ATTR_* flags to the kernel when instantiating an inode. Servers are expected to send FUSE_IFLAG_* values, which will be translated into what the kernel can understand. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 3 ++ include/fuse_lowlevel.h | 83 +++++++++++++++++++++++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 64 ++++++++++++++++++++++++++++-------- lib/fuse_versionscript | 4 ++ 4 files changed, 139 insertions(+), 15 deletions(-) diff --git a/include/fuse_common.h b/include/fuse_common.h index 865bcc6f5543d8..5a5f69a6e23e5e 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1226,6 +1226,9 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, /* is pagecache writeback */ #define FUSE_IOMAP_IOEND_WRITEBACK (1U << 5) +/* enable fsdax */ +#define FUSE_IFLAG_DAX (1U << 0) + /* ----------------------------------------------------------- * * Compatibility stuff * * ----------------------------------------------------------- */ diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index f55682f81e86f4..6d80713fe8bf88 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -242,6 +242,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_entry + * fuse_reply_entry_iflags * fuse_reply_err * * @param req request handle @@ -301,6 +302,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_attr + * fuse_reply_attr_iflags * fuse_reply_err * * @param req request handle @@ -336,6 +338,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_attr + * fuse_reply_attr_iflags * fuse_reply_err * * @param req request handle @@ -367,6 +370,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_entry + * fuse_reply_entry_iflags * fuse_reply_err * * @param req request handle @@ -383,6 +387,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_entry + * fuse_reply_entry_iflags * fuse_reply_err * * @param req request handle @@ -432,6 +437,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_entry + * fuse_reply_entry_iflags * fuse_reply_err * * @param req request handle @@ -480,6 +486,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_entry + * fuse_reply_entry_iflags * fuse_reply_err * * @param req request handle @@ -971,6 +978,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_create + * fuse_reply_create_iflags * fuse_reply_err * * @param req request handle @@ -1316,6 +1324,7 @@ struct fuse_lowlevel_ops { * * Valid replies: * fuse_reply_create + * fuse_reply_create_iflags * fuse_reply_err * * @param req request handle @@ -1468,6 +1477,23 @@ void fuse_reply_none(fuse_req_t req); */ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e); +/** + * Reply with a directory entry and FUSE_IFLAG_* + * + * Possible requests: + * lookup, mknod, mkdir, symlink, link + * + * Side effects: + * increments the lookup count on success + * + * @param req request handle + * @param e the entry parameters + * @param iflags FUSE_IFLAG_* + * @return zero for success, -errno for failure to send reply + */ +int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e, + unsigned int iflags); + /** * Reply with a directory entry and open parameters * @@ -1489,6 +1515,29 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e); int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, const struct fuse_file_info *fi); +/** + * Reply with a directory entry, open parameters and FUSE_IFLAG_* + * + * currently the following members of 'fi' are used: + * fh, direct_io, keep_cache, cache_readdir, nonseekable, noflush, + * parallel_direct_writes + * + * Possible requests: + * create + * + * Side effects: + * increments the lookup count on success + * + * @param req request handle + * @param e the entry parameters + * @param iflags FUSE_IFLAG_* + * @param fi file information + * @return zero for success, -errno for failure to send reply + */ +int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e, + unsigned int iflags, + const struct fuse_file_info *fi); + /** * Reply with attributes * @@ -1503,6 +1552,21 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, int fuse_reply_attr(fuse_req_t req, const struct stat *attr, double attr_timeout); +/** + * Reply with attributes and FUSE_IFLAG_* flags + * + * Possible requests: + * getattr, setattr + * + * @param req request handle + * @param attr the attributes + * @param attr_timeout validity timeout (in seconds) for the attributes + * @param iflags set of FUSE_IFLAG_* flags + * @return zero for success, -errno for failure to send reply + */ +int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr, + unsigned int iflags, double attr_timeout); + /** * Reply with the contents of a symbolic link * @@ -1730,6 +1794,25 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize, const char *name, const struct fuse_entry_param *e, off_t off); +/** + * Add a directory entry and FUSE_IFLAG_* to the buffer with the attributes + * + * See documentation of `fuse_add_direntry_plus()` for more details. + * + * @param req request handle + * @param buf the point where the new entry will be added to the buffer + * @param bufsize remaining size of the buffer + * @param name the name of the entry + * @param iflags FUSE_IFLAG_* + * @param e the directory entry + * @param off the offset of the next entry + * @return the space needed for the entry + */ +size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize, + const char *name, unsigned int iflags, + const struct fuse_entry_param *e, + off_t off); + /** * Reply to ask for data fetch and output buffer preparation. ioctl * will be retried with the specified input data fetched and output diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 8d706aa4b92f88..2dbdef695009f8 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -142,7 +142,8 @@ static void trace_request_reply(uint64_t unique, unsigned int len, } #endif -static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr) +static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr, + unsigned int iflags) { attr->ino = stbuf->st_ino; attr->mode = stbuf->st_mode; @@ -159,6 +160,10 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr) attr->atimensec = ST_ATIM_NSEC(stbuf); attr->mtimensec = ST_MTIM_NSEC(stbuf); attr->ctimensec = ST_CTIM_NSEC(stbuf); + + attr->flags = 0; + if (iflags & FUSE_IFLAG_DAX) + attr->flags |= FUSE_ATTR_DAX; } static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf) @@ -476,7 +481,8 @@ static unsigned int calc_timeout_nsec(double t) } static void fill_entry(struct fuse_entry_out *arg, - const struct fuse_entry_param *e) + const struct fuse_entry_param *e, + unsigned int iflags) { arg->nodeid = e->ino; arg->generation = e->generation; @@ -484,14 +490,15 @@ static void fill_entry(struct fuse_entry_out *arg, arg->entry_valid_nsec = calc_timeout_nsec(e->entry_timeout); arg->attr_valid = calc_timeout_sec(e->attr_timeout); arg->attr_valid_nsec = calc_timeout_nsec(e->attr_timeout); - convert_stat(&e->attr, &arg->attr); + convert_stat(&e->attr, &arg->attr, iflags); } /* `buf` is allowed to be empty so that the proper size may be allocated by the caller */ -size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize, - const char *name, - const struct fuse_entry_param *e, off_t off) +size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize, + const char *name, unsigned int iflags, + const struct fuse_entry_param *e, + off_t off) { (void)req; size_t namelen; @@ -506,7 +513,7 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize, struct fuse_direntplus *dp = (struct fuse_direntplus *) buf; memset(&dp->entry_out, 0, sizeof(dp->entry_out)); - fill_entry(&dp->entry_out, e); + fill_entry(&dp->entry_out, e, iflags); struct fuse_dirent *dirent = &dp->dirent; dirent->ino = e->attr.st_ino; @@ -519,6 +526,14 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize, return entlen_padded; } +size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize, + const char *name, + const struct fuse_entry_param *e, off_t off) +{ + return fuse_add_direntry_plus_iflags(req, buf, bufsize, name, 0, e, + off); +} + static void fill_open(struct fuse_open_out *arg, const struct fuse_file_info *f) { @@ -541,7 +556,8 @@ static void fill_open(struct fuse_open_out *arg, arg->open_flags |= FOPEN_PARALLEL_DIRECT_WRITES; } -int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e) +int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e, + unsigned int iflags) { struct fuse_entry_out arg; size_t size = req->se->conn.proto_minor < 9 ? @@ -553,12 +569,18 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e) return fuse_reply_err(req, ENOENT); memset(&arg, 0, sizeof(arg)); - fill_entry(&arg, e); + fill_entry(&arg, e, iflags); return send_reply_ok(req, &arg, size); } -int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, - const struct fuse_file_info *f) +int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e) +{ + return fuse_reply_entry_iflags(req, e, 0); +} + +int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e, + unsigned int iflags, + const struct fuse_file_info *f) { alignas(uint64_t) char buf[sizeof(struct fuse_entry_out) + sizeof(struct fuse_open_out)]; size_t entrysize = req->se->conn.proto_minor < 9 ? @@ -569,7 +591,7 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, int error; memset(buf, 0, sizeof(buf)); - fill_entry(earg, e); + fill_entry(earg, e, iflags); fill_open(oarg, f); error = send_reply_ok(req, buf, entrysize + sizeof(struct fuse_open_out)); @@ -578,8 +600,14 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, return error; } -int fuse_reply_attr(fuse_req_t req, const struct stat *attr, - double attr_timeout) +int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e, + const struct fuse_file_info *f) +{ + return fuse_reply_create_iflags(req, e, 0, f); +} + +int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr, + unsigned int iflags, double attr_timeout) { struct fuse_attr_out arg; size_t size = req->se->conn.proto_minor < 9 ? @@ -588,11 +616,17 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr, memset(&arg, 0, sizeof(arg)); arg.attr_valid = calc_timeout_sec(attr_timeout); arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout); - convert_stat(attr, &arg.attr); + convert_stat(attr, &arg.attr, iflags); return send_reply_ok(req, &arg, size); } +int fuse_reply_attr(fuse_req_t req, const struct stat *attr, + double attr_timeout) +{ + return fuse_reply_attr_iflags(req, attr, 0, attr_timeout); +} + int fuse_reply_readlink(fuse_req_t req, const char *linkname) { return send_reply_ok(req, linkname, strlen(linkname)); diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 06fcc6dcaa11be..01c11ba1794231 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -256,6 +256,10 @@ FUSE_3.99 { fuse_lowlevel_iomap_device_remove; fuse_fs_iomap_device_add; fuse_fs_iomap_device_remove; + fuse_reply_attr_iflags; + fuse_reply_create_iflags; + fuse_reply_entry_iflags; + fuse_add_direntry_plus_iflags; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 11/25] libfuse: connect high level fuse library to fuse_reply_attr_iflags 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (9 preceding siblings ...) 2026-04-29 14:41 ` [PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong @ 2026-04-29 14:41 ` Darrick J. Wong 2026-04-29 14:42 ` [PATCH 12/25] libfuse: support enabling exclusive mode for files Darrick J. Wong ` (13 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:41 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a new ->getattr_iflags function so that iomap filesystems can set the appropriate in-kernel inode flags on instantiation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 7 ++ lib/fuse.c | 192 ++++++++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 151 insertions(+), 48 deletions(-) diff --git a/include/fuse.h b/include/fuse.h index dbb72c80b138c6..1b009c7bdee295 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -911,6 +911,13 @@ struct fuse_operations { uint64_t written_in, uint32_t ioendflags_in, int error_in, uint32_t dev_in, uint64_t new_addr_in, off_t *newsize); + + /** + * Get file attributes and FUSE_IFLAG_* flags. Otherwise the same as + * getattr. + */ + int (*getattr_iflags)(const char *path, struct stat *statbuf, + unsigned int *iflags, struct fuse_file_info *fi); }; /** Extra context that may be needed by some filesystems diff --git a/lib/fuse.c b/lib/fuse.c index 1bd672a7818a1f..f2d1026c90b957 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -123,6 +123,7 @@ struct fuse { struct list_head partial_slabs; struct list_head full_slabs; pthread_t prune_thread; + bool want_iflags; }; struct lock { @@ -144,6 +145,7 @@ struct node { char *name; uint64_t nlookup; int open_count; + unsigned int iflags; struct timespec stat_updated; struct timespec mtime; off_t size; @@ -1629,6 +1631,24 @@ int fuse_fs_getattr(struct fuse_fs *fs, const char *path, struct stat *buf, return fs->op.getattr(path, buf, fi); } +static int fuse_fs_getattr_iflags(struct fuse_fs *fs, const char *path, + struct stat *statbuf, unsigned int *iflags, + struct fuse_file_info *fi) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.getattr_iflags) + return -ENOSYS; + + if (fs->debug) { + char buf[10]; + + fuse_log(FUSE_LOG_DEBUG, "getattr_iflags[%s] %s\n", + file_info_string(fi, buf, sizeof(buf)), + path); + } + return fs->op.getattr_iflags(path, statbuf, iflags, fi); +} + int fuse_fs_rename(struct fuse_fs *fs, const char *oldpath, const char *newpath, unsigned int flags) { @@ -2490,7 +2510,7 @@ static void update_stat(struct node *node, const struct stat *stbuf) } static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name, - struct fuse_entry_param *e) + struct fuse_entry_param *e, unsigned int *iflags) { struct node *node; @@ -2508,25 +2528,64 @@ static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name, pthread_mutex_unlock(&f->lock); } set_stat(f, e->ino, &e->attr); + *iflags = node->iflags; return 0; } +static int lookup_and_update(struct fuse *f, fuse_ino_t nodeid, + const char *name, struct fuse_entry_param *e, + unsigned int iflags) +{ + struct node *node; + + node = find_node(f, nodeid, name); + if (node == NULL) + return -ENOMEM; + + e->ino = node->nodeid; + e->generation = node->generation; + e->entry_timeout = f->conf.entry_timeout; + e->attr_timeout = f->conf.attr_timeout; + if (f->conf.auto_cache) { + pthread_mutex_lock(&f->lock); + update_stat(node, &e->attr); + pthread_mutex_unlock(&f->lock); + } + set_stat(f, e->ino, &e->attr); + node->iflags = iflags; + return 0; +} + +static int getattr(struct fuse *f, const char *path, struct stat *buf, + unsigned int *iflags, struct fuse_file_info *fi) +{ + if (f->want_iflags) + return fuse_fs_getattr_iflags(f->fs, path, buf, iflags, fi); + return fuse_fs_getattr(f->fs, path, buf, fi); +} + static int lookup_path(struct fuse *f, fuse_ino_t nodeid, const char *name, const char *path, - struct fuse_entry_param *e, struct fuse_file_info *fi) + struct fuse_entry_param *e, unsigned int *iflags, + struct fuse_file_info *fi) { int res; memset(e, 0, sizeof(struct fuse_entry_param)); - res = fuse_fs_getattr(f->fs, path, &e->attr, fi); - if (res == 0) { - res = do_lookup(f, nodeid, name, e); - if (res == 0 && f->conf.debug) { - fuse_log(FUSE_LOG_DEBUG, " NODEID: %llu\n", - (unsigned long long) e->ino); - } - } - return res; + *iflags = 0; + res = getattr(f, path, &e->attr, iflags, fi); + if (res) + return res; + + res = lookup_and_update(f, nodeid, name, e, *iflags); + if (res) + return res; + + if (f->conf.debug) + fuse_log(FUSE_LOG_DEBUG, " NODEID: %llu iflags 0x%x\n", + (unsigned long long) e->ino, *iflags); + + return 0; } static struct fuse_context_i *fuse_get_context_internal(void) @@ -2613,11 +2672,14 @@ static inline void reply_err(fuse_req_t req, int err) } static void reply_entry(fuse_req_t req, const struct fuse_entry_param *e, - int err) + unsigned int iflags, int err) { if (!err) { struct fuse *f = req_fuse(req); - if (fuse_reply_entry(req, e) == -ENOENT) { + int entry_res; + + entry_res = fuse_reply_entry_iflags(req, e, iflags); + if (entry_res == -ENOENT) { /* Skip forget for negative result */ if (e->ino != 0) forget_node(f, e->ino, 1); @@ -2658,6 +2720,9 @@ static void fuse_lib_init(void *data, struct fuse_conn_info *conn) /* Disable the receiving and processing of FUSE_INTERRUPT requests */ conn->no_interrupt = 1; } + + if (conn->want_ext & FUSE_CAP_IOMAP) + f->want_iflags = true; } void fuse_fs_destroy(struct fuse_fs *fs) @@ -2681,6 +2746,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent, struct fuse *f = req_fuse_prepare(req); struct fuse_entry_param e = { .ino = 0 }; /* invalid ino */ char *path; + unsigned int iflags = 0; int err; struct node *dot = NULL; @@ -2695,7 +2761,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent, dot = get_node_nocheck(f, parent); if (dot == NULL) { pthread_mutex_unlock(&f->lock); - reply_entry(req, &e, -ESTALE); + reply_entry(req, &e, 0, -ESTALE); return; } dot->refctr++; @@ -2715,7 +2781,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent, if (f->conf.debug) fuse_log(FUSE_LOG_DEBUG, "LOOKUP %s\n", path); fuse_prepare_interrupt(f, req, &d); - err = lookup_path(f, parent, name, path, &e, NULL); + err = lookup_path(f, parent, name, path, &e, &iflags, NULL); if (err == -ENOENT && f->conf.negative_timeout != 0.0) { e.ino = 0; e.entry_timeout = f->conf.negative_timeout; @@ -2729,7 +2795,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent, unref_node(f, dot); pthread_mutex_unlock(&f->lock); } - reply_entry(req, &e, err); + reply_entry(req, &e, iflags, err); } static void do_forget(struct fuse *f, fuse_ino_t ino, uint64_t nlookup) @@ -2765,6 +2831,7 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino, struct fuse *f = req_fuse_prepare(req); struct stat buf; char *path; + unsigned int iflags = 0; int err; memset(&buf, 0, sizeof(buf)); @@ -2776,7 +2843,7 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino, if (!err) { struct fuse_intr_data d; fuse_prepare_interrupt(f, req, &d); - err = fuse_fs_getattr(f->fs, path, &buf, fi); + err = getattr(f, path, &buf, &iflags, fi); fuse_finish_interrupt(f, req, &d); free_path(f, ino, path); } @@ -2789,9 +2856,11 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino, buf.st_nlink--; if (f->conf.auto_cache) update_stat(node, &buf); + node->iflags = iflags; pthread_mutex_unlock(&f->lock); set_stat(f, ino, &buf); - fuse_reply_attr(req, &buf, f->conf.attr_timeout); + fuse_reply_attr_iflags(req, &buf, iflags, + f->conf.attr_timeout); } else reply_err(req, err); } @@ -2900,6 +2969,7 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, struct fuse *f = req_fuse_prepare(req); struct stat buf; char *path; + unsigned int iflags = 0; int err; memset(&buf, 0, sizeof(buf)); @@ -2957,19 +3027,23 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, err = fuse_fs_utimens(f->fs, path, tv, fi); } if (!err) { - err = fuse_fs_getattr(f->fs, path, &buf, fi); + err = getattr(f, path, &buf, &iflags, fi); } fuse_finish_interrupt(f, req, &d); free_path(f, ino, path); } if (!err) { - if (f->conf.auto_cache) { - pthread_mutex_lock(&f->lock); - update_stat(get_node(f, ino), &buf); - pthread_mutex_unlock(&f->lock); - } + struct node *node; + + pthread_mutex_lock(&f->lock); + node = get_node(f, ino); + if (f->conf.auto_cache) + update_stat(node, &buf); + node->iflags = iflags; + pthread_mutex_unlock(&f->lock); set_stat(f, ino, &buf); - fuse_reply_attr(req, &buf, f->conf.attr_timeout); + fuse_reply_attr_iflags(req, &buf, iflags, + f->conf.attr_timeout); } else reply_err(req, err); } @@ -3020,6 +3094,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name, struct fuse *f = req_fuse_prepare(req); struct fuse_entry_param e; char *path; + unsigned int iflags = 0; int err; err = get_path_name(f, parent, name, &path); @@ -3036,7 +3111,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name, err = fuse_fs_create(f->fs, path, mode, &fi); if (!err) { err = lookup_path(f, parent, name, path, &e, - &fi); + &iflags, &fi); fuse_fs_release(f->fs, path, &fi); } } @@ -3044,12 +3119,12 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name, err = fuse_fs_mknod(f->fs, path, mode, rdev); if (!err) err = lookup_path(f, parent, name, path, &e, - NULL); + &iflags, NULL); } fuse_finish_interrupt(f, req, &d); free_path(f, parent, path); } - reply_entry(req, &e, err); + reply_entry(req, &e, iflags, err); } static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name, @@ -3058,6 +3133,7 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name, struct fuse *f = req_fuse_prepare(req); struct fuse_entry_param e; char *path; + unsigned int iflags = 0; int err; err = get_path_name(f, parent, name, &path); @@ -3067,11 +3143,12 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name, fuse_prepare_interrupt(f, req, &d); err = fuse_fs_mkdir(f->fs, path, mode); if (!err) - err = lookup_path(f, parent, name, path, &e, NULL); + err = lookup_path(f, parent, name, path, &e, &iflags, + NULL); fuse_finish_interrupt(f, req, &d); free_path(f, parent, path); } - reply_entry(req, &e, err); + reply_entry(req, &e, iflags, err); } static void fuse_lib_unlink(fuse_req_t req, fuse_ino_t parent, @@ -3141,6 +3218,7 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname, struct fuse *f = req_fuse_prepare(req); struct fuse_entry_param e; char *path; + unsigned int iflags = 0; int err; err = get_path_name(f, parent, name, &path); @@ -3150,11 +3228,12 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname, fuse_prepare_interrupt(f, req, &d); err = fuse_fs_symlink(f->fs, linkname, path); if (!err) - err = lookup_path(f, parent, name, path, &e, NULL); + err = lookup_path(f, parent, name, path, &e, &iflags, + NULL); fuse_finish_interrupt(f, req, &d); free_path(f, parent, path); } - reply_entry(req, &e, err); + reply_entry(req, &e, iflags, err); } static void fuse_lib_rename(fuse_req_t req, fuse_ino_t olddir, @@ -3202,6 +3281,7 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent, struct fuse_entry_param e; char *oldpath; char *newpath; + unsigned int iflags = 0; int err; err = get_path2(f, ino, NULL, newparent, newname, @@ -3213,11 +3293,11 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent, err = fuse_fs_link(f->fs, oldpath, newpath); if (!err) err = lookup_path(f, newparent, newname, newpath, - &e, NULL); + &e, &iflags, NULL); fuse_finish_interrupt(f, req, &d); free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath); } - reply_entry(req, &e, err); + reply_entry(req, &e, iflags, err); } static void fuse_do_release(struct fuse *f, fuse_ino_t ino, const char *path, @@ -3260,6 +3340,7 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent, struct fuse_intr_data d; struct fuse_entry_param e = {0}; char *path; + unsigned int iflags; int err; err = get_path_name(f, parent, name, &path); @@ -3267,7 +3348,8 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent, fuse_prepare_interrupt(f, req, &d); err = fuse_fs_create(f->fs, path, mode, fi); if (!err) { - err = lookup_path(f, parent, name, path, &e, fi); + err = lookup_path(f, parent, name, path, &e, + &iflags, fi); if (err) fuse_fs_release(f->fs, path, fi); else if (!S_ISREG(e.attr.st_mode)) { @@ -3287,10 +3369,14 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent, fuse_finish_interrupt(f, req, &d); } if (!err) { + int create_res; + pthread_mutex_lock(&f->lock); get_node(f, e.ino)->open_count++; pthread_mutex_unlock(&f->lock); - if (fuse_reply_create(req, &e, fi) == -ENOENT) { + + create_res = fuse_reply_create_iflags(req, &e, iflags, fi); + if (create_res == -ENOENT) { /* The open syscall was interrupted, so it must be cancelled */ fuse_do_release(f, e.ino, path, fi); @@ -3324,13 +3410,16 @@ static void open_auto_cache(struct fuse *f, fuse_ino_t ino, const char *path, if (diff_timespec(&now, &node->stat_updated) > f->conf.ac_attr_timeout) { struct stat stbuf; + unsigned int iflags = 0; int err; + pthread_mutex_unlock(&f->lock); - err = fuse_fs_getattr(f->fs, path, &stbuf, fi); + err = getattr(f, path, &stbuf, &iflags, fi); pthread_mutex_lock(&f->lock); - if (!err) + if (!err) { update_stat(node, &stbuf); - else + node->iflags = iflags; + } else node->cache_valid = 0; } } @@ -3659,6 +3748,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp, .ino = 0, }; struct fuse *f = dh->fuse; + unsigned int iflags = 0; if ((flags & ~FUSE_FILL_DIR_PLUS) != 0) { dh->error = -EIO; @@ -3682,6 +3772,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp, if (off) { size_t newlen; + size_t thislen; if (dh->filled) { dh->error = -EIO; @@ -3697,8 +3788,8 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp, if (statp && (flags & FUSE_FILL_DIR_PLUS)) { if (!is_dot_or_dotdot(name)) { - int res = do_lookup(f, dh->nodeid, name, &e); - + int res = do_lookup(f, dh->nodeid, name, &e, + &iflags); if (res) { dh->error = res; return 1; @@ -3706,10 +3797,12 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp, } } - newlen = dh->len + - fuse_add_direntry_plus(dh->req, dh->contents + dh->len, - dh->needlen - dh->len, name, - &e, off); + thislen = fuse_add_direntry_plus_iflags(dh->req, + dh->contents + dh->len, + dh->needlen - dh->len, + name, iflags, &e, off); + newlen = dh->len + thislen; + if (newlen > dh->needlen) return 1; dh->len = newlen; @@ -3796,6 +3889,7 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh, unsigned rem = dh->needlen - dh->len; unsigned thislen; unsigned newlen; + unsigned int iflags = 0; pos++; if (flags & FUSE_READDIR_PLUS) { @@ -3807,15 +3901,17 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh, if (de->flags & FUSE_FILL_DIR_PLUS && !is_dot_or_dotdot(de->name)) { res = do_lookup(dh->fuse, dh->nodeid, - de->name, &e); + de->name, &e, &iflags); if (res) { dh->error = res; return 1; } } - thislen = fuse_add_direntry_plus(req, p, rem, - de->name, &e, pos); + thislen = fuse_add_direntry_plus_iflags(req, p, rem, + de->name, + iflags, &e, + pos); } else { thislen = fuse_add_direntry(req, p, rem, de->name, &de->stat, pos); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 12/25] libfuse: support enabling exclusive mode for files 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (10 preceding siblings ...) 2026-04-29 14:41 ` [PATCH 11/25] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong @ 2026-04-29 14:42 ` Darrick J. Wong 2026-04-29 14:42 ` [PATCH 13/25] libfuse: support direct I/O through iomap Darrick J. Wong ` (12 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:42 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Make it so that lowlevel fuse servers can ask for exclusive mode. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 2 ++ include/fuse_kernel.h | 4 ++++ lib/fuse_lowlevel.c | 2 ++ 3 files changed, 8 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 5a5f69a6e23e5e..08dfdd57fd4977 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1228,6 +1228,8 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, /* enable fsdax */ #define FUSE_IFLAG_DAX (1U << 0) +/* exclusive attr mode */ +#define FUSE_IFLAG_EXCLUSIVE (1U << 1) /* ----------------------------------------------------------- * * Compatibility stuff * diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 732085a1b900b0..9b0894899ca453 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -244,6 +244,7 @@ * 7.99 * - XXX magic minor revision to make experimental code really obvious * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations + * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes */ #ifndef _LINUX_FUSE_H @@ -584,9 +585,12 @@ struct fuse_file_lock { * * FUSE_ATTR_SUBMOUNT: Object is a submount root * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode + * FUSE_ATTR_EXCLUSIVE: This file can only be modified by this mount, so the + * kernel can use cached attributes more aggressively (e.g. ACL inheritance) */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) +#define FUSE_ATTR_EXCLUSIVE (1 << 2) /** * Open flags diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 2dbdef695009f8..f033decf2def26 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -164,6 +164,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr, attr->flags = 0; if (iflags & FUSE_IFLAG_DAX) attr->flags |= FUSE_ATTR_DAX; + if (iflags & FUSE_IFLAG_EXCLUSIVE) + attr->flags |= FUSE_ATTR_EXCLUSIVE; } static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 13/25] libfuse: support direct I/O through iomap 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (11 preceding siblings ...) 2026-04-29 14:42 ` [PATCH 12/25] libfuse: support enabling exclusive mode for files Darrick J. Wong @ 2026-04-29 14:42 ` Darrick J. Wong 2026-04-29 14:42 ` [PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the upper level fuse library Darrick J. Wong ` (11 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:42 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Make it so that fuse servers can ask the kernel fuse driver to use iomap to support direct IO. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 2 ++ include/fuse_kernel.h | 3 +++ lib/fuse_lowlevel.c | 2 ++ 3 files changed, 7 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 08dfdd57fd4977..6c03d845258d25 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1230,6 +1230,8 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, #define FUSE_IFLAG_DAX (1U << 0) /* exclusive attr mode */ #define FUSE_IFLAG_EXCLUSIVE (1U << 1) +/* use iomap for this inode */ +#define FUSE_IFLAG_IOMAP (1U << 2) /* ----------------------------------------------------------- * * Compatibility stuff * diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 9b0894899ca453..6b1fcc44004dbf 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -245,6 +245,7 @@ * - XXX magic minor revision to make experimental code really obvious * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes + * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes */ #ifndef _LINUX_FUSE_H @@ -587,10 +588,12 @@ struct fuse_file_lock { * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode * FUSE_ATTR_EXCLUSIVE: This file can only be modified by this mount, so the * kernel can use cached attributes more aggressively (e.g. ACL inheritance) + * FUSE_ATTR_IOMAP: Use iomap for this inode */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) #define FUSE_ATTR_EXCLUSIVE (1 << 2) +#define FUSE_ATTR_IOMAP (1 << 3) /** * Open flags diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index f033decf2def26..034c00fcbe07ca 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -166,6 +166,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr, attr->flags |= FUSE_ATTR_DAX; if (iflags & FUSE_IFLAG_EXCLUSIVE) attr->flags |= FUSE_ATTR_EXCLUSIVE; + if (iflags & FUSE_IFLAG_IOMAP) + attr->flags |= FUSE_ATTR_IOMAP; } static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the upper level fuse library 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (12 preceding siblings ...) 2026-04-29 14:42 ` [PATCH 13/25] libfuse: support direct I/O through iomap Darrick J. Wong @ 2026-04-29 14:42 ` Darrick J. Wong 2026-04-29 14:42 ` [PATCH 15/25] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong ` (10 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:42 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> The upper level fuse library creates a separate node object for every (i)node referenced by a directory entry. Unfortunately, it doesn't account for the possibility of hardlinks, which means that we can create multiple nodeids that refer to the same hardlinked inode. Inode locking in iomap mode in the kernel relies there only being one inode object for a hardlinked file, so we cannot allow anyone to hardlink an iomap file. The client had better not turn on iomap for an existing hardlinked file. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 18 ++++++++++ lib/fuse.c | 90 +++++++++++++++++++++++++++++++++++++++++++----- lib/fuse_versionscript | 2 + 3 files changed, 101 insertions(+), 9 deletions(-) diff --git a/include/fuse.h b/include/fuse.h index 1b009c7bdee295..e2099ebc8ac7a1 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -1472,6 +1472,24 @@ int fuse_fs_iomap_device_add(int fd, unsigned int flags); */ int fuse_fs_iomap_device_remove(int device_id); +/** + * Decide if we can enable iomap mode for a particular file for an upper-level + * fuse server. + * + * @param stbuf stat information for the file. + * @return true if it can be enabled, false if not. + */ +bool fuse_fs_can_enable_iomap(const struct stat *stbuf); + +/** + * Decide if we can enable iomap mode for a particular file for an upper-level + * fuse server. + * + * @param statxbuf statx information for the file. + * @return true if it can be enabled, false if not. + */ +bool fuse_fs_can_enable_iomapx(const struct statx *statxbuf); + int fuse_notify_poll(struct fuse_pollhandle *ph); /** diff --git a/lib/fuse.c b/lib/fuse.c index f2d1026c90b957..35ba13271d0b9c 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -3274,10 +3274,66 @@ static void fuse_lib_rename(fuse_req_t req, fuse_ino_t olddir, reply_err(req, err); } +/* + * Decide if file IO for this inode can use iomap. + * + * The upper level libfuse creates internal node ids that have nothing to do + * with the ext2_ino_t that we give it. These internal node ids are what + * actually gets igetted in the kernel, which means that there can be multiple + * fuse_inode objects in the kernel for a single hardlinked inode in the fuse + * server. + * + * What this means, horrifyingly, is that on a fuse filesystem that supports + * hard links, the in-kernel i_rwsem does not protect against concurrent writes + * between files that point to the same inode. That in turn means that the + * file mode and size can get desynchronized between the multiple fuse_inode + * objects. This also means that we cannot cache iomaps in the kernel AT ALL + * because the caches will get out of sync, leading to WARN_ONs from the iomap + * zeroing code and probably data corruption after that. + * + * Therefore, libfuse must never create hardlinks of iomap files, and the + * predicates below allow fuse servers to decide if they can turn on iomap for + * existing hardlinked files. + */ +bool fuse_fs_can_enable_iomap(const struct stat *stbuf) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + + if (!(se->conn.want_ext & FUSE_CAP_IOMAP)) + return false; + + return stbuf->st_nlink < 2; +} + +bool fuse_fs_can_enable_iomapx(const struct statx *statxbuf) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + + if (!(se->conn.want_ext & FUSE_CAP_IOMAP)) + return false; + + return statxbuf->stx_nlink < 2; +} + +static bool fuse_lib_can_link(fuse_req_t req, fuse_ino_t ino) +{ + struct fuse *f = req_fuse_prepare(req); + struct node *node; + + if (!(req->se->conn.want_ext & FUSE_CAP_IOMAP)) + return true; + + node = get_node(f, ino); + return !(node->iflags & FUSE_IFLAG_IOMAP); +} + static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent, const char *newname) { struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; struct fuse_entry_param e; char *oldpath; char *newpath; @@ -3286,17 +3342,33 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent, err = get_path2(f, ino, NULL, newparent, newname, &oldpath, &newpath, NULL, NULL); - if (!err) { - struct fuse_intr_data d; + if (err) + goto out_reply; - fuse_prepare_interrupt(f, req, &d); - err = fuse_fs_link(f->fs, oldpath, newpath); - if (!err) - err = lookup_path(f, newparent, newname, newpath, - &e, &iflags, NULL); - fuse_finish_interrupt(f, req, &d); - free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath); + /* + * The upper level fuse library creates a separate node object for + * every (i)node referenced by a directory entry. Unfortunately, it + * doesn't account for the possibility of hardlinks, which means that + * we can create multiple nodeids that refer to the same hardlinked + * inode. Inode locking in iomap mode in the kernel relies there only + * being one inode object for a hardlinked file, so we cannot allow + * anyone to hardlink an iomap file. The client had better not turn on + * iomap for an existing hardlinked file. + */ + if (!fuse_lib_can_link(req, ino)) { + err = -EPERM; + goto out_path; } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_link(f->fs, oldpath, newpath); + if (!err) + err = lookup_path(f, newparent, newname, newpath, + &e, &iflags, NULL); + fuse_finish_interrupt(f, req, &d); +out_path: + free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath); +out_reply: reply_entry(req, &e, iflags, err); } diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 01c11ba1794231..5c1a1bfff44d87 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -260,6 +260,8 @@ FUSE_3.99 { fuse_reply_create_iflags; fuse_reply_entry_iflags; fuse_add_direntry_plus_iflags; + fuse_fs_can_enable_iomap; + fuse_fs_can_enable_iomapx; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 15/25] libfuse: allow discovery of the kernel's iomap capabilities 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (13 preceding siblings ...) 2026-04-29 14:42 ` [PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the upper level fuse library Darrick J. Wong @ 2026-04-29 14:42 ` Darrick J. Wong 2026-04-29 14:43 ` [PATCH 16/25] libfuse: add lower level iomap_config implementation Darrick J. Wong ` (9 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:42 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a library function so that we can discover the kernel's iomap capabilities ahead of time. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 7 +++++++ include/fuse_kernel.h | 7 +++++++ include/fuse_lowlevel.h | 9 +++++++++ include/fuse_service.h | 12 ++++++++++++ lib/fuse_lowlevel.c | 19 +++++++++++++++++++ lib/fuse_service.c | 5 +++++ lib/fuse_service_stub.c | 5 +++++ lib/fuse_versionscript | 2 ++ 8 files changed, 66 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 6c03d845258d25..277a8209f09d50 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -539,6 +539,13 @@ struct fuse_loop_config_v1 { #define FUSE_IOCTL_MAX_IOV 256 +/** + * iomap discovery flags + * + * FUSE_IOMAP_SUPPORT_FILEIO: basic file I/O functionality through iomap + */ +#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0) + /** * Connection information, passed to the ->init() method * diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 6b1fcc44004dbf..95c6c179a4398a 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -1156,12 +1156,19 @@ struct fuse_backing_map { uint64_t padding; }; +struct fuse_iomap_support { + uint64_t flags; + uint64_t padding; +}; + /* Device ioctls: */ #define FUSE_DEV_IOC_MAGIC 229 #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t) #define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \ struct fuse_backing_map) #define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t) +#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \ + struct fuse_iomap_support) struct fuse_lseek_in { uint64_t fh; diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index 6d80713fe8bf88..df41afee0cfbe5 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2659,6 +2659,15 @@ bool fuse_req_is_uring(fuse_req_t req); int fuse_req_get_payload(fuse_req_t req, char **payload, size_t *payload_sz, void **mr); +/** + * Discover the kernel's iomap capabilities. Returns FUSE_CAP_IOMAP_* flags. + * + * @param fd open file descriptor to a fuse device, or -1 if you're running + * in the same process that will call mount(). + * @return FUSE_IOMAP_SUPPORT_* flags + */ +uint64_t fuse_lowlevel_discover_iomap(int fd); + #ifdef __cplusplus } #endif diff --git a/include/fuse_service.h b/include/fuse_service.h index 7e4c204e7a70bf..f1b3fb738aec3a 100644 --- a/include/fuse_service.h +++ b/include/fuse_service.h @@ -234,6 +234,18 @@ int fuse_service_send_goodbye(struct fuse_service *sf, int exitcode); */ int fuse_service_exit(int ret); +#if FUSE_MAKE_VERSION(3, 99) <= FUSE_USE_VERSION +/** + * Discover the kernel's iomap capabilities. Returns FUSE_CAP_IOMAP_* flags. + * + * @param fd open file descriptor to a fuse device, or -1 if you're running + * in the same process that will call mount(). + * @return FUSE_IOMAP_SUPPORT_* flags + */ +uint64_t fuse_service_discover_iomap(struct fuse_service *sf); + +#endif /* FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 99) */ + #endif /* FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 19) */ #ifdef __cplusplus diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 034c00fcbe07ca..90136408b39204 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -5082,3 +5082,22 @@ void fuse_session_stop_teardown_watchdog(void *data) pthread_join(tt->thread_id, NULL); fuse_tt_destruct(tt); } + +uint64_t fuse_lowlevel_discover_iomap(int fd) +{ + struct fuse_iomap_support ios = { }; + + if (fd >= 0) { + ioctl(fd, FUSE_DEV_IOC_IOMAP_SUPPORT, &ios); + return ios.flags; + } + + fd = open(fuse_mnt_get_devname(), O_RDONLY | O_CLOEXEC); + if (fd < 0) + return 0; + + ioctl(fd, FUSE_DEV_IOC_IOMAP_SUPPORT, &ios); + close(fd); + + return ios.flags; +} diff --git a/lib/fuse_service.c b/lib/fuse_service.c index 83c1d564a18b0c..e860d1aafbe5e5 100644 --- a/lib/fuse_service.c +++ b/lib/fuse_service.c @@ -1246,3 +1246,8 @@ int fuse_service_exit(int ret) */ return ret != 0 ? EXIT_FAILURE : EXIT_SUCCESS; } + +uint64_t fuse_service_discover_iomap(struct fuse_service *sf) +{ + return fuse_lowlevel_discover_iomap(sf->fusedevfd); +} diff --git a/lib/fuse_service_stub.c b/lib/fuse_service_stub.c index d34df3891a6e31..2cafde6d6b6f5d 100644 --- a/lib/fuse_service_stub.c +++ b/lib/fuse_service_stub.c @@ -104,3 +104,8 @@ int fuse_service_exit(int ret) { return ret; } + +uint64_t fuse_service_discover_iomap(struct fuse_service *sf) +{ + return 0; +} diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 5c1a1bfff44d87..46465540a1f51b 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -262,6 +262,8 @@ FUSE_3.99 { fuse_add_direntry_plus_iflags; fuse_fs_can_enable_iomap; fuse_fs_can_enable_iomapx; + fuse_lowlevel_discover_iomap; + fuse_service_discover_iomap; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 16/25] libfuse: add lower level iomap_config implementation 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (14 preceding siblings ...) 2026-04-29 14:42 ` [PATCH 15/25] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong @ 2026-04-29 14:43 ` Darrick J. Wong 2026-04-29 14:43 ` [PATCH 17/25] libfuse: add upper " Darrick J. Wong ` (8 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:43 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add FUSE_IOMAP_CONFIG helpers to the low level fuse library. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 37 ++++++++++++++++++++ include/fuse_kernel.h | 31 +++++++++++++++++ include/fuse_lowlevel.h | 28 +++++++++++++++ lib/fuse_lowlevel.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++ lib/fuse_versionscript | 1 + 5 files changed, 182 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 277a8209f09d50..2f0aff038c3ece 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1240,6 +1240,43 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, /* use iomap for this inode */ #define FUSE_IFLAG_IOMAP (1U << 2) +/* Which fields are set in fuse_iomap_config_out? */ +#define FUSE_IOMAP_CONFIG_SID (1 << 0ULL) +#define FUSE_IOMAP_CONFIG_UUID (1 << 1ULL) +#define FUSE_IOMAP_CONFIG_BLOCKSIZE (1 << 2ULL) +#define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL) +#define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL) +#define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL) + +struct fuse_iomap_config_params { + uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */ + int64_t maxbytes; /* max supported file size */ + uint64_t padding[6]; /* zero */ +}; + +struct fuse_iomap_config { + uint64_t flags; /* FUSE_IOMAP_CONFIG_* */ + + char s_id[32]; /* Informational name */ + char s_uuid[16]; /* UUID */ + + uint8_t s_uuid_len; /* length of s_uuid */ + + uint8_t s_pad[3]; /* must be zeroes */ + + uint32_t s_blocksize; /* fs block size */ + uint32_t s_max_links; /* max hard links */ + + /* Granularity of c/m/atime in ns (cannot be worse than a second) */ + uint32_t s_time_gran; + + /* Time limits for c/m/atime in seconds */ + int64_t s_time_min; + int64_t s_time_max; + + int64_t s_maxbytes; /* max file size */ +}; + /* ----------------------------------------------------------- * * Compatibility stuff * * ----------------------------------------------------------- */ diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 95c6c179a4398a..897d996a0ce60d 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -246,6 +246,7 @@ * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes + * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry */ #ifndef _LINUX_FUSE_H @@ -677,6 +678,7 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_IOMAP_CONFIG = 4092, FUSE_IOMAP_IOEND = 4093, FUSE_IOMAP_BEGIN = 4094, FUSE_IOMAP_END = 4095, @@ -1390,4 +1392,33 @@ struct fuse_iomap_ioend_out { uint64_t newsize; /* new ondisk size */ }; +struct fuse_iomap_config_in { + uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */ + int64_t maxbytes; /* max supported file size */ + uint64_t padding[6]; /* zero */ +}; + +struct fuse_iomap_config_out { + uint64_t flags; /* FUSE_IOMAP_CONFIG_* */ + + char s_id[32]; /* Informational name */ + char s_uuid[16]; /* UUID */ + + uint8_t s_uuid_len; /* length of s_uuid */ + + uint8_t s_pad[3]; /* must be zeroes */ + + uint32_t s_blocksize; /* fs block size */ + uint32_t s_max_links; /* max hard links */ + + /* Granularity of c/m/atime in ns (cannot be worse than a second) */ + uint32_t s_time_gran; + + /* Time limits for c/m/atime in seconds */ + int64_t s_time_min; + int64_t s_time_max; + + int64_t s_maxbytes; /* max file size */ +}; + #endif /* _LINUX_FUSE_H */ diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index df41afee0cfbe5..e97c3df16fb466 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -1425,6 +1425,21 @@ struct fuse_lowlevel_ops { uint64_t attr_ino, off_t pos, uint64_t written, uint32_t ioendflags, int error, uint32_t dev, uint64_t new_addr); + + /** + * Configure the filesystem geometry for iomap mode + * + * Valid replies: + * fuse_reply_iomap_config + * fuse_reply_err + * + * @param req request handle + * @param p all available iomap configuration parameters + * @param psize size of parameters structure + */ + void (*iomap_config)(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize); }; /** @@ -1928,6 +1943,19 @@ int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read, */ int fuse_reply_iomap_ioend(fuse_req_t req, off_t newsize); +/** + * Reply with iomap configuration + * + * Possible requests: + * iomap_config + * + * @param req request handle + * @param cfg iomap configuration + * @return zero for success, -errno for failure to send reply + */ +int fuse_reply_iomap_config(fuse_req_t req, + const struct fuse_iomap_config *cfg); + /* ----------------------------------------------------------- * * Notification * * ----------------------------------------------------------- */ diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 90136408b39204..4750b39ab99137 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -2834,6 +2834,89 @@ static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid, _do_iomap_ioend(req, nodeid, inarg, NULL); } +#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER)) +#define offsetofend(TYPE, MEMBER) \ + (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER)) + +#define FUSE_IOMAP_CONFIG_V1 (FUSE_IOMAP_CONFIG_SID | \ + FUSE_IOMAP_CONFIG_UUID | \ + FUSE_IOMAP_CONFIG_BLOCKSIZE | \ + FUSE_IOMAP_CONFIG_MAX_LINKS | \ + FUSE_IOMAP_CONFIG_TIME | \ + FUSE_IOMAP_CONFIG_MAXBYTES) + +#define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_V1) + +static ssize_t iomap_config_reply_size(const struct fuse_iomap_config *cfg) +{ + if (cfg->flags & ~FUSE_IOMAP_CONFIG_ALL) + return -EINVAL; + + return offsetofend(struct fuse_iomap_config_out, s_maxbytes); +} + +int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg) +{ + struct fuse_iomap_config_out arg = { + .flags = cfg->flags, + }; + const ssize_t reply_size = iomap_config_reply_size(cfg); + + if (reply_size < 0) + fuse_reply_err(req, -reply_size); + + if (cfg->flags & FUSE_IOMAP_CONFIG_BLOCKSIZE) + arg.s_blocksize = cfg->s_blocksize; + + if (cfg->flags & FUSE_IOMAP_CONFIG_SID) + memcpy(arg.s_id, cfg->s_id, sizeof(arg.s_id)); + + if (cfg->flags & FUSE_IOMAP_CONFIG_UUID) { + arg.s_uuid_len = cfg->s_uuid_len; + if (arg.s_uuid_len > sizeof(arg.s_uuid)) + arg.s_uuid_len = sizeof(arg.s_uuid); + memcpy(arg.s_uuid, cfg->s_uuid, arg.s_uuid_len); + } + + if (cfg->flags & FUSE_IOMAP_CONFIG_MAX_LINKS) + arg.s_max_links = cfg->s_max_links; + + if (cfg->flags & FUSE_IOMAP_CONFIG_TIME) { + arg.s_time_gran = cfg->s_time_gran; + arg.s_time_min = cfg->s_time_min; + arg.s_time_max = cfg->s_time_max; + } + + if (cfg->flags & FUSE_IOMAP_CONFIG_MAXBYTES) + arg.s_maxbytes = cfg->s_maxbytes; + + return send_reply_ok(req, &arg, reply_size); +} + +static void _do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + const struct fuse_iomap_config_in *arg = op_in; + struct fuse_iomap_config_params p = { + .flags = arg->flags & FUSE_IOMAP_CONFIG_ALL, + .maxbytes = arg->maxbytes, + }; + + (void)nodeid; + (void)in_payload; + + if (req->se->op.iomap_config) + req->se->op.iomap_config(req, &p, sizeof(p)); + else + fuse_reply_err(req, ENOSYS); +} + +static void do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_iomap_config(req, nodeid, inarg, NULL); +} + static bool want_flags_valid(uint64_t capable, uint64_t want) { uint64_t unknown_flags = want & (~capable); @@ -3827,6 +3910,7 @@ static struct { [FUSE_LSEEK] = { do_lseek, "LSEEK" }, [FUSE_SYNCFS] = { do_syncfs, "SYNCFS" }, [FUSE_STATX] = { do_statx, "STATX" }, + [FUSE_IOMAP_CONFIG] = { do_iomap_config, "IOMAP_CONFIG" }, [FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" }, [FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" }, [FUSE_IOMAP_IOEND] = { do_iomap_ioend, "IOMAP_IOEND" }, @@ -3887,6 +3971,7 @@ static struct { [FUSE_LSEEK] = { _do_lseek, "LSEEK" }, [FUSE_SYNCFS] = { _do_syncfs, "SYNCFS" }, [FUSE_STATX] = { _do_statx, "STATX" }, + [FUSE_IOMAP_CONFIG] = { _do_iomap_config, "IOMAP_CONFIG" }, [FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" }, [FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" }, [FUSE_IOMAP_IOEND] = { _do_iomap_ioend, "IOMAP_IOEND" }, diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 46465540a1f51b..f1e5cc63da18ee 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -264,6 +264,7 @@ FUSE_3.99 { fuse_fs_can_enable_iomapx; fuse_lowlevel_discover_iomap; fuse_service_discover_iomap; + fuse_reply_iomap_config; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 17/25] libfuse: add upper level iomap_config implementation 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (15 preceding siblings ...) 2026-04-29 14:43 ` [PATCH 16/25] libfuse: add lower level iomap_config implementation Darrick J. Wong @ 2026-04-29 14:43 ` Darrick J. Wong 2026-04-29 14:43 ` [PATCH 18/25] libfuse: add low level code to invalidate iomap block device ranges Darrick J. Wong ` (7 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:43 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add FUSE_IOMAP_CONFIG helpers to the upper level fuse library. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 7 +++++++ lib/fuse.c | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index e2099ebc8ac7a1..8b6cd7e3cf9f84 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -918,6 +918,13 @@ struct fuse_operations { */ int (*getattr_iflags)(const char *path, struct stat *statbuf, unsigned int *iflags, struct fuse_file_info *fi); + + /** + * Configure the filesystem geometry that will be used by iomap + * files. + */ + int (*iomap_config)(const struct fuse_iomap_config_params *p, + size_t psize, struct fuse_iomap_config *cfg); }; /** Extra context that may be needed by some filesystems diff --git a/lib/fuse.c b/lib/fuse.c index 35ba13271d0b9c..2d9a49c16c25ff 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2963,6 +2963,23 @@ static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path, ioendflags, error, dev, new_addr, newsize); } +static int fuse_fs_iomap_config(struct fuse_fs *fs, + const struct fuse_iomap_config_params *p, + size_t psize, struct fuse_iomap_config *cfg) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.iomap_config) + return -ENOSYS; + + if (fs->debug) { + fuse_log(FUSE_LOG_DEBUG, + "iomap_config flags 0x%llx maxbytes %lld\n", + (unsigned long long)p->flags, (long long)p->maxbytes); + } + + return fs->op.iomap_config(p, psize, cfg); +} + static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { @@ -4841,6 +4858,26 @@ static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid, fuse_reply_iomap_ioend(req, newsize); } +static void fuse_lib_iomap_config(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize) +{ + struct fuse_iomap_config cfg = { }; + struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; + int err; + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_iomap_config(f->fs, p, psize, &cfg); + fuse_finish_interrupt(f, req, &d); + if (err) { + reply_err(req, err); + return; + } + + fuse_reply_iomap_config(req, &cfg); +} + static int clean_delay(struct fuse *f) { /* @@ -4946,6 +4983,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = { .iomap_begin = fuse_lib_iomap_begin, .iomap_end = fuse_lib_iomap_end, .iomap_ioend = fuse_lib_iomap_ioend, + .iomap_config = fuse_lib_iomap_config, }; int fuse_notify_poll(struct fuse_pollhandle *ph) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 18/25] libfuse: add low level code to invalidate iomap block device ranges 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (16 preceding siblings ...) 2026-04-29 14:43 ` [PATCH 17/25] libfuse: add upper " Darrick J. Wong @ 2026-04-29 14:43 ` Darrick J. Wong 2026-04-29 14:44 ` [PATCH 19/25] libfuse: add upper-level API to invalidate parts of an iomap block device Darrick J. Wong ` (6 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:43 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Make it easier to invalidate the page cache for a block device that is being used in conjunction with iomap. This allows a fuse server to kill all cached data for a block that is being freed, so that block reuse doesn't result in file corruption. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 15 +++++++++++++++ include/fuse_lowlevel.h | 16 ++++++++++++++++ lib/fuse_lowlevel.c | 22 ++++++++++++++++++++++ lib/fuse_versionscript | 1 + 4 files changed, 54 insertions(+) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 897d996a0ce60d..1e7c9d8082cf23 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -247,6 +247,7 @@ * - add FUSE_ATTR_EXCLUSIVE to enable exclusive mode for specific inodes * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry + * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges */ #ifndef _LINUX_FUSE_H @@ -701,6 +702,7 @@ enum fuse_notify_code { FUSE_NOTIFY_RESEND = 7, FUSE_NOTIFY_INC_EPOCH = 8, FUSE_NOTIFY_PRUNE = 9, + FUSE_NOTIFY_IOMAP_DEV_INVAL = 99, }; /* The read buffer is required to be at least 8k, but may be much larger */ @@ -1421,4 +1423,17 @@ struct fuse_iomap_config_out { int64_t s_maxbytes; /* max file size */ }; +struct fuse_range { + uint64_t offset; + uint64_t length; +}; + +struct fuse_iomap_dev_inval_out { + uint32_t dev; /* device cookie */ + uint32_t reserved; /* zero */ + + /* range of bdev pagecache to invalidate, in bytes */ + struct fuse_range range; +}; + #endif /* _LINUX_FUSE_H */ diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index e97c3df16fb466..d17bde088d4992 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2210,6 +2210,22 @@ int fuse_lowlevel_iomap_device_add(struct fuse_session *se, int fd, */ int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id); +/* + * Invalidate the page cache of a block device opened for use with iomap. + * + * Added in FUSE protocol version 7.99. If the kernel does not support + * this (or a newer) version, the function will return -ENOSYS and do + * nothing. + * + * @param se the session object + * @param dev device cookie returned by fuse_lowlevel_iomap_add_device + * @param offset start of the range to invalidate, in bytes + * @param length the length of the range to invalidate, in bytes + * @return 0 on success, or negative errno on failure + */ +int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev, + off_t offset, off_t length); + /* ----------------------------------------------------------- * * Utility functions * * ----------------------------------------------------------- */ diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 4750b39ab99137..2d3e0b49c5a50f 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -3693,6 +3693,28 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino, return res; } +int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev, + off_t offset, off_t length) +{ + struct fuse_iomap_dev_inval_out arg = { + .dev = dev, + .range.offset = offset, + .range.length = length, + }; + struct iovec iov[2]; + + if (!se) + return -EINVAL; + + if (!(se->conn.want_ext & FUSE_CAP_IOMAP)) + return -ENOSYS; + + iov[1].iov_base = &arg; + iov[1].iov_len = sizeof(arg); + + return send_notify_iov(se, FUSE_NOTIFY_IOMAP_DEV_INVAL, iov, 2); +} + struct fuse_retrieve_req { struct fuse_notify_req nreq; void *cookie; diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index f1e5cc63da18ee..43bef1cd1c5076 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -265,6 +265,7 @@ FUSE_3.99 { fuse_lowlevel_discover_iomap; fuse_service_discover_iomap; fuse_reply_iomap_config; + fuse_lowlevel_iomap_device_invalidate; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 19/25] libfuse: add upper-level API to invalidate parts of an iomap block device 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (17 preceding siblings ...) 2026-04-29 14:43 ` [PATCH 18/25] libfuse: add low level code to invalidate iomap block device ranges Darrick J. Wong @ 2026-04-29 14:44 ` Darrick J. Wong 2026-04-29 14:44 ` [PATCH 20/25] libfuse: add atomic write support Darrick J. Wong ` (5 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:44 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Wire up the upper-level wrappers to fuse_lowlevel_iomap_invalidate_device. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 10 ++++++++++ lib/fuse.c | 9 +++++++++ lib/fuse_versionscript | 1 + 3 files changed, 20 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index 8b6cd7e3cf9f84..d68feb7ccfa84c 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -1479,6 +1479,16 @@ int fuse_fs_iomap_device_add(int fd, unsigned int flags); */ int fuse_fs_iomap_device_remove(int device_id); +/** + * Invalidate any pagecache for the given iomap (block) device. + * + * @param device_id device index as returned by fuse_lowlevel_iomap_device_add + * @param offset starting offset of the range to invalidate + * @param length the length of the range to invalidate + * @return 0 on success, or negative errno on failure + */ +int fuse_fs_iomap_device_invalidate(int device_id, off_t offset, off_t length); + /** * Decide if we can enable iomap mode for a particular file for an upper-level * fuse server. diff --git a/lib/fuse.c b/lib/fuse.c index 2d9a49c16c25ff..ddf9b4044b8952 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2941,6 +2941,15 @@ int fuse_fs_iomap_device_remove(int device_id) return fuse_lowlevel_iomap_device_remove(se, device_id); } +int fuse_fs_iomap_device_invalidate(int device_id, off_t offset, off_t length) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + + return fuse_lowlevel_iomap_device_invalidate(se, device_id, offset, + length); +} + static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path, uint64_t nodeid, uint64_t attr_ino, off_t pos, uint64_t written, uint32_t ioendflags, int error, diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 43bef1cd1c5076..7f14aadff9fe72 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -266,6 +266,7 @@ FUSE_3.99 { fuse_service_discover_iomap; fuse_reply_iomap_config; fuse_lowlevel_iomap_device_invalidate; + fuse_fs_iomap_device_invalidate; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 20/25] libfuse: add atomic write support 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (18 preceding siblings ...) 2026-04-29 14:44 ` [PATCH 19/25] libfuse: add upper-level API to invalidate parts of an iomap block device Darrick J. Wong @ 2026-04-29 14:44 ` Darrick J. Wong 2026-04-29 14:44 ` [PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write throttling Darrick J. Wong ` (4 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:44 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add the single flag that we need to turn on atomic write support in fuse. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 4 ++++ include/fuse_kernel.h | 3 +++ lib/fuse_lowlevel.c | 2 ++ 3 files changed, 9 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 2f0aff038c3ece..7c34dcf3b90754 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -545,6 +545,8 @@ struct fuse_loop_config_v1 { * FUSE_IOMAP_SUPPORT_FILEIO: basic file I/O functionality through iomap */ #define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0) +/* untorn writes through iomap */ +#define FUSE_IOMAP_SUPPORT_ATOMIC (1ULL << 1) /** * Connection information, passed to the ->init() method @@ -1239,6 +1241,8 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, #define FUSE_IFLAG_EXCLUSIVE (1U << 1) /* use iomap for this inode */ #define FUSE_IFLAG_IOMAP (1U << 2) +/* enable untorn writes */ +#define FUSE_IFLAG_ATOMIC (1U << 3) /* Which fields are set in fuse_iomap_config_out? */ #define FUSE_IOMAP_CONFIG_SID (1 << 0ULL) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 1e7c9d8082cf23..71a6a92b4b4a65 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -248,6 +248,7 @@ * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges + * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support */ #ifndef _LINUX_FUSE_H @@ -591,11 +592,13 @@ struct fuse_file_lock { * FUSE_ATTR_EXCLUSIVE: This file can only be modified by this mount, so the * kernel can use cached attributes more aggressively (e.g. ACL inheritance) * FUSE_ATTR_IOMAP: Use iomap for this inode + * FUSE_ATTR_ATOMIC: Enable untorn writes */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) #define FUSE_ATTR_EXCLUSIVE (1 << 2) #define FUSE_ATTR_IOMAP (1 << 3) +#define FUSE_ATTR_ATOMIC (1 << 4) /** * Open flags diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 2d3e0b49c5a50f..6c300c39ece52c 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -168,6 +168,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr, attr->flags |= FUSE_ATTR_EXCLUSIVE; if (iflags & FUSE_IFLAG_IOMAP) attr->flags |= FUSE_ATTR_IOMAP; + if (iflags & FUSE_IFLAG_ATOMIC) + attr->flags |= FUSE_ATTR_ATOMIC; } static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write throttling 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (19 preceding siblings ...) 2026-04-29 14:44 ` [PATCH 20/25] libfuse: add atomic write support Darrick J. Wong @ 2026-04-29 14:44 ` Darrick J. Wong 2026-04-29 14:44 ` [PATCH 22/25] libfuse: create a helper to transform an open regular file into an open loopdev Darrick J. Wong ` (3 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:44 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a library function so that fuse-iomap servers can ask the kernel to disable direct memory reclaim for filesystems and BDI write throttling. Disabling fs reclaim prevents livelocks where the fuse server can allocate memory, fault into the kernel, and then the allocation tries to initiate writeback by calling back into the same fuse server. Disabling BDI write throttling means that writeback won't be throttled by metadata writes to the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 1 + include/fuse_lowlevel.h | 13 +++++++++++++ lib/fuse_lowlevel.c | 7 +++++++ lib/fuse_versionscript | 1 + 4 files changed, 22 insertions(+) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 71a6a92b4b4a65..0779e3917a1e8f 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -1176,6 +1176,7 @@ struct fuse_iomap_support { #define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t) #define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \ struct fuse_iomap_support) +#define FUSE_DEV_IOC_SET_NOFS _IOW(FUSE_DEV_IOC_MAGIC, 100, uint32_t) struct fuse_lseek_in { uint64_t fh; diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index d17bde088d4992..bc720564d828cd 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2712,6 +2712,19 @@ int fuse_req_get_payload(fuse_req_t req, char **payload, size_t *payload_sz, */ uint64_t fuse_lowlevel_discover_iomap(int fd); +/** + * Disable direct fs memory reclaim and BDI throttling for a fuse-iomap server. + * This prevents memory allocations for the fuse server from initiating + * pagecache writeback to the fuse server and only throttles writes to the + * fuse server's block devices. The fuse connection must already be + * initialized with iomap enabled. + * + * @param se the session object + * @param val 1 to disable fs reclaim and throttling, 0 to enable them + * @return 0 on success, or negative errno on failure + */ +int fuse_lowlevel_disable_fsreclaim(struct fuse_session *se, int val); + #ifdef __cplusplus } #endif diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 6c300c39ece52c..e4b9ce2e4d2f08 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -5210,3 +5210,10 @@ uint64_t fuse_lowlevel_discover_iomap(int fd) return ios.flags; } + +int fuse_lowlevel_disable_fsreclaim(struct fuse_session *se, int val) +{ + int ret = ioctl(se->fd, FUSE_DEV_IOC_SET_NOFS, &val); + + return ret ? -errno : 0; +} diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 7f14aadff9fe72..99dfe35eccc90a 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -267,6 +267,7 @@ FUSE_3.99 { fuse_reply_iomap_config; fuse_lowlevel_iomap_device_invalidate; fuse_fs_iomap_device_invalidate; + fuse_lowlevel_disable_fsreclaim; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 22/25] libfuse: create a helper to transform an open regular file into an open loopdev 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (20 preceding siblings ...) 2026-04-29 14:44 ` [PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write throttling Darrick J. Wong @ 2026-04-29 14:44 ` Darrick J. Wong 2026-04-29 14:45 ` [PATCH 23/25] libfuse: add swapfile support for iomap files Darrick J. Wong ` (2 subsequent siblings) 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:44 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a helper function to configure a loop device for an open regular file fd, and then return an open fd to the loop device. This will enable the use of fuse+iomap file servers with filesystem image files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_loopdev.h | 29 +++ include/meson.build | 4 lib/fuse_loopdev.c | 441 ++++++++++++++++++++++++++++++++++++++++++++++++ lib/fuse_versionscript | 1 lib/meson.build | 3 meson.build | 11 + 6 files changed, 488 insertions(+), 1 deletion(-) create mode 100644 include/fuse_loopdev.h create mode 100644 lib/fuse_loopdev.c diff --git a/include/fuse_loopdev.h b/include/fuse_loopdev.h new file mode 100644 index 00000000000000..aa536e2a0b3964 --- /dev/null +++ b/include/fuse_loopdev.h @@ -0,0 +1,29 @@ +/* + * FUSE: Filesystem in Userspace + * Copyright (C) 2025-2026 Oracle. + * Author: Darrick J. Wong <djwong@kernel.org> + * + * This program can be distributed under the terms of the GNU LGPLv2. + * See the file LGPL2.txt. + */ +#ifndef FUSE_LOOPDEV_H_ +#define FUSE_LOOPDEV_H_ + +/** + * If possible, set up a loop device for the given file fd. Return the opened + * loop device fd and the path to the loop device. The loop device will be + * removed when the last close() occurs. + * + * @param file_fd an open file + * @param open_flags O_* flags that were used to open file_fd + * @param path the path to the open regular file + * @param timeout spend this much time waiting to lock the file + * @param loop_fd set to an open fd to the new loop device or + * -1 if setting up a loop device is not possible or appropriate + * @param loop_dev (optional) set to a pointer to the path to the loop device + * @return 0 for success, or negative errno on failure + */ +int fuse_loopdev_setup(int file_fd, int open_flags, const char *path, + unsigned int timeout, int *loop_fd, char **loop_dev); + +#endif /* FUSE_LOOPDEV_H_ */ diff --git a/include/meson.build b/include/meson.build index da51180f87eea2..60edd649f1784f 100644 --- a/include/meson.build +++ b/include/meson.build @@ -5,4 +5,8 @@ if private_cfg.get('HAVE_SERVICEMOUNT', false) libfuse_headers += [ 'fuse_service.h' ] endif +if private_cfg.get('FUSE_LOOPDEV_ENABLED') + libfuse_headers += [ 'fuse_loopdev.h' ] +endif + install_headers(libfuse_headers, subdir: 'fuse3') diff --git a/lib/fuse_loopdev.c b/lib/fuse_loopdev.c new file mode 100644 index 00000000000000..6d7017277d9eaf --- /dev/null +++ b/lib/fuse_loopdev.c @@ -0,0 +1,441 @@ +/* + * FUSE: Filesystem in Userspace + * Copyright (C) 2025-2026 Oracle. + * Author: Darrick J. Wong <djwong@kernel.org> + * + * Library functions for handling loopback devices on linux. + * + * This program can be distributed under the terms of the GNU LGPLv2. + * See the file LGPL2.txt + */ +#define _GNU_SOURCE +#include "fuse_config.h" +#include "fuse_loopdev.h" + +#ifdef FUSE_LOOPDEV_ENABLED +#include <stdint.h> +#include <stdio.h> +#include <fcntl.h> +#include <unistd.h> +#include <string.h> +#include <stdlib.h> +#include <limits.h> +#include <stdbool.h> +#include <errno.h> +#include <dirent.h> +#include <signal.h> +#include <time.h> +#include <sys/stat.h> +#include <sys/ioctl.h> +#include <sys/file.h> +#include <sys/types.h> +#include <sys/time.h> +#include <linux/loop.h> + +#include "fuse_log.h" + +#define _PATH_LOOPCTL "/dev/loop-control" +#define _PATH_SYS_BLOCK "/sys/block" + +#ifdef STATX_SUBVOL +# define STATX_SUBVOL_FLAG STATX_SUBVOL +#else +# define STATX_SUBVOL_FLAG 0 +#endif + +static int lock_file(int fd, const char *path) +{ + int ret; + + ret = flock(fd, LOCK_EX); + if (ret) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", path, strerror(error)); + return -error; + } + + return 0; +} + +static double gettime_monotonic(void) +{ +#ifdef CLOCK_MONOTONIC + struct timespec ts; +#endif + struct timeval tv; + static double fake_ret; + int ret; + +#ifdef CLOCK_MONOTONIC + ret = clock_gettime(CLOCK_MONOTONIC, &ts); + if (ret == 0) + return ts.tv_sec + (ts.tv_nsec / 1000000000.0); +#endif + ret = gettimeofday(&tv, NULL); + if (ret == 0) + return tv.tv_sec + (tv.tv_usec / 1000000.0); + + fake_ret += 1.0; + return fake_ret; +} + +static int lock_file_timeout(int fd, const char *path, unsigned int timeout) +{ + double deadline, now; + int ret; + + now = gettime_monotonic(); + deadline = now + timeout; + + /* Use a tight sleeping loop here to avoid signal handlers */ + while (now <= deadline) { + struct timespec sleepy = { + /* sleep 0.1s before trying again */ + .tv_nsec = 100000000, + }; + int error; + + ret = flock(fd, LOCK_EX | LOCK_NB); + if (ret == 0) + return 0; + + error = errno; + + if (error != EWOULDBLOCK) { + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", path, + strerror(error)); + return -error; + } + + nanosleep(&sleepy, NULL); + + now = gettime_monotonic(); + } + + fuse_log(FUSE_LOG_DEBUG, "%s: could not lock file\n", path); + return -EWOULDBLOCK; +} + +static int unlock_file(int fd, const char *path) +{ + int ret; + + ret = flock(fd, LOCK_UN); + if (ret) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", path, strerror(error)); + return -error; + } + + return 0; +} + +static int want_loopdev(int file_fd, const char *path) +{ + struct stat stbuf; + int ret; + + ret = fstat(file_fd, &stbuf); + if (ret < 0) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: fstat failed: %s\n", + path, strerror(error)); + return -error; + } + + /* + * Keep quiet about block devices, the client can probably still read + * and write that. + */ + if (S_ISBLK(stbuf.st_mode)) + return 0; + + ret = S_ISREG(stbuf.st_mode) && stbuf.st_size >= 512; + if (!ret) + fuse_log(FUSE_LOG_DEBUG, + "%s: file not compatible with loop device\n", path); + return ret; +} + +static int same_backing_file(int dir_fd, const char *name, + const struct statx *file_stat) +{ + struct statx backing_stat; + char backing_name[NAME_MAX + 18 + 1]; + char path[PATH_MAX + 1]; + ssize_t bytes; + int fd; + int ret; + + snprintf(backing_name, sizeof(backing_name), "%s/loop/backing_file", + name); + + fd = openat(dir_fd, backing_name, O_RDONLY); + if (fd < 0) { + int error = errno; + + /* unconfigured loop devices don't have backing_file attr */ + if (error == ENOENT) + return 0; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", backing_name, + strerror(error)); + return -error; + } + + bytes = pread(fd, path, sizeof(path) - 1, 0); + if (bytes < 0) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", backing_name, + strerror(error)); + ret = -error; + goto out_backing; + } else if (bytes == 0) { + fuse_log(FUSE_LOG_DEBUG, "%s: no path in backing file?\n", + backing_name); + ret = -ENOENT; + goto out_backing; + } + + if (path[bytes - 1] == '\n') + path[bytes - 1] = 0; + + ret = statx(AT_FDCWD, path, 0, STATX_BASIC_STATS | STATX_SUBVOL_FLAG, + &backing_stat); + if (ret) { + int error = errno; + + /* + * backing file deleted, assume nobody's doing procfd + * shenanigans + */ + if (error == ENOENT) { + ret = 0; + goto out_backing; + } + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", path, strerror(error)); + ret = -error; + goto out_backing; + } + + /* different devices */ + if (backing_stat.stx_dev_major != file_stat->stx_dev_major) + goto out_backing; + if (backing_stat.stx_dev_minor != file_stat->stx_dev_minor) + goto out_backing; + + /* different inode number */ + if (backing_stat.stx_ino != file_stat->stx_ino) + goto out_backing; + +#ifdef STATX_SUBVOL + /* different subvol (or subvol state) */ + if ((backing_stat.stx_mask ^ file_stat->stx_mask) & STATX_SUBVOL) + goto out_backing; + + if ((backing_stat.stx_mask & STATX_SUBVOL) && + backing_stat.stx_subvol != file_stat->stx_subvol) + goto out_backing; +#endif + + ret = 1; + +out_backing: + close(fd); + return ret; +} + +static int has_existing_loopdev(int file_fd, const char *path) +{ + struct statx file_stat; + DIR *dir; + struct dirent *d; + int blockfd; + int ret; + + ret = statx(file_fd, "", AT_EMPTY_PATH, + STATX_BASIC_STATS | STATX_SUBVOL_FLAG, &file_stat); + if (ret) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", path, strerror(error)); + return -error; + } + + dir = opendir(_PATH_SYS_BLOCK); + if (!dir) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", _PATH_SYS_BLOCK, + strerror(error)); + return -error; + } + + blockfd = dirfd(dir); + + while ((d = readdir(dir)) != NULL) { + if (strcmp(d->d_name, ".") == 0 || + strcmp(d->d_name, "..") == 0 || + strncmp(d->d_name, "loop", 4) != 0) + continue; + + ret = same_backing_file(blockfd, d->d_name, &file_stat); + if (ret != 0) + break; + } + + closedir(dir); + return ret; +} + +static int open_loopdev(int file_fd, int open_flags, char *loopdev, + size_t loopdev_sz) +{ + struct loop_config lc = { + .info.lo_flags = LO_FLAGS_DIRECT_IO | LO_FLAGS_AUTOCLEAR, + }; + int ctl_fd = -1; + int loop_fd = -1; + int loopno; + int ret; + + if ((open_flags & O_ACCMODE) == O_RDONLY) + lc.info.lo_flags |= LO_FLAGS_READ_ONLY; + + ctl_fd = open(_PATH_LOOPCTL, O_RDONLY); + if (ctl_fd < 0) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", _PATH_LOOPCTL, + strerror(error)); + return -error; + } + + ret = ioctl(ctl_fd, LOOP_CTL_GET_FREE); + if (ret < 0) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", _PATH_LOOPCTL, + strerror(error)); + ret = -error; + goto out_ctl; + } + loopno = ret; + snprintf(loopdev, loopdev_sz, "/dev/loop%d", loopno); + + loop_fd = open(loopdev, open_flags); + if (loop_fd < 0) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", loopdev, strerror(error)); + ret = -error; + goto out_ctl; + } + + lc.fd = file_fd; + + ret = ioctl(loop_fd, LOOP_CONFIGURE, &lc); + if (ret < 0) { + int error = errno; + + fuse_log(FUSE_LOG_DEBUG, "%s: %s\n", loopdev, strerror(error)); + ret = -error; + goto out_loop; + } + + close(ctl_fd); + return loop_fd; + +out_loop: + ioctl(ctl_fd, LOOP_CTL_REMOVE, loopno); + close(loop_fd); +out_ctl: + close(ctl_fd); + return ret; +} + +int fuse_loopdev_setup(int file_fd, int open_flags, const char *path, + unsigned int timeout, int *loop_fd, char **loop_dev) +{ + char loopdev[PATH_MAX]; + int loopfd = -1; + int ret; + + *loop_fd = -1; + if (loop_dev) + *loop_dev = NULL; + + if (timeout) + ret = lock_file_timeout(file_fd, path, timeout); + else + ret = lock_file(file_fd, path); + if (ret) + return ret; + + ret = want_loopdev(file_fd, path); + if (ret <= 0) + goto out_unlock; + + ret = has_existing_loopdev(file_fd, path); + if (ret < 0) + goto out_unlock; + if (ret == 1) { + fuse_log(FUSE_LOG_DEBUG, + "%s: attached to another loop device\n", path); + ret = -EBUSY; + goto out_unlock; + } + + loopfd = open_loopdev(file_fd, open_flags, loopdev, sizeof(loopdev)); + if (loopfd < 0) { + ret = loopfd; + goto out_unlock; + } + + ret = unlock_file(file_fd, path); + if (ret) + goto out_loop; + + if (loop_dev) { + char *ldev = strdup(loopdev); + + if (!ldev) { + ret = -ENOMEM; + goto out_loop; + } + + *loop_fd = loopfd; + *loop_dev = ldev; + } else { + *loop_fd = loopfd; + } + + return 0; + +out_loop: + close(loopfd); +out_unlock: + unlock_file(file_fd, path); + return ret; +} +#else +#include <stdlib.h> + +#include "util.h" + +int fuse_loopdev_setup(int file_fd FUSE_VAR_UNUSED, + int open_flags FUSE_VAR_UNUSED, + const char *path FUSE_VAR_UNUSED, + unsigned int timeout FUSE_VAR_UNUSED, + int *loop_fd, char **loop_dev) +{ + *loop_fd = -1; + if (loop_dev) + *loop_dev = NULL; + return 0; +} +#endif /* FUSE_LOOPDEV_ENABLED */ diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 99dfe35eccc90a..b2357623a49ce6 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -268,6 +268,7 @@ FUSE_3.99 { fuse_lowlevel_iomap_device_invalidate; fuse_fs_iomap_device_invalidate; fuse_lowlevel_disable_fsreclaim; + fuse_loopdev_setup; } FUSE_3.19; # Local Variables: diff --git a/lib/meson.build b/lib/meson.build index ff311d4002da0e..e708247a307ba4 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -2,7 +2,8 @@ libfuse_sources = ['fuse.c', 'fuse_i.h', 'fuse_loop.c', 'fuse_loop_mt.c', 'fuse_lowlevel.c', 'fuse_misc.h', 'fuse_opt.c', 'fuse_signals.c', 'buffer.c', 'cuse_lowlevel.c', 'helper.c', 'modules/subdir.c', 'mount_util.c', - 'fuse_log.c', 'compat.c', 'util.c', 'util.h' ] + 'fuse_log.c', 'compat.c', 'util.c', 'util.h', + 'fuse_loopdev.c' ] if host_machine.system().startswith('linux') libfuse_sources += [ 'mount.c' ] diff --git a/meson.build b/meson.build index 9fa17d08c39b6b..09e77ce3987dcc 100644 --- a/meson.build +++ b/meson.build @@ -187,7 +187,18 @@ private_cfg.set('HAVE_STRUCT_STAT_ST_ATIMESPEC', cc.has_member('struct stat', 'st_atimespec', prefix: include_default + '#include <sys/stat.h>', args: args_default)) +private_cfg.set('HAVE_STRUCT_LOOP_CONFIG_INFO', + cc.has_member('struct loop_config', 'info', + prefix: include_default + '#include <linux/loop.h>', + args: args_default)) +private_cfg.set('HAVE_STATX_BASIC_STATS', + cc.has_member('struct statx', 'stx_ino', + prefix: include_default + '#include <sys/stat.h>', + args: args_default)) +private_cfg.set('FUSE_LOOPDEV_ENABLED', \ + private_cfg.get('HAVE_STRUCT_LOOP_CONFIG_INFO') and \ + private_cfg.get('HAVE_STATX_BASIC_STATS')) private_cfg.set('USDT_ENABLED', get_option('enable-usdt')) # Check for liburing with SQE128 support ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 23/25] libfuse: add swapfile support for iomap files 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (21 preceding siblings ...) 2026-04-29 14:44 ` [PATCH 22/25] libfuse: create a helper to transform an open regular file into an open loopdev Darrick J. Wong @ 2026-04-29 14:45 ` Darrick J. Wong 2026-04-29 14:45 ` [PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw, and shutdown requests Darrick J. Wong 2026-04-29 14:45 ` [PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw, and shutdown events Darrick J. Wong 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:45 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add flags for swapfile activation and deactivation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 7c34dcf3b90754..0df70d6064b457 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1195,6 +1195,9 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn); #define FUSE_IOMAP_OP_ATOMIC (1U << 9) #define FUSE_IOMAP_OP_DONTCACHE (1U << 10) +/* swapfile config operation */ +#define FUSE_IOMAP_OP_SWAPFILE (1U << 30) + /* pagecache writeback operation */ #define FUSE_IOMAP_OP_WRITEBACK (1U << 31) @@ -1234,6 +1237,8 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, #define FUSE_IOMAP_IOEND_APPEND (1U << 4) /* is pagecache writeback */ #define FUSE_IOMAP_IOEND_WRITEBACK (1U << 5) +/* swapfile deactivation */ +#define FUSE_IOMAP_IOEND_SWAPOFF (1U << 6) /* enable fsdax */ #define FUSE_IFLAG_DAX (1U << 0) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw, and shutdown requests 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (22 preceding siblings ...) 2026-04-29 14:45 ` [PATCH 23/25] libfuse: add swapfile support for iomap files Darrick J. Wong @ 2026-04-29 14:45 ` Darrick J. Wong 2026-04-29 14:45 ` [PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw, and shutdown events Darrick J. Wong 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:45 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Pass the kernel's filesystem freeze, thaw, and shutdown requests through to low level fuse servers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 12 +++++++++ include/fuse_lowlevel.h | 35 +++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 107 insertions(+) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 0779e3917a1e8f..ff21973e1c88f7 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -682,6 +682,10 @@ enum fuse_opcode { FUSE_STATX = 52, FUSE_COPY_FILE_RANGE_64 = 53, + FUSE_FREEZE_FS = 4089, + FUSE_UNFREEZE_FS = 4090, + FUSE_SHUTDOWN_FS = 4091, + FUSE_IOMAP_CONFIG = 4092, FUSE_IOMAP_IOEND = 4093, FUSE_IOMAP_BEGIN = 4094, @@ -1238,6 +1242,14 @@ struct fuse_syncfs_in { uint64_t padding; }; +struct fuse_freezefs_in { + uint64_t unlinked; +}; + +struct fuse_shutdownfs_in { + uint64_t flags; +}; + /* * For each security context, send fuse_secctx with size of security context * fuse_secctx will be followed by security context name and this in turn diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index bc720564d828cd..bac627bb9038c6 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -1440,6 +1440,41 @@ struct fuse_lowlevel_ops { void (*iomap_config)(fuse_req_t req, const struct fuse_iomap_config_params *p, size_t psize); + + /** + * Freeze the filesystem + * + * Valid replies: + * fuse_reply_err + * + * @param req request handle + * @param ino the root inode number + * @param unlinked count of open unlinked inodes + */ + void (*freezefs)(fuse_req_t req, fuse_ino_t ino, uint64_t unlinked); + + /** + * Thaw the filesystem + * + * Valid replies: + * fuse_reply_err + * + * @param req request handle + * @param ino the root inode number + */ + void (*unfreezefs)(fuse_req_t req, fuse_ino_t ino); + + /** + * Shut down the filesystem + * + * Valid replies: + * fuse_reply_err + * + * @param req request handle + * @param ino the root inode number + * @param flags zero, currently + */ + void (*shutdownfs)(fuse_req_t req, fuse_ino_t ino, uint64_t flags); }; /** diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index e4b9ce2e4d2f08..9aa7730db39ac6 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -2919,6 +2919,60 @@ static void do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid, _do_iomap_config(req, nodeid, inarg, NULL); } +static void _do_freezefs(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + const struct fuse_freezefs_in *inarg = op_in; + (void)in_payload; + + if (req->se->op.freezefs) + req->se->op.freezefs(req, nodeid, inarg->unlinked); + else + fuse_reply_err(req, ENOSYS); +} + +static void do_freezefs(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_freezefs(req, nodeid, inarg, NULL); +} + +static void _do_unfreezefs(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + (void)op_in; + (void)in_payload; + + if (req->se->op.unfreezefs) + req->se->op.unfreezefs(req, nodeid); + else + fuse_reply_err(req, ENOSYS); +} + +static void do_unfreezefs(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_unfreezefs(req, nodeid, inarg, NULL); +} + +static void _do_shutdownfs(fuse_req_t req, const fuse_ino_t nodeid, + const void *op_in, const void *in_payload) +{ + const struct fuse_shutdownfs_in *inarg = op_in; + (void)in_payload; + + if (req->se->op.shutdownfs) + req->se->op.shutdownfs(req, nodeid, inarg->flags); + else + fuse_reply_err(req, ENOSYS); +} + +static void do_shutdownfs(fuse_req_t req, const fuse_ino_t nodeid, + const void *inarg) +{ + _do_shutdownfs(req, nodeid, inarg, NULL); +} + static bool want_flags_valid(uint64_t capable, uint64_t want) { uint64_t unknown_flags = want & (~capable); @@ -3934,6 +3988,9 @@ static struct { [FUSE_LSEEK] = { do_lseek, "LSEEK" }, [FUSE_SYNCFS] = { do_syncfs, "SYNCFS" }, [FUSE_STATX] = { do_statx, "STATX" }, + [FUSE_FREEZE_FS] = { do_freezefs, "FREEZE" }, + [FUSE_UNFREEZE_FS] = { do_unfreezefs, "UNFREEZE" }, + [FUSE_SHUTDOWN_FS] = { do_shutdownfs, "SHUTDOWN" }, [FUSE_IOMAP_CONFIG] = { do_iomap_config, "IOMAP_CONFIG" }, [FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" }, [FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" }, @@ -3995,6 +4052,9 @@ static struct { [FUSE_LSEEK] = { _do_lseek, "LSEEK" }, [FUSE_SYNCFS] = { _do_syncfs, "SYNCFS" }, [FUSE_STATX] = { _do_statx, "STATX" }, + [FUSE_FREEZE_FS] = { _do_freezefs, "FREEZE" }, + [FUSE_UNFREEZE_FS] = { _do_unfreezefs, "UNFREEZE" }, + [FUSE_SHUTDOWN_FS] = { _do_shutdownfs, "SHUTDOWN" }, [FUSE_IOMAP_CONFIG] = { _do_iomap_config, "IOMAP_CONFIG" }, [FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" }, [FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" }, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw, and shutdown events 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong ` (23 preceding siblings ...) 2026-04-29 14:45 ` [PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw, and shutdown requests Darrick J. Wong @ 2026-04-29 14:45 ` Darrick J. Wong 24 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:45 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Pass filesystem freeze, thaw, and shutdown requests from the low level library to the upper level library so that those fuse servers can handle the events. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 15 +++++++++ lib/fuse.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 110 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index d68feb7ccfa84c..2717e654e92071 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -925,6 +925,21 @@ struct fuse_operations { */ int (*iomap_config)(const struct fuse_iomap_config_params *p, size_t psize, struct fuse_iomap_config *cfg); + + /** + * Freeze the filesystem + */ + int (*freezefs)(const char *path, uint64_t unlinked_files); + + /** + * Thaw the filesystem + */ + int (*unfreezefs)(const char *path); + + /** + * Shut down the filesystem + */ + int (*shutdownfs)(const char *path, uint64_t flags); }; /** Extra context that may be needed by some filesystems diff --git a/lib/fuse.c b/lib/fuse.c index ddf9b4044b8952..b43a336d7530bb 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2989,6 +2989,38 @@ static int fuse_fs_iomap_config(struct fuse_fs *fs, return fs->op.iomap_config(p, psize, cfg); } +static int fuse_fs_freezefs(struct fuse_fs *fs, const char *path, + uint64_t unlinked) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.freezefs) + return -ENOSYS; + if (fs->debug) + fuse_log(FUSE_LOG_DEBUG, "freezefs[%s]\n", path); + return fs->op.freezefs(path, unlinked); +} + +static int fuse_fs_unfreezefs(struct fuse_fs *fs, const char *path) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.unfreezefs) + return -ENOSYS; + if (fs->debug) + fuse_log(FUSE_LOG_DEBUG, "unfreezefs[%s]\n", path); + return fs->op.unfreezefs(path); +} + +static int fuse_fs_shutdownfs(struct fuse_fs *fs, const char *path, + uint64_t flags) +{ + fuse_get_context()->private_data = fs->user_data; + if (!fs->op.shutdownfs) + return -ENOSYS; + if (fs->debug) + fuse_log(FUSE_LOG_DEBUG, "shutdownfs[%s]\n", path); + return fs->op.shutdownfs(path, flags); +} + static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { @@ -4887,6 +4919,66 @@ static void fuse_lib_iomap_config(fuse_req_t req, fuse_reply_iomap_config(req, &cfg); } +static void fuse_lib_freezefs(fuse_req_t req, fuse_ino_t ino, uint64_t unlinked) +{ + struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; + char *path; + int err; + + err = get_path(f, ino, &path); + if (err) { + reply_err(req, err); + return; + } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_freezefs(f->fs, path, unlinked); + fuse_finish_interrupt(f, req, &d); + free_path(f, ino, path); + reply_err(req, err); +} + +static void fuse_lib_unfreezefs(fuse_req_t req, fuse_ino_t ino) +{ + struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; + char *path; + int err; + + err = get_path(f, ino, &path); + if (err) { + reply_err(req, err); + return; + } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_unfreezefs(f->fs, path); + fuse_finish_interrupt(f, req, &d); + free_path(f, ino, path); + reply_err(req, err); +} + +static void fuse_lib_shutdownfs(fuse_req_t req, fuse_ino_t ino, uint64_t flags) +{ + struct fuse *f = req_fuse_prepare(req); + struct fuse_intr_data d; + char *path; + int err; + + err = get_path(f, ino, &path); + if (err) { + reply_err(req, err); + return; + } + + fuse_prepare_interrupt(f, req, &d); + err = fuse_fs_shutdownfs(f->fs, path, flags); + fuse_finish_interrupt(f, req, &d); + free_path(f, ino, path); + reply_err(req, err); +} + static int clean_delay(struct fuse *f) { /* @@ -4989,6 +5081,9 @@ static struct fuse_lowlevel_ops fuse_path_ops = { .statx = fuse_lib_statx, #endif .syncfs = fuse_lib_syncfs, + .freezefs = fuse_lib_freezefs, + .unfreezefs = fuse_lib_unfreezefs, + .shutdownfs = fuse_lib_shutdownfs, .iomap_begin = fuse_lib_iomap_begin, .iomap_end = fuse_lib_iomap_end, .iomap_ioend = fuse_lib_iomap_ioend, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 2/6] libfuse: allow servers to specify root node id 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (8 preceding siblings ...) 2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong @ 2026-04-29 14:19 ` Darrick J. Wong 2026-04-29 14:45 ` [PATCH 1/1] libfuse: allow root_nodeid mount option Darrick J. Wong 2026-04-29 14:19 ` [PATCHSET v8 3/6] libfuse: implement syncfs Darrick J. Wong ` (9 subsequent siblings) 19 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:19 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal Hi all, This series grants fuse servers full control over the entire node id address space by allowing them to specify the nodeid of the root directory. With this new feature, fuse4fs will not have to translate node ids. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-root-nodeid --- Commits in this patchset: * libfuse: allow root_nodeid mount option --- lib/mount.c | 1 + util/fusermount.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/1] libfuse: allow root_nodeid mount option 2026-04-29 14:19 ` [PATCHSET v8 2/6] libfuse: allow servers to specify root node id Darrick J. Wong @ 2026-04-29 14:45 ` Darrick J. Wong 0 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:45 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Allow this mount option so that fuse servers can configure the root nodeid if they want to. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/mount.c | 1 + util/fusermount.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/mount.c b/lib/mount.c index 84c73579ab2daf..3e7e4c22ce6559 100644 --- a/lib/mount.c +++ b/lib/mount.c @@ -100,6 +100,7 @@ static const struct fuse_opt fuse_mount_opts[] = { FUSE_OPT_KEY("defcontext=", KEY_KERN_OPT), FUSE_OPT_KEY("rootcontext=", KEY_KERN_OPT), FUSE_OPT_KEY("max_read=", KEY_KERN_OPT), + FUSE_OPT_KEY("root_nodeid=", KEY_KERN_OPT), FUSE_OPT_KEY("user=", KEY_MTAB_OPT), FUSE_OPT_KEY("-n", KEY_MTAB_OPT), FUSE_OPT_KEY("-r", KEY_RO), diff --git a/util/fusermount.c b/util/fusermount.c index c7905d58a85e32..a9d07683395632 100644 --- a/util/fusermount.c +++ b/util/fusermount.c @@ -733,7 +733,8 @@ static int do_mount(const char *mnt, const char **typep, mode_t rootmode, } else if (opt_eq(s, len, "default_permissions") || opt_eq(s, len, "allow_other") || begins_with(s, "max_read=") || - begins_with(s, "blksize=")) { + begins_with(s, "blksize=") || + begins_with(s, "root_nodeid=")) { memcpy(d, s, len); d += len; *d++ = ','; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 3/6] libfuse: implement syncfs 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (9 preceding siblings ...) 2026-04-29 14:19 ` [PATCHSET v8 2/6] libfuse: allow servers to specify root node id Darrick J. Wong @ 2026-04-29 14:19 ` Darrick J. Wong 2026-04-29 14:46 ` [PATCH 1/2] libfuse: add strictatime/lazytime mount options Darrick J. Wong 2026-04-29 14:46 ` [PATCH 2/2] libfuse: set sync, immutable, and append when loading files Darrick J. Wong 2026-04-29 14:19 ` [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap Darrick J. Wong ` (8 subsequent siblings) 19 siblings, 2 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:19 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal Hi all, Implement syncfs in libfuse so that iomap-compatible fuse servers can receive syncfs commands, and enable fuse servers to transmit inode flags to the kernel so that it can enforce sync, immutable, and append. Also enable some of the timestamp update mount options. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs --- Commits in this patchset: * libfuse: add strictatime/lazytime mount options * libfuse: set sync, immutable, and append when loading files --- include/fuse_common.h | 6 ++++++ include/fuse_kernel.h | 8 ++++++++ lib/fuse_lowlevel.c | 6 ++++++ lib/mount.c | 10 ++++++++-- 4 files changed, 28 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/2] libfuse: add strictatime/lazytime mount options 2026-04-29 14:19 ` [PATCHSET v8 3/6] libfuse: implement syncfs Darrick J. Wong @ 2026-04-29 14:46 ` Darrick J. Wong 2026-04-29 14:46 ` [PATCH 2/2] libfuse: set sync, immutable, and append when loading files Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:46 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> fuse+iomap leaves the kernel completely in charge of handling timestamps. Add the lazytime and strictatime mount options so that fuse+iomap filesystems can take advantage of those options. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/mount.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/lib/mount.c b/lib/mount.c index 3e7e4c22ce6559..0825b4dfba4385 100644 --- a/lib/mount.c +++ b/lib/mount.c @@ -117,9 +117,12 @@ static const struct fuse_opt fuse_mount_opts[] = { FUSE_OPT_KEY("dirsync", KEY_KERN_FLAG), FUSE_OPT_KEY("noatime", KEY_KERN_FLAG), FUSE_OPT_KEY("nodiratime", KEY_KERN_FLAG), - FUSE_OPT_KEY("nostrictatime", KEY_KERN_FLAG), FUSE_OPT_KEY("symfollow", KEY_KERN_FLAG), FUSE_OPT_KEY("nosymfollow", KEY_KERN_FLAG), + FUSE_OPT_KEY("lazytime", KEY_KERN_FLAG), + FUSE_OPT_KEY("nolazytime", KEY_KERN_FLAG), + FUSE_OPT_KEY("strictatime", KEY_KERN_FLAG), + FUSE_OPT_KEY("nostrictatime", KEY_KERN_FLAG), FUSE_OPT_END }; @@ -189,12 +192,15 @@ static const struct mount_flags mount_flags[] = { {"noatime", MS_NOATIME, 1}, {"nodiratime", MS_NODIRATIME, 1}, {"norelatime", MS_RELATIME, 0}, - {"nostrictatime", MS_STRICTATIME, 0}, {"symfollow", MS_NOSYMFOLLOW, 0}, {"nosymfollow", MS_NOSYMFOLLOW, 1}, #ifndef __NetBSD__ {"dirsync", MS_DIRSYNC, 1}, #endif + {"lazytime", MS_LAZYTIME, 1}, + {"nolazytime", MS_LAZYTIME, 0}, + {"strictatime", MS_STRICTATIME, 1}, + {"nostrictatime", MS_STRICTATIME, 0}, {NULL, 0, 0} }; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/2] libfuse: set sync, immutable, and append when loading files 2026-04-29 14:19 ` [PATCHSET v8 3/6] libfuse: implement syncfs Darrick J. Wong 2026-04-29 14:46 ` [PATCH 1/2] libfuse: add strictatime/lazytime mount options Darrick J. Wong @ 2026-04-29 14:46 ` Darrick J. Wong 1 sibling, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:46 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add these three fuse_attr::flags bits so that servers can mark a file as immutable or append-only and have the kernel advertise and enforce that. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 6 ++++++ include/fuse_kernel.h | 8 ++++++++ lib/fuse_lowlevel.c | 6 ++++++ 3 files changed, 20 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 0df70d6064b457..7b209b217b310e 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1248,6 +1248,12 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, #define FUSE_IFLAG_IOMAP (1U << 2) /* enable untorn writes */ #define FUSE_IFLAG_ATOMIC (1U << 3) +/* file writes are synchronous */ +#define FUSE_IFLAG_SYNC (1U << 4) +/* file is immutable */ +#define FUSE_IFLAG_IMMUTABLE (1U << 5) +/* file is append only */ +#define FUSE_IFLAG_APPEND (1U << 6) /* Which fields are set in fuse_iomap_config_out? */ #define FUSE_IOMAP_CONFIG_SID (1 << 0ULL) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index ff21973e1c88f7..bee825a6d17ad5 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -249,6 +249,8 @@ * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support + * - add FUSE_ATTR_{SYNC,IMMUTABLE,APPEND} for VFS enforcement of file + * attributes */ #ifndef _LINUX_FUSE_H @@ -593,12 +595,18 @@ struct fuse_file_lock { * kernel can use cached attributes more aggressively (e.g. ACL inheritance) * FUSE_ATTR_IOMAP: Use iomap for this inode * FUSE_ATTR_ATOMIC: Enable untorn writes + * FUSE_ATTR_SYNC: File writes are always synchronous + * FUSE_ATTR_IMMUTABLE: File is immutable + * FUSE_ATTR_APPEND: File is append-only */ #define FUSE_ATTR_SUBMOUNT (1 << 0) #define FUSE_ATTR_DAX (1 << 1) #define FUSE_ATTR_EXCLUSIVE (1 << 2) #define FUSE_ATTR_IOMAP (1 << 3) #define FUSE_ATTR_ATOMIC (1 << 4) +#define FUSE_ATTR_SYNC (1 << 5) +#define FUSE_ATTR_IMMUTABLE (1 << 6) +#define FUSE_ATTR_APPEND (1 << 7) /** * Open flags diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 9aa7730db39ac6..00eaf511aaccc0 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -170,6 +170,12 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr, attr->flags |= FUSE_ATTR_IOMAP; if (iflags & FUSE_IFLAG_ATOMIC) attr->flags |= FUSE_ATTR_ATOMIC; + if (iflags & FUSE_IFLAG_SYNC) + attr->flags |= FUSE_ATTR_SYNC; + if (iflags & FUSE_IFLAG_IMMUTABLE) + attr->flags |= FUSE_ATTR_IMMUTABLE; + if (iflags & FUSE_IFLAG_APPEND) + attr->flags |= FUSE_ATTR_APPEND; } static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (10 preceding siblings ...) 2026-04-29 14:19 ` [PATCHSET v8 3/6] libfuse: implement syncfs Darrick J. Wong @ 2026-04-29 14:19 ` Darrick J. Wong 2026-04-29 14:46 ` [PATCH 1/3] mount_service: delegate iomap privilege from mount.service to fuse services Darrick J. Wong ` (2 more replies) 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong ` (7 subsequent siblings) 19 siblings, 3 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:19 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal Hi all, There are a few privileges that fuse-iomap servers need to ask of the kernel, such as asking for iomap support, being able to configure loop devices if a passed in filesystem path actually points to a regular file, and setting the block size once we've parsed the superblock. Because fuse-iomap services are supposed to run with zero privileges, these three things must be performed by the fuservicemount program, which presumably has more privilege (and the correct mount ns) than the fuse server itself. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D --- Commits in this patchset: * mount_service: delegate iomap privilege from mount.service to fuse services * libfuse: enable setting iomap block device block size * mount_service: create loop devices for regular files --- include/fuse_kernel.h | 8 +++ include/fuse_lowlevel.h | 22 ++++++++ include/fuse_service.h | 17 ++++++ include/fuse_service_priv.h | 14 +++++ lib/fuse_lowlevel.c | 19 +++++++ lib/fuse_service.c | 48 +++++++++++++++++ lib/fuse_service_stub.c | 6 ++ lib/fuse_versionscript | 3 + util/mount_service.c | 119 +++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 254 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/3] mount_service: delegate iomap privilege from mount.service to fuse services 2026-04-29 14:19 ` [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap Darrick J. Wong @ 2026-04-29 14:46 ` Darrick J. Wong 2026-04-29 14:46 ` [PATCH 2/3] libfuse: enable setting iomap block device block size Darrick J. Wong 2026-04-29 14:47 ` [PATCH 3/3] mount_service: create loop devices for regular files Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:46 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Enable the mount.service helper to attach whatever privileges it might have to enable iomap to a /dev/fuse fd before passing that fd to the fuse server. Assuming that the fuse service itself does not have sufficient privilege to enable iomap on its own, it can now inherit that privilege via the fd. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 1 + include/fuse_lowlevel.h | 10 ++++++ include/fuse_service.h | 11 ++++++ include/fuse_service_priv.h | 10 ++++++ lib/fuse_lowlevel.c | 7 ++++ lib/fuse_service.c | 43 +++++++++++++++++++++++++ lib/fuse_service_stub.c | 6 +++ lib/fuse_versionscript | 2 + util/mount_service.c | 74 +++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 164 insertions(+) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index bee825a6d17ad5..10a82bf818d2cc 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -1189,6 +1189,7 @@ struct fuse_iomap_support { #define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \ struct fuse_iomap_support) #define FUSE_DEV_IOC_SET_NOFS _IOW(FUSE_DEV_IOC_MAGIC, 100, uint32_t) +#define FUSE_DEV_IOC_ADD_IOMAP _IO(FUSE_DEV_IOC_MAGIC, 101) struct fuse_lseek_in { uint64_t fh; diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index bac627bb9038c6..d09ea5ca1b8ad7 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2760,6 +2760,16 @@ uint64_t fuse_lowlevel_discover_iomap(int fd); */ int fuse_lowlevel_disable_fsreclaim(struct fuse_session *se, int val); +/** + * Request that iomap capabilities be added to this fuse device. This enables + * a privileged mount helper to convey the privileges that allow iomap usage to + * a completely unprivileged fuse server. + * + * @param fd open file descriptor to a fuse device + * @return 0 on success, or negative errno on failure + */ +int fuse_lowlevel_add_iomap(int fd); + #ifdef __cplusplus } #endif diff --git a/include/fuse_service.h b/include/fuse_service.h index f1b3fb738aec3a..5c4cd89f68459d 100644 --- a/include/fuse_service.h +++ b/include/fuse_service.h @@ -179,6 +179,17 @@ int fuse_service_receive_file(struct fuse_service *sf, */ int fuse_service_finish_file_requests(struct fuse_service *sf); +/** + * Attach iomap to the fuse connection. + * + * @param sf service context + * @param mandatory true if the server requires iomap + * @param error result of trying to enable iomap + * @return 0 on success, or negative errno on failure + */ +int fuse_service_configure_iomap(struct fuse_service *sf, bool mandatory, + int *error); + /** * Require that the filesystem mount point have the expected file format * (S_IFDIR/S_IFREG). Can be overridden when calling diff --git a/include/fuse_service_priv.h b/include/fuse_service_priv.h index 8560b1ac610143..224ccd6e926085 100644 --- a/include/fuse_service_priv.h +++ b/include/fuse_service_priv.h @@ -40,6 +40,7 @@ struct fuse_service_memfd_argv { #define FUSE_SERVICE_UNMOUNT_CMD 0x554d4e54 /* UMNT */ #define FUSE_SERVICE_BYE_CMD 0x42594545 /* BYEE */ #define FUSE_SERVICE_MTABOPTS_CMD 0x4d544142 /* MTAB */ +#define FUSE_SERVICE_IOMAP_CMD 0x494f4d41 /* IOMA */ /* mount.service sends replies to the fuse server */ #define FUSE_SERVICE_OPEN_REPLY 0x46494c45 /* FILE */ @@ -116,6 +117,15 @@ static inline size_t sizeof_fuse_service_open_command(size_t pathlen) return sizeof(struct fuse_service_open_command) + pathlen + 1; } +#define FUSE_IOMAP_MODE_OPTIONAL 0x503F /* P? */ +#define FUSE_IOMAP_MODE_MANDATORY 0x5021 /* P! */ + +struct fuse_service_iomap_command { + struct fuse_service_packet p; + uint16_t mode; + uint16_t padding; +}; + struct fuse_service_string_command { struct fuse_service_packet p; char value[]; diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 00eaf511aaccc0..fa8cf6f9837bed 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -5283,3 +5283,10 @@ int fuse_lowlevel_disable_fsreclaim(struct fuse_session *se, int val) return ret ? -errno : 0; } + +int fuse_lowlevel_add_iomap(int fd) +{ + int ret = ioctl(fd, FUSE_DEV_IOC_ADD_IOMAP); + + return ret ? -errno : 0; +} diff --git a/lib/fuse_service.c b/lib/fuse_service.c index e860d1aafbe5e5..2193b997583e3f 100644 --- a/lib/fuse_service.c +++ b/lib/fuse_service.c @@ -1251,3 +1251,46 @@ uint64_t fuse_service_discover_iomap(struct fuse_service *sf) { return fuse_lowlevel_discover_iomap(sf->fusedevfd); } + +int fuse_service_configure_iomap(struct fuse_service *sf, bool mandatory, + int *errorp) +{ + struct fuse_service_iomap_command cmd = { + .p.magic = ntohl(FUSE_SERVICE_IOMAP_CMD), + .mode = mandatory ? ntohs(FUSE_IOMAP_MODE_MANDATORY) : + ntohs(FUSE_IOMAP_MODE_OPTIONAL), + }; + struct fuse_service_simple_reply reply = { }; + ssize_t size; + + size = __send_packet(sf, &cmd, sizeof(cmd)); + if (size < 0) { + int error = errno; + + fuse_log(FUSE_LOG_ERR, "fuse: send iomap command: %s\n", + strerror(error)); + return -error; + } + + size = __recv_packet(sf, &reply, sizeof(reply)); + if (size < 0) { + int error = errno; + + fuse_log(FUSE_LOG_ERR, "fuse: iomap command reply: %s\n", + strerror(error)); + return -error; + } + if (size != sizeof(reply)) { + fuse_log(FUSE_LOG_ERR, "fuse: wrong iomap command reply size %zd, expected %zd\n", + size, sizeof(reply)); + return -EBADMSG; + } + + if (ntohl(reply.p.magic) != FUSE_SERVICE_SIMPLE_REPLY) { + fuse_log(FUSE_LOG_ERR, "fuse: iomap command reply contains wrong magic!\n"); + return -EBADMSG; + } + + *errorp = ntohl(reply.error); + return 0; +} diff --git a/lib/fuse_service_stub.c b/lib/fuse_service_stub.c index 2cafde6d6b6f5d..84eed8482f7d2b 100644 --- a/lib/fuse_service_stub.c +++ b/lib/fuse_service_stub.c @@ -109,3 +109,9 @@ uint64_t fuse_service_discover_iomap(struct fuse_service *sf) { return 0; } + +int fuse_service_configure_iomap(struct fuse_service *sf, bool mandatory, + int *error) +{ + return -EOPNOTSUPP; +} diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index b2357623a49ce6..347aad1c1c8b1d 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -269,6 +269,8 @@ FUSE_3.99 { fuse_fs_iomap_device_invalidate; fuse_lowlevel_disable_fsreclaim; fuse_loopdev_setup; + fuse_lowlevel_add_iomap; + fuse_service_configure_iomap; } FUSE_3.19; # Local Variables: diff --git a/util/mount_service.c b/util/mount_service.c index bc5940bc900dad..c50eb3125f8ef1 100644 --- a/util/mount_service.c +++ b/util/mount_service.c @@ -88,6 +88,9 @@ struct mount_service { /* is this a fuseblk mount? */ bool fuseblk; + + /* did someone try to configure iomap already? */ + bool iomap_configured; }; static char IGNORE_MTAB; @@ -586,6 +589,23 @@ static int mount_service_send_file_error(struct mount_service *mo, int error, return ret; } +static int mount_service_config_iomap(struct mount_service *mo, + bool mandatory) +{ + int ret; + + mo->iomap_configured = true; + + ret = fuse_lowlevel_add_iomap(mo->fusedevfd); + if (ret && mandatory) { + fprintf(stderr, "%s: adding iomap capability: %s\n", + mo->msgtag, strerror(errno)); + return -1; + } + + return 0; +} + static int mount_service_send_required_files(struct mount_service *mo, const char *fusedev) { @@ -1354,6 +1374,50 @@ static int mount_service_handle_mountpoint_cmd(struct mount_service *mo, return attach_to_mountpoint(mo, expected_fmt, mntpt); } +static int mount_service_handle_iomap_cmd(struct mount_service *mo, + struct fuse_service_packet *p, + size_t psz) +{ + struct fuse_service_iomap_command *oc = + container_of(p, struct fuse_service_iomap_command, p); + bool mandatory = false; + int ret; + + if (psz != sizeof(struct fuse_service_iomap_command)) { + fprintf(stderr, "%s: iomap command wrong size\n", + mo->msgtag); + return mount_service_send_reply(mo, EINVAL); + } + + if (oc->padding) { + fprintf(stderr, "%s: invalid iomap command\n", + mo->msgtag); + return mount_service_send_reply(mo, EINVAL); + } + + switch (ntohs(oc->mode)) { + case FUSE_IOMAP_MODE_MANDATORY: + mandatory = true; + fallthrough; + case FUSE_IOMAP_MODE_OPTIONAL: + ret = mount_service_config_iomap(mo, mandatory); + if (ret < 0) { + /* + * Ok to return 0 here, fuse servers can fall back + * if there's no iomap support. + */ + return mount_service_send_reply(mo, EPERM); + } + break; + default: + fprintf(stderr, "%s: invalid iomap command mode\n", + mo->msgtag); + return mount_service_send_reply(mo, EINVAL); + } + + return mount_service_send_reply(mo, 0); +} + static inline int format_libfuse_mntopts(char *buf, size_t bufsz, const struct mount_service *mo, const struct stat *stbuf) @@ -1839,6 +1903,13 @@ static int mount_service_handle_mount_cmd(struct mount_service *mo, return mount_service_send_reply(mo, -ret); } + /* + * If nobody tried to configure iomap, try to enable it but don't + * fail if we can't. + */ + if (!mo->iomap_configured) + mount_service_config_iomap(mo, false); + if (mo->fsopenfd >= 0) { ret = mount_service_fsopen_mount(mo, oc, &stbuf); if (ret != FUSE_MOUNT_FALLBACK_NEEDED) @@ -2037,6 +2108,9 @@ int mount_service_main(int argc, char *argv[]) case FUSE_SERVICE_MTABOPTS_CMD: ret = mount_service_handle_mtabopts_cmd(&mo, p, sz); break; + case FUSE_SERVICE_IOMAP_CMD: + ret = mount_service_handle_iomap_cmd(&mo, p, sz); + break; case FUSE_SERVICE_MOUNT_CMD: ret = mount_service_handle_mount_cmd(&mo, p, sz); break; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/3] libfuse: enable setting iomap block device block size 2026-04-29 14:19 ` [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap Darrick J. Wong 2026-04-29 14:46 ` [PATCH 1/3] mount_service: delegate iomap privilege from mount.service to fuse services Darrick J. Wong @ 2026-04-29 14:46 ` Darrick J. Wong 2026-04-29 14:47 ` [PATCH 3/3] mount_service: create loop devices for regular files Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:46 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a means for an unprivileged fuse server to set the block size of a block device that it previously opened and associated with the fuse connection. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_kernel.h | 7 +++++++ include/fuse_lowlevel.h | 12 ++++++++++++ lib/fuse_lowlevel.c | 12 ++++++++++++ lib/fuse_versionscript | 1 + 4 files changed, 32 insertions(+) diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 10a82bf818d2cc..3ed174567dc172 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -1180,6 +1180,11 @@ struct fuse_iomap_support { uint64_t padding; }; +struct fuse_iomap_backing_info { + uint32_t backing_id; + uint32_t blocksize; +}; + /* Device ioctls: */ #define FUSE_DEV_IOC_MAGIC 229 #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t) @@ -1190,6 +1195,8 @@ struct fuse_iomap_support { struct fuse_iomap_support) #define FUSE_DEV_IOC_SET_NOFS _IOW(FUSE_DEV_IOC_MAGIC, 100, uint32_t) #define FUSE_DEV_IOC_ADD_IOMAP _IO(FUSE_DEV_IOC_MAGIC, 101) +#define FUSE_DEV_IOC_IOMAP_SET_BLOCKSIZE _IOW(FUSE_DEV_IOC_MAGIC, 102, \ + struct fuse_iomap_backing_info) struct fuse_lseek_in { uint64_t fh; diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index d09ea5ca1b8ad7..67c9bd4b2c6cee 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2770,6 +2770,18 @@ int fuse_lowlevel_disable_fsreclaim(struct fuse_session *se, int val); */ int fuse_lowlevel_add_iomap(int fd); +/** + * Set the block size of an open block device that has been opened for use with + * iomap. + * + * @param se the session object + * @param dev_index device index returned by fuse_lowlevel_iomap_device_add + * @param blocksize block size in bytes + * @return 0 on success, or negative errno on failure + */ +int fuse_lowlevel_iomap_set_blocksize(struct fuse_session *se, int dev_index, + unsigned int blocksize); + #ifdef __cplusplus } #endif diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index fa8cf6f9837bed..d3e2d4c698a62b 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -5290,3 +5290,15 @@ int fuse_lowlevel_add_iomap(int fd) return ret ? -errno : 0; } + +int fuse_lowlevel_iomap_set_blocksize(struct fuse_session *se, int dev_index, + unsigned int blocksize) +{ + struct fuse_iomap_backing_info fbi = { + .backing_id = dev_index, + .blocksize = blocksize, + }; + int ret = ioctl(se->fd, FUSE_DEV_IOC_IOMAP_SET_BLOCKSIZE, &fbi); + + return ret ? -errno : 0; +} diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 347aad1c1c8b1d..9c9013c964488c 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -271,6 +271,7 @@ FUSE_3.99 { fuse_loopdev_setup; fuse_lowlevel_add_iomap; fuse_service_configure_iomap; + fuse_lowlevel_iomap_set_blocksize; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/3] mount_service: create loop devices for regular files 2026-04-29 14:19 ` [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap Darrick J. Wong 2026-04-29 14:46 ` [PATCH 1/3] mount_service: delegate iomap privilege from mount.service to fuse services Darrick J. Wong 2026-04-29 14:46 ` [PATCH 2/3] libfuse: enable setting iomap block device block size Darrick J. Wong @ 2026-04-29 14:47 ` Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:47 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> If a fuse server asks fuservicemount to open a regular file, try to create an auto-clear loop device so that the fuse server can use iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_service.h | 6 ++++++ include/fuse_service_priv.h | 4 +++- lib/fuse_service.c | 5 ++++- util/mount_service.c | 45 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 58 insertions(+), 2 deletions(-) diff --git a/include/fuse_service.h b/include/fuse_service.h index 5c4cd89f68459d..152e46cf425d18 100644 --- a/include/fuse_service.h +++ b/include/fuse_service.h @@ -128,6 +128,12 @@ int fuse_service_parse_cmdline_opts(struct fuse_args *args, */ #define FUSE_SERVICE_REQUEST_FILE_QUIET (1U << 0) +/** + * If the file opened is a regular file, try to create a loop device for it. + * If successful, the loop device is returned; if not, the regular file is. + */ +#define FUSE_SERVICE_REQUEST_FILE_TRYLOOP (1U << 1) + /** * Ask the mount.service helper to open a file on behalf of the fuse server. * diff --git a/include/fuse_service_priv.h b/include/fuse_service_priv.h index 224ccd6e926085..db9ecdf768b6d6 100644 --- a/include/fuse_service_priv.h +++ b/include/fuse_service_priv.h @@ -101,7 +101,9 @@ struct fuse_service_fsopen_command { }; #define FUSE_SERVICE_OPEN_QUIET (1U << 0) -#define FUSE_SERVICE_OPEN_FLAGS (FUSE_SERVICE_OPEN_QUIET) +#define FUSE_SERVICE_OPEN_TRYLOOP (1U << 1) +#define FUSE_SERVICE_OPEN_FLAGS (FUSE_SERVICE_OPEN_QUIET | \ + FUSE_SERVICE_OPEN_TRYLOOP) struct fuse_service_open_command { struct fuse_service_packet p; diff --git a/lib/fuse_service.c b/lib/fuse_service.c index 2193b997583e3f..100c98e4a303c4 100644 --- a/lib/fuse_service.c +++ b/lib/fuse_service.c @@ -222,7 +222,8 @@ int fuse_service_receive_file(struct fuse_service *sf, const char *path, return ret; } -#define FUSE_SERVICE_REQUEST_FILE_FLAGS (FUSE_SERVICE_REQUEST_FILE_QUIET) +#define FUSE_SERVICE_REQUEST_FILE_FLAGS (FUSE_SERVICE_REQUEST_FILE_QUIET | \ + FUSE_SERVICE_REQUEST_FILE_TRYLOOP) static int fuse_service_request_path(struct fuse_service *sf, const char *path, mode_t expected_fmt, int open_flags, @@ -244,6 +245,8 @@ static int fuse_service_request_path(struct fuse_service *sf, const char *path, if (request_flags & FUSE_SERVICE_REQUEST_FILE_QUIET) rqflags |= FUSE_SERVICE_OPEN_QUIET; + if (request_flags & FUSE_SERVICE_REQUEST_FILE_TRYLOOP) + rqflags |= FUSE_SERVICE_OPEN_TRYLOOP; cmd = calloc(1, cmdsz); if (!cmd) { diff --git a/util/mount_service.c b/util/mount_service.c index c50eb3125f8ef1..f36471a6387601 100644 --- a/util/mount_service.c +++ b/util/mount_service.c @@ -37,6 +37,7 @@ #include "util.h" #include "fuse_i.h" #include "fuse_service_priv.h" +#include "fuse_loopdev.h" #include "mount_service.h" #include "fuser_conf.h" @@ -754,6 +755,41 @@ static int prepare_bdev(struct mount_service *mo, return 0; } +static int prepare_loopdev(struct fuse_service_open_command *oc, int *fd) +{ + int reg_fd = *fd; + int loop_fd = -1; + int ret; + + drop_privs(); + ret = fuse_loopdev_setup(reg_fd, ntohl(oc->open_flags), oc->path, 5, + &loop_fd, NULL); + restore_privs(); + if (ret) { + /* + * If the setup function returned EBUSY, there is already a + * loop device backed by this file, so we must return an error. + * For any other type of error we'll leave *fd untouched to + * send the original file we opened to the fuse server. + */ + if (ret == -EBUSY) + return -EBUSY; + return 0; + } + if (loop_fd < 0) { + /* + * The loopdev setup function didn't give us a new fd, so we + * return having left the passed-in fd alone. + */ + return 0; + } + + /* Substitute the loop device for the regular file. */ + close(reg_fd); + *fd = loop_fd; + return 0; +} + static int mount_service_open_path(struct mount_service *mo, mode_t expected_fmt, struct fuse_service_packet *p, size_t psz) @@ -802,6 +838,15 @@ static int mount_service_open_path(struct mount_service *mo, } restore_privs(); + if (request_flags & FUSE_SERVICE_OPEN_TRYLOOP) { + ret = prepare_loopdev(oc, &fd); + if (ret < 0) { + close(fd); + return mount_service_send_file_error(mo, -ret, + oc->path); + } + } + if (S_ISBLK(expected_fmt)) { ret = prepare_bdev(mo, oc, fd); if (ret < 0) { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 5/6] fuse: add sample iomap fuse servers 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (11 preceding siblings ...) 2026-04-29 14:19 ` [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap Darrick J. Wong @ 2026-04-29 14:19 ` Darrick J. Wong 2026-04-29 14:47 ` [PATCH 1/7] example/iomap_ll: create a simple iomap server Darrick J. Wong ` (6 more replies) 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (6 subsequent siblings) 19 siblings, 7 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:19 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal Hi all, This series adds some simple examples of iomap-based fuse servers so that we can demonstrate and test the libfuse functionality. There's a simple iomap server that simulates an ext4 file, a second one that shows off inline data IO, a third one that implements out of place writes, and a fourth one that shows off fuse servers running as a systemd service. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D --- Commits in this patchset: * example/iomap_ll: create a simple iomap server * example/iomap_ll: track block state * example/iomap_ll: implement atomic writes * example/iomap_inline_ll: create a simple server to test inlinedata * example/iomap_ow_ll: create a simple iomap out of place write server * example/iomap_ow_ll: implement atomic writes * example/iomap_service_ll: create a sample systemd service fuse server --- example/single_file.h | 22 + example/iomap_inline_ll.c | 367 ++++++++++++++++ example/iomap_ll.c | 714 +++++++++++++++++++++++++++++++ example/iomap_ow_ll.c | 820 ++++++++++++++++++++++++++++++++++++ example/iomap_service_ll.c | 377 +++++++++++++++++ example/iomap_service_ll.socket.in | 15 + example/iomap_service_ll@.service | 102 ++++ example/meson.build | 6 example/service_hl.c | 4 example/service_ll.c | 4 example/single_file.c | 212 +++++++++ 11 files changed, 2619 insertions(+), 24 deletions(-) create mode 100644 example/iomap_inline_ll.c create mode 100644 example/iomap_ll.c create mode 100644 example/iomap_ow_ll.c create mode 100644 example/iomap_service_ll.c create mode 100644 example/iomap_service_ll.socket.in create mode 100644 example/iomap_service_ll@.service ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/7] example/iomap_ll: create a simple iomap server 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong @ 2026-04-29 14:47 ` Darrick J. Wong 2026-04-29 14:47 ` [PATCH 2/7] example/iomap_ll: track block state Darrick J. Wong ` (5 subsequent siblings) 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:47 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a toy iomap fileserver as an example for how this all works. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 6 + example/iomap_ll.c | 332 +++++++++++++++++++++++++++++++++++++++++++++++++ example/meson.build | 1 example/service_hl.c | 2 example/service_ll.c | 2 example/single_file.c | 105 +++++++++++++++ 6 files changed, 444 insertions(+), 4 deletions(-) create mode 100644 example/iomap_ll.c diff --git a/example/single_file.h b/example/single_file.h index 290dd6051ed6f5..a12389d69e33cc 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -45,6 +45,7 @@ static inline uint64_t howmany(uint64_t b, unsigned int align) } struct single_file { + int file_fd; int backing_fd; int64_t isize; @@ -56,6 +57,7 @@ struct single_file { bool allow_dio; bool sync; bool require_bdev; + bool uses_iomap; unsigned int blocksize; @@ -115,11 +117,13 @@ unsigned long long parse_num_blocks(const char *arg, int log_block_size); struct fuse_service; int single_file_service_open(struct fuse_service *sf, const char *path); +int single_file_open(const char *path); void single_file_check_read(off_t pos, size_t *count); int single_file_check_write(off_t pos, size_t *count); -int single_file_configure(const char *device, const char *filename); +int single_file_configure(const char *device, mode_t expected_fmt, + const char *filename); int single_file_configure_simple(const char *filename); void single_file_close(void); diff --git a/example/iomap_ll.c b/example/iomap_ll.c new file mode 100644 index 00000000000000..93199b80205c2d --- /dev/null +++ b/example/iomap_ll.c @@ -0,0 +1,332 @@ +/* + * FUSE: Filesystem in Userspace + * Copyright (C) 2026 Oracle. + * + * This program can be distributed under the terms of the GNU GPLv2. + * See the file GPL2.txt. + */ + +/** @file + * + * minimal example iomap filesystem using low-level API + * + * Compile with: + * + * gcc -Wall single_file.c iomap_ll.c `pkg-config fuse3 --cflags --libs` -o iomap_ll + * + * Note: If the pkg-config command fails due to the absence of the fuse3.pc + * file, you should configure the path to the fuse3.pc file in the + * PKG_CONFIG_PATH variable. + * + * ## Source code ## + * \include iomap_ll.c + * \include single_file.c + * \include single_file.h + */ + +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 99) + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <fuse_lowlevel.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> +#include <fcntl.h> +#include <unistd.h> +#include <assert.h> +#include <pthread.h> +#include <sys/ioctl.h> +#include <sys/stat.h> +#include <linux/fs.h> +#include <linux/stat.h> +#define USE_SINGLE_FILE_LL_API +#include "single_file.h" + +struct iomap_ll { + struct fuse_session *se; + char *device; + + /* really booleans */ + int debug; + + int dev_index; +}; + +static struct iomap_ll ll = { }; + +static void iomap_ll_init(void *userdata, struct fuse_conn_info *conn) +{ + (void)userdata; + + if (fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + single_file.uses_iomap = true; + conn->time_gran = 1; +} + +static int iomap_iomap_config_devices(void) +{ + int blocksize = single_file.blocksize; + int ret; + + ll.dev_index = fuse_lowlevel_iomap_device_add(ll.se, single_file.backing_fd, 0); + if (ll.dev_index < 0) { + ret = -ll.dev_index; + printf("%s: cannot register iomap dev fd=%d: %s\n", + ll.device, single_file.backing_fd, strerror(ret)); + return ret; + } + + ret = ioctl(single_file.backing_fd, BLKBSZSET, &blocksize); + if (ret) { + printf("%s: cannot set block size %u: %s\n", + ll.device, single_file.blocksize, strerror(errno)); + return errno; + } + + return 0; +} + +static void iomap_ll_iomap_config(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize) +{ + struct fuse_iomap_config cfg = { }; + int ret; + + (void)p; + (void)psize; + + cfg.flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE; + cfg.s_blocksize = single_file.blocksize; + + cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES; + cfg.s_maxbytes = single_file.isize; + + ret = iomap_iomap_config_devices(); + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_config(req, &cfg); +} + +static int iomap_begin_report(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + (void)pos; + (void)count; + (void)opflags; + + read->offset = 0; + read->length = fsb_to_b(single_file.blocks); + read->addr = 0; + read->dev = ll.dev_index; + read->type = FUSE_IOMAP_TYPE_MAPPED; + + return 0; +} + +static int iomap_begin_read(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return iomap_begin_report(pos, count, opflags, read); +} + +static int iomap_begin_write(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return iomap_begin_read(pos, count, opflags, read); +} + +static void iomap_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, uint64_t count, + uint32_t opflags) +{ + struct fuse_file_iomap read; + int ret; + + (void)dontcare; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx opflags 0x%x\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + if (!single_file.allow_dio && (opflags & FUSE_IOMAP_OP_DIRECT)) { + ret = ENOSYS; + goto out_reply; + } + + memset(&read, 0, sizeof(read)); + + pthread_mutex_lock(&single_file.lock); + if (opflags & FUSE_IOMAP_OP_REPORT) + ret = iomap_begin_report(pos, count, opflags, &read); + else if (fuse_iomap_is_write(opflags)) + ret = iomap_begin_write(pos, count, opflags, &read); + else + ret = iomap_begin_read(pos, count, opflags, &read); + if (ret) + goto out_unlock; + + if (ll.debug) + fprintf(stderr, + "%s: offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", + __func__, + (unsigned long long)read.offset, + (unsigned long long)read.length, + read.type, + read.dev, + (unsigned long long)read.addr, + read.flags); + + /* Not filling even the first byte will make the kernel unhappy. */ + if (read.offset > pos || read.offset + read.length <= pos) { + printf("%s: made bad mapping at pos %llu\n", ll.device, + (unsigned long long)pos); + ret = EIO; + goto out_unlock; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_begin(req, &read, NULL); +} + +static const struct fuse_lowlevel_ops iomap_ll_oper = { + .lookup = single_file_ll_lookup, + .getattr = single_file_ll_getattr, + .setattr = single_file_ll_setattr, + .readdir = single_file_ll_readdir, + .open = single_file_ll_open, + .statfs = single_file_ll_statfs, + .statx = single_file_ll_statx, + .fsync = single_file_ll_fsync, + + .init = iomap_ll_init, + .iomap_config = iomap_ll_iomap_config, + .iomap_begin = iomap_ll_iomap_begin, +}; + +#define IOMAP_LL_OPT(t, p, v) { t, offsetof(struct iomap_ll, p), v } + +static struct fuse_opt iomap_ll_opts[] = { + IOMAP_LL_OPT("debug", debug, 1), + SINGLE_FILE_OPT_KEYS, + FUSE_OPT_END +}; + +static int iomap_ll_opt_proc(void *data, const char *arg, + int key, struct fuse_args *outargs) +{ + if (single_file_opt_proc(data, arg, key, outargs) == 0) + return 0; + + switch (key) { + case FUSE_OPT_KEY_NONOPT: + if (!ll.device) { + ll.device = strdup(arg); + return 0; + } + return 1; + } + + return 1; +} + +int main(int argc, char *argv[]) +{ + struct fuse_args args = FUSE_ARGS_INIT(argc, argv); + struct fuse_cmdline_opts opts = { }; + int ret = 1; + + if (fuse_opt_parse(&args, &ll, iomap_ll_opts, iomap_ll_opt_proc)) + goto err_args; + + if (fuse_parse_cmdline(&args, &opts)) + goto err_args; + + if (opts.show_help) { + printf("usage: %s [options] <device> <mountpoint>\n\n", argv[0]); + fuse_cmdline_help(); + fuse_lowlevel_help(); + ret = 0; + goto err_strings; + } else if (opts.show_version) { + printf("FUSE library version %s\n", fuse_pkgversion()); + fuse_lowlevel_version(); + ret = 0; + goto err_strings; + } + + if (!opts.mountpoint || !ll.device) { + printf("usage: %s [options] <device> <mountpoint>\n", argv[0]); + printf(" %s --help\n", argv[0]); + goto err_strings; + } + + if (single_file_open(ll.device)) + goto err_strings; + + if (single_file_configure(ll.device, S_IFBLK, "bdev")) + goto err_singlefile; + + ll.se = fuse_session_new(&args, &iomap_ll_oper, sizeof(iomap_ll_oper), + NULL); + if (ll.se == NULL) + goto err_singlefile; + + if (fuse_set_signal_handlers(ll.se)) + goto err_session; + + if (fuse_session_mount(ll.se, opts.mountpoint)) + goto err_signals; + + fuse_daemonize(opts.foreground); + + /* Block until ctrl+c or fusermount -u */ + if (opts.singlethread) { + ret = fuse_session_loop(ll.se); + } else { + struct fuse_loop_config *config = fuse_loop_cfg_create(); + + if (!config) { + ret = 1; + goto err_mount; + } + + fuse_loop_cfg_set_clone_fd(config, opts.clone_fd); + fuse_loop_cfg_set_max_threads(config, opts.max_threads); + ret = fuse_session_loop_mt(ll.se, config); + fuse_loop_cfg_destroy(config); + } + +err_mount: + fuse_session_unmount(ll.se); +err_signals: + fuse_remove_signal_handlers(ll.se); +err_session: + fuse_session_destroy(ll.se); +err_singlefile: + single_file_close(); +err_strings: + free(opts.mountpoint); + free(ll.device); +err_args: + fuse_opt_free_args(&args); + return ret ? 1 : 0; +} diff --git a/example/meson.build b/example/meson.build index 45dbf26eb355a7..16eb256d7125d6 100644 --- a/example/meson.build +++ b/example/meson.build @@ -31,6 +31,7 @@ if platform.endswith('linux') output: 'service_hl.socket', configuration: private_cfg) + single_file_examples += [ 'iomap_ll' ] endif threaded_examples = [ 'notify_inval_inode', diff --git a/example/service_hl.c b/example/service_hl.c index ea041f670f2ec5..b270315e63e297 100644 --- a/example/service_hl.c +++ b/example/service_hl.c @@ -221,7 +221,7 @@ int main(int argc, char *argv[]) if (fuse_service_finish_file_requests(hl.service)) goto err_singlefile; - if (single_file_configure(hl.device, NULL)) + if (single_file_configure(hl.device, 0, NULL)) goto err_singlefile; fuse_service_expect_mount_format(hl.service, S_IFDIR); diff --git a/example/service_ll.c b/example/service_ll.c index 33a8bd48bc1215..f09e211ce6c393 100644 --- a/example/service_ll.c +++ b/example/service_ll.c @@ -274,7 +274,7 @@ int main(int argc, char *argv[]) if (fuse_service_finish_file_requests(ll.service)) goto err_singlefile; - if (single_file_configure(ll.device, NULL)) + if (single_file_configure(ll.device, 0, NULL)) goto err_singlefile; ll.se = fuse_session_new(&args, &service_ll_oper, diff --git a/example/single_file.c b/example/single_file.c index a962232d576e17..d4db4e90593cc7 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -29,6 +29,7 @@ #include "fuse_lowlevel.h" #include "fuse.h" #include "fuse_service.h" +#include "fuse_loopdev.h" #define USE_SINGLE_FILE_LL_API #define USE_SINGLE_FILE_HL_API #include "single_file.h" @@ -48,6 +49,7 @@ struct dirbuf { struct single_file_stat { struct fuse_entry_param entry; + unsigned int iflags; }; #define SINGLE_FILE_INO (FUSE_ROOT_ID + 1) @@ -59,6 +61,7 @@ static struct timespec startup_time; struct single_file single_file = { .backing_fd = -1, .allow_dio = true, + .file_fd = -1, .mode = S_IFREG | 0444, .lock = PTHREAD_MUTEX_INITIALIZER, }; @@ -209,6 +212,14 @@ static bool sf_stat(fuse_ino_t ino, struct single_file_stat *llstat) entry->entry_timeout = 0.0; entry->ino = ino; + if (single_file.uses_iomap) { + llstat->iflags = FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; + if (single_file.sync) + llstat->iflags |= FUSE_IFLAG_SYNC; + if (single_file.ro) + llstat->iflags |= FUSE_IFLAG_IMMUTABLE; + } + return true; } @@ -396,6 +407,9 @@ void single_file_ll_getattr(fuse_req_t req, fuse_ino_t ino, pthread_mutex_unlock(&single_file.lock); if (!filled) fuse_reply_err(req, ENOENT); + else if (single_file.uses_iomap) + fuse_reply_attr_iflags(req, &llstat.entry.attr, llstat.iflags, + llstat.entry.attr_timeout); else fuse_reply_attr(req, &llstat.entry.attr, llstat.entry.attr_timeout); @@ -567,6 +581,8 @@ void single_file_ll_lookup(fuse_req_t req, fuse_ino_t parent, const char *name) pthread_mutex_unlock(&single_file.lock); if (!filled) fuse_reply_err(req, ENOENT); + else if (single_file.uses_iomap) + fuse_reply_entry_iflags(req, &llstat.entry, llstat.iflags); else fuse_reply_entry(req, &llstat.entry); } @@ -756,6 +772,20 @@ int single_file_opt_proc(void *data, const char *arg, int key, return 1; } +int single_file_open(const char *path) +{ + const int open_flags = (single_file.ro ? O_RDONLY : O_RDWR) | O_EXCL; + int fd = open(path, open_flags); + + if (fd < 0) { + perror(path); + return -1; + } + + single_file.backing_fd = fd; + return 0; +} + int single_file_service_open(struct fuse_service *sf, const char *path) { int open_flags = single_file.ro ? O_RDONLY : O_RDWR; @@ -874,7 +904,51 @@ ssize_t single_file_pread(char *buf, size_t count, off_t pos) return 0; } -int single_file_configure(const char *device, const char *filename) +static int check_fmt(mode_t expected_fmt, const struct stat *stbuf, + const char *device) +{ + const char *type = NULL; + + switch (expected_fmt) { + case S_IFREG: + type = "regular file"; + break; + case S_IFDIR: + type = "directory"; + break; + case S_IFLNK: + type = "symlink"; + break; + case S_IFBLK: + type = "block device"; + break; + case S_IFCHR: + type = "char device"; + break; + case S_IFIFO: + type = "fifo"; + break; + case S_IFSOCK: + type = "socket"; + break; + case 0: + return 0; + default: + fprintf(stderr, "%s: unrecognized format 0o%o\n", + device, expected_fmt); + return -1; + } + + if (expected_fmt != (stbuf->st_mode & S_IFMT)) { + fprintf(stderr, "%s: Not a %s\n", device, type); + return -1; + } + + return 0; +} + +int single_file_configure(const char *device, mode_t expected_fmt, + const char *filename) { struct stat stbuf; unsigned long long backing_size; @@ -887,9 +961,35 @@ int single_file_configure(const char *device, const char *filename) perror(device); return -1; } + + if (expected_fmt == S_IFBLK && !S_ISBLK(stbuf.st_mode)) { + int loop_fd = -1; + + ret = fuse_loopdev_setup(single_file.backing_fd, + (single_file.ro ? O_RDONLY : O_RDWR), + device, 5, &loop_fd, NULL); + if (ret) { + fprintf(stderr, "%s: %s\n", device, strerror(-ret)); + return -1; + } + + single_file.file_fd = single_file.backing_fd; + single_file.backing_fd = loop_fd; + + ret = fstat(single_file.backing_fd, &stbuf); + if (ret) { + perror(device); + return -1; + } + } + lbasize = stbuf.st_blksize; backing_size = stbuf.st_size; + ret = check_fmt(expected_fmt, &stbuf, device); + if (ret) + return ret; + if (S_ISBLK(stbuf.st_mode)) { #ifdef BLKSSZGET ret = ioctl(single_file.backing_fd, BLKSSZGET, &lbasize); @@ -986,6 +1086,9 @@ void single_file_close(void) close(single_file.backing_fd); single_file.backing_fd = -1; + close(single_file.file_fd); + single_file.file_fd = -1; + if (single_file_name_set) free((void *)single_file_name); single_file_name_set = false; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/7] example/iomap_ll: track block state 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong 2026-04-29 14:47 ` [PATCH 1/7] example/iomap_ll: create a simple iomap server Darrick J. Wong @ 2026-04-29 14:47 ` Darrick J. Wong 2026-04-29 14:47 ` [PATCH 3/7] example/iomap_ll: implement atomic writes Darrick J. Wong ` (4 subsequent siblings) 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:47 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Track block state so we can do more interesting things with iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 11 + example/iomap_ll.c | 419 +++++++++++++++++++++++++++++++++++++++++++++++-- example/service_hl.c | 2 example/service_ll.c | 2 example/single_file.c | 30 +++- 5 files changed, 432 insertions(+), 32 deletions(-) diff --git a/example/single_file.h b/example/single_file.h index a12389d69e33cc..38e147ee24d331 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -70,6 +70,11 @@ struct single_file { extern struct single_file single_file; +struct single_file_ops { + uint64_t (*calc_st_blocks)(void); + uint64_t (*calc_bfree)(void); +}; + static inline uint64_t b_to_fsbt(uint64_t off) { return off / single_file.blocksize; @@ -123,8 +128,10 @@ void single_file_check_read(off_t pos, size_t *count); int single_file_check_write(off_t pos, size_t *count); int single_file_configure(const char *device, mode_t expected_fmt, - const char *filename); -int single_file_configure_simple(const char *filename); + const char *filename, + const struct single_file_ops *fops); +int single_file_configure_simple(const char *filename, + const struct single_file_ops *fops); void single_file_close(void); ssize_t single_file_pwrite(const char *buf, size_t count, off_t pos); diff --git a/example/iomap_ll.c b/example/iomap_ll.c index 93199b80205c2d..0388400af2efd0 100644 --- a/example/iomap_ll.c +++ b/example/iomap_ll.c @@ -45,18 +45,73 @@ #include <linux/stat.h> #define USE_SINGLE_FILE_LL_API #include "single_file.h" +#include <linux/falloc.h> + +#define STATE_UNTRACKED 0 +#define STATE_HOLE '.' +#define STATE_UNWRITTEN 'U' +#define STATE_MAPPED 'M' +#define STATE_DELALLOC 'd' + +static uint16_t state_to_iomap_type(char s) +{ + switch (s) { + case STATE_HOLE: + return FUSE_IOMAP_TYPE_HOLE; + case STATE_DELALLOC: + return FUSE_IOMAP_TYPE_DELALLOC; + case STATE_MAPPED: + return FUSE_IOMAP_TYPE_MAPPED; + case STATE_UNWRITTEN: + return FUSE_IOMAP_TYPE_UNWRITTEN; + default: + /* should never get here */ + return FUSE_IOMAP_TYPE_INLINE; + } +} struct iomap_ll { struct fuse_session *se; char *device; + uint8_t *state; /* really booleans */ int debug; + int track_state; int dev_index; }; -static struct iomap_ll ll = { }; +static struct iomap_ll ll = { + .track_state = STATE_UNTRACKED, +}; + +static uint64_t iomap_bytes_allocated(void) +{ + uint8_t *p; + uint64_t i; + uint64_t blocks = 0; + + if (ll.track_state == STATE_UNTRACKED) + return single_file.isize; + + for (i = 0, p = ll.state; i < single_file.blocks; i++, p++) + if (*p != STATE_HOLE) + blocks++; + + return fsb_to_b(blocks); +} + + +static uint64_t iomap_ll_calc_stblocks(void) +{ + return howmany(iomap_bytes_allocated(), 512); +} + +static uint64_t iomap_ll_calc_bfree(void) +{ + return single_file.blocks - b_to_fsb(iomap_bytes_allocated()); +} static void iomap_ll_init(void *userdata, struct fuse_conn_info *conn) { @@ -113,31 +168,132 @@ static void iomap_ll_iomap_config(fuse_req_t req, fuse_reply_iomap_config(req, &cfg); } -static int iomap_begin_report(off_t pos, uint64_t count, uint32_t opflags, - struct fuse_file_iomap *read) +static ssize_t adjust_endoff(uint64_t fileoff, uint64_t *endoff) { - (void)pos; - (void)count; + if (fileoff >= single_file.blocks) + return -EIO; + if (*endoff > single_file.blocks) + *endoff = single_file.blocks; + return 0; +} + +static ssize_t iomap_begin_report(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + uint64_t fileoff, endoff; + uint8_t *p; + uint8_t orig_state; + ssize_t ret; + (void)opflags; - read->offset = 0; - read->length = fsb_to_b(single_file.blocks); - read->addr = 0; - read->dev = ll.dev_index; - read->type = FUSE_IOMAP_TYPE_MAPPED; + if (ll.track_state == STATE_UNTRACKED) { + read->offset = 0; + read->length = fsb_to_b(single_file.blocks); + read->addr = 0; + read->dev = ll.dev_index; + read->type = FUSE_IOMAP_TYPE_MAPPED; + return 0; + } + + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + ret = adjust_endoff(fileoff, &endoff); + if (ret) + return ret; + + if (ll.debug) + fprintf(stderr, +"%s: pos 0x%llx count 0x%llx fileoff 0x%llx endoff 0x%llx new_endoff 0x%llx\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + (unsigned long long)fileoff, + (unsigned long long)b_to_fsb(pos + count), + (unsigned long long)endoff); + + read->offset = fsb_to_b(fileoff); + read->length = 0; + orig_state = ll.state[fileoff]; + read->type = state_to_iomap_type(orig_state); + switch (read->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + read->dev = ll.dev_index; + read->addr = fsb_to_b(fileoff); + break; + case FUSE_IOMAP_TYPE_DELALLOC: + case FUSE_IOMAP_TYPE_HOLE: + read->dev = FUSE_IOMAP_DEV_NULL; + read->addr = FUSE_IOMAP_NULL_ADDR; + break; + default: + return -EIO; + } + + for (p = ll.state + fileoff; fileoff < endoff; p++, fileoff++) { + if (*p != orig_state) + break; + read->length++; + } + read->length = fsb_to_b(read->length); return 0; } -static int iomap_begin_read(off_t pos, uint64_t count, uint32_t opflags, - struct fuse_file_iomap *read) +static ssize_t iomap_begin_read(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) { return iomap_begin_report(pos, count, opflags, read); } -static int iomap_begin_write(off_t pos, uint64_t count, uint32_t opflags, - struct fuse_file_iomap *read) +static ssize_t iomap_write_allocate(off_t pos, uint64_t count, uint32_t opflags) { + uint64_t fileoff, endoff; + uint8_t *p; + const bool direct = opflags & (FUSE_IOMAP_OP_DIRECT | + FUSE_IOMAP_OP_WRITEBACK); + ssize_t ret; + + if (ll.track_state == STATE_UNTRACKED) + return 0; + if (opflags & FUSE_IOMAP_OP_ZERO) + return 0; + + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + ret = adjust_endoff(fileoff, &endoff); + if (ret) + return ret; + + if (ll.debug) + fprintf(stderr, "%s: set %s pos 0x%llx count 0x%llx\n", + __func__, + direct ? "unwritten" : "delalloc", + (unsigned long long)pos, + (unsigned long long)count); + + for (p = ll.state + fileoff; fileoff < endoff; p++, fileoff++) { + if (direct && *p != STATE_MAPPED) + *p = STATE_UNWRITTEN; + else if (!direct && *p == STATE_HOLE) + *p = STATE_DELALLOC; + } + + return 0; +} + +static ssize_t iomap_begin_write(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + ssize_t ret; + + ret = iomap_write_allocate(pos, count, opflags); + if (ret) + return ret; + return iomap_begin_read(pos, count, opflags, read); } @@ -146,7 +302,8 @@ static void iomap_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, uint32_t opflags) { struct fuse_file_iomap read; - int ret; + ssize_t got; + int ret = 0; (void)dontcare; @@ -171,13 +328,15 @@ static void iomap_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, pthread_mutex_lock(&single_file.lock); if (opflags & FUSE_IOMAP_OP_REPORT) - ret = iomap_begin_report(pos, count, opflags, &read); + got = iomap_begin_report(pos, count, opflags, &read); else if (fuse_iomap_is_write(opflags)) - ret = iomap_begin_write(pos, count, opflags, &read); + got = iomap_begin_write(pos, count, opflags, &read); else - ret = iomap_begin_read(pos, count, opflags, &read); - if (ret) + got = iomap_begin_read(pos, count, opflags, &read); + if (got < 0) { + ret = -got; goto out_unlock; + } if (ll.debug) fprintf(stderr, @@ -207,6 +366,181 @@ static void iomap_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, fuse_reply_iomap_begin(req, &read, NULL); } +static void iomap_ll_iomap_end(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, uint64_t count, + uint32_t opflags, ssize_t written, + const struct fuse_file_iomap *iomap) +{ + uint64_t fileoff, endoff; + uint8_t *p; + ssize_t got; + int ret = 0; + + (void)dontcare; + (void)iomap; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx opflags 0x%x written 0x%zd\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + opflags, + written); + + if (ll.track_state == STATE_UNTRACKED || written >= 0) + goto out_reply; + + /* punch delalloc mappings due to error */ + pthread_mutex_unlock(&single_file.lock); + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + got = adjust_endoff(fileoff, &endoff); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + for (p = ll.state + fileoff; fileoff < endoff; p++, fileoff++) { + if (*p == STATE_DELALLOC) + *p = STATE_HOLE; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + fuse_reply_err(req, ret); +} + +static void iomap_ll_iomap_ioend(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, size_t written, + uint32_t ioendflags, int error, uint32_t dev, + uint64_t new_addr) +{ + uint64_t fileoff, endoff; + uint8_t *p; + ssize_t got; + int ret = 0; + + (void)dontcare; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, +"%s: pos 0x%llx written 0x%zx ioendflags 0x%x error %d dev %u new_addr 0x%llx\n", + __func__, + (unsigned long long)pos, + written, + ioendflags, + error, + dev, + (unsigned long long)new_addr); + + if (error) { + ret = error; + goto out_reply; + } + + if (ll.track_state == STATE_UNTRACKED) + goto out_reply; + if (!(ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN)) + goto out_reply; + + pthread_mutex_lock(&single_file.lock); + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + written); + + got = adjust_endoff(fileoff, &endoff); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + for (p = ll.state + fileoff; fileoff < endoff; p++, fileoff++) { + if (*p == STATE_UNWRITTEN) + *p = STATE_MAPPED; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_ioend(req, single_file.isize); +} + +static void iomap_ll_fallocate(fuse_req_t req, fuse_ino_t ino, int mode, + off_t pos, off_t count, + struct fuse_file_info *fp) +{ + uint64_t fileoff, endoff; + uint8_t old_state, new_state; + uint8_t *p; + ssize_t got; + int ret = 0; + + (void)fp; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx mode 0x%x\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + mode); + + if (ll.track_state == STATE_UNTRACKED) { + if (mode & (FALLOC_FL_ZERO_RANGE | FALLOC_FL_PUNCH_HOLE)) + ret = EOPNOTSUPP; + goto out_reply; + } + + pthread_mutex_lock(&single_file.lock); + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + got = adjust_endoff(fileoff, &endoff); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + if (mode & FALLOC_FL_ZERO_RANGE) { + old_state = STATE_UNTRACKED; + new_state = STATE_UNWRITTEN; + } else if (mode & FALLOC_FL_PUNCH_HOLE) { + old_state = STATE_UNTRACKED; + new_state = STATE_HOLE; + } else { + old_state = STATE_HOLE; + new_state = STATE_UNWRITTEN; + } + + for (p = ll.state + fileoff; fileoff < endoff; p++, fileoff++) { + if (old_state == STATE_UNTRACKED || *p == old_state) + *p = new_state; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + fuse_reply_err(req, ret); +} + static const struct fuse_lowlevel_ops iomap_ll_oper = { .lookup = single_file_ll_lookup, .getattr = single_file_ll_getattr, @@ -220,6 +554,13 @@ static const struct fuse_lowlevel_ops iomap_ll_oper = { .init = iomap_ll_init, .iomap_config = iomap_ll_iomap_config, .iomap_begin = iomap_ll_iomap_begin, + .iomap_end = iomap_ll_iomap_end, + .iomap_ioend = iomap_ll_iomap_ioend, + .fallocate = iomap_ll_fallocate, +}; + +enum { + IOMAP_LL_INITSTATE = SINGLE_FILE_NR_KEYS, }; #define IOMAP_LL_OPT(t, p, v) { t, offsetof(struct iomap_ll, p), v } @@ -227,6 +568,7 @@ static const struct fuse_lowlevel_ops iomap_ll_oper = { static struct fuse_opt iomap_ll_opts[] = { IOMAP_LL_OPT("debug", debug, 1), SINGLE_FILE_OPT_KEYS, + FUSE_OPT_KEY("init_state=%s", IOMAP_LL_INITSTATE), FUSE_OPT_END }; @@ -243,11 +585,32 @@ static int iomap_ll_opt_proc(void *data, const char *arg, return 0; } return 1; + case IOMAP_LL_INITSTATE: + if (!strcmp(arg + 11, "alwaysmapped")) + ll.track_state = STATE_UNTRACKED; + else if (!strcmp(arg + 11, "hole")) + ll.track_state = STATE_HOLE; + else if (!strcmp(arg + 11, "unwritten")) + ll.track_state = STATE_UNWRITTEN; + else if (!strcmp(arg + 11, "mapped")) + ll.track_state = STATE_MAPPED; + else { + fprintf(stderr, "%s: initial state not recognized\n", arg); + return -1; + } + + /* do not pass through to libfuse */ + return 0; } return 1; } +static const struct single_file_ops fops = { + .calc_st_blocks = iomap_ll_calc_stblocks, + .calc_bfree = iomap_ll_calc_bfree, +}; + int main(int argc, char *argv[]) { struct fuse_args args = FUSE_ARGS_INIT(argc, argv); @@ -282,13 +645,25 @@ int main(int argc, char *argv[]) if (single_file_open(ll.device)) goto err_strings; - if (single_file_configure(ll.device, S_IFBLK, "bdev")) + if (single_file_configure(ll.device, S_IFBLK, "bdev", &fops)) goto err_singlefile; + if (ll.track_state != STATE_UNTRACKED) { + ll.state = malloc(single_file.blocks + 1); + if (!ll.state) { + perror(ll.device); + goto err_singlefile; + } + memset(ll.state, ll.track_state, single_file.blocks); + + /* make this a nice asciiz string */ + ll.state[single_file.blocks] = 0; + } + ll.se = fuse_session_new(&args, &iomap_ll_oper, sizeof(iomap_ll_oper), NULL); if (ll.se == NULL) - goto err_singlefile; + goto err_state; if (fuse_set_signal_handlers(ll.se)) goto err_session; @@ -321,6 +696,8 @@ int main(int argc, char *argv[]) fuse_remove_signal_handlers(ll.se); err_session: fuse_session_destroy(ll.se); +err_state: + free(ll.state); err_singlefile: single_file_close(); err_strings: diff --git a/example/service_hl.c b/example/service_hl.c index b270315e63e297..db92dbb9b611b3 100644 --- a/example/service_hl.c +++ b/example/service_hl.c @@ -221,7 +221,7 @@ int main(int argc, char *argv[]) if (fuse_service_finish_file_requests(hl.service)) goto err_singlefile; - if (single_file_configure(hl.device, 0, NULL)) + if (single_file_configure(hl.device, 0, NULL, NULL)) goto err_singlefile; fuse_service_expect_mount_format(hl.service, S_IFDIR); diff --git a/example/service_ll.c b/example/service_ll.c index f09e211ce6c393..eb3d77c639f705 100644 --- a/example/service_ll.c +++ b/example/service_ll.c @@ -274,7 +274,7 @@ int main(int argc, char *argv[]) if (fuse_service_finish_file_requests(ll.service)) goto err_singlefile; - if (single_file_configure(ll.device, 0, NULL)) + if (single_file_configure(ll.device, 0, NULL, NULL)) goto err_singlefile; ll.se = fuse_session_new(&args, &service_ll_oper, diff --git a/example/single_file.c b/example/single_file.c index d4db4e90593cc7..5af17e539c566c 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -58,6 +58,8 @@ static const char *single_file_name = "single_file"; static bool single_file_name_set; static struct timespec startup_time; +static const struct single_file_ops *single_file_ops; + struct single_file single_file = { .backing_fd = -1, .allow_dio = true, @@ -198,7 +200,10 @@ static bool sf_stat(fuse_ino_t ino, struct single_file_stat *llstat) stbuf->st_nlink = 1; stbuf->st_size = single_file.isize; stbuf->st_blksize = single_file.blocksize; - stbuf->st_blocks = howmany(single_file.isize, 512); + if (single_file_ops && single_file_ops->calc_st_blocks) + stbuf->st_blocks = single_file_ops->calc_st_blocks(); + else + stbuf->st_blocks = howmany(single_file.isize, 512); stbuf->st_atim = single_file.atime; stbuf->st_mtim = single_file.mtime; stbuf->st_ctim = single_file.ctime; @@ -271,7 +276,10 @@ static bool sf_statx(fuse_ino_t ino, int statx_mask, struct statx *stx) stx->stx_nlink = 1; stx->stx_size = single_file.isize; stx->stx_blksize = single_file.blocksize; - stx->stx_blocks = howmany(single_file.isize, 512); + if (single_file_ops && single_file_ops->calc_st_blocks) + stx->stx_blocks = single_file_ops->calc_st_blocks(); + else + stx->stx_blocks = howmany(single_file.isize, 512); stx->stx_atime.tv_sec = single_file.atime.tv_sec; stx->stx_atime.tv_nsec = single_file.atime.tv_nsec; stx->stx_mtime.tv_sec = single_file.mtime.tv_sec; @@ -362,8 +370,11 @@ static void single_file_statfs(struct statvfs *buf) buf->f_frsize = 0; buf->f_blocks = single_file.blocks; - buf->f_bfree = 0; - buf->f_bavail = 0; + if (single_file_ops && single_file_ops->calc_bfree) + buf->f_bfree = single_file_ops->calc_bfree(); + else + buf->f_bfree = 0; + buf->f_bavail = buf->f_bfree; buf->f_files = 1; buf->f_ffree = 0; buf->f_favail = 0; @@ -948,7 +959,8 @@ static int check_fmt(mode_t expected_fmt, const struct stat *stbuf, } int single_file_configure(const char *device, mode_t expected_fmt, - const char *filename) + const char *filename, + const struct single_file_ops *fops) { struct stat stbuf; unsigned long long backing_size; @@ -1048,10 +1060,11 @@ int single_file_configure(const char *device, mode_t expected_fmt, single_file.isize = round_down(single_file.isize, single_file.blocksize); single_file.blocks = single_file.isize / single_file.blocksize; - return single_file_configure_simple(filename); + return single_file_configure_simple(filename, fops); } -int single_file_configure_simple(const char *filename) +int single_file_configure_simple(const char *filename, + const struct single_file_ops *fops) { if (!single_file.blocksize) single_file.blocksize = sysconf(_SC_PAGESIZE); @@ -1078,6 +1091,7 @@ int single_file_configure_simple(const char *filename) if (!single_file.ro) single_file.mode |= 0220; + single_file_ops = fops; return 0; } @@ -1092,4 +1106,6 @@ void single_file_close(void) if (single_file_name_set) free((void *)single_file_name); single_file_name_set = false; + + single_file_ops = NULL; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/7] example/iomap_ll: implement atomic writes 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong 2026-04-29 14:47 ` [PATCH 1/7] example/iomap_ll: create a simple iomap server Darrick J. Wong 2026-04-29 14:47 ` [PATCH 2/7] example/iomap_ll: track block state Darrick J. Wong @ 2026-04-29 14:47 ` Darrick J. Wong 2026-04-29 14:48 ` [PATCH 4/7] example/iomap_inline_ll: create a simple server to test inlinedata Darrick J. Wong ` (3 subsequent siblings) 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:47 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> If the block device supports atomic writes, we will too. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 2 ++ example/iomap_ll.c | 5 +++++ example/single_file.c | 40 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 47 insertions(+) diff --git a/example/single_file.h b/example/single_file.h index 38e147ee24d331..76f2cd6eb80529 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -59,6 +59,7 @@ struct single_file { bool require_bdev; bool uses_iomap; + int awu_min, awu_max; unsigned int blocksize; struct timespec atime; @@ -132,6 +133,7 @@ int single_file_configure(const char *device, mode_t expected_fmt, const struct single_file_ops *fops); int single_file_configure_simple(const char *filename, const struct single_file_ops *fops); +void single_file_configure_atomic_write(void); void single_file_close(void); ssize_t single_file_pwrite(const char *buf, size_t count, off_t pos); diff --git a/example/iomap_ll.c b/example/iomap_ll.c index 0388400af2efd0..9824172be4afa4 100644 --- a/example/iomap_ll.c +++ b/example/iomap_ll.c @@ -135,6 +135,8 @@ static int iomap_iomap_config_devices(void) return ret; } + single_file_configure_atomic_write(); + ret = ioctl(single_file.backing_fd, BLKBSZSET, &blocksize); if (ret) { printf("%s: cannot set block size %u: %s\n", @@ -338,6 +340,9 @@ static void iomap_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, goto out_unlock; } + if (opflags & FUSE_IOMAP_OP_ATOMIC) + read.flags |= FUSE_IOMAP_F_ATOMIC_BIO; + if (ll.debug) fprintf(stderr, "%s: offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", diff --git a/example/single_file.c b/example/single_file.c index 5af17e539c566c..401bdf7afae6c8 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -34,6 +34,7 @@ #define USE_SINGLE_FILE_HL_API #include "single_file.h" +#define max(x, y) ((x) > (y) ? (x) : (y)) #define min(x, y) ((x) < (y) ? (x) : (y)) #if __has_attribute(__fallthrough__) @@ -223,6 +224,8 @@ static bool sf_stat(fuse_ino_t ino, struct single_file_stat *llstat) llstat->iflags |= FUSE_IFLAG_SYNC; if (single_file.ro) llstat->iflags |= FUSE_IFLAG_IMMUTABLE; + if (single_file.awu_min > 0 && single_file.awu_max > 0) + llstat->iflags |= FUSE_IFLAG_ATOMIC; } return true; @@ -288,6 +291,15 @@ static bool sf_statx(fuse_ino_t ino, int statx_mask, struct statx *stx) stx->stx_ctime.tv_nsec = single_file.ctime.tv_nsec; stx->stx_btime.tv_sec = startup_time.tv_sec; stx->stx_btime.tv_nsec = startup_time.tv_nsec; + +#ifdef STATX_WRITE_ATOMIC + if (single_file.awu_min > 0 && single_file.awu_max > 0) { + stx->stx_mask |= STATX_WRITE_ATOMIC; + stx->stx_atomic_write_unit_min = single_file.awu_min; + stx->stx_atomic_write_unit_max = single_file.awu_max; + stx->stx_atomic_write_segments_max = 1; + } +#endif } else { return false; } @@ -1095,6 +1107,34 @@ int single_file_configure_simple(const char *filename, return 0; } +#ifdef STATX_WRITE_ATOMIC +void single_file_configure_atomic_write(void) +{ + struct statx devx; + unsigned int awu_min, awu_max; + int ret; + + ret = statx(single_file.backing_fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &devx); + if (ret) + return; + if (!(devx.stx_mask & STATX_WRITE_ATOMIC)) + return; + + awu_min = max(single_file.blocksize, devx.stx_atomic_write_unit_min); + awu_max = min(single_file.blocksize, devx.stx_atomic_write_unit_max); + if (awu_min > awu_max) + return; + + single_file.awu_min = awu_min; + single_file.awu_max = awu_max; +} +#else +void single_file_configure_atomic_write(void) +{ + single_file.awu_min = single_file.awu_max = 0; +} +#endif + void single_file_close(void) { close(single_file.backing_fd); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 4/7] example/iomap_inline_ll: create a simple server to test inlinedata 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:47 ` [PATCH 3/7] example/iomap_ll: implement atomic writes Darrick J. Wong @ 2026-04-29 14:48 ` Darrick J. Wong 2026-04-29 14:48 ` [PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place write server Darrick J. Wong ` (2 subsequent siblings) 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:48 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create another example iomap fuse server to test inline data handling. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 1 example/iomap_inline_ll.c | 367 +++++++++++++++++++++++++++++++++++++++++++++ example/meson.build | 2 example/single_file.c | 26 ++- 4 files changed, 389 insertions(+), 7 deletions(-) create mode 100644 example/iomap_inline_ll.c diff --git a/example/single_file.h b/example/single_file.h index 76f2cd6eb80529..90696a4d5a626a 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -58,6 +58,7 @@ struct single_file { bool sync; bool require_bdev; bool uses_iomap; + bool fixed_size; int awu_min, awu_max; unsigned int blocksize; diff --git a/example/iomap_inline_ll.c b/example/iomap_inline_ll.c new file mode 100644 index 00000000000000..bed9855b72a27e --- /dev/null +++ b/example/iomap_inline_ll.c @@ -0,0 +1,367 @@ +/* + * FUSE: Filesystem in Userspace + * Copyright (C) 2026 Oracle. + * + * This program can be distributed under the terms of the GNU GPLv2. + * See the file GPL2.txt. + */ + +/** @file + * + * minimal example inlinedata iomap filesystem using low-level API + * + * Compile with: + * + * gcc -Wall single_file.c iomap_inline_ll.c `pkg-config fuse3 --cflags --libs` \ + * -o iomap_inline_ll + * + * Note: If the pkg-config command fails due to the absence of the fuse3.pc + * file, you should configure the path to the fuse3.pc file in the + * PKG_CONFIG_PATH variable. + * + * ## Source code ## + * \include iomap_inline_ll.c + * \include single_file.c + * \include single_file.h + */ + +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 99) + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <fuse_lowlevel.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> +#include <fcntl.h> +#include <unistd.h> +#include <assert.h> +#include <pthread.h> +#include <sys/ioctl.h> +#include <sys/stat.h> +#include <linux/fs.h> +#include <linux/stat.h> +#define USE_SINGLE_FILE_LL_API +#include "single_file.h" + +#define max(x, y) ((x) > (y) ? (x) : (y)) + +struct ioinline_ll { + struct fuse_session *se; + + char *filedata; + + /* really booleans */ + int debug; +}; + +static struct ioinline_ll ll = { }; + +static uint64_t ioinline_ll_calc_bfree(void) +{ + return single_file.isize == 0; +} + +static void ioinline_ll_init(void *userdata, struct fuse_conn_info *conn) +{ + (void)userdata; + + if (fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + single_file.uses_iomap = true; + conn->time_gran = 1; +} + +static void ioinline_ll_iomap_config(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize) +{ + struct fuse_iomap_config cfg = { }; + + (void)p; + (void)psize; + + cfg.flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE; + cfg.s_blocksize = single_file.blocksize; + + cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES; + cfg.s_maxbytes = single_file.blocksize; + + fuse_reply_iomap_config(req, &cfg); +} + +static int ioinline_begin_report(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + (void)pos; + (void)count; + (void)opflags; + + read->offset = 0; + read->length = single_file.blocksize; + read->addr = FUSE_IOMAP_NULL_ADDR; + read->type = FUSE_IOMAP_TYPE_INLINE; + + return 0; +} + +static int ioinline_begin_read(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return ioinline_begin_report(pos, count, opflags, read); +} + +static int ioinline_begin_write(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return ioinline_begin_report(pos, count, opflags, read); +} + +static void ioinline_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, + uint64_t count, uint32_t opflags) +{ + struct fuse_file_iomap read; + int ret; + + (void)dontcare; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx opflags 0x%x\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + if (!single_file.allow_dio && (opflags & FUSE_IOMAP_OP_DIRECT)) { + ret = ENOSYS; + goto out_reply; + } + + memset(&read, 0, sizeof(read)); + + pthread_mutex_lock(&single_file.lock); + if (opflags & FUSE_IOMAP_OP_REPORT) + ret = ioinline_begin_report(pos, count, opflags, &read); + else if (fuse_iomap_is_write(opflags)) + ret = ioinline_begin_write(pos, count, opflags, &read); + else + ret = ioinline_begin_read(pos, count, opflags, &read); + if (ret) + goto out_unlock; + + if (ll.debug) + fprintf(stderr, +"%s: offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", + __func__, + (unsigned long long)read.offset, + (unsigned long long)read.length, + read.type, + read.dev, + (unsigned long long)read.addr, + read.flags); + + /* Not filling even the first byte will make the kernel unhappy. */ + if (read.offset > pos || read.offset + read.length <= pos) { + printf("made bad mapping at pos %llu\n", + (unsigned long long)pos); + ret = EIO; + goto out_unlock; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_begin(req, &read, NULL); +} + +static void ioinline_ll_read(fuse_req_t req, fuse_ino_t ino, size_t count, + off_t pos, struct fuse_file_info *fp) +{ + int ret = 0; + + (void)fp; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count); + +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_buf(req, ll.filedata + pos, count); +} + +static void ioinline_ll_write(fuse_req_t req, fuse_ino_t ino, const char *buf, + size_t count, off_t pos, + struct fuse_file_info *fp) +{ + int ret = 0; + + (void)fp; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count); + + pthread_mutex_lock(&single_file.lock); + memcpy(ll.filedata + pos, buf, count); + single_file.isize = max(single_file.isize, pos + count); + pthread_mutex_unlock(&single_file.lock); + +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_write(req, count); +} + +static const struct fuse_lowlevel_ops ioinline_ll_oper = { + .lookup = single_file_ll_lookup, + .getattr = single_file_ll_getattr, + .setattr = single_file_ll_setattr, + .readdir = single_file_ll_readdir, + .open = single_file_ll_open, + .statfs = single_file_ll_statfs, + .statx = single_file_ll_statx, + + .init = ioinline_ll_init, + .iomap_config = ioinline_ll_iomap_config, + .iomap_begin = ioinline_ll_iomap_begin, + .read = ioinline_ll_read, + .write = ioinline_ll_write, +}; + +#define IOMAP_LL_OPT(t, p, v) { t, offsetof(struct ioinline_ll, p), v } + +static struct fuse_opt ioinline_ll_opts[] = { + IOMAP_LL_OPT("debug", debug, 1), + SINGLE_FILE_OPT_KEYS, + FUSE_OPT_END +}; + +static int ioinline_ll_opt_proc(void *data, const char *arg, + int key, struct fuse_args *outargs) +{ + return single_file_opt_proc(data, arg, key, outargs); +} + +static const struct single_file_ops fops = { + .calc_bfree = ioinline_ll_calc_bfree, +}; + +int main(int argc, char *argv[]) +{ + struct fuse_args args = FUSE_ARGS_INIT(argc, argv); + struct fuse_cmdline_opts opts = { }; + int ret = 1; + + single_file.fixed_size = false; + + if (fuse_opt_parse(&args, &ll, ioinline_ll_opts, ioinline_ll_opt_proc)) + goto err_args; + + if (fuse_parse_cmdline(&args, &opts)) + goto err_args; + + if (opts.show_help) { + printf("usage: %s [options] <mountpoint>\n\n", argv[0]); + fuse_cmdline_help(); + fuse_lowlevel_help(); + ret = 0; + goto err_strings; + } else if (opts.show_version) { + printf("FUSE library version %s\n", fuse_pkgversion()); + fuse_lowlevel_version(); + ret = 0; + goto err_strings; + } + + if (!opts.mountpoint) { + printf("usage: %s [options] <mountpoint>\n", argv[0]); + printf(" %s --help\n", argv[0]); + goto err_strings; + } + + if (single_file_configure_simple("inline", &fops)) + goto err_strings; + + ll.filedata = malloc(single_file.blocksize); + if (!ll.filedata) { + perror("memory"); + goto err_singlefile; + } + memset(ll.filedata, 0x58, single_file.blocksize); + single_file.blocks = 1; + + ll.se = fuse_session_new(&args, &ioinline_ll_oper, + sizeof(ioinline_ll_oper), NULL); + if (ll.se == NULL) + goto err_filedata; + + if (fuse_set_signal_handlers(ll.se)) + goto err_session; + + if (fuse_session_mount(ll.se, opts.mountpoint)) + goto err_signals; + + fuse_daemonize(opts.foreground); + + /* Block until ctrl+c or fusermount -u */ + if (opts.singlethread) { + ret = fuse_session_loop(ll.se); + } else { + struct fuse_loop_config *config = fuse_loop_cfg_create(); + + if (!config) { + ret = 1; + goto err_mount; + } + + fuse_loop_cfg_set_clone_fd(config, opts.clone_fd); + fuse_loop_cfg_set_max_threads(config, opts.max_threads); + ret = fuse_session_loop_mt(ll.se, config); + fuse_loop_cfg_destroy(config); + } + +err_mount: + fuse_session_unmount(ll.se); +err_signals: + fuse_remove_signal_handlers(ll.se); +err_session: + fuse_session_destroy(ll.se); +err_filedata: + free(ll.filedata); +err_singlefile: + single_file_close(); +err_strings: + free(opts.mountpoint); +err_args: + fuse_opt_free_args(&args); + return ret ? 1 : 0; +} diff --git a/example/meson.build b/example/meson.build index 16eb256d7125d6..24fdcb23cb58f2 100644 --- a/example/meson.build +++ b/example/meson.build @@ -31,7 +31,7 @@ if platform.endswith('linux') output: 'service_hl.socket', configuration: private_cfg) - single_file_examples += [ 'iomap_ll' ] + single_file_examples += [ 'iomap_ll', 'iomap_inline_ll' ] endif threaded_examples = [ 'notify_inval_inode', diff --git a/example/single_file.c b/example/single_file.c index 401bdf7afae6c8..1d66157a8c637c 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -67,6 +67,7 @@ struct single_file single_file = { .file_fd = -1, .mode = S_IFREG | 0444, .lock = PTHREAD_MUTEX_INITIALIZER, + .fixed_size = true, }; static fuse_ino_t single_file_path_to_ino(const char *path) @@ -478,8 +479,9 @@ void single_file_ll_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (ino != SINGLE_FILE_INO) goto deny; - if (to_set & (FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID | - FUSE_SET_ATTR_SIZE)) + if (to_set & (FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID)) + goto deny; + if ((to_set & FUSE_SET_ATTR_SIZE) && single_file.fixed_size) goto deny; if (single_file.ro) goto deny; @@ -504,6 +506,8 @@ void single_file_ll_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, } if (to_set & FUSE_SET_ATTR_CTIME) single_file.ctime = now; + if (to_set & FUSE_SET_ATTR_SIZE) + single_file.isize = attr->st_size; pthread_mutex_unlock(&single_file.lock); single_file_ll_getattr(req, ino, fi); @@ -581,11 +585,18 @@ int single_file_hl_chown(const char *path, uid_t owner, gid_t group, int single_file_hl_truncate(const char *path, off_t len, struct fuse_file_info *fi) { - (void)path; - (void)len; - (void)fi; + fuse_ino_t ino = single_open_file_path_to_ino(fi, path); - return -EPERM; + if (ino != SINGLE_FILE_INO) + return -EPERM; + if (single_file.fixed_size) + return -EPERM; + + pthread_mutex_lock(&single_file.lock); + single_file.isize = len; + pthread_mutex_unlock(&single_file.lock); + + return 0; } void single_file_ll_lookup(fuse_req_t req, fuse_ino_t parent, const char *name) @@ -852,6 +863,9 @@ int single_file_service_open(struct fuse_service *sf, const char *path) int single_file_check_write(off_t pos, size_t *count) { + if (!single_file.fixed_size) + return 0; + if (pos >= single_file.isize) return -EFBIG; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place write server 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:48 ` [PATCH 4/7] example/iomap_inline_ll: create a simple server to test inlinedata Darrick J. Wong @ 2026-04-29 14:48 ` Darrick J. Wong 2026-04-29 14:48 ` [PATCH 6/7] example/iomap_ow_ll: implement atomic writes Darrick J. Wong 2026-04-29 14:48 ` [PATCH 7/7] example/iomap_service_ll: create a sample systemd service fuse server Darrick J. Wong 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:48 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a toy iomap fileserver as an example of how out of place writes works. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 1 example/iomap_ow_ll.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++++ example/meson.build | 2 example/single_file.c | 5 4 files changed, 819 insertions(+), 2 deletions(-) create mode 100644 example/iomap_ow_ll.c diff --git a/example/single_file.h b/example/single_file.h index 90696a4d5a626a..3ee9dd261352ba 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -75,6 +75,7 @@ extern struct single_file single_file; struct single_file_ops { uint64_t (*calc_st_blocks)(void); uint64_t (*calc_bfree)(void); + uint64_t (*calc_blocks)(void); }; static inline uint64_t b_to_fsbt(uint64_t off) diff --git a/example/iomap_ow_ll.c b/example/iomap_ow_ll.c new file mode 100644 index 00000000000000..c4edda1f9bfe40 --- /dev/null +++ b/example/iomap_ow_ll.c @@ -0,0 +1,813 @@ +/* + * FUSE: Filesystem in Userspace + * Copyright (C) 2026 Oracle. + * + * This program can be distributed under the terms of the GNU GPLv2. + * See the file GPL2.txt. + */ + +/** @file + * + * minimal example iomap out of place write filesystem using low-level API + * + * Compile with: + * + * gcc -Wall single_file.c iomap_ow_ll.c `pkg-config fuse3 --cflags --libs` -o iomap_ow_ll + * + * Note: If the pkg-config command fails due to the absence of the fuse3.pc + * file, you should configure the path to the fuse3.pc file in the + * PKG_CONFIG_PATH variable. + * + * ## Source code ## + * \include iomap_ow_ll.c + * \include single_file.c + * \include single_file.h + */ + +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 99) + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <fuse_lowlevel.h> +#include <fuse_loopdev.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> +#include <fcntl.h> +#include <unistd.h> +#include <assert.h> +#include <pthread.h> +#include <sys/ioctl.h> +#include <sys/stat.h> +#include <linux/fs.h> +#include <linux/stat.h> +#include <linux/falloc.h> +#define USE_SINGLE_FILE_LL_API +#include "single_file.h" + +#define max(x, y) ((x) > (y) ? (x) : (y)) + +#define MAP_HOLE (0) +#define MAP_DELALLOC (1) + +static uint16_t read_state_to_iomap_type(uint32_t map) +{ + switch (map) { + case MAP_HOLE: + return FUSE_IOMAP_TYPE_HOLE; + case MAP_DELALLOC: + return FUSE_IOMAP_TYPE_DELALLOC; + } + + return FUSE_IOMAP_TYPE_MAPPED; +} + +static uint16_t write_state_to_iomap_type(uint32_t map) +{ + switch (map) { + case MAP_HOLE: + return FUSE_IOMAP_TYPE_HOLE; + case MAP_DELALLOC: + return FUSE_IOMAP_TYPE_DELALLOC; + } + + return FUSE_IOMAP_TYPE_UNWRITTEN; +} + +struct ioow_ll { + struct fuse_session *se; + char *device; + + uint32_t *read_map; + uint32_t *write_map; + uint32_t max_file_blocks; + + uint8_t *freespace; + uint64_t bdev_blocks; + + /* really booleans */ + int debug; + + int dev_index; +}; + +static struct ioow_ll ll = { }; + +static uint64_t ioow_bytes_allocated(void) +{ + uint32_t *p, *q; + uint64_t i; + uint64_t blocks = 0; + + for (i = 0, p = ll.read_map; i < ll.max_file_blocks; i++, p++) + if (*p != MAP_HOLE) + blocks++; + + for (i = 0, q = ll.write_map; i < ll.max_file_blocks; i++, q++) + if (*q != MAP_HOLE) + blocks++; + + return fsb_to_b(blocks); +} + +static uint64_t ioow_ll_calc_stblocks(void) +{ + return howmany(ioow_bytes_allocated(), 512); +} + +static uint64_t ioow_ll_calc_blocks(void) +{ + return ll.bdev_blocks; +} + +static uint64_t ioow_ll_calc_bfree(void) +{ + return ll.bdev_blocks - b_to_fsb(ioow_bytes_allocated()); +} + +static void ioow_ll_init(void *userdata, struct fuse_conn_info *conn) +{ + (void)userdata; + + if (fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + single_file.uses_iomap = true; + conn->time_gran = 1; +} + +static int ioow_iomap_config_devices(void) +{ + int blocksize = single_file.blocksize; + int ret; + + ll.dev_index = fuse_lowlevel_iomap_device_add(ll.se, single_file.backing_fd, 0); + if (ll.dev_index < 0) { + ret = -ll.dev_index; + printf("%s: cannot register iomap dev fd=%d: %s\n", + ll.device, single_file.backing_fd, strerror(ret)); + return ret; + } + + ret = ioctl(single_file.backing_fd, BLKBSZSET, &blocksize); + if (ret) { + printf("%s: cannot set block size %u: %s\n", + ll.device, single_file.blocksize, strerror(errno)); + return errno; + } + + return 0; +} + +static void ioow_ll_iomap_config(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize) +{ + struct fuse_iomap_config cfg = { }; + int ret; + + (void)p; + (void)psize; + + cfg.flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE; + cfg.s_blocksize = single_file.blocksize; + + cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES; + cfg.s_maxbytes = fsb_to_b(ll.max_file_blocks); + + ret = ioow_iomap_config_devices(); + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_config(req, &cfg); +} + +static ssize_t adjust_endoff(uint64_t fileoff, uint64_t *endoff) +{ + if (fileoff >= ll.max_file_blocks) + return -EIO; + if (*endoff > ll.max_file_blocks) + *endoff = ll.max_file_blocks; + return 0; +} + +static ssize_t ioow_begin_report(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read, + struct fuse_file_iomap *write) +{ + uint64_t fileoff, endoff; + uint32_t *p, *q; + uint32_t orig_p, orig_q; + ssize_t ret; + + (void)opflags; + + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + ret = adjust_endoff(fileoff, &endoff); + if (ret) + return ret; + + if (ll.debug) + fprintf(stderr, +"%s: pos 0x%llx count 0x%llx fileoff 0x%llx endoff 0x%llx new_endoff 0x%llx\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + (unsigned long long)fileoff, + (unsigned long long)b_to_fsb(pos + count), + (unsigned long long)endoff); + + read->offset = fsb_to_b(fileoff); + write->offset = fsb_to_b(fileoff); + read->length = 0; + write->length = 0; + + orig_p = ll.read_map[fileoff]; + orig_q = ll.write_map[fileoff]; + read->type = read_state_to_iomap_type(orig_p); + write->type = write_state_to_iomap_type(orig_q); + + switch (read->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + read->dev = ll.dev_index; + read->addr = fsb_to_b(orig_p); + break; + case FUSE_IOMAP_TYPE_DELALLOC: + case FUSE_IOMAP_TYPE_HOLE: + read->dev = FUSE_IOMAP_DEV_NULL; + read->addr = FUSE_IOMAP_NULL_ADDR; + break; + default: + return -EIO; + } + + switch (write->type) { + case FUSE_IOMAP_TYPE_MAPPED: + case FUSE_IOMAP_TYPE_UNWRITTEN: + write->dev = ll.dev_index; + write->addr = fsb_to_b(orig_q); + break; + case FUSE_IOMAP_TYPE_DELALLOC: + case FUSE_IOMAP_TYPE_HOLE: + write->dev = FUSE_IOMAP_DEV_NULL; + write->addr = FUSE_IOMAP_NULL_ADDR; + break; + default: + return -EIO; + } + + for (p = ll.read_map + fileoff, q = ll.write_map + fileoff; + fileoff < endoff; + p++, q++, fileoff++) { + if (read->type != read_state_to_iomap_type(*p)) + break; + if (write->type != write_state_to_iomap_type(*q)) + break; + if (read->type == FUSE_IOMAP_TYPE_MAPPED && + read->addr + read->length != fsb_to_b(*p)) + break; + if (write->type == FUSE_IOMAP_TYPE_MAPPED && + write->addr + write->length != fsb_to_b(*q)) + break; + + read->length += single_file.blocksize; + write->length += single_file.blocksize; + } + + return 0; +} + +static ssize_t ioow_begin_read(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read, + struct fuse_file_iomap *write) +{ + return ioow_begin_report(pos, count, opflags, read, write); +} + +struct free_extent { + uint32_t startblock; + uint32_t blockcount; +}; + +#define FIND_FREE_INIT_CURSOR (MAP_DELALLOC + 1) + +static uint32_t ioow_find_free(uint32_t *cursor) +{ + bool wrapped = false; + + while (!wrapped) { + if (*cursor >= ll.bdev_blocks || *cursor <= MAP_DELALLOC) + *cursor = FIND_FREE_INIT_CURSOR; + + for (; *cursor < ll.bdev_blocks; (*cursor)++) { + if (ll.freespace[*cursor]) { + ll.freespace[*cursor] = 0; + (*cursor)++; + return *cursor - 1; + } + } + + wrapped = true; + } + + return MAP_HOLE; +} + +static void ioow_free_block(uint32_t *block) +{ + if (*block > MAP_DELALLOC) + ll.freespace[*block] = 1; + *block = MAP_HOLE; +} + +static void ioow_remap_block(uint32_t *p, uint32_t *q) +{ + ioow_free_block(p); + *p = *q; + *q = MAP_HOLE; +} + +static ssize_t ioow_write_allocate(off_t pos, uint64_t count, uint32_t opflags) +{ + uint64_t fileoff, endoff; + uint32_t *q; + const bool direct = opflags & (FUSE_IOMAP_OP_DIRECT | + FUSE_IOMAP_OP_WRITEBACK); + uint32_t cursor = FIND_FREE_INIT_CURSOR; + ssize_t ret; + + if (opflags & FUSE_IOMAP_OP_ZERO) + return 0; + + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + ret = adjust_endoff(fileoff, &endoff); + if (ret) + return ret; + + if (ll.debug) + fprintf(stderr, "%s: set %s pos 0x%llx count 0x%llx\n", + __func__, + direct ? "unwritten" : "delalloc", + (unsigned long long)pos, + (unsigned long long)count); + + for (q = ll.write_map + fileoff; fileoff < endoff; q++, fileoff++) { + if (!direct && *q == MAP_HOLE) { + fprintf(stderr, "%s: fileoff %lu delalloc\n", __func__, fileoff); + *q = MAP_DELALLOC; + continue; + } + + if (direct && (*q == MAP_HOLE || *q == MAP_DELALLOC)) { + uint32_t free_block = ioow_find_free(&cursor); + + if (free_block == MAP_HOLE) + return -ENOSPC; + + fprintf(stderr, "%s: fileoff %lu unwritten %u\n", + __func__, fileoff, free_block); + *q = free_block; + } + } + + single_file.isize = max(single_file.isize, pos + count); + return 0; +} + +static ssize_t ioow_begin_write(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read, + struct fuse_file_iomap *write) +{ + ssize_t ret; + + ret = ioow_write_allocate(pos, count, opflags); + if (ret) + return ret; + + return ioow_begin_read(pos, count, opflags, read, write); +} + +static void ioow_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, uint64_t count, + uint32_t opflags) +{ + struct fuse_file_iomap read, write; + ssize_t got; + int ret = 0; + + (void)dontcare; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx opflags 0x%x\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + if (!single_file.allow_dio && (opflags & FUSE_IOMAP_OP_DIRECT)) { + ret = ENOSYS; + goto out_reply; + } + + memset(&read, 0, sizeof(read)); + memset(&write, 0, sizeof(write)); + + pthread_mutex_lock(&single_file.lock); + if (opflags & FUSE_IOMAP_OP_REPORT) + got = ioow_begin_report(pos, count, opflags, &read, &write); + else if (fuse_iomap_is_write(opflags)) + got = ioow_begin_write(pos, count, opflags, &read, &write); + else + got = ioow_begin_read(pos, count, opflags, &read, &write); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + if (ll.debug) { + fprintf(stderr, +"%s: read offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", + __func__, + (unsigned long long)read.offset, + (unsigned long long)read.length, + read.type, + read.dev, + (unsigned long long)read.addr, + read.flags); + fprintf(stderr, +"%s: write offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", + __func__, + (unsigned long long)write.offset, + (unsigned long long)write.length, + write.type, + write.dev, + (unsigned long long)write.addr, + write.flags); + } + + /* Not filling even the first byte will make the kernel unhappy. */ + if (read.offset > pos || read.offset + read.length <= pos) { + printf("%s: made read bad mapping at pos %llu\n", ll.device, + (unsigned long long)pos); + ret = EIO; + goto out_unlock; + } + if (write.offset > pos || write.offset + write.length <= pos) { + printf("%s: made bad write mapping at pos %llu\n", ll.device, + (unsigned long long)pos); + ret = EIO; + goto out_unlock; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_begin(req, &read, &write); +} + +static void ioow_ll_iomap_end(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, uint64_t count, + uint32_t opflags, ssize_t written, + const struct fuse_file_iomap *iomap) +{ + uint64_t fileoff, endoff; + uint32_t *q; + ssize_t got; + int ret = 0; + + (void)dontcare; + (void)iomap; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx opflags 0x%x written 0x%zd\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + opflags, + written); + + if (written >= 0) + goto out_reply; + + /* punch delalloc mappings due to error */ + pthread_mutex_lock(&single_file.lock); + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + got = adjust_endoff(fileoff, &endoff); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + for (q = ll.write_map + fileoff; fileoff < endoff; q++, fileoff++) { + if (*q == MAP_DELALLOC) + ioow_free_block(q); + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + fuse_reply_err(req, ret); +} + +static void ioow_ll_iomap_ioend(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, size_t written, + uint32_t ioendflags, int error, uint32_t dev, + uint64_t new_addr) +{ + uint64_t fileoff, endoff; + uint32_t *p, *q; + ssize_t got; + int ret = 0; + + (void)dontcare; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, +"%s: pos 0x%llx written 0x%zx ioendflags 0x%x error %d dev %u new_addr 0x%llx\n", + __func__, + (unsigned long long)pos, + written, + ioendflags, + error, + dev, + (unsigned long long)new_addr); + + if (error) { + ret = error; + goto out_reply; + } + + pthread_mutex_lock(&single_file.lock); + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + written); + + got = adjust_endoff(fileoff, &endoff); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + for (p = ll.read_map + fileoff, q = ll.write_map + fileoff; + fileoff < endoff; p++, q++, fileoff++) { + fprintf(stderr, "%s: fileoff %lu read %u write %u\n", __func__, fileoff, *p, *q); + ioow_remap_block(p, q); + } + + single_file.isize = max(single_file.isize, pos + written); + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_ioend(req, single_file.isize); +} + +static void ioow_ll_fallocate(fuse_req_t req, fuse_ino_t ino, int mode, + off_t pos, off_t count, + struct fuse_file_info *fp) +{ + uint64_t fileoff, endoff; + uint32_t *p, *q; + ssize_t got; + int ret = 0; + + (void)fp; + + if (!is_single_file_ino(ino)) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx mode 0x%x\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + mode); + + if (!(mode & (FALLOC_FL_ZERO_RANGE | FALLOC_FL_PUNCH_HOLE))) { + ret = EOPNOTSUPP; + goto out_reply; + } + + pthread_mutex_lock(&single_file.lock); + fileoff = b_to_fsbt(pos); + endoff = b_to_fsb(pos + count); + + got = adjust_endoff(fileoff, &endoff); + if (got < 0) { + ret = -got; + goto out_unlock; + } + + for (p = ll.read_map + fileoff, q = ll.write_map + fileoff; + fileoff < endoff; + p++, q++, fileoff++) { + ioow_free_block(p); + ioow_free_block(q); + } + + if (!(mode & FALLOC_FL_KEEP_SIZE)) + single_file.isize = max(single_file.isize, pos + count); + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + fuse_reply_err(req, ret); +} + +static const struct fuse_lowlevel_ops ioow_ll_oper = { + .lookup = single_file_ll_lookup, + .getattr = single_file_ll_getattr, + .setattr = single_file_ll_setattr, + .readdir = single_file_ll_readdir, + .open = single_file_ll_open, + .statfs = single_file_ll_statfs, + .statx = single_file_ll_statx, + .fsync = single_file_ll_fsync, + + .init = ioow_ll_init, + .iomap_config = ioow_ll_iomap_config, + .iomap_begin = ioow_ll_iomap_begin, + .iomap_end = ioow_ll_iomap_end, + .iomap_ioend = ioow_ll_iomap_ioend, + .fallocate = ioow_ll_fallocate, +}; + +#define IOOW_LL_OPT(t, p, v) { t, offsetof(struct ioow_ll, p), v } + +static struct fuse_opt ioow_ll_opts[] = { + IOOW_LL_OPT("debug", debug, 1), + SINGLE_FILE_OPT_KEYS, + FUSE_OPT_END +}; + +static int ioow_ll_opt_proc(void *data, const char *arg, + int key, struct fuse_args *outargs) +{ + if (single_file_opt_proc(data, arg, key, outargs) == 0) + return 0; + + switch (key) { + case FUSE_OPT_KEY_NONOPT: + if (!ll.device) { + ll.device = strdup(arg); + return 0; + } + return 1; + } + + return 1; +} + +static const struct single_file_ops fops = { + .calc_st_blocks = ioow_ll_calc_stblocks, + .calc_blocks = ioow_ll_calc_blocks, + .calc_bfree = ioow_ll_calc_bfree, +}; + +int main(int argc, char *argv[]) +{ + struct fuse_args args = FUSE_ARGS_INIT(argc, argv); + struct fuse_cmdline_opts opts = { }; + int ret = 1; + + if (fuse_opt_parse(&args, &ll, ioow_ll_opts, ioow_ll_opt_proc)) + goto err_args; + + if (fuse_parse_cmdline(&args, &opts)) + goto err_args; + + if (opts.show_help) { + printf("usage: %s [options] <device> <mountpoint>\n\n", argv[0]); + fuse_cmdline_help(); + fuse_lowlevel_help(); + ret = 0; + goto err_strings; + } else if (opts.show_version) { + printf("FUSE library version %s\n", fuse_pkgversion()); + fuse_lowlevel_version(); + ret = 0; + goto err_strings; + } + + if (!opts.mountpoint || !ll.device) { + printf("usage: %s [options] <device> <mountpoint>\n", argv[0]); + printf(" %s --help\n", argv[0]); + goto err_strings; + } + + if (single_file_open(ll.device)) + goto err_strings; + + if (single_file_configure(ll.device, S_IFBLK, "outtaplace", &fops)) + goto err_singlefile; + + ll.bdev_blocks = single_file.isize / single_file.blocksize; + if (ll.bdev_blocks < 3) { + fprintf(stderr, "%s: block device must be at least %llu bytes long\n", + ll.device, (unsigned long long)single_file.blocksize * 3); + goto err_singlefile; + } + single_file.isize = 0; + + ll.freespace = malloc(ll.bdev_blocks); + if (!ll.freespace) { + perror("freespace"); + goto err_singlefile; + } + memset(ll.freespace, 1, ll.bdev_blocks); + + /* block 0 means hole, block 1 means dealloc, and 1 spare */ + ll.max_file_blocks = ll.bdev_blocks - 3; + ll.read_map = calloc(ll.max_file_blocks, sizeof(uint32_t)); + if (!ll.read_map) { + perror("read map"); + goto err_freespace; + } + + ll.write_map = calloc(ll.max_file_blocks, sizeof(uint32_t)); + if (!ll.write_map) { + perror("read map"); + goto err_readmap; + } + + ll.se = fuse_session_new(&args, &ioow_ll_oper, sizeof(ioow_ll_oper), + NULL); + if (ll.se == NULL) + goto err_writemap; + + if (fuse_set_signal_handlers(ll.se)) + goto err_session; + + if (fuse_session_mount(ll.se, opts.mountpoint)) + goto err_signals; + + fuse_daemonize(opts.foreground); + + /* Block until ctrl+c or fusermount -u */ + if (opts.singlethread) { + ret = fuse_session_loop(ll.se); + } else { + struct fuse_loop_config *config = fuse_loop_cfg_create(); + + if (!config) { + ret = 1; + goto err_mount; + } + + fuse_loop_cfg_set_clone_fd(config, opts.clone_fd); + fuse_loop_cfg_set_max_threads(config, opts.max_threads); + ret = fuse_session_loop_mt(ll.se, config); + fuse_loop_cfg_destroy(config); + } + +err_mount: + fuse_session_unmount(ll.se); +err_signals: + fuse_remove_signal_handlers(ll.se); +err_session: + fuse_session_destroy(ll.se); +err_writemap: + free(ll.write_map); +err_readmap: + free(ll.read_map); +err_freespace: + free(ll.freespace); +err_singlefile: + single_file_close(); +err_strings: + free(opts.mountpoint); + free(ll.device); +err_args: + fuse_opt_free_args(&args); + return ret ? 1 : 0; +} diff --git a/example/meson.build b/example/meson.build index 24fdcb23cb58f2..6967a2ee370d4b 100644 --- a/example/meson.build +++ b/example/meson.build @@ -31,7 +31,7 @@ if platform.endswith('linux') output: 'service_hl.socket', configuration: private_cfg) - single_file_examples += [ 'iomap_ll', 'iomap_inline_ll' ] + single_file_examples += [ 'iomap_ll', 'iomap_inline_ll', 'iomap_ow_ll' ] endif threaded_examples = [ 'notify_inval_inode', diff --git a/example/single_file.c b/example/single_file.c index 1d66157a8c637c..c4ac998a2c978b 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -382,7 +382,10 @@ static void single_file_statfs(struct statvfs *buf) buf->f_bsize = single_file.blocksize; buf->f_frsize = 0; - buf->f_blocks = single_file.blocks; + if (single_file_ops && single_file_ops->calc_blocks) + buf->f_blocks = single_file_ops->calc_blocks(); + else + buf->f_blocks = single_file.blocks; if (single_file_ops && single_file_ops->calc_bfree) buf->f_bfree = single_file_ops->calc_bfree(); else ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 6/7] example/iomap_ow_ll: implement atomic writes 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:48 ` [PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place write server Darrick J. Wong @ 2026-04-29 14:48 ` Darrick J. Wong 2026-04-29 14:48 ` [PATCH 7/7] example/iomap_service_ll: create a sample systemd service fuse server Darrick J. Wong 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:48 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> If the block device supports atomic writes, we will too. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/iomap_ow_ll.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/example/iomap_ow_ll.c b/example/iomap_ow_ll.c index c4edda1f9bfe40..926c25133a3a84 100644 --- a/example/iomap_ow_ll.c +++ b/example/iomap_ow_ll.c @@ -150,6 +150,8 @@ static int ioow_iomap_config_devices(void) return ret; } + single_file_configure_atomic_write(); + ret = ioctl(single_file.backing_fd, BLKBSZSET, &blocksize); if (ret) { printf("%s: cannot set block size %u: %s\n", @@ -435,6 +437,9 @@ static void ioow_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, goto out_unlock; } + if (opflags & FUSE_IOMAP_OP_ATOMIC) + read.flags |= FUSE_IOMAP_F_ATOMIC_BIO; + if (ll.debug) { fprintf(stderr, "%s: read offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", @@ -700,6 +705,8 @@ int main(int argc, char *argv[]) struct fuse_cmdline_opts opts = { }; int ret = 1; + single_file.fixed_size = false; + if (fuse_opt_parse(&args, &ll, ioow_ll_opts, ioow_ll_opt_proc)) goto err_args; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 7/7] example/iomap_service_ll: create a sample systemd service fuse server 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:48 ` [PATCH 6/7] example/iomap_ow_ll: implement atomic writes Darrick J. Wong @ 2026-04-29 14:48 ` Darrick J. Wong 6 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:48 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, john, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Create a simple fuse server that can be run as a systemd service. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 3 example/iomap_service_ll.c | 377 ++++++++++++++++++++++++++++++++++++ example/iomap_service_ll.socket.in | 15 + example/iomap_service_ll@.service | 102 ++++++++++ example/meson.build | 7 + example/service_hl.c | 2 example/service_ll.c | 2 example/single_file.c | 8 - 8 files changed, 509 insertions(+), 7 deletions(-) create mode 100644 example/iomap_service_ll.c create mode 100644 example/iomap_service_ll.socket.in create mode 100644 example/iomap_service_ll@.service diff --git a/example/single_file.h b/example/single_file.h index 3ee9dd261352ba..edf8a424d02ec6 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -124,7 +124,8 @@ int single_file_opt_proc(void *data, const char *arg, int key, unsigned long long parse_num_blocks(const char *arg, int log_block_size); struct fuse_service; -int single_file_service_open(struct fuse_service *sf, const char *path); +int single_file_service_open(struct fuse_service *sf, const char *path, + unsigned int request_flags); int single_file_open(const char *path); void single_file_check_read(off_t pos, size_t *count); diff --git a/example/iomap_service_ll.c b/example/iomap_service_ll.c new file mode 100644 index 00000000000000..c7b9e9c064a32d --- /dev/null +++ b/example/iomap_service_ll.c @@ -0,0 +1,377 @@ +/* + * FUSE: Filesystem in Userspace + * Copyright (C) 2026 Oracle. + * + * This program can be distributed under the terms of the GNU GPLv2. + * See the file GPL2.txt. + */ + +/** @file + * + * minimal example iomap filesystem using low-level API and systemd service api + * + * Compile with: + * + * gcc -Wall single_file.c iomap_service_ll.c `pkg-config fuse3 --cflags --libs` \ + * -o iomap_service_ll + * + * Note: If the pkg-config command fails due to the absence of the fuse3.pc + * file, you should configure the path to the fuse3.pc file in the + * PKG_CONFIG_PATH variable. + * + * Change the ExecStart line in iomap_service_ll@.service: + * + * ExecStart=/path/to/iomap_service_ll + * + * to point to the actual path of the iomap_service_ll binary. + * + * Finally, install the iomap_service_ll@.service and iomap_service_ll.socket files to the + * systemd service directory, usually /run/systemd/system. + * + * ## Source code ## + * \include iomap_service_ll.c + * \include iomap_service_ll.socket + * \include iomap_service_ll@.service + * \include single_file.c + * \include single_file.h + */ + +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 99) + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <fuse_lowlevel.h> +#include <fuse_service.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> +#include <fcntl.h> +#include <unistd.h> +#include <assert.h> +#include <pthread.h> +#include <sys/ioctl.h> +#include <sys/stat.h> +#include <linux/fs.h> +#include <linux/stat.h> +#define USE_SINGLE_FILE_LL_API +#include "single_file.h" + +struct ioservice_ll { + struct fuse_session *se; + char *device; + struct fuse_service *service; + + /* really booleans */ + int debug; + + int dev_index; +}; + +static struct ioservice_ll ll = { }; + +static void ioservice_ll_init(void *userdata, struct fuse_conn_info *conn) +{ + (void)userdata; + + if (fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + single_file.uses_iomap = true; + conn->time_gran = 1; +} + +static int ioservice_iomap_config_devices(void) +{ + int ret; + + ll.dev_index = fuse_lowlevel_iomap_device_add(ll.se, single_file.backing_fd, 0); + if (ll.dev_index < 0) { + ret = -ll.dev_index; + printf("%s: cannot register iomap dev fd=%d: %s\n", + ll.device, single_file.backing_fd, strerror(ret)); + return ret; + } + + ret = fuse_lowlevel_iomap_set_blocksize(ll.se, ll.dev_index, + single_file.blocksize); + if (ret) { + printf("%s: cannot set block size %u: %s\n", + ll.device, single_file.blocksize, strerror(errno)); + return errno; + } + + return 0; +} + +static void ioservice_ll_iomap_config(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize) +{ + struct fuse_iomap_config cfg = { }; + int ret; + + (void)p; + (void)psize; + + cfg.flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE; + cfg.s_blocksize = single_file.blocksize; + + cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES; + cfg.s_maxbytes = single_file.isize; + + ret = ioservice_iomap_config_devices(); + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_config(req, &cfg); +} + +static int ioservice_begin_report(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + (void)pos; + (void)count; + (void)opflags; + + read->offset = 0; + read->length = fsb_to_b(single_file.blocks); + read->addr = 0; + read->dev = ll.dev_index; + read->type = FUSE_IOMAP_TYPE_MAPPED; + + return 0; +} + +static int ioservice_begin_read(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return ioservice_begin_report(pos, count, opflags, read); +} + +static int ioservice_begin_write(off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return ioservice_begin_read(pos, count, opflags, read); +} + +static void ioservice_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, + uint64_t dontcare, off_t pos, + uint64_t count, uint32_t opflags) +{ + struct fuse_file_iomap read; + int ret; + + (void)dontcare; + + if (ino != 2) { + ret = EIO; + goto out_reply; + } + + if (ll.debug) + fprintf(stderr, "%s: pos 0x%llx count 0x%llx opflags 0x%x\n", + __func__, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + if (!single_file.allow_dio && (opflags & FUSE_IOMAP_OP_DIRECT)) { + ret = ENOSYS; + goto out_reply; + } + + memset(&read, 0, sizeof(read)); + + pthread_mutex_lock(&single_file.lock); + if (opflags & FUSE_IOMAP_OP_REPORT) + ret = ioservice_begin_report(pos, count, opflags, &read); + else if (fuse_iomap_is_write(opflags)) + ret = ioservice_begin_write(pos, count, opflags, &read); + else + ret = ioservice_begin_read(pos, count, opflags, &read); + if (ret) + goto out_unlock; + + if (ll.debug) + fprintf(stderr, +"%s: offset 0x%llx length 0x%llx type %u dev %u addr 0x%llx flags 0x%x\n", + __func__, + (unsigned long long)read.offset, + (unsigned long long)read.length, + read.type, + read.dev, + (unsigned long long)read.addr, + read.flags); + + /* Not filling even the first byte will make the kernel unhappy. */ + if (read.offset > pos || read.offset + read.length <= pos) { + printf("%s: made bad mapping at pos %llu\n", ll.device, + (unsigned long long)pos); + ret = EIO; + goto out_unlock; + } + +out_unlock: + pthread_mutex_unlock(&single_file.lock); +out_reply: + if (ret) + fuse_reply_err(req, ret); + else + fuse_reply_iomap_begin(req, &read, NULL); +} + +static const struct fuse_lowlevel_ops ioservice_ll_oper = { + .lookup = single_file_ll_lookup, + .getattr = single_file_ll_getattr, + .setattr = single_file_ll_setattr, + .readdir = single_file_ll_readdir, + .open = single_file_ll_open, + .statfs = single_file_ll_statfs, + .statx = single_file_ll_statx, + .fsync = single_file_ll_fsync, + + .init = ioservice_ll_init, + .iomap_config = ioservice_ll_iomap_config, + .iomap_begin = ioservice_ll_iomap_begin, +}; + +#define IOMAP_SERVICE_LL_OPT(t, p, v) { t, offsetof(struct ioservice_ll, p), v } + +static struct fuse_opt ioservice_ll_opts[] = { + IOMAP_SERVICE_LL_OPT("debug", debug, 1), + SINGLE_FILE_OPT_KEYS, + FUSE_OPT_END +}; + +static int ioservice_ll_opt_proc(void *data, const char *arg, int key, + struct fuse_args *outargs) +{ + if (single_file_opt_proc(data, arg, key, outargs) == 0) + return 0; + + switch (key) { + case FUSE_OPT_KEY_NONOPT: + if (!ll.device) { + ll.device = strdup(arg); + return 0; + } + return 1; + } + + return 1; +} + +int main(int argc, char *argv[]) +{ + struct fuse_args args = FUSE_ARGS_INIT(argc, argv); + struct fuse_cmdline_opts opts = { }; + struct fuse_loop_config *config = NULL; + int error; + int ret = 1; + + if (fuse_service_accept(&ll.service)) + goto err_args; + + if (!fuse_service_accepted(ll.service)) + goto err_args; + + if (fuse_service_append_args(ll.service, &args)) + goto err_service; + + if (fuse_opt_parse(&args, &ll, ioservice_ll_opts, + ioservice_ll_opt_proc) != 0) + goto err_service; + + if (fuse_service_parse_cmdline_opts(&args, &opts)) + goto err_service; + + if (opts.show_help) { + printf("usage: %s [options] <device> <mountpoint>\n\n", argv[0]); + fuse_cmdline_help(); + fuse_lowlevel_help(); + ret = 0; + goto err_service; + } else if (opts.show_version) { + printf("FUSE library version %s\n", fuse_pkgversion()); + fuse_lowlevel_version(); + ret = 0; + goto err_service; + } + + if (!opts.mountpoint || !ll.device) { + printf("usage: %s [options] <device> <mountpoint>\n", argv[0]); + printf(" %s --help\n", argv[0]); + goto err_service; + } + + if (single_file_service_open(ll.service, ll.device, + FUSE_SERVICE_REQUEST_FILE_TRYLOOP)) + goto err_service; + + if (fuse_service_finish_file_requests(ll.service)) + goto err_singlefile; + + if (fuse_service_configure_iomap(ll.service, true, &error)) + goto err_singlefile; + if (error) { + fprintf(stderr, "%s: mount helper could not enable iomap: %s\n", + ll.device, strerror(error)); + goto err_singlefile; + } + + if (single_file_configure(ll.device, S_IFBLK, "svc_bdev", NULL)) + goto err_singlefile; + + ll.se = fuse_session_new(&args, &ioservice_ll_oper, + sizeof(ioservice_ll_oper), NULL); + if (ll.se == NULL) + goto err_singlefile; + + if (!opts.singlethread) { + config = fuse_loop_cfg_create(); + if (!config) { + ret = 1; + goto err_session; + } + } + + if (fuse_set_signal_handlers(ll.se)) + goto err_loopcfg; + + if (fuse_service_session_mount(ll.service, ll.se, S_IFDIR, &opts)) + goto err_signals; + + /* Block until ctrl+c or fusermount -u */ + if (opts.singlethread) { + fuse_service_send_goodbye(ll.service, 0); + fuse_service_release(ll.service); + + ret = fuse_session_loop(ll.se); + } else { + fuse_loop_cfg_set_clone_fd(config, opts.clone_fd); + fuse_loop_cfg_set_max_threads(config, opts.max_threads); + + fuse_service_send_goodbye(ll.service, 0); + fuse_service_release(ll.service); + + ret = fuse_session_loop_mt(ll.se, config); + } + +err_signals: + fuse_remove_signal_handlers(ll.se); +err_loopcfg: + fuse_loop_cfg_destroy(config); +err_session: + fuse_session_destroy(ll.se); +err_singlefile: + single_file_close(); +err_service: + free(opts.mountpoint); + free(ll.device); + fuse_service_send_goodbye(ll.service, ret); + fuse_service_destroy(&ll.service); +err_args: + fuse_opt_free_args(&args); + return fuse_service_exit(ret); +} diff --git a/example/iomap_service_ll.socket.in b/example/iomap_service_ll.socket.in new file mode 100644 index 00000000000000..690064fa493522 --- /dev/null +++ b/example/iomap_service_ll.socket.in @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0-or-later +# +# Copyright (C) 2026 Oracle. All Rights Reserved. +# Author: Darrick J. Wong <djwong@kernel.org> +[Unit] +Description=Socket for iomap_service_ll Service + +[Socket] +ListenSequentialPacket=@FUSE_SERVICE_SOCKET_DIR_RAW@/iomap_service_ll +Accept=yes +SocketMode=@FUSE_SERVICE_SOCKET_PERMS@ +RemoveOnStop=yes + +[Install] +WantedBy=sockets.target diff --git a/example/iomap_service_ll@.service b/example/iomap_service_ll@.service new file mode 100644 index 00000000000000..67d4a21acc89b4 --- /dev/null +++ b/example/iomap_service_ll@.service @@ -0,0 +1,102 @@ +# SPDX-License-Identifier: GPL-2.0-or-later +# +# Copyright (C) 2026 Oracle. All Rights Reserved. +# Author: Darrick J. Wong <djwong@kernel.org> +[Unit] +Description=iomap_service_ll Sample Fuse Service + +# Don't leave failed units behind, systemd does not clean them up! +CollectMode=inactive-or-failed + +[Service] +Type=exec +ExecStart=/path/to/iomap_service_ll + +# Try to capture core dumps +LimitCORE=infinity + +SyslogIdentifier=%N + +# No realtime CPU scheduling +RestrictRealtime=true + +# Don't let us see anything in the regular system, and don't run as root +DynamicUser=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +PrivateDevices=true +PrivateUsers=true + +# No network access +PrivateNetwork=true +ProtectHostname=true +RestrictAddressFamilies=none +IPAddressDeny=any + +# Don't let the program mess with the kernel configuration at all +ProtectKernelLogs=true +ProtectKernelModules=true +ProtectKernelTunables=true +ProtectControlGroups=true +ProtectProc=invisible +RestrictNamespaces=true +RestrictFileSystems= + +# Hide everything in /proc, even /proc/mounts +ProcSubset=pid + +# Only allow the default personality Linux +LockPersonality=true + +# No writable memory pages +MemoryDenyWriteExecute=true + +# Don't let our mounts leak out to the host +PrivateMounts=true + +# Restrict system calls to the native arch and only enough to get things going +SystemCallArchitectures=native +SystemCallFilter=@system-service +SystemCallFilter=~@privileged +SystemCallFilter=~@resources + +SystemCallFilter=~@clock +SystemCallFilter=~@cpu-emulation +SystemCallFilter=~@debug +SystemCallFilter=~@module +SystemCallFilter=~@reboot +SystemCallFilter=~@swap + +SystemCallFilter=~@mount + +# libfuse io_uring wants to pin cores and memory +SystemCallFilter=mbind +SystemCallFilter=sched_setaffinity + +# Leave a breadcrumb if we get whacked by the system call filter +SystemCallErrorNumber=EL3RST + +# Log to the kernel dmesg, just like an in-kernel filesystem driver +StandardOutput=append:/dev/ttyprintk +StandardError=append:/dev/ttyprintk + +# Run with no capabilities at all +CapabilityBoundingSet= +AmbientCapabilities= +NoNewPrivileges=true + +# We don't create files +UMask=7777 + +# No access to hardware /dev files at all +ProtectClock=true +DevicePolicy=closed + +# Don't mess with set[ug]id anything. +RestrictSUIDSGID=true + +# Don't let OOM kills of processes in this containment group kill the whole +# service, because we don't want filesystem drivers to go down. +OOMPolicy=continue +OOMScoreAdjust=-1000 diff --git a/example/meson.build b/example/meson.build index 6967a2ee370d4b..06af406d0c909a 100644 --- a/example/meson.build +++ b/example/meson.build @@ -31,7 +31,12 @@ if platform.endswith('linux') output: 'service_hl.socket', configuration: private_cfg) - single_file_examples += [ 'iomap_ll', 'iomap_inline_ll', 'iomap_ow_ll' ] + single_file_examples += [ 'iomap_ll', 'iomap_inline_ll', 'iomap_ow_ll', + 'iomap_service_ll'] + + configure_file(input: 'iomap_service_ll.socket.in', + output: 'iomap_service_ll.socket', + configuration: private_cfg) endif threaded_examples = [ 'notify_inval_inode', diff --git a/example/service_hl.c b/example/service_hl.c index db92dbb9b611b3..4964df411b0e23 100644 --- a/example/service_hl.c +++ b/example/service_hl.c @@ -215,7 +215,7 @@ int main(int argc, char *argv[]) goto err_service; } - if (single_file_service_open(hl.service, hl.device)) + if (single_file_service_open(hl.service, hl.device, 0)) goto err_service; if (fuse_service_finish_file_requests(hl.service)) diff --git a/example/service_ll.c b/example/service_ll.c index eb3d77c639f705..e7bed0deb69c11 100644 --- a/example/service_ll.c +++ b/example/service_ll.c @@ -268,7 +268,7 @@ int main(int argc, char *argv[]) goto err_service; } - if (single_file_service_open(ll.service, ll.device)) + if (single_file_service_open(ll.service, ll.device, 0)) goto err_service; if (fuse_service_finish_file_requests(ll.service)) diff --git a/example/single_file.c b/example/single_file.c index c4ac998a2c978b..49c3e857b130f8 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -823,7 +823,8 @@ int single_file_open(const char *path) return 0; } -int single_file_service_open(struct fuse_service *sf, const char *path) +int single_file_service_open(struct fuse_service *sf, const char *path, + unsigned int request_flags) { int open_flags = single_file.ro ? O_RDONLY : O_RDWR; int fd; @@ -832,11 +833,12 @@ int single_file_service_open(struct fuse_service *sf, const char *path) again: if (single_file.require_bdev) ret = fuse_service_request_blockdev(sf, path, - open_flags | O_EXCL, 0, 0, + open_flags | O_EXCL, 0, + request_flags, single_file.blocksize); else ret = fuse_service_request_file(sf, path, open_flags | O_EXCL, - 0, 0); + 0, request_flags); if (ret) return ret; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (12 preceding siblings ...) 2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong @ 2026-04-29 14:20 ` Darrick J. Wong 2026-04-29 14:49 ` [PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong ` (8 more replies) 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (5 subsequent siblings) 19 siblings, 9 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal Hi all, This series improves the performance (and correctness for some filesystems) by adding the ability to cache iomap mappings in the kernel. For filesystems that can change mapping states during pagecache writeback (e.g. unwritten extent conversion) this is absolutely necessary to deal with races with writes to the pagecache because writeback does not take i_rwsem. For everyone else, it simply eliminates roundtrips to userspace. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache --- Commits in this patchset: * libfuse: enable iomap cache management for lowlevel fuse * libfuse: add upper-level iomap cache management * libfuse: allow constraining of iomap mapping cache size * libfuse: add upper-level iomap mapping cache constraint code * libfuse: enable iomap * example/iomap_ll: cache mappings for later * example/iomap_inline_ll: cache iomappings in the kernel * example/iomap_ow_ll: cache iomappings in the kernel * example/iomap_service_ll: cache iomappings in the kernel --- example/single_file.h | 7 +++- include/fuse.h | 31 ++++++++++++++++ include/fuse_common.h | 19 +++++++++- include/fuse_kernel.h | 33 ++++++++++++++++- include/fuse_lowlevel.h | 52 ++++++++++++++++++++++++++ example/iomap_inline_ll.c | 16 ++++++++ example/iomap_ll.c | 16 ++++++++ example/iomap_ow_ll.c | 17 +++++++++ example/iomap_service_ll.c | 16 ++++++++ example/single_file.c | 7 ++++ lib/fuse.c | 35 +++++++++++++++++- lib/fuse_lowlevel.c | 88 ++++++++++++++++++++++++++++++++++++++++++-- lib/fuse_versionscript | 4 ++ 13 files changed, 332 insertions(+), 9 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong @ 2026-04-29 14:49 ` Darrick J. Wong 2026-04-29 14:49 ` [PATCH 2/9] libfuse: add upper-level iomap cache management Darrick J. Wong ` (7 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:49 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Add the library methods so that fuse servers can manage an in-kernel iomap cache. This enables better performance on small IOs and is required if the filesystem needs synchronization between pagecache writes and writeback. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 12 +++++++ include/fuse_kernel.h | 27 ++++++++++++++++ include/fuse_lowlevel.h | 52 ++++++++++++++++++++++++++++++++ lib/fuse_lowlevel.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++ lib/fuse_versionscript | 2 + 5 files changed, 170 insertions(+) diff --git a/include/fuse_common.h b/include/fuse_common.h index 7b209b217b310e..76c65f7e79179e 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1163,6 +1163,10 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn); /* fuse-specific mapping type indicating that writes use the read mapping */ #define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255) +/* fuse-specific mapping type saying the server has populated the cache */ +#define FUSE_IOMAP_TYPE_RETRY_CACHE (254) +/* do not upsert this mapping */ +#define FUSE_IOMAP_TYPE_NOCACHE (253) #define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */ @@ -1292,6 +1296,14 @@ struct fuse_iomap_config { int64_t s_maxbytes; /* max file size */ }; +/* invalidate to end of file */ +#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL) + +struct fuse_file_range { + uint64_t offset; /* file offset to invalidate, bytes */ + uint64_t length; /* length to invalidate, bytes */ +}; + /* ----------------------------------------------------------- * * Compatibility stuff * * ----------------------------------------------------------- */ diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 3ed174567dc172..115c8a228e765a 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -251,6 +251,8 @@ * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support * - add FUSE_ATTR_{SYNC,IMMUTABLE,APPEND} for VFS enforcement of file * attributes + * - add FUSE_NOTIFY_IOMAP_{UPSERT,INVAL}_MAPPINGS so fuse servers can cache + * file range mappings in the kernel for iomap */ #ifndef _LINUX_FUSE_H @@ -718,6 +720,8 @@ enum fuse_notify_code { FUSE_NOTIFY_INC_EPOCH = 8, FUSE_NOTIFY_PRUNE = 9, FUSE_NOTIFY_IOMAP_DEV_INVAL = 99, + FUSE_NOTIFY_IOMAP_UPSERT_MAPPINGS = 100, + FUSE_NOTIFY_IOMAP_INVAL_MAPPINGS = 101, }; /* The read buffer is required to be at least 8k, but may be much larger */ @@ -1468,4 +1472,27 @@ struct fuse_iomap_dev_inval_out { struct fuse_range range; }; +struct fuse_iomap_inval_mappings_out { + uint64_t nodeid; /* Inode ID */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + + /* + * Range of read and mappings to invalidate. Zero length means ignore + * the range; and FUSE_IOMAP_INVAL_TO_EOF can be used for length. + */ + struct fuse_range read; + struct fuse_range write; +}; + +struct fuse_iomap_upsert_mappings_out { + uint64_t nodeid; /* Inode ID */ + uint64_t attr_ino; /* matches fuse_attr:ino */ + + /* read file data from here */ + struct fuse_iomap_io read; + + /* write file data to here, if applicable */ + struct fuse_iomap_io write; +}; + #endif /* _LINUX_FUSE_H */ diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h index 67c9bd4b2c6cee..3e9ceae9aa4aa4 100644 --- a/include/fuse_lowlevel.h +++ b/include/fuse_lowlevel.h @@ -2261,6 +2261,58 @@ int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id); int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev, off_t offset, off_t length); +/* + * Upsert some file mapping information into the kernel. This is necessary + * for filesystems that require coordination of mapping state changes between + * buffered writes and writeback, and desirable for better performance + * elsewhere. + * + * Added in FUSE protocol version 7.99. If the kernel does not support + * this (or a newer) version, the function will return -ENOSYS and do + * nothing. + * + * @param se the session object + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param read mapping information for file reads + * @param write mapping information for file writes + * @return zero for success, -errno for failure + */ +int fuse_lowlevel_iomap_upsert_mappings(struct fuse_session *se, + fuse_ino_t nodeid, uint64_t attr_ino, + const struct fuse_file_iomap *read, + const struct fuse_file_iomap *write); + +/* + * Update a mapping that will be sent to the kernel as part of an iomap_begin + * reply to signal that the mapping has been upserted into the cache. + */ +static inline void fuse_file_iomap_retry_cache(struct fuse_file_iomap *map) +{ + map->type = FUSE_IOMAP_TYPE_RETRY_CACHE; + map->dev = FUSE_IOMAP_DEV_NULL; + map->addr = FUSE_IOMAP_NULL_ADDR; +} + +/** + * Invalidate some file mapping information in the kernel. + * + * Added in FUSE protocol version 7.99. If the kernel does not support + * this (or a newer) version, the function will return -ENOSYS and do + * nothing. + * + * @param se the session object + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param read file read mapping range to invalidate + * @param write file write mapping range to invalidate + * @return zero for success, -errno for failure + */ +int fuse_lowlevel_iomap_inval_mappings(struct fuse_session *se, + fuse_ino_t nodeid, uint64_t attr_ino, + const struct fuse_file_range *read, + const struct fuse_file_range *write); + /* ----------------------------------------------------------- * * Utility functions * * ----------------------------------------------------------- */ diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index d3e2d4c698a62b..6e8d2a7d74201b 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -3777,6 +3777,83 @@ int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev, return send_notify_iov(se, FUSE_NOTIFY_IOMAP_DEV_INVAL, iov, 2); } +int fuse_lowlevel_iomap_upsert_mappings(struct fuse_session *se, + fuse_ino_t nodeid, uint64_t attr_ino, + const struct fuse_file_iomap *read, + const struct fuse_file_iomap *write) +{ + struct fuse_iomap_upsert_mappings_out outarg = { + .nodeid = nodeid, + .attr_ino = attr_ino, + .read = { + .type = FUSE_IOMAP_TYPE_NOCACHE, + }, + .write = { + .type = FUSE_IOMAP_TYPE_NOCACHE, + } + }; + struct iovec iov[2]; + + if (!se) + return -EINVAL; + + if (se->conn.proto_minor < 99) + return -ENOSYS; + + if (!read && !write) + return 0; + + if (read) + fuse_iomap_to_kernel(&outarg.read, read); + + if (write) + fuse_iomap_to_kernel(&outarg.write, write); + + iov[1].iov_base = &outarg; + iov[1].iov_len = sizeof(outarg); + + return send_notify_iov(se, FUSE_NOTIFY_IOMAP_UPSERT_MAPPINGS, iov, 2); +} + +static inline void +fuse_iomap_range_to_kernel(struct fuse_range *range, + const struct fuse_file_range *firange) +{ + range->offset = firange->offset; + range->length = firange->length; +} + +int fuse_lowlevel_iomap_inval_mappings(struct fuse_session *se, + fuse_ino_t nodeid, uint64_t attr_ino, + const struct fuse_file_range *read, + const struct fuse_file_range *write) +{ + struct fuse_iomap_inval_mappings_out outarg = { + .nodeid = nodeid, + .attr_ino = attr_ino, + }; + struct iovec iov[2]; + + if (!se) + return -EINVAL; + + if (se->conn.proto_minor < 99) + return -ENOSYS; + + if (!read && !write) + return 0; + + if (read) + fuse_iomap_range_to_kernel(&outarg.read, read); + if (write) + fuse_iomap_range_to_kernel(&outarg.write, write); + + iov[1].iov_base = &outarg; + iov[1].iov_len = sizeof(outarg); + + return send_notify_iov(se, FUSE_NOTIFY_IOMAP_INVAL_MAPPINGS, iov, 2); +} + struct fuse_retrieve_req { struct fuse_notify_req nreq; void *cookie; diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 9c9013c964488c..41e0193708e57d 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -272,6 +272,8 @@ FUSE_3.99 { fuse_lowlevel_add_iomap; fuse_service_configure_iomap; fuse_lowlevel_iomap_set_blocksize; + fuse_lowlevel_iomap_upsert_mappings; + fuse_lowlevel_iomap_inval_mappings; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/9] libfuse: add upper-level iomap cache management 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong 2026-04-29 14:49 ` [PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong @ 2026-04-29 14:49 ` Darrick J. Wong 2026-04-29 14:49 ` [PATCH 3/9] libfuse: allow constraining of iomap mapping cache size Darrick J. Wong ` (6 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:49 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Make it so that upper-level fuse servers can use the iomap cache too. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse.h | 31 +++++++++++++++++++++++++++++++ lib/fuse.c | 30 ++++++++++++++++++++++++++++++ lib/fuse_versionscript | 2 ++ 3 files changed, 63 insertions(+) diff --git a/include/fuse.h b/include/fuse.h index 2717e654e92071..54803803beb5cb 100644 --- a/include/fuse.h +++ b/include/fuse.h @@ -1522,6 +1522,37 @@ bool fuse_fs_can_enable_iomap(const struct stat *stbuf); */ bool fuse_fs_can_enable_iomapx(const struct statx *statxbuf); +/* + * Upsert some file mapping information into the kernel. This is necessary + * for filesystems that require coordination of mapping state changes between + * buffered writes and writeback, and desirable for better performance + * elsewhere. + * + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param read mapping information for file reads + * @param write mapping information for file writes + * @return zero for success, -errno for failure + */ +int fuse_fs_iomap_upsert(uint64_t nodeid, uint64_t attr_ino, + const struct fuse_file_iomap *read, + const struct fuse_file_iomap *write); + +/** + * Invalidate some file mapping information in the kernel. + * + * @param nodeid the inode number + * @param attr_ino inode number as told by fuse_attr::ino + * @param read_off start of the range of read mappings to invalidate + * @param read_len length of the range of read mappings to invalidate + * @param write_off start of the range of write mappings to invalidate + * @param write_len length of the range of write mappings to invalidate + * @return zero for success, -errno for failure + */ +int fuse_fs_iomap_inval(uint64_t nodeid, uint64_t attr_ino, loff_t read_off, + uint64_t read_len, loff_t write_off, + uint64_t write_len); + int fuse_notify_poll(struct fuse_pollhandle *ph); /** diff --git a/lib/fuse.c b/lib/fuse.c index b43a336d7530bb..9ca3ed7d92bfde 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -3021,6 +3021,36 @@ static int fuse_fs_shutdownfs(struct fuse_fs *fs, const char *path, return fs->op.shutdownfs(path, flags); } +int fuse_fs_iomap_upsert(uint64_t nodeid, uint64_t attr_ino, + const struct fuse_file_iomap *read, + const struct fuse_file_iomap *write) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + + return fuse_lowlevel_iomap_upsert_mappings(se, nodeid, attr_ino, read, + write); +} + +int fuse_fs_iomap_inval(uint64_t nodeid, uint64_t attr_ino, + loff_t read_off, uint64_t read_len, + loff_t write_off, uint64_t write_len) +{ + struct fuse_context *ctxt = fuse_get_context(); + struct fuse_session *se = fuse_get_session(ctxt->fuse); + struct fuse_file_range read = { + .offset = read_off, + .length = read_len, + }; + struct fuse_file_range write = { + .offset = write_off, + .length = write_len, + }; + + return fuse_lowlevel_iomap_inval_mappings(se, nodeid, attr_ino, &read, + &write); +} + static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript index 41e0193708e57d..f5a2fcb3621d90 100644 --- a/lib/fuse_versionscript +++ b/lib/fuse_versionscript @@ -274,6 +274,8 @@ FUSE_3.99 { fuse_lowlevel_iomap_set_blocksize; fuse_lowlevel_iomap_upsert_mappings; fuse_lowlevel_iomap_inval_mappings; + fuse_fs_iomap_upsert; + fuse_fs_iomap_inval; } FUSE_3.19; # Local Variables: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/9] libfuse: allow constraining of iomap mapping cache size 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong 2026-04-29 14:49 ` [PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong 2026-04-29 14:49 ` [PATCH 2/9] libfuse: add upper-level iomap cache management Darrick J. Wong @ 2026-04-29 14:49 ` Darrick J. Wong 2026-04-29 14:50 ` [PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint code Darrick J. Wong ` (5 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:49 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Allow the fuse server to constrain the maximum size of each iomap mapping cache. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/fuse_common.h | 7 ++++++- include/fuse_kernel.h | 6 +++++- lib/fuse_lowlevel.c | 9 +++++++-- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/include/fuse_common.h b/include/fuse_common.h index 76c65f7e79179e..3a8fc7ea78bcc7 100644 --- a/include/fuse_common.h +++ b/include/fuse_common.h @@ -1266,11 +1266,14 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags, #define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL) #define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL) #define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL) +#define FUSE_IOMAP_CONFIG_CACHE_MAXBYTES (1 << 6ULL) struct fuse_iomap_config_params { uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */ int64_t maxbytes; /* max supported file size */ - uint64_t padding[6]; /* zero */ + uint32_t cache_maxbytes; /* mapping cache maxbytes */ + uint32_t zero; /* zero */ + uint64_t padding[5]; /* zero */ }; struct fuse_iomap_config { @@ -1294,6 +1297,8 @@ struct fuse_iomap_config { int64_t s_time_max; int64_t s_maxbytes; /* max file size */ + + uint32_t cache_maxbytes; /* mapping cache maximum size */ }; /* invalidate to end of file */ diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h index 115c8a228e765a..d9b65fe2395bde 100644 --- a/include/fuse_kernel.h +++ b/include/fuse_kernel.h @@ -1433,7 +1433,9 @@ struct fuse_iomap_ioend_out { struct fuse_iomap_config_in { uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */ int64_t maxbytes; /* max supported file size */ - uint64_t padding[6]; /* zero */ + uint32_t cache_maxbytes; /* mapping cache maxbytes */ + uint32_t zero; /* zero */ + uint64_t padding[5]; /* zero */ }; struct fuse_iomap_config_out { @@ -1457,6 +1459,8 @@ struct fuse_iomap_config_out { int64_t s_time_max; int64_t s_maxbytes; /* max file size */ + + uint32_t cache_maxbytes; /* mapping cache maximum size */ }; struct fuse_range { diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 6e8d2a7d74201b..4af2ed8380f5ac 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -2851,7 +2851,8 @@ static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid, FUSE_IOMAP_CONFIG_BLOCKSIZE | \ FUSE_IOMAP_CONFIG_MAX_LINKS | \ FUSE_IOMAP_CONFIG_TIME | \ - FUSE_IOMAP_CONFIG_MAXBYTES) + FUSE_IOMAP_CONFIG_MAXBYTES | \ + FUSE_IOMAP_CONFIG_CACHE_MAXBYTES) #define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_V1) @@ -2860,7 +2861,7 @@ static ssize_t iomap_config_reply_size(const struct fuse_iomap_config *cfg) if (cfg->flags & ~FUSE_IOMAP_CONFIG_ALL) return -EINVAL; - return offsetofend(struct fuse_iomap_config_out, s_maxbytes); + return offsetofend(struct fuse_iomap_config_out, cache_maxbytes); } int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg) @@ -2898,6 +2899,9 @@ int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg) if (cfg->flags & FUSE_IOMAP_CONFIG_MAXBYTES) arg.s_maxbytes = cfg->s_maxbytes; + if (cfg->flags & FUSE_IOMAP_CONFIG_CACHE_MAXBYTES) + arg.cache_maxbytes = cfg->cache_maxbytes; + return send_reply_ok(req, &arg, reply_size); } @@ -2908,6 +2912,7 @@ static void _do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid, struct fuse_iomap_config_params p = { .flags = arg->flags & FUSE_IOMAP_CONFIG_ALL, .maxbytes = arg->maxbytes, + .cache_maxbytes = arg->cache_maxbytes, }; (void)nodeid; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint code 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:49 ` [PATCH 3/9] libfuse: allow constraining of iomap mapping cache size Darrick J. Wong @ 2026-04-29 14:50 ` Darrick J. Wong 2026-04-29 14:50 ` [PATCH 5/9] libfuse: enable iomap Darrick J. Wong ` (4 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:50 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Allow high-level fuse servers to constrain the maximum size of each iomap mapping cache. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/fuse.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/fuse.c b/lib/fuse.c index 9ca3ed7d92bfde..6b97f33abab72f 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -2982,8 +2982,9 @@ static int fuse_fs_iomap_config(struct fuse_fs *fs, if (fs->debug) { fuse_log(FUSE_LOG_DEBUG, - "iomap_config flags 0x%llx maxbytes %lld\n", - (unsigned long long)p->flags, (long long)p->maxbytes); + "iomap_config flags 0x%llx maxbytes %lld cache_maxbytes %u\n", + (unsigned long long)p->flags, (long long)p->maxbytes, + p->cache_maxbytes); } return fs->op.iomap_config(p, psize, cfg); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 5/9] libfuse: enable iomap 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:50 ` [PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint code Darrick J. Wong @ 2026-04-29 14:50 ` Darrick J. Wong 2026-04-29 14:50 ` [PATCH 6/9] example/iomap_ll: cache mappings for later Darrick J. Wong ` (3 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:50 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Remove the guard that we used to avoid bisection problems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/fuse_lowlevel.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index 4af2ed8380f5ac..5dcb321f2f87a7 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -3185,8 +3185,6 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in, se->conn.capable_ext |= FUSE_CAP_ALLOW_IDMAP; if (inargflags & FUSE_IOMAP) se->conn.capable_ext |= FUSE_CAP_IOMAP; - /* Don't let anyone touch iomap until the end of the patchset. */ - se->conn.capable_ext &= ~FUSE_CAP_IOMAP; } else { se->conn.max_readahead = 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 6/9] example/iomap_ll: cache mappings for later 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:50 ` [PATCH 5/9] libfuse: enable iomap Darrick J. Wong @ 2026-04-29 14:50 ` Darrick J. Wong 2026-04-29 14:50 ` [PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel Darrick J. Wong ` (2 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:50 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Cache the iomappings in the kernel for better performance. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/single_file.h | 7 ++++++- example/iomap_ll.c | 16 ++++++++++++++++ example/single_file.c | 7 +++++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/example/single_file.h b/example/single_file.h index edf8a424d02ec6..96cda5be59d77a 100644 --- a/example/single_file.h +++ b/example/single_file.h @@ -59,6 +59,7 @@ struct single_file { bool require_bdev; bool uses_iomap; bool fixed_size; + bool iomap_cache; int awu_min, awu_max; unsigned int blocksize; @@ -103,6 +104,8 @@ enum single_file_opt_keys { SINGLE_FILE_NOSYNC, SINGLE_FILE_SIZE, SINGLE_FILE_BLOCKSIZE, + SINGLE_FILE_IOMAP_CACHE, + SINGLE_FILE_NOIOMAP_CACHE, SINGLE_FILE_NR_KEYS, }; @@ -116,7 +119,9 @@ enum single_file_opt_keys { FUSE_OPT_KEY("sync", SINGLE_FILE_SYNC), \ FUSE_OPT_KEY("nosync", SINGLE_FILE_NOSYNC), \ FUSE_OPT_KEY("size=%s", SINGLE_FILE_SIZE), \ - FUSE_OPT_KEY("blocksize=%s", SINGLE_FILE_BLOCKSIZE) + FUSE_OPT_KEY("blocksize=%s", SINGLE_FILE_BLOCKSIZE), \ + FUSE_OPT_KEY("iomap_cache", SINGLE_FILE_IOMAP_CACHE), \ + FUSE_OPT_KEY("noiomap_cache", SINGLE_FILE_NOIOMAP_CACHE) int single_file_opt_proc(void *data, const char *arg, int key, struct fuse_args *outargs); diff --git a/example/iomap_ll.c b/example/iomap_ll.c index 9824172be4afa4..ff0a21db46d2c2 100644 --- a/example/iomap_ll.c +++ b/example/iomap_ll.c @@ -362,6 +362,22 @@ static void iomap_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, goto out_unlock; } + /* + * For real IO operations, cache the mapping in the kernel so that we + * can reuse them for subsequent IO to the same regions. Don't let + * FIEMAP thrash the cache. + */ + if (!(opflags & FUSE_IOMAP_OP_REPORT) && single_file.iomap_cache) { + ret = fuse_lowlevel_iomap_upsert_mappings(ll.se, ino, ino, + &read, NULL); + if (ret) { + ret = -ret; + goto out_unlock; + } + + fuse_file_iomap_retry_cache(&read); + } + out_unlock: pthread_mutex_unlock(&single_file.lock); out_reply: diff --git a/example/single_file.c b/example/single_file.c index 49c3e857b130f8..d669f81bc309c9 100644 --- a/example/single_file.c +++ b/example/single_file.c @@ -68,6 +68,7 @@ struct single_file single_file = { .mode = S_IFREG | 0444, .lock = PTHREAD_MUTEX_INITIALIZER, .fixed_size = true, + .iomap_cache = true, }; static fuse_ino_t single_file_path_to_ino(const char *path) @@ -800,6 +801,12 @@ int single_file_opt_proc(void *data, const char *arg, int key, case SINGLE_FILE_NOSYNC: single_file.sync = false; return 0; + case SINGLE_FILE_IOMAP_CACHE: + single_file.iomap_cache = true; + return 0; + case SINGLE_FILE_NOIOMAP_CACHE: + single_file.iomap_cache = false; + return 0; case SINGLE_FILE_BLOCKSIZE: return single_file_set_blocksize(arg + 10); case SINGLE_FILE_SIZE: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:50 ` [PATCH 6/9] example/iomap_ll: cache mappings for later Darrick J. Wong @ 2026-04-29 14:50 ` Darrick J. Wong 2026-04-29 14:51 ` [PATCH 8/9] example/iomap_ow_ll: " Darrick J. Wong 2026-04-29 14:51 ` [PATCH 9/9] example/iomap_service_ll: " Darrick J. Wong 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:50 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Cache iomappings in the kernel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/iomap_inline_ll.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/example/iomap_inline_ll.c b/example/iomap_inline_ll.c index bed9855b72a27e..202a528c2735a8 100644 --- a/example/iomap_inline_ll.c +++ b/example/iomap_inline_ll.c @@ -176,6 +176,22 @@ static void ioinline_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, goto out_unlock; } + /* + * For real IO operations, cache the mapping in the kernel so that we + * can reuse them for subsequent IO to the same regions. Don't let + * FIEMAP thrash the cache. + */ + if (!(opflags & FUSE_IOMAP_OP_REPORT) && single_file.iomap_cache) { + ret = fuse_lowlevel_iomap_upsert_mappings(ll.se, ino, ino, + &read, NULL); + if (ret) { + ret = -ret; + goto out_unlock; + } + + fuse_file_iomap_retry_cache(&read); + } + out_unlock: pthread_mutex_unlock(&single_file.lock); out_reply: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 8/9] example/iomap_ow_ll: cache iomappings in the kernel 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:50 ` [PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel Darrick J. Wong @ 2026-04-29 14:51 ` Darrick J. Wong 2026-04-29 14:51 ` [PATCH 9/9] example/iomap_service_ll: " Darrick J. Wong 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:51 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Cache iomappings in the kernel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/iomap_ow_ll.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/example/iomap_ow_ll.c b/example/iomap_ow_ll.c index 926c25133a3a84..3df96581231dd2 100644 --- a/example/iomap_ow_ll.c +++ b/example/iomap_ow_ll.c @@ -475,6 +475,23 @@ static void ioow_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, goto out_unlock; } + /* + * For real IO operations, cache the mapping in the kernel so that we + * can reuse them for subsequent IO to the same regions. Don't let + * FIEMAP thrash the cache. + */ + if (!(opflags & FUSE_IOMAP_OP_REPORT) && single_file.iomap_cache) { + ret = fuse_lowlevel_iomap_upsert_mappings(ll.se, ino, ino, + &read, &write); + if (ret) { + ret = -ret; + goto out_unlock; + } + + fuse_file_iomap_retry_cache(&read); + fuse_file_iomap_retry_cache(&write); + } + out_unlock: pthread_mutex_unlock(&single_file.lock); out_reply: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 9/9] example/iomap_service_ll: cache iomappings in the kernel 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:51 ` [PATCH 8/9] example/iomap_ow_ll: " Darrick J. Wong @ 2026-04-29 14:51 ` Darrick J. Wong 8 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:51 UTC (permalink / raw) To: djwong, bernd; +Cc: miklos, linux-fsdevel, fuse-devel, joannelkoong, neal From: Darrick J. Wong <djwong@kernel.org> Cache iomappings in the kernel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- example/iomap_service_ll.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/example/iomap_service_ll.c b/example/iomap_service_ll.c index c7b9e9c064a32d..021a22ae13388d 100644 --- a/example/iomap_service_ll.c +++ b/example/iomap_service_ll.c @@ -212,6 +212,22 @@ static void ioservice_ll_iomap_begin(fuse_req_t req, fuse_ino_t ino, goto out_unlock; } + /* + * For real IO operations, cache the mapping in the kernel so that we + * can reuse them for subsequent IO to the same regions. Don't let + * FIEMAP thrash the cache. + */ + if (!(opflags & FUSE_IOMAP_OP_REPORT) && single_file.iomap_cache) { + ret = fuse_lowlevel_iomap_upsert_mappings(ll.se, ino, ino, + &read, NULL); + if (ret) { + ret = -ret; + goto out_unlock; + } + + fuse_file_iomap_retry_cache(&read); + } + out_unlock: pthread_mutex_unlock(&single_file.lock); out_reply: ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (13 preceding siblings ...) 2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong @ 2026-04-29 14:20 ` Darrick J. Wong 2026-04-29 14:51 ` [PATCH 1/5] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong ` (4 more replies) 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (4 subsequent siblings) 19 siblings, 5 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong Hi all, In preparation for connecting fuse, iomap, and fuse2fs for a much more performant file IO path, make some changes to the Unix IO manager in libext2fs so that we can have better IO. First we start by making filesystem flushes a lot more efficient by eliding fsyncs when they're not necessary, and allowing library clients to turn off the racy code that writes the superblock byte by byte but exposes stale checksums. XXX: The second part of this series adds IO tagging so that we could tag IOs by inode number to distinguish file data blocks in cache from everything else. This is temporary scaffolding whilst we're in the middle adding directio and later buffered writes. Once we can use the pagecache for all file IO activity I think we could drop the back half of this series. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-iomap-prep --- Commits in this patchset: * libext2fs: invalidate cached blocks when freeing them * libext2fs: only flush affected blocks in unix_write_byte * libext2fs: allow unix_write_byte when the write would be aligned * libext2fs: allow clients to ask to write full superblocks * libext2fs: allow callers to disallow I/O to file data blocks --- lib/ext2fs/ext2_io.h | 8 ++++++- lib/ext2fs/ext2fs.h | 4 +++ debian/libext2fs2t64.symbols | 1 + lib/ext2fs/alloc_stats.c | 6 +++++ lib/ext2fs/closefs.c | 7 ++++++ lib/ext2fs/fileio.c | 12 +++++++++- lib/ext2fs/io_manager.c | 9 +++++++ lib/ext2fs/unix_io.c | 51 ++++++++++++++++++++++++++++++++++++++++-- 8 files changed, 93 insertions(+), 5 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/5] libext2fs: invalidate cached blocks when freeing them 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong @ 2026-04-29 14:51 ` Darrick J. Wong 2026-04-29 14:51 ` [PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:51 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> When we're freeing blocks, we should tell the IO manager to drop them from any cache it might be maintaining to improve performance. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2_io.h | 8 +++++++- debian/libext2fs2t64.symbols | 1 + lib/ext2fs/alloc_stats.c | 6 ++++++ lib/ext2fs/io_manager.c | 9 +++++++++ lib/ext2fs/unix_io.c | 35 +++++++++++++++++++++++++++++++++++ 5 files changed, 58 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h index c880ea2524f248..0148492caf63b6 100644 --- a/lib/ext2fs/ext2_io.h +++ b/lib/ext2fs/ext2_io.h @@ -104,7 +104,10 @@ struct struct_io_manager { unsigned long long count); errcode_t (*flock)(io_channel channel, unsigned int flock_flags); errcode_t (*get_fd)(io_channel channel, int *fd); - long reserved[12]; + errcode_t (*invalidate_blocks)(io_channel channel, + unsigned long long block, + unsigned long long count); + long reserved[11]; }; #define IO_FLAG_RW 0x0001 @@ -157,6 +160,9 @@ extern errcode_t io_channel_cache_readahead(io_channel io, extern errcode_t io_channel_flock(io_channel io, unsigned int flock_flags); extern errcode_t io_channel_funlock(io_channel io); extern errcode_t io_channel_get_fd(io_channel io, int *fd); +extern errcode_t io_channel_invalidate_blocks(io_channel io, + unsigned long long block, + unsigned long long count); #ifdef _WIN32 /* windows_io.c */ diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols index 555fbbb0c98878..b19a362967f00e 100644 --- a/debian/libext2fs2t64.symbols +++ b/debian/libext2fs2t64.symbols @@ -702,6 +702,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER# io_channel_flock@Base 1.47.99 io_channel_funlock@Base 1.47.99 io_channel_get_fd@Base 1.47.99 + io_channel_invalidate_blocks@Base 1.47.99 io_channel_read_blk64@Base 1.41.1 io_channel_set_options@Base 1.37 io_channel_write_blk64@Base 1.41.1 diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c index 95a6438f252e0f..68bbe6807a8ed3 100644 --- a/lib/ext2fs/alloc_stats.c +++ b/lib/ext2fs/alloc_stats.c @@ -82,6 +82,9 @@ void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse) -inuse * (blk64_t) EXT2FS_CLUSTER_RATIO(fs)); ext2fs_mark_super_dirty(fs); ext2fs_mark_bb_dirty(fs); + if (inuse < 0) + io_channel_invalidate_blocks(fs->io, blk, + EXT2FS_CLUSTER_RATIO(fs)); if (fs->block_alloc_stats) (fs->block_alloc_stats)(fs, (blk64_t) blk, inuse); } @@ -144,11 +147,14 @@ void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk, ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT); ext2fs_group_desc_csum_set(fs, group); ext2fs_free_blocks_count_add(fs->super, -inuse * (blk64_t) n); + blk += n; num -= n; } ext2fs_mark_super_dirty(fs); ext2fs_mark_bb_dirty(fs); + if (inuse < 0) + io_channel_invalidate_blocks(fs->io, orig_blk, orig_num); if (fs->block_alloc_stats_range) (fs->block_alloc_stats_range)(fs, orig_blk, orig_num, inuse); } diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c index dff3d73552827f..a92dba7b9dc880 100644 --- a/lib/ext2fs/io_manager.c +++ b/lib/ext2fs/io_manager.c @@ -174,3 +174,12 @@ errcode_t io_channel_get_fd(io_channel io, int *fd) return io->manager->get_fd(io, fd); } + +errcode_t io_channel_invalidate_blocks(io_channel io, unsigned long long block, + unsigned long long count) +{ + if (!io->manager->invalidate_blocks) + return EXT2_ET_OP_NOT_SUPPORTED; + + return io->manager->invalidate_blocks(io, block, count); +} diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 567bbd9493f7f1..54bd4b5597ea9e 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -672,6 +672,25 @@ static errcode_t reuse_cache(io_channel channel, #define FLUSH_INVALIDATE 0x01 #define FLUSH_NOLOCK 0x02 +/* Remove blocks from the cache. Dirty contents are discarded. */ +static void invalidate_cached_blocks(io_channel channel, + struct unix_private_data *data, + unsigned long long block, + unsigned long long count) +{ + struct unix_cache *cache; + int i; + + mutex_lock(data, CACHE_MTX); + for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) { + if (!cache->in_use || cache->block < block || + cache->block >= block + count) + continue; + cache->in_use = 0; + } + mutex_unlock(data, CACHE_MTX); +} + /* * Flush all of the blocks in the cache */ @@ -1832,6 +1851,20 @@ static errcode_t unix_get_fd(io_channel channel, int *fd) return 0; } +static errcode_t unix_invalidate_blocks(io_channel channel, + unsigned long long block, + unsigned long long count) +{ + struct unix_private_data *data; + + EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); + data = (struct unix_private_data *) channel->private_data; + EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL); + + invalidate_cached_blocks(channel, data, block, count); + return 0; +} + #if __GNUC_PREREQ (4, 6) #pragma GCC diagnostic pop #endif @@ -1855,6 +1888,7 @@ static struct struct_io_manager struct_unix_manager = { .zeroout = unix_zeroout, .flock = unix_flock, .get_fd = unix_get_fd, + .invalidate_blocks = unix_invalidate_blocks, }; io_manager unix_io_manager = &struct_unix_manager; @@ -1878,6 +1912,7 @@ static struct struct_io_manager struct_unixfd_manager = { .zeroout = unix_zeroout, .flock = unix_flock, .get_fd = unix_get_fd, + .invalidate_blocks = unix_invalidate_blocks, }; io_manager unixfd_io_manager = &struct_unixfd_manager; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong 2026-04-29 14:51 ` [PATCH 1/5] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong @ 2026-04-29 14:51 ` Darrick J. Wong 2026-04-29 14:52 ` [PATCH 3/5] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:51 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> There's no need to invalidate the entire cache when writing a range of bytes to the device. The only ones we need to invalidate are the ones that we're writing separately. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 54bd4b5597ea9e..35c42c35f735a3 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -1588,6 +1588,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, { struct unix_private_data *data; errcode_t retval = 0; + unsigned long long bno, nbno; ssize_t actual; EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL); @@ -1603,10 +1604,17 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, #ifndef NO_IO_CACHE /* - * Flush out the cache completely + * Flush all the dirty blocks, then invalidate the blocks we're about + * to write. */ - if ((retval = flush_cached_blocks(channel, data, FLUSH_INVALIDATE))) + retval = flush_cached_blocks(channel, data, 0); + if (retval) return retval; + + bno = offset / channel->block_size; + nbno = (offset + size + channel->block_size - 1) / channel->block_size; + + invalidate_cached_blocks(channel, data, bno, nbno - bno); #endif if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/5] libext2fs: allow unix_write_byte when the write would be aligned 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong 2026-04-29 14:51 ` [PATCH 1/5] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong 2026-04-29 14:51 ` [PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong @ 2026-04-29 14:52 ` Darrick J. Wong 2026-04-29 14:52 ` [PATCH 4/5] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong 2026-04-29 14:52 ` [PATCH 5/5] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong 4 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:52 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> If someone calls write_byte on an IO channel with an alignment requirement and the range to be written is aligned correctly, go ahead and do the write. This will be needed later when we try to speed up superblock writes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/unix_io.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c index 35c42c35f735a3..ea8ee56b7d5163 100644 --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -1599,7 +1599,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset, #ifdef ALIGN_DEBUG printf("unix_write_byte: O_DIRECT fallback\n"); #endif - return EXT2_ET_UNIMPLEMENTED; + if (!IS_ALIGNED(data->offset + offset, channel->align) || + !IS_ALIGNED(data->offset + offset + size, channel->align)) + return EXT2_ET_UNIMPLEMENTED; } #ifndef NO_IO_CACHE ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 4/5] libext2fs: allow clients to ask to write full superblocks 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:52 ` [PATCH 3/5] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong @ 2026-04-29 14:52 ` Darrick J. Wong 2026-04-29 14:52 ` [PATCH 5/5] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong 4 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:52 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> write_primary_superblock currently does this weird dance where it will try to write only the dirty bytes of the primary superblock to disk. In theory, this is done so that tune2fs can incrementally update superblock bytes when the filesystem is mounted; ext2 was famous for allowing using this dance to set new fs parameters and have them take effect in real time. The ability to do this safely was obliterated back in 2001 when ext3 was introduced with journalling, because tune2fs has no way to know if the journal has already logged an updated primary superblock but not yet written it to disk, which means that they can race to write, and changes can be lost. This (non-)safety was further obliterated back in 2012 when I added checksums to all the metadata blocks in ext4 because anyone else with the block device open can see the primary superblock in an intermediate state where the checksum does not match the superblock contents. At this point in 2025 it's kind of stupid for fuse2fs to be doing this because you can't have the kernel and fuse2fs mount the same filesystem at the same time. It also makes fuse2fs op_fsync slow because libext2fs performs a bunch of small writes and introduce extra fsyncs. So, add a new flag to ask for full superblock writes, which fuse2fs will use later. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2fs.h | 1 + lib/ext2fs/closefs.c | 7 +++++++ 2 files changed, 8 insertions(+) diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h index 02c3cbcea92482..8fad4c4011dd5a 100644 --- a/lib/ext2fs/ext2fs.h +++ b/lib/ext2fs/ext2fs.h @@ -220,6 +220,7 @@ typedef struct ext2_file *ext2_file_t; #define EXT2_FLAG_IBITMAP_TAIL_PROBLEM 0x2000000 #define EXT2_FLAG_THREADS 0x4000000 #define EXT2_FLAG_IGNORE_SWAP_DIRENT 0x8000000 +#define EXT2_FLAG_WRITE_FULL_SUPER 0x10000000 /* * Internal flags for use by the ext2fs library only diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c index 8e5bec03a050de..9a67db76e7b326 100644 --- a/lib/ext2fs/closefs.c +++ b/lib/ext2fs/closefs.c @@ -196,6 +196,13 @@ static errcode_t write_primary_superblock(ext2_filsys fs, int check_idx, write_idx, size; errcode_t retval; + if (fs->flags & EXT2_FLAG_WRITE_FULL_SUPER) { + retval = io_channel_write_byte(fs->io, SUPERBLOCK_OFFSET, + SUPERBLOCK_SIZE, super); + if (!retval) + return 0; + } + if (!fs->io->manager->write_byte || !fs->orig_super) { fallback: io_channel_set_blksize(fs->io, SUPERBLOCK_OFFSET); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 5/5] libext2fs: allow callers to disallow I/O to file data blocks 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:52 ` [PATCH 4/5] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong @ 2026-04-29 14:52 ` Darrick J. Wong 4 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:52 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Add a flag to ext2_file_t to disallow read and write I/O to file data blocks. This supports fuse2fs iomap support, which will keep all the file data I/O inside the kerne. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- lib/ext2fs/ext2fs.h | 3 +++ lib/ext2fs/fileio.c | 12 +++++++++++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h index 8fad4c4011dd5a..23b0695a32d150 100644 --- a/lib/ext2fs/ext2fs.h +++ b/lib/ext2fs/ext2fs.h @@ -178,6 +178,9 @@ typedef struct ext2_struct_dblist *ext2_dblist; #define EXT2_FILE_WRITE 0x0001 #define EXT2_FILE_CREATE 0x0002 +/* no file I/O to disk blocks, only to inline data */ +#define EXT2_FILE_NOBLOCKIO 0x0004 + #define EXT2_FILE_MASK 0x00FF #define EXT2_FILE_BUF_DIRTY 0x4000 diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c index 3a36e9e7fff43b..95ee45ec7371ae 100644 --- a/lib/ext2fs/fileio.c +++ b/lib/ext2fs/fileio.c @@ -314,6 +314,11 @@ errcode_t ext2fs_file_read(ext2_file_t file, void *buf, if (file->inode.i_flags & EXT4_INLINE_DATA_FL) return ext2fs_file_read_inline_data(file, buf, wanted, got); + if (file->flags & EXT2_FILE_NOBLOCKIO) { + retval = EXT2_ET_OP_NOT_SUPPORTED; + goto fail; + } + while ((file->pos < EXT2_I_SIZE(&file->inode)) && (wanted > 0)) { retval = sync_buffer_position(file); if (retval) @@ -441,6 +446,11 @@ errcode_t ext2fs_file_write(ext2_file_t file, const void *buf, retval = 0; } + if (file->flags & EXT2_FILE_NOBLOCKIO) { + retval = EXT2_ET_OP_NOT_SUPPORTED; + goto fail; + } + while (nbytes > 0) { retval = sync_buffer_position(file); if (retval) @@ -609,7 +619,7 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file, int ret_flags; errcode_t retval; - if (off == 0) + if (off == 0 || (file->flags & EXT2_FILE_NOBLOCKIO)) return 0; retval = sync_buffer_position(file); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (14 preceding siblings ...) 2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong @ 2026-04-29 14:20 ` Darrick J. Wong 2026-04-29 14:52 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong ` (18 more replies) 2026-04-29 14:20 ` [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services Darrick J. Wong ` (3 subsequent siblings) 19 siblings, 19 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong Hi all, Switch fuse2fs to use the new iomap file data IO paths instead of pushing it very slowly through the /dev/fuse connection. For local filesystems, all we have to do is respond to requests for file to device mappings; the rest of the IO hot path stays within the kernel. This means that we can get rid of all file data block processing within fuse2fs. Because we're not pinning dirty pages through a potentially slow network connection, we don't need the heavy BDI throttling for which most fuse servers have become infamous. Yes, mapping lookups for writeback can stall, but mappings are small as compared to data and this situation exists for all kernel filesystems as well. The performance of this new data path is quite stunning: on a warm system, streaming reads and writes through the pagecache go from 60-90MB/s to 2-2.5GB/s. Direct IO reads and writes improve from the same baseline to 2.5-8GB/s. FIEMAP and SEEK_DATA/SEEK_HOLE now work too. The kernel ext4 driver can manage about 1.6GB/s for pagecache IO and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the kernel for streaming file IO. Random 4k buffered IO is not so good: plain fuse2fs pokes along at 25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s. The kernel can do 900-1300MB/s. Random directio is worse: plain fuse2fs does 20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does 40-55MB/s. I suspect that metadata heavy workloads do not perform well on fuse2fs because libext2fs wasn't designed for that and it doesn't even have a journal to absorb all the fsync writes. We also probably need iomap caching really badly. These performance numbers are slanted: my machine is 12 years old, and fuse2fs is VERY poorly optimized for performance. It contains a single Big Filesystem Lock which nukes multi-threaded scalability. There's no inode cache nor is there a proper buffer cache, which means that fuse2fs reads metadata in from disk and checksums it on EVERY ACCESS. Sad! Despite these gaps, this RFC demonstrates that it's feasible to run the metadata parsing parts of a filesystem in userspace while not sacrificing much performance. We now have a vehicle to move the filesystems out of the kernel, where they can be containerized so that malicious filesystems can be contained, somewhat. iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so for capable systems, fuse2fs doesn't need to run in fuseblk mode anymore. However, there are some major warts remaining: 1. The iomap cookie validation is not present, which can lead to subtle races between pagecache zeroing and writeback on filesystems that support unwritten and delalloc mappings. 2. Mappings ought to be cached in the kernel for more speed. 3. iomap doesn't support things like fscrypt or fsverity, and I haven't yet figured out how inline data is supposed to work. 4. I would like to be able to turn on fuse+iomap on a per-inode basis, which currently isn't possible because the kernel fuse driver will iget inodes prior to calling FUSE_GETATTR to discover the properties of the inode it just read. 5. ext4 doesn't support out of place writes so I don't know if that actually works correctly. 6. iomap is an inode-based service, not a file-based service. This means that we /must/ push ext2's inode numbers into the kernel via FUSE_GETATTR so that it can report those same numbers back out through the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid to index its incore inode, so we have to pass those too so that notifications work properly. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-fileio --- Commits in this patchset: * fuse2fs: implement bare minimum iomap for file mapping reporting * fuse2fs: add iomap= mount option * fuse2fs: implement iomap configuration * fuse2fs: register block devices for use with iomap * fuse2fs: implement directio file reads * fuse2fs: add extent dump function for debugging * fuse2fs: implement direct write support * fuse2fs: turn on iomap for pagecache IO * fuse2fs: don't zero bytes in punch hole * fuse2fs: don't do file data block IO when iomap is enabled * fuse2fs: try to create loop device when ext4 device is a regular file * fuse2fs: enable file IO to inline data files * fuse2fs: set iomap-related inode flags * fuse2fs: configure block device block size * fuse4fs: separate invalidation * fuse2fs: implement statx * fuse2fs: enable atomic writes * fuse4fs: disable fs reclaim and write throttling * fuse2fs: implement freeze and shutdown requests --- configure | 90 ++ configure.ac | 54 + fuse4fs/fuse4fs.1.in | 6 fuse4fs/fuse4fs.c | 1934 +++++++++++++++++++++++++++++++++++++++++++++++++- lib/config.h.in | 6 misc/fuse2fs.1.in | 6 misc/fuse2fs.c | 1947 ++++++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 4016 insertions(+), 27 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong @ 2026-04-29 14:52 ` Darrick J. Wong 2026-04-29 14:53 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong ` (17 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:52 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Add enough of an iomap implementation that we can do FIEMAP and SEEK_DATA and SEEK_HOLE. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- configure | 49 +++++ configure.ac | 31 +++ fuse4fs/fuse4fs.c | 540 +++++++++++++++++++++++++++++++++++++++++++++++++++++ lib/config.h.in | 3 misc/fuse2fs.c | 540 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 1161 insertions(+), 2 deletions(-) diff --git a/configure b/configure index 80aad505da550c..344c7af2ee48f8 100755 --- a/configure +++ b/configure @@ -14608,6 +14608,7 @@ printf "%s\n" "yes" >&6; } fi +have_fuse_iomap= if test -n "$FUSE_LIB" then FUSE_USE_VERSION=319 @@ -14634,12 +14635,60 @@ esac fi done + + { printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5 +printf %s "checking for iomap_begin in libfuse... " >&6; } + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + + #define _GNU_SOURCE + #define _FILE_OFFSET_BITS 64 + #define FUSE_USE_VERSION 399 + #include <fuse.h> + +int +main (void) +{ + + struct fuse_operations fs_ops = { + .iomap_begin = NULL, + .iomap_end = NULL, + }; + struct fuse_file_iomap narf = { }; + + ; + return 0; +} + +_ACEOF +if ac_fn_c_try_link "$LINENO" +then : + have_fuse_iomap=yes + { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5 +printf "%s\n" "yes" >&6; } +else case e in #( + e) { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5 +printf "%s\n" "no" >&6; } ;; +esac +fi +rm -f core conftest.err conftest.$ac_objext conftest.beam \ + conftest$ac_exeext conftest.$ac_ext + if test "$have_fuse_iomap" = yes + then + FUSE_USE_VERSION=399 + fi fi if test -n "$FUSE_USE_VERSION" then printf "%s\n" "#define FUSE_USE_VERSION $FUSE_USE_VERSION" >>confdefs.h +fi +if test -n "$have_fuse_iomap" +then + +printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h + fi have_fuse_lowlevel= diff --git a/configure.ac b/configure.ac index 63a5cd697a6dde..8d85e9966877ea 100644 --- a/configure.ac +++ b/configure.ac @@ -1385,6 +1385,7 @@ dnl dnl Set FUSE_USE_VERSION, which is how fuse servers build against a particular dnl libfuse ABI. Currently we link against the libfuse 3.19 ABI (hence 319) dnl +have_fuse_iomap= if test -n "$FUSE_LIB" then FUSE_USE_VERSION=319 @@ -1394,12 +1395,42 @@ then [AC_MSG_FAILURE([Cannot build against fuse3 headers])], [#define _FILE_OFFSET_BITS 64 #define FUSE_USE_VERSION 319]) + + dnl + dnl Check if the fuse library supports iomap, which requires a higher + dnl FUSE_USE_VERSION ABI version (3.99) + dnl + AC_MSG_CHECKING(for iomap_begin in libfuse) + AC_LINK_IFELSE( + [ AC_LANG_PROGRAM([[ + #define _GNU_SOURCE + #define _FILE_OFFSET_BITS 64 + #define FUSE_USE_VERSION 399 + #include <fuse.h> + ]], [[ + struct fuse_operations fs_ops = { + .iomap_begin = NULL, + .iomap_end = NULL, + }; + struct fuse_file_iomap narf = { }; + ]]) + ], have_fuse_iomap=yes + AC_MSG_RESULT(yes), + AC_MSG_RESULT(no)) + if test "$have_fuse_iomap" = yes + then + FUSE_USE_VERSION=399 + fi fi if test -n "$FUSE_USE_VERSION" then AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION, [Define to the version of FUSE to use]) fi +if test -n "$have_fuse_iomap" +then + AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap]) +fi dnl dnl Check if the FUSE lowlevel library is supported diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index dc5a0ede9f5072..a159024f778ba2 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -155,6 +155,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align) return b - m; } +#define max(a, b) ((a) > (b) ? (a) : (b)) +#define min(a, b) ((a) < (b) ? (a) : (b)) + #define dbg_printf(fuse4fs, format, ...) \ while ((fuse4fs)->debug) { \ printf("FUSE4FS (%s): tid=%llu " format, (fuse4fs)->shortdev, get_thread_id(), ##__VA_ARGS__); \ @@ -233,6 +236,14 @@ enum fuse4fs_opstate { F4OP_SHUTDOWN, }; +#ifdef HAVE_FUSE_IOMAP +enum fuse4fs_iomap_state { + IOMAP_DISABLED, + IOMAP_UNKNOWN, + IOMAP_ENABLED, +}; +#endif + /* Main program context */ #define FUSE4FS_MAGIC (0xEF53DEADUL) struct fuse4fs { @@ -260,6 +271,9 @@ struct fuse4fs { int logfd; int blocklog; int oom_score_adj; +#ifdef HAVE_FUSE_IOMAP + enum fuse4fs_iomap_state iomap_state; +#endif unsigned int blockmask; unsigned long offset; unsigned int next_generation; @@ -882,6 +896,15 @@ fuse4fs_set_handle(struct fuse_file_info *fp, struct fuse4fs_file_handle *fh) fp->keep_cache = 1; } +#ifdef HAVE_FUSE_IOMAP +static inline int fuse4fs_iomap_enabled(const struct fuse4fs *ff) +{ + return ff->iomap_state >= IOMAP_ENABLED; +} +#else +# define fuse4fs_iomap_enabled(...) (0) +#endif + static void get_now(struct timespec *now) { #ifdef CLOCK_REALTIME @@ -1514,7 +1537,7 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff) char options[128]; double deadline; int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW | - EXT2_FLAG_EXCLUSIVE; + EXT2_FLAG_EXCLUSIVE | EXT2_FLAG_WRITE_FULL_SUPER; errcode_t err; if (ff->lockfile) { @@ -1808,6 +1831,15 @@ static void op_destroy(void *userdata) (stats->cache_hits + stats->cache_misses)); } + /* + * If we're mounting in iomap mode, we need to unmount in op_destroy so + * that the block device will be released before umount(2) returns. + */ + if (ff->iomap_state == IOMAP_ENABLED) { + fuse4fs_mmp_cancel(ff); + fuse4fs_unmount(ff); + } + fuse4fs_finish(ff, 0); } @@ -1948,6 +1980,44 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn, } #endif +#ifdef HAVE_FUSE_IOMAP +static inline bool fuse4fs_wants_iomap(struct fuse4fs *ff) +{ + if (ff->iomap_state == IOMAP_DISABLED) + return false; + + /* iomap only works with block devices */ + if (!(ff->fs->io->flags & CHANNEL_FLAGS_BLOCK_DEVICE)) + return false; + + /* + * iomap addrs must be aligned to the bdev lba size; we use fs + * blocksize as a proxy here + */ + if (ff->offset % ff->fs->blocksize > 0) + return false; + + return true; +} + +static void fuse4fs_iomap_enable(struct fuse_conn_info *conn, + struct fuse4fs *ff) +{ + /* Don't let anyone touch iomap until the end of the patchset. */ + ff->iomap_state = IOMAP_DISABLED; + return; + + if (fuse4fs_wants_iomap(ff) && + fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + ff->iomap_state = IOMAP_ENABLED; + + if (ff->iomap_state == IOMAP_UNKNOWN) + ff->iomap_state = IOMAP_DISABLED; +} +#else +# define fuse4fs_iomap_enable(...) ((void)0) +#endif + static void op_init(void *userdata, struct fuse_conn_info *conn) { struct fuse4fs *ff = userdata; @@ -1970,6 +2040,7 @@ static void op_init(void *userdata, struct fuse_conn_info *conn) #ifdef FUSE_CAP_NO_EXPORT_SUPPORT fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT); #endif + fuse4fs_iomap_enable(conn, ff); conn->time_gran = 1; if (ff->opstate == F4OP_WRITABLE) @@ -5917,6 +5988,466 @@ static void op_fallocate(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)), } #endif /* SUPPORT_FALLOCATE */ +#ifdef HAVE_FUSE_IOMAP +static void fuse4fs_iomap_hole(struct fuse4fs *ff, struct fuse_file_iomap *iomap, + off_t pos, uint64_t count) +{ + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->offset = pos; + iomap->length = count; + iomap->type = FUSE_IOMAP_TYPE_HOLE; +} + +static void fuse4fs_iomap_hole_to_eof(struct fuse4fs *ff, + struct fuse_file_iomap *iomap, off_t pos, + off_t count, + const struct ext2_inode_large *inode) +{ + ext2_filsys fs = ff->fs; + uint64_t isize = EXT2_I_SIZE(inode); + + /* + * We have to be careful about handling a hole to the right of the + * entire mapping tree. First, the mapping must start and end on a + * block boundary because they must be aligned to at least an LBA for + * the block layer; and to the fsblock for smoother operation. + * + * As for the length -- we could return a mapping all the way to + * i_size, but i_size could be less than pos/count if we're zeroing the + * EOF block in anticipation of a truncate operation. Similarly, we + * don't want to end the mapping at pos+count because we know there's + * nothing mapped beyond here. + */ + uint64_t startoff = round_down(pos, fs->blocksize); + uint64_t eofoff = round_up(max(pos + count, isize), fs->blocksize); + + dbg_printf(ff, + "pos=0x%llx count=0x%llx isize=0x%llx startoff=0x%llx eofoff=0x%llx\n", + (unsigned long long)pos, + (unsigned long long)count, + (unsigned long long)isize, + (unsigned long long)startoff, + (unsigned long long)eofoff); + + fuse4fs_iomap_hole(ff, iomap, startoff, eofoff - startoff); +} + +#define DEBUG_IOMAP +#ifdef DEBUG_IOMAP +# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \ + do { \ + dbg_printf((ff), \ + "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \ + (func), (tag), (startoff), (err), (extent)->e_lblk, \ + (extent)->e_pblk, (extent)->e_len, \ + (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \ + } while(0) +# define DUMP_EXTENT(ff, tag, startoff, err, extent) \ + __DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent)) + +# define __DUMP_INFO(ff, func, tag, startoff, err, info) \ + do { \ + dbg_printf((ff), \ + "%s: %s startoff 0x%llx err %ld entry %d/%d/%d level %d/%d\n", \ + (func), (tag), (startoff), (err), \ + (info)->curr_entry, (info)->num_entries, \ + (info)->max_entries, (info)->curr_level, \ + (info)->max_depth); \ + } while(0) +# define DUMP_INFO(ff, tag, startoff, err, info) \ + __DUMP_INFO((ff), __func__, (tag), (startoff), (err), (info)) +#else +# define __DUMP_EXTENT(...) ((void)0) +# define DUMP_EXTENT(...) ((void)0) +# define DUMP_INFO(...) ((void)0) +#endif + +static inline errcode_t __fuse4fs_get_mapping_at(struct fuse4fs *ff, + ext2_extent_handle_t handle, + blk64_t startoff, + struct ext2fs_extent *bmap, + const char *func) +{ + errcode_t err; + + /* + * Find the file mapping at startoff. We don't check the return value + * of _goto because _get will error out if _goto failed. There's a + * subtlety to the outcome of _goto when startoff falls in a sparse + * hole however: + * + * Most of the time, _goto points the cursor at the mapping whose lblk + * is just to the left of startoff. The mapping may or may not overlap + * startoff; this is ok. In other words, the tree lookup behaves as if + * we asked it to use a less than or equals comparison. + * + * However, if startoff is to the left of the first mapping in the + * extent tree, _goto points the cursor at that first mapping because + * it doesn't know how to deal with this situation. In this case, + * the tree lookup behaves as if we asked it to use a greater than + * or equals comparison. + * + * Note: If _get() returns 'no current node', that means that there + * aren't any mappings at all. + */ + ext2fs_extent_goto(handle, startoff); + err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap); + __DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap); + if (err == EXT2_ET_NO_CURRENT_NODE) + err = EXT2_ET_EXTENT_NOT_FOUND; + return err; +} + +static inline errcode_t __fuse4fs_get_next_mapping(struct fuse4fs *ff, + ext2_extent_handle_t handle, + blk64_t startoff, + struct ext2fs_extent *bmap, + const char *func) +{ + struct ext2fs_extent newex; + struct ext2_extent_info info; + errcode_t err; + + /* + * The extent tree code has this (probably broken) behavior that if + * more than two of the highest levels of the cursor point at the + * rightmost edge of an extent tree block, a _NEXT_LEAF movement fails + * to move the cursor position of any of the lower levels. IOWs, if + * leaf level N is at the right edge, it will only advance level N-1 + * to the right. If N-1 was at the right edge, the cursor resets to + * record 0 of that level and goes down to the wrong leaf. + * + * Work around this by walking up (towards root level 0) the extent + * tree until we find a level where we're not already at the rightmost + * edge. The _NEXT_LEAF movement will walk down the tree to find the + * leaves. + */ + err = ext2fs_extent_get_info(handle, &info); + DUMP_INFO(ff, "UP?", startoff, err, &info); + if (err) + return err; + + while (info.curr_entry == info.num_entries && info.curr_level > 0) { + err = ext2fs_extent_get(handle, EXT2_EXTENT_UP, &newex); + DUMP_EXTENT(ff, "UP", startoff, err, &newex); + if (err) + return err; + err = ext2fs_extent_get_info(handle, &info); + DUMP_INFO(ff, "UP", startoff, err, &info); + if (err) + return err; + } + + /* + * If we're at the root and there are no more entries, there's nothing + * else to be found. + */ + if (info.curr_level == 0 && info.curr_entry == info.num_entries) + return EXT2_ET_EXTENT_NOT_FOUND; + + /* Otherwise grab this next leaf and return it. */ + err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex); + DUMP_EXTENT(ff, "NEXT", startoff, err, &newex); + if (err) + return err; + + *bmap = newex; + return 0; +} + +#define fuse4fs_get_mapping_at(ff, handle, startoff, bmap) \ + __fuse4fs_get_mapping_at((ff), (handle), (startoff), (bmap), __func__) +#define fuse4fs_get_next_mapping(ff, handle, startoff, bmap) \ + __fuse4fs_get_next_mapping((ff), (handle), (startoff), (bmap), __func__) + +static errcode_t fuse4fs_iomap_begin_extent(struct fuse4fs *ff, uint64_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *iomap) +{ + ext2_extent_handle_t handle; + struct ext2fs_extent extent = { }; + ext2_filsys fs = ff->fs; + const blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos); + errcode_t err; + int ret = 0; + + err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle); + if (err) + return translate_error(fs, ino, err); + + err = fuse4fs_get_mapping_at(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + /* No mappings at all; the whole range is a hole. */ + fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + goto out_handle; + } + if (err) { + ret = translate_error(fs, ino, err); + goto out_handle; + } + + if (startoff < extent.e_lblk) { + /* + * Mapping starts to the right of the current position. + * Synthesize a hole going to that next extent. + */ + fuse4fs_iomap_hole(ff, iomap, FUSE4FS_FSB_TO_B(ff, startoff), + FUSE4FS_FSB_TO_B(ff, extent.e_lblk - startoff)); + goto out_handle; + } + + if (startoff >= extent.e_lblk + extent.e_len) { + /* + * Mapping ends to the left of the current position. Try to + * find the next mapping. If there is no next mapping, the + * whole range is in a hole. + */ + err = fuse4fs_get_next_mapping(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + goto out_handle; + } + + /* + * If the new mapping starts to the right of startoff, there's + * a hole from startoff to the start of the new mapping. + */ + if (startoff < extent.e_lblk) { + fuse4fs_iomap_hole(ff, iomap, + FUSE4FS_FSB_TO_B(ff, startoff), + FUSE4FS_FSB_TO_B(ff, extent.e_lblk - startoff)); + goto out_handle; + } + + /* + * The new mapping starts at startoff. Something weird + * happened in the extent tree lookup, but we found a valid + * mapping so we'll run with it. + */ + } + + /* Mapping overlaps startoff, report this. */ + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->addr = FUSE4FS_FSB_TO_B(ff, extent.e_pblk) + ff->offset; + iomap->offset = FUSE4FS_FSB_TO_B(ff, extent.e_lblk); + iomap->length = FUSE4FS_FSB_TO_B(ff, extent.e_len); + if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) + iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN; + else + iomap->type = FUSE_IOMAP_TYPE_MAPPED; + +out_handle: + ext2fs_extent_free(handle); + return ret; +} + +static int fuse4fs_iomap_begin_indirect(struct fuse4fs *ff, uint64_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *iomap) +{ + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos); + uint64_t isize = EXT2_I_SIZE(inode); + uint64_t real_count = min(count, 131072); + const blk64_t endoff = FUSE4FS_B_TO_FSB(ff, pos + real_count); + blk64_t startblock; + errcode_t err; + + err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL, + &startblock); + if (err) + return translate_error(fs, ino, err); + + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->offset = FUSE4FS_FSB_TO_B(ff, startoff); + iomap->flags |= FUSE_IOMAP_F_MERGED; + if (startblock) { + iomap->addr = FUSE4FS_FSB_TO_B(ff, startblock) + ff->offset; + iomap->type = FUSE_IOMAP_TYPE_MAPPED; + } else { + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->type = FUSE_IOMAP_TYPE_HOLE; + } + iomap->length = fs->blocksize; + + /* See how long the mapping goes for. */ + for (startoff++; startoff < endoff; startoff++) { + blk64_t prev_startblock = startblock; + + err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, + startoff, NULL, &startblock); + if (err) + break; + + if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) { + if (startblock == prev_startblock + 1) + iomap->length += fs->blocksize; + else + break; + } else { + if (startblock == 0) + iomap->length += fs->blocksize; + else + break; + } + } + + /* + * If this is a hole that goes beyond EOF, report this as a hole to the + * end of the range queried so that FIEMAP doesn't go mad. + */ + if (iomap->type == FUSE_IOMAP_TYPE_HOLE && + iomap->offset + iomap->length >= isize) + fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + + return 0; +} + +static int fuse4fs_iomap_begin_inline(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, struct fuse_file_iomap *iomap) +{ + uint64_t one_fsb = FUSE4FS_FSB_TO_B(ff, 1); + + if (pos >= one_fsb) { + fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + } else { + /* ext4 only supports inline data files up to 1 fsb */ + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->offset = 0; + iomap->length = one_fsb; + iomap->type = FUSE_IOMAP_TYPE_INLINE; + } + + return 0; +} + +static int fuse4fs_iomap_begin_report(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *read) +{ + if (inode->i_flags & EXT4_INLINE_DATA_FL) + return fuse4fs_iomap_begin_inline(ff, ino, inode, pos, count, + read); + + if (inode->i_flags & EXT4_EXTENTS_FL) + return fuse4fs_iomap_begin_extent(ff, ino, inode, pos, count, + opflags, read); + + return fuse4fs_iomap_begin_indirect(ff, ino, inode, pos, count, + opflags, read); +} + +static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return -ENOSYS; +} + +static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return -ENOSYS; +} + +static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, + off_t pos, uint64_t count, uint32_t opflags) +{ + struct fuse4fs *ff = fuse4fs_get(req); + struct ext2_inode_large inode; + struct fuse_file_iomap read = { }; + ext2_filsys fs; + ext2_ino_t ino; + errcode_t err; + int ret = 0; + + FUSE4FS_CHECK_CONTEXT(req); + FUSE4FS_CONVERT_FINO(req, &ino, fino); + + dbg_printf(ff, "%s: ino=%d pos=0x%llx count=0x%llx opflags=0x%x\n", + __func__, ino, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + fs = fuse4fs_start(ff); + err = fuse4fs_read_inode(fs, ino, &inode); + if (err) { + ret = translate_error(fs, ino, err); + goto out_unlock; + } + + if (opflags & FUSE_IOMAP_OP_REPORT) + ret = fuse4fs_iomap_begin_report(ff, ino, &inode, pos, count, + opflags, &read); + else if (fuse_iomap_is_write(opflags)) + ret = fuse4fs_iomap_begin_write(ff, ino, &inode, pos, count, + opflags, &read); + else + ret = fuse4fs_iomap_begin_read(ff, ino, &inode, pos, count, + opflags, &read); + if (ret) + goto out_unlock; + + dbg_printf(ff, + "%s: ino=%d pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u flags=0x%x\n", + __func__, ino, + (unsigned long long)pos, + (unsigned long long)read.addr, + (unsigned long long)read.offset, + (unsigned long long)read.length, + read.type, + read.flags); + + /* Not filling even the first byte will make the kernel unhappy. */ + if (read.offset > pos || read.offset + read.length <= pos) { + ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED); + goto out_unlock; + } + +out_unlock: + fuse4fs_finish(ff, ret); + if (ret) + fuse_reply_err(req, -ret); + else + fuse_reply_iomap_begin(req, &read, NULL); +} + +static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, + off_t pos, uint64_t count, uint32_t opflags, + ssize_t written, const struct fuse_file_iomap *iomap) +{ + struct fuse4fs *ff = fuse4fs_get(req); + ext2_ino_t ino; + + FUSE4FS_CHECK_CONTEXT(req); + FUSE4FS_CONVERT_FINO(req, &ino, fino); + + dbg_printf(ff, + "%s: ino=%d pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags=0x%x\n", + __func__, ino, + (unsigned long long)pos, + (unsigned long long)count, + opflags, + written, + iomap->flags); + + fuse_reply_err(req, 0); +} +#endif /* HAVE_FUSE_IOMAP */ + static struct fuse_lowlevel_ops fs_ops = { .lookup = op_lookup, .setattr = op_setattr, @@ -5960,6 +6491,10 @@ static struct fuse_lowlevel_ops fs_ops = { #ifdef SUPPORT_FALLOCATE .fallocate = op_fallocate, #endif +#ifdef HAVE_FUSE_IOMAP + .iomap_begin = op_iomap_begin, + .iomap_end = op_iomap_end, +#endif /* HAVE_FUSE_IOMAP */ }; static int get_random_bytes(void *p, size_t sz) @@ -6413,6 +6948,9 @@ int main(int argc, char *argv[]) .opstate = F4OP_WRITABLE, #ifdef HAVE_FUSE4FS_SERVICE .bdev_fd = -1, +#endif +#ifdef HAVE_FUSE_IOMAP + .iomap_state = IOMAP_UNKNOWN, #endif }; errcode_t err; diff --git a/lib/config.h.in b/lib/config.h.in index 2c25632188e4f3..58338cc926590e 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -148,6 +148,9 @@ /* Define to 1 if you have the <fuse.h> header file. */ #undef HAVE_FUSE_H +/* Define to 1 if fuse supports iomap */ +#undef HAVE_FUSE_IOMAP + /* Define to 1 if fuse supports lowlevel API */ #undef HAVE_FUSE_LOWLEVEL diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 0f4781bc49f18f..63c9b59e54fb04 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -138,6 +138,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align) return b - m; } +#define max(a, b) ((a) > (b) ? (a) : (b)) +#define min(a, b) ((a) < (b) ? (a) : (b)) + #define dbg_printf(fuse2fs, format, ...) \ while ((fuse2fs)->debug) { \ printf("FUSE2FS (%s): tid=%llu " format, (fuse2fs)->shortdev, get_thread_id(), ##__VA_ARGS__); \ @@ -215,6 +218,14 @@ enum fuse2fs_opstate { F2OP_SHUTDOWN, }; +#ifdef HAVE_FUSE_IOMAP +enum fuse2fs_iomap_state { + IOMAP_DISABLED, + IOMAP_UNKNOWN, + IOMAP_ENABLED, +}; +#endif + /* Main program context */ #define FUSE2FS_MAGIC (0xEF53DEADUL) struct fuse2fs { @@ -242,6 +253,9 @@ struct fuse2fs { int logfd; int blocklog; int oom_score_adj; +#ifdef HAVE_FUSE_IOMAP + enum fuse2fs_iomap_state iomap_state; +#endif unsigned int blockmask; unsigned long offset; unsigned int next_generation; @@ -693,6 +707,15 @@ fuse2fs_set_handle(struct fuse_file_info *fp, struct fuse2fs_file_handle *fh) fp->fh = (uintptr_t)fh; } +#ifdef HAVE_FUSE_IOMAP +static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff) +{ + return ff->iomap_state >= IOMAP_ENABLED; +} +#else +# define fuse2fs_iomap_enabled(...) (0) +#endif + static void get_now(struct timespec *now) { #ifdef CLOCK_REALTIME @@ -1122,7 +1145,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff) char options[128]; double deadline; int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW | - EXT2_FLAG_EXCLUSIVE; + EXT2_FLAG_EXCLUSIVE | EXT2_FLAG_WRITE_FULL_SUPER; errcode_t err; if (ff->lockfile) { @@ -1409,6 +1432,15 @@ static void op_destroy(void *p EXT2FS_ATTR((unused))) (stats->cache_hits + stats->cache_misses)); } + /* + * If we're mounting in iomap mode, we need to unmount in op_destroy so + * that the block device will be released before umount(2) returns. + */ + if (ff->iomap_state == IOMAP_ENABLED) { + fuse2fs_mmp_cancel(ff); + fuse2fs_unmount(ff); + } + fuse2fs_finish(ff, 0); } @@ -1545,6 +1577,44 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn, } #endif +#ifdef HAVE_FUSE_IOMAP +static inline bool fuse2fs_wants_iomap(struct fuse2fs *ff) +{ + if (ff->iomap_state == IOMAP_DISABLED) + return false; + + /* iomap only works with block devices */ + if (!(ff->fs->io->flags & CHANNEL_FLAGS_BLOCK_DEVICE)) + return false; + + /* + * iomap addrs must be aligned to the bdev lba size; we use fs + * blocksize as a proxy here + */ + if (ff->offset % ff->fs->blocksize > 0) + return false; + + return true; +} + +static void fuse2fs_iomap_enable(struct fuse_conn_info *conn, + struct fuse2fs *ff) +{ + /* Don't let anyone touch iomap until the end of the patchset. */ + ff->iomap_state = IOMAP_DISABLED; + return; + + if (fuse2fs_wants_iomap(ff) && + fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) + ff->iomap_state = IOMAP_ENABLED; + + if (ff->iomap_state == IOMAP_UNKNOWN) + ff->iomap_state = IOMAP_DISABLED; +} +#else +# define fuse2fs_iomap_enable(...) ((void)0) +#endif + static void *op_init(struct fuse_conn_info *conn, struct fuse_config *cfg EXT2FS_ATTR((unused))) { @@ -1578,6 +1648,8 @@ static void *op_init(struct fuse_conn_info *conn, #ifdef FUSE_CAP_NO_EXPORT_SUPPORT fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT); #endif + fuse2fs_iomap_enable(conn, ff); + conn->time_gran = 1; cfg->use_ino = 1; if (ff->debug) @@ -5150,6 +5222,465 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode, } #endif /* SUPPORT_FALLOCATE */ +#ifdef HAVE_FUSE_IOMAP +static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_file_iomap *iomap, + off_t pos, uint64_t count) +{ + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->offset = pos; + iomap->length = count; + iomap->type = FUSE_IOMAP_TYPE_HOLE; +} + +static void fuse2fs_iomap_hole_to_eof(struct fuse2fs *ff, + struct fuse_file_iomap *iomap, off_t pos, + off_t count, + const struct ext2_inode_large *inode) +{ + ext2_filsys fs = ff->fs; + uint64_t isize = EXT2_I_SIZE(inode); + + /* + * We have to be careful about handling a hole to the right of the + * entire mapping tree. First, the mapping must start and end on a + * block boundary because they must be aligned to at least an LBA for + * the block layer; and to the fsblock for smoother operation. + * + * As for the length -- we could return a mapping all the way to + * i_size, but i_size could be less than pos/count if we're zeroing the + * EOF block in anticipation of a truncate operation. Similarly, we + * don't want to end the mapping at pos+count because we know there's + * nothing mapped beyond here. + */ + uint64_t startoff = round_down(pos, fs->blocksize); + uint64_t eofoff = round_up(max(pos + count, isize), fs->blocksize); + + dbg_printf(ff, + "pos=0x%llx count=0x%llx isize=0x%llx startoff=0x%llx eofoff=0x%llx\n", + (unsigned long long)pos, + (unsigned long long)count, + (unsigned long long)isize, + (unsigned long long)startoff, + (unsigned long long)eofoff); + + fuse2fs_iomap_hole(ff, iomap, startoff, eofoff - startoff); +} + +#define DEBUG_IOMAP +#ifdef DEBUG_IOMAP +# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \ + do { \ + dbg_printf((ff), \ + "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \ + (func), (tag), (startoff), (err), (extent)->e_lblk, \ + (extent)->e_pblk, (extent)->e_len, \ + (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \ + } while(0) +# define DUMP_EXTENT(ff, tag, startoff, err, extent) \ + __DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent)) + +# define __DUMP_INFO(ff, func, tag, startoff, err, info) \ + do { \ + dbg_printf((ff), \ + "%s: %s startoff 0x%llx err %ld entry %d/%d/%d level %d/%d\n", \ + (func), (tag), (startoff), (err), \ + (info)->curr_entry, (info)->num_entries, \ + (info)->max_entries, (info)->curr_level, \ + (info)->max_depth); \ + } while(0) +# define DUMP_INFO(ff, tag, startoff, err, info) \ + __DUMP_INFO((ff), __func__, (tag), (startoff), (err), (info)) +#else +# define __DUMP_EXTENT(...) ((void)0) +# define DUMP_EXTENT(...) ((void)0) +# define DUMP_INFO(...) ((void)0) +#endif + +static inline errcode_t __fuse2fs_get_mapping_at(struct fuse2fs *ff, + ext2_extent_handle_t handle, + blk64_t startoff, + struct ext2fs_extent *bmap, + const char *func) +{ + errcode_t err; + + /* + * Find the file mapping at startoff. We don't check the return value + * of _goto because _get will error out if _goto failed. There's a + * subtlety to the outcome of _goto when startoff falls in a sparse + * hole however: + * + * Most of the time, _goto points the cursor at the mapping whose lblk + * is just to the left of startoff. The mapping may or may not overlap + * startoff; this is ok. In other words, the tree lookup behaves as if + * we asked it to use a less than or equals comparison. + * + * However, if startoff is to the left of the first mapping in the + * extent tree, _goto points the cursor at that first mapping because + * it doesn't know how to deal with this situation. In this case, + * the tree lookup behaves as if we asked it to use a greater than + * or equals comparison. + * + * Note: If _get() returns 'no current node', that means that there + * aren't any mappings at all. + */ + ext2fs_extent_goto(handle, startoff); + err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap); + __DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap); + if (err == EXT2_ET_NO_CURRENT_NODE) + err = EXT2_ET_EXTENT_NOT_FOUND; + return err; +} + +static inline errcode_t __fuse2fs_get_next_mapping(struct fuse2fs *ff, + ext2_extent_handle_t handle, + blk64_t startoff, + struct ext2fs_extent *bmap, + const char *func) +{ + struct ext2fs_extent newex; + struct ext2_extent_info info; + errcode_t err; + + /* + * The extent tree code has this (probably broken) behavior that if + * more than two of the highest levels of the cursor point at the + * rightmost edge of an extent tree block, a _NEXT_LEAF movement fails + * to move the cursor position of any of the lower levels. IOWs, if + * leaf level N is at the right edge, it will only advance level N-1 + * to the right. If N-1 was at the right edge, the cursor resets to + * record 0 of that level and goes down to the wrong leaf. + * + * Work around this by walking up (towards root level 0) the extent + * tree until we find a level where we're not already at the rightmost + * edge. The _NEXT_LEAF movement will walk down the tree to find the + * leaves. + */ + err = ext2fs_extent_get_info(handle, &info); + DUMP_INFO(ff, "UP?", startoff, err, &info); + if (err) + return err; + + while (info.curr_entry == info.num_entries && info.curr_level > 0) { + err = ext2fs_extent_get(handle, EXT2_EXTENT_UP, &newex); + DUMP_EXTENT(ff, "UP", startoff, err, &newex); + if (err) + return err; + err = ext2fs_extent_get_info(handle, &info); + DUMP_INFO(ff, "UP", startoff, err, &info); + if (err) + return err; + } + + /* + * If we're at the root and there are no more entries, there's nothing + * else to be found. + */ + if (info.curr_level == 0 && info.curr_entry == info.num_entries) + return EXT2_ET_EXTENT_NOT_FOUND; + + /* Otherwise grab this next leaf and return it. */ + err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex); + DUMP_EXTENT(ff, "NEXT", startoff, err, &newex); + if (err) + return err; + + *bmap = newex; + return 0; +} + +#define fuse2fs_get_mapping_at(ff, handle, startoff, bmap) \ + __fuse2fs_get_mapping_at((ff), (handle), (startoff), (bmap), __func__) +#define fuse2fs_get_next_mapping(ff, handle, startoff, bmap) \ + __fuse2fs_get_next_mapping((ff), (handle), (startoff), (bmap), __func__) + +static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *iomap) +{ + ext2_extent_handle_t handle; + struct ext2fs_extent extent = { }; + ext2_filsys fs = ff->fs; + const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + errcode_t err; + int ret = 0; + + err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle); + if (err) + return translate_error(fs, ino, err); + + err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + /* No mappings at all; the whole range is a hole. */ + fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + goto out_handle; + } + if (err) { + ret = translate_error(fs, ino, err); + goto out_handle; + } + + if (startoff < extent.e_lblk) { + /* + * Mapping starts to the right of the current position. + * Synthesize a hole going to that next extent. + */ + fuse2fs_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff), + FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff)); + goto out_handle; + } + + if (startoff >= extent.e_lblk + extent.e_len) { + /* + * Mapping ends to the left of the current position. Try to + * find the next mapping. If there is no next mapping, the + * whole range is in a hole. + */ + err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + goto out_handle; + } + + /* + * If the new mapping starts to the right of startoff, there's + * a hole from startoff to the start of the new mapping. + */ + if (startoff < extent.e_lblk) { + fuse2fs_iomap_hole(ff, iomap, + FUSE2FS_FSB_TO_B(ff, startoff), + FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff)); + goto out_handle; + } + + /* + * The new mapping starts at startoff. Something weird + * happened in the extent tree lookup, but we found a valid + * mapping so we'll run with it. + */ + } + + /* Mapping overlaps startoff, report this. */ + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk) + ff->offset; + iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk); + iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len); + if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) + iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN; + else + iomap->type = FUSE_IOMAP_TYPE_MAPPED; + +out_handle: + ext2fs_extent_free(handle); + return ret; +} + +static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *iomap) +{ + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + uint64_t isize = EXT2_I_SIZE(inode); + uint64_t real_count = min(count, 131072); + const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count); + blk64_t startblock; + errcode_t err; + + err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL, + &startblock); + if (err) + return translate_error(fs, ino, err); + + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->offset = FUSE2FS_FSB_TO_B(ff, startoff); + iomap->flags |= FUSE_IOMAP_F_MERGED; + if (startblock) { + iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock) + ff->offset; + iomap->type = FUSE_IOMAP_TYPE_MAPPED; + } else { + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->type = FUSE_IOMAP_TYPE_HOLE; + } + iomap->length = fs->blocksize; + + /* See how long the mapping goes for. */ + for (startoff++; startoff < endoff; startoff++) { + blk64_t prev_startblock = startblock; + + err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, + startoff, NULL, &startblock); + if (err) + break; + + if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) { + if (startblock == prev_startblock + 1) + iomap->length += fs->blocksize; + else + break; + } else { + if (startblock == 0) + iomap->length += fs->blocksize; + else + break; + } + } + + /* + * If this is a hole that goes beyond EOF, report this as a hole to the + * end of the range queried so that FIEMAP doesn't go mad. + */ + if (iomap->type == FUSE_IOMAP_TYPE_HOLE && + iomap->offset + iomap->length >= isize) + fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + + return 0; +} + +static int fuse2fs_iomap_begin_inline(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, struct fuse_file_iomap *iomap) +{ + uint64_t one_fsb = FUSE2FS_FSB_TO_B(ff, 1); + + if (pos >= one_fsb) { + fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode); + } else { + /* ext4 only supports inline data files up to 1 fsb */ + iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->addr = FUSE_IOMAP_NULL_ADDR; + iomap->offset = 0; + iomap->length = one_fsb; + iomap->type = FUSE_IOMAP_TYPE_INLINE; + } + + return 0; +} + +static int fuse2fs_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *read) +{ + if (inode->i_flags & EXT4_INLINE_DATA_FL) + return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count, + read); + + if (inode->i_flags & EXT4_EXTENTS_FL) + return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count, + opflags, read); + + return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count, + opflags, read); +} + +static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return -ENOSYS; +} + +static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read) +{ + return -ENOSYS; +} + +static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos, uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read, + struct fuse_file_iomap *write) +{ + struct fuse2fs *ff = fuse2fs_get(); + struct ext2_inode_large inode; + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + + dbg_printf(ff, + "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n", + __func__, path, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + (unsigned long long)count, + opflags); + + fs = fuse2fs_start(ff); + err = fuse2fs_read_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + + if (opflags & FUSE_IOMAP_OP_REPORT) + ret = fuse2fs_iomap_begin_report(ff, attr_ino, &inode, pos, + count, opflags, read); + else if (fuse_iomap_is_write(opflags)) + ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos, + count, opflags, read); + else + ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos, + count, opflags, read); + if (ret) + goto out_unlock; + + dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n", + __func__, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + (unsigned long long)read->addr, + (unsigned long long)read->offset, + (unsigned long long)read->length, + read->type); + + /* Not filling even the first byte will make the kernel unhappy. */ + if (read->offset > pos || read->offset + read->length <= pos) { + ret = translate_error(fs, attr_ino, EXT2_ET_INODE_CORRUPTED); + goto out_unlock; + } + +out_unlock: + fuse2fs_finish(ff, ret); + return ret; +} + +static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos, uint64_t count, uint32_t opflags, + ssize_t written, const struct fuse_file_iomap *iomap) +{ + struct fuse2fs *ff = fuse2fs_get(); + + FUSE2FS_CHECK_CONTEXT(ff); + + dbg_printf(ff, + "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags=0x%x\n", + __func__, path, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + (unsigned long long)count, + opflags, + written, + iomap->flags); + + return 0; +} +#endif /* HAVE_FUSE_IOMAP */ + static struct fuse_operations fs_ops = { .init = op_init, .destroy = op_destroy, @@ -5191,6 +5722,10 @@ static struct fuse_operations fs_ops = { #ifdef SUPPORT_FALLOCATE .fallocate = op_fallocate, #endif +#ifdef HAVE_FUSE_IOMAP + .iomap_begin = op_iomap_begin, + .iomap_end = op_iomap_end, +#endif /* HAVE_FUSE_IOMAP */ }; static int get_random_bytes(void *p, size_t sz) @@ -5477,6 +6012,9 @@ int main(int argc, char *argv[]) .bfl = (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER, .oom_score_adj = -500, .opstate = F2OP_WRITABLE, +#ifdef HAVE_FUSE_IOMAP + .iomap_state = IOMAP_UNKNOWN, +#endif }; errcode_t err; FILE *orig_stderr = stderr; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 02/19] fuse2fs: add iomap= mount option 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong 2026-04-29 14:52 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong @ 2026-04-29 14:53 ` Darrick J. Wong 2026-04-29 14:53 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong ` (16 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:53 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Add a mount option to control iomap usage so that we can test before and after scenarios. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.1.in | 6 ++++++ fuse4fs/fuse4fs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.1.in | 6 ++++++ misc/fuse2fs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 104 insertions(+) diff --git a/fuse4fs/fuse4fs.1.in b/fuse4fs/fuse4fs.1.in index 8bef5f48802385..8855867d27101d 100644 --- a/fuse4fs/fuse4fs.1.in +++ b/fuse4fs/fuse4fs.1.in @@ -75,6 +75,12 @@ .SS "fuse4fs options:" \fB-o\fR fuse4fs_debug enable fuse4fs debugging .TP +\fB-o\fR iomap= +If set to \fI1\fR, requires iomap to be enabled. +If set to \fI0\fR, forbids use of iomap. +If set to \fIdefault\fR (or not set), enables iomap if present. +This substantially improves the performance of the fuse4fs server. +.TP \fB-o\fR kernel Behave more like the kernel ext4 driver in the following ways: Allows processes owned by other users to access the filesystem. diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index a159024f778ba2..df2bda7cc22bf2 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -236,6 +236,12 @@ enum fuse4fs_opstate { F4OP_SHUTDOWN, }; +enum fuse4fs_feature_toggle { + FT_DISABLE, + FT_ENABLE, + FT_DEFAULT, +}; + #ifdef HAVE_FUSE_IOMAP enum fuse4fs_iomap_state { IOMAP_DISABLED, @@ -272,6 +278,7 @@ struct fuse4fs { int blocklog; int oom_score_adj; #ifdef HAVE_FUSE_IOMAP + enum fuse4fs_feature_toggle iomap_want; enum fuse4fs_iomap_state iomap_state; #endif unsigned int blockmask; @@ -2013,6 +2020,12 @@ static void fuse4fs_iomap_enable(struct fuse_conn_info *conn, if (ff->iomap_state == IOMAP_UNKNOWN) ff->iomap_state = IOMAP_DISABLED; + + if (!fuse4fs_iomap_enabled(ff)) { + if (ff->iomap_want == FT_ENABLE) + err_printf(ff, "%s\n", _("Could not enable iomap.")); + return; + } } #else # define fuse4fs_iomap_enable(...) ((void)0) @@ -6522,6 +6535,9 @@ enum { FUSE4FS_CACHE_SIZE, FUSE4FS_DIRSYNC, FUSE4FS_ERRORS_BEHAVIOR, +#ifdef HAVE_FUSE_IOMAP + FUSE4FS_IOMAP, +#endif }; #define FUSE4FS_OPT(t, p, v) { t, offsetof(struct fuse4fs, p), v } @@ -6553,6 +6569,10 @@ static struct fuse_opt fuse4fs_opts[] = { FUSE_OPT_KEY("cache_size=%s", FUSE4FS_CACHE_SIZE), FUSE_OPT_KEY("dirsync", FUSE4FS_DIRSYNC), FUSE_OPT_KEY("errors=%s", FUSE4FS_ERRORS_BEHAVIOR), +#ifdef HAVE_FUSE_IOMAP + FUSE_OPT_KEY("iomap=%s", FUSE4FS_IOMAP), + FUSE_OPT_KEY("iomap", FUSE4FS_IOMAP), +#endif FUSE_OPT_KEY("-V", FUSE4FS_VERSION), FUSE_OPT_KEY("--version", FUSE4FS_VERSION), @@ -6604,6 +6624,23 @@ static int fuse4fs_opt_proc(void *data, const char *arg, /* do not pass through to libfuse */ return 0; +#ifdef HAVE_FUSE_IOMAP + case FUSE4FS_IOMAP: + if (strcmp(arg, "iomap") == 0 || strcmp(arg + 6, "1") == 0) + ff->iomap_want = FT_ENABLE; + else if (strcmp(arg + 6, "0") == 0) + ff->iomap_want = FT_DISABLE; + else if (strcmp(arg + 6, "default") == 0) + ff->iomap_want = FT_DEFAULT; + else { + fprintf(stderr, "%s: %s\n", arg, + _("unknown iomap= behavior.")); + return -1; + } + + /* do not pass through to libfuse */ + return 0; +#endif case FUSE4FS_IGNORED: return 0; case FUSE4FS_HELP: @@ -6631,6 +6668,9 @@ static int fuse4fs_opt_proc(void *data, const char *arg, " -o cache_size=N[KMG] use a disk cache of this size\n" " -o errors= behavior when an error is encountered:\n" " continue|remount-ro|panic\n" +#ifdef HAVE_FUSE_IOMAP + " -o iomap= 0 to disable iomap, 1 to enable iomap\n" +#endif "\n", outargs->argv[0]); if (key == FUSE4FS_HELPFULL) { @@ -6950,6 +6990,7 @@ int main(int argc, char *argv[]) .bdev_fd = -1, #endif #ifdef HAVE_FUSE_IOMAP + .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, #endif }; @@ -6983,6 +7024,11 @@ int main(int argc, char *argv[]) if (fuse4fs_is_service(&fctx)) fuse4fs_service_set_proc_cmdline(&fctx, argc, argv, &args); +#ifdef HAVE_FUSE_IOMAP + if (fctx.iomap_want == FT_DISABLE) + fctx.iomap_state = IOMAP_DISABLED; +#endif + /* /dev/sda -> sda for reporting */ fctx.shortdev = strrchr(fctx.device, '/'); if (fctx.shortdev) diff --git a/misc/fuse2fs.1.in b/misc/fuse2fs.1.in index 6acfa092851292..2b55fa0e723966 100644 --- a/misc/fuse2fs.1.in +++ b/misc/fuse2fs.1.in @@ -75,6 +75,12 @@ .SS "fuse2fs options:" \fB-o\fR fuse2fs_debug enable fuse2fs debugging .TP +\fB-o\fR iomap= +If set to \fI1\fR, requires iomap to be enabled. +If set to \fI0\fR, forbids use of iomap. +If set to \fIdefault\fR (or not set), enables iomap if present. +This substantially improves the performance of the fuse2fs server. +.TP \fB-o\fR kernel Behave more like the kernel ext4 driver in the following ways: Allows processes owned by other users to access the filesystem. diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 63c9b59e54fb04..15ebe6b39f1288 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -218,6 +218,12 @@ enum fuse2fs_opstate { F2OP_SHUTDOWN, }; +enum fuse2fs_feature_toggle { + FT_DISABLE, + FT_ENABLE, + FT_DEFAULT, +}; + #ifdef HAVE_FUSE_IOMAP enum fuse2fs_iomap_state { IOMAP_DISABLED, @@ -254,6 +260,7 @@ struct fuse2fs { int blocklog; int oom_score_adj; #ifdef HAVE_FUSE_IOMAP + enum fuse2fs_feature_toggle iomap_want; enum fuse2fs_iomap_state iomap_state; #endif unsigned int blockmask; @@ -1610,6 +1617,12 @@ static void fuse2fs_iomap_enable(struct fuse_conn_info *conn, if (ff->iomap_state == IOMAP_UNKNOWN) ff->iomap_state = IOMAP_DISABLED; + + if (!fuse2fs_iomap_enabled(ff)) { + if (ff->iomap_want == FT_ENABLE) + err_printf(ff, "%s\n", _("Could not enable iomap.")); + return; + } } #else # define fuse2fs_iomap_enable(...) ((void)0) @@ -5753,6 +5766,9 @@ enum { FUSE2FS_CACHE_SIZE, FUSE2FS_DIRSYNC, FUSE2FS_ERRORS_BEHAVIOR, +#ifdef HAVE_FUSE_IOMAP + FUSE2FS_IOMAP, +#endif }; #define FUSE2FS_OPT(t, p, v) { t, offsetof(struct fuse2fs, p), v } @@ -5784,6 +5800,10 @@ static struct fuse_opt fuse2fs_opts[] = { FUSE_OPT_KEY("cache_size=%s", FUSE2FS_CACHE_SIZE), FUSE_OPT_KEY("dirsync", FUSE2FS_DIRSYNC), FUSE_OPT_KEY("errors=%s", FUSE2FS_ERRORS_BEHAVIOR), +#ifdef HAVE_FUSE_IOMAP + FUSE_OPT_KEY("iomap=%s", FUSE2FS_IOMAP), + FUSE_OPT_KEY("iomap", FUSE2FS_IOMAP), +#endif FUSE_OPT_KEY("-V", FUSE2FS_VERSION), FUSE_OPT_KEY("--version", FUSE2FS_VERSION), @@ -5835,6 +5855,23 @@ static int fuse2fs_opt_proc(void *data, const char *arg, /* do not pass through to libfuse */ return 0; +#ifdef HAVE_FUSE_IOMAP + case FUSE2FS_IOMAP: + if (strcmp(arg, "iomap") == 0 || strcmp(arg + 6, "1") == 0) + ff->iomap_want = FT_ENABLE; + else if (strcmp(arg + 6, "0") == 0) + ff->iomap_want = FT_DISABLE; + else if (strcmp(arg + 6, "default") == 0) + ff->iomap_want = FT_DEFAULT; + else { + fprintf(stderr, "%s: %s\n", arg, + _("unknown iomap= behavior.")); + return -1; + } + + /* do not pass through to libfuse */ + return 0; +#endif case FUSE2FS_IGNORED: return 0; case FUSE2FS_HELP: @@ -5862,6 +5899,9 @@ static int fuse2fs_opt_proc(void *data, const char *arg, " -o cache_size=N[KMG] use a disk cache of this size\n" " -o errors= behavior when an error is encountered:\n" " continue|remount-ro|panic\n" +#ifdef HAVE_FUSE_IOMAP + " -o iomap= 0 to disable iomap, 1 to enable iomap\n" +#endif "\n", outargs->argv[0]); if (key == FUSE2FS_HELPFULL) { @@ -6013,6 +6053,7 @@ int main(int argc, char *argv[]) .oom_score_adj = -500, .opstate = F2OP_WRITABLE, #ifdef HAVE_FUSE_IOMAP + .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, #endif }; @@ -6029,6 +6070,11 @@ int main(int argc, char *argv[]) exit(1); } +#ifdef HAVE_FUSE_IOMAP + if (fctx.iomap_want == FT_DISABLE) + fctx.iomap_state = IOMAP_DISABLED; +#endif + /* /dev/sda -> sda for reporting */ fctx.shortdev = strrchr(fctx.device, '/'); if (fctx.shortdev) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 03/19] fuse2fs: implement iomap configuration 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong 2026-04-29 14:52 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong 2026-04-29 14:53 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong @ 2026-04-29 14:53 ` Darrick J. Wong 2026-04-29 14:53 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong ` (15 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:53 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Upload the filesystem geometry to the kernel when asked. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++-- misc/fuse2fs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 188 insertions(+), 6 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index df2bda7cc22bf2..feb46bdfbac39b 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -208,6 +208,10 @@ static inline uint64_t round_down(uint64_t b, unsigned int align) # define FL_ZERO_RANGE_FLAG (0) #endif +#ifndef NSEC_PER_SEC +# define NSEC_PER_SEC (1000000000L) +#endif + errcode_t ext2fs_check_ext3_journal(ext2_filsys fs); errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs); @@ -995,9 +999,9 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino) EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode); get_now(&now); - datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000); - dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000); - dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000); + datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC); + dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC); + dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC); /* * If atime is newer than mtime and atime hasn't been updated in thirty @@ -6459,6 +6463,93 @@ static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, fuse_reply_err(req, 0); } + +/* + * Maximal extent format file size. + * Resulting logical blkno at s_maxbytes must fit in our on-disk + * extent format containers, within a sector_t, and within i_blocks + * in the vfs. ext4 inode has 48 bits of i_block in fsblock units, + * so that won't be a limiting factor. + * + * However there is other limiting factor. We do store extents in the form + * of starting block and length, hence the resulting length of the extent + * covering maximum file size must fit into on-disk format containers as + * well. Given that length is always by 1 unit bigger than max unit (because + * we count 0 as well) we have to lower the s_maxbytes by one fs block. + * + * Note, this does *not* consider any metadata overhead for vfs i_blocks. + */ +static off_t fuse4fs_max_size(struct fuse4fs *ff, off_t upper_limit) +{ + off_t res; + + if (!ext2fs_has_feature_huge_file(ff->fs->super)) { + upper_limit = (1LL << 32) - 1; + + /* total blocks in file system block size */ + upper_limit >>= (ff->blocklog - 9); + upper_limit <<= ff->blocklog; + } + + /* + * 32-bit extent-start container, ee_block. We lower the maxbytes + * by one fs block, so ee_len can cover the extent of maximum file + * size + */ + res = (1LL << 32) - 1; + res <<= ff->blocklog; + + /* Sanity check against vm- & vfs- imposed limits */ + if (res > upper_limit) + res = upper_limit; + + return res; +} + +static void op_iomap_config(fuse_req_t req, + const struct fuse_iomap_config_params *p, + size_t psize) +{ + struct fuse_iomap_config cfg = { }; + struct fuse4fs *ff = fuse4fs_get(req); + ext2_filsys fs; + + FUSE4FS_CHECK_CONTEXT(req); + + dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx\n", __func__, + (unsigned long long)p->flags, + (unsigned long long)p->maxbytes); + fs = fuse4fs_start(ff); + + cfg.flags |= FUSE_IOMAP_CONFIG_UUID; + memcpy(cfg.s_uuid, fs->super->s_uuid, sizeof(cfg.s_uuid)); + cfg.s_uuid_len = sizeof(fs->super->s_uuid); + + cfg.flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE; + cfg.s_blocksize = FUSE4FS_FSB_TO_B(ff, 1); + + /* + * If there inode is large enough to house i_[acm]time_extra then we + * can turn on nanosecond timestamps; i_crtime was the next field added + * after i_atime_extra. + */ + cfg.flags |= FUSE_IOMAP_CONFIG_TIME; + if (fs->super->s_inode_size >= + offsetof(struct ext2_inode_large, i_crtime)) { + cfg.s_time_gran = 1; + cfg.s_time_max = EXT4_EXTRA_TIMESTAMP_MAX; + } else { + cfg.s_time_gran = NSEC_PER_SEC; + cfg.s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX; + } + cfg.s_time_min = EXT4_TIMESTAMP_MIN; + + cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES; + cfg.s_maxbytes = fuse4fs_max_size(ff, p->maxbytes); + + fuse4fs_finish(ff, 0); + fuse_reply_iomap_config(req, &cfg); +} #endif /* HAVE_FUSE_IOMAP */ static struct fuse_lowlevel_ops fs_ops = { @@ -6507,6 +6598,7 @@ static struct fuse_lowlevel_ops fs_ops = { #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, + .iomap_config = op_iomap_config, #endif /* HAVE_FUSE_IOMAP */ }; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 15ebe6b39f1288..7df4e127e5981a 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -191,6 +191,10 @@ static inline uint64_t round_down(uint64_t b, unsigned int align) # define FL_ZERO_RANGE_FLAG (0) #endif +#ifndef NSEC_PER_SEC +# define NSEC_PER_SEC (1000000000L) +#endif + errcode_t ext2fs_check_ext3_journal(ext2_filsys fs); errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs); @@ -806,9 +810,9 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino) EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode); get_now(&now); - datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000); - dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000); - dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000); + datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC); + dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC); + dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC); /* * If atime is newer than mtime and atime hasn't been updated in thirty @@ -5692,6 +5696,91 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, return 0; } + +/* + * Maximal extent format file size. + * Resulting logical blkno at s_maxbytes must fit in our on-disk + * extent format containers, within a sector_t, and within i_blocks + * in the vfs. ext4 inode has 48 bits of i_block in fsblock units, + * so that won't be a limiting factor. + * + * However there is other limiting factor. We do store extents in the form + * of starting block and length, hence the resulting length of the extent + * covering maximum file size must fit into on-disk format containers as + * well. Given that length is always by 1 unit bigger than max unit (because + * we count 0 as well) we have to lower the s_maxbytes by one fs block. + * + * Note, this does *not* consider any metadata overhead for vfs i_blocks. + */ +static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit) +{ + off_t res; + + if (!ext2fs_has_feature_huge_file(ff->fs->super)) { + upper_limit = (1LL << 32) - 1; + + /* total blocks in file system block size */ + upper_limit >>= (ff->blocklog - 9); + upper_limit <<= ff->blocklog; + } + + /* + * 32-bit extent-start container, ee_block. We lower the maxbytes + * by one fs block, so ee_len can cover the extent of maximum file + * size + */ + res = (1LL << 32) - 1; + res <<= ff->blocklog; + + /* Sanity check against vm- & vfs- imposed limits */ + if (res > upper_limit) + res = upper_limit; + + return res; +} + +static int op_iomap_config(const struct fuse_iomap_config_params *p, + size_t psize, struct fuse_iomap_config *cfg) +{ + struct fuse2fs *ff = fuse2fs_get(); + ext2_filsys fs; + + FUSE2FS_CHECK_CONTEXT(ff); + + dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx\n", __func__, + (unsigned long long)p->flags, + (unsigned long long)p->maxbytes); + fs = fuse2fs_start(ff); + + cfg->flags |= FUSE_IOMAP_CONFIG_UUID; + memcpy(cfg->s_uuid, fs->super->s_uuid, sizeof(cfg->s_uuid)); + cfg->s_uuid_len = sizeof(fs->super->s_uuid); + + cfg->flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE; + cfg->s_blocksize = FUSE2FS_FSB_TO_B(ff, 1); + + /* + * If there inode is large enough to house i_[acm]time_extra then we + * can turn on nanosecond timestamps; i_crtime was the next field added + * after i_atime_extra. + */ + cfg->flags |= FUSE_IOMAP_CONFIG_TIME; + if (fs->super->s_inode_size >= + offsetof(struct ext2_inode_large, i_crtime)) { + cfg->s_time_gran = 1; + cfg->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX; + } else { + cfg->s_time_gran = NSEC_PER_SEC; + cfg->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX; + } + cfg->s_time_min = EXT4_TIMESTAMP_MIN; + + cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES; + cfg->s_maxbytes = fuse2fs_max_size(ff, p->maxbytes); + + fuse2fs_finish(ff, 0); + return 0; +} #endif /* HAVE_FUSE_IOMAP */ static struct fuse_operations fs_ops = { @@ -5738,6 +5827,7 @@ static struct fuse_operations fs_ops = { #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, + .iomap_config = op_iomap_config, #endif /* HAVE_FUSE_IOMAP */ }; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 04/19] fuse2fs: register block devices for use with iomap 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:53 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong @ 2026-04-29 14:53 ` Darrick J. Wong 2026-04-29 14:53 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong ` (14 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:53 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Register the ext4 block device with the kernel for use with iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 44 ++++++++++++++++++++++++++++++++++++++++---- misc/fuse2fs.c | 42 ++++++++++++++++++++++++++++++++++++++---- 2 files changed, 78 insertions(+), 8 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index feb46bdfbac39b..3e9852f585302d 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -284,6 +284,7 @@ struct fuse4fs { #ifdef HAVE_FUSE_IOMAP enum fuse4fs_feature_toggle iomap_want; enum fuse4fs_iomap_state iomap_state; + uint32_t iomap_dev; #endif unsigned int blockmask; unsigned long offset; @@ -6247,7 +6248,7 @@ static errcode_t fuse4fs_iomap_begin_extent(struct fuse4fs *ff, uint64_t ino, } /* Mapping overlaps startoff, report this. */ - iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->dev = ff->iomap_dev; iomap->addr = FUSE4FS_FSB_TO_B(ff, extent.e_pblk) + ff->offset; iomap->offset = FUSE4FS_FSB_TO_B(ff, extent.e_lblk); iomap->length = FUSE4FS_FSB_TO_B(ff, extent.e_len); @@ -6280,13 +6281,14 @@ static int fuse4fs_iomap_begin_indirect(struct fuse4fs *ff, uint64_t ino, if (err) return translate_error(fs, ino, err); - iomap->dev = FUSE_IOMAP_DEV_NULL; iomap->offset = FUSE4FS_FSB_TO_B(ff, startoff); iomap->flags |= FUSE_IOMAP_F_MERGED; if (startblock) { + iomap->dev = ff->iomap_dev; iomap->addr = FUSE4FS_FSB_TO_B(ff, startblock) + ff->offset; iomap->type = FUSE_IOMAP_TYPE_MAPPED; } else { + iomap->dev = FUSE_IOMAP_DEV_NULL; iomap->addr = FUSE_IOMAP_NULL_ADDR; iomap->type = FUSE_IOMAP_TYPE_HOLE; } @@ -6506,6 +6508,30 @@ static off_t fuse4fs_max_size(struct fuse4fs *ff, off_t upper_limit) return res; } +static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) +{ + errcode_t err; + int fd; + int ret; + + err = io_channel_get_fd(ff->fs->io, &fd); + if (err) + return translate_error(ff->fs, 0, err); + + ret = fuse_lowlevel_iomap_device_add(ff->fuse, fd, 0); + if (ret < 0) { + dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n", + __func__, fd, -ret); + return translate_error(ff->fs, 0, -ret); + } + + dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n", + __func__, fd, ff->iomap_dev); + + ff->iomap_dev = ret; + return 0; +} + static void op_iomap_config(fuse_req_t req, const struct fuse_iomap_config_params *p, size_t psize) @@ -6513,6 +6539,7 @@ static void op_iomap_config(fuse_req_t req, struct fuse_iomap_config cfg = { }; struct fuse4fs *ff = fuse4fs_get(req); ext2_filsys fs; + int ret = 0; FUSE4FS_CHECK_CONTEXT(req); @@ -6547,8 +6574,16 @@ static void op_iomap_config(fuse_req_t req, cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES; cfg.s_maxbytes = fuse4fs_max_size(ff, p->maxbytes); - fuse4fs_finish(ff, 0); - fuse_reply_iomap_config(req, &cfg); + ret = fuse4fs_iomap_config_devices(ff); + if (ret) + goto out_unlock; + +out_unlock: + fuse4fs_finish(ff, ret); + if (ret) + fuse_reply_err(req, -ret); + else + fuse_reply_iomap_config(req, &cfg); } #endif /* HAVE_FUSE_IOMAP */ @@ -7084,6 +7119,7 @@ int main(int argc, char *argv[]) #ifdef HAVE_FUSE_IOMAP .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, + .iomap_dev = FUSE_IOMAP_DEV_NULL, #endif }; errcode_t err; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 7df4e127e5981a..c24ae461dad2ad 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -40,6 +40,7 @@ # define _FILE_OFFSET_BITS 64 #endif /* _FILE_OFFSET_BITS */ #include <fuse.h> +#include <fuse_lowlevel.h> #ifdef __SET_FOB_FOR_FUSE # undef _FILE_OFFSET_BITS #endif /* __SET_FOB_FOR_FUSE */ @@ -266,6 +267,7 @@ struct fuse2fs { #ifdef HAVE_FUSE_IOMAP enum fuse2fs_feature_toggle iomap_want; enum fuse2fs_iomap_state iomap_state; + uint32_t iomap_dev; #endif unsigned int blockmask; unsigned long offset; @@ -5481,7 +5483,7 @@ static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino, } /* Mapping overlaps startoff, report this. */ - iomap->dev = FUSE_IOMAP_DEV_NULL; + iomap->dev = ff->iomap_dev; iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk) + ff->offset; iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk); iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len); @@ -5514,13 +5516,14 @@ static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino, if (err) return translate_error(fs, ino, err); - iomap->dev = FUSE_IOMAP_DEV_NULL; iomap->offset = FUSE2FS_FSB_TO_B(ff, startoff); iomap->flags |= FUSE_IOMAP_F_MERGED; if (startblock) { + iomap->dev = ff->iomap_dev; iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock) + ff->offset; iomap->type = FUSE_IOMAP_TYPE_MAPPED; } else { + iomap->dev = FUSE_IOMAP_DEV_NULL; iomap->addr = FUSE_IOMAP_NULL_ADDR; iomap->type = FUSE_IOMAP_TYPE_HOLE; } @@ -5739,11 +5742,36 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit) return res; } +static int fuse2fs_iomap_config_devices(struct fuse2fs *ff) +{ + errcode_t err; + int fd; + int ret; + + err = io_channel_get_fd(ff->fs->io, &fd); + if (err) + return translate_error(ff->fs, 0, err); + + ret = fuse_fs_iomap_device_add(fd, 0); + if (ret < 0) { + dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n", + __func__, fd, -ret); + return translate_error(ff->fs, 0, -ret); + } + + dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n", + __func__, fd, ff->iomap_dev); + + ff->iomap_dev = ret; + return 0; +} + static int op_iomap_config(const struct fuse_iomap_config_params *p, size_t psize, struct fuse_iomap_config *cfg) { struct fuse2fs *ff = fuse2fs_get(); ext2_filsys fs; + int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); @@ -5778,8 +5806,13 @@ static int op_iomap_config(const struct fuse_iomap_config_params *p, cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES; cfg->s_maxbytes = fuse2fs_max_size(ff, p->maxbytes); - fuse2fs_finish(ff, 0); - return 0; + ret = fuse2fs_iomap_config_devices(ff); + if (ret) + goto out_unlock; + +out_unlock: + fuse2fs_finish(ff, ret); + return ret; } #endif /* HAVE_FUSE_IOMAP */ @@ -6145,6 +6178,7 @@ int main(int argc, char *argv[]) #ifdef HAVE_FUSE_IOMAP .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, + .iomap_dev = FUSE_IOMAP_DEV_NULL, #endif }; errcode_t err; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 05/19] fuse2fs: implement directio file reads 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:53 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong @ 2026-04-29 14:53 ` Darrick J. Wong 2026-04-29 14:54 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong ` (13 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:53 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Implement file reads via iomap. Currently only directio is supported. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 14 +++++++++++++- misc/fuse2fs.c | 14 +++++++++++++- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 3e9852f585302d..a1d931aed8f393 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -6370,7 +6370,19 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino, uint64_t count, uint32_t opflags, struct fuse_file_iomap *read) { - return -ENOSYS; + if (!(opflags & FUSE_IOMAP_OP_DIRECT)) + return -ENOSYS; + + /* fall back to slow path for inline data reads */ + if (inode->i_flags & EXT4_INLINE_DATA_FL) + return -ENOSYS; + + if (inode->i_flags & EXT4_EXTENTS_FL) + return fuse4fs_iomap_begin_extent(ff, ino, inode, pos, count, + opflags, read); + + return fuse4fs_iomap_begin_indirect(ff, ino, inode, pos, count, + opflags, read); } static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino, diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index c24ae461dad2ad..739867fa41dd91 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -5605,7 +5605,19 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, uint64_t count, uint32_t opflags, struct fuse_file_iomap *read) { - return -ENOSYS; + if (!(opflags & FUSE_IOMAP_OP_DIRECT)) + return -ENOSYS; + + /* fall back to slow path for inline data reads */ + if (inode->i_flags & EXT4_INLINE_DATA_FL) + return -ENOSYS; + + if (inode->i_flags & EXT4_EXTENTS_FL) + return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count, + opflags, read); + + return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count, + opflags, read); } static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 06/19] fuse2fs: add extent dump function for debugging 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:53 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong @ 2026-04-29 14:54 ` Darrick J. Wong 2026-04-29 14:54 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong ` (12 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:54 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Add a function to dump an inode's extent map for debugging purposes. This helped debug a problem with generic/299 failing on 1k fsblock filesystems: --- a/tests/generic/299.out 2025-07-15 14:45:15.030113607 -0700 +++ b/tests/generic/299.out.bad 2025-07-16 19:33:50.889344998 -0700 @@ -3,3 +3,4 @@ QA output created by 299 Run fio with random aio-dio pattern Start fallocate/truncate loop +fio: io_u error on file /opt/direct_aio.0.0: Input/output error: write offset=2602827776, buflen=131072 (The cause of this was misuse of the libext2fs extent code) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 140 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index a1d931aed8f393..1489be2104f2b2 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -917,6 +917,74 @@ static inline int fuse4fs_iomap_enabled(const struct fuse4fs *ff) # define fuse4fs_iomap_enabled(...) (0) #endif +static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + const char *why) +{ + ext2_filsys fs = ff->fs; + unsigned int nr = 0; + blk64_t blockcount = 0; + struct ext2_inode_large xinode; + struct ext2fs_extent extent; + ext2_extent_handle_t extents; + int op = EXT2_EXTENT_ROOT; + errcode_t retval; + + if (!inode) { + inode = &xinode; + + retval = fuse4fs_read_inode(fs, ino, inode); + if (retval) { + com_err(__func__, retval, _("reading ino %u"), ino); + return; + } + } + + if (!(inode->i_flags & EXT4_EXTENTS_FL)) + return; + + printf("%s: %s ino=%u isize %llu iblocks %llu\n", __func__, why, ino, + EXT2_I_SIZE(inode), + (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) / + fs->blocksize); + fflush(stdout); + + retval = ext2fs_extent_open(fs, ino, &extents); + if (retval) { + com_err(__func__, retval, _("opening extents of ino \"%u\""), + ino); + return; + } + + while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) { + op = EXT2_EXTENT_NEXT; + + if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT) + continue; + + printf("[%u]: %s ino=%u lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", + nr++, why, ino, extent.e_lblk, extent.e_pblk, + extent.e_len, extent.e_flags); + fflush(stdout); + if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF) + blockcount += extent.e_len; + else + blockcount++; + } + if (retval == EXT2_ET_EXTENT_NO_NEXT) + retval = 0; + if (retval) { + com_err(__func__, retval, ("getting extents of ino %u"), + ino); + } + if (inode->i_file_acl) + blockcount++; + printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount); + fflush(stdout); + + ext2fs_extent_free(extents); +} + static void get_now(struct timespec *now) { #ifdef CLOCK_REALTIME @@ -6444,6 +6512,8 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, /* Not filling even the first byte will make the kernel unhappy. */ if (read.offset > pos || read.offset + read.length <= pos) { + if (ff->debug) + fuse4fs_dump_extents(ff, ino, &inode, "BAD DATA"); ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED); goto out_unlock; } diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 739867fa41dd91..4b37bcde63d0f2 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -729,6 +729,74 @@ static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff) # define fuse2fs_iomap_enabled(...) (0) #endif +static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + const char *why) +{ + ext2_filsys fs = ff->fs; + unsigned int nr = 0; + blk64_t blockcount = 0; + struct ext2_inode_large xinode; + struct ext2fs_extent extent; + ext2_extent_handle_t extents; + int op = EXT2_EXTENT_ROOT; + errcode_t retval; + + if (!inode) { + inode = &xinode; + + retval = fuse2fs_read_inode(fs, ino, inode); + if (retval) { + com_err(__func__, retval, _("reading ino %u"), ino); + return; + } + } + + if (!(inode->i_flags & EXT4_EXTENTS_FL)) + return; + + printf("%s: %s ino=%u isize %llu iblocks %llu\n", __func__, why, ino, + EXT2_I_SIZE(inode), + (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) / + fs->blocksize); + fflush(stdout); + + retval = ext2fs_extent_open(fs, ino, &extents); + if (retval) { + com_err(__func__, retval, _("opening extents of ino \"%u\""), + ino); + return; + } + + while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) { + op = EXT2_EXTENT_NEXT; + + if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT) + continue; + + printf("[%u]: %s ino=%u lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", + nr++, why, ino, extent.e_lblk, extent.e_pblk, + extent.e_len, extent.e_flags); + fflush(stdout); + if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF) + blockcount += extent.e_len; + else + blockcount++; + } + if (retval == EXT2_ET_EXTENT_NO_NEXT) + retval = 0; + if (retval) { + com_err(__func__, retval, ("getting extents of ino %u"), + ino); + } + if (inode->i_file_acl) + blockcount++; + printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount); + fflush(stdout); + + ext2fs_extent_free(extents); +} + static void get_now(struct timespec *now) { #ifdef CLOCK_REALTIME @@ -5681,6 +5749,8 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, /* Not filling even the first byte will make the kernel unhappy. */ if (read->offset > pos || read->offset + read->length <= pos) { + if (ff->debug) + fuse2fs_dump_extents(ff, attr_ino, &inode, "BAD DATA"); ret = translate_error(fs, attr_ino, EXT2_ET_INODE_CORRUPTED); goto out_unlock; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 07/19] fuse2fs: implement direct write support 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 14:54 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong @ 2026-04-29 14:54 ` Darrick J. Wong 2026-04-29 14:54 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong ` (11 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:54 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Wire up an iomap_begin method that can allocate into holes so that we can do directio writes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 477 +++++++++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 471 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 942 insertions(+), 6 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 1489be2104f2b2..8b508de5b8cb65 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -6453,12 +6453,106 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino, opflags, read); } +static int fuse4fs_iomap_write_allocate(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, uint64_t count, + uint32_t opflags, + struct fuse_file_iomap *read, + bool *dirty) +{ + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos); + blk64_t stopoff = FUSE4FS_B_TO_FSB(ff, pos + count); + blk64_t old_iblocks; + errcode_t err; + int ret; + + dbg_printf(ff, + "%s: ino=%d startoff 0x%llx blockcount 0x%llx\n", + __func__, ino, startoff, stopoff - startoff); + + if (!fuse4fs_can_allocate(ff, stopoff - startoff)) + return -ENOSPC; + + old_iblocks = ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)); + err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino, + EXT2_INODE(inode), ~0ULL, startoff, + stopoff - startoff); + if (err) + return translate_error(fs, ino, err); + + /* + * New allocations for file data blocks on indirect mapped files are + * zeroed through the IO manager so we have to flush it to disk. + */ + if (!(inode->i_flags & EXT4_EXTENTS_FL) && + old_iblocks != ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode))) { + err = io_channel_flush(fs->io); + if (err) + return translate_error(fs, ino, err); + } + + /* pick up the newly allocated mapping */ + ret = fuse4fs_iomap_begin_read(ff, ino, inode, pos, count, opflags, + read); + if (ret) + return ret; + + read->flags |= FUSE_IOMAP_F_DIRTY; + *dirty = true; + return 0; +} + +static off_t fuse4fs_max_file_size(const struct fuse4fs *ff, + const struct ext2_inode_large *inode) +{ + ext2_filsys fs = ff->fs; + blk64_t addr_per_block, max_map_block; + + if (inode->i_flags & EXT4_EXTENTS_FL) { + max_map_block = (1ULL << 32) - 1; + } else { + addr_per_block = fs->blocksize >> 2; + max_map_block = addr_per_block; + max_map_block += addr_per_block * addr_per_block; + max_map_block += addr_per_block * addr_per_block * addr_per_block; + max_map_block += 12; + } + + return FUSE4FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1); +} + static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino, struct ext2_inode_large *inode, off_t pos, uint64_t count, uint32_t opflags, - struct fuse_file_iomap *read) + struct fuse_file_iomap *read, + bool *dirty) { - return -ENOSYS; + off_t max_size = fuse4fs_max_file_size(ff, inode); + int ret; + + if (!(opflags & FUSE_IOMAP_OP_DIRECT)) + return -ENOSYS; + + if (pos >= max_size) + return -EFBIG; + + if (pos >= max_size - count) + count = max_size - pos; + + ret = fuse4fs_iomap_begin_read(ff, ino, inode, pos, count, opflags, + read); + if (ret) + return ret; + + if (fuse_iomap_need_write_allocate(opflags, read)) { + ret = fuse4fs_iomap_write_allocate(ff, ino, inode, pos, count, + opflags, read, dirty); + if (ret) + return ret; + } + + return 0; } static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, @@ -6470,6 +6564,7 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, ext2_filsys fs; ext2_ino_t ino; errcode_t err; + bool dirty = false; int ret = 0; FUSE4FS_CHECK_CONTEXT(req); @@ -6493,7 +6588,7 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, opflags, &read); else if (fuse_iomap_is_write(opflags)) ret = fuse4fs_iomap_begin_write(ff, ino, &inode, pos, count, - opflags, &read); + opflags, &read, &dirty); else ret = fuse4fs_iomap_begin_read(ff, ino, &inode, pos, count, opflags, &read); @@ -6518,6 +6613,14 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, goto out_unlock; } + if (dirty) { + err = fuse4fs_write_inode(fs, ino, &inode); + if (err) { + ret = translate_error(fs, ino, err); + goto out_unlock; + } + } + out_unlock: fuse4fs_finish(ff, ret); if (ret) @@ -6667,6 +6770,373 @@ static void op_iomap_config(fuse_req_t req, else fuse_reply_iomap_config(req, &cfg); } + +static inline bool fuse4fs_can_merge_mappings(const struct ext2fs_extent *left, + const struct ext2fs_extent *right) +{ + uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ? + EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN; + + return left->e_lblk + left->e_len == right->e_lblk && + left->e_pblk + left->e_len == right->e_pblk && + (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) == + (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) && + (uint64_t)left->e_len + right->e_len <= max_len; +} + +static int fuse4fs_try_merge_mappings(struct fuse4fs *ff, ext2_ino_t ino, + ext2_extent_handle_t handle, + blk64_t startoff) +{ + ext2_filsys fs = ff->fs; + struct ext2fs_extent left, right; + errcode_t err; + + /* Look up the mappings before startoff */ + err = fuse4fs_get_mapping_at(ff, handle, startoff - 1, &left); + if (err == EXT2_ET_EXTENT_NOT_FOUND) + return 0; + if (err) + return translate_error(fs, ino, err); + + /* Look up the mapping at startoff */ + err = fuse4fs_get_mapping_at(ff, handle, startoff, &right); + if (err == EXT2_ET_EXTENT_NOT_FOUND) + return 0; + if (err) + return translate_error(fs, ino, err); + + /* Can we combine them? */ + if (!fuse4fs_can_merge_mappings(&left, &right)) + return 0; + + /* + * Delete the mapping after startoff because libext2fs cannot handle + * overlapping mappings. + */ + err = ext2fs_extent_delete(handle, 0); + DUMP_EXTENT(ff, "remover", startoff, err, &right); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixremover", startoff, err, &right); + if (err) + return translate_error(fs, ino, err); + + /* Move back and lengthen the mapping before startoff */ + err = ext2fs_extent_goto(handle, left.e_lblk); + DUMP_EXTENT(ff, "movel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + left.e_len += right.e_len; + err = ext2fs_extent_replace(handle, 0, &left); + DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + return 0; +} + +static int fuse4fs_convert_unwritten_mapping(struct fuse4fs *ff, + ext2_ino_t ino, + struct ext2_inode_large *inode, + ext2_extent_handle_t handle, + blk64_t *cursor, blk64_t stopoff) +{ + ext2_filsys fs = ff->fs; + struct ext2fs_extent extent; + blk64_t startoff = *cursor; + errcode_t err; + + /* + * Find the mapping at startoff. Note that we can find holes because + * the mapping data can change due to racing writes. + */ + err = fuse4fs_get_mapping_at(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + /* + * If we didn't find any mappings at all then the file is + * completely sparse. There's nothing to convert. + */ + *cursor = stopoff; + return 0; + } + if (err) + return translate_error(fs, ino, err); + + /* + * The mapping is completely to the left of the range that we want. + * Let's see what's in the next extent, if there is one. + */ + if (startoff >= extent.e_lblk + extent.e_len) { + /* + * Mapping ends to the left of the current position. Try to + * find the next mapping. If there is no next mapping, then + * we're done. + */ + err = fuse4fs_get_next_mapping(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + *cursor = stopoff; + return 0; + } + if (err) + return translate_error(fs, ino, err); + } + + /* + * The mapping is completely to the right of the range that we want, + * so we're done. + */ + if (extent.e_lblk >= stopoff) { + *cursor = stopoff; + return 0; + } + + /* + * At this point, we have a mapping that overlaps (startoff, stopoff]. + * If the mapping is already written, move on to the next one. + */ + if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)) + goto next; + + if (startoff > extent.e_lblk) { + struct ext2fs_extent newex = extent; + + /* + * Unwritten mapping starts before startoff. Shorten + * the previous mapping... + */ + newex.e_len = startoff - extent.e_lblk; + err = ext2fs_extent_replace(handle, 0, &newex); + DUMP_EXTENT(ff, "shortenp", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + /* ...and create new written mapping at startoff. */ + extent.e_len -= newex.e_len; + extent.e_lblk += newex.e_len; + extent.e_pblk += newex.e_len; + extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_insert(handle, + EXT2_EXTENT_INSERT_AFTER, + &extent); + DUMP_EXTENT(ff, "insertx", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + } + + if (extent.e_lblk + extent.e_len > stopoff) { + struct ext2fs_extent newex = extent; + + /* + * Unwritten mapping ends after stopoff. Shorten the current + * mapping... + */ + extent.e_len = stopoff - extent.e_lblk; + extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_replace(handle, 0, &extent); + DUMP_EXTENT(ff, "shortenn", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + /* ..and create a new unwritten mapping at stopoff. */ + newex.e_pblk += extent.e_len; + newex.e_lblk += extent.e_len; + newex.e_len -= extent.e_len; + newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_insert(handle, + EXT2_EXTENT_INSERT_AFTER, + &newex); + DUMP_EXTENT(ff, "insertn", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + } + + /* Still unwritten? Update the state. */ + if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) { + extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_replace(handle, 0, &extent); + DUMP_EXTENT(ff, "replacex", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + } + +next: + /* Try to merge with the previous extent */ + if (startoff > 0) { + err = fuse4fs_try_merge_mappings(ff, ino, handle, startoff); + if (err) + return translate_error(fs, ino, err); + } + + *cursor = extent.e_lblk + extent.e_len; + return 0; +} + +static int fuse4fs_convert_unwritten_mappings(struct fuse4fs *ff, + ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, size_t written) +{ + ext2_extent_handle_t handle; + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos); + const blk64_t stopoff = FUSE4FS_B_TO_FSB(ff, pos + written); + errcode_t err; + int ret; + + err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle); + if (err) + return translate_error(fs, ino, err); + + /* Walk every mapping in the range, converting them. */ + while (startoff < stopoff) { + blk64_t old_startoff = startoff; + + ret = fuse4fs_convert_unwritten_mapping(ff, ino, inode, handle, + &startoff, stopoff); + if (ret) + goto out_handle; + if (startoff <= old_startoff) { + /* Do not go backwards. */ + ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED); + goto out_handle; + } + } + + /* Try to merge the right edge */ + ret = fuse4fs_try_merge_mappings(ff, ino, handle, stopoff); +out_handle: + ext2fs_extent_free(handle); + return ret; +} + +static void op_iomap_ioend(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, + off_t pos, size_t written, uint32_t ioendflags, + int error, uint32_t dev, uint64_t new_addr) +{ + struct fuse4fs *ff = fuse4fs_get(req); + struct ext2_inode_large inode; + ext2_filsys fs; + ext2_ino_t ino; + ext2_off64_t isize; + errcode_t err; + bool dirty = false; + off_t newsize = -1; + int ret = 0; + + FUSE4FS_CHECK_CONTEXT(req); + FUSE4FS_CONVERT_FINO(req, &ino, fino); + + dbg_printf(ff, + "%s: ino=%d pos=0x%llx written=0x%zx ioendflags=0x%x error=%d dev=%u new_addr=0x%llx\n", + __func__, ino, + (unsigned long long)pos, + written, + ioendflags, + error, + dev, + (unsigned long long)new_addr); + + if (error) { + fuse_reply_err(req, -error); + return; + } + + fs = fuse4fs_start(ff); + + /* should never see these ioend types */ + if (ioendflags & FUSE_IOMAP_IOEND_SHARED) { + ret = translate_error(fs, ino, EXT2_ET_FILESYSTEM_CORRUPTED); + goto out_unlock; + } + + err = fuse4fs_read_inode(fs, ino, &inode); + if (err) { + ret = translate_error(fs, ino, err); + goto out_unlock; + } + + if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) { + /* unwritten extents are only supported on extents files */ + if (!(inode.i_flags & EXT4_EXTENTS_FL)) { + ret = translate_error(fs, ino, + EXT2_ET_FILESYSTEM_CORRUPTED); + goto out_unlock; + } + + ret = fuse4fs_convert_unwritten_mappings(ff, ino, &inode, + pos, written); + if (ret) + goto out_unlock; + + dirty = true; + } + + isize = EXT2_I_SIZE(&inode); + if (pos + written > isize) { + err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), + pos + written); + if (err) { + ret = translate_error(fs, ino, err); + goto out_unlock; + } + + dirty = true; + } + + if (dirty) { + err = fuse4fs_write_inode(fs, ino, &inode); + if (err) { + ret = translate_error(fs, ino, err); + goto out_unlock; + } + } + + newsize = EXT2_I_SIZE(&inode); +out_unlock: + fuse4fs_finish(ff, ret); + if (ret) + fuse_reply_err(req, -ret); + else + fuse_reply_iomap_ioend(req, newsize); +} #endif /* HAVE_FUSE_IOMAP */ static struct fuse_lowlevel_ops fs_ops = { @@ -6716,6 +7186,7 @@ static struct fuse_lowlevel_ops fs_ops = { .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, .iomap_config = op_iomap_config, + .iomap_ioend = op_iomap_ioend, #endif /* HAVE_FUSE_IOMAP */ }; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 4b37bcde63d0f2..67a5bc4c5cc986 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -5688,12 +5688,103 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, opflags, read); } +static int fuse2fs_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode, off_t pos, + uint64_t count, uint32_t opflags, + struct fuse_file_iomap *read, bool *dirty) +{ + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count); + blk64_t old_iblocks; + errcode_t err; + int ret; + + dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n", + __func__, ino, startoff, stopoff - startoff); + + if (!fs_can_allocate(ff, stopoff - startoff)) + return -ENOSPC; + + old_iblocks = ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)); + err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino, + EXT2_INODE(inode), ~0ULL, startoff, + stopoff - startoff); + if (err) + return translate_error(fs, ino, err); + + /* + * New allocations for file data blocks on indirect mapped files are + * zeroed through the IO manager so we have to flush it to disk. + */ + if (!(inode->i_flags & EXT4_EXTENTS_FL) && + old_iblocks != ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode))) { + err = io_channel_flush(fs->io); + if (err) + return translate_error(fs, ino, err); + } + + /* pick up the newly allocated mapping */ + ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags, + read); + if (ret) + return ret; + + read->flags |= FUSE_IOMAP_F_DIRTY; + *dirty = true; + return 0; +} + +static off_t fuse2fs_max_file_size(const struct fuse2fs *ff, + const struct ext2_inode_large *inode) +{ + ext2_filsys fs = ff->fs; + blk64_t addr_per_block, max_map_block; + + if (inode->i_flags & EXT4_EXTENTS_FL) { + max_map_block = (1ULL << 32) - 1; + } else { + addr_per_block = fs->blocksize >> 2; + max_map_block = addr_per_block; + max_map_block += addr_per_block * addr_per_block; + max_map_block += addr_per_block * addr_per_block * addr_per_block; + max_map_block += 12; + } + + return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1); +} + static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, struct ext2_inode_large *inode, off_t pos, uint64_t count, uint32_t opflags, - struct fuse_file_iomap *read) + struct fuse_file_iomap *read, + bool *dirty) { - return -ENOSYS; + off_t max_size = fuse2fs_max_file_size(ff, inode); + int ret; + + if (!(opflags & FUSE_IOMAP_OP_DIRECT)) + return -ENOSYS; + + if (pos >= max_size) + return -EFBIG; + + if (pos >= max_size - count) + count = max_size - pos; + + ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags, + read); + if (ret) + return ret; + + if (fuse_iomap_need_write_allocate(opflags, read)) { + ret = fuse2fs_iomap_write_allocate(ff, ino, inode, pos, count, + opflags, read, dirty); + if (ret) + return ret; + } + + return 0; } static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, @@ -5705,6 +5796,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, struct ext2_inode_large inode; ext2_filsys fs; errcode_t err; + bool dirty = false; int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); @@ -5730,7 +5822,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, count, opflags, read); else if (fuse_iomap_is_write(opflags)) ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos, - count, opflags, read); + count, opflags, read, &dirty); else ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos, count, opflags, read); @@ -5755,6 +5847,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, goto out_unlock; } + if (dirty) { + err = fuse2fs_write_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + } + out_unlock: fuse2fs_finish(ff, ret); return ret; @@ -5896,6 +5996,370 @@ static int op_iomap_config(const struct fuse_iomap_config_params *p, fuse2fs_finish(ff, ret); return ret; } + +static inline bool fuse2fs_can_merge_mappings(const struct ext2fs_extent *left, + const struct ext2fs_extent *right) +{ + uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ? + EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN; + + return left->e_lblk + left->e_len == right->e_lblk && + left->e_pblk + left->e_len == right->e_pblk && + (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) == + (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) && + (uint64_t)left->e_len + right->e_len <= max_len; +} + +static int fuse2fs_try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino, + ext2_extent_handle_t handle, + blk64_t startoff) +{ + ext2_filsys fs = ff->fs; + struct ext2fs_extent left, right; + errcode_t err; + + /* Look up the mappings before startoff */ + err = fuse2fs_get_mapping_at(ff, handle, startoff - 1, &left); + if (err == EXT2_ET_EXTENT_NOT_FOUND) + return 0; + if (err) + return translate_error(fs, ino, err); + + /* Look up the mapping at startoff */ + err = fuse2fs_get_mapping_at(ff, handle, startoff, &right); + if (err == EXT2_ET_EXTENT_NOT_FOUND) + return 0; + if (err) + return translate_error(fs, ino, err); + + /* Can we combine them? */ + if (!fuse2fs_can_merge_mappings(&left, &right)) + return 0; + + /* + * Delete the mapping after startoff because libext2fs cannot handle + * overlapping mappings. + */ + err = ext2fs_extent_delete(handle, 0); + DUMP_EXTENT(ff, "remover", startoff, err, &right); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixremover", startoff, err, &right); + if (err) + return translate_error(fs, ino, err); + + /* Move back and lengthen the mapping before startoff */ + err = ext2fs_extent_goto(handle, left.e_lblk); + DUMP_EXTENT(ff, "movel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + left.e_len += right.e_len; + err = ext2fs_extent_replace(handle, 0, &left); + DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left); + if (err) + return translate_error(fs, ino, err); + + return 0; +} + +static int fuse2fs_convert_unwritten_mapping(struct fuse2fs *ff, + ext2_ino_t ino, + struct ext2_inode_large *inode, + ext2_extent_handle_t handle, + blk64_t *cursor, blk64_t stopoff) +{ + ext2_filsys fs = ff->fs; + struct ext2fs_extent extent; + blk64_t startoff = *cursor; + errcode_t err; + + /* + * Find the mapping at startoff. Note that we can find holes because + * the mapping data can change due to racing writes. + */ + err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + /* + * If we didn't find any mappings at all then the file is + * completely sparse. There's nothing to convert. + */ + *cursor = stopoff; + return 0; + } + if (err) + return translate_error(fs, ino, err); + + /* + * The mapping is completely to the left of the range that we want. + * Let's see what's in the next extent, if there is one. + */ + if (startoff >= extent.e_lblk + extent.e_len) { + /* + * Mapping ends to the left of the current position. Try to + * find the next mapping. If there is no next mapping, then + * we're done. + */ + err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent); + if (err == EXT2_ET_EXTENT_NOT_FOUND) { + *cursor = stopoff; + return 0; + } + if (err) + return translate_error(fs, ino, err); + } + + /* + * The mapping is completely to the right of the range that we want, + * so we're done. + */ + if (extent.e_lblk >= stopoff) { + *cursor = stopoff; + return 0; + } + + /* + * At this point, we have a mapping that overlaps (startoff, stopoff]. + * If the mapping is already written, move on to the next one. + */ + if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)) + goto next; + + if (startoff > extent.e_lblk) { + struct ext2fs_extent newex = extent; + + /* + * Unwritten mapping starts before startoff. Shorten + * the previous mapping... + */ + newex.e_len = startoff - extent.e_lblk; + err = ext2fs_extent_replace(handle, 0, &newex); + DUMP_EXTENT(ff, "shortenp", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + /* ...and create new written mapping at startoff. */ + extent.e_len -= newex.e_len; + extent.e_lblk += newex.e_len; + extent.e_pblk += newex.e_len; + extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_insert(handle, + EXT2_EXTENT_INSERT_AFTER, + &extent); + DUMP_EXTENT(ff, "insertx", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + } + + if (extent.e_lblk + extent.e_len > stopoff) { + struct ext2fs_extent newex = extent; + + /* + * Unwritten mapping ends after stopoff. Shorten the current + * mapping... + */ + extent.e_len = stopoff - extent.e_lblk; + extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_replace(handle, 0, &extent); + DUMP_EXTENT(ff, "shortenn", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + /* ..and create a new unwritten mapping at stopoff. */ + newex.e_pblk += extent.e_len; + newex.e_lblk += extent.e_len; + newex.e_len -= extent.e_len; + newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_insert(handle, + EXT2_EXTENT_INSERT_AFTER, + &newex); + DUMP_EXTENT(ff, "insertn", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex); + if (err) + return translate_error(fs, ino, err); + } + + /* Still unwritten? Update the state. */ + if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) { + extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT; + + err = ext2fs_extent_replace(handle, 0, &extent); + DUMP_EXTENT(ff, "replacex", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + + err = ext2fs_extent_fix_parents(handle); + DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent); + if (err) + return translate_error(fs, ino, err); + } + +next: + /* Try to merge with the previous extent */ + if (startoff > 0) { + err = fuse2fs_try_merge_mappings(ff, ino, handle, startoff); + if (err) + return translate_error(fs, ino, err); + } + + *cursor = extent.e_lblk + extent.e_len; + return 0; +} + +static int fuse2fs_convert_unwritten_mappings(struct fuse2fs *ff, + ext2_ino_t ino, + struct ext2_inode_large *inode, + off_t pos, size_t written) +{ + ext2_extent_handle_t handle; + ext2_filsys fs = ff->fs; + blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos); + const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written); + errcode_t err; + int ret; + + err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle); + if (err) + return translate_error(fs, ino, err); + + /* Walk every mapping in the range, converting them. */ + while (startoff < stopoff) { + blk64_t old_startoff = startoff; + + ret = fuse2fs_convert_unwritten_mapping(ff, ino, inode, handle, + &startoff, stopoff); + if (ret) + goto out_handle; + if (startoff <= old_startoff) { + /* Do not go backwards. */ + ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED); + goto out_handle; + } + } + + /* Try to merge the right edge */ + ret = fuse2fs_try_merge_mappings(ff, ino, handle, stopoff); +out_handle: + ext2fs_extent_free(handle); + return ret; +} + +static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino, + off_t pos, size_t written, uint32_t ioendflags, + int error, uint32_t dev, uint64_t new_addr, + off_t *newsize) +{ + struct fuse2fs *ff = fuse2fs_get(); + struct ext2_inode_large inode; + ext2_filsys fs; + errcode_t err; + ext2_off64_t isize; + bool dirty = false; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + + dbg_printf(ff, + "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d dev=%u new_addr=%llu\n", + __func__, path, + (unsigned long long)nodeid, + (unsigned long long)attr_ino, + (unsigned long long)pos, + written, + ioendflags, + error, + dev, + (unsigned long long)new_addr); + + fs = fuse2fs_start(ff); + if (error) { + ret = error; + goto out_unlock; + } + + /* should never see these ioend types */ + if (ioendflags & FUSE_IOMAP_IOEND_SHARED) { + ret = translate_error(fs, attr_ino, + EXT2_ET_FILESYSTEM_CORRUPTED); + goto out_unlock; + } + + err = fuse2fs_read_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + + if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) { + /* unwritten extents are only supported on extents files */ + if (!(inode.i_flags & EXT4_EXTENTS_FL)) { + ret = translate_error(fs, attr_ino, + EXT2_ET_FILESYSTEM_CORRUPTED); + goto out_unlock; + } + + ret = fuse2fs_convert_unwritten_mappings(ff, attr_ino, &inode, + pos, written); + if (ret) + goto out_unlock; + + dirty = true; + } + + isize = EXT2_I_SIZE(&inode); + if (pos + written > isize) { + err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), + pos + written); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + + dirty = true; + } + + if (dirty) { + err = fuse2fs_write_inode(fs, attr_ino, &inode); + if (err) { + ret = translate_error(fs, attr_ino, err); + goto out_unlock; + } + } + + *newsize = EXT2_I_SIZE(&inode); +out_unlock: + fuse2fs_finish(ff, ret); + return ret; +} #endif /* HAVE_FUSE_IOMAP */ static struct fuse_operations fs_ops = { @@ -5943,6 +6407,7 @@ static struct fuse_operations fs_ops = { .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, .iomap_config = op_iomap_config, + .iomap_ioend = op_iomap_ioend, #endif /* HAVE_FUSE_IOMAP */ }; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 14:54 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong @ 2026-04-29 14:54 ` Darrick J. Wong 2026-04-29 14:54 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong ` (10 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:54 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Turn on iomap for pagecache IO to regular files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++------ misc/fuse2fs.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 108 insertions(+), 14 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 8b508de5b8cb65..fa82fda99ff687 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -6438,9 +6438,6 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino, uint64_t count, uint32_t opflags, struct fuse_file_iomap *read) { - if (!(opflags & FUSE_IOMAP_OP_DIRECT)) - return -ENOSYS; - /* fall back to slow path for inline data reads */ if (inode->i_flags & EXT4_INLINE_DATA_FL) return -ENOSYS; @@ -6531,9 +6528,6 @@ static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino, off_t max_size = fuse4fs_max_file_size(ff, inode); int ret; - if (!(opflags & FUSE_IOMAP_OP_DIRECT)) - return -ENOSYS; - if (pos >= max_size) return -EFBIG; @@ -6629,12 +6623,51 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, fuse_reply_iomap_begin(req, &read, NULL); } +static int fuse4fs_iomap_append_setsize(struct fuse4fs *ff, ext2_ino_t ino, + loff_t newsize) +{ + ext2_filsys fs = ff->fs; + struct ext2_inode_large inode; + ext2_off64_t isize; + errcode_t err; + + dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino, + (unsigned long long)newsize); + + err = fuse4fs_read_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + isize = EXT2_I_SIZE(&inode); + if (newsize <= isize) + return 0; + + dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino, + (unsigned long long)isize, + (unsigned long long)newsize); + + /* + * XXX cheesily update the ondisk size even though we only want to do + * the incore size until writeback happens + */ + err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize); + if (err) + return translate_error(fs, ino, err); + + err = fuse4fs_write_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + return 0; +} + static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, off_t pos, uint64_t count, uint32_t opflags, ssize_t written, const struct fuse_file_iomap *iomap) { struct fuse4fs *ff = fuse4fs_get(req); ext2_ino_t ino; + int ret = 0; FUSE4FS_CHECK_CONTEXT(req); FUSE4FS_CONVERT_FINO(req, &ino, fino); @@ -6648,7 +6681,21 @@ static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, written, iomap->flags); - fuse_reply_err(req, 0); + fuse4fs_start(ff); + + /* XXX is this really necessary? */ + if ((opflags & FUSE_IOMAP_OP_WRITE) && + !(opflags & FUSE_IOMAP_OP_DIRECT) && + (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) && + written > 0) { + ret = fuse4fs_iomap_append_setsize(ff, ino, pos + written); + if (ret) + goto out_unlock; + } + +out_unlock: + fuse4fs_finish(ff, ret); + fuse_reply_err(req, -ret); } /* diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 67a5bc4c5cc986..679406323df86f 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -5673,9 +5673,6 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, uint64_t count, uint32_t opflags, struct fuse_file_iomap *read) { - if (!(opflags & FUSE_IOMAP_OP_DIRECT)) - return -ENOSYS; - /* fall back to slow path for inline data reads */ if (inode->i_flags & EXT4_INLINE_DATA_FL) return -ENOSYS; @@ -5763,9 +5760,6 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino, off_t max_size = fuse2fs_max_file_size(ff, inode); int ret; - if (!(opflags & FUSE_IOMAP_OP_DIRECT)) - return -ENOSYS; - if (pos >= max_size) return -EFBIG; @@ -5860,11 +5854,50 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, return ret; } +static int fuse2fs_iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino, + loff_t newsize) +{ + ext2_filsys fs = ff->fs; + struct ext2_inode_large inode; + ext2_off64_t isize; + errcode_t err; + + dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino, + (unsigned long long)newsize); + + err = fuse2fs_read_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + isize = EXT2_I_SIZE(&inode); + if (newsize <= isize) + return 0; + + dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino, + (unsigned long long)isize, + (unsigned long long)newsize); + + /* + * XXX cheesily update the ondisk size even though we only want to do + * the incore size until writeback happens + */ + err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize); + if (err) + return translate_error(fs, ino, err); + + err = fuse2fs_write_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + return 0; +} + static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, off_t pos, uint64_t count, uint32_t opflags, ssize_t written, const struct fuse_file_iomap *iomap) { struct fuse2fs *ff = fuse2fs_get(); + int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); @@ -5879,7 +5912,21 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino, written, iomap->flags); - return 0; + fuse2fs_start(ff); + + /* XXX is this really necessary? */ + if ((opflags & FUSE_IOMAP_OP_WRITE) && + !(opflags & FUSE_IOMAP_OP_DIRECT) && + (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) && + written > 0) { + ret = fuse2fs_iomap_append_setsize(ff, attr_ino, pos + written); + if (ret) + goto out_unlock; + } + +out_unlock: + fuse2fs_finish(ff, ret); + return ret; } /* ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 09/19] fuse2fs: don't zero bytes in punch hole 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 14:54 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong @ 2026-04-29 14:54 ` Darrick J. Wong 2026-04-29 14:55 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong ` (9 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:54 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> When iomap is in use for the pagecache, it will take care of zeroing the unaligned parts of punched out regions so we don't have to do it ourselves. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 8 ++++++++ misc/fuse2fs.c | 9 +++++++++ 2 files changed, 17 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index fa82fda99ff687..40713b0d0d5e37 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -5868,6 +5868,10 @@ static errcode_t fuse4fs_zero_middle(struct fuse4fs *ff, ext2_ino_t ino, int retflags; errcode_t err; + /* the kernel does this for us in iomap mode */ + if (fuse4fs_iomap_enabled(ff)) + return 0; + if (!*buf) { err = ext2fs_get_mem(fs->blocksize, buf); if (err) @@ -5904,6 +5908,10 @@ static errcode_t fuse4fs_zero_edge(struct fuse4fs *ff, ext2_ino_t ino, off_t residue; errcode_t err; + /* the kernel does this for us in iomap mode */ + if (fuse4fs_iomap_enabled(ff)) + return 0; + residue = FUSE4FS_OFF_IN_FSB(ff, offset); if (residue == 0) return 0; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 679406323df86f..a37851cdf30785 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -727,6 +727,7 @@ static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff) } #else # define fuse2fs_iomap_enabled(...) (0) +# define fuse2fs_iomap_enabled(...) (0) #endif static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino, @@ -5104,6 +5105,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino, int retflags; errcode_t err; + /* the kernel does this for us in iomap mode */ + if (fuse2fs_iomap_enabled(ff)) + return 0; + if (!*buf) { err = ext2fs_get_mem(fs->blocksize, buf); if (err) @@ -5140,6 +5145,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino, off_t residue; errcode_t err; + /* the kernel does this for us in iomap mode */ + if (fuse2fs_iomap_enabled(ff)) + return 0; + residue = FUSE2FS_OFF_IN_FSB(ff, offset); if (residue == 0) return 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (8 preceding siblings ...) 2026-04-29 14:54 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong @ 2026-04-29 14:55 ` Darrick J. Wong 2026-04-29 14:55 ` [PATCH 11/19] fuse2fs: try to create loop device when ext4 device is a regular file Darrick J. Wong ` (8 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:55 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> When iomap is in use for the page cache, the kernel will take care of all the file data block IO for us, including zeroing of punched ranges and post-EOF bytes. fuse2fs only needs to do IO for inline data. Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not do any regular file IO to or from disk blocks at all. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 11 +++++++- misc/fuse2fs.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 81 insertions(+), 2 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 40713b0d0d5e37..68f1f7c02df223 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -3933,9 +3933,14 @@ static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size) ext2_file_t file; __u64 old_isize; errcode_t err; + int flags = EXT2_FILE_WRITE; int ret = 0; - err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file); + /* the kernel handles all eof zeroing for us in iomap mode */ + if (fuse4fs_iomap_enabled(ff)) + flags |= EXT2_FILE_NOBLOCKIO; + + err = ext2fs_file_open(fs, ino, flags, &file); if (err) return translate_error(fs, ino, err); @@ -4030,6 +4035,10 @@ static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt, if (linked) check |= L_OK; + /* the kernel handles all block IO for us in iomap mode */ + if (fuse4fs_iomap_enabled(ff)) + file->open_flags |= EXT2_FILE_NOBLOCKIO; + /* * If the caller wants to truncate the file, we need to ask for full * write access even if the caller claims to be appending. diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index a37851cdf30785..f7653dc6c20c3f 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -3463,15 +3463,72 @@ static int fuse2fs_punch_posteof(struct fuse2fs *ff, ext2_ino_t ino, return 0; } +/* + * Decide if file IO for this inode can use iomap. + * + * It turns out that libfuse creates internal node ids that have nothing to do + * with the ext2_ino_t that we give it. These internal node ids are what + * actually gets igetted in the kernel, which means that there can be multiple + * fuse_inode objects in the kernel for a single hardlinked ondisk ext2 inode. + * + * What this means, horrifyingly, is that on a fuse filesystem that supports + * hard links, the in-kernel i_rwsem does not protect against concurrent writes + * between files that point to the same inode. That in turn means that the + * file mode and size can get desynchronized between the multiple fuse_inode + * objects. This also means that we cannot cache iomaps in the kernel AT ALL + * because the caches will get out of sync, leading to WARN_ONs from the iomap + * zeroing code and probably data corruption after that. + * + * Therefore, libfuse won't let us create hardlinks of iomap files, and we must + * never turn on iomap for existing hardlinked files. Long term it means we + * have to find a way around this loss of functionality. fuse4fs gets around + * this by being a low level fuse driver and controlling the nodeids itself. + * + * Returns 0 for no, 1 for yes, or a negative errno. + */ +#ifdef HAVE_FUSE_IOMAP +static int fuse2fs_file_uses_iomap(struct fuse2fs *ff, ext2_ino_t ino) +{ + struct stat statbuf; + int ret; + + if (!fuse2fs_iomap_enabled(ff)) + return 0; + + ret = stat_inode(ff->fs, ino, &statbuf); + if (ret) + return ret; + + /* the kernel handles all block IO for us in iomap mode */ + return fuse_fs_can_enable_iomap(&statbuf); +} +#else +# define fuse2fs_file_uses_iomap(...) (0) +#endif + static int fuse2fs_truncate(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size) { ext2_filsys fs = ff->fs; ext2_file_t file; __u64 old_isize; errcode_t err; + int flags = EXT2_FILE_WRITE; int ret = 0; - err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file); + /* the kernel handles all eof zeroing for us in iomap mode */ + ret = fuse2fs_file_uses_iomap(ff, ino); + switch (ret) { + case 0: + break; + case 1: + flags |= EXT2_FILE_NOBLOCKIO; + ret = 0; + break; + default: + return ret; + } + + err = ext2fs_file_open(fs, ino, flags, &file); if (err) return translate_error(fs, ino, err); @@ -3626,6 +3683,19 @@ static int __op_open(struct fuse2fs *ff, const char *path, goto out; } + /* the kernel handles all block IO for us in iomap mode */ + ret = fuse2fs_file_uses_iomap(ff, file->ino); + switch (ret) { + case 0: + break; + case 1: + file->open_flags |= EXT2_FILE_NOBLOCKIO; + ret = 0; + break; + default: + goto out; + } + if (fp->flags & O_TRUNC) { ret = fuse2fs_truncate(ff, file->ino, 0); if (ret) ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 11/19] fuse2fs: try to create loop device when ext4 device is a regular file 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (9 preceding siblings ...) 2026-04-29 14:55 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong @ 2026-04-29 14:55 ` Darrick J. Wong 2026-04-29 14:55 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong ` (7 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:55 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> If the filesystem device is a regular file, try to create a loop device for it so that we can take advantage of iomap. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- configure | 41 +++++++++++++++++++ configure.ac | 23 +++++++++++ fuse4fs/fuse4fs.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++- lib/config.h.in | 3 + misc/fuse2fs.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 292 insertions(+), 4 deletions(-) diff --git a/configure b/configure index 344c7af2ee48f8..ba1556b34257a6 100755 --- a/configure +++ b/configure @@ -14691,6 +14691,47 @@ printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h fi +if test -n "$have_fuse_iomap"; then + { printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for fuse_loopdev.h in libfuse" >&5 +printf %s "checking for fuse_loopdev.h in libfuse... " >&6; } + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + + #define _GNU_SOURCE + #define _FILE_OFFSET_BITS 64 + #define FUSE_USE_VERSION 399 + #include <fuse_loopdev.h> + +int +main (void) +{ + + + ; + return 0; +} + +_ACEOF +if ac_fn_c_try_link "$LINENO" +then : + have_fuse_loopdev=yes + { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5 +printf "%s\n" "yes" >&6; } +else case e in #( + e) { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5 +printf "%s\n" "no" >&6; } ;; +esac +fi +rm -f core conftest.err conftest.$ac_objext conftest.beam \ + conftest$ac_exeext conftest.$ac_ext +fi +if test -n "$have_fuse_loopdev" +then + +printf "%s\n" "#define HAVE_FUSE_LOOPDEV 1" >>confdefs.h + +fi + have_fuse_lowlevel= if test -n "$FUSE_USE_VERSION" then diff --git a/configure.ac b/configure.ac index 8d85e9966877ea..8cfde4d85489e5 100644 --- a/configure.ac +++ b/configure.ac @@ -1432,6 +1432,29 @@ then AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap]) fi +dnl +dnl Check if fuse library has fuse_loopdev.h, which it only gained after adding +dnl iomap support. +dnl +if test -n "$have_fuse_iomap"; then + AC_MSG_CHECKING(for fuse_loopdev.h in libfuse) + AC_LINK_IFELSE( + [ AC_LANG_PROGRAM([[ + #define _GNU_SOURCE + #define _FILE_OFFSET_BITS 64 + #define FUSE_USE_VERSION 399 + #include <fuse_loopdev.h> + ]], [[ + ]]) + ], have_fuse_loopdev=yes + AC_MSG_RESULT(yes), + AC_MSG_RESULT(no)) +fi +if test -n "$have_fuse_loopdev" +then + AC_DEFINE(HAVE_FUSE_LOOPDEV, 1, [Define to 1 if fuse supports loopdev operations]) +fi + dnl dnl Check if the FUSE lowlevel library is supported dnl diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 68f1f7c02df223..3c3debb6f60ac7 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -27,6 +27,9 @@ #include <unistd.h> #include <ctype.h> #include <assert.h> +#ifdef HAVE_FUSE_LOOPDEV +# include <fuse_loopdev.h> +#endif #define FUSE_DARWIN_ENABLE_EXTENSIONS 0 #ifdef __SET_FOB_FOR_FUSE # error Do not set magic value __SET_FOB_FOR_FUSE!!!! @@ -262,6 +265,10 @@ struct fuse4fs { pthread_mutex_t bfl; char *device; char *shortdev; +#ifdef HAVE_FUSE_LOOPDEV + char *loop_device; + int loop_fd; +#endif /* options set by fuse_opt_parse must be of type int */ int ro; @@ -285,6 +292,7 @@ struct fuse4fs { enum fuse4fs_feature_toggle iomap_want; enum fuse4fs_iomap_state iomap_state; uint32_t iomap_dev; + uint64_t iomap_cap; #endif unsigned int blockmask; unsigned long offset; @@ -913,8 +921,23 @@ static inline int fuse4fs_iomap_enabled(const struct fuse4fs *ff) { return ff->iomap_state >= IOMAP_ENABLED; } + +static inline void fuse4fs_discover_iomap(struct fuse4fs *ff) +{ + if (ff->iomap_want == FT_DISABLE) + return; + + ff->iomap_cap = fuse_lowlevel_discover_iomap(-1); +} + +static inline bool fuse4fs_can_iomap(const struct fuse4fs *ff) +{ + return ff->iomap_cap & FUSE_IOMAP_SUPPORT_FILEIO; +} #else # define fuse4fs_iomap_enabled(...) (0) +# define fuse4fs_discover_iomap(...) ((void)0) +# define fuse4fs_can_iomap(...) (false) #endif static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino, @@ -1584,6 +1607,76 @@ static void fuse4fs_release_lockfile(struct fuse4fs *ff) free(ff->lockfile); } +#ifdef HAVE_FUSE_LOOPDEV +static int fuse4fs_try_losetup(struct fuse4fs *ff, int flags) +{ + bool rw = flags & EXT2_FLAG_RW; + int dev_fd; + int ret; + + /* + * Only transform a regular file into a loopdev for iomap, and only if + * the service helper isn't required to that for us. + */ + if (!fuse4fs_can_iomap(ff) || fuse4fs_is_service(ff)) + return 0; + + /* open the actual target device, see if it's a regular file */ + dev_fd = open(ff->device, rw ? O_RDWR : O_RDONLY); + if (dev_fd < 0) { + err_printf(ff, "%s: %s\n", _("while opening fs"), + error_message(errno)); + return -1; + } + + ret = fuse_loopdev_setup(dev_fd, rw ? O_RDWR : O_RDONLY, ff->device, 5, + &ff->loop_fd, &ff->loop_device); + if (ret == -EBUSY) { + /* + * If the setup function returned EBUSY, there is already a + * loop device backed by this file. Report that the file is + * already in use. Ignore the other errors because we can + * otherwise handle filesystem in a file. + */ + err_printf(ff, "%s: %s\n", _("while opening fs loopdev"), + error_message(errno)); + close(dev_fd); + return -1; + } + + close(dev_fd); + return 0; +} + +static void fuse4fs_detach_losetup(struct fuse4fs *ff) +{ + if (ff->loop_fd >= 0) + close(ff->loop_fd); + ff->loop_fd = -1; +} + +static void fuse4fs_undo_losetup(struct fuse4fs *ff) +{ + fuse4fs_detach_losetup(ff); + free(ff->loop_device); + ff->loop_device = NULL; +} + +static inline const char *fuse4fs_device(const struct fuse4fs *ff) +{ + /* + * If we created a loop device for the file passed in, open that. + * Otherwise open the path the user gave us. + */ + return ff->loop_device ? ff->loop_device : ff->device; +} +#else +# define fuse4fs_try_losetup(...) (0) +# define fuse4fs_detach_losetup(...) ((void)0) +# define fuse4fs_undo_losetup(...) ((void)0) +# define fuse4fs_device(ff) ((ff)->device) +#endif + static void fuse4fs_unmount(struct fuse4fs *ff) { char uuid[UUID_STR_SIZE]; @@ -1607,6 +1700,7 @@ static void fuse4fs_unmount(struct fuse4fs *ff) } fuse4fs_service_close_bdev(ff); + fuse4fs_undo_losetup(ff); if (ff->lockfile) fuse4fs_release_lockfile(ff); @@ -1620,6 +1714,8 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff) EXT2_FLAG_EXCLUSIVE | EXT2_FLAG_WRITE_FULL_SUPER; errcode_t err; + fuse4fs_discover_iomap(ff); + if (ff->lockfile) { err = fuse4fs_acquire_lockfile(ff); if (err) @@ -1632,6 +1728,12 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff) if (ff->directio) flags |= EXT2_FLAG_DIRECT_IO; + dbg_printf(ff, "opening with flags=0x%x\n", flags); + + err = fuse4fs_try_losetup(ff, flags); + if (err) + return err; + /* * If the filesystem is stored on a block device, the _EXCLUSIVE flag * causes libext2fs to try to open the block device with O_EXCL. If @@ -1666,8 +1768,8 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff) if (fuse4fs_is_service(ff)) err = fuse4fs_service_openfs(ff, options, &flags); else - err = ext2fs_open2(ff->device, options, flags, 0, 0, - unix_io_manager, &ff->fs); + err = ext2fs_open2(fuse4fs_device(ff), options, flags, + 0, 0, unix_io_manager, &ff->fs); if ((err == EPERM || err == EACCES) && (!ff->ro || (flags & EXT2_FLAG_RW))) { /* @@ -1681,6 +1783,11 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff) flags &= ~EXT2_FLAG_RW; ff->ro = 1; + fuse4fs_undo_losetup(ff); + err = fuse4fs_try_losetup(ff, flags); + if (err) + return err; + /* Force the loop to run once more */ err = -1; } @@ -2129,6 +2236,8 @@ static void op_init(void *userdata, struct fuse_conn_info *conn) fuse4fs_iomap_enable(conn, ff); conn->time_gran = 1; + fuse4fs_detach_losetup(ff); + if (ff->opstate == F4OP_WRITABLE) fuse4fs_read_bitmaps(ff); @@ -7737,6 +7846,9 @@ int main(int argc, char *argv[]) .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, .iomap_dev = FUSE_IOMAP_DEV_NULL, +#endif +#ifdef HAVE_FUSE_LOOPDEV + .loop_fd = -1, #endif }; errcode_t err; diff --git a/lib/config.h.in b/lib/config.h.in index 58338cc926590e..96ed5479181a5b 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -151,6 +151,9 @@ /* Define to 1 if fuse supports iomap */ #undef HAVE_FUSE_IOMAP +/* Define to 1 if fuse supports loopdev operations */ +#undef HAVE_FUSE_LOOPDEV + /* Define to 1 if fuse supports lowlevel API */ #undef HAVE_FUSE_LOWLEVEL diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index f7653dc6c20c3f..3c76eba683a10d 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -25,6 +25,9 @@ #include <sys/ioctl.h> #include <unistd.h> #include <ctype.h> +#ifdef HAVE_FUSE_LOOPDEV +# include <fuse_loopdev.h> +#endif #define FUSE_DARWIN_ENABLE_EXTENSIONS 0 #ifdef __SET_FOB_FOR_FUSE # error Do not set magic value __SET_FOB_FOR_FUSE!!!! @@ -245,6 +248,10 @@ struct fuse2fs { pthread_mutex_t bfl; char *device; char *shortdev; +#ifdef HAVE_FUSE_LOOPDEV + char *loop_device; + int loop_fd; +#endif /* options set by fuse_opt_parse must be of type int */ int ro; @@ -268,6 +275,7 @@ struct fuse2fs { enum fuse2fs_feature_toggle iomap_want; enum fuse2fs_iomap_state iomap_state; uint32_t iomap_dev; + uint64_t iomap_cap; #endif unsigned int blockmask; unsigned long offset; @@ -725,9 +733,23 @@ static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff) { return ff->iomap_state >= IOMAP_ENABLED; } + +static inline void fuse2fs_discover_iomap(struct fuse2fs *ff) +{ + if (ff->iomap_want == FT_DISABLE) + return; + + ff->iomap_cap = fuse_lowlevel_discover_iomap(-1); +} + +static inline bool fuse2fs_can_iomap(const struct fuse2fs *ff) +{ + return ff->iomap_cap & FUSE_IOMAP_SUPPORT_FILEIO; +} #else # define fuse2fs_iomap_enabled(...) (0) -# define fuse2fs_iomap_enabled(...) (0) +# define fuse2fs_discover_iomap(...) ((void)0) +# define fuse2fs_can_iomap(...) (false) #endif static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino, @@ -1201,6 +1223,73 @@ static void fuse2fs_release_lockfile(struct fuse2fs *ff) free(ff->lockfile); } +#ifdef HAVE_FUSE_LOOPDEV +static int fuse2fs_try_losetup(struct fuse2fs *ff, int flags) +{ + bool rw = flags & EXT2_FLAG_RW; + int dev_fd; + int ret; + + /* Only transform a regular file into a loopdev for iomap */ + if (!fuse2fs_can_iomap(ff)) + return 0; + + /* open the actual target device, see if it's a regular file */ + dev_fd = open(ff->device, rw ? O_RDWR : O_RDONLY); + if (dev_fd < 0) { + err_printf(ff, "%s: %s\n", _("while opening fs"), + error_message(errno)); + return -1; + } + + ret = fuse_loopdev_setup(dev_fd, rw ? O_RDWR : O_RDONLY, ff->device, 5, + &ff->loop_fd, &ff->loop_device); + if (ret == -EBUSY) { + /* + * If the setup function returned EBUSY, there is already a + * loop device backed by this file. Report that the file is + * already in use. Ignore the other errors because we can + * otherwise handle filesystem in a file. + */ + err_printf(ff, "%s: %s\n", _("while opening fs loopdev"), + error_message(-ret)); + close(dev_fd); + return -1; + } + + close(dev_fd); + return 0; +} + +static void fuse2fs_detach_losetup(struct fuse2fs *ff) +{ + if (ff->loop_fd >= 0) + close(ff->loop_fd); + ff->loop_fd = -1; +} + +static void fuse2fs_undo_losetup(struct fuse2fs *ff) +{ + fuse2fs_detach_losetup(ff); + free(ff->loop_device); + ff->loop_device = NULL; +} + +static inline const char *fuse2fs_device(const struct fuse2fs *ff) +{ + /* + * If we created a loop device for the file passed in, open that. + * Otherwise open the path the user gave us. + */ + return ff->loop_device ? ff->loop_device : ff->device; +} +#else +# define fuse2fs_try_losetup(...) (0) +# define fuse2fs_detach_losetup(...) ((void)0) +# define fuse2fs_undo_losetup(...) ((void)0) +# define fuse2fs_device(ff) ((ff)->device) +#endif + static void fuse2fs_unmount(struct fuse2fs *ff) { char uuid[UUID_STR_SIZE]; @@ -1218,6 +1307,8 @@ static void fuse2fs_unmount(struct fuse2fs *ff) uuid); } + fuse2fs_undo_losetup(ff); + if (ff->lockfile) fuse2fs_release_lockfile(ff); } @@ -1230,6 +1321,8 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff) EXT2_FLAG_EXCLUSIVE | EXT2_FLAG_WRITE_FULL_SUPER; errcode_t err; + fuse2fs_discover_iomap(ff); + if (ff->lockfile) { err = fuse2fs_acquire_lockfile(ff); if (err) @@ -1242,6 +1335,12 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff) if (ff->directio) flags |= EXT2_FLAG_DIRECT_IO; + dbg_printf(ff, "opening with flags=0x%x\n", flags); + + err = fuse2fs_try_losetup(ff, flags); + if (err) + return err; + /* * If the filesystem is stored on a block device, the _EXCLUSIVE flag * causes libext2fs to try to open the block device with O_EXCL. If @@ -1273,7 +1372,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff) */ deadline = init_deadline(FUSE2FS_OPEN_TIMEOUT); do { - err = ext2fs_open2(ff->device, options, flags, 0, 0, + err = ext2fs_open2(fuse2fs_device(ff), options, flags, 0, 0, unix_io_manager, &ff->fs); if ((err == EPERM || err == EACCES) && (!ff->ro || (flags & EXT2_FLAG_RW))) { @@ -1288,6 +1387,11 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff) flags &= ~EXT2_FLAG_RW; ff->ro = 1; + fuse2fs_undo_losetup(ff); + err = fuse2fs_try_losetup(ff, flags); + if (err) + return err; + /* Force the loop to run once more */ err = -1; } @@ -1744,6 +1848,8 @@ static void *op_init(struct fuse_conn_info *conn, cfg->debug = 1; cfg->nullpath_ok = 1; + fuse2fs_detach_losetup(ff); + if (ff->opstate == F2OP_WRITABLE) fuse2fs_read_bitmaps(ff); @@ -6852,6 +6958,9 @@ int main(int argc, char *argv[]) .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, .iomap_dev = FUSE_IOMAP_DEV_NULL, +#endif +#ifdef HAVE_FUSE_LOOPDEV + .loop_fd = -1, #endif }; errcode_t err; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 12/19] fuse2fs: enable file IO to inline data files 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (10 preceding siblings ...) 2026-04-29 14:55 ` [PATCH 11/19] fuse2fs: try to create loop device when ext4 device is a regular file Darrick J. Wong @ 2026-04-29 14:55 ` Darrick J. Wong 2026-04-29 14:56 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong ` (6 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:55 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Enable file reads and writes from inline data files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 3 ++- misc/fuse2fs.c | 42 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 42 insertions(+), 3 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 3c3debb6f60ac7..e2421dda75475a 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -6566,7 +6566,8 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino, { /* fall back to slow path for inline data reads */ if (inode->i_flags & EXT4_INLINE_DATA_FL) - return -ENOSYS; + return fuse4fs_iomap_begin_inline(ff, ino, inode, pos, count, + read); if (inode->i_flags & EXT4_EXTENTS_FL) return fuse4fs_iomap_begin_extent(ff, ino, inode, pos, count, diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 3c76eba683a10d..eecbf60a3360c6 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1846,7 +1846,16 @@ static void *op_init(struct fuse_conn_info *conn, cfg->use_ino = 1; if (ff->debug) cfg->debug = 1; - cfg->nullpath_ok = 1; + + /* + * Inline data file io depends on op_read/write being fed a path, so we + * have to slow everyone down to look up the path from the nodeid. + */ + if (fuse2fs_iomap_enabled(ff) && + ext2fs_has_feature_inline_data(ff->fs->super)) + cfg->nullpath_ok = 0; + else + cfg->nullpath_ok = 1; fuse2fs_detach_losetup(ff); @@ -3840,6 +3849,9 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf, size_t len, off_t offset, struct fuse_file_info *fp) { + struct fuse2fs_file_handle fhurk = { + .magic = FUSE2FS_FILE_MAGIC, + }; struct fuse2fs *ff = fuse2fs_get(); struct fuse2fs_file_handle *fh = fuse2fs_get_handle(fp); ext2_filsys fs; @@ -3849,10 +3861,21 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf, int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); + + if (!fh) + fh = &fhurk; + FUSE2FS_CHECK_HANDLE(ff, fh); dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino, (unsigned long long)offset, len); fs = fuse2fs_start(ff); + + if (fh == &fhurk) { + ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino); + if (ret) + goto out; + } + err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp); if (err) { ret = translate_error(fs, fh->ino, err); @@ -3894,6 +3917,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)), const char *buf, size_t len, off_t offset, struct fuse_file_info *fp) { + struct fuse2fs_file_handle fhurk = { + .magic = FUSE2FS_FILE_MAGIC, + .open_flags = EXT2_FILE_WRITE, + }; struct fuse2fs *ff = fuse2fs_get(); struct fuse2fs_file_handle *fh = fuse2fs_get_handle(fp); ext2_filsys fs; @@ -3903,6 +3930,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)), int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); + + if (!fh) + fh = &fhurk; + FUSE2FS_CHECK_HANDLE(ff, fh); dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino, (unsigned long long) offset, len); @@ -3917,6 +3948,12 @@ static int op_write(const char *path EXT2FS_ATTR((unused)), goto out; } + if (fh == &fhurk) { + ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino); + if (ret) + goto out; + } + err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp); if (err) { ret = translate_error(fs, fh->ino, err); @@ -5860,7 +5897,8 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino, { /* fall back to slow path for inline data reads */ if (inode->i_flags & EXT4_INLINE_DATA_FL) - return -ENOSYS; + return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count, + read); if (inode->i_flags & EXT4_EXTENTS_FL) return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 13/19] fuse2fs: set iomap-related inode flags 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (11 preceding siblings ...) 2026-04-29 14:55 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong @ 2026-04-29 14:56 ` Darrick J. Wong 2026-04-29 14:56 ` [PATCH 14/19] fuse2fs: configure block device block size Darrick J. Wong ` (5 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:56 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Set FUSE_IFLAG_* when we do a getattr, so that all files will have iomap enabled. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 46 +++++++++++++++++++++++++++++++++++----------- misc/fuse2fs.c | 20 ++++++++++++++++++++ 2 files changed, 55 insertions(+), 11 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index e2421dda75475a..f9c905c0805d9e 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -2265,6 +2265,7 @@ static void op_init(void *userdata, struct fuse_conn_info *conn) struct fuse4fs_stat { struct fuse_entry_param entry; + unsigned int iflags; }; static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino, @@ -2330,9 +2331,29 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino, entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT; entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT; + fstat->iflags = 0; +#ifdef HAVE_FUSE_IOMAP + if (fuse4fs_iomap_enabled(ff)) + fstat->iflags |= FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; +#endif + return 0; } +#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 99) +#define fuse_reply_entry_iflags(req, entry, iflags) \ + fuse_reply_entry((req), (entry)) + +#define fuse_reply_attr_iflags(req, entry, iflags, timeout) \ + fuse_reply_attr((req), (entry), (timeout)) + +#define fuse_add_direntry_plus_iflags(req, buf, sz, name, iflags, entry, dirpos) \ + fuse_add_direntry_plus((req), (buf), (sz), (name), (entry), (dirpos)) + +#define fuse_reply_create_iflags(req, entry, iflags, fp) \ + fuse_reply_create((req), (entry), (fp)) +#endif + static void op_lookup(fuse_req_t req, fuse_ino_t fino, const char *name) { struct fuse4fs_stat fstat; @@ -2363,7 +2384,7 @@ static void op_lookup(fuse_req_t req, fuse_ino_t fino, const char *name) if (ret) fuse_reply_err(req, -ret); else - fuse_reply_entry(req, &fstat.entry); + fuse_reply_entry_iflags(req, &fstat.entry, fstat.iflags); } static void op_getattr(fuse_req_t req, fuse_ino_t fino, @@ -2383,8 +2404,8 @@ static void op_getattr(fuse_req_t req, fuse_ino_t fino, if (ret) fuse_reply_err(req, -ret); else - fuse_reply_attr(req, &fstat.entry.attr, - fstat.entry.attr_timeout); + fuse_reply_attr_iflags(req, &fstat.entry.attr, fstat.iflags, + fstat.entry.attr_timeout); } static void op_readlink(fuse_req_t req, fuse_ino_t fino) @@ -2662,7 +2683,7 @@ static void fuse4fs_reply_entry(fuse_req_t req, ext2_ino_t ino, return; } - fuse_reply_entry(req, &fstat.entry); + fuse_reply_entry_iflags(req, &fstat.entry, fstat.iflags); } static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name, @@ -4990,10 +5011,13 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)), namebuf[dirent->name_len & 0xFF] = 0; if (i->readdirplus) { - entrysize = fuse_add_direntry_plus(i->req, i->buf + i->bufused, - i->bufsz - i->bufused, - namebuf, &fstat.entry, - i->dirpos); + entrysize = fuse_add_direntry_plus_iflags(i->req, + i->buf + i->bufused, + i->bufsz - i->bufused, + namebuf, + fstat.iflags, + &fstat.entry, + i->dirpos); } else { entrysize = fuse_add_direntry(i->req, i->buf + i->bufused, i->bufsz - i->bufused, namebuf, @@ -5218,7 +5242,7 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name, if (ret) fuse_reply_err(req, -ret); else - fuse_reply_create(req, &fstat.entry, fp); + fuse_reply_create_iflags(req, &fstat.entry, fstat.iflags, fp); } #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17) @@ -5417,8 +5441,8 @@ static void op_setattr(fuse_req_t req, fuse_ino_t fino, struct stat *attr, if (ret) fuse_reply_err(req, -ret); else - fuse_reply_attr(req, &fstat.entry.attr, - fstat.entry.attr_timeout); + fuse_reply_attr_iflags(req, &fstat.entry.attr, fstat.iflags, + fstat.entry.attr_timeout); } #define FUSE4FS_MODIFIABLE_IFLAGS \ diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index eecbf60a3360c6..c6472a1c45506f 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1987,6 +1987,23 @@ static int op_getattr(const char *path, struct stat *statbuf, return ret; } +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) +static int op_getattr_iflags(const char *path, struct stat *statbuf, + unsigned int *iflags, struct fuse_file_info *fi) +{ + int ret = op_getattr(path, statbuf, fi); + + if (ret) + return ret; + + if (fuse_fs_can_enable_iomap(statbuf)) + *iflags |= FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; + + return 0; +} +#endif + + static int op_readlink(const char *path, char *buf, size_t len) { struct fuse2fs *ff = fuse2fs_get(); @@ -6673,6 +6690,9 @@ static struct fuse_operations fs_ops = { #ifdef SUPPORT_FALLOCATE .fallocate = op_fallocate, #endif +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) + .getattr_iflags = op_getattr_iflags, +#endif #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 14/19] fuse2fs: configure block device block size 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (12 preceding siblings ...) 2026-04-29 14:56 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong @ 2026-04-29 14:56 ` Darrick J. Wong 2026-04-29 14:56 ` [PATCH 15/19] fuse4fs: separate invalidation Darrick J. Wong ` (4 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:56 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Set the blocksize of the block device to the filesystem blocksize. This prevents the bdev pagecache from caching file data blocks that iomap will read and write directly. Cache duplication is dangerous. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 43 +++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index f9c905c0805d9e..e92a85da0115ca 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -6891,6 +6891,45 @@ static off_t fuse4fs_max_size(struct fuse4fs *ff, off_t upper_limit) return res; } +/* + * Set the block device's blocksize to the fs blocksize. + * + * This is required to avoid creating uptodate bdev pagecache that aliases file + * data blocks because iomap reads and writes directly to file data blocks. + */ +static int fuse4fs_set_bdev_blocksize(struct fuse4fs *ff, int fd) +{ + int blocksize = ff->fs->blocksize; + int set_error; + int ret; + + ret = ioctl(fd, BLKBSZSET, &blocksize); + if (!ret) + return 0; + + /* + * Save the original errno so we can report that if the block device + * blocksize isn't set in an agreeable way. + */ + set_error = errno; + + ret = ioctl(fd, BLKBSZGET, &blocksize); + if (ret) + goto out_bad; + + /* Pretend that BLKBSZSET rejected our proposed block size */ + if (blocksize > ff->fs->blocksize) { + set_error = EINVAL; + goto out_bad; + } + + return 0; +out_bad: + err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__, + blocksize, strerror(set_error)); + return -EIO; +} + static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) { errcode_t err; @@ -6901,6 +6940,10 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) if (err) return translate_error(ff->fs, 0, err); + ret = fuse4fs_set_bdev_blocksize(ff, fd); + if (ret) + return ret; + ret = fuse_lowlevel_iomap_device_add(ff->fuse, fd, 0); if (ret < 0) { dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n", diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index c6472a1c45506f..c922c7fb45d311 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -6211,6 +6211,45 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit) return res; } +/* + * Set the block device's blocksize to the fs blocksize. + * + * This is required to avoid creating uptodate bdev pagecache that aliases file + * data blocks because iomap reads and writes directly to file data blocks. + */ +static int fuse2fs_set_bdev_blocksize(struct fuse2fs *ff, int fd) +{ + int blocksize = ff->fs->blocksize; + int set_error; + int ret; + + ret = ioctl(fd, BLKBSZSET, &blocksize); + if (!ret) + return 0; + + /* + * Save the original errno so we can report that if the block device + * blocksize isn't set in an agreeable way. + */ + set_error = errno; + + ret = ioctl(fd, BLKBSZGET, &blocksize); + if (ret) + goto out_bad; + + /* Pretend that BLKBSZSET rejected our proposed block size */ + if (blocksize > ff->fs->blocksize) { + set_error = EINVAL; + goto out_bad; + } + + return 0; +out_bad: + err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__, + blocksize, strerror(set_error)); + return -EIO; +} + static int fuse2fs_iomap_config_devices(struct fuse2fs *ff) { errcode_t err; @@ -6221,6 +6260,10 @@ static int fuse2fs_iomap_config_devices(struct fuse2fs *ff) if (err) return translate_error(ff->fs, 0, err); + ret = fuse2fs_set_bdev_blocksize(ff, fd); + if (ret) + return ret; + ret = fuse_fs_iomap_device_add(fd, 0); if (ret < 0) { dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n", ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 15/19] fuse4fs: separate invalidation 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (13 preceding siblings ...) 2026-04-29 14:56 ` [PATCH 14/19] fuse2fs: configure block device block size Darrick J. Wong @ 2026-04-29 14:56 ` Darrick J. Wong 2026-04-29 14:56 ` [PATCH 16/19] fuse2fs: implement statx Darrick J. Wong ` (3 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:56 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Use the new stuff Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 121 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index e92a85da0115ca..6016e23c511ac1 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -293,6 +293,9 @@ struct fuse4fs { enum fuse4fs_iomap_state iomap_state; uint32_t iomap_dev; uint64_t iomap_cap; + void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse); + void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num, + int inuse); #endif unsigned int blockmask; unsigned long offset; @@ -6958,6 +6961,51 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) return 0; } +static void fuse4fs_invalidate_bdev(struct fuse4fs *ff, blk64_t blk, blk_t num) +{ + off_t offset = FUSE4FS_FSB_TO_B(ff, blk); + off_t length = FUSE4FS_FSB_TO_B(ff, num); + int ret; + + ret = fuse_lowlevel_iomap_device_invalidate(ff->fuse, ff->iomap_dev, + offset, length); + if (!ret) + return; + + if (num == 1) + err_printf(ff, "%s %llu: %s\n", + _("error invalidating block"), + (unsigned long long)blk, + strerror(ret)); + else + err_printf(ff, "%s %llu-%llu: %s\n", + _("error invalidating blocks"), + (unsigned long long)blk, + (unsigned long long)blk + num - 1, + strerror(ret)); +} + +static void fuse4fs_alloc_stats(ext2_filsys fs, blk64_t blk, int inuse) +{ + struct fuse4fs *ff = fs->priv_data; + + if (inuse < 0) + fuse4fs_invalidate_bdev(ff, blk, 1); + if (ff->old_alloc_stats) + ff->old_alloc_stats(fs, blk, inuse); +} + +static void fuse4fs_alloc_stats_range(ext2_filsys fs, blk64_t blk, blk_t num, + int inuse) +{ + struct fuse4fs *ff = fs->priv_data; + + if (inuse < 0) + fuse4fs_invalidate_bdev(ff, blk, num); + if (ff->old_alloc_stats_range) + ff->old_alloc_stats_range(fs, blk, num, inuse); +} + static void op_iomap_config(fuse_req_t req, const struct fuse_iomap_config_params *p, size_t psize) @@ -7004,6 +7052,19 @@ static void op_iomap_config(fuse_req_t req, if (ret) goto out_unlock; + /* + * If we let iomap do all file block IO, then we need to watch for + * freed blocks so that we can invalidate any page cache that might + * get written to the block deivce. + */ + if (fuse4fs_iomap_enabled(ff)) { + ext2fs_set_block_alloc_stats_callback(ff->fs, + fuse4fs_alloc_stats, &ff->old_alloc_stats); + ext2fs_set_block_alloc_stats_range_callback(ff->fs, + fuse4fs_alloc_stats_range, + &ff->old_alloc_stats_range); + } + out_unlock: fuse4fs_finish(ff, ret); if (ret) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index c922c7fb45d311..138346fcc4517f 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -276,6 +276,9 @@ struct fuse2fs { enum fuse2fs_iomap_state iomap_state; uint32_t iomap_dev; uint64_t iomap_cap; + void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse); + void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num, + int inuse); #endif unsigned int blockmask; unsigned long offset; @@ -6278,6 +6281,50 @@ static int fuse2fs_iomap_config_devices(struct fuse2fs *ff) return 0; } +static void fuse2fs_invalidate_bdev(struct fuse2fs *ff, blk64_t blk, blk_t num) +{ + off_t offset = FUSE2FS_FSB_TO_B(ff, blk); + off_t length = FUSE2FS_FSB_TO_B(ff, num); + int ret; + + ret = fuse_fs_iomap_device_invalidate(ff->iomap_dev, offset, length); + if (!ret) + return; + + if (num == 1) + err_printf(ff, "%s %llu: %s\n", + _("error invalidating block"), + (unsigned long long)blk, + strerror(ret)); + else + err_printf(ff, "%s %llu-%llu: %s\n", + _("error invalidating blocks"), + (unsigned long long)blk, + (unsigned long long)blk + num - 1, + strerror(ret)); +} + +static void fuse2fs_alloc_stats(ext2_filsys fs, blk64_t blk, int inuse) +{ + struct fuse2fs *ff = fs->priv_data; + + if (inuse < 0) + fuse2fs_invalidate_bdev(ff, blk, 1); + if (ff->old_alloc_stats) + ff->old_alloc_stats(fs, blk, inuse); +} + +static void fuse2fs_alloc_stats_range(ext2_filsys fs, blk64_t blk, blk_t num, + int inuse) +{ + struct fuse2fs *ff = fs->priv_data; + + if (inuse < 0) + fuse2fs_invalidate_bdev(ff, blk, num); + if (ff->old_alloc_stats_range) + ff->old_alloc_stats_range(fs, blk, num, inuse); +} + static int op_iomap_config(const struct fuse_iomap_config_params *p, size_t psize, struct fuse_iomap_config *cfg) { @@ -6322,6 +6369,19 @@ static int op_iomap_config(const struct fuse_iomap_config_params *p, if (ret) goto out_unlock; + /* + * If we let iomap do all file block IO, then we need to watch for + * freed blocks so that we can invalidate any page cache that might + * get written to the block deivce. + */ + if (fuse2fs_iomap_enabled(ff)) { + ext2fs_set_block_alloc_stats_callback(ff->fs, + fuse2fs_alloc_stats, &ff->old_alloc_stats); + ext2fs_set_block_alloc_stats_range_callback(ff->fs, + fuse2fs_alloc_stats_range, + &ff->old_alloc_stats_range); + } + out_unlock: fuse2fs_finish(ff, ret); return ret; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 16/19] fuse2fs: implement statx 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (14 preceding siblings ...) 2026-04-29 14:56 ` [PATCH 15/19] fuse4fs: separate invalidation Darrick J. Wong @ 2026-04-29 14:56 ` Darrick J. Wong 2026-04-29 14:57 ` [PATCH 17/19] fuse2fs: enable atomic writes Darrick J. Wong ` (2 subsequent siblings) 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:56 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Implement statx. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 267 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 6016e23c511ac1..8d994fe490e914 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -24,6 +24,7 @@ #include <sys/xattr.h> #endif #include <sys/ioctl.h> +#include <sys/sysmacros.h> #include <unistd.h> #include <ctype.h> #include <assert.h> @@ -2411,6 +2412,138 @@ static void op_getattr(fuse_req_t req, fuse_ino_t fino, fstat.entry.attr_timeout); } +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) && defined(STATX_BASIC_STATS) +static inline void fuse4fs_set_statx_attr(struct statx *stx, + uint64_t statx_flag, int set) +{ + if (set) + stx->stx_attributes |= statx_flag; + stx->stx_attributes_mask |= statx_flag; +} + +static void fuse4fs_statx_directio(struct fuse4fs *ff, struct statx *stx) +{ + struct statx devx; + errcode_t err; + int fd; + + err = io_channel_get_fd(ff->fs->io, &fd); + if (err) + return; + + err = statx(fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &devx); + if (err) + return; + if (!(devx.stx_mask & STATX_DIOALIGN)) + return; + + stx->stx_mask |= STATX_DIOALIGN; + stx->stx_dio_mem_align = devx.stx_dio_mem_align; + stx->stx_dio_offset_align = devx.stx_dio_offset_align; +} + +static int fuse4fs_statx(struct fuse4fs *ff, ext2_ino_t ino, int statx_mask, + struct statx *stx) +{ + struct ext2_inode_large inode; + ext2_filsys fs = ff->fs;; + dev_t fakedev = 0; + errcode_t err; + struct timespec tv; + + err = fuse4fs_read_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev)); + stx->stx_mask = STATX_BASIC_STATS; + stx->stx_dev_major = major(fakedev); + stx->stx_dev_minor = minor(fakedev); + stx->stx_ino = ino; + stx->stx_mode = inode.i_mode; + stx->stx_nlink = inode.i_links_count; + stx->stx_uid = inode_uid(inode); + stx->stx_gid = inode_gid(inode); + stx->stx_size = EXT2_I_SIZE(&inode); + stx->stx_blksize = fs->blocksize; + stx->stx_blocks = ext2fs_get_stat_i_blocks(fs, + EXT2_INODE(&inode)); + EXT4_INODE_GET_XTIME(i_atime, &tv, &inode); + stx->stx_atime.tv_sec = tv.tv_sec; + stx->stx_atime.tv_nsec = tv.tv_nsec; + + EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode); + stx->stx_mtime.tv_sec = tv.tv_sec; + stx->stx_mtime.tv_nsec = tv.tv_nsec; + + EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode); + stx->stx_ctime.tv_sec = tv.tv_sec; + stx->stx_ctime.tv_nsec = tv.tv_nsec; + + if (EXT4_FITS_IN_INODE(&inode, i_crtime)) { + stx->stx_mask |= STATX_BTIME; + EXT4_INODE_GET_XTIME(i_crtime, &tv, &inode); + stx->stx_btime.tv_sec = tv.tv_sec; + stx->stx_btime.tv_nsec = tv.tv_nsec; + } + + dbg_printf(ff, "%s: ino=%d atime=%lld.%d mtime=%lld.%d ctime=%lld.%d btime=%lld.%d\n", + __func__, ino, + (long long int)stx->stx_atime.tv_sec, stx->stx_atime.tv_nsec, + (long long int)stx->stx_mtime.tv_sec, stx->stx_mtime.tv_nsec, + (long long int)stx->stx_ctime.tv_sec, stx->stx_ctime.tv_nsec, + (long long int)stx->stx_btime.tv_sec, stx->stx_btime.tv_nsec); + + if (LINUX_S_ISCHR(inode.i_mode) || + LINUX_S_ISBLK(inode.i_mode)) { + if (inode.i_block[0]) { + stx->stx_rdev_major = major(inode.i_block[0]); + stx->stx_rdev_minor = minor(inode.i_block[0]); + } else { + stx->stx_rdev_major = major(inode.i_block[1]); + stx->stx_rdev_minor = minor(inode.i_block[1]); + } + } + + fuse4fs_set_statx_attr(stx, STATX_ATTR_COMPRESSED, + inode.i_flags & EXT2_COMPR_FL); + fuse4fs_set_statx_attr(stx, STATX_ATTR_IMMUTABLE, + inode.i_flags & EXT2_IMMUTABLE_FL); + fuse4fs_set_statx_attr(stx, STATX_ATTR_APPEND, + inode.i_flags & EXT2_APPEND_FL); + fuse4fs_set_statx_attr(stx, STATX_ATTR_NODUMP, + inode.i_flags & EXT2_NODUMP_FL); + + fuse4fs_statx_directio(ff, stx); + + return 0; +} + +static void op_statx(fuse_req_t req, fuse_ino_t fino, int flags, int mask, + struct fuse_file_info *fi) +{ + struct statx stx = { }; + struct fuse4fs *ff = fuse4fs_get(req); + ext2_ino_t ino; + int ret = 0; + + FUSE4FS_CHECK_CONTEXT(req); + FUSE4FS_CONVERT_FINO(req, &ino, fino); + fuse4fs_start(ff); + ret = fuse4fs_statx(ff, ino, mask, &stx); + if (ret) + goto out; +out: + fuse4fs_finish(ff, ret); + if (ret) + fuse_reply_err(req, -ret); + else + fuse_reply_statx(req, 0, &stx, FUSE4FS_ATTR_TIMEOUT); +} +#else +# define op_statx NULL +#endif + static void op_readlink(fuse_req_t req, fuse_ino_t fino) { struct ext2_inode inode; @@ -7484,6 +7617,9 @@ static struct fuse_lowlevel_ops fs_ops = { #ifdef SUPPORT_FALLOCATE .fallocate = op_fallocate, #endif +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) + .statx = op_statx, +#endif #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 138346fcc4517f..f9e8fca096ec2c 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -23,6 +23,7 @@ #include <sys/xattr.h> #endif #include <sys/ioctl.h> +#include <sys/sysmacros.h> #include <unistd.h> #include <ctype.h> #ifdef HAVE_FUSE_LOOPDEV @@ -2006,6 +2007,133 @@ static int op_getattr_iflags(const char *path, struct stat *statbuf, } #endif +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) && defined(STATX_BASIC_STATS) +static inline void fuse2fs_set_statx_attr(struct statx *stx, + uint64_t statx_flag, int set) +{ + if (set) + stx->stx_attributes |= statx_flag; + stx->stx_attributes_mask |= statx_flag; +} + +static void fuse2fs_statx_directio(struct fuse2fs *ff, struct statx *stx) +{ + struct statx devx; + errcode_t err; + int fd; + + err = io_channel_get_fd(ff->fs->io, &fd); + if (err) + return; + + err = statx(fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &devx); + if (err) + return; + if (!(devx.stx_mask & STATX_DIOALIGN)) + return; + + stx->stx_mask |= STATX_DIOALIGN; + stx->stx_dio_mem_align = devx.stx_dio_mem_align; + stx->stx_dio_offset_align = devx.stx_dio_offset_align; +} + +static int fuse2fs_statx(struct fuse2fs *ff, ext2_ino_t ino, int statx_mask, + struct statx *stx) +{ + struct ext2_inode_large inode; + ext2_filsys fs = ff->fs;; + dev_t fakedev = 0; + errcode_t err; + struct timespec tv; + + err = fuse2fs_read_inode(fs, ino, &inode); + if (err) + return translate_error(fs, ino, err); + + memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev)); + stx->stx_mask = STATX_BASIC_STATS; + stx->stx_dev_major = major(fakedev); + stx->stx_dev_minor = minor(fakedev); + stx->stx_ino = ino; + stx->stx_mode = inode.i_mode; + stx->stx_nlink = inode.i_links_count; + stx->stx_uid = inode_uid(inode); + stx->stx_gid = inode_gid(inode); + stx->stx_size = EXT2_I_SIZE(&inode); + stx->stx_blksize = fs->blocksize; + stx->stx_blocks = ext2fs_get_stat_i_blocks(fs, + EXT2_INODE(&inode)); + EXT4_INODE_GET_XTIME(i_atime, &tv, &inode); + stx->stx_atime.tv_sec = tv.tv_sec; + stx->stx_atime.tv_nsec = tv.tv_nsec; + + EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode); + stx->stx_mtime.tv_sec = tv.tv_sec; + stx->stx_mtime.tv_nsec = tv.tv_nsec; + + EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode); + stx->stx_ctime.tv_sec = tv.tv_sec; + stx->stx_ctime.tv_nsec = tv.tv_nsec; + + if (EXT4_FITS_IN_INODE(&inode, i_crtime)) { + stx->stx_mask |= STATX_BTIME; + EXT4_INODE_GET_XTIME(i_crtime, &tv, &inode); + stx->stx_btime.tv_sec = tv.tv_sec; + stx->stx_btime.tv_nsec = tv.tv_nsec; + } + + dbg_printf(ff, "%s: ino=%d atime=%lld.%d mtime=%lld.%d ctime=%lld.%d btime=%lld.%d\n", + __func__, ino, + (long long int)stx->stx_atime.tv_sec, stx->stx_atime.tv_nsec, + (long long int)stx->stx_mtime.tv_sec, stx->stx_mtime.tv_nsec, + (long long int)stx->stx_ctime.tv_sec, stx->stx_ctime.tv_nsec, + (long long int)stx->stx_btime.tv_sec, stx->stx_btime.tv_nsec); + + if (LINUX_S_ISCHR(inode.i_mode) || + LINUX_S_ISBLK(inode.i_mode)) { + if (inode.i_block[0]) { + stx->stx_rdev_major = major(inode.i_block[0]); + stx->stx_rdev_minor = minor(inode.i_block[0]); + } else { + stx->stx_rdev_major = major(inode.i_block[1]); + stx->stx_rdev_minor = minor(inode.i_block[1]); + } + } + + fuse2fs_set_statx_attr(stx, STATX_ATTR_COMPRESSED, + inode.i_flags & EXT2_COMPR_FL); + fuse2fs_set_statx_attr(stx, STATX_ATTR_IMMUTABLE, + inode.i_flags & EXT2_IMMUTABLE_FL); + fuse2fs_set_statx_attr(stx, STATX_ATTR_APPEND, + inode.i_flags & EXT2_APPEND_FL); + fuse2fs_set_statx_attr(stx, STATX_ATTR_NODUMP, + inode.i_flags & EXT2_NODUMP_FL); + + fuse2fs_statx_directio(ff, stx); + + return 0; +} + +static int op_statx(const char *path, int statx_flags, int statx_mask, + struct statx *stx, struct fuse_file_info *fi) +{ + struct fuse2fs *ff = fuse2fs_get(); + ext2_ino_t ino; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + fuse2fs_start(ff); + ret = fuse2fs_file_ino(ff, path, fi, &ino); + if (ret) + goto out; + ret = fuse2fs_statx(ff, ino, statx_mask, stx); +out: + fuse2fs_finish(ff, ret); + return ret; +} +#else +# define op_statx NULL +#endif static int op_readlink(const char *path, char *buf, size_t len) { @@ -6793,6 +6921,9 @@ static struct fuse_operations fs_ops = { #ifdef SUPPORT_FALLOCATE .fallocate = op_fallocate, #endif +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) + .statx = op_statx, +#endif #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) .getattr_iflags = op_getattr_iflags, #endif ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 17/19] fuse2fs: enable atomic writes 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (15 preceding siblings ...) 2026-04-29 14:56 ` [PATCH 16/19] fuse2fs: implement statx Darrick J. Wong @ 2026-04-29 14:57 ` Darrick J. Wong 2026-04-29 14:57 ` [PATCH 18/19] fuse4fs: disable fs reclaim and write throttling Darrick J. Wong 2026-04-29 14:57 ` [PATCH 19/19] fuse2fs: implement freeze and shutdown requests Darrick J. Wong 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:57 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Advertise the single-fsblock atomic write capability that iomap can do. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 134 insertions(+), 2 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 8d994fe490e914..49708bdf7b655d 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -297,6 +297,9 @@ struct fuse4fs { void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse); void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num, int inuse); +#ifdef STATX_WRITE_ATOMIC + unsigned int awu_min, awu_max; +#endif #endif unsigned int blockmask; unsigned long offset; @@ -938,10 +941,22 @@ static inline bool fuse4fs_can_iomap(const struct fuse4fs *ff) { return ff->iomap_cap & FUSE_IOMAP_SUPPORT_FILEIO; } + +static inline bool fuse4fs_iomap_supports_hw_atomic(const struct fuse4fs *ff) +{ + return fuse4fs_iomap_enabled(ff) && + (ff->iomap_cap & FUSE_IOMAP_SUPPORT_ATOMIC) && +#ifdef STATX_WRITE_ATOMIC + ff->awu_max > 0 && ff->awu_min > 0; +#else + 0; +#endif +} #else # define fuse4fs_iomap_enabled(...) (0) # define fuse4fs_discover_iomap(...) ((void)0) # define fuse4fs_can_iomap(...) (false) +# define fuse4fs_iomap_supports_hw_atomic(...) (0) #endif static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino, @@ -2337,8 +2352,12 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino, fstat->iflags = 0; #ifdef HAVE_FUSE_IOMAP - if (fuse4fs_iomap_enabled(ff)) + if (fuse4fs_iomap_enabled(ff)) { fstat->iflags |= FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; + + if (fuse4fs_iomap_supports_hw_atomic(ff)) + fstat->iflags |= FUSE_IFLAG_ATOMIC; + } #endif return 0; @@ -2516,6 +2535,15 @@ static int fuse4fs_statx(struct fuse4fs *ff, ext2_ino_t ino, int statx_mask, fuse4fs_statx_directio(ff, stx); +#ifdef STATX_WRITE_ATOMIC + if (fuse4fs_iomap_supports_hw_atomic(ff)) { + stx->stx_mask |= STATX_WRITE_ATOMIC; + stx->stx_atomic_write_unit_min = ff->awu_min; + stx->stx_atomic_write_unit_max = ff->awu_max; + stx->stx_atomic_write_segments_max = 1; + } +#endif + return 0; } @@ -6902,6 +6930,9 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, } } + if (opflags & FUSE_IOMAP_OP_ATOMIC) + read.flags |= FUSE_IOMAP_F_ATOMIC_BIO; + out_unlock: fuse4fs_finish(ff, ret); if (ret) @@ -7066,6 +7097,38 @@ static int fuse4fs_set_bdev_blocksize(struct fuse4fs *ff, int fd) return -EIO; } +#ifdef STATX_WRITE_ATOMIC +static void fuse4fs_configure_atomic_write(struct fuse4fs *ff, int bdev_fd) +{ + struct statx devx; + unsigned int awu_min, awu_max; + int ret; + + if (!ext2fs_has_feature_extents(ff->fs->super)) + return; + + ret = statx(bdev_fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &devx); + if (ret) + return; + if (!(devx.stx_mask & STATX_WRITE_ATOMIC)) + return; + + awu_min = max(ff->fs->blocksize, devx.stx_atomic_write_unit_min); + awu_max = min(ff->fs->blocksize, devx.stx_atomic_write_unit_max); + if (awu_min > awu_max) + return; + + log_printf(ff, "%s awu_min: %u, awu_max: %u\n", + _("Supports (experimental) DIO atomic writes"), + awu_min, awu_max); + + ff->awu_min = awu_min; + ff->awu_max = awu_max; +} +#else +# define fuse4fs_configure_atomic_write(...) ((void)0) +#endif + static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) { errcode_t err; @@ -7090,6 +7153,8 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n", __func__, fd, ff->iomap_dev); + fuse4fs_configure_atomic_write(ff, fd); + ff->iomap_dev = ret; return 0; } diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index f9e8fca096ec2c..fe45ffa86823b0 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -280,6 +280,9 @@ struct fuse2fs { void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse); void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num, int inuse); +#ifdef STATX_WRITE_ATOMIC + unsigned int awu_min, awu_max; +#endif #endif unsigned int blockmask; unsigned long offset; @@ -750,10 +753,22 @@ static inline bool fuse2fs_can_iomap(const struct fuse2fs *ff) { return ff->iomap_cap & FUSE_IOMAP_SUPPORT_FILEIO; } + +static inline bool fuse2fs_iomap_supports_hw_atomic(const struct fuse2fs *ff) +{ + return fuse2fs_iomap_enabled(ff) && + (ff->iomap_cap & FUSE_IOMAP_SUPPORT_ATOMIC) && +#ifdef STATX_WRITE_ATOMIC + ff->awu_max > 0 && ff->awu_min > 0; +#else + 0; +#endif +} #else # define fuse2fs_iomap_enabled(...) (0) # define fuse2fs_discover_iomap(...) ((void)0) # define fuse2fs_can_iomap(...) (false) +# define fuse2fs_iomap_supports_hw_atomic(...) (0) #endif static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino, @@ -1995,14 +2010,19 @@ static int op_getattr(const char *path, struct stat *statbuf, static int op_getattr_iflags(const char *path, struct stat *statbuf, unsigned int *iflags, struct fuse_file_info *fi) { + struct fuse2fs *ff = fuse2fs_get(); int ret = op_getattr(path, statbuf, fi); if (ret) return ret; - if (fuse_fs_can_enable_iomap(statbuf)) + if (fuse_fs_can_enable_iomap(statbuf)) { *iflags |= FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; + if (fuse2fs_iomap_supports_hw_atomic(ff)) + *iflags |= FUSE_IFLAG_ATOMIC; + } + return 0; } #endif @@ -2111,6 +2131,16 @@ static int fuse2fs_statx(struct fuse2fs *ff, ext2_ino_t ino, int statx_mask, fuse2fs_statx_directio(ff, stx); +#ifdef STATX_WRITE_ATOMIC + if (fuse_fs_can_enable_iomapx(stx) && + fuse2fs_iomap_supports_hw_atomic(ff)) { + stx->stx_mask |= STATX_WRITE_ATOMIC; + stx->stx_atomic_write_unit_min = ff->awu_min; + stx->stx_atomic_write_unit_max = ff->awu_max; + stx->stx_atomic_write_segments_max = 1; + } +#endif + return 0; } @@ -6220,6 +6250,9 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, } } + if (opflags & FUSE_IOMAP_OP_ATOMIC) + read->flags |= FUSE_IOMAP_F_ATOMIC_BIO; + out_unlock: fuse2fs_finish(ff, ret); return ret; @@ -6381,6 +6414,38 @@ static int fuse2fs_set_bdev_blocksize(struct fuse2fs *ff, int fd) return -EIO; } +#ifdef STATX_WRITE_ATOMIC +static void fuse2fs_configure_atomic_write(struct fuse2fs *ff, int bdev_fd) +{ + struct statx devx; + unsigned int awu_min, awu_max; + int ret; + + if (!ext2fs_has_feature_extents(ff->fs->super)) + return; + + ret = statx(bdev_fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &devx); + if (ret) + return; + if (!(devx.stx_mask & STATX_WRITE_ATOMIC)) + return; + + awu_min = max(ff->fs->blocksize, devx.stx_atomic_write_unit_min); + awu_max = min(ff->fs->blocksize, devx.stx_atomic_write_unit_max); + if (awu_min > awu_max) + return; + + log_printf(ff, "%s awu_min: %u, awu_max: %u\n", + _("Supports (experimental) DIO atomic writes"), + awu_min, awu_max); + + ff->awu_min = awu_min; + ff->awu_max = awu_max; +} +#else +# define fuse2fs_configure_atomic_write(...) ((void)0) +#endif + static int fuse2fs_iomap_config_devices(struct fuse2fs *ff) { errcode_t err; @@ -6405,6 +6470,8 @@ static int fuse2fs_iomap_config_devices(struct fuse2fs *ff) dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n", __func__, fd, ff->iomap_dev); + fuse2fs_configure_atomic_write(ff, fd); + ff->iomap_dev = ret; return 0; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 18/19] fuse4fs: disable fs reclaim and write throttling 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (16 preceding siblings ...) 2026-04-29 14:57 ` [PATCH 17/19] fuse2fs: enable atomic writes Darrick J. Wong @ 2026-04-29 14:57 ` Darrick J. Wong 2026-04-29 14:57 ` [PATCH 19/19] fuse2fs: implement freeze and shutdown requests Darrick J. Wong 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:57 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Ask the kernel if we can disable fs reclaim and write throttling. Disabling fs reclaim prevents livelocks where the fuse server can allocate memory, fault into the kernel, and then the allocation tries to initiate writeback by calling back into the same fuse server. Disabling BDI write throttling means that writeback won't be throttled by metadata writes to the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 36 ++++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 49708bdf7b655d..6ea2d30772ae5a 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -7969,6 +7969,19 @@ static void try_set_io_flusher(struct fuse4fs *ff) #endif } +/* Undo try_set_io_flusher */ +static void try_clear_io_flusher(struct fuse4fs *ff) +{ +#ifdef HAVE_PR_SET_IO_FLUSHER + /* + * zero ret means it's already set, negative means we can't even + * look at the value so don't bother clearing it + */ + if (prctl(PR_GET_IO_FLUSHER, 0, 0, 0, 0) > 0) + prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0); +#endif +} + /* Try to adjust the OOM score so that we don't get killed */ static void try_adjust_oom_score(struct fuse4fs *ff) { @@ -8022,6 +8035,23 @@ static int fuse4fs_event_loop(struct fuse4fs *ff, struct fuse_loop_config *loop_config, const struct fuse_cmdline_opts *opts) { + bool clear_io_flusher = false; + int ret; + + /* + * Try to set ourselves up with fs reclaim disabled to prevent + * recursive reclaim and throttling. This must be done before starting + * the worker threads so that they inherit the process flags. + */ + ret = fuse_lowlevel_disable_fsreclaim(ff->fuse, 1); + if (ret) { + err_printf(ff, "%s: %s.\n", + _("Could not register as FS flusher thread"), + strerror(-ret)); + try_set_io_flusher(ff); + clear_io_flusher = true; + } + /* * Since there's a Big Kernel Lock around all the libext2fs code, we * only need to start four threads -- one to decode a request, another @@ -8032,7 +8062,10 @@ static int fuse4fs_event_loop(struct fuse4fs *ff, fuse_loop_cfg_set_idle_threads(loop_config, opts->max_idle_threads); fuse_loop_cfg_set_max_threads(loop_config, 4); - return fuse_session_loop_mt(ff->fuse, loop_config) == 0 ? 0 : 8; + ret = fuse_session_loop_mt(ff->fuse, loop_config) == 0 ? 0 : 8; + if (clear_io_flusher) + try_clear_io_flusher(ff); + return ret; } #ifdef HAVE_FUSE4FS_SERVICE @@ -8251,7 +8284,6 @@ int main(int argc, char *argv[]) } } - try_set_io_flusher(&fctx); try_adjust_oom_score(&fctx); /* Will we allow users to allocate every last block? */ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 19/19] fuse2fs: implement freeze and shutdown requests 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong ` (17 preceding siblings ...) 2026-04-29 14:57 ` [PATCH 18/19] fuse4fs: disable fs reclaim and write throttling Darrick J. Wong @ 2026-04-29 14:57 ` Darrick J. Wong 18 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:57 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Handle freezing and shutting down the filesystem if requested. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 175 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 6ea2d30772ae5a..46f81e2066b044 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -240,6 +240,7 @@ struct fuse4fs_file_handle { enum fuse4fs_opstate { F4OP_READONLY, + F4OP_WRITABLE_FROZEN, F4OP_WRITABLE, F4OP_SHUTDOWN, }; @@ -6388,6 +6389,91 @@ static void op_fallocate(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)), } #endif /* SUPPORT_FALLOCATE */ +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) +static void op_freezefs(fuse_req_t req, fuse_ino_t ino, uint64_t unlinked) +{ + struct fuse4fs *ff = fuse4fs_get(req); + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE4FS_CHECK_CONTEXT(req); + fs = fuse4fs_start(ff); + + if (ff->opstate == F4OP_WRITABLE) { + if (fs->super->s_error_count) + fs->super->s_state |= EXT2_ERROR_FS; + else if (!unlinked) + fs->super->s_state |= EXT2_VALID_FS; + ext2fs_mark_super_dirty(fs); + err = ext2fs_set_gdt_csum(fs); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + err = ext2fs_flush2(fs, 0); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + ff->opstate = F4OP_WRITABLE_FROZEN; + } + +out_unlock: + fs->super->s_state &= ~EXT2_VALID_FS; + fuse4fs_finish(ff, ret); + fuse_reply_err(req, -ret); +} + +static void op_unfreezefs(fuse_req_t req, fuse_ino_t ino) +{ + struct fuse4fs *ff = fuse4fs_get(req); + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE4FS_CHECK_CONTEXT(req); + fs = fuse4fs_start(ff); + + if (ff->opstate == F4OP_WRITABLE_FROZEN) { + if (fs->super->s_error_count) + fs->super->s_state |= EXT2_ERROR_FS; + fs->super->s_state &= ~EXT2_VALID_FS; + ext2fs_mark_super_dirty(fs); + err = ext2fs_set_gdt_csum(fs); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + err = ext2fs_flush2(fs, 0); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + ff->opstate = F4OP_WRITABLE; + } + +out_unlock: + fuse4fs_finish(ff, ret); + fuse_reply_err(req, -ret); +} + +static void op_shutdownfs(fuse_req_t req, fuse_ino_t ino, uint64_t flags) +{ + const struct fuse_ctx *ctxt = fuse_req_ctx(req); + struct fuse4fs *ff = fuse4fs_get(req); + int ret; + + ret = ioctl_shutdown(ff, ctxt, NULL, NULL, 0); + + fuse_reply_err(req, -ret); +} +#endif + #ifdef HAVE_FUSE_IOMAP static void fuse4fs_iomap_hole(struct fuse4fs *ff, struct fuse_file_iomap *iomap, off_t pos, uint64_t count) @@ -7685,6 +7771,11 @@ static struct fuse_lowlevel_ops fs_ops = { #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) .statx = op_statx, #endif +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) + .freezefs = op_freezefs, + .unfreezefs = op_unfreezefs, + .shutdownfs = op_shutdownfs, +#endif #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, .iomap_end = op_iomap_end, diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index fe45ffa86823b0..16b010fd28d4b5 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -223,6 +223,7 @@ struct fuse2fs_file_handle { enum fuse2fs_opstate { F2OP_READONLY, + F2OP_WRITABLE_FROZEN, F2OP_WRITABLE, F2OP_SHUTDOWN, }; @@ -5709,6 +5710,86 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode, } #endif /* SUPPORT_FALLOCATE */ +#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) +static int op_freezefs(const char *path, uint64_t unlinked) +{ + struct fuse2fs *ff = fuse2fs_get(); + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + fs = fuse2fs_start(ff); + + if (ff->opstate == F2OP_WRITABLE) { + if (fs->super->s_error_count) + fs->super->s_state |= EXT2_ERROR_FS; + else if (!unlinked) + fs->super->s_state |= EXT2_VALID_FS; + ext2fs_mark_super_dirty(fs); + err = ext2fs_set_gdt_csum(fs); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + err = ext2fs_flush2(fs, 0); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + ff->opstate = F2OP_WRITABLE_FROZEN; + } + +out_unlock: + fs->super->s_state &= ~EXT2_VALID_FS; + fuse2fs_finish(ff, ret); + return ret; +} + +static int op_unfreezefs(const char *path) +{ + struct fuse2fs *ff = fuse2fs_get(); + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + fs = fuse2fs_start(ff); + + if (ff->opstate == F2OP_WRITABLE_FROZEN) { + if (fs->super->s_error_count) + fs->super->s_state |= EXT2_ERROR_FS; + ext2fs_mark_super_dirty(fs); + err = ext2fs_set_gdt_csum(fs); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + err = ext2fs_flush2(fs, 0); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + ff->opstate = F2OP_WRITABLE; + } + +out_unlock: + fuse2fs_finish(ff, ret); + return ret; +} + +static int op_shutdownfs(const char *path, uint64_t flags) +{ + struct fuse2fs *ff = fuse2fs_get(); + + return ioctl_shutdown(ff, NULL, NULL); +} +#endif + #ifdef HAVE_FUSE_IOMAP static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_file_iomap *iomap, off_t pos, uint64_t count) @@ -6993,6 +7074,9 @@ static struct fuse_operations fs_ops = { #endif #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) .getattr_iflags = op_getattr_iflags, + .freezefs = op_freezefs, + .unfreezefs = op_unfreezefs, + .shutdownfs = op_shutdownfs, #endif #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (15 preceding siblings ...) 2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong @ 2026-04-29 14:20 ` Darrick J. Wong 2026-04-29 14:57 ` [PATCH 1/3] fuse4fs: configure iomap when running as a service Darrick J. Wong ` (2 more replies) 2026-04-29 14:21 ` [PATCHSET v8 4/6] fuse4fs: specify the root node id Darrick J. Wong ` (2 subsequent siblings) 19 siblings, 3 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong Hi all, This series adapts the iomap code to work in systemd service mode. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-iomap-service --- Commits in this patchset: * fuse4fs: configure iomap when running as a service * fuse4fs: set iomap backing device blocksize * fuse4fs: ask for loop devices when opening via fuservicemount --- fuse4fs/fuse4fs.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 83 insertions(+), 11 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/3] fuse4fs: configure iomap when running as a service 2026-04-29 14:20 ` [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services Darrick J. Wong @ 2026-04-29 14:57 ` Darrick J. Wong 2026-04-29 14:58 ` [PATCH 2/3] fuse4fs: set iomap backing device blocksize Darrick J. Wong 2026-04-29 14:58 ` [PATCH 3/3] fuse4fs: ask for loop devices when opening via fuservicemount Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:57 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> We have to ask the mount helper to enable iomap for us on the fuse connection, so do that before mounting. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 46f81e2066b044..d7de2e94a7e536 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -334,8 +334,14 @@ static inline bool fuse4fs_is_service(const struct fuse4fs *ff) { return fuse_service_accepted(ff->service); } + +static int fuse4fs_service_discover_iomap(struct fuse4fs *ff) +{ + return fuse_service_discover_iomap(ff->service); +} #else # define fuse4fs_is_service(...) (false) +# define fuse4fs_service_discover_iomap(...) (0) #endif #define FUSE4FS_CHECK_HANDLE(req, fh) \ @@ -935,6 +941,11 @@ static inline void fuse4fs_discover_iomap(struct fuse4fs *ff) if (ff->iomap_want == FT_DISABLE) return; + if (fuse4fs_is_service(ff)) { + ff->iomap_cap = fuse4fs_service_discover_iomap(ff); + return; + } + ff->iomap_cap = fuse_lowlevel_discover_iomap(-1); } @@ -1579,6 +1590,30 @@ static errcode_t fuse4fs_service_openfs(struct fuse4fs *ff, char *options, # define fuse4fs_service_openfs(...) (EOPNOTSUPP) #endif +#if defined(HAVE_FUSE4FS_SERVICE) && defined(HAVE_FUSE_IOMAP) +static int fuse4fs_service_configure_iomap(struct fuse4fs *ff) +{ + int error = 0; + int ret; + + ret = fuse_service_configure_iomap(ff->service, + ff->iomap_want == FT_ENABLE, + &error); + if (ret) + return -1; + + if (error) { + err_printf(ff, "%s: %s.\n", _("enabling iomap"), + strerror(error)); + return -1; + } + + return 0; +} +#else +# define fuse4fs_service_configure_iomap(...) (EOPNOTSUPP) +#endif + static errcode_t fuse4fs_acquire_lockfile(struct fuse4fs *ff) { char *resolved; @@ -8373,6 +8408,16 @@ int main(int argc, char *argv[]) ret = 2; goto out; } + +#ifdef HAVE_FUSE_IOMAP + if (fctx.iomap_want != FT_DISABLE) { + ret = fuse4fs_service_configure_iomap(&fctx); + if (ret) { + ret = 2; + goto out; + } + } +#endif } try_adjust_oom_score(&fctx); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/3] fuse4fs: set iomap backing device blocksize 2026-04-29 14:20 ` [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services Darrick J. Wong 2026-04-29 14:57 ` [PATCH 1/3] fuse4fs: configure iomap when running as a service Darrick J. Wong @ 2026-04-29 14:58 ` Darrick J. Wong 2026-04-29 14:58 ` [PATCH 3/3] fuse4fs: ask for loop devices when opening via fuservicemount Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:58 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> If we're running as an unprivileged iomap fuse server, we must ask the kernel to set the blocksize of the block device. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 42 ++++++++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index d7de2e94a7e536..e4338722e88e73 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -1610,10 +1610,27 @@ static int fuse4fs_service_configure_iomap(struct fuse4fs *ff) return 0; } + +int fuse4fs_service_set_bdev_blocksize(struct fuse4fs *ff, int dev_index) +{ + int ret; + + ret = fuse_lowlevel_iomap_set_blocksize(ff->fuse, dev_index, + ff->fs->blocksize); + if (ret) { + err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__, + ff->fs->blocksize, strerror(-ret)); + return -EIO; + } + + return 0; +} #else # define fuse4fs_service_configure_iomap(...) (EOPNOTSUPP) +# define fuse4fs_service_set_bdev_blocksize(...) (EOPNOTSUPP) #endif + static errcode_t fuse4fs_acquire_lockfile(struct fuse4fs *ff) { char *resolved; @@ -7254,21 +7271,19 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) { errcode_t err; int fd; + int dev_index; int ret; err = io_channel_get_fd(ff->fs->io, &fd); if (err) return translate_error(ff->fs, 0, err); - ret = fuse4fs_set_bdev_blocksize(ff, fd); - if (ret) - return ret; - - ret = fuse_lowlevel_iomap_device_add(ff->fuse, fd, 0); - if (ret < 0) { - dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n", - __func__, fd, -ret); - return translate_error(ff->fs, 0, -ret); + dev_index = fuse_lowlevel_iomap_device_add(ff->fuse, fd, 0); + if (dev_index < 0) { + ret = -dev_index; + dbg_printf(ff, "%s: cannot register iomap dev fd=%d: %s\n", + __func__, fd, strerror(ret)); + return translate_error(ff->fs, 0, ret); } dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n", @@ -7276,7 +7291,14 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff) fuse4fs_configure_atomic_write(ff, fd); - ff->iomap_dev = ret; + if (fuse4fs_is_service(ff)) + ret = fuse4fs_service_set_bdev_blocksize(ff, dev_index); + else + ret = fuse4fs_set_bdev_blocksize(ff, fd); + if (ret) + return ret; + + ff->iomap_dev = dev_index; return 0; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/3] fuse4fs: ask for loop devices when opening via fuservicemount 2026-04-29 14:20 ` [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services Darrick J. Wong 2026-04-29 14:57 ` [PATCH 1/3] fuse4fs: configure iomap when running as a service Darrick J. Wong 2026-04-29 14:58 ` [PATCH 2/3] fuse4fs: set iomap backing device blocksize Darrick J. Wong @ 2026-04-29 14:58 ` Darrick J. Wong 2 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:58 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> When requesting a file, ask the fuservicemount program to transform an open regular file into a loop device for us, so that we can use iomap even when the filesystem is actually an image file. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index e4338722e88e73..7c861fc28e9fa4 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -1491,13 +1491,18 @@ static int fuse4fs_service_get_config(struct fuse4fs *ff) { double deadline = init_deadline(FUSE4FS_OPEN_TIMEOUT); const int open_flags = O_EXCL | (ff->directio ? O_DIRECT : 0); + unsigned int request_flags = 0; int open_mode = O_RDWR; int fd; int ret; + if (fuse4fs_can_iomap(ff)) + request_flags |= FUSE_SERVICE_REQUEST_FILE_TRYLOOP; + do { ret = fuse_service_request_file(ff->service, ff->device, - open_mode | open_flags, 0, 0); + open_mode | open_flags, 0, + request_flags); if (ret) return ret; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 4/6] fuse4fs: specify the root node id 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (16 preceding siblings ...) 2026-04-29 14:20 ` [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services Darrick J. Wong @ 2026-04-29 14:21 ` Darrick J. Wong 2026-04-29 14:58 ` [PATCH 1/1] fuse4fs: don't use inode number translation when possible Darrick J. Wong 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong 2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong 19 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong Hi all, This series adapts fuse4fs to have a 1:1 mapping of ext2_ino_t to fuse_ino_t for slightly better performance and less confusing code interpretation. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-root-nodeid --- Commits in this patchset: * fuse4fs: don't use inode number translation when possible --- fuse4fs/fuse4fs.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/1] fuse4fs: don't use inode number translation when possible 2026-04-29 14:21 ` [PATCHSET v8 4/6] fuse4fs: specify the root node id Darrick J. Wong @ 2026-04-29 14:58 ` Darrick J. Wong 0 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:58 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Prior to the integration of iomap into fuse, the fuse client (aka the kernel) required that the root directory have an inumber of FUSE_ROOT_ID, which is 1. However, the ext2 filesystem defines the root inode number to be EXT2_ROOT_INO, which is 2. This dissonance means that we have to have translator functions, and that any access to inumber 1 (the ext2 badblocks file) will instead redirect to the root directory. That's horrible. Use the new mount option to set the root directory nodeid to EXT2_ROOT_INO so that we don't need this translation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 7c861fc28e9fa4..ec6af3813a661a 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -285,6 +285,7 @@ struct fuse4fs { int directio; int acl; int dirsync; + int translate_inums; enum fuse4fs_opstate opstate; int logfd; @@ -379,17 +380,19 @@ static int fuse4fs_service_discover_iomap(struct fuse4fs *ff) #define FUSE4FS_CHECK_CONTEXT_INIT(req) \ __FUSE4FS_CHECK_CONTEXT((req), abort(), abort()) -static inline void fuse4fs_ino_from_fuse(ext2_ino_t *inop, fuse_ino_t fino) +static inline void fuse4fs_ino_from_fuse(const struct fuse4fs *ff, + ext2_ino_t *inop, fuse_ino_t fino) { - if (fino == FUSE_ROOT_ID) + if (ff->translate_inums && fino == FUSE_ROOT_ID) *inop = EXT2_ROOT_INO; else *inop = fino; } -static inline void fuse4fs_ino_to_fuse(fuse_ino_t *finop, ext2_ino_t ino) +static inline void fuse4fs_ino_to_fuse(const struct fuse4fs *ff, + fuse_ino_t *finop, ext2_ino_t ino) { - if (ino == EXT2_ROOT_INO) + if (ff->translate_inums && ino == EXT2_ROOT_INO) *finop = FUSE_ROOT_ID; else *finop = ino; @@ -405,7 +408,7 @@ static inline void fuse4fs_ino_to_fuse(fuse_ino_t *finop, ext2_ino_t ino) fuse_reply_err((req), EIO); \ return; \ } \ - fuse4fs_ino_from_fuse(ext2_inop, fuse_ino); \ + fuse4fs_ino_from_fuse(fuse4fs_get(req), ext2_inop, fuse_ino); \ } while (0) static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err, @@ -2403,7 +2406,7 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino, statbuf->st_rdev = inodep->i_block[1]; } - fuse4fs_ino_to_fuse(&entry->ino, ino); + fuse4fs_ino_to_fuse(ff, &entry->ino, ino); entry->generation = inodep->i_generation; entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT; entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT; @@ -8087,6 +8090,20 @@ static void fuse4fs_compute_libfuse_args(struct fuse4fs *ff, "-oallow_other,default_permissions,suid,dev"); } + if (fuse4fs_can_iomap(ff)) { + /* + * The root_nodeid mount option was added when iomap support + * was added to fuse. This enables us to control the root + * nodeid in the kernel, which enables a 1:1 translation of + * ext2 to kernel inumbers. + */ + snprintf(extra_args, BUFSIZ, "-oroot_nodeid=%d", + EXT2_ROOT_INO); + fuse_opt_add_arg(args, extra_args); + ff->translate_inums = 0; + } + + if (ff->debug) { int i; @@ -8366,6 +8383,7 @@ int main(int argc, char *argv[]) #ifdef HAVE_FUSE_LOOPDEV .loop_fd = -1, #endif + .translate_inums = 1, }; errcode_t err; FILE *orig_stderr = stderr; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (17 preceding siblings ...) 2026-04-29 14:21 ` [PATCHSET v8 4/6] fuse4fs: specify the root node id Darrick J. Wong @ 2026-04-29 14:21 ` Darrick J. Wong 2026-04-29 14:58 ` [PATCH 01/10] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong ` (9 more replies) 2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong 19 siblings, 10 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong Hi all, When iomap is enabled for a fuse file, we try to keep as much of the file IO path in the kernel as we possibly can. That means no calling out to the fuse server in the IO path when we can avoid it. However, the existing FUSE architecture defers all file attributes to the fuse server -- [cm]time updates, ACL metadata management, set[ug]id removal, and permissions checking thereof, etc. We'd really rather do all these attribute updates in the kernel, and only push them to the fuse server when it's actually necessary (e.g. fsync). Furthermore, the POSIX ACL code has the weird behavior that if the access ACL can be represented entirely by i_mode bits, it will change the mode and delete the ACL, which fuse servers generally don't seem to implement. IOWs, we want consistent and correct (as defined by fstests) behavior of file attributes in iomap mode. Let's make the kernel manage all that and push the results to userspace as needed. This improves performance even further, since it's sort of like writeback_cache mode but more aggressive. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-attrs --- Commits in this patchset: * fuse2fs: add strictatime/lazytime mount options * fuse2fs: skip permission checking on utimens when iomap is enabled * fuse2fs: let the kernel tell us about acl/mode updates * fuse2fs: better debugging for file mode updates * fuse2fs: debug timestamp updates * fuse2fs: use coarse timestamps for iomap mode * fuse2fs: add tracing for retrieving timestamps * fuse2fs: enable syncfs * fuse2fs: set sync, immutable, and append at file load time * fuse4fs: increase attribute timeout in iomap mode --- fuse4fs/fuse4fs.1.in | 6 + fuse4fs/fuse4fs.c | 226 ++++++++++++++++++++++++++++++---------- misc/fuse2fs.1.in | 6 + misc/fuse2fs.c | 282 +++++++++++++++++++++++++++++++++++++------------- 4 files changed, 389 insertions(+), 131 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 01/10] fuse2fs: add strictatime/lazytime mount options 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong @ 2026-04-29 14:58 ` Darrick J. Wong 2026-04-29 14:59 ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong ` (8 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:58 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> In iomap mode, we can support the strictatime/lazytime mount options. Add them to fuse2fs. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.1.in | 6 ++++++ fuse4fs/fuse4fs.c | 28 +++++++++++++++++++++++++++- misc/fuse2fs.1.in | 6 ++++++ misc/fuse2fs.c | 27 +++++++++++++++++++++++++++ 4 files changed, 66 insertions(+), 1 deletion(-) diff --git a/fuse4fs/fuse4fs.1.in b/fuse4fs/fuse4fs.1.in index 8855867d27101d..119cbcc903d8af 100644 --- a/fuse4fs/fuse4fs.1.in +++ b/fuse4fs/fuse4fs.1.in @@ -90,6 +90,9 @@ .SS "fuse4fs options:" .I nosuid ) later. .TP +\fB-o\fR lazytime +if iomap is enabled, enable lazy updates of timestamps +.TP \fB-o\fR lockfile=path use this file to control access to the filesystem .TP @@ -98,6 +101,9 @@ .SS "fuse4fs options:" .TP \fB-o\fR norecovery do not replay the journal and mount the file system read-only +.TP +\fB-o\fR strictatime +if iomap is enabled, update atime on every access .SS "FUSE options:" .TP \fB-d -o\fR debug diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index ec6af3813a661a..6e5683dba4c918 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -286,6 +286,7 @@ struct fuse4fs { int acl; int dirsync; int translate_inums; + int iomap_passthrough_options; enum fuse4fs_opstate opstate; int logfd; @@ -1415,6 +1416,12 @@ static errcode_t fuse4fs_check_support(struct fuse4fs *ff) return EXT2_ET_FILESYSTEM_CORRUPTED; } + if (ff->iomap_passthrough_options && !fuse4fs_can_iomap(ff)) { + err_printf(ff, "%s\n", + _("Some mount options require iomap.")); + return EINVAL; + } + return 0; } @@ -2284,6 +2291,8 @@ static void fuse4fs_iomap_enable(struct fuse_conn_info *conn, if (!fuse4fs_iomap_enabled(ff)) { if (ff->iomap_want == FT_ENABLE) err_printf(ff, "%s\n", _("Could not enable iomap.")); + if (ff->iomap_passthrough_options) + err_printf(ff, "%s\n", _("Some mount options require iomap.")); return; } } @@ -7876,6 +7885,7 @@ enum { FUSE4FS_ERRORS_BEHAVIOR, #ifdef HAVE_FUSE_IOMAP FUSE4FS_IOMAP, + FUSE4FS_IOMAP_PASSTHROUGH, #endif }; @@ -7902,6 +7912,17 @@ static struct fuse_opt fuse4fs_opts[] = { FUSE4FS_OPT("timing", timing, 1), #endif +#ifdef HAVE_FUSE_IOMAP +#ifdef MS_LAZYTIME + FUSE_OPT_KEY("lazytime", FUSE4FS_IOMAP_PASSTHROUGH), + FUSE_OPT_KEY("nolazytime", FUSE4FS_IOMAP_PASSTHROUGH), +#endif +#ifdef MS_STRICTATIME + FUSE_OPT_KEY("strictatime", FUSE4FS_IOMAP_PASSTHROUGH), + FUSE_OPT_KEY("nostrictatime", FUSE4FS_IOMAP_PASSTHROUGH), +#endif +#endif + FUSE_OPT_KEY("user_xattr", FUSE4FS_IGNORED), FUSE_OPT_KEY("noblock_validity", FUSE4FS_IGNORED), FUSE_OPT_KEY("nodelalloc", FUSE4FS_IGNORED), @@ -7928,6 +7949,12 @@ static int fuse4fs_opt_proc(void *data, const char *arg, struct fuse4fs *ff = data; switch (key) { +#ifdef HAVE_FUSE_IOMAP + case FUSE4FS_IOMAP_PASSTHROUGH: + ff->iomap_passthrough_options = 1; + /* pass through to libfuse */ + return 1; +#endif case FUSE4FS_DIRSYNC: ff->dirsync = 1; /* pass through to libfuse */ @@ -8103,7 +8130,6 @@ static void fuse4fs_compute_libfuse_args(struct fuse4fs *ff, ff->translate_inums = 0; } - if (ff->debug) { int i; diff --git a/misc/fuse2fs.1.in b/misc/fuse2fs.1.in index 2b55fa0e723966..0c0934f03c9543 100644 --- a/misc/fuse2fs.1.in +++ b/misc/fuse2fs.1.in @@ -90,6 +90,9 @@ .SS "fuse2fs options:" .I nosuid ) later. .TP +\fB-o\fR lazytime +if iomap is enabled, enable lazy updates of timestamps +.TP \fB-o\fR lockfile=path use this file to control access to the filesystem .TP @@ -98,6 +101,9 @@ .SS "fuse2fs options:" .TP \fB-o\fR norecovery do not replay the journal and mount the file system read-only +.TP +\fB-o\fR strictatime +if iomap is enabled, update atime on every access .SS "FUSE options:" .TP \fB-d -o\fR debug diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 16b010fd28d4b5..02ca7ade4aaad6 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -268,6 +268,7 @@ struct fuse2fs { int directio; int acl; int dirsync; + int iomap_passthrough_options; enum fuse2fs_opstate opstate; int logfd; @@ -1192,6 +1193,12 @@ static errcode_t fuse2fs_check_support(struct fuse2fs *ff) return EXT2_ET_FILESYSTEM_CORRUPTED; } + if (ff->iomap_passthrough_options && !fuse2fs_can_iomap(ff)) { + err_printf(ff, "%s\n", + _("Some mount options require iomap.")); + return EINVAL; + } + return 0; } @@ -1820,6 +1827,8 @@ static void fuse2fs_iomap_enable(struct fuse_conn_info *conn, if (!fuse2fs_iomap_enabled(ff)) { if (ff->iomap_want == FT_ENABLE) err_printf(ff, "%s\n", _("Could not enable iomap.")); + if (ff->iomap_passthrough_options) + err_printf(ff, "%s\n", _("Some mount options require iomap.")); return; } } @@ -7113,6 +7122,7 @@ enum { FUSE2FS_ERRORS_BEHAVIOR, #ifdef HAVE_FUSE_IOMAP FUSE2FS_IOMAP, + FUSE2FS_IOMAP_PASSTHROUGH, #endif }; @@ -7139,6 +7149,17 @@ static struct fuse_opt fuse2fs_opts[] = { FUSE2FS_OPT("timing", timing, 1), #endif +#ifdef HAVE_FUSE_IOMAP +#ifdef MS_LAZYTIME + FUSE_OPT_KEY("lazytime", FUSE2FS_IOMAP_PASSTHROUGH), + FUSE_OPT_KEY("nolazytime", FUSE2FS_IOMAP_PASSTHROUGH), +#endif +#ifdef MS_STRICTATIME + FUSE_OPT_KEY("strictatime", FUSE2FS_IOMAP_PASSTHROUGH), + FUSE_OPT_KEY("nostrictatime", FUSE2FS_IOMAP_PASSTHROUGH), +#endif +#endif + FUSE_OPT_KEY("user_xattr", FUSE2FS_IGNORED), FUSE_OPT_KEY("noblock_validity", FUSE2FS_IGNORED), FUSE_OPT_KEY("nodelalloc", FUSE2FS_IGNORED), @@ -7165,6 +7186,12 @@ static int fuse2fs_opt_proc(void *data, const char *arg, struct fuse2fs *ff = data; switch (key) { +#ifdef HAVE_FUSE_IOMAP + case FUSE2FS_IOMAP_PASSTHROUGH: + ff->iomap_passthrough_options = 1; + /* pass through to libfuse */ + return 1; +#endif case FUSE2FS_DIRSYNC: ff->dirsync = 1; /* pass through to libfuse */ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong 2026-04-29 14:58 ` [PATCH 01/10] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong @ 2026-04-29 14:59 ` Darrick J. Wong 2026-04-29 14:59 ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong ` (7 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:59 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> When iomap is enabled, the kernel is in charge of enforcing permissions checks on timestamp updates for files. We needn't do that in userspace anymore. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 11 +++++++---- misc/fuse2fs.c | 11 +++++++---- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 6e5683dba4c918..7163c440c9ee66 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -5555,13 +5555,16 @@ static int fuse4fs_utimens(struct fuse4fs *ff, const struct fuse_ctx *ctxt, /* * ext4 allows timestamp updates of append-only files but only if we're - * setting to current time + * setting to current time. If iomap is enabled, the kernel does the + * permission checking for timestamp updates; skip the access check. */ if (aact == TA_NOW && mact == TA_NOW) access |= A_OK; - ret = fuse4fs_inum_access(ff, ctxt, ino, access); - if (ret) - return ret; + if (!fuse4fs_iomap_enabled(ff)) { + ret = fuse4fs_inum_access(ff, ctxt, ino, access); + if (ret) + return ret; + } if (aact != TA_OMIT) EXT4_INODE_SET_XTIME(i_atime, &atime, inode); diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 02ca7ade4aaad6..670411c1117e44 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -4939,13 +4939,16 @@ static int op_utimens(const char *path, const struct timespec ctv[2], /* * ext4 allows timestamp updates of append-only files but only if we're - * setting to current time + * setting to current time. If iomap is enabled, the kernel does the + * permission checking for timestamp updates; skip the access check. */ if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW) access |= A_OK; - ret = check_inum_access(ff, ino, access); - if (ret) - goto out; + if (!fuse2fs_iomap_enabled(ff)) { + ret = check_inum_access(ff, ino, access); + if (ret) + goto out; + } err = fuse2fs_read_inode(fs, ino, &inode); if (err) { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong 2026-04-29 14:58 ` [PATCH 01/10] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong 2026-04-29 14:59 ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong @ 2026-04-29 14:59 ` Darrick J. Wong 2026-04-29 14:59 ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong ` (6 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:59 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> When the kernel is running in iomap mode, it will also manage all the ACL updates and the resulting file mode changes for us. Disable the manual implementation of it in fuse2fs. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 4 ++-- misc/fuse2fs.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 7163c440c9ee66..2020c6bc1e55db 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -2784,7 +2784,7 @@ static int fuse4fs_propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent, size_t deflen; int ret; - if (!ff->acl || S_ISDIR(mode)) + if (!ff->acl || S_ISDIR(mode) || fuse4fs_iomap_enabled(ff)) return 0; ret = fuse4fs_getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def, @@ -4210,7 +4210,7 @@ static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino, * of the user's groups, but FUSE only tells us about the primary * group. */ - if (!fuse4fs_is_superuser(ff, ctxt)) { + if (!fuse4fs_iomap_enabled(ff) && !fuse4fs_is_superuser(ff, ctxt)) { ret = fuse4fs_in_file_group(ff, req, inode); if (ret < 0) return ret; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 670411c1117e44..6e19f5ae796127 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -2316,7 +2316,7 @@ static int propagate_default_acls(struct fuse2fs *ff, ext2_ino_t parent, size_t deflen; int ret; - if (!ff->acl || S_ISDIR(mode)) + if (!ff->acl || S_ISDIR(mode) || fuse2fs_iomap_enabled(ff)) return 0; ret = __getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def, @@ -3645,7 +3645,7 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi) * of the user's groups, but FUSE only tells us about the primary * group. */ - if (!is_superuser(ff, ctxt)) { + if (!fuse2fs_iomap_enabled(ff) && !is_superuser(ff, ctxt)) { ret = in_file_group(ctxt, &inode); if (ret < 0) goto out; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 04/10] fuse2fs: better debugging for file mode updates 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 14:59 ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong @ 2026-04-29 14:59 ` Darrick J. Wong 2026-04-29 14:59 ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong ` (5 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:59 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Improve the tracing of a chmod operation so that we can debug file mode updates. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 10 ++++++---- misc/fuse2fs.c | 12 +++++++----- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 2020c6bc1e55db..567b576e5f5779 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -4193,6 +4193,7 @@ static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino, mode_t mode, struct ext2_inode_large *inode) { const struct fuse_ctx *ctxt = fuse_req_ctx(req); + mode_t new_mode; int ret = 0; dbg_printf(ff, "%s: ino=%d mode=0%o\n", __func__, ino, mode); @@ -4219,11 +4220,12 @@ static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino, mode &= ~S_ISGID; } - inode->i_mode &= ~0xFFF; - inode->i_mode |= mode & 0xFFF; + new_mode = (inode->i_mode & ~0xFFF) | (mode & 0xFFF); - dbg_printf(ff, "%s: ino=%d new_mode=0%o\n", - __func__, ino, inode->i_mode); + dbg_printf(ff, "%s: ino=%d old_mode=0%o new_mode=0%o\n", + __func__, ino, inode->i_mode, new_mode); + + inode->i_mode = new_mode; return 0; } diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 6e19f5ae796127..2e0fdeda963de2 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -3616,6 +3616,7 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi) errcode_t err; ext2_ino_t ino; struct ext2_inode_large inode; + mode_t new_mode; int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); @@ -3654,11 +3655,12 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi) mode &= ~S_ISGID; } - inode.i_mode &= ~0xFFF; - inode.i_mode |= mode & 0xFFF; + new_mode = (inode.i_mode & ~0xFFF) | (mode & 0xFFF); - dbg_printf(ff, "%s: path=%s new_mode=0%o ino=%d\n", __func__, - path, inode.i_mode, ino); + dbg_printf(ff, "%s: path=%s old_mode=0%o new_mode=0%o ino=%d\n", + __func__, path, inode.i_mode, new_mode, ino); + + inode.i_mode = new_mode; ret = update_ctime(fs, ino, &inode); if (ret) @@ -3678,12 +3680,12 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi) static int op_chown(const char *path, uid_t owner, gid_t group, struct fuse_file_info *fi) { + struct ext2_inode_large inode; struct fuse_context *ctxt = fuse_get_context(); struct fuse2fs *ff = fuse2fs_get(); ext2_filsys fs; errcode_t err; ext2_ino_t ino; - struct ext2_inode_large inode; int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 05/10] fuse2fs: debug timestamp updates 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (3 preceding siblings ...) 2026-04-29 14:59 ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong @ 2026-04-29 14:59 ` Darrick J. Wong 2026-04-29 15:00 ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong ` (4 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:59 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Add tracing for timestamp updates to files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 97 +++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 61 insertions(+), 36 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 2e0fdeda963de2..935a66af66603e 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -865,7 +865,8 @@ static void increment_version(struct ext2_inode_large *inode) inode->i_version_hi = ver >> 32; } -static void init_times(struct ext2_inode_large *inode) +static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *inode) { struct timespec now; @@ -875,11 +876,15 @@ static void init_times(struct ext2_inode_large *inode) EXT4_INODE_SET_XTIME(i_mtime, &now, inode); EXT4_EINODE_SET_XTIME(i_crtime, &now, inode); increment_version(inode); + + dbg_printf(ff, "%s: ino=%u time %ld:%lu\n", __func__, ino, now.tv_sec, + now.tv_nsec); } -static int update_ctime(ext2_filsys fs, ext2_ino_t ino, - struct ext2_inode_large *pinode) +static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *pinode) { + ext2_filsys fs = ff->fs; errcode_t err; struct timespec now; struct ext2_inode_large inode; @@ -890,6 +895,10 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino, if (pinode) { increment_version(pinode); EXT4_INODE_SET_XTIME(i_ctime, &now, pinode); + + dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino, + now.tv_sec, now.tv_nsec); + return 0; } @@ -901,6 +910,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino, increment_version(&inode); EXT4_INODE_SET_XTIME(i_ctime, &now, &inode); + dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino, + now.tv_sec, now.tv_nsec); + err = fuse2fs_write_inode(fs, ino, &inode); if (err) return translate_error(fs, ino, err); @@ -908,8 +920,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino, return 0; } -static int update_atime(ext2_filsys fs, ext2_ino_t ino) +static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino) { + ext2_filsys fs = ff->fs; errcode_t err; struct ext2_inode_large inode, *pinode; struct timespec atime, mtime, now; @@ -928,6 +941,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino) dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC); dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC); + dbg_printf(ff, "%s: ino=%u atime %ld:%lu mtime %ld:%lu now %ld:%lu\n", + __func__, ino, atime.tv_sec, atime.tv_nsec, mtime.tv_sec, + mtime.tv_nsec, now.tv_sec, now.tv_nsec); + /* * If atime is newer than mtime and atime hasn't been updated in thirty * seconds, skip the atime update. Same idea as Linux "relatime". Use @@ -944,9 +961,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino) return 0; } -static int update_mtime(ext2_filsys fs, ext2_ino_t ino, - struct ext2_inode_large *pinode) +static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino, + struct ext2_inode_large *pinode) { + ext2_filsys fs = ff->fs; errcode_t err; struct ext2_inode_large inode; struct timespec now; @@ -956,6 +974,10 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino, EXT4_INODE_SET_XTIME(i_mtime, &now, pinode); EXT4_INODE_SET_XTIME(i_ctime, &now, pinode); increment_version(pinode); + + dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n", + __func__, ino, now.tv_sec, now.tv_nsec); + return 0; } @@ -968,6 +990,9 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino, EXT4_INODE_SET_XTIME(i_ctime, &now, &inode); increment_version(&inode); + dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n", + __func__, ino, now.tv_sec, now.tv_nsec); + err = fuse2fs_write_inode(fs, ino, &inode); if (err) return translate_error(fs, ino, err); @@ -2237,7 +2262,7 @@ static int op_readlink(const char *path, char *buf, size_t len) buf[len] = 0; if (fuse2fs_is_writeable(ff)) { - ret = update_atime(fs, ino); + ret = fuse2fs_update_atime(ff, ino); if (ret) goto out; } @@ -2511,7 +2536,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev) goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse2fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -2534,7 +2559,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev) } inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse2fs_init_timestamps(ff, child, &inode); err = fuse2fs_write_inode(fs, child, &inode); if (err) { ret = translate_error(fs, child, err); @@ -2620,7 +2645,7 @@ static int op_mkdir(const char *path, mode_t mode) goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse2fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -2647,7 +2672,7 @@ static int op_mkdir(const char *path, mode_t mode) if (parent_sgid) inode.i_mode |= S_ISGID; inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse2fs_init_timestamps(ff, child, &inode); err = fuse2fs_write_inode(fs, child, &inode); if (err) { @@ -2730,7 +2755,7 @@ static int fuse2fs_unlink(struct fuse2fs *ff, const char *path, if (err) return translate_error(fs, dir, err); - ret = update_mtime(fs, dir, NULL); + ret = fuse2fs_update_mtime(ff, dir, NULL); if (ret) return ret; @@ -2821,7 +2846,7 @@ static int remove_inode(struct fuse2fs *ff, ext2_ino_t ino) ext2fs_set_dtime(fs, EXT2_INODE(&inode)); } - ret = update_ctime(fs, ino, &inode); + ret = fuse2fs_update_ctime(ff, ino, &inode); if (ret) return ret; @@ -2991,7 +3016,7 @@ static int __op_rmdir(struct fuse2fs *ff, const char *path) goto out; } ext2fs_dec_nlink(EXT2_INODE(&inode)); - ret = update_mtime(fs, rds.parent, &inode); + ret = fuse2fs_update_mtime(ff, rds.parent, &inode); if (ret) goto out; err = fuse2fs_write_inode(fs, rds.parent, &inode); @@ -3088,7 +3113,7 @@ static int op_symlink(const char *src, const char *dest) } /* Update parent dir's mtime */ - ret = update_mtime(fs, parent, NULL); + ret = fuse2fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -3112,7 +3137,7 @@ static int op_symlink(const char *src, const char *dest) fuse2fs_set_uid(&inode, ctxt->uid); fuse2fs_set_gid(&inode, gid); inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse2fs_init_timestamps(ff, child, &inode); err = fuse2fs_write_inode(fs, child, &inode); if (err) { @@ -3397,11 +3422,11 @@ static int op_rename(const char *from, const char *to, } /* Update timestamps */ - ret = update_ctime(fs, from_ino, NULL); + ret = fuse2fs_update_ctime(ff, from_ino, NULL); if (ret) goto out2; - ret = update_mtime(fs, to_dir_ino, NULL); + ret = fuse2fs_update_mtime(ff, to_dir_ino, NULL); if (ret) goto out2; @@ -3495,7 +3520,7 @@ static int op_link(const char *src, const char *dest) } ext2fs_inc_nlink(fs, EXT2_INODE(&inode)); - ret = update_ctime(fs, ino, &inode); + ret = fuse2fs_update_ctime(ff, ino, &inode); if (ret) goto out2; @@ -3514,7 +3539,7 @@ static int op_link(const char *src, const char *dest) goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse2fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -3662,7 +3687,7 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi) inode.i_mode = new_mode; - ret = update_ctime(fs, ino, &inode); + ret = fuse2fs_update_ctime(ff, ino, &inode); if (ret) goto out; @@ -3729,7 +3754,7 @@ static int op_chown(const char *path, uid_t owner, gid_t group, fuse2fs_set_gid(&inode, group); } - ret = update_ctime(fs, ino, &inode); + ret = fuse2fs_update_ctime(ff, ino, &inode); if (ret) goto out; @@ -3859,7 +3884,7 @@ static int fuse2fs_truncate(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size) if (err) return translate_error(fs, ino, err); - ret = update_mtime(fs, ino, NULL); + ret = fuse2fs_update_mtime(ff, ino, NULL); if (ret) return ret; @@ -4094,7 +4119,7 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf, } if (fh->check_flags != X_OK && fuse2fs_is_writeable(ff)) { - ret = update_atime(fs, fh->ino); + ret = fuse2fs_update_atime(ff, fh->ino); if (ret) goto out; } @@ -4178,7 +4203,7 @@ static int op_write(const char *path EXT2FS_ATTR((unused)), goto out; } - ret = update_mtime(fs, fh->ino, NULL); + ret = fuse2fs_update_mtime(ff, fh->ino, NULL); if (ret) goto out; @@ -4540,7 +4565,7 @@ static int op_setxattr(const char *path EXT2FS_ATTR((unused)), goto out2; } - ret = update_ctime(fs, ino, NULL); + ret = fuse2fs_update_ctime(ff, ino, NULL); out2: err = ext2fs_xattrs_close(&h); if (!ret && err) @@ -4634,7 +4659,7 @@ static int op_removexattr(const char *path, const char *key) goto out2; } - ret = update_ctime(fs, ino, NULL); + ret = fuse2fs_update_ctime(ff, ino, NULL); out2: err = ext2fs_xattrs_close(&h); if (err && !ret) @@ -4752,7 +4777,7 @@ static int op_readdir(const char *path EXT2FS_ATTR((unused)), void *buf, } if (fuse2fs_is_writeable(ff)) { - ret = update_atime(i.fs, fh->ino); + ret = fuse2fs_update_atime(ff, fh->ino); if (ret) goto out; } @@ -4857,7 +4882,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp) goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse2fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -4888,7 +4913,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp) } inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse2fs_init_timestamps(ff, child, &inode); err = fuse2fs_write_inode(fs, child, &inode); if (err) { ret = translate_error(fs, child, err); @@ -4972,7 +4997,7 @@ static int op_utimens(const char *path, const struct timespec ctv[2], if (tv[1].tv_nsec != UTIME_OMIT) EXT4_INODE_SET_XTIME(i_mtime, &tv[1], &inode); #endif /* UTIME_OMIT */ - ret = update_ctime(fs, ino, &inode); + ret = fuse2fs_update_ctime(ff, ino, &inode); if (ret) goto out; @@ -5040,7 +5065,7 @@ static int ioctl_setflags(struct fuse2fs *ff, struct fuse2fs_file_handle *fh, if (ret) return ret; - ret = update_ctime(fs, fh->ino, &inode); + ret = fuse2fs_update_ctime(ff, fh->ino, &inode); if (ret) return ret; @@ -5087,7 +5112,7 @@ static int ioctl_setversion(struct fuse2fs *ff, struct fuse2fs_file_handle *fh, inode.i_generation = generation; - ret = update_ctime(fs, fh->ino, &inode); + ret = fuse2fs_update_ctime(ff, fh->ino, &inode); if (ret) return ret; @@ -5218,7 +5243,7 @@ static int ioctl_fssetxattr(struct fuse2fs *ff, struct fuse2fs_file_handle *fh, if (ext2fs_inode_includes(inode_size, i_projid)) inode.i_projid = fsx->fsx_projid; - ret = update_ctime(fs, fh->ino, &inode); + ret = fuse2fs_update_ctime(ff, fh->ino, &inode); if (ret) return ret; @@ -5490,7 +5515,7 @@ static int fuse2fs_allocate_range(struct fuse2fs *ff, } } - err = update_mtime(fs, fh->ino, &inode); + err = fuse2fs_update_mtime(ff, fh->ino, &inode); if (err) return err; @@ -5663,7 +5688,7 @@ static int fuse2fs_punch_range(struct fuse2fs *ff, return translate_error(fs, fh->ino, err); } - err = update_mtime(fs, fh->ino, &inode); + err = fuse2fs_update_mtime(ff, fh->ino, &inode); if (err) return err; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (4 preceding siblings ...) 2026-04-29 14:59 ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong @ 2026-04-29 15:00 ` Darrick J. Wong 2026-04-29 15:00 ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong ` (3 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:00 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> In iomap mode, the kernel is responsible for maintaining timestamps because file writes don't upcall to fuse2fs. The kernel's predicate for deciding if [cm]time should be updated bases its decisions off [cm]time being an exact match for the coarse clock (instead of checking that [cm]time < coarse_clock) which means that fuse2fs setting a fine-grained timestamp that is slightly ahead of the coarse clock can result in timestamps appearing to go backwards. generic/423 doesn't like seeing btime > ctime from statx, so we'll use the coarse clock in iomap mode. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 110 +++++++++++++++++++++++++++++++---------------------- misc/fuse2fs.c | 34 ++++++++++++---- 2 files changed, 90 insertions(+), 54 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 567b576e5f5779..8a9ada8905983d 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -1043,8 +1043,24 @@ static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino, ext2fs_extent_free(extents); } -static void get_now(struct timespec *now) +static void fuse4fs_get_now(struct fuse4fs *ff, struct timespec *now) { +#ifdef CLOCK_REALTIME_COARSE + /* + * In iomap mode, the kernel is responsible for maintaining timestamps + * because file writes don't upcall to fuse4fs. The kernel's predicate + * for deciding if [cm]time should be updated bases its decisions off + * [cm]time being an exact match for the coarse clock (instead of + * checking that [cm]time < coarse_clock) which means that fuse4fs + * setting a fine-grained timestamp that is slightly ahead of the + * coarse clock can result in timestamps appearing to go backwards. + * generic/423 doesn't like seeing btime > ctime from statx, so we'll + * use the coarse clock in iomap mode. + */ + if (fuse4fs_iomap_enabled(ff) && + !clock_gettime(CLOCK_REALTIME_COARSE, now)) + return; +#endif #ifdef CLOCK_REALTIME if (!clock_gettime(CLOCK_REALTIME, now)) return; @@ -1067,11 +1083,12 @@ static void increment_version(struct ext2_inode_large *inode) inode->i_version_hi = ver >> 32; } -static void init_times(struct ext2_inode_large *inode) +static void fuse4fs_init_timestamps(struct fuse4fs *ff, + struct ext2_inode_large *inode) { struct timespec now; - get_now(&now); + fuse4fs_get_now(ff, &now); EXT4_INODE_SET_XTIME(i_atime, &now, inode); EXT4_INODE_SET_XTIME(i_ctime, &now, inode); EXT4_INODE_SET_XTIME(i_mtime, &now, inode); @@ -1079,14 +1096,15 @@ static void init_times(struct ext2_inode_large *inode) increment_version(inode); } -static int update_ctime(ext2_filsys fs, ext2_ino_t ino, - struct ext2_inode_large *pinode) +static int fuse4fs_update_ctime(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *pinode) { - errcode_t err; struct timespec now; struct ext2_inode_large inode; + ext2_filsys fs = ff->fs; + errcode_t err; - get_now(&now); + fuse4fs_get_now(ff, &now); /* If user already has a inode buffer, just update that */ if (pinode) { @@ -1110,12 +1128,13 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino, return 0; } -static int update_atime(ext2_filsys fs, ext2_ino_t ino) +static int fuse4fs_update_atime(struct fuse4fs *ff, ext2_ino_t ino) { - errcode_t err; struct ext2_inode_large inode, *pinode; struct timespec atime, mtime, now; + ext2_filsys fs = ff->fs; double datime, dmtime, dnow; + errcode_t err; err = fuse4fs_read_inode(fs, ino, &inode); if (err) @@ -1124,7 +1143,7 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino) pinode = &inode; EXT4_INODE_GET_XTIME(i_atime, &atime, pinode); EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode); - get_now(&now); + fuse4fs_get_now(ff, &now); datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC); dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC); @@ -1146,15 +1165,16 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino) return 0; } -static int update_mtime(ext2_filsys fs, ext2_ino_t ino, - struct ext2_inode_large *pinode) +static int fuse4fs_update_mtime(struct fuse4fs *ff, ext2_ino_t ino, + struct ext2_inode_large *pinode) { - errcode_t err; struct ext2_inode_large inode; struct timespec now; + ext2_filsys fs = ff->fs; + errcode_t err; if (pinode) { - get_now(&now); + fuse4fs_get_now(ff, &now); EXT4_INODE_SET_XTIME(i_mtime, &now, pinode); EXT4_INODE_SET_XTIME(i_ctime, &now, pinode); increment_version(pinode); @@ -1165,7 +1185,7 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino, if (err) return translate_error(fs, ino, err); - get_now(&now); + fuse4fs_get_now(ff, &now); EXT4_INODE_SET_XTIME(i_mtime, &now, &inode); EXT4_INODE_SET_XTIME(i_ctime, &now, &inode); increment_version(&inode); @@ -2701,7 +2721,7 @@ static void op_readlink(fuse_req_t req, fuse_ino_t fino) buf[len] = 0; if (fuse4fs_is_writeable(ff)) { - ret = update_atime(fs, ino); + ret = fuse4fs_update_atime(ff, ino); if (ret) goto out; } @@ -2970,7 +2990,7 @@ static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name, goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse4fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -2993,7 +3013,7 @@ static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name, } inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse4fs_init_timestamps(ff, &inode); err = fuse4fs_write_inode(fs, child, &inode); if (err) { ret = translate_error(fs, child, err); @@ -3055,7 +3075,7 @@ static void op_mkdir(fuse_req_t req, fuse_ino_t fino, const char *name, goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse4fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -3081,7 +3101,7 @@ static void op_mkdir(fuse_req_t req, fuse_ino_t fino, const char *name, if (parent_sgid) inode.i_mode |= S_ISGID; inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse4fs_init_timestamps(ff, &inode); err = fuse4fs_write_inode(fs, child, &inode); if (err) { @@ -3432,7 +3452,7 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino) inode.i_links_count--; } - ret = update_ctime(fs, ino, &inode); + ret = fuse4fs_update_ctime(ff, ino, &inode); if (ret) return ret; @@ -3504,7 +3524,7 @@ static int fuse4fs_unlink(struct fuse4fs *ff, ext2_ino_t parent, goto out; } - ret = update_mtime(fs, parent, NULL); + ret = fuse4fs_update_mtime(ff, parent, NULL); if (ret) goto out; out: @@ -3638,7 +3658,7 @@ static int fuse4fs_rmdir(struct fuse4fs *ff, ext2_ino_t parent, goto out; } ext2fs_dec_nlink(EXT2_INODE(&inode)); - ret = update_mtime(fs, rds.parent, &inode); + ret = fuse4fs_update_mtime(ff, rds.parent, &inode); if (ret) goto out; err = fuse4fs_write_inode(fs, rds.parent, &inode); @@ -3742,7 +3762,7 @@ static void op_symlink(fuse_req_t req, const char *target, fuse_ino_t fino, } /* Update parent dir's mtime */ - ret = update_mtime(fs, parent, NULL); + ret = fuse4fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -3765,7 +3785,7 @@ static void op_symlink(fuse_req_t req, const char *target, fuse_ino_t fino, fuse4fs_set_uid(&inode, ctxt->uid); fuse4fs_set_gid(&inode, gid); inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse4fs_init_timestamps(ff, &inode); err = fuse4fs_write_inode(fs, child, &inode); if (err) { @@ -3996,11 +4016,11 @@ static void op_rename(fuse_req_t req, fuse_ino_t from_parent, const char *from, } /* Update timestamps */ - ret = update_ctime(fs, from_ino, NULL); + ret = fuse4fs_update_ctime(ff, from_ino, NULL); if (ret) goto out; - ret = update_mtime(fs, to_dir_ino, NULL); + ret = fuse4fs_update_mtime(ff, to_dir_ino, NULL); if (ret) goto out; @@ -4079,7 +4099,7 @@ static void op_link(fuse_req_t req, fuse_ino_t child_fino, } ext2fs_inc_nlink(fs, EXT2_INODE(&inode)); - ret = update_ctime(fs, child, &inode); + ret = fuse4fs_update_ctime(ff, child, &inode); if (ret) goto out2; @@ -4096,7 +4116,7 @@ static void op_link(fuse_req_t req, fuse_ino_t child_fino, goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse4fs_update_mtime(ff, parent, NULL); if (ret) goto out2; @@ -4332,7 +4352,7 @@ static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size) if (err) return translate_error(fs, ino, err); - ret = update_mtime(fs, ino, NULL); + ret = fuse4fs_update_mtime(ff, ino, NULL); if (ret) return ret; @@ -4541,7 +4561,7 @@ static void op_read(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)), } if (fh->check_flags != X_OK && fuse4fs_is_writeable(ff)) { - ret = update_atime(fs, fh->ino); + ret = fuse4fs_update_atime(ff, fh->ino); if (ret) goto out; } @@ -4615,7 +4635,7 @@ static void op_write(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)), goto out; } - ret = update_mtime(fs, fh->ino, NULL); + ret = fuse4fs_update_mtime(ff, fh->ino, NULL); if (ret) goto out; @@ -5062,7 +5082,7 @@ static void op_setxattr(fuse_req_t req, fuse_ino_t fino, const char *key, goto out2; } - ret = update_ctime(fs, ino, NULL); + ret = fuse4fs_update_ctime(ff, ino, NULL); out2: err = ext2fs_xattrs_close(&h); if (!ret && err) @@ -5156,7 +5176,7 @@ static void op_removexattr(fuse_req_t req, fuse_ino_t fino, const char *key) goto out2; } - ret = update_ctime(fs, ino, NULL); + ret = fuse4fs_update_ctime(ff, ino, NULL); out2: err = ext2fs_xattrs_close(&h); if (err && !ret) @@ -5303,7 +5323,7 @@ static void __op_readdir(fuse_req_t req, fuse_ino_t fino, size_t size, } if (fuse4fs_is_writeable(ff)) { - ret = update_atime(i.fs, fh->ino); + ret = fuse4fs_update_atime(i.ff, fh->ino); if (ret) goto out; } @@ -5403,7 +5423,7 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name, goto out2; } - ret = update_mtime(fs, parent, NULL); + ret = fuse4fs_update_mtime(ff, parent, NULL); if (ret) goto out2; } else { @@ -5444,7 +5464,7 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name, } inode.i_generation = ff->next_generation++; - init_times(&inode); + fuse4fs_init_timestamps(ff, &inode); err = fuse4fs_write_inode(fs, child, &inode); if (err) { ret = translate_error(fs, child, err); @@ -5523,7 +5543,7 @@ static int fuse4fs_utimens(struct fuse4fs *ff, const struct fuse_ctx *ctxt, int ret = 0; if (to_set & (FUSE_SET_ATTR_ATIME_NOW | FUSE_SET_ATTR_MTIME_NOW)) - get_now(&now); + fuse4fs_get_now(ff, &now); if (to_set & FUSE_SET_ATTR_ATIME_NOW) { atime = now; @@ -5661,7 +5681,7 @@ static void op_setattr(fuse_req_t req, fuse_ino_t fino, struct stat *attr, } /* Update ctime for any attribute change */ - ret = update_ctime(fs, ino, &inode); + ret = fuse4fs_update_ctime(ff, ino, &inode); if (ret) goto out; @@ -5743,7 +5763,7 @@ static int ioctl_setflags(struct fuse4fs *ff, const struct fuse_ctx *ctxt, if (ret) return ret; - ret = update_ctime(fs, fh->ino, &inode); + ret = fuse4fs_update_ctime(ff, fh->ino, &inode); if (ret) return ret; @@ -5796,7 +5816,7 @@ static int ioctl_setversion(struct fuse4fs *ff, const struct fuse_ctx *ctxt, inode.i_generation = *indata; - ret = update_ctime(fs, fh->ino, &inode); + ret = fuse4fs_update_ctime(ff, fh->ino, &inode); if (ret) return ret; @@ -5932,7 +5952,7 @@ static int ioctl_fssetxattr(struct fuse4fs *ff, const struct fuse_ctx *ctxt, if (ext2fs_inode_includes(inode_size, i_projid)) inode.i_projid = fsx->fsx_projid; - ret = update_ctime(fs, fh->ino, &inode); + ret = fuse4fs_update_ctime(ff, fh->ino, &inode); if (ret) return ret; @@ -6228,7 +6248,7 @@ static int fuse4fs_allocate_range(struct fuse4fs *ff, } } - err = update_mtime(fs, fh->ino, &inode); + err = fuse4fs_update_mtime(ff, fh->ino, &inode); if (err) return err; @@ -6401,7 +6421,7 @@ static int fuse4fs_punch_range(struct fuse4fs *ff, return translate_error(fs, fh->ino, err); } - err = update_mtime(fs, fh->ino, &inode); + err = fuse4fs_update_mtime(ff, fh->ino, &inode); if (err) return err; @@ -8739,7 +8759,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err, error_message(err), func, line); /* Make a note in the error log */ - get_now(&now); + fuse4fs_get_now(ff, &now); ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec); fs->super->s_last_error_ino = ino; fs->super->s_last_error_line = line; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 935a66af66603e..458fd6e5bf6dd8 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -841,8 +841,24 @@ static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino, ext2fs_extent_free(extents); } -static void get_now(struct timespec *now) +static void fuse2fs_get_now(struct fuse2fs *ff, struct timespec *now) { +#ifdef CLOCK_REALTIME_COARSE + /* + * In iomap mode, the kernel is responsible for maintaining timestamps + * because file writes don't upcall to fuse2fs. The kernel's predicate + * for deciding if [cm]time should be updated bases its decisions off + * [cm]time being an exact match for the coarse clock (instead of + * checking that [cm]time < coarse_clock) which means that fuse2fs + * setting a fine-grained timestamp that is slightly ahead of the + * coarse clock can result in timestamps appearing to go backwards. + * generic/423 doesn't like seeing btime > ctime from statx, so we'll + * use the coarse clock in iomap mode. + */ + if (fuse2fs_iomap_enabled(ff) && + !clock_gettime(CLOCK_REALTIME_COARSE, now)) + return; +#endif #ifdef CLOCK_REALTIME if (!clock_gettime(CLOCK_REALTIME, now)) return; @@ -870,7 +886,7 @@ static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino, { struct timespec now; - get_now(&now); + fuse2fs_get_now(ff, &now); EXT4_INODE_SET_XTIME(i_atime, &now, inode); EXT4_INODE_SET_XTIME(i_ctime, &now, inode); EXT4_INODE_SET_XTIME(i_mtime, &now, inode); @@ -889,7 +905,7 @@ static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino, struct timespec now; struct ext2_inode_large inode; - get_now(&now); + fuse2fs_get_now(ff, &now); /* If user already has a inode buffer, just update that */ if (pinode) { @@ -935,7 +951,7 @@ static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino) pinode = &inode; EXT4_INODE_GET_XTIME(i_atime, &atime, pinode); EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode); - get_now(&now); + fuse2fs_get_now(ff, &now); datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC); dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC); @@ -970,7 +986,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino, struct timespec now; if (pinode) { - get_now(&now); + fuse2fs_get_now(ff, &now); EXT4_INODE_SET_XTIME(i_mtime, &now, pinode); EXT4_INODE_SET_XTIME(i_ctime, &now, pinode); increment_version(pinode); @@ -985,7 +1001,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino, if (err) return translate_error(fs, ino, err); - get_now(&now); + fuse2fs_get_now(ff, &now); EXT4_INODE_SET_XTIME(i_mtime, &now, &inode); EXT4_INODE_SET_XTIME(i_ctime, &now, &inode); increment_version(&inode); @@ -4987,9 +5003,9 @@ static int op_utimens(const char *path, const struct timespec ctv[2], tv[1] = ctv[1]; #ifdef UTIME_NOW if (tv[0].tv_nsec == UTIME_NOW) - get_now(tv); + fuse2fs_get_now(ff, tv); if (tv[1].tv_nsec == UTIME_NOW) - get_now(tv + 1); + fuse2fs_get_now(ff, tv + 1); #endif /* UTIME_NOW */ #ifdef UTIME_OMIT if (tv[0].tv_nsec != UTIME_OMIT) @@ -7747,7 +7763,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err, error_message(err), func, line); /* Make a note in the error log */ - get_now(&now); + fuse2fs_get_now(ff, &now); ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec); fs->super->s_last_error_ino = ino; fs->super->s_last_error_line = line; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (5 preceding siblings ...) 2026-04-29 15:00 ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong @ 2026-04-29 15:00 ` Darrick J. Wong 2026-04-29 15:00 ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong ` (2 subsequent siblings) 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:00 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Add tracing for retrieving timestamps so we can debug the weird behavior. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- misc/fuse2fs.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 458fd6e5bf6dd8..e790f6c2b59ecd 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1955,9 +1955,11 @@ static void *op_init(struct fuse_conn_info *conn, return ff; } -static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf) +static int fuse2fs_stat(struct fuse2fs *ff, ext2_ino_t ino, + struct stat *statbuf) { struct ext2_inode_large inode; + ext2_filsys fs = ff->fs; dev_t fakedev = 0; errcode_t err; int ret = 0; @@ -1996,6 +1998,13 @@ static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf) #else statbuf->st_ctime = tv.tv_sec; #endif + + dbg_printf(ff, "%s: ino=%d atime=%lld.%ld mtime=%lld.%ld ctime=%lld.%ld\n", + __func__, ino, + (long long int)statbuf->st_atim.tv_sec, statbuf->st_atim.tv_nsec, + (long long int)statbuf->st_mtim.tv_sec, statbuf->st_mtim.tv_nsec, + (long long int)statbuf->st_ctim.tv_sec, statbuf->st_ctim.tv_nsec); + if (LINUX_S_ISCHR(inode.i_mode) || LINUX_S_ISBLK(inode.i_mode)) { if (inode.i_block[0]) @@ -2042,16 +2051,15 @@ static int op_getattr(const char *path, struct stat *statbuf, struct fuse_file_info *fi) { struct fuse2fs *ff = fuse2fs_get(); - ext2_filsys fs; ext2_ino_t ino; int ret = 0; FUSE2FS_CHECK_CONTEXT(ff); - fs = fuse2fs_start(ff); + fuse2fs_start(ff); ret = fuse2fs_file_ino(ff, path, fi, &ino); if (ret) goto out; - ret = stat_inode(fs, ino, statbuf); + ret = fuse2fs_stat(ff, ino, statbuf); out: fuse2fs_finish(ff, ret); return ret; @@ -3841,7 +3849,7 @@ static int fuse2fs_file_uses_iomap(struct fuse2fs *ff, ext2_ino_t ino) if (!fuse2fs_iomap_enabled(ff)) return 0; - ret = stat_inode(ff->fs, ino, &statbuf); + ret = fuse2fs_stat(ff, ino, &statbuf); if (ret) return ret; @@ -4750,7 +4758,7 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)), (unsigned long long)i->dirpos); if (i->flags == FUSE_READDIR_PLUS) { - ret = stat_inode(i->fs, dirent->inode, &stat); + ret = fuse2fs_stat(i->ff, dirent->inode, &stat); if (ret) return DIRENT_ABORT; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 08/10] fuse2fs: enable syncfs 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (6 preceding siblings ...) 2026-04-29 15:00 ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong @ 2026-04-29 15:00 ` Darrick J. Wong 2026-04-29 15:00 ` [PATCH 09/10] fuse2fs: set sync, immutable, and append at file load time Darrick J. Wong 2026-04-29 15:01 ` [PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode Darrick J. Wong 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:00 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Enable syncfs calls in fuse2fs. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 32 ++++++++++++++++++++++++++++++++ misc/fuse2fs.c | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 66 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 8a9ada8905983d..51958d6677288c 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -6563,7 +6563,38 @@ static void op_shutdownfs(fuse_req_t req, fuse_ino_t ino, uint64_t flags) int ret; ret = ioctl_shutdown(ff, ctxt, NULL, NULL, 0); + fuse_reply_err(req, -ret); +} +static void op_syncfs(fuse_req_t req, fuse_ino_t ino) +{ + struct fuse4fs *ff = fuse4fs_get(req); + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE4FS_CHECK_CONTEXT(req); + fs = fuse4fs_start(ff); + + if (ff->opstate == F4OP_WRITABLE) { + if (fs->super->s_error_count) + fs->super->s_state |= EXT2_ERROR_FS; + ext2fs_mark_super_dirty(fs); + err = ext2fs_set_gdt_csum(fs); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + err = ext2fs_flush2(fs, 0); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + } + +out_unlock: + fuse4fs_finish(ff, ret); fuse_reply_err(req, -ret); } #endif @@ -7874,6 +7905,7 @@ static struct fuse_lowlevel_ops fs_ops = { .freezefs = op_freezefs, .unfreezefs = op_unfreezefs, .shutdownfs = op_shutdownfs, + .syncfs = op_syncfs, #endif #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index e790f6c2b59ecd..026c547618b5cd 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -5851,6 +5851,39 @@ static int op_shutdownfs(const char *path, uint64_t flags) return ioctl_shutdown(ff, NULL, NULL); } + +static int op_syncfs(const char *path) +{ + struct fuse2fs *ff = fuse2fs_get(); + ext2_filsys fs; + errcode_t err; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + dbg_printf(ff, "%s: path=%s\n", __func__, path); + fs = fuse2fs_start(ff); + + if (ff->opstate == F2OP_WRITABLE) { + if (fs->super->s_error_count) + fs->super->s_state |= EXT2_ERROR_FS; + ext2fs_mark_super_dirty(fs); + err = ext2fs_set_gdt_csum(fs); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + + err = ext2fs_flush2(fs, 0); + if (err) { + ret = translate_error(fs, 0, err); + goto out_unlock; + } + } + +out_unlock: + fuse2fs_finish(ff, ret); + return ret; +} #endif #ifdef HAVE_FUSE_IOMAP @@ -7140,6 +7173,7 @@ static struct fuse_operations fs_ops = { .freezefs = op_freezefs, .unfreezefs = op_unfreezefs, .shutdownfs = op_shutdownfs, + .syncfs = op_syncfs, #endif #ifdef HAVE_FUSE_IOMAP .iomap_begin = op_iomap_begin, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 09/10] fuse2fs: set sync, immutable, and append at file load time 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (7 preceding siblings ...) 2026-04-29 15:00 ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong @ 2026-04-29 15:00 ` Darrick J. Wong 2026-04-29 15:01 ` [PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode Darrick J. Wong 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:00 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Convey these three inode flags to the kernel when we're loading a file. This way the kernel can advertise and enforce those flags so that the fuse server doesn't have to. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 16 ++++++++++++++++ misc/fuse2fs.c | 53 ++++++++++++++++++++++++++++++++++++++--------------- 2 files changed, 54 insertions(+), 15 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 51958d6677288c..b4105333f4fe22 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -2441,6 +2441,22 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino, entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT; fstat->iflags = 0; + +#ifdef FUSE_IFLAG_SYNC + if (inodep->i_flags & EXT2_SYNC_FL) + fstat->iflags |= FUSE_IFLAG_SYNC; +#endif + +#ifdef FUSE_IFLAG_IMMUTABLE + if (inodep->i_flags & EXT2_IMMUTABLE_FL) + fstat->iflags |= FUSE_IFLAG_IMMUTABLE; +#endif + +#ifdef FUSE_IFLAG_APPEND + if (inodep->i_flags & EXT2_APPEND_FL) + fstat->iflags |= FUSE_IFLAG_APPEND; +#endif + #ifdef HAVE_FUSE_IOMAP if (fuse4fs_iomap_enabled(ff)) { fstat->iflags |= FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 026c547618b5cd..6e4780121d5c83 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1956,7 +1956,7 @@ static void *op_init(struct fuse_conn_info *conn, } static int fuse2fs_stat(struct fuse2fs *ff, ext2_ino_t ino, - struct stat *statbuf) + struct stat *statbuf, unsigned int *iflags) { struct ext2_inode_large inode; ext2_filsys fs = ff->fs; @@ -2013,6 +2013,7 @@ static int fuse2fs_stat(struct fuse2fs *ff, ext2_ino_t ino, statbuf->st_rdev = inode.i_block[1]; } + *iflags = inode.i_flags; return ret; } @@ -2047,22 +2048,31 @@ static int __fuse2fs_file_ino(struct fuse2fs *ff, const char *path, # define fuse2fs_file_ino(ff, path, fp, inop) \ __fuse2fs_file_ino((ff), (path), (fp), (inop), __func__, __LINE__) +static int fuse2fs_getattr(struct fuse2fs *ff, const char *path, + struct stat *statbuf, struct fuse_file_info *fi, + unsigned int *iflags) +{ + ext2_ino_t ino; + int ret = 0; + + FUSE2FS_CHECK_CONTEXT(ff); + fuse2fs_start(ff); + ret = fuse2fs_file_ino(ff, path, fi, &ino); + if (ret) + goto out; + ret = fuse2fs_stat(ff, ino, statbuf, iflags); +out: + fuse2fs_finish(ff, ret); + return ret; +} + static int op_getattr(const char *path, struct stat *statbuf, struct fuse_file_info *fi) { struct fuse2fs *ff = fuse2fs_get(); - ext2_ino_t ino; - int ret = 0; + unsigned int dontcare; - FUSE2FS_CHECK_CONTEXT(ff); - fuse2fs_start(ff); - ret = fuse2fs_file_ino(ff, path, fi, &ino); - if (ret) - goto out; - ret = fuse2fs_stat(ff, ino, statbuf); -out: - fuse2fs_finish(ff, ret); - return ret; + return fuse2fs_getattr(ff, path, statbuf, fi, &dontcare); } #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99) @@ -2070,11 +2080,21 @@ static int op_getattr_iflags(const char *path, struct stat *statbuf, unsigned int *iflags, struct fuse_file_info *fi) { struct fuse2fs *ff = fuse2fs_get(); - int ret = op_getattr(path, statbuf, fi); + unsigned int i_flags; + int ret = fuse2fs_getattr(ff, path, statbuf, fi, &i_flags); if (ret) return ret; + if (i_flags & EXT2_SYNC_FL) + *iflags |= FUSE_IFLAG_SYNC; + + if (i_flags & EXT2_IMMUTABLE_FL) + *iflags |= FUSE_IFLAG_IMMUTABLE; + + if (i_flags & EXT2_APPEND_FL) + *iflags |= FUSE_IFLAG_APPEND; + if (fuse_fs_can_enable_iomap(statbuf)) { *iflags |= FUSE_IFLAG_IOMAP | FUSE_IFLAG_EXCLUSIVE; @@ -3844,12 +3864,13 @@ static int fuse2fs_punch_posteof(struct fuse2fs *ff, ext2_ino_t ino, static int fuse2fs_file_uses_iomap(struct fuse2fs *ff, ext2_ino_t ino) { struct stat statbuf; + unsigned int dontcare; int ret; if (!fuse2fs_iomap_enabled(ff)) return 0; - ret = fuse2fs_stat(ff, ino, &statbuf); + ret = fuse2fs_stat(ff, ino, &statbuf, &dontcare); if (ret) return ret; @@ -4758,7 +4779,9 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)), (unsigned long long)i->dirpos); if (i->flags == FUSE_READDIR_PLUS) { - ret = fuse2fs_stat(i->ff, dirent->inode, &stat); + unsigned int dontcare; + + ret = fuse2fs_stat(i->ff, dirent->inode, &stat, &dontcare); if (ret) return DIRENT_ABORT; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong ` (8 preceding siblings ...) 2026-04-29 15:00 ` [PATCH 09/10] fuse2fs: set sync, immutable, and append at file load time Darrick J. Wong @ 2026-04-29 15:01 ` Darrick J. Wong 9 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:01 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> In iomap mode, we trust the kernel to cache file attributes, because it is critical to keep all of the file IO permissions checking in the kernel as part of keeping all the file IO paths in the kernel. Therefore, increase the attribute timeout to 30 seconds to reduce the number of upcalls even further. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index b4105333f4fe22..9f7fe365ccdf87 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -131,7 +131,8 @@ #endif #endif /* !defined(ENODATA) */ -#define FUSE4FS_ATTR_TIMEOUT (0.0) +#define FUSE4FS_IOMAP_ATTR_TIMEOUT (0.0) +#define FUSE4FS_ATTR_TIMEOUT (30.0) #ifndef O_DIRECT # define O_DIRECT (0) @@ -2437,8 +2438,14 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino, fuse4fs_ino_to_fuse(ff, &entry->ino, ino); entry->generation = inodep->i_generation; - entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT; - entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT; + + if (fuse4fs_iomap_enabled(ff)) { + entry->attr_timeout = FUSE4FS_IOMAP_ATTR_TIMEOUT; + entry->entry_timeout = FUSE4FS_IOMAP_ATTR_TIMEOUT; + } else { + entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT; + entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT; + } fstat->iflags = 0; @@ -2671,6 +2678,8 @@ static void op_statx(fuse_req_t req, fuse_ino_t fino, int flags, int mask, fuse4fs_finish(ff, ret); if (ret) fuse_reply_err(req, -ret); + else if (fuse4fs_iomap_enabled(ff)) + fuse_reply_statx(req, 0, &stx, FUSE4FS_IOMAP_ATTR_TIMEOUT); else fuse_reply_statx(req, 0, &stx, FUSE4FS_ATTR_TIMEOUT); } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance 2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong ` (18 preceding siblings ...) 2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong @ 2026-04-29 14:21 ` Darrick J. Wong 2026-04-29 15:01 ` [PATCH 1/4] fuse2fs: enable caching of iomaps Darrick J. Wong ` (3 more replies) 19 siblings, 4 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong Hi all, This series improves the performance (and correctness for some filesystems) by adding the ability to cache iomap mappings in the kernel. For filesystems that can change mapping states during pagecache writeback (e.g. unwritten extent conversion) this is absolutely necessary to deal with races with writes to the pagecache because writeback does not take i_rwsem. For everyone else, it simply eliminates roundtrips to userspace. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache --- Commits in this patchset: * fuse2fs: enable caching of iomaps * fuse2fs: constrain iomap mapping cache size * fuse4fs: upsert first file mapping to kernel on open * fuse2fs: enable iomap --- fuse4fs/fuse4fs.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++----- misc/fuse2fs.c | 38 ++++++++++++++++++++++----- 2 files changed, 101 insertions(+), 13 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/4] fuse2fs: enable caching of iomaps 2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong @ 2026-04-29 15:01 ` Darrick J. Wong 2026-04-29 15:01 ` [PATCH 2/4] fuse2fs: constrain iomap mapping cache size Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:01 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Cache the iomaps we generate in the kernel for better performance. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 30 ++++++++++++++++++++++++++++++ misc/fuse2fs.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 59 insertions(+) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 9f7fe365ccdf87..56a79dd48c9f43 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -304,6 +304,8 @@ struct fuse4fs { #ifdef STATX_WRITE_ATOMIC unsigned int awu_min, awu_max; #endif + /* options set by fuse_opt_parse must be of type int */ + int iomap_cache; #endif unsigned int blockmask; unsigned long offset; @@ -2314,6 +2316,7 @@ static void fuse4fs_iomap_enable(struct fuse_conn_info *conn, err_printf(ff, "%s\n", _("Could not enable iomap.")); if (ff->iomap_passthrough_options) err_printf(ff, "%s\n", _("Some mount options require iomap.")); + ff->iomap_cache = 0; return; } } @@ -7169,6 +7172,28 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, if (opflags & FUSE_IOMAP_OP_ATOMIC) read.flags |= FUSE_IOMAP_F_ATOMIC_BIO; + /* + * For real IO operations, cache the mapping in the kernel so that we + * can reuse them for subsequent IO to the same regions. Don't let + * FIEMAP thrash the cache. + */ + if (!(opflags & FUSE_IOMAP_OP_REPORT) && ff->iomap_cache) { + ret = fuse_lowlevel_iomap_upsert_mappings(ff->fuse, fino, ino, + &read, NULL); + if (ret) { + /* + * Log the cache upsert error, but we can still return + * the mapping via the reply. EINVAL is the magic code + * for the kernel declining to cache the mapping. + */ + if (ret != -ENOMEM && ret != -EINVAL) + translate_error(fs, ino, -ret); + goto out_unlock; + } + + fuse_file_iomap_retry_cache(&read); + } + out_unlock: fuse4fs_finish(ff, ret); if (ret) @@ -7993,6 +8018,10 @@ static struct fuse_opt fuse4fs_opts[] = { #ifdef HAVE_CLOCK_MONOTONIC FUSE4FS_OPT("timing", timing, 1), #endif +#ifdef HAVE_FUSE_IOMAP + FUSE4FS_OPT("iomap_cache", iomap_cache, 1), + FUSE4FS_OPT("noiomap_cache", iomap_cache, 0), +#endif #ifdef HAVE_FUSE_IOMAP #ifdef MS_LAZYTIME @@ -8487,6 +8516,7 @@ int main(int argc, char *argv[]) .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, .iomap_dev = FUSE_IOMAP_DEV_NULL, + .iomap_cache = 1, #endif #ifdef HAVE_FUSE_LOOPDEV .loop_fd = -1, diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 6e4780121d5c83..e485c38bda41e8 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -285,6 +285,8 @@ struct fuse2fs { #ifdef STATX_WRITE_ATOMIC unsigned int awu_min, awu_max; #endif + /* options set by fuse_opt_parse must be of type int */ + int iomap_cache; #endif unsigned int blockmask; unsigned long offset; @@ -1870,6 +1872,7 @@ static void fuse2fs_iomap_enable(struct fuse_conn_info *conn, err_printf(ff, "%s\n", _("Could not enable iomap.")); if (ff->iomap_passthrough_options) err_printf(ff, "%s\n", _("Some mount options require iomap.")); + ff->iomap_cache = 0; return; } } @@ -6453,6 +6456,27 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino, if (opflags & FUSE_IOMAP_OP_ATOMIC) read->flags |= FUSE_IOMAP_F_ATOMIC_BIO; + /* + * For real IO operations, cache the mapping in the kernel so that we + * can reuse them for subsequent IO to the same regions. Don't let + * FIEMAP thrash the cache. + */ + if (!(opflags & FUSE_IOMAP_OP_REPORT) && ff->iomap_cache) { + ret = fuse_fs_iomap_upsert(nodeid, attr_ino, read, NULL); + if (ret) { + /* + * Log the cache upsert error, but we can still return + * the mapping via the reply. EINVAL is the magic code + * for the kernel declining to cache the mapping. + */ + if (ret != -ENOMEM && ret != -EINVAL) + translate_error(fs, attr_ino, -ret); + goto out_unlock; + } + + fuse_file_iomap_retry_cache(read); + } + out_unlock: fuse2fs_finish(ff, ret); return ret; @@ -7259,6 +7283,10 @@ static struct fuse_opt fuse2fs_opts[] = { #ifdef HAVE_CLOCK_MONOTONIC FUSE2FS_OPT("timing", timing, 1), #endif +#ifdef HAVE_FUSE_IOMAP + FUSE2FS_OPT("iomap_cache", iomap_cache, 1), + FUSE2FS_OPT("noiomap_cache", iomap_cache, 0), +#endif #ifdef HAVE_FUSE_IOMAP #ifdef MS_LAZYTIME @@ -7539,6 +7567,7 @@ int main(int argc, char *argv[]) .iomap_want = FT_DEFAULT, .iomap_state = IOMAP_UNKNOWN, .iomap_dev = FUSE_IOMAP_DEV_NULL, + .iomap_cache = 1, #endif #ifdef HAVE_FUSE_LOOPDEV .loop_fd = -1, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 2/4] fuse2fs: constrain iomap mapping cache size 2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong 2026-04-29 15:01 ` [PATCH 1/4] fuse2fs: enable caching of iomaps Darrick J. Wong @ 2026-04-29 15:01 ` Darrick J. Wong 2026-04-29 15:02 ` [PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open Darrick J. Wong 2026-04-29 15:02 ` [PATCH 4/4] fuse2fs: enable iomap Darrick J. Wong 3 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:01 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Update the iomap config functions to handle the new iomap mapping cache size restriction knob. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 6 ++++-- misc/fuse2fs.c | 5 +++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index 56a79dd48c9f43..d07c576d31862a 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -7481,9 +7481,11 @@ static void op_iomap_config(fuse_req_t req, FUSE4FS_CHECK_CONTEXT(req); - dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx\n", __func__, + dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx cache_maxbytes=0x%x\n", + __func__, (unsigned long long)p->flags, - (unsigned long long)p->maxbytes); + (unsigned long long)p->maxbytes, + p->cache_maxbytes); fs = fuse4fs_start(ff); cfg.flags |= FUSE_IOMAP_CONFIG_UUID; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index e485c38bda41e8..232181bc170183 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -6753,9 +6753,10 @@ static int op_iomap_config(const struct fuse_iomap_config_params *p, FUSE2FS_CHECK_CONTEXT(ff); - dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx\n", __func__, + dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx cache_maxbytes=%u\n", __func__, (unsigned long long)p->flags, - (unsigned long long)p->maxbytes); + (unsigned long long)p->maxbytes, + p->cache_maxbytes); fs = fuse2fs_start(ff); cfg->flags |= FUSE_IOMAP_CONFIG_UUID; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open 2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong 2026-04-29 15:01 ` [PATCH 1/4] fuse2fs: enable caching of iomaps Darrick J. Wong 2026-04-29 15:01 ` [PATCH 2/4] fuse2fs: constrain iomap mapping cache size Darrick J. Wong @ 2026-04-29 15:02 ` Darrick J. Wong 2026-04-29 15:02 ` [PATCH 4/4] fuse2fs: enable iomap Darrick J. Wong 3 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:02 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Try to speed up the first access to a file by upserting the first file space mapping to the kernel at open time. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 36 +++++++++++++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index d07c576d31862a..b10a9a8be00a08 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -4415,6 +4415,9 @@ static void detect_linux_executable_open(int kernel_flags, int *access_check, } #endif /* __linux__ */ +static void fuse4fs_try_upsert_first_mapping(struct fuse4fs *ff, ext2_ino_t ino, + struct fuse_file_info *fp); + static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt, ext2_ino_t ino, bool linked, struct fuse_file_info *fp) @@ -4509,7 +4512,7 @@ static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt, /* fuse 3.5: cache dirents from readdir contents */ fp->cache_readdir = 1; #endif - + fuse4fs_try_upsert_first_mapping(ff, ino, fp); out: if (ret) ext2fs_free_mem(&file); @@ -7277,6 +7280,37 @@ static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare, fuse_reply_err(req, -ret); } +static void fuse4fs_try_upsert_first_mapping(struct fuse4fs *ff, ext2_ino_t ino, + struct fuse_file_info *fp) +{ + struct ext2_inode_large inode; + struct fuse_file_iomap read = { }; + uint64_t fsize; + errcode_t err; + + if (!ff->iomap_cache || (fp->flags & O_TRUNC)) + return; + + err = fuse4fs_read_inode(ff->fs, ino, &inode); + if (err) + return; + + if (!S_ISREG(inode.i_mode)) + return; + + fsize = EXT2_I_SIZE(&inode); + if (!fsize) + return; + + /* try to map the first 64k */ + err = fuse4fs_iomap_begin_read(ff, ino, &inode, 0, min(fsize, 65536), + 0, &read); + if (err) + return; + + fuse_lowlevel_iomap_upsert_mappings(ff->fuse, ino, ino, &read, NULL); +} + /* * Maximal extent format file size. * Resulting logical blkno at s_maxbytes must fit in our on-disk ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH 4/4] fuse2fs: enable iomap 2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong ` (2 preceding siblings ...) 2026-04-29 15:02 ` [PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open Darrick J. Wong @ 2026-04-29 15:02 ` Darrick J. Wong 3 siblings, 0 replies; 191+ messages in thread From: Darrick J. Wong @ 2026-04-29 15:02 UTC (permalink / raw) To: tytso Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel, joannelkoong From: Darrick J. Wong <djwong@kernel.org> Now that iomap functionality is complete, enable this for users. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fuse4fs/fuse4fs.c | 4 ---- misc/fuse2fs.c | 4 ---- 2 files changed, 8 deletions(-) diff --git a/fuse4fs/fuse4fs.c b/fuse4fs/fuse4fs.c index b10a9a8be00a08..fc72fbe1f00eac 100644 --- a/fuse4fs/fuse4fs.c +++ b/fuse4fs/fuse4fs.c @@ -2300,10 +2300,6 @@ static inline bool fuse4fs_wants_iomap(struct fuse4fs *ff) static void fuse4fs_iomap_enable(struct fuse_conn_info *conn, struct fuse4fs *ff) { - /* Don't let anyone touch iomap until the end of the patchset. */ - ff->iomap_state = IOMAP_DISABLED; - return; - if (fuse4fs_wants_iomap(ff) && fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) ff->iomap_state = IOMAP_ENABLED; diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c index 232181bc170183..453e3347b2a295 100644 --- a/misc/fuse2fs.c +++ b/misc/fuse2fs.c @@ -1856,10 +1856,6 @@ static inline bool fuse2fs_wants_iomap(struct fuse2fs *ff) static void fuse2fs_iomap_enable(struct fuse_conn_info *conn, struct fuse2fs *ff) { - /* Don't let anyone touch iomap until the end of the patchset. */ - ff->iomap_state = IOMAP_DISABLED; - return; - if (fuse2fs_wants_iomap(ff) && fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) ff->iomap_state = IOMAP_ENABLED; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCHSET v7 2/9] iomap: cleanups ahead of adding fuse support @ 2026-02-23 23:00 Darrick J. Wong 2026-02-23 23:08 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-02-23 23:00 UTC (permalink / raw) To: brauner, miklos, djwong; +Cc: bpf, hch, linux-fsdevel, linux-ext4 Hi all, In preparation for making fuse use the fs/iomap code for regular file data IO, fix a few bugs in fuse and apply a couple of tweaks to iomap. These patches can go in immediately. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. With a bit of luck, this should all go splendidly. Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=iomap-fuse-prep --- Commits in this patchset: * iomap: allow directio callers to supply _COMP_WORK * iomap: allow NULL swap info bdev when activating swapfile --- include/linux/iomap.h | 3 +++ fs/iomap/direct-io.c | 5 +++-- fs/iomap/swapfile.c | 17 +++++++++++++++++ 3 files changed, 23 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-02-23 23:00 [PATCHSET v7 2/9] iomap: cleanups ahead of adding fuse support Darrick J. Wong @ 2026-02-23 23:08 ` Darrick J. Wong 2026-02-24 14:01 ` Christoph Hellwig 0 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-02-23 23:08 UTC (permalink / raw) To: brauner, miklos, djwong; +Cc: bpf, hch, linux-fsdevel, linux-ext4 From: Darrick J. Wong <djwong@kernel.org> All current users of the iomap swapfile activation mechanism are block device filesystems. This means that claim_swapfile will set swap_info_struct::bdev to inode->i_sb->s_bdev of the swap file. However, a subsequent patch to fuse will add iomap infrastructure so that fuse servers can be asked to provide file mappings specifically for swap files. The fuse server isn't required to set s_bdev (by mounting as fuseblk) so s_bdev might be null. For this case, we want to set sis::bdev from the first mapping. To make this work robustly, we must explicitly check that each mapping provides a bdev and that there's no way we can succeed at collecting swapfile pages without a block device. And just to be clear: fuse-iomap servers will have to respond to an explicit request for swapfile activation. It's not like fuseblk, where responding to bmap means swapfiles work even if that wasn't expected. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/iomap/swapfile.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c index 0db77c449467a7..9d9f4e84437df5 100644 --- a/fs/iomap/swapfile.c +++ b/fs/iomap/swapfile.c @@ -112,6 +112,13 @@ static int iomap_swapfile_iter(struct iomap_iter *iter, if (iomap->flags & IOMAP_F_SHARED) return iomap_swapfile_fail(isi, "has shared extents"); + /* Swapfiles must be backed by a block device */ + if (!iomap->bdev) + return iomap_swapfile_fail(isi, "is not on a block device"); + + if (iter->pos == 0 && !isi->sis->bdev) + isi->sis->bdev = iomap->bdev; + /* Only one bdev per swap file. */ if (iomap->bdev != isi->sis->bdev) return iomap_swapfile_fail(isi, "outside the main device"); @@ -184,6 +191,16 @@ int iomap_swapfile_activate(struct swap_info_struct *sis, return -EINVAL; } + /* + * If this swapfile doesn't have a block device, reject this useless + * swapfile to prevent confusion later on. + */ + if (sis->bdev == NULL) { + pr_warn( + "swapon: No block device for swap file but usage pages?!\n"); + return -EINVAL; + } + *pagespan = 1 + isi.highest_ppage - isi.lowest_ppage; sis->max = isi.nr_pages; sis->pages = isi.nr_pages - 1; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-02-23 23:08 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong @ 2026-02-24 14:01 ` Christoph Hellwig 2026-02-24 19:26 ` Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Christoph Hellwig @ 2026-02-24 14:01 UTC (permalink / raw) To: Darrick J. Wong; +Cc: brauner, miklos, bpf, hch, linux-fsdevel, linux-ext4 On Mon, Feb 23, 2026 at 03:08:08PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > All current users of the iomap swapfile activation mechanism are block > device filesystems. This means that claim_swapfile will set > swap_info_struct::bdev to inode->i_sb->s_bdev of the swap file. > > However, a subsequent patch to fuse will add iomap infrastructure so > that fuse servers can be asked to provide file mappings specifically for > swap files. That sounds pretty sketchy. How do you make sure that is safe vs memory reclaim deadlocks? Does someone really need this feature? ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-02-24 14:01 ` Christoph Hellwig @ 2026-02-24 19:26 ` Darrick J. Wong 2026-02-25 14:16 ` Christoph Hellwig 0 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-02-24 19:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: brauner, miklos, bpf, linux-fsdevel, linux-ext4 On Tue, Feb 24, 2026 at 03:01:18PM +0100, Christoph Hellwig wrote: > On Mon, Feb 23, 2026 at 03:08:08PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > All current users of the iomap swapfile activation mechanism are block > > device filesystems. This means that claim_swapfile will set > > swap_info_struct::bdev to inode->i_sb->s_bdev of the swap file. > > > > However, a subsequent patch to fuse will add iomap infrastructure so > > that fuse servers can be asked to provide file mappings specifically for > > swap files. > > That sounds pretty sketchy. How do you make sure that is safe vs > memory reclaim deadlocks? Does someone really need this feature? Err, which part is sketchy, specifically? This patch that adjusts stuff in fs/iomap/, or the (much later) patch to fuse-iomap? If it's the second part (activating swapfiles via fuse-iomap) then I'll state that fuse-iomap swapfiles work mostly the same way that they do in xfs: iomap_swapfile_activate calls ->iomap_begin, which calls the fuse server to get iomappings. Each iomap is passed to the mm, which constructs its own internal mapping of swapfile range to disk addresses. The swap code then submits bios to read/write swapfile contents directly, which means that there are no upcalls to the fuse server after the initial activation. Obviously this means that the fuse server is granting a longterm layout lease to the kernel swapfile code, so it should reply to fuse_iomap_begin with a error code if it doesn't want to do that. I don't know that anyone really needs this feature, but pre-iomap fuse2fs supports swapfiles, as does any other fuse server that implements bmap. --D ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-02-24 19:26 ` Darrick J. Wong @ 2026-02-25 14:16 ` Christoph Hellwig 2026-02-25 17:03 ` Darrick J. Wong 0 siblings, 1 reply; 191+ messages in thread From: Christoph Hellwig @ 2026-02-25 14:16 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, brauner, miklos, bpf, linux-fsdevel, linux-ext4 On Tue, Feb 24, 2026 at 11:26:53AM -0800, Darrick J. Wong wrote: > > That sounds pretty sketchy. How do you make sure that is safe vs > > memory reclaim deadlocks? Does someone really need this feature? > > Err, which part is sketchy, specifically? This patch that adjusts stuff > in fs/iomap/, or the (much later) patch to fuse-iomap? The concept of swapping to fuse. > Obviously this means that the fuse server is granting a longterm layout > lease to the kernel swapfile code, so it should reply to > fuse_iomap_begin with a error code if it doesn't want to do that. > > I don't know that anyone really needs this feature, but pre-iomap > fuse2fs supports swapfiles, as does any other fuse server that > implements bmap. Eww, I didn't know people were already trying to support swap to fuse. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-02-25 14:16 ` Christoph Hellwig @ 2026-02-25 17:03 ` Darrick J. Wong 2026-02-25 17:49 ` Christoph Hellwig 0 siblings, 1 reply; 191+ messages in thread From: Darrick J. Wong @ 2026-02-25 17:03 UTC (permalink / raw) To: Christoph Hellwig; +Cc: brauner, miklos, bpf, linux-fsdevel, linux-ext4 On Wed, Feb 25, 2026 at 03:16:39PM +0100, Christoph Hellwig wrote: > On Tue, Feb 24, 2026 at 11:26:53AM -0800, Darrick J. Wong wrote: > > > That sounds pretty sketchy. How do you make sure that is safe vs > > > memory reclaim deadlocks? Does someone really need this feature? > > > > Err, which part is sketchy, specifically? This patch that adjusts stuff > > in fs/iomap/, or the (much later) patch to fuse-iomap? > > The concept of swapping to fuse. > > > Obviously this means that the fuse server is granting a longterm layout > > lease to the kernel swapfile code, so it should reply to > > fuse_iomap_begin with a error code if it doesn't want to do that. > > > > I don't know that anyone really needs this feature, but pre-iomap > > fuse2fs supports swapfiles, as does any other fuse server that > > implements bmap. > > Eww, I didn't know people were already trying to support swap to fuse. It was merged in the kernel via commit b2d2272fae1e1d ("[PATCH] fuse: add bmap support"), which was 2.6.20. So people have been using it for ~20 years now. At least it's the mm-managed bio swap path and we're not actually upcalling the fuse server to do swapins/swapouts. --D ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile 2026-02-25 17:03 ` Darrick J. Wong @ 2026-02-25 17:49 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2026-02-25 17:49 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, brauner, miklos, bpf, linux-fsdevel, linux-ext4 On Wed, Feb 25, 2026 at 09:03:22AM -0800, Darrick J. Wong wrote: > > > I don't know that anyone really needs this feature, but pre-iomap > > > fuse2fs supports swapfiles, as does any other fuse server that > > > implements bmap. > > > > Eww, I didn't know people were already trying to support swap to fuse. > > It was merged in the kernel via commit b2d2272fae1e1d ("[PATCH] fuse: > add bmap support"), which was 2.6.20. So people have been using it for > ~20 years now. At least it's the mm-managed bio swap path and we're not > actually upcalling the fuse server to do swapins/swapouts. Assuming it actually gets used, but yes, it's been there forever. :( ^ permalink raw reply [flat|nested] 191+ messages in thread
end of thread, other threads:[~2026-05-08 23:41 UTC | newest]
Thread overview: 191+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 14:12 [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers Darrick J. Wong
2026-04-29 14:16 ` [PATCHSET v8 1/8] fuse: general bug fixes Darrick J. Wong
2026-04-29 14:21 ` [PATCH 1/4] fuse: flush pending FUSE_RELEASE requests before sending FUSE_DESTROY Darrick J. Wong
2026-04-29 14:22 ` [PATCH 2/4] fuse: implement file attributes mask for statx Darrick J. Wong
2026-04-29 14:22 ` [PATCH 3/4] fuse: update file mode when updating acls Darrick J. Wong
2026-04-30 13:48 ` Joanne Koong
2026-04-30 20:57 ` Darrick J. Wong
2026-05-01 9:53 ` Joanne Koong
2026-05-01 16:15 ` Darrick J. Wong
2026-04-29 14:22 ` [PATCH 4/4] fuse: propagate default and file acls on creation Darrick J. Wong
2026-05-01 11:11 ` Joanne Koong
2026-05-01 16:57 ` Darrick J. Wong
2026-04-29 14:16 ` [PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong
2026-04-29 14:22 ` [PATCH 1/2] iomap: allow directio callers to supply _COMP_WORK Darrick J. Wong
2026-04-29 14:23 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong
2026-05-08 9:06 ` Christoph Hellwig
2026-05-08 23:41 ` Darrick J. Wong
2026-04-29 14:17 ` [PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
2026-04-29 14:23 ` [PATCH 1/2] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
2026-04-29 14:23 ` [PATCH 2/2] fuse_trace: " Darrick J. Wong
2026-04-29 14:17 ` [PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2026-04-29 14:23 ` [PATCH 01/33] fuse: implement the basic iomap mechanisms Darrick J. Wong
2026-04-29 14:24 ` [PATCH 02/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:24 ` [PATCH 03/33] fuse: make debugging configurable at runtime Darrick J. Wong
2026-04-29 14:24 ` [PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong
2026-04-29 14:24 ` [PATCH 05/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:25 ` [PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything before sending DESTROY Darrick J. Wong
2026-04-29 14:25 ` [PATCH 07/33] fuse: clean up per-file type inode initialization Darrick J. Wong
2026-04-29 14:25 ` [PATCH 08/33] fuse: create a per-inode flag for setting exclusive mode Darrick J. Wong
2026-04-29 14:26 ` [PATCH 09/33] fuse: create a per-inode flag for toggling iomap Darrick J. Wong
2026-04-29 14:26 ` [PATCH 10/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:26 ` [PATCH 11/33] fuse: isolate the other regular file IO paths from iomap Darrick J. Wong
2026-04-29 14:26 ` [PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2026-04-29 14:27 ` [PATCH 13/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:27 ` [PATCH 14/33] fuse: implement direct IO with iomap Darrick J. Wong
2026-04-29 14:27 ` [PATCH 15/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:27 ` [PATCH 16/33] fuse: implement buffered " Darrick J. Wong
2026-04-29 14:28 ` [PATCH 17/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:28 ` [PATCH 18/33] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2026-04-29 14:28 ` [PATCH 19/33] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2026-04-29 14:28 ` [PATCH 20/33] fuse: advertise support for iomap Darrick J. Wong
2026-04-29 14:29 ` [PATCH 21/33] fuse: query filesystem geometry when using iomap Darrick J. Wong
2026-04-29 14:29 ` [PATCH 22/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:29 ` [PATCH 23/33] fuse: implement fadvise for iomap files Darrick J. Wong
2026-04-29 14:29 ` [PATCH 24/33] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
2026-04-29 14:30 ` [PATCH 25/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:30 ` [PATCH 26/33] fuse: implement inline data file IO via iomap Darrick J. Wong
2026-04-29 14:30 ` [PATCH 27/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:31 ` [PATCH 28/33] fuse: allow more statx fields Darrick J. Wong
2026-04-29 14:31 ` [PATCH 29/33] fuse: support atomic writes with iomap Darrick J. Wong
2026-04-29 14:31 ` [PATCH 30/33] fuse_trace: " Darrick J. Wong
2026-04-29 14:31 ` [PATCH 31/33] fuse: disable direct fs reclaim for any fuse server that uses iomap Darrick J. Wong
2026-04-29 14:32 ` [PATCH 32/33] fuse: enable swapfile activation on iomap Darrick J. Wong
2026-04-29 14:32 ` [PATCH 33/33] fuse: implement freeze and shutdowns for iomap filesystems Darrick J. Wong
2026-04-29 14:17 ` [PATCHSET v8 5/8] fuse: allow servers to specify root node id Darrick J. Wong
2026-04-29 14:32 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong
2026-04-29 14:32 ` [PATCH 2/3] fuse_trace: " Darrick J. Wong
2026-04-29 14:33 ` [PATCH 3/3] fuse: allow setting of root nodeid Darrick J. Wong
2026-04-29 14:17 ` [PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2026-04-29 14:33 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong
2026-04-29 14:33 ` [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
2026-04-29 14:33 ` [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags Darrick J. Wong
2026-04-29 14:34 ` [PATCH 4/9] fuse_trace: " Darrick J. Wong
2026-04-29 14:34 ` [PATCH 5/9] fuse: cache atime when in iomap mode Darrick J. Wong
2026-04-29 14:34 ` [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
2026-04-29 14:34 ` [PATCH 7/9] fuse_trace: " Darrick J. Wong
2026-04-29 14:35 ` [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2026-04-29 14:35 ` [PATCH 9/9] fuse: always cache ACLs when using iomap Darrick J. Wong
2026-04-29 14:18 ` [PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2026-04-29 14:35 ` [PATCH 01/12] fuse: cache iomaps Darrick J. Wong
2026-04-29 14:35 ` [PATCH 02/12] fuse_trace: " Darrick J. Wong
2026-04-29 14:36 ` [PATCH 03/12] fuse: use the iomap cache for iomap_begin Darrick J. Wong
2026-04-29 14:36 ` [PATCH 04/12] fuse_trace: " Darrick J. Wong
2026-04-29 14:36 ` [PATCH 05/12] fuse: invalidate iomap cache after file updates Darrick J. Wong
2026-04-29 14:36 ` [PATCH 06/12] fuse_trace: " Darrick J. Wong
2026-04-29 14:37 ` [PATCH 07/12] fuse: enable iomap cache management Darrick J. Wong
2026-04-29 14:37 ` [PATCH 08/12] fuse_trace: " Darrick J. Wong
2026-04-29 14:37 ` [PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode Darrick J. Wong
2026-04-29 14:38 ` [PATCH 10/12] fuse: constrain iomap mapping cache size Darrick J. Wong
2026-04-29 14:38 ` [PATCH 11/12] fuse_trace: " Darrick J. Wong
2026-04-29 14:38 ` [PATCH 12/12] fuse: enable iomap Darrick J. Wong
2026-04-29 14:18 ` [PATCHSET v8 8/8] fuse: run fuse-iomap servers as a contained service Darrick J. Wong
2026-04-29 14:38 ` [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage Darrick J. Wong
2026-04-29 14:39 ` [PATCH 2/2] fuse: set iomap backing device block size Darrick J. Wong
2026-04-29 14:18 ` [PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2026-04-29 14:39 ` [PATCH 01/25] libfuse: bump kernel and library ABI versions Darrick J. Wong
2026-04-29 14:39 ` [PATCH 02/25] libfuse: wait in do_destroy until all open files are closed Darrick J. Wong
2026-04-29 14:39 ` [PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP Darrick J. Wong
2026-04-29 14:40 ` [PATCH 04/25] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2026-04-29 14:40 ` [PATCH 05/25] libfuse: add upper level iomap commands Darrick J. Wong
2026-04-29 14:40 ` [PATCH 06/25] libfuse: add a lowlevel notification to add a new device to iomap Darrick J. Wong
2026-04-29 14:40 ` [PATCH 07/25] libfuse: add upper-level iomap add device function Darrick J. Wong
2026-04-29 14:41 ` [PATCH 08/25] libfuse: add iomap ioend low level handler Darrick J. Wong
2026-04-29 14:41 ` [PATCH 09/25] libfuse: add upper level iomap ioend commands Darrick J. Wong
2026-04-29 14:41 ` [PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
2026-04-29 14:41 ` [PATCH 11/25] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
2026-04-29 14:42 ` [PATCH 12/25] libfuse: support enabling exclusive mode for files Darrick J. Wong
2026-04-29 14:42 ` [PATCH 13/25] libfuse: support direct I/O through iomap Darrick J. Wong
2026-04-29 14:42 ` [PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the upper level fuse library Darrick J. Wong
2026-04-29 14:42 ` [PATCH 15/25] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2026-04-29 14:43 ` [PATCH 16/25] libfuse: add lower level iomap_config implementation Darrick J. Wong
2026-04-29 14:43 ` [PATCH 17/25] libfuse: add upper " Darrick J. Wong
2026-04-29 14:43 ` [PATCH 18/25] libfuse: add low level code to invalidate iomap block device ranges Darrick J. Wong
2026-04-29 14:44 ` [PATCH 19/25] libfuse: add upper-level API to invalidate parts of an iomap block device Darrick J. Wong
2026-04-29 14:44 ` [PATCH 20/25] libfuse: add atomic write support Darrick J. Wong
2026-04-29 14:44 ` [PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write throttling Darrick J. Wong
2026-04-29 14:44 ` [PATCH 22/25] libfuse: create a helper to transform an open regular file into an open loopdev Darrick J. Wong
2026-04-29 14:45 ` [PATCH 23/25] libfuse: add swapfile support for iomap files Darrick J. Wong
2026-04-29 14:45 ` [PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw, and shutdown requests Darrick J. Wong
2026-04-29 14:45 ` [PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw, and shutdown events Darrick J. Wong
2026-04-29 14:19 ` [PATCHSET v8 2/6] libfuse: allow servers to specify root node id Darrick J. Wong
2026-04-29 14:45 ` [PATCH 1/1] libfuse: allow root_nodeid mount option Darrick J. Wong
2026-04-29 14:19 ` [PATCHSET v8 3/6] libfuse: implement syncfs Darrick J. Wong
2026-04-29 14:46 ` [PATCH 1/2] libfuse: add strictatime/lazytime mount options Darrick J. Wong
2026-04-29 14:46 ` [PATCH 2/2] libfuse: set sync, immutable, and append when loading files Darrick J. Wong
2026-04-29 14:19 ` [PATCHSET v8 4/6] libfuse: add some service helper commands for iomap Darrick J. Wong
2026-04-29 14:46 ` [PATCH 1/3] mount_service: delegate iomap privilege from mount.service to fuse services Darrick J. Wong
2026-04-29 14:46 ` [PATCH 2/3] libfuse: enable setting iomap block device block size Darrick J. Wong
2026-04-29 14:47 ` [PATCH 3/3] mount_service: create loop devices for regular files Darrick J. Wong
2026-04-29 14:19 ` [PATCHSET v8 5/6] fuse: add sample iomap fuse servers Darrick J. Wong
2026-04-29 14:47 ` [PATCH 1/7] example/iomap_ll: create a simple iomap server Darrick J. Wong
2026-04-29 14:47 ` [PATCH 2/7] example/iomap_ll: track block state Darrick J. Wong
2026-04-29 14:47 ` [PATCH 3/7] example/iomap_ll: implement atomic writes Darrick J. Wong
2026-04-29 14:48 ` [PATCH 4/7] example/iomap_inline_ll: create a simple server to test inlinedata Darrick J. Wong
2026-04-29 14:48 ` [PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place write server Darrick J. Wong
2026-04-29 14:48 ` [PATCH 6/7] example/iomap_ow_ll: implement atomic writes Darrick J. Wong
2026-04-29 14:48 ` [PATCH 7/7] example/iomap_service_ll: create a sample systemd service fuse server Darrick J. Wong
2026-04-29 14:20 ` [PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2026-04-29 14:49 ` [PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong
2026-04-29 14:49 ` [PATCH 2/9] libfuse: add upper-level iomap cache management Darrick J. Wong
2026-04-29 14:49 ` [PATCH 3/9] libfuse: allow constraining of iomap mapping cache size Darrick J. Wong
2026-04-29 14:50 ` [PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint code Darrick J. Wong
2026-04-29 14:50 ` [PATCH 5/9] libfuse: enable iomap Darrick J. Wong
2026-04-29 14:50 ` [PATCH 6/9] example/iomap_ll: cache mappings for later Darrick J. Wong
2026-04-29 14:50 ` [PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel Darrick J. Wong
2026-04-29 14:51 ` [PATCH 8/9] example/iomap_ow_ll: " Darrick J. Wong
2026-04-29 14:51 ` [PATCH 9/9] example/iomap_service_ll: " Darrick J. Wong
2026-04-29 14:20 ` [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2026-04-29 14:51 ` [PATCH 1/5] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2026-04-29 14:51 ` [PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2026-04-29 14:52 ` [PATCH 3/5] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2026-04-29 14:52 ` [PATCH 4/5] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2026-04-29 14:52 ` [PATCH 5/5] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2026-04-29 14:20 ` [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2026-04-29 14:52 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2026-04-29 14:53 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong
2026-04-29 14:53 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong
2026-04-29 14:53 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong
2026-04-29 14:53 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong
2026-04-29 14:54 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong
2026-04-29 14:54 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong
2026-04-29 14:54 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2026-04-29 14:54 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2026-04-29 14:55 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2026-04-29 14:55 ` [PATCH 11/19] fuse2fs: try to create loop device when ext4 device is a regular file Darrick J. Wong
2026-04-29 14:55 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong
2026-04-29 14:56 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong
2026-04-29 14:56 ` [PATCH 14/19] fuse2fs: configure block device block size Darrick J. Wong
2026-04-29 14:56 ` [PATCH 15/19] fuse4fs: separate invalidation Darrick J. Wong
2026-04-29 14:56 ` [PATCH 16/19] fuse2fs: implement statx Darrick J. Wong
2026-04-29 14:57 ` [PATCH 17/19] fuse2fs: enable atomic writes Darrick J. Wong
2026-04-29 14:57 ` [PATCH 18/19] fuse4fs: disable fs reclaim and write throttling Darrick J. Wong
2026-04-29 14:57 ` [PATCH 19/19] fuse2fs: implement freeze and shutdown requests Darrick J. Wong
2026-04-29 14:20 ` [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services Darrick J. Wong
2026-04-29 14:57 ` [PATCH 1/3] fuse4fs: configure iomap when running as a service Darrick J. Wong
2026-04-29 14:58 ` [PATCH 2/3] fuse4fs: set iomap backing device blocksize Darrick J. Wong
2026-04-29 14:58 ` [PATCH 3/3] fuse4fs: ask for loop devices when opening via fuservicemount Darrick J. Wong
2026-04-29 14:21 ` [PATCHSET v8 4/6] fuse4fs: specify the root node id Darrick J. Wong
2026-04-29 14:58 ` [PATCH 1/1] fuse4fs: don't use inode number translation when possible Darrick J. Wong
2026-04-29 14:21 ` [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2026-04-29 14:58 ` [PATCH 01/10] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2026-04-29 14:59 ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
2026-04-29 14:59 ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2026-04-29 14:59 ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
2026-04-29 14:59 ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
2026-04-29 15:00 ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2026-04-29 15:00 ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2026-04-29 15:00 ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
2026-04-29 15:00 ` [PATCH 09/10] fuse2fs: set sync, immutable, and append at file load time Darrick J. Wong
2026-04-29 15:01 ` [PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode Darrick J. Wong
2026-04-29 14:21 ` [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance Darrick J. Wong
2026-04-29 15:01 ` [PATCH 1/4] fuse2fs: enable caching of iomaps Darrick J. Wong
2026-04-29 15:01 ` [PATCH 2/4] fuse2fs: constrain iomap mapping cache size Darrick J. Wong
2026-04-29 15:02 ` [PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open Darrick J. Wong
2026-04-29 15:02 ` [PATCH 4/4] fuse2fs: enable iomap Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2026-02-23 23:00 [PATCHSET v7 2/9] iomap: cleanups ahead of adding fuse support Darrick J. Wong
2026-02-23 23:08 ` [PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile Darrick J. Wong
2026-02-24 14:01 ` Christoph Hellwig
2026-02-24 19:26 ` Darrick J. Wong
2026-02-25 14:16 ` Christoph Hellwig
2026-02-25 17:03 ` Darrick J. Wong
2026-02-25 17:49 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox