* [PATCHSET RFC v5 1/8] fuse: general bug fixes
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
@ 2025-09-16 0:18 ` Darrick J. Wong
2025-09-16 0:24 ` [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
` (7 more replies)
2025-09-16 0:18 ` [PATCHSET RFC v5 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong
` (6 subsequent siblings)
7 siblings, 8 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:18 UTC (permalink / raw)
To: djwong, miklos
Cc: stable, joannelkoong, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
Hi all,
Here's a collection of fixes that I *think* are bugs in fuse, along with
some scattered improvements.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-fixes
---
Commits in this patchset:
* fuse: fix livelock in synchronous file put from fuseblk workers
* fuse: flush pending fuse events before aborting the connection
* fuse: capture the unique id of fuse commands being sent
* fuse: signal that a fuse filesystem should exhibit local fs behaviors
* fuse: implement file attributes mask for statx
* fuse: update file mode when updating acls
* fuse: propagate default and file acls on creation
* fuse: enable FUSE_SYNCFS for all fuseblk servers
---
fs/fuse/fuse_i.h | 55 +++++++++++++++++++++++++++
fs/fuse/acl.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev.c | 60 +++++++++++++++++++++++++++--
fs/fuse/dev_uring.c | 4 +-
fs/fuse/dir.c | 96 +++++++++++++++++++++++++++++++++++------------
fs/fuse/file.c | 8 +++-
fs/fuse/inode.c | 17 ++++++++
fs/fuse/virtio_fs.c | 3 -
8 files changed, 314 insertions(+), 34 deletions(-)
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 2/8] iomap: cleanups ahead of adding fuse support
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
@ 2025-09-16 0:18 ` Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
2025-09-16 0:26 ` [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
` (5 subsequent siblings)
7 siblings, 2 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:18 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
Hi all,
In preparation for making fuse use the fs/iomap code for regular file
data IO, fix a few bugs in fuse and apply a couple of tweaks to iomap.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=iomap-fuse-prep
---
Commits in this patchset:
* iomap: trace iomap_zero_iter zeroing activities
* iomap: error out on file IO when there is no inline_data buffer
---
fs/iomap/trace.h | 1 +
fs/iomap/buffered-io.c | 18 +++++++++++++-----
fs/iomap/direct-io.c | 3 +++
3 files changed, 17 insertions(+), 5 deletions(-)
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
2025-09-16 0:18 ` [PATCHSET RFC v5 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong
@ 2025-09-16 0:18 ` Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/5] fuse: allow synchronous FUSE_INIT Darrick J. Wong
` (4 more replies)
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (4 subsequent siblings)
7 siblings, 5 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:18 UTC (permalink / raw)
To: djwong, miklos
Cc: mszeredi, amir73il, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
Hi all,
In preparation for making fuse use the fs/iomap code for regular file
data IO, fix a few bugs in fuse and apply a couple of tweaks to iomap.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-prep
---
Commits in this patchset:
* fuse: allow synchronous FUSE_INIT
* fuse: move the backing file idr and code into a new source file
* fuse: move the passthrough-specific code back to passthrough.c
* fuse_trace: move the passthrough-specific code back to passthrough.c
* fuse: move CREATE_TRACE_POINTS to a separate file
---
fs/fuse/fuse_dev_i.h | 13 ++-
fs/fuse/fuse_i.h | 73 ++++++++++----
fs/fuse/fuse_trace.h | 35 +++++++
include/uapi/linux/fuse.h | 9 ++
fs/fuse/Kconfig | 4 +
fs/fuse/Makefile | 4 +
fs/fuse/backing.c | 231 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/cuse.c | 3 -
fs/fuse/dev.c | 79 +++++++++++----
fs/fuse/dev_uring.c | 4 -
fs/fuse/inode.c | 54 ++++++++---
fs/fuse/passthrough.c | 198 +++++++--------------------------------
fs/fuse/trace.c | 13 +++
13 files changed, 494 insertions(+), 226 deletions(-)
create mode 100644 fs/fuse/backing.c
create mode 100644 fs/fuse/trace.c
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
` (2 preceding siblings ...)
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
@ 2025-09-16 0:19 ` Darrick J. Wong
2025-09-16 0:28 ` [PATCH 01/28] fuse: implement the basic iomap mechanisms Darrick J. Wong
` (27 more replies)
2025-09-16 0:19 ` [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id Darrick J. Wong
` (3 subsequent siblings)
7 siblings, 28 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:19 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
Hi all,
This series connects fuse (the userspace filesystem layer) to fs-iomap
to get fuse servers out of the business of handling file I/O themselves.
By keeping the IO path mostly within the kernel, we can dramatically
improve the speed of disk-based filesystems. This enables us to move
all the filesystem metadata parsing code out of the kernel and into
userspace, which means that we can containerize them for security
without losing a lot of performance.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
---
Commits in this patchset:
* fuse: implement the basic iomap mechanisms
* fuse_trace: implement the basic iomap mechanisms
* fuse: make debugging configurable at runtime
* fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
* fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
* fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
* fuse: create a per-inode flag for toggling iomap
* fuse_trace: create a per-inode flag for toggling iomap
* fuse: isolate the other regular file IO paths from iomap
* fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
* fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
* fuse: implement direct IO with iomap
* fuse_trace: implement direct IO with iomap
* fuse: implement buffered IO with iomap
* fuse_trace: implement buffered IO with iomap
* fuse: implement large folios for iomap pagecache files
* fuse: use an unrestricted backing device with iomap pagecache io
* fuse: advertise support for iomap
* fuse: query filesystem geometry when using iomap
* fuse_trace: query filesystem geometry when using iomap
* fuse: implement fadvise for iomap files
* fuse: invalidate ranges of block devices being used for iomap
* fuse_trace: invalidate ranges of block devices being used for iomap
* fuse: implement inline data file IO via iomap
* fuse_trace: implement inline data file IO via iomap
* fuse: allow more statx fields
* fuse: support atomic writes with iomap
* fuse: disable direct reclaim for any fuse server that uses iomap
---
fs/fuse/fuse_i.h | 172 ++++
fs/fuse/fuse_trace.h | 936 +++++++++++++++++++
fs/fuse/iomap_priv.h | 52 +
include/uapi/linux/fuse.h | 201 ++++
fs/fuse/Kconfig | 48 +
fs/fuse/Makefile | 1
fs/fuse/backing.c | 12
fs/fuse/dev.c | 30 +
fs/fuse/dir.c | 120 ++
fs/fuse/file.c | 133 ++-
fs/fuse/file_iomap.c | 2165 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 66 +
fs/fuse/iomode.c | 2
fs/fuse/trace.c | 2
14 files changed, 3892 insertions(+), 48 deletions(-)
create mode 100644 fs/fuse/iomap_priv.h
create mode 100644 fs/fuse/file_iomap.c
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
` (3 preceding siblings ...)
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-09-16 0:19 ` Darrick J. Wong
2025-09-16 0:35 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong
` (2 more replies)
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (2 subsequent siblings)
7 siblings, 3 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:19 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
Hi all,
This series grants fuse servers full control over the entire node id
address space by allowing them to specify the nodeid of the root
directory. With this new feature, fuse4fs will not have to translate
node ids.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-root-nodeid
---
Commits in this patchset:
* fuse: make the root nodeid dynamic
* fuse_trace: make the root nodeid dynamic
* fuse: allow setting of root nodeid
---
fs/fuse/fuse_i.h | 9 +++++++--
fs/fuse/fuse_trace.h | 6 ++++--
fs/fuse/dir.c | 10 ++++++----
fs/fuse/inode.c | 22 ++++++++++++++++++----
fs/fuse/readdir.c | 10 +++++-----
5 files changed, 40 insertions(+), 17 deletions(-)
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
` (4 preceding siblings ...)
2025-09-16 0:19 ` [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id Darrick J. Wong
@ 2025-09-16 0:19 ` Darrick J. Wong
2025-09-16 0:36 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong
` (8 more replies)
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-09-16 0:20 ` [PATCHSET RFC v5 8/8] fuse: run fuse servers as a contained service Darrick J. Wong
7 siblings, 9 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:19 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
Hi all,
When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can. That means no calling
out to the fuse server in the IO path when we can avoid it. However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.
We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync). Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.
IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode. Let's make the kernel manage all that
and push the results to userspace as needed. This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
---
Commits in this patchset:
* fuse: enable caching of timestamps
* fuse: force a ctime update after a fileattr_set call when in iomap mode
* fuse: allow local filesystems to set some VFS iflags
* fuse_trace: allow local filesystems to set some VFS iflags
* fuse: cache atime when in iomap mode
* fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
* fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
* fuse: update ctime when updating acls on an iomap inode
* fuse: always cache ACLs when using iomap
---
fs/fuse/fuse_i.h | 1 +
fs/fuse/fuse_trace.h | 87 +++++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 8 ++++
fs/fuse/acl.c | 29 +++++++++++++--
fs/fuse/dir.c | 38 ++++++++++++++++----
fs/fuse/file.c | 18 ++++++---
fs/fuse/file_iomap.c | 6 +++
fs/fuse/inode.c | 27 +++++++++++---
fs/fuse/ioctl.c | 70 ++++++++++++++++++++++++++++++++++++
fs/fuse/readdir.c | 3 +-
10 files changed, 263 insertions(+), 24 deletions(-)
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
` (5 preceding siblings ...)
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-09-16 0:19 ` Darrick J. Wong
2025-09-16 0:38 ` [PATCH 01/10] fuse: cache iomaps Darrick J. Wong
` (9 more replies)
2025-09-16 0:20 ` [PATCHSET RFC v5 8/8] fuse: run fuse servers as a contained service Darrick J. Wong
7 siblings, 10 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:19 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
Hi all,
This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel. For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem. For everyone else, it simply
eliminates roundtrips to userspace.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache
---
Commits in this patchset:
* fuse: cache iomaps
* fuse_trace: cache iomaps
* fuse: use the iomap cache for iomap_begin
* fuse_trace: use the iomap cache for iomap_begin
* fuse: invalidate iomap cache after file updates
* fuse_trace: invalidate iomap cache after file updates
* fuse: enable iomap cache management
* fuse_trace: enable iomap cache management
* fuse: overlay iomap inode info in struct fuse_inode
* fuse: enable iomap
---
fs/fuse/fuse_i.h | 58 ++
fs/fuse/fuse_trace.h | 434 ++++++++++++
fs/fuse/iomap_priv.h | 149 ++++
include/uapi/linux/fuse.h | 33 +
fs/fuse/Makefile | 2
fs/fuse/dev.c | 44 +
fs/fuse/dir.c | 6
fs/fuse/file.c | 10
fs/fuse/file_iomap.c | 527 ++++++++++++++
fs/fuse/iomap_cache.c | 1693 +++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 2937 insertions(+), 19 deletions(-)
create mode 100644 fs/fuse/iomap_cache.c
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCHSET RFC v5 8/8] fuse: run fuse servers as a contained service
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
` (6 preceding siblings ...)
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-09-16 0:20 ` Darrick J. Wong
2025-09-16 0:41 ` [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage Darrick J. Wong
2025-09-16 0:41 ` [PATCH 2/2] fuse: set iomap backing device block size Darrick J. Wong
7 siblings, 2 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:20 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
Hi all,
This patchset defines the necessary communication protocols and library
code so that users can mount fuse servers that run in unprivileged
systemd service containers. That in turn allows unprivileged untrusted
mounts, because the worst that can happen is that a malicious image
crashes the fuse server and the mount dies, instead of corrupting the
kernel. As part of the delegation, add a new ioctl allowing any process
with an open fusedev fd to ask for permission for anyone with that
fusedev fd to use iomap.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container
---
Commits in this patchset:
* fuse: allow privileged mount helpers to pre-approve iomap usage
* fuse: set iomap backing device block size
---
fs/fuse/fuse_dev_i.h | 32 +++++++++++++++++++--
fs/fuse/fuse_i.h | 12 ++++++++
include/uapi/linux/fuse.h | 8 +++++
fs/fuse/dev.c | 13 +++++----
fs/fuse/file_iomap.c | 67 ++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/inode.c | 18 ++++++++----
6 files changed, 134 insertions(+), 16 deletions(-)
^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
@ 2025-09-16 0:24 ` Darrick J. Wong
2025-09-23 10:57 ` Miklos Szeredi
2025-09-16 0:24 ` [PATCH 2/8] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
` (6 subsequent siblings)
7 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:24 UTC (permalink / raw)
To: djwong, miklos
Cc: stable, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
I observed a hang when running generic/323 against a fuseblk server.
This test opens a file, initiates a lot of AIO writes to that file
descriptor, and closes the file descriptor before the writes complete.
Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for
responses from the fuseblk server:
# cat /proc/372265/task/372313/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_do_getattr+0xfc/0x1f0 [fuse]
[<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse]
[<0>] aio_read+0x130/0x1e0
[<0>] io_submit_one+0x542/0x860
[<0>] __x64_sys_io_submit+0x98/0x1a0
[<0>] do_syscall_64+0x37/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
But the /weird/ part is that the fuseblk server threads are waiting for
responses from itself:
# cat /proc/372210/task/372232/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_file_put+0x9a/0xd0 [fuse]
[<0>] fuse_release+0x36/0x50 [fuse]
[<0>] __fput+0xec/0x2b0
[<0>] task_work_run+0x55/0x90
[<0>] syscall_exit_to_user_mode+0xe9/0x100
[<0>] do_syscall_64+0x43/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
The fuseblk server is fuse2fs so there's nothing all that exciting in
the server itself. So why is the fuse server calling fuse_file_put?
The commit message for the fstest sheds some light on that:
"By closing the file descriptor before calling io_destroy, you pretty
much guarantee that the last put on the ioctx will be done in interrupt
context (during I/O completion).
Aha. AIO fgets a new struct file from the fd when it queues the ioctx.
The completion of the FUSE_WRITE command from userspace causes the fuse
server to call the AIO completion function. The completion puts the
struct file, queuing a delayed fput to the fuse server task. When the
fuse server task returns to userspace, it has to run the delayed fput,
which in the case of a fuseblk server, it does synchronously.
Sending the FUSE_RELEASE command sychronously from fuse server threads
is a bad idea because a client program can initiate enough simultaneous
AIOs such that all the fuse server threads end up in delayed_fput, and
now there aren't any threads left to handle the queued fuse commands.
Fix this by only using asynchronous fputs when closing files, and leave
a comment explaining why.
Cc: <stable@vger.kernel.org> # v2.6.38
Fixes: 5a18ec176c934c ("fuse: fix hang of single threaded fuseblk filesystem")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4adcf09d4b01a6..ebdca39b2261d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -356,8 +356,14 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
* Make the release synchronous if this is a fuseblk mount,
* synchronous RELEASE is allowed (and desirable) in this case
* because the server can be trusted not to screw up.
+ *
+ * Always use the asynchronous file put because the current thread
+ * might be the fuse server. This can happen if a process starts some
+ * aio and closes the fd before the aio completes. Since aio takes its
+ * own ref to the file, the IO completion has to drop the ref, which is
+ * how the fuse server can end up closing its clients' files.
*/
- fuse_file_put(ff, ff->fm->fc->destroy);
+ fuse_file_put(ff, false);
}
void fuse_release_common(struct file *file, bool isdir)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
2025-09-16 0:24 ` [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-09-16 0:24 ` Darrick J. Wong
2025-09-23 11:11 ` Miklos Szeredi
2025-09-16 0:24 ` [PATCH 3/8] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
` (5 subsequent siblings)
7 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:24 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
generic/488 fails with fuse2fs in the following fashion:
generic/488 _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
(see /var/tmp/fstests/generic/488.full for details)
This test opens a large number of files, unlinks them (which really just
renames them to fuse hidden files), closes the program, unmounts the
filesystem, and runs fsck to check that there aren't any inconsistencies
in the filesystem.
Unfortunately, the 488.full file shows that there are a lot of hidden
files left over in the filesystem, with incorrect link counts. Tracing
fuse_request_* shows that there are a large number of FUSE_RELEASE
commands that are queued up on behalf of the unlinked files at the time
that fuse_conn_destroy calls fuse_abort_conn. Had the connection not
aborted, the fuse server would have responded to the RELEASE commands by
removing the hidden files; instead they stick around.
For upper-level fuse servers that don't use fuseblk mode this isn't a
problem because libfuse responds to the connection going down by pruning
its inode cache and calling the fuse server's ->release for any open
files before calling the server's ->destroy function.
For fuseblk servers this is a problem, however, because the kernel sends
FUSE_DESTROY to the fuse server, and the fuse server has to close the
block device before returning. This means that the kernel must flush
all pending FUSE_RELEASE requests before issuing FUSE_DESTROY.
Create a function to push all the background requests to the queue and
then wait for the number of pending events to hit zero, and call this
before sending FUSE_DESTROY. That way, all the pending events are
processed by the fuse server and we don't end up with a corrupt
filesystem.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 5 +++++
fs/fuse/dev.c | 33 +++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 11 ++++++++++-
3 files changed, 48 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cc428d04be3e14..8edca9ad13a9d1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1263,6 +1263,11 @@ void fuse_request_end(struct fuse_req *req);
void fuse_abort_conn(struct fuse_conn *fc);
void fuse_wait_aborted(struct fuse_conn *fc);
+/**
+ * Flush all pending requests and wait for them.
+ */
+void fuse_flush_requests_and_wait(struct fuse_conn *fc);
+
/* Check if any requests timed out */
void fuse_check_timeout(struct work_struct *work);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5150aa25e64be9..dcd338b65b2fc7 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -24,6 +24,7 @@
#include <linux/splice.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
+#include <linux/nmi.h>
#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
@@ -2385,6 +2386,38 @@ static void end_polls(struct fuse_conn *fc)
}
}
+/*
+ * Flush all pending requests and wait for them. Only call this function when
+ * it is no longer possible for other threads to add requests.
+ */
+void fuse_flush_requests_and_wait(struct fuse_conn *fc)
+{
+ spin_lock(&fc->lock);
+ if (!fc->connected) {
+ spin_unlock(&fc->lock);
+ return;
+ }
+
+ /* Push all the background requests to the queue. */
+ spin_lock(&fc->bg_lock);
+ fc->blocked = 0;
+ fc->max_background = UINT_MAX;
+ flush_bg_queue(fc);
+ spin_unlock(&fc->bg_lock);
+ spin_unlock(&fc->lock);
+
+ /*
+ * Wait for all the events to complete or abort. Touch the watchdog
+ * once per second so that we don't trip the hangcheck timer while
+ * waiting for the fuse server.
+ */
+ smp_mb();
+ while (wait_event_timeout(fc->blocked_waitq,
+ !fc->connected || atomic_read(&fc->num_waiting) == 0,
+ HZ) == 0)
+ touch_softlockup_watchdog();
+}
+
/*
* Abort all requests.
*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7ddfd2b3cc9c4f..c94aba627a6f11 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2056,8 +2056,17 @@ void fuse_conn_destroy(struct fuse_mount *fm)
{
struct fuse_conn *fc = fm->fc;
- if (fc->destroy)
+ if (fc->destroy) {
+ /*
+ * Flush all pending requests (most of which will be
+ * FUSE_RELEASE) before sending FUSE_DESTROY, because the fuse
+ * server must close the filesystem before replying to the
+ * destroy message, because unmount is about to release its
+ * O_EXCL hold on the block device.
+ */
+ fuse_flush_requests_and_wait(fc);
fuse_send_destroy(fm);
+ }
fuse_abort_conn(fc);
fuse_wait_aborted(fc);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 3/8] fuse: capture the unique id of fuse commands being sent
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
2025-09-16 0:24 ` [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-09-16 0:24 ` [PATCH 2/8] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
@ 2025-09-16 0:24 ` Darrick J. Wong
2025-09-23 10:58 ` Miklos Szeredi
2025-09-16 0:25 ` [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors Darrick J. Wong
` (4 subsequent siblings)
7 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:24 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
The fuse_request_{send,end} tracepoints capture the value of
req->in.h.unique in the trace output. It would be really nice if we
could use this to match a request to its response for debugging and
latency analysis, but the call to trace_fuse_request_send occurs before
the unique id has been set:
fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
fuse_request_end: connection 8388608 req 6 len 16 error -2
(Notice that req moves from 0 to 6)
Move the callsites to trace_fuse_request_send to after the unique id has
been set by introducing a helper to do that for standard fuse_req
requests. FUSE_FORGET requests are not covered by this because they
appear to be synthesized into the event stream without a fuse_req
object and are never replied to.
Requests that are aborted without ever having been submitted to the fuse
server retain the behavior that only the fuse_request_end tracepoint
shows up in the trace record, and with req==0.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 5 +++++
fs/fuse/dev.c | 27 +++++++++++++++++++++++----
fs/fuse/dev_uring.c | 4 ++--
fs/fuse/virtio_fs.c | 3 +--
4 files changed, 31 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8edca9ad13a9d1..e93a3c3f11d901 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1254,6 +1254,11 @@ static inline ssize_t fuse_simple_idmap_request(struct mnt_idmap *idmap,
int fuse_simple_background(struct fuse_mount *fm, struct fuse_args *args,
gfp_t gfp_flags);
+/**
+ * Assign a unique id to a fuse request
+ */
+void fuse_request_assign_unique(struct fuse_iqueue *fiq, struct fuse_req *req);
+
/**
* End a finished request
*/
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index dcd338b65b2fc7..f06208e4364642 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -370,12 +370,32 @@ void fuse_dev_queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
}
}
+static inline void fuse_request_assign_unique_locked(struct fuse_iqueue *fiq,
+ struct fuse_req *req)
+{
+ if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
+ req->in.h.unique = fuse_get_unique_locked(fiq);
+
+ /* tracepoint captures in.h.unique and in.h.len */
+ trace_fuse_request_send(req);
+}
+
+inline void fuse_request_assign_unique(struct fuse_iqueue *fiq,
+ struct fuse_req *req)
+{
+ if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
+ req->in.h.unique = fuse_get_unique(fiq);
+
+ /* tracepoint captures in.h.unique and in.h.len */
+ trace_fuse_request_send(req);
+}
+EXPORT_SYMBOL_GPL(fuse_request_assign_unique);
+
static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
{
spin_lock(&fiq->lock);
if (fiq->connected) {
- if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
- req->in.h.unique = fuse_get_unique_locked(fiq);
+ fuse_request_assign_unique_locked(fiq, req);
list_add_tail(&req->list, &fiq->pending);
fuse_dev_wake_and_unlock(fiq);
} else {
@@ -398,7 +418,6 @@ static void fuse_send_one(struct fuse_iqueue *fiq, struct fuse_req *req)
req->in.h.len = sizeof(struct fuse_in_header) +
fuse_len_args(req->args->in_numargs,
(struct fuse_arg *) req->args->in_args);
- trace_fuse_request_send(req);
fiq->ops->send_req(fiq, req);
}
@@ -688,10 +707,10 @@ static bool fuse_request_queue_background_uring(struct fuse_conn *fc,
{
struct fuse_iqueue *fiq = &fc->iq;
- req->in.h.unique = fuse_get_unique(fiq);
req->in.h.len = sizeof(struct fuse_in_header) +
fuse_len_args(req->args->in_numargs,
(struct fuse_arg *) req->args->in_args);
+ fuse_request_assign_unique(fiq, req);
return fuse_uring_queue_bq_req(req);
}
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 249b210becb1cc..7b541aeea1813f 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -7,6 +7,7 @@
#include "fuse_i.h"
#include "dev_uring_i.h"
#include "fuse_dev_i.h"
+#include "fuse_trace.h"
#include <linux/fs.h>
#include <linux/io_uring/cmd.h>
@@ -1268,8 +1269,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
if (!queue)
goto err;
- if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
- req->in.h.unique = fuse_get_unique(fiq);
+ fuse_request_assign_unique(fiq, req);
spin_lock(&queue->lock);
err = -ENOTCONN;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 76c8fd0bfc75d5..a880294549a6bd 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1480,8 +1480,7 @@ static void virtio_fs_send_req(struct fuse_iqueue *fiq, struct fuse_req *req)
struct virtio_fs_vq *fsvq;
int ret;
- if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
- req->in.h.unique = fuse_get_unique(fiq);
+ fuse_request_assign_unique(fiq, req);
clear_bit(FR_PENDING, &req->flags);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
` (2 preceding siblings ...)
2025-09-16 0:24 ` [PATCH 3/8] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
@ 2025-09-16 0:25 ` Darrick J. Wong
2025-09-17 17:18 ` Joanne Koong
2025-09-16 0:25 ` [PATCH 5/8] fuse: implement file attributes mask for statx Darrick J. Wong
` (3 subsequent siblings)
7 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:25 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Create a new fuse context flag that indicates that the kernel should
implement various local filesystem behaviors instead of passing vfs
commands straight through to the fuse server and expecting the server to
do all the work. For example, this means that we'll use the kernel to
transform some ACL updates into mode changes, and later to do
enforcement of the immutable and append iflags.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++++
fs/fuse/inode.c | 2 ++
2 files changed, 6 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e93a3c3f11d901..e13e8270f4f58d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -603,6 +603,7 @@ struct fuse_fs_context {
bool no_control:1;
bool no_force_umount:1;
bool legacy_opts_show:1;
+ bool local_fs:1;
enum fuse_dax_mode dax_mode;
unsigned int max_read;
unsigned int blksize;
@@ -901,6 +902,9 @@ struct fuse_conn {
/* Is link not implemented by fs? */
unsigned int no_link:1;
+ /* Should this filesystem behave like a local filesystem? */
+ unsigned int local_fs:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c94aba627a6f11..c8dd0bcb7e6f9f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1862,6 +1862,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+ fc->local_fs = ctx->local_fs;
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
@@ -2029,6 +2030,7 @@ static int fuse_init_fs_context(struct fs_context *fsc)
if (fsc->fs_type == &fuseblk_fs_type) {
ctx->is_bdev = true;
ctx->destroy = true;
+ ctx->local_fs = true;
}
#endif
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 5/8] fuse: implement file attributes mask for statx
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
` (3 preceding siblings ...)
2025-09-16 0:25 ` [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors Darrick J. Wong
@ 2025-09-16 0:25 ` Darrick J. Wong
2025-09-16 0:25 ` [PATCH 6/8] fuse: update file mode when updating acls Darrick J. Wong
` (2 subsequent siblings)
7 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:25 UTC (permalink / raw)
To: djwong, miklos
Cc: joannelkoong, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Actually copy the attributes/attributes_mask from userspace. Ignore
file attributes bits that the VFS sets (or doesn't set) on its own.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/fuse_i.h | 37 +++++++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 4 ++++
fs/fuse/inode.c | 3 +++
3 files changed, 44 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e13e8270f4f58d..52776b77efc0e4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -140,6 +140,10 @@ struct fuse_inode {
/** Version of last attribute change */
u64 attr_version;
+ /** statx file attributes */
+ u64 statx_attributes;
+ u64 statx_attributes_mask;
+
union {
/* read/write io cache (regular file only) */
struct {
@@ -1221,6 +1225,39 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid, u32 cache_mask,
u64 evict_ctr);
+/*
+ * These statx attribute flags are set by the VFS so mask them out of replies
+ * from the fuse server for local filesystems. Nonlocal filesystems are
+ * responsible for enforcing and advertising these flags themselves.
+ */
+#define FUSE_STATX_LOCAL_VFS_ATTRIBUTES (STATX_ATTR_IMMUTABLE | \
+ STATX_ATTR_APPEND)
+
+/*
+ * These statx attribute flags are set by the VFS so mask them out of replies
+ * from the fuse server.
+ */
+#define FUSE_STATX_VFS_ATTRIBUTES (STATX_ATTR_AUTOMOUNT | STATX_ATTR_DAX | \
+ STATX_ATTR_MOUNT_ROOT)
+
+static inline u64 fuse_statx_attributes_mask(const struct fuse_conn *fc,
+ const struct fuse_statx *sx)
+{
+ if (fc->local_fs)
+ return sx->attributes_mask & ~(FUSE_STATX_VFS_ATTRIBUTES |
+ FUSE_STATX_LOCAL_VFS_ATTRIBUTES);
+ return sx->attributes_mask & ~FUSE_STATX_VFS_ATTRIBUTES;
+}
+
+static inline u64 fuse_statx_attributes(const struct fuse_conn *fc,
+ const struct fuse_statx *sx)
+{
+ if (fc->local_fs)
+ return sx->attributes & ~(FUSE_STATX_VFS_ATTRIBUTES |
+ FUSE_STATX_LOCAL_VFS_ATTRIBUTES);
+ return sx->attributes & ~FUSE_STATX_VFS_ATTRIBUTES;
+}
+
u32 fuse_get_cache_mask(struct inode *inode);
/**
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5c569c3cb53f3d..a7f47e43692f1c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1278,6 +1278,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
stat->btime.tv_sec = sx->btime.tv_sec;
stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
+ stat->attributes |= fuse_statx_attributes(fm->fc, sx);
+ stat->attributes_mask |= fuse_statx_attributes_mask(fm->fc, sx);
fuse_fillattr(idmap, inode, &attr, stat);
stat->result_mask |= STATX_TYPE;
}
@@ -1382,6 +1384,8 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
stat->btime = fi->i_btime;
stat->result_mask |= STATX_BTIME;
}
+ stat->attributes = fi->statx_attributes;
+ stat->attributes_mask = fi->statx_attributes_mask;
}
return err;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c8dd0bcb7e6f9f..55db991bb6b8c1 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -287,6 +287,9 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
fi->i_btime.tv_sec = sx->btime.tv_sec;
fi->i_btime.tv_nsec = sx->btime.tv_nsec;
}
+
+ fi->statx_attributes = fuse_statx_attributes(fc, sx);
+ fi->statx_attributes_mask = fuse_statx_attributes_mask(fc, sx);
}
if (attr->blksize)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 6/8] fuse: update file mode when updating acls
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
` (4 preceding siblings ...)
2025-09-16 0:25 ` [PATCH 5/8] fuse: implement file attributes mask for statx Darrick J. Wong
@ 2025-09-16 0:25 ` Darrick J. Wong
2025-09-16 0:25 ` [PATCH 7/8] fuse: propagate default and file acls on creation Darrick J. Wong
2025-09-16 0:26 ` [PATCH 8/8] fuse: enable FUSE_SYNCFS for all fuseblk servers Darrick J. Wong
7 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:25 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
If someone sets ACLs on a file that can be expressed fully as Unix DAC
mode bits, most local filesystems will then update the mode bits and
drop the ACL xattr to reduce inefficiency in the file access paths.
Let's do that too. Note that means that we can setacl and end up with
no ACL xattrs, so we also need to tolerate ENODATA returns from
fuse_removexattr.
Note that here we define a "local" fuse filesystem as one that uses
fuseblk mode; we'll shortly add fuse servers that use iomap for the file
IO path to that list.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/acl.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8f484b105f13ab..4997827ee83c6d 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -11,6 +11,16 @@
#include <linux/posix_acl.h>
#include <linux/posix_acl_xattr.h>
+/*
+ * If this fuse server behaves like a local filesystem, we can implement the
+ * kernel's optimizations for ACLs for local filesystems instead of passing
+ * the ACL requests straight through to another server.
+ */
+static inline bool fuse_has_local_acls(const struct fuse_conn *fc)
+{
+ return fc->posix_acl && fc->local_fs;
+}
+
static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc,
struct inode *inode, int type, bool rcu)
{
@@ -98,6 +108,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *inode = d_inode(dentry);
struct fuse_conn *fc = get_fuse_conn(inode);
const char *name;
+ umode_t mode = inode->i_mode;
int ret;
if (fuse_is_bad(inode))
@@ -113,6 +124,17 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
else
return -EINVAL;
+ /*
+ * If the ACL can be represented entirely with changes to the mode
+ * bits, then most filesystems will update the mode bits and delete
+ * the ACL xattr.
+ */
+ if (acl && type == ACL_TYPE_ACCESS && fuse_has_local_acls(fc)) {
+ ret = posix_acl_update_mode(idmap, inode, &mode, &acl);
+ if (ret)
+ return ret;
+ }
+
if (acl) {
unsigned int extra_flags = 0;
/*
@@ -143,7 +165,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
* through POSIX ACLs. Such daemons don't expect setgid bits to
* be stripped.
*/
- if (fc->posix_acl &&
+ if (fc->posix_acl && mode == inode->i_mode &&
!in_group_or_capable(idmap, inode,
i_gid_into_vfsgid(idmap, inode)))
extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID;
@@ -152,6 +174,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
kfree(value);
} else {
ret = fuse_removexattr(inode, name);
+ /* If the acl didn't exist to start with that's fine. */
+ if (ret == -ENODATA)
+ ret = 0;
+ }
+
+ /* If we scheduled a mode update above, push that to userspace now. */
+ if (!ret) {
+ struct iattr attr = { };
+
+ if (mode != inode->i_mode) {
+ attr.ia_valid |= ATTR_MODE;
+ attr.ia_mode = mode;
+ }
+
+ if (attr.ia_valid)
+ ret = fuse_do_setattr(idmap, dentry, &attr, NULL);
}
if (fc->posix_acl) {
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 7/8] fuse: propagate default and file acls on creation
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
` (5 preceding siblings ...)
2025-09-16 0:25 ` [PATCH 6/8] fuse: update file mode when updating acls Darrick J. Wong
@ 2025-09-16 0:25 ` Darrick J. Wong
2025-09-16 6:41 ` Chen Linxuan
2025-09-16 0:26 ` [PATCH 8/8] fuse: enable FUSE_SYNCFS for all fuseblk servers Darrick J. Wong
7 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:25 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
For local filesystems, propagate the default and file access ACLs to new
children when creating them, just like the other in-kernel local
filesystems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++
fs/fuse/acl.c | 65 ++++++++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 92 +++++++++++++++++++++++++++++++++++++++++-------------
3 files changed, 138 insertions(+), 23 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 52776b77efc0e4..b9306678dcda0d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1507,6 +1507,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap,
struct dentry *dentry, int type);
int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry,
struct posix_acl *acl, int type);
+int fuse_acl_create(struct inode *dir, umode_t *mode,
+ struct posix_acl **default_acl, struct posix_acl **acl);
+int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
+ const struct posix_acl *acl);
/* readdir.c */
int fuse_readdir(struct file *file, struct dir_context *ctx);
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 4997827ee83c6d..4faee72f1365a5 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -203,3 +203,68 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
return ret;
}
+
+int fuse_acl_create(struct inode *dir, umode_t *mode,
+ struct posix_acl **default_acl, struct posix_acl **acl)
+{
+ struct fuse_conn *fc = get_fuse_conn(dir);
+
+ if (fuse_is_bad(dir))
+ return -EIO;
+
+ if (IS_POSIXACL(dir) && fuse_has_local_acls(fc))
+ return posix_acl_create(dir, mode, default_acl, acl);
+
+ if (!fc->dont_mask)
+ *mode &= ~current_umask();
+
+ *default_acl = NULL;
+ *acl = NULL;
+ return 0;
+}
+
+static int __fuse_set_acl(struct inode *inode, const char *name,
+ const struct posix_acl *acl)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ size_t size = posix_acl_xattr_size(acl->a_count);
+ void *value;
+ int ret;
+
+ if (size > PAGE_SIZE)
+ return -E2BIG;
+
+ value = kmalloc(size, GFP_KERNEL);
+ if (!value)
+ return -ENOMEM;
+
+ ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
+ if (ret < 0)
+ goto out_value;
+
+ ret = fuse_setxattr(inode, name, value, size, 0, 0);
+out_value:
+ kfree(value);
+ return ret;
+}
+
+int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
+ const struct posix_acl *acl)
+{
+ int ret;
+
+ if (default_acl) {
+ ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT,
+ default_acl);
+ if (ret)
+ return ret;
+ }
+
+ if (acl) {
+ ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index a7f47e43692f1c..b116e42431ee12 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -628,26 +628,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
struct fuse_entry_out outentry;
struct fuse_inode *fi;
struct fuse_file *ff;
+ struct posix_acl *default_acl = NULL, *acl = NULL;
int epoch, err;
bool trunc = flags & O_TRUNC;
/* Userspace expects S_IFREG in create mode */
BUG_ON((mode & S_IFMT) != S_IFREG);
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return err;
+
epoch = atomic_read(&fm->fc->epoch);
forget = fuse_alloc_forget();
err = -ENOMEM;
if (!forget)
- goto out_err;
+ goto out_acl_release;
err = -ENOMEM;
ff = fuse_file_alloc(fm, true);
if (!ff)
goto out_put_forget_req;
- if (!fm->fc->dont_mask)
- mode &= ~current_umask();
-
flags &= ~O_NOCTTY;
memset(&inarg, 0, sizeof(inarg));
memset(&outentry, 0, sizeof(outentry));
@@ -699,12 +701,16 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
fuse_sync_release(NULL, ff, flags);
fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
err = -ENOMEM;
- goto out_err;
+ goto out_acl_release;
}
kfree(forget);
d_instantiate(entry, inode);
entry->d_time = epoch;
fuse_change_entry_timeout(entry, &outentry);
+
+ err = fuse_init_acls(inode, default_acl, acl);
+ if (err)
+ goto out_acl_release;
fuse_dir_changed(dir);
err = generic_file_open(inode, file);
if (!err) {
@@ -726,7 +732,9 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
fuse_file_free(ff);
out_put_forget_req:
kfree(forget);
-out_err:
+out_acl_release:
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
return err;
}
@@ -785,7 +793,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
*/
static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm,
struct fuse_args *args, struct inode *dir,
- struct dentry *entry, umode_t mode)
+ struct dentry *entry, umode_t mode,
+ struct posix_acl *default_acl,
+ struct posix_acl *acl)
{
struct fuse_entry_out outarg;
struct inode *inode;
@@ -793,14 +803,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
struct fuse_forget_link *forget;
int epoch, err;
- if (fuse_is_bad(dir))
- return ERR_PTR(-EIO);
+ if (fuse_is_bad(dir)) {
+ err = -EIO;
+ goto out_acl_release;
+ }
epoch = atomic_read(&fm->fc->epoch);
forget = fuse_alloc_forget();
- if (!forget)
- return ERR_PTR(-ENOMEM);
+ if (!forget) {
+ err = -ENOMEM;
+ goto out_acl_release;
+ }
memset(&outarg, 0, sizeof(outarg));
args->nodeid = get_node_id(dir);
@@ -830,7 +844,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
&outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0);
if (!inode) {
fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
- return ERR_PTR(-ENOMEM);
+ err = -ENOMEM;
+ goto out_acl_release;
}
kfree(forget);
@@ -846,19 +861,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
entry->d_time = epoch;
fuse_change_entry_timeout(entry, &outarg);
}
+
+ err = fuse_init_acls(inode, default_acl, acl);
+ if (err)
+ goto out_acl_release;
fuse_dir_changed(dir);
+
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
return d;
out_put_forget_req:
if (err == -EEXIST)
fuse_invalidate_entry(entry);
kfree(forget);
+ out_acl_release:
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
return ERR_PTR(err);
}
static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
struct fuse_args *args, struct inode *dir,
- struct dentry *entry, umode_t mode)
+ struct dentry *entry, umode_t mode,
+ struct posix_acl *default_acl,
+ struct posix_acl *acl)
{
/*
* Note that when creating anything other than a directory we
@@ -869,7 +896,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
*/
WARN_ON_ONCE(S_ISDIR(mode));
- return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode));
+ return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode,
+ default_acl, acl));
}
static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
@@ -877,10 +905,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
{
struct fuse_mknod_in inarg;
struct fuse_mount *fm = get_fuse_mount(dir);
+ struct posix_acl *default_acl, *acl;
FUSE_ARGS(args);
+ int err;
- if (!fm->fc->dont_mask)
- mode &= ~current_umask();
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return err;
memset(&inarg, 0, sizeof(inarg));
inarg.mode = mode;
@@ -892,7 +923,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
args.in_args[0].value = &inarg;
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
- return create_new_nondir(idmap, fm, &args, dir, entry, mode);
+ return create_new_nondir(idmap, fm, &args, dir, entry, mode,
+ default_acl, acl);
}
static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
@@ -924,13 +956,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
{
struct fuse_mkdir_in inarg;
struct fuse_mount *fm = get_fuse_mount(dir);
+ struct posix_acl *default_acl, *acl;
FUSE_ARGS(args);
+ int err;
- if (!fm->fc->dont_mask)
- mode &= ~current_umask();
+ mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return ERR_PTR(err);
memset(&inarg, 0, sizeof(inarg));
- inarg.mode = mode;
+ inarg.mode = mode & ~S_IFDIR;
inarg.umask = current_umask();
args.opcode = FUSE_MKDIR;
args.in_numargs = 2;
@@ -938,7 +974,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
args.in_args[0].value = &inarg;
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
- return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR);
+ return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR,
+ default_acl, acl);
}
static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
@@ -946,7 +983,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
{
struct fuse_mount *fm = get_fuse_mount(dir);
unsigned len = strlen(link) + 1;
+ struct posix_acl *default_acl, *acl;
+ umode_t mode = S_IFLNK | 0777;
FUSE_ARGS(args);
+ int err;
+
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return err;
args.opcode = FUSE_SYMLINK;
args.in_numargs = 3;
@@ -955,7 +999,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
args.in_args[1].value = entry->d_name.name;
args.in_args[2].size = len;
args.in_args[2].value = link;
- return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK);
+ return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK,
+ default_acl, acl);
}
void fuse_flush_time_update(struct inode *inode)
@@ -1155,7 +1200,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir,
args.in_args[0].value = &inarg;
args.in_args[1].size = newent->d_name.len + 1;
args.in_args[1].value = newent->d_name.name;
- err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode);
+ err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent,
+ inode->i_mode, NULL, NULL);
if (!err)
fuse_update_ctime_in_cache(inode);
else if (err == -EINTR)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 8/8] fuse: enable FUSE_SYNCFS for all fuseblk servers
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
` (6 preceding siblings ...)
2025-09-16 0:25 ` [PATCH 7/8] fuse: propagate default and file acls on creation Darrick J. Wong
@ 2025-09-16 0:26 ` Darrick J. Wong
2025-09-23 10:58 ` Miklos Szeredi
7 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:26 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Turn on syncfs for all fuseblk servers so that the ones in the know can
flush cached intermediate data and logs to disk.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/inode.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 55db991bb6b8c1..869d8a87bfb628 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1824,6 +1824,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
!sb_set_blocksize(sb, PAGE_SIZE))
goto err;
#endif
+ fc->sync_fs = 1;
} else {
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities
2025-09-16 0:18 ` [PATCHSET RFC v5 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong
@ 2025-09-16 0:26 ` Darrick J. Wong
2025-09-16 13:49 ` Christoph Hellwig
2025-09-16 0:26 ` [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
1 sibling, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:26 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Trace which bytes actually get zeroed.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/iomap/trace.h | 1 +
fs/iomap/buffered-io.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 6ad66e6ba653e8..a61c1dae474270 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -84,6 +84,7 @@ DEFINE_RANGE_EVENT(iomap_release_folio);
DEFINE_RANGE_EVENT(iomap_invalidate_folio);
DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
+DEFINE_RANGE_EVENT(iomap_zero_iter);
#define IOMAP_TYPE_STRINGS \
{ IOMAP_HOLE, "HOLE" }, \
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1e95a331a682e2..741f1f6001e1ff 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1415,6 +1415,9 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
/* warn about zeroing folios beyond eof that won't write back */
WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
+ trace_iomap_zero_iter(iter->inode, folio_pos(folio) + offset,
+ bytes);
+
folio_zero_range(folio, offset, bytes);
folio_mark_accessed(folio);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer
2025-09-16 0:18 ` [PATCHSET RFC v5 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
@ 2025-09-16 0:26 ` Darrick J. Wong
2025-09-16 13:50 ` Christoph Hellwig
1 sibling, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:26 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Return IO errors if an ->iomap_begin implementation returns an
IOMAP_INLINE buffer but forgets to set the inline_data pointer.
Filesystems should never do this, but we could help fs developers (me)
fix their bugs by handling this more gracefully than crashing the
kernel.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/iomap/buffered-io.c | 15 ++++++++++-----
fs/iomap/direct-io.c | 3 +++
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 741f1f6001e1ff..869f178aea28d3 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -312,6 +312,9 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
size_t size = i_size_read(iter->inode) - iomap->offset;
size_t offset = offset_in_folio(folio, iomap->offset);
+ if (WARN_ON_ONCE(iomap->inline_data == NULL))
+ return -EIO;
+
if (folio_test_uptodate(folio))
return 0;
@@ -913,7 +916,7 @@ static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
return true;
}
-static void iomap_write_end_inline(const struct iomap_iter *iter,
+static bool iomap_write_end_inline(const struct iomap_iter *iter,
struct folio *folio, loff_t pos, size_t copied)
{
const struct iomap *iomap = &iter->iomap;
@@ -922,12 +925,16 @@ static void iomap_write_end_inline(const struct iomap_iter *iter,
WARN_ON_ONCE(!folio_test_uptodate(folio));
BUG_ON(!iomap_inline_data_valid(iomap));
+ if (WARN_ON_ONCE(iomap->inline_data == NULL))
+ return false;
+
flush_dcache_folio(folio);
addr = kmap_local_folio(folio, pos);
memcpy(iomap_inline_data(iomap, pos), addr, copied);
kunmap_local(addr);
mark_inode_dirty(iter->inode);
+ return true;
}
/*
@@ -940,10 +947,8 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t pos = iter->pos;
- if (srcmap->type == IOMAP_INLINE) {
- iomap_write_end_inline(iter, folio, pos, copied);
- return true;
- }
+ if (srcmap->type == IOMAP_INLINE)
+ return iomap_write_end_inline(iter, folio, pos, copied);
if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
size_t bh_written;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 6dc4e18f93a40a..a992130a1cb6dd 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -523,6 +523,9 @@ static int iomap_dio_inline_iter(struct iomap_iter *iomi, struct iomap_dio *dio)
loff_t pos = iomi->pos;
u64 copied;
+ if (WARN_ON_ONCE(inline_data == NULL))
+ return -EIO;
+
if (WARN_ON_ONCE(!iomap_inline_data_valid(iomap)))
return -EIO;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 1/5] fuse: allow synchronous FUSE_INIT
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
@ 2025-09-16 0:26 ` Darrick J. Wong
2025-09-17 17:22 ` Joanne Koong
2025-09-16 0:27 ` [PATCH 2/5] fuse: move the backing file idr and code into a new source file Darrick J. Wong
` (3 subsequent siblings)
4 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:26 UTC (permalink / raw)
To: djwong, miklos
Cc: mszeredi, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
From: Miklos Szeredi <mszeredi@redhat.com>
FUSE_INIT has always been asynchronous with mount. That means that the
server processed this request after the mount syscall returned.
This means that FUSE_INIT can't supply the root inode's ID, hence it
currently has a hardcoded value. There are other limitations such as not
being able to perform getxattr during mount, which is needed by selinux.
To remove these limitations allow server to process FUSE_INIT while
initializing the in-core super block for the fuse filesystem. This can
only be done if the server is prepared to handle this, so add
FUSE_DEV_IOC_SYNC_INIT ioctl, which
a) lets the server know whether this feature is supported, returning
ENOTTY othewrwise.
b) lets the kernel know to perform a synchronous initialization
The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
only during super block creation. This is solved by setting the private
data of the fuse device file to a special value ((struct fuse_dev *) 1) and
waiting for this to be turned into a proper fuse_dev before commecing with
operations on the device file.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_dev_i.h | 13 +++++++-
fs/fuse/fuse_i.h | 5 ++-
include/uapi/linux/fuse.h | 1 +
fs/fuse/cuse.c | 3 +-
fs/fuse/dev.c | 74 +++++++++++++++++++++++++++++++++------------
fs/fuse/dev_uring.c | 4 +-
fs/fuse/inode.c | 50 ++++++++++++++++++++++++------
7 files changed, 115 insertions(+), 35 deletions(-)
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 5a9bd771a3193d..6e8373f970409e 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -12,6 +12,8 @@
#define FUSE_INT_REQ_BIT (1ULL << 0)
#define FUSE_REQ_ID_STEP (1ULL << 1)
+extern struct wait_queue_head fuse_dev_waitq;
+
struct fuse_arg;
struct fuse_args;
struct fuse_pqueue;
@@ -37,15 +39,22 @@ struct fuse_copy_state {
} ring;
};
-static inline struct fuse_dev *fuse_get_dev(struct file *file)
+#define FUSE_DEV_SYNC_INIT ((struct fuse_dev *) 1)
+#define FUSE_DEV_PTR_MASK (~1UL)
+
+static inline struct fuse_dev *__fuse_get_dev(struct file *file)
{
/*
* Lockless access is OK, because file->private data is set
* once during mount and is valid until the file is released.
*/
- return READ_ONCE(file->private_data);
+ struct fuse_dev *fud = READ_ONCE(file->private_data);
+
+ return (typeof(fud)) ((unsigned long) fud & FUSE_DEV_PTR_MASK);
}
+struct fuse_dev *fuse_get_dev(struct file *file);
+
unsigned int fuse_req_hash(u64 unique);
struct fuse_req *fuse_request_find(struct fuse_pqueue *fpq, u64 unique);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b9306678dcda0d..02f0138e2fe443 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -909,6 +909,9 @@ struct fuse_conn {
/* Should this filesystem behave like a local filesystem? */
unsigned int local_fs:1;
+ /* Is synchronous FUSE_INIT allowed? */
+ unsigned int sync_init:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1366,7 +1369,7 @@ struct fuse_dev *fuse_dev_alloc_install(struct fuse_conn *fc);
struct fuse_dev *fuse_dev_alloc(void);
void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc);
void fuse_dev_free(struct fuse_dev *fud);
-void fuse_send_init(struct fuse_mount *fm);
+int fuse_send_init(struct fuse_mount *fm);
/**
* Fill in superblock and initialize fuse connection
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 122d6586e8d4da..1d76d0332f46f6 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1126,6 +1126,7 @@ struct fuse_backing_map {
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_SYNC_INIT _IO(FUSE_DEV_IOC_MAGIC, 3)
struct fuse_lseek_in {
uint64_t fh;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index b39844d75a806f..28c96961e85d1c 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include "fuse_i.h"
+#include "fuse_dev_i.h"
#define CUSE_CONNTBL_LEN 64
@@ -547,7 +548,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
*/
static int cuse_channel_release(struct inode *inode, struct file *file)
{
- struct fuse_dev *fud = file->private_data;
+ struct fuse_dev *fud = __fuse_get_dev(file);
struct cuse_conn *cc = fc_to_cc(fud->fc);
/* remove from the conntbl, no more access from this point on */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index f06208e4364642..e5aaf0c668bc11 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1548,14 +1548,34 @@ static int fuse_dev_open(struct inode *inode, struct file *file)
return 0;
}
+struct fuse_dev *fuse_get_dev(struct file *file)
+{
+ struct fuse_dev *fud = __fuse_get_dev(file);
+ int err;
+
+ if (likely(fud))
+ return fud;
+
+ err = wait_event_interruptible(fuse_dev_waitq,
+ READ_ONCE(file->private_data) != FUSE_DEV_SYNC_INIT);
+ if (err)
+ return ERR_PTR(err);
+
+ fud = __fuse_get_dev(file);
+ if (!fud)
+ return ERR_PTR(-EPERM);
+
+ return fud;
+}
+
static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
{
struct fuse_copy_state cs;
struct file *file = iocb->ki_filp;
struct fuse_dev *fud = fuse_get_dev(file);
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
if (!user_backed_iter(to))
return -EINVAL;
@@ -1575,8 +1595,8 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
struct fuse_copy_state cs;
struct fuse_dev *fud = fuse_get_dev(in);
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
bufs = kvmalloc_array(pipe->max_usage, sizeof(struct pipe_buffer),
GFP_KERNEL);
@@ -2251,8 +2271,8 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
struct fuse_copy_state cs;
struct fuse_dev *fud = fuse_get_dev(iocb->ki_filp);
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
if (!user_backed_iter(from))
return -EINVAL;
@@ -2276,8 +2296,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
ssize_t ret;
fud = fuse_get_dev(out);
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
pipe_lock(pipe);
@@ -2361,7 +2381,7 @@ static __poll_t fuse_dev_poll(struct file *file, poll_table *wait)
struct fuse_iqueue *fiq;
struct fuse_dev *fud = fuse_get_dev(file);
- if (!fud)
+ if (IS_ERR(fud))
return EPOLLERR;
fiq = &fud->fc->iq;
@@ -2540,7 +2560,7 @@ void fuse_wait_aborted(struct fuse_conn *fc)
int fuse_dev_release(struct inode *inode, struct file *file)
{
- struct fuse_dev *fud = fuse_get_dev(file);
+ struct fuse_dev *fud = __fuse_get_dev(file);
if (fud) {
struct fuse_conn *fc = fud->fc;
@@ -2571,8 +2591,8 @@ static int fuse_dev_fasync(int fd, struct file *file, int on)
{
struct fuse_dev *fud = fuse_get_dev(file);
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
/* No locking - fasync_helper does its own locking */
return fasync_helper(fd, file, on, &fud->fc->iq.fasync);
@@ -2582,7 +2602,7 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
{
struct fuse_dev *fud;
- if (new->private_data)
+ if (__fuse_get_dev(new))
return -EINVAL;
fud = fuse_dev_alloc_install(fc);
@@ -2613,7 +2633,7 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
* uses the same ioctl handler.
*/
if (fd_file(f)->f_op == file->f_op)
- fud = fuse_get_dev(fd_file(f));
+ fud = __fuse_get_dev(fd_file(f));
res = -EINVAL;
if (fud) {
@@ -2631,8 +2651,8 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
struct fuse_dev *fud = fuse_get_dev(file);
struct fuse_backing_map map;
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
return -EOPNOTSUPP;
@@ -2648,8 +2668,8 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
struct fuse_dev *fud = fuse_get_dev(file);
int backing_id;
- if (!fud)
- return -EPERM;
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
return -EOPNOTSUPP;
@@ -2660,6 +2680,19 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
return fuse_backing_close(fud->fc, backing_id);
}
+static long fuse_dev_ioctl_sync_init(struct file *file)
+{
+ int err = -EINVAL;
+
+ mutex_lock(&fuse_mutex);
+ if (!__fuse_get_dev(file)) {
+ WRITE_ONCE(file->private_data, FUSE_DEV_SYNC_INIT);
+ err = 0;
+ }
+ mutex_unlock(&fuse_mutex);
+ return err;
+}
+
static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -2675,6 +2708,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_BACKING_CLOSE:
return fuse_dev_ioctl_backing_close(file, argp);
+ case FUSE_DEV_IOC_SYNC_INIT:
+ return fuse_dev_ioctl_sync_init(file);
+
default:
return -ENOTTY;
}
@@ -2683,7 +2719,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
#ifdef CONFIG_PROC_FS
static void fuse_dev_show_fdinfo(struct seq_file *seq, struct file *file)
{
- struct fuse_dev *fud = fuse_get_dev(file);
+ struct fuse_dev *fud = __fuse_get_dev(file);
if (!fud)
return;
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7b541aeea1813f..6862fe6b7799a7 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1140,9 +1140,9 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
return -EINVAL;
fud = fuse_get_dev(cmd->file);
- if (!fud) {
+ if (IS_ERR(fud)) {
pr_info_ratelimited("No fuse device found\n");
- return -ENOTCONN;
+ return PTR_ERR(fud);
}
fc = fud->fc;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 869d8a87bfb628..14c35ce12b87d6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -7,6 +7,7 @@
*/
#include "fuse_i.h"
+#include "fuse_dev_i.h"
#include "dev_uring_i.h"
#include <linux/dax.h>
@@ -34,6 +35,7 @@ MODULE_LICENSE("GPL");
static struct kmem_cache *fuse_inode_cachep;
struct list_head fuse_conn_list;
DEFINE_MUTEX(fuse_mutex);
+DECLARE_WAIT_QUEUE_HEAD(fuse_dev_waitq);
static int set_global_limit(const char *val, const struct kernel_param *kp);
@@ -1472,7 +1474,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
wake_up_all(&fc->blocked_waitq);
}
-void fuse_send_init(struct fuse_mount *fm)
+static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
{
struct fuse_init_args *ia;
u64 flags;
@@ -1531,10 +1533,29 @@ void fuse_send_init(struct fuse_mount *fm)
ia->args.out_args[0].value = &ia->out;
ia->args.force = true;
ia->args.nocreds = true;
- ia->args.end = process_init_reply;
- if (fuse_simple_background(fm, &ia->args, GFP_KERNEL) != 0)
- process_init_reply(fm, &ia->args, -ENOTCONN);
+ return ia;
+}
+
+int fuse_send_init(struct fuse_mount *fm)
+{
+ struct fuse_init_args *ia = fuse_new_init(fm);
+ int err;
+
+ if (fm->fc->sync_init) {
+ err = fuse_simple_request(fm, &ia->args);
+ /* Ignore size of init reply */
+ if (err > 0)
+ err = 0;
+ } else {
+ ia->args.end = process_init_reply;
+ err = fuse_simple_background(fm, &ia->args, GFP_KERNEL);
+ if (!err)
+ return 0;
+ err = -ENOTCONN;
+ }
+ process_init_reply(fm, &ia->args, err);
+ return err;
}
EXPORT_SYMBOL_GPL(fuse_send_init);
@@ -1877,8 +1898,12 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
mutex_lock(&fuse_mutex);
err = -EINVAL;
- if (ctx->fudptr && *ctx->fudptr)
- goto err_unlock;
+ if (ctx->fudptr && *ctx->fudptr) {
+ if (*ctx->fudptr == FUSE_DEV_SYNC_INIT) {
+ fc->sync_init = 1;
+ } else
+ goto err_unlock;
+ }
err = fuse_ctl_add_conn(fc);
if (err)
@@ -1886,8 +1911,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
list_add_tail(&fc->entry, &fuse_conn_list);
sb->s_root = root_dentry;
- if (ctx->fudptr)
+ if (ctx->fudptr) {
*ctx->fudptr = fud;
+ wake_up_all(&fuse_dev_waitq);
+ }
mutex_unlock(&fuse_mutex);
return 0;
@@ -1908,6 +1935,7 @@ EXPORT_SYMBOL_GPL(fuse_fill_super_common);
static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc)
{
struct fuse_fs_context *ctx = fsc->fs_private;
+ struct fuse_mount *fm;
int err;
if (!ctx->file || !ctx->rootmode_present ||
@@ -1928,8 +1956,10 @@ static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc)
return err;
/* file->private_data shall be visible on all CPUs after this */
smp_mb();
- fuse_send_init(get_fuse_mount_super(sb));
- return 0;
+
+ fm = get_fuse_mount_super(sb);
+
+ return fuse_send_init(fm);
}
/*
@@ -1990,7 +2020,7 @@ static int fuse_get_tree(struct fs_context *fsc)
* Allow creating a fuse mount with an already initialized fuse
* connection
*/
- fud = READ_ONCE(ctx->file->private_data);
+ fud = __fuse_get_dev(ctx->file);
if (ctx->file->f_op == &fuse_dev_operations && fud) {
fsc->sget_key = fud->fc;
sb = sget_fc(fsc, fuse_test_super, fuse_set_no_super);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 2/5] fuse: move the backing file idr and code into a new source file
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/5] fuse: allow synchronous FUSE_INIT Darrick J. Wong
@ 2025-09-16 0:27 ` Darrick J. Wong
2025-09-25 14:11 ` Miklos Szeredi
2025-09-16 0:27 ` [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
` (2 subsequent siblings)
4 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:27 UTC (permalink / raw)
To: djwong, miklos
Cc: amir73il, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
iomap support for fuse is also going to want the ability to attach
backing files to a fuse filesystem. Move the fuse_backing code into a
separate file so that both can use it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
fs/fuse/fuse_i.h | 47 +++++++------
fs/fuse/Makefile | 2 -
fs/fuse/backing.c | 179 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/passthrough.c | 163 ---------------------------------------------
4 files changed, 208 insertions(+), 183 deletions(-)
create mode 100644 fs/fuse/backing.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 02f0138e2fe443..52db609e63eb54 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1570,29 +1570,11 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
void fuse_file_release(struct inode *inode, struct fuse_file *ff,
unsigned int open_flags, fl_owner_t id, bool isdir);
-/* passthrough.c */
-static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
-{
-#ifdef CONFIG_FUSE_PASSTHROUGH
- return READ_ONCE(fi->fb);
-#else
- return NULL;
-#endif
-}
-
-static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
- struct fuse_backing *fb)
-{
-#ifdef CONFIG_FUSE_PASSTHROUGH
- return xchg(&fi->fb, fb);
-#else
- return NULL;
-#endif
-}
-
+/* backing.c */
#ifdef CONFIG_FUSE_PASSTHROUGH
struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
void fuse_backing_put(struct fuse_backing *fb);
+struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
#else
static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
@@ -1603,6 +1585,11 @@ static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
static inline void fuse_backing_put(struct fuse_backing *fb)
{
}
+static inline struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
+ int backing_id)
+{
+ return NULL;
+}
#endif
void fuse_backing_files_init(struct fuse_conn *fc);
@@ -1610,6 +1597,26 @@ void fuse_backing_files_free(struct fuse_conn *fc);
int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map);
int fuse_backing_close(struct fuse_conn *fc, int backing_id);
+/* passthrough.c */
+static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
+{
+#ifdef CONFIG_FUSE_PASSTHROUGH
+ return READ_ONCE(fi->fb);
+#else
+ return NULL;
+#endif
+}
+
+static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
+ struct fuse_backing *fb)
+{
+#ifdef CONFIG_FUSE_PASSTHROUGH
+ return xchg(&fi->fb, fb);
+#else
+ return NULL;
+#endif
+}
+
struct fuse_backing *fuse_passthrough_open(struct file *file,
struct inode *inode,
int backing_id);
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1cc..8ddd8f0b204ee5 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
-fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
+fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
new file mode 100644
index 00000000000000..4afda419dd1416
--- /dev/null
+++ b/fs/fuse/backing.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FUSE passthrough to backing file.
+ *
+ * Copyright (c) 2023 CTERA Networks.
+ */
+
+#include "fuse_i.h"
+
+#include <linux/file.h>
+
+struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
+{
+ if (fb && refcount_inc_not_zero(&fb->count))
+ return fb;
+ return NULL;
+}
+
+static void fuse_backing_free(struct fuse_backing *fb)
+{
+ pr_debug("%s: fb=0x%p\n", __func__, fb);
+
+ if (fb->file)
+ fput(fb->file);
+ put_cred(fb->cred);
+ kfree_rcu(fb, rcu);
+}
+
+void fuse_backing_put(struct fuse_backing *fb)
+{
+ if (fb && refcount_dec_and_test(&fb->count))
+ fuse_backing_free(fb);
+}
+
+void fuse_backing_files_init(struct fuse_conn *fc)
+{
+ idr_init(&fc->backing_files_map);
+}
+
+static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
+{
+ int id;
+
+ idr_preload(GFP_KERNEL);
+ spin_lock(&fc->lock);
+ /* FIXME: xarray might be space inefficient */
+ id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
+ spin_unlock(&fc->lock);
+ idr_preload_end();
+
+ WARN_ON_ONCE(id == 0);
+ return id;
+}
+
+static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
+ int id)
+{
+ struct fuse_backing *fb;
+
+ spin_lock(&fc->lock);
+ fb = idr_remove(&fc->backing_files_map, id);
+ spin_unlock(&fc->lock);
+
+ return fb;
+}
+
+static int fuse_backing_id_free(int id, void *p, void *data)
+{
+ struct fuse_backing *fb = p;
+
+ WARN_ON_ONCE(refcount_read(&fb->count) != 1);
+ fuse_backing_free(fb);
+ return 0;
+}
+
+void fuse_backing_files_free(struct fuse_conn *fc)
+{
+ idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
+ idr_destroy(&fc->backing_files_map);
+}
+
+int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
+{
+ struct file *file;
+ struct super_block *backing_sb;
+ struct fuse_backing *fb = NULL;
+ int res;
+
+ pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
+
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ res = -EPERM;
+ if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ res = -EINVAL;
+ if (map->flags || map->padding)
+ goto out;
+
+ file = fget_raw(map->fd);
+ res = -EBADF;
+ if (!file)
+ goto out;
+
+ /* read/write/splice/mmap passthrough only relevant for regular files */
+ res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
+ if (!d_is_reg(file->f_path.dentry))
+ goto out_fput;
+
+ backing_sb = file_inode(file)->i_sb;
+ res = -ELOOP;
+ if (backing_sb->s_stack_depth >= fc->max_stack_depth)
+ goto out_fput;
+
+ fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
+ res = -ENOMEM;
+ if (!fb)
+ goto out_fput;
+
+ fb->file = file;
+ fb->cred = prepare_creds();
+ refcount_set(&fb->count, 1);
+
+ res = fuse_backing_id_alloc(fc, fb);
+ if (res < 0) {
+ fuse_backing_free(fb);
+ fb = NULL;
+ }
+
+out:
+ pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
+
+ return res;
+
+out_fput:
+ fput(file);
+ goto out;
+}
+
+int fuse_backing_close(struct fuse_conn *fc, int backing_id)
+{
+ struct fuse_backing *fb = NULL;
+ int err;
+
+ pr_debug("%s: backing_id=%d\n", __func__, backing_id);
+
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ err = -EPERM;
+ if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ err = -EINVAL;
+ if (backing_id <= 0)
+ goto out;
+
+ err = -ENOENT;
+ fb = fuse_backing_id_remove(fc, backing_id);
+ if (!fb)
+ goto out;
+
+ fuse_backing_put(fb);
+ err = 0;
+out:
+ pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
+
+ return err;
+}
+
+struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
+{
+ struct fuse_backing *fb;
+
+ rcu_read_lock();
+ fb = idr_find(&fc->backing_files_map, backing_id);
+ fb = fuse_backing_get(fb);
+ rcu_read_unlock();
+
+ return fb;
+}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index eb97ac009e75d9..e0b8d885bc81f3 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -144,163 +144,6 @@ ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
return backing_file_mmap(backing_file, vma, &ctx);
}
-struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
-{
- if (fb && refcount_inc_not_zero(&fb->count))
- return fb;
- return NULL;
-}
-
-static void fuse_backing_free(struct fuse_backing *fb)
-{
- pr_debug("%s: fb=0x%p\n", __func__, fb);
-
- if (fb->file)
- fput(fb->file);
- put_cred(fb->cred);
- kfree_rcu(fb, rcu);
-}
-
-void fuse_backing_put(struct fuse_backing *fb)
-{
- if (fb && refcount_dec_and_test(&fb->count))
- fuse_backing_free(fb);
-}
-
-void fuse_backing_files_init(struct fuse_conn *fc)
-{
- idr_init(&fc->backing_files_map);
-}
-
-static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
-{
- int id;
-
- idr_preload(GFP_KERNEL);
- spin_lock(&fc->lock);
- /* FIXME: xarray might be space inefficient */
- id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
- spin_unlock(&fc->lock);
- idr_preload_end();
-
- WARN_ON_ONCE(id == 0);
- return id;
-}
-
-static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
- int id)
-{
- struct fuse_backing *fb;
-
- spin_lock(&fc->lock);
- fb = idr_remove(&fc->backing_files_map, id);
- spin_unlock(&fc->lock);
-
- return fb;
-}
-
-static int fuse_backing_id_free(int id, void *p, void *data)
-{
- struct fuse_backing *fb = p;
-
- WARN_ON_ONCE(refcount_read(&fb->count) != 1);
- fuse_backing_free(fb);
- return 0;
-}
-
-void fuse_backing_files_free(struct fuse_conn *fc)
-{
- idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
- idr_destroy(&fc->backing_files_map);
-}
-
-int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
-{
- struct file *file;
- struct super_block *backing_sb;
- struct fuse_backing *fb = NULL;
- int res;
-
- pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
-
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
- res = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
- goto out;
-
- res = -EINVAL;
- if (map->flags || map->padding)
- goto out;
-
- file = fget_raw(map->fd);
- res = -EBADF;
- if (!file)
- goto out;
-
- /* read/write/splice/mmap passthrough only relevant for regular files */
- res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
- if (!d_is_reg(file->f_path.dentry))
- goto out_fput;
-
- backing_sb = file_inode(file)->i_sb;
- res = -ELOOP;
- if (backing_sb->s_stack_depth >= fc->max_stack_depth)
- goto out_fput;
-
- fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
- res = -ENOMEM;
- if (!fb)
- goto out_fput;
-
- fb->file = file;
- fb->cred = prepare_creds();
- refcount_set(&fb->count, 1);
-
- res = fuse_backing_id_alloc(fc, fb);
- if (res < 0) {
- fuse_backing_free(fb);
- fb = NULL;
- }
-
-out:
- pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
-
- return res;
-
-out_fput:
- fput(file);
- goto out;
-}
-
-int fuse_backing_close(struct fuse_conn *fc, int backing_id)
-{
- struct fuse_backing *fb = NULL;
- int err;
-
- pr_debug("%s: backing_id=%d\n", __func__, backing_id);
-
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
- err = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
- goto out;
-
- err = -EINVAL;
- if (backing_id <= 0)
- goto out;
-
- err = -ENOENT;
- fb = fuse_backing_id_remove(fc, backing_id);
- if (!fb)
- goto out;
-
- fuse_backing_put(fb);
- err = 0;
-out:
- pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
-
- return err;
-}
-
/*
* Setup passthrough to a backing file.
*
@@ -320,12 +163,8 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
if (backing_id <= 0)
goto out;
- rcu_read_lock();
- fb = idr_find(&fc->backing_files_map, backing_id);
- fb = fuse_backing_get(fb);
- rcu_read_unlock();
-
err = -ENOENT;
+ fb = fuse_backing_lookup(fc, backing_id);
if (!fb)
goto out;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/5] fuse: allow synchronous FUSE_INIT Darrick J. Wong
2025-09-16 0:27 ` [PATCH 2/5] fuse: move the backing file idr and code into a new source file Darrick J. Wong
@ 2025-09-16 0:27 ` Darrick J. Wong
2025-09-17 2:47 ` Amir Goldstein
2025-09-16 0:27 ` [PATCH 4/5] fuse_trace: " Darrick J. Wong
2025-09-16 0:27 ` [PATCH 5/5] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
4 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:27 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
In preparation for iomap, move the passthrough-specific validation code
back to passthrough.c and create a new Kconfig item for conditional
compilation of backing.c. In the next patch, iomap will share the
backing structures.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 23 +++++++++--
include/uapi/linux/fuse.h | 8 +++-
fs/fuse/Kconfig | 4 ++
fs/fuse/Makefile | 3 +
fs/fuse/backing.c | 95 ++++++++++++++++++++++++++++++++++-----------
fs/fuse/dev.c | 4 +-
fs/fuse/inode.c | 4 +-
fs/fuse/passthrough.c | 37 +++++++++++++++++-
8 files changed, 144 insertions(+), 34 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 52db609e63eb54..4560687d619d76 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -96,10 +96,21 @@ struct fuse_submount_lookup {
struct fuse_forget_link *forget;
};
+struct fuse_conn;
+
+/** Operations for subsystems that want to use a backing file */
+struct fuse_backing_ops {
+ int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
+ int (*may_open)(struct fuse_conn *fc, struct file *file);
+ int (*may_close)(struct fuse_conn *fc, struct file *file);
+ unsigned int type;
+};
+
/** Container for data related to mapping to backing file */
struct fuse_backing {
struct file *file;
struct cred *cred;
+ const struct fuse_backing_ops *ops;
/** refcount */
refcount_t count;
@@ -968,7 +979,7 @@ struct fuse_conn {
/* New writepages go into this bucket */
struct fuse_sync_bucket __rcu *curr_bucket;
-#ifdef CONFIG_FUSE_PASSTHROUGH
+#ifdef CONFIG_FUSE_BACKING
/** IDR for backing files ids */
struct idr backing_files_map;
#endif
@@ -1571,10 +1582,12 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
unsigned int open_flags, fl_owner_t id, bool isdir);
/* backing.c */
-#ifdef CONFIG_FUSE_PASSTHROUGH
+#ifdef CONFIG_FUSE_BACKING
struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
void fuse_backing_put(struct fuse_backing *fb);
-struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
+struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
+ const struct fuse_backing_ops *ops,
+ int backing_id);
#else
static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
@@ -1631,6 +1644,10 @@ static inline struct file *fuse_file_passthrough(struct fuse_file *ff)
#endif
}
+#ifdef CONFIG_FUSE_PASSTHROUGH
+extern const struct fuse_backing_ops fuse_passthrough_backing_ops;
+#endif
+
ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t fuse_passthrough_splice_read(struct file *in, loff_t *ppos,
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1d76d0332f46f6..31b80f93211b81 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1114,9 +1114,15 @@ struct fuse_notify_retrieve_in {
uint64_t dummy4;
};
+#define FUSE_BACKING_TYPE_MASK (0xFF)
+#define FUSE_BACKING_TYPE_PASSTHROUGH (0)
+#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
+
+#define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
+
struct fuse_backing_map {
int32_t fd;
- uint32_t flags;
+ uint32_t flags; /* FUSE_BACKING_* */
uint64_t padding;
};
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index a774166264de69..9563fa5387a241 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -59,12 +59,16 @@ config FUSE_PASSTHROUGH
default y
depends on FUSE_FS
select FS_STACK
+ select FUSE_BACKING
help
This allows bypassing FUSE server by mapping specific FUSE operations
to be performed directly on a backing file.
If you want to allow passthrough operations, answer Y.
+config FUSE_BACKING
+ bool
+
config FUSE_IO_URING
bool "FUSE communication over io-uring"
default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 8ddd8f0b204ee5..36be6d715b111a 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -13,7 +13,8 @@ obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
-fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
+fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
+fuse-$(CONFIG_FUSE_BACKING) += backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
index 4afda419dd1416..da0dff288396ed 100644
--- a/fs/fuse/backing.c
+++ b/fs/fuse/backing.c
@@ -6,6 +6,7 @@
*/
#include "fuse_i.h"
+#include "fuse_trace.h"
#include <linux/file.h>
@@ -69,32 +70,53 @@ static int fuse_backing_id_free(int id, void *p, void *data)
struct fuse_backing *fb = p;
WARN_ON_ONCE(refcount_read(&fb->count) != 1);
+
fuse_backing_free(fb);
return 0;
}
void fuse_backing_files_free(struct fuse_conn *fc)
{
- idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
+ idr_for_each(&fc->backing_files_map, fuse_backing_id_free, fc);
idr_destroy(&fc->backing_files_map);
}
+static inline const struct fuse_backing_ops *
+fuse_backing_ops_from_map(const struct fuse_backing_map *map)
+{
+ switch (map->flags & FUSE_BACKING_TYPE_MASK) {
+#ifdef CONFIG_FUSE_PASSTHROUGH
+ case FUSE_BACKING_TYPE_PASSTHROUGH:
+ return &fuse_passthrough_backing_ops;
+#endif
+ default:
+ break;
+ }
+
+ return NULL;
+}
+
int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
{
struct file *file;
- struct super_block *backing_sb;
struct fuse_backing *fb = NULL;
+ const struct fuse_backing_ops *ops = fuse_backing_ops_from_map(map);
+ uint32_t op_flags = map->flags & ~FUSE_BACKING_TYPE_MASK;
int res;
pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
- res = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ res = -EOPNOTSUPP;
+ if (!ops)
+ goto out;
+ WARN_ON(ops->type != (map->flags & FUSE_BACKING_TYPE_MASK));
+
+ res = ops->may_admin ? ops->may_admin(fc, op_flags) : 0;
+ if (res)
goto out;
res = -EINVAL;
- if (map->flags || map->padding)
+ if (map->padding)
goto out;
file = fget_raw(map->fd);
@@ -102,14 +124,8 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
if (!file)
goto out;
- /* read/write/splice/mmap passthrough only relevant for regular files */
- res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
- if (!d_is_reg(file->f_path.dentry))
- goto out_fput;
-
- backing_sb = file_inode(file)->i_sb;
- res = -ELOOP;
- if (backing_sb->s_stack_depth >= fc->max_stack_depth)
+ res = ops->may_open ? ops->may_open(fc, file) : 0;
+ if (res)
goto out_fput;
fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
@@ -119,14 +135,15 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
fb->file = file;
fb->cred = prepare_creds();
+ fb->ops = ops;
refcount_set(&fb->count, 1);
res = fuse_backing_id_alloc(fc, fb);
if (res < 0) {
fuse_backing_free(fb);
fb = NULL;
+ goto out;
}
-
out:
pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
@@ -137,41 +154,71 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
goto out;
}
+static struct fuse_backing *__fuse_backing_lookup(struct fuse_conn *fc,
+ int backing_id)
+{
+ struct fuse_backing *fb;
+
+ rcu_read_lock();
+ fb = idr_find(&fc->backing_files_map, backing_id);
+ fb = fuse_backing_get(fb);
+ rcu_read_unlock();
+
+ return fb;
+}
+
int fuse_backing_close(struct fuse_conn *fc, int backing_id)
{
- struct fuse_backing *fb = NULL;
+ struct fuse_backing *fb, *test_fb;
+ const struct fuse_backing_ops *ops;
int err;
pr_debug("%s: backing_id=%d\n", __func__, backing_id);
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
- err = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
- goto out;
-
err = -EINVAL;
if (backing_id <= 0)
goto out;
err = -ENOENT;
- fb = fuse_backing_id_remove(fc, backing_id);
+ fb = __fuse_backing_lookup(fc, backing_id);
if (!fb)
goto out;
+ ops = fb->ops;
- fuse_backing_put(fb);
+ err = ops->may_admin ? ops->may_admin(fc, 0) : 0;
+ if (err)
+ goto out_fb;
+
+ err = ops->may_close ? ops->may_close(fc, fb->file) : 0;
+ if (err)
+ goto out_fb;
+
+ err = -ENOENT;
+ test_fb = fuse_backing_id_remove(fc, backing_id);
+ if (!test_fb)
+ goto out_fb;
+
+ WARN_ON(fb != test_fb);
err = 0;
+ fuse_backing_put(test_fb);
+out_fb:
+ fuse_backing_put(fb);
out:
pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
return err;
}
-struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
+struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
+ const struct fuse_backing_ops *ops,
+ int backing_id)
{
struct fuse_backing *fb;
rcu_read_lock();
fb = idr_find(&fc->backing_files_map, backing_id);
+ if (fb && fb->ops != ops)
+ fb = NULL;
fb = fuse_backing_get(fb);
rcu_read_unlock();
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e5aaf0c668bc11..281bc81f3b448b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2654,7 +2654,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
if (IS_ERR(fud))
return PTR_ERR(fud);
- if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (!IS_ENABLED(CONFIG_FUSE_BACKING))
return -EOPNOTSUPP;
if (copy_from_user(&map, argp, sizeof(map)))
@@ -2671,7 +2671,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
if (IS_ERR(fud))
return PTR_ERR(fud);
- if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (!IS_ENABLED(CONFIG_FUSE_BACKING))
return -EOPNOTSUPP;
if (get_user(backing_id, argp))
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 14c35ce12b87d6..1e7298b2b89b58 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -995,7 +995,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->name_max = FUSE_NAME_LOW_MAX;
fc->timeout.req_timeout = 0;
- if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_init(fc);
INIT_LIST_HEAD(&fc->mounts);
@@ -1032,7 +1032,7 @@ void fuse_conn_put(struct fuse_conn *fc)
WARN_ON(atomic_read(&bucket->count) != 1);
kfree(bucket);
}
- if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_free(fc);
call_rcu(&fc->rcu, delayed_release);
}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index e0b8d885bc81f3..9792d7b12a775b 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -164,7 +164,7 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
goto out;
err = -ENOENT;
- fb = fuse_backing_lookup(fc, backing_id);
+ fb = fuse_backing_lookup(fc, &fuse_passthrough_backing_ops, backing_id);
if (!fb)
goto out;
@@ -197,3 +197,38 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb)
put_cred(ff->cred);
ff->cred = NULL;
}
+
+static int fuse_passthrough_may_admin(struct fuse_conn *fc, unsigned int flags)
+{
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (flags)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int fuse_passthrough_may_open(struct fuse_conn *fc, struct file *file)
+{
+ struct super_block *backing_sb;
+ int res;
+
+ /* read/write/splice/mmap passthrough only relevant for regular files */
+ res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
+ if (!d_is_reg(file->f_path.dentry))
+ return res;
+
+ backing_sb = file_inode(file)->i_sb;
+ if (backing_sb->s_stack_depth >= fc->max_stack_depth)
+ return -ELOOP;
+
+ return 0;
+}
+
+const struct fuse_backing_ops fuse_passthrough_backing_ops = {
+ .type = FUSE_BACKING_TYPE_PASSTHROUGH,
+ .may_admin = fuse_passthrough_may_admin,
+ .may_open = fuse_passthrough_may_open,
+};
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 4/5] fuse_trace: move the passthrough-specific code back to passthrough.c
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
` (2 preceding siblings ...)
2025-09-16 0:27 ` [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
@ 2025-09-16 0:27 ` Darrick J. Wong
2025-09-16 0:27 ` [PATCH 5/5] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
4 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:27 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 35 +++++++++++++++++++++++++++++++++++
fs/fuse/backing.c | 5 +++++
2 files changed, 40 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index bbe9ddd8c71696..286a0845dc0898 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -124,6 +124,41 @@ TRACE_EVENT(fuse_request_end,
__entry->unique, __entry->len, __entry->error)
);
+#ifdef CONFIG_FUSE_BACKING
+TRACE_EVENT(fuse_backing_class,
+ TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
+ const struct fuse_backing *fb),
+
+ TP_ARGS(fc, idx, fb),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(unsigned int, idx)
+ __field(unsigned long, ino)
+ ),
+
+ TP_fast_assign(
+ struct inode *inode = file_inode(fb->file);
+
+ __entry->connection = fc->dev;
+ __entry->idx = idx;
+ __entry->ino = inode->i_ino;
+ ),
+
+ TP_printk("connection %u idx %u ino 0x%lx",
+ __entry->connection,
+ __entry->idx,
+ __entry->ino)
+);
+#define DEFINE_FUSE_BACKING_EVENT(name) \
+DEFINE_EVENT(fuse_backing_class, name, \
+ TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
+ const struct fuse_backing *fb), \
+ TP_ARGS(fc, idx, fb))
+DEFINE_FUSE_BACKING_EVENT(fuse_backing_open);
+DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
+#endif /* CONFIG_FUSE_BACKING */
+
#endif /* _TRACE_FUSE_H */
#undef TRACE_INCLUDE_PATH
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
index da0dff288396ed..229c101ab46b0e 100644
--- a/fs/fuse/backing.c
+++ b/fs/fuse/backing.c
@@ -71,6 +71,7 @@ static int fuse_backing_id_free(int id, void *p, void *data)
WARN_ON_ONCE(refcount_read(&fb->count) != 1);
+ trace_fuse_backing_close((struct fuse_conn *)data, id, fb);
fuse_backing_free(fb);
return 0;
}
@@ -144,6 +145,8 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
fb = NULL;
goto out;
}
+
+ trace_fuse_backing_open(fc, res, fb);
out:
pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
@@ -193,6 +196,8 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
if (err)
goto out_fb;
+ trace_fuse_backing_close(fc, backing_id, fb);
+
err = -ENOENT;
test_fb = fuse_backing_id_remove(fc, backing_id);
if (!test_fb)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 5/5] fuse: move CREATE_TRACE_POINTS to a separate file
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
` (3 preceding siblings ...)
2025-09-16 0:27 ` [PATCH 4/5] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:27 ` Darrick J. Wong
2025-09-25 14:25 ` Miklos Szeredi
4 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:27 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Before we start adding new tracepoints for fuse+iomap, move the
tracepoint creation itself to a separate source file so that we don't
have to start pulling iomap dependencies into dev.c just for the iomap
structures.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/Makefile | 3 ++-
fs/fuse/dev.c | 1 -
fs/fuse/trace.c | 13 +++++++++++++
3 files changed, 15 insertions(+), 2 deletions(-)
create mode 100644 fs/fuse/trace.c
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 36be6d715b111a..46041228e5be2c 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -10,7 +10,8 @@ obj-$(CONFIG_FUSE_FS) += fuse.o
obj-$(CONFIG_CUSE) += cuse.o
obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
-fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
+fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
+fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 281bc81f3b448b..871877cac2acf3 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -26,7 +26,6 @@
#include <linux/seq_file.h>
#include <linux/nmi.h>
-#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c
new file mode 100644
index 00000000000000..93bd72efc98cd0
--- /dev/null
+++ b/fs/fuse/trace.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "dev_uring_i.h"
+#include "fuse_i.h"
+#include "fuse_dev_i.h"
+
+#include <linux/pagemap.h>
+
+#define CREATE_TRACE_POINTS
+#include "fuse_trace.h"
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 01/28] fuse: implement the basic iomap mechanisms
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-09-16 0:28 ` Darrick J. Wong
2025-09-19 22:36 ` Joanne Koong
2025-09-16 0:28 ` [PATCH 02/28] fuse_trace: " Darrick J. Wong
` (26 subsequent siblings)
27 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:28 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement functions to enable upcalling of iomap_begin and iomap_end to
userspace fuse servers.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 35 ++++
fs/fuse/iomap_priv.h | 36 ++++
include/uapi/linux/fuse.h | 90 +++++++++
fs/fuse/Kconfig | 32 +++
fs/fuse/Makefile | 1
fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 9 +
7 files changed, 636 insertions(+), 1 deletion(-)
create mode 100644 fs/fuse/iomap_priv.h
create mode 100644 fs/fuse/file_iomap.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4560687d619d76..f0d408a6e12c32 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -923,6 +923,9 @@ struct fuse_conn {
/* Is synchronous FUSE_INIT allowed? */
unsigned int sync_init:1;
+ /* Enable fs/iomap for file operations */
+ unsigned int iomap:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1047,6 +1050,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
return sb->s_fs_info;
}
+static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
{
return get_fuse_mount_super(sb)->fc;
@@ -1057,16 +1065,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
return get_fuse_mount_super(inode->i_sb);
}
+static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
+{
+ return get_fuse_mount_super_c(inode->i_sb);
+}
+
static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb)->fc;
}
+static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
+{
+ return get_fuse_mount_super_c(inode->i_sb)->fc;
+}
+
static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
{
return container_of(inode, struct fuse_inode, inode);
}
+static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
+{
+ return container_of(inode, struct fuse_inode, inode);
+}
+
static inline u64 get_node_id(struct inode *inode)
{
return get_fuse_inode(inode)->nodeid;
@@ -1666,4 +1689,16 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+bool fuse_iomap_enabled(void);
+
+static inline bool fuse_has_iomap(const struct inode *inode)
+{
+ return get_fuse_conn_c(inode)->iomap;
+}
+#else
+# define fuse_iomap_enabled(...) (false)
+# define fuse_has_iomap(...) (false)
+#endif
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
new file mode 100644
index 00000000000000..243d92cb625095
--- /dev/null
+++ b/fs/fuse/iomap_priv.h
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _FS_FUSE_IOMAP_PRIV_H
+#define _FS_FUSE_IOMAP_PRIV_H
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+# define ASSERT(condition) do { \
+ int __cond = !!(condition); \
+ WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+} while (0)
+# define BAD_DATA(condition) ({ \
+ int __cond = !!(condition); \
+ WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+})
+#else
+# define ASSERT(condition)
+# define BAD_DATA(condition) ({ \
+ int __cond = !!(condition); \
+ unlikely(__cond); \
+})
+#endif /* CONFIG_FUSE_IOMAP_DEBUG */
+
+enum fuse_iomap_iodir {
+ READ_MAPPING,
+ WRITE_MAPPING,
+};
+
+#define EFSCORRUPTED EUCLEAN
+
+#endif /* CONFIG_FUSE_IOMAP */
+
+#endif /* _FS_FUSE_IOMAP_PRIV_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 31b80f93211b81..3634cbe602cd9c 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -235,6 +235,9 @@
*
* 7.44
* - add FUSE_NOTIFY_INC_EPOCH
+ *
+ * 7.99
+ * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
*/
#ifndef _LINUX_FUSE_H
@@ -270,7 +273,7 @@
#define FUSE_KERNEL_VERSION 7
/** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 44
+#define FUSE_KERNEL_MINOR_VERSION 99
/** The node ID of the root inode */
#define FUSE_ROOT_ID 1
@@ -443,6 +446,7 @@ struct fuse_file_lock {
* FUSE_OVER_IO_URING: Indicate that client supports io-uring
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for regular file operations.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -490,6 +494,7 @@ struct fuse_file_lock {
#define FUSE_ALLOW_IDMAP (1ULL << 40)
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
+#define FUSE_IOMAP (1ULL << 43)
/**
* CUSE INIT request/reply flags
@@ -658,6 +663,9 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_BEGIN = 4094,
+ FUSE_IOMAP_END = 4095,
+
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1297,4 +1305,84 @@ struct fuse_uring_cmd_req {
uint8_t padding[6];
};
+/* mapping types; see corresponding IOMAP_TYPE_ */
+#define FUSE_IOMAP_TYPE_HOLE (0)
+#define FUSE_IOMAP_TYPE_DELALLOC (1)
+#define FUSE_IOMAP_TYPE_MAPPED (2)
+#define FUSE_IOMAP_TYPE_UNWRITTEN (3)
+#define FUSE_IOMAP_TYPE_INLINE (4)
+
+/* fuse-specific mapping type indicating that writes use the read mapping */
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
+
+#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
+
+/* mapping flags passed back from iomap_begin; see corresponding IOMAP_F_ */
+#define FUSE_IOMAP_F_NEW (1U << 0)
+#define FUSE_IOMAP_F_DIRTY (1U << 1)
+#define FUSE_IOMAP_F_SHARED (1U << 2)
+#define FUSE_IOMAP_F_MERGED (1U << 3)
+#define FUSE_IOMAP_F_BOUNDARY (1U << 4)
+#define FUSE_IOMAP_F_ANON_WRITE (1U << 5)
+#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 6)
+
+/* fuse-specific mapping flag asking for ->iomap_end call */
+#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 7)
+
+/* mapping flags passed to iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED (1U << 8)
+#define FUSE_IOMAP_F_STALE (1U << 9)
+
+/* operation flags from iomap; see corresponding IOMAP_* */
+#define FUSE_IOMAP_OP_WRITE (1U << 0)
+#define FUSE_IOMAP_OP_ZERO (1U << 1)
+#define FUSE_IOMAP_OP_REPORT (1U << 2)
+#define FUSE_IOMAP_OP_FAULT (1U << 3)
+#define FUSE_IOMAP_OP_DIRECT (1U << 4)
+#define FUSE_IOMAP_OP_NOWAIT (1U << 5)
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1U << 6)
+#define FUSE_IOMAP_OP_UNSHARE (1U << 7)
+#define FUSE_IOMAP_OP_DAX (1U << 8)
+#define FUSE_IOMAP_OP_ATOMIC (1U << 9)
+#define FUSE_IOMAP_OP_DONTCACHE (1U << 10)
+
+#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */
+
+struct fuse_iomap_io {
+ uint64_t offset; /* file offset of mapping, bytes */
+ uint64_t length; /* length of mapping, bytes */
+ uint64_t addr; /* disk offset of mapping, bytes */
+ uint16_t type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t flags; /* FUSE_IOMAP_F_* */
+ uint32_t dev; /* device cookie */
+};
+
+struct fuse_iomap_begin_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+ /* read file data from here */
+ struct fuse_iomap_io read;
+
+ /* write file data to here, if applicable */
+ struct fuse_iomap_io write;
+};
+
+struct fuse_iomap_end_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+ int64_t written; /* bytes processed */
+
+ /* mapping that the kernel acted upon */
+ struct fuse_iomap_io map;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 9563fa5387a241..67dfe300bf2e07 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH
config FUSE_BACKING
bool
+config FUSE_IOMAP
+ bool "FUSE file IO over iomap"
+ default n
+ depends on FUSE_FS
+ depends on BLOCK
+ select FS_IOMAP
+ help
+ Enable fuse servers to operate the regular file I/O path through
+ the fs-iomap library in the kernel. This enables higher performance
+ userspace filesystems by keeping the performance critical parts in
+ the kernel while delegating the difficult metadata parsing parts to
+ an easily-contained userspace program.
+
+ This feature is considered EXPERIMENTAL. Use with caution!
+
+ If unsure, say N.
+
+config FUSE_IOMAP_BY_DEFAULT
+ bool "FUSE file I/O over iomap by default"
+ default n
+ depends on FUSE_IOMAP
+ help
+ Enable sending FUSE file I/O over iomap by default.
+
+config FUSE_IOMAP_DEBUG
+ bool "Debug FUSE file IO over iomap"
+ default n
+ depends on FUSE_IOMAP
+ help
+ Enable debugging assertions for the fuse iomap code paths and logging
+ of bad iomap file mapping data being sent to the kernel.
+
config FUSE_IO_URING
bool "FUSE communication over io-uring"
default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 46041228e5be2c..27be39317701d6 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
fuse-$(CONFIG_FUSE_BACKING) += backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
new file mode 100644
index 00000000000000..dda757768d3ea6
--- /dev/null
+++ b/fs/fuse/file_iomap.c
@@ -0,0 +1,434 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include <linux/iomap.h>
+#include "fuse_i.h"
+#include "fuse_trace.h"
+#include "iomap_priv.h"
+
+static bool __read_mostly enable_iomap =
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
+ true;
+#else
+ false;
+#endif
+module_param(enable_iomap, bool, 0644);
+MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+
+bool fuse_iomap_enabled(void)
+{
+ /* Don't let anyone touch iomap until the end of the patchset. */
+ return false;
+
+ /*
+ * There are fears that a fuse+iomap server could somehow DoS the
+ * system by doing things like going out to lunch during a writeback
+ * related iomap request. Only allow iomap access if the fuse server
+ * has rawio capabilities since those processes can mess things up
+ * quite well even without our help.
+ */
+ return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO);
+}
+
+/* Convert IOMAP_* mapping types to FUSE_IOMAP_TYPE_* */
+#define XMAP(word) \
+ case IOMAP_##word: \
+ return FUSE_IOMAP_TYPE_##word
+static inline uint16_t fuse_iomap_type_to_server(uint16_t iomap_type)
+{
+ switch (iomap_type) {
+ XMAP(HOLE);
+ XMAP(DELALLOC);
+ XMAP(MAPPED);
+ XMAP(UNWRITTEN);
+ XMAP(INLINE);
+ default:
+ ASSERT(0);
+ }
+ return 0;
+}
+#undef XMAP
+
+/* Convert FUSE_IOMAP_TYPE_* to IOMAP_* mapping types */
+#define XMAP(word) \
+ case FUSE_IOMAP_TYPE_##word: \
+ return IOMAP_##word
+static inline uint16_t fuse_iomap_type_from_server(uint16_t fuse_type)
+{
+ switch (fuse_type) {
+ XMAP(HOLE);
+ XMAP(DELALLOC);
+ XMAP(MAPPED);
+ XMAP(UNWRITTEN);
+ XMAP(INLINE);
+ default:
+ ASSERT(0);
+ }
+ return 0;
+}
+#undef XMAP
+
+/* Validate FUSE_IOMAP_TYPE_* */
+static inline bool fuse_iomap_check_type(uint16_t fuse_type)
+{
+ switch (fuse_type) {
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ case FUSE_IOMAP_TYPE_INLINE:
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ return true;
+ }
+
+ return false;
+}
+
+#define FUSE_IOMAP_F_ALL (FUSE_IOMAP_F_NEW | \
+ FUSE_IOMAP_F_DIRTY | \
+ FUSE_IOMAP_F_SHARED | \
+ FUSE_IOMAP_F_MERGED | \
+ FUSE_IOMAP_F_BOUNDARY | \
+ FUSE_IOMAP_F_ANON_WRITE | \
+ FUSE_IOMAP_F_ATOMIC_BIO | \
+ FUSE_IOMAP_F_WANT_IOMAP_END)
+
+static inline bool fuse_iomap_check_flags(uint16_t flags)
+{
+ return (flags & ~FUSE_IOMAP_F_ALL) == 0;
+}
+
+/* Convert IOMAP_F_* mapping state flags to FUSE_IOMAP_F_* */
+#define XMAP(word) \
+ if (iomap_f_flags & IOMAP_F_##word) \
+ ret |= FUSE_IOMAP_F_##word
+#define YMAP(iword, oword) \
+ if (iomap_f_flags & IOMAP_F_##iword) \
+ ret |= FUSE_IOMAP_F_##oword
+static inline uint16_t fuse_iomap_flags_to_server(uint16_t iomap_f_flags)
+{
+ uint16_t ret = 0;
+
+ XMAP(NEW);
+ XMAP(DIRTY);
+ XMAP(SHARED);
+ XMAP(MERGED);
+ XMAP(BOUNDARY);
+ XMAP(ANON_WRITE);
+ XMAP(ATOMIC_BIO);
+ YMAP(PRIVATE, WANT_IOMAP_END);
+
+ XMAP(SIZE_CHANGED);
+ XMAP(STALE);
+
+ return ret;
+}
+#undef YMAP
+#undef XMAP
+
+/* Convert FUSE_IOMAP_F_* to IOMAP_F_* mapping state flags */
+#define XMAP(word) \
+ if (fuse_f_flags & FUSE_IOMAP_F_##word) \
+ ret |= IOMAP_F_##word
+#define YMAP(iword, oword) \
+ if (fuse_f_flags & FUSE_IOMAP_F_##iword) \
+ ret |= IOMAP_F_##oword
+static inline uint16_t fuse_iomap_flags_from_server(uint16_t fuse_f_flags)
+{
+ uint16_t ret = 0;
+
+ XMAP(NEW);
+ XMAP(DIRTY);
+ XMAP(SHARED);
+ XMAP(MERGED);
+ XMAP(BOUNDARY);
+ XMAP(ANON_WRITE);
+ XMAP(ATOMIC_BIO);
+ YMAP(WANT_IOMAP_END, PRIVATE);
+
+ return ret;
+}
+#undef YMAP
+#undef XMAP
+
+/* Convert IOMAP_* operation flags to FUSE_IOMAP_OP_* */
+#define XMAP(word) \
+ if (iomap_op_flags & IOMAP_##word) \
+ ret |= FUSE_IOMAP_OP_##word
+static inline uint32_t fuse_iomap_op_to_server(unsigned iomap_op_flags)
+{
+ uint32_t ret = 0;
+
+ XMAP(WRITE);
+ XMAP(ZERO);
+ XMAP(REPORT);
+ XMAP(FAULT);
+ XMAP(DIRECT);
+ XMAP(NOWAIT);
+ XMAP(OVERWRITE_ONLY);
+ XMAP(UNSHARE);
+ XMAP(DAX);
+ XMAP(ATOMIC);
+ XMAP(DONTCACHE);
+
+ return ret;
+}
+#undef XMAP
+
+/* Validate an iomap mapping. */
+static inline bool fuse_iomap_check_mapping(const struct inode *inode,
+ const struct fuse_iomap_io *map,
+ enum fuse_iomap_iodir iodir)
+{
+ const unsigned int blocksize = i_blocksize(inode);
+ uint64_t end;
+
+ /* Type and flags must be known */
+ if (BAD_DATA(!fuse_iomap_check_type(map->type)))
+ return false;
+ if (BAD_DATA(!fuse_iomap_check_flags(map->flags)))
+ return false;
+
+ /* No zero-length mappings */
+ if (BAD_DATA(map->length == 0))
+ return false;
+
+ /* File range must be aligned to blocksize */
+ if (BAD_DATA(!IS_ALIGNED(map->offset, blocksize)))
+ return false;
+ if (BAD_DATA(!IS_ALIGNED(map->length, blocksize)))
+ return false;
+
+ /* No overflows in the file range */
+ if (BAD_DATA(check_add_overflow(map->offset, map->length, &end)))
+ return false;
+
+ /* File range cannot start past maxbytes */
+ if (BAD_DATA(map->offset >= inode->i_sb->s_maxbytes))
+ return false;
+
+ switch (map->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ /* Mappings backed by space must have a device/addr */
+ if (BAD_DATA(map->dev == FUSE_IOMAP_DEV_NULL))
+ return false;
+ if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR))
+ return false;
+ break;
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_INLINE:
+ /* Mappings not backed by space cannot have a device addr. */
+ if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
+ return false;
+ if (BAD_DATA(map->addr != FUSE_IOMAP_NULL_ADDR))
+ return false;
+ break;
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ /* "Pure overwrite" only allowed for write mapping */
+ if (BAD_DATA(iodir != WRITE_MAPPING))
+ return false;
+ break;
+ default:
+ /* should have been caught already */
+ ASSERT(0);
+ return false;
+ }
+
+ /* XXX: we don't support devices yet */
+ if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
+ return false;
+
+ /* No overflows in the device range, if supplied */
+ if (map->addr != FUSE_IOMAP_NULL_ADDR &&
+ BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
+ return false;
+
+ return true;
+}
+
+/* Convert a mapping from the server into something the kernel can use */
+static inline void fuse_iomap_from_server(struct inode *inode,
+ struct iomap *iomap,
+ const struct fuse_iomap_io *fmap)
+{
+ iomap->addr = fmap->addr;
+ iomap->offset = fmap->offset;
+ iomap->length = fmap->length;
+ iomap->type = fuse_iomap_type_from_server(fmap->type);
+ iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
+ iomap->bdev = inode->i_sb->s_bdev; /* XXX */
+}
+
+/* Convert a mapping from the kernel into something the server can use */
+static inline void fuse_iomap_to_server(struct fuse_iomap_io *fmap,
+ const struct iomap *iomap)
+{
+ fmap->addr = FUSE_IOMAP_NULL_ADDR; /* XXX */
+ fmap->offset = iomap->offset;
+ fmap->length = iomap->length;
+ fmap->type = fuse_iomap_type_to_server(iomap->type);
+ fmap->flags = fuse_iomap_flags_to_server(iomap->flags);
+ fmap->dev = FUSE_IOMAP_DEV_NULL; /* XXX */
+}
+
+/* Check the incoming _begin mappings to make sure they're not nonsense. */
+static inline int
+fuse_iomap_begin_validate(const struct inode *inode,
+ unsigned opflags, loff_t pos,
+ const struct fuse_iomap_begin_out *outarg)
+{
+ /* Make sure the mappings aren't garbage */
+ if (!fuse_iomap_check_mapping(inode, &outarg->read, READ_MAPPING))
+ return -EFSCORRUPTED;
+
+ if (!fuse_iomap_check_mapping(inode, &outarg->write, WRITE_MAPPING))
+ return -EFSCORRUPTED;
+
+ /*
+ * Must have returned a mapping for at least the first byte in the
+ * range. The main mapping check already validated that the length
+ * is nonzero and there is no overflow in computing end.
+ */
+ if (BAD_DATA(outarg->read.offset > pos))
+ return -EFSCORRUPTED;
+ if (BAD_DATA(outarg->write.offset > pos))
+ return -EFSCORRUPTED;
+
+ if (BAD_DATA(outarg->read.offset + outarg->read.length <= pos))
+ return -EFSCORRUPTED;
+ if (BAD_DATA(outarg->write.offset + outarg->write.length <= pos))
+ return -EFSCORRUPTED;
+
+ return 0;
+}
+
+static inline bool fuse_is_iomap_file_write(unsigned int opflags)
+{
+ return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
+}
+
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_begin_in inarg = {
+ .attr_ino = fi->orig_ino,
+ .opflags = fuse_iomap_op_to_server(opflags),
+ .pos = pos,
+ .count = count,
+ };
+ struct fuse_iomap_begin_out outarg = { };
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ int err;
+
+ args.opcode = FUSE_IOMAP_BEGIN;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ args.out_numargs = 1;
+ args.out_args[0].size = sizeof(outarg);
+ args.out_args[0].value = &outarg;
+ err = fuse_simple_request(fm, &args);
+ if (err)
+ return err;
+
+ err = fuse_iomap_begin_validate(inode, opflags, pos, &outarg);
+ if (err)
+ return err;
+
+ if (fuse_is_iomap_file_write(opflags) &&
+ outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ /*
+ * For an out of place write, we must supply the write mapping
+ * via @iomap, and the read mapping via @srcmap.
+ */
+ fuse_iomap_from_server(inode, iomap, &outarg.write);
+ fuse_iomap_from_server(inode, srcmap, &outarg.read);
+ } else {
+ /*
+ * For everything else (reads, reporting, and pure overwrites),
+ * we can return the sole mapping through @iomap and leave
+ * @srcmap unchanged from its default (HOLE).
+ */
+ fuse_iomap_from_server(inode, iomap, &outarg.read);
+ }
+
+ return 0;
+}
+
+/* Decide if we send FUSE_IOMAP_END to the fuse server */
+static bool fuse_should_send_iomap_end(const struct iomap *iomap,
+ unsigned int opflags, loff_t count,
+ ssize_t written)
+{
+ /* fuse server demanded an iomap_end call. */
+ if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
+ return true;
+
+ /* Reads and reporting should never affect the filesystem metadata */
+ if (!fuse_is_iomap_file_write(opflags))
+ return false;
+
+ /* Appending writes get an iomap_end call */
+ if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+ return true;
+
+ /* Short writes get an iomap_end call to clean up delalloc */
+ return written < count;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
+ ssize_t written, unsigned opflags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ int err = 0;
+
+ if (fuse_should_send_iomap_end(iomap, opflags, count, written)) {
+ struct fuse_iomap_end_in inarg = {
+ .opflags = fuse_iomap_op_to_server(opflags),
+ .attr_ino = fi->orig_ino,
+ .pos = pos,
+ .count = count,
+ .written = written,
+ };
+ FUSE_ARGS(args);
+
+ fuse_iomap_to_server(&inarg.map, iomap);
+
+ args.opcode = FUSE_IOMAP_END;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fm, &args);
+ switch (err) {
+ case -ENOSYS:
+ /*
+ * libfuse returns ENOSYS for servers that don't
+ * implement iomap_end
+ */
+ err = 0;
+ break;
+ case 0:
+ break;
+ default:
+ break;
+ }
+ }
+
+ return err;
+}
+
+const struct iomap_ops fuse_iomap_ops = {
+ .iomap_begin = fuse_iomap_begin,
+ .iomap_end = fuse_iomap_end,
+};
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1e7298b2b89b58..32f4b7c9a20a8a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1448,6 +1448,13 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
if (flags & FUSE_REQUEST_TIMEOUT)
timeout = arg->request_timeout;
+
+ if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
+ fc->local_fs = 1;
+ fc->iomap = 1;
+ printk(KERN_WARNING
+ "fuse: EXPERIMENTAL iomap feature enabled. Use at your own risk!");
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1516,6 +1523,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
*/
if (fuse_uring_enabled())
flags |= FUSE_OVER_IO_URING;
+ if (fuse_iomap_enabled())
+ flags |= FUSE_IOMAP;
ia->in.flags = flags;
ia->in.flags2 = flags >> 32;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 02/28] fuse_trace: implement the basic iomap mechanisms
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-09-16 0:28 ` [PATCH 01/28] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-09-16 0:28 ` Darrick J. Wong
2025-09-16 0:28 ` [PATCH 03/28] fuse: make debugging configurable at runtime Darrick J. Wong
` (25 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:28 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/iomap_priv.h | 6 +
fs/fuse/file_iomap.c | 12 ++
3 files changed, 312 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 286a0845dc0898..ef94f07cbbf2d4 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,8 @@
EM( FUSE_SYNCFS, "FUSE_SYNCFS") \
EM( FUSE_TMPFILE, "FUSE_TMPFILE") \
EM( FUSE_STATX, "FUSE_STATX") \
+ EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
+ EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
EMe(CUSE_INIT, "CUSE_INIT")
/*
@@ -77,6 +79,54 @@ OPCODES
#define EM(a, b) {a, b},
#define EMe(a, b) {a, b}
+/* tracepoint boilerplate so we don't have to keep doing this */
+#define FUSE_INODE_FIELDS \
+ __field(dev_t, connection) \
+ __field(uint64_t, ino) \
+ __field(uint64_t, nodeid) \
+ __field(loff_t, isize)
+
+#define FUSE_INODE_ASSIGN(inode, fi, fm) \
+ const struct fuse_inode *fi = get_fuse_inode_c(inode); \
+ const struct fuse_mount *fm = get_fuse_mount_c(inode); \
+\
+ __entry->connection = (fm)->fc->dev; \
+ __entry->ino = (fi)->orig_ino; \
+ __entry->nodeid = (fi)->nodeid; \
+ __entry->isize = i_size_read(inode)
+
+#define FUSE_INODE_FMT \
+ "connection %u ino %llu nodeid %llu isize 0x%llx"
+
+#define FUSE_INODE_PRINTK_ARGS \
+ __entry->connection, \
+ __entry->ino, \
+ __entry->nodeid, \
+ __entry->isize
+
+#define FUSE_FILE_RANGE_FIELDS(prefix) \
+ __field(loff_t, prefix##offset) \
+ __field(loff_t, prefix##length)
+
+#define FUSE_FILE_RANGE_FMT(prefix) \
+ " " prefix "pos 0x%llx length 0x%llx"
+
+#define FUSE_FILE_RANGE_PRINTK_ARGS(prefix) \
+ __entry->prefix##offset, \
+ __entry->prefix##length
+
+/* combinations of boilerplate to reduce typing further */
+#define FUSE_IO_RANGE_FIELDS(prefix) \
+ FUSE_INODE_FIELDS \
+ FUSE_FILE_RANGE_FIELDS(prefix)
+
+#define FUSE_IO_RANGE_FMT(prefix) \
+ FUSE_INODE_FMT FUSE_FILE_RANGE_FMT(prefix)
+
+#define FUSE_IO_RANGE_PRINTK_ARGS(prefix) \
+ FUSE_INODE_PRINTK_ARGS, \
+ FUSE_FILE_RANGE_PRINTK_ARGS(prefix)
+
TRACE_EVENT(fuse_request_send,
TP_PROTO(const struct fuse_req *req),
@@ -159,6 +209,251 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_open);
DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#endif /* CONFIG_FUSE_BACKING */
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+
+/* tracepoint boilerplate so we don't have to keep doing this */
+#define FUSE_IOMAP_OPFLAGS_FIELD \
+ __field(unsigned, opflags)
+
+#define FUSE_IOMAP_OPFLAGS_FMT \
+ " opflags (%s)"
+
+#define FUSE_IOMAP_OPFLAGS_PRINTK_ARG \
+ __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS)
+
+#define FUSE_IOMAP_MAP_FIELDS(prefix) \
+ __field(uint64_t, prefix##offset) \
+ __field(uint64_t, prefix##length) \
+ __field(uint64_t, prefix##addr) \
+ __field(uint32_t, prefix##dev) \
+ __field(uint16_t, prefix##type) \
+ __field(uint16_t, prefix##flags)
+
+#define FUSE_IOMAP_MAP_FMT(prefix) \
+ " " prefix "offset 0x%llx length 0x%llx type %s dev %u addr 0x%llx mapflags (%s)"
+
+#define FUSE_IOMAP_MAP_PRINTK_ARGS(prefix) \
+ __entry->prefix##offset, \
+ __entry->prefix##length, \
+ __print_symbolic(__entry->prefix##type, FUSE_IOMAP_TYPE_STRINGS), \
+ __entry->prefix##dev, \
+ __entry->prefix##addr, \
+ __print_flags(__entry->prefix##flags, "|", FUSE_IOMAP_F_STRINGS)
+
+/* combinations of boilerplate to reduce typing further */
+#define FUSE_IOMAP_OP_FIELDS(prefix) \
+ FUSE_INODE_FIELDS \
+ FUSE_IOMAP_OPFLAGS_FIELD \
+ FUSE_FILE_RANGE_FIELDS(prefix)
+
+#define FUSE_IOMAP_OP_FMT(prefix) \
+ FUSE_INODE_FMT FUSE_IOMAP_OPFLAGS_FMT FUSE_FILE_RANGE_FMT(prefix)
+
+#define FUSE_IOMAP_OP_PRINTK_ARGS(prefix) \
+ FUSE_INODE_PRINTK_ARGS, \
+ FUSE_IOMAP_OPFLAGS_PRINTK_ARG, \
+ FUSE_FILE_RANGE_PRINTK_ARGS(prefix)
+
+/* string decoding */
+#define FUSE_IOMAP_F_STRINGS \
+ { FUSE_IOMAP_F_NEW, "new" }, \
+ { FUSE_IOMAP_F_DIRTY, "dirty" }, \
+ { FUSE_IOMAP_F_SHARED, "shared" }, \
+ { FUSE_IOMAP_F_MERGED, "merged" }, \
+ { FUSE_IOMAP_F_BOUNDARY, "boundary" }, \
+ { FUSE_IOMAP_F_ANON_WRITE, "anon_write" }, \
+ { FUSE_IOMAP_F_ATOMIC_BIO, "atomic" }, \
+ { FUSE_IOMAP_F_WANT_IOMAP_END, "iomap_end" }, \
+ { FUSE_IOMAP_F_SIZE_CHANGED, "append" }, \
+ { FUSE_IOMAP_F_STALE, "stale" }
+
+#define FUSE_IOMAP_OP_STRINGS \
+ { FUSE_IOMAP_OP_WRITE, "write" }, \
+ { FUSE_IOMAP_OP_ZERO, "zero" }, \
+ { FUSE_IOMAP_OP_REPORT, "report" }, \
+ { FUSE_IOMAP_OP_FAULT, "fault" }, \
+ { FUSE_IOMAP_OP_DIRECT, "direct" }, \
+ { FUSE_IOMAP_OP_NOWAIT, "nowait" }, \
+ { FUSE_IOMAP_OP_OVERWRITE_ONLY, "overwrite" }, \
+ { FUSE_IOMAP_OP_UNSHARE, "unshare" }, \
+ { FUSE_IOMAP_OP_DAX, "fsdax" }, \
+ { FUSE_IOMAP_OP_ATOMIC, "atomic" }, \
+ { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }
+
+#define FUSE_IOMAP_TYPE_STRINGS \
+ { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
+ { FUSE_IOMAP_TYPE_HOLE, "hole" }, \
+ { FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
+ { FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
+ { FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
+ { FUSE_IOMAP_TYPE_INLINE, "inline" }
+
+DECLARE_EVENT_CLASS(fuse_iomap_check_class,
+ TP_PROTO(const char *func, int line, const char *condition),
+
+ TP_ARGS(func, line, condition),
+
+ TP_STRUCT__entry(
+ __string(func, func)
+ __field(int, line)
+ __string(condition, condition)
+ ),
+
+ TP_fast_assign(
+ __assign_str(func);
+ __assign_str(condition);
+ __entry->line = line;
+ ),
+
+ TP_printk("func %s line %d condition %s", __get_str(func),
+ __entry->line, __get_str(condition))
+);
+#define DEFINE_FUSE_IOMAP_CHECK_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_check_class, name, \
+ TP_PROTO(const char *func, int line, const char *condition), \
+ TP_ARGS(func, line, condition))
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+DEFINE_FUSE_IOMAP_CHECK_EVENT(fuse_iomap_assert);
+#endif
+DEFINE_FUSE_IOMAP_CHECK_EVENT(fuse_iomap_bad_data);
+
+TRACE_EVENT(fuse_iomap_begin,
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags),
+
+ TP_ARGS(inode, pos, count, opflags),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = count;
+ __entry->opflags = opflags;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT(),
+ FUSE_IOMAP_OP_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_begin_error,
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, int error),
+
+ TP_ARGS(inode, pos, count, opflags, error),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = count;
+ __entry->opflags = opflags;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT() " err %d",
+ FUSE_IOMAP_OP_PRINTK_ARGS(),
+ __entry->error)
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_mapping_class,
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map),
+
+ TP_ARGS(inode, map),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT(),
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+#define DEFINE_FUSE_IOMAP_MAPPING_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_mapping_class, name, \
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), \
+ TP_ARGS(inode, map))
+DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_read_map);
+DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_write_map);
+
+TRACE_EVENT(fuse_iomap_end,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_end_in *inarg),
+
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ __field(size_t, written)
+ FUSE_IOMAP_MAP_FIELDS(map)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->opflags = inarg->opflags;
+ __entry->written = inarg->written;
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->count;
+
+ __entry->mapoffset = inarg->map.offset;
+ __entry->maplength = inarg->map.length;
+ __entry->mapdev = inarg->map.dev;
+ __entry->mapaddr = inarg->map.addr;
+ __entry->maptype = inarg->map.type;
+ __entry->mapflags = inarg->map.flags;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT() " written %zd" FUSE_IOMAP_MAP_FMT(),
+ FUSE_IOMAP_OP_PRINTK_ARGS(),
+ __entry->written,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+
+TRACE_EVENT(fuse_iomap_end_error,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_end_in *inarg, int error),
+
+ TP_ARGS(inode, inarg, error),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ __field(size_t, written)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->count;
+ __entry->opflags = inarg->opflags;
+ __entry->written = inarg->written;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT() " written %zd error %d",
+ FUSE_IOMAP_OP_PRINTK_ARGS(),
+ __entry->written,
+ __entry->error)
+);
+#endif /* CONFIG_FUSE_IOMAP */
+
#endif /* _TRACE_FUSE_H */
#undef TRACE_INCLUDE_PATH
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 243d92cb625095..ca8544a95a4267 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -10,16 +10,22 @@
#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
# define ASSERT(condition) do { \
int __cond = !!(condition); \
+ if (unlikely(!__cond)) \
+ trace_fuse_iomap_assert(__func__, __LINE__, #condition); \
WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
} while (0)
# define BAD_DATA(condition) ({ \
int __cond = !!(condition); \
+ if (unlikely(__cond)) \
+ trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \
WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
})
#else
# define ASSERT(condition)
# define BAD_DATA(condition) ({ \
int __cond = !!(condition); \
+ if (unlikely(__cond)) \
+ trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \
unlikely(__cond); \
})
#endif /* CONFIG_FUSE_IOMAP_DEBUG */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index dda757768d3ea6..e503bb06fe0c69 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -327,6 +327,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
FUSE_ARGS(args);
int err;
+ trace_fuse_iomap_begin(inode, pos, count, opflags);
+
args.opcode = FUSE_IOMAP_BEGIN;
args.nodeid = get_node_id(inode);
args.in_numargs = 1;
@@ -336,8 +338,13 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
args.out_args[0].size = sizeof(outarg);
args.out_args[0].value = &outarg;
err = fuse_simple_request(fm, &args);
- if (err)
+ if (err) {
+ trace_fuse_iomap_begin_error(inode, pos, count, opflags, err);
return err;
+ }
+
+ trace_fuse_iomap_read_map(inode, &outarg.read);
+ trace_fuse_iomap_write_map(inode, &outarg.write);
err = fuse_iomap_begin_validate(inode, opflags, pos, &outarg);
if (err)
@@ -404,6 +411,8 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
fuse_iomap_to_server(&inarg.map, iomap);
+ trace_fuse_iomap_end(inode, &inarg);
+
args.opcode = FUSE_IOMAP_END;
args.nodeid = get_node_id(inode);
args.in_numargs = 1;
@@ -421,6 +430,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
case 0:
break;
default:
+ trace_fuse_iomap_end_error(inode, &inarg, err);
break;
}
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 03/28] fuse: make debugging configurable at runtime
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-09-16 0:28 ` [PATCH 01/28] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-09-16 0:28 ` [PATCH 02/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:28 ` Darrick J. Wong
2025-09-16 0:29 ` [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong
` (24 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:28 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Use static keys so that we can configure debugging assertions and dmesg
warnings at runtime. By default this is turned off so the cost is
merely scanning a nop sled. However, fuse server developers can turn
it on for their debugging systems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 +++++
fs/fuse/iomap_priv.h | 16 ++++++++--
fs/fuse/Kconfig | 15 +++++++++
fs/fuse/file_iomap.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 7 ++++
5 files changed, 124 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f0d408a6e12c32..389b123f0bf144 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1689,6 +1689,14 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+int fuse_iomap_sysfs_init(struct kobject *kobj);
+void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
+#else
+# define fuse_iomap_sysfs_init(...) (0)
+# define fuse_iomap_sysfs_cleanup(...) ((void)0)
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
bool fuse_iomap_enabled(void);
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index ca8544a95a4267..7002eb38f87fe1 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -6,19 +6,29 @@
#ifndef _FS_FUSE_IOMAP_PRIV_H
#define _FS_FUSE_IOMAP_PRIV_H
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG_DEFAULT)
+DECLARE_STATIC_KEY_TRUE(fuse_iomap_debug);
+#else
+DECLARE_STATIC_KEY_FALSE(fuse_iomap_debug);
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
-# define ASSERT(condition) do { \
+# define ASSERT(condition) \
+while (static_branch_unlikely(&fuse_iomap_debug)) { \
int __cond = !!(condition); \
if (unlikely(!__cond)) \
trace_fuse_iomap_assert(__func__, __LINE__, #condition); \
WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
-} while (0)
+ break; \
+}
# define BAD_DATA(condition) ({ \
int __cond = !!(condition); \
if (unlikely(__cond)) \
trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \
- WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+ if (static_branch_unlikely(&fuse_iomap_debug)) \
+ WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+ unlikely(__cond); \
})
#else
# define ASSERT(condition)
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 67dfe300bf2e07..52e1a04183e760 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -101,6 +101,21 @@ config FUSE_IOMAP_DEBUG
Enable debugging assertions for the fuse iomap code paths and logging
of bad iomap file mapping data being sent to the kernel.
+ Say N here if you don't want any debugging code code compiled in at
+ all.
+
+config FUSE_IOMAP_DEBUG_BY_DEFAULT
+ bool "Debug FUSE file IO over iomap at boot time"
+ default n
+ depends on FUSE_IOMAP_DEBUG
+ help
+ At boot time, enable debugging assertions for the fuse iomap code
+ paths and warnings about bad iomap file mapping data. This enables
+ fuse server authors to control debugging at runtime even on a
+ distribution kernel while avoiding most of the overhead on production
+ systems. The setting can be changed at runtime via
+ /sys/fs/fuse/iomap/debug.
+
config FUSE_IO_URING
bool "FUSE communication over io-uring"
default y
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index e503bb06fe0c69..e7d19e2aee4541 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -8,6 +8,12 @@
#include "fuse_trace.h"
#include "iomap_priv.h"
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG_DEFAULT)
+DEFINE_STATIC_KEY_TRUE(fuse_iomap_debug);
+#else
+DEFINE_STATIC_KEY_FALSE(fuse_iomap_debug);
+#endif
+
static bool __read_mostly enable_iomap =
#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
true;
@@ -17,6 +23,81 @@ static bool __read_mostly enable_iomap =
module_param(enable_iomap, bool, 0644);
MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static struct kobject *iomap_kobj;
+
+static ssize_t fuse_iomap_debug_show(struct kobject *kobject,
+ struct kobj_attribute *a, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", !!static_key_enabled(&fuse_iomap_debug));
+}
+
+static ssize_t fuse_iomap_debug_store(struct kobject *kobject,
+ struct kobj_attribute *a,
+ const char *buf, size_t count)
+{
+ int ret;
+ int val;
+
+ ret = kstrtoint(buf, 0, &val);
+ if (ret)
+ return ret;
+
+ if (val < 0 || val > 1)
+ return -EINVAL;
+
+ if (val)
+ static_branch_enable(&fuse_iomap_debug);
+ else
+ static_branch_disable(&fuse_iomap_debug);
+
+ return count;
+}
+
+#define __INIT_KOBJ_ATTR(_name, _mode, _show, _store) \
+{ \
+ .attr = { .name = __stringify(_name), .mode = _mode }, \
+ .show = _show, \
+ .store = _store, \
+}
+
+#define FUSE_ATTR_RW(_name, _show, _store) \
+ static struct kobj_attribute fuse_attr_##_name = \
+ __INIT_KOBJ_ATTR(_name, 0644, _show, _store)
+
+#define FUSE_ATTR_PTR(_name) \
+ (&fuse_attr_##_name.attr)
+
+FUSE_ATTR_RW(debug, fuse_iomap_debug_show, fuse_iomap_debug_store);
+
+static const struct attribute *fuse_iomap_attrs[] = {
+ FUSE_ATTR_PTR(debug),
+ NULL,
+};
+
+int fuse_iomap_sysfs_init(struct kobject *fuse_kobj)
+{
+ int error;
+
+ iomap_kobj = kobject_create_and_add("iomap", fuse_kobj);
+ if (!iomap_kobj)
+ return -ENOMEM;
+
+ error = sysfs_create_files(iomap_kobj, fuse_iomap_attrs);
+ if (error) {
+ kobject_put(iomap_kobj);
+ return error;
+ }
+
+ return 0;
+}
+
+void fuse_iomap_sysfs_cleanup(struct kobject *fuse_kobj)
+{
+ kobject_put(iomap_kobj);
+}
+#endif /* IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) */
+
bool fuse_iomap_enabled(void)
{
/* Don't let anyone touch iomap until the end of the patchset. */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 32f4b7c9a20a8a..0d39e1dcec308d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2272,8 +2272,14 @@ static int fuse_sysfs_init(void)
if (err)
goto out_fuse_unregister;
+ err = fuse_iomap_sysfs_init(fuse_kobj);
+ if (err)
+ goto out_fuse_connections;
+
return 0;
+ out_fuse_connections:
+ sysfs_remove_mount_point(fuse_kobj, "connections");
out_fuse_unregister:
kobject_put(fuse_kobj);
out_err:
@@ -2282,6 +2288,7 @@ static int fuse_sysfs_init(void)
static void fuse_sysfs_cleanup(void)
{
+ fuse_iomap_sysfs_cleanup(fuse_kobj);
sysfs_remove_mount_point(fuse_kobj, "connections");
kobject_put(fuse_kobj);
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (2 preceding siblings ...)
2025-09-16 0:28 ` [PATCH 03/28] fuse: make debugging configurable at runtime Darrick J. Wong
@ 2025-09-16 0:29 ` Darrick J. Wong
2025-09-17 3:09 ` Amir Goldstein
2025-09-16 0:29 ` [PATCH 05/28] fuse_trace: " Darrick J. Wong
` (23 subsequent siblings)
27 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:29 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Enable the use of the backing file open/close ioctls so that fuse
servers can register block devices for use with iomap.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 5 ++
include/uapi/linux/fuse.h | 3 +
fs/fuse/Kconfig | 1
fs/fuse/backing.c | 12 +++++
fs/fuse/file_iomap.c | 99 +++++++++++++++++++++++++++++++++++++++++----
fs/fuse/trace.c | 1
6 files changed, 111 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 389b123f0bf144..791f210c13a876 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -97,12 +97,14 @@ struct fuse_submount_lookup {
};
struct fuse_conn;
+struct fuse_backing;
/** Operations for subsystems that want to use a backing file */
struct fuse_backing_ops {
int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
int (*may_open)(struct fuse_conn *fc, struct file *file);
int (*may_close)(struct fuse_conn *fc, struct file *file);
+ int (*post_open)(struct fuse_conn *fc, struct fuse_backing *fb);
unsigned int type;
};
@@ -110,6 +112,7 @@ struct fuse_backing_ops {
struct fuse_backing {
struct file *file;
struct cred *cred;
+ struct block_device *bdev;
const struct fuse_backing_ops *ops;
/** refcount */
@@ -1704,6 +1707,8 @@ static inline bool fuse_has_iomap(const struct inode *inode)
{
return get_fuse_conn_c(inode)->iomap;
}
+
+extern const struct fuse_backing_ops fuse_iomap_backing_ops;
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 3634cbe602cd9c..3a367f387795ff 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1124,7 +1124,8 @@ struct fuse_notify_retrieve_in {
#define FUSE_BACKING_TYPE_MASK (0xFF)
#define FUSE_BACKING_TYPE_PASSTHROUGH (0)
-#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
+#define FUSE_BACKING_TYPE_IOMAP (1)
+#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP)
#define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 52e1a04183e760..baa38cf0f295ff 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -75,6 +75,7 @@ config FUSE_IOMAP
depends on FUSE_FS
depends on BLOCK
select FS_IOMAP
+ select FUSE_BACKING
help
Enable fuse servers to operate the regular file I/O path through
the fs-iomap library in the kernel. This enables higher performance
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
index 229c101ab46b0e..fc58636ac78eaa 100644
--- a/fs/fuse/backing.c
+++ b/fs/fuse/backing.c
@@ -89,6 +89,10 @@ fuse_backing_ops_from_map(const struct fuse_backing_map *map)
#ifdef CONFIG_FUSE_PASSTHROUGH
case FUSE_BACKING_TYPE_PASSTHROUGH:
return &fuse_passthrough_backing_ops;
+#endif
+#ifdef CONFIG_FUSE_IOMAP
+ case FUSE_BACKING_TYPE_IOMAP:
+ return &fuse_iomap_backing_ops;
#endif
default:
break;
@@ -137,8 +141,16 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
fb->file = file;
fb->cred = prepare_creds();
fb->ops = ops;
+ fb->bdev = NULL;
refcount_set(&fb->count, 1);
+ res = ops->post_open ? ops->post_open(fc, fb) : 0;
+ if (res) {
+ fuse_backing_free(fb);
+ fb = NULL;
+ goto out;
+ }
+
res = fuse_backing_id_alloc(fc, fb);
if (res < 0) {
fuse_backing_free(fb);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index e7d19e2aee4541..3a4161633add0e 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -319,10 +319,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
return false;
}
- /* XXX: we don't support devices yet */
- if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
- return false;
-
/* No overflows in the device range, if supplied */
if (map->addr != FUSE_IOMAP_NULL_ADDR &&
BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
@@ -334,6 +330,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
/* Convert a mapping from the server into something the kernel can use */
static inline void fuse_iomap_from_server(struct inode *inode,
struct iomap *iomap,
+ const struct fuse_backing *fb,
const struct fuse_iomap_io *fmap)
{
iomap->addr = fmap->addr;
@@ -341,7 +338,9 @@ static inline void fuse_iomap_from_server(struct inode *inode,
iomap->length = fmap->length;
iomap->type = fuse_iomap_type_from_server(fmap->type);
iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
- iomap->bdev = inode->i_sb->s_bdev; /* XXX */
+
+ iomap->bdev = fb ? fb->bdev : NULL;
+ iomap->dax_dev = NULL;
}
/* Convert a mapping from the kernel into something the server can use */
@@ -392,6 +391,27 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
}
+static inline struct fuse_backing *
+fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
+{
+ struct fuse_backing *ret = NULL;
+
+ if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX)
+ ret = fuse_backing_lookup(fc, &fuse_iomap_backing_ops,
+ map->dev);
+
+ switch (map->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ /* Mappings backed by space must have a device/addr */
+ if (BAD_DATA(ret == NULL))
+ return ERR_PTR(-EFSCORRUPTED);
+ break;
+ }
+
+ return ret;
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -405,6 +425,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
};
struct fuse_iomap_begin_out outarg = { };
struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_backing *read_dev = NULL;
+ struct fuse_backing *write_dev = NULL;
FUSE_ARGS(args);
int err;
@@ -431,24 +453,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
if (err)
return err;
+ read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
+ if (IS_ERR(read_dev))
+ return PTR_ERR(read_dev);
+
if (fuse_is_iomap_file_write(opflags) &&
outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ /* open the write device */
+ write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write);
+ if (IS_ERR(write_dev)) {
+ err = PTR_ERR(write_dev);
+ goto out_read_dev;
+ }
+
/*
* For an out of place write, we must supply the write mapping
* via @iomap, and the read mapping via @srcmap.
*/
- fuse_iomap_from_server(inode, iomap, &outarg.write);
- fuse_iomap_from_server(inode, srcmap, &outarg.read);
+ fuse_iomap_from_server(inode, iomap, write_dev, &outarg.write);
+ fuse_iomap_from_server(inode, srcmap, read_dev, &outarg.read);
} else {
/*
* For everything else (reads, reporting, and pure overwrites),
* we can return the sole mapping through @iomap and leave
* @srcmap unchanged from its default (HOLE).
*/
- fuse_iomap_from_server(inode, iomap, &outarg.read);
+ fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
}
- return 0;
+ /*
+ * XXX: if we ever want to support closing devices, we need a way to
+ * track the fuse_backing refcount all the way through bio endios.
+ * For now we put the refcount here because you can't remove an iomap
+ * device until unmount time.
+ */
+ fuse_backing_put(write_dev);
+out_read_dev:
+ fuse_backing_put(read_dev);
+ return err;
}
/* Decide if we send FUSE_IOMAP_END to the fuse server */
@@ -523,3 +565,42 @@ const struct iomap_ops fuse_iomap_ops = {
.iomap_begin = fuse_iomap_begin,
.iomap_end = fuse_iomap_end,
};
+
+static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags)
+{
+ if (!fc->iomap)
+ return -EPERM;
+
+ if (flags)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int fuse_iomap_may_open(struct fuse_conn *fc, struct file *file)
+{
+ if (!S_ISBLK(file_inode(file)->i_mode))
+ return -ENODEV;
+
+ return 0;
+}
+
+static int fuse_iomap_post_open(struct fuse_conn *fc, struct fuse_backing *fb)
+{
+ fb->bdev = I_BDEV(fb->file->f_mapping->host);
+ return 0;
+}
+
+static int fuse_iomap_may_close(struct fuse_conn *fc, struct file *file)
+{
+ /* We only support closing iomap block devices at unmount */
+ return -EBUSY;
+}
+
+const struct fuse_backing_ops fuse_iomap_backing_ops = {
+ .type = FUSE_BACKING_TYPE_IOMAP,
+ .may_admin = fuse_iomap_may_admin,
+ .may_open = fuse_iomap_may_open,
+ .may_close = fuse_iomap_may_close,
+ .post_open = fuse_iomap_post_open,
+};
diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c
index 93bd72efc98cd0..3b54f639a5423e 100644
--- a/fs/fuse/trace.c
+++ b/fs/fuse/trace.c
@@ -6,6 +6,7 @@
#include "dev_uring_i.h"
#include "fuse_i.h"
#include "fuse_dev_i.h"
+#include "iomap_priv.h"
#include <linux/pagemap.h>
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 05/28] fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (3 preceding siblings ...)
2025-09-16 0:29 ` [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong
@ 2025-09-16 0:29 ` Darrick J. Wong
2025-09-16 0:29 ` [PATCH 06/28] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
` (22 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:29 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Enhance the existing backing file tracepoints to report the subsystem
that's actually using the backing file.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index ef94f07cbbf2d4..d39029b30e0198 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -175,6 +175,10 @@ TRACE_EVENT(fuse_request_end,
);
#ifdef CONFIG_FUSE_BACKING
+#define FUSE_BACKING_FLAG_STRINGS \
+ { FUSE_BACKING_TYPE_PASSTHROUGH, "pass" }, \
+ { FUSE_BACKING_TYPE_IOMAP, "iomap" }
+
TRACE_EVENT(fuse_backing_class,
TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
const struct fuse_backing *fb),
@@ -184,7 +188,9 @@ TRACE_EVENT(fuse_backing_class,
TP_STRUCT__entry(
__field(dev_t, connection)
__field(unsigned int, idx)
+ __field(unsigned int, type)
__field(unsigned long, ino)
+ __field(dev_t, rdev)
),
TP_fast_assign(
@@ -193,12 +199,19 @@ TRACE_EVENT(fuse_backing_class,
__entry->connection = fc->dev;
__entry->idx = idx;
__entry->ino = inode->i_ino;
+ __entry->type = fb->ops->type;
+ if (fb->ops->type == FUSE_BACKING_TYPE_IOMAP)
+ __entry->rdev = inode->i_rdev;
+ else
+ __entry->rdev = 0;
),
- TP_printk("connection %u idx %u ino 0x%lx",
+ TP_printk("connection %u idx %u type %s ino 0x%lx rdev %u:%u",
__entry->connection,
__entry->idx,
- __entry->ino)
+ __print_symbolic(__entry->type, FUSE_BACKING_FLAG_STRINGS),
+ __entry->ino,
+ MAJOR(__entry->rdev), MINOR(__entry->rdev))
);
#define DEFINE_FUSE_BACKING_EVENT(name) \
DEFINE_EVENT(fuse_backing_class, name, \
@@ -210,7 +223,6 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#endif /* CONFIG_FUSE_BACKING */
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
-
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
__field(unsigned, opflags)
@@ -452,6 +464,30 @@ TRACE_EVENT(fuse_iomap_end_error,
__entry->written,
__entry->error)
);
+
+TRACE_EVENT(fuse_iomap_dev_add,
+ TP_PROTO(const struct fuse_conn *fc,
+ const struct fuse_backing_map *map),
+
+ TP_ARGS(fc, map),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(int, fd)
+ __field(unsigned int, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->connection = fc->dev;
+ __entry->fd = map->fd;
+ __entry->flags = map->flags;
+ ),
+
+ TP_printk("connection %u fd %d flags 0x%x",
+ __entry->connection,
+ __entry->fd,
+ __entry->flags)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 06/28] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (4 preceding siblings ...)
2025-09-16 0:29 ` [PATCH 05/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:29 ` Darrick J. Wong
2025-09-16 0:29 ` [PATCH 07/28] fuse: create a per-inode flag for toggling iomap Darrick J. Wong
` (21 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:29 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
At unmount time, there are a few things that we need to ask the fuse
server to do.
First, we need to flush queued events to userspace to give the fuse
server a chance to process the events. This is how we make sure that
the server processes FUSE_RELEASE events before the connection goes
down.
Second, to ensure that all those metadata updates are persisted to disk
before tell the fuse server to destroy itself, send FUSE_SYNCFS after
waiting for the queued events.
Finally, we need to send FUSE_DESTROY to the fuse server so that it
closes the filesystem and the device fds before unmount returns. That
way, a script that does something like "umount /dev/sda ; e2fsck -fn
/dev/sda" will not fail the e2fsck because the fd closure races with
e2fsck startup. Obviously, we need to wait for FUSE_SYNCFS.
This is a major behavior change and who knows what might break existing
code, so we hide it behind iomap mode.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 ++++++++
fs/fuse/file_iomap.c | 29 +++++++++++++++++++++++++++++
fs/fuse/inode.c | 9 +++++++--
3 files changed, 44 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 791f210c13a876..3cda9bc6af23fe 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1432,6 +1432,9 @@ int fuse_init_fs_context_submount(struct fs_context *fsc);
*/
void fuse_conn_destroy(struct fuse_mount *fm);
+/* Send the FUSE_DESTROY command. */
+void fuse_send_destroy(struct fuse_mount *fm);
+
/* Drop the connection and free the fuse mount */
void fuse_mount_destroy(struct fuse_mount *fm);
@@ -1709,9 +1712,14 @@ static inline bool fuse_has_iomap(const struct inode *inode)
}
extern const struct fuse_backing_ops fuse_iomap_backing_ops;
+
+void fuse_iomap_mount(struct fuse_mount *fm);
+void fuse_iomap_unmount(struct fuse_mount *fm);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
+# define fuse_iomap_mount(...) ((void)0)
+# define fuse_iomap_unmount(...) ((void)0)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 3a4161633add0e..75e6f668baa9ef 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -604,3 +604,32 @@ const struct fuse_backing_ops fuse_iomap_backing_ops = {
.may_close = fuse_iomap_may_close,
.post_open = fuse_iomap_post_open,
};
+
+void fuse_iomap_mount(struct fuse_mount *fm)
+{
+ struct fuse_conn *fc = fm->fc;
+
+ /*
+ * Enable syncfs for iomap fuse servers so that we can send a final
+ * flush at unmount time. This also means that we can support
+ * freeze/thaw properly.
+ */
+ fc->sync_fs = true;
+}
+
+void fuse_iomap_unmount(struct fuse_mount *fm)
+{
+ struct fuse_conn *fc = fm->fc;
+
+ /*
+ * Flush all pending commands, then issue a syncfs, flush the syncfs,
+ * and send a destroy command. This gives the fuse server a chance to
+ * process all the pending releases, write the last bits of metadata
+ * changes to disk, and close the iomap block devices before we return
+ * from the umount call.
+ */
+ fuse_flush_requests_and_wait(fc);
+ sync_filesystem(fm->sb);
+ fuse_flush_requests_and_wait(fc);
+ fuse_send_destroy(fm);
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0d39e1dcec308d..7cb1426ca3e767 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -623,7 +623,7 @@ static void fuse_umount_begin(struct super_block *sb)
retire_super(sb);
}
-static void fuse_send_destroy(struct fuse_mount *fm)
+void fuse_send_destroy(struct fuse_mount *fm)
{
if (fm->fc->conn_init) {
FUSE_ARGS(args);
@@ -1463,6 +1463,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
init_server_timeout(fc, timeout);
+ if (fc->iomap)
+ fuse_iomap_mount(fm);
+
fm->sb->s_bdi->ra_pages =
min(fm->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
@@ -2101,7 +2104,9 @@ void fuse_conn_destroy(struct fuse_mount *fm)
{
struct fuse_conn *fc = fm->fc;
- if (fc->destroy) {
+ if (fc->iomap) {
+ fuse_iomap_unmount(fm);
+ } else if (fc->destroy) {
/*
* Flush all pending requests (most of which will be
* FUSE_RELEASE) before sending FUSE_DESTROY, because the fuse
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 07/28] fuse: create a per-inode flag for toggling iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (5 preceding siblings ...)
2025-09-16 0:29 ` [PATCH 06/28] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
@ 2025-09-16 0:29 ` Darrick J. Wong
2025-09-16 0:30 ` [PATCH 08/28] fuse_trace: " Darrick J. Wong
` (20 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:29 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Create a per-inode flag to control whether or not this inode actually
uses iomap. This is required for non-regular files because iomap
doesn't apply there; and enables fuse filesystems to provide some
non-iomap files if desired.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 15 +++++++++++++++
include/uapi/linux/fuse.h | 3 +++
fs/fuse/file.c | 1 +
fs/fuse/file_iomap.c | 32 ++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 2 ++
5 files changed, 53 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3cda9bc6af23fe..791e868e568cc5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -250,6 +250,8 @@ enum {
FUSE_I_BTIME,
/* Wants or already has page cache IO */
FUSE_I_CACHE_IO_MODE,
+ /* Use iomap for this inode */
+ FUSE_I_IOMAP,
};
struct fuse_conn;
@@ -1715,11 +1717,24 @@ extern const struct fuse_backing_ops fuse_iomap_backing_ops;
void fuse_iomap_mount(struct fuse_mount *fm);
void fuse_iomap_unmount(struct fuse_mount *fm);
+
+void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags);
+void fuse_iomap_evict_inode(struct inode *inode);
+
+static inline bool fuse_inode_has_iomap(const struct inode *inode)
+{
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+ return test_bit(FUSE_I_IOMAP, &fi->state);
+}
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
# define fuse_iomap_mount(...) ((void)0)
# define fuse_iomap_unmount(...) ((void)0)
+# define fuse_iomap_init_inode(...) ((void)0)
+# define fuse_iomap_evict_inode(...) ((void)0)
+# define fuse_inode_has_iomap(...) (false)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 3a367f387795ff..cc4bca2941cb79 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -238,6 +238,7 @@
*
* 7.99
* - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
+ * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
*/
#ifndef _LINUX_FUSE_H
@@ -578,9 +579,11 @@ struct fuse_file_lock {
*
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
+ * FUSE_ATTR_IOMAP: Use iomap for this inode
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
+#define FUSE_ATTR_IOMAP (1 << 2)
/**
* Open flags
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ebdca39b2261d7..8982e0b9661bb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3127,4 +3127,5 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_inode_init(inode, flags);
+ fuse_iomap_init_inode(inode, flags);
}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 75e6f668baa9ef..6ffa5710a92ad5 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -633,3 +633,35 @@ void fuse_iomap_unmount(struct fuse_mount *fm)
fuse_flush_requests_and_wait(fc);
fuse_send_destroy(fm);
}
+
+static inline void fuse_inode_set_iomap(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ set_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+static inline void fuse_inode_clear_iomap(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ clear_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
+{
+ struct fuse_conn *conn = get_fuse_conn(inode);
+
+ if (conn->iomap && (attr_flags & FUSE_ATTR_IOMAP))
+ fuse_inode_set_iomap(inode);
+}
+
+void fuse_iomap_evict_inode(struct inode *inode)
+{
+ if (fuse_inode_has_iomap(inode))
+ fuse_inode_clear_iomap(inode);
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7cb1426ca3e767..b209db07e60e33 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -199,6 +199,8 @@ static void fuse_evict_inode(struct inode *inode)
WARN_ON(!list_empty(&fi->write_files));
WARN_ON(!list_empty(&fi->queued_writes));
}
+
+ fuse_iomap_evict_inode(inode);
}
static int fuse_reconfigure(struct fs_context *fsc)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 08/28] fuse_trace: create a per-inode flag for toggling iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (6 preceding siblings ...)
2025-09-16 0:29 ` [PATCH 07/28] fuse: create a per-inode flag for toggling iomap Darrick J. Wong
@ 2025-09-16 0:30 ` Darrick J. Wong
2025-09-16 0:30 ` [PATCH 09/28] fuse: isolate the other regular file IO paths from iomap Darrick J. Wong
` (19 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:30 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 4 ++++
2 files changed, 46 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index d39029b30e0198..cdedaf2b2a0ad5 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -300,6 +300,23 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
{ FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
{ FUSE_IOMAP_TYPE_INLINE, "inline" }
+TRACE_DEFINE_ENUM(FUSE_I_ADVISE_RDPLUS);
+TRACE_DEFINE_ENUM(FUSE_I_INIT_RDPLUS);
+TRACE_DEFINE_ENUM(FUSE_I_SIZE_UNSTABLE);
+TRACE_DEFINE_ENUM(FUSE_I_BAD);
+TRACE_DEFINE_ENUM(FUSE_I_BTIME);
+TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
+
+#define FUSE_IFLAG_STRINGS \
+ { 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \
+ { 1 << FUSE_I_INIT_RDPLUS, "init_rdplus" }, \
+ { 1 << FUSE_I_SIZE_UNSTABLE, "size_unstable" }, \
+ { 1 << FUSE_I_BAD, "bad" }, \
+ { 1 << FUSE_I_BTIME, "btime" }, \
+ { 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
+ { 1 << FUSE_I_IOMAP, "iomap" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -488,6 +505,31 @@ TRACE_EVENT(fuse_iomap_dev_add,
__entry->fd,
__entry->flags)
);
+
+DECLARE_EVENT_CLASS(fuse_inode_state_class,
+ TP_PROTO(const struct inode *inode),
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(unsigned long, state)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->state = fi->state;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " state (%s)",
+ FUSE_INODE_PRINTK_ARGS,
+ __print_flags(__entry->state, "|", FUSE_IFLAG_STRINGS))
+);
+#define DEFINE_FUSE_INODE_STATE_EVENT(name) \
+DEFINE_EVENT(fuse_inode_state_class, name, \
+ TP_PROTO(const struct inode *inode), \
+ TP_ARGS(inode))
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 6ffa5710a92ad5..0759704847598b 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -658,10 +658,14 @@ void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
if (conn->iomap && (attr_flags & FUSE_ATTR_IOMAP))
fuse_inode_set_iomap(inode);
+
+ trace_fuse_iomap_init_inode(inode);
}
void fuse_iomap_evict_inode(struct inode *inode)
{
+ trace_fuse_iomap_evict_inode(inode);
+
if (fuse_inode_has_iomap(inode))
fuse_inode_clear_iomap(inode);
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 09/28] fuse: isolate the other regular file IO paths from iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (7 preceding siblings ...)
2025-09-16 0:30 ` [PATCH 08/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:30 ` Darrick J. Wong
2025-09-16 0:30 ` [PATCH 10/28] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
` (18 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:30 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
iomap completely takes over all regular file IO, so we don't need to
access any of the other mechanisms at all. Gate them off so that we can
eventually overlay them with a union to save space in struct fuse_inode.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 14 +++++++++-----
fs/fuse/file.c | 18 +++++++++++++-----
fs/fuse/inode.c | 3 ++-
fs/fuse/iomode.c | 2 +-
4 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b116e42431ee12..6dbce975dc96b7 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1998,6 +1998,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
FUSE_ARGS(args);
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
+ const bool is_iomap = fuse_inode_has_iomap(inode);
bool is_truncate = false;
bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
loff_t oldsize;
@@ -2055,12 +2056,15 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
if (err)
return err;
- fuse_set_nowrite(inode);
- fuse_release_nowrite(inode);
+ if (!is_iomap) {
+ fuse_set_nowrite(inode);
+ fuse_release_nowrite(inode);
+ }
}
if (is_truncate) {
- fuse_set_nowrite(inode);
+ if (!is_iomap)
+ fuse_set_nowrite(inode);
set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
if (trust_local_cmtime && attr->ia_size != inode->i_size)
attr->ia_valid |= ATTR_MTIME | ATTR_CTIME;
@@ -2132,7 +2136,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
if (!is_wb || is_truncate)
i_size_write(inode, outarg.attr.size);
- if (is_truncate) {
+ if (is_truncate && !is_iomap) {
/* NOTE: this may release/reacquire fi->lock */
__fuse_release_nowrite(inode);
}
@@ -2156,7 +2160,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
return 0;
error:
- if (is_truncate)
+ if (is_truncate && !is_iomap)
fuse_release_nowrite(inode);
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8982e0b9661bb1..0f253837b57fdc 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -238,6 +238,7 @@ static int fuse_open(struct inode *inode, struct file *file)
struct fuse_conn *fc = fm->fc;
struct fuse_file *ff;
int err;
+ const bool is_iomap = fuse_inode_has_iomap(inode);
bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
bool is_wb_truncate = is_truncate && fc->writeback_cache;
bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
@@ -259,7 +260,7 @@ static int fuse_open(struct inode *inode, struct file *file)
goto out_inode_unlock;
}
- if (is_wb_truncate || dax_truncate)
+ if ((is_wb_truncate || dax_truncate) && !is_iomap)
fuse_set_nowrite(inode);
err = fuse_do_open(fm, get_node_id(inode), file, false);
@@ -272,7 +273,7 @@ static int fuse_open(struct inode *inode, struct file *file)
fuse_truncate_update_attr(inode, file);
}
- if (is_wb_truncate || dax_truncate)
+ if ((is_wb_truncate || dax_truncate) && !is_iomap)
fuse_release_nowrite(inode);
if (!err) {
if (is_truncate)
@@ -520,12 +521,14 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
{
struct inode *inode = file->f_mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
+ const bool need_sync_writes = !fuse_inode_has_iomap(inode);
int err;
if (fuse_is_bad(inode))
return -EIO;
- inode_lock(inode);
+ if (need_sync_writes)
+ inode_lock(inode);
/*
* Start writeback against all dirty pages of the inode, then
@@ -536,7 +539,8 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
if (err)
goto out;
- fuse_sync_writes(inode);
+ if (need_sync_writes)
+ fuse_sync_writes(inode);
/*
* Due to implementation of fuse writeback
@@ -560,7 +564,8 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
err = 0;
}
out:
- inode_unlock(inode);
+ if (need_sync_writes)
+ inode_unlock(inode);
return err;
}
@@ -1949,6 +1954,9 @@ static struct fuse_file *__fuse_write_file_get(struct fuse_inode *fi)
{
struct fuse_file *ff;
+ if (fuse_inode_has_iomap(&fi->inode))
+ return NULL;
+
spin_lock(&fi->lock);
ff = list_first_entry_or_null(&fi->write_files, struct fuse_file,
write_entry);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b209db07e60e33..4f348fc575a5c3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -194,7 +194,8 @@ static void fuse_evict_inode(struct inode *inode)
if (inode->i_nlink > 0)
atomic64_inc(&fc->evict_ctr);
}
- if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) {
+ if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode) &&
+ !fuse_inode_has_iomap(inode)) {
WARN_ON(fi->iocachectr != 0);
WARN_ON(!list_empty(&fi->write_files));
WARN_ON(!list_empty(&fi->queued_writes));
diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
index c99e285f3183ef..92225dfa6e7ad9 100644
--- a/fs/fuse/iomode.c
+++ b/fs/fuse/iomode.c
@@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
* io modes are not relevant with DAX and with server that does not
* implement open.
*/
- if (FUSE_IS_DAX(inode) || !ff->args)
+ if (fuse_inode_has_iomap(inode) || FUSE_IS_DAX(inode) || !ff->args)
return 0;
/*
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 10/28] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (8 preceding siblings ...)
2025-09-16 0:30 ` [PATCH 09/28] fuse: isolate the other regular file IO paths from iomap Darrick J. Wong
@ 2025-09-16 0:30 ` Darrick J. Wong
2025-09-16 0:30 ` [PATCH 11/28] fuse_trace: " Darrick J. Wong
` (17 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:30 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement the basic file mapping reporting functions like FIEMAP, BMAP,
and SEEK_DATA/HOLE.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 ++++++
fs/fuse/dir.c | 1 +
fs/fuse/file.c | 13 ++++++++++
fs/fuse/file_iomap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 89 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 791e868e568cc5..ea879b45e904c5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1727,6 +1727,11 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode)
return test_bit(FUSE_I_IOMAP, &fi->state);
}
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 length);
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1735,6 +1740,9 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode)
# define fuse_iomap_init_inode(...) ((void)0)
# define fuse_iomap_evict_inode(...) ((void)0)
# define fuse_inode_has_iomap(...) (false)
+# define fuse_iomap_fiemap NULL
+# define fuse_iomap_lseek(...) (-ENOSYS)
+# define fuse_iomap_bmap(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 6dbce975dc96b7..467ea2f46798ba 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2303,6 +2303,7 @@ static const struct inode_operations fuse_common_inode_operations = {
.set_acl = fuse_set_acl,
.fileattr_get = fuse_fileattr_get,
.fileattr_set = fuse_fileattr_set,
+ .fiemap = fuse_iomap_fiemap,
};
static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0f253837b57fdc..1941dd7846f12a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2522,6 +2522,12 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
struct fuse_bmap_out outarg;
int err;
+ if (fuse_inode_has_iomap(inode)) {
+ sector_t alt_sec = fuse_iomap_bmap(mapping, block);
+ if (alt_sec > 0)
+ return alt_sec;
+ }
+
if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
return 0;
@@ -2557,6 +2563,13 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
struct fuse_lseek_out outarg;
int err;
+ if (fuse_inode_has_iomap(inode)) {
+ loff_t alt_pos = fuse_iomap_lseek(file, offset, whence);
+
+ if (alt_pos >= 0 || (alt_pos < 0 && alt_pos != -ENOSYS))
+ return alt_pos;
+ }
+
if (fm->fc->no_lseek)
goto fallback;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 0759704847598b..29603eec939589 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -4,6 +4,7 @@
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include <linux/iomap.h>
+#include <linux/fiemap.h>
#include "fuse_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -561,7 +562,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
return err;
}
-const struct iomap_ops fuse_iomap_ops = {
+static const struct iomap_ops fuse_iomap_ops = {
.iomap_begin = fuse_iomap_begin,
.iomap_end = fuse_iomap_end,
};
@@ -669,3 +670,68 @@ void fuse_iomap_evict_inode(struct inode *inode)
if (fuse_inode_has_iomap(inode))
fuse_inode_clear_iomap(inode);
}
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 count)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ int error;
+
+ /*
+ * We are called directly from the vfs so we need to check per-inode
+ * support here explicitly.
+ */
+ if (!fuse_inode_has_iomap(inode))
+ return -EOPNOTSUPP;
+
+ if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
+ return -EOPNOTSUPP;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (!fuse_allow_current_process(fc))
+ return -EACCES;
+
+ inode_lock_shared(inode);
+ error = iomap_fiemap(inode, fieinfo, start, count, &fuse_iomap_ops);
+ inode_unlock_shared(inode);
+
+ return error;
+}
+
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block)
+{
+ ASSERT(fuse_inode_has_iomap(mapping->host));
+
+ return iomap_bmap(mapping, block, &fuse_iomap_ops);
+}
+
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (!fuse_allow_current_process(fc))
+ return -EACCES;
+
+ switch (whence) {
+ case SEEK_HOLE:
+ offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops);
+ break;
+ case SEEK_DATA:
+ offset = iomap_seek_data(inode, offset, &fuse_iomap_ops);
+ break;
+ default:
+ return -ENOSYS;
+ }
+
+ if (offset < 0)
+ return offset;
+ return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 11/28] fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (9 preceding siblings ...)
2025-09-16 0:30 ` [PATCH 10/28] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
@ 2025-09-16 0:30 ` Darrick J. Wong
2025-09-16 0:31 ` [PATCH 12/28] fuse: implement direct IO with iomap Darrick J. Wong
` (16 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:30 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 46 ++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 4 ++++
2 files changed, 50 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index cdedaf2b2a0ad5..4fe51be0e65bdc 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -530,6 +530,52 @@ DEFINE_EVENT(fuse_inode_state_class, name, \
TP_ARGS(inode))
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
+
+TRACE_EVENT(fuse_iomap_fiemap,
+ TP_PROTO(const struct inode *inode, u64 start, u64 count,
+ unsigned int flags),
+
+ TP_ARGS(inode, start, count, flags),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned int, flags)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = start;
+ __entry->length = count;
+ __entry->flags = flags;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT("fiemap") " flags 0x%x",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->flags)
+);
+
+TRACE_EVENT(fuse_iomap_lseek,
+ TP_PROTO(const struct inode *inode, loff_t offset, int whence),
+
+ TP_ARGS(inode, offset, whence),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(loff_t, offset)
+ __field(int, whence)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->whence = whence;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " offset 0x%llx whence %d",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->offset,
+ __entry->whence)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 29603eec939589..88d85d572faf97 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -693,6 +693,8 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
if (!fuse_allow_current_process(fc))
return -EACCES;
+ trace_fuse_iomap_fiemap(inode, start, count, fieinfo->fi_flags);
+
inode_lock_shared(inode);
error = iomap_fiemap(inode, fieinfo, start, count, &fuse_iomap_ops);
inode_unlock_shared(inode);
@@ -720,6 +722,8 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
if (!fuse_allow_current_process(fc))
return -EACCES;
+ trace_fuse_iomap_lseek(inode, offset, whence);
+
switch (whence) {
case SEEK_HOLE:
offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 12/28] fuse: implement direct IO with iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (10 preceding siblings ...)
2025-09-16 0:30 ` [PATCH 11/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:31 ` Darrick J. Wong
2025-09-16 0:31 ` [PATCH 13/28] fuse_trace: " Darrick J. Wong
` (15 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:31 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Start implementing the fuse-iomap file I/O paths by adding direct I/O
support and all the signalling flags that come with it. Buffered I/O
is much more complicated, so we leave that to a subsequent patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 30 +++++
include/uapi/linux/fuse.h | 22 ++++
fs/fuse/dir.c | 7 +
fs/fuse/file.c | 16 +++
fs/fuse/file_iomap.c | 249 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/trace.c | 1
6 files changed, 323 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ea879b45e904c5..ed0608d84ac76c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -643,6 +643,16 @@ struct fuse_sync_bucket {
struct rcu_head rcu;
};
+#ifdef CONFIG_FUSE_IOMAP
+struct fuse_iomap_conn {
+ /* fuse server doesn't implement iomap_end */
+ unsigned int no_end:1;
+
+ /* fuse server doesn't implement iomap_ioend */
+ unsigned int no_ioend:1;
+};
+#endif
+
/**
* A Fuse connection.
*
@@ -992,6 +1002,11 @@ struct fuse_conn {
struct idr backing_files_map;
#endif
+#ifdef CONFIG_FUSE_IOMAP
+ /** iomap information */
+ struct fuse_iomap_conn iomap_conn;
+#endif
+
#ifdef CONFIG_FUSE_IO_URING
/** uring connection information*/
struct fuse_ring *ring;
@@ -1732,6 +1747,17 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 length);
loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
+
+void fuse_iomap_open(struct inode *inode, struct file *file);
+
+static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
+{
+ return (iocb->ki_flags & IOCB_DIRECT) &&
+ fuse_inode_has_iomap(file_inode(iocb->ki_filp));
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1743,6 +1769,10 @@ sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
# define fuse_iomap_fiemap NULL
# define fuse_iomap_lseek(...) (-ENOSYS)
# define fuse_iomap_bmap(...) (-ENOSYS)
+# define fuse_iomap_open(...) ((void)0)
+# define fuse_want_iomap_directio(...) (false)
+# define fuse_iomap_direct_read(...) (-ENOSYS)
+# define fuse_iomap_direct_write(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index cc4bca2941cb79..4835a40b8af664 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -666,6 +666,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1389,4 +1390,25 @@ struct fuse_iomap_end_in {
struct fuse_iomap_io map;
};
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED (1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT (1U << 3)
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND (1U << 4)
+
+struct fuse_iomap_ioend_in {
+ uint32_t ioendflags; /* FUSE_IOMAP_IOEND_* */
+ int32_t error; /* negative errno or 0 */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t new_addr; /* disk offset of new mapping, in bytes */
+ uint32_t written; /* bytes processed */
+ uint32_t reserved1; /* zero */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 467ea2f46798ba..e0022eea806fbd 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -712,6 +712,10 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
if (err)
goto out_acl_release;
fuse_dir_changed(dir);
+
+ if (fuse_inode_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (!err) {
file->private_data = ff;
@@ -1750,6 +1754,9 @@ static int fuse_dir_open(struct inode *inode, struct file *file)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_inode_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (err)
return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1941dd7846f12a..baf433b4c23e1b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -246,6 +246,9 @@ static int fuse_open(struct inode *inode, struct file *file)
if (fuse_is_bad(inode))
return -EIO;
+ if (is_iomap)
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (err)
return err;
@@ -1754,10 +1757,17 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
+ ssize_t ret;
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_want_iomap_directio(iocb)) {
+ ret = fuse_iomap_direct_read(iocb, to);
+ if (ret != -ENOSYS)
+ return ret;
+ }
+
if (FUSE_IS_DAX(inode))
return fuse_dax_read_iter(iocb, to);
@@ -1779,6 +1789,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_want_iomap_directio(iocb)) {
+ ssize_t ret = fuse_iomap_direct_write(iocb, from);
+ if (ret != -ENOSYS)
+ return ret;
+ }
+
if (FUSE_IS_DAX(inode))
return fuse_dax_write_iter(iocb, from);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 88d85d572faf97..84eb1fe4fcde49 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -495,10 +495,15 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
}
/* Decide if we send FUSE_IOMAP_END to the fuse server */
-static bool fuse_should_send_iomap_end(const struct iomap *iomap,
+static bool fuse_should_send_iomap_end(const struct fuse_mount *fm,
+ const struct iomap *iomap,
unsigned int opflags, loff_t count,
ssize_t written)
{
+ /* Not implemented on fuse server */
+ if (fm->fc->iomap_conn.no_end)
+ return false;
+
/* fuse server demanded an iomap_end call. */
if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
return true;
@@ -523,7 +528,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
struct fuse_mount *fm = get_fuse_mount(inode);
int err = 0;
- if (fuse_should_send_iomap_end(iomap, opflags, count, written)) {
+ if (fuse_should_send_iomap_end(fm, iomap, opflags, count, written)) {
struct fuse_iomap_end_in inarg = {
.opflags = fuse_iomap_op_to_server(opflags),
.attr_ino = fi->orig_ino,
@@ -549,6 +554,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
* libfuse returns ENOSYS for servers that don't
* implement iomap_end
*/
+ fm->fc->iomap_conn.no_end = 1;
err = 0;
break;
case 0:
@@ -567,6 +573,95 @@ static const struct iomap_ops fuse_iomap_ops = {
.iomap_end = fuse_iomap_end,
};
+static inline bool
+fuse_should_send_iomap_ioend(const struct fuse_mount *fm,
+ const struct fuse_iomap_ioend_in *inarg)
+{
+ /* Not implemented on fuse server */
+ if (fm->fc->iomap_conn.no_ioend)
+ return false;
+
+ /* Always send an ioend for errors. */
+ if (inarg->error)
+ return true;
+
+ /* Send an ioend if we performed an IO involving metadata changes. */
+ return inarg->written > 0 &&
+ (inarg->ioendflags & (FUSE_IOMAP_IOEND_SHARED |
+ FUSE_IOMAP_IOEND_UNWRITTEN |
+ FUSE_IOMAP_IOEND_APPEND));
+}
+
+/*
+ * Fast and loose check if this write could update the on-disk inode size.
+ */
+static inline bool fuse_ioend_is_append(const struct fuse_inode *fi,
+ loff_t pos, size_t written)
+{
+ return pos + written > i_size_read(&fi->inode);
+}
+
+static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
+ int error, unsigned ioendflags, sector_t new_addr)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_iomap_ioend_in inarg = {
+ .ioendflags = ioendflags,
+ .error = error,
+ .attr_ino = fi->orig_ino,
+ .pos = pos,
+ .written = written,
+ .new_addr = new_addr,
+ };
+
+ if (fuse_ioend_is_append(fi, pos, written))
+ inarg.ioendflags |= FUSE_IOMAP_IOEND_APPEND;
+
+ if (fuse_should_send_iomap_ioend(fm, &inarg)) {
+ FUSE_ARGS(args);
+ int err;
+
+ args.opcode = FUSE_IOMAP_IOEND;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fm, &args);
+ switch (err) {
+ case -ENOSYS:
+ /*
+ * fuse servers can return ENOSYS if ioend processing
+ * is never needed for this filesystem.
+ */
+ fm->fc->iomap_conn.no_ioend = 1;
+ err = 0;
+ break;
+ case 0:
+ break;
+ default:
+ /*
+ * If the write IO failed, return the failure code to
+ * the caller no matter what happens with the ioend.
+ * If the write IO succeeded but the ioend did not,
+ * pass the new error up to the caller.
+ */
+ if (!error)
+ error = err;
+ break;
+ }
+ }
+ if (error)
+ return error;
+
+ /*
+ * If there weren't any ioend errors, update the incore isize, which
+ * confusingly takes the new i_size as "pos".
+ */
+ fuse_write_update_attr(inode, pos + written, written);
+ return 0;
+}
+
static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags)
{
if (!fc->iomap)
@@ -616,6 +711,8 @@ void fuse_iomap_mount(struct fuse_mount *fm)
* freeze/thaw properly.
*/
fc->sync_fs = true;
+ fc->iomap_conn.no_end = 0;
+ fc->iomap_conn.no_ioend = 0;
}
void fuse_iomap_unmount(struct fuse_mount *fm)
@@ -739,3 +836,151 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
return offset;
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
}
+
+void fuse_iomap_open(struct inode *inode, struct file *file)
+{
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+}
+
+enum fuse_ilock_type {
+ SHARED,
+ EXCL,
+};
+
+static int fuse_iomap_ilock_iocb(const struct kiocb *iocb,
+ enum fuse_ilock_type type)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ switch (type) {
+ case SHARED:
+ return inode_trylock_shared(inode) ? 0 : -EAGAIN;
+ case EXCL:
+ return inode_trylock(inode) ? 0 : -EAGAIN;
+ default:
+ ASSERT(0);
+ return -EIO;
+ }
+ } else {
+ switch (type) {
+ case SHARED:
+ inode_lock_shared(inode);
+ break;
+ case EXCL:
+ inode_lock(inode);
+ break;
+ default:
+ ASSERT(0);
+ return -EIO;
+ }
+ }
+
+ return 0;
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (!iov_iter_count(to))
+ return 0; /* skip atime */
+
+ file_accessed(iocb->ki_filp);
+
+ ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+ if (ret)
+ return ret;
+ ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0);
+ inode_unlock_shared(inode);
+
+ return ret;
+}
+
+static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written,
+ int error, unsigned dioflags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ unsigned int nofs_flag;
+ unsigned int ioendflags = FUSE_IOMAP_IOEND_DIRECT;
+ int ret;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (dioflags & IOMAP_DIO_COW)
+ ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+ if (dioflags & IOMAP_DIO_UNWRITTEN)
+ ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+ /*
+ * We can allocate memory here while doing writeback on behalf of
+ * memory reclaim. To avoid memory allocation deadlocks set the
+ * task-wide nofs context for the following operations.
+ */
+ nofs_flag = memalloc_nofs_save();
+ ret = fuse_iomap_ioend(inode, iocb->ki_pos, written, error, ioendflags,
+ FUSE_IOMAP_NULL_ADDR);
+ memalloc_nofs_restore(nofs_flag);
+ return ret;
+}
+
+static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
+ .end_io = fuse_iomap_dio_write_end_io,
+};
+
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ loff_t blockmask = i_blocksize(inode) - 1;
+ size_t count = iov_iter_count(from);
+ unsigned int flags = 0;
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (!count)
+ return 0;
+
+ /*
+ * Unaligned direct writes require zeroing of unwritten head and tail
+ * blocks. Extending writes require zeroing of post-EOF tail blocks.
+ * The zeroing writes must complete before we return the direct write
+ * to userspace. Don't even bother trying the fast path.
+ */
+ if ((iocb->ki_pos | count) & blockmask)
+ flags = IOMAP_DIO_FORCE_WAIT;
+
+ ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+ if (ret)
+ goto out_dsync;
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out_unlock;
+
+ /*
+ * If we are doing exclusive unaligned I/O, this must be the only I/O
+ * in-flight. Otherwise we risk data corruption due to unwritten
+ * extent conversions from the AIO end_io handler. Wait for all other
+ * I/O to drain first.
+ */
+ if (flags & IOMAP_DIO_FORCE_WAIT)
+ inode_dio_wait(inode);
+
+ ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
+ &fuse_iomap_dio_write_ops, flags, NULL, 0);
+ if (ret)
+ goto out_unlock;
+
+out_unlock:
+ inode_unlock(inode);
+out_dsync:
+ return ret;
+}
diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c
index 3b54f639a5423e..9de407148c867d 100644
--- a/fs/fuse/trace.c
+++ b/fs/fuse/trace.c
@@ -9,6 +9,7 @@
#include "iomap_priv.h"
#include <linux/pagemap.h>
+#include <linux/iomap.h>
#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 13/28] fuse_trace: implement direct IO with iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (11 preceding siblings ...)
2025-09-16 0:31 ` [PATCH 12/28] fuse: implement direct IO with iomap Darrick J. Wong
@ 2025-09-16 0:31 ` Darrick J. Wong
2025-09-16 0:31 ` [PATCH 14/28] fuse: implement buffered " Darrick J. Wong
` (14 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:31 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 13 +++++
2 files changed, 157 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 4fe51be0e65bdc..434d38ce89c428 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -60,6 +60,7 @@
EM( FUSE_STATX, "FUSE_STATX") \
EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
+ EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \
EMe(CUSE_INIT, "CUSE_INIT")
/*
@@ -300,6 +301,17 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
{ FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
{ FUSE_IOMAP_TYPE_INLINE, "inline" }
+#define FUSE_IOMAP_IOEND_STRINGS \
+ { FUSE_IOMAP_IOEND_SHARED, "shared" }, \
+ { FUSE_IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
+ { FUSE_IOMAP_IOEND_BOUNDARY, "boundary" }, \
+ { FUSE_IOMAP_IOEND_DIRECT, "direct" }, \
+ { FUSE_IOMAP_IOEND_APPEND, "append" }
+
+#define IOMAP_DIOEND_STRINGS \
+ { IOMAP_DIO_UNWRITTEN, "unwritten" }, \
+ { IOMAP_DIO_COW, "cow" }
+
TRACE_DEFINE_ENUM(FUSE_I_ADVISE_RDPLUS);
TRACE_DEFINE_ENUM(FUSE_I_INIT_RDPLUS);
TRACE_DEFINE_ENUM(FUSE_I_SIZE_UNSTABLE);
@@ -482,6 +494,65 @@ TRACE_EVENT(fuse_iomap_end_error,
__entry->error)
);
+TRACE_EVENT(fuse_iomap_ioend,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_ioend_in *inarg),
+
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned, ioendflags)
+ __field(int, error)
+ __field(uint64_t, new_addr)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->written;
+ __entry->ioendflags = inarg->ioendflags;
+ __entry->error = inarg->error;
+ __entry->new_addr = inarg->new_addr;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d new_addr 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+ __entry->error,
+ __entry->new_addr)
+);
+
+TRACE_EVENT(fuse_iomap_ioend_error,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_ioend_in *inarg,
+ int error),
+
+ TP_ARGS(inode, inarg, error),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned, ioendflags)
+ __field(int, error)
+ __field(uint64_t, new_addr)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->written;
+ __entry->ioendflags = inarg->ioendflags;
+ __entry->error = error;
+ __entry->new_addr = inarg->new_addr;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d new_addr 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+ __entry->error,
+ __entry->new_addr)
+);
+
TRACE_EVENT(fuse_iomap_dev_add,
TP_PROTO(const struct fuse_conn *fc,
const struct fuse_backing_map *map),
@@ -576,6 +647,79 @@ TRACE_EVENT(fuse_iomap_lseek,
__entry->offset,
__entry->whence)
);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_io_class,
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter),
+ TP_ARGS(iocb, iter),
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(file_inode(iocb->ki_filp), fi, fm);
+ __entry->offset = iocb->ki_pos;
+ __entry->length = iov_iter_count(iter);
+ ),
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+)
+#define DEFINE_FUSE_IOMAP_FILE_IO_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_io_class, name, \
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), \
+ TP_ARGS(iocb, iter))
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
+ ssize_t ret),
+ TP_ARGS(iocb, iter, ret),
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(ssize_t, ret)
+ ),
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(file_inode(iocb->ki_filp), fi, fm);
+ __entry->offset = iocb->ki_pos;
+ __entry->length = iov_iter_count(iter);
+ __entry->ret = ret;
+ ),
+ TP_printk(FUSE_IO_RANGE_FMT() " ret 0x%zx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->ret)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, \
+ ssize_t ret), \
+ TP_ARGS(iocb, iter, ret))
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+
+TRACE_EVENT(fuse_iomap_dio_write_end_io,
+ TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
+ int error, unsigned flags),
+
+ TP_ARGS(inode, pos, written, error, flags),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned, dioendflags)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = written;
+ __entry->dioendflags = flags;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " dioendflags (%s) error %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
+ __entry->error)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 84eb1fe4fcde49..54e09f60980ef1 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -618,6 +618,8 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
if (fuse_ioend_is_append(fi, pos, written))
inarg.ioendflags |= FUSE_IOMAP_IOEND_APPEND;
+ trace_fuse_iomap_ioend(inode, &inarg);
+
if (fuse_should_send_iomap_ioend(fm, &inarg)) {
FUSE_ARGS(args);
int err;
@@ -640,6 +642,8 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
case 0:
break;
default:
+ trace_fuse_iomap_ioend_error(inode, &inarg, err);
+
/*
* If the write IO failed, return the failure code to
* the caller no matter what happens with the ioend.
@@ -888,6 +892,8 @@ ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_direct_read(iocb, to);
+
if (!iov_iter_count(to))
return 0; /* skip atime */
@@ -899,6 +905,7 @@ ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0);
inode_unlock_shared(inode);
+ trace_fuse_iomap_direct_read_end(iocb, to, ret);
return ret;
}
@@ -915,6 +922,9 @@ static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written,
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_dio_write_end_io(inode, iocb->ki_pos, written, error,
+ dioflags);
+
if (dioflags & IOMAP_DIO_COW)
ioendflags |= FUSE_IOMAP_IOEND_SHARED;
if (dioflags & IOMAP_DIO_UNWRITTEN)
@@ -946,6 +956,8 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_direct_write(iocb, from);
+
if (!count)
return 0;
@@ -982,5 +994,6 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
out_unlock:
inode_unlock(inode);
out_dsync:
+ trace_fuse_iomap_direct_write_end(iocb, from, ret);
return ret;
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 14/28] fuse: implement buffered IO with iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (12 preceding siblings ...)
2025-09-16 0:31 ` [PATCH 13/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:31 ` Darrick J. Wong
2025-09-16 0:31 ` [PATCH 15/28] fuse_trace: " Darrick J. Wong
` (13 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:31 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement pagecache IO with iomap, complete with hooks into truncate and
fallocate so that the fuse server needn't implement disk block zeroing
of post-EOF and unaligned punch/zero regions.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 30 ++
include/uapi/linux/fuse.h | 5
fs/fuse/dir.c | 23 ++
fs/fuse/file.c | 86 +++++-
fs/fuse/file_iomap.c | 659 ++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 777 insertions(+), 26 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ed0608d84ac76c..7581d22de2340c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -179,6 +179,13 @@ struct fuse_inode {
/* waitq for direct-io completion */
wait_queue_head_t direct_io_waitq;
+
+#ifdef CONFIG_FUSE_IOMAP
+ /* pending io completions */
+ spinlock_t ioend_lock;
+ struct work_struct ioend_work;
+ struct list_head ioend_list;
+#endif
};
/* readdir cache (directory only) */
@@ -1720,6 +1727,8 @@ void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
# define fuse_iomap_sysfs_cleanup(...) ((void)0)
#endif
+sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
bool fuse_iomap_enabled(void);
@@ -1758,6 +1767,20 @@ static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
+
+static inline bool fuse_want_iomap_buffered_io(const struct kiocb *iocb)
+{
+ return fuse_inode_has_iomap(file_inode(iocb->ki_filp));
+}
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
+int fuse_iomap_setsize_start(struct inode *inode, loff_t newsize);
+int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t length, loff_t new_size);
+int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
+ loff_t endpos);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1773,6 +1796,13 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
# define fuse_want_iomap_directio(...) (false)
# define fuse_iomap_direct_read(...) (-ENOSYS)
# define fuse_iomap_direct_write(...) (-ENOSYS)
+# define fuse_want_iomap_buffered_io(...) (false)
+# define fuse_iomap_mmap(...) (-ENOSYS)
+# define fuse_iomap_buffered_read(...) (-ENOSYS)
+# define fuse_iomap_buffered_write(...) (-ENOSYS)
+# define fuse_iomap_setsize_start(...) (-ENOSYS)
+# define fuse_iomap_fallocate(...) (-ENOSYS)
+# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4835a40b8af664..c0af8a4d3e30d8 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1351,6 +1351,9 @@ struct fuse_uring_cmd_req {
#define FUSE_IOMAP_OP_ATOMIC (1U << 9)
#define FUSE_IOMAP_OP_DONTCACHE (1U << 10)
+/* pagecache writeback operation */
+#define FUSE_IOMAP_OP_WRITEBACK (1U << 31)
+
#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */
struct fuse_iomap_io {
@@ -1400,6 +1403,8 @@ struct fuse_iomap_end_in {
#define FUSE_IOMAP_IOEND_DIRECT (1U << 3)
/* is append ioend */
#define FUSE_IOMAP_IOEND_APPEND (1U << 4)
+/* is pagecache writeback */
+#define FUSE_IOMAP_IOEND_WRITEBACK (1U << 5)
struct fuse_iomap_ioend_in {
uint32_t ioendflags; /* FUSE_IOMAP_IOEND_* */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index e0022eea806fbd..d62ceadbc05fb2 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2027,7 +2027,10 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
is_truncate = true;
}
- if (FUSE_IS_DAX(inode) && is_truncate) {
+ if (is_iomap && is_truncate) {
+ filemap_invalidate_lock(mapping);
+ fault_blocked = true;
+ } else if (FUSE_IS_DAX(inode) && is_truncate) {
filemap_invalidate_lock(mapping);
fault_blocked = true;
err = fuse_dax_break_layouts(inode, 0, -1);
@@ -2042,6 +2045,18 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
WARN_ON(!(attr->ia_valid & ATTR_SIZE));
WARN_ON(attr->ia_size != 0);
if (fc->atomic_o_trunc) {
+ if (is_iomap) {
+ /*
+ * fuse_open already set the size to zero and
+ * truncated the pagecache, and we've since
+ * cycled the inode locks. Another thread
+ * could have performed an appending write, so
+ * we don't want to touch the file further.
+ */
+ filemap_invalidate_unlock(mapping);
+ return 0;
+ }
+
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
@@ -2075,6 +2090,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
if (trust_local_cmtime && attr->ia_size != inode->i_size)
attr->ia_valid |= ATTR_MTIME | ATTR_CTIME;
+
+ if (is_iomap) {
+ err = fuse_iomap_setsize_start(inode, attr->ia_size);
+ if (err)
+ goto error;
+ }
}
memset(&inarg, 0, sizeof(inarg));
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index baf433b4c23e1b..dd65485c9743bf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -384,7 +384,7 @@ static int fuse_release(struct inode *inode, struct file *file)
* Dirty pages might remain despite write_inode_now() call from
* fuse_flush() due to writes racing with the close.
*/
- if (fc->writeback_cache)
+ if (fc->writeback_cache || fuse_inode_has_iomap(inode))
write_inode_now(inode, 1);
fuse_release_common(file, false);
@@ -1768,6 +1768,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return ret;
}
+ if (fuse_want_iomap_buffered_io(iocb))
+ return fuse_iomap_buffered_read(iocb, to);
+
if (FUSE_IS_DAX(inode))
return fuse_dax_read_iter(iocb, to);
@@ -1791,10 +1794,29 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (fuse_want_iomap_directio(iocb)) {
ssize_t ret = fuse_iomap_direct_write(iocb, from);
- if (ret != -ENOSYS)
+ switch (ret) {
+ case -ENOTBLK:
+ /*
+ * If we're going to fall back to the iomap buffered
+ * write path only, then try the write again as a
+ * synchronous buffered write. Otherwise we let it
+ * drop through to the old ->direct_IO path.
+ */
+ if (fuse_want_iomap_buffered_io(iocb))
+ iocb->ki_flags |= IOCB_SYNC;
+ fallthrough;
+ case -ENOSYS:
+ /* no implementation, fall through */
+ break;
+ default:
+ /* errors, no progress, or even partial progress */
return ret;
+ }
}
+ if (fuse_want_iomap_buffered_io(iocb))
+ return fuse_iomap_buffered_write(iocb, from);
+
if (FUSE_IS_DAX(inode))
return fuse_dax_write_iter(iocb, from);
@@ -2331,6 +2353,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
struct inode *inode = file_inode(file);
int rc;
+ if (fuse_inode_has_iomap(inode))
+ return fuse_iomap_mmap(file, vma);
+
/* DAX mmap is superior to direct_io mmap */
if (FUSE_IS_DAX(inode))
return fuse_dax_mmap(file, vma);
@@ -2529,7 +2554,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
return err;
}
-static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
+sector_t fuse_bmap(struct address_space *mapping, sector_t block)
{
struct inode *inode = mapping->host;
struct fuse_mount *fm = get_fuse_mount(inode);
@@ -2883,8 +2908,12 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end)
{
- int err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
+ int err;
+ if (fuse_inode_has_iomap(inode))
+ return fuse_iomap_flush_unmap_range(inode, start, end);
+
+ err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
if (!err)
fuse_sync_writes(inode);
@@ -2905,7 +2934,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
.length = length,
.mode = mode
};
+ loff_t newsize = 0;
int err;
+ const bool is_iomap = fuse_inode_has_iomap(inode);
bool block_faults = FUSE_IS_DAX(inode) &&
(!(mode & FALLOC_FL_KEEP_SIZE) ||
(mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
@@ -2918,7 +2949,10 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
return -EOPNOTSUPP;
inode_lock(inode);
- if (block_faults) {
+ if (is_iomap) {
+ filemap_invalidate_lock(inode->i_mapping);
+ block_faults = true;
+ } else if (block_faults) {
filemap_invalidate_lock(inode->i_mapping);
err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
@@ -2933,11 +2967,23 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}
+ /*
+ * If we are using iomap for file IO, fallocate must wait for all AIO
+ * to complete before we continue as AIO can change the file size on
+ * completion without holding any locks we currently hold. We must do
+ * this first because AIO can update the in-memory inode size, and the
+ * operations that follow require the in-memory size to be fully
+ * up-to-date.
+ */
+ if (is_iomap)
+ inode_dio_wait(inode);
+
if (!(mode & FALLOC_FL_KEEP_SIZE) &&
offset + length > i_size_read(inode)) {
err = inode_newsize_ok(inode, offset + length);
if (err)
goto out;
+ newsize = offset + length;
}
err = file_modified(file);
@@ -2960,14 +3006,22 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (err)
goto out;
- /* we could have extended the file */
- if (!(mode & FALLOC_FL_KEEP_SIZE)) {
- if (fuse_write_update_attr(inode, offset + length, length))
- file_update_time(file);
- }
+ if (is_iomap) {
+ err = fuse_iomap_fallocate(file, mode, offset, length,
+ newsize);
+ if (err)
+ goto out;
+ } else {
+ /* we could have extended the file */
+ if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+ if (fuse_write_update_attr(inode, newsize, length))
+ file_update_time(file);
+ }
- if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
- truncate_pagecache_range(inode, offset, offset + length - 1);
+ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
+ truncate_pagecache_range(inode, offset,
+ offset + length - 1);
+ }
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
@@ -3010,6 +3064,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
ssize_t err;
/* mark unstable when write-back is not used, and file_out gets
* extended */
+ const bool is_iomap = fuse_inode_has_iomap(inode_out);
bool is_unstable = (!fc->writeback_cache) &&
((pos_out + len) > inode_out->i_size);
@@ -3053,6 +3108,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (err)
goto out;
+ /* See inode_dio_wait comment in fuse_file_fallocate */
+ if (is_iomap)
+ inode_dio_wait(inode_out);
+
if (is_unstable)
set_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state);
@@ -3075,7 +3134,8 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (err)
goto out;
- truncate_inode_pages_range(inode_out->i_mapping,
+ if (!is_iomap)
+ truncate_inode_pages_range(inode_out->i_mapping,
ALIGN_DOWN(pos_out, PAGE_SIZE),
ALIGN(pos_out + outarg.size, PAGE_SIZE) - 1);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 54e09f60980ef1..64f851d04a009b 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -5,6 +5,8 @@
*/
#include <linux/iomap.h>
#include <linux/fiemap.h>
+#include <linux/pagemap.h>
+#include <linux/falloc.h>
#include "fuse_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -241,7 +243,7 @@ static inline uint16_t fuse_iomap_flags_from_server(uint16_t fuse_f_flags)
ret |= FUSE_IOMAP_OP_##word
static inline uint32_t fuse_iomap_op_to_server(unsigned iomap_op_flags)
{
- uint32_t ret = 0;
+ uint32_t ret = iomap_op_flags & FUSE_IOMAP_OP_WRITEBACK;
XMAP(WRITE);
XMAP(ZERO);
@@ -389,7 +391,8 @@ fuse_iomap_begin_validate(const struct inode *inode,
static inline bool fuse_is_iomap_file_write(unsigned int opflags)
{
- return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
+ return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE |
+ FUSE_IOMAP_OP_WRITEBACK);
}
static inline struct fuse_backing *
@@ -736,14 +739,7 @@ void fuse_iomap_unmount(struct fuse_mount *fm)
fuse_send_destroy(fm);
}
-static inline void fuse_inode_set_iomap(struct inode *inode)
-{
- struct fuse_inode *fi = get_fuse_inode(inode);
-
- ASSERT(fuse_has_iomap(inode));
-
- set_bit(FUSE_I_IOMAP, &fi->state);
-}
+static inline void fuse_inode_set_iomap(struct inode *inode);
static inline void fuse_inode_clear_iomap(struct inode *inode)
{
@@ -946,6 +942,110 @@ static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
.end_io = fuse_iomap_dio_write_end_io,
};
+static const struct iomap_write_ops fuse_iomap_write_ops = {
+};
+
+static int
+fuse_iomap_zero_range(
+ struct inode *inode,
+ loff_t pos,
+ loff_t len,
+ bool *did_zero)
+{
+ return iomap_zero_range(inode, pos, len, did_zero, &fuse_iomap_ops,
+ &fuse_iomap_write_ops, NULL);
+}
+
+/* Take care of zeroing post-EOF blocks when they might exist. */
+static ssize_t
+fuse_iomap_write_zero_eof(
+ struct kiocb *iocb,
+ struct iov_iter *from,
+ bool *drained_dio)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+ loff_t isize;
+ int error;
+
+ /*
+ * We need to serialise against EOF updates that occur in IO
+ * completions here. We want to make sure that nobody is changing the
+ * size while we do this check until we have placed an IO barrier (i.e.
+ * hold i_rwsem exclusively) that prevents new IO from being
+ * dispatched. The spinlock effectively forms a memory barrier once we
+ * have i_rwsem exclusively so we are guaranteed to see the latest EOF
+ * value and hence be able to correctly determine if we need to run
+ * zeroing.
+ */
+ spin_lock(&fi->lock);
+ isize = i_size_read(inode);
+ if (iocb->ki_pos <= isize) {
+ spin_unlock(&fi->lock);
+ return 0;
+ }
+ spin_unlock(&fi->lock);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ if (!(*drained_dio)) {
+ /*
+ * We now have an IO submission barrier in place, but AIO can
+ * do EOF updates during IO completion and hence we now need to
+ * wait for all of them to drain. Non-AIO DIO will have
+ * drained before we are given the exclusive i_rwsem, and so
+ * for most cases this wait is a no-op.
+ */
+ inode_dio_wait(inode);
+ *drained_dio = true;
+ return 1;
+ }
+
+ filemap_invalidate_lock(mapping);
+ error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL);
+ filemap_invalidate_unlock(mapping);
+
+ return error;
+}
+
+static ssize_t
+fuse_iomap_write_checks(
+ struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ ssize_t error;
+ bool drained_dio = false;
+
+restart:
+ error = generic_write_checks(iocb, from);
+ if (error <= 0)
+ return error;
+
+ /*
+ * If the offset is beyond the size of the file, we need to zero all
+ * blocks that fall between the existing EOF and the start of this
+ * write.
+ *
+ * We can do an unlocked check for i_size here safely as I/O completion
+ * can only extend EOF. Truncate is locked out at this point, so the
+ * EOF cannot move backwards, only forwards. Hence we only need to take
+ * the slow path when we are at or beyond the current EOF.
+ */
+ if (fuse_inode_has_iomap(inode) &&
+ iocb->ki_pos > i_size_read(inode)) {
+ error = fuse_iomap_write_zero_eof(iocb, from, &drained_dio);
+ if (error == 1)
+ goto restart;
+ if (error)
+ return error;
+ }
+
+ return kiocb_modified(iocb);
+}
+
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
@@ -973,8 +1073,9 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
ret = fuse_iomap_ilock_iocb(iocb, EXCL);
if (ret)
goto out_dsync;
- ret = generic_write_checks(iocb, from);
- if (ret <= 0)
+
+ ret = fuse_iomap_write_checks(iocb, from);
+ if (ret)
goto out_unlock;
/*
@@ -997,3 +1098,537 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
trace_fuse_iomap_direct_write_end(iocb, from, ret);
return ret;
}
+
+struct fuse_writepage_ctx {
+ struct iomap_writepage_ctx ctx;
+};
+
+static void fuse_iomap_end_ioend(struct iomap_ioend *ioend)
+{
+ struct inode *inode = ioend->io_inode;
+ unsigned int ioendflags = FUSE_IOMAP_IOEND_WRITEBACK;
+ unsigned int nofs_flag;
+ int error = blk_status_to_errno(ioend->io_bio.bi_status);
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (fuse_is_bad(inode))
+ return;
+
+ if (ioend->io_flags & IOMAP_IOEND_SHARED)
+ ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+ if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
+ ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+ /*
+ * We can allocate memory here while doing writeback on behalf of
+ * memory reclaim. To avoid memory allocation deadlocks set the
+ * task-wide nofs context for the following operations.
+ */
+ nofs_flag = memalloc_nofs_save();
+ fuse_iomap_ioend(inode, ioend->io_offset, ioend->io_size, error,
+ ioendflags, ioend->io_sector);
+ iomap_finish_ioends(ioend, error);
+ memalloc_nofs_restore(nofs_flag);
+}
+
+/*
+ * Finish all pending IO completions that require transactional modifications.
+ *
+ * We try to merge physical and logically contiguous ioends before completion to
+ * minimise the number of transactions we need to perform during IO completion.
+ * Both unwritten extent conversion and COW remapping need to iterate and modify
+ * one physical extent at a time, so we gain nothing by merging physically
+ * discontiguous extents here.
+ *
+ * The ioend chain length that we can be processing here is largely unbound in
+ * length and we may have to perform significant amounts of work on each ioend
+ * to complete it. Hence we have to be careful about holding the CPU for too
+ * long in this loop.
+ */
+static void fuse_iomap_end_io(struct work_struct *work)
+{
+ struct fuse_inode *fi =
+ container_of(work, struct fuse_inode, ioend_work);
+ struct iomap_ioend *ioend;
+ struct list_head tmp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&fi->ioend_lock, flags);
+ list_replace_init(&fi->ioend_list, &tmp);
+ spin_unlock_irqrestore(&fi->ioend_lock, flags);
+
+ iomap_sort_ioends(&tmp);
+ while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
+ io_list))) {
+ list_del_init(&ioend->io_list);
+ iomap_ioend_try_merge(ioend, &tmp);
+ fuse_iomap_end_ioend(ioend);
+ cond_resched();
+ }
+}
+
+static void fuse_iomap_end_bio(struct bio *bio)
+{
+ struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+ struct inode *inode = ioend->io_inode;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ unsigned long flags;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ spin_lock_irqsave(&fi->ioend_lock, flags);
+ if (list_empty(&fi->ioend_list))
+ WARN_ON_ONCE(!queue_work(system_unbound_wq, &fi->ioend_work));
+ list_add_tail(&ioend->io_list, &fi->ioend_list);
+ spin_unlock_irqrestore(&fi->ioend_lock, flags);
+}
+
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+ loff_t offset)
+{
+ if (offset < wpc->iomap.offset ||
+ offset >= wpc->iomap.offset + wpc->iomap.length)
+ return false;
+
+ /* XXX actually use revalidation cookie */
+ return true;
+}
+
+/*
+ * If the folio has delalloc blocks on it, the caller is asking us to punch them
+ * out. If we don't, we can leave a stale delalloc mapping covered by a clean
+ * page that needs to be dirtied again before the delalloc mapping can be
+ * converted. This stale delalloc mapping can trip up a later direct I/O read
+ * operation on the same region.
+ *
+ * We prevent this by truncating away the delalloc regions on the folio. Because
+ * they are delalloc, we can do this without needing a transaction. Indeed - if
+ * we get ENOSPC errors, we have to be able to do this truncation without a
+ * transaction as there is no space left for block reservation (typically why
+ * we see a ENOSPC in writeback).
+ */
+static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos, int error)
+{
+ struct inode *inode = folio->mapping->host;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t end = folio_pos(folio) + folio_size(folio);
+
+ if (fuse_is_bad(inode))
+ return;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ printk_ratelimited(KERN_ERR
+ "page discard on page %px, inode 0x%llx, pos %llu.",
+ folio, fi->orig_ino, pos);
+
+ /* Userspace may need to remove delayed allocations */
+ fuse_iomap_ioend(inode, pos, end - pos, error, 0, FUSE_IOMAP_NULL_ADDR);
+}
+
+static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 offset,
+ unsigned int len, u64 end_pos)
+{
+ struct inode *inode = wpc->inode;
+ struct iomap write_iomap, dontcare;
+ ssize_t ret;
+
+ if (fuse_is_bad(inode)) {
+ ret = -EIO;
+ goto discard_folio;
+ }
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (!fuse_iomap_revalidate_writeback(wpc, offset)) {
+ ret = fuse_iomap_begin(inode, offset, len,
+ FUSE_IOMAP_OP_WRITEBACK,
+ &write_iomap, &dontcare);
+ if (ret)
+ goto discard_folio;
+
+ /*
+ * Landed in a hole or beyond EOF? Send that to iomap, it'll
+ * skip writing back the file range.
+ */
+ if (write_iomap.offset > offset) {
+ write_iomap.length = write_iomap.offset - offset;
+ write_iomap.offset = offset;
+ write_iomap.type = IOMAP_HOLE;
+ }
+
+ memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap));
+ }
+
+ ret = iomap_add_to_ioend(wpc, folio, offset, end_pos, len);
+ if (ret < 0)
+ goto discard_folio;
+
+ return ret;
+discard_folio:
+ fuse_iomap_discard_folio(folio, offset, ret);
+ return ret;
+}
+
+static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
+ int error)
+{
+ struct iomap_ioend *ioend = wpc->wb_ctx;
+
+ ASSERT(fuse_inode_has_iomap(ioend->io_inode));
+
+ /* always call our ioend function, even if we cancel the bio */
+ ioend->io_bio.bi_end_io = fuse_iomap_end_bio;
+ return iomap_ioend_writeback_submit(wpc, error);
+}
+
+static const struct iomap_writeback_ops fuse_iomap_writeback_ops = {
+ .writeback_range = fuse_iomap_writeback_range,
+ .writeback_submit = fuse_iomap_writeback_submit,
+};
+
+static int fuse_iomap_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct fuse_writepage_ctx wpc = {
+ .ctx = {
+ .inode = mapping->host,
+ .wbc = wbc,
+ .ops = &fuse_iomap_writeback_ops,
+ },
+ };
+
+ ASSERT(fuse_inode_has_iomap(mapping->host));
+
+ return iomap_writepages(&wpc.ctx);
+}
+
+static int fuse_iomap_read_folio(struct file *file, struct folio *folio)
+{
+ ASSERT(fuse_inode_has_iomap(file_inode(file)));
+
+ return iomap_read_folio(folio, &fuse_iomap_ops);
+}
+
+static void fuse_iomap_readahead(struct readahead_control *rac)
+{
+ ASSERT(fuse_inode_has_iomap(file_inode(rac->file)));
+
+ iomap_readahead(rac, &fuse_iomap_ops);
+}
+
+static const struct address_space_operations fuse_iomap_aops = {
+ .read_folio = fuse_iomap_read_folio,
+ .readahead = fuse_iomap_readahead,
+ .writepages = fuse_iomap_writepages,
+ .dirty_folio = iomap_dirty_folio,
+ .release_folio = iomap_release_folio,
+ .invalidate_folio = iomap_invalidate_folio,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_folio = generic_error_remove_folio,
+
+ /* These aren't pagecache operations per se */
+ .bmap = fuse_bmap,
+};
+
+static inline void fuse_inode_set_iomap(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ inode->i_data.a_ops = &fuse_iomap_aops;
+
+ INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
+ INIT_LIST_HEAD(&fi->ioend_list);
+ spin_lock_init(&fi->ioend_lock);
+ set_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_lock (MM)
+ * sb_start_pagefault(vfs, freeze)
+ * invalidate_lock (vfs - truncate serialisation)
+ * page_lock (MM)
+ * i_lock (FUSE - extent map serialisation)
+ */
+static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ vm_fault_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vmf->vma->vm_file);
+
+ filemap_invalidate_lock_shared(mapping);
+ ret = iomap_page_mkwrite(vmf, &fuse_iomap_ops, NULL);
+ filemap_invalidate_unlock_shared(mapping);
+
+ sb_end_pagefault(inode->i_sb);
+ return ret;
+}
+
+static const struct vm_operations_struct fuse_iomap_vm_ops = {
+ .fault = filemap_fault,
+ .map_pages = filemap_map_pages,
+ .page_mkwrite = fuse_iomap_page_mkwrite,
+};
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ ASSERT(fuse_inode_has_iomap(file_inode(file)));
+
+ file_accessed(file);
+ vma->vm_ops = &fuse_iomap_vm_ops;
+ return 0;
+}
+
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (!iov_iter_count(to))
+ return 0; /* skip atime */
+
+ file_accessed(iocb->ki_filp);
+
+ ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+ if (ret)
+ return ret;
+ ret = generic_file_read_iter(iocb, to);
+ inode_unlock_shared(inode);
+
+ return ret;
+}
+
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t pos = iocb->ki_pos;
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (!iov_iter_count(from))
+ return 0;
+
+ ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+ if (ret)
+ return ret;
+
+ ret = fuse_iomap_write_checks(iocb, from);
+ if (ret)
+ goto out_unlock;
+
+ if (inode->i_size < pos + iov_iter_count(from))
+ set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+ ret = iomap_file_buffered_write(iocb, from, &fuse_iomap_ops,
+ &fuse_iomap_write_ops, NULL);
+
+ if (ret > 0)
+ fuse_write_update_attr(inode, pos + ret, ret);
+ clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+out_unlock:
+ inode_unlock(inode);
+
+ if (ret > 0) {
+ /* Handle various SYNC-type writes */
+ ret = generic_write_sync(iocb, ret);
+ }
+ return ret;
+}
+
+static int
+fuse_iomap_truncate_page(
+ struct inode *inode,
+ loff_t pos,
+ bool *did_zero)
+{
+ return iomap_truncate_page(inode, pos, did_zero, &fuse_iomap_ops,
+ &fuse_iomap_write_ops, NULL);
+}
+/*
+ * Truncate pagecache for a file before sending the truncate request to
+ * userspace. Must have write permission and not be a directory.
+ *
+ * Caution: The caller of this function is responsible for calling
+ * setattr_prepare() or otherwise verifying the change is fine.
+ */
+int
+fuse_iomap_setsize_start(
+ struct inode *inode,
+ loff_t newsize)
+{
+ loff_t oldsize = i_size_read(inode);
+ int error;
+ bool did_zeroing = false;
+
+ rwsem_assert_held_write(&inode->i_rwsem);
+ rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
+ ASSERT(S_ISREG(inode->i_mode));
+
+ /*
+ * Wait for all direct I/O to complete.
+ */
+ inode_dio_wait(inode);
+
+ /*
+ * File data changes must be complete and flushed to disk before we
+ * call userspace to modify the inode.
+ *
+ * Start with zeroing any data beyond EOF that we may expose on file
+ * extension, or zeroing out the rest of the block on a downward
+ * truncate.
+ */
+ if (newsize > oldsize)
+ error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize,
+ &did_zeroing);
+ else
+ error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing);
+ if (error)
+ return error;
+
+ /*
+ * We've already locked out new page faults, so now we can safely
+ * remove pages from the page cache knowing they won't get refaulted
+ * until we drop the mapping invalidation lock after the extent
+ * manipulations are complete. The truncate_setsize() call also cleans
+ * folios spanning EOF on extending truncates and hence ensures
+ * sub-page block size filesystems are correctly handled, too.
+ *
+ * And we update in-core i_size and truncate page cache beyond newsize
+ * before writing back the whole file, so we're guaranteed not to write
+ * stale data past the new EOF on truncate down.
+ */
+ truncate_setsize(inode, newsize);
+
+ /*
+ * Flush the entire pagecache to ensure the fuse server logs the inode
+ * size change and all dirty data that might be associated with it.
+ * We don't know the ondisk inode size, so we only have this clumsy
+ * hammer.
+ */
+ return filemap_write_and_wait(inode->i_mapping);
+}
+
+/*
+ * Prepare for a file data block remapping operation by flushing and unmapping
+ * all pagecache for the entire range.
+ */
+int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
+ loff_t endpos)
+{
+ loff_t start, end;
+ unsigned int rounding;
+ int error;
+
+ /*
+ * Make sure we extend the flush out to extent alignment boundaries so
+ * any extent range overlapping the start/end of the modification we
+ * are about to do is clean and idle.
+ */
+ rounding = max_t(unsigned int, i_blocksize(inode), PAGE_SIZE);
+ start = round_down(pos, rounding);
+ end = round_up(endpos + 1, rounding) - 1;
+
+ error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+ if (error)
+ return error;
+ truncate_pagecache_range(inode, start, end);
+ return 0;
+}
+
+static int fuse_iomap_punch_range(struct inode *inode, loff_t offset,
+ loff_t length)
+{
+ loff_t isize = i_size_read(inode);
+ int error;
+
+ /*
+ * Now that we've unmap all full blocks we'll have to zero out any
+ * partial block at the beginning and/or end. iomap_zero_range is
+ * smart enough to skip holes and unwritten extents, including those we
+ * just created, but we must take care not to zero beyond EOF, which
+ * would enlarge i_size.
+ */
+ if (offset >= isize)
+ return 0;
+ if (offset + length > isize)
+ length = isize - offset;
+ error = fuse_iomap_zero_range(inode, offset, length, NULL);
+ if (error)
+ return error;
+
+ /*
+ * If we zeroed right up to EOF and EOF straddles a page boundary we
+ * must make sure that the post-EOF area is also zeroed because the
+ * page could be mmap'd and iomap_zero_range doesn't do that for us.
+ * Writeback of the eof page will do this, albeit clumsily.
+ */
+ if (offset + length >= isize && offset_in_page(offset + length) > 0) {
+ error = filemap_write_and_wait_range(inode->i_mapping,
+ round_down(offset + length, PAGE_SIZE),
+ LLONG_MAX);
+ }
+
+ return error;
+}
+
+int
+fuse_iomap_fallocate(
+ struct file *file,
+ int mode,
+ loff_t offset,
+ loff_t length,
+ loff_t new_size)
+{
+ struct inode *inode = file_inode(file);
+ int error;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ /*
+ * If we unmapped blocks from the file range, then we zero the
+ * pagecache for those regions and push them to disk rather than make
+ * the fuse server manually zero the disk blocks.
+ */
+ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) {
+ error = fuse_iomap_punch_range(inode, offset, length);
+ if (error)
+ return error;
+ }
+
+ /*
+ * If this is an extending write, we need to zero the bytes beyond the
+ * new EOF and bounce the new size out to userspace.
+ */
+ if (new_size) {
+ error = fuse_iomap_setsize_start(inode, new_size);
+ if (error)
+ return error;
+
+ fuse_write_update_attr(inode, new_size, length);
+ }
+
+ file_update_time(file);
+ return 0;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 15/28] fuse_trace: implement buffered IO with iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (13 preceding siblings ...)
2025-09-16 0:31 ` [PATCH 14/28] fuse: implement buffered " Darrick J. Wong
@ 2025-09-16 0:31 ` Darrick J. Wong
2025-09-16 0:32 ` [PATCH 16/28] fuse: implement large folios for iomap pagecache files Darrick J. Wong
` (12 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:31 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 40 ++++++++
2 files changed, 288 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 434d38ce89c428..e69ad48b14066b 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -224,6 +224,9 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#endif /* CONFIG_FUSE_BACKING */
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+struct iomap_writepage_ctx;
+struct iomap_ioend;
+
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
__field(unsigned, opflags)
@@ -291,7 +294,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
{ FUSE_IOMAP_OP_UNSHARE, "unshare" }, \
{ FUSE_IOMAP_OP_DAX, "fsdax" }, \
{ FUSE_IOMAP_OP_ATOMIC, "atomic" }, \
- { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }
+ { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }, \
+ { FUSE_IOMAP_OP_WRITEBACK, "writeback" }
#define FUSE_IOMAP_TYPE_STRINGS \
{ FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
@@ -306,7 +310,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
{ FUSE_IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
{ FUSE_IOMAP_IOEND_BOUNDARY, "boundary" }, \
{ FUSE_IOMAP_IOEND_DIRECT, "direct" }, \
- { FUSE_IOMAP_IOEND_APPEND, "append" }
+ { FUSE_IOMAP_IOEND_APPEND, "append" }, \
+ { FUSE_IOMAP_IOEND_WRITEBACK, "writeback" }
#define IOMAP_DIOEND_STRINGS \
{ IOMAP_DIO_UNWRITTEN, "unwritten" }, \
@@ -329,6 +334,12 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
{ 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
{ 1 << FUSE_I_IOMAP, "iomap" }
+#define IOMAP_IOEND_STRINGS \
+ { IOMAP_IOEND_SHARED, "shared" }, \
+ { IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
+ { IOMAP_IOEND_BOUNDARY, "boundary" }, \
+ { IOMAP_IOEND_DIRECT, "direct" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -668,6 +679,9 @@ DEFINE_EVENT(fuse_iomap_file_io_class, name, \
TP_ARGS(iocb, iter))
DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_write_zero_eof);
DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
@@ -694,6 +708,8 @@ DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \
TP_ARGS(iocb, iter, ret))
DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_write_end);
TRACE_EVENT(fuse_iomap_dio_write_end_io,
TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
@@ -720,6 +736,238 @@ TRACE_EVENT(fuse_iomap_dio_write_end_io,
__print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
__entry->error)
);
+
+TRACE_EVENT(fuse_iomap_end_ioend,
+ TP_PROTO(const struct iomap_ioend *ioend),
+
+ TP_ARGS(ioend),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned int, ioendflags)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(ioend->io_inode, fi, fm);
+ __entry->offset = ioend->io_offset;
+ __entry->length = ioend->io_size;
+ __entry->ioendflags = ioend->io_flags;
+ __entry->error = blk_status_to_errno(ioend->io_bio.bi_status);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->ioendflags, "|", IOMAP_IOEND_STRINGS),
+ __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_writeback_range,
+ TP_PROTO(const struct inode *inode, u64 offset, unsigned int count,
+ u64 end_pos),
+
+ TP_ARGS(inode, offset, count, end_pos),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(uint64_t, end_pos)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = count;
+ __entry->end_pos = end_pos;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " end_pos 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->end_pos)
+);
+
+TRACE_EVENT(fuse_iomap_writeback_submit,
+ TP_PROTO(const struct iomap_writepage_ctx *wpc, int error),
+
+ TP_ARGS(wpc, error),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned int, nr_folios)
+ __field(uint64_t, addr)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(wpc->inode, fi, fm);
+ __entry->nr_folios = wpc->nr_folios;
+ __entry->offset = wpc->iomap.offset;
+ __entry->length = wpc->iomap.length;
+ __entry->addr = wpc->iomap.addr << 9;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " addr 0x%llx nr_folios %u error %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->addr,
+ __entry->nr_folios,
+ __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_discard_folio,
+ TP_PROTO(const struct inode *inode, loff_t offset, size_t count),
+
+ TP_ARGS(inode, offset, count),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = count;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_writepages,
+ TP_PROTO(const struct inode *inode, const struct writeback_control *wbc),
+
+ TP_ARGS(inode, wbc),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(long, nr_to_write)
+ __field(bool, sync_all)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = wbc->range_start;
+ __entry->length = wbc->range_end - wbc->range_start + 1;
+ __entry->nr_to_write = wbc->nr_to_write;
+ __entry->sync_all = wbc->sync_mode == WB_SYNC_ALL;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " nr_folios %ld sync_all? %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->nr_to_write,
+ __entry->sync_all)
+);
+
+TRACE_EVENT(fuse_iomap_read_folio,
+ TP_PROTO(const struct folio *folio),
+
+ TP_ARGS(folio),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(folio->mapping->host, fi, fm);
+ __entry->offset = folio_pos(folio);
+ __entry->length = folio_size(folio);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_readahead,
+ TP_PROTO(const struct readahead_control *rac),
+
+ TP_ARGS(rac),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ struct readahead_control *mutrac = (struct readahead_control *)rac;
+ FUSE_INODE_ASSIGN(file_inode(rac->file), fi, fm);
+ __entry->offset = readahead_pos(mutrac);
+ __entry->length = readahead_length(mutrac);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_page_mkwrite,
+ TP_PROTO(const struct vm_fault *vmf),
+
+ TP_ARGS(vmf),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ struct folio *folio = page_folio(vmf->page);
+ FUSE_INODE_ASSIGN(file_inode(vmf->vma->vm_file), fi, fm);
+ __entry->offset = folio_pos(folio);
+ __entry->length = folio_size(folio);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_range_class,
+ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length),
+
+ TP_ARGS(inode, offset, length),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = length;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+)
+#define DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_range_class, name, \
+ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \
+ TP_ARGS(inode, offset, length))
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range);
+
+TRACE_EVENT(fuse_iomap_fallocate,
+ TP_PROTO(const struct inode *inode, int mode, loff_t offset,
+ loff_t length, loff_t newsize),
+ TP_ARGS(inode, mode, offset, length, newsize),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(loff_t, newsize)
+ __field(int, mode)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = length;
+ __entry->mode = mode;
+ __entry->newsize = newsize;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " mode 0x%x newsize 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->mode,
+ __entry->newsize)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 64f851d04a009b..d771b1068fb912 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1003,6 +1003,8 @@ fuse_iomap_write_zero_eof(
return 1;
}
+ trace_fuse_iomap_write_zero_eof(iocb, from);
+
filemap_invalidate_lock(mapping);
error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL);
filemap_invalidate_unlock(mapping);
@@ -1115,6 +1117,8 @@ static void fuse_iomap_end_ioend(struct iomap_ioend *ioend)
if (fuse_is_bad(inode))
return;
+ trace_fuse_iomap_end_ioend(ioend);
+
if (ioend->io_flags & IOMAP_IOEND_SHARED)
ioendflags |= FUSE_IOMAP_IOEND_SHARED;
if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
@@ -1223,6 +1227,8 @@ static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos, int error)
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_discard_folio(inode, pos, folio_size(folio));
+
printk_ratelimited(KERN_ERR
"page discard on page %px, inode 0x%llx, pos %llu.",
folio, fi->orig_ino, pos);
@@ -1246,6 +1252,8 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_writeback_range(inode, offset, len, end_pos);
+
if (!fuse_iomap_revalidate_writeback(wpc, offset)) {
ret = fuse_iomap_begin(inode, offset, len,
FUSE_IOMAP_OP_WRITEBACK,
@@ -1283,6 +1291,8 @@ static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
ASSERT(fuse_inode_has_iomap(ioend->io_inode));
+ trace_fuse_iomap_writeback_submit(wpc, error);
+
/* always call our ioend function, even if we cancel the bio */
ioend->io_bio.bi_end_io = fuse_iomap_end_bio;
return iomap_ioend_writeback_submit(wpc, error);
@@ -1306,6 +1316,8 @@ static int fuse_iomap_writepages(struct address_space *mapping,
ASSERT(fuse_inode_has_iomap(mapping->host));
+ trace_fuse_iomap_writepages(mapping->host, wbc);
+
return iomap_writepages(&wpc.ctx);
}
@@ -1313,6 +1325,8 @@ static int fuse_iomap_read_folio(struct file *file, struct folio *folio)
{
ASSERT(fuse_inode_has_iomap(file_inode(file)));
+ trace_fuse_iomap_read_folio(folio);
+
return iomap_read_folio(folio, &fuse_iomap_ops);
}
@@ -1320,6 +1334,8 @@ static void fuse_iomap_readahead(struct readahead_control *rac)
{
ASSERT(fuse_inode_has_iomap(file_inode(rac->file)));
+ trace_fuse_iomap_readahead(rac);
+
iomap_readahead(rac, &fuse_iomap_ops);
}
@@ -1370,6 +1386,8 @@ static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf)
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_page_mkwrite(vmf);
+
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
@@ -1403,6 +1421,8 @@ ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_buffered_read(iocb, to);
+
if (!iov_iter_count(to))
return 0; /* skip atime */
@@ -1414,6 +1434,7 @@ ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
ret = generic_file_read_iter(iocb, to);
inode_unlock_shared(inode);
+ trace_fuse_iomap_buffered_read_end(iocb, to, ret);
return ret;
}
@@ -1426,6 +1447,8 @@ ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_buffered_write(iocb, from);
+
if (!iov_iter_count(from))
return 0;
@@ -1454,6 +1477,7 @@ ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
/* Handle various SYNC-type writes */
ret = generic_write_sync(iocb, ret);
}
+ trace_fuse_iomap_buffered_write_end(iocb, from, ret);
return ret;
}
@@ -1499,11 +1523,17 @@ fuse_iomap_setsize_start(
* extension, or zeroing out the rest of the block on a downward
* truncate.
*/
- if (newsize > oldsize)
+ if (newsize > oldsize) {
+ trace_fuse_iomap_truncate_up(inode, oldsize, newsize - oldsize);
+
error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize,
&did_zeroing);
- else
+ } else {
+ trace_fuse_iomap_truncate_down(inode, newsize,
+ oldsize - newsize);
+
error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing);
+ }
if (error)
return error;
@@ -1550,6 +1580,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
start = round_down(pos, rounding);
end = round_up(endpos + 1, rounding) - 1;
+ trace_fuse_iomap_flush_unmap_range(inode, start, end + 1 - start);
+
error = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (error)
return error;
@@ -1563,6 +1595,8 @@ static int fuse_iomap_punch_range(struct inode *inode, loff_t offset,
loff_t isize = i_size_read(inode);
int error;
+ trace_fuse_iomap_punch_range(inode, offset, length);
+
/*
* Now that we've unmap all full blocks we'll have to zero out any
* partial block at the beginning and/or end. iomap_zero_range is
@@ -1606,6 +1640,8 @@ fuse_iomap_fallocate(
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+
/*
* If we unmapped blocks from the file range, then we zero the
* pagecache for those regions and push them to disk rather than make
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 16/28] fuse: implement large folios for iomap pagecache files
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (14 preceding siblings ...)
2025-09-16 0:31 ` [PATCH 15/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:32 ` Darrick J. Wong
2025-09-16 0:32 ` [PATCH 17/28] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
` (11 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:32 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Use large folios when we're using iomap.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index d771b1068fb912..c09e00c7de2694 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1357,6 +1357,7 @@ static const struct address_space_operations fuse_iomap_aops = {
static inline void fuse_inode_set_iomap(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ unsigned int min_order = 0;
ASSERT(fuse_has_iomap(inode));
@@ -1365,6 +1366,11 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
INIT_LIST_HEAD(&fi->ioend_list);
spin_lock_init(&fi->ioend_lock);
+
+ if (inode->i_blkbits > PAGE_SHIFT)
+ min_order = inode->i_blkbits - PAGE_SHIFT;
+
+ mapping_set_folio_min_order(inode->i_mapping, min_order);
set_bit(FUSE_I_IOMAP, &fi->state);
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 17/28] fuse: use an unrestricted backing device with iomap pagecache io
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (15 preceding siblings ...)
2025-09-16 0:32 ` [PATCH 16/28] fuse: implement large folios for iomap pagecache files Darrick J. Wong
@ 2025-09-16 0:32 ` Darrick J. Wong
2025-09-16 0:32 ` [PATCH 18/28] fuse: advertise support for iomap Darrick J. Wong
` (10 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:32 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
With iomap support turned on for the pagecache, the kernel issues
writeback to directly to block devices and we no longer have to push all
those pages through the fuse device to userspace. Therefore, we don't
need the tight dirty limits (~1M) that are used for regular fuse. This
dramatically increases the performance of fuse's pagecache IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index c09e00c7de2694..6cc1f91fe3d5a4 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -711,6 +711,27 @@ const struct fuse_backing_ops fuse_iomap_backing_ops = {
void fuse_iomap_mount(struct fuse_mount *fm)
{
struct fuse_conn *fc = fm->fc;
+ struct super_block *sb = fm->sb;
+ struct backing_dev_info *old_bdi = sb->s_bdi;
+ char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+ int res;
+
+ /*
+ * sb->s_bdi points to the initial private bdi. However, we want to
+ * redirect it to a new private bdi with default dirty and readahead
+ * settings because iomap writeback won't be pushing a ton of dirty
+ * data through the fuse device. If this fails we fall back to the
+ * initial fuse bdi.
+ */
+ sb->s_bdi = &noop_backing_dev_info;
+ res = super_setup_bdi_name(sb, "%u:%u%s.iomap", MAJOR(fc->dev),
+ MINOR(fc->dev), suffix);
+ if (res) {
+ sb->s_bdi = old_bdi;
+ } else {
+ bdi_unregister(old_bdi);
+ bdi_put(old_bdi);
+ }
/*
* Enable syncfs for iomap fuse servers so that we can send a final
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 18/28] fuse: advertise support for iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (16 preceding siblings ...)
2025-09-16 0:32 ` [PATCH 17/28] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
@ 2025-09-16 0:32 ` Darrick J. Wong
2025-09-16 0:32 ` [PATCH 19/28] fuse: query filesystem geometry when using iomap Darrick J. Wong
` (9 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:32 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Advertise our new IO paths programmatically by creating an ioctl that
can return the capabilities of the kernel.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++++
include/uapi/linux/fuse.h | 9 +++++++++
fs/fuse/dev.c | 3 +++
fs/fuse/file_iomap.c | 13 +++++++++++++
4 files changed, 29 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7581d22de2340c..82191e92c21097 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1781,6 +1781,9 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
loff_t length, loff_t new_size);
int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
loff_t endpos);
+
+int fuse_dev_ioctl_iomap_support(struct file *file,
+ struct fuse_iomap_support __user *argp);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1803,6 +1806,7 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
# define fuse_iomap_setsize_start(...) (-ENOSYS)
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
+# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c0af8a4d3e30d8..675b1d4fdff8db 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1139,6 +1139,13 @@ struct fuse_backing_map {
uint64_t padding;
};
+/* basic file I/O functionality through iomap */
+#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0)
+struct fuse_iomap_support {
+ uint64_t flags;
+ uint64_t padding;
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
@@ -1146,6 +1153,8 @@ struct fuse_backing_map {
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
#define FUSE_DEV_IOC_SYNC_INIT _IO(FUSE_DEV_IOC_MAGIC, 3)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \
+ struct fuse_iomap_support)
struct fuse_lseek_in {
uint64_t fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 871877cac2acf3..bb0ec19a368bea 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2710,6 +2710,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_SYNC_INIT:
return fuse_dev_ioctl_sync_init(file);
+ case FUSE_DEV_IOC_IOMAP_SUPPORT:
+ return fuse_dev_ioctl_iomap_support(file, argp);
+
default:
return -ENOTTY;
}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 6cc1f91fe3d5a4..5cefceb267f8f1 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1695,3 +1695,16 @@ fuse_iomap_fallocate(
file_update_time(file);
return 0;
}
+
+int fuse_dev_ioctl_iomap_support(struct file *file,
+ struct fuse_iomap_support __user *argp)
+{
+ struct fuse_iomap_support ios = { };
+
+ if (fuse_iomap_enabled())
+ ios.flags = FUSE_IOMAP_SUPPORT_FILEIO;
+
+ if (copy_to_user(argp, &ios, sizeof(ios)))
+ return -EFAULT;
+ return 0;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 19/28] fuse: query filesystem geometry when using iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (17 preceding siblings ...)
2025-09-16 0:32 ` [PATCH 18/28] fuse: advertise support for iomap Darrick J. Wong
@ 2025-09-16 0:32 ` Darrick J. Wong
2025-09-16 0:33 ` [PATCH 20/28] fuse_trace: " Darrick J. Wong
` (8 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:32 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add a new upcall to the fuse server so that the kernel can request
filesystem geometry bits when iomap mode is in use.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 10 ++-
include/uapi/linux/fuse.h | 39 ++++++++++++
fs/fuse/file_iomap.c | 147 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 42 ++++++++++---
4 files changed, 227 insertions(+), 11 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 82191e92c21097..e45780f6fe9e39 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1019,6 +1019,9 @@ struct fuse_conn {
struct fuse_ring *ring;
#endif
+ /** How many subsystems still need initialization? */
+ atomic_t need_init;
+
/** Only used if the connection opts into request timeouts */
struct {
/* Worker for checking if any requests have timed out */
@@ -1431,6 +1434,7 @@ struct fuse_dev *fuse_dev_alloc(void);
void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc);
void fuse_dev_free(struct fuse_dev *fud);
int fuse_send_init(struct fuse_mount *fm);
+void fuse_finish_init(struct fuse_conn *fc, bool ok);
/**
* Fill in superblock and initialize fuse connection
@@ -1739,7 +1743,8 @@ static inline bool fuse_has_iomap(const struct inode *inode)
extern const struct fuse_backing_ops fuse_iomap_backing_ops;
-void fuse_iomap_mount(struct fuse_mount *fm);
+int fuse_iomap_mount(struct fuse_mount *fm);
+void fuse_iomap_mount_async(struct fuse_mount *fm);
void fuse_iomap_unmount(struct fuse_mount *fm);
void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags);
@@ -1787,7 +1792,8 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
-# define fuse_iomap_mount(...) ((void)0)
+# define fuse_iomap_mount(...) (0)
+# define fuse_iomap_mount_async(...) ((void)0)
# define fuse_iomap_unmount(...) ((void)0)
# define fuse_iomap_init_inode(...) ((void)0)
# define fuse_iomap_evict_inode(...) ((void)0)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 675b1d4fdff8db..19c1ac5006faa9 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -239,6 +239,7 @@
* 7.99
* - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
+ * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
*/
#ifndef _LINUX_FUSE_H
@@ -666,6 +667,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_CONFIG = 4092,
FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1425,4 +1427,41 @@ struct fuse_iomap_ioend_in {
uint32_t reserved1; /* zero */
};
+struct fuse_iomap_config_in {
+ uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */
+ int64_t maxbytes; /* maximum supported file size */
+ uint64_t padding[6]; /* zero */
+};
+
+/* Which fields are set in fuse_iomap_config_out? */
+#define FUSE_IOMAP_CONFIG_SID (1 << 0ULL)
+#define FUSE_IOMAP_CONFIG_UUID (1 << 1ULL)
+#define FUSE_IOMAP_CONFIG_BLOCKSIZE (1 << 2ULL)
+#define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL)
+#define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL)
+#define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL)
+
+struct fuse_iomap_config_out {
+ uint64_t flags; /* FUSE_IOMAP_CONFIG_* */
+
+ char s_id[32]; /* Informational name */
+ char s_uuid[16]; /* UUID */
+
+ uint8_t s_uuid_len; /* length of s_uuid */
+
+ uint8_t s_pad[3]; /* must be zeroes */
+
+ uint32_t s_blocksize; /* fs block size */
+ uint32_t s_max_links; /* max hard links */
+
+ /* Granularity of c/m/atime in ns (cannot be worse than a second) */
+ uint32_t s_time_gran;
+
+ /* Time limits for c/m/atime in seconds */
+ int64_t s_time_min;
+ int64_t s_time_max;
+
+ int64_t s_maxbytes; /* max file size */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 5cefceb267f8f1..abba22107718d9 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -708,14 +708,103 @@ const struct fuse_backing_ops fuse_iomap_backing_ops = {
.post_open = fuse_iomap_post_open,
};
-void fuse_iomap_mount(struct fuse_mount *fm)
+struct fuse_iomap_config_args {
+ struct fuse_args args;
+ struct fuse_iomap_config_in inarg;
+ struct fuse_iomap_config_out outarg;
+};
+
+#define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_SID | \
+ FUSE_IOMAP_CONFIG_UUID | \
+ FUSE_IOMAP_CONFIG_BLOCKSIZE | \
+ FUSE_IOMAP_CONFIG_MAX_LINKS | \
+ FUSE_IOMAP_CONFIG_TIME | \
+ FUSE_IOMAP_CONFIG_MAXBYTES)
+
+static int fuse_iomap_process_config(struct fuse_mount *fm, int error,
+ const struct fuse_iomap_config_out *outarg)
{
+ struct super_block *sb = fm->sb;
+
+ switch (error) {
+ case 0:
+ break;
+ case -ENOSYS:
+ return 0;
+ default:
+ return error;
+ }
+
+ if (outarg->flags & ~FUSE_IOMAP_CONFIG_ALL)
+ return -EINVAL;
+
+ if (outarg->s_uuid_len > sizeof(outarg->s_uuid))
+ return -EINVAL;
+
+ if (memchr_inv(outarg->s_pad, 0, sizeof(outarg->s_pad)))
+ return -EINVAL;
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_BLOCKSIZE) {
+ if (sb->s_bdev) {
+#ifdef CONFIG_BLOCK
+ if (!sb_set_blocksize(sb, outarg->s_blocksize))
+ return -EINVAL;
+#else
+ /*
+ * XXX: how do we have a bdev filesystem without
+ * CONFIG_BLOCK???
+ */
+ return -EINVAL;
+#endif
+ } else {
+ sb->s_blocksize = outarg->s_blocksize;
+ sb->s_blocksize_bits = blksize_bits(outarg->s_blocksize);
+ }
+ }
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_SID)
+ memcpy(sb->s_id, outarg->s_id, sizeof(sb->s_id));
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_UUID) {
+ memcpy(&sb->s_uuid, outarg->s_uuid, outarg->s_uuid_len);
+ sb->s_uuid_len = outarg->s_uuid_len;
+ }
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_MAX_LINKS)
+ sb->s_max_links = outarg->s_max_links;
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_TIME) {
+ sb->s_time_gran = outarg->s_time_gran;
+ sb->s_time_min = outarg->s_time_min;
+ sb->s_time_max = outarg->s_time_max;
+ }
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_MAXBYTES)
+ sb->s_maxbytes = outarg->s_maxbytes;
+
+ return 0;
+}
+
+static void fuse_iomap_config_reply(struct fuse_mount *fm,
+ struct fuse_args *args, int error)
+{
+ struct fuse_iomap_config_args *ia =
+ container_of(args, struct fuse_iomap_config_args, args);
struct fuse_conn *fc = fm->fc;
struct super_block *sb = fm->sb;
struct backing_dev_info *old_bdi = sb->s_bdi;
char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+ bool ok = true;
int res;
+ res = fuse_iomap_process_config(fm, error, &ia->outarg);
+ if (res) {
+ printk(KERN_ERR "%s: could not configure iomap, err=%d",
+ sb->s_id, res);
+ ok = false;
+ goto done;
+ }
+
/*
* sb->s_bdi points to the initial private bdi. However, we want to
* redirect it to a new private bdi with default dirty and readahead
@@ -741,6 +830,62 @@ void fuse_iomap_mount(struct fuse_mount *fm)
fc->sync_fs = true;
fc->iomap_conn.no_end = 0;
fc->iomap_conn.no_ioend = 0;
+
+done:
+ kfree(ia);
+ fuse_finish_init(fc, ok);
+}
+
+static struct fuse_iomap_config_args *
+fuse_iomap_new_mount(struct fuse_mount *fm)
+{
+ struct fuse_iomap_config_args *ia;
+
+ ia = kzalloc(sizeof(*ia), GFP_KERNEL | __GFP_NOFAIL);
+ ia->inarg.maxbytes = MAX_LFS_FILESIZE;
+ ia->inarg.flags = FUSE_IOMAP_CONFIG_ALL;
+
+ ia->args.opcode = FUSE_IOMAP_CONFIG;
+ ia->args.nodeid = 0;
+ ia->args.in_numargs = 1;
+ ia->args.in_args[0].size = sizeof(ia->inarg);
+ ia->args.in_args[0].value = &ia->inarg;
+ ia->args.out_argvar = true;
+ ia->args.out_numargs = 1;
+ ia->args.out_args[0].size = sizeof(ia->outarg);
+ ia->args.out_args[0].value = &ia->outarg;
+ ia->args.force = true;
+ ia->args.nocreds = true;
+
+ return ia;
+}
+
+int fuse_iomap_mount(struct fuse_mount *fm)
+{
+ struct fuse_iomap_config_args *ia = fuse_iomap_new_mount(fm);
+ int err;
+
+ ASSERT(fm->fc->sync_init);
+
+ err = fuse_simple_request(fm, &ia->args);
+ /* Ignore size of iomap_config reply */
+ if (err > 0)
+ err = 0;
+ fuse_iomap_config_reply(fm, &ia->args, err);
+ return err;
+}
+
+void fuse_iomap_mount_async(struct fuse_mount *fm)
+{
+ struct fuse_iomap_config_args *ia = fuse_iomap_new_mount(fm);
+ int err;
+
+ ASSERT(!fm->fc->sync_init);
+
+ ia->args.end = fuse_iomap_config_reply;
+ err = fuse_simple_background(fm, &ia->args, GFP_KERNEL);
+ if (err)
+ fuse_iomap_config_reply(fm, &ia->args, -ENOTCONN);
}
void fuse_iomap_unmount(struct fuse_mount *fm)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4f348fc575a5c3..beb9ee62b6b861 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1319,6 +1319,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
struct fuse_init_out *arg = &ia->out;
bool ok = true;
+ atomic_inc(&fc->need_init);
+
if (error || arg->major != FUSE_KERNEL_VERSION)
ok = false;
else {
@@ -1466,9 +1468,6 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
init_server_timeout(fc, timeout);
- if (fc->iomap)
- fuse_iomap_mount(fm);
-
fm->sb->s_bdi->ra_pages =
min(fm->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
@@ -1478,13 +1477,27 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
}
kfree(ia);
- if (!ok) {
+ if (!ok)
fc->conn_init = 0;
+
+ if (ok && fc->iomap) {
+ atomic_inc(&fc->need_init);
+ if (!fc->sync_init)
+ fuse_iomap_mount_async(fm);
+ }
+
+ fuse_finish_init(fc, ok);
+}
+
+void fuse_finish_init(struct fuse_conn *fc, bool ok)
+{
+ if (!ok)
fc->conn_error = 1;
- }
- fuse_set_initialized(fc);
- wake_up_all(&fc->blocked_waitq);
+ if (atomic_dec_and_test(&fc->need_init)) {
+ fuse_set_initialized(fc);
+ wake_up_all(&fc->blocked_waitq);
+ }
}
static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
@@ -1974,7 +1987,20 @@ static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc)
fm = get_fuse_mount_super(sb);
- return fuse_send_init(fm);
+ err = fuse_send_init(fm);
+ if (err)
+ return err;
+
+ if (fm->fc->conn_init && fm->fc->sync_init && fm->fc->iomap) {
+ err = fuse_iomap_mount(fm);
+ if (err)
+ return err;
+ }
+
+ if (fm->fc->conn_error)
+ return -EIO;
+
+ return 0;
}
/*
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 20/28] fuse_trace: query filesystem geometry when using iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (18 preceding siblings ...)
2025-09-16 0:32 ` [PATCH 19/28] fuse: query filesystem geometry when using iomap Darrick J. Wong
@ 2025-09-16 0:33 ` Darrick J. Wong
2025-09-16 0:33 ` [PATCH 21/28] fuse: implement fadvise for iomap files Darrick J. Wong
` (7 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:33 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 3 +++
2 files changed, 51 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index e69ad48b14066b..66b564bcd25360 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,7 @@
EM( FUSE_SYNCFS, "FUSE_SYNCFS") \
EM( FUSE_TMPFILE, "FUSE_TMPFILE") \
EM( FUSE_STATX, "FUSE_STATX") \
+ EM( FUSE_IOMAP_CONFIG, "FUSE_IOMAP_CONFIG") \
EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \
@@ -340,6 +341,14 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
{ IOMAP_IOEND_BOUNDARY, "boundary" }, \
{ IOMAP_IOEND_DIRECT, "direct" }
+#define FUSE_IOMAP_CONFIG_STRINGS \
+ { FUSE_IOMAP_CONFIG_SID, "sid" }, \
+ { FUSE_IOMAP_CONFIG_UUID, "uuid" }, \
+ { FUSE_IOMAP_CONFIG_BLOCKSIZE, "blocksize" }, \
+ { FUSE_IOMAP_CONFIG_MAX_LINKS, "max_links" }, \
+ { FUSE_IOMAP_CONFIG_TIME, "time" }, \
+ { FUSE_IOMAP_CONFIG_MAXBYTES, "maxbytes" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -968,6 +977,45 @@ TRACE_EVENT(fuse_iomap_fallocate,
__entry->mode,
__entry->newsize)
);
+
+TRACE_EVENT(fuse_iomap_config,
+ TP_PROTO(const struct fuse_mount *fm,
+ const struct fuse_iomap_config_out *outarg),
+ TP_ARGS(fm, outarg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+
+ __field(uint32_t, flags)
+ __field(uint32_t, blocksize)
+ __field(uint32_t, max_links)
+ __field(uint32_t, time_gran)
+
+ __field(int64_t, time_min)
+ __field(int64_t, time_max)
+ __field(int64_t, maxbytes)
+ __field(uint8_t, uuid_len)
+ ),
+
+ TP_fast_assign(
+ __entry->connection = fm->fc->dev;
+ __entry->flags = outarg->flags;
+ __entry->blocksize = outarg->s_blocksize;
+ __entry->max_links = outarg->s_max_links;
+ __entry->time_gran = outarg->s_time_gran;
+ __entry->time_min = outarg->s_time_min;
+ __entry->time_max = outarg->s_time_max;
+ __entry->maxbytes = outarg->s_maxbytes;
+ __entry->uuid_len = outarg->s_uuid_len;
+ ),
+
+ TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
+ __entry->connection,
+ __print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS),
+ __entry->blocksize, __entry->max_links, __entry->time_gran,
+ __entry->time_min, __entry->time_max, __entry->maxbytes,
+ __entry->uuid_len)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index abba22107718d9..2d01828fc532b0 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -735,6 +735,8 @@ static int fuse_iomap_process_config(struct fuse_mount *fm, int error,
return error;
}
+ trace_fuse_iomap_config(fm, outarg);
+
if (outarg->flags & ~FUSE_IOMAP_CONFIG_ALL)
return -EINVAL;
@@ -760,6 +762,7 @@ static int fuse_iomap_process_config(struct fuse_mount *fm, int error,
sb->s_blocksize = outarg->s_blocksize;
sb->s_blocksize_bits = blksize_bits(outarg->s_blocksize);
}
+ fm->fc->blkbits = sb->s_blocksize_bits;
}
if (outarg->flags & FUSE_IOMAP_CONFIG_SID)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 21/28] fuse: implement fadvise for iomap files
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (19 preceding siblings ...)
2025-09-16 0:33 ` [PATCH 20/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:33 ` Darrick J. Wong
2025-09-16 0:33 ` [PATCH 22/28] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
` (6 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:33 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
If userspace asks us to perform readahead on a file, take i_rwsem so
that it can't race with hole punching or writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 3 +++
fs/fuse/file.c | 1 +
fs/fuse/file_iomap.c | 20 ++++++++++++++++++++
3 files changed, 24 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e45780f6fe9e39..d59c19f61d5337 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1789,6 +1789,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
+
+int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1813,6 +1815,7 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
+# define fuse_iomap_fadvise NULL
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index dd65485c9743bf..9476f14035bb7f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3189,6 +3189,7 @@ static const struct file_operations fuse_file_operations = {
.poll = fuse_file_poll,
.fallocate = fuse_file_fallocate,
.copy_file_range = fuse_copy_file_range,
+ .fadvise = fuse_iomap_fadvise,
};
static const struct address_space_operations fuse_file_aops = {
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 2d01828fc532b0..a484cd235d9da2 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -7,6 +7,7 @@
#include <linux/fiemap.h>
#include <linux/pagemap.h>
#include <linux/falloc.h>
+#include <linux/fadvise.h>
#include "fuse_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -1856,3 +1857,22 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
return -EFAULT;
return 0;
}
+
+int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice)
+{
+ struct inode *inode = file_inode(file);
+ bool needlock = advice == POSIX_FADV_WILLNEED &&
+ fuse_inode_has_iomap(inode);
+ int ret;
+
+ /*
+ * Operations creating pages in page cache need protection from hole
+ * punching and similar ops
+ */
+ if (needlock)
+ inode_lock_shared(inode);
+ ret = generic_fadvise(file, start, end, advice);
+ if (needlock)
+ inode_unlock_shared(inode);
+ return ret;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 22/28] fuse: invalidate ranges of block devices being used for iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (20 preceding siblings ...)
2025-09-16 0:33 ` [PATCH 21/28] fuse: implement fadvise for iomap files Darrick J. Wong
@ 2025-09-16 0:33 ` Darrick J. Wong
2025-09-16 0:33 ` [PATCH 23/28] fuse_trace: " Darrick J. Wong
` (5 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:33 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Make it easier to invalidate the page cache for a block device that is
being used in conjunction with iomap. This allows a fuse server to kill
all cached data for a block that is being freed, so that block reuse
doesn't result in file corruption. Right now, the only way to do this
is with fadvise, which ignores and doesn't wait for pages undergoing
writeback.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 3 +++
include/uapi/linux/fuse.h | 10 ++++++++++
fs/fuse/dev.c | 27 +++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 40 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 80 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d59c19f61d5337..4aa7199dd0cd9f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1789,6 +1789,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
+int fuse_iomap_dev_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_dev_inval_out *arg);
int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
#else
@@ -1815,6 +1817,7 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
+# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fadvise NULL
#endif
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 19c1ac5006faa9..b63fba0a2c52c9 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,7 @@
* - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
+ * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
*/
#ifndef _LINUX_FUSE_H
@@ -689,6 +690,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
+ FUSE_NOTIFY_IOMAP_DEV_INVAL = 9,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1464,4 +1466,12 @@ struct fuse_iomap_config_out {
int64_t s_maxbytes; /* max file size */
};
+struct fuse_iomap_dev_inval_out {
+ uint32_t dev; /* device cookie */
+ uint32_t reserved; /* zero */
+
+ uint64_t offset; /* range to invalidate pagecache, bytes */
+ uint64_t length;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index bb0ec19a368bea..adbe2a65e6fe87 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1868,6 +1868,30 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
return err;
}
+static int fuse_notify_iomap_dev_inval(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_dev_inval_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ if (outarg.reserved) {
+ err = -EINVAL;
+ goto err;
+ }
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_dev_inval(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
struct fuse_retrieve_args {
struct fuse_args_pages ap;
struct fuse_notify_retrieve_in inarg;
@@ -2114,6 +2138,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
case FUSE_NOTIFY_INC_EPOCH:
return fuse_notify_inc_epoch(fc);
+ case FUSE_NOTIFY_IOMAP_DEV_INVAL:
+ return fuse_notify_iomap_dev_inval(fc, size, cs);
+
default:
fuse_copy_finish(cs);
return -EINVAL;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index a484cd235d9da2..9c798435e45633 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1876,3 +1876,43 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice)
inode_unlock_shared(inode);
return ret;
}
+
+int fuse_iomap_dev_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_dev_inval_out *arg)
+{
+ struct fuse_backing *fb;
+ struct block_device *bdev;
+ loff_t end;
+ int ret = 0;
+
+ if (!fc->iomap || arg->dev == FUSE_IOMAP_DEV_NULL)
+ return -EINVAL;
+
+ down_read(&fc->killsb);
+ fb = fuse_backing_lookup(fc, &fuse_iomap_backing_ops, arg->dev);
+ if (!fb) {
+ ret = -ENODEV;
+ goto out_killsb;
+ }
+ bdev = fb->bdev;
+
+ inode_lock(bdev->bd_mapping->host);
+ filemap_invalidate_lock(bdev->bd_mapping);
+
+ if (check_add_overflow(arg->offset, arg->length, &end) ||
+ arg->offset >= bdev_nr_bytes(bdev)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ end = min(end, bdev_nr_bytes(bdev));
+ truncate_inode_pages_range(bdev->bd_mapping, arg->offset, end - 1);
+
+out_unlock:
+ filemap_invalidate_unlock(bdev->bd_mapping);
+ inode_unlock(bdev->bd_mapping->host);
+ fuse_backing_put(fb);
+out_killsb:
+ up_read(&fc->killsb);
+ return ret;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 23/28] fuse_trace: invalidate ranges of block devices being used for iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (21 preceding siblings ...)
2025-09-16 0:33 ` [PATCH 22/28] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
@ 2025-09-16 0:33 ` Darrick J. Wong
2025-09-16 0:34 ` [PATCH 24/28] fuse: implement inline data file IO via iomap Darrick J. Wong
` (4 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:33 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 26 ++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 2 ++
2 files changed, 28 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 66b564bcd25360..1cff42bc5907bf 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -1016,6 +1016,32 @@ TRACE_EVENT(fuse_iomap_config,
__entry->time_min, __entry->time_max, __entry->maxbytes,
__entry->uuid_len)
);
+
+TRACE_EVENT(fuse_iomap_dev_inval,
+ TP_PROTO(const struct fuse_conn *fc,
+ const struct fuse_iomap_dev_inval_out *arg),
+ TP_ARGS(fc, arg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(int, dev)
+ __field(unsigned long long, offset)
+ __field(unsigned long long, length)
+ ),
+
+ TP_fast_assign(
+ __entry->connection = fc->dev;
+ __entry->dev = arg->dev;
+ __entry->offset = arg->offset;
+ __entry->length = arg->length;
+ ),
+
+ TP_printk("connection %u dev %d offset 0x%llx length 0x%llx",
+ __entry->connection,
+ __entry->dev,
+ __entry->offset,
+ __entry->length)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 9c798435e45633..d2945f8071a296 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1885,6 +1885,8 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc,
loff_t end;
int ret = 0;
+ trace_fuse_iomap_dev_inval(fc, arg);
+
if (!fc->iomap || arg->dev == FUSE_IOMAP_DEV_NULL)
return -EINVAL;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 24/28] fuse: implement inline data file IO via iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (22 preceding siblings ...)
2025-09-16 0:33 ` [PATCH 23/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:34 ` Darrick J. Wong
2025-09-16 0:34 ` [PATCH 25/28] fuse_trace: " Darrick J. Wong
` (3 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:34 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement inline data file IO by issuing FUSE_READ/FUSE_WRITE commands
in response to an inline data mapping.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index d2945f8071a296..8faf16f58df035 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -417,6 +417,150 @@ fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
return ret;
}
+static inline int fuse_iomap_inline_alloc(struct iomap *iomap)
+{
+ ASSERT(iomap->inline_data == NULL);
+ ASSERT(iomap->length > 0);
+
+ iomap->inline_data = kvzalloc(iomap->length, GFP_KERNEL);
+ return iomap->inline_data ? 0 : -ENOMEM;
+}
+
+static inline void fuse_iomap_inline_free(struct iomap *iomap)
+{
+ kvfree(iomap->inline_data);
+ iomap->inline_data = NULL;
+}
+
+/*
+ * Use the FUSE_READ command to read inline file data from the fuse server.
+ * Note that there's no file handle attached, so the fuse server must be able
+ * to reconnect to the inode via the nodeid.
+ */
+static int fuse_iomap_inline_read(struct inode *inode, loff_t pos,
+ loff_t count, struct iomap *iomap)
+{
+ struct fuse_read_in in = {
+ .offset = pos,
+ .size = count,
+ };
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ ssize_t ret;
+
+ if (BAD_DATA(!iomap_inline_data_valid(iomap)))
+ return -EFSCORRUPTED;
+
+ args.opcode = FUSE_READ;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(in);
+ args.in_args[0].value = ∈
+ args.out_argvar = true;
+ args.out_numargs = 1;
+ args.out_args[0].size = count;
+ args.out_args[0].value = iomap_inline_data(iomap, pos);
+
+ ret = fuse_simple_request(fm, &args);
+ if (ret < 0) {
+ fuse_iomap_inline_free(iomap);
+ return ret;
+ }
+ /* no readahead means something bad happened */
+ if (ret == 0) {
+ fuse_iomap_inline_free(iomap);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+/*
+ * Use the FUSE_WRITE command to write inline file data from the fuse server.
+ * Note that there's no file handle attached, so the fuse server must be able
+ * to reconnect to the inode via the nodeid.
+ */
+static int fuse_iomap_inline_write(struct inode *inode, loff_t pos,
+ loff_t count, struct iomap *iomap)
+{
+ struct fuse_write_in in = {
+ .offset = pos,
+ .size = count,
+ };
+ struct fuse_write_out out = { };
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ ssize_t ret;
+
+ if (BAD_DATA(!iomap_inline_data_valid(iomap)))
+ return -EFSCORRUPTED;
+
+ args.opcode = FUSE_WRITE;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 2;
+ args.in_args[0].size = sizeof(in);
+ args.in_args[0].value = ∈
+ args.in_args[1].size = count;
+ args.in_args[1].value = iomap_inline_data(iomap, pos);
+ args.out_numargs = 1;
+ args.out_args[0].size = sizeof(out);
+ args.out_args[0].value = &out;
+
+ ret = fuse_simple_request(fm, &args);
+ if (ret < 0) {
+ fuse_iomap_inline_free(iomap);
+ return ret;
+ }
+ /* short write means something bad happened */
+ if (out.size < count) {
+ fuse_iomap_inline_free(iomap);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+/* Set up inline data buffers for iomap_begin */
+static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
+ loff_t pos, loff_t count,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ int err;
+
+ if (opflags & IOMAP_REPORT)
+ return 0;
+
+ if (fuse_is_iomap_file_write(opflags)) {
+ if (iomap->type == IOMAP_INLINE) {
+ err = fuse_iomap_inline_alloc(iomap);
+ if (err)
+ return err;
+ }
+
+ if (srcmap->type == IOMAP_INLINE) {
+ err = fuse_iomap_inline_alloc(srcmap);
+ if (!err)
+ err = fuse_iomap_inline_read(inode, pos, count,
+ srcmap);
+ if (err) {
+ fuse_iomap_inline_free(iomap);
+ return err;
+ }
+ }
+ } else if (iomap->type == IOMAP_INLINE) {
+ /* inline data read */
+ err = fuse_iomap_inline_alloc(iomap);
+ if (!err)
+ err = fuse_iomap_inline_read(inode, pos, count, iomap);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -486,12 +630,20 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
}
+ if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
+ err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
+ srcmap);
+ if (err)
+ goto out_write_dev;
+ }
+
/*
* XXX: if we ever want to support closing devices, we need a way to
* track the fuse_backing refcount all the way through bio endios.
* For now we put the refcount here because you can't remove an iomap
* device until unmount time.
*/
+out_write_dev:
fuse_backing_put(write_dev);
out_read_dev:
fuse_backing_put(read_dev);
@@ -530,8 +682,28 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
{
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_mount *fm = get_fuse_mount(inode);
+ struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+ struct iomap *srcmap = &iter->srcmap;
int err = 0;
+ if (srcmap->inline_data)
+ fuse_iomap_inline_free(srcmap);
+
+ if (iomap->inline_data) {
+ if (fuse_is_iomap_file_write(opflags) && written > 0) {
+ err = fuse_iomap_inline_write(inode, pos, written,
+ iomap);
+ fuse_iomap_inline_free(iomap);
+ if (err)
+ return err;
+ } else {
+ fuse_iomap_inline_free(iomap);
+ }
+
+ /* fuse server should already be aware of what happened */
+ return 0;
+ }
+
if (fuse_should_send_iomap_end(fm, iomap, opflags, count, written)) {
struct fuse_iomap_end_in inarg = {
.opflags = fuse_iomap_op_to_server(opflags),
@@ -1431,6 +1603,18 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
if (ret)
goto discard_folio;
+ if (BAD_DATA(write_iomap.type == IOMAP_INLINE)) {
+ /*
+ * iomap assumes that inline data writes are completed
+ * by the time ->iomap_end completes, so it should
+ * never mark a pagecache folio dirty.
+ */
+ fuse_iomap_end(inode, offset, len, 0,
+ FUSE_IOMAP_OP_WRITEBACK, &write_iomap);
+ ret = -EIO;
+ goto discard_folio;
+ }
+
/*
* Landed in a hole or beyond EOF? Send that to iomap, it'll
* skip writing back the file range.
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 25/28] fuse_trace: implement inline data file IO via iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (23 preceding siblings ...)
2025-09-16 0:34 ` [PATCH 24/28] fuse: implement inline data file IO via iomap Darrick J. Wong
@ 2025-09-16 0:34 ` Darrick J. Wong
2025-09-16 0:34 ` [PATCH 26/28] fuse: allow more statx fields Darrick J. Wong
` (2 subsequent siblings)
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:34 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 7 +++++++
2 files changed, 52 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 1cff42bc5907bf..b1c45abe40b440 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -227,6 +227,7 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
struct iomap_writepage_ctx;
struct iomap_ioend;
+struct iomap;
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
@@ -1042,6 +1043,50 @@ TRACE_EVENT(fuse_iomap_dev_inval,
__entry->offset,
__entry->length)
);
+
+DECLARE_EVENT_CLASS(fuse_iomap_inline_class,
+ TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count,
+ const struct iomap *map),
+ TP_ARGS(inode, pos, count, map),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(bool, has_buf)
+ __field(uint64_t, validity_cookie)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = count;
+
+ __entry->mapdev = FUSE_IOMAP_DEV_NULL;
+ __entry->mapaddr = map->addr;
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+
+ __entry->has_buf = map->inline_data != NULL;
+ __entry->validity_cookie= map->validity_cookie;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_MAP_FMT() " has_buf? %d cookie 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ __entry->has_buf,
+ __entry->validity_cookie)
+);
+#define DEFINE_FUSE_IOMAP_INLINE_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_inline_class, name, \
+ TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count, \
+ const struct iomap *map), \
+ TP_ARGS(inode, pos, count, map))
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 8faf16f58df035..a71de4ea5eb32d 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -452,6 +452,8 @@ static int fuse_iomap_inline_read(struct inode *inode, loff_t pos,
if (BAD_DATA(!iomap_inline_data_valid(iomap)))
return -EFSCORRUPTED;
+ trace_fuse_iomap_inline_read(inode, pos, count, iomap);
+
args.opcode = FUSE_READ;
args.nodeid = fi->nodeid;
args.in_numargs = 1;
@@ -497,6 +499,8 @@ static int fuse_iomap_inline_write(struct inode *inode, loff_t pos,
if (BAD_DATA(!iomap_inline_data_valid(iomap)))
return -EFSCORRUPTED;
+ trace_fuse_iomap_inline_write(inode, pos, count, iomap);
+
args.opcode = FUSE_WRITE;
args.nodeid = fi->nodeid;
args.in_numargs = 2;
@@ -558,6 +562,9 @@ static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
return err;
}
+ trace_fuse_iomap_set_inline_iomap(inode, pos, count, iomap);
+ trace_fuse_iomap_set_inline_srcmap(inode, pos, count, srcmap);
+
return 0;
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 26/28] fuse: allow more statx fields
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (24 preceding siblings ...)
2025-09-16 0:34 ` [PATCH 25/28] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:34 ` Darrick J. Wong
2025-09-16 0:35 ` [PATCH 27/28] fuse: support atomic writes with iomap Darrick J. Wong
2025-09-16 0:35 ` [PATCH 28/28] fuse: disable direct reclaim for any fuse server that uses iomap Darrick J. Wong
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:34 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Allow the fuse server to supply us with the more recently added fields
of struct statx.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 +++++
include/uapi/linux/fuse.h | 15 ++++++++-
fs/fuse/dir.c | 75 ++++++++++++++++++++++++++++++++++++++-------
3 files changed, 86 insertions(+), 12 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4aa7199dd0cd9f..02af28f49cdfe5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1733,6 +1733,14 @@ void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+int fuse_iomap_sysfs_init(struct kobject *kobj);
+void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
+#else
+# define fuse_iomap_sysfs_init(...) (0)
+# define fuse_iomap_sysfs_cleanup(...) ((void)0)
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
bool fuse_iomap_enabled(void);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index b63fba0a2c52c9..e0139fb43f82ea 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -334,7 +334,20 @@ struct fuse_statx {
uint32_t rdev_minor;
uint32_t dev_major;
uint32_t dev_minor;
- uint64_t __spare2[14];
+
+ uint64_t mnt_id;
+ uint32_t dio_mem_align;
+ uint32_t dio_offset_align;
+ uint64_t subvol;
+
+ uint32_t atomic_write_unit_min;
+ uint32_t atomic_write_unit_max;
+ uint32_t atomic_write_segments_max;
+ uint32_t dio_read_offset_align;
+ uint32_t atomic_write_unit_max_opt;
+ uint32_t __spare2[1];
+
+ uint64_t __spare3[8];
};
struct fuse_kstatfs {
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index d62ceadbc05fb2..b5e3536f1d53c1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1274,6 +1274,50 @@ static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr)
attr->blksize = sx->blksize;
}
+#define FUSE_SUPPORTED_STATX_MASK (STATX_BASIC_STATS | \
+ STATX_BTIME | \
+ STATX_DIOALIGN | \
+ STATX_SUBVOL | \
+ STATX_WRITE_ATOMIC)
+
+#define FUSE_UNCACHED_STATX_MASK (STATX_DIOALIGN | \
+ STATX_SUBVOL | \
+ STATX_WRITE_ATOMIC)
+
+static void kstat_from_fuse_statx(const struct fuse_conn *fc,
+ struct kstat *stat,
+ const struct fuse_statx *sx)
+{
+ stat->result_mask = sx->mask & FUSE_SUPPORTED_STATX_MASK;
+
+ stat->attributes |= fuse_statx_attributes(fc, sx);
+ stat->attributes_mask |= fuse_statx_attributes_mask(fc, sx);
+
+ if (sx->mask & STATX_BTIME) {
+ stat->btime.tv_sec = sx->btime.tv_sec;
+ stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec,
+ NSEC_PER_SEC - 1);
+ }
+
+ if (sx->mask & STATX_DIOALIGN) {
+ stat->dio_mem_align = sx->dio_mem_align;
+ stat->dio_offset_align = sx->dio_offset_align;
+ }
+
+ if (sx->mask & STATX_SUBVOL)
+ stat->subvol = sx->subvol;
+
+ if (sx->mask & STATX_WRITE_ATOMIC) {
+ stat->atomic_write_unit_min = sx->atomic_write_unit_min;
+ stat->atomic_write_unit_max = sx->atomic_write_unit_max;
+ stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt;
+ stat->atomic_write_segments_max = sx->atomic_write_segments_max;
+ }
+
+ if (sx->mask & STATX_DIO_READ_ALIGN)
+ stat->dio_read_offset_align = sx->dio_read_offset_align;
+}
+
static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
struct file *file, struct kstat *stat)
{
@@ -1297,7 +1341,7 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
}
/* For now leave sync hints as the default, request all stats. */
inarg.sx_flags = 0;
- inarg.sx_mask = STATX_BASIC_STATS | STATX_BTIME;
+ inarg.sx_mask = FUSE_SUPPORTED_STATX_MASK;
args.opcode = FUSE_STATX;
args.nodeid = get_node_id(inode);
args.in_numargs = 1;
@@ -1325,11 +1369,7 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
}
if (stat) {
- stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
- stat->btime.tv_sec = sx->btime.tv_sec;
- stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
- stat->attributes |= fuse_statx_attributes(fm->fc, sx);
- stat->attributes_mask |= fuse_statx_attributes_mask(fm->fc, sx);
+ kstat_from_fuse_statx(fm->fc, stat, sx);
fuse_fillattr(idmap, inode, &attr, stat);
stat->result_mask |= STATX_TYPE;
}
@@ -1394,16 +1434,29 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
u32 inval_mask = READ_ONCE(fi->inval_mask);
u32 cache_mask = fuse_get_cache_mask(inode);
-
- /* FUSE only supports basic stats and possibly btime */
- request_mask &= STATX_BASIC_STATS | STATX_BTIME;
+ /* Only ask for supported stats */
+ request_mask &= FUSE_SUPPORTED_STATX_MASK;
retry:
if (fc->no_statx)
request_mask &= STATX_BASIC_STATS;
if (!request_mask)
sync = false;
- else if (flags & AT_STATX_FORCE_SYNC)
+ else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
+ switch (flags & AT_STATX_SYNC_TYPE) {
+ case AT_STATX_DONT_SYNC:
+ request_mask &= ~FUSE_UNCACHED_STATX_MASK;
+ sync = false;
+ break;
+ case AT_STATX_FORCE_SYNC:
+ case AT_STATX_SYNC_AS_STAT:
+ sync = true;
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+ } else if (flags & AT_STATX_FORCE_SYNC)
sync = true;
else if (flags & AT_STATX_DONT_SYNC)
sync = false;
@@ -1414,7 +1467,7 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
if (sync) {
forget_all_cached_acls(inode);
- /* Try statx if BTIME is requested */
+ /* Try statx if a field not covered by regular stat is wanted */
if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) {
err = fuse_do_statx(idmap, inode, file, stat);
if (err == -ENOSYS) {
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 27/28] fuse: support atomic writes with iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (25 preceding siblings ...)
2025-09-16 0:34 ` [PATCH 26/28] fuse: allow more statx fields Darrick J. Wong
@ 2025-09-16 0:35 ` Darrick J. Wong
2025-09-16 0:35 ` [PATCH 28/28] fuse: disable direct reclaim for any fuse server that uses iomap Darrick J. Wong
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:35 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
One whole block!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 9 ++++++++
fs/fuse/fuse_trace.h | 4 +++-
include/uapi/linux/fuse.h | 5 +++++
fs/fuse/file_iomap.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 66 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 02af28f49cdfe5..777826115d3e80 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -259,6 +259,8 @@ enum {
FUSE_I_CACHE_IO_MODE,
/* Use iomap for this inode */
FUSE_I_IOMAP,
+ /* Enable untorn writes */
+ FUSE_I_ATOMIC,
};
struct fuse_conn;
@@ -1765,6 +1767,13 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode)
return test_bit(FUSE_I_IOMAP, &fi->state);
}
+static inline bool fuse_inode_has_atomic(const struct inode *inode)
+{
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+ return test_bit(FUSE_I_ATOMIC, &fi->state);
+}
+
int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 length);
loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index b1c45abe40b440..1befea65d4b15c 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -326,6 +326,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BAD);
TRACE_DEFINE_ENUM(FUSE_I_BTIME);
TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
+TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
#define FUSE_IFLAG_STRINGS \
{ 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \
@@ -334,7 +335,8 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
{ 1 << FUSE_I_BAD, "bad" }, \
{ 1 << FUSE_I_BTIME, "btime" }, \
{ 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
- { 1 << FUSE_I_IOMAP, "iomap" }
+ { 1 << FUSE_I_IOMAP, "iomap" }, \
+ { 1 << FUSE_I_ATOMIC, "atomic" }
#define IOMAP_IOEND_STRINGS \
{ IOMAP_IOEND_SHARED, "shared" }, \
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index e0139fb43f82ea..472605d7ff6a2f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -241,6 +241,7 @@
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
* - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
+ * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
*/
#ifndef _LINUX_FUSE_H
@@ -595,10 +596,12 @@ struct fuse_file_lock {
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
* FUSE_ATTR_IOMAP: Use iomap for this inode
+ * FUSE_ATTR_ATOMIC: Enable untorn writes
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
#define FUSE_ATTR_IOMAP (1 << 2)
+#define FUSE_ATTR_ATOMIC (1 << 3)
/**
* Open flags
@@ -1158,6 +1161,8 @@ struct fuse_backing_map {
/* basic file I/O functionality through iomap */
#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0)
+/* untorn writes through iomap */
+#define FUSE_IOMAP_SUPPORT_ATOMIC (1ULL << 1)
struct fuse_iomap_support {
uint64_t flags;
uint64_t padding;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index a71de4ea5eb32d..30db4079ab8a55 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1099,12 +1099,32 @@ static inline void fuse_inode_clear_iomap(struct inode *inode)
clear_bit(FUSE_I_IOMAP, &fi->state);
}
+static inline void fuse_inode_set_atomic(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ set_bit(FUSE_I_ATOMIC, &fi->state);
+}
+
+static inline void fuse_inode_clear_atomic(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ clear_bit(FUSE_I_ATOMIC, &fi->state);
+}
+
void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
{
struct fuse_conn *conn = get_fuse_conn(inode);
if (conn->iomap && (attr_flags & FUSE_ATTR_IOMAP))
fuse_inode_set_iomap(inode);
+ if (fuse_inode_has_iomap(inode) && (attr_flags & FUSE_ATTR_ATOMIC))
+ fuse_inode_set_atomic(inode);
trace_fuse_iomap_init_inode(inode);
}
@@ -1113,6 +1133,8 @@ void fuse_iomap_evict_inode(struct inode *inode)
{
trace_fuse_iomap_evict_inode(inode);
+ if (fuse_inode_has_atomic(inode))
+ fuse_inode_clear_atomic(inode);
if (fuse_inode_has_iomap(inode))
fuse_inode_clear_iomap(inode);
}
@@ -1191,6 +1213,8 @@ void fuse_iomap_open(struct inode *inode, struct file *file)
ASSERT(fuse_inode_has_iomap(inode));
file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+ if (fuse_inode_has_atomic(inode))
+ file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
}
enum fuse_ilock_type {
@@ -1397,6 +1421,17 @@ fuse_iomap_write_checks(
return kiocb_modified(iocb);
}
+static inline ssize_t fuse_iomap_atomic_write_valid(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iov_iter_count(from) != i_blocksize(inode))
+ return -EINVAL;
+
+ return generic_atomic_write_valid(iocb, from);
+}
+
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
@@ -1412,6 +1447,12 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (!count)
return 0;
+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ ret = fuse_iomap_atomic_write_valid(iocb, from);
+ if (ret)
+ return ret;
+ }
+
/*
* Unaligned direct writes require zeroing of unwritten head and tail
* blocks. Extending writes require zeroing of post-EOF tail blocks.
@@ -1819,6 +1860,12 @@ ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
if (!iov_iter_count(from))
return 0;
+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ ret = fuse_iomap_atomic_write_valid(iocb, from);
+ if (ret)
+ return ret;
+ }
+
ret = fuse_iomap_ilock_iocb(iocb, EXCL);
if (ret)
return ret;
@@ -2042,7 +2089,8 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support ios = { };
if (fuse_iomap_enabled())
- ios.flags = FUSE_IOMAP_SUPPORT_FILEIO;
+ ios.flags = FUSE_IOMAP_SUPPORT_FILEIO |
+ FUSE_IOMAP_SUPPORT_ATOMIC;
if (copy_to_user(argp, &ios, sizeof(ios)))
return -EFAULT;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 28/28] fuse: disable direct reclaim for any fuse server that uses iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (26 preceding siblings ...)
2025-09-16 0:35 ` [PATCH 27/28] fuse: support atomic writes with iomap Darrick J. Wong
@ 2025-09-16 0:35 ` Darrick J. Wong
27 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:35 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Any fuse server that uses iomap can create a substantial amount of dirty
pages in the pagecache because we don't write dirty stuff until reclaim
or fsync. Therefore, memory reclaim on any fuse iomap server musn't
ever recurse back into the same filesystem. We must also never throttle
the fuse server writes to a bdi because that will just slow down
metadata operations.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 30db4079ab8a55..524c26e53674f2 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1014,6 +1014,12 @@ static void fuse_iomap_config_reply(struct fuse_mount *fm,
fc->iomap_conn.no_end = 0;
fc->iomap_conn.no_ioend = 0;
+ /*
+ * We could be on the hook for a substantial amount of writeback, so
+ * prohibit reclaim from recursing into fuse or the kernel from
+ * throttling any bdis that the fuse server might write to.
+ */
+ current->flags |= PF_MEMALLOC_NOFS | PF_LOCAL_THROTTLE;
done:
kfree(ia);
fuse_finish_init(fc, ok);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 1/3] fuse: make the root nodeid dynamic
2025-09-16 0:19 ` [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id Darrick J. Wong
@ 2025-09-16 0:35 ` Darrick J. Wong
2025-09-16 0:35 ` [PATCH 2/3] fuse_trace: " Darrick J. Wong
2025-09-16 0:36 ` [PATCH 3/3] fuse: allow setting of root nodeid Darrick J. Wong
2 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:35 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Change this from a hardcoded constant to a dynamic field so that fuse
servers don't need to translate.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +++++--
fs/fuse/dir.c | 10 ++++++----
fs/fuse/inode.c | 11 +++++++----
fs/fuse/readdir.c | 10 +++++-----
4 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 777826115d3e80..70942340e33855 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -684,6 +684,9 @@ struct fuse_conn {
struct rcu_head rcu;
+ /* node id of the root directory */
+ u64 root_nodeid;
+
/** The user id for this mount */
kuid_t user_id;
@@ -1127,9 +1130,9 @@ static inline u64 get_node_id(struct inode *inode)
return get_fuse_inode(inode)->nodeid;
}
-static inline int invalid_nodeid(u64 nodeid)
+static inline int invalid_nodeid(const struct fuse_conn *fc, u64 nodeid)
{
- return !nodeid || nodeid == FUSE_ROOT_ID;
+ return !nodeid || nodeid == fc->root_nodeid;
}
static inline u64 fuse_get_attr_version(struct fuse_conn *fc)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b5e3536f1d53c1..c6e83b724f8cd0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -386,7 +386,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
err = -EIO;
if (fuse_invalid_attr(&outarg->attr))
goto out_put_forget;
- if (outarg->nodeid == FUSE_ROOT_ID && outarg->generation != 0) {
+ if (outarg->nodeid == fm->fc->root_nodeid && outarg->generation != 0) {
pr_warn_once("root generation should be zero\n");
outarg->generation = 0;
}
@@ -436,7 +436,7 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
goto out_err;
err = -EIO;
- if (inode && get_node_id(inode) == FUSE_ROOT_ID)
+ if (inode && get_node_id(inode) == fc->root_nodeid)
goto out_iput;
newent = d_splice_alias(inode, entry);
@@ -687,7 +687,8 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
goto out_free_ff;
err = -EIO;
- if (!S_ISREG(outentry.attr.mode) || invalid_nodeid(outentry.nodeid) ||
+ if (!S_ISREG(outentry.attr.mode) ||
+ invalid_nodeid(fm->fc, outentry.nodeid) ||
fuse_invalid_attr(&outentry.attr))
goto out_free_ff;
@@ -838,7 +839,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
goto out_put_forget_req;
err = -EIO;
- if (invalid_nodeid(outarg.nodeid) || fuse_invalid_attr(&outarg.attr))
+ if (invalid_nodeid(fm->fc, outarg.nodeid) ||
+ fuse_invalid_attr(&outarg.attr))
goto out_put_forget_req;
if ((outarg.attr.mode ^ mode) & S_IFMT)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index beb9ee62b6b861..350805fa61690c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -997,6 +997,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->max_pages_limit = fuse_max_pages_limit;
fc->name_max = FUSE_NAME_LOW_MAX;
fc->timeout.req_timeout = 0;
+ fc->root_nodeid = FUSE_ROOT_ID;
if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_init(fc);
@@ -1052,12 +1053,14 @@ EXPORT_SYMBOL_GPL(fuse_conn_get);
static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned int mode)
{
struct fuse_attr attr;
+ struct fuse_conn *fc = get_fuse_conn_super(sb);
+
memset(&attr, 0, sizeof(attr));
attr.mode = mode;
- attr.ino = FUSE_ROOT_ID;
+ attr.ino = fc->root_nodeid;
attr.nlink = 1;
- return fuse_iget(sb, FUSE_ROOT_ID, 0, &attr, 0, 0, 0);
+ return fuse_iget(sb, fc->root_nodeid, 0, &attr, 0, 0, 0);
}
struct fuse_inode_handle {
@@ -1101,7 +1104,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb,
goto out_iput;
entry = d_obtain_alias(inode);
- if (!IS_ERR(entry) && get_node_id(inode) != FUSE_ROOT_ID)
+ if (!IS_ERR(entry) && get_node_id(inode) != fc->root_nodeid)
fuse_invalidate_entry_cache(entry);
return entry;
@@ -1194,7 +1197,7 @@ static struct dentry *fuse_get_parent(struct dentry *child)
}
parent = d_obtain_alias(inode);
- if (!IS_ERR(parent) && get_node_id(inode) != FUSE_ROOT_ID)
+ if (!IS_ERR(parent) && get_node_id(inode) != fc->root_nodeid)
fuse_invalidate_entry_cache(parent);
return parent;
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index c2aae2eef0868b..45dd932eb03a5e 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -185,12 +185,12 @@ static int fuse_direntplus_link(struct file *file,
return 0;
}
- if (invalid_nodeid(o->nodeid))
- return -EIO;
- if (fuse_invalid_attr(&o->attr))
- return -EIO;
-
fc = get_fuse_conn(dir);
+ if (invalid_nodeid(fc, o->nodeid))
+ return -EIO;
+ if (fuse_invalid_attr(&o->attr))
+ return -EIO;
+
epoch = atomic_read(&fc->epoch);
name.hash = full_name_hash(parent, name.name, name.len);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 2/3] fuse_trace: make the root nodeid dynamic
2025-09-16 0:19 ` [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id Darrick J. Wong
2025-09-16 0:35 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong
@ 2025-09-16 0:35 ` Darrick J. Wong
2025-09-16 0:36 ` [PATCH 3/3] fuse: allow setting of root nodeid Darrick J. Wong
2 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:35 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Enhance the iomap config tracepoint to report the node id of the root
directory.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 1befea65d4b15c..9c2eb497730b06 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -988,6 +988,7 @@ TRACE_EVENT(fuse_iomap_config,
TP_STRUCT__entry(
__field(dev_t, connection)
+ __field(uint64_t, root_nodeid)
__field(uint32_t, flags)
__field(uint32_t, blocksize)
@@ -1002,6 +1003,7 @@ TRACE_EVENT(fuse_iomap_config,
TP_fast_assign(
__entry->connection = fm->fc->dev;
+ __entry->root_nodeid = fm->fc->root_nodeid;
__entry->flags = outarg->flags;
__entry->blocksize = outarg->s_blocksize;
__entry->max_links = outarg->s_max_links;
@@ -1012,8 +1014,8 @@ TRACE_EVENT(fuse_iomap_config,
__entry->uuid_len = outarg->s_uuid_len;
),
- TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
- __entry->connection,
+ TP_printk("connection %u root_ino 0x%llx flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
+ __entry->connection, __entry->root_nodeid,
__print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS),
__entry->blocksize, __entry->max_links, __entry->time_gran,
__entry->time_min, __entry->time_max, __entry->maxbytes,
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 3/3] fuse: allow setting of root nodeid
2025-09-16 0:19 ` [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id Darrick J. Wong
2025-09-16 0:35 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong
2025-09-16 0:35 ` [PATCH 2/3] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:36 ` Darrick J. Wong
2 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:36 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Provide a new mount option so that fuse servers can actually set the
root nodeid.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 2 ++
fs/fuse/inode.c | 11 +++++++++++
2 files changed, 13 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 70942340e33855..fb60686fb9c61a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,6 +619,7 @@ struct fuse_fs_context {
int fd;
struct file *file;
unsigned int rootmode;
+ u64 root_nodeid;
kuid_t user_id;
kgid_t group_id;
bool is_bdev:1;
@@ -633,6 +634,7 @@ struct fuse_fs_context {
bool no_force_umount:1;
bool legacy_opts_show:1;
bool local_fs:1;
+ bool root_nodeid_present:1;
enum fuse_dax_mode dax_mode;
unsigned int max_read;
unsigned int blksize;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 350805fa61690c..e74d39ac05a570 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -781,6 +781,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+ OPT_ROOT_NODEID,
OPT_ERR
};
@@ -795,6 +796,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
fsparam_u32 ("max_read", OPT_MAX_READ),
fsparam_u32 ("blksize", OPT_BLKSIZE),
fsparam_string ("subtype", OPT_SUBTYPE),
+ fsparam_u64 ("root_nodeid", OPT_ROOT_NODEID),
{}
};
@@ -890,6 +892,11 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
ctx->blksize = result.uint_32;
break;
+ case OPT_ROOT_NODEID:
+ ctx->root_nodeid = result.uint_64;
+ ctx->root_nodeid_present = true;
+ break;
+
default:
return -EINVAL;
}
@@ -925,6 +932,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+ if (fc->root_nodeid && fc->root_nodeid != FUSE_ROOT_ID)
+ seq_printf(m, ",root_nodeid=%llu", fc->root_nodeid);
}
#ifdef CONFIG_FUSE_DAX
if (fc->dax_mode == FUSE_DAX_ALWAYS)
@@ -1910,6 +1919,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
fc->default_permissions = ctx->default_permissions;
+ if (ctx->root_nodeid_present)
+ fc->root_nodeid = ctx->root_nodeid;
fc->allow_other = ctx->allow_other;
fc->user_id = ctx->user_id;
fc->group_id = ctx->group_id;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 1/9] fuse: enable caching of timestamps
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-09-16 0:36 ` Darrick J. Wong
2025-09-16 0:36 ` [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
` (7 subsequent siblings)
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:36 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Cache the timestamps in the kernel so that the kernel sends FUSE_SETATTR
calls to the fuse server after writes, because the iomap infrastructure
won't do that for us.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 5 ++++-
fs/fuse/file.c | 18 ++++++++++++------
fs/fuse/file_iomap.c | 6 ++++++
fs/fuse/inode.c | 13 +++++++------
4 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c6e83b724f8cd0..380559950c3444 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2062,7 +2062,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct fuse_attr_out outarg;
const bool is_iomap = fuse_inode_has_iomap(inode);
bool is_truncate = false;
- bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
+ bool is_wb = (is_iomap || fc->writeback_cache) &&
+ S_ISREG(inode->i_mode);
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb;
@@ -2196,6 +2197,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
spin_lock(&fi->lock);
/* the kernel maintains i_mtime locally */
if (trust_local_cmtime) {
+ if ((attr->ia_valid & ATTR_ATIME) && is_iomap)
+ inode_set_atime_to_ts(inode, attr->ia_atime);
if (attr->ia_valid & ATTR_MTIME)
inode_set_mtime_to_ts(inode, attr->ia_mtime);
if (attr->ia_valid & ATTR_CTIME)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9476f14035bb7f..0ed13082d0d00d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -240,7 +240,7 @@ static int fuse_open(struct inode *inode, struct file *file)
int err;
const bool is_iomap = fuse_inode_has_iomap(inode);
bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
- bool is_wb_truncate = is_truncate && fc->writeback_cache;
+ bool is_wb_truncate = is_truncate && (is_iomap || fc->writeback_cache);
bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
if (fuse_is_bad(inode))
@@ -453,12 +453,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
struct fuse_file *ff = file->private_data;
struct fuse_flush_in inarg;
FUSE_ARGS(args);
+ const bool is_iomap = fuse_inode_has_iomap(inode);
int err;
if (fuse_is_bad(inode))
return -EIO;
- if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+ if ((ff->open_flags & FOPEN_NOFLUSH) &&
+ !fm->fc->writeback_cache && !is_iomap)
return 0;
err = write_inode_now(inode, 1);
@@ -494,7 +496,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
* In memory i_blocks is not maintained by fuse, if writeback cache is
* enabled, i_blocks from cached attr may not be accurate.
*/
- if (!err && fm->fc->writeback_cache)
+ if (!err && (is_iomap || fm->fc->writeback_cache))
fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
return err;
}
@@ -796,8 +798,10 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read,
* If writeback_cache is enabled, a short read means there's a hole in
* the file. Some data after the hole is in page cache, but has not
* reached the client fs yet. So the hole is not present there.
+ * If iomap is enabled, a short read means we hit EOF so there's
+ * nothing to adjust.
*/
- if (!fc->writeback_cache) {
+ if (!fc->writeback_cache && !fuse_inode_has_iomap(inode)) {
loff_t pos = folio_pos(ap->folios[0]) + num_read;
fuse_read_update_size(inode, pos, attr_ver);
}
@@ -1412,6 +1416,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned int flags, struct iomap *iomap,
struct iomap *srcmap)
{
+ WARN_ON(fuse_inode_has_iomap(inode));
+
iomap->type = IOMAP_MAPPED;
iomap->length = length;
iomap->offset = offset;
@@ -1979,7 +1985,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
* Do this only if writeback_cache is not enabled. If writeback_cache
* is enabled, we trust local ctime/mtime.
*/
- if (!fc->writeback_cache)
+ if (!fc->writeback_cache && !fuse_inode_has_iomap(inode))
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
spin_lock(&fi->lock);
fi->writectr--;
@@ -3065,7 +3071,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
/* mark unstable when write-back is not used, and file_out gets
* extended */
const bool is_iomap = fuse_inode_has_iomap(inode_out);
- bool is_unstable = (!fc->writeback_cache) &&
+ bool is_unstable = (!fc->writeback_cache && !is_iomap) &&
((pos_out + len) > inode_out->i_size);
if (fc->no_copy_file_range)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 524c26e53674f2..1fc9a9b7b75094 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1769,6 +1769,12 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
ASSERT(fuse_has_iomap(inode));
+ /*
+ * Manage timestamps ourselves, don't make the fuse server do it. This
+ * is critical for mtime updates to work correctly with page_mkwrite.
+ */
+ inode->i_flags &= ~S_NOCMTIME;
+ inode->i_flags &= ~S_NOATIME;
inode->i_data.a_ops = &fuse_iomap_aops;
INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e74d39ac05a570..b63e4e1d8f45ce 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -328,10 +328,11 @@ u32 fuse_get_cache_mask(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
- return 0;
+ if (S_ISREG(inode->i_mode) &&
+ (fuse_inode_has_iomap(inode) || fc->writeback_cache))
+ return STATX_MTIME | STATX_CTIME | STATX_SIZE;
- return STATX_MTIME | STATX_CTIME | STATX_SIZE;
+ return 0;
}
static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr,
@@ -346,9 +347,9 @@ static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr
spin_lock(&fi->lock);
/*
- * In case of writeback_cache enabled, writes update mtime, ctime and
- * may update i_size. In these cases trust the cached value in the
- * inode.
+ * In case of writeback_cache or iomap enabled, writes update mtime,
+ * ctime and may update i_size. In these cases trust the cached value
+ * in the inode.
*/
cache_mask = fuse_get_cache_mask(inode);
if (cache_mask & STATX_SIZE)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-09-16 0:36 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong
@ 2025-09-16 0:36 ` Darrick J. Wong
2025-09-16 0:36 ` [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags Darrick J. Wong
` (6 subsequent siblings)
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:36 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
In iomap mode, the kernel is in charge of driving ctime updates to
the fuse server and ignores updates coming from the fuse server.
Therefore, when someone calls fileattr_set to change file attributes, we
must force a ctime update.
Found by generic/277.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/ioctl.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index 57032eadca6c27..f5f7d806262cdf 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -548,8 +548,13 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
struct fuse_file *ff;
unsigned int flags = fa->flags;
struct fsxattr xfa;
+ struct file_kattr old_ma = { };
+ bool is_wb = (fuse_get_cache_mask(inode) & STATX_CTIME);
int err;
+ if (is_wb)
+ vfs_fileattr_get(dentry, &old_ma);
+
ff = fuse_priv_ioctl_prepare(inode);
if (IS_ERR(ff))
return PTR_ERR(ff);
@@ -573,6 +578,12 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
cleanup:
fuse_priv_ioctl_cleanup(inode, ff);
+ /*
+ * If we cache ctime updates and the fileattr changed, then force a
+ * ctime update.
+ */
+ if (is_wb && memcmp(&old_ma, fa, sizeof(old_ma)))
+ fuse_update_ctime(inode);
if (err == -ENOTTY)
err = -EOPNOTSUPP;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-09-16 0:36 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong
2025-09-16 0:36 ` [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
@ 2025-09-16 0:36 ` Darrick J. Wong
2025-09-16 0:37 ` [PATCH 4/9] fuse_trace: " Darrick J. Wong
` (5 subsequent siblings)
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:36 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
There are three inode flags (immutable, append, sync) that are enforced
by the VFS. Whenever we go around setting iflags, let's update the VFS
state so that they actually work. Make it so that the fuse server can
set these three inode flags at load time and have the kernel advertise
and enforce them.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 1 +
include/uapi/linux/fuse.h | 8 +++++++
fs/fuse/dir.c | 1 +
fs/fuse/inode.c | 1 +
fs/fuse/ioctl.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 64 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fb60686fb9c61a..ae03a898d3aa7d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1629,6 +1629,7 @@ long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa);
int fuse_fileattr_set(struct mnt_idmap *idmap,
struct dentry *dentry, struct file_kattr *fa);
+void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr);
/* iomode.c */
int fuse_file_cached_io_open(struct inode *inode, struct fuse_file *ff);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 472605d7ff6a2f..94ec220beb5f79 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -242,6 +242,8 @@
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
* - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
* - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
+ * - add FUSE_ATTR_{SYNC,IMMUTABLE,APPEND} for VFS enforcement of file
+ * attributes
*/
#ifndef _LINUX_FUSE_H
@@ -597,11 +599,17 @@ struct fuse_file_lock {
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
* FUSE_ATTR_IOMAP: Use iomap for this inode
* FUSE_ATTR_ATOMIC: Enable untorn writes
+ * FUSE_ATTR_SYNC: File writes are synchronous
+ * FUSE_ATTR_IMMUTABLE: File is immutable
+ * FUSE_ATTR_APPEND: File is append-only
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
#define FUSE_ATTR_IOMAP (1 << 2)
#define FUSE_ATTR_ATOMIC (1 << 3)
+#define FUSE_ATTR_SYNC (1 << 4)
+#define FUSE_ATTR_IMMUTABLE (1 << 5)
+#define FUSE_ATTR_APPEND (1 << 6)
/**
* Open flags
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 380559950c3444..30c914ba4bb23f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1254,6 +1254,7 @@ static void fuse_fillattr(struct mnt_idmap *idmap, struct inode *inode,
blkbits = fc->blkbits;
stat->blksize = 1 << blkbits;
+ generic_fill_statx_attr(inode, stat);
}
static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b63e4e1d8f45ce..f845864bf50dee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -521,6 +521,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
fuse_init_inode(inode, attr, fc);
+ fuse_fileattr_init(inode, attr);
unlock_new_inode(inode);
} else if (fuse_stale_inode(inode, generation, attr)) {
/* nodeid was reused, any I/O on the old inode should fail */
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index f5f7d806262cdf..fc0c9bac7a5939 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -502,6 +502,56 @@ static void fuse_priv_ioctl_cleanup(struct inode *inode, struct fuse_file *ff)
fuse_file_release(inode, ff, O_RDONLY, NULL, S_ISDIR(inode->i_mode));
}
+static inline void update_iflag(struct inode *inode, unsigned int iflag,
+ bool set)
+{
+ if (set)
+ inode->i_flags |= iflag;
+ else
+ inode->i_flags &= ~iflag;
+}
+
+static void fuse_fileattr_update_inode(struct inode *inode,
+ const struct file_kattr *fa)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ unsigned int old_iflags = inode->i_flags;
+
+ if (!fc->local_fs)
+ return;
+
+ if (fa->flags_valid) {
+ update_iflag(inode, S_SYNC, fa->flags & FS_SYNC_FL);
+ update_iflag(inode, S_IMMUTABLE, fa->flags & FS_IMMUTABLE_FL);
+ update_iflag(inode, S_APPEND, fa->flags & FS_APPEND_FL);
+ } else if (fa->fsx_valid) {
+ update_iflag(inode, S_SYNC, fa->fsx_xflags & FS_XFLAG_SYNC);
+ update_iflag(inode, S_IMMUTABLE,
+ fa->fsx_xflags & FS_XFLAG_IMMUTABLE);
+ update_iflag(inode, S_APPEND, fa->fsx_xflags & FS_XFLAG_APPEND);
+ }
+
+ if (old_iflags != inode->i_flags)
+ fuse_invalidate_attr(inode);
+}
+
+void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (!fc->local_fs)
+ return;
+
+ if (attr->flags & FUSE_ATTR_SYNC)
+ inode->i_flags |= S_SYNC;
+
+ if (attr->flags & FUSE_ATTR_IMMUTABLE)
+ inode->i_flags |= S_IMMUTABLE;
+
+ if (attr->flags & FUSE_ATTR_APPEND)
+ inode->i_flags |= S_APPEND;
+}
+
int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
{
struct inode *inode = d_inode(dentry);
@@ -574,7 +624,10 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
err = fuse_priv_ioctl(inode, ff, FS_IOC_FSSETXATTR,
&xfa, sizeof(xfa));
+ if (err)
+ goto cleanup;
}
+ fuse_fileattr_update_inode(inode, fa);
cleanup:
fuse_priv_ioctl_cleanup(inode, ff);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 4/9] fuse_trace: allow local filesystems to set some VFS iflags
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (2 preceding siblings ...)
2025-09-16 0:36 ` [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags Darrick J. Wong
@ 2025-09-16 0:37 ` Darrick J. Wong
2025-09-16 0:37 ` [PATCH 5/9] fuse: cache atime when in iomap mode Darrick J. Wong
` (4 subsequent siblings)
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:37 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 29 +++++++++++++++++++++++++++++
fs/fuse/ioctl.c | 6 ++++++
2 files changed, 35 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 9c2eb497730b06..2aff78a30503ee 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -176,6 +176,35 @@ TRACE_EVENT(fuse_request_end,
__entry->unique, __entry->len, __entry->error)
);
+DECLARE_EVENT_CLASS(fuse_fileattr_class,
+ TP_PROTO(const struct inode *inode, unsigned int old_iflags),
+
+ TP_ARGS(inode, old_iflags),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(unsigned int, old_iflags)
+ __field(unsigned int, new_iflags)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->old_iflags = old_iflags;
+ __entry->new_iflags = inode->i_flags;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " old_iflags 0x%x iflags 0x%x",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->old_iflags,
+ __entry->new_iflags)
+);
+#define DEFINE_FUSE_FILEATTR_EVENT(name) \
+DEFINE_EVENT(fuse_fileattr_class, name, \
+ TP_PROTO(const struct inode *inode, unsigned int old_iflags), \
+ TP_ARGS(inode, old_iflags))
+DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_update_inode);
+DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_init);
+
#ifdef CONFIG_FUSE_BACKING
#define FUSE_BACKING_FLAG_STRINGS \
{ FUSE_BACKING_TYPE_PASSTHROUGH, "pass" }, \
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index fc0c9bac7a5939..2ac1911dc5cc83 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -4,6 +4,7 @@
*/
#include "fuse_i.h"
+#include "fuse_trace.h"
#include <linux/uio.h>
#include <linux/compat.h>
@@ -531,6 +532,8 @@ static void fuse_fileattr_update_inode(struct inode *inode,
update_iflag(inode, S_APPEND, fa->fsx_xflags & FS_XFLAG_APPEND);
}
+ trace_fuse_fileattr_update_inode(inode, old_iflags);
+
if (old_iflags != inode->i_flags)
fuse_invalidate_attr(inode);
}
@@ -538,6 +541,7 @@ static void fuse_fileattr_update_inode(struct inode *inode,
void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr)
{
struct fuse_conn *fc = get_fuse_conn(inode);
+ unsigned int old_iflags = inode->i_flags;
if (!fc->local_fs)
return;
@@ -550,6 +554,8 @@ void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr)
if (attr->flags & FUSE_ATTR_APPEND)
inode->i_flags |= S_APPEND;
+
+ trace_fuse_fileattr_init(inode, old_iflags);
}
int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 5/9] fuse: cache atime when in iomap mode
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (3 preceding siblings ...)
2025-09-16 0:37 ` [PATCH 4/9] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:37 ` Darrick J. Wong
2025-09-16 0:37 ` [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
` (3 subsequent siblings)
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:37 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
When we're running in iomap mode, allow the kernel to cache the access
timestamp to further reduce the number of roundtrips to the fuse server.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 5 +++++
fs/fuse/inode.c | 19 ++++++++++++++++---
2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 30c914ba4bb23f..8247e5196fd0b2 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2033,6 +2033,11 @@ int fuse_flush_times(struct inode *inode, struct fuse_file *ff)
inarg.ctime = inode_get_ctime_sec(inode);
inarg.ctimensec = inode_get_ctime_nsec(inode);
}
+ if (fuse_inode_has_iomap(inode)) {
+ inarg.valid |= FATTR_ATIME;
+ inarg.atime = inode_get_atime_sec(inode);
+ inarg.atimensec = inode_get_atime_nsec(inode);
+ }
if (ff) {
inarg.valid |= FATTR_FH;
inarg.fh = ff->fh;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f845864bf50dee..c29a8cbc55fa27 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -266,7 +266,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
attr->mtimensec = min_t(u32, attr->mtimensec, NSEC_PER_SEC - 1);
attr->ctimensec = min_t(u32, attr->ctimensec, NSEC_PER_SEC - 1);
- inode_set_atime(inode, attr->atime, attr->atimensec);
+ if (!(cache_mask & STATX_ATIME))
+ inode_set_atime(inode, attr->atime, attr->atimensec);
/* mtime from server may be stale due to local buffered write */
if (!(cache_mask & STATX_MTIME)) {
inode_set_mtime(inode, attr->mtime, attr->mtimensec);
@@ -328,8 +329,12 @@ u32 fuse_get_cache_mask(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- if (S_ISREG(inode->i_mode) &&
- (fuse_inode_has_iomap(inode) || fc->writeback_cache))
+ if (!S_ISREG(inode->i_mode))
+ return 0;
+
+ if (fuse_inode_has_iomap(inode))
+ return STATX_MTIME | STATX_CTIME | STATX_ATIME | STATX_SIZE;
+ if (fc->writeback_cache)
return STATX_MTIME | STATX_CTIME | STATX_SIZE;
return 0;
@@ -448,6 +453,14 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr,
new_decode_dev(attr->rdev));
} else
BUG();
+
+ /*
+ * iomap caches atime too, so we must load it from the fuse server
+ * at instantiation time.
+ */
+ if (fuse_inode_has_iomap(inode))
+ inode_set_atime(inode, attr->atime, attr->atimensec);
+
/*
* Ensure that we don't cache acls for daemons without FUSE_POSIX_ACL
* so they see the exact same behavior as before.
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (4 preceding siblings ...)
2025-09-16 0:37 ` [PATCH 5/9] fuse: cache atime when in iomap mode Darrick J. Wong
@ 2025-09-16 0:37 ` Darrick J. Wong
2025-09-16 0:37 ` [PATCH 7/9] fuse_trace: " Darrick J. Wong
` (2 subsequent siblings)
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:37 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Let the kernel handle killing the suid/sgid bits because the
write/falloc/truncate/chown code already does this, and we don't have to
worry about external modifications that are only visible to the fuse
server (i.e. we're not a cluster fs).
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8247e5196fd0b2..2e1837b2363e83 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2268,6 +2268,7 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
struct inode *inode = d_inode(entry);
struct fuse_conn *fc = get_fuse_conn(inode);
struct file *file = (attr->ia_valid & ATTR_FILE) ? attr->ia_file : NULL;
+ const bool is_iomap = fuse_inode_has_iomap(inode);
int ret;
if (fuse_is_bad(inode))
@@ -2276,15 +2277,19 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
if (!fuse_allow_current_process(get_fuse_conn(inode)))
return -EACCES;
- if (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) {
+ if (!is_iomap &&
+ (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))) {
attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID |
ATTR_MODE);
/*
* The only sane way to reliably kill suid/sgid is to do it in
- * the userspace filesystem
+ * the userspace filesystem if this isn't an iomap file. For
+ * iomap filesystems we let the kernel kill the setuid/setgid
+ * bits.
*
- * This should be done on write(), truncate() and chown().
+ * This should be done on write(), truncate(), chown(), and
+ * fallocate().
*/
if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {
/*
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 7/9] fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (5 preceding siblings ...)
2025-09-16 0:37 ` [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
@ 2025-09-16 0:37 ` Darrick J. Wong
2025-09-16 0:38 ` [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2025-09-16 0:38 ` [PATCH 9/9] fuse: always cache ACLs when using iomap Darrick J. Wong
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:37 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 5 ++++
2 files changed, 63 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 2aff78a30503ee..1f900580b14937 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -205,6 +205,64 @@ DEFINE_EVENT(fuse_fileattr_class, name, \
DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_update_inode);
DEFINE_FUSE_FILEATTR_EVENT(fuse_fileattr_init);
+TRACE_EVENT(fuse_setattr_fill,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_setattr_in *inarg),
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(umode_t, mode)
+ __field(uint32_t, valid)
+ __field(umode_t, new_mode)
+ __field(uint64_t, new_size)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mode = inode->i_mode;
+ __entry->valid = inarg->valid;
+ __entry->new_mode = inarg->mode;
+ __entry->new_size = inarg->size;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " mode 0%o valid 0x%x new_mode 0%o new_size 0x%llx",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->mode,
+ __entry->valid,
+ __entry->new_mode,
+ __entry->new_size)
+);
+
+TRACE_EVENT(fuse_setattr,
+ TP_PROTO(const struct inode *inode,
+ const struct iattr *inarg),
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(umode_t, mode)
+ __field(uint32_t, valid)
+ __field(umode_t, new_mode)
+ __field(uint64_t, new_size)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mode = inode->i_mode;
+ __entry->valid = inarg->ia_valid;
+ __entry->new_mode = inarg->ia_mode;
+ __entry->new_size = inarg->ia_size;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " mode 0%o valid 0x%x new_mode 0%o new_size 0x%llx",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->mode,
+ __entry->valid,
+ __entry->new_mode,
+ __entry->new_size)
+);
+
#ifdef CONFIG_FUSE_BACKING
#define FUSE_BACKING_FLAG_STRINGS \
{ FUSE_BACKING_TYPE_PASSTHROUGH, "pass" }, \
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 2e1837b2363e83..58106f49395697 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -7,6 +7,7 @@
*/
#include "fuse_i.h"
+#include "fuse_trace.h"
#include <linux/pagemap.h>
#include <linux/file.h>
@@ -2002,6 +2003,8 @@ static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_args *args,
struct fuse_setattr_in *inarg_p,
struct fuse_attr_out *outarg_p)
{
+ trace_fuse_setattr_fill(inode, inarg_p);
+
args->opcode = FUSE_SETATTR;
args->nodeid = get_node_id(inode);
args->in_numargs = 1;
@@ -2277,6 +2280,8 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
if (!fuse_allow_current_process(get_fuse_conn(inode)))
return -EACCES;
+ trace_fuse_setattr(inode, attr);
+
if (!is_iomap &&
(attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))) {
attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID |
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (6 preceding siblings ...)
2025-09-16 0:37 ` [PATCH 7/9] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:38 ` Darrick J. Wong
2025-09-16 0:38 ` [PATCH 9/9] fuse: always cache ACLs when using iomap Darrick J. Wong
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:38 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
In iomap mode, the fuse kernel driver is in charge of updating file
attributes, so we need to update ctime after an ACL change.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/acl.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 4faee72f1365a5..9b24c53b510405 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -109,6 +109,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
struct fuse_conn *fc = get_fuse_conn(inode);
const char *name;
umode_t mode = inode->i_mode;
+ const bool is_iomap = fuse_inode_has_iomap(inode);
int ret;
if (fuse_is_bad(inode))
@@ -179,10 +180,24 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
ret = 0;
}
- /* If we scheduled a mode update above, push that to userspace now. */
if (!ret) {
struct iattr attr = { };
+ /*
+ * When we're running in iomap mode, we need to update mode and
+ * ctime ourselves instead of letting the fuse server figure
+ * that out.
+ */
+ if (is_iomap) {
+ attr.ia_valid |= ATTR_CTIME;
+ inode_set_ctime_current(inode);
+ attr.ia_ctime = inode_get_ctime(inode);
+ }
+
+ /*
+ * If we scheduled a mode update above, push that to userspace
+ * now.
+ */
if (mode != inode->i_mode) {
attr.ia_valid |= ATTR_MODE;
attr.ia_mode = mode;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 9/9] fuse: always cache ACLs when using iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (7 preceding siblings ...)
2025-09-16 0:38 ` [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
@ 2025-09-16 0:38 ` Darrick J. Wong
8 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:38 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Keep ACLs cached in memory when we're using iomap, so that we don't have
to make a round trip to the fuse server. This might want to become a
FUSE_ATTR_ flag.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/acl.c | 12 +++++++++---
fs/fuse/dir.c | 11 ++++++++---
fs/fuse/readdir.c | 3 ++-
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 9b24c53b510405..a9f152ddc1faec 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -210,10 +210,16 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
if (fc->posix_acl) {
/*
* Fuse daemons without FUSE_POSIX_ACL never cached POSIX ACLs
- * and didn't invalidate attributes. Retain that behavior.
+ * and didn't invalidate attributes. Retain that behavior
+ * except for iomap, where we assume that only the source of
+ * ACL changes is userspace.
*/
- forget_all_cached_acls(inode);
- fuse_invalidate_attr(inode);
+ if (!ret && is_iomap) {
+ set_cached_acl(inode, type, acl);
+ } else {
+ forget_all_cached_acls(inode);
+ fuse_invalidate_attr(inode);
+ }
}
return ret;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 58106f49395697..9adaf262bda975 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -261,7 +261,8 @@ static int fuse_dentry_revalidate(struct inode *dir, const struct qstr *name,
fuse_stale_inode(inode, outarg.generation, &outarg.attr))
goto invalid;
- forget_all_cached_acls(inode);
+ if (!fuse_inode_has_iomap(inode))
+ forget_all_cached_acls(inode);
fuse_change_attributes(inode, &outarg.attr, NULL,
ATTR_TIMEOUT(&outarg),
attr_version);
@@ -1470,7 +1471,8 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
sync = time_before64(fi->i_time, get_jiffies_64());
if (sync) {
- forget_all_cached_acls(inode);
+ if (!fuse_inode_has_iomap(inode))
+ forget_all_cached_acls(inode);
/* Try statx if a field not covered by regular stat is wanted */
if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) {
err = fuse_do_statx(idmap, inode, file, stat);
@@ -1648,6 +1650,9 @@ static int fuse_access(struct inode *inode, int mask)
static int fuse_perm_getattr(struct inode *inode, int mask)
{
+ if (fuse_inode_has_iomap(inode))
+ return 0;
+
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
@@ -2325,7 +2330,7 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
* If filesystem supports acls it may have updated acl xattrs in
* the filesystem, so forget cached acls for the inode.
*/
- if (fc->posix_acl)
+ if (fc->posix_acl && !is_iomap)
forget_all_cached_acls(inode);
/* Directory mode changed, may need to revalidate access */
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index 45dd932eb03a5e..f7c2a45f23678e 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -224,7 +224,8 @@ static int fuse_direntplus_link(struct file *file,
fi->nlookup++;
spin_unlock(&fi->lock);
- forget_all_cached_acls(inode);
+ if (!fuse_inode_has_iomap(inode))
+ forget_all_cached_acls(inode);
fuse_change_attributes(inode, &o->attr, NULL,
ATTR_TIMEOUT(o),
attr_version);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 01/10] fuse: cache iomaps
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-09-16 0:38 ` Darrick J. Wong
2025-09-16 0:38 ` [PATCH 02/10] fuse_trace: " Darrick J. Wong
` (8 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:38 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Cache iomaps to a file so that we don't have to upcall the server.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 39 +
fs/fuse/iomap_priv.h | 135 ++++
include/uapi/linux/fuse.h | 5
fs/fuse/Makefile | 2
fs/fuse/file_iomap.c | 23 +
fs/fuse/iomap_cache.c | 1629 +++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 1827 insertions(+), 6 deletions(-)
create mode 100644 fs/fuse/iomap_cache.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ae03a898d3aa7d..33b65253b2e9be 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -120,6 +120,24 @@ struct fuse_backing {
struct rcu_head rcu;
};
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+/*
+ * File incore extent information, present for each of data & attr forks.
+ */
+struct fuse_ifork {
+ int64_t if_bytes; /* bytes in if_data */
+ void *if_data; /* extent tree root */
+ int if_height; /* height of the extent tree */
+};
+
+struct fuse_iomap_cache {
+ struct fuse_ifork im_read;
+ struct fuse_ifork *im_write;
+ uint64_t im_seq; /* validity counter */
+ struct rw_semaphore im_lock; /* mapping lock */
+};
+#endif
+
/** FUSE inode */
struct fuse_inode {
/** Inode data */
@@ -185,6 +203,9 @@ struct fuse_inode {
spinlock_t ioend_lock;
struct work_struct ioend_work;
struct list_head ioend_list;
+
+ /* cached iomap mappings */
+ struct fuse_iomap_cache cache;
#endif
};
@@ -261,6 +282,11 @@ enum {
FUSE_I_IOMAP,
/* Enable untorn writes */
FUSE_I_ATOMIC,
+ /*
+ * Cache iomaps in the kernel. This is required for any filesystem
+ * that needs to synchronize pagecache write and writeback.
+ */
+ FUSE_I_IOMAP_CACHE,
};
struct fuse_conn;
@@ -1816,6 +1842,18 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc,
const struct fuse_iomap_dev_inval_out *arg);
int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
+
+static inline bool fuse_inode_caches_iomaps(const struct inode *inode)
+{
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+ return test_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+}
+
+enum fuse_iomap_iodir {
+ READ_MAPPING,
+ WRITE_MAPPING,
+};
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1842,6 +1880,7 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fadvise NULL
+# define fuse_inode_caches_iomaps(...) (false)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 7002eb38f87fe1..8e4a32879025a4 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -1,5 +1,9 @@
// SPDX-License-Identifier: GPL-2.0
/*
+ * The fuse_iext code comes from xfs_iext_tree.[ch] and is:
+ * Copyright (c) 2017 Christoph Hellwig.
+ *
+ * Everything else is:
* Copyright (C) 2025 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
@@ -40,13 +44,134 @@ while (static_branch_unlikely(&fuse_iomap_debug)) { \
})
#endif /* CONFIG_FUSE_IOMAP_DEBUG */
-enum fuse_iomap_iodir {
- READ_MAPPING,
- WRITE_MAPPING,
-};
-
#define EFSCORRUPTED EUCLEAN
+void fuse_iomap_cache_lock(struct inode *inode);
+void fuse_iomap_cache_unlock(struct inode *inode);
+void fuse_iomap_cache_lock_shared(struct inode *inode);
+void fuse_iomap_cache_unlock_shared(struct inode *inode);
+
+struct fuse_iext_leaf;
+
+struct fuse_iext_cursor {
+ struct fuse_iext_leaf *leaf;
+ int pos;
+};
+
+#define FUSE_IEXT_LEFT_CONTIG (1u << 0)
+#define FUSE_IEXT_RIGHT_CONTIG (1u << 1)
+#define FUSE_IEXT_LEFT_FILLING (1u << 2)
+#define FUSE_IEXT_RIGHT_FILLING (1u << 3)
+#define FUSE_IEXT_LEFT_VALID (1u << 4)
+#define FUSE_IEXT_RIGHT_VALID (1u << 5)
+#define FUSE_IEXT_WRITE_MAPPING (1u << 6)
+
+struct fuse_ifork *fuse_iext_state_to_fork(struct fuse_iomap_cache *ip,
+ unsigned int state);
+
+uint64_t fuse_iext_count(const struct fuse_ifork *ifp);
+void fuse_iext_insert_raw(struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *irec);
+void fuse_iext_insert(struct fuse_iomap_cache *,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *, int);
+void fuse_iext_remove(struct fuse_iomap_cache *,
+ struct fuse_iext_cursor *,
+ int);
+void fuse_iext_destroy(struct fuse_ifork *);
+
+bool fuse_iext_lookup_extent(struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp, loff_t bno,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+bool fuse_iext_lookup_extent_before(struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp, loff_t *end,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+bool fuse_iext_get_extent(const struct fuse_ifork *ifp,
+ const struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+void fuse_iext_update_extent(struct fuse_iomap_cache *ip, int state,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+
+void fuse_iext_first(struct fuse_ifork *, struct fuse_iext_cursor *);
+void fuse_iext_last(struct fuse_ifork *, struct fuse_iext_cursor *);
+void fuse_iext_next(struct fuse_ifork *, struct fuse_iext_cursor *);
+void fuse_iext_prev(struct fuse_ifork *, struct fuse_iext_cursor *);
+
+static inline bool fuse_iext_next_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ fuse_iext_next(ifp, cur);
+ return fuse_iext_get_extent(ifp, cur, gotp);
+}
+
+static inline bool fuse_iext_prev_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ fuse_iext_prev(ifp, cur);
+ return fuse_iext_get_extent(ifp, cur, gotp);
+}
+
+/*
+ * Return the extent after cur in gotp without updating the cursor.
+ */
+static inline bool fuse_iext_peek_next_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ struct fuse_iext_cursor ncur = *cur;
+
+ fuse_iext_next(ifp, &ncur);
+ return fuse_iext_get_extent(ifp, &ncur, gotp);
+}
+
+/*
+ * Return the extent before cur in gotp without updating the cursor.
+ */
+static inline bool fuse_iext_peek_prev_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ struct fuse_iext_cursor ncur = *cur;
+
+ fuse_iext_prev(ifp, &ncur);
+ return fuse_iext_get_extent(ifp, &ncur, gotp);
+}
+
+#define for_each_fuse_iext(ifp, ext, got) \
+ for (fuse_iext_first((ifp), (ext)); \
+ fuse_iext_get_extent((ifp), (ext), (got)); \
+ fuse_iext_next((ifp), (ext)))
+
+static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ip)
+{
+ return (uint64_t)READ_ONCE(ip->im_seq);
+}
+
+int fuse_iomap_cache_remove(struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t off, uint64_t len);
+
+int fuse_iomap_cache_upsert(struct inode *inode, enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map);
+
+enum fuse_iomap_lookup_result {
+ LOOKUP_HIT,
+ LOOKUP_MISS,
+ LOOKUP_NOFORK,
+};
+
+struct fuse_iomap_lookup {
+ struct fuse_iomap_io map; /* cached mapping */
+ uint64_t validity_cookie; /* used with .iomap_valid() */
+};
+
+enum fuse_iomap_lookup_result
+fuse_iomap_cache_lookup(struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t off, uint64_t len,
+ struct fuse_iomap_lookup *mval);
+
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _FS_FUSE_IOMAP_PRIV_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 94ec220beb5f79..d4a257517915fd 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1358,6 +1358,8 @@ struct fuse_uring_cmd_req {
/* fuse-specific mapping type indicating that writes use the read mapping */
#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
+/* fuse-specific mapping type saying the server has populated the cache */
+#define FUSE_IOMAP_TYPE_RETRY_CACHE (254)
#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
@@ -1500,4 +1502,7 @@ struct fuse_iomap_dev_inval_out {
uint64_t length;
};
+/* invalidate all cached iomap mappings up to EOF */
+#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL)
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 27be39317701d6..e3ed1da6cfb6e7 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -18,6 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
fuse-$(CONFIG_FUSE_BACKING) += backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
-fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o iomap_cache.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 1fc9a9b7b75094..d35e69d03b0940 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1123,6 +1123,21 @@ static inline void fuse_inode_clear_atomic(struct inode *inode)
clear_bit(FUSE_I_ATOMIC, &fi->state);
}
+static inline void fuse_iomap_clear_cache(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ clear_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+
+ fuse_iext_destroy(&fi->cache.im_read);
+ if (fi->cache.im_write) {
+ fuse_iext_destroy(fi->cache.im_write);
+ kfree(fi->cache.im_write);
+ }
+}
+
void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
{
struct fuse_conn *conn = get_fuse_conn(inode);
@@ -1139,6 +1154,8 @@ void fuse_iomap_evict_inode(struct inode *inode)
{
trace_fuse_iomap_evict_inode(inode);
+ if (fuse_inode_caches_iomaps(inode))
+ fuse_iomap_clear_cache(inode);
if (fuse_inode_has_atomic(inode))
fuse_inode_clear_atomic(inode);
if (fuse_inode_has_iomap(inode))
@@ -1785,6 +1802,12 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
min_order = inode->i_blkbits - PAGE_SHIFT;
mapping_set_folio_min_order(inode->i_mapping, min_order);
+
+ memset(&fi->cache.im_read, 0, sizeof(fi->cache.im_read));
+ fi->cache.im_seq = 0;
+ fi->cache.im_write = NULL;
+
+ init_rwsem(&fi->cache.im_lock);
set_bit(FUSE_I_IOMAP, &fi->state);
}
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
new file mode 100644
index 00000000000000..1fec9dcc6d3922
--- /dev/null
+++ b/fs/fuse/iomap_cache.c
@@ -0,0 +1,1629 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fuse_iext* code adapted from xfs_iext_tree.c:
+ * Copyright (c) 2017 Christoph Hellwig.
+ *
+ * fuse_iomap_cache*lock* code adapted from xfs_inode.c:
+ * Copyright (c) 2000-2006 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "fuse_i.h"
+#include "iomap_priv.h"
+#include "fuse_trace.h"
+#include <linux/iomap.h>
+
+/* maximum length of a mapping that we're willing to cache */
+#define FUSE_IOMAP_MAX_LEN ((loff_t)(1ULL << 63))
+
+void fuse_iomap_cache_lock_shared(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ down_read(&ip->im_lock);
+}
+
+void fuse_iomap_cache_unlock_shared(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ up_read(&ip->im_lock);
+}
+
+void fuse_iomap_cache_lock(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ down_write(&ip->im_lock);
+}
+
+void fuse_iomap_cache_unlock(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ up_write(&ip->im_lock);
+}
+
+static inline void assert_cache_locked_shared(struct fuse_iomap_cache *ip)
+{
+ rwsem_assert_held(&ip->im_lock);
+}
+
+static inline void assert_cache_locked(struct fuse_iomap_cache *ip)
+{
+ rwsem_assert_held_write_nolockdep(&ip->im_lock);
+}
+
+static inline struct fuse_inode *FUSE_I(struct fuse_iomap_cache *ip)
+{
+ return container_of(ip, struct fuse_inode, cache);
+}
+
+static inline struct inode *VFS_I(struct fuse_iomap_cache *ip)
+{
+ struct fuse_inode *fi = FUSE_I(ip);
+
+ return &fi->inode;
+}
+
+static inline uint32_t
+fuse_iomap_fork_to_state(const struct fuse_iomap_cache *ip,
+ const struct fuse_ifork *ifp)
+{
+ ASSERT(ifp == ip->im_write || ifp == &ip->im_read);
+
+ if (ifp == ip->im_write)
+ return FUSE_IEXT_WRITE_MAPPING;
+ return 0;
+}
+
+/* Convert bmap state flags to an inode fork. */
+struct fuse_ifork *
+fuse_iext_state_to_fork(
+ struct fuse_iomap_cache *ip,
+ unsigned int state)
+{
+ if (state & FUSE_IEXT_WRITE_MAPPING)
+ return ip->im_write;
+ return &ip->im_read;
+}
+
+/* The internal iext tree record is a struct fuse_iomap_io */
+
+static bool fuse_iext_rec_is_empty(const struct fuse_iomap_io *rec)
+{
+ return rec->length == 0;
+}
+
+static inline void fuse_iext_rec_clear(struct fuse_iomap_io *rec)
+{
+ memset(rec, 0, sizeof(*rec));
+}
+
+static void
+fuse_iext_set(
+ struct fuse_iomap_io *rec,
+ const struct fuse_iomap_io *irec)
+{
+ ASSERT(irec->length > 0);
+
+ *rec = *irec;
+}
+
+static void
+fuse_iext_get(
+ struct fuse_iomap_io *irec,
+ const struct fuse_iomap_io *rec)
+{
+ *irec = *rec;
+}
+
+enum {
+ NODE_SIZE = 256,
+ KEYS_PER_NODE = NODE_SIZE / (sizeof(uint64_t) + sizeof(void *)),
+ RECS_PER_LEAF = (NODE_SIZE - (2 * sizeof(struct fuse_iext_leaf *))) /
+ sizeof(struct fuse_iomap_io),
+};
+
+/*
+ * In-core extent btree block layout:
+ *
+ * There are two types of blocks in the btree: leaf and inner (non-leaf) blocks.
+ *
+ * The leaf blocks are made up by %KEYS_PER_NODE extent records, which each
+ * contain the startoffset, blockcount, startblock and unwritten extent flag.
+ * See above for the exact format, followed by pointers to the previous and next
+ * leaf blocks (if there are any).
+ *
+ * The inner (non-leaf) blocks first contain KEYS_PER_NODE lookup keys, followed
+ * by an equal number of pointers to the btree blocks at the next lower level.
+ *
+ * +-------+-------+-------+-------+-------+----------+----------+
+ * Leaf: | rec 1 | rec 2 | rec 3 | rec 4 | rec N | prev-ptr | next-ptr |
+ * +-------+-------+-------+-------+-------+----------+----------+
+ *
+ * +-------+-------+-------+-------+-------+-------+------+-------+
+ * Inner: | key 1 | key 2 | key 3 | key N | ptr 1 | ptr 2 | ptr3 | ptr N |
+ * +-------+-------+-------+-------+-------+-------+------+-------+
+ */
+struct fuse_iext_node {
+ uint64_t keys[KEYS_PER_NODE];
+#define FUSE_IEXT_KEY_INVALID (1ULL << 63)
+ void *ptrs[KEYS_PER_NODE];
+};
+
+struct fuse_iext_leaf {
+ struct fuse_iomap_io recs[RECS_PER_LEAF];
+ struct fuse_iext_leaf *prev;
+ struct fuse_iext_leaf *next;
+};
+
+inline uint64_t fuse_iext_count(const struct fuse_ifork *ifp)
+{
+ return ifp->if_bytes / sizeof(struct fuse_iomap_io);
+}
+
+static inline int fuse_iext_max_recs(const struct fuse_ifork *ifp)
+{
+ if (ifp->if_height == 1)
+ return fuse_iext_count(ifp);
+ return RECS_PER_LEAF;
+}
+
+static inline struct fuse_iomap_io *cur_rec(const struct fuse_iext_cursor *cur)
+{
+ return &cur->leaf->recs[cur->pos];
+}
+
+static inline bool fuse_iext_valid(const struct fuse_ifork *ifp,
+ const struct fuse_iext_cursor *cur)
+{
+ if (!cur->leaf)
+ return false;
+ if (cur->pos < 0 || cur->pos >= fuse_iext_max_recs(ifp))
+ return false;
+ if (fuse_iext_rec_is_empty(cur_rec(cur)))
+ return false;
+ return true;
+}
+
+static void *
+fuse_iext_find_first_leaf(
+ struct fuse_ifork *ifp)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height;
+
+ if (!ifp->if_height)
+ return NULL;
+
+ for (height = ifp->if_height; height > 1; height--) {
+ node = node->ptrs[0];
+ ASSERT(node);
+ }
+
+ return node;
+}
+
+static void *
+fuse_iext_find_last_leaf(
+ struct fuse_ifork *ifp)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height, i;
+
+ if (!ifp->if_height)
+ return NULL;
+
+ for (height = ifp->if_height; height > 1; height--) {
+ for (i = 1; i < KEYS_PER_NODE; i++)
+ if (!node->ptrs[i])
+ break;
+ node = node->ptrs[i - 1];
+ ASSERT(node);
+ }
+
+ return node;
+}
+
+void
+fuse_iext_first(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ cur->pos = 0;
+ cur->leaf = fuse_iext_find_first_leaf(ifp);
+}
+
+void
+fuse_iext_last(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ int i;
+
+ cur->leaf = fuse_iext_find_last_leaf(ifp);
+ if (!cur->leaf) {
+ cur->pos = 0;
+ return;
+ }
+
+ for (i = 1; i < fuse_iext_max_recs(ifp); i++) {
+ if (fuse_iext_rec_is_empty(&cur->leaf->recs[i]))
+ break;
+ }
+ cur->pos = i - 1;
+}
+
+void
+fuse_iext_next(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ if (!cur->leaf) {
+ ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF);
+ fuse_iext_first(ifp, cur);
+ return;
+ }
+
+ ASSERT(cur->pos >= 0);
+ ASSERT(cur->pos < fuse_iext_max_recs(ifp));
+
+ cur->pos++;
+ if (ifp->if_height > 1 && !fuse_iext_valid(ifp, cur) &&
+ cur->leaf->next) {
+ cur->leaf = cur->leaf->next;
+ cur->pos = 0;
+ }
+}
+
+void
+fuse_iext_prev(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ if (!cur->leaf) {
+ ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF);
+ fuse_iext_last(ifp, cur);
+ return;
+ }
+
+ ASSERT(cur->pos >= 0);
+ ASSERT(cur->pos <= RECS_PER_LEAF);
+
+recurse:
+ do {
+ cur->pos--;
+ if (fuse_iext_valid(ifp, cur))
+ return;
+ } while (cur->pos > 0);
+
+ if (ifp->if_height > 1 && cur->leaf->prev) {
+ cur->leaf = cur->leaf->prev;
+ cur->pos = RECS_PER_LEAF;
+ goto recurse;
+ }
+}
+
+static inline int
+fuse_iext_key_cmp(
+ struct fuse_iext_node *node,
+ int n,
+ loff_t offset)
+{
+ if (node->keys[n] > offset)
+ return 1;
+ if (node->keys[n] < offset)
+ return -1;
+ return 0;
+}
+
+static inline int
+fuse_iext_rec_cmp(
+ struct fuse_iomap_io *rec,
+ loff_t offset)
+{
+ if (rec->offset > offset)
+ return 1;
+ if (rec->offset + rec->length <= offset)
+ return -1;
+ return 0;
+}
+
+static void *
+fuse_iext_find_level(
+ struct fuse_ifork *ifp,
+ loff_t offset,
+ int level)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height, i;
+
+ if (!ifp->if_height)
+ return NULL;
+
+ for (height = ifp->if_height; height > level; height--) {
+ for (i = 1; i < KEYS_PER_NODE; i++)
+ if (fuse_iext_key_cmp(node, i, offset) > 0)
+ break;
+
+ node = node->ptrs[i - 1];
+ if (!node)
+ break;
+ }
+
+ return node;
+}
+
+static int
+fuse_iext_node_pos(
+ struct fuse_iext_node *node,
+ loff_t offset)
+{
+ int i;
+
+ for (i = 1; i < KEYS_PER_NODE; i++) {
+ if (fuse_iext_key_cmp(node, i, offset) > 0)
+ break;
+ }
+
+ return i - 1;
+}
+
+static int
+fuse_iext_node_insert_pos(
+ struct fuse_iext_node *node,
+ loff_t offset)
+{
+ int i;
+
+ for (i = 0; i < KEYS_PER_NODE; i++) {
+ if (fuse_iext_key_cmp(node, i, offset) > 0)
+ return i;
+ }
+
+ return KEYS_PER_NODE;
+}
+
+static int
+fuse_iext_node_nr_entries(
+ struct fuse_iext_node *node,
+ int start)
+{
+ int i;
+
+ for (i = start; i < KEYS_PER_NODE; i++) {
+ if (node->keys[i] == FUSE_IEXT_KEY_INVALID)
+ break;
+ }
+
+ return i;
+}
+
+static int
+fuse_iext_leaf_nr_entries(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_leaf *leaf,
+ int start)
+{
+ int i;
+
+ for (i = start; i < fuse_iext_max_recs(ifp); i++) {
+ if (fuse_iext_rec_is_empty(&leaf->recs[i]))
+ break;
+ }
+
+ return i;
+}
+
+static inline uint64_t
+fuse_iext_leaf_key(
+ struct fuse_iext_leaf *leaf,
+ int n)
+{
+ return leaf->recs[n].offset;
+}
+
+static inline void *
+fuse_iext_alloc_node(
+ int size)
+{
+ return kzalloc(size, GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
+}
+
+static void
+fuse_iext_grow(
+ struct fuse_ifork *ifp)
+{
+ struct fuse_iext_node *node = fuse_iext_alloc_node(NODE_SIZE);
+ int i;
+
+ if (ifp->if_height == 1) {
+ struct fuse_iext_leaf *prev = ifp->if_data;
+
+ node->keys[0] = fuse_iext_leaf_key(prev, 0);
+ node->ptrs[0] = prev;
+ } else {
+ struct fuse_iext_node *prev = ifp->if_data;
+
+ ASSERT(ifp->if_height > 1);
+
+ node->keys[0] = prev->keys[0];
+ node->ptrs[0] = prev;
+ }
+
+ for (i = 1; i < KEYS_PER_NODE; i++)
+ node->keys[i] = FUSE_IEXT_KEY_INVALID;
+
+ ifp->if_data = node;
+ ifp->if_height++;
+}
+
+static void
+fuse_iext_update_node(
+ struct fuse_ifork *ifp,
+ loff_t old_offset,
+ loff_t new_offset,
+ int level,
+ void *ptr)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height, i;
+
+ for (height = ifp->if_height; height > level; height--) {
+ for (i = 0; i < KEYS_PER_NODE; i++) {
+ if (i > 0 && fuse_iext_key_cmp(node, i, old_offset) > 0)
+ break;
+ if (node->keys[i] == old_offset)
+ node->keys[i] = new_offset;
+ }
+ node = node->ptrs[i - 1];
+ ASSERT(node);
+ }
+
+ ASSERT(node == ptr);
+}
+
+static struct fuse_iext_node *
+fuse_iext_split_node(
+ struct fuse_iext_node **nodep,
+ int *pos,
+ int *nr_entries)
+{
+ struct fuse_iext_node *node = *nodep;
+ struct fuse_iext_node *new = fuse_iext_alloc_node(NODE_SIZE);
+ const int nr_move = KEYS_PER_NODE / 2;
+ int nr_keep = nr_move + (KEYS_PER_NODE & 1);
+ int i = 0;
+
+ /* for sequential append operations just spill over into the new node */
+ if (*pos == KEYS_PER_NODE) {
+ *nodep = new;
+ *pos = 0;
+ *nr_entries = 0;
+ goto done;
+ }
+
+
+ for (i = 0; i < nr_move; i++) {
+ new->keys[i] = node->keys[nr_keep + i];
+ new->ptrs[i] = node->ptrs[nr_keep + i];
+
+ node->keys[nr_keep + i] = FUSE_IEXT_KEY_INVALID;
+ node->ptrs[nr_keep + i] = NULL;
+ }
+
+ if (*pos >= nr_keep) {
+ *nodep = new;
+ *pos -= nr_keep;
+ *nr_entries = nr_move;
+ } else {
+ *nr_entries = nr_keep;
+ }
+done:
+ for (; i < KEYS_PER_NODE; i++)
+ new->keys[i] = FUSE_IEXT_KEY_INVALID;
+ return new;
+}
+
+static void
+fuse_iext_insert_node(
+ struct fuse_ifork *ifp,
+ uint64_t offset,
+ void *ptr,
+ int level)
+{
+ struct fuse_iext_node *node, *new;
+ int i, pos, nr_entries;
+
+again:
+ if (ifp->if_height < level)
+ fuse_iext_grow(ifp);
+
+ new = NULL;
+ node = fuse_iext_find_level(ifp, offset, level);
+ pos = fuse_iext_node_insert_pos(node, offset);
+ nr_entries = fuse_iext_node_nr_entries(node, pos);
+
+ ASSERT(pos >= nr_entries || fuse_iext_key_cmp(node, pos, offset) != 0);
+ ASSERT(nr_entries <= KEYS_PER_NODE);
+
+ if (nr_entries == KEYS_PER_NODE)
+ new = fuse_iext_split_node(&node, &pos, &nr_entries);
+
+ /*
+ * Update the pointers in higher levels if the first entry changes
+ * in an existing node.
+ */
+ if (node != new && pos == 0 && nr_entries > 0)
+ fuse_iext_update_node(ifp, node->keys[0], offset, level, node);
+
+ for (i = nr_entries; i > pos; i--) {
+ node->keys[i] = node->keys[i - 1];
+ node->ptrs[i] = node->ptrs[i - 1];
+ }
+ node->keys[pos] = offset;
+ node->ptrs[pos] = ptr;
+
+ if (new) {
+ offset = new->keys[0];
+ ptr = new;
+ level++;
+ goto again;
+ }
+}
+
+static struct fuse_iext_leaf *
+fuse_iext_split_leaf(
+ struct fuse_iext_cursor *cur,
+ int *nr_entries)
+{
+ struct fuse_iext_leaf *leaf = cur->leaf;
+ struct fuse_iext_leaf *new = fuse_iext_alloc_node(NODE_SIZE);
+ const int nr_move = RECS_PER_LEAF / 2;
+ int nr_keep = nr_move + (RECS_PER_LEAF & 1);
+ int i;
+
+ /* for sequential append operations just spill over into the new node */
+ if (cur->pos == RECS_PER_LEAF) {
+ cur->leaf = new;
+ cur->pos = 0;
+ *nr_entries = 0;
+ goto done;
+ }
+
+ for (i = 0; i < nr_move; i++) {
+ new->recs[i] = leaf->recs[nr_keep + i];
+ fuse_iext_rec_clear(&leaf->recs[nr_keep + i]);
+ }
+
+ if (cur->pos >= nr_keep) {
+ cur->leaf = new;
+ cur->pos -= nr_keep;
+ *nr_entries = nr_move;
+ } else {
+ *nr_entries = nr_keep;
+ }
+done:
+ if (leaf->next)
+ leaf->next->prev = new;
+ new->next = leaf->next;
+ new->prev = leaf;
+ leaf->next = new;
+ return new;
+}
+
+static void
+fuse_iext_alloc_root(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ ASSERT(ifp->if_bytes == 0);
+
+ ifp->if_data = fuse_iext_alloc_node(sizeof(struct fuse_iomap_io));
+ ifp->if_height = 1;
+
+ /* now that we have a node step into it */
+ cur->leaf = ifp->if_data;
+ cur->pos = 0;
+}
+
+static void
+fuse_iext_realloc_root(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ int64_t new_size = ifp->if_bytes + sizeof(struct fuse_iomap_io);
+ void *new;
+
+ /* account for the prev/next pointers */
+ if (new_size / sizeof(struct fuse_iomap_io) == RECS_PER_LEAF)
+ new_size = NODE_SIZE;
+
+ new = krealloc(ifp->if_data, new_size,
+ GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
+ memset(new + ifp->if_bytes, 0, new_size - ifp->if_bytes);
+ ifp->if_data = new;
+ cur->leaf = new;
+}
+
+/*
+ * Increment the sequence counter on extent tree changes. We use WRITE_ONCE
+ * here to ensure the update to the sequence counter is seen before the
+ * modifications to the extent tree itself take effect.
+ */
+static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ip)
+{
+ WRITE_ONCE(ip->im_seq, READ_ONCE(ip->im_seq) + 1);
+}
+
+void
+fuse_iext_insert_raw(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *irec)
+{
+ loff_t offset = irec->offset;
+ struct fuse_iext_leaf *new = NULL;
+ int nr_entries, i;
+
+ fuse_iext_inc_seq(ip);
+
+ if (ifp->if_height == 0)
+ fuse_iext_alloc_root(ifp, cur);
+ else if (ifp->if_height == 1)
+ fuse_iext_realloc_root(ifp, cur);
+
+ nr_entries = fuse_iext_leaf_nr_entries(ifp, cur->leaf, cur->pos);
+ ASSERT(nr_entries <= RECS_PER_LEAF);
+ ASSERT(cur->pos >= nr_entries ||
+ fuse_iext_rec_cmp(cur_rec(cur), irec->offset) != 0);
+
+ if (nr_entries == RECS_PER_LEAF)
+ new = fuse_iext_split_leaf(cur, &nr_entries);
+
+ /*
+ * Update the pointers in higher levels if the first entry changes
+ * in an existing node.
+ */
+ if (cur->leaf != new && cur->pos == 0 && nr_entries > 0) {
+ fuse_iext_update_node(ifp, fuse_iext_leaf_key(cur->leaf, 0),
+ offset, 1, cur->leaf);
+ }
+
+ for (i = nr_entries; i > cur->pos; i--)
+ cur->leaf->recs[i] = cur->leaf->recs[i - 1];
+ fuse_iext_set(cur_rec(cur), irec);
+ ifp->if_bytes += sizeof(struct fuse_iomap_io);
+
+ if (new)
+ fuse_iext_insert_node(ifp, fuse_iext_leaf_key(new, 0), new, 2);
+}
+
+void
+fuse_iext_insert(
+ struct fuse_iomap_cache *ip,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *irec,
+ int state)
+{
+ struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
+
+ fuse_iext_insert_raw(ip, ifp, cur, irec);
+}
+
+static struct fuse_iext_node *
+fuse_iext_rebalance_node(
+ struct fuse_iext_node *parent,
+ int *pos,
+ struct fuse_iext_node *node,
+ int nr_entries)
+{
+ /*
+ * If the neighbouring nodes are completely full, or have different
+ * parents, we might never be able to merge our node, and will only
+ * delete it once the number of entries hits zero.
+ */
+ if (nr_entries == 0)
+ return node;
+
+ if (*pos > 0) {
+ struct fuse_iext_node *prev = parent->ptrs[*pos - 1];
+ int nr_prev = fuse_iext_node_nr_entries(prev, 0), i;
+
+ if (nr_prev + nr_entries <= KEYS_PER_NODE) {
+ for (i = 0; i < nr_entries; i++) {
+ prev->keys[nr_prev + i] = node->keys[i];
+ prev->ptrs[nr_prev + i] = node->ptrs[i];
+ }
+ return node;
+ }
+ }
+
+ if (*pos + 1 < fuse_iext_node_nr_entries(parent, *pos)) {
+ struct fuse_iext_node *next = parent->ptrs[*pos + 1];
+ int nr_next = fuse_iext_node_nr_entries(next, 0), i;
+
+ if (nr_entries + nr_next <= KEYS_PER_NODE) {
+ /*
+ * Merge the next node into this node so that we don't
+ * have to do an additional update of the keys in the
+ * higher levels.
+ */
+ for (i = 0; i < nr_next; i++) {
+ node->keys[nr_entries + i] = next->keys[i];
+ node->ptrs[nr_entries + i] = next->ptrs[i];
+ }
+
+ ++*pos;
+ return next;
+ }
+ }
+
+ return NULL;
+}
+
+static void
+fuse_iext_remove_node(
+ struct fuse_ifork *ifp,
+ loff_t offset,
+ void *victim)
+{
+ struct fuse_iext_node *node, *parent;
+ int level = 2, pos, nr_entries, i;
+
+ ASSERT(level <= ifp->if_height);
+ node = fuse_iext_find_level(ifp, offset, level);
+ pos = fuse_iext_node_pos(node, offset);
+again:
+ ASSERT(node->ptrs[pos]);
+ ASSERT(node->ptrs[pos] == victim);
+ kfree(victim);
+
+ nr_entries = fuse_iext_node_nr_entries(node, pos) - 1;
+ offset = node->keys[0];
+ for (i = pos; i < nr_entries; i++) {
+ node->keys[i] = node->keys[i + 1];
+ node->ptrs[i] = node->ptrs[i + 1];
+ }
+ node->keys[nr_entries] = FUSE_IEXT_KEY_INVALID;
+ node->ptrs[nr_entries] = NULL;
+
+ if (pos == 0 && nr_entries > 0) {
+ fuse_iext_update_node(ifp, offset, node->keys[0], level, node);
+ offset = node->keys[0];
+ }
+
+ if (nr_entries >= KEYS_PER_NODE / 2)
+ return;
+
+ if (level < ifp->if_height) {
+ /*
+ * If we aren't at the root yet try to find a neighbour node to
+ * merge with (or delete the node if it is empty), and then
+ * recurse up to the next level.
+ */
+ level++;
+ parent = fuse_iext_find_level(ifp, offset, level);
+ pos = fuse_iext_node_pos(parent, offset);
+
+ ASSERT(pos != KEYS_PER_NODE);
+ ASSERT(parent->ptrs[pos] == node);
+
+ node = fuse_iext_rebalance_node(parent, &pos, node, nr_entries);
+ if (node) {
+ victim = node;
+ node = parent;
+ goto again;
+ }
+ } else if (nr_entries == 1) {
+ /*
+ * If we are at the root and only one entry is left we can just
+ * free this node and update the root pointer.
+ */
+ ASSERT(node == ifp->if_data);
+ ifp->if_data = node->ptrs[0];
+ ifp->if_height--;
+ kfree(node);
+ }
+}
+
+static void
+fuse_iext_rebalance_leaf(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iext_leaf *leaf,
+ loff_t offset,
+ int nr_entries)
+{
+ /*
+ * If the neighbouring nodes are completely full we might never be able
+ * to merge our node, and will only delete it once the number of
+ * entries hits zero.
+ */
+ if (nr_entries == 0)
+ goto remove_node;
+
+ if (leaf->prev) {
+ int nr_prev = fuse_iext_leaf_nr_entries(ifp, leaf->prev, 0), i;
+
+ if (nr_prev + nr_entries <= RECS_PER_LEAF) {
+ for (i = 0; i < nr_entries; i++)
+ leaf->prev->recs[nr_prev + i] = leaf->recs[i];
+
+ if (cur->leaf == leaf) {
+ cur->leaf = leaf->prev;
+ cur->pos += nr_prev;
+ }
+ goto remove_node;
+ }
+ }
+
+ if (leaf->next) {
+ int nr_next = fuse_iext_leaf_nr_entries(ifp, leaf->next, 0), i;
+
+ if (nr_entries + nr_next <= RECS_PER_LEAF) {
+ /*
+ * Merge the next node into this node so that we don't
+ * have to do an additional update of the keys in the
+ * higher levels.
+ */
+ for (i = 0; i < nr_next; i++) {
+ leaf->recs[nr_entries + i] =
+ leaf->next->recs[i];
+ }
+
+ if (cur->leaf == leaf->next) {
+ cur->leaf = leaf;
+ cur->pos += nr_entries;
+ }
+
+ offset = fuse_iext_leaf_key(leaf->next, 0);
+ leaf = leaf->next;
+ goto remove_node;
+ }
+ }
+
+ return;
+remove_node:
+ if (leaf->prev)
+ leaf->prev->next = leaf->next;
+ if (leaf->next)
+ leaf->next->prev = leaf->prev;
+ fuse_iext_remove_node(ifp, offset, leaf);
+}
+
+static void
+fuse_iext_free_last_leaf(
+ struct fuse_ifork *ifp)
+{
+ ifp->if_height--;
+ kfree(ifp->if_data);
+ ifp->if_data = NULL;
+}
+
+void
+fuse_iext_remove(
+ struct fuse_iomap_cache *ip,
+ struct fuse_iext_cursor *cur,
+ int state)
+{
+ struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
+ struct fuse_iext_leaf *leaf = cur->leaf;
+ loff_t offset = fuse_iext_leaf_key(leaf, 0);
+ int i, nr_entries;
+
+ ASSERT(ifp->if_height > 0);
+ ASSERT(ifp->if_data != NULL);
+ ASSERT(fuse_iext_valid(ifp, cur));
+
+ fuse_iext_inc_seq(ip);
+
+ nr_entries = fuse_iext_leaf_nr_entries(ifp, leaf, cur->pos) - 1;
+ for (i = cur->pos; i < nr_entries; i++)
+ leaf->recs[i] = leaf->recs[i + 1];
+ fuse_iext_rec_clear(&leaf->recs[nr_entries]);
+ ifp->if_bytes -= sizeof(struct fuse_iomap_io);
+
+ if (cur->pos == 0 && nr_entries > 0) {
+ fuse_iext_update_node(ifp, offset, fuse_iext_leaf_key(leaf, 0), 1,
+ leaf);
+ offset = fuse_iext_leaf_key(leaf, 0);
+ } else if (cur->pos == nr_entries) {
+ if (ifp->if_height > 1 && leaf->next)
+ cur->leaf = leaf->next;
+ else
+ cur->leaf = NULL;
+ cur->pos = 0;
+ }
+
+ if (nr_entries >= RECS_PER_LEAF / 2)
+ return;
+
+ if (ifp->if_height > 1)
+ fuse_iext_rebalance_leaf(ifp, cur, leaf, offset, nr_entries);
+ else if (nr_entries == 0)
+ fuse_iext_free_last_leaf(ifp);
+}
+
+/*
+ * Lookup the extent covering offset.
+ *
+ * If there is an extent covering offset return the extent index, and store the
+ * expanded extent structure in *gotp, and the extent cursor in *cur.
+ * If there is no extent covering offset, but there is an extent after it (e.g.
+ * it lies in a hole) return that extent in *gotp and its cursor in *cur
+ * instead.
+ * If offset is beyond the last extent return false, and return an invalid
+ * cursor value.
+ */
+bool
+fuse_iext_lookup_extent(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ loff_t offset,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp)
+{
+ cur->leaf = fuse_iext_find_level(ifp, offset, 1);
+ if (!cur->leaf) {
+ cur->pos = 0;
+ return false;
+ }
+
+ for (cur->pos = 0; cur->pos < fuse_iext_max_recs(ifp); cur->pos++) {
+ struct fuse_iomap_io *rec = cur_rec(cur);
+
+ if (fuse_iext_rec_is_empty(rec))
+ break;
+ if (fuse_iext_rec_cmp(rec, offset) >= 0)
+ goto found;
+ }
+
+ /* Try looking in the next node for an entry > offset */
+ if (ifp->if_height == 1 || !cur->leaf->next)
+ return false;
+ cur->leaf = cur->leaf->next;
+ cur->pos = 0;
+ if (!fuse_iext_valid(ifp, cur))
+ return false;
+found:
+ fuse_iext_get(gotp, cur_rec(cur));
+ return true;
+}
+
+/*
+ * Returns the last extent before end, and if this extent doesn't cover
+ * end, update end to the end of the extent.
+ */
+bool
+fuse_iext_lookup_extent_before(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ loff_t *end,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp)
+{
+ /* could be optimized to not even look up the next on a match.. */
+ if (fuse_iext_lookup_extent(ip, ifp, *end - 1, cur, gotp) &&
+ gotp->offset <= *end - 1)
+ return true;
+ if (!fuse_iext_prev_extent(ifp, cur, gotp))
+ return false;
+ *end = gotp->offset + gotp->length;
+ return true;
+}
+
+void
+fuse_iext_update_extent(
+ struct fuse_iomap_cache *ip,
+ int state,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *new)
+{
+ struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
+
+ fuse_iext_inc_seq(ip);
+
+ if (cur->pos == 0) {
+ struct fuse_iomap_io old;
+
+ fuse_iext_get(&old, cur_rec(cur));
+ if (new->offset != old.offset) {
+ fuse_iext_update_node(ifp, old.offset,
+ new->offset, 1, cur->leaf);
+ }
+ }
+
+ fuse_iext_set(cur_rec(cur), new);
+}
+
+/*
+ * Return true if the cursor points at an extent and return the extent structure
+ * in gotp. Else return false.
+ */
+bool
+fuse_iext_get_extent(
+ const struct fuse_ifork *ifp,
+ const struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp)
+{
+ if (!fuse_iext_valid(ifp, cur))
+ return false;
+ fuse_iext_get(gotp, cur_rec(cur));
+ return true;
+}
+
+/*
+ * This is a recursive function, because of that we need to be extremely
+ * careful with stack usage.
+ */
+static void
+fuse_iext_destroy_node(
+ struct fuse_iext_node *node,
+ int level)
+{
+ int i;
+
+ if (level > 1) {
+ for (i = 0; i < KEYS_PER_NODE; i++) {
+ if (node->keys[i] == FUSE_IEXT_KEY_INVALID)
+ break;
+ fuse_iext_destroy_node(node->ptrs[i], level - 1);
+ }
+ }
+
+ kfree(node);
+}
+
+void
+fuse_iext_destroy(
+ struct fuse_ifork *ifp)
+{
+ fuse_iext_destroy_node(ifp->if_data, ifp->if_height);
+
+ ifp->if_bytes = 0;
+ ifp->if_height = 0;
+ ifp->if_data = NULL;
+}
+
+static inline struct fuse_ifork *
+fuse_iomap_fork_ptr(
+ struct fuse_iomap_cache *ip,
+ enum fuse_iomap_iodir iodir)
+{
+ switch (iodir) {
+ case READ_MAPPING:
+ return &ip->im_read;
+ case WRITE_MAPPING:
+ return ip->im_write;
+ default:
+ ASSERT(0);
+ return NULL;
+ }
+}
+
+static inline bool fuse_iomap_addrs_adjacent(const struct fuse_iomap_io *left,
+ const struct fuse_iomap_io *right)
+{
+ switch (left->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ return left->addr + left->length == right->addr;
+ default:
+ return left->addr == FUSE_IOMAP_NULL_ADDR &&
+ right->addr == FUSE_IOMAP_NULL_ADDR;
+ }
+}
+
+static inline bool fuse_iomap_can_merge(const struct fuse_iomap_io *left,
+ const struct fuse_iomap_io *right)
+{
+ return (left->dev == right->dev &&
+ left->offset + left->length == right->offset &&
+ left->type == right->type &&
+ fuse_iomap_addrs_adjacent(left, right) &&
+ left->flags == right->flags &&
+ left->length + right->length <= FUSE_IOMAP_MAX_LEN);
+}
+
+static inline bool fuse_iomap_can_merge3(const struct fuse_iomap_io *left,
+ const struct fuse_iomap_io *new,
+ const struct fuse_iomap_io *right)
+{
+ return left->length + new->length + right->length <= FUSE_IOMAP_MAX_LEN;
+}
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static void fuse_iext_check_mappings(struct inode *inode,
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp)
+{
+ struct fuse_inode *fi = FUSE_I(ip);
+ struct fuse_iext_cursor icur;
+ struct fuse_iomap_io prev, got;
+ unsigned long long nr = 0;
+
+ if (!ifp || !static_branch_unlikely(&fuse_iomap_debug))
+ return;
+
+ fuse_iext_first(ifp, &icur);
+ if (!fuse_iext_get_extent(ifp, &icur, &prev))
+ return;
+ nr++;
+
+ fuse_iext_next(ifp, &icur);
+ while (fuse_iext_get_extent(ifp, &icur, &got)) {
+ if (got.length == 0 ||
+ got.offset < prev.offset + prev.length ||
+ fuse_iomap_can_merge(&prev, &got)) {
+ printk(KERN_ERR "FUSE IOMAP CORRUPTION ino=%llu nr=%llu",
+ fi->orig_ino, nr);
+ printk(KERN_ERR "prev: offset=%llu length=%llu type=%u flags=0x%x dev=%u addr=%llu\n",
+ prev.offset, prev.length, prev.type, prev.flags,
+ prev.dev, prev.addr);
+ printk(KERN_ERR "curr: offset=%llu length=%llu type=%u flags=0x%x dev=%u addr=%llu\n",
+ got.offset, got.length, got.type, got.flags,
+ got.dev, got.addr);
+ }
+
+ prev = got;
+ nr++;
+ fuse_iext_next(ifp, &icur);
+ }
+}
+#else
+# define fuse_iext_check_mappings(...) ((void)0)
+#endif
+
+static void
+fuse_iext_del_mapping(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *icur,
+ struct fuse_iomap_io *got, /* current extent entry */
+ struct fuse_iomap_io *del) /* data to remove from extents */
+{
+ struct fuse_iomap_io new; /* new record to be inserted */
+ /* first addr (fsblock aligned) past del */
+ uint64_t del_endaddr;
+ /* first offset (fsblock aligned) past del */
+ uint64_t del_endoff = del->offset + del->length;
+ /* first offset (fsblock aligned) past got */
+ uint64_t got_endoff = got->offset + got->length;
+ uint32_t state = fuse_iomap_fork_to_state(ip, ifp);
+
+ ASSERT(del->length > 0);
+ ASSERT(got->offset <= del->offset);
+ ASSERT(got_endoff >= del_endoff);
+
+ switch (del->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ del_endaddr = del->addr + del->length;
+ break;
+ default:
+ del_endaddr = FUSE_IOMAP_NULL_ADDR;
+ break;
+ }
+
+ if (got->offset == del->offset)
+ state |= FUSE_IEXT_LEFT_FILLING;
+ if (got_endoff == del_endoff)
+ state |= FUSE_IEXT_RIGHT_FILLING;
+
+ switch (state & (FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING)) {
+ case FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING:
+ /*
+ * Matches the whole extent. Delete the entry.
+ */
+ fuse_iext_remove(ip, icur, state);
+ fuse_iext_prev(ifp, icur);
+ break;
+ case FUSE_IEXT_LEFT_FILLING:
+ /*
+ * Deleting the first part of the extent.
+ */
+ got->offset = del_endoff;
+ got->addr = del_endaddr;
+ got->length -= del->length;
+ fuse_iext_update_extent(ip, state, icur, got);
+ break;
+ case FUSE_IEXT_RIGHT_FILLING:
+ /*
+ * Deleting the last part of the extent.
+ */
+ got->length -= del->length;
+ fuse_iext_update_extent(ip, state, icur, got);
+ break;
+ case 0:
+ /*
+ * Deleting the middle of the extent.
+ */
+ got->length = del->offset - got->offset;
+ fuse_iext_update_extent(ip, state, icur, got);
+
+ new.offset = del_endoff;
+ new.length = got_endoff - del_endoff;
+ new.type = got->type;
+ new.flags = got->flags;
+ new.addr = del_endaddr;
+ new.dev = got->dev;
+
+ fuse_iext_next(ifp, icur);
+ fuse_iext_insert(ip, icur, &new, state);
+ break;
+ }
+}
+
+int
+fuse_iomap_cache_remove(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ loff_t start, /* first file offset deleted */
+ uint64_t len) /* length to unmap */
+{
+ struct fuse_iext_cursor icur;
+ struct fuse_iomap_io got; /* current extent record */
+ struct fuse_iomap_io del; /* extent being deleted */
+ loff_t end;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ struct fuse_ifork *ifp = fuse_iomap_fork_ptr(ip, iodir);
+ bool wasreal;
+ bool done = false;
+ int ret = 0;
+
+ assert_cache_locked(ip);
+
+ if (!ifp || fuse_iext_count(ifp) == 0)
+ return 0;
+
+ /* Fast shortcut if the caller wants to erase everything */
+ if (start == 0 && len >= inode->i_sb->s_maxbytes) {
+ fuse_iext_destroy(ifp);
+ return 0;
+ }
+
+ if (!len)
+ goto out;
+
+ /*
+ * If the caller wants us to remove everything to EOF, we set the end
+ * of the removal range to the maximum file offset. We don't support
+ * unsigned file offsets.
+ */
+ if (len == FUSE_IOMAP_INVAL_TO_EOF) {
+ const unsigned int blocksize = i_blocksize(inode);
+
+ len = round_up(inode->i_sb->s_maxbytes, blocksize) - start;
+ }
+
+ /*
+ * Now that we've settled len, look up the extent before the end of the
+ * range.
+ */
+ end = start + len;
+ if (!fuse_iext_lookup_extent_before(ip, ifp, &end, &icur, &got))
+ goto out;
+ end--;
+
+ while (end != -1 && end >= start) {
+ /*
+ * Is the found extent after a hole in which end lives?
+ * Just back up to the previous extent, if so.
+ */
+ if (got.offset > end &&
+ !fuse_iext_prev_extent(ifp, &icur, &got)) {
+ done = true;
+ break;
+ }
+ /*
+ * Is the last block of this extent before the range
+ * we're supposed to delete? If so, we're done.
+ */
+ end = min_t(loff_t, end, got.offset + got.length - 1);
+ if (end < start)
+ break;
+ /*
+ * Then deal with the (possibly delayed) allocated space
+ * we found.
+ */
+ del = got;
+ switch (del.type) {
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_INLINE:
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ wasreal = false;
+ break;
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ wasreal = true;
+ break;
+ default:
+ ASSERT(0);
+ ret = -EFSCORRUPTED;
+ goto out;
+ }
+
+ if (got.offset < start) {
+ del.offset = start;
+ del.length -= start - got.offset;
+ if (wasreal)
+ del.addr += start - got.offset;
+ }
+ if (del.offset + del.length > end + 1)
+ del.length = end + 1 - del.offset;
+
+ fuse_iext_del_mapping(ip, ifp, &icur, &got, &del);
+ end = del.offset - 1;
+
+ /*
+ * If not done go on to the next (previous) record.
+ */
+ if (end != -1 && end >= start) {
+ if (!fuse_iext_get_extent(ifp, &icur, &got) ||
+ (got.offset > end &&
+ !fuse_iext_prev_extent(ifp, &icur, &got))) {
+ done = true;
+ break;
+ }
+ }
+ }
+
+ /* Should have removed everything */
+ if (len == 0 || done || end == (loff_t)-1 || end < start)
+ ret = 0;
+ else
+ ret = -EFSCORRUPTED;
+
+out:
+ fuse_iext_check_mappings(inode, ip, ifp);
+ return ret;
+}
+
+static void
+fuse_iext_add_mapping(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *icur,
+ const struct fuse_iomap_io *new) /* new extent entry */
+{
+ struct fuse_iomap_io left; /* left neighbor extent entry */
+ struct fuse_iomap_io right; /* right neighbor extent entry */
+ uint32_t state = fuse_iomap_fork_to_state(ip, ifp);
+
+ /*
+ * Check and set flags if this segment has a left neighbor.
+ */
+ if (fuse_iext_peek_prev_extent(ifp, icur, &left))
+ state |= FUSE_IEXT_LEFT_VALID;
+
+ /*
+ * Check and set flags if this segment has a current value.
+ * Not true if we're inserting into the "hole" at eof.
+ */
+ if (fuse_iext_get_extent(ifp, icur, &right))
+ state |= FUSE_IEXT_RIGHT_VALID;
+
+ /*
+ * We're inserting a real allocation between "left" and "right".
+ * Set the contiguity flags. Don't let extents get too large.
+ */
+ if ((state & FUSE_IEXT_LEFT_VALID) && fuse_iomap_can_merge(&left, new))
+ state |= FUSE_IEXT_LEFT_CONTIG;
+
+ if ((state & FUSE_IEXT_RIGHT_VALID) &&
+ fuse_iomap_can_merge(new, &right) &&
+ (!(state & FUSE_IEXT_LEFT_CONTIG) ||
+ fuse_iomap_can_merge3(&left, new, &right)))
+ state |= FUSE_IEXT_RIGHT_CONTIG;
+
+ /*
+ * Select which case we're in here, and implement it.
+ */
+ switch (state & (FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG)) {
+ case FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG:
+ /*
+ * New allocation is contiguous with real allocations on the
+ * left and on the right.
+ * Merge all three into a single extent record.
+ */
+ left.length += new->length + right.length;
+
+ fuse_iext_remove(ip, icur, state);
+ fuse_iext_prev(ifp, icur);
+ fuse_iext_update_extent(ip, state, icur, &left);
+ break;
+
+ case FUSE_IEXT_LEFT_CONTIG:
+ /*
+ * New allocation is contiguous with a real allocation
+ * on the left.
+ * Merge the new allocation with the left neighbor.
+ */
+ left.length += new->length;
+
+ fuse_iext_prev(ifp, icur);
+ fuse_iext_update_extent(ip, state, icur, &left);
+ break;
+
+ case FUSE_IEXT_RIGHT_CONTIG:
+ /*
+ * New allocation is contiguous with a real allocation
+ * on the right.
+ * Merge the new allocation with the right neighbor.
+ */
+ right.offset = new->offset;
+ right.addr = new->addr;
+ right.length += new->length;
+ fuse_iext_update_extent(ip, state, icur, &right);
+ break;
+
+ case 0:
+ /*
+ * New allocation is not contiguous with another
+ * real allocation.
+ * Insert a new entry.
+ */
+ fuse_iext_insert(ip, icur, new, state);
+ break;
+ }
+}
+
+static int
+fuse_iomap_cache_add(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *new)
+{
+ struct fuse_iext_cursor icur;
+ struct fuse_iomap_io got;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ struct fuse_ifork *ifp = fuse_iomap_fork_ptr(ip, iodir);
+
+ assert_cache_locked(ip);
+ ASSERT(new->length > 0);
+ ASSERT(new->offset < inode->i_sb->s_maxbytes);
+
+ if (!ifp) {
+ ifp = kzalloc(sizeof(struct fuse_ifork),
+ GFP_KERNEL | __GFP_NOFAIL);
+ if (!ifp)
+ return -ENOMEM;
+
+ ip->im_write = ifp;
+ }
+
+ if (fuse_iext_lookup_extent(ip, ifp, new->offset, &icur, &got)) {
+ /* make sure we only add into a hole. */
+ ASSERT(got.offset > new->offset);
+ ASSERT(got.offset - new->offset >= new->length);
+
+ if (got.offset <= new->offset ||
+ got.offset - new->offset < new->length)
+ return -EFSCORRUPTED;
+ }
+
+ fuse_iext_add_mapping(ip, ifp, &icur, new);
+ fuse_iext_check_mappings(inode, ip, ifp);
+ return 0;
+}
+
+int
+fuse_iomap_cache_upsert(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ int err;
+
+ /*
+ * We interpret no write fork to mean that all writes are pure
+ * overwrites. Avoid wasting memory if we're trying to upsert a
+ * pure overwrite.
+ */
+ if (iodir == WRITE_MAPPING &&
+ map->type == FUSE_IOMAP_TYPE_PURE_OVERWRITE &&
+ ip->im_write == NULL)
+ return 0;
+
+ err = fuse_iomap_cache_remove(inode, iodir, map->offset, map->length);
+ if (err)
+ return err;
+
+ return fuse_iomap_cache_add(inode, iodir, map);
+}
+
+/*
+ * Trim the returned map to the required bounds
+ */
+static void
+fuse_iomap_trim(
+ struct fuse_inode *fi,
+ struct fuse_iomap_lookup *mval,
+ const struct fuse_iomap_io *got,
+ loff_t off,
+ loff_t len)
+{
+ struct fuse_iomap_cache *ip = &fi->cache;
+ const unsigned int blocksize = i_blocksize(&fi->inode);
+ const loff_t aligned_off = round_down(off, blocksize);
+ const loff_t aligned_end = round_up(off + len, blocksize);
+ const loff_t aligned_len = aligned_end - aligned_off;
+
+ ASSERT(aligned_off >= got->offset);
+
+ switch (got->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ mval->map.addr = got->addr + (aligned_off - got->offset);
+ break;
+ default:
+ mval->map.addr = FUSE_IOMAP_NULL_ADDR;
+ break;
+ }
+ mval->map.offset = aligned_off;
+ mval->map.length = min_t(loff_t, aligned_len,
+ got->length - (aligned_off - got->offset));
+ mval->map.type = got->type;
+ mval->map.flags = got->flags;
+ mval->map.dev = got->dev;
+ mval->validity_cookie = fuse_iext_read_seq(ip);
+}
+
+enum fuse_iomap_lookup_result
+fuse_iomap_cache_lookup(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ loff_t off,
+ uint64_t len,
+ struct fuse_iomap_lookup *mval)
+{
+ struct fuse_iomap_io got;
+ struct fuse_iext_cursor icur;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ struct fuse_ifork *ifp = fuse_iomap_fork_ptr(ip, iodir);
+
+ assert_cache_locked_shared(ip);
+
+ if (!ifp) {
+ /*
+ * No write fork at all means this filesystem doesn't do out of
+ * place writes.
+ */
+ return LOOKUP_NOFORK;
+ }
+
+ if (!fuse_iext_lookup_extent(ip, ifp, off, &icur, &got)) {
+ /*
+ * Write fork does not contain a mapping at or beyond off,
+ * which is a cache miss.
+ */
+ return LOOKUP_MISS;
+ }
+
+ if (got.offset > off) {
+ /*
+ * Found a mapping, but it doesn't cover the start of the
+ * range, which is effectively a miss.
+ */
+ return LOOKUP_MISS;
+ }
+
+ /* Found a mapping in the cache, return it */
+ fuse_iomap_trim(fi, mval, &got, off, len);
+ return LOOKUP_HIT;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 02/10] fuse_trace: cache iomaps
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-09-16 0:38 ` [PATCH 01/10] fuse: cache iomaps Darrick J. Wong
@ 2025-09-16 0:38 ` Darrick J. Wong
2025-09-16 0:39 ` [PATCH 03/10] fuse: use the iomap cache for iomap_begin Darrick J. Wong
` (7 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:38 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 295 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/iomap_cache.c | 31 +++++
2 files changed, 325 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 1f900580b14937..6072ef187f9215 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -315,6 +315,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
struct iomap_writepage_ctx;
struct iomap_ioend;
struct iomap;
+struct fuse_iext_cursor;
+struct fuse_iomap_lookup;
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
@@ -345,6 +347,16 @@ struct iomap;
__entry->prefix##addr, \
__print_flags(__entry->prefix##flags, "|", FUSE_IOMAP_F_STRINGS)
+#define FUSE_IOMAP_IODIR_FIELD \
+ __field(enum fuse_iomap_iodir, iodir)
+
+#define FUSE_IOMAP_IODIR_FMT \
+ " iodir %s"
+
+#define FUSE_IOMAP_IODIR_PRINTK_ARGS \
+ __print_symbolic(__entry->iodir, FUSE_IOMAP_FORK_STRINGS)
+
+
/* combinations of boilerplate to reduce typing further */
#define FUSE_IOMAP_OP_FIELDS(prefix) \
FUSE_INODE_FIELDS \
@@ -414,6 +426,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BTIME);
TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP_CACHE);
#define FUSE_IFLAG_STRINGS \
{ 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \
@@ -423,7 +436,8 @@ TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
{ 1 << FUSE_I_BTIME, "btime" }, \
{ 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
{ 1 << FUSE_I_IOMAP, "iomap" }, \
- { 1 << FUSE_I_ATOMIC, "atomic" }
+ { 1 << FUSE_I_ATOMIC, "atomic" }, \
+ { 1 << FUSE_I_IOMAP_CACHE, "iomap_cache" }
#define IOMAP_IOEND_STRINGS \
{ IOMAP_IOEND_SHARED, "shared" }, \
@@ -439,6 +453,22 @@ TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
{ FUSE_IOMAP_CONFIG_TIME, "time" }, \
{ FUSE_IOMAP_CONFIG_MAXBYTES, "maxbytes" }
+TRACE_DEFINE_ENUM(READ_MAPPING);
+TRACE_DEFINE_ENUM(WRITE_MAPPING);
+
+#define FUSE_IOMAP_FORK_STRINGS \
+ { READ_MAPPING, "read" }, \
+ { WRITE_MAPPING, "write" }
+
+#define FUSE_IEXT_STATE_STRINGS \
+ { FUSE_IEXT_LEFT_CONTIG, "l_cont" }, \
+ { FUSE_IEXT_RIGHT_CONTIG, "r_cont" }, \
+ { FUSE_IEXT_LEFT_FILLING, "l_fill" }, \
+ { FUSE_IEXT_RIGHT_FILLING, "r_fill" }, \
+ { FUSE_IEXT_LEFT_VALID, "l_valid" }, \
+ { FUSE_IEXT_RIGHT_VALID, "r_valid" }, \
+ { FUSE_IEXT_WRITE_MAPPING, "write" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -1178,6 +1208,269 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
+
+DECLARE_EVENT_CLASS(fuse_iext_class,
+ TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur,
+ int state, unsigned long caller_ip),
+
+ TP_ARGS(inode, cur, state, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(void *, leaf)
+ __field(int, pos)
+ __field(int, iext_state)
+ __field(unsigned long, caller_ip)
+ ),
+ TP_fast_assign(
+ const struct fuse_ifork *ifp;
+ struct fuse_iomap_io r = { };
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+
+ if (state & FUSE_IEXT_WRITE_MAPPING)
+ ifp = fi->cache.im_write;
+ else
+ ifp = &fi->cache.im_read;
+ if (ifp)
+ fuse_iext_get_extent(ifp, cur, &r);
+
+ __entry->mapoffset = r.offset;
+ __entry->mapaddr = r.addr;
+ __entry->maplength = r.length;
+ __entry->mapdev = r.dev;
+ __entry->maptype = r.type;
+ __entry->mapflags = r.flags;
+
+ __entry->leaf = cur->leaf;
+ __entry->pos = cur->pos;
+
+ __entry->iext_state = state;
+ __entry->caller_ip = caller_ip;
+ ),
+ TP_printk(FUSE_INODE_FMT " state (%s) cur %p/%d " FUSE_IOMAP_MAP_FMT() " caller %pS",
+ FUSE_INODE_PRINTK_ARGS,
+ __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS),
+ __entry->leaf,
+ __entry->pos,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ (void *)__entry->caller_ip)
+)
+
+#define DEFINE_IEXT_EVENT(name) \
+DEFINE_EVENT(fuse_iext_class, name, \
+ TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur, \
+ int state, unsigned long caller_ip), \
+ TP_ARGS(inode, cur, state, caller_ip))
+DEFINE_IEXT_EVENT(fuse_iext_insert);
+DEFINE_IEXT_EVENT(fuse_iext_remove);
+DEFINE_IEXT_EVENT(fuse_iext_pre_update);
+DEFINE_IEXT_EVENT(fuse_iext_post_update);
+
+TRACE_EVENT(fuse_iext_update_class,
+ TP_PROTO(const struct inode *inode, uint32_t iext_state,
+ const struct fuse_iomap_io *map),
+ TP_ARGS(inode, iext_state, map),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(uint32_t, iext_state)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+
+ __entry->iext_state = iext_state;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " state (%s)" FUSE_IOMAP_MAP_FMT(),
+ FUSE_INODE_PRINTK_ARGS,
+ __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+#define DEFINE_IEXT_UPDATE_EVENT(name) \
+DEFINE_EVENT(fuse_iext_update_class, name, \
+ TP_PROTO(const struct inode *inode, uint32_t iext_state, \
+ const struct fuse_iomap_io *map), \
+ TP_ARGS(inode, iext_state, map))
+DEFINE_IEXT_UPDATE_EVENT(fuse_iext_del_mapping);
+DEFINE_IEXT_UPDATE_EVENT(fuse_iext_add_mapping);
+
+TRACE_EVENT(fuse_iext_alt_update_class,
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map),
+ TP_ARGS(inode, map),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT(),
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+#define DEFINE_IEXT_ALT_UPDATE_EVENT(name) \
+DEFINE_EVENT(fuse_iext_alt_update_class, name, \
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), \
+ TP_ARGS(inode, map))
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_del_mapping_got);
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_left);
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_right);
+
+TRACE_EVENT(fuse_iomap_cache_remove,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t offset, uint64_t length, unsigned long caller_ip),
+ TP_ARGS(inode, iodir, offset, length, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ FUSE_IOMAP_IODIR_FIELD
+ __field(unsigned long, caller_ip)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+ __entry->offset = offset;
+ __entry->length = length;
+ __entry->caller_ip = caller_ip;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " caller %pS",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ (void *)__entry->caller_ip)
+);
+
+TRACE_EVENT(fuse_iomap_cached_mapping_class,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map, unsigned long caller_ip),
+ TP_ARGS(inode, iodir, map, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_IODIR_FIELD
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(unsigned long, caller_ip)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+
+ __entry->caller_ip = caller_ip;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_IODIR_FMT FUSE_IOMAP_MAP_FMT() " caller %pS",
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ (void *)__entry->caller_ip)
+);
+#define DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_cached_mapping_class, name, \
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir, \
+ const struct fuse_iomap_io *map, unsigned long caller_ip), \
+ TP_ARGS(inode, iodir, map, caller_ip))
+DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(fuse_iomap_cache_add);
+DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(fuse_iext_check_mapping);
+
+TRACE_EVENT(fuse_iomap_cache_lookup,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t pos, uint64_t count, unsigned long caller_ip),
+ TP_ARGS(inode, iodir, pos, count, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ FUSE_IOMAP_IODIR_FIELD
+ __field(unsigned long, caller_ip)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+ __entry->offset = pos;
+ __entry->length = count;
+ __entry->caller_ip = caller_ip;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " caller %pS",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ (void *)__entry->caller_ip)
+);
+
+TRACE_EVENT(fuse_iomap_cache_lookup_result,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t pos, uint64_t count, const struct fuse_iomap_io *got,
+ const struct fuse_iomap_lookup *map),
+ TP_ARGS(inode, iodir, pos, count, got, map),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+
+ FUSE_IOMAP_MAP_FIELDS(got)
+ FUSE_IOMAP_MAP_FIELDS(map)
+
+ FUSE_IOMAP_IODIR_FIELD
+ __field(uint64_t, validity_cookie)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+ __entry->offset = pos;
+ __entry->length = count;
+
+ __entry->gotoffset = got->offset;
+ __entry->gotlength = got->length;
+ __entry->gottype = got->type;
+ __entry->gotflags = got->flags;
+ __entry->gotdev = got->dev;
+ __entry->gotaddr = got->addr;
+
+ __entry->mapoffset = map->map.offset;
+ __entry->maplength = map->map.length;
+ __entry->maptype = map->map.type;
+ __entry->mapflags = map->map.flags;
+ __entry->mapdev = map->map.dev;
+ __entry->mapaddr = map->map.addr;
+
+ __entry->validity_cookie= map->validity_cookie;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT FUSE_IOMAP_MAP_FMT("map") FUSE_IOMAP_MAP_FMT("got") " cookie 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(got),
+ __entry->validity_cookie)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 1fec9dcc6d3922..5bfa0e26346d1f 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -717,6 +717,7 @@ fuse_iext_insert(
struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
fuse_iext_insert_raw(ip, ifp, cur, irec);
+ trace_fuse_iext_insert(VFS_I(ip), cur, state, _RET_IP_);
}
static struct fuse_iext_node *
@@ -920,6 +921,8 @@ fuse_iext_remove(
loff_t offset = fuse_iext_leaf_key(leaf, 0);
int i, nr_entries;
+ trace_fuse_iext_remove(VFS_I(ip), cur, state, _RET_IP_);
+
ASSERT(ifp->if_height > 0);
ASSERT(ifp->if_data != NULL);
ASSERT(fuse_iext_valid(ifp, cur));
@@ -1042,7 +1045,9 @@ fuse_iext_update_extent(
}
}
+ trace_fuse_iext_pre_update(VFS_I(ip), cur, state, _RET_IP_);
fuse_iext_set(cur_rec(cur), new);
+ trace_fuse_iext_post_update(VFS_I(ip), cur, state, _RET_IP_);
}
/*
@@ -1150,17 +1155,25 @@ static void fuse_iext_check_mappings(struct inode *inode,
struct fuse_iext_cursor icur;
struct fuse_iomap_io prev, got;
unsigned long long nr = 0;
+ enum fuse_iomap_iodir iodir;
if (!ifp || !static_branch_unlikely(&fuse_iomap_debug))
return;
+ if (ifp == ip->im_write)
+ iodir = WRITE_MAPPING;
+ else
+ iodir = READ_MAPPING;
+
fuse_iext_first(ifp, &icur);
if (!fuse_iext_get_extent(ifp, &icur, &prev))
return;
+ trace_fuse_iext_check_mapping(inode, iodir, &prev, _RET_IP_);
nr++;
fuse_iext_next(ifp, &icur);
while (fuse_iext_get_extent(ifp, &icur, &got)) {
+ trace_fuse_iext_check_mapping(inode, iodir, &got, _RET_IP_);
if (got.length == 0 ||
got.offset < prev.offset + prev.length ||
fuse_iomap_can_merge(&prev, &got)) {
@@ -1219,6 +1232,9 @@ fuse_iext_del_mapping(
if (got_endoff == del_endoff)
state |= FUSE_IEXT_RIGHT_FILLING;
+ trace_fuse_iext_del_mapping(VFS_I(ip), state, del);
+ trace_fuse_iext_del_mapping_got(VFS_I(ip), got);
+
switch (state & (FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING)) {
case FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING:
/*
@@ -1283,6 +1299,8 @@ fuse_iomap_cache_remove(
assert_cache_locked(ip);
+ trace_fuse_iomap_cache_remove(inode, iodir, start, len, _RET_IP_);
+
if (!ifp || fuse_iext_count(ifp) == 0)
return 0;
@@ -1427,6 +1445,12 @@ fuse_iext_add_mapping(
fuse_iomap_can_merge3(&left, new, &right)))
state |= FUSE_IEXT_RIGHT_CONTIG;
+ trace_fuse_iext_add_mapping(VFS_I(ip), state, new);
+ if (state & FUSE_IEXT_LEFT_VALID)
+ trace_fuse_iext_add_mapping_left(VFS_I(ip), &left);
+ if (state & FUSE_IEXT_RIGHT_VALID)
+ trace_fuse_iext_add_mapping_right(VFS_I(ip), &right);
+
/*
* Select which case we're in here, and implement it.
*/
@@ -1495,6 +1519,8 @@ fuse_iomap_cache_add(
ASSERT(new->length > 0);
ASSERT(new->offset < inode->i_sb->s_maxbytes);
+ trace_fuse_iomap_cache_add(inode, iodir, new, _RET_IP_);
+
if (!ifp) {
ifp = kzalloc(sizeof(struct fuse_ifork),
GFP_KERNEL | __GFP_NOFAIL);
@@ -1599,6 +1625,8 @@ fuse_iomap_cache_lookup(
assert_cache_locked_shared(ip);
+ trace_fuse_iomap_cache_lookup(inode, iodir, off, len, _RET_IP_);
+
if (!ifp) {
/*
* No write fork at all means this filesystem doesn't do out of
@@ -1625,5 +1653,8 @@ fuse_iomap_cache_lookup(
/* Found a mapping in the cache, return it */
fuse_iomap_trim(fi, mval, &got, off, len);
+
+ trace_fuse_iomap_cache_lookup_result(inode, iodir, off, len, &got,
+ mval);
return LOOKUP_HIT;
}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 03/10] fuse: use the iomap cache for iomap_begin
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-09-16 0:38 ` [PATCH 01/10] fuse: cache iomaps Darrick J. Wong
2025-09-16 0:38 ` [PATCH 02/10] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:39 ` Darrick J. Wong
2025-09-16 0:39 ` [PATCH 04/10] fuse_trace: " Darrick J. Wong
` (6 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:39 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Look inside the iomap cache to try to satisfy iomap_begin.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/iomap_priv.h | 5 +
fs/fuse/file_iomap.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/iomap_cache.c | 6 +
3 files changed, 221 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 8e4a32879025a4..8f1aef381942b6 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -145,6 +145,11 @@ static inline bool fuse_iext_peek_prev_extent(struct fuse_ifork *ifp,
fuse_iext_get_extent((ifp), (ext), (got)); \
fuse_iext_next((ifp), (ext)))
+/* iomaps that come direct from the fuse server are presumed to be valid */
+#define FUSE_IOMAP_ALWAYS_VALID ((uint64_t)0)
+/* set initial iomap cookie value to avoid ALWAYS_VALID */
+#define FUSE_IOMAP_INIT_COOKIE ((uint64_t)1)
+
static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ip)
{
return (uint64_t)READ_ONCE(ip->im_seq);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index d35e69d03b0940..47c82ec29238e3 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -165,6 +165,7 @@ static inline bool fuse_iomap_check_type(uint16_t fuse_type)
case FUSE_IOMAP_TYPE_UNWRITTEN:
case FUSE_IOMAP_TYPE_INLINE:
case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ case FUSE_IOMAP_TYPE_RETRY_CACHE:
return true;
}
@@ -270,9 +271,14 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
const unsigned int blocksize = i_blocksize(inode);
uint64_t end;
- /* Type and flags must be known */
+ /*
+ * Type and flags must be known. Mapping type "retry cache" doesn't
+ * use any of the other fields.
+ */
if (BAD_DATA(!fuse_iomap_check_type(map->type)))
return false;
+ if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE)
+ return true;
if (BAD_DATA(!fuse_iomap_check_flags(map->flags)))
return false;
@@ -303,6 +309,14 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR))
return false;
break;
+ case FUSE_IOMAP_TYPE_RETRY_CACHE:
+ /*
+ * We only accept cache retries if we have a cache to query.
+ * There must not be a device addr.
+ */
+ if (BAD_DATA(!fuse_inode_caches_iomaps(inode)))
+ return false;
+ fallthrough;
case FUSE_IOMAP_TYPE_DELALLOC:
case FUSE_IOMAP_TYPE_HOLE:
case FUSE_IOMAP_TYPE_INLINE:
@@ -568,6 +582,149 @@ static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
return 0;
}
+/* Convert a mapping from the cache into something the kernel can use */
+static int fuse_iomap_from_cache(struct inode *inode, struct iomap *iomap,
+ const struct fuse_iomap_lookup *lmap)
+{
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_backing *fb;
+
+ fb = fuse_iomap_find_dev(fm->fc, &lmap->map);
+ if (IS_ERR(fb))
+ return PTR_ERR(fb);
+
+ fuse_iomap_from_server(inode, iomap, fb, &lmap->map);
+ iomap->validity_cookie = lmap->validity_cookie;
+
+ fuse_backing_put(fb);
+ return 0;
+}
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static inline int
+fuse_iomap_cached_validate(const struct inode *inode,
+ enum fuse_iomap_iodir dir,
+ const struct fuse_iomap_lookup *lmap)
+{
+ if (!static_branch_unlikely(&fuse_iomap_debug))
+ return 0;
+
+ /* Make sure the mappings aren't garbage */
+ if (!fuse_iomap_check_mapping(inode, &lmap->map, dir))
+ return -EFSCORRUPTED;
+
+ /* The cache should not be storing "retry cache" mappings */
+ if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_RETRY_CACHE))
+ return -EFSCORRUPTED;
+
+ return 0;
+}
+#else
+# define fuse_iomap_cached_validate(...) (0)
+#endif
+
+/*
+ * Look up iomappings from the cache. Returns 1 if iomap and srcmap were
+ * satisfied from cache; 0 if not; or a negative errno.
+ */
+static int fuse_iomap_try_cache(struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_iomap_lookup lmap;
+ struct iomap *dest = iomap;
+ enum fuse_iomap_lookup_result res;
+ int ret;
+
+ if (!fuse_inode_caches_iomaps(inode))
+ return 0;
+
+ fuse_iomap_cache_lock_shared(inode);
+
+ if (fuse_is_iomap_file_write(opflags)) {
+ res = fuse_iomap_cache_lookup(inode, WRITE_MAPPING, pos, count,
+ &lmap);
+ switch (res) {
+ case LOOKUP_HIT:
+ ret = fuse_iomap_cached_validate(inode, WRITE_MAPPING,
+ &lmap);
+ if (ret)
+ goto out_unlock;
+
+ if (lmap.map.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ ret = fuse_iomap_from_cache(inode, dest, &lmap);
+ if (ret)
+ goto out_unlock;
+
+ dest = srcmap;
+ }
+ fallthrough;
+ case LOOKUP_NOFORK:
+ /* move on to the read fork */
+ break;
+ case LOOKUP_MISS:
+ ret = 0;
+ goto out_unlock;
+ }
+ }
+
+ res = fuse_iomap_cache_lookup(inode, READ_MAPPING, pos, count, &lmap);
+ switch (res) {
+ case LOOKUP_HIT:
+ break;
+ case LOOKUP_NOFORK:
+ ASSERT(res != LOOKUP_NOFORK);
+ ret = -EFSCORRUPTED;
+ goto out_unlock;
+ case LOOKUP_MISS:
+ ret = 0;
+ goto out_unlock;
+ }
+
+ ret = fuse_iomap_cached_validate(inode, READ_MAPPING, &lmap);
+ if (ret)
+ goto out_unlock;
+
+ ret = fuse_iomap_from_cache(inode, dest, &lmap);
+ if (ret)
+ goto out_unlock;
+
+ if (fuse_is_iomap_file_write(opflags)) {
+ switch (iomap->type) {
+ case IOMAP_HOLE:
+ if (opflags & (IOMAP_ZERO | IOMAP_UNSHARE))
+ ret = 1;
+ else
+ ret = 0;
+ break;
+ case IOMAP_DELALLOC:
+ if (opflags & IOMAP_DIRECT)
+ ret = 0;
+ else
+ ret = 1;
+ break;
+ default:
+ ret = 1;
+ break;
+ }
+ } else {
+ ret = 1;
+ }
+
+out_unlock:
+ fuse_iomap_cache_unlock_shared(inode);
+ if (ret < 1)
+ return ret;
+
+ if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
+ ret = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
+ srcmap);
+ if (ret)
+ return ret;
+ }
+ return 1;
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -588,6 +745,21 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
trace_fuse_iomap_begin(inode, pos, count, opflags);
+ /*
+ * Try to read mappings from the cache; if we find something then use
+ * it; otherwise we upcall the fuse server. For atomic writes we must
+ * always query the server.
+ */
+ if (!(opflags & FUSE_IOMAP_OP_ATOMIC)) {
+ err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap,
+ srcmap);
+ if (err < 0)
+ return err;
+ if (err == 1)
+ return 0;
+ }
+
+retry:
args.opcode = FUSE_IOMAP_BEGIN;
args.nodeid = get_node_id(inode);
args.in_numargs = 1;
@@ -609,6 +781,24 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
if (err)
return err;
+ /*
+ * If the fuse server tells us it populated the cache, we'll try the
+ * cache lookup again. Note that we dropped the cache lock, so it's
+ * entirely possible that another thread could have invalidated the
+ * cache -- if the cache misses, we'll call the server again.
+ */
+ if (outarg.read.type == FUSE_IOMAP_TYPE_RETRY_CACHE) {
+ err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap,
+ srcmap);
+ if (err < 0)
+ return err;
+ if (err == 1)
+ return 0;
+ if (signal_pending(current))
+ return -EINTR;
+ goto retry;
+ }
+
read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
if (IS_ERR(read_dev))
return PTR_ERR(read_dev);
@@ -636,6 +826,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
*/
fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
}
+ iomap->validity_cookie = FUSE_IOMAP_ALWAYS_VALID;
+ srcmap->validity_cookie = FUSE_IOMAP_ALWAYS_VALID;
if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
@@ -1338,7 +1530,21 @@ static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
.end_io = fuse_iomap_dio_write_end_io,
};
+static bool fuse_iomap_revalidate(struct inode *inode,
+ const struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ uint64_t validity_cookie;
+
+ if (iomap->validity_cookie == FUSE_IOMAP_ALWAYS_VALID)
+ return true;
+
+ validity_cookie = fuse_iext_read_seq(&fi->cache);
+ return iomap->validity_cookie == validity_cookie;
+}
+
static const struct iomap_write_ops fuse_iomap_write_ops = {
+ .iomap_valid = fuse_iomap_revalidate,
};
static int
@@ -1606,14 +1812,14 @@ static void fuse_iomap_end_bio(struct bio *bio)
* mapping is valid, false otherwise.
*/
static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+ struct inode *inode,
loff_t offset)
{
if (offset < wpc->iomap.offset ||
offset >= wpc->iomap.offset + wpc->iomap.length)
return false;
- /* XXX actually use revalidation cookie */
- return true;
+ return fuse_iomap_revalidate(inode, &wpc->iomap);
}
/*
@@ -1667,7 +1873,7 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
trace_fuse_iomap_writeback_range(inode, offset, len, end_pos);
- if (!fuse_iomap_revalidate_writeback(wpc, offset)) {
+ if (!fuse_iomap_revalidate_writeback(wpc, inode, offset)) {
ret = fuse_iomap_begin(inode, offset, len,
FUSE_IOMAP_OP_WRITEBACK,
&write_iomap, &dontcare);
@@ -1804,7 +2010,7 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
mapping_set_folio_min_order(inode->i_mapping, min_order);
memset(&fi->cache.im_read, 0, sizeof(fi->cache.im_read));
- fi->cache.im_seq = 0;
+ fi->cache.im_seq = FUSE_IOMAP_INIT_COOKIE;
fi->cache.im_write = NULL;
init_rwsem(&fi->cache.im_lock);
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 5bfa0e26346d1f..572bccf99a97a8 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -660,7 +660,11 @@ fuse_iext_realloc_root(
*/
static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ip)
{
- WRITE_ONCE(ip->im_seq, READ_ONCE(ip->im_seq) + 1);
+ uint64_t new_val = READ_ONCE(ip->im_seq) + 1;
+
+ if (new_val == FUSE_IOMAP_ALWAYS_VALID)
+ new_val++;
+ WRITE_ONCE(ip->im_seq, new_val);
}
void
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 04/10] fuse_trace: use the iomap cache for iomap_begin
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (2 preceding siblings ...)
2025-09-16 0:39 ` [PATCH 03/10] fuse: use the iomap cache for iomap_begin Darrick J. Wong
@ 2025-09-16 0:39 ` Darrick J. Wong
2025-09-16 0:39 ` [PATCH 05/10] fuse: invalidate iomap cache after file updates Darrick J. Wong
` (5 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:39 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 34 ++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 7 ++++++-
2 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 6072ef187f9215..5f399b1604a2ac 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -400,6 +400,7 @@ struct fuse_iomap_lookup;
#define FUSE_IOMAP_TYPE_STRINGS \
{ FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
+ { FUSE_IOMAP_TYPE_RETRY_CACHE, "retry" }, \
{ FUSE_IOMAP_TYPE_HOLE, "hole" }, \
{ FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
{ FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
@@ -1471,6 +1472,39 @@ TRACE_EVENT(fuse_iomap_cache_lookup_result,
FUSE_IOMAP_MAP_PRINTK_ARGS(got),
__entry->validity_cookie)
);
+
+TRACE_EVENT(fuse_iomap_invalid,
+ TP_PROTO(const struct inode *inode, const struct iomap *map,
+ uint64_t validity_cookie),
+ TP_ARGS(inode, map, validity_cookie),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(uint64_t, old_validity_cookie)
+ __field(uint64_t, validity_cookie)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapaddr = map->addr;
+ __entry->mapdev = FUSE_IOMAP_DEV_NULL;
+
+ __entry->old_validity_cookie= map->validity_cookie;
+ __entry->validity_cookie= validity_cookie;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT() " old_cookie 0x%llx new_cookie 0x%llx",
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ __entry->old_validity_cookie,
+ __entry->validity_cookie)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 47c82ec29238e3..b568a862f120ff 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1540,7 +1540,12 @@ static bool fuse_iomap_revalidate(struct inode *inode,
return true;
validity_cookie = fuse_iext_read_seq(&fi->cache);
- return iomap->validity_cookie == validity_cookie;
+ if (unlikely(iomap->validity_cookie != validity_cookie)) {
+ trace_fuse_iomap_invalid(inode, iomap, validity_cookie);
+ return false;
+ }
+
+ return true;
}
static const struct iomap_write_ops fuse_iomap_write_ops = {
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 05/10] fuse: invalidate iomap cache after file updates
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (3 preceding siblings ...)
2025-09-16 0:39 ` [PATCH 04/10] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:39 ` Darrick J. Wong
2025-09-16 0:39 ` [PATCH 06/10] fuse_trace: " Darrick J. Wong
` (4 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:39 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
The kernel doesn't know what the fuse server might have done in response
to truncate, fallocate, or ioend events. Therefore, it must invalidate
the mapping cache after those operations to ensure cache coherency.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/iomap_priv.h | 9 +++++++++
fs/fuse/dir.c | 6 ++++++
fs/fuse/file.c | 10 +++++++---
fs/fuse/file_iomap.c | 42 +++++++++++++++++++++++++++++++++++++++++-
fs/fuse/iomap_cache.c | 27 +++++++++++++++++++++++++++
6 files changed, 97 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 33b65253b2e9be..c6ec9383a99ce5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1831,10 +1831,14 @@ int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
int fuse_iomap_setsize_start(struct inode *inode, loff_t newsize);
+int fuse_iomap_setsize_finish(struct inode *inode, loff_t newsize);
int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
loff_t length, loff_t new_size);
int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
loff_t endpos);
+void fuse_iomap_open_truncate(struct inode *inode);
+void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
+ size_t written);
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
@@ -1875,8 +1879,11 @@ enum fuse_iomap_iodir {
# define fuse_iomap_buffered_read(...) (-ENOSYS)
# define fuse_iomap_buffered_write(...) (-ENOSYS)
# define fuse_iomap_setsize_start(...) (-ENOSYS)
+# define fuse_iomap_setsize_finish(...) (-ENOSYS)
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
+# define fuse_iomap_open_truncate(...) ((void)0)
+# define fuse_iomap_copied_file_range(...) ((void)0)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fadvise NULL
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 8f1aef381942b6..e78c49af638e0f 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -177,6 +177,15 @@ fuse_iomap_cache_lookup(struct inode *inode, enum fuse_iomap_iodir iodir,
loff_t off, uint64_t len,
struct fuse_iomap_lookup *mval);
+int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
+ uint64_t length);
+static inline int fuse_iomap_cache_invalidate(struct inode *inode,
+ loff_t offset)
+{
+ return fuse_iomap_cache_invalidate_range(inode, offset,
+ FUSE_IOMAP_INVAL_TO_EOF);
+}
+
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _FS_FUSE_IOMAP_PRIV_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 9adaf262bda975..c7291d968ba89c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2208,6 +2208,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
goto error;
}
+ if (is_iomap && is_truncate) {
+ err = fuse_iomap_setsize_finish(inode, outarg.attr.size);
+ if (err)
+ goto error;
+ }
+
spin_lock(&fi->lock);
/* the kernel maintains i_mtime locally */
if (trust_local_cmtime) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0ed13082d0d00d..130395403535dd 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -279,9 +279,11 @@ static int fuse_open(struct inode *inode, struct file *file)
if ((is_wb_truncate || dax_truncate) && !is_iomap)
fuse_release_nowrite(inode);
if (!err) {
- if (is_truncate)
+ if (is_truncate) {
truncate_pagecache(inode, 0);
- else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
+ if (is_iomap)
+ fuse_iomap_open_truncate(inode);
+ } else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
}
if (dax_truncate)
@@ -3140,7 +3142,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (err)
goto out;
- if (!is_iomap)
+ if (is_iomap)
+ fuse_iomap_copied_file_range(inode_out, pos_out, outarg.size);
+ else
truncate_inode_pages_range(inode_out->i_mapping,
ALIGN_DOWN(pos_out, PAGE_SIZE),
ALIGN(pos_out + outarg.size, PAGE_SIZE) - 1);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index b568a862f120ff..b410cae0dec5dd 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -895,6 +895,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
fuse_iomap_inline_free(iomap);
if (err)
return err;
+ fuse_iomap_cache_invalidate_range(inode, pos, written);
} else {
fuse_iomap_inline_free(iomap);
}
@@ -1035,9 +1036,11 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
/*
* If there weren't any ioend errors, update the incore isize, which
- * confusingly takes the new i_size as "pos".
+ * confusingly takes the new i_size as "pos". Invalidate cached
+ * mappings for the file range that we just completed.
*/
fuse_write_update_attr(inode, pos + written, written);
+ fuse_iomap_cache_invalidate_range(inode, pos, written);
return 0;
}
@@ -2220,6 +2223,18 @@ fuse_iomap_setsize_start(
return filemap_write_and_wait(inode->i_mapping);
}
+int
+fuse_iomap_setsize_finish(
+ struct inode *inode,
+ loff_t newsize)
+{
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_setsize(inode, newsize, 0);
+
+ return fuse_iomap_cache_invalidate(inode, newsize);
+}
+
/*
* Prepare for a file data block remapping operation by flushing and unmapping
* all pagecache for the entire range.
@@ -2302,6 +2317,14 @@ fuse_iomap_fallocate(
trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+ if (mode & (FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE))
+ error = fuse_iomap_cache_invalidate(inode, offset);
+ else
+ error = fuse_iomap_cache_invalidate_range(inode, offset,
+ length);
+ if (error)
+ return error;
+
/*
* If we unmapped blocks from the file range, then we zero the
* pagecache for those regions and push them to disk rather than make
@@ -2319,6 +2342,8 @@ fuse_iomap_fallocate(
*/
if (new_size) {
error = fuse_iomap_setsize_start(inode, new_size);
+ if (!error)
+ error = fuse_iomap_setsize_finish(inode, new_size);
if (error)
return error;
@@ -2403,3 +2428,18 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc,
up_read(&fc->killsb);
return ret;
}
+
+void fuse_iomap_open_truncate(struct inode *inode)
+{
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ fuse_iomap_cache_invalidate(inode, 0);
+}
+
+void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
+ size_t written)
+{
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ fuse_iomap_cache_invalidate_range(inode, offset, written);
+}
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 572bccf99a97a8..f1be73da571440 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -1412,6 +1412,33 @@ fuse_iomap_cache_remove(
return ret;
}
+int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
+ uint64_t length)
+{
+ loff_t aligned_offset;
+ const unsigned int blocksize = i_blocksize(inode);
+ int ret, ret2;
+
+ if (!fuse_inode_caches_iomaps(inode))
+ return 0;
+
+ aligned_offset = round_down(offset, blocksize);
+ if (length != FUSE_IOMAP_INVAL_TO_EOF) {
+ length += offset - aligned_offset;
+ length = round_up(length, blocksize);
+ }
+
+ fuse_iomap_cache_lock(inode);
+ ret = fuse_iomap_cache_remove(inode, READ_MAPPING,
+ aligned_offset, length);
+ ret2 = fuse_iomap_cache_remove(inode, WRITE_MAPPING,
+ aligned_offset, length);
+ fuse_iomap_cache_unlock(inode);
+ if (ret)
+ return ret;
+ return ret2;
+}
+
static void
fuse_iext_add_mapping(
struct fuse_iomap_cache *ip,
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 06/10] fuse_trace: invalidate iomap cache after file updates
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (4 preceding siblings ...)
2025-09-16 0:39 ` [PATCH 05/10] fuse: invalidate iomap cache after file updates Darrick J. Wong
@ 2025-09-16 0:39 ` Darrick J. Wong
2025-09-16 0:40 ` [PATCH 07/10] fuse: enable iomap cache management Darrick J. Wong
` (3 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:39 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 37 +++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 4 ++++
fs/fuse/iomap_cache.c | 2 ++
3 files changed, 43 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 5f399b1604a2ac..1cfcc64de08817 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -1073,6 +1073,7 @@ DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_cache_invalidate_range);
TRACE_EVENT(fuse_iomap_fallocate,
TP_PROTO(const struct inode *inode, int mode, loff_t offset,
@@ -1210,6 +1211,42 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
+TRACE_EVENT(fuse_iomap_open_truncate,
+ TP_PROTO(const struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ ),
+
+ TP_printk(FUSE_INODE_FMT,
+ FUSE_INODE_PRINTK_ARGS)
+);
+
+TRACE_EVENT(fuse_iomap_copied_file_range,
+ TP_PROTO(const struct inode *inode, loff_t offset,
+ size_t written),
+ TP_ARGS(inode, offset, written),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = written;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
DECLARE_EVENT_CLASS(fuse_iext_class,
TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur,
int state, unsigned long caller_ip),
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index b410cae0dec5dd..c7b0026bff75f3 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -2433,6 +2433,8 @@ void fuse_iomap_open_truncate(struct inode *inode)
{
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_open_truncate(inode);
+
fuse_iomap_cache_invalidate(inode, 0);
}
@@ -2441,5 +2443,7 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
{
ASSERT(fuse_inode_has_iomap(inode));
+ trace_fuse_iomap_copied_file_range(inode, offset, written);
+
fuse_iomap_cache_invalidate_range(inode, offset, written);
}
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index f1be73da571440..a13eb5eec72415 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -1422,6 +1422,8 @@ int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
if (!fuse_inode_caches_iomaps(inode))
return 0;
+ trace_fuse_iomap_cache_invalidate_range(inode, offset, length);
+
aligned_offset = round_down(offset, blocksize);
if (length != FUSE_IOMAP_INVAL_TO_EOF) {
length += offset - aligned_offset;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 07/10] fuse: enable iomap cache management
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (5 preceding siblings ...)
2025-09-16 0:39 ` [PATCH 06/10] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:40 ` Darrick J. Wong
2025-09-16 0:40 ` [PATCH 08/10] fuse_trace: " Darrick J. Wong
` (2 subsequent siblings)
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:40 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Provide a means for the fuse server to upload iomappings to the kernel
and invalidate them. This is how we enable iomap caching for better
performance. This is also required for correct synchronization between
pagecache writes and writeback.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +
include/uapi/linux/fuse.h | 28 +++++
fs/fuse/dev.c | 44 ++++++++
fs/fuse/file_iomap.c | 239 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 314 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c6ec9383a99ce5..d42737bac0af88 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1858,6 +1858,11 @@ enum fuse_iomap_iodir {
READ_MAPPING,
WRITE_MAPPING,
};
+
+int fuse_iomap_upsert(struct fuse_conn *fc,
+ const struct fuse_iomap_upsert_out *outarg);
+int fuse_iomap_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_inval_out *outarg);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1888,6 +1893,8 @@ enum fuse_iomap_iodir {
# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fadvise NULL
# define fuse_inode_caches_iomaps(...) (false)
+# define fuse_iomap_upsert(...) (-ENOSYS)
+# define fuse_iomap_inval(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d4a257517915fd..5c2c594fc87892 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -244,6 +244,8 @@
* - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
* - add FUSE_ATTR_{SYNC,IMMUTABLE,APPEND} for VFS enforcement of file
* attributes
+ * - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
+ * can cache iomappings in the kernel
*/
#ifndef _LINUX_FUSE_H
@@ -715,6 +717,8 @@ enum fuse_notify_code {
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
FUSE_NOTIFY_IOMAP_DEV_INVAL = 9,
+ FUSE_NOTIFY_IOMAP_UPSERT = 10,
+ FUSE_NOTIFY_IOMAP_INVAL = 11,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1360,6 +1364,8 @@ struct fuse_uring_cmd_req {
#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
/* fuse-specific mapping type saying the server has populated the cache */
#define FUSE_IOMAP_TYPE_RETRY_CACHE (254)
+/* do not upsert this mapping */
+#define FUSE_IOMAP_TYPE_NOCACHE (253)
#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
@@ -1505,4 +1511,26 @@ struct fuse_iomap_dev_inval_out {
/* invalidate all cached iomap mappings up to EOF */
#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL)
+struct fuse_iomap_inval_out {
+ uint64_t nodeid; /* Inode ID */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+
+ uint64_t read_offset; /* range to invalidate read iomaps, bytes */
+ uint64_t read_length; /* can be FUSE_IOMAP_INVAL_TO_EOF */
+
+ uint64_t write_offset; /* range to invalidate write iomaps, bytes */
+ uint64_t write_length; /* can be FUSE_IOMAP_INVAL_TO_EOF */
+};
+
+struct fuse_iomap_upsert_out {
+ uint64_t nodeid; /* Inode ID */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+
+ /* read file data from here */
+ struct fuse_iomap_io read;
+
+ /* write file data to here, if applicable */
+ struct fuse_iomap_io write;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index adbe2a65e6fe87..b144f67f06160f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1892,6 +1892,46 @@ static int fuse_notify_iomap_dev_inval(struct fuse_conn *fc, unsigned int size,
return err;
}
+static int fuse_notify_iomap_upsert(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_upsert_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_upsert(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
+static int fuse_notify_iomap_inval(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_inval_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_inval(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
struct fuse_retrieve_args {
struct fuse_args_pages ap;
struct fuse_notify_retrieve_in inarg;
@@ -2140,6 +2180,10 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
case FUSE_NOTIFY_IOMAP_DEV_INVAL:
return fuse_notify_iomap_dev_inval(fc, size, cs);
+ case FUSE_NOTIFY_IOMAP_UPSERT:
+ return fuse_notify_iomap_upsert(fc, size, cs);
+ case FUSE_NOTIFY_IOMAP_INVAL:
+ return fuse_notify_iomap_inval(fc, size, cs);
default:
fuse_copy_finish(cs);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index c7b0026bff75f3..ff79a30f6ff8d2 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -166,6 +166,7 @@ static inline bool fuse_iomap_check_type(uint16_t fuse_type)
case FUSE_IOMAP_TYPE_INLINE:
case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
case FUSE_IOMAP_TYPE_RETRY_CACHE:
+ case FUSE_IOMAP_TYPE_NOCACHE:
return true;
}
@@ -272,12 +273,13 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
uint64_t end;
/*
- * Type and flags must be known. Mapping type "retry cache" doesn't
- * use any of the other fields.
+ * Type and flags must be known. Mapping types "retry cache" and "do
+ * not insert in cache" don't use any of the other fields.
*/
if (BAD_DATA(!fuse_iomap_check_type(map->type)))
return false;
- if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE)
+ if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE ||
+ map->type == FUSE_IOMAP_TYPE_NOCACHE)
return true;
if (BAD_DATA(!fuse_iomap_check_flags(map->flags)))
return false;
@@ -331,6 +333,9 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
if (BAD_DATA(iodir != WRITE_MAPPING))
return false;
break;
+ case FUSE_IOMAP_TYPE_NOCACHE:
+ /* We're ignoring this mapping */
+ break;
default:
/* should have been caught already */
ASSERT(0);
@@ -386,6 +391,15 @@ fuse_iomap_begin_validate(const struct inode *inode,
if (!fuse_iomap_check_mapping(inode, &outarg->write, WRITE_MAPPING))
return -EFSCORRUPTED;
+ /*
+ * ->iomap_begin requires real mappings or "retry from cache"; "do not
+ * add to cache" does not apply here.
+ */
+ if (BAD_DATA(outarg->read.type == FUSE_IOMAP_TYPE_NOCACHE))
+ return -EFSCORRUPTED;
+ if (BAD_DATA(outarg->write.type == FUSE_IOMAP_TYPE_NOCACHE))
+ return -EFSCORRUPTED;
+
/*
* Must have returned a mapping for at least the first byte in the
* range. The main mapping check already validated that the length
@@ -613,9 +627,11 @@ fuse_iomap_cached_validate(const struct inode *inode,
if (!fuse_iomap_check_mapping(inode, &lmap->map, dir))
return -EFSCORRUPTED;
- /* The cache should not be storing "retry cache" mappings */
+ /* The cache should not be storing cache management mappings */
if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_RETRY_CACHE))
return -EFSCORRUPTED;
+ if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_NOCACHE))
+ return -EFSCORRUPTED;
return 0;
}
@@ -2447,3 +2463,218 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
fuse_iomap_cache_invalidate_range(inode, offset, written);
}
+
+static inline bool
+fuse_iomap_upsert_validate_dev(
+ const struct fuse_backing *fb,
+ const struct fuse_iomap_io *map)
+{
+ uint64_t map_end;
+ sector_t device_bytes;
+
+ if (!fb) {
+ if (BAD_DATA(map->addr != FUSE_IOMAP_NULL_ADDR))
+ return false;
+
+ return true;
+ }
+
+ if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR))
+ return false;
+
+ if (BAD_DATA(check_add_overflow(map->addr, map->length, &map_end)))
+ return false;
+
+ device_bytes = bdev_nr_sectors(fb->bdev) << SECTOR_SHIFT;
+ if (BAD_DATA(map_end > device_bytes))
+ return false;
+
+ return true;
+}
+
+/* Validate one of the incoming upsert mappings */
+static inline bool
+fuse_iomap_upsert_validate_mapping(struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_backing *fb;
+ bool ret;
+
+ if (!fuse_iomap_check_mapping(inode, map, iodir))
+ return false;
+
+ /*
+ * A "retry cache" instruction makes no sense when we're adding to
+ * the mapping cache.
+ */
+ if (BAD_DATA(map->type == FUSE_IOMAP_TYPE_RETRY_CACHE))
+ return false;
+
+ if (map->type == FUSE_IOMAP_TYPE_NOCACHE)
+ return true;
+
+ /* Make sure we can find the device */
+ fb = fuse_iomap_find_dev(fc, map);
+ if (IS_ERR(fb))
+ return false;
+
+ ret = fuse_iomap_upsert_validate_dev(fb, map);
+ fuse_backing_put(fb);
+ return ret;
+}
+
+/* Check the incoming upsert mappings to make sure they're not nonsense */
+static inline int
+fuse_iomap_upsert_validate(struct inode *inode,
+ const struct fuse_iomap_upsert_out *outarg)
+{
+ if (!fuse_iomap_upsert_validate_mapping(inode, READ_MAPPING,
+ &outarg->read))
+ return -EFSCORRUPTED;
+ if (!fuse_iomap_upsert_validate_mapping(inode, WRITE_MAPPING,
+ &outarg->write))
+ return -EFSCORRUPTED;
+
+ return 0;
+}
+
+int fuse_iomap_upsert(struct fuse_conn *fc,
+ const struct fuse_iomap_upsert_out *outarg)
+{
+ struct inode *inode;
+ struct fuse_inode *fi;
+ int ret;
+
+ if (!fc->iomap)
+ return -EINVAL;
+
+ down_read(&fc->killsb);
+ inode = fuse_ilookup(fc, outarg->nodeid, NULL);
+ if (!inode) {
+ ret = -ESTALE;
+ goto out_sb;
+ }
+
+ fi = get_fuse_inode(inode);
+ if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) {
+ ret = -EINVAL;
+ goto out_inode;
+ }
+
+ if (fuse_is_bad(inode)) {
+ ret = -EIO;
+ goto out_inode;
+ }
+
+ ret = fuse_iomap_upsert_validate(inode, outarg);
+ if (ret)
+ goto out_inode;
+
+ fuse_iomap_cache_lock(inode);
+
+ set_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+
+ if (outarg->read.type != FUSE_IOMAP_TYPE_NOCACHE) {
+ ret = fuse_iomap_cache_upsert(inode, READ_MAPPING,
+ &outarg->read);
+ if (ret)
+ goto out_unlock;
+ }
+
+ if (outarg->write.type != FUSE_IOMAP_TYPE_NOCACHE) {
+ ret = fuse_iomap_cache_upsert(inode, WRITE_MAPPING,
+ &outarg->write);
+ if (ret)
+ goto out_unlock;
+ }
+
+out_unlock:
+ fuse_iomap_cache_unlock(inode);
+out_inode:
+ iput(inode);
+out_sb:
+ up_read(&fc->killsb);
+ return ret;
+}
+
+static inline bool fuse_iomap_inval_validate(const struct inode *inode,
+ uint64_t offset, uint64_t length)
+{
+ const unsigned int blocksize = i_blocksize(inode);
+
+ if (length == 0)
+ return true;
+
+ /* Range can't start beyond maxbytes */
+ if (BAD_DATA(offset >= inode->i_sb->s_maxbytes))
+ return false;
+
+ /* File range must be aligned to blocksize */
+ if (BAD_DATA(!IS_ALIGNED(offset, blocksize)))
+ return false;
+ if (length != FUSE_IOMAP_INVAL_TO_EOF &&
+ BAD_DATA(!IS_ALIGNED(length, blocksize)))
+ return false;
+
+ return true;
+}
+
+int fuse_iomap_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_inval_out *outarg)
+{
+ struct inode *inode;
+ struct fuse_inode *fi;
+ int ret = 0, ret2 = 0;
+
+ if (!fc->iomap)
+ return -EINVAL;
+
+ down_read(&fc->killsb);
+ inode = fuse_ilookup(fc, outarg->nodeid, NULL);
+ if (!inode) {
+ ret = -ESTALE;
+ goto out_sb;
+ }
+
+ fi = get_fuse_inode(inode);
+ if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) {
+ ret = -EINVAL;
+ goto out_inode;
+ }
+
+ if (fuse_is_bad(inode)) {
+ ret = -EIO;
+ goto out_inode;
+ }
+
+ if (!fuse_iomap_inval_validate(inode, outarg->write_offset,
+ outarg->write_length)) {
+ ret = -EFSCORRUPTED;
+ goto out_inode;
+ }
+
+ if (!fuse_iomap_inval_validate(inode, outarg->read_offset,
+ outarg->read_length)) {
+ ret = -EFSCORRUPTED;
+ goto out_inode;
+ }
+
+ fuse_iomap_cache_lock(inode);
+ if (outarg->read_length)
+ ret2 = fuse_iomap_cache_remove(inode, READ_MAPPING,
+ outarg->read_offset,
+ outarg->read_length);
+ if (outarg->write_length)
+ ret = fuse_iomap_cache_remove(inode, WRITE_MAPPING,
+ outarg->write_offset,
+ outarg->write_length);
+ fuse_iomap_cache_unlock(inode);
+
+out_inode:
+ iput(inode);
+out_sb:
+ up_read(&fc->killsb);
+ return ret ? ret : ret2;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 08/10] fuse_trace: enable iomap cache management
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (6 preceding siblings ...)
2025-09-16 0:40 ` [PATCH 07/10] fuse: enable iomap cache management Darrick J. Wong
@ 2025-09-16 0:40 ` Darrick J. Wong
2025-09-16 0:40 ` [PATCH 09/10] fuse: overlay iomap inode info in struct fuse_inode Darrick J. Wong
2025-09-16 0:41 ` [PATCH 10/10] fuse: enable iomap Darrick J. Wong
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:40 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add tracepoints for the previous patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 7 ++++-
2 files changed, 74 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 1cfcc64de08817..202fc32f6b02e1 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -401,6 +401,7 @@ struct fuse_iomap_lookup;
#define FUSE_IOMAP_TYPE_STRINGS \
{ FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
{ FUSE_IOMAP_TYPE_RETRY_CACHE, "retry" }, \
+ { FUSE_IOMAP_TYPE_NOCACHE, "nocache" }, \
{ FUSE_IOMAP_TYPE_HOLE, "hole" }, \
{ FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
{ FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
@@ -742,6 +743,7 @@ DEFINE_EVENT(fuse_inode_state_class, name, \
TP_ARGS(inode))
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_cache_enable);
TRACE_EVENT(fuse_iomap_fiemap,
TP_PROTO(const struct inode *inode, u64 start, u64 count,
@@ -1542,6 +1544,72 @@ TRACE_EVENT(fuse_iomap_invalid,
__entry->old_validity_cookie,
__entry->validity_cookie)
);
+
+TRACE_EVENT(fuse_iomap_upsert,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_upsert_out *outarg),
+ TP_ARGS(inode, outarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(uint64_t, attr_ino)
+
+ FUSE_IOMAP_MAP_FIELDS(read)
+ FUSE_IOMAP_MAP_FIELDS(write)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->attr_ino = outarg->attr_ino;
+ __entry->readoffset = outarg->read.offset;
+ __entry->readlength = outarg->read.length;
+ __entry->readaddr = outarg->read.addr;
+ __entry->readtype = outarg->read.type;
+ __entry->readflags = outarg->read.flags;
+ __entry->readdev = outarg->read.dev;
+ __entry->writeoffset = outarg->write.offset;
+ __entry->writelength = outarg->write.length;
+ __entry->writeaddr = outarg->write.addr;
+ __entry->writetype = outarg->write.type;
+ __entry->writeflags = outarg->write.flags;
+ __entry->writedev = outarg->write.dev;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " attr_ino 0x%llx" FUSE_IOMAP_MAP_FMT("read") FUSE_IOMAP_MAP_FMT("write"),
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->attr_ino,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(read),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(write))
+);
+
+TRACE_EVENT(fuse_iomap_inval,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_inval_out *outarg),
+ TP_ARGS(inode, outarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(uint64_t, attr_ino)
+
+ FUSE_FILE_RANGE_FIELDS(read)
+ FUSE_FILE_RANGE_FIELDS(write)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->attr_ino = outarg->attr_ino;
+ __entry->readoffset = outarg->read_offset;
+ __entry->readlength = outarg->read_length;
+ __entry->writeoffset = outarg->write_offset;
+ __entry->writelength = outarg->write_length;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " attr_ino 0x%llx" FUSE_FILE_RANGE_FMT("read") FUSE_FILE_RANGE_FMT("write"),
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->attr_ino,
+ FUSE_FILE_RANGE_PRINTK_ARGS(read),
+ FUSE_FILE_RANGE_PRINTK_ARGS(write))
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index ff79a30f6ff8d2..c82434674fb52b 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -2557,6 +2557,8 @@ int fuse_iomap_upsert(struct fuse_conn *fc,
goto out_sb;
}
+ trace_fuse_iomap_upsert(inode, outarg);
+
fi = get_fuse_inode(inode);
if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) {
ret = -EINVAL;
@@ -2574,7 +2576,8 @@ int fuse_iomap_upsert(struct fuse_conn *fc,
fuse_iomap_cache_lock(inode);
- set_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+ if (!test_and_set_bit(FUSE_I_IOMAP_CACHE, &fi->state))
+ trace_fuse_iomap_cache_enable(inode);
if (outarg->read.type != FUSE_IOMAP_TYPE_NOCACHE) {
ret = fuse_iomap_cache_upsert(inode, READ_MAPPING,
@@ -2638,6 +2641,8 @@ int fuse_iomap_inval(struct fuse_conn *fc,
goto out_sb;
}
+ trace_fuse_iomap_inval(inode, outarg);
+
fi = get_fuse_inode(inode);
if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) {
ret = -EINVAL;
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 09/10] fuse: overlay iomap inode info in struct fuse_inode
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (7 preceding siblings ...)
2025-09-16 0:40 ` [PATCH 08/10] fuse_trace: " Darrick J. Wong
@ 2025-09-16 0:40 ` Darrick J. Wong
2025-09-16 0:41 ` [PATCH 10/10] fuse: enable iomap Darrick J. Wong
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:40 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
It's not possible for a regular file to use iomap mode and writeback
caching at the same time, so we can save some memory in struct
fuse_inode by overlaying them in the union. This is a separate patch
because C unions are rather unsafe and I prefer any errors to be
bisectable to this patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d42737bac0af88..8238c1cfd9c481 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -197,8 +197,11 @@ struct fuse_inode {
/* waitq for direct-io completion */
wait_queue_head_t direct_io_waitq;
+ };
#ifdef CONFIG_FUSE_IOMAP
+ /* regular file iomap mode */
+ struct {
/* pending io completions */
spinlock_t ioend_lock;
struct work_struct ioend_work;
@@ -206,8 +209,8 @@ struct fuse_inode {
/* cached iomap mappings */
struct fuse_iomap_cache cache;
-#endif
};
+#endif
/* readdir cache (directory only) */
struct {
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 10/10] fuse: enable iomap
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (8 preceding siblings ...)
2025-09-16 0:40 ` [PATCH 09/10] fuse: overlay iomap inode info in struct fuse_inode Darrick J. Wong
@ 2025-09-16 0:41 ` Darrick J. Wong
9 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:41 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Remove the guard that we used to avoid bisection problems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index c82434674fb52b..261490a322c289 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -104,9 +104,6 @@ void fuse_iomap_sysfs_cleanup(struct kobject *fuse_kobj)
bool fuse_iomap_enabled(void)
{
- /* Don't let anyone touch iomap until the end of the patchset. */
- return false;
-
/*
* There are fears that a fuse+iomap server could somehow DoS the
* system by doing things like going out to lunch during a writeback
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage
2025-09-16 0:20 ` [PATCHSET RFC v5 8/8] fuse: run fuse servers as a contained service Darrick J. Wong
@ 2025-09-16 0:41 ` Darrick J. Wong
2025-09-16 0:41 ` [PATCH 2/2] fuse: set iomap backing device block size Darrick J. Wong
1 sibling, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:41 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
For the upcoming safemount functionality in libfuse, we will create a
privileged "mount.safe" helper that starts the fuse server in a
completely unprivileged systemd container. The mount helper will pass
the mount options and fds for /dev/fuse and any other files requested by
the fuse server into the container via a Unix socket.
Currently, the ability to turn on iomap for fuse depends on a module
parameter and the process that calls mount() having the CAP_SYS_RAWIO
capability. However, the unprivilged fuse server might want to query
the /dev/fuse fd for iomap capabilities before mount or FUSE_INIT so
that it can get ready.
Similar to FUSE_DEV_SYNC_INIT, add a new bit for iomap that can be
squirreled away in file->private_data and an ioctl to set that bit.
That way the privileged mount helper can pass its iomap privilege to the
contained fuse server without the fuse server needing to have
CAP_SYS_RAWIO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_dev_i.h | 32 +++++++++++++++++++++++++++++---
fs/fuse/fuse_i.h | 9 +++++++++
include/uapi/linux/fuse.h | 1 +
fs/fuse/dev.c | 11 +++++------
fs/fuse/file_iomap.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/inode.c | 18 ++++++++++++------
6 files changed, 98 insertions(+), 16 deletions(-)
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 6e8373f970409e..783ab1432c8691 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -39,8 +39,10 @@ struct fuse_copy_state {
} ring;
};
-#define FUSE_DEV_SYNC_INIT ((struct fuse_dev *) 1)
-#define FUSE_DEV_PTR_MASK (~1UL)
+#define FUSE_DEV_SYNC_INIT (1UL << 0)
+#define FUSE_DEV_INHERIT_IOMAP (1UL << 1)
+#define FUSE_DEV_FLAGS_MASK (FUSE_DEV_SYNC_INIT | FUSE_DEV_INHERIT_IOMAP)
+#define FUSE_DEV_PTR_MASK (~FUSE_DEV_FLAGS_MASK)
static inline struct fuse_dev *__fuse_get_dev(struct file *file)
{
@@ -50,7 +52,31 @@ static inline struct fuse_dev *__fuse_get_dev(struct file *file)
*/
struct fuse_dev *fud = READ_ONCE(file->private_data);
- return (typeof(fud)) ((unsigned long) fud & FUSE_DEV_PTR_MASK);
+ return (typeof(fud)) ((uintptr_t)fud & FUSE_DEV_PTR_MASK);
+}
+
+static inline struct fuse_dev *__fuse_get_dev_and_flags(struct file *file,
+ uintptr_t *flagsp)
+{
+ /*
+ * Lockless access is OK, because file->private data is set
+ * once during mount and is valid until the file is released.
+ */
+ struct fuse_dev *fud = READ_ONCE(file->private_data);
+
+ *flagsp = ((uintptr_t)fud) & FUSE_DEV_FLAGS_MASK;
+ return (typeof(fud)) ((uintptr_t) fud & FUSE_DEV_PTR_MASK);
+}
+
+static inline int __fuse_set_dev_flags(struct file *file, uintptr_t flag)
+{
+ uintptr_t old_flags = 0;
+
+ if (__fuse_get_dev_and_flags(file, &old_flags))
+ return -EINVAL;
+
+ WRITE_ONCE(file->private_data, (struct fuse_dev *)(old_flags | flag));
+ return 0;
}
struct fuse_dev *fuse_get_dev(struct file *file);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8238c1cfd9c481..1a965d3dee6479 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -984,6 +984,13 @@ struct fuse_conn {
/* Enable fs/iomap for file operations */
unsigned int iomap:1;
+ /*
+ * Are filesystems using this connection allowed to use iomap? This is
+ * determined by the privilege level of the process that initiated the
+ * mount() call.
+ */
+ unsigned int may_iomap:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1843,6 +1850,7 @@ void fuse_iomap_open_truncate(struct inode *inode);
void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
size_t written);
+int fuse_dev_ioctl_add_iomap(struct file *file);
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
int fuse_iomap_dev_inval(struct fuse_conn *fc,
@@ -1892,6 +1900,7 @@ int fuse_iomap_inval(struct fuse_conn *fc,
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
# define fuse_iomap_open_truncate(...) ((void)0)
# define fuse_iomap_copied_file_range(...) ((void)0)
+# define fuse_dev_ioctl_add_iomap(...) (-EOPNOTSUPP)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fadvise NULL
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5c2c594fc87892..b59ce131513efd 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1187,6 +1187,7 @@ struct fuse_iomap_support {
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
#define FUSE_DEV_IOC_SYNC_INIT _IO(FUSE_DEV_IOC_MAGIC, 3)
+#define FUSE_DEV_IOC_ADD_IOMAP _IO(FUSE_DEV_IOC_MAGIC, 99)
#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \
struct fuse_iomap_support)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index b144f67f06160f..7a24fbcdb2f919 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1556,7 +1556,7 @@ struct fuse_dev *fuse_get_dev(struct file *file)
return fud;
err = wait_event_interruptible(fuse_dev_waitq,
- READ_ONCE(file->private_data) != FUSE_DEV_SYNC_INIT);
+ __fuse_get_dev(file) != NULL);
if (err)
return ERR_PTR(err);
@@ -2752,13 +2752,10 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
static long fuse_dev_ioctl_sync_init(struct file *file)
{
- int err = -EINVAL;
+ int err;
mutex_lock(&fuse_mutex);
- if (!__fuse_get_dev(file)) {
- WRITE_ONCE(file->private_data, FUSE_DEV_SYNC_INIT);
- err = 0;
- }
+ err = __fuse_set_dev_flags(file, FUSE_DEV_SYNC_INIT);
mutex_unlock(&fuse_mutex);
return err;
}
@@ -2783,6 +2780,8 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_IOMAP_SUPPORT:
return fuse_dev_ioctl_iomap_support(file, argp);
+ case FUSE_DEV_IOC_ADD_IOMAP:
+ return fuse_dev_ioctl_add_iomap(file);
default:
return -ENOTTY;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 261490a322c289..70b01638006a2e 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -9,6 +9,7 @@
#include <linux/falloc.h>
#include <linux/fadvise.h>
#include "fuse_i.h"
+#include "fuse_dev_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -114,6 +115,12 @@ bool fuse_iomap_enabled(void)
return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO);
}
+static inline bool fuse_iomap_may_enable(void)
+{
+ /* Same as above, but this time we log the denial in audit log */
+ return enable_iomap && capable(CAP_SYS_RAWIO);
+}
+
/* Convert IOMAP_* mapping types to FUSE_IOMAP_TYPE_* */
#define XMAP(word) \
case IOMAP_##word: \
@@ -2367,12 +2374,46 @@ fuse_iomap_fallocate(
return 0;
}
+int fuse_dev_ioctl_add_iomap(struct file *file)
+{
+ uintptr_t flags = 0;
+ struct fuse_dev *fud;
+ int ret = 0;
+
+ mutex_lock(&fuse_mutex);
+ fud = __fuse_get_dev_and_flags(file, &flags);
+ if (fud) {
+ if (!fud->fc->may_iomap && !fuse_iomap_may_enable()) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
+ fud->fc->may_iomap = 1;
+ goto out_unlock;
+ }
+
+ if (!(flags & FUSE_DEV_INHERIT_IOMAP) && !fuse_iomap_may_enable()) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
+ ret = __fuse_set_dev_flags(file, FUSE_DEV_INHERIT_IOMAP);
+
+out_unlock:
+ mutex_unlock(&fuse_mutex);
+ return ret;
+}
+
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp)
{
struct fuse_iomap_support ios = { };
+ uintptr_t flags = 0;
+ struct fuse_dev *fud = __fuse_get_dev_and_flags(file, &flags);
- if (fuse_iomap_enabled())
+ if ((!fud && (flags & FUSE_DEV_INHERIT_IOMAP)) ||
+ (fud && fud->fc->may_iomap) ||
+ fuse_iomap_enabled())
ios.flags = FUSE_IOMAP_SUPPORT_FILEIO |
FUSE_IOMAP_SUPPORT_ATOMIC;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c29a8cbc55fa27..19b385b79f7cbe 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1022,6 +1022,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->name_max = FUSE_NAME_LOW_MAX;
fc->timeout.req_timeout = 0;
fc->root_nodeid = FUSE_ROOT_ID;
+ fc->may_iomap = fuse_iomap_enabled();
if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_init(fc);
@@ -1481,7 +1482,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
if (flags & FUSE_REQUEST_TIMEOUT)
timeout = arg->request_timeout;
- if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
+ if ((flags & FUSE_IOMAP) && fc->may_iomap) {
fc->local_fs = 1;
fc->iomap = 1;
printk(KERN_WARNING
@@ -1569,7 +1570,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
*/
if (fuse_uring_enabled())
flags |= FUSE_OVER_IO_URING;
- if (fuse_iomap_enabled())
+ if (fm->fc->may_iomap)
flags |= FUSE_IOMAP;
ia->in.flags = flags;
@@ -1955,11 +1956,16 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
mutex_lock(&fuse_mutex);
err = -EINVAL;
- if (ctx->fudptr && *ctx->fudptr) {
- if (*ctx->fudptr == FUSE_DEV_SYNC_INIT) {
- fc->sync_init = 1;
- } else
+ if (ctx->fudptr) {
+ uintptr_t raw = (uintptr_t)(*ctx->fudptr);
+ uintptr_t flags = raw & FUSE_DEV_FLAGS_MASK;
+
+ if (raw & FUSE_DEV_PTR_MASK)
goto err_unlock;
+ if (flags & FUSE_DEV_SYNC_INIT)
+ fc->sync_init = 1;
+ if (flags & FUSE_DEV_INHERIT_IOMAP)
+ fc->may_iomap = 1;
}
err = fuse_ctl_add_conn(fc);
^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 2/2] fuse: set iomap backing device block size
2025-09-16 0:20 ` [PATCHSET RFC v5 8/8] fuse: run fuse servers as a contained service Darrick J. Wong
2025-09-16 0:41 ` [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage Darrick J. Wong
@ 2025-09-16 0:41 ` Darrick J. Wong
1 sibling, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 0:41 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add a new ioctl so that an unprivileged fuse server can set the block
size of a bdev that's opened for iomap usage.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 3 +++
include/uapi/linux/fuse.h | 7 +++++++
fs/fuse/dev.c | 2 ++
fs/fuse/file_iomap.c | 24 ++++++++++++++++++++++++
4 files changed, 36 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1a965d3dee6479..faef0efe6a9506 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1853,6 +1853,8 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
int fuse_dev_ioctl_add_iomap(struct file *file);
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
+int fuse_dev_ioctl_iomap_set_blocksize(struct file *file,
+ struct fuse_iomap_backing_info __user *argp);
int fuse_iomap_dev_inval(struct fuse_conn *fc,
const struct fuse_iomap_dev_inval_out *arg);
@@ -1902,6 +1904,7 @@ int fuse_iomap_inval(struct fuse_conn *fc,
# define fuse_iomap_copied_file_range(...) ((void)0)
# define fuse_dev_ioctl_add_iomap(...) (-EOPNOTSUPP)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
+# define fuse_dev_ioctl_iomap_set_blocksize(...) (-EOPNOTSUPP)
# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fadvise NULL
# define fuse_inode_caches_iomaps(...) (false)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index b59ce131513efd..d360c39be43104 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1180,6 +1180,11 @@ struct fuse_iomap_support {
uint64_t padding;
};
+struct fuse_iomap_backing_info {
+ uint32_t backing_id;
+ uint32_t blocksize;
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
@@ -1190,6 +1195,8 @@ struct fuse_iomap_support {
#define FUSE_DEV_IOC_ADD_IOMAP _IO(FUSE_DEV_IOC_MAGIC, 99)
#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 99, \
struct fuse_iomap_support)
+#define FUSE_DEV_IOC_IOMAP_SET_BLOCKSIZE _IOW(FUSE_DEV_IOC_MAGIC, 99, \
+ struct fuse_iomap_backing_info)
struct fuse_lseek_in {
uint64_t fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 7a24fbcdb2f919..5003a862daf37a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2782,6 +2782,8 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
return fuse_dev_ioctl_iomap_support(file, argp);
case FUSE_DEV_IOC_ADD_IOMAP:
return fuse_dev_ioctl_add_iomap(file);
+ case FUSE_DEV_IOC_IOMAP_SET_BLOCKSIZE:
+ return fuse_dev_ioctl_iomap_set_blocksize(file, argp);
default:
return -ENOTTY;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 70b01638006a2e..a915cc9520b532 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -2721,3 +2721,27 @@ int fuse_iomap_inval(struct fuse_conn *fc,
up_read(&fc->killsb);
return ret ? ret : ret2;
}
+
+int fuse_dev_ioctl_iomap_set_blocksize(struct file *file,
+ struct fuse_iomap_backing_info __user *argp)
+{
+ struct fuse_iomap_backing_info fbi;
+ struct fuse_dev *fud = fuse_get_dev(file);
+ struct fuse_backing *fb;
+ int ret;
+
+ if (IS_ERR(fud))
+ return PTR_ERR(fud);
+
+ if (copy_from_user(&fbi, argp, sizeof(fbi)))
+ return -EFAULT;
+
+ fb = fuse_backing_lookup(fud->fc, &fuse_iomap_backing_ops,
+ fbi.backing_id);
+ if (!fb)
+ return -ENOENT;
+
+ ret = set_blocksize(fb->file, fbi.blocksize);
+ fuse_backing_put(fb);
+ return ret;
+}
^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH 7/8] fuse: propagate default and file acls on creation
2025-09-16 0:25 ` [PATCH 7/8] fuse: propagate default and file acls on creation Darrick J. Wong
@ 2025-09-16 6:41 ` Chen Linxuan
2025-09-16 14:48 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Chen Linxuan @ 2025-09-16 6:41 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 16, 2025 at 8:26 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> For local filesystems, propagate the default and file access ACLs to new
> children when creating them, just like the other in-kernel local
> filesystems.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 4 ++
> fs/fuse/acl.c | 65 ++++++++++++++++++++++++++++++++++++++
> fs/fuse/dir.c | 92 +++++++++++++++++++++++++++++++++++++++++-------------
> 3 files changed, 138 insertions(+), 23 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 52776b77efc0e4..b9306678dcda0d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1507,6 +1507,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap,
> struct dentry *dentry, int type);
> int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry,
> struct posix_acl *acl, int type);
> +int fuse_acl_create(struct inode *dir, umode_t *mode,
> + struct posix_acl **default_acl, struct posix_acl **acl);
> +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
> + const struct posix_acl *acl);
>
> /* readdir.c */
> int fuse_readdir(struct file *file, struct dir_context *ctx);
> diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
> index 4997827ee83c6d..4faee72f1365a5 100644
> --- a/fs/fuse/acl.c
> +++ b/fs/fuse/acl.c
> @@ -203,3 +203,68 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
>
> return ret;
> }
> +
> +int fuse_acl_create(struct inode *dir, umode_t *mode,
> + struct posix_acl **default_acl, struct posix_acl **acl)
> +{
> + struct fuse_conn *fc = get_fuse_conn(dir);
> +
> + if (fuse_is_bad(dir))
> + return -EIO;
> +
> + if (IS_POSIXACL(dir) && fuse_has_local_acls(fc))
> + return posix_acl_create(dir, mode, default_acl, acl);
> +
> + if (!fc->dont_mask)
> + *mode &= ~current_umask();
> +
> + *default_acl = NULL;
> + *acl = NULL;
> + return 0;
> +}
> +
> +static int __fuse_set_acl(struct inode *inode, const char *name,
> + const struct posix_acl *acl)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + size_t size = posix_acl_xattr_size(acl->a_count);
> + void *value;
> + int ret;
> +
> + if (size > PAGE_SIZE)
> + return -E2BIG;
> +
> + value = kmalloc(size, GFP_KERNEL);
> + if (!value)
> + return -ENOMEM;
> +
> + ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
> + if (ret < 0)
> + goto out_value;
> +
> + ret = fuse_setxattr(inode, name, value, size, 0, 0);
> +out_value:
> + kfree(value);
> + return ret;
> +}
> +
> +int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
> + const struct posix_acl *acl)
> +{
> + int ret;
> +
> + if (default_acl) {
> + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT,
> + default_acl);
> + if (ret)
> + return ret;
> + }
> +
> + if (acl) {
> + ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index a7f47e43692f1c..b116e42431ee12 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -628,26 +628,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
> struct fuse_entry_out outentry;
> struct fuse_inode *fi;
> struct fuse_file *ff;
> + struct posix_acl *default_acl = NULL, *acl = NULL;
> int epoch, err;
> bool trunc = flags & O_TRUNC;
>
> /* Userspace expects S_IFREG in create mode */
> BUG_ON((mode & S_IFMT) != S_IFREG);
>
> + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
> + if (err)
> + return err;
> +
> epoch = atomic_read(&fm->fc->epoch);
> forget = fuse_alloc_forget();
> err = -ENOMEM;
> if (!forget)
> - goto out_err;
> + goto out_acl_release;
>
> err = -ENOMEM;
> ff = fuse_file_alloc(fm, true);
> if (!ff)
> goto out_put_forget_req;
>
> - if (!fm->fc->dont_mask)
> - mode &= ~current_umask();
> -
> flags &= ~O_NOCTTY;
> memset(&inarg, 0, sizeof(inarg));
> memset(&outentry, 0, sizeof(outentry));
> @@ -699,12 +701,16 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
> fuse_sync_release(NULL, ff, flags);
> fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
> err = -ENOMEM;
> - goto out_err;
> + goto out_acl_release;
> }
> kfree(forget);
> d_instantiate(entry, inode);
> entry->d_time = epoch;
> fuse_change_entry_timeout(entry, &outentry);
> +
> + err = fuse_init_acls(inode, default_acl, acl);
> + if (err)
> + goto out_acl_release;
> fuse_dir_changed(dir);
> err = generic_file_open(inode, file);
> if (!err) {
> @@ -726,7 +732,9 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
> fuse_file_free(ff);
> out_put_forget_req:
> kfree(forget);
> -out_err:
> +out_acl_release:
> + posix_acl_release(default_acl);
> + posix_acl_release(acl);
> return err;
> }
>
> @@ -785,7 +793,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> */
> static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm,
> struct fuse_args *args, struct inode *dir,
> - struct dentry *entry, umode_t mode)
> + struct dentry *entry, umode_t mode,
> + struct posix_acl *default_acl,
> + struct posix_acl *acl)
> {
> struct fuse_entry_out outarg;
> struct inode *inode;
> @@ -793,14 +803,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
> struct fuse_forget_link *forget;
> int epoch, err;
>
> - if (fuse_is_bad(dir))
> - return ERR_PTR(-EIO);
> + if (fuse_is_bad(dir)) {
> + err = -EIO;
> + goto out_acl_release;
> + }
>
> epoch = atomic_read(&fm->fc->epoch);
>
> forget = fuse_alloc_forget();
> - if (!forget)
> - return ERR_PTR(-ENOMEM);
> + if (!forget) {
> + err = -ENOMEM;
> + goto out_acl_release;
> + }
>
> memset(&outarg, 0, sizeof(outarg));
> args->nodeid = get_node_id(dir);
> @@ -830,7 +844,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
> &outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0);
> if (!inode) {
> fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
> - return ERR_PTR(-ENOMEM);
> + err = -ENOMEM;
> + goto out_acl_release;
> }
> kfree(forget);
>
> @@ -846,19 +861,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
> entry->d_time = epoch;
> fuse_change_entry_timeout(entry, &outarg);
> }
> +
> + err = fuse_init_acls(inode, default_acl, acl);
> + if (err)
> + goto out_acl_release;
> fuse_dir_changed(dir);
> +
> + posix_acl_release(default_acl);
> + posix_acl_release(acl);
> return d;
>
> out_put_forget_req:
> if (err == -EEXIST)
> fuse_invalidate_entry(entry);
> kfree(forget);
> + out_acl_release:
> + posix_acl_release(default_acl);
> + posix_acl_release(acl);
> return ERR_PTR(err);
> }
>
> static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
> struct fuse_args *args, struct inode *dir,
> - struct dentry *entry, umode_t mode)
> + struct dentry *entry, umode_t mode,
> + struct posix_acl *default_acl,
> + struct posix_acl *acl)
> {
> /*
> * Note that when creating anything other than a directory we
> @@ -869,7 +896,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
> */
> WARN_ON_ONCE(S_ISDIR(mode));
>
> - return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode));
> + return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode,
> + default_acl, acl));
> }
>
> static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
> @@ -877,10 +905,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
> {
> struct fuse_mknod_in inarg;
> struct fuse_mount *fm = get_fuse_mount(dir);
> + struct posix_acl *default_acl, *acl;
> FUSE_ARGS(args);
> + int err;
>
> - if (!fm->fc->dont_mask)
> - mode &= ~current_umask();
> + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
Please excuse me if this is a dumb question.
In this function (including fuse_mkdir and fuse_symlink),
why can't we pair fuse_acl_create and posix_acl_release together
within the same function,
just like in fuse_create_open?
Thanks,
Chen Linxuan
> + if (err)
> + return err;
>
> memset(&inarg, 0, sizeof(inarg));
> inarg.mode = mode;
> @@ -892,7 +923,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
> args.in_args[0].value = &inarg;
> args.in_args[1].size = entry->d_name.len + 1;
> args.in_args[1].value = entry->d_name.name;
> - return create_new_nondir(idmap, fm, &args, dir, entry, mode);
> + return create_new_nondir(idmap, fm, &args, dir, entry, mode,
> + default_acl, acl);
> }
>
> static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
> @@ -924,13 +956,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> {
> struct fuse_mkdir_in inarg;
> struct fuse_mount *fm = get_fuse_mount(dir);
> + struct posix_acl *default_acl, *acl;
> FUSE_ARGS(args);
> + int err;
>
> - if (!fm->fc->dont_mask)
> - mode &= ~current_umask();
> + mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */
> + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
> + if (err)
> + return ERR_PTR(err);
>
> memset(&inarg, 0, sizeof(inarg));
> - inarg.mode = mode;
> + inarg.mode = mode & ~S_IFDIR;
> inarg.umask = current_umask();
> args.opcode = FUSE_MKDIR;
> args.in_numargs = 2;
> @@ -938,7 +974,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> args.in_args[0].value = &inarg;
> args.in_args[1].size = entry->d_name.len + 1;
> args.in_args[1].value = entry->d_name.name;
> - return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR);
> + return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR,
> + default_acl, acl);
> }
>
> static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
> @@ -946,7 +983,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
> {
> struct fuse_mount *fm = get_fuse_mount(dir);
> unsigned len = strlen(link) + 1;
> + struct posix_acl *default_acl, *acl;
> + umode_t mode = S_IFLNK | 0777;
> FUSE_ARGS(args);
> + int err;
> +
> + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
> + if (err)
> + return err;
>
> args.opcode = FUSE_SYMLINK;
> args.in_numargs = 3;
> @@ -955,7 +999,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
> args.in_args[1].value = entry->d_name.name;
> args.in_args[2].size = len;
> args.in_args[2].value = link;
> - return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK);
> + return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK,
> + default_acl, acl);
> }
>
> void fuse_flush_time_update(struct inode *inode)
> @@ -1155,7 +1200,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir,
> args.in_args[0].value = &inarg;
> args.in_args[1].size = newent->d_name.len + 1;
> args.in_args[1].value = newent->d_name.name;
> - err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode);
> + err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent,
> + inode->i_mode, NULL, NULL);
> if (!err)
> fuse_update_ctime_in_cache(inode);
> else if (err == -EINTR)
>
>
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities
2025-09-16 0:26 ` [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
@ 2025-09-16 13:49 ` Christoph Hellwig
2025-09-16 14:49 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Christoph Hellwig @ 2025-09-16 13:49 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Mon, Sep 15, 2025 at 05:26:24PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Trace which bytes actually get zeroed.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
Can you send this out separately so that we can get it queued up ASAP?
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer
2025-09-16 0:26 ` [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
@ 2025-09-16 13:50 ` Christoph Hellwig
2025-09-16 14:50 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Christoph Hellwig @ 2025-09-16 13:50 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
> + if (WARN_ON_ONCE(iomap->inline_data == NULL))
Shorten this to just !iomap->inline_data instead of checking for NULL?
Same for the other two.
Otherwise this looks good, and I'd prefer to see it go upstream ASAP
instead of hiding it in your big patch pile if possible.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 7/8] fuse: propagate default and file acls on creation
2025-09-16 6:41 ` Chen Linxuan
@ 2025-09-16 14:48 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 14:48 UTC (permalink / raw)
To: Chen Linxuan
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 16, 2025 at 02:41:30PM +0800, Chen Linxuan wrote:
> On Tue, Sep 16, 2025 at 8:26 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > For local filesystems, propagate the default and file access ACLs to new
> > children when creating them, just like the other in-kernel local
> > filesystems.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 4 ++
> > fs/fuse/acl.c | 65 ++++++++++++++++++++++++++++++++++++++
> > fs/fuse/dir.c | 92 +++++++++++++++++++++++++++++++++++++++++-------------
> > 3 files changed, 138 insertions(+), 23 deletions(-)
> >
> >
<snip>
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index a7f47e43692f1c..b116e42431ee12 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -877,10 +905,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
> > {
> > struct fuse_mknod_in inarg;
> > struct fuse_mount *fm = get_fuse_mount(dir);
> > + struct posix_acl *default_acl, *acl;
> > FUSE_ARGS(args);
> > + int err;
> >
> > - if (!fm->fc->dont_mask)
> > - mode &= ~current_umask();
> > + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
>
> Please excuse me if this is a dumb question.
> In this function (including fuse_mkdir and fuse_symlink),
> why can't we pair fuse_acl_create and posix_acl_release together
> within the same function,
> just like in fuse_create_open?
It seemed cleaner to have create_new_{entry,nondir} consume the two acl
arguments rather than have to change every callsite:
fuse_acl_create(...., &default_acl, &acl);
...
ret = create_new_nondir(..., default_acl, acl);
posix_acl_release(default_acl);
posix_acl_release(acl);
return ret;
since create_new_entry is really the bottom half of mknod, mkdir,
symlink, and link.
--D
> Thanks,
> Chen Linxuan
>
> > + if (err)
> > + return err;
> >
> > memset(&inarg, 0, sizeof(inarg));
> > inarg.mode = mode;
> > @@ -892,7 +923,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
> > args.in_args[0].value = &inarg;
> > args.in_args[1].size = entry->d_name.len + 1;
> > args.in_args[1].value = entry->d_name.name;
> > - return create_new_nondir(idmap, fm, &args, dir, entry, mode);
> > + return create_new_nondir(idmap, fm, &args, dir, entry, mode,
> > + default_acl, acl);
> > }
> >
> > static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
> > @@ -924,13 +956,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> > {
> > struct fuse_mkdir_in inarg;
> > struct fuse_mount *fm = get_fuse_mount(dir);
> > + struct posix_acl *default_acl, *acl;
> > FUSE_ARGS(args);
> > + int err;
> >
> > - if (!fm->fc->dont_mask)
> > - mode &= ~current_umask();
> > + mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */
> > + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
> > + if (err)
> > + return ERR_PTR(err);
> >
> > memset(&inarg, 0, sizeof(inarg));
> > - inarg.mode = mode;
> > + inarg.mode = mode & ~S_IFDIR;
> > inarg.umask = current_umask();
> > args.opcode = FUSE_MKDIR;
> > args.in_numargs = 2;
> > @@ -938,7 +974,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> > args.in_args[0].value = &inarg;
> > args.in_args[1].size = entry->d_name.len + 1;
> > args.in_args[1].value = entry->d_name.name;
> > - return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR);
> > + return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR,
> > + default_acl, acl);
> > }
> >
> > static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
> > @@ -946,7 +983,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
> > {
> > struct fuse_mount *fm = get_fuse_mount(dir);
> > unsigned len = strlen(link) + 1;
> > + struct posix_acl *default_acl, *acl;
> > + umode_t mode = S_IFLNK | 0777;
> > FUSE_ARGS(args);
> > + int err;
> > +
> > + err = fuse_acl_create(dir, &mode, &default_acl, &acl);
> > + if (err)
> > + return err;
> >
> > args.opcode = FUSE_SYMLINK;
> > args.in_numargs = 3;
> > @@ -955,7 +999,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
> > args.in_args[1].value = entry->d_name.name;
> > args.in_args[2].size = len;
> > args.in_args[2].value = link;
> > - return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK);
> > + return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK,
> > + default_acl, acl);
> > }
> >
> > void fuse_flush_time_update(struct inode *inode)
> > @@ -1155,7 +1200,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir,
> > args.in_args[0].value = &inarg;
> > args.in_args[1].size = newent->d_name.len + 1;
> > args.in_args[1].value = newent->d_name.name;
> > - err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode);
> > + err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent,
> > + inode->i_mode, NULL, NULL);
> > if (!err)
> > fuse_update_ctime_in_cache(inode);
> > else if (err == -EINTR)
> >
> >
> >
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities
2025-09-16 13:49 ` Christoph Hellwig
@ 2025-09-16 14:49 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 14:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 16, 2025 at 06:49:20AM -0700, Christoph Hellwig wrote:
> On Mon, Sep 15, 2025 at 05:26:24PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Trace which bytes actually get zeroed.
>
> Looks good:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> Can you send this out separately so that we can get it queued up ASAP?
Will do. Thanks for the review.
--D
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer
2025-09-16 13:50 ` Christoph Hellwig
@ 2025-09-16 14:50 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-16 14:50 UTC (permalink / raw)
To: Christoph Hellwig
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 16, 2025 at 06:50:27AM -0700, Christoph Hellwig wrote:
> > + if (WARN_ON_ONCE(iomap->inline_data == NULL))
>
> Shorten this to just !iomap->inline_data instead of checking for NULL?
>
> Same for the other two.
>
> Otherwise this looks good, and I'd prefer to see it go upstream ASAP
> instead of hiding it in your big patch pile if possible.
Ok. Will fix and resend as an independent series.
--D
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c
2025-09-16 0:27 ` [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
@ 2025-09-17 2:47 ` Amir Goldstein
2025-09-18 18:02 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Amir Goldstein @ 2025-09-17 2:47 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 16, 2025 at 2:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> In preparation for iomap, move the passthrough-specific validation code
> back to passthrough.c and create a new Kconfig item for conditional
> compilation of backing.c. In the next patch, iomap will share the
> backing structures.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 23 +++++++++--
> include/uapi/linux/fuse.h | 8 +++-
> fs/fuse/Kconfig | 4 ++
> fs/fuse/Makefile | 3 +
> fs/fuse/backing.c | 95 ++++++++++++++++++++++++++++++++++-----------
> fs/fuse/dev.c | 4 +-
> fs/fuse/inode.c | 4 +-
> fs/fuse/passthrough.c | 37 +++++++++++++++++-
> 8 files changed, 144 insertions(+), 34 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 52db609e63eb54..4560687d619d76 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -96,10 +96,21 @@ struct fuse_submount_lookup {
> struct fuse_forget_link *forget;
> };
>
> +struct fuse_conn;
> +
> +/** Operations for subsystems that want to use a backing file */
> +struct fuse_backing_ops {
> + int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
> + int (*may_open)(struct fuse_conn *fc, struct file *file);
> + int (*may_close)(struct fuse_conn *fc, struct file *file);
> + unsigned int type;
> +};
> +
> /** Container for data related to mapping to backing file */
> struct fuse_backing {
> struct file *file;
> struct cred *cred;
> + const struct fuse_backing_ops *ops;
Please argue why we need a mix of passthrough backing
files and iomap backing bdev on the same filesystem.
Same as my argument against passthrough/iomap on
same fuse_backing:
If you do not plan to test it, and nobody asked for it, please do
not allow it - it's bad for code test coverage.
I think at this point in time FUSE_PASSTHROUGH and
FUSE_IOMAP should be mutually exclusive and
fuse_backing_ops could be set at fc level.
If we want to move them for per fuse_backing later
we can always do that when the use cases and tests arrive.
Thanks,
Amir.
>
> /** refcount */
> refcount_t count;
> @@ -968,7 +979,7 @@ struct fuse_conn {
> /* New writepages go into this bucket */
> struct fuse_sync_bucket __rcu *curr_bucket;
>
> -#ifdef CONFIG_FUSE_PASSTHROUGH
> +#ifdef CONFIG_FUSE_BACKING
> /** IDR for backing files ids */
> struct idr backing_files_map;
> #endif
> @@ -1571,10 +1582,12 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
> unsigned int open_flags, fl_owner_t id, bool isdir);
>
> /* backing.c */
> -#ifdef CONFIG_FUSE_PASSTHROUGH
> +#ifdef CONFIG_FUSE_BACKING
> struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
> void fuse_backing_put(struct fuse_backing *fb);
> -struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
> +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> + const struct fuse_backing_ops *ops,
> + int backing_id);
> #else
>
> static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> @@ -1631,6 +1644,10 @@ static inline struct file *fuse_file_passthrough(struct fuse_file *ff)
> #endif
> }
>
> +#ifdef CONFIG_FUSE_PASSTHROUGH
> +extern const struct fuse_backing_ops fuse_passthrough_backing_ops;
> +#endif
> +
> ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *iter);
> ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *iter);
> ssize_t fuse_passthrough_splice_read(struct file *in, loff_t *ppos,
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 1d76d0332f46f6..31b80f93211b81 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1114,9 +1114,15 @@ struct fuse_notify_retrieve_in {
> uint64_t dummy4;
> };
>
> +#define FUSE_BACKING_TYPE_MASK (0xFF)
> +#define FUSE_BACKING_TYPE_PASSTHROUGH (0)
> +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
> +
> +#define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
> +
> struct fuse_backing_map {
> int32_t fd;
> - uint32_t flags;
> + uint32_t flags; /* FUSE_BACKING_* */
> uint64_t padding;
> };
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index a774166264de69..9563fa5387a241 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -59,12 +59,16 @@ config FUSE_PASSTHROUGH
> default y
> depends on FUSE_FS
> select FS_STACK
> + select FUSE_BACKING
> help
> This allows bypassing FUSE server by mapping specific FUSE operations
> to be performed directly on a backing file.
>
> If you want to allow passthrough operations, answer Y.
>
> +config FUSE_BACKING
> + bool
> +
> config FUSE_IO_URING
> bool "FUSE communication over io-uring"
> default y
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 8ddd8f0b204ee5..36be6d715b111a 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -13,7 +13,8 @@ obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
> fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> fuse-y += iomode.o
> fuse-$(CONFIG_FUSE_DAX) += dax.o
> -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
> +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> +fuse-$(CONFIG_FUSE_BACKING) += backing.o
> fuse-$(CONFIG_SYSCTL) += sysctl.o
> fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
>
> diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> index 4afda419dd1416..da0dff288396ed 100644
> --- a/fs/fuse/backing.c
> +++ b/fs/fuse/backing.c
> @@ -6,6 +6,7 @@
> */
>
> #include "fuse_i.h"
> +#include "fuse_trace.h"
>
> #include <linux/file.h>
>
> @@ -69,32 +70,53 @@ static int fuse_backing_id_free(int id, void *p, void *data)
> struct fuse_backing *fb = p;
>
> WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> +
> fuse_backing_free(fb);
> return 0;
> }
>
> void fuse_backing_files_free(struct fuse_conn *fc)
> {
> - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, fc);
> idr_destroy(&fc->backing_files_map);
> }
>
> +static inline const struct fuse_backing_ops *
> +fuse_backing_ops_from_map(const struct fuse_backing_map *map)
> +{
> + switch (map->flags & FUSE_BACKING_TYPE_MASK) {
> +#ifdef CONFIG_FUSE_PASSTHROUGH
> + case FUSE_BACKING_TYPE_PASSTHROUGH:
> + return &fuse_passthrough_backing_ops;
> +#endif
> + default:
> + break;
> + }
> +
> + return NULL;
> +}
> +
> int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> {
> struct file *file;
> - struct super_block *backing_sb;
> struct fuse_backing *fb = NULL;
> + const struct fuse_backing_ops *ops = fuse_backing_ops_from_map(map);
> + uint32_t op_flags = map->flags & ~FUSE_BACKING_TYPE_MASK;
> int res;
>
> pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
>
> - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> - res = -EPERM;
> - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> + res = -EOPNOTSUPP;
> + if (!ops)
> + goto out;
> + WARN_ON(ops->type != (map->flags & FUSE_BACKING_TYPE_MASK));
> +
> + res = ops->may_admin ? ops->may_admin(fc, op_flags) : 0;
> + if (res)
> goto out;
>
> res = -EINVAL;
> - if (map->flags || map->padding)
> + if (map->padding)
> goto out;
>
> file = fget_raw(map->fd);
> @@ -102,14 +124,8 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> if (!file)
> goto out;
>
> - /* read/write/splice/mmap passthrough only relevant for regular files */
> - res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
> - if (!d_is_reg(file->f_path.dentry))
> - goto out_fput;
> -
> - backing_sb = file_inode(file)->i_sb;
> - res = -ELOOP;
> - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> + res = ops->may_open ? ops->may_open(fc, file) : 0;
> + if (res)
> goto out_fput;
>
> fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> @@ -119,14 +135,15 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
>
> fb->file = file;
> fb->cred = prepare_creds();
> + fb->ops = ops;
> refcount_set(&fb->count, 1);
>
> res = fuse_backing_id_alloc(fc, fb);
> if (res < 0) {
> fuse_backing_free(fb);
> fb = NULL;
> + goto out;
> }
> -
> out:
> pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
>
> @@ -137,41 +154,71 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> goto out;
> }
>
> +static struct fuse_backing *__fuse_backing_lookup(struct fuse_conn *fc,
> + int backing_id)
> +{
> + struct fuse_backing *fb;
> +
> + rcu_read_lock();
> + fb = idr_find(&fc->backing_files_map, backing_id);
> + fb = fuse_backing_get(fb);
> + rcu_read_unlock();
> +
> + return fb;
> +}
> +
> int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> {
> - struct fuse_backing *fb = NULL;
> + struct fuse_backing *fb, *test_fb;
> + const struct fuse_backing_ops *ops;
> int err;
>
> pr_debug("%s: backing_id=%d\n", __func__, backing_id);
>
> - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> - err = -EPERM;
> - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> - goto out;
> -
> err = -EINVAL;
> if (backing_id <= 0)
> goto out;
>
> err = -ENOENT;
> - fb = fuse_backing_id_remove(fc, backing_id);
> + fb = __fuse_backing_lookup(fc, backing_id);
> if (!fb)
> goto out;
> + ops = fb->ops;
>
> - fuse_backing_put(fb);
> + err = ops->may_admin ? ops->may_admin(fc, 0) : 0;
> + if (err)
> + goto out_fb;
> +
> + err = ops->may_close ? ops->may_close(fc, fb->file) : 0;
> + if (err)
> + goto out_fb;
> +
> + err = -ENOENT;
> + test_fb = fuse_backing_id_remove(fc, backing_id);
> + if (!test_fb)
> + goto out_fb;
> +
> + WARN_ON(fb != test_fb);
> err = 0;
> + fuse_backing_put(test_fb);
> +out_fb:
> + fuse_backing_put(fb);
> out:
> pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
>
> return err;
> }
>
> -struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
> +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> + const struct fuse_backing_ops *ops,
> + int backing_id)
> {
> struct fuse_backing *fb;
>
> rcu_read_lock();
> fb = idr_find(&fc->backing_files_map, backing_id);
> + if (fb && fb->ops != ops)
> + fb = NULL;
> fb = fuse_backing_get(fb);
> rcu_read_unlock();
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index e5aaf0c668bc11..281bc81f3b448b 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2654,7 +2654,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
> if (IS_ERR(fud))
> return PTR_ERR(fud);
>
> - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> return -EOPNOTSUPP;
>
> if (copy_from_user(&map, argp, sizeof(map)))
> @@ -2671,7 +2671,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> if (IS_ERR(fud))
> return PTR_ERR(fud);
>
> - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> return -EOPNOTSUPP;
>
> if (get_user(backing_id, argp))
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 14c35ce12b87d6..1e7298b2b89b58 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -995,7 +995,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> fc->name_max = FUSE_NAME_LOW_MAX;
> fc->timeout.req_timeout = 0;
>
> - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> fuse_backing_files_init(fc);
>
> INIT_LIST_HEAD(&fc->mounts);
> @@ -1032,7 +1032,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> WARN_ON(atomic_read(&bucket->count) != 1);
> kfree(bucket);
> }
> - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> fuse_backing_files_free(fc);
> call_rcu(&fc->rcu, delayed_release);
> }
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index e0b8d885bc81f3..9792d7b12a775b 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -164,7 +164,7 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
> goto out;
>
> err = -ENOENT;
> - fb = fuse_backing_lookup(fc, backing_id);
> + fb = fuse_backing_lookup(fc, &fuse_passthrough_backing_ops, backing_id);
> if (!fb)
> goto out;
>
> @@ -197,3 +197,38 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb)
> put_cred(ff->cred);
> ff->cred = NULL;
> }
> +
> +static int fuse_passthrough_may_admin(struct fuse_conn *fc, unsigned int flags)
> +{
> + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (flags)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int fuse_passthrough_may_open(struct fuse_conn *fc, struct file *file)
> +{
> + struct super_block *backing_sb;
> + int res;
> +
> + /* read/write/splice/mmap passthrough only relevant for regular files */
> + res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
> + if (!d_is_reg(file->f_path.dentry))
> + return res;
> +
> + backing_sb = file_inode(file)->i_sb;
> + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> + return -ELOOP;
> +
> + return 0;
> +}
> +
> +const struct fuse_backing_ops fuse_passthrough_backing_ops = {
> + .type = FUSE_BACKING_TYPE_PASSTHROUGH,
> + .may_admin = fuse_passthrough_may_admin,
> + .may_open = fuse_passthrough_may_open,
> +};
>
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-16 0:29 ` [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong
@ 2025-09-17 3:09 ` Amir Goldstein
2025-09-18 18:17 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Amir Goldstein @ 2025-09-17 3:09 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 16, 2025 at 2:30 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Enable the use of the backing file open/close ioctls so that fuse
> servers can register block devices for use with iomap.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 5 ++
> include/uapi/linux/fuse.h | 3 +
> fs/fuse/Kconfig | 1
> fs/fuse/backing.c | 12 +++++
> fs/fuse/file_iomap.c | 99 +++++++++++++++++++++++++++++++++++++++++----
> fs/fuse/trace.c | 1
> 6 files changed, 111 insertions(+), 10 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 389b123f0bf144..791f210c13a876 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -97,12 +97,14 @@ struct fuse_submount_lookup {
> };
>
> struct fuse_conn;
> +struct fuse_backing;
>
> /** Operations for subsystems that want to use a backing file */
> struct fuse_backing_ops {
> int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
> int (*may_open)(struct fuse_conn *fc, struct file *file);
> int (*may_close)(struct fuse_conn *fc, struct file *file);
> + int (*post_open)(struct fuse_conn *fc, struct fuse_backing *fb);
> unsigned int type;
> };
>
> @@ -110,6 +112,7 @@ struct fuse_backing_ops {
> struct fuse_backing {
> struct file *file;
> struct cred *cred;
> + struct block_device *bdev;
> const struct fuse_backing_ops *ops;
>
> /** refcount */
> @@ -1704,6 +1707,8 @@ static inline bool fuse_has_iomap(const struct inode *inode)
> {
> return get_fuse_conn_c(inode)->iomap;
> }
> +
> +extern const struct fuse_backing_ops fuse_iomap_backing_ops;
> #else
> # define fuse_iomap_enabled(...) (false)
> # define fuse_has_iomap(...) (false)
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 3634cbe602cd9c..3a367f387795ff 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1124,7 +1124,8 @@ struct fuse_notify_retrieve_in {
>
> #define FUSE_BACKING_TYPE_MASK (0xFF)
> #define FUSE_BACKING_TYPE_PASSTHROUGH (0)
> -#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
> +#define FUSE_BACKING_TYPE_IOMAP (1)
> +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP)
>
> #define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 52e1a04183e760..baa38cf0f295ff 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -75,6 +75,7 @@ config FUSE_IOMAP
> depends on FUSE_FS
> depends on BLOCK
> select FS_IOMAP
> + select FUSE_BACKING
> help
> Enable fuse servers to operate the regular file I/O path through
> the fs-iomap library in the kernel. This enables higher performance
> diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> index 229c101ab46b0e..fc58636ac78eaa 100644
> --- a/fs/fuse/backing.c
> +++ b/fs/fuse/backing.c
> @@ -89,6 +89,10 @@ fuse_backing_ops_from_map(const struct fuse_backing_map *map)
> #ifdef CONFIG_FUSE_PASSTHROUGH
> case FUSE_BACKING_TYPE_PASSTHROUGH:
> return &fuse_passthrough_backing_ops;
> +#endif
> +#ifdef CONFIG_FUSE_IOMAP
> + case FUSE_BACKING_TYPE_IOMAP:
> + return &fuse_iomap_backing_ops;
> #endif
> default:
> break;
> @@ -137,8 +141,16 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> fb->file = file;
> fb->cred = prepare_creds();
> fb->ops = ops;
> + fb->bdev = NULL;
> refcount_set(&fb->count, 1);
>
> + res = ops->post_open ? ops->post_open(fc, fb) : 0;
> + if (res) {
> + fuse_backing_free(fb);
> + fb = NULL;
> + goto out;
> + }
> +
> res = fuse_backing_id_alloc(fc, fb);
> if (res < 0) {
> fuse_backing_free(fb);
> diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> index e7d19e2aee4541..3a4161633add0e 100644
> --- a/fs/fuse/file_iomap.c
> +++ b/fs/fuse/file_iomap.c
> @@ -319,10 +319,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> return false;
> }
>
> - /* XXX: we don't support devices yet */
> - if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
> - return false;
> -
> /* No overflows in the device range, if supplied */
> if (map->addr != FUSE_IOMAP_NULL_ADDR &&
> BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
> @@ -334,6 +330,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> /* Convert a mapping from the server into something the kernel can use */
> static inline void fuse_iomap_from_server(struct inode *inode,
> struct iomap *iomap,
> + const struct fuse_backing *fb,
> const struct fuse_iomap_io *fmap)
> {
> iomap->addr = fmap->addr;
> @@ -341,7 +338,9 @@ static inline void fuse_iomap_from_server(struct inode *inode,
> iomap->length = fmap->length;
> iomap->type = fuse_iomap_type_from_server(fmap->type);
> iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
> - iomap->bdev = inode->i_sb->s_bdev; /* XXX */
> +
> + iomap->bdev = fb ? fb->bdev : NULL;
> + iomap->dax_dev = NULL;
> }
>
> /* Convert a mapping from the kernel into something the server can use */
> @@ -392,6 +391,27 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
> return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
> }
>
> +static inline struct fuse_backing *
> +fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
> +{
> + struct fuse_backing *ret = NULL;
> +
> + if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX)
> + ret = fuse_backing_lookup(fc, &fuse_iomap_backing_ops,
> + map->dev);
> +
> + switch (map->type) {
> + case FUSE_IOMAP_TYPE_MAPPED:
> + case FUSE_IOMAP_TYPE_UNWRITTEN:
> + /* Mappings backed by space must have a device/addr */
> + if (BAD_DATA(ret == NULL))
> + return ERR_PTR(-EFSCORRUPTED);
> + break;
> + }
> +
> + return ret;
> +}
> +
> static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> unsigned opflags, struct iomap *iomap,
> struct iomap *srcmap)
> @@ -405,6 +425,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> };
> struct fuse_iomap_begin_out outarg = { };
> struct fuse_mount *fm = get_fuse_mount(inode);
> + struct fuse_backing *read_dev = NULL;
> + struct fuse_backing *write_dev = NULL;
> FUSE_ARGS(args);
> int err;
>
> @@ -431,24 +453,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> if (err)
> return err;
>
> + read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
> + if (IS_ERR(read_dev))
> + return PTR_ERR(read_dev);
> +
> if (fuse_is_iomap_file_write(opflags) &&
> outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
> + /* open the write device */
> + write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write);
> + if (IS_ERR(write_dev)) {
> + err = PTR_ERR(write_dev);
> + goto out_read_dev;
> + }
> +
> /*
> * For an out of place write, we must supply the write mapping
> * via @iomap, and the read mapping via @srcmap.
> */
> - fuse_iomap_from_server(inode, iomap, &outarg.write);
> - fuse_iomap_from_server(inode, srcmap, &outarg.read);
> + fuse_iomap_from_server(inode, iomap, write_dev, &outarg.write);
> + fuse_iomap_from_server(inode, srcmap, read_dev, &outarg.read);
> } else {
> /*
> * For everything else (reads, reporting, and pure overwrites),
> * we can return the sole mapping through @iomap and leave
> * @srcmap unchanged from its default (HOLE).
> */
> - fuse_iomap_from_server(inode, iomap, &outarg.read);
> + fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
> }
>
> - return 0;
> + /*
> + * XXX: if we ever want to support closing devices, we need a way to
> + * track the fuse_backing refcount all the way through bio endios.
> + * For now we put the refcount here because you can't remove an iomap
> + * device until unmount time.
> + */
> + fuse_backing_put(write_dev);
> +out_read_dev:
> + fuse_backing_put(read_dev);
> + return err;
> }
>
> /* Decide if we send FUSE_IOMAP_END to the fuse server */
> @@ -523,3 +565,42 @@ const struct iomap_ops fuse_iomap_ops = {
> .iomap_begin = fuse_iomap_begin,
> .iomap_end = fuse_iomap_end,
> };
> +
> +static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags)
> +{
> + if (!fc->iomap)
> + return -EPERM;
> +
IIRC, on RFC I asked why is iomap exempt from CAP_SYS_ADMIN
check. If there was a good reason, I forgot it.
The problem is that while fuse-iomap fs is only expected to open
a handful of backing devs, we would like to prevent abuse of this ioctl
by a buggy or malicious user.
I think that if you want to avoid CAP_SYS_ADMIN here you should
enforce a limit on the number of backing bdevs.
If you accept my suggestion to mutually exclude passthrough and
iomap features per fs, then you'd just need to keep track on numbers
of fuse_backing ids and place a limit for iomap fs.
BTW, I think it is enough keep track of the number of backing ids
and no need to keep track of the number of fuse_backing objects
(which can outlive a backing id), because an "anonymous" fuse_backing
object is always associated with an open fuse file - that's the same as
an overlayfs backing file, which is not accounted for in ulimit.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-16 0:25 ` [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors Darrick J. Wong
@ 2025-09-17 17:18 ` Joanne Koong
2025-09-18 16:52 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Joanne Koong @ 2025-09-17 17:18 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Mon, Sep 15, 2025 at 5:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Create a new fuse context flag that indicates that the kernel should
> implement various local filesystem behaviors instead of passing vfs
> commands straight through to the fuse server and expecting the server to
> do all the work. For example, this means that we'll use the kernel to
> transform some ACL updates into mode changes, and later to do
> enforcement of the immutable and append iflags.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 4 ++++
> fs/fuse/inode.c | 2 ++
> 2 files changed, 6 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e93a3c3f11d901..e13e8270f4f58d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -603,6 +603,7 @@ struct fuse_fs_context {
> bool no_control:1;
> bool no_force_umount:1;
> bool legacy_opts_show:1;
> + bool local_fs:1;
> enum fuse_dax_mode dax_mode;
> unsigned int max_read;
> unsigned int blksize;
> @@ -901,6 +902,9 @@ struct fuse_conn {
> /* Is link not implemented by fs? */
> unsigned int no_link:1;
>
> + /* Should this filesystem behave like a local filesystem? */
> + unsigned int local_fs:1;
> +
> /* Use io_uring for communication */
> unsigned int io_uring;
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index c94aba627a6f11..c8dd0bcb7e6f9f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1862,6 +1862,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> fc->destroy = ctx->destroy;
> fc->no_control = ctx->no_control;
> fc->no_force_umount = ctx->no_force_umount;
> + fc->local_fs = ctx->local_fs;
>
If I'm understanding it correctly, fc->local_fs is set to true if it's
a fuseblk device? Why do we need a new "ctx->local_fs" instead of
reusing ctx->is_bdev?
Thanks,
Joanne
> err = -ENOMEM;
> root = fuse_get_root_inode(sb, ctx->rootmode);
> @@ -2029,6 +2030,7 @@ static int fuse_init_fs_context(struct fs_context *fsc)
> if (fsc->fs_type == &fuseblk_fs_type) {
> ctx->is_bdev = true;
> ctx->destroy = true;
> + ctx->local_fs = true;
> }
> #endif
>
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 1/5] fuse: allow synchronous FUSE_INIT
2025-09-16 0:26 ` [PATCH 1/5] fuse: allow synchronous FUSE_INIT Darrick J. Wong
@ 2025-09-17 17:22 ` Joanne Koong
2025-09-18 18:04 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Joanne Koong @ 2025-09-17 17:22 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, mszeredi, bernd, linux-xfs, John, linux-fsdevel, neal
On Mon, Sep 15, 2025 at 5:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Miklos Szeredi <mszeredi@redhat.com>
>
> FUSE_INIT has always been asynchronous with mount. That means that the
> server processed this request after the mount syscall returned.
>
> This means that FUSE_INIT can't supply the root inode's ID, hence it
> currently has a hardcoded value. There are other limitations such as not
> being able to perform getxattr during mount, which is needed by selinux.
>
> To remove these limitations allow server to process FUSE_INIT while
> initializing the in-core super block for the fuse filesystem. This can
> only be done if the server is prepared to handle this, so add
> FUSE_DEV_IOC_SYNC_INIT ioctl, which
>
> a) lets the server know whether this feature is supported, returning
> ENOTTY othewrwise.
>
> b) lets the kernel know to perform a synchronous initialization
>
> The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
> only during super block creation. This is solved by setting the private
> data of the fuse device file to a special value ((struct fuse_dev *) 1) and
> waiting for this to be turned into a proper fuse_dev before commecing with
> operations on the device file.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_dev_i.h | 13 +++++++-
> fs/fuse/fuse_i.h | 5 ++-
> include/uapi/linux/fuse.h | 1 +
> fs/fuse/cuse.c | 3 +-
> fs/fuse/dev.c | 74 +++++++++++++++++++++++++++++++++------------
> fs/fuse/dev_uring.c | 4 +-
> fs/fuse/inode.c | 50 ++++++++++++++++++++++++------
> 7 files changed, 115 insertions(+), 35 deletions(-)
btw, I think an updated version of this has already been merged into
the fuse for-next tree (commit dfb84c330794)
Thanks,
Joanne
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-17 17:18 ` Joanne Koong
@ 2025-09-18 16:52 ` Darrick J. Wong
2025-09-19 9:24 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-18 16:52 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Wed, Sep 17, 2025 at 10:18:40AM -0700, Joanne Koong wrote:
> On Mon, Sep 15, 2025 at 5:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Create a new fuse context flag that indicates that the kernel should
> > implement various local filesystem behaviors instead of passing vfs
> > commands straight through to the fuse server and expecting the server to
> > do all the work. For example, this means that we'll use the kernel to
> > transform some ACL updates into mode changes, and later to do
> > enforcement of the immutable and append iflags.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 4 ++++
> > fs/fuse/inode.c | 2 ++
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index e93a3c3f11d901..e13e8270f4f58d 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -603,6 +603,7 @@ struct fuse_fs_context {
> > bool no_control:1;
> > bool no_force_umount:1;
> > bool legacy_opts_show:1;
> > + bool local_fs:1;
> > enum fuse_dax_mode dax_mode;
> > unsigned int max_read;
> > unsigned int blksize;
> > @@ -901,6 +902,9 @@ struct fuse_conn {
> > /* Is link not implemented by fs? */
> > unsigned int no_link:1;
> >
> > + /* Should this filesystem behave like a local filesystem? */
> > + unsigned int local_fs:1;
> > +
> > /* Use io_uring for communication */
> > unsigned int io_uring;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index c94aba627a6f11..c8dd0bcb7e6f9f 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1862,6 +1862,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> > fc->destroy = ctx->destroy;
> > fc->no_control = ctx->no_control;
> > fc->no_force_umount = ctx->no_force_umount;
> > + fc->local_fs = ctx->local_fs;
> >
>
> If I'm understanding it correctly, fc->local_fs is set to true if it's
> a fuseblk device? Why do we need a new "ctx->local_fs" instead of
> reusing ctx->is_bdev?
Eventually, enabling iomap will also set local_fs=1, as Miklos and I
sort of touched on a couple weeks ago:
https://lore.kernel.org/linux-fsdevel/CAJfpegvmXnZc=nC4UGw5Gya2cAr-kR0s=WNecnMhdTM_mGyuUg@mail.gmail.com/
--D
> Thanks,
> Joanne
>
> > err = -ENOMEM;
> > root = fuse_get_root_inode(sb, ctx->rootmode);
> > @@ -2029,6 +2030,7 @@ static int fuse_init_fs_context(struct fs_context *fsc)
> > if (fsc->fs_type == &fuseblk_fs_type) {
> > ctx->is_bdev = true;
> > ctx->destroy = true;
> > + ctx->local_fs = true;
> > }
> > #endif
> >
> >
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c
2025-09-17 2:47 ` Amir Goldstein
@ 2025-09-18 18:02 ` Darrick J. Wong
2025-09-19 7:34 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-18 18:02 UTC (permalink / raw)
To: Amir Goldstein
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, Sep 17, 2025 at 04:47:19AM +0200, Amir Goldstein wrote:
> On Tue, Sep 16, 2025 at 2:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > In preparation for iomap, move the passthrough-specific validation code
> > back to passthrough.c and create a new Kconfig item for conditional
> > compilation of backing.c. In the next patch, iomap will share the
> > backing structures.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 23 +++++++++--
> > include/uapi/linux/fuse.h | 8 +++-
> > fs/fuse/Kconfig | 4 ++
> > fs/fuse/Makefile | 3 +
> > fs/fuse/backing.c | 95 ++++++++++++++++++++++++++++++++++-----------
> > fs/fuse/dev.c | 4 +-
> > fs/fuse/inode.c | 4 +-
> > fs/fuse/passthrough.c | 37 +++++++++++++++++-
> > 8 files changed, 144 insertions(+), 34 deletions(-)
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 52db609e63eb54..4560687d619d76 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -96,10 +96,21 @@ struct fuse_submount_lookup {
> > struct fuse_forget_link *forget;
> > };
> >
> > +struct fuse_conn;
> > +
> > +/** Operations for subsystems that want to use a backing file */
> > +struct fuse_backing_ops {
> > + int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
> > + int (*may_open)(struct fuse_conn *fc, struct file *file);
> > + int (*may_close)(struct fuse_conn *fc, struct file *file);
> > + unsigned int type;
> > +};
> > +
> > /** Container for data related to mapping to backing file */
> > struct fuse_backing {
> > struct file *file;
> > struct cred *cred;
> > + const struct fuse_backing_ops *ops;
>
> Please argue why we need a mix of passthrough backing
> files and iomap backing bdev on the same filesystem.
I've no particular reason to allow both on the same filesystem. I
simply didn't want to add restrictions to existing functionality.
> Same as my argument against passthrough/iomap on
> same fuse_backing:
> If you do not plan to test it, and nobody asked for it, please do
> not allow it - it's bad for code test coverage.
>
> I think at this point in time FUSE_PASSTHROUGH and
> FUSE_IOMAP should be mutually exclusive and
> fuse_backing_ops could be set at fc level.
> If we want to move them for per fuse_backing later
> we can always do that when the use cases and tests arrive.
With Miklos' ok I'll constrain fuse not to allow passthrough and iomap
files on the same filesystem, but as it is now there's no technical
reason to make it so that they can't coexist.
--D
> Thanks,
> Amir.
>
> >
> > /** refcount */
> > refcount_t count;
> > @@ -968,7 +979,7 @@ struct fuse_conn {
> > /* New writepages go into this bucket */
> > struct fuse_sync_bucket __rcu *curr_bucket;
> >
> > -#ifdef CONFIG_FUSE_PASSTHROUGH
> > +#ifdef CONFIG_FUSE_BACKING
> > /** IDR for backing files ids */
> > struct idr backing_files_map;
> > #endif
> > @@ -1571,10 +1582,12 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
> > unsigned int open_flags, fl_owner_t id, bool isdir);
> >
> > /* backing.c */
> > -#ifdef CONFIG_FUSE_PASSTHROUGH
> > +#ifdef CONFIG_FUSE_BACKING
> > struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
> > void fuse_backing_put(struct fuse_backing *fb);
> > -struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
> > +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> > + const struct fuse_backing_ops *ops,
> > + int backing_id);
> > #else
> >
> > static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > @@ -1631,6 +1644,10 @@ static inline struct file *fuse_file_passthrough(struct fuse_file *ff)
> > #endif
> > }
> >
> > +#ifdef CONFIG_FUSE_PASSTHROUGH
> > +extern const struct fuse_backing_ops fuse_passthrough_backing_ops;
> > +#endif
> > +
> > ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *iter);
> > ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *iter);
> > ssize_t fuse_passthrough_splice_read(struct file *in, loff_t *ppos,
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 1d76d0332f46f6..31b80f93211b81 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -1114,9 +1114,15 @@ struct fuse_notify_retrieve_in {
> > uint64_t dummy4;
> > };
> >
> > +#define FUSE_BACKING_TYPE_MASK (0xFF)
> > +#define FUSE_BACKING_TYPE_PASSTHROUGH (0)
> > +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
> > +
> > +#define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
> > +
> > struct fuse_backing_map {
> > int32_t fd;
> > - uint32_t flags;
> > + uint32_t flags; /* FUSE_BACKING_* */
> > uint64_t padding;
> > };
> >
> > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > index a774166264de69..9563fa5387a241 100644
> > --- a/fs/fuse/Kconfig
> > +++ b/fs/fuse/Kconfig
> > @@ -59,12 +59,16 @@ config FUSE_PASSTHROUGH
> > default y
> > depends on FUSE_FS
> > select FS_STACK
> > + select FUSE_BACKING
> > help
> > This allows bypassing FUSE server by mapping specific FUSE operations
> > to be performed directly on a backing file.
> >
> > If you want to allow passthrough operations, answer Y.
> >
> > +config FUSE_BACKING
> > + bool
> > +
> > config FUSE_IO_URING
> > bool "FUSE communication over io-uring"
> > default y
> > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > index 8ddd8f0b204ee5..36be6d715b111a 100644
> > --- a/fs/fuse/Makefile
> > +++ b/fs/fuse/Makefile
> > @@ -13,7 +13,8 @@ obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
> > fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> > fuse-y += iomode.o
> > fuse-$(CONFIG_FUSE_DAX) += dax.o
> > -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
> > +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > +fuse-$(CONFIG_FUSE_BACKING) += backing.o
> > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> >
> > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > index 4afda419dd1416..da0dff288396ed 100644
> > --- a/fs/fuse/backing.c
> > +++ b/fs/fuse/backing.c
> > @@ -6,6 +6,7 @@
> > */
> >
> > #include "fuse_i.h"
> > +#include "fuse_trace.h"
> >
> > #include <linux/file.h>
> >
> > @@ -69,32 +70,53 @@ static int fuse_backing_id_free(int id, void *p, void *data)
> > struct fuse_backing *fb = p;
> >
> > WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> > +
> > fuse_backing_free(fb);
> > return 0;
> > }
> >
> > void fuse_backing_files_free(struct fuse_conn *fc)
> > {
> > - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> > + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, fc);
> > idr_destroy(&fc->backing_files_map);
> > }
> >
> > +static inline const struct fuse_backing_ops *
> > +fuse_backing_ops_from_map(const struct fuse_backing_map *map)
> > +{
> > + switch (map->flags & FUSE_BACKING_TYPE_MASK) {
> > +#ifdef CONFIG_FUSE_PASSTHROUGH
> > + case FUSE_BACKING_TYPE_PASSTHROUGH:
> > + return &fuse_passthrough_backing_ops;
> > +#endif
> > + default:
> > + break;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > {
> > struct file *file;
> > - struct super_block *backing_sb;
> > struct fuse_backing *fb = NULL;
> > + const struct fuse_backing_ops *ops = fuse_backing_ops_from_map(map);
> > + uint32_t op_flags = map->flags & ~FUSE_BACKING_TYPE_MASK;
> > int res;
> >
> > pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> >
> > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > - res = -EPERM;
> > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > + res = -EOPNOTSUPP;
> > + if (!ops)
> > + goto out;
> > + WARN_ON(ops->type != (map->flags & FUSE_BACKING_TYPE_MASK));
> > +
> > + res = ops->may_admin ? ops->may_admin(fc, op_flags) : 0;
> > + if (res)
> > goto out;
> >
> > res = -EINVAL;
> > - if (map->flags || map->padding)
> > + if (map->padding)
> > goto out;
> >
> > file = fget_raw(map->fd);
> > @@ -102,14 +124,8 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > if (!file)
> > goto out;
> >
> > - /* read/write/splice/mmap passthrough only relevant for regular files */
> > - res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
> > - if (!d_is_reg(file->f_path.dentry))
> > - goto out_fput;
> > -
> > - backing_sb = file_inode(file)->i_sb;
> > - res = -ELOOP;
> > - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > + res = ops->may_open ? ops->may_open(fc, file) : 0;
> > + if (res)
> > goto out_fput;
> >
> > fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> > @@ -119,14 +135,15 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> >
> > fb->file = file;
> > fb->cred = prepare_creds();
> > + fb->ops = ops;
> > refcount_set(&fb->count, 1);
> >
> > res = fuse_backing_id_alloc(fc, fb);
> > if (res < 0) {
> > fuse_backing_free(fb);
> > fb = NULL;
> > + goto out;
> > }
> > -
> > out:
> > pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> >
> > @@ -137,41 +154,71 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > goto out;
> > }
> >
> > +static struct fuse_backing *__fuse_backing_lookup(struct fuse_conn *fc,
> > + int backing_id)
> > +{
> > + struct fuse_backing *fb;
> > +
> > + rcu_read_lock();
> > + fb = idr_find(&fc->backing_files_map, backing_id);
> > + fb = fuse_backing_get(fb);
> > + rcu_read_unlock();
> > +
> > + return fb;
> > +}
> > +
> > int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > {
> > - struct fuse_backing *fb = NULL;
> > + struct fuse_backing *fb, *test_fb;
> > + const struct fuse_backing_ops *ops;
> > int err;
> >
> > pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> >
> > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > - err = -EPERM;
> > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > - goto out;
> > -
> > err = -EINVAL;
> > if (backing_id <= 0)
> > goto out;
> >
> > err = -ENOENT;
> > - fb = fuse_backing_id_remove(fc, backing_id);
> > + fb = __fuse_backing_lookup(fc, backing_id);
> > if (!fb)
> > goto out;
> > + ops = fb->ops;
> >
> > - fuse_backing_put(fb);
> > + err = ops->may_admin ? ops->may_admin(fc, 0) : 0;
> > + if (err)
> > + goto out_fb;
> > +
> > + err = ops->may_close ? ops->may_close(fc, fb->file) : 0;
> > + if (err)
> > + goto out_fb;
> > +
> > + err = -ENOENT;
> > + test_fb = fuse_backing_id_remove(fc, backing_id);
> > + if (!test_fb)
> > + goto out_fb;
> > +
> > + WARN_ON(fb != test_fb);
> > err = 0;
> > + fuse_backing_put(test_fb);
> > +out_fb:
> > + fuse_backing_put(fb);
> > out:
> > pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> >
> > return err;
> > }
> >
> > -struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
> > +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> > + const struct fuse_backing_ops *ops,
> > + int backing_id)
> > {
> > struct fuse_backing *fb;
> >
> > rcu_read_lock();
> > fb = idr_find(&fc->backing_files_map, backing_id);
> > + if (fb && fb->ops != ops)
> > + fb = NULL;
> > fb = fuse_backing_get(fb);
> > rcu_read_unlock();
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index e5aaf0c668bc11..281bc81f3b448b 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2654,7 +2654,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
> > if (IS_ERR(fud))
> > return PTR_ERR(fud);
> >
> > - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> > return -EOPNOTSUPP;
> >
> > if (copy_from_user(&map, argp, sizeof(map)))
> > @@ -2671,7 +2671,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> > if (IS_ERR(fud))
> > return PTR_ERR(fud);
> >
> > - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> > return -EOPNOTSUPP;
> >
> > if (get_user(backing_id, argp))
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 14c35ce12b87d6..1e7298b2b89b58 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -995,7 +995,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> > fc->name_max = FUSE_NAME_LOW_MAX;
> > fc->timeout.req_timeout = 0;
> >
> > - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> > fuse_backing_files_init(fc);
> >
> > INIT_LIST_HEAD(&fc->mounts);
> > @@ -1032,7 +1032,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > WARN_ON(atomic_read(&bucket->count) != 1);
> > kfree(bucket);
> > }
> > - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> > fuse_backing_files_free(fc);
> > call_rcu(&fc->rcu, delayed_release);
> > }
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index e0b8d885bc81f3..9792d7b12a775b 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -164,7 +164,7 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
> > goto out;
> >
> > err = -ENOENT;
> > - fb = fuse_backing_lookup(fc, backing_id);
> > + fb = fuse_backing_lookup(fc, &fuse_passthrough_backing_ops, backing_id);
> > if (!fb)
> > goto out;
> >
> > @@ -197,3 +197,38 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb)
> > put_cred(ff->cred);
> > ff->cred = NULL;
> > }
> > +
> > +static int fuse_passthrough_may_admin(struct fuse_conn *fc, unsigned int flags)
> > +{
> > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > + return -EPERM;
> > +
> > + if (flags)
> > + return -EINVAL;
> > +
> > + return 0;
> > +}
> > +
> > +static int fuse_passthrough_may_open(struct fuse_conn *fc, struct file *file)
> > +{
> > + struct super_block *backing_sb;
> > + int res;
> > +
> > + /* read/write/splice/mmap passthrough only relevant for regular files */
> > + res = d_is_dir(file->f_path.dentry) ? -EISDIR : -EINVAL;
> > + if (!d_is_reg(file->f_path.dentry))
> > + return res;
> > +
> > + backing_sb = file_inode(file)->i_sb;
> > + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > + return -ELOOP;
> > +
> > + return 0;
> > +}
> > +
> > +const struct fuse_backing_ops fuse_passthrough_backing_ops = {
> > + .type = FUSE_BACKING_TYPE_PASSTHROUGH,
> > + .may_admin = fuse_passthrough_may_admin,
> > + .may_open = fuse_passthrough_may_open,
> > +};
> >
> >
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 1/5] fuse: allow synchronous FUSE_INIT
2025-09-17 17:22 ` Joanne Koong
@ 2025-09-18 18:04 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-18 18:04 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, mszeredi, bernd, linux-xfs, John, linux-fsdevel, neal
On Wed, Sep 17, 2025 at 10:22:21AM -0700, Joanne Koong wrote:
> On Mon, Sep 15, 2025 at 5:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Miklos Szeredi <mszeredi@redhat.com>
> >
> > FUSE_INIT has always been asynchronous with mount. That means that the
> > server processed this request after the mount syscall returned.
> >
> > This means that FUSE_INIT can't supply the root inode's ID, hence it
> > currently has a hardcoded value. There are other limitations such as not
> > being able to perform getxattr during mount, which is needed by selinux.
> >
> > To remove these limitations allow server to process FUSE_INIT while
> > initializing the in-core super block for the fuse filesystem. This can
> > only be done if the server is prepared to handle this, so add
> > FUSE_DEV_IOC_SYNC_INIT ioctl, which
> >
> > a) lets the server know whether this feature is supported, returning
> > ENOTTY othewrwise.
> >
> > b) lets the kernel know to perform a synchronous initialization
> >
> > The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
> > only during super block creation. This is solved by setting the private
> > data of the fuse device file to a special value ((struct fuse_dev *) 1) and
> > waiting for this to be turned into a proper fuse_dev before commecing with
> > operations on the device file.
> >
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_dev_i.h | 13 +++++++-
> > fs/fuse/fuse_i.h | 5 ++-
> > include/uapi/linux/fuse.h | 1 +
> > fs/fuse/cuse.c | 3 +-
> > fs/fuse/dev.c | 74 +++++++++++++++++++++++++++++++++------------
> > fs/fuse/dev_uring.c | 4 +-
> > fs/fuse/inode.c | 50 ++++++++++++++++++++++++------
> > 7 files changed, 115 insertions(+), 35 deletions(-)
>
> btw, I think an updated version of this has already been merged into
> the fuse for-next tree (commit dfb84c330794)
Thanks for letting me know!
(I don't develop against for-next. ;))
--D
> Thanks,
> Joanne
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-17 3:09 ` Amir Goldstein
@ 2025-09-18 18:17 ` Darrick J. Wong
2025-09-18 18:42 ` Amir Goldstein
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-18 18:17 UTC (permalink / raw)
To: Amir Goldstein
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, Sep 17, 2025 at 05:09:14AM +0200, Amir Goldstein wrote:
> On Tue, Sep 16, 2025 at 2:30 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Enable the use of the backing file open/close ioctls so that fuse
> > servers can register block devices for use with iomap.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 5 ++
> > include/uapi/linux/fuse.h | 3 +
> > fs/fuse/Kconfig | 1
> > fs/fuse/backing.c | 12 +++++
> > fs/fuse/file_iomap.c | 99 +++++++++++++++++++++++++++++++++++++++++----
> > fs/fuse/trace.c | 1
> > 6 files changed, 111 insertions(+), 10 deletions(-)
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 389b123f0bf144..791f210c13a876 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -97,12 +97,14 @@ struct fuse_submount_lookup {
> > };
> >
> > struct fuse_conn;
> > +struct fuse_backing;
> >
> > /** Operations for subsystems that want to use a backing file */
> > struct fuse_backing_ops {
> > int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
> > int (*may_open)(struct fuse_conn *fc, struct file *file);
> > int (*may_close)(struct fuse_conn *fc, struct file *file);
> > + int (*post_open)(struct fuse_conn *fc, struct fuse_backing *fb);
> > unsigned int type;
> > };
> >
> > @@ -110,6 +112,7 @@ struct fuse_backing_ops {
> > struct fuse_backing {
> > struct file *file;
> > struct cred *cred;
> > + struct block_device *bdev;
> > const struct fuse_backing_ops *ops;
> >
> > /** refcount */
> > @@ -1704,6 +1707,8 @@ static inline bool fuse_has_iomap(const struct inode *inode)
> > {
> > return get_fuse_conn_c(inode)->iomap;
> > }
> > +
> > +extern const struct fuse_backing_ops fuse_iomap_backing_ops;
> > #else
> > # define fuse_iomap_enabled(...) (false)
> > # define fuse_has_iomap(...) (false)
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 3634cbe602cd9c..3a367f387795ff 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -1124,7 +1124,8 @@ struct fuse_notify_retrieve_in {
> >
> > #define FUSE_BACKING_TYPE_MASK (0xFF)
> > #define FUSE_BACKING_TYPE_PASSTHROUGH (0)
> > -#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
> > +#define FUSE_BACKING_TYPE_IOMAP (1)
> > +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP)
> >
> > #define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
> >
> > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > index 52e1a04183e760..baa38cf0f295ff 100644
> > --- a/fs/fuse/Kconfig
> > +++ b/fs/fuse/Kconfig
> > @@ -75,6 +75,7 @@ config FUSE_IOMAP
> > depends on FUSE_FS
> > depends on BLOCK
> > select FS_IOMAP
> > + select FUSE_BACKING
> > help
> > Enable fuse servers to operate the regular file I/O path through
> > the fs-iomap library in the kernel. This enables higher performance
> > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > index 229c101ab46b0e..fc58636ac78eaa 100644
> > --- a/fs/fuse/backing.c
> > +++ b/fs/fuse/backing.c
> > @@ -89,6 +89,10 @@ fuse_backing_ops_from_map(const struct fuse_backing_map *map)
> > #ifdef CONFIG_FUSE_PASSTHROUGH
> > case FUSE_BACKING_TYPE_PASSTHROUGH:
> > return &fuse_passthrough_backing_ops;
> > +#endif
> > +#ifdef CONFIG_FUSE_IOMAP
> > + case FUSE_BACKING_TYPE_IOMAP:
> > + return &fuse_iomap_backing_ops;
> > #endif
> > default:
> > break;
> > @@ -137,8 +141,16 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > fb->file = file;
> > fb->cred = prepare_creds();
> > fb->ops = ops;
> > + fb->bdev = NULL;
> > refcount_set(&fb->count, 1);
> >
> > + res = ops->post_open ? ops->post_open(fc, fb) : 0;
> > + if (res) {
> > + fuse_backing_free(fb);
> > + fb = NULL;
> > + goto out;
> > + }
> > +
> > res = fuse_backing_id_alloc(fc, fb);
> > if (res < 0) {
> > fuse_backing_free(fb);
> > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > index e7d19e2aee4541..3a4161633add0e 100644
> > --- a/fs/fuse/file_iomap.c
> > +++ b/fs/fuse/file_iomap.c
> > @@ -319,10 +319,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> > return false;
> > }
> >
> > - /* XXX: we don't support devices yet */
> > - if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
> > - return false;
> > -
> > /* No overflows in the device range, if supplied */
> > if (map->addr != FUSE_IOMAP_NULL_ADDR &&
> > BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
> > @@ -334,6 +330,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> > /* Convert a mapping from the server into something the kernel can use */
> > static inline void fuse_iomap_from_server(struct inode *inode,
> > struct iomap *iomap,
> > + const struct fuse_backing *fb,
> > const struct fuse_iomap_io *fmap)
> > {
> > iomap->addr = fmap->addr;
> > @@ -341,7 +338,9 @@ static inline void fuse_iomap_from_server(struct inode *inode,
> > iomap->length = fmap->length;
> > iomap->type = fuse_iomap_type_from_server(fmap->type);
> > iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
> > - iomap->bdev = inode->i_sb->s_bdev; /* XXX */
> > +
> > + iomap->bdev = fb ? fb->bdev : NULL;
> > + iomap->dax_dev = NULL;
> > }
> >
> > /* Convert a mapping from the kernel into something the server can use */
> > @@ -392,6 +391,27 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
> > return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
> > }
> >
> > +static inline struct fuse_backing *
> > +fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
> > +{
> > + struct fuse_backing *ret = NULL;
> > +
> > + if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX)
> > + ret = fuse_backing_lookup(fc, &fuse_iomap_backing_ops,
> > + map->dev);
> > +
> > + switch (map->type) {
> > + case FUSE_IOMAP_TYPE_MAPPED:
> > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > + /* Mappings backed by space must have a device/addr */
> > + if (BAD_DATA(ret == NULL))
> > + return ERR_PTR(-EFSCORRUPTED);
> > + break;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > unsigned opflags, struct iomap *iomap,
> > struct iomap *srcmap)
> > @@ -405,6 +425,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > };
> > struct fuse_iomap_begin_out outarg = { };
> > struct fuse_mount *fm = get_fuse_mount(inode);
> > + struct fuse_backing *read_dev = NULL;
> > + struct fuse_backing *write_dev = NULL;
> > FUSE_ARGS(args);
> > int err;
> >
> > @@ -431,24 +453,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > if (err)
> > return err;
> >
> > + read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
> > + if (IS_ERR(read_dev))
> > + return PTR_ERR(read_dev);
> > +
> > if (fuse_is_iomap_file_write(opflags) &&
> > outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
> > + /* open the write device */
> > + write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write);
> > + if (IS_ERR(write_dev)) {
> > + err = PTR_ERR(write_dev);
> > + goto out_read_dev;
> > + }
> > +
> > /*
> > * For an out of place write, we must supply the write mapping
> > * via @iomap, and the read mapping via @srcmap.
> > */
> > - fuse_iomap_from_server(inode, iomap, &outarg.write);
> > - fuse_iomap_from_server(inode, srcmap, &outarg.read);
> > + fuse_iomap_from_server(inode, iomap, write_dev, &outarg.write);
> > + fuse_iomap_from_server(inode, srcmap, read_dev, &outarg.read);
> > } else {
> > /*
> > * For everything else (reads, reporting, and pure overwrites),
> > * we can return the sole mapping through @iomap and leave
> > * @srcmap unchanged from its default (HOLE).
> > */
> > - fuse_iomap_from_server(inode, iomap, &outarg.read);
> > + fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
> > }
> >
> > - return 0;
> > + /*
> > + * XXX: if we ever want to support closing devices, we need a way to
> > + * track the fuse_backing refcount all the way through bio endios.
> > + * For now we put the refcount here because you can't remove an iomap
> > + * device until unmount time.
> > + */
> > + fuse_backing_put(write_dev);
> > +out_read_dev:
> > + fuse_backing_put(read_dev);
> > + return err;
> > }
> >
> > /* Decide if we send FUSE_IOMAP_END to the fuse server */
> > @@ -523,3 +565,42 @@ const struct iomap_ops fuse_iomap_ops = {
> > .iomap_begin = fuse_iomap_begin,
> > .iomap_end = fuse_iomap_end,
> > };
> > +
> > +static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags)
> > +{
> > + if (!fc->iomap)
> > + return -EPERM;
> > +
>
> IIRC, on RFC I asked why is iomap exempt from CAP_SYS_ADMIN
> check. If there was a good reason, I forgot it.
CAP_SYS_ADMIN means that the fuse server (or the fuservicemount helper)
can make quite a lot of other changes to the system that are not at all
related to being a filesystem. I'd rather not use that one.
Instead I require CAP_SYS_RAWIO to enable fc->iomap, so that the fuse
server has to have *some* privilege, but only enough to write to raw
block devices since that's what iomap does.
> The problem is that while fuse-iomap fs is only expected to open
> a handful of backing devs, we would like to prevent abuse of this ioctl
> by a buggy or malicious user.
>
> I think that if you want to avoid CAP_SYS_ADMIN here you should
> enforce a limit on the number of backing bdevs.
>
> If you accept my suggestion to mutually exclude passthrough and
> iomap features per fs, then you'd just need to keep track on numbers
> of fuse_backing ids and place a limit for iomap fs.
>
> BTW, I think it is enough keep track of the number of backing ids
> and no need to keep track of the number of fuse_backing objects
> (which can outlive a backing id), because an "anonymous" fuse_backing
> object is always associated with an open fuse file - that's the same as
> an overlayfs backing file, which is not accounted for in ulimit.
How about restricting the backing ids to RLIMIT_NOFILE? The @end param
to idr_alloc_cyclic constrains them in exactly that way.
--D
> Thanks,
> Amir.
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-18 18:17 ` Darrick J. Wong
@ 2025-09-18 18:42 ` Amir Goldstein
2025-09-18 19:03 ` Darrick J. Wong
2025-09-19 7:13 ` Miklos Szeredi
0 siblings, 2 replies; 126+ messages in thread
From: Amir Goldstein @ 2025-09-18 18:42 UTC (permalink / raw)
To: Darrick J. Wong
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Thu, Sep 18, 2025 at 8:17 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Sep 17, 2025 at 05:09:14AM +0200, Amir Goldstein wrote:
> > On Tue, Sep 16, 2025 at 2:30 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Enable the use of the backing file open/close ioctls so that fuse
> > > servers can register block devices for use with iomap.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > > fs/fuse/fuse_i.h | 5 ++
> > > include/uapi/linux/fuse.h | 3 +
> > > fs/fuse/Kconfig | 1
> > > fs/fuse/backing.c | 12 +++++
> > > fs/fuse/file_iomap.c | 99 +++++++++++++++++++++++++++++++++++++++++----
> > > fs/fuse/trace.c | 1
> > > 6 files changed, 111 insertions(+), 10 deletions(-)
> > >
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 389b123f0bf144..791f210c13a876 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -97,12 +97,14 @@ struct fuse_submount_lookup {
> > > };
> > >
> > > struct fuse_conn;
> > > +struct fuse_backing;
> > >
> > > /** Operations for subsystems that want to use a backing file */
> > > struct fuse_backing_ops {
> > > int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
> > > int (*may_open)(struct fuse_conn *fc, struct file *file);
> > > int (*may_close)(struct fuse_conn *fc, struct file *file);
> > > + int (*post_open)(struct fuse_conn *fc, struct fuse_backing *fb);
> > > unsigned int type;
> > > };
> > >
> > > @@ -110,6 +112,7 @@ struct fuse_backing_ops {
> > > struct fuse_backing {
> > > struct file *file;
> > > struct cred *cred;
> > > + struct block_device *bdev;
> > > const struct fuse_backing_ops *ops;
> > >
> > > /** refcount */
> > > @@ -1704,6 +1707,8 @@ static inline bool fuse_has_iomap(const struct inode *inode)
> > > {
> > > return get_fuse_conn_c(inode)->iomap;
> > > }
> > > +
> > > +extern const struct fuse_backing_ops fuse_iomap_backing_ops;
> > > #else
> > > # define fuse_iomap_enabled(...) (false)
> > > # define fuse_has_iomap(...) (false)
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index 3634cbe602cd9c..3a367f387795ff 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -1124,7 +1124,8 @@ struct fuse_notify_retrieve_in {
> > >
> > > #define FUSE_BACKING_TYPE_MASK (0xFF)
> > > #define FUSE_BACKING_TYPE_PASSTHROUGH (0)
> > > -#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
> > > +#define FUSE_BACKING_TYPE_IOMAP (1)
> > > +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP)
> > >
> > > #define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
> > >
> > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > index 52e1a04183e760..baa38cf0f295ff 100644
> > > --- a/fs/fuse/Kconfig
> > > +++ b/fs/fuse/Kconfig
> > > @@ -75,6 +75,7 @@ config FUSE_IOMAP
> > > depends on FUSE_FS
> > > depends on BLOCK
> > > select FS_IOMAP
> > > + select FUSE_BACKING
> > > help
> > > Enable fuse servers to operate the regular file I/O path through
> > > the fs-iomap library in the kernel. This enables higher performance
> > > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > > index 229c101ab46b0e..fc58636ac78eaa 100644
> > > --- a/fs/fuse/backing.c
> > > +++ b/fs/fuse/backing.c
> > > @@ -89,6 +89,10 @@ fuse_backing_ops_from_map(const struct fuse_backing_map *map)
> > > #ifdef CONFIG_FUSE_PASSTHROUGH
> > > case FUSE_BACKING_TYPE_PASSTHROUGH:
> > > return &fuse_passthrough_backing_ops;
> > > +#endif
> > > +#ifdef CONFIG_FUSE_IOMAP
> > > + case FUSE_BACKING_TYPE_IOMAP:
> > > + return &fuse_iomap_backing_ops;
> > > #endif
> > > default:
> > > break;
> > > @@ -137,8 +141,16 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > > fb->file = file;
> > > fb->cred = prepare_creds();
> > > fb->ops = ops;
> > > + fb->bdev = NULL;
> > > refcount_set(&fb->count, 1);
> > >
> > > + res = ops->post_open ? ops->post_open(fc, fb) : 0;
> > > + if (res) {
> > > + fuse_backing_free(fb);
> > > + fb = NULL;
> > > + goto out;
> > > + }
> > > +
> > > res = fuse_backing_id_alloc(fc, fb);
> > > if (res < 0) {
> > > fuse_backing_free(fb);
> > > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > > index e7d19e2aee4541..3a4161633add0e 100644
> > > --- a/fs/fuse/file_iomap.c
> > > +++ b/fs/fuse/file_iomap.c
> > > @@ -319,10 +319,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> > > return false;
> > > }
> > >
> > > - /* XXX: we don't support devices yet */
> > > - if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
> > > - return false;
> > > -
> > > /* No overflows in the device range, if supplied */
> > > if (map->addr != FUSE_IOMAP_NULL_ADDR &&
> > > BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
> > > @@ -334,6 +330,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> > > /* Convert a mapping from the server into something the kernel can use */
> > > static inline void fuse_iomap_from_server(struct inode *inode,
> > > struct iomap *iomap,
> > > + const struct fuse_backing *fb,
> > > const struct fuse_iomap_io *fmap)
> > > {
> > > iomap->addr = fmap->addr;
> > > @@ -341,7 +338,9 @@ static inline void fuse_iomap_from_server(struct inode *inode,
> > > iomap->length = fmap->length;
> > > iomap->type = fuse_iomap_type_from_server(fmap->type);
> > > iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
> > > - iomap->bdev = inode->i_sb->s_bdev; /* XXX */
> > > +
> > > + iomap->bdev = fb ? fb->bdev : NULL;
> > > + iomap->dax_dev = NULL;
> > > }
> > >
> > > /* Convert a mapping from the kernel into something the server can use */
> > > @@ -392,6 +391,27 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
> > > return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
> > > }
> > >
> > > +static inline struct fuse_backing *
> > > +fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
> > > +{
> > > + struct fuse_backing *ret = NULL;
> > > +
> > > + if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX)
> > > + ret = fuse_backing_lookup(fc, &fuse_iomap_backing_ops,
> > > + map->dev);
> > > +
> > > + switch (map->type) {
> > > + case FUSE_IOMAP_TYPE_MAPPED:
> > > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > > + /* Mappings backed by space must have a device/addr */
> > > + if (BAD_DATA(ret == NULL))
> > > + return ERR_PTR(-EFSCORRUPTED);
> > > + break;
> > > + }
> > > +
> > > + return ret;
> > > +}
> > > +
> > > static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > > unsigned opflags, struct iomap *iomap,
> > > struct iomap *srcmap)
> > > @@ -405,6 +425,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > > };
> > > struct fuse_iomap_begin_out outarg = { };
> > > struct fuse_mount *fm = get_fuse_mount(inode);
> > > + struct fuse_backing *read_dev = NULL;
> > > + struct fuse_backing *write_dev = NULL;
> > > FUSE_ARGS(args);
> > > int err;
> > >
> > > @@ -431,24 +453,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > > if (err)
> > > return err;
> > >
> > > + read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
> > > + if (IS_ERR(read_dev))
> > > + return PTR_ERR(read_dev);
> > > +
> > > if (fuse_is_iomap_file_write(opflags) &&
> > > outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
> > > + /* open the write device */
> > > + write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write);
> > > + if (IS_ERR(write_dev)) {
> > > + err = PTR_ERR(write_dev);
> > > + goto out_read_dev;
> > > + }
> > > +
> > > /*
> > > * For an out of place write, we must supply the write mapping
> > > * via @iomap, and the read mapping via @srcmap.
> > > */
> > > - fuse_iomap_from_server(inode, iomap, &outarg.write);
> > > - fuse_iomap_from_server(inode, srcmap, &outarg.read);
> > > + fuse_iomap_from_server(inode, iomap, write_dev, &outarg.write);
> > > + fuse_iomap_from_server(inode, srcmap, read_dev, &outarg.read);
> > > } else {
> > > /*
> > > * For everything else (reads, reporting, and pure overwrites),
> > > * we can return the sole mapping through @iomap and leave
> > > * @srcmap unchanged from its default (HOLE).
> > > */
> > > - fuse_iomap_from_server(inode, iomap, &outarg.read);
> > > + fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
> > > }
> > >
> > > - return 0;
> > > + /*
> > > + * XXX: if we ever want to support closing devices, we need a way to
> > > + * track the fuse_backing refcount all the way through bio endios.
> > > + * For now we put the refcount here because you can't remove an iomap
> > > + * device until unmount time.
> > > + */
> > > + fuse_backing_put(write_dev);
> > > +out_read_dev:
> > > + fuse_backing_put(read_dev);
> > > + return err;
> > > }
> > >
> > > /* Decide if we send FUSE_IOMAP_END to the fuse server */
> > > @@ -523,3 +565,42 @@ const struct iomap_ops fuse_iomap_ops = {
> > > .iomap_begin = fuse_iomap_begin,
> > > .iomap_end = fuse_iomap_end,
> > > };
> > > +
> > > +static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags)
> > > +{
> > > + if (!fc->iomap)
> > > + return -EPERM;
> > > +
> >
> > IIRC, on RFC I asked why is iomap exempt from CAP_SYS_ADMIN
> > check. If there was a good reason, I forgot it.
>
> CAP_SYS_ADMIN means that the fuse server (or the fuservicemount helper)
> can make quite a lot of other changes to the system that are not at all
> related to being a filesystem. I'd rather not use that one.
>
> Instead I require CAP_SYS_RAWIO to enable fc->iomap, so that the fuse
> server has to have *some* privilege, but only enough to write to raw
> block devices since that's what iomap does.
>
> > The problem is that while fuse-iomap fs is only expected to open
> > a handful of backing devs, we would like to prevent abuse of this ioctl
> > by a buggy or malicious user.
> >
> > I think that if you want to avoid CAP_SYS_ADMIN here you should
> > enforce a limit on the number of backing bdevs.
> >
> > If you accept my suggestion to mutually exclude passthrough and
> > iomap features per fs, then you'd just need to keep track on numbers
> > of fuse_backing ids and place a limit for iomap fs.
> >
> > BTW, I think it is enough keep track of the number of backing ids
> > and no need to keep track of the number of fuse_backing objects
> > (which can outlive a backing id), because an "anonymous" fuse_backing
> > object is always associated with an open fuse file - that's the same as
> > an overlayfs backing file, which is not accounted for in ulimit.
>
> How about restricting the backing ids to RLIMIT_NOFILE? The @end param
> to idr_alloc_cyclic constrains them in exactly that way.
IDK. My impression was that Miklos didn't like having a large number
of unaccounted files, but it's up to him.
Do you have an estimate on the worst case number of backing blockdev
for fuse iomap?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-18 18:42 ` Amir Goldstein
@ 2025-09-18 19:03 ` Darrick J. Wong
2025-09-19 7:13 ` Miklos Szeredi
1 sibling, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-18 19:03 UTC (permalink / raw)
To: Amir Goldstein
Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Thu, Sep 18, 2025 at 08:42:08PM +0200, Amir Goldstein wrote:
> On Thu, Sep 18, 2025 at 8:17 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Sep 17, 2025 at 05:09:14AM +0200, Amir Goldstein wrote:
> > > On Tue, Sep 16, 2025 at 2:30 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Enable the use of the backing file open/close ioctls so that fuse
> > > > servers can register block devices for use with iomap.
> > > >
> > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > > > fs/fuse/fuse_i.h | 5 ++
> > > > include/uapi/linux/fuse.h | 3 +
> > > > fs/fuse/Kconfig | 1
> > > > fs/fuse/backing.c | 12 +++++
> > > > fs/fuse/file_iomap.c | 99 +++++++++++++++++++++++++++++++++++++++++----
> > > > fs/fuse/trace.c | 1
> > > > 6 files changed, 111 insertions(+), 10 deletions(-)
> > > >
> > > >
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 389b123f0bf144..791f210c13a876 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -97,12 +97,14 @@ struct fuse_submount_lookup {
> > > > };
> > > >
> > > > struct fuse_conn;
> > > > +struct fuse_backing;
> > > >
> > > > /** Operations for subsystems that want to use a backing file */
> > > > struct fuse_backing_ops {
> > > > int (*may_admin)(struct fuse_conn *fc, uint32_t flags);
> > > > int (*may_open)(struct fuse_conn *fc, struct file *file);
> > > > int (*may_close)(struct fuse_conn *fc, struct file *file);
> > > > + int (*post_open)(struct fuse_conn *fc, struct fuse_backing *fb);
> > > > unsigned int type;
> > > > };
> > > >
> > > > @@ -110,6 +112,7 @@ struct fuse_backing_ops {
> > > > struct fuse_backing {
> > > > struct file *file;
> > > > struct cred *cred;
> > > > + struct block_device *bdev;
> > > > const struct fuse_backing_ops *ops;
> > > >
> > > > /** refcount */
> > > > @@ -1704,6 +1707,8 @@ static inline bool fuse_has_iomap(const struct inode *inode)
> > > > {
> > > > return get_fuse_conn_c(inode)->iomap;
> > > > }
> > > > +
> > > > +extern const struct fuse_backing_ops fuse_iomap_backing_ops;
> > > > #else
> > > > # define fuse_iomap_enabled(...) (false)
> > > > # define fuse_has_iomap(...) (false)
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index 3634cbe602cd9c..3a367f387795ff 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -1124,7 +1124,8 @@ struct fuse_notify_retrieve_in {
> > > >
> > > > #define FUSE_BACKING_TYPE_MASK (0xFF)
> > > > #define FUSE_BACKING_TYPE_PASSTHROUGH (0)
> > > > -#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_PASSTHROUGH)
> > > > +#define FUSE_BACKING_TYPE_IOMAP (1)
> > > > +#define FUSE_BACKING_MAX_TYPE (FUSE_BACKING_TYPE_IOMAP)
> > > >
> > > > #define FUSE_BACKING_FLAGS_ALL (FUSE_BACKING_TYPE_MASK)
> > > >
> > > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > > index 52e1a04183e760..baa38cf0f295ff 100644
> > > > --- a/fs/fuse/Kconfig
> > > > +++ b/fs/fuse/Kconfig
> > > > @@ -75,6 +75,7 @@ config FUSE_IOMAP
> > > > depends on FUSE_FS
> > > > depends on BLOCK
> > > > select FS_IOMAP
> > > > + select FUSE_BACKING
> > > > help
> > > > Enable fuse servers to operate the regular file I/O path through
> > > > the fs-iomap library in the kernel. This enables higher performance
> > > > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > > > index 229c101ab46b0e..fc58636ac78eaa 100644
> > > > --- a/fs/fuse/backing.c
> > > > +++ b/fs/fuse/backing.c
> > > > @@ -89,6 +89,10 @@ fuse_backing_ops_from_map(const struct fuse_backing_map *map)
> > > > #ifdef CONFIG_FUSE_PASSTHROUGH
> > > > case FUSE_BACKING_TYPE_PASSTHROUGH:
> > > > return &fuse_passthrough_backing_ops;
> > > > +#endif
> > > > +#ifdef CONFIG_FUSE_IOMAP
> > > > + case FUSE_BACKING_TYPE_IOMAP:
> > > > + return &fuse_iomap_backing_ops;
> > > > #endif
> > > > default:
> > > > break;
> > > > @@ -137,8 +141,16 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > > > fb->file = file;
> > > > fb->cred = prepare_creds();
> > > > fb->ops = ops;
> > > > + fb->bdev = NULL;
> > > > refcount_set(&fb->count, 1);
> > > >
> > > > + res = ops->post_open ? ops->post_open(fc, fb) : 0;
> > > > + if (res) {
> > > > + fuse_backing_free(fb);
> > > > + fb = NULL;
> > > > + goto out;
> > > > + }
> > > > +
> > > > res = fuse_backing_id_alloc(fc, fb);
> > > > if (res < 0) {
> > > > fuse_backing_free(fb);
> > > > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > > > index e7d19e2aee4541..3a4161633add0e 100644
> > > > --- a/fs/fuse/file_iomap.c
> > > > +++ b/fs/fuse/file_iomap.c
> > > > @@ -319,10 +319,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> > > > return false;
> > > > }
> > > >
> > > > - /* XXX: we don't support devices yet */
> > > > - if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
> > > > - return false;
> > > > -
> > > > /* No overflows in the device range, if supplied */
> > > > if (map->addr != FUSE_IOMAP_NULL_ADDR &&
> > > > BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
> > > > @@ -334,6 +330,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
> > > > /* Convert a mapping from the server into something the kernel can use */
> > > > static inline void fuse_iomap_from_server(struct inode *inode,
> > > > struct iomap *iomap,
> > > > + const struct fuse_backing *fb,
> > > > const struct fuse_iomap_io *fmap)
> > > > {
> > > > iomap->addr = fmap->addr;
> > > > @@ -341,7 +338,9 @@ static inline void fuse_iomap_from_server(struct inode *inode,
> > > > iomap->length = fmap->length;
> > > > iomap->type = fuse_iomap_type_from_server(fmap->type);
> > > > iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
> > > > - iomap->bdev = inode->i_sb->s_bdev; /* XXX */
> > > > +
> > > > + iomap->bdev = fb ? fb->bdev : NULL;
> > > > + iomap->dax_dev = NULL;
> > > > }
> > > >
> > > > /* Convert a mapping from the kernel into something the server can use */
> > > > @@ -392,6 +391,27 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
> > > > return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
> > > > }
> > > >
> > > > +static inline struct fuse_backing *
> > > > +fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
> > > > +{
> > > > + struct fuse_backing *ret = NULL;
> > > > +
> > > > + if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX)
> > > > + ret = fuse_backing_lookup(fc, &fuse_iomap_backing_ops,
> > > > + map->dev);
> > > > +
> > > > + switch (map->type) {
> > > > + case FUSE_IOMAP_TYPE_MAPPED:
> > > > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > > > + /* Mappings backed by space must have a device/addr */
> > > > + if (BAD_DATA(ret == NULL))
> > > > + return ERR_PTR(-EFSCORRUPTED);
> > > > + break;
> > > > + }
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > > static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > > > unsigned opflags, struct iomap *iomap,
> > > > struct iomap *srcmap)
> > > > @@ -405,6 +425,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > > > };
> > > > struct fuse_iomap_begin_out outarg = { };
> > > > struct fuse_mount *fm = get_fuse_mount(inode);
> > > > + struct fuse_backing *read_dev = NULL;
> > > > + struct fuse_backing *write_dev = NULL;
> > > > FUSE_ARGS(args);
> > > > int err;
> > > >
> > > > @@ -431,24 +453,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> > > > if (err)
> > > > return err;
> > > >
> > > > + read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
> > > > + if (IS_ERR(read_dev))
> > > > + return PTR_ERR(read_dev);
> > > > +
> > > > if (fuse_is_iomap_file_write(opflags) &&
> > > > outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
> > > > + /* open the write device */
> > > > + write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write);
> > > > + if (IS_ERR(write_dev)) {
> > > > + err = PTR_ERR(write_dev);
> > > > + goto out_read_dev;
> > > > + }
> > > > +
> > > > /*
> > > > * For an out of place write, we must supply the write mapping
> > > > * via @iomap, and the read mapping via @srcmap.
> > > > */
> > > > - fuse_iomap_from_server(inode, iomap, &outarg.write);
> > > > - fuse_iomap_from_server(inode, srcmap, &outarg.read);
> > > > + fuse_iomap_from_server(inode, iomap, write_dev, &outarg.write);
> > > > + fuse_iomap_from_server(inode, srcmap, read_dev, &outarg.read);
> > > > } else {
> > > > /*
> > > > * For everything else (reads, reporting, and pure overwrites),
> > > > * we can return the sole mapping through @iomap and leave
> > > > * @srcmap unchanged from its default (HOLE).
> > > > */
> > > > - fuse_iomap_from_server(inode, iomap, &outarg.read);
> > > > + fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
> > > > }
> > > >
> > > > - return 0;
> > > > + /*
> > > > + * XXX: if we ever want to support closing devices, we need a way to
> > > > + * track the fuse_backing refcount all the way through bio endios.
> > > > + * For now we put the refcount here because you can't remove an iomap
> > > > + * device until unmount time.
> > > > + */
> > > > + fuse_backing_put(write_dev);
> > > > +out_read_dev:
> > > > + fuse_backing_put(read_dev);
> > > > + return err;
> > > > }
> > > >
> > > > /* Decide if we send FUSE_IOMAP_END to the fuse server */
> > > > @@ -523,3 +565,42 @@ const struct iomap_ops fuse_iomap_ops = {
> > > > .iomap_begin = fuse_iomap_begin,
> > > > .iomap_end = fuse_iomap_end,
> > > > };
> > > > +
> > > > +static int fuse_iomap_may_admin(struct fuse_conn *fc, unsigned int flags)
> > > > +{
> > > > + if (!fc->iomap)
> > > > + return -EPERM;
> > > > +
> > >
> > > IIRC, on RFC I asked why is iomap exempt from CAP_SYS_ADMIN
> > > check. If there was a good reason, I forgot it.
> >
> > CAP_SYS_ADMIN means that the fuse server (or the fuservicemount helper)
> > can make quite a lot of other changes to the system that are not at all
> > related to being a filesystem. I'd rather not use that one.
> >
> > Instead I require CAP_SYS_RAWIO to enable fc->iomap, so that the fuse
> > server has to have *some* privilege, but only enough to write to raw
> > block devices since that's what iomap does.
> >
> > > The problem is that while fuse-iomap fs is only expected to open
> > > a handful of backing devs, we would like to prevent abuse of this ioctl
> > > by a buggy or malicious user.
> > >
> > > I think that if you want to avoid CAP_SYS_ADMIN here you should
> > > enforce a limit on the number of backing bdevs.
> > >
> > > If you accept my suggestion to mutually exclude passthrough and
> > > iomap features per fs, then you'd just need to keep track on numbers
> > > of fuse_backing ids and place a limit for iomap fs.
> > >
> > > BTW, I think it is enough keep track of the number of backing ids
> > > and no need to keep track of the number of fuse_backing objects
> > > (which can outlive a backing id), because an "anonymous" fuse_backing
> > > object is always associated with an open fuse file - that's the same as
> > > an overlayfs backing file, which is not accounted for in ulimit.
> >
> > How about restricting the backing ids to RLIMIT_NOFILE? The @end param
> > to idr_alloc_cyclic constrains them in exactly that way.
>
> IDK. My impression was that Miklos didn't like having a large number
> of unaccounted files, but it's up to him.
>
> Do you have an estimate on the worst case number of backing blockdev
> for fuse iomap?
It's the upper limit on the number of block devices that you can attach
to a multi-device filesystem for use with files. For ext4 it's 1, for
XFS it would be 2, for btrfs I have no idea.
--D
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-18 18:42 ` Amir Goldstein
2025-09-18 19:03 ` Darrick J. Wong
@ 2025-09-19 7:13 ` Miklos Szeredi
2025-09-19 9:54 ` Amir Goldstein
1 sibling, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-19 7:13 UTC (permalink / raw)
To: Amir Goldstein
Cc: Darrick J. Wong, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Thu, 18 Sept 2025 at 20:42, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Sep 18, 2025 at 8:17 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > How about restricting the backing ids to RLIMIT_NOFILE? The @end param
> > to idr_alloc_cyclic constrains them in exactly that way.
>
> IDK. My impression was that Miklos didn't like having a large number
> of unaccounted files, but it's up to him.
There's no 1:1 mapping between a fuse instance and a "fuse server
process", so the question is whose RLIMIT_NOFILE? Accounting to the
process that registered the fd would be good, but implementing it
looks exceedingly complex. Just taking RLIMIT_NOFILE value from the
process that is doing the fd registering should work, I guess.
There's still the question of unhiding these files. Latest discussion
ended with lets create a proper directory tree for open files in proc.
I.e. /proc/PID/fdtree/FD/hidden/...
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c
2025-09-18 18:02 ` Darrick J. Wong
@ 2025-09-19 7:34 ` Miklos Szeredi
2025-09-19 9:36 ` Amir Goldstein
2025-09-19 17:43 ` Darrick J. Wong
0 siblings, 2 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-19 7:34 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Amir Goldstein, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Thu, 18 Sept 2025 at 20:02, Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Sep 17, 2025 at 04:47:19AM +0200, Amir Goldstein wrote:
> > I think at this point in time FUSE_PASSTHROUGH and
> > FUSE_IOMAP should be mutually exclusive and
> > fuse_backing_ops could be set at fc level.
> > If we want to move them for per fuse_backing later
> > we can always do that when the use cases and tests arrive.
>
> With Miklos' ok I'll constrain fuse not to allow passthrough and iomap
> files on the same filesystem, but as it is now there's no technical
> reason to make it so that they can't coexist.
Is there a good reason to add the restriction? If restricting it
doesn't simplify anything or even makes it more complex, then I'd opt
for leaving it more general, even if it doesn't seem to make sense.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-18 16:52 ` Darrick J. Wong
@ 2025-09-19 9:24 ` Miklos Szeredi
2025-09-19 17:50 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-19 9:24 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Thu, 18 Sept 2025 at 18:52, Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Sep 17, 2025 at 10:18:40AM -0700, Joanne Koong wrote:
> > If I'm understanding it correctly, fc->local_fs is set to true if it's
> > a fuseblk device? Why do we need a new "ctx->local_fs" instead of
> > reusing ctx->is_bdev?
>
> Eventually, enabling iomap will also set local_fs=1, as Miklos and I
> sort of touched on a couple weeks ago:
>
> https://lore.kernel.org/linux-fsdevel/CAJfpegvmXnZc=nC4UGw5Gya2cAr-kR0s=WNecnMhdTM_mGyuUg@mail.gmail.com/
I think it might be worth making this property per-inode. I.e. a
distributed filesystem could allow one inode to be completely "owned"
by one client. This would be similar to NFSv4 delegations and could
be refined to read-only (shared) and read-write (exclusive) ownership.
A local filesystem would have all inodes excusively owned.
This's been long on my todo list and also have some prior experiments,
so it's a good opportunity to start working on it again:)
Thanks,
Miklos
>
> --D
>
> > Thanks,
> > Joanne
> >
> > > err = -ENOMEM;
> > > root = fuse_get_root_inode(sb, ctx->rootmode);
> > > @@ -2029,6 +2030,7 @@ static int fuse_init_fs_context(struct fs_context *fsc)
> > > if (fsc->fs_type == &fuseblk_fs_type) {
> > > ctx->is_bdev = true;
> > > ctx->destroy = true;
> > > + ctx->local_fs = true;
> > > }
> > > #endif
> > >
> > >
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c
2025-09-19 7:34 ` Miklos Szeredi
@ 2025-09-19 9:36 ` Amir Goldstein
2025-09-19 17:43 ` Darrick J. Wong
1 sibling, 0 replies; 126+ messages in thread
From: Amir Goldstein @ 2025-09-19 9:36 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Darrick J. Wong, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Fri, Sep 19, 2025 at 9:34 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 18 Sept 2025 at 20:02, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Sep 17, 2025 at 04:47:19AM +0200, Amir Goldstein wrote:
>
> > > I think at this point in time FUSE_PASSTHROUGH and
> > > FUSE_IOMAP should be mutually exclusive and
> > > fuse_backing_ops could be set at fc level.
> > > If we want to move them for per fuse_backing later
> > > we can always do that when the use cases and tests arrive.
> >
> > With Miklos' ok I'll constrain fuse not to allow passthrough and iomap
> > files on the same filesystem, but as it is now there's no technical
> > reason to make it so that they can't coexist.
>
> Is there a good reason to add the restriction? If restricting it
I guess "good reason" is subjective.
I do not like to have never tested code, but it's your fs, so up to you.
> doesn't simplify anything or even makes it more complex, then I'd opt
> for leaving it more general, even if it doesn't seem to make sense.
I don't think either restricting or not is more complex.
It's just a matter of whether fuse_backing_ops are per fuse_backing
or per fuse_conn.
It may come handy to limit the number of backing ids per fuse_conn
so that can be negotiated on FUSE_INIT, but that is independent
on the question of mutually excluding the two features.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-19 7:13 ` Miklos Szeredi
@ 2025-09-19 9:54 ` Amir Goldstein
2025-09-19 17:42 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Amir Goldstein @ 2025-09-19 9:54 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Darrick J. Wong, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Fri, Sep 19, 2025 at 9:14 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 18 Sept 2025 at 20:42, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Thu, Sep 18, 2025 at 8:17 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> > > How about restricting the backing ids to RLIMIT_NOFILE? The @end param
> > > to idr_alloc_cyclic constrains them in exactly that way.
> >
> > IDK. My impression was that Miklos didn't like having a large number
> > of unaccounted files, but it's up to him.
>
> There's no 1:1 mapping between a fuse instance and a "fuse server
> process", so the question is whose RLIMIT_NOFILE? Accounting to the
> process that registered the fd would be good, but implementing it
> looks exceedingly complex. Just taking RLIMIT_NOFILE value from the
> process that is doing the fd registering should work, I guess.
>
> There's still the question of unhiding these files. Latest discussion
> ended with lets create a proper directory tree for open files in proc.
> I.e. /proc/PID/fdtree/FD/hidden/...
>
Yes, well, fuse_backing_open() says:
/* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
So that's the reason I was saying there is no justification to
relax this for FUSE_IOMAP as long as this issue is not resolved.
As Darrick writes, fuse4fs needs only 1 backing blockdev
and other iomap fuse fs are unlikely to need more than a few
backing blockdevs.
So maybe, similar to max_stack_depth, we require the server to
negotiate the max_backing_id at FUSE_INIT time.
We could allow any "reasonable" number without any capabilities
and regardless of RLIMIT_NOFILE or we can account max_backing_id
in advance for the user setting up the connection.
For backward compat (or for privileged servers) zero max_backing_id
means unlimited (within the int32 range) and that requires
CAP_SYS_ADMIN for fuse_backing_open() regardless of which
type of backing file it is.
WDYT?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-19 9:54 ` Amir Goldstein
@ 2025-09-19 17:42 ` Darrick J. Wong
2025-09-23 7:10 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-19 17:42 UTC (permalink / raw)
To: Amir Goldstein
Cc: Miklos Szeredi, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Fri, Sep 19, 2025 at 11:54:39AM +0200, Amir Goldstein wrote:
> On Fri, Sep 19, 2025 at 9:14 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Thu, 18 Sept 2025 at 20:42, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Thu, Sep 18, 2025 at 8:17 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > > > How about restricting the backing ids to RLIMIT_NOFILE? The @end param
> > > > to idr_alloc_cyclic constrains them in exactly that way.
> > >
> > > IDK. My impression was that Miklos didn't like having a large number
> > > of unaccounted files, but it's up to him.
> >
> > There's no 1:1 mapping between a fuse instance and a "fuse server
> > process", so the question is whose RLIMIT_NOFILE? Accounting to the
> > process that registered the fd would be good, but implementing it
> > looks exceedingly complex. Just taking RLIMIT_NOFILE value from the
> > process that is doing the fd registering should work, I guess.
Or perhaps a static limit of 1024 for now, and if someone comes up with
a humongous filesystem that needs more, we can figure out how to support
that later.
Since we're already adding flag bits to the /dev/fuse file::private_data
for synchronous init, I guess we could expand that into a full struct so
that you could open /dev/fuse, ask for various config options, and then
apply them to the fuse_dev when it gets created?
> > There's still the question of unhiding these files. Latest discussion
> > ended with lets create a proper directory tree for open files in proc.
> > I.e. /proc/PID/fdtree/FD/hidden/...
All the iomap backing files are block devices, perhaps we could put a
symlink in /sys/block/XXX/holders/ to something associated with the
fuse_mount? Perhaps the s_bdi?
This is a more general problem, because there's no standard way to
figure out that a given bdev is an auxiliary device attached to a
multi-device filesystems (e.g. xfs realtime volume or external log).
The downsides are that "holders" is sysfs-happy and even symlinks
require a target kobject; and lsof doesn't know about holders. But at
least it wouldn't be 100% invisible like it is now.
> Yes, well, fuse_backing_open() says:
> /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> So that's the reason I was saying there is no justification to
> relax this for FUSE_IOMAP as long as this issue is not resolved.
>
> As Darrick writes, fuse4fs needs only 1 backing blockdev
> and other iomap fuse fs are unlikely to need more than a few
> backing blockdevs.
Until someone has a go at making btrfs-fuse fully functional. But that
can be their problem. ;)
> So maybe, similar to max_stack_depth, we require the server to
> negotiate the max_backing_id at FUSE_INIT time.
>
> We could allow any "reasonable" number without any capabilities
> and regardless of RLIMIT_NOFILE or we can account max_backing_id
> in advance for the user setting up the connection.
>
> For backward compat (or for privileged servers) zero max_backing_id
> means unlimited (within the int32 range) and that requires
> CAP_SYS_ADMIN for fuse_backing_open() regardless of which
> type of backing file it is.
>
> WDYT?
I think capping at 1024 now (or 256, or even 8) is fine for now, and we
can figure out the request protocol later when someone wants more.
Alternately, I wonder if there's a way to pin the fd that is used to
create the backing id so that the fuse server can't close it? There's
probably no non-awful way to pin the fd table entry though.
--D
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c
2025-09-19 7:34 ` Miklos Szeredi
2025-09-19 9:36 ` Amir Goldstein
@ 2025-09-19 17:43 ` Darrick J. Wong
1 sibling, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-19 17:43 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Amir Goldstein, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Fri, Sep 19, 2025 at 09:34:06AM +0200, Miklos Szeredi wrote:
> On Thu, 18 Sept 2025 at 20:02, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Sep 17, 2025 at 04:47:19AM +0200, Amir Goldstein wrote:
>
> > > I think at this point in time FUSE_PASSTHROUGH and
> > > FUSE_IOMAP should be mutually exclusive and
> > > fuse_backing_ops could be set at fc level.
> > > If we want to move them for per fuse_backing later
> > > we can always do that when the use cases and tests arrive.
> >
> > With Miklos' ok I'll constrain fuse not to allow passthrough and iomap
> > files on the same filesystem, but as it is now there's no technical
> > reason to make it so that they can't coexist.
>
> Is there a good reason to add the restriction? If restricting it
> doesn't simplify anything or even makes it more complex, then I'd opt
> for leaving it more general, even if it doesn't seem to make sense.
I don't have a good reason to add a restriction; it's entirely Amir's
concern about testing the two together.
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-19 9:24 ` Miklos Szeredi
@ 2025-09-19 17:50 ` Darrick J. Wong
2025-09-23 14:57 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-19 17:50 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Fri, Sep 19, 2025 at 11:24:09AM +0200, Miklos Szeredi wrote:
> On Thu, 18 Sept 2025 at 18:52, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Sep 17, 2025 at 10:18:40AM -0700, Joanne Koong wrote:
>
> > > If I'm understanding it correctly, fc->local_fs is set to true if it's
> > > a fuseblk device? Why do we need a new "ctx->local_fs" instead of
> > > reusing ctx->is_bdev?
> >
> > Eventually, enabling iomap will also set local_fs=1, as Miklos and I
> > sort of touched on a couple weeks ago:
> >
> > https://lore.kernel.org/linux-fsdevel/CAJfpegvmXnZc=nC4UGw5Gya2cAr-kR0s=WNecnMhdTM_mGyuUg@mail.gmail.com/
>
> I think it might be worth making this property per-inode. I.e. a
> distributed filesystem could allow one inode to be completely "owned"
> by one client. This would be similar to NFSv4 delegations and could
> be refined to read-only (shared) and read-write (exclusive) ownership.
> A local filesystem would have all inodes excusively owned.
>
> This's been long on my todo list and also have some prior experiments,
> so it's a good opportunity to start working on it again:)
Since I already have per-fs and per-inode iomap flags, I can add a
per-inode localfs flag pretty easily for v6. ATM I see 2 existing flags
and 5 proposed (out of 32 possible):
/**
* fuse_attr flags
*
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
* FUSE_ATTR_IOMAP: Use iomap for this inode
* FUSE_ATTR_ATOMIC: Enable untorn writes
* FUSE_ATTR_SYNC: File writes are synchronous
* FUSE_ATTR_IMMUTABLE: File is immutable
* FUSE_ATTR_APPEND: File is append-only
*/
So we still have plenty of space.
Would you like to allow any server set the per-inode flag? Or would you
rather keep the per-fs flag and require that it's set before setting the
per-inode flag? That would be useful for privilege checking of the fuse
server.
--D
> Thanks,
> Miklos
>
>
>
>
>
>
> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > >
> > > > err = -ENOMEM;
> > > > root = fuse_get_root_inode(sb, ctx->rootmode);
> > > > @@ -2029,6 +2030,7 @@ static int fuse_init_fs_context(struct fs_context *fsc)
> > > > if (fsc->fs_type == &fuseblk_fs_type) {
> > > > ctx->is_bdev = true;
> > > > ctx->destroy = true;
> > > > + ctx->local_fs = true;
> > > > }
> > > > #endif
> > > >
> > > >
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 01/28] fuse: implement the basic iomap mechanisms
2025-09-16 0:28 ` [PATCH 01/28] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-09-19 22:36 ` Joanne Koong
2025-09-23 20:32 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Joanne Koong @ 2025-09-19 22:36 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Mon, Sep 15, 2025 at 5:28 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Implement functions to enable upcalling of iomap_begin and iomap_end to
> userspace fuse servers.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 35 ++++
> fs/fuse/iomap_priv.h | 36 ++++
> include/uapi/linux/fuse.h | 90 +++++++++
> fs/fuse/Kconfig | 32 +++
> fs/fuse/Makefile | 1
> fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
> fs/fuse/inode.c | 9 +
> 7 files changed, 636 insertions(+), 1 deletion(-)
> create mode 100644 fs/fuse/iomap_priv.h
> create mode 100644 fs/fuse/file_iomap.c
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 4560687d619d76..f0d408a6e12c32 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -923,6 +923,9 @@ struct fuse_conn {
> /* Is synchronous FUSE_INIT allowed? */
> unsigned int sync_init:1;
>
> + /* Enable fs/iomap for file operations */
> + unsigned int iomap:1;
> +
> /* Use io_uring for communication */
> unsigned int io_uring;
>
> @@ -1047,6 +1050,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> return sb->s_fs_info;
> }
>
> +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> +{
> + return sb->s_fs_info;
> +}
> +
> static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> {
> return get_fuse_mount_super(sb)->fc;
> @@ -1057,16 +1065,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> return get_fuse_mount_super(inode->i_sb);
> }
>
> +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> +{
> + return get_fuse_mount_super_c(inode->i_sb);
> +}
> +
> static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> {
> return get_fuse_mount_super(inode->i_sb)->fc;
> }
>
> +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> +{
> + return get_fuse_mount_super_c(inode->i_sb)->fc;
> +}
> +
> static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> {
> return container_of(inode, struct fuse_inode, inode);
> }
>
> +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> +{
> + return container_of(inode, struct fuse_inode, inode);
> +}
Do we need these new set of helpers? AFAICT it does two things: a)
guarantee constness of the arg passed in b) guarantee constness of the
pointer returned
But it seems like for a) we could get that by modifying the existing
functions to take in a const arg, eg
-static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
+static inline struct fuse_inode *get_fuse_inode(const struct inode *inode)
{
return container_of(inode, struct fuse_inode, inode);
}
and for b) it seems to me like the caller enforces the constness of
the pointer returned whether the actual function returns a const
pointer or not,
eg
const struct fuse_inode *fi = get_fuse_inode{_c}(inode);
Maybe I'm missing something here but it seems to me like we don't need
these new helpers?
> +
> diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
btw, i think the general convention is to use "_i.h" suffixing for
private internal files, eg fuse_i.h, fuse_dev_i.h, dev_uring_i.h
> new file mode 100644
> index 00000000000000..243d92cb625095
> --- /dev/null
> +++ b/fs/fuse/iomap_priv.h
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2025 Oracle. All Rights Reserved.
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#ifndef _FS_FUSE_IOMAP_PRIV_H
> +#define _FS_FUSE_IOMAP_PRIV_H
> +
...
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 31b80f93211b81..3634cbe602cd9c 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -235,6 +235,9 @@
> *
> * 7.44
> * - add FUSE_NOTIFY_INC_EPOCH
> + *
> + * 7.99
Just curious, where did you get the .99 from?
> + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -270,7 +273,7 @@
> #define FUSE_KERNEL_VERSION 7
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 9563fa5387a241..67dfe300bf2e07 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH
> +config FUSE_IOMAP_DEBUG
> + bool "Debug FUSE file IO over iomap"
> + default n
> + depends on FUSE_IOMAP
> + help
> + Enable debugging assertions for the fuse iomap code paths and logging
> + of bad iomap file mapping data being sent to the kernel.
> +
I wonder if we should have a general FUSE_DEBUG that this would fall
under instead of creating one that's iomap_debug specific
> config FUSE_IO_URING
> bool "FUSE communication over io-uring"
> default y
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 46041228e5be2c..27be39317701d6 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> fuse-$(CONFIG_FUSE_BACKING) += backing.o
> fuse-$(CONFIG_SYSCTL) += sysctl.o
> fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> +fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
>
> virtiofs-y := virtio_fs.o
> diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> new file mode 100644
> index 00000000000000..dda757768d3ea6
> --- /dev/null
> +++ b/fs/fuse/file_iomap.c
> @@ -0,0 +1,434 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2025 Oracle. All Rights Reserved.
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#include <linux/iomap.h>
> +#include "fuse_i.h"
> +#include "fuse_trace.h"
> +#include "iomap_priv.h"
> +
> +/* Validate FUSE_IOMAP_TYPE_* */
> +static inline bool fuse_iomap_check_type(uint16_t fuse_type)
> +{
> + switch (fuse_type) {
> + case FUSE_IOMAP_TYPE_HOLE:
> + case FUSE_IOMAP_TYPE_DELALLOC:
> + case FUSE_IOMAP_TYPE_MAPPED:
> + case FUSE_IOMAP_TYPE_UNWRITTEN:
> + case FUSE_IOMAP_TYPE_INLINE:
> + case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
> + return true;
> + }
> +
> + return false;
> +}
Maybe faster to check by using a bitmask instead?
> +
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 1e7298b2b89b58..32f4b7c9a20a8a 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1448,6 +1448,13 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>
> if (flags & FUSE_REQUEST_TIMEOUT)
> timeout = arg->request_timeout;
> +
> + if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
> + fc->local_fs = 1;
> + fc->iomap = 1;
> + printk(KERN_WARNING
> + "fuse: EXPERIMENTAL iomap feature enabled. Use at your own risk!");
> + }
pr_warn() seems to be the convention elsewhere in the fuse code
Thanks,
Joanne
> } else {
> ra_pages = fc->max_read / PAGE_SIZE;
> fc->no_lock = 1;
> @@ -1516,6 +1523,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
> */
> if (fuse_uring_enabled())
> flags |= FUSE_OVER_IO_URING;
> + if (fuse_iomap_enabled())
> + flags |= FUSE_IOMAP;
>
> ia->in.flags = flags;
> ia->in.flags2 = flags >> 32;
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
2025-09-19 17:42 ` Darrick J. Wong
@ 2025-09-23 7:10 ` Miklos Szeredi
0 siblings, 0 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 7:10 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Amir Goldstein, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Fri, 19 Sept 2025 at 19:42, Darrick J. Wong <djwong@kernel.org> wrote:
> I think capping at 1024 now (or 256, or even 8) is fine for now, and we
> can figure out the request protocol later when someone wants more.
Yeah, whichever.
> Alternately, I wonder if there's a way to pin the fd that is used to
> create the backing id so that the fuse server can't close it? There's
> probably no non-awful way to pin the fd table entry though.
I don't think this could work.
My idea back then was to create a kernel thread for each fuse instance
and have FUSE_DEV_IOC_BACKING_OPEN/_CLOSE operate on the file table of
this thread. Not sure how practical this would be.
Thanks,
Miklos
>
> --D
>
> > Thanks,
> > Amir.
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers
2025-09-16 0:24 ` [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-09-23 10:57 ` Miklos Szeredi
0 siblings, 0 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 10:57 UTC (permalink / raw)
To: Darrick J. Wong
Cc: stable, bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, 16 Sept 2025 at 02:24, Darrick J. Wong <djwong@kernel.org> wrote:
> Fix this by only using asynchronous fputs when closing files, and leave
> a comment explaining why.
>
> Cc: <stable@vger.kernel.org> # v2.6.38
> Fixes: 5a18ec176c934c ("fuse: fix hang of single threaded fuseblk filesystem")
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Applied, thanks.
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 3/8] fuse: capture the unique id of fuse commands being sent
2025-09-16 0:24 ` [PATCH 3/8] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
@ 2025-09-23 10:58 ` Miklos Szeredi
0 siblings, 0 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 10:58 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, 16 Sept 2025 at 02:24, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> The fuse_request_{send,end} tracepoints capture the value of
> req->in.h.unique in the trace output. It would be really nice if we
> could use this to match a request to its response for debugging and
> latency analysis, but the call to trace_fuse_request_send occurs before
> the unique id has been set:
>
> fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
> fuse_request_end: connection 8388608 req 6 len 16 error -2
>
> (Notice that req moves from 0 to 6)
>
> Move the callsites to trace_fuse_request_send to after the unique id has
> been set by introducing a helper to do that for standard fuse_req
> requests. FUSE_FORGET requests are not covered by this because they
> appear to be synthesized into the event stream without a fuse_req
> object and are never replied to.
>
> Requests that are aborted without ever having been submitted to the fuse
> server retain the behavior that only the fuse_request_end tracepoint
> shows up in the trace record, and with req==0.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Applied, thanks.
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 8/8] fuse: enable FUSE_SYNCFS for all fuseblk servers
2025-09-16 0:26 ` [PATCH 8/8] fuse: enable FUSE_SYNCFS for all fuseblk servers Darrick J. Wong
@ 2025-09-23 10:58 ` Miklos Szeredi
0 siblings, 0 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 10:58 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, 16 Sept 2025 at 02:26, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Turn on syncfs for all fuseblk servers so that the ones in the know can
> flush cached intermediate data and logs to disk.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Applied, thanks.
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-16 0:24 ` [PATCH 2/8] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
@ 2025-09-23 11:11 ` Miklos Szeredi
2025-09-23 14:54 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 11:11 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, 16 Sept 2025 at 02:24, Darrick J. Wong <djwong@kernel.org> wrote:
> + /*
> + * Wait for all the events to complete or abort. Touch the watchdog
> + * once per second so that we don't trip the hangcheck timer while
> + * waiting for the fuse server.
> + */
> + smp_mb();
> + while (wait_event_timeout(fc->blocked_waitq,
> + !fc->connected || atomic_read(&fc->num_waiting) == 0,
> + HZ) == 0)
I applied this patch, but then realized that I don't understand what's
going on here.
Why is this site special? Should the other waits for server response
be treated like this?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-23 11:11 ` Miklos Szeredi
@ 2025-09-23 14:54 ` Darrick J. Wong
2025-09-23 18:56 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 14:54 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 23, 2025 at 01:11:39PM +0200, Miklos Szeredi wrote:
> On Tue, 16 Sept 2025 at 02:24, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > + /*
> > + * Wait for all the events to complete or abort. Touch the watchdog
> > + * once per second so that we don't trip the hangcheck timer while
> > + * waiting for the fuse server.
> > + */
> > + smp_mb();
> > + while (wait_event_timeout(fc->blocked_waitq,
> > + !fc->connected || atomic_read(&fc->num_waiting) == 0,
> > + HZ) == 0)
>
> I applied this patch, but then realized that I don't understand what's
> going on here.
We go around this tight loop until either 1 second goes by, the fuse
connection drops, or the number of fuse commands hits zero. Non-timeout
Wakeups are stimulated by the wake_up_all(&fc->blocked_waitq) in
fuse_drop_waiting() after the request is completed or aborted.
The loop body touches the soft lockup watchdog so that we don't get hung
task warnings while waiting for a possibly large number of RELEASE
requests (or whatever's queued up at that point) to be processed by the
server. I didn't use wait_event_killable_timeout because I don't know
how to clean up an in-progress unmount midway through.
> Why is this site special? Should the other waits for server response
> be treated like this?
I'm not sure what you're referring to by "special" -- are you asking
about why I added the touch_softlockup_watchdog() call here but not in
fuse_wait_aborted()? I think it could use that treatment too, but once
you abort all the pending requests they tend to go away very quickly.
It might be the case that nobody's gotten a warning simply because the
aborted requests all go away in under 30 seconds.
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-19 17:50 ` Darrick J. Wong
@ 2025-09-23 14:57 ` Miklos Szeredi
2025-09-23 20:51 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 14:57 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Fri, 19 Sept 2025 at 19:50, Darrick J. Wong <djwong@kernel.org> wrote:
> /**
> * fuse_attr flags
> *
> * FUSE_ATTR_SUBMOUNT: Object is a submount root
> * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
> * FUSE_ATTR_IOMAP: Use iomap for this inode
> * FUSE_ATTR_ATOMIC: Enable untorn writes
> * FUSE_ATTR_SYNC: File writes are synchronous
> * FUSE_ATTR_IMMUTABLE: File is immutable
> * FUSE_ATTR_APPEND: File is append-only
> */
>
> So we still have plenty of space.
No, I was thinking of an internal flag or flags. Exporting this to
the server will come at some point, but not now.
So for now something like
/** FUSE inode state bits */
enum {
...
/* Exclusive access to file, either because fs is local or have an
exclusive "lease" on distributed fs */
FUSE_I_EXCLUSIVE,
};
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-23 14:54 ` Darrick J. Wong
@ 2025-09-23 18:56 ` Miklos Szeredi
2025-09-23 20:59 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-23 18:56 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, 23 Sept 2025 at 16:54, Darrick J. Wong <djwong@kernel.org> wrote:
> I'm not sure what you're referring to by "special" -- are you asking
> about why I added the touch_softlockup_watchdog() call here but not in
> fuse_wait_aborted()? I think it could use that treatment too, but once
> you abort all the pending requests they tend to go away very quickly.
> It might be the case that nobody's gotten a warning simply because the
> aborted requests all go away in under 30 seconds.
Maybe I'm not understanding how the softlockup detector works. I
thought that it triggers if task is spinning in a tight loop. That
precludes any timeouts, since that means that the task went to sleep.
So what's happening here?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 01/28] fuse: implement the basic iomap mechanisms
2025-09-19 22:36 ` Joanne Koong
@ 2025-09-23 20:32 ` Darrick J. Wong
2025-09-23 21:24 ` Joanne Koong
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 20:32 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Fri, Sep 19, 2025 at 03:36:52PM -0700, Joanne Koong wrote:
> On Mon, Sep 15, 2025 at 5:28 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > userspace fuse servers.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 35 ++++
> > fs/fuse/iomap_priv.h | 36 ++++
> > include/uapi/linux/fuse.h | 90 +++++++++
> > fs/fuse/Kconfig | 32 +++
> > fs/fuse/Makefile | 1
> > fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
> > fs/fuse/inode.c | 9 +
> > 7 files changed, 636 insertions(+), 1 deletion(-)
> > create mode 100644 fs/fuse/iomap_priv.h
> > create mode 100644 fs/fuse/file_iomap.c
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 4560687d619d76..f0d408a6e12c32 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -923,6 +923,9 @@ struct fuse_conn {
> > /* Is synchronous FUSE_INIT allowed? */
> > unsigned int sync_init:1;
> >
> > + /* Enable fs/iomap for file operations */
> > + unsigned int iomap:1;
> > +
> > /* Use io_uring for communication */
> > unsigned int io_uring;
> >
> > @@ -1047,6 +1050,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> > return sb->s_fs_info;
> > }
> >
> > +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> > +{
> > + return sb->s_fs_info;
> > +}
> > +
> > static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> > {
> > return get_fuse_mount_super(sb)->fc;
> > @@ -1057,16 +1065,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> > return get_fuse_mount_super(inode->i_sb);
> > }
> >
> > +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> > +{
> > + return get_fuse_mount_super_c(inode->i_sb);
> > +}
> > +
> > static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> > {
> > return get_fuse_mount_super(inode->i_sb)->fc;
> > }
> >
> > +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> > +{
> > + return get_fuse_mount_super_c(inode->i_sb)->fc;
> > +}
> > +
> > static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> > {
> > return container_of(inode, struct fuse_inode, inode);
> > }
> >
> > +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> > +{
> > + return container_of(inode, struct fuse_inode, inode);
> > +}
>
> Do we need these new set of helpers? AFAICT it does two things: a)
> guarantee constness of the arg passed in b) guarantee constness of the
> pointer returned
>
> But it seems like for a) we could get that by modifying the existing
> functions to take in a const arg, eg
>
> -static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> +static inline struct fuse_inode *get_fuse_inode(const struct inode *inode)
> {
> return container_of(inode, struct fuse_inode, inode);
> }
>
> and for b) it seems to me like the caller enforces the constness of
> the pointer returned whether the actual function returns a const
> pointer or not,
>
> eg
> const struct fuse_inode *fi = get_fuse_inode{_c}(inode);
>
> Maybe I'm missing something here but it seems to me like we don't need
> these new helpers?
Heh. I had mistakenly thought that one cannot cast a const struct
pointer to a mutable const struct pointer, but I just tried your
suggestion and it seemed to work fine. So I guess we don't need
get_fuse_mount_c either.
Yay C, all it's doing is taking a number pointing to something that
can't be changed, subtracting from it, and thus returning a different
number. Perhaps Rust has polluted my brain.
> > +
> > diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
>
> btw, i think the general convention is to use "_i.h" suffixing for
> private internal files, eg fuse_i.h, fuse_dev_i.h, dev_uring_i.h
Noted, thank you.
> > new file mode 100644
> > index 00000000000000..243d92cb625095
> > --- /dev/null
> > +++ b/fs/fuse/iomap_priv.h
> > @@ -0,0 +1,36 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > + * Author: Darrick J. Wong <djwong@kernel.org>
> > + */
> > +#ifndef _FS_FUSE_IOMAP_PRIV_H
> > +#define _FS_FUSE_IOMAP_PRIV_H
> > +
> ...
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 31b80f93211b81..3634cbe602cd9c 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -235,6 +235,9 @@
> > *
> > * 7.44
> > * - add FUSE_NOTIFY_INC_EPOCH
> > + *
> > + * 7.99
>
> Just curious, where did you get the .99 from?
Any time I go adding to a versioned ABI, I try to use high numbers (and
high bits for flags) so that it's really obvious that the new flags are
in use when I poke through crash/gdb/etc.
For permanent artifacts like an ondisk format, it's convenient to cache
fs images for fuzz testing, etc. Using a high bit/number reduces the
chance that someone else's new feature will get merged and cause
conflicts, which force me to regenerate all cached images.
Obviously at merge time I change these values to use lower bit positions
or version numbers to fit the merge target so it doesn't completely
eliminate the caching problems.
> > + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -270,7 +273,7 @@
> > #define FUSE_KERNEL_VERSION 7
> > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > index 9563fa5387a241..67dfe300bf2e07 100644
> > --- a/fs/fuse/Kconfig
> > +++ b/fs/fuse/Kconfig
> > @@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH
> > +config FUSE_IOMAP_DEBUG
> > + bool "Debug FUSE file IO over iomap"
> > + default n
> > + depends on FUSE_IOMAP
> > + help
> > + Enable debugging assertions for the fuse iomap code paths and logging
> > + of bad iomap file mapping data being sent to the kernel.
> > +
>
> I wonder if we should have a general FUSE_DEBUG that this would fall
> under instead of creating one that's iomap_debug specific
Probably, but I was also trying to keep this as localized to iomap as
possible. If Miklos would rather I extend it to all of fuse (which is
probably a good idea!) then I'm happy to do so.
> > config FUSE_IO_URING
> > bool "FUSE communication over io-uring"
> > default y
> > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > index 46041228e5be2c..27be39317701d6 100644
> > --- a/fs/fuse/Makefile
> > +++ b/fs/fuse/Makefile
> > @@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > fuse-$(CONFIG_FUSE_BACKING) += backing.o
> > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > +fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> >
> > virtiofs-y := virtio_fs.o
> > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > new file mode 100644
> > index 00000000000000..dda757768d3ea6
> > --- /dev/null
> > +++ b/fs/fuse/file_iomap.c
> > @@ -0,0 +1,434 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > + * Author: Darrick J. Wong <djwong@kernel.org>
> > + */
> > +#include <linux/iomap.h>
> > +#include "fuse_i.h"
> > +#include "fuse_trace.h"
> > +#include "iomap_priv.h"
> > +
> > +/* Validate FUSE_IOMAP_TYPE_* */
> > +static inline bool fuse_iomap_check_type(uint16_t fuse_type)
> > +{
> > + switch (fuse_type) {
> > + case FUSE_IOMAP_TYPE_HOLE:
> > + case FUSE_IOMAP_TYPE_DELALLOC:
> > + case FUSE_IOMAP_TYPE_MAPPED:
> > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > + case FUSE_IOMAP_TYPE_INLINE:
> > + case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
> > + return true;
> > + }
> > +
> > + return false;
> > +}
>
> Maybe faster to check by using a bitmask instead?
They're consecutive; one could #define a FUSE_IOMAP_TYPE_MAX to alias
PURE_OVERWRITE and collapse the whole check to:
return fuse_type <= FUSE_IOMAP_TYPE_MAX;
> > +
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 1e7298b2b89b58..32f4b7c9a20a8a 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1448,6 +1448,13 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> >
> > if (flags & FUSE_REQUEST_TIMEOUT)
> > timeout = arg->request_timeout;
> > +
> > + if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
> > + fc->local_fs = 1;
> > + fc->iomap = 1;
> > + printk(KERN_WARNING
> > + "fuse: EXPERIMENTAL iomap feature enabled. Use at your own risk!");
> > + }
>
> pr_warn() seems to be the convention elsewhere in the fuse code
Ah, thanks. Do you know why fuse calls pr_warn("fuse: XXX") instead of
the usual sequence of
#define pr_fmt(fmt) "fuse: " fmt
so that "fuse: " gets included automatically?
--D
>
> Thanks,
> Joanne
> > } else {
> > ra_pages = fc->max_read / PAGE_SIZE;
> > fc->no_lock = 1;
> > @@ -1516,6 +1523,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
> > */
> > if (fuse_uring_enabled())
> > flags |= FUSE_OVER_IO_URING;
> > + if (fuse_iomap_enabled())
> > + flags |= FUSE_IOMAP;
> >
> > ia->in.flags = flags;
> > ia->in.flags2 = flags >> 32;
> >
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-23 14:57 ` Miklos Szeredi
@ 2025-09-23 20:51 ` Darrick J. Wong
2025-09-24 13:55 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 20:51 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Tue, Sep 23, 2025 at 04:57:30PM +0200, Miklos Szeredi wrote:
> On Fri, 19 Sept 2025 at 19:50, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > /**
> > * fuse_attr flags
> > *
> > * FUSE_ATTR_SUBMOUNT: Object is a submount root
> > * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
> > * FUSE_ATTR_IOMAP: Use iomap for this inode
> > * FUSE_ATTR_ATOMIC: Enable untorn writes
> > * FUSE_ATTR_SYNC: File writes are synchronous
> > * FUSE_ATTR_IMMUTABLE: File is immutable
> > * FUSE_ATTR_APPEND: File is append-only
> > */
> >
> > So we still have plenty of space.
>
> No, I was thinking of an internal flag or flags. Exporting this to
> the server will come at some point, but not now.
>
> So for now something like
>
> /** FUSE inode state bits */
> enum {
> ...
> /* Exclusive access to file, either because fs is local or have an
> exclusive "lease" on distributed fs */
> FUSE_I_EXCLUSIVE,
> };
Oh, ok. I can do that. Just to be clear about what I need to do for
v6:
* fuse_conn::is_local goes away
* FUSE_I_* gains a new FUSE_I_EXCLUSIVE flag
* "local" operations check for FUSE_I_EXCLUSIVE instead of local_fs
* fuseblk filesystems always set FUSE_I_EXCLUSIVE
* iomap filesystems (when they arrive) always set FUSE_I_EXCLUSIVE
Right?
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-23 18:56 ` Miklos Szeredi
@ 2025-09-23 20:59 ` Darrick J. Wong
2025-09-23 22:34 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 20:59 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 23, 2025 at 08:56:47PM +0200, Miklos Szeredi wrote:
> On Tue, 23 Sept 2025 at 16:54, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > I'm not sure what you're referring to by "special" -- are you asking
> > about why I added the touch_softlockup_watchdog() call here but not in
> > fuse_wait_aborted()? I think it could use that treatment too, but once
> > you abort all the pending requests they tend to go away very quickly.
> > It might be the case that nobody's gotten a warning simply because the
> > aborted requests all go away in under 30 seconds.
>
> Maybe I'm not understanding how the softlockup detector works. I
> thought that it triggers if task is spinning in a tight loop. That
> precludes any timeouts, since that means that the task went to sleep.
>
> So what's happening here?
Hrm, I thought the softlockup detector also complains about tasks stuck
in uninterruptible sleep, but you're right, it *does* schedule() so the
softlockup detector won't complain about it.
I think. Let me go try to prove that empirically. :)
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 01/28] fuse: implement the basic iomap mechanisms
2025-09-23 20:32 ` Darrick J. Wong
@ 2025-09-23 21:24 ` Joanne Koong
2025-09-23 22:10 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Joanne Koong @ 2025-09-23 21:24 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Tue, Sep 23, 2025 at 1:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Sep 19, 2025 at 03:36:52PM -0700, Joanne Koong wrote:
> > On Mon, Sep 15, 2025 at 5:28 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > > userspace fuse servers.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > > fs/fuse/fuse_i.h | 35 ++++
> > > fs/fuse/iomap_priv.h | 36 ++++
> > > include/uapi/linux/fuse.h | 90 +++++++++
> > > fs/fuse/Kconfig | 32 +++
> > > fs/fuse/Makefile | 1
> > > fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
> > > fs/fuse/inode.c | 9 +
> > > 7 files changed, 636 insertions(+), 1 deletion(-)
> > > create mode 100644 fs/fuse/iomap_priv.h
> > > create mode 100644 fs/fuse/file_iomap.c
> > >
> > > new file mode 100644
> > > index 00000000000000..243d92cb625095
> > > --- /dev/null
> > > +++ b/fs/fuse/iomap_priv.h
> > > @@ -0,0 +1,36 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > > + * Author: Darrick J. Wong <djwong@kernel.org>
> > > + */
> > > +#ifndef _FS_FUSE_IOMAP_PRIV_H
> > > +#define _FS_FUSE_IOMAP_PRIV_H
> > > +
> > ...
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index 31b80f93211b81..3634cbe602cd9c 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -235,6 +235,9 @@
> > > *
> > > * 7.44
> > > * - add FUSE_NOTIFY_INC_EPOCH
> > > + *
> > > + * 7.99
> >
> > Just curious, where did you get the .99 from?
>
> Any time I go adding to a versioned ABI, I try to use high numbers (and
> high bits for flags) so that it's really obvious that the new flags are
> in use when I poke through crash/gdb/etc.
>
> For permanent artifacts like an ondisk format, it's convenient to cache
> fs images for fuzz testing, etc. Using a high bit/number reduces the
> chance that someone else's new feature will get merged and cause
> conflicts, which force me to regenerate all cached images.
>
> Obviously at merge time I change these values to use lower bit positions
> or version numbers to fit the merge target so it doesn't completely
> eliminate the caching problems.
Ahh okay I see, thanks for the explanation!
>
> > > + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
> > > */
> > >
> > > #ifndef _LINUX_FUSE_H
> > > @@ -270,7 +273,7 @@
> > > #define FUSE_KERNEL_VERSION 7
> > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > index 9563fa5387a241..67dfe300bf2e07 100644
> > > --- a/fs/fuse/Kconfig
> > > +++ b/fs/fuse/Kconfig
> > > @@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH
> > > +config FUSE_IOMAP_DEBUG
> > > + bool "Debug FUSE file IO over iomap"
> > > + default n
> > > + depends on FUSE_IOMAP
> > > + help
> > > + Enable debugging assertions for the fuse iomap code paths and logging
> > > + of bad iomap file mapping data being sent to the kernel.
> > > +
> >
> > I wonder if we should have a general FUSE_DEBUG that this would fall
> > under instead of creating one that's iomap_debug specific
>
> Probably, but I was also trying to keep this as localized to iomap as
> possible. If Miklos would rather I extend it to all of fuse (which is
> probably a good idea!) then I'm happy to do so.
>
> > > config FUSE_IO_URING
> > > bool "FUSE communication over io-uring"
> > > default y
> > > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > > index 46041228e5be2c..27be39317701d6 100644
> > > --- a/fs/fuse/Makefile
> > > +++ b/fs/fuse/Makefile
> > > @@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > > fuse-$(CONFIG_FUSE_BACKING) += backing.o
> > > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > > +fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> > >
> > > virtiofs-y := virtio_fs.o
> > > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > > new file mode 100644
> > > index 00000000000000..dda757768d3ea6
> > > --- /dev/null
> > > +++ b/fs/fuse/file_iomap.c
> > > @@ -0,0 +1,434 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > > + * Author: Darrick J. Wong <djwong@kernel.org>
> > > + */
> > > +#include <linux/iomap.h>
> > > +#include "fuse_i.h"
> > > +#include "fuse_trace.h"
> > > +#include "iomap_priv.h"
> > > +
> > > +/* Validate FUSE_IOMAP_TYPE_* */
> > > +static inline bool fuse_iomap_check_type(uint16_t fuse_type)
> > > +{
> > > + switch (fuse_type) {
> > > + case FUSE_IOMAP_TYPE_HOLE:
> > > + case FUSE_IOMAP_TYPE_DELALLOC:
> > > + case FUSE_IOMAP_TYPE_MAPPED:
> > > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > > + case FUSE_IOMAP_TYPE_INLINE:
> > > + case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
> > > + return true;
> > > + }
> > > +
> > > + return false;
> > > +}
> >
> > Maybe faster to check by using a bitmask instead?
>
> They're consecutive; one could #define a FUSE_IOMAP_TYPE_MAX to alias
> PURE_OVERWRITE and collapse the whole check to:
>
> return fuse_type <= FUSE_IOMAP_TYPE_MAX;
>
> > > +
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 1e7298b2b89b58..32f4b7c9a20a8a 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -1448,6 +1448,13 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > >
> > > if (flags & FUSE_REQUEST_TIMEOUT)
> > > timeout = arg->request_timeout;
> > > +
> > > + if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
> > > + fc->local_fs = 1;
> > > + fc->iomap = 1;
> > > + printk(KERN_WARNING
> > > + "fuse: EXPERIMENTAL iomap feature enabled. Use at your own risk!");
> > > + }
> >
> > pr_warn() seems to be the convention elsewhere in the fuse code
>
> Ah, thanks. Do you know why fuse calls pr_warn("fuse: XXX") instead of
> the usual sequence of
>
> #define pr_fmt(fmt) "fuse: " fmt
>
> so that "fuse: " gets included automatically?
I think it does do this, or at least that's what I see in fuse_i.h :D
Thanks,
Joanne
>
> --D
>
> >
> > Thanks,
> > Joanne
> > > } else {
> > > ra_pages = fc->max_read / PAGE_SIZE;
> > > fc->no_lock = 1;
> > > @@ -1516,6 +1523,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
> > > */
> > > if (fuse_uring_enabled())
> > > flags |= FUSE_OVER_IO_URING;
> > > + if (fuse_iomap_enabled())
> > > + flags |= FUSE_IOMAP;
> > >
> > > ia->in.flags = flags;
> > > ia->in.flags2 = flags >> 32;
> > >
> >
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 01/28] fuse: implement the basic iomap mechanisms
2025-09-23 21:24 ` Joanne Koong
@ 2025-09-23 22:10 ` Darrick J. Wong
2025-09-23 23:08 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 22:10 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Tue, Sep 23, 2025 at 02:24:21PM -0700, Joanne Koong wrote:
> On Tue, Sep 23, 2025 at 1:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Sep 19, 2025 at 03:36:52PM -0700, Joanne Koong wrote:
> > > On Mon, Sep 15, 2025 at 5:28 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > > > userspace fuse servers.
> > > >
> > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > > > fs/fuse/fuse_i.h | 35 ++++
> > > > fs/fuse/iomap_priv.h | 36 ++++
> > > > include/uapi/linux/fuse.h | 90 +++++++++
> > > > fs/fuse/Kconfig | 32 +++
> > > > fs/fuse/Makefile | 1
> > > > fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
> > > > fs/fuse/inode.c | 9 +
> > > > 7 files changed, 636 insertions(+), 1 deletion(-)
> > > > create mode 100644 fs/fuse/iomap_priv.h
> > > > create mode 100644 fs/fuse/file_iomap.c
> > > >
> > > > new file mode 100644
> > > > index 00000000000000..243d92cb625095
> > > > --- /dev/null
> > > > +++ b/fs/fuse/iomap_priv.h
> > > > @@ -0,0 +1,36 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > > > + * Author: Darrick J. Wong <djwong@kernel.org>
> > > > + */
> > > > +#ifndef _FS_FUSE_IOMAP_PRIV_H
> > > > +#define _FS_FUSE_IOMAP_PRIV_H
> > > > +
> > > ...
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index 31b80f93211b81..3634cbe602cd9c 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -235,6 +235,9 @@
> > > > *
> > > > * 7.44
> > > > * - add FUSE_NOTIFY_INC_EPOCH
> > > > + *
> > > > + * 7.99
> > >
> > > Just curious, where did you get the .99 from?
> >
> > Any time I go adding to a versioned ABI, I try to use high numbers (and
> > high bits for flags) so that it's really obvious that the new flags are
> > in use when I poke through crash/gdb/etc.
> >
> > For permanent artifacts like an ondisk format, it's convenient to cache
> > fs images for fuzz testing, etc. Using a high bit/number reduces the
> > chance that someone else's new feature will get merged and cause
> > conflicts, which force me to regenerate all cached images.
> >
> > Obviously at merge time I change these values to use lower bit positions
> > or version numbers to fit the merge target so it doesn't completely
> > eliminate the caching problems.
>
> Ahh okay I see, thanks for the explanation!
>
> >
> > > > + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
> > > > */
> > > >
> > > > #ifndef _LINUX_FUSE_H
> > > > @@ -270,7 +273,7 @@
> > > > #define FUSE_KERNEL_VERSION 7
> > > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > > index 9563fa5387a241..67dfe300bf2e07 100644
> > > > --- a/fs/fuse/Kconfig
> > > > +++ b/fs/fuse/Kconfig
> > > > @@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH
> > > > +config FUSE_IOMAP_DEBUG
> > > > + bool "Debug FUSE file IO over iomap"
> > > > + default n
> > > > + depends on FUSE_IOMAP
> > > > + help
> > > > + Enable debugging assertions for the fuse iomap code paths and logging
> > > > + of bad iomap file mapping data being sent to the kernel.
> > > > +
> > >
> > > I wonder if we should have a general FUSE_DEBUG that this would fall
> > > under instead of creating one that's iomap_debug specific
> >
> > Probably, but I was also trying to keep this as localized to iomap as
> > possible. If Miklos would rather I extend it to all of fuse (which is
> > probably a good idea!) then I'm happy to do so.
> >
> > > > config FUSE_IO_URING
> > > > bool "FUSE communication over io-uring"
> > > > default y
> > > > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > > > index 46041228e5be2c..27be39317701d6 100644
> > > > --- a/fs/fuse/Makefile
> > > > +++ b/fs/fuse/Makefile
> > > > @@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > > > fuse-$(CONFIG_FUSE_BACKING) += backing.o
> > > > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > > > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > > > +fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> > > >
> > > > virtiofs-y := virtio_fs.o
> > > > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > > > new file mode 100644
> > > > index 00000000000000..dda757768d3ea6
> > > > --- /dev/null
> > > > +++ b/fs/fuse/file_iomap.c
> > > > @@ -0,0 +1,434 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > > > + * Author: Darrick J. Wong <djwong@kernel.org>
> > > > + */
> > > > +#include <linux/iomap.h>
> > > > +#include "fuse_i.h"
> > > > +#include "fuse_trace.h"
> > > > +#include "iomap_priv.h"
> > > > +
> > > > +/* Validate FUSE_IOMAP_TYPE_* */
> > > > +static inline bool fuse_iomap_check_type(uint16_t fuse_type)
> > > > +{
> > > > + switch (fuse_type) {
> > > > + case FUSE_IOMAP_TYPE_HOLE:
> > > > + case FUSE_IOMAP_TYPE_DELALLOC:
> > > > + case FUSE_IOMAP_TYPE_MAPPED:
> > > > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > > > + case FUSE_IOMAP_TYPE_INLINE:
> > > > + case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
> > > > + return true;
> > > > + }
> > > > +
> > > > + return false;
> > > > +}
> > >
> > > Maybe faster to check by using a bitmask instead?
> >
> > They're consecutive; one could #define a FUSE_IOMAP_TYPE_MAX to alias
> > PURE_OVERWRITE and collapse the whole check to:
> >
> > return fuse_type <= FUSE_IOMAP_TYPE_MAX;
> >
> > > > +
> > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > index 1e7298b2b89b58..32f4b7c9a20a8a 100644
> > > > --- a/fs/fuse/inode.c
> > > > +++ b/fs/fuse/inode.c
> > > > @@ -1448,6 +1448,13 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > >
> > > > if (flags & FUSE_REQUEST_TIMEOUT)
> > > > timeout = arg->request_timeout;
> > > > +
> > > > + if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
> > > > + fc->local_fs = 1;
> > > > + fc->iomap = 1;
> > > > + printk(KERN_WARNING
> > > > + "fuse: EXPERIMENTAL iomap feature enabled. Use at your own risk!");
> > > > + }
> > >
> > > pr_warn() seems to be the convention elsewhere in the fuse code
> >
> > Ah, thanks. Do you know why fuse calls pr_warn("fuse: XXX") instead of
> > the usual sequence of
> >
> > #define pr_fmt(fmt) "fuse: " fmt
> >
> > so that "fuse: " gets included automatically?
>
> I think it does do this, or at least that's what I see in fuse_i.h :D
Whoooops, sorry for the noise :)
--D
> Thanks,
> Joanne
> >
> > --D
> >
> > >
> > > Thanks,
> > > Joanne
> > > > } else {
> > > > ra_pages = fc->max_read / PAGE_SIZE;
> > > > fc->no_lock = 1;
> > > > @@ -1516,6 +1523,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
> > > > */
> > > > if (fuse_uring_enabled())
> > > > flags |= FUSE_OVER_IO_URING;
> > > > + if (fuse_iomap_enabled())
> > > > + flags |= FUSE_IOMAP;
> > > >
> > > > ia->in.flags = flags;
> > > > ia->in.flags2 = flags >> 32;
> > > >
> > >
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-23 20:59 ` Darrick J. Wong
@ 2025-09-23 22:34 ` Darrick J. Wong
2025-09-24 12:04 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 22:34 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 23, 2025 at 01:59:36PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 23, 2025 at 08:56:47PM +0200, Miklos Szeredi wrote:
> > On Tue, 23 Sept 2025 at 16:54, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > > I'm not sure what you're referring to by "special" -- are you asking
> > > about why I added the touch_softlockup_watchdog() call here but not in
> > > fuse_wait_aborted()? I think it could use that treatment too, but once
> > > you abort all the pending requests they tend to go away very quickly.
> > > It might be the case that nobody's gotten a warning simply because the
> > > aborted requests all go away in under 30 seconds.
> >
> > Maybe I'm not understanding how the softlockup detector works. I
> > thought that it triggers if task is spinning in a tight loop. That
> > precludes any timeouts, since that means that the task went to sleep.
> >
> > So what's happening here?
>
> Hrm, I thought the softlockup detector also complains about tasks stuck
> in uninterruptible sleep, but you're right, it *does* schedule() so the
> softlockup detector won't complain about it.
>
> I think. Let me go try to prove that empirically. :)
Hrm. If I change the bottom of the function to:
wait_event(fc->blocked_waitq, <some false expression>);
Then I get softlockup warnings because the process state gets set to
UNINTERRUPTIBLE, schedule() is called to pick another process, and the
umount process never reaches runnable state ever again.
If instead I change it to:
while (wait_event_timeout(fc->blocked_waitq, <false expr>, HZ) == 0) {
/* empty */
}
then I do not get softlockup warnings, because the umount process
actually does get scheduled off and on the system, repeatedly.
Conclusion: The loop is necessary to avoid softlockup warnings while the
fuse requests are processed by the server, but it is not necessary to
touch the watchdog in the loop body.
Thanks for challenging me, now I've learned something useful. :)
--D
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 01/28] fuse: implement the basic iomap mechanisms
2025-09-23 22:10 ` Darrick J. Wong
@ 2025-09-23 23:08 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-23 23:08 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, linux-xfs, John, linux-fsdevel, neal
On Tue, Sep 23, 2025 at 03:10:14PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 23, 2025 at 02:24:21PM -0700, Joanne Koong wrote:
> > On Tue, Sep 23, 2025 at 1:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Fri, Sep 19, 2025 at 03:36:52PM -0700, Joanne Koong wrote:
> > > > On Mon, Sep 15, 2025 at 5:28 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > > > > userspace fuse servers.
> > > > >
> > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > ---
> > > > > fs/fuse/fuse_i.h | 35 ++++
> > > > > fs/fuse/iomap_priv.h | 36 ++++
> > > > > include/uapi/linux/fuse.h | 90 +++++++++
> > > > > fs/fuse/Kconfig | 32 +++
> > > > > fs/fuse/Makefile | 1
> > > > > fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > fs/fuse/inode.c | 9 +
> > > > > 7 files changed, 636 insertions(+), 1 deletion(-)
> > > > > create mode 100644 fs/fuse/iomap_priv.h
> > > > > create mode 100644 fs/fuse/file_iomap.c
> > > > >
> > > > > new file mode 100644
> > > > > index 00000000000000..243d92cb625095
> > > > > --- /dev/null
> > > > > +++ b/fs/fuse/iomap_priv.h
> > > > > @@ -0,0 +1,36 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +/*
> > > > > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > > > > + * Author: Darrick J. Wong <djwong@kernel.org>
> > > > > + */
> > > > > +#ifndef _FS_FUSE_IOMAP_PRIV_H
> > > > > +#define _FS_FUSE_IOMAP_PRIV_H
> > > > > +
> > > > ...
> > > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > > index 31b80f93211b81..3634cbe602cd9c 100644
> > > > > --- a/include/uapi/linux/fuse.h
> > > > > +++ b/include/uapi/linux/fuse.h
> > > > > @@ -235,6 +235,9 @@
> > > > > *
> > > > > * 7.44
> > > > > * - add FUSE_NOTIFY_INC_EPOCH
> > > > > + *
> > > > > + * 7.99
> > > >
> > > > Just curious, where did you get the .99 from?
> > >
> > > Any time I go adding to a versioned ABI, I try to use high numbers (and
> > > high bits for flags) so that it's really obvious that the new flags are
> > > in use when I poke through crash/gdb/etc.
> > >
> > > For permanent artifacts like an ondisk format, it's convenient to cache
> > > fs images for fuzz testing, etc. Using a high bit/number reduces the
> > > chance that someone else's new feature will get merged and cause
> > > conflicts, which force me to regenerate all cached images.
> > >
> > > Obviously at merge time I change these values to use lower bit positions
> > > or version numbers to fit the merge target so it doesn't completely
> > > eliminate the caching problems.
> >
> > Ahh okay I see, thanks for the explanation!
> >
> > >
> > > > > + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
> > > > > */
> > > > >
> > > > > #ifndef _LINUX_FUSE_H
> > > > > @@ -270,7 +273,7 @@
> > > > > #define FUSE_KERNEL_VERSION 7
> > > > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > > > index 9563fa5387a241..67dfe300bf2e07 100644
> > > > > --- a/fs/fuse/Kconfig
> > > > > +++ b/fs/fuse/Kconfig
> > > > > @@ -69,6 +69,38 @@ config FUSE_PASSTHROUGH
> > > > > +config FUSE_IOMAP_DEBUG
> > > > > + bool "Debug FUSE file IO over iomap"
> > > > > + default n
> > > > > + depends on FUSE_IOMAP
> > > > > + help
> > > > > + Enable debugging assertions for the fuse iomap code paths and logging
> > > > > + of bad iomap file mapping data being sent to the kernel.
> > > > > +
> > > >
> > > > I wonder if we should have a general FUSE_DEBUG that this would fall
> > > > under instead of creating one that's iomap_debug specific
> > >
> > > Probably, but I was also trying to keep this as localized to iomap as
> > > possible. If Miklos would rather I extend it to all of fuse (which is
> > > probably a good idea!) then I'm happy to do so.
> > >
> > > > > config FUSE_IO_URING
> > > > > bool "FUSE communication over io-uring"
> > > > > default y
> > > > > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > > > > index 46041228e5be2c..27be39317701d6 100644
> > > > > --- a/fs/fuse/Makefile
> > > > > +++ b/fs/fuse/Makefile
> > > > > @@ -18,5 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > > > > fuse-$(CONFIG_FUSE_BACKING) += backing.o
> > > > > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > > > > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > > > > +fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> > > > >
> > > > > virtiofs-y := virtio_fs.o
> > > > > diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> > > > > new file mode 100644
> > > > > index 00000000000000..dda757768d3ea6
> > > > > --- /dev/null
> > > > > +++ b/fs/fuse/file_iomap.c
> > > > > @@ -0,0 +1,434 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +/*
> > > > > + * Copyright (C) 2025 Oracle. All Rights Reserved.
> > > > > + * Author: Darrick J. Wong <djwong@kernel.org>
> > > > > + */
> > > > > +#include <linux/iomap.h>
> > > > > +#include "fuse_i.h"
> > > > > +#include "fuse_trace.h"
> > > > > +#include "iomap_priv.h"
> > > > > +
> > > > > +/* Validate FUSE_IOMAP_TYPE_* */
> > > > > +static inline bool fuse_iomap_check_type(uint16_t fuse_type)
> > > > > +{
> > > > > + switch (fuse_type) {
> > > > > + case FUSE_IOMAP_TYPE_HOLE:
> > > > > + case FUSE_IOMAP_TYPE_DELALLOC:
> > > > > + case FUSE_IOMAP_TYPE_MAPPED:
> > > > > + case FUSE_IOMAP_TYPE_UNWRITTEN:
> > > > > + case FUSE_IOMAP_TYPE_INLINE:
> > > > > + case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
> > > > > + return true;
> > > > > + }
> > > > > +
> > > > > + return false;
> > > > > +}
> > > >
> > > > Maybe faster to check by using a bitmask instead?
> > >
> > > They're consecutive; one could #define a FUSE_IOMAP_TYPE_MAX to alias
> > > PURE_OVERWRITE and collapse the whole check to:
> > >
> > > return fuse_type <= FUSE_IOMAP_TYPE_MAX;
I godbolted this with gcc -O0, and got (arm64):
sub sp, sp, #16
strh w0, [sp, 14]
ldrh w0, [sp, 14] /* load first arg in w0 */
cmp w0, 4
bgt .L2 /* goto L2 if arg > _INLINE */
cmp w0, 0
bge .L3 /* goto L3 if _HOLE <= arg <= _INLINE */
b .L4 /* goto L4 if arg < _HOLE */
.L2:
cmp w0, 255 /* goto L4 if arg != _PURE_OVERWRITE */
bne .L4
.L3:
mov w0, 1 /* input was good */
b .L5
.L4:
mov w0, 0 /* input was bad */
.L5: /* return result in w0 */
add sp, sp, 16
ret
The compiler is apparently smart enough to recognize the adjacent case
statements and merge them into the appropriate integer comparisons.
--D
> > >
> > > > > +
> > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > index 1e7298b2b89b58..32f4b7c9a20a8a 100644
> > > > > --- a/fs/fuse/inode.c
> > > > > +++ b/fs/fuse/inode.c
> > > > > @@ -1448,6 +1448,13 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > >
> > > > > if (flags & FUSE_REQUEST_TIMEOUT)
> > > > > timeout = arg->request_timeout;
> > > > > +
> > > > > + if ((flags & FUSE_IOMAP) && fuse_iomap_enabled()) {
> > > > > + fc->local_fs = 1;
> > > > > + fc->iomap = 1;
> > > > > + printk(KERN_WARNING
> > > > > + "fuse: EXPERIMENTAL iomap feature enabled. Use at your own risk!");
> > > > > + }
> > > >
> > > > pr_warn() seems to be the convention elsewhere in the fuse code
> > >
> > > Ah, thanks. Do you know why fuse calls pr_warn("fuse: XXX") instead of
> > > the usual sequence of
> > >
> > > #define pr_fmt(fmt) "fuse: " fmt
> > >
> > > so that "fuse: " gets included automatically?
> >
> > I think it does do this, or at least that's what I see in fuse_i.h :D
>
> Whoooops, sorry for the noise :)
>
> --D
>
> > Thanks,
> > Joanne
> > >
> > > --D
> > >
> > > >
> > > > Thanks,
> > > > Joanne
> > > > > } else {
> > > > > ra_pages = fc->max_read / PAGE_SIZE;
> > > > > fc->no_lock = 1;
> > > > > @@ -1516,6 +1523,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
> > > > > */
> > > > > if (fuse_uring_enabled())
> > > > > flags |= FUSE_OVER_IO_URING;
> > > > > + if (fuse_iomap_enabled())
> > > > > + flags |= FUSE_IOMAP;
> > > > >
> > > > > ia->in.flags = flags;
> > > > > ia->in.flags2 = flags >> 32;
> > > > >
> > > >
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-23 22:34 ` Darrick J. Wong
@ 2025-09-24 12:04 ` Miklos Szeredi
2025-09-24 17:50 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-24 12:04 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, 24 Sept 2025 at 00:34, Darrick J. Wong <djwong@kernel.org> wrote:
> Conclusion: The loop is necessary to avoid softlockup warnings while the
> fuse requests are processed by the server, but it is not necessary to
> touch the watchdog in the loop body.
I'm still confused.
What is the kernel message you get?
"watchdog: BUG: soft lockup - CPU#X stuck for NNs!"
or
"INFO: task PROC blocked for more than NN seconds."
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-23 20:51 ` Darrick J. Wong
@ 2025-09-24 13:55 ` Miklos Szeredi
2025-09-24 17:31 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-24 13:55 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Tue, 23 Sept 2025 at 22:51, Darrick J. Wong <djwong@kernel.org> wrote:
> Oh, ok. I can do that. Just to be clear about what I need to do for
> v6:
>
> * fuse_conn::is_local goes away
> * FUSE_I_* gains a new FUSE_I_EXCLUSIVE flag
> * "local" operations check for FUSE_I_EXCLUSIVE instead of local_fs
> * fuseblk filesystems always set FUSE_I_EXCLUSIVE
Not sure if we want to touch fuseblk, as that carries a risk of regressions.
> * iomap filesystems (when they arrive) always set FUSE_I_EXCLUSIVE
Yes.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-24 13:55 ` Miklos Szeredi
@ 2025-09-24 17:31 ` Darrick J. Wong
2025-09-25 19:17 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-24 17:31 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Wed, Sep 24, 2025 at 03:55:48PM +0200, Miklos Szeredi wrote:
> On Tue, 23 Sept 2025 at 22:51, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Oh, ok. I can do that. Just to be clear about what I need to do for
> > v6:
> >
> > * fuse_conn::is_local goes away
> > * FUSE_I_* gains a new FUSE_I_EXCLUSIVE flag
> > * "local" operations check for FUSE_I_EXCLUSIVE instead of local_fs
> > * fuseblk filesystems always set FUSE_I_EXCLUSIVE
>
> Not sure if we want to touch fuseblk, as that carries a risk of regressions.
Hrm. As it stands today, setting FUSE_I_EXCLUSIVE in fuseblk mode
solves various mode/acl failures in fstests.
On the other hand, mounting with fuseblk requires fsname to point to a
block device that the mount()ing process can open, and if you're working
with a local filesystem on a block device, why wouldn't you use iomap
mode?
Add to that Ted's reluctance to merge the fuseblk support patches into
fuse2fs, and perhaps I should take that as a sign to abandon fuseblk
work entirely. It'd get rid of an entire test configuration, since I'd
only have to check fuse4fs-iomap on a bdev; and classic fuse4fs on a
regular file. Even in that second case, fuse4fs could losetup to take
advantage of iomap mode.
Yeah ok I've persuaded myself to drop the fuseblk stuff entirely. If
anyone /really/ wants me to keep it, holler in the next couple of hours.
> > * iomap filesystems (when they arrive) always set FUSE_I_EXCLUSIVE
>
> Yes.
Ok, thanks for the quick responses! :)
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-24 12:04 ` Miklos Szeredi
@ 2025-09-24 17:50 ` Darrick J. Wong
2025-09-24 18:19 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-24 17:50 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, Sep 24, 2025 at 02:04:20PM +0200, Miklos Szeredi wrote:
> On Wed, 24 Sept 2025 at 00:34, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Conclusion: The loop is necessary to avoid softlockup warnings while the
> > fuse requests are processed by the server, but it is not necessary to
> > touch the watchdog in the loop body.
>
> I'm still confused.
>
> What is the kernel message you get?
>
> "watchdog: BUG: soft lockup - CPU#X stuck for NNs!"
>
> or
>
> "INFO: task PROC blocked for more than NN seconds."
Oh! The second:
INFO: task umount:1279 blocked for more than 20 seconds.
Not tainted 6.17.0-rc7-xfsx #rc7
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag.
task:umount state:D stack:11984 pid:1279 tgid:1279 ppid:10690
Call Trace:
<TASK>
__schedule+0x4cb/0x1a70
? vprintk_emit+0x10e/0x330
schedule+0x2a/0xe0
fuse_flush_requests_and_wait+0xe5/0x110 [fuse 810ab4024704e943aea18e16]
? cpuacct_css_alloc+0xa0/0xa0
fuse_iomap_unmount+0x15/0x30 [fuse 810ab4024704e943aea18e1670e01b473ed]
fuse_conn_destroy+0xdb/0xe0 [fuse 810ab4024704e943aea18e1670e01b473ed1]
fuse_kill_sb_anon+0xb7/0xc0 [fuse 810ab4024704e943aea18e1670e01b473ed1]
deactivate_locked_super+0x29/0xa0
cleanup_mnt+0xbd/0x150
task_work_run+0x55/0x90
exit_to_user_mode_loop+0xa0/0xb0
do_syscall_64+0x16b/0x1a0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f24c342ab77
RSP: 002b:00007ffd683ce5e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00005648fc1208a8 RCX: 00007f24c342ab77
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005648fc1209c0
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000073
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f24c3566264
R13: 00005648fc1209c0 R14: 0000000000000000 R15: 00005648fc120790
</TASK>
Apologies for my imprecise description of what I was trying to avoid; I
should have paid closer attention.
The wait_event_timeout() loop causes the process to schedule at least
once per second, which avoids the "blocked for more than..." warning.
Since the process actually does go to sleep, it's not necessary to touch
the softlockup watchdog because we're not preventing another process
from being scheduled on a CPU.
I can copy the above into the commit message if that resolves the
confusion.
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-24 17:50 ` Darrick J. Wong
@ 2025-09-24 18:19 ` Miklos Szeredi
2025-09-24 20:54 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-24 18:19 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, 24 Sept 2025 at 19:50, Darrick J. Wong <djwong@kernel.org> wrote:
> The wait_event_timeout() loop causes the process to schedule at least
> once per second, which avoids the "blocked for more than..." warning.
> Since the process actually does go to sleep, it's not necessary to touch
> the softlockup watchdog because we're not preventing another process
> from being scheduled on a CPU.
To be clear, this triggers because no RELEASE reply is received for
more than 20 seconds? That sounds weird. What is the server doing
all that time?
If a reply *is* received, then the task doing the umount should have
woken up (to check fc->num_waiting), which would have prevented the
hung task warning.
What am I missing?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-24 18:19 ` Miklos Szeredi
@ 2025-09-24 20:54 ` Darrick J. Wong
2025-09-30 10:29 ` Miklos Szeredi
0 siblings, 1 reply; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-24 20:54 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, Sep 24, 2025 at 08:19:59PM +0200, Miklos Szeredi wrote:
> On Wed, 24 Sept 2025 at 19:50, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > The wait_event_timeout() loop causes the process to schedule at least
> > once per second, which avoids the "blocked for more than..." warning.
> > Since the process actually does go to sleep, it's not necessary to touch
> > the softlockup watchdog because we're not preventing another process
> > from being scheduled on a CPU.
>
> To be clear, this triggers because no RELEASE reply is received for
> more than 20 seconds? That sounds weird. What is the server doing
> all that time?
>
> If a reply *is* received, then the task doing the umount should have
> woken up (to check fc->num_waiting), which would have prevented the
> hung task warning.
>
> What am I missing?
(Note: I set /proc/sys/kernel/hung_task_timeout_secs to 10 seconds to
generate the 20 second warning)
I think what you're missing is the fuse server taking more than 20
seconds to process one RELEASE command successfully. Say you create a
sparse file with 1 million extents, open it, and unlink the the file.
The file's still open, so the unlink can't truncate it or free it.
Next, you close the file and unmount the filesystem. Inode eviction
causes a RELEASE command to be issued, so the fuse server starts
truncating the file to free it. There's a million extents to free, but
the server is slow and can't process more than (say) 1000 extent freeing
operations per second. That implies that the truncation will take 1000
seconds to complete, which means the reply to the RELEASE doesn't arrive
for 1000 seconds. Meanwhile, the umount process doesn't see a change in
fc->waiting for 1000 seconds, so it isn't woken up for that amount of
time and we get the stuck task warning.
I think we don't want stuck task warnings because the "stuck" task
(umount) is not the task that is actually doing the work. For an
in-kernel filesystem like XFS, the inode eviction process would be
generating enough context switches from all the metadata IOs to avoid
the hung task warning.
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/5] fuse: move the backing file idr and code into a new source file
2025-09-16 0:27 ` [PATCH 2/5] fuse: move the backing file idr and code into a new source file Darrick J. Wong
@ 2025-09-25 14:11 ` Miklos Szeredi
0 siblings, 0 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-25 14:11 UTC (permalink / raw)
To: Darrick J. Wong
Cc: amir73il, bernd, linux-xfs, John, linux-fsdevel, neal,
joannelkoong
On Tue, 16 Sept 2025 at 02:27, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> iomap support for fuse is also going to want the ability to attach
> backing files to a fuse filesystem. Move the fuse_backing code into a
> separate file so that both can use it.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Applied, thanks.
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 5/5] fuse: move CREATE_TRACE_POINTS to a separate file
2025-09-16 0:27 ` [PATCH 5/5] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
@ 2025-09-25 14:25 ` Miklos Szeredi
0 siblings, 0 replies; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-25 14:25 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, 16 Sept 2025 at 02:27, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Before we start adding new tracepoints for fuse+iomap, move the
> tracepoint creation itself to a separate source file so that we don't
> have to start pulling iomap dependencies into dev.c just for the iomap
> structures.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Applied, thanks.
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors
2025-09-24 17:31 ` Darrick J. Wong
@ 2025-09-25 19:17 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-25 19:17 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: Joanne Koong, bernd, linux-xfs, John, linux-fsdevel, neal
On Wed, Sep 24, 2025 at 10:31:36AM -0700, Darrick J. Wong wrote:
> On Wed, Sep 24, 2025 at 03:55:48PM +0200, Miklos Szeredi wrote:
> > On Tue, 23 Sept 2025 at 22:51, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > > Oh, ok. I can do that. Just to be clear about what I need to do for
> > > v6:
> > >
> > > * fuse_conn::is_local goes away
> > > * FUSE_I_* gains a new FUSE_I_EXCLUSIVE flag
> > > * "local" operations check for FUSE_I_EXCLUSIVE instead of local_fs
> > > * fuseblk filesystems always set FUSE_I_EXCLUSIVE
> >
> > Not sure if we want to touch fuseblk, as that carries a risk of regressions.
>
> Hrm. As it stands today, setting FUSE_I_EXCLUSIVE in fuseblk mode
> solves various mode/acl failures in fstests.
>
> On the other hand, mounting with fuseblk requires fsname to point to a
> block device that the mount()ing process can open, and if you're working
> with a local filesystem on a block device, why wouldn't you use iomap
> mode?
>
> Add to that Ted's reluctance to merge the fuseblk support patches into
> fuse2fs, and perhaps I should take that as a sign to abandon fuseblk
> work entirely. It'd get rid of an entire test configuration, since I'd
> only have to check fuse4fs-iomap on a bdev; and classic fuse4fs on a
> regular file. Even in that second case, fuse4fs could losetup to take
> advantage of iomap mode.
>
> Yeah ok I've persuaded myself to drop the fuseblk stuff entirely. If
> anyone /really/ wants me to keep it, holler in the next couple of hours.
Ted agrees with this, so I'm dropping fuseblk support for fuse[24]fs.
--D
> > > * iomap filesystems (when they arrive) always set FUSE_I_EXCLUSIVE
> >
> > Yes.
>
> Ok, thanks for the quick responses! :)
>
> --D
>
> > Thanks,
> > Miklos
>
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-24 20:54 ` Darrick J. Wong
@ 2025-09-30 10:29 ` Miklos Szeredi
2025-09-30 17:56 ` Darrick J. Wong
0 siblings, 1 reply; 126+ messages in thread
From: Miklos Szeredi @ 2025-09-30 10:29 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Wed, 24 Sept 2025 at 22:54, Darrick J. Wong <djwong@kernel.org> wrote:
> I think we don't want stuck task warnings because the "stuck" task
> (umount) is not the task that is actually doing the work.
Agreed.
I do wonder why this isn't happening during normal operation. There
could be multiple explanations:
- release is async, so this particular case would not trigger the hang warning
- some other op could be taking a long time to complete (fsync?), but
request_wait_answer() starts with interruptible sleep and falls back
to uninterruptible sleep after a signal is received. So unless
there's a signal, even a very slow request would fail to trigger the
hang warning.
A more generic solution would be to introduce a mechanism that would
tell the kernel that while the request is taking long, it's not
stalled (e.g. periodic progress reports).
But I also get the feeling that this is not very urgent and possibly
more of a test checkbox than a real life issue.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH 2/8] fuse: flush pending fuse events before aborting the connection
2025-09-30 10:29 ` Miklos Szeredi
@ 2025-09-30 17:56 ` Darrick J. Wong
0 siblings, 0 replies; 126+ messages in thread
From: Darrick J. Wong @ 2025-09-30 17:56 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, linux-xfs, John, linux-fsdevel, neal, joannelkoong
On Tue, Sep 30, 2025 at 12:29:30PM +0200, Miklos Szeredi wrote:
> On Wed, 24 Sept 2025 at 22:54, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > I think we don't want stuck task warnings because the "stuck" task
> > (umount) is not the task that is actually doing the work.
>
> Agreed.
>
> I do wonder why this isn't happening during normal operation. There
> could be multiple explanations:
>
> - release is async, so this particular case would not trigger the hang warning
<nod> I confirm that a normal release takes place asynchronously, so
nothing in the kernel gets hung up if ->release takes a long time.
It's only my new code that causes the hang warnings, and only because
it's using the non-interruptible wait_event variant to flush the
requests.
(I /could/ solve the problem differently by calling
wait_event_interruptible in a loop and ignoring the EINTR, but that
seems like a misuse of APIs.)
> - some other op could be taking a long time to complete (fsync?), but
> request_wait_answer() starts with interruptible sleep and falls back
> to uninterruptible sleep after a signal is received. So unless
> there's a signal, even a very slow request would fail to trigger the
> hang warning.
Yes, that's what happens if I inject a "stall" into, say, fallocate by
adding a gdb breakpoint on the fallocate handler in fuse4fs. The xfs_io
process calling fallocate() then just blocks in interruptible sleep
and I see no complaints from the hangcheck timer. But it's fallocate(),
which is quite interruptible.
Unmount is different -- the kernel has already torn down some of the
mount state, so we can't back out after some sort of interruption.
> A more generic solution would be to introduce a mechanism that would
> tell the kernel that while the request is taking long, it's not
> stalled (e.g. periodic progress reports).
>
> But I also get the feeling that this is not very urgent and possibly
> more of a test checkbox than a real life issue.
It's probably not an issue for 99% of filesystems and use cases, but
unprivileged userspace can set up the conditions for a stall warning.
Some customers file bug reports for /any/ kernel backtrace, even if it's
a stall warning caused by slow IO, so I'd prefer not to create a new
opening for this to happen.
--D
>
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 126+ messages in thread
end of thread, other threads:[~2025-09-30 17:56 UTC | newest]
Thread overview: 126+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250916000759.GA8080@frogsfrogsfrogs>
2025-09-16 0:18 ` [PATCHSET RFC v5 1/8] fuse: general bug fixes Darrick J. Wong
2025-09-16 0:24 ` [PATCH 1/8] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-09-23 10:57 ` Miklos Szeredi
2025-09-16 0:24 ` [PATCH 2/8] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
2025-09-23 11:11 ` Miklos Szeredi
2025-09-23 14:54 ` Darrick J. Wong
2025-09-23 18:56 ` Miklos Szeredi
2025-09-23 20:59 ` Darrick J. Wong
2025-09-23 22:34 ` Darrick J. Wong
2025-09-24 12:04 ` Miklos Szeredi
2025-09-24 17:50 ` Darrick J. Wong
2025-09-24 18:19 ` Miklos Szeredi
2025-09-24 20:54 ` Darrick J. Wong
2025-09-30 10:29 ` Miklos Szeredi
2025-09-30 17:56 ` Darrick J. Wong
2025-09-16 0:24 ` [PATCH 3/8] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
2025-09-23 10:58 ` Miklos Szeredi
2025-09-16 0:25 ` [PATCH 4/8] fuse: signal that a fuse filesystem should exhibit local fs behaviors Darrick J. Wong
2025-09-17 17:18 ` Joanne Koong
2025-09-18 16:52 ` Darrick J. Wong
2025-09-19 9:24 ` Miklos Szeredi
2025-09-19 17:50 ` Darrick J. Wong
2025-09-23 14:57 ` Miklos Szeredi
2025-09-23 20:51 ` Darrick J. Wong
2025-09-24 13:55 ` Miklos Szeredi
2025-09-24 17:31 ` Darrick J. Wong
2025-09-25 19:17 ` Darrick J. Wong
2025-09-16 0:25 ` [PATCH 5/8] fuse: implement file attributes mask for statx Darrick J. Wong
2025-09-16 0:25 ` [PATCH 6/8] fuse: update file mode when updating acls Darrick J. Wong
2025-09-16 0:25 ` [PATCH 7/8] fuse: propagate default and file acls on creation Darrick J. Wong
2025-09-16 6:41 ` Chen Linxuan
2025-09-16 14:48 ` Darrick J. Wong
2025-09-16 0:26 ` [PATCH 8/8] fuse: enable FUSE_SYNCFS for all fuseblk servers Darrick J. Wong
2025-09-23 10:58 ` Miklos Szeredi
2025-09-16 0:18 ` [PATCHSET RFC v5 2/8] iomap: cleanups ahead of adding fuse support Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/2] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
2025-09-16 13:49 ` Christoph Hellwig
2025-09-16 14:49 ` Darrick J. Wong
2025-09-16 0:26 ` [PATCH 2/2] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
2025-09-16 13:50 ` Christoph Hellwig
2025-09-16 14:50 ` Darrick J. Wong
2025-09-16 0:18 ` [PATCHSET RFC v5 3/8] fuse: cleanups ahead of adding fuse support Darrick J. Wong
2025-09-16 0:26 ` [PATCH 1/5] fuse: allow synchronous FUSE_INIT Darrick J. Wong
2025-09-17 17:22 ` Joanne Koong
2025-09-18 18:04 ` Darrick J. Wong
2025-09-16 0:27 ` [PATCH 2/5] fuse: move the backing file idr and code into a new source file Darrick J. Wong
2025-09-25 14:11 ` Miklos Szeredi
2025-09-16 0:27 ` [PATCH 3/5] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
2025-09-17 2:47 ` Amir Goldstein
2025-09-18 18:02 ` Darrick J. Wong
2025-09-19 7:34 ` Miklos Szeredi
2025-09-19 9:36 ` Amir Goldstein
2025-09-19 17:43 ` Darrick J. Wong
2025-09-16 0:27 ` [PATCH 4/5] fuse_trace: " Darrick J. Wong
2025-09-16 0:27 ` [PATCH 5/5] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
2025-09-25 14:25 ` Miklos Szeredi
2025-09-16 0:19 ` [PATCHSET RFC v5 4/8] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-09-16 0:28 ` [PATCH 01/28] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-09-19 22:36 ` Joanne Koong
2025-09-23 20:32 ` Darrick J. Wong
2025-09-23 21:24 ` Joanne Koong
2025-09-23 22:10 ` Darrick J. Wong
2025-09-23 23:08 ` Darrick J. Wong
2025-09-16 0:28 ` [PATCH 02/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:28 ` [PATCH 03/28] fuse: make debugging configurable at runtime Darrick J. Wong
2025-09-16 0:29 ` [PATCH 04/28] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices Darrick J. Wong
2025-09-17 3:09 ` Amir Goldstein
2025-09-18 18:17 ` Darrick J. Wong
2025-09-18 18:42 ` Amir Goldstein
2025-09-18 19:03 ` Darrick J. Wong
2025-09-19 7:13 ` Miklos Szeredi
2025-09-19 9:54 ` Amir Goldstein
2025-09-19 17:42 ` Darrick J. Wong
2025-09-23 7:10 ` Miklos Szeredi
2025-09-16 0:29 ` [PATCH 05/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:29 ` [PATCH 06/28] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
2025-09-16 0:29 ` [PATCH 07/28] fuse: create a per-inode flag for toggling iomap Darrick J. Wong
2025-09-16 0:30 ` [PATCH 08/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:30 ` [PATCH 09/28] fuse: isolate the other regular file IO paths from iomap Darrick J. Wong
2025-09-16 0:30 ` [PATCH 10/28] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-09-16 0:30 ` [PATCH 11/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:31 ` [PATCH 12/28] fuse: implement direct IO with iomap Darrick J. Wong
2025-09-16 0:31 ` [PATCH 13/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:31 ` [PATCH 14/28] fuse: implement buffered " Darrick J. Wong
2025-09-16 0:31 ` [PATCH 15/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:32 ` [PATCH 16/28] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-09-16 0:32 ` [PATCH 17/28] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-09-16 0:32 ` [PATCH 18/28] fuse: advertise support for iomap Darrick J. Wong
2025-09-16 0:32 ` [PATCH 19/28] fuse: query filesystem geometry when using iomap Darrick J. Wong
2025-09-16 0:33 ` [PATCH 20/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:33 ` [PATCH 21/28] fuse: implement fadvise for iomap files Darrick J. Wong
2025-09-16 0:33 ` [PATCH 22/28] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
2025-09-16 0:33 ` [PATCH 23/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:34 ` [PATCH 24/28] fuse: implement inline data file IO via iomap Darrick J. Wong
2025-09-16 0:34 ` [PATCH 25/28] fuse_trace: " Darrick J. Wong
2025-09-16 0:34 ` [PATCH 26/28] fuse: allow more statx fields Darrick J. Wong
2025-09-16 0:35 ` [PATCH 27/28] fuse: support atomic writes with iomap Darrick J. Wong
2025-09-16 0:35 ` [PATCH 28/28] fuse: disable direct reclaim for any fuse server that uses iomap Darrick J. Wong
2025-09-16 0:19 ` [PATCHSET RFC v5 5/8] fuse: allow servers to specify root node id Darrick J. Wong
2025-09-16 0:35 ` [PATCH 1/3] fuse: make the root nodeid dynamic Darrick J. Wong
2025-09-16 0:35 ` [PATCH 2/3] fuse_trace: " Darrick J. Wong
2025-09-16 0:36 ` [PATCH 3/3] fuse: allow setting of root nodeid Darrick J. Wong
2025-09-16 0:19 ` [PATCHSET RFC v5 6/8] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-09-16 0:36 ` [PATCH 1/9] fuse: enable caching of timestamps Darrick J. Wong
2025-09-16 0:36 ` [PATCH 2/9] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
2025-09-16 0:36 ` [PATCH 3/9] fuse: allow local filesystems to set some VFS iflags Darrick J. Wong
2025-09-16 0:37 ` [PATCH 4/9] fuse_trace: " Darrick J. Wong
2025-09-16 0:37 ` [PATCH 5/9] fuse: cache atime when in iomap mode Darrick J. Wong
2025-09-16 0:37 ` [PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
2025-09-16 0:37 ` [PATCH 7/9] fuse_trace: " Darrick J. Wong
2025-09-16 0:38 ` [PATCH 8/9] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2025-09-16 0:38 ` [PATCH 9/9] fuse: always cache ACLs when using iomap Darrick J. Wong
2025-09-16 0:19 ` [PATCHSET RFC v5 7/8] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-09-16 0:38 ` [PATCH 01/10] fuse: cache iomaps Darrick J. Wong
2025-09-16 0:38 ` [PATCH 02/10] fuse_trace: " Darrick J. Wong
2025-09-16 0:39 ` [PATCH 03/10] fuse: use the iomap cache for iomap_begin Darrick J. Wong
2025-09-16 0:39 ` [PATCH 04/10] fuse_trace: " Darrick J. Wong
2025-09-16 0:39 ` [PATCH 05/10] fuse: invalidate iomap cache after file updates Darrick J. Wong
2025-09-16 0:39 ` [PATCH 06/10] fuse_trace: " Darrick J. Wong
2025-09-16 0:40 ` [PATCH 07/10] fuse: enable iomap cache management Darrick J. Wong
2025-09-16 0:40 ` [PATCH 08/10] fuse_trace: " Darrick J. Wong
2025-09-16 0:40 ` [PATCH 09/10] fuse: overlay iomap inode info in struct fuse_inode Darrick J. Wong
2025-09-16 0:41 ` [PATCH 10/10] fuse: enable iomap Darrick J. Wong
2025-09-16 0:20 ` [PATCHSET RFC v5 8/8] fuse: run fuse servers as a contained service Darrick J. Wong
2025-09-16 0:41 ` [PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap usage Darrick J. Wong
2025-09-16 0:41 ` [PATCH 2/2] fuse: set iomap backing device block size Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox