* [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-05-22 0:02 ` Darrick J. Wong
2025-05-29 11:08 ` Miklos Szeredi
2025-05-22 0:02 ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
` (10 subsequent siblings)
11 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:02 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
I observed a hang when running generic/323 against a fuseblk server.
This test opens a file, initiates a lot of AIO writes to that file
descriptor, and closes the file descriptor before the writes complete.
Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for
responses from the fuseblk server:
# cat /proc/372265/task/372313/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_do_getattr+0xfc/0x1f0 [fuse]
[<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse]
[<0>] aio_read+0x130/0x1e0
[<0>] io_submit_one+0x542/0x860
[<0>] __x64_sys_io_submit+0x98/0x1a0
[<0>] do_syscall_64+0x37/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
But the /weird/ part is that the fuseblk server threads are waiting for
responses from itself:
# cat /proc/372210/task/372232/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_file_put+0x9a/0xd0 [fuse]
[<0>] fuse_release+0x36/0x50 [fuse]
[<0>] __fput+0xec/0x2b0
[<0>] task_work_run+0x55/0x90
[<0>] syscall_exit_to_user_mode+0xe9/0x100
[<0>] do_syscall_64+0x43/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
The fuseblk server is fuse2fs so there's nothing all that exciting in
the server itself. So why is the fuse server calling fuse_file_put?
The commit message for the fstest sheds some light on that:
"By closing the file descriptor before calling io_destroy, you pretty
much guarantee that the last put on the ioctx will be done in interrupt
context (during I/O completion).
Aha. AIO fgets a new struct file from the fd when it queues the ioctx.
The completion of the FUSE_WRITE command from userspace causes the fuse
server to call the AIO completion function. The completion puts the
struct file, queuing a delayed fput to the fuse server task. When the
fuse server task returns to userspace, it has to run the delayed fput,
which in the case of a fuseblk server, it does synchronously.
Sending the FUSE_RELEASE command sychronously from fuse server threads
is a bad idea because a client program can initiate enough simultaneous
AIOs such that all the fuse server threads end up in delayed_fput, and
now there aren't any threads left to handle the queued fuse commands.
Fix this by only using synchronous fputs for fuseblk servers if the
process doesn't have PF_LOCAL_THROTTLE. Hopefully the fuseblk server
had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
filesystem server.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 754378dd9f7159..ada1ed9e653e42 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -355,8 +355,16 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
* Make the release synchronous if this is a fuseblk mount,
* synchronous RELEASE is allowed (and desirable) in this case
* because the server can be trusted not to screw up.
+ *
+ * If we're a LOCAL_THROTTLE thread, use the asynchronous put
+ * because the current thread might be a fuse server. This can
+ * happen if a process starts some aio and closes the fd before
+ * the aio completes. Since aio takes its own ref to the file,
+ * the IO completion has to drop the ref, which is how the fuse
+ * server can end up closing its own clients' files.
*/
- fuse_file_put(ff, ff->fm->fc->destroy);
+ fuse_file_put(ff, ff->fm->fc->destroy &&
+ (current->flags & PF_LOCAL_THROTTLE) == 0);
}
void fuse_release_common(struct file *file, bool isdir)
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
2025-05-22 0:02 ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-05-29 11:08 ` Miklos Szeredi
2025-05-31 1:08 ` Darrick J. Wong
0 siblings, 1 reply; 23+ messages in thread
From: Miklos Szeredi @ 2025-05-29 11:08 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John
On Thu, 22 May 2025 at 02:02, Darrick J. Wong <djwong@kernel.org> wrote:
> Fix this by only using synchronous fputs for fuseblk servers if the
> process doesn't have PF_LOCAL_THROTTLE. Hopefully the fuseblk server
> had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
> filesystem server.
The bug is valid.
I just wonder if we really need to check against the task flag instead
of always sending release async, which would simplify things.
The sync release originates from commit 5a18ec176c93 ("fuse: fix hang
of single threaded fuseblk filesystem"), but then commit baebccbe997d
("fuse: hold inode instead of path after release") made that obsolete.
Anybody sees a reason why sync release for fuseblk is a good idea?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
2025-05-29 11:08 ` Miklos Szeredi
@ 2025-05-31 1:08 ` Darrick J. Wong
2025-06-06 13:54 ` Miklos Szeredi
0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-31 1:08 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John
On Thu, May 29, 2025 at 01:08:25PM +0200, Miklos Szeredi wrote:
> On Thu, 22 May 2025 at 02:02, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Fix this by only using synchronous fputs for fuseblk servers if the
> > process doesn't have PF_LOCAL_THROTTLE. Hopefully the fuseblk server
> > had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
> > filesystem server.
>
> The bug is valid.
>
> I just wonder if we really need to check against the task flag instead
> of always sending release async, which would simplify things.
>
> The sync release originates from commit 5a18ec176c93 ("fuse: fix hang
> of single threaded fuseblk filesystem"), but then commit baebccbe997d
> ("fuse: hold inode instead of path after release") made that obsolete.
>
> Anybody sees a reason why sync release for fuseblk is a good idea?
The best reason that I can think of is that normally the process that
owns the fd (and hence is releasing it) should be made to wait for
the release, because normally we want processes that generate file
activity to pay those costs. It's just this weird case where the fd
already got closed but aio is still going in the background.
(yeah, everyone hates aio ;))
Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
fuseblk filesystems? I'd have thought that you'd want to make umount
block until the fuse server is totally done. OTOH I guess I could see
an argument for not waiting for potentially hung servers, etc.
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
2025-05-31 1:08 ` Darrick J. Wong
@ 2025-06-06 13:54 ` Miklos Szeredi
2025-06-09 18:13 ` Darrick J. Wong
0 siblings, 1 reply; 23+ messages in thread
From: Miklos Szeredi @ 2025-06-06 13:54 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John
On Sat, 31 May 2025 at 03:08, Darrick J. Wong <djwong@kernel.org> wrote:
> The best reason that I can think of is that normally the process that
> owns the fd (and hence is releasing it) should be made to wait for
> the release, because normally we want processes that generate file
> activity to pay those costs.
That argument seems to apply to all fuse variants. But fuse does get
away with async release and I don't see why fuseblk would be different
in this respect.
Trying to hack around the problems of sync release with a task flag
that servers might or might not have set does not feel a very robust
solution.
> Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
> fuseblk filesystems? I'd have thought that you'd want to make umount
> block until the fuse server is totally done. OTOH I guess I could see
> an argument for not waiting for potentially hung servers, etc.
It's a potential DoS. With allow_root we could arguably enable
FUSE_DESTROY, since the mounter is explicitly acknowledging this DoS
possibilty.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
2025-06-06 13:54 ` Miklos Szeredi
@ 2025-06-09 18:13 ` Darrick J. Wong
2025-06-09 20:29 ` Darrick J. Wong
0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2025-06-09 18:13 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John
On Fri, Jun 06, 2025 at 03:54:50PM +0200, Miklos Szeredi wrote:
> On Sat, 31 May 2025 at 03:08, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > The best reason that I can think of is that normally the process that
> > owns the fd (and hence is releasing it) should be made to wait for
> > the release, because normally we want processes that generate file
> > activity to pay those costs.
>
> That argument seems to apply to all fuse variants. But fuse does get
> away with async release and I don't see why fuseblk would be different
> in this respect.
>
> Trying to hack around the problems of sync release with a task flag
> that servers might or might not have set does not feel a very robust
> solution.
>
> > Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
> > fuseblk filesystems? I'd have thought that you'd want to make umount
> > block until the fuse server is totally done. OTOH I guess I could see
> > an argument for not waiting for potentially hung servers, etc.
>
> It's a potential DoS. With allow_root we could arguably enable
> FUSE_DESTROY, since the mounter is explicitly acknowledging this DoS
> possibilty.
<nod> Looking deeper at fuse2fs's op_destroy function, I think most of
the slow functionality (writing group descriptors and the primary super
and fsyncing the device) ought to be done via FUSE_SYNCFS, not
FUSE_DESTROY. If I made that change, I think op_destroy becomes very
fast -- all it does is close the fs and log a message. The VFS unmount
code calls sync_filesystem (which initiates a FUSE_SYNCFS) which sounds
like it would work for fuse2fs.
Unhappily, libfuse3 doesn't seem to implement it:
$ git grep FUSE_SYNCFS
doc/libfuse-operations.txt:394:50. FUSE_SYNCFS (50)
include/fuse_kernel.h:186: * - add FUSE_SYNCFS
include/fuse_kernel.h:670: FUSE_SYNCFS = 50,
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
2025-06-09 18:13 ` Darrick J. Wong
@ 2025-06-09 20:29 ` Darrick J. Wong
0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-06-09 20:29 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John
On Mon, Jun 09, 2025 at 11:13:26AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 06, 2025 at 03:54:50PM +0200, Miklos Szeredi wrote:
> > On Sat, 31 May 2025 at 03:08, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > > The best reason that I can think of is that normally the process that
> > > owns the fd (and hence is releasing it) should be made to wait for
> > > the release, because normally we want processes that generate file
> > > activity to pay those costs.
> >
> > That argument seems to apply to all fuse variants. But fuse does get
> > away with async release and I don't see why fuseblk would be different
> > in this respect.
> >
> > Trying to hack around the problems of sync release with a task flag
> > that servers might or might not have set does not feel a very robust
> > solution.
> >
> > > Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
> > > fuseblk filesystems? I'd have thought that you'd want to make umount
> > > block until the fuse server is totally done. OTOH I guess I could see
> > > an argument for not waiting for potentially hung servers, etc.
> >
> > It's a potential DoS. With allow_root we could arguably enable
> > FUSE_DESTROY, since the mounter is explicitly acknowledging this DoS
> > possibilty.
>
> <nod> Looking deeper at fuse2fs's op_destroy function, I think most of
> the slow functionality (writing group descriptors and the primary super
> and fsyncing the device) ought to be done via FUSE_SYNCFS, not
> FUSE_DESTROY. If I made that change, I think op_destroy becomes very
> fast -- all it does is close the fs and log a message. The VFS unmount
> code calls sync_filesystem (which initiates a FUSE_SYNCFS) which sounds
> like it would work for fuse2fs.
>
> Unhappily, libfuse3 doesn't seem to implement it:
>
> $ git grep FUSE_SYNCFS
> doc/libfuse-operations.txt:394:50. FUSE_SYNCFS (50)
> include/fuse_kernel.h:186: * - add FUSE_SYNCFS
> include/fuse_kernel.h:670: FUSE_SYNCFS = 50,
...and it won't really work anyway since fuse_sync_fs doesn't upcall to
the fuse server if sb->s_root == NULL; and we can't do anything at that
point anyway because deactivate_locked_super -> fuse_kill_sb_anon has
already called fuse_conn_destroy to tear down the connection.
--D
>
> > Thanks,
> > Miklos
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22 0:02 ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-05-22 0:02 ` Darrick J. Wong
2025-05-22 0:03 ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
` (9 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:02 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
If iomap_iter::len is zero on the first call to iomap_iter(), we should
just return zero instead of calling ->iomap_begin with zero count. This
obviates the need for ->iomap_begin implementations to handle that
"correctly" by not returning a zero-length mapping.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/iomap/iter.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 6ffc6a7b9ba502..b86a6a08627126 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -66,8 +66,11 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
trace_iomap_iter(iter, ops, _RET_IP_);
- if (!iter->iomap.length)
+ if (!iter->iomap.length) {
+ if (iter->len == 0)
+ return 0;
goto begin;
+ }
/*
* Calculate how far the iter was advanced and the original length bytes
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 03/11] fuse: implement the basic iomap mechanisms
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22 0:02 ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-05-22 0:02 ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
@ 2025-05-22 0:03 ` Darrick J. Wong
2025-05-29 22:15 ` Joanne Koong
2025-05-22 0:03 ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
` (8 subsequent siblings)
11 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:03 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Implement functions to enable upcalling of iomap_begin and iomap_end to
userspace fuse servers.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 38 ++++++
fs/fuse/fuse_trace.h | 258 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 87 ++++++++++++++
fs/fuse/Kconfig | 23 ++++
fs/fuse/Makefile | 1
fs/fuse/file_iomap.c | 280 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 5 +
7 files changed, 691 insertions(+), 1 deletion(-)
create mode 100644 fs/fuse/file_iomap.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d56d4fd956db99..aa51f25856697d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -895,6 +895,9 @@ struct fuse_conn {
/* Is link not implemented by fs? */
unsigned int no_link:1;
+ /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
+ unsigned int iomap:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
return sb->s_fs_info;
}
+static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
{
return get_fuse_mount_super(sb)->fc;
@@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
return get_fuse_mount_super(inode->i_sb);
}
+static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
+{
+ return get_fuse_mount_super_c(inode->i_sb);
+}
+
static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb)->fc;
}
+static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
+{
+ return get_fuse_mount_super_c(inode->i_sb)->fc;
+}
+
static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
{
return container_of(inode, struct fuse_inode, inode);
}
+static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
+{
+ return container_of(inode, struct fuse_inode, inode);
+}
+
static inline u64 get_node_id(struct inode *inode)
{
return get_fuse_inode(inode)->nodeid;
@@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+# include <linux/fiemap.h>
+# include <linux/iomap.h>
+
+bool fuse_iomap_enabled(void);
+
+static inline bool fuse_has_iomap(const struct inode *inode)
+{
+ return get_fuse_conn_c(inode)->iomap;
+}
+#else
+# define fuse_iomap_enabled(...) (false)
+# define fuse_has_iomap(...) (false)
+#endif
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index bbe9ddd8c71696..f9a316c9788e06 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,8 @@
EM( FUSE_SYNCFS, "FUSE_SYNCFS") \
EM( FUSE_TMPFILE, "FUSE_TMPFILE") \
EM( FUSE_STATX, "FUSE_STATX") \
+ EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
+ EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
EMe(CUSE_INIT, "CUSE_INIT")
/*
@@ -124,6 +126,262 @@ TRACE_EVENT(fuse_request_end,
__entry->unique, __entry->len, __entry->error)
);
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+
+#define FUSE_IOMAP_F_STRINGS \
+ { FUSE_IOMAP_F_NEW, "new" }, \
+ { FUSE_IOMAP_F_DIRTY, "dirty" }, \
+ { FUSE_IOMAP_F_SHARED, "shared" }, \
+ { FUSE_IOMAP_F_MERGED, "merged" }, \
+ { FUSE_IOMAP_F_XATTR, "xattr" }, \
+ { FUSE_IOMAP_F_BOUNDARY, "boundary" }, \
+ { FUSE_IOMAP_F_ANON_WRITE, "anon_write" }, \
+ { FUSE_IOMAP_F_ATOMIC_BIO, "atomic" }, \
+ { FUSE_IOMAP_F_WANT_IOMAP_END, "iomap_end" }, \
+ { FUSE_IOMAP_F_SIZE_CHANGED, "append" }, \
+ { FUSE_IOMAP_F_STALE, "stale" }
+
+#define FUSE_IOMAP_OP_STRINGS \
+ { FUSE_IOMAP_OP_WRITE, "write" }, \
+ { FUSE_IOMAP_OP_ZERO, "zero" }, \
+ { FUSE_IOMAP_OP_REPORT, "report" }, \
+ { FUSE_IOMAP_OP_FAULT, "fault" }, \
+ { FUSE_IOMAP_OP_DIRECT, "direct" }, \
+ { FUSE_IOMAP_OP_NOWAIT, "nowait" }, \
+ { FUSE_IOMAP_OP_OVERWRITE_ONLY, "overwrite" }, \
+ { FUSE_IOMAP_OP_UNSHARE, "unshare" }, \
+ { FUSE_IOMAP_OP_ATOMIC, "atomic" }, \
+ { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }
+
+#define FUSE_IOMAP_TYPE_STRINGS \
+ { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
+ { FUSE_IOMAP_TYPE_HOLE, "hole" }, \
+ { FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
+ { FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
+ { FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
+ { FUSE_IOMAP_TYPE_INLINE, "inline" }
+
+TRACE_EVENT(fuse_iomap_begin,
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags),
+
+ TP_ARGS(inode, pos, count, opflags),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(loff_t, count)
+ __field(unsigned, opflags)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = pos;
+ __entry->count = count;
+ __entry->opflags = opflags;
+ ),
+
+ TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+ __entry->pos, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_begin_error,
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, int error),
+
+ TP_ARGS(inode, pos, count, opflags, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(loff_t, count)
+ __field(unsigned, opflags)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = pos;
+ __entry->count = count;
+ __entry->opflags = opflags;
+ __entry->error = error;
+ ),
+
+ TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx err %d",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+ __entry->pos, __entry->count, __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_read_map,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_begin_out *outarg),
+
+ TP_ARGS(inode, outarg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(loff_t, length)
+ __field(uint32_t, dev)
+ __field(uint64_t, addr)
+ __field(uint16_t, type)
+ __field(uint16_t, mapflags)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->offset = outarg->offset;
+ __entry->length = outarg->length;
+ __entry->dev = outarg->read_dev;
+ __entry->addr = outarg->read_addr;
+ __entry->type = outarg->read_type;
+ __entry->mapflags = outarg->read_flags;
+ ),
+
+ TP_printk("connection %u ino %llu read offset 0x%llx count 0x%llx dev %u addr 0x%llu type %s mapflags (%s)",
+ __entry->connection, __entry->ino, __entry->offset,
+ __entry->length, __entry->dev, __entry->addr,
+ __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+ __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_write_map,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_begin_out *outarg),
+
+ TP_ARGS(inode, outarg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(loff_t, length)
+ __field(uint32_t, dev)
+ __field(uint64_t, addr)
+ __field(uint16_t, type)
+ __field(uint16_t, mapflags)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->offset = outarg->offset;
+ __entry->length = outarg->length;
+ __entry->dev = outarg->write_dev;
+ __entry->addr = outarg->write_addr;
+ __entry->type = outarg->write_type;
+ __entry->mapflags = outarg->write_flags;
+ ),
+
+ TP_printk("connection %u ino %llu write offset 0x%llx count 0x%llx dev %u addr 0x%llu type %s mapflags (%s)",
+ __entry->connection, __entry->ino, __entry->offset,
+ __entry->length, __entry->dev, __entry->addr,
+ __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+ __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_end,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_end_in *inarg),
+
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(loff_t, count)
+ __field(unsigned, opflags)
+ __field(size_t, written)
+
+ __field(uint32_t, dev)
+ __field(uint64_t, addr)
+ __field(uint16_t, type)
+ __field(uint16_t, mapflags)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = inarg->pos;
+ __entry->count = inarg->count;
+ __entry->opflags = inarg->opflags;
+ __entry->written = inarg->written;
+ __entry->dev = inarg->map_dev;
+ __entry->addr = inarg->map_addr;
+ __entry->type = inarg->map_type;
+ __entry->mapflags = inarg->map_flags;
+ ),
+
+ TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx written %zd dev %u addr 0x%llx type 0x%x mapflags (%s)",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+ __entry->pos, __entry->count, __entry->written, __entry->dev,
+ __entry->addr, __entry->type,
+ __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_end_error,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_end_in *inarg, int error),
+
+ TP_ARGS(inode, inarg, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(loff_t, count)
+ __field(unsigned, opflags)
+ __field(size_t, written)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = inarg->pos;
+ __entry->count = inarg->count;
+ __entry->opflags = inarg->opflags;
+ __entry->written = inarg->written;
+ __entry->error = error;
+ ),
+
+ TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx written %zd error %d",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+ __entry->pos, __entry->count, __entry->written,
+ __entry->error)
+);
+#endif /* CONFIG_FUSE_IOMAP */
+
#endif /* _TRACE_FUSE_H */
#undef TRACE_INCLUDE_PATH
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5ec43ecbceb783..ce6c9960f2418f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -232,6 +232,10 @@
*
* 7.43
* - add FUSE_REQUEST_TIMEOUT
+ *
+ * 7.44
+ * - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ * SEEK_{DATA,HOLE} support
*/
#ifndef _LINUX_FUSE_H
@@ -267,7 +271,7 @@
#define FUSE_KERNEL_VERSION 7
/** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 43
+#define FUSE_KERNEL_MINOR_VERSION 44
/** The node ID of the root inode */
#define FUSE_ROOT_ID 1
@@ -440,6 +444,8 @@ struct fuse_file_lock {
* FUSE_OVER_IO_URING: Indicate that client supports io-uring
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ * operations.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -487,6 +493,7 @@ struct fuse_file_lock {
#define FUSE_ALLOW_IDMAP (1ULL << 40)
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
+#define FUSE_IOMAP (1ULL << 43)
/**
* CUSE INIT request/reply flags
@@ -655,6 +662,9 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_BEGIN = 4094,
+ FUSE_IOMAP_END = 4095,
+
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1286,4 +1296,79 @@ struct fuse_uring_cmd_req {
uint8_t padding[6];
};
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_HOLE 0 /* no blocks allocated, need allocation */
+#define FUSE_IOMAP_TYPE_DELALLOC 1 /* delayed allocation blocks */
+#define FUSE_IOMAP_TYPE_MAPPED 2 /* blocks allocated at @addr */
+#define FUSE_IOMAP_TYPE_UNWRITTEN 3 /* blocks allocated at @addr in unwritten state */
+#define FUSE_IOMAP_TYPE_INLINE 4 /* data inline in the inode */
+
+#define FUSE_IOMAP_DEV_FUSEBLK (0U) /* fuseblk sb_dev device cookie */
+#define FUSE_IOMAP_DEV_NULL (~0U) /* null device cookie */
+
+#define FUSE_IOMAP_F_NEW (1U << 0)
+#define FUSE_IOMAP_F_DIRTY (1U << 1)
+#define FUSE_IOMAP_F_SHARED (1U << 2)
+#define FUSE_IOMAP_F_MERGED (1U << 3)
+#define FUSE_IOMAP_F_XATTR (1U << 5)
+#define FUSE_IOMAP_F_BOUNDARY (1U << 6)
+#define FUSE_IOMAP_F_ANON_WRITE (1U << 7)
+#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 8)
+#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 12) /* want ->iomap_end call */
+
+/* only for iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED (1U << 14)
+#define FUSE_IOMAP_F_STALE (1U << 15)
+
+#define FUSE_IOMAP_OP_WRITE (1 << 0) /* writing, must allocate blocks */
+#define FUSE_IOMAP_OP_ZERO (1 << 1) /* zeroing operation, may skip holes */
+#define FUSE_IOMAP_OP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */
+#define FUSE_IOMAP_OP_FAULT (1 << 3) /* mapping for page fault */
+#define FUSE_IOMAP_OP_DIRECT (1 << 4) /* direct I/O */
+#define FUSE_IOMAP_OP_NOWAIT (1 << 5) /* do not block */
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1 << 6) /* only pure overwrites allowed */
+#define FUSE_IOMAP_OP_UNSHARE (1 << 7) /* unshare_file_range */
+#define FUSE_IOMAP_OP_ATOMIC (1 << 9) /* torn-write protection */
+#define FUSE_IOMAP_OP_DONTCACHE (1 << 10) /* dont retain pagecache */
+
+#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */
+
+struct fuse_iomap_begin_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+ uint64_t offset; /* file offset of mapping, bytes */
+ uint64_t length; /* length of both mappings, bytes */
+
+ uint64_t read_addr; /* disk offset of mapping, bytes */
+ uint16_t read_type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t read_flags; /* FUSE_IOMAP_F_* */
+ uint32_t read_dev; /* device cookie */
+
+ uint64_t write_addr; /* disk offset of mapping, bytes */
+ uint16_t write_type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t write_flags; /* FUSE_IOMAP_F_* */
+ uint32_t write_dev; /* device cookie * */
+};
+
+struct fuse_iomap_end_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+ int64_t written; /* bytes processed */
+
+ uint64_t map_length; /* length of mapping, bytes */
+ uint64_t map_addr; /* disk offset of mapping, bytes */
+ uint16_t map_type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t map_flags; /* FUSE_IOMAP_F_* */
+ uint32_t map_dev; /* device cookie * */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index ca215a3cba3e31..fc7c5bf1cef52d 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
If you want to allow passthrough operations, answer Y.
+config FUSE_IOMAP
+ bool "FUSE file IO over iomap"
+ default y
+ depends on FUSE_FS
+ select FS_IOMAP
+ help
+ For supported fuseblk servers, this allows the file IO path to run
+ through the kernel.
+
+config FUSE_IOMAP_BY_DEFAULT
+ bool "FUSE file I/O over iomap by default"
+ default n
+ depends on FUSE_IOMAP
+ help
+ Enable sending FUSE file I/O over iomap by default.
+
+config FUSE_IOMAP_DEBUG
+ bool "Debug FUSE file IO over iomap"
+ default n
+ depends on FUSE_IOMAP
+ help
+ Enable debugging assertions for the fuse iomap code paths.
+
config FUSE_IO_URING
bool "FUSE communication over io-uring"
default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1cc..63a41ef9336aaa 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -16,5 +16,6 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
new file mode 100644
index 00000000000000..dfa0c309803113
--- /dev/null
+++ b/fs/fuse/file_iomap.c
@@ -0,0 +1,280 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org.
+ */
+#include "fuse_i.h"
+#include "fuse_trace.h"
+#include <linux/iomap.h>
+
+static bool __read_mostly enable_iomap =
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
+ true;
+#else
+ false;
+#endif
+module_param(enable_iomap, bool, 0644);
+MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+# define ASSERT(a) do { WARN_ON(!(a)); } while (0)
+#else
+# define ASSERT(a)
+#endif
+
+bool fuse_iomap_enabled(void)
+{
+ return enable_iomap;
+}
+
+static inline bool fuse_iomap_check_type(uint16_t type)
+{
+ BUILD_BUG_ON(FUSE_IOMAP_TYPE_HOLE != IOMAP_HOLE);
+ BUILD_BUG_ON(FUSE_IOMAP_TYPE_DELALLOC != IOMAP_DELALLOC);
+ BUILD_BUG_ON(FUSE_IOMAP_TYPE_MAPPED != IOMAP_MAPPED);
+ BUILD_BUG_ON(FUSE_IOMAP_TYPE_UNWRITTEN != IOMAP_UNWRITTEN);
+ BUILD_BUG_ON(FUSE_IOMAP_TYPE_INLINE != IOMAP_INLINE);
+
+ switch (type) {
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ case FUSE_IOMAP_TYPE_INLINE:
+ return true;
+ }
+
+ return false;
+}
+
+#define FUSE_IOMAP_F_ALL (FUSE_IOMAP_F_NEW | \
+ FUSE_IOMAP_F_DIRTY | \
+ FUSE_IOMAP_F_SHARED | \
+ FUSE_IOMAP_F_MERGED | \
+ FUSE_IOMAP_F_XATTR | \
+ FUSE_IOMAP_F_BOUNDARY | \
+ FUSE_IOMAP_F_ANON_WRITE | \
+ FUSE_IOMAP_F_ATOMIC_BIO | \
+ FUSE_IOMAP_F_WANT_IOMAP_END)
+
+static inline bool fuse_iomap_check_flags(uint16_t flags)
+{
+ BUILD_BUG_ON(FUSE_IOMAP_F_NEW != IOMAP_F_NEW);
+ BUILD_BUG_ON(FUSE_IOMAP_F_DIRTY != IOMAP_F_DIRTY);
+ BUILD_BUG_ON(FUSE_IOMAP_F_SHARED != IOMAP_F_SHARED);
+ BUILD_BUG_ON(FUSE_IOMAP_F_MERGED != IOMAP_F_MERGED);
+ BUILD_BUG_ON(FUSE_IOMAP_F_XATTR != IOMAP_F_XATTR);
+ BUILD_BUG_ON(FUSE_IOMAP_F_BOUNDARY != IOMAP_F_BOUNDARY);
+ BUILD_BUG_ON(FUSE_IOMAP_F_ANON_WRITE != IOMAP_F_ANON_WRITE);
+ BUILD_BUG_ON(FUSE_IOMAP_F_ATOMIC_BIO != IOMAP_F_ATOMIC_BIO);
+ BUILD_BUG_ON(FUSE_IOMAP_F_WANT_IOMAP_END != IOMAP_F_PRIVATE);
+
+ return (flags & ~FUSE_IOMAP_F_ALL) == 0;
+}
+
+/* Check the incoming mappings to make sure they're not nonsense */
+static inline int fuse_iomap_validate(const struct fuse_iomap_begin_out *outarg,
+ unsigned opflags, loff_t pos)
+{
+ BUILD_BUG_ON(FUSE_IOMAP_OP_WRITE != IOMAP_WRITE);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_ZERO != IOMAP_ZERO);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_REPORT != IOMAP_REPORT);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_FAULT != IOMAP_FAULT);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_DIRECT != IOMAP_DIRECT);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_NOWAIT != IOMAP_NOWAIT);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_OVERWRITE_ONLY != IOMAP_OVERWRITE_ONLY);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_UNSHARE != IOMAP_UNSHARE);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_ATOMIC != IOMAP_ATOMIC);
+ BUILD_BUG_ON(FUSE_IOMAP_OP_DONTCACHE != IOMAP_DONTCACHE);
+
+ if (outarg->read_dev == FUSE_IOMAP_DEV_NULL) {
+ ASSERT(outarg->read_dev != FUSE_IOMAP_DEV_NULL);
+ return -EIO;
+ }
+ if (outarg->write_dev == FUSE_IOMAP_DEV_NULL) {
+ ASSERT(outarg->write_dev != FUSE_IOMAP_DEV_NULL);
+ return -EIO;
+ }
+ if (outarg->offset > pos) {
+ ASSERT(outarg->offset <= pos);
+ return -EIO;
+ }
+ if (outarg->length == 0) {
+ ASSERT(outarg->length != 0);
+ return -EIO;
+ }
+ if (outarg->offset + outarg->length <= pos) {
+ ASSERT(outarg->offset + outarg->length > pos);
+ return -EIO;
+ }
+ if (!fuse_iomap_check_type(outarg->write_type)) {
+ ASSERT(fuse_iomap_check_type(outarg->write_type));
+ return -EIO;
+ }
+ if (!fuse_iomap_check_flags(outarg->write_flags)) {
+ ASSERT(fuse_iomap_check_flags(outarg->write_flags));
+ return -EIO;
+ }
+ if (!fuse_iomap_check_type(outarg->read_type)) {
+ ASSERT(fuse_iomap_check_type(outarg->read_type));
+ return -EIO;
+ }
+ if (!fuse_iomap_check_flags(outarg->read_flags)) {
+ ASSERT(fuse_iomap_check_flags(outarg->read_flags));
+ return -EIO;
+ }
+
+ if (!(opflags & FUSE_IOMAP_OP_REPORT)) {
+ /*
+ * XXX inline data reads and writes are not supported, how do
+ * we do this?
+ */
+ ASSERT(outarg->read_type != FUSE_IOMAP_TYPE_INLINE);
+ ASSERT(outarg->write_type != FUSE_IOMAP_TYPE_INLINE);
+
+ if (outarg->read_type == FUSE_IOMAP_TYPE_INLINE)
+ return -EIO;
+ if (outarg->write_type == FUSE_IOMAP_TYPE_INLINE)
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_begin_in inarg = {
+ .attr_ino = fi->orig_ino,
+ .opflags = opflags,
+ .pos = pos,
+ .count = count,
+ };
+ struct fuse_iomap_begin_out outarg = { };
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ int err;
+
+ trace_fuse_iomap_begin(inode, pos, count, opflags);
+
+ args.opcode = FUSE_IOMAP_BEGIN;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ args.out_numargs = 1;
+ args.out_args[0].size = sizeof(outarg);
+ args.out_args[0].value = &outarg;
+ err = fuse_simple_request(fm, &args);
+ if (err) {
+ trace_fuse_iomap_begin_error(inode, pos, count, opflags, err);
+ return err;
+ }
+
+ trace_fuse_iomap_read_map(inode, &outarg);
+ trace_fuse_iomap_write_map(inode, &outarg);
+
+ err = fuse_iomap_validate(&outarg, opflags, pos);
+ if (err)
+ return err;
+
+ if ((opflags & IOMAP_WRITE) &&
+ outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ /*
+ * For an out of place write, we must supply the write mapping
+ * via @iomap, and the read mapping via @srcmap.
+ */
+ iomap->addr = outarg.write_addr;
+ iomap->offset = outarg.offset;
+ iomap->length = outarg.length;
+ iomap->type = outarg.write_type;
+ iomap->flags = outarg.write_flags;
+ iomap->bdev = inode->i_sb->s_bdev;
+
+ srcmap->addr = outarg.read_addr;
+ srcmap->offset = outarg.offset;
+ srcmap->length = outarg.length;
+ srcmap->type = outarg.read_type;
+ srcmap->flags = outarg.read_flags;
+ srcmap->bdev = inode->i_sb->s_bdev;
+ } else {
+ /*
+ * For everything else (reads, reporting, and pure overwrites),
+ * we can return the sole mapping through @iomap and leave
+ * @srcmap unchanged from its default (HOLE).
+ */
+ iomap->addr = outarg.read_addr;
+ iomap->offset = outarg.offset;
+ iomap->length = outarg.length;
+ iomap->type = outarg.read_type;
+ iomap->flags = outarg.read_flags;
+ iomap->bdev = inode->i_sb->s_bdev;
+ }
+
+ return 0;
+}
+
+static bool fuse_want_iomap_end(const struct iomap *iomap, unsigned int opflags,
+ loff_t count, ssize_t written)
+{
+ /* Caller demanded an iomap_end call. */
+ if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
+ return true;
+
+ /* Reads and reporting should never affect the filesystem metadata */
+ if (!(opflags & (IOMAP_WRITE | IOMAP_ZERO)))
+ return false;
+
+ /* Appending writes get an iomap_end call */
+ if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+ return true;
+
+ /* Short writes get an iomap_end call to clean up delalloc */
+ return written < count;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
+ ssize_t written, unsigned opflags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_end_in inarg = {
+ .opflags = opflags,
+ .attr_ino = fi->orig_ino,
+ .pos = pos,
+ .count = count,
+ .written = written,
+
+ .map_addr = iomap->addr,
+ .map_length = iomap->length,
+ .map_type = iomap->type,
+ .map_flags = iomap->flags,
+ };
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ int err;
+
+ if (!fuse_want_iomap_end(iomap, opflags, count, written))
+ return 0;
+
+ trace_fuse_iomap_end(inode, &inarg);
+
+ args.opcode = FUSE_IOMAP_END;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fm, &args);
+
+ trace_fuse_iomap_end_error(inode, &inarg, err);
+
+ return err;
+}
+
+const struct iomap_ops fuse_iomap_ops = {
+ .iomap_begin = fuse_iomap_begin,
+ .iomap_end = fuse_iomap_end,
+};
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index fd48e8d37f2edc..88730d26c9b5e2 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1438,6 +1438,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
if (flags & FUSE_REQUEST_TIMEOUT)
timeout = arg->request_timeout;
+
+ if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
+ fc->iomap = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1506,6 +1509,8 @@ void fuse_send_init(struct fuse_mount *fm)
*/
if (fuse_uring_enabled())
flags |= FUSE_OVER_IO_URING;
+ if (fuse_iomap_enabled())
+ flags |= FUSE_IOMAP;
ia->in.flags = flags;
ia->in.flags2 = flags >> 32;
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH 03/11] fuse: implement the basic iomap mechanisms
2025-05-22 0:03 ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-05-29 22:15 ` Joanne Koong
2025-05-29 23:15 ` Joanne Koong
0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-05-29 22:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, miklos, linux-xfs, bernd, John
On Wed, May 21, 2025 at 5:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Implement functions to enable upcalling of iomap_begin and iomap_end to
> userspace fuse servers.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 38 ++++++
> fs/fuse/fuse_trace.h | 258 +++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fuse.h | 87 ++++++++++++++
> fs/fuse/Kconfig | 23 ++++
> fs/fuse/Makefile | 1
> fs/fuse/file_iomap.c | 280 +++++++++++++++++++++++++++++++++++++++++++++
> fs/fuse/inode.c | 5 +
> 7 files changed, 691 insertions(+), 1 deletion(-)
> create mode 100644 fs/fuse/file_iomap.c
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d56d4fd956db99..aa51f25856697d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -895,6 +895,9 @@ struct fuse_conn {
> /* Is link not implemented by fs? */
> unsigned int no_link:1;
>
> + /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
> + unsigned int iomap:1;
> +
> /* Use io_uring for communication */
> unsigned int io_uring;
>
> @@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> return sb->s_fs_info;
> }
>
> +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> +{
> + return sb->s_fs_info;
> +}
> +
Instead of adding this new helper (and the ones below), what about
modifying the existing (non-const) versions of these helpers to take
in const * input args, eg
-static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
+static inline struct fuse_mount *get_fuse_mount_super(const struct
super_block *sb)
{
return sb->s_fs_info;
}
Then, doing something like "const struct fuse_mount *mt =
get_fuse_mount(inode);" would enforce the same guarantees as "const
struct fuse_mount *mt = get_fuse_mount_c(inode);" and we wouldn't need
2 sets of helpers that pretty much do the same thing.
> static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> {
> return get_fuse_mount_super(sb)->fc;
> @@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> return get_fuse_mount_super(inode->i_sb);
> }
>
> +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> +{
> + return get_fuse_mount_super_c(inode->i_sb);
> +}
> +
> static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> {
> return get_fuse_mount_super(inode->i_sb)->fc;
> }
>
> +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> +{
> + return get_fuse_mount_super_c(inode->i_sb)->fc;
> +}
> +
> static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> {
> return container_of(inode, struct fuse_inode, inode);
> }
>
> +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> +{
> + return container_of(inode, struct fuse_inode, inode);
> +}
> +
> static inline u64 get_node_id(struct inode *inode)
> {
> return get_fuse_inode(inode)->nodeid;
> @@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
> #define fuse_sysctl_unregister() do { } while (0)
> #endif /* CONFIG_SYSCTL */
>
> +#if IS_ENABLED(CONFIG_FUSE_IOMAP)
> +# include <linux/fiemap.h>
> +# include <linux/iomap.h>
> +
> +bool fuse_iomap_enabled(void);
> +
> +static inline bool fuse_has_iomap(const struct inode *inode)
> +{
> + return get_fuse_conn_c(inode)->iomap;
> +}
> +#else
> +# define fuse_iomap_enabled(...) (false)
> +# define fuse_has_iomap(...) (false)
> +#endif
> +
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index ca215a3cba3e31..fc7c5bf1cef52d 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
>
> If you want to allow passthrough operations, answer Y.
>
> +config FUSE_IOMAP
> + bool "FUSE file IO over iomap"
> + default y
> + depends on FUSE_FS
> + select FS_IOMAP
> + help
> + For supported fuseblk servers, this allows the file IO path to run
> + through the kernel.
I have config FUSE_FS select FS_IOMAP in my patchset (not yet
submitted) that changes fuse buffered writes / writeback handling to
use iomap. Could we just have config FUSE_FS automatically opt into
FS_IOMAP here or do you see a reason that this needs to be a separate
config?
Thanks,
Joanne
> +
> +config FUSE_IOMAP_BY_DEFAULT
> + bool "FUSE file I/O over iomap by default"
> + default n
> + depends on FUSE_IOMAP
> + help
> + Enable sending FUSE file I/O over iomap by default.
> +
> +config FUSE_IOMAP_DEBUG
> + bool "Debug FUSE file IO over iomap"
> + default n
> + depends on FUSE_IOMAP
> + help
> + Enable debugging assertions for the fuse iomap code paths.
> +
> config FUSE_IO_URING
> bool "FUSE communication over io-uring"
> default y
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH 03/11] fuse: implement the basic iomap mechanisms
2025-05-29 22:15 ` Joanne Koong
@ 2025-05-29 23:15 ` Joanne Koong
2025-06-03 0:13 ` Darrick J. Wong
0 siblings, 1 reply; 23+ messages in thread
From: Joanne Koong @ 2025-05-29 23:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, miklos, linux-xfs, bernd, John
On Thu, May 29, 2025 at 3:15 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, May 21, 2025 at 5:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > userspace fuse servers.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 38 ++++++
> > fs/fuse/fuse_trace.h | 258 +++++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/fuse.h | 87 ++++++++++++++
> > fs/fuse/Kconfig | 23 ++++
> > fs/fuse/Makefile | 1
> > fs/fuse/file_iomap.c | 280 +++++++++++++++++++++++++++++++++++++++++++++
> > fs/fuse/inode.c | 5 +
> > 7 files changed, 691 insertions(+), 1 deletion(-)
> > create mode 100644 fs/fuse/file_iomap.c
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index d56d4fd956db99..aa51f25856697d 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -895,6 +895,9 @@ struct fuse_conn {
> > /* Is link not implemented by fs? */
> > unsigned int no_link:1;
> >
> > + /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
> > + unsigned int iomap:1;
> > +
> > /* Use io_uring for communication */
> > unsigned int io_uring;
> >
> > @@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> > return sb->s_fs_info;
> > }
> >
> > +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> > +{
> > + return sb->s_fs_info;
> > +}
> > +
>
> Instead of adding this new helper (and the ones below), what about
> modifying the existing (non-const) versions of these helpers to take
> in const * input args, eg
>
> -static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> +static inline struct fuse_mount *get_fuse_mount_super(const struct
> super_block *sb)
> {
> return sb->s_fs_info;
> }
>
> Then, doing something like "const struct fuse_mount *mt =
> get_fuse_mount(inode);" would enforce the same guarantees as "const
> struct fuse_mount *mt = get_fuse_mount_c(inode);" and we wouldn't need
> 2 sets of helpers that pretty much do the same thing.
>
> > static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> > {
> > return get_fuse_mount_super(sb)->fc;
> > @@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> > return get_fuse_mount_super(inode->i_sb);
> > }
> >
> > +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> > +{
> > + return get_fuse_mount_super_c(inode->i_sb);
> > +}
> > +
> > static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> > {
> > return get_fuse_mount_super(inode->i_sb)->fc;
> > }
> >
> > +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> > +{
> > + return get_fuse_mount_super_c(inode->i_sb)->fc;
> > +}
> > +
> > static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> > {
> > return container_of(inode, struct fuse_inode, inode);
> > }
> >
> > +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> > +{
> > + return container_of(inode, struct fuse_inode, inode);
> > +}
> > +
> > static inline u64 get_node_id(struct inode *inode)
> > {
> > return get_fuse_inode(inode)->nodeid;
> > @@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
> > #define fuse_sysctl_unregister() do { } while (0)
> > #endif /* CONFIG_SYSCTL */
> >
> > +#if IS_ENABLED(CONFIG_FUSE_IOMAP)
> > +# include <linux/fiemap.h>
> > +# include <linux/iomap.h>
> > +
> > +bool fuse_iomap_enabled(void);
> > +
> > +static inline bool fuse_has_iomap(const struct inode *inode)
> > +{
> > + return get_fuse_conn_c(inode)->iomap;
> > +}
> > +#else
> > +# define fuse_iomap_enabled(...) (false)
> > +# define fuse_has_iomap(...) (false)
> > +#endif
> > +
> > #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > index ca215a3cba3e31..fc7c5bf1cef52d 100644
> > --- a/fs/fuse/Kconfig
> > +++ b/fs/fuse/Kconfig
> > @@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
> >
> > If you want to allow passthrough operations, answer Y.
> >
> > +config FUSE_IOMAP
> > + bool "FUSE file IO over iomap"
> > + default y
> > + depends on FUSE_FS
> > + select FS_IOMAP
> > + help
> > + For supported fuseblk servers, this allows the file IO path to run
> > + through the kernel.
>
> I have config FUSE_FS select FS_IOMAP in my patchset (not yet
> submitted) that changes fuse buffered writes / writeback handling to
> use iomap. Could we just have config FUSE_FS automatically opt into
> FS_IOMAP here or do you see a reason that this needs to be a separate
> config?
Thinking about it some more, the iomap stuff you're adding also
requires a "depends on BLOCK", so this will need to be a separate
config anyways regardless of whether the FUSE_FS will always "select
FS_IOMAP"
Thanks,
Joanne
>
>
> Thanks,
> Joanne
> > +
> > +config FUSE_IOMAP_BY_DEFAULT
> > + bool "FUSE file I/O over iomap by default"
> > + default n
> > + depends on FUSE_IOMAP
> > + help
> > + Enable sending FUSE file I/O over iomap by default.
> > +
> > +config FUSE_IOMAP_DEBUG
> > + bool "Debug FUSE file IO over iomap"
> > + default n
> > + depends on FUSE_IOMAP
> > + help
> > + Enable debugging assertions for the fuse iomap code paths.
> > +
> > config FUSE_IO_URING
> > bool "FUSE communication over io-uring"
> > default y
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH 03/11] fuse: implement the basic iomap mechanisms
2025-05-29 23:15 ` Joanne Koong
@ 2025-06-03 0:13 ` Darrick J. Wong
0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-06-03 0:13 UTC (permalink / raw)
To: Joanne Koong; +Cc: linux-fsdevel, miklos, linux-xfs, bernd, John
On Thu, May 29, 2025 at 04:15:57PM -0700, Joanne Koong wrote:
> On Thu, May 29, 2025 at 3:15 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, May 21, 2025 at 5:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > > userspace fuse servers.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > > fs/fuse/fuse_i.h | 38 ++++++
> > > fs/fuse/fuse_trace.h | 258 +++++++++++++++++++++++++++++++++++++++++
> > > include/uapi/linux/fuse.h | 87 ++++++++++++++
> > > fs/fuse/Kconfig | 23 ++++
> > > fs/fuse/Makefile | 1
> > > fs/fuse/file_iomap.c | 280 +++++++++++++++++++++++++++++++++++++++++++++
> > > fs/fuse/inode.c | 5 +
> > > 7 files changed, 691 insertions(+), 1 deletion(-)
> > > create mode 100644 fs/fuse/file_iomap.c
> > >
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index d56d4fd956db99..aa51f25856697d 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -895,6 +895,9 @@ struct fuse_conn {
> > > /* Is link not implemented by fs? */
> > > unsigned int no_link:1;
> > >
> > > + /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
> > > + unsigned int iomap:1;
> > > +
> > > /* Use io_uring for communication */
> > > unsigned int io_uring;
> > >
> > > @@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> > > return sb->s_fs_info;
> > > }
> > >
> > > +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> > > +{
> > > + return sb->s_fs_info;
> > > +}
> > > +
> >
> > Instead of adding this new helper (and the ones below), what about
> > modifying the existing (non-const) versions of these helpers to take
> > in const * input args, eg
> >
> > -static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> > +static inline struct fuse_mount *get_fuse_mount_super(const struct
> > super_block *sb)
> > {
> > return sb->s_fs_info;
> > }
> >
> > Then, doing something like "const struct fuse_mount *mt =
> > get_fuse_mount(inode);" would enforce the same guarantees as "const
> > struct fuse_mount *mt = get_fuse_mount_c(inode);" and we wouldn't need
> > 2 sets of helpers that pretty much do the same thing.
> >
> > > static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> > > {
> > > return get_fuse_mount_super(sb)->fc;
> > > @@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> > > return get_fuse_mount_super(inode->i_sb);
> > > }
> > >
> > > +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> > > +{
> > > + return get_fuse_mount_super_c(inode->i_sb);
> > > +}
> > > +
> > > static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> > > {
> > > return get_fuse_mount_super(inode->i_sb)->fc;
> > > }
> > >
> > > +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> > > +{
> > > + return get_fuse_mount_super_c(inode->i_sb)->fc;
> > > +}
> > > +
> > > static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> > > {
> > > return container_of(inode, struct fuse_inode, inode);
> > > }
> > >
> > > +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> > > +{
> > > + return container_of(inode, struct fuse_inode, inode);
> > > +}
> > > +
> > > static inline u64 get_node_id(struct inode *inode)
> > > {
> > > return get_fuse_inode(inode)->nodeid;
> > > @@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
> > > #define fuse_sysctl_unregister() do { } while (0)
> > > #endif /* CONFIG_SYSCTL */
> > >
> > > +#if IS_ENABLED(CONFIG_FUSE_IOMAP)
> > > +# include <linux/fiemap.h>
> > > +# include <linux/iomap.h>
> > > +
> > > +bool fuse_iomap_enabled(void);
> > > +
> > > +static inline bool fuse_has_iomap(const struct inode *inode)
> > > +{
> > > + return get_fuse_conn_c(inode)->iomap;
> > > +}
> > > +#else
> > > +# define fuse_iomap_enabled(...) (false)
> > > +# define fuse_has_iomap(...) (false)
> > > +#endif
> > > +
> > > #endif /* _FS_FUSE_I_H */
> > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > index ca215a3cba3e31..fc7c5bf1cef52d 100644
> > > --- a/fs/fuse/Kconfig
> > > +++ b/fs/fuse/Kconfig
> > > @@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
> > >
> > > If you want to allow passthrough operations, answer Y.
> > >
> > > +config FUSE_IOMAP
> > > + bool "FUSE file IO over iomap"
> > > + default y
> > > + depends on FUSE_FS
> > > + select FS_IOMAP
> > > + help
> > > + For supported fuseblk servers, this allows the file IO path to run
> > > + through the kernel.
> >
> > I have config FUSE_FS select FS_IOMAP in my patchset (not yet
> > submitted) that changes fuse buffered writes / writeback handling to
> > use iomap. Could we just have config FUSE_FS automatically opt into
> > FS_IOMAP here or do you see a reason that this needs to be a separate
> > config?
>
> Thinking about it some more, the iomap stuff you're adding also
> requires a "depends on BLOCK", so this will need to be a separate
> config anyways regardless of whether the FUSE_FS will always "select
> FS_IOMAP"
I'll add that, thanks. I forgot that FS_IOMAP no longer selects BLOCK
all the time. :)
--D
>
> Thanks,
> Joanne
>
> >
> >
> > Thanks,
> > Joanne
> > > +
> > > +config FUSE_IOMAP_BY_DEFAULT
> > > + bool "FUSE file I/O over iomap by default"
> > > + default n
> > > + depends on FUSE_IOMAP
> > > + help
> > > + Enable sending FUSE file I/O over iomap by default.
> > > +
> > > +config FUSE_IOMAP_DEBUG
> > > + bool "Debug FUSE file IO over iomap"
> > > + default n
> > > + depends on FUSE_IOMAP
> > > + help
> > > + Enable debugging assertions for the fuse iomap code paths.
> > > +
> > > config FUSE_IO_URING
> > > bool "FUSE communication over io-uring"
> > > default y
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 04/11] fuse: add a notification to add new iomap devices
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (2 preceding siblings ...)
2025-05-22 0:03 ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-05-22 0:03 ` Darrick J. Wong
2025-05-22 16:46 ` Amir Goldstein
2025-05-22 0:03 ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
` (7 subsequent siblings)
11 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:03 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Add a new notification so that fuse servers can add extra block devices
to use with iomap.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 19 +++++++
fs/fuse/fuse_trace.h | 36 ++++++++++++++
include/uapi/linux/fuse.h | 8 +++
fs/fuse/dev.c | 23 +++++++++
fs/fuse/file_iomap.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/inode.c | 9 +++
6 files changed, 211 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index aa51f25856697d..4eb75ed90db300 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,6 +619,12 @@ struct fuse_sync_bucket {
struct rcu_head rcu;
};
+struct fuse_iomap {
+ /* array of file objects that reference block devices for iomap */
+ struct file **files;
+ unsigned int nr_files;
+};
+
/**
* A Fuse connection.
*
@@ -970,6 +976,10 @@ struct fuse_conn {
struct fuse_ring *ring;
#endif
+#ifdef CONFIG_FUSE_IOMAP
+ struct fuse_iomap iomap_conn;
+#endif
+
/** Only used if the connection opts into request timeouts */
struct {
/* Worker for checking if any requests have timed out */
@@ -1610,9 +1620,18 @@ static inline bool fuse_has_iomap(const struct inode *inode)
{
return get_fuse_conn_c(inode)->iomap;
}
+
+void fuse_iomap_init_reply(struct fuse_mount *fm);
+void fuse_iomap_conn_put(struct fuse_conn *fc);
+
+int fuse_iomap_add_device(struct fuse_conn *fc,
+ const struct fuse_iomap_add_device_out *outarg);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
+# define fuse_iomap_init_reply(...) ((void)0)
+# define fuse_iomap_conn_put(...) ((void)0)
+# define fuse_iomap_add_device(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index f9a316c9788e06..e1a2e491d2581a 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -380,6 +380,42 @@ TRACE_EVENT(fuse_iomap_end_error,
__entry->pos, __entry->count, __entry->written,
__entry->error)
);
+
+TRACE_EVENT(fuse_iomap_dev_class,
+ TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
+ const struct file *file),
+
+ TP_ARGS(fc, idx, file),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(unsigned int, idx)
+ __field(dev_t, bdev)
+ ),
+
+ TP_fast_assign(
+ struct inode *inode = file_inode(file);
+
+ __entry->connection = fc->dev;
+ __entry->idx = idx;
+ if (S_ISBLK(inode->i_mode)) {
+ __entry->bdev = inode->i_rdev;
+ } else
+ __entry->bdev = 0;
+ ),
+
+ TP_printk("connection %u idx %u dev %u:%u",
+ __entry->connection,
+ __entry->idx,
+ MAJOR(__entry->bdev), MINOR(__entry->bdev))
+);
+#define DEFINE_FUSE_IOMAP_DEV_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_dev_class, name, \
+ TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
+ const struct file *file), \
+ TP_ARGS(fc, idx, file))
+DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
+DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index ce6c9960f2418f..ea8992e980a015 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -236,6 +236,7 @@
* 7.44
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
* SEEK_{DATA,HOLE} support
+ * - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
*/
#ifndef _LINUX_FUSE_H
@@ -681,6 +682,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_RETRIEVE = 5,
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
+ FUSE_NOTIFY_ADD_IOMAP_DEVICE = 8,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1371,4 +1373,10 @@ struct fuse_iomap_end_in {
uint32_t map_dev; /* device cookie * */
};
+struct fuse_iomap_add_device_out {
+ int32_t fd; /* fd of the open device to add */
+ uint32_t reserved; /* must be zero */
+ uint32_t *map_dev; /* location to receive device cookie */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6dcbaa218b7a16..9d7064ec170cf6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1824,6 +1824,26 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
return err;
}
+static int fuse_notify_add_iomap_device(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_add_device_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_add_device(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
struct fuse_retrieve_args {
struct fuse_args_pages ap;
struct fuse_notify_retrieve_in inarg;
@@ -2049,6 +2069,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
case FUSE_NOTIFY_RESEND:
return fuse_notify_resend(fc);
+ case FUSE_NOTIFY_ADD_IOMAP_DEVICE:
+ return fuse_notify_add_iomap_device(fc, size, cs);
+
default:
fuse_copy_finish(cs);
return -EINVAL;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index dfa0c309803113..faefd29a273bf3 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -142,6 +142,26 @@ static inline int fuse_iomap_validate(const struct fuse_iomap_begin_out *outarg,
return 0;
}
+static inline struct block_device *fuse_iomap_bdev(struct fuse_mount *fm,
+ unsigned int idx)
+{
+ struct fuse_conn *fc = fm->fc;
+ struct file *file = NULL;
+
+ spin_lock(&fc->lock);
+ if (idx < fc->iomap_conn.nr_files)
+ file = fc->iomap_conn.files[idx];
+ spin_unlock(&fc->lock);
+
+ if (!file)
+ return NULL;
+
+ if (!S_ISBLK(file_inode(file)->i_mode))
+ return NULL;
+
+ return I_BDEV(file->f_mapping->host);
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -155,6 +175,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
};
struct fuse_iomap_begin_out outarg = { };
struct fuse_mount *fm = get_fuse_mount(inode);
+ struct block_device *read_bdev;
FUSE_ARGS(args);
int err;
@@ -181,8 +202,18 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
if (err)
return err;
+ read_bdev = fuse_iomap_bdev(fm, outarg.read_dev);
+ if (!read_bdev)
+ return -ENODEV;
+
if ((opflags & IOMAP_WRITE) &&
outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ struct block_device *write_bdev =
+ fuse_iomap_bdev(fm, outarg.write_dev);
+
+ if (!write_bdev)
+ return -ENODEV;
+
/*
* For an out of place write, we must supply the write mapping
* via @iomap, and the read mapping via @srcmap.
@@ -192,14 +223,14 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
iomap->length = outarg.length;
iomap->type = outarg.write_type;
iomap->flags = outarg.write_flags;
- iomap->bdev = inode->i_sb->s_bdev;
+ iomap->bdev = write_bdev;
srcmap->addr = outarg.read_addr;
srcmap->offset = outarg.offset;
srcmap->length = outarg.length;
srcmap->type = outarg.read_type;
srcmap->flags = outarg.read_flags;
- srcmap->bdev = inode->i_sb->s_bdev;
+ srcmap->bdev = read_bdev;
} else {
/*
* For everything else (reads, reporting, and pure overwrites),
@@ -211,7 +242,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
iomap->length = outarg.length;
iomap->type = outarg.read_type;
iomap->flags = outarg.read_flags;
- iomap->bdev = inode->i_sb->s_bdev;
+ iomap->bdev = read_bdev;
}
return 0;
@@ -278,3 +309,85 @@ const struct iomap_ops fuse_iomap_ops = {
.iomap_begin = fuse_iomap_begin,
.iomap_end = fuse_iomap_end,
};
+
+void fuse_iomap_conn_put(struct fuse_conn *fc)
+{
+ unsigned int i;
+
+ for (i = 0; i < fc->iomap_conn.nr_files; i++) {
+ struct file *file = fc->iomap_conn.files[i];
+
+ trace_fuse_iomap_remove_dev(fc, i, file);
+
+ fc->iomap_conn.files[i] = NULL;
+ fput(file);
+ }
+
+ kfree(fc->iomap_conn.files);
+ fc->iomap_conn.nr_files = 0;
+}
+
+/* Add a bdev to the fuse connection, returns the index or a negative errno */
+static int __fuse_iomap_add_device(struct fuse_conn *fc, struct file *file)
+{
+ struct file **new_files;
+ int ret;
+
+ if (fc->iomap_conn.nr_files >= PAGE_SIZE / sizeof(unsigned int))
+ return -EMFILE;
+
+ new_files = krealloc_array(fc->iomap_conn.files,
+ fc->iomap_conn.nr_files + 1,
+ sizeof(struct file *),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!new_files)
+ return -ENOMEM;
+
+ spin_lock(&fc->lock);
+ fc->iomap_conn.files = new_files;
+ fc->iomap_conn.files[fc->iomap_conn.nr_files] = get_file(file);
+ ret = fc->iomap_conn.nr_files++;
+ spin_unlock(&fc->lock);
+
+ trace_fuse_iomap_add_dev(fc, ret, file);
+
+ return ret;
+}
+
+void fuse_iomap_init_reply(struct fuse_mount *fm)
+{
+ struct fuse_conn *fc = fm->fc;
+ struct super_block *sb = fm->sb;
+
+ if (sb->s_bdev)
+ __fuse_iomap_add_device(fc, sb->s_bdev_file);
+}
+
+int fuse_iomap_add_device(struct fuse_conn *fc,
+ const struct fuse_iomap_add_device_out *outarg)
+{
+ struct file *file;
+ int ret;
+
+ if (!fc->iomap)
+ return -EINVAL;
+
+ if (outarg->reserved)
+ return -EINVAL;
+
+ CLASS(fd, somefd)(outarg->fd);
+ if (fd_empty(somefd))
+ return -EBADF;
+ file = fd_file(somefd);
+
+ if (!S_ISBLK(file_inode(file)->i_mode))
+ return -ENODEV;
+
+ down_read(&fc->killsb);
+ ret = __fuse_iomap_add_device(fc, file);
+ up_read(&fc->killsb);
+ if (ret < 0)
+ return ret;
+
+ return put_user(ret, outarg->map_dev);
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 88730d26c9b5e2..84b7cd5ffe843b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1010,6 +1010,8 @@ void fuse_conn_put(struct fuse_conn *fc)
struct fuse_iqueue *fiq = &fc->iq;
struct fuse_sync_bucket *bucket;
+ if (fc->iomap)
+ fuse_iomap_conn_put(fc);
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_conn_free(fc);
if (fc->timeout.req_timeout)
@@ -1449,6 +1451,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
init_server_timeout(fc, timeout);
+ if (fc->iomap)
+ fuse_iomap_init_reply(fm);
+
fm->sb->s_bdi->ra_pages =
min(fm->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
@@ -1886,6 +1891,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
err_free_dax:
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_conn_free(fc);
+ /*
+ * No need to call fuse_iomap_conn_put here because we don't add
+ * devices until the init reply.
+ */
err:
return err;
}
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH 04/11] fuse: add a notification to add new iomap devices
2025-05-22 0:03 ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
@ 2025-05-22 16:46 ` Amir Goldstein
2025-05-22 17:11 ` Darrick J. Wong
0 siblings, 1 reply; 23+ messages in thread
From: Amir Goldstein @ 2025-05-22 16:46 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
On Thu, May 22, 2025 at 2:03 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add a new notification so that fuse servers can add extra block devices
> to use with iomap.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 19 +++++++
> fs/fuse/fuse_trace.h | 36 ++++++++++++++
> include/uapi/linux/fuse.h | 8 +++
> fs/fuse/dev.c | 23 +++++++++
> fs/fuse/file_iomap.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-
> fs/fuse/inode.c | 9 +++
> 6 files changed, 211 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index aa51f25856697d..4eb75ed90db300 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -619,6 +619,12 @@ struct fuse_sync_bucket {
> struct rcu_head rcu;
> };
>
> +struct fuse_iomap {
> + /* array of file objects that reference block devices for iomap */
> + struct file **files;
> + unsigned int nr_files;
> +};
> +
> /**
> * A Fuse connection.
> *
> @@ -970,6 +976,10 @@ struct fuse_conn {
> struct fuse_ring *ring;
> #endif
>
> +#ifdef CONFIG_FUSE_IOMAP
> + struct fuse_iomap iomap_conn;
> +#endif
> +
> /** Only used if the connection opts into request timeouts */
> struct {
> /* Worker for checking if any requests have timed out */
> @@ -1610,9 +1620,18 @@ static inline bool fuse_has_iomap(const struct inode *inode)
> {
> return get_fuse_conn_c(inode)->iomap;
> }
> +
> +void fuse_iomap_init_reply(struct fuse_mount *fm);
> +void fuse_iomap_conn_put(struct fuse_conn *fc);
> +
> +int fuse_iomap_add_device(struct fuse_conn *fc,
> + const struct fuse_iomap_add_device_out *outarg);
> #else
> # define fuse_iomap_enabled(...) (false)
> # define fuse_has_iomap(...) (false)
> +# define fuse_iomap_init_reply(...) ((void)0)
> +# define fuse_iomap_conn_put(...) ((void)0)
> +# define fuse_iomap_add_device(...) (-ENOSYS)
> #endif
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
> index f9a316c9788e06..e1a2e491d2581a 100644
> --- a/fs/fuse/fuse_trace.h
> +++ b/fs/fuse/fuse_trace.h
> @@ -380,6 +380,42 @@ TRACE_EVENT(fuse_iomap_end_error,
> __entry->pos, __entry->count, __entry->written,
> __entry->error)
> );
> +
> +TRACE_EVENT(fuse_iomap_dev_class,
> + TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
> + const struct file *file),
> +
> + TP_ARGS(fc, idx, file),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, connection)
> + __field(unsigned int, idx)
> + __field(dev_t, bdev)
> + ),
> +
> + TP_fast_assign(
> + struct inode *inode = file_inode(file);
> +
> + __entry->connection = fc->dev;
> + __entry->idx = idx;
> + if (S_ISBLK(inode->i_mode)) {
> + __entry->bdev = inode->i_rdev;
> + } else
> + __entry->bdev = 0;
> + ),
> +
> + TP_printk("connection %u idx %u dev %u:%u",
> + __entry->connection,
> + __entry->idx,
> + MAJOR(__entry->bdev), MINOR(__entry->bdev))
> +);
> +#define DEFINE_FUSE_IOMAP_DEV_EVENT(name) \
> +DEFINE_EVENT(fuse_iomap_dev_class, name, \
> + TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
> + const struct file *file), \
> + TP_ARGS(fc, idx, file))
> +DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
> +DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
> #endif /* CONFIG_FUSE_IOMAP */
>
> #endif /* _TRACE_FUSE_H */
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index ce6c9960f2418f..ea8992e980a015 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -236,6 +236,7 @@
> * 7.44
> * - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
> * SEEK_{DATA,HOLE} support
> + * - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -681,6 +682,7 @@ enum fuse_notify_code {
> FUSE_NOTIFY_RETRIEVE = 5,
> FUSE_NOTIFY_DELETE = 6,
> FUSE_NOTIFY_RESEND = 7,
> + FUSE_NOTIFY_ADD_IOMAP_DEVICE = 8,
> FUSE_NOTIFY_CODE_MAX,
> };
>
> @@ -1371,4 +1373,10 @@ struct fuse_iomap_end_in {
> uint32_t map_dev; /* device cookie * */
> };
>
> +struct fuse_iomap_add_device_out {
> + int32_t fd; /* fd of the open device to add */
> + uint32_t reserved; /* must be zero */
> + uint32_t *map_dev; /* location to receive device cookie */
> +};
> +
> #endif /* _LINUX_FUSE_H */
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 6dcbaa218b7a16..9d7064ec170cf6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1824,6 +1824,26 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
> return err;
> }
>
> +static int fuse_notify_add_iomap_device(struct fuse_conn *fc, unsigned int size,
> + struct fuse_copy_state *cs)
> +{
> + struct fuse_iomap_add_device_out outarg;
> + int err = -EINVAL;
> +
> + if (size != sizeof(outarg))
> + goto err;
> +
> + err = fuse_copy_one(cs, &outarg, sizeof(outarg));
> + if (err)
> + goto err;
> + fuse_copy_finish(cs);
> +
> + return fuse_iomap_add_device(fc, &outarg);
> +err:
> + fuse_copy_finish(cs);
> + return err;
> +}
> +
> struct fuse_retrieve_args {
> struct fuse_args_pages ap;
> struct fuse_notify_retrieve_in inarg;
> @@ -2049,6 +2069,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
> case FUSE_NOTIFY_RESEND:
> return fuse_notify_resend(fc);
>
> + case FUSE_NOTIFY_ADD_IOMAP_DEVICE:
> + return fuse_notify_add_iomap_device(fc, size, cs);
> +
> default:
> fuse_copy_finish(cs);
> return -EINVAL;
> diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> index dfa0c309803113..faefd29a273bf3 100644
> --- a/fs/fuse/file_iomap.c
> +++ b/fs/fuse/file_iomap.c
> @@ -142,6 +142,26 @@ static inline int fuse_iomap_validate(const struct fuse_iomap_begin_out *outarg,
> return 0;
> }
>
> +static inline struct block_device *fuse_iomap_bdev(struct fuse_mount *fm,
> + unsigned int idx)
> +{
> + struct fuse_conn *fc = fm->fc;
> + struct file *file = NULL;
> +
> + spin_lock(&fc->lock);
> + if (idx < fc->iomap_conn.nr_files)
> + file = fc->iomap_conn.files[idx];
> + spin_unlock(&fc->lock);
> +
> + if (!file)
> + return NULL;
> +
> + if (!S_ISBLK(file_inode(file)->i_mode))
> + return NULL;
> +
> + return I_BDEV(file->f_mapping->host);
> +}
> +
> static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> unsigned opflags, struct iomap *iomap,
> struct iomap *srcmap)
> @@ -155,6 +175,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> };
> struct fuse_iomap_begin_out outarg = { };
> struct fuse_mount *fm = get_fuse_mount(inode);
> + struct block_device *read_bdev;
> FUSE_ARGS(args);
> int err;
>
> @@ -181,8 +202,18 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> if (err)
> return err;
>
> + read_bdev = fuse_iomap_bdev(fm, outarg.read_dev);
> + if (!read_bdev)
> + return -ENODEV;
> +
> if ((opflags & IOMAP_WRITE) &&
> outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
> + struct block_device *write_bdev =
> + fuse_iomap_bdev(fm, outarg.write_dev);
> +
> + if (!write_bdev)
> + return -ENODEV;
> +
> /*
> * For an out of place write, we must supply the write mapping
> * via @iomap, and the read mapping via @srcmap.
> @@ -192,14 +223,14 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> iomap->length = outarg.length;
> iomap->type = outarg.write_type;
> iomap->flags = outarg.write_flags;
> - iomap->bdev = inode->i_sb->s_bdev;
> + iomap->bdev = write_bdev;
>
> srcmap->addr = outarg.read_addr;
> srcmap->offset = outarg.offset;
> srcmap->length = outarg.length;
> srcmap->type = outarg.read_type;
> srcmap->flags = outarg.read_flags;
> - srcmap->bdev = inode->i_sb->s_bdev;
> + srcmap->bdev = read_bdev;
> } else {
> /*
> * For everything else (reads, reporting, and pure overwrites),
> @@ -211,7 +242,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
> iomap->length = outarg.length;
> iomap->type = outarg.read_type;
> iomap->flags = outarg.read_flags;
> - iomap->bdev = inode->i_sb->s_bdev;
> + iomap->bdev = read_bdev;
> }
>
> return 0;
> @@ -278,3 +309,85 @@ const struct iomap_ops fuse_iomap_ops = {
> .iomap_begin = fuse_iomap_begin,
> .iomap_end = fuse_iomap_end,
> };
> +
> +void fuse_iomap_conn_put(struct fuse_conn *fc)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < fc->iomap_conn.nr_files; i++) {
> + struct file *file = fc->iomap_conn.files[i];
> +
> + trace_fuse_iomap_remove_dev(fc, i, file);
> +
> + fc->iomap_conn.files[i] = NULL;
> + fput(file);
> + }
> +
> + kfree(fc->iomap_conn.files);
> + fc->iomap_conn.nr_files = 0;
> +}
> +
> +/* Add a bdev to the fuse connection, returns the index or a negative errno */
> +static int __fuse_iomap_add_device(struct fuse_conn *fc, struct file *file)
> +{
> + struct file **new_files;
> + int ret;
> +
> + if (fc->iomap_conn.nr_files >= PAGE_SIZE / sizeof(unsigned int))
> + return -EMFILE;
> +
> + new_files = krealloc_array(fc->iomap_conn.files,
> + fc->iomap_conn.nr_files + 1,
> + sizeof(struct file *),
> + GFP_KERNEL | __GFP_ZERO);
> + if (!new_files)
> + return -ENOMEM;
> +
> + spin_lock(&fc->lock);
> + fc->iomap_conn.files = new_files;
> + fc->iomap_conn.files[fc->iomap_conn.nr_files] = get_file(file);
> + ret = fc->iomap_conn.nr_files++;
> + spin_unlock(&fc->lock);
> +
> + trace_fuse_iomap_add_dev(fc, ret, file);
> +
> + return ret;
> +}
> +
> +void fuse_iomap_init_reply(struct fuse_mount *fm)
> +{
> + struct fuse_conn *fc = fm->fc;
> + struct super_block *sb = fm->sb;
> +
> + if (sb->s_bdev)
> + __fuse_iomap_add_device(fc, sb->s_bdev_file);
> +}
> +
> +int fuse_iomap_add_device(struct fuse_conn *fc,
> + const struct fuse_iomap_add_device_out *outarg)
> +{
> + struct file *file;
> + int ret;
> +
> + if (!fc->iomap)
> + return -EINVAL;
> +
> + if (outarg->reserved)
> + return -EINVAL;
> +
> + CLASS(fd, somefd)(outarg->fd);
> + if (fd_empty(somefd))
> + return -EBADF;
> + file = fd_file(somefd);
> +
> + if (!S_ISBLK(file_inode(file)->i_mode))
> + return -ENODEV;
> +
> + down_read(&fc->killsb);
> + ret = __fuse_iomap_add_device(fc, file);
> + up_read(&fc->killsb);
> + if (ret < 0)
> + return ret;
> +
> + return put_user(ret, outarg->map_dev);
> +}
This very much reminds of FUSE_DEV_IOC_BACKING_OPEN
that gives kernel an fd to remember for later file operations.
FUSE_DEV_IOC_BACKING_OPEN was implemented as an ioctl
because of security concerns of passing an fd to the kernel via write().
Speaking of security concerns, we need to consider if this requires some
privileges to allow setting up direct access to blockdev.
But also, apart from the fact that those are block device fds,
what does iomap_conn.files[] differ from fc->backing_files_map?
Miklos had envisioned this (backing blockdev) use case as one of the
private cases of fuse passthrough.
Instead of identity mapping to backing file created at open time
it's extent mapping to backing blockdev created at data access time.
I am not saying that you need to reuse anything from fuse passthrough
code, because the use cases probably do not overlap, but hopefully,
you can avoid falling into the same pits that we have already managed to avoid.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH 04/11] fuse: add a notification to add new iomap devices
2025-05-22 16:46 ` Amir Goldstein
@ 2025-05-22 17:11 ` Darrick J. Wong
0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 17:11 UTC (permalink / raw)
To: Amir Goldstein
Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
On Thu, May 22, 2025 at 06:46:14PM +0200, Amir Goldstein wrote:
> On Thu, May 22, 2025 at 2:03 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add a new notification so that fuse servers can add extra block devices
> > to use with iomap.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
<snip>
> > +int fuse_iomap_add_device(struct fuse_conn *fc,
> > + const struct fuse_iomap_add_device_out *outarg)
> > +{
> > + struct file *file;
> > + int ret;
> > +
> > + if (!fc->iomap)
> > + return -EINVAL;
> > +
> > + if (outarg->reserved)
> > + return -EINVAL;
> > +
> > + CLASS(fd, somefd)(outarg->fd);
> > + if (fd_empty(somefd))
> > + return -EBADF;
> > + file = fd_file(somefd);
> > +
> > + if (!S_ISBLK(file_inode(file)->i_mode))
> > + return -ENODEV;
> > +
> > + down_read(&fc->killsb);
> > + ret = __fuse_iomap_add_device(fc, file);
> > + up_read(&fc->killsb);
> > + if (ret < 0)
> > + return ret;
> > +
> > + return put_user(ret, outarg->map_dev);
> > +}
>
> This very much reminds of FUSE_DEV_IOC_BACKING_OPEN
> that gives kernel an fd to remember for later file operations.
>
> FUSE_DEV_IOC_BACKING_OPEN was implemented as an ioctl
> because of security concerns of passing an fd to the kernel via write().
>
> Speaking of security concerns, we need to consider if this requires some
> privileges to allow setting up direct access to blockdev.
Yeah, I was assuming that if the fuse server can open the bdev, then
that's enough. But I suppose I at least need to check that it's opened
in write mode too.
> But also, apart from the fact that those are block device fds,
> what does iomap_conn.files[] differ from fc->backing_files_map?
Oh, so that's what that does! Yes, I'd rather pile on to that than
introduce more ABI. :)
> Miklos had envisioned this (backing blockdev) use case as one of the
> private cases of fuse passthrough.
>
> Instead of identity mapping to backing file created at open time
> it's extent mapping to backing blockdev created at data access time.
>
> I am not saying that you need to reuse anything from fuse passthrough
> code, because the use cases probably do not overlap, but hopefully,
> you can avoid falling into the same pits that we have already managed to avoid.
<nod> The one downside is that fsiomap requires the file to point at
either a block device or (in theory) a dax device, so we'd have to check
that on every access. But aside from that I think I could reuse this
piece. Thanks for bringing that to my attention! :)
--D
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (3 preceding siblings ...)
2025-05-22 0:03 ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
@ 2025-05-22 0:03 ` Darrick J. Wong
2025-05-22 0:04 ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
` (6 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:03 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
When we're destroying a fuse connection, send a FUSE_DESTROY command to
userspace so that it has time to react (closing block devices, reporting
latent errors, etc) before the mount actually goes away.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 84b7cd5ffe843b..224fb9e7610cc5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2056,7 +2056,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
{
struct fuse_conn *fc = fm->fc;
- if (fc->destroy)
+ if (fc->destroy || fc->iomap)
fuse_send_destroy(fm);
fuse_abort_conn(fc);
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (4 preceding siblings ...)
2025-05-22 0:03 ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
@ 2025-05-22 0:04 ` Darrick J. Wong
2025-05-22 0:04 ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
` (5 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:04 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Implement the basic file mapping reporting functions like FIEMAP, BMAP,
and SEEK_DATA/HOLE.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 ++++++
fs/fuse/fuse_trace.h | 57 +++++++++++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 1 +
fs/fuse/file.c | 13 +++++++++
fs/fuse/file_iomap.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 149 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4eb75ed90db300..a39e45eeec2e3e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1626,12 +1626,20 @@ void fuse_iomap_conn_put(struct fuse_conn *fc);
int fuse_iomap_add_device(struct fuse_conn *fc,
const struct fuse_iomap_add_device_out *outarg);
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 length);
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
# define fuse_iomap_init_reply(...) ((void)0)
# define fuse_iomap_conn_put(...) ((void)0)
# define fuse_iomap_add_device(...) (-ENOSYS)
+# define fuse_iomap_fiemap NULL
+# define fuse_iomap_lseek(...) (-ENOSYS)
+# define fuse_iomap_bmap(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index e1a2e491d2581a..252eab698287bd 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -416,6 +416,63 @@ DEFINE_EVENT(fuse_iomap_dev_class, name, \
TP_ARGS(fc, idx, file))
DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
+
+TRACE_EVENT(fuse_iomap_fiemap,
+ TP_PROTO(const struct inode *inode, u64 start, u64 count,
+ unsigned int flags),
+
+ TP_ARGS(inode, start, count, flags),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(u64, start)
+ __field(u64, count)
+ __field(unsigned int, flags)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->start = start;
+ __entry->count = count;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("connection %u ino %llu flags 0x%x start 0x%llx count 0x%llx",
+ __entry->connection, __entry->ino, __entry->flags,
+ __entry->start, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_lseek,
+ TP_PROTO(const struct inode *inode, loff_t offset, int whence),
+
+ TP_ARGS(inode, offset, whence),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(int, whence)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->offset = offset;
+ __entry->whence = whence;
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx whence %d",
+ __entry->connection, __entry->ino, __entry->offset,
+ __entry->whence)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 83ac192e7fdd19..be75a515c4f8b6 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2230,6 +2230,7 @@ static const struct inode_operations fuse_common_inode_operations = {
.set_acl = fuse_set_acl,
.fileattr_get = fuse_fileattr_get,
.fileattr_set = fuse_fileattr_set,
+ .fiemap = fuse_iomap_fiemap,
};
static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ada1ed9e653e42..6b54b9a8f8a84d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2844,6 +2844,12 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
struct fuse_bmap_out outarg;
int err;
+ if (fuse_has_iomap(inode)) {
+ sector_t alt_sec = fuse_iomap_bmap(mapping, block);
+ if (alt_sec > 0)
+ return alt_sec;
+ }
+
if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
return 0;
@@ -2879,6 +2885,13 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
struct fuse_lseek_out outarg;
int err;
+ if (fuse_has_iomap(inode)) {
+ loff_t alt_pos = fuse_iomap_lseek(file, offset, whence);
+
+ if (alt_pos >= 0 || (alt_pos < 0 && alt_pos != -ENOSYS))
+ return alt_pos;
+ }
+
if (fm->fc->no_lseek)
goto fallback;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index faefd29a273bf3..f943cb3334a787 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -391,3 +391,73 @@ int fuse_iomap_add_device(struct fuse_conn *fc,
return put_user(ret, outarg->map_dev);
}
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 count)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ int error;
+
+ /*
+ * We are called directly from the vfs so we need to check per-inode
+ * support here explicitly.
+ */
+ if (!fuse_has_iomap(inode))
+ return -EOPNOTSUPP;
+
+ if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
+ return -EOPNOTSUPP;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (!fuse_allow_current_process(fc))
+ return -EACCES;
+
+ trace_fuse_iomap_fiemap(inode, start, count, fieinfo->fi_flags);
+
+ inode_lock_shared(inode);
+ error = iomap_fiemap(inode, fieinfo, start, count,
+ &fuse_iomap_ops);
+ inode_unlock_shared(inode);
+
+ return error;
+}
+
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block)
+{
+ ASSERT(fuse_has_iomap(mapping->host));
+
+ return iomap_bmap(mapping, block, &fuse_iomap_ops);
+}
+
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (!fuse_allow_current_process(fc))
+ return -EACCES;
+
+ trace_fuse_iomap_lseek(inode, offset, whence);
+
+ switch (whence) {
+ case SEEK_HOLE:
+ offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops);
+ break;
+ case SEEK_DATA:
+ offset = iomap_seek_data(inode, offset, &fuse_iomap_ops);
+ break;
+ default:
+ return -ENOSYS;
+ }
+
+ if (offset < 0)
+ return offset;
+ return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
+}
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 07/11] fuse: implement direct IO with iomap
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (5 preceding siblings ...)
2025-05-22 0:04 ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
@ 2025-05-22 0:04 ` Darrick J. Wong
2025-05-22 0:04 ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
` (4 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:04 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Implement direct IO with iomap if it's available.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 24 ++++
fs/fuse/fuse_trace.h | 186 +++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 27 +++++
fs/fuse/dir.c | 7 +
fs/fuse/file.c | 16 +++
fs/fuse/file_iomap.c | 256 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 4 +
7 files changed, 519 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a39e45eeec2e3e..51a373bc7b03d9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -904,6 +904,9 @@ struct fuse_conn {
/* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
unsigned int iomap:1;
+ /* Use fs/iomap for direct I/O operations */
+ unsigned int iomap_directio:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1631,6 +1634,22 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 length);
loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
+
+void fuse_iomap_open(struct inode *inode, struct file *file);
+
+static inline bool fuse_has_iomap_direct_io(const struct inode *inode)
+{
+ return get_fuse_conn_c(inode)->iomap_directio;
+}
+
+static inline bool fuse_want_iomap_direct_io(const struct kiocb *iocb)
+{
+ return (iocb->ki_flags & IOCB_DIRECT) &&
+ fuse_has_iomap_direct_io(file_inode(iocb->ki_filp));
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1640,6 +1659,11 @@ sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
# define fuse_iomap_fiemap NULL
# define fuse_iomap_lseek(...) (-ENOSYS)
# define fuse_iomap_bmap(...) (-ENOSYS)
+# define fuse_iomap_open(...) ((void)0)
+# define fuse_has_iomap_direct_io(...) (false)
+# define fuse_want_iomap_direct_io(...) (false)
+# define fuse_iomap_direct_read(...) (-ENOSYS)
+# define fuse_iomap_direct_write(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 252eab698287bd..da7c317b664a10 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -60,6 +60,7 @@
EM( FUSE_STATX, "FUSE_STATX") \
EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
+ EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \
EMe(CUSE_INIT, "CUSE_INIT")
/*
@@ -161,6 +162,17 @@ TRACE_EVENT(fuse_request_end,
{ FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
{ FUSE_IOMAP_TYPE_INLINE, "inline" }
+#define FUSE_IOMAP_IOEND_STRINGS \
+ { FUSE_IOMAP_IOEND_SHARED, "shared" }, \
+ { FUSE_IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
+ { FUSE_IOMAP_IOEND_BOUNDARY, "boundary" }, \
+ { FUSE_IOMAP_IOEND_DIRECT, "direct" }, \
+ { FUSE_IOMAP_IOEND_APPEND, "append" }
+
+#define IOMAP_DIOEND_STRINGS \
+ { IOMAP_DIO_UNWRITTEN, "unwritten" }, \
+ { IOMAP_DIO_COW, "cow" }
+
TRACE_EVENT(fuse_iomap_begin,
TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
unsigned opflags),
@@ -381,6 +393,79 @@ TRACE_EVENT(fuse_iomap_end_error,
__entry->error)
);
+TRACE_EVENT(fuse_iomap_ioend,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_ioend_in *inarg),
+
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(unsigned, ioendflags)
+ __field(int, error)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(uint64_t, new_addr)
+ __field(size_t, written)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->ioendflags = inarg->ioendflags;
+ __entry->error = inarg->error;
+ __entry->pos = inarg->pos;
+ __entry->new_addr = inarg->new_addr;
+ __entry->written = inarg->written;
+ ),
+
+ TP_printk("connection %u ino %llu ioendflags (%s) pos 0x%llx written %zd error %d new_addr 0x%llx",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+ __entry->pos, __entry->written, __entry->error,
+ __entry->new_addr)
+);
+
+TRACE_EVENT(fuse_iomap_ioend_error,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_ioend_in *inarg,
+ int error),
+
+ TP_ARGS(inode, inarg, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(unsigned, ioendflags)
+ __field(int, error)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(uint64_t, new_addr)
+ __field(size_t, written)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->ioendflags = inarg->ioendflags;
+ __entry->error = error;
+ __entry->pos = inarg->pos;
+ __entry->new_addr = inarg->new_addr;
+ __entry->written = inarg->written;
+ ),
+
+ TP_printk("connection %u ino %llu ioendflags (%s) pos 0x%llx written %zd error %d new_addr 0x%llx",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+ __entry->pos, __entry->written, __entry->error,
+ __entry->new_addr)
+);
+
TRACE_EVENT(fuse_iomap_dev_class,
TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
const struct file *file),
@@ -473,6 +558,107 @@ TRACE_EVENT(fuse_iomap_lseek,
__entry->connection, __entry->ino, __entry->offset,
__entry->whence)
);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_io_class,
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter),
+ TP_ARGS(iocb, iter),
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, size)
+ __field(loff_t, offset)
+ __field(size_t, count)
+ ),
+ TP_fast_assign(
+ const struct inode *inode = file_inode(iocb->ki_filp);
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->size = i_size_read(inode);
+ __entry->offset = iocb->ki_pos;
+ __entry->count = iov_iter_count(iter);
+ ),
+ TP_printk("connection %u ino %llu disize 0x%llx pos 0x%llx bytecount 0x%zx",
+ __entry->connection, __entry->ino, __entry->size,
+ __entry->offset, __entry->count)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IO_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_io_class, name, \
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), \
+ TP_ARGS(iocb, iter))
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
+ ssize_t ret),
+ TP_ARGS(iocb, iter, ret),
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, size)
+ __field(loff_t, offset)
+ __field(size_t, count)
+ __field(ssize_t, ret)
+ ),
+ TP_fast_assign(
+ const struct inode *inode = file_inode(iocb->ki_filp);
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->size = i_size_read(inode);
+ __entry->offset = iocb->ki_pos;
+ __entry->count = iov_iter_count(iter);
+ __entry->ret = ret;
+ ),
+ TP_printk("connection %u ino %llu disize 0x%llx pos 0x%llx bytecount 0x%zx ret 0x%zx",
+ __entry->connection, __entry->ino, __entry->size,
+ __entry->offset, __entry->count, __entry->ret)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, \
+ ssize_t ret), \
+ TP_ARGS(iocb, iter, ret))
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+
+TRACE_EVENT(fuse_iomap_dio_write_end_io,
+ TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
+ int error, unsigned flags),
+
+ TP_ARGS(inode, pos, written, error, flags),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(unsigned, dioendflags)
+ __field(int, error)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(size_t, written)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->dioendflags = flags;
+ __entry->error = error;
+ __entry->pos = pos;
+ __entry->written = written;
+ ),
+
+ TP_printk("connection %u ino %llu dioendflags (%s) pos 0x%llx written %zd error %d",
+ __entry->connection, __entry->ino,
+ __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
+ __entry->pos, __entry->written, __entry->error)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index ea8992e980a015..4611f912003593 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -237,6 +237,7 @@
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
* SEEK_{DATA,HOLE} support
* - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
+ * - add FUSE_IOMAP_DIRECTIO for direct I/O support
*/
#ifndef _LINUX_FUSE_H
@@ -447,6 +448,7 @@ struct fuse_file_lock {
* init_out.request_timeout contains the timeout (in secs)
* FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
* operations.
+ * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -495,6 +497,7 @@ struct fuse_file_lock {
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
#define FUSE_IOMAP (1ULL << 43)
+#define FUSE_IOMAP_DIRECTIO (1ULL << 44)
/**
* CUSE INIT request/reply flags
@@ -663,6 +666,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1379,4 +1383,27 @@ struct fuse_iomap_add_device_out {
uint32_t *map_dev; /* location to receive device cookie */
};
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED (1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT (1U << 3)
+
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND (1U << 15)
+
+struct fuse_iomap_ioend_in {
+ uint16_t ioendflags; /* FUSE_IOMAP_IOEND_* */
+ uint16_t reserved; /* zero */
+ int32_t error; /* negative errno or 0 */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t new_addr; /* disk offset of new mapping, in bytes */
+ uint32_t written; /* bytes processed */
+ uint32_t reserved1; /* zero */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index be75a515c4f8b6..c947ad50a9a8eb 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -704,6 +704,10 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
d_instantiate(entry, inode);
fuse_change_entry_timeout(entry, &outentry);
fuse_dir_changed(dir);
+
+ if (fuse_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (!err) {
file->private_data = ff;
@@ -1692,6 +1696,9 @@ static int fuse_dir_open(struct inode *inode, struct file *file)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (err)
return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6b54b9a8f8a84d..7e8b20f56dd823 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -244,6 +244,9 @@ static int fuse_open(struct inode *inode, struct file *file)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (err)
return err;
@@ -1778,10 +1781,17 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
+ ssize_t ret;
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_want_iomap_direct_io(iocb)) {
+ ret = fuse_iomap_direct_read(iocb, to);
+ if (ret != -ENOSYS)
+ return ret;
+ }
+
if (FUSE_IS_DAX(inode))
return fuse_dax_read_iter(iocb, to);
@@ -1803,6 +1813,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_want_iomap_direct_io(iocb)) {
+ ssize_t ret = fuse_iomap_direct_write(iocb, from);
+ if (ret != -ENOSYS)
+ return ret;
+ }
+
if (FUSE_IS_DAX(inode))
return fuse_dax_write_iter(iocb, from);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index f943cb3334a787..077ef51ee47452 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -310,6 +310,70 @@ const struct iomap_ops fuse_iomap_ops = {
.iomap_end = fuse_iomap_end,
};
+static inline bool fuse_want_ioend(const struct fuse_iomap_ioend_in *inarg)
+{
+ /* Always send an ioend for errors. */
+ if (inarg->error)
+ return true;
+
+ /* Send an ioend if we performed an IO involving metadata changes. */
+ return inarg->written > 0 &&
+ (inarg->ioendflags & (FUSE_IOMAP_IOEND_SHARED |
+ FUSE_IOMAP_IOEND_UNWRITTEN |
+ FUSE_IOMAP_IOEND_APPEND));
+}
+
+static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
+ int error, unsigned ioendflags, sector_t new_addr)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_ioend_in inarg = {
+ .ioendflags = ioendflags,
+ .error = error,
+ .attr_ino = fi->orig_ino,
+ .pos = pos,
+ .written = written,
+ .new_addr = new_addr,
+ };
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ int err = 0;
+
+ if (pos + written > i_size_read(inode))
+ inarg.ioendflags |= FUSE_IOMAP_IOEND_APPEND;
+
+ trace_fuse_iomap_ioend(inode, &inarg);
+
+ if (!fuse_want_ioend(&inarg))
+ goto out;
+
+ args.opcode = FUSE_IOMAP_IOEND;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fm, &args);
+
+ trace_fuse_iomap_ioend_error(inode, &inarg, err);
+
+ /*
+ * Preserve the original error code if userspace didn't respond or
+ * returned success despite the error we passed along via the ioend.
+ */
+ if (error && (err == 0 || err == -ENOSYS))
+ err = error;
+
+out:
+ /*
+ * If there weren't any ioend errors, update the incore isize, which
+ * confusingly takes the new i_size as "pos".
+ */
+ if (!error && !err)
+ fuse_write_update_attr(inode, pos + written, written);
+
+ return err;
+}
+
void fuse_iomap_conn_put(struct fuse_conn *fc)
{
unsigned int i;
@@ -461,3 +525,195 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
return offset;
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
}
+
+void fuse_iomap_open(struct inode *inode, struct file *file)
+{
+ if (fuse_has_iomap_direct_io(inode))
+ file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+}
+
+enum fuse_ilock_type {
+ SHARED,
+ EXCL,
+};
+
+static int fuse_iomap_ilock_iocb(const struct kiocb *iocb,
+ enum fuse_ilock_type type)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ switch (type) {
+ case SHARED:
+ return inode_trylock_shared(inode) ? 0 : -EAGAIN;
+ case EXCL:
+ return inode_trylock(inode) ? 0 : -EAGAIN;
+ default:
+ ASSERT(0);
+ return -EIO;
+ }
+ } else {
+ switch (type) {
+ case SHARED:
+ inode_lock_shared(inode);
+ break;
+ case EXCL:
+ inode_lock(inode);
+ break;
+ default:
+ ASSERT(0);
+ return -EIO;
+ }
+ }
+
+ return 0;
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ ASSERT(fuse_has_iomap_direct_io(inode));
+
+ trace_fuse_iomap_direct_read(iocb, to);
+
+ if (!iov_iter_count(to))
+ return 0; /* skip atime */
+
+ file_accessed(iocb->ki_filp);
+
+ ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+ if (ret)
+ return ret;
+ ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0);
+ inode_unlock_shared(inode);
+
+ trace_fuse_iomap_direct_read_end(iocb, to, ret);
+ return ret;
+}
+
+static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written,
+ int error, unsigned dioflags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ unsigned int nofs_flag;
+ unsigned int ioendflags = FUSE_IOMAP_IOEND_DIRECT;
+ int ret;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ ASSERT(fuse_has_iomap_direct_io(inode));
+
+ trace_fuse_iomap_dio_write_end_io(inode, iocb->ki_pos, written, error,
+ dioflags);
+
+ if (dioflags & IOMAP_DIO_COW)
+ ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+ if (dioflags & IOMAP_DIO_UNWRITTEN)
+ ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+ /*
+ * We can allocate memory here while doing writeback on behalf of
+ * memory reclaim. To avoid memory allocation deadlocks set the
+ * task-wide nofs context for the following operations.
+ */
+ nofs_flag = memalloc_nofs_save();
+ ret = fuse_iomap_ioend(inode, iocb->ki_pos, written, error, ioendflags,
+ FUSE_IOMAP_NULL_ADDR);
+ memalloc_nofs_restore(nofs_flag);
+ return ret;
+}
+
+static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
+ .end_io = fuse_iomap_dio_write_end_io,
+};
+
+static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
+ size_t count)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ loff_t end = start + count - 1;
+ int err;
+
+ /* Flush the file metadata, not the page cache. */
+ err = sync_inode_metadata(inode, 1);
+ if (err)
+ return err;
+
+ if (fc->no_fsync)
+ return 0;
+
+ err = fuse_fsync_common(iocb->ki_filp, start, end, iocb_is_dsync(iocb),
+ FUSE_FSYNC);
+ if (err == -ENOSYS) {
+ fc->no_fsync = 1;
+ err = 0;
+ }
+ return err;
+}
+
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ loff_t blockmask = i_blocksize(inode) - 1;
+ loff_t pos = iocb->ki_pos;
+ size_t count = iov_iter_count(from);
+ bool was_dsync = false;
+ ssize_t ret;
+
+ ASSERT(fuse_has_iomap_direct_io(inode));
+
+ trace_fuse_iomap_direct_write(iocb, from);
+
+ /*
+ * direct I/O must be aligned to the fsblock size or we fall back to
+ * the old paths
+ */
+ if ((iocb->ki_pos | count) & blockmask)
+ return -ENOTBLK;
+
+ /* fuse doesn't support S_SYNC, so complain if we see this. */
+ if (IS_SYNC(inode)) {
+ ASSERT(!IS_SYNC(inode));
+ return -EIO;
+ }
+
+ /*
+ * Strip off IOCB_DSYNC so that we can run the fsync ourselves because
+ * we hold inode_lock; iomap_dio_rw calls generic_write_sync; and
+ * fuse_fsync tries to take inode_lock again.
+ */
+ if (iocb_is_dsync(iocb)) {
+ was_dsync = true;
+ iocb->ki_flags &= ~IOCB_DSYNC;
+ }
+
+ ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+ if (ret)
+ goto out_dsync;
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out_unlock;
+
+ ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
+ &fuse_iomap_dio_write_ops, 0, NULL, 0);
+ if (ret)
+ goto out_unlock;
+
+ if (was_dsync) {
+ /* Restore IOCB_DSYNC and call our sync function */
+ iocb->ki_flags |= IOCB_DSYNC;
+ ret = fuse_iomap_direct_write_sync(iocb, pos, count);
+ }
+
+out_unlock:
+ inode_unlock(inode);
+out_dsync:
+ trace_fuse_iomap_direct_write_end(iocb, from, ret);
+ if (was_dsync)
+ iocb->ki_flags |= IOCB_DSYNC;
+ return ret;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 224fb9e7610cc5..0b3ad7bf89b52d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1443,6 +1443,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
fc->iomap = 1;
+ if ((flags & FUSE_IOMAP_DIRECTIO) && fc->iomap)
+ fc->iomap_directio = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1515,7 +1517,7 @@ void fuse_send_init(struct fuse_mount *fm)
if (fuse_uring_enabled())
flags |= FUSE_OVER_IO_URING;
if (fuse_iomap_enabled())
- flags |= FUSE_IOMAP;
+ flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO;
ia->in.flags = flags;
ia->in.flags2 = flags >> 32;
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 08/11] fuse: implement buffered IO with iomap
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (6 preceding siblings ...)
2025-05-22 0:04 ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
@ 2025-05-22 0:04 ` Darrick J. Wong
2025-05-22 0:04 ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
` (3 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:04 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Implement pagecache IO with iomap, complete with hooks into truncate and
fallocate so that the fuse server needn't implement disk block zeroing
of post-EOF and unaligned punch/zero regions.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 42 +++
fs/fuse/fuse_trace.h | 308 ++++++++++++++++++++
include/uapi/linux/fuse.h | 3
fs/fuse/dir.c | 6
fs/fuse/file.c | 48 +++
fs/fuse/file_iomap.c | 684 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 7
7 files changed, 1088 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 51a373bc7b03d9..8481b1d0299df0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -164,6 +164,13 @@ struct fuse_inode {
/* List of writepage requestst (pending or sent) */
struct rb_root writepages;
+
+#ifdef CONFIG_FUSE_IOMAP
+ /* pending io completions */
+ spinlock_t ioend_lock;
+ struct work_struct ioend_work;
+ struct list_head ioend_list;
+#endif
};
/* readdir cache (directory only) */
@@ -907,6 +914,9 @@ struct fuse_conn {
/* Use fs/iomap for direct I/O operations */
unsigned int iomap_directio:1;
+ /* Use fs/iomap for pagecache I/O operations */
+ unsigned int iomap_pagecache:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1613,6 +1623,9 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
# include <linux/fiemap.h>
# include <linux/iomap.h>
@@ -1650,6 +1663,26 @@ static inline bool fuse_want_iomap_direct_io(const struct kiocb *iocb)
ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
+
+static inline bool fuse_has_iomap_pagecache(const struct inode *inode)
+{
+ return get_fuse_conn_c(inode)->iomap_pagecache;
+}
+
+static inline bool fuse_want_iomap_buffered_io(const struct kiocb *iocb)
+{
+ return fuse_has_iomap_pagecache(file_inode(iocb->ki_filp));
+}
+
+void fuse_iomap_init_pagecache(struct inode *inode);
+void fuse_iomap_destroy_pagecache(struct inode *inode);
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
+int fuse_iomap_setsize(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct iattr *iattr);
+int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t length, loff_t new_size);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1664,6 +1697,15 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
# define fuse_want_iomap_direct_io(...) (false)
# define fuse_iomap_direct_read(...) (-ENOSYS)
# define fuse_iomap_direct_write(...) (-ENOSYS)
+# define fuse_has_iomap_pagecache(...) (false)
+# define fuse_want_iomap_buffered_io(...) (false)
+# define fuse_iomap_init_pagecache(...) ((void)0)
+# define fuse_iomap_destroy_pagecache(...) ((void)0)
+# define fuse_iomap_mmap(...) (-ENOSYS)
+# define fuse_iomap_buffered_read(...) (-ENOSYS)
+# define fuse_iomap_buffered_write(...) (-ENOSYS)
+# define fuse_iomap_setsize(...) (-ENOSYS)
+# define fuse_iomap_fallocate(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index da7c317b664a10..ef86cfa9195070 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -173,6 +173,12 @@ TRACE_EVENT(fuse_request_end,
{ IOMAP_DIO_UNWRITTEN, "unwritten" }, \
{ IOMAP_DIO_COW, "cow" }
+#define IOMAP_IOEND_STRINGS \
+ { IOMAP_IOEND_SHARED, "shared" }, \
+ { IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
+ { IOMAP_IOEND_BOUNDARY, "boundary" }, \
+ { IOMAP_IOEND_DIRECT, "direct" }
+
TRACE_EVENT(fuse_iomap_begin,
TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
unsigned opflags),
@@ -590,6 +596,9 @@ DEFINE_EVENT(fuse_iomap_file_io_class, name, \
TP_ARGS(iocb, iter))
DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_write_zero_eof);
DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
@@ -626,6 +635,8 @@ DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \
TP_ARGS(iocb, iter, ret))
DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_write_end);
TRACE_EVENT(fuse_iomap_dio_write_end_io,
TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
@@ -659,6 +670,303 @@ TRACE_EVENT(fuse_iomap_dio_write_end_io,
__print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
__entry->pos, __entry->written, __entry->error)
);
+
+TRACE_EVENT(fuse_iomap_end_ioend,
+ TP_PROTO(const struct iomap_ioend *ioend),
+
+ TP_ARGS(ioend),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(size_t, size)
+ __field(unsigned int, ioendflags)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ const struct inode *inode = ioend->io_inode;
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->offset = ioend->io_offset;
+ __entry->size = ioend->io_size;
+ __entry->ioendflags = ioend->io_flags;
+ __entry->error =
+ blk_status_to_errno(ioend->io_bio.bi_status);
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx size %zu ioendflags (%s) error %d",
+ __entry->connection, __entry->ino, __entry->offset,
+ __entry->size,
+ __print_flags(__entry->ioendflags, "|", IOMAP_IOEND_STRINGS),
+ __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_map_blocks,
+ TP_PROTO(const struct inode *inode, loff_t offset, unsigned int count),
+
+ TP_ARGS(inode, offset, count),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(unsigned int, count)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->offset = offset;
+ __entry->count = count;
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx count %u",
+ __entry->connection, __entry->ino, __entry->offset,
+ __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_submit_ioend,
+ TP_PROTO(const struct inode *inode, unsigned int nr_folios, int error),
+
+ TP_ARGS(inode, nr_folios, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(unsigned int, nr_folios)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->nr_folios = nr_folios;
+ __entry->error = error;
+ ),
+
+ TP_printk("connection %u ino %llu nr_folios %u error %d",
+ __entry->connection, __entry->ino, __entry->nr_folios,
+ __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_discard_folio,
+ TP_PROTO(const struct inode *inode, loff_t offset, size_t count),
+
+ TP_ARGS(inode, offset, count),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(size_t, count)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->offset = offset;
+ __entry->count = count;
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+ __entry->connection, __entry->ino, __entry->offset,
+ __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_writepages,
+ TP_PROTO(const struct inode *inode, const struct writeback_control *wbc),
+
+ TP_ARGS(inode, wbc),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, start)
+ __field(loff_t, end)
+ __field(long, nr_to_write)
+ __field(bool, sync_all)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->start = wbc->range_start;
+ __entry->end = wbc->range_end;
+ __entry->nr_to_write = wbc->nr_to_write;
+ __entry->sync_all = wbc->sync_mode == WB_SYNC_ALL;
+ ),
+
+ TP_printk("connection %u ino %llu start 0x%llx end 0x%llx nr %ld sync_all? %d",
+ __entry->connection, __entry->ino, __entry->start,
+ __entry->end, __entry->nr_to_write, __entry->sync_all)
+);
+
+TRACE_EVENT(fuse_iomap_read_folio,
+ TP_PROTO(const struct folio *folio),
+
+ TP_ARGS(folio),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(size_t, count)
+ ),
+
+ TP_fast_assign(
+ const struct inode *inode = folio->mapping->host;
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = folio_pos(folio);
+ __entry->count = folio_size(folio);
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+ __entry->connection, __entry->ino, __entry->pos,
+ __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_readahead,
+ TP_PROTO(const struct readahead_control *rac),
+
+ TP_ARGS(rac),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(size_t, count)
+ ),
+
+ TP_fast_assign(
+ const struct inode *inode = file_inode(rac->file);
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+ struct readahead_control *mutrac = (struct readahead_control *)rac;
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = readahead_pos(mutrac);
+ __entry->count = readahead_length(mutrac);
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+ __entry->connection, __entry->ino, __entry->pos,
+ __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_page_mkwrite,
+ TP_PROTO(const struct vm_fault *vmf),
+
+ TP_ARGS(vmf),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, pos)
+ __field(size_t, count)
+ ),
+
+ TP_fast_assign(
+ const struct inode *inode = file_inode(vmf->vma->vm_file);
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+ struct folio *folio = page_folio(vmf->page);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->pos = folio_pos(folio);
+ __entry->count = folio_size(folio);
+ ),
+
+ TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+ __entry->connection, __entry->ino, __entry->pos,
+ __entry->count)
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_range_class,
+ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length),
+ TP_ARGS(inode, offset, length),
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, size)
+ __field(loff_t, offset)
+ __field(loff_t, length)
+ ),
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->size = i_size_read(inode);
+ __entry->offset = offset;
+ __entry->length = length;
+ ),
+ TP_printk("connection %u ino %llu disize 0x%llx pos 0x%llx bytecount 0x%llx",
+ __entry->connection, __entry->ino, __entry->size,
+ __entry->offset, __entry->length)
+)
+#define DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_range_class, name, \
+ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \
+ TP_ARGS(inode, offset, length))
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
+
+TRACE_EVENT(fuse_iomap_fallocate,
+ TP_PROTO(const struct inode *inode, int mode, loff_t offset,
+ loff_t length, loff_t newsize),
+ TP_ARGS(inode, mode, offset, length, newsize),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(uint64_t, ino)
+ __field(loff_t, offset)
+ __field(loff_t, length)
+ __field(loff_t, newsize)
+ __field(int, mode)
+ ),
+
+ TP_fast_assign(
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+ const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+ __entry->connection = fm->fc->dev;
+ __entry->ino = fi->orig_ino;
+ __entry->mode = mode;
+ __entry->offset = offset;
+ __entry->length = length;
+ __entry->newsize = newsize;
+ ),
+
+ TP_printk("connection %u ino %llu mode 0x%x offset 0x%llx length 0x%llx newsize 0x%llx",
+ __entry->connection, __entry->ino, __entry->mode,
+ __entry->offset, __entry->length, __entry->newsize)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4611f912003593..c9402f2b2a335c 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -238,6 +238,7 @@
* SEEK_{DATA,HOLE} support
* - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
* - add FUSE_IOMAP_DIRECTIO for direct I/O support
+ * - add FUSE_IOMAP_PAGECACHE for buffered I/O support
*/
#ifndef _LINUX_FUSE_H
@@ -449,6 +450,7 @@ struct fuse_file_lock {
* FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
* operations.
* FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
+ * FUSE_IOMAP_PAGECACHE: Client supports iomap for pagecache I/O operations.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -498,6 +500,7 @@ struct fuse_file_lock {
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
#define FUSE_IOMAP (1ULL << 43)
#define FUSE_IOMAP_DIRECTIO (1ULL << 44)
+#define FUSE_IOMAP_PAGECACHE (1ULL << 45)
/**
* CUSE INIT request/reply flags
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c947ad50a9a8eb..2b6c5f3c99338f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2012,6 +2012,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
if (trust_local_cmtime && attr->ia_size != inode->i_size)
attr->ia_valid |= ATTR_MTIME | ATTR_CTIME;
+
+ if (fuse_has_iomap_pagecache(inode)) {
+ err = fuse_iomap_setsize(idmap, dentry, attr);
+ if (err)
+ goto error;
+ }
}
memset(&inarg, 0, sizeof(inarg));
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7e8b20f56dd823..a3e9df5f9788d6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -384,7 +384,7 @@ static int fuse_release(struct inode *inode, struct file *file)
* Dirty pages might remain despite write_inode_now() call from
* fuse_flush() due to writes racing with the close.
*/
- if (fc->writeback_cache)
+ if (fc->writeback_cache || fuse_has_iomap_pagecache(inode))
write_inode_now(inode, 1);
fuse_release_common(file, false);
@@ -1734,8 +1734,6 @@ static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
return res;
}
-static ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
-
static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
ssize_t res;
@@ -1792,6 +1790,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return ret;
}
+ if (fuse_want_iomap_buffered_io(iocb))
+ return fuse_iomap_buffered_read(iocb, to);
+
if (FUSE_IS_DAX(inode))
return fuse_dax_read_iter(iocb, to);
@@ -1815,10 +1816,29 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (fuse_want_iomap_direct_io(iocb)) {
ssize_t ret = fuse_iomap_direct_write(iocb, from);
- if (ret != -ENOSYS)
+ switch (ret) {
+ case -ENOTBLK:
+ /*
+ * If we're going to fall back to the iomap buffered
+ * write path only, then try the write again as a
+ * synchronous buffered write. Otherwise we let it
+ * drop through to the old ->direct_IO path.
+ */
+ if (fuse_want_iomap_buffered_io(iocb))
+ iocb->ki_flags |= IOCB_SYNC;
+ fallthrough;
+ case -ENOSYS:
+ /* no implementation, fall through */
+ break;
+ default:
+ /* errors, no progress, or even partial progress */
return ret;
+ }
}
+ if (fuse_want_iomap_buffered_io(iocb))
+ return fuse_iomap_buffered_write(iocb, from);
+
if (FUSE_IS_DAX(inode))
return fuse_dax_write_iter(iocb, from);
@@ -2653,6 +2673,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
struct inode *inode = file_inode(file);
int rc;
+ if (fuse_has_iomap_pagecache(inode))
+ return fuse_iomap_mmap(file, vma);
+
/* DAX mmap is superior to direct_io mmap */
if (FUSE_IS_DAX(inode))
return fuse_dax_mmap(file, vma);
@@ -2851,7 +2874,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
return err;
}
-static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
+sector_t fuse_bmap(struct address_space *mapping, sector_t block)
{
struct inode *inode = mapping->host;
struct fuse_mount *fm = get_fuse_mount(inode);
@@ -3107,8 +3130,7 @@ static inline loff_t fuse_round_up(struct fuse_conn *fc, loff_t off)
return round_up(off, fc->max_pages << PAGE_SHIFT);
}
-static ssize_t
-fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
DECLARE_COMPLETION_ONSTACK(wait);
ssize_t ret = 0;
@@ -3227,6 +3249,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
.length = length,
.mode = mode
};
+ loff_t newsize = 0;
int err;
bool block_faults = FUSE_IS_DAX(inode) &&
(!(mode & FALLOC_FL_KEEP_SIZE) ||
@@ -3260,6 +3283,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
err = inode_newsize_ok(inode, offset + length);
if (err)
goto out;
+ newsize = offset + length;
}
err = file_modified(file);
@@ -3282,6 +3306,14 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (err)
goto out;
+ if (fuse_has_iomap_pagecache(inode)) {
+ err = fuse_iomap_fallocate(file, mode, offset, length,
+ newsize);
+ if (err)
+ goto out;
+ file_update_time(file);
+ }
+
/* we could have extended the file */
if (!(mode & FALLOC_FL_KEEP_SIZE)) {
if (fuse_write_update_attr(inode, offset + length, length))
@@ -3480,4 +3512,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_inode_init(inode, flags);
+ if (fuse_has_iomap_pagecache(inode))
+ fuse_iomap_init_pagecache(inode);
}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 077ef51ee47452..345610768edc80 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -6,6 +6,8 @@
#include "fuse_i.h"
#include "fuse_trace.h"
#include <linux/iomap.h>
+#include <linux/pagemap.h>
+#include <linux/falloc.h>
static bool __read_mostly enable_iomap =
#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
@@ -530,6 +532,8 @@ void fuse_iomap_open(struct inode *inode, struct file *file)
{
if (fuse_has_iomap_direct_io(inode))
file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+ if (fuse_has_iomap_pagecache(inode))
+ file->f_mode |= FMODE_NOWAIT;
}
enum fuse_ilock_type {
@@ -655,6 +659,109 @@ static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
return err;
}
+static int
+fuse_iomap_zero_range(
+ struct inode *inode,
+ loff_t pos,
+ loff_t len,
+ bool *did_zero)
+{
+ return iomap_zero_range(inode, pos, len, did_zero, &fuse_iomap_ops,
+ NULL);
+}
+
+/* Take care of zeroing post-EOF blocks when they might exist. */
+static ssize_t
+fuse_iomap_write_zero_eof(
+ struct kiocb *iocb,
+ struct iov_iter *from,
+ bool *drained_dio)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+ loff_t isize;
+ int error;
+
+ /*
+ * We need to serialise against EOF updates that occur in IO
+ * completions here. We want to make sure that nobody is changing the
+ * size while we do this check until we have placed an IO barrier (i.e.
+ * hold i_rwsem exclusively) that prevents new IO from being
+ * dispatched. The spinlock effectively forms a memory barrier once we
+ * have i_rwsem exclusively so we are guaranteed to see the latest EOF
+ * value and hence be able to correctly determine if we need to run
+ * zeroing.
+ */
+ spin_lock(&fi->lock);
+ isize = i_size_read(inode);
+ if (iocb->ki_pos <= isize) {
+ spin_unlock(&fi->lock);
+ return 0;
+ }
+ spin_unlock(&fi->lock);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ if (!(*drained_dio)) {
+ /*
+ * We now have an IO submission barrier in place, but AIO can
+ * do EOF updates during IO completion and hence we now need to
+ * wait for all of them to drain. Non-AIO DIO will have
+ * drained before we are given the exclusive i_rwsem, and so
+ * for most cases this wait is a no-op.
+ */
+ inode_dio_wait(inode);
+ *drained_dio = true;
+ return 1;
+ }
+
+ trace_fuse_iomap_write_zero_eof(iocb, from);
+
+ filemap_invalidate_lock(mapping);
+ error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL);
+ filemap_invalidate_unlock(mapping);
+
+ return error;
+}
+
+static ssize_t
+fuse_iomap_write_checks(
+ struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ ssize_t error;
+ bool drained_dio = false;
+
+restart:
+ error = generic_write_checks(iocb, from);
+ if (error <= 0)
+ return error;
+
+ /*
+ * If the offset is beyond the size of the file, we need to zero all
+ * blocks that fall between the existing EOF and the start of this
+ * write.
+ *
+ * We can do an unlocked check for i_size here safely as I/O completion
+ * can only extend EOF. Truncate is locked out at this point, so the
+ * EOF cannot move backwards, only forwards. Hence we only need to take
+ * the slow path when we are at or beyond the current EOF.
+ */
+ if (fuse_has_iomap_pagecache(inode) &&
+ iocb->ki_pos > i_size_read(inode)) {
+ error = fuse_iomap_write_zero_eof(iocb, from, &drained_dio);
+ if (error == 1)
+ goto restart;
+ if (error)
+ return error;
+ }
+
+ return kiocb_modified(iocb);
+}
+
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
@@ -694,8 +801,9 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
ret = fuse_iomap_ilock_iocb(iocb, EXCL);
if (ret)
goto out_dsync;
- ret = generic_write_checks(iocb, from);
- if (ret <= 0)
+
+ ret = fuse_iomap_write_checks(iocb, from);
+ if (ret)
goto out_unlock;
ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
@@ -717,3 +825,575 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
iocb->ki_flags |= IOCB_DSYNC;
return ret;
}
+
+struct fuse_writepage_ctx {
+ struct iomap_writepage_ctx ctx;
+};
+
+static void fuse_iomap_end_ioend(struct iomap_ioend *ioend)
+{
+ struct inode *inode = ioend->io_inode;
+ unsigned int ioendflags = 0;
+ unsigned int nofs_flag;
+ int error = blk_status_to_errno(ioend->io_bio.bi_status);
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ if (fuse_is_bad(inode))
+ return;
+
+ trace_fuse_iomap_end_ioend(ioend);
+
+ if (ioend->io_flags & IOMAP_IOEND_SHARED)
+ ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+ if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
+ ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+ /*
+ * We can allocate memory here while doing writeback on behalf of
+ * memory reclaim. To avoid memory allocation deadlocks set the
+ * task-wide nofs context for the following operations.
+ */
+ nofs_flag = memalloc_nofs_save();
+ fuse_iomap_ioend(inode, ioend->io_offset, ioend->io_size, error,
+ ioendflags, FUSE_IOMAP_NULL_ADDR);
+ iomap_finish_ioends(ioend, error);
+ memalloc_nofs_restore(nofs_flag);
+}
+
+/*
+ * Finish all pending IO completions that require transactional modifications.
+ *
+ * We try to merge physical and logically contiguous ioends before completion to
+ * minimise the number of transactions we need to perform during IO completion.
+ * Both unwritten extent conversion and COW remapping need to iterate and modify
+ * one physical extent at a time, so we gain nothing by merging physically
+ * discontiguous extents here.
+ *
+ * The ioend chain length that we can be processing here is largely unbound in
+ * length and we may have to perform significant amounts of work on each ioend
+ * to complete it. Hence we have to be careful about holding the CPU for too
+ * long in this loop.
+ */
+static void fuse_iomap_end_io(struct work_struct *work)
+{
+ struct fuse_inode *fi =
+ container_of(work, struct fuse_inode, ioend_work);
+ struct iomap_ioend *ioend;
+ struct list_head tmp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&fi->ioend_lock, flags);
+ list_replace_init(&fi->ioend_list, &tmp);
+ spin_unlock_irqrestore(&fi->ioend_lock, flags);
+
+ iomap_sort_ioends(&tmp);
+ while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
+ io_list))) {
+ list_del_init(&ioend->io_list);
+ iomap_ioend_try_merge(ioend, &tmp);
+ fuse_iomap_end_ioend(ioend);
+ cond_resched();
+ }
+}
+
+static void fuse_iomap_end_bio(struct bio *bio)
+{
+ struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+ struct inode *inode = ioend->io_inode;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ unsigned long flags;
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ spin_lock_irqsave(&fi->ioend_lock, flags);
+ if (list_empty(&fi->ioend_list))
+ WARN_ON_ONCE(!queue_work(system_unbound_wq, &fi->ioend_work));
+ list_add_tail(&ioend->io_list, &fi->ioend_list);
+ spin_unlock_irqrestore(&fi->ioend_lock, flags);
+}
+
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+ loff_t offset)
+{
+ if (offset < wpc->iomap.offset ||
+ offset >= wpc->iomap.offset + wpc->iomap.length)
+ return false;
+
+ /* XXX actually use revalidation cookie */
+ return true;
+}
+
+static int fuse_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
+ struct inode *inode, loff_t offset,
+ unsigned int len)
+{
+ struct iomap write_iomap, dontcare;
+ int ret;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_map_blocks(inode, offset, len);
+
+ if (fuse_iomap_revalidate_writeback(wpc, offset))
+ return 0;
+
+ /* Pretend that this is a directio write */
+ ret = fuse_iomap_begin(inode, offset, len, IOMAP_DIRECT | IOMAP_WRITE,
+ &write_iomap, &dontcare);
+ if (ret)
+ return ret;
+
+ /*
+ * Landed in a hole or beyond EOF? Send that to iomap, it'll skip
+ * writing back the file range.
+ */
+ if (write_iomap.offset > offset) {
+ write_iomap.length = write_iomap.offset - offset;
+ write_iomap.offset = offset;
+ write_iomap.type = IOMAP_HOLE;
+ }
+
+ memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap));
+ return 0;
+}
+
+static int fuse_iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int status)
+{
+ struct iomap_ioend *ioend = wpc->ioend;
+
+ ASSERT(fuse_has_iomap_pagecache(ioend->io_inode));
+
+ trace_fuse_iomap_submit_ioend(ioend->io_inode, wpc->nr_folios, status);
+
+ /* always call our ioend function, even if we cancel the bio */
+ ioend->io_bio.bi_end_io = fuse_iomap_end_bio;
+
+ if (status)
+ return status;
+ submit_bio(&ioend->io_bio);
+ return 0;
+}
+
+/*
+ * If the folio has delalloc blocks on it, the caller is asking us to punch them
+ * out. If we don't, we can leave a stale delalloc mapping covered by a clean
+ * page that needs to be dirtied again before the delalloc mapping can be
+ * converted. This stale delalloc mapping can trip up a later direct I/O read
+ * operation on the same region.
+ *
+ * We prevent this by truncating away the delalloc regions on the folio. Because
+ * they are delalloc, we can do this without needing a transaction. Indeed - if
+ * we get ENOSPC errors, we have to be able to do this truncation without a
+ * transaction as there is no space left for block reservation (typically why
+ * we see a ENOSPC in writeback).
+ */
+static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+ struct inode *inode = folio->mapping->host;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ if (fuse_is_bad(inode))
+ return;
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_discard_folio(inode, pos, folio_size(folio));
+
+ printk_ratelimited(KERN_ERR
+ "page discard on page %px, inode 0x%llx, pos %llu.",
+ folio, fi->orig_ino, pos);
+
+ /* XXX actually punch the new delalloc ranges? */
+}
+
+static const struct iomap_writeback_ops fuse_iomap_writeback_ops = {
+ .map_blocks = fuse_iomap_map_blocks,
+ .submit_ioend = fuse_iomap_submit_ioend,
+ .discard_folio = fuse_iomap_discard_folio,
+};
+
+static int fuse_iomap_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct fuse_writepage_ctx wpc = { };
+
+ ASSERT(fuse_has_iomap_pagecache(mapping->host));
+
+ trace_fuse_iomap_writepages(mapping->host, wbc);
+
+ return iomap_writepages(mapping, wbc, &wpc.ctx,
+ &fuse_iomap_writeback_ops);
+}
+
+static int fuse_iomap_read_folio(struct file *file, struct folio *folio)
+{
+ ASSERT(fuse_has_iomap_pagecache(file_inode(file)));
+
+ trace_fuse_iomap_read_folio(folio);
+
+ return iomap_read_folio(folio, &fuse_iomap_ops);
+}
+
+static void fuse_iomap_readahead(struct readahead_control *rac)
+{
+ ASSERT(fuse_has_iomap_pagecache(file_inode(rac->file)));
+
+ trace_fuse_iomap_readahead(rac);
+
+ iomap_readahead(rac, &fuse_iomap_ops);
+}
+
+const struct address_space_operations fuse_iomap_aops = {
+ .read_folio = fuse_iomap_read_folio,
+ .readahead = fuse_iomap_readahead,
+ .writepages = fuse_iomap_writepages,
+ .dirty_folio = iomap_dirty_folio,
+ .release_folio = iomap_release_folio,
+ .invalidate_folio = iomap_invalidate_folio,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_folio = generic_error_remove_folio,
+
+ /* These aren't pagecache operations per se */
+ .bmap = fuse_bmap,
+ .direct_IO = fuse_direct_IO,
+};
+
+void fuse_iomap_init_pagecache(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ /* Manage timestamps ourselves, don't make the fuse server do it */
+ inode->i_flags &= ~S_NOCMTIME;
+ inode->i_flags &= ~S_NOATIME;
+ inode->i_data.a_ops = &fuse_iomap_aops;
+
+ INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
+ INIT_LIST_HEAD(&fi->ioend_list);
+ spin_lock_init(&fi->ioend_lock);
+}
+
+void fuse_iomap_destroy_pagecache(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(list_empty(&fi->ioend_list));
+}
+
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_lock (MM)
+ * sb_start_pagefault(vfs, freeze)
+ * invalidate_lock (vfs - truncate serialisation)
+ * page_lock (MM)
+ * i_lock (FUSE - extent map serialisation)
+ */
+static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ vm_fault_t ret;
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_page_mkwrite(vmf);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vmf->vma->vm_file);
+
+ filemap_invalidate_lock_shared(mapping);
+ ret = iomap_page_mkwrite(vmf, &fuse_iomap_ops, NULL);
+ filemap_invalidate_unlock_shared(mapping);
+
+ sb_end_pagefault(inode->i_sb);
+ return ret;
+}
+
+static const struct vm_operations_struct fuse_iomap_vm_ops = {
+ .fault = filemap_fault,
+ .map_pages = filemap_map_pages,
+ .page_mkwrite = fuse_iomap_page_mkwrite,
+};
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(file);
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ file_accessed(file);
+ vma->vm_ops = &fuse_iomap_vm_ops;
+ return 0;
+}
+
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_buffered_read(iocb, to);
+
+ if (!iov_iter_count(to))
+ return 0; /* skip atime */
+
+ file_accessed(iocb->ki_filp);
+
+ ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+ if (ret)
+ return ret;
+ ret = generic_file_read_iter(iocb, to);
+ inode_unlock_shared(inode);
+
+ trace_fuse_iomap_buffered_read_end(iocb, to, ret);
+ return ret;
+}
+
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t pos = iocb->ki_pos;
+ ssize_t ret;
+
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_buffered_write(iocb, from);
+
+ ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+ if (ret)
+ return ret;
+
+ ret = fuse_iomap_write_checks(iocb, from);
+ if (ret)
+ goto out_unlock;
+
+ if (inode->i_size < pos + iov_iter_count(from))
+ set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+ ret = iomap_file_buffered_write(iocb, from, &fuse_iomap_ops, NULL);
+
+ if (ret > 0)
+ fuse_write_update_attr(inode, pos + ret, ret);
+ clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+out_unlock:
+ inode_unlock(inode);
+
+ if (ret > 0) {
+ /* Handle various SYNC-type writes */
+ ret = generic_write_sync(iocb, ret);
+ }
+ trace_fuse_iomap_buffered_write_end(iocb, from, ret);
+ return ret;
+}
+
+static int
+fuse_iomap_truncate_page(
+ struct inode *inode,
+ loff_t pos,
+ bool *did_zero)
+{
+ return iomap_truncate_page(inode, pos, did_zero, &fuse_iomap_ops,
+ NULL);
+}
+/*
+ * Truncate file. Must have write permission and not be a directory.
+ *
+ * Caution: The caller of this function is responsible for calling
+ * setattr_prepare() or otherwise verifying the change is fine.
+ */
+static int
+fuse_iomap_setattr_size(
+ struct mnt_idmap *idmap,
+ struct dentry *dentry,
+ struct inode *inode,
+ struct iattr *iattr)
+{
+ loff_t oldsize, newsize;
+ int error;
+ bool did_zeroing = false;
+
+ //xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
+ ASSERT(S_ISREG(inode->i_mode));
+ ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
+ ATTR_MTIME_SET|ATTR_TIMES_SET)) == 0);
+
+ oldsize = inode->i_size;
+ newsize = iattr->ia_size;
+
+ /*
+ * Wait for all direct I/O to complete.
+ */
+ inode_dio_wait(inode);
+
+ /*
+ * File data changes must be complete and flushed to disk before we
+ * call userspace to modify the inode.
+ *
+ * Start with zeroing any data beyond EOF that we may expose on file
+ * extension, or zeroing out the rest of the block on a downward
+ * truncate.
+ */
+ if (newsize > oldsize) {
+ trace_fuse_iomap_truncate_up(inode, oldsize, newsize - oldsize);
+
+ error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize,
+ &did_zeroing);
+ } else {
+ trace_fuse_iomap_truncate_down(inode, newsize,
+ oldsize - newsize);
+
+ error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing);
+ }
+ if (error)
+ return error;
+
+ /*
+ * We've already locked out new page faults, so now we can safely
+ * remove pages from the page cache knowing they won't get refaulted
+ * until we drop the mapping invalidation lock after the extent
+ * manipulations are complete. The truncate_setsize() call also cleans
+ * folios spanning EOF on extending truncates and hence ensures
+ * sub-page block size filesystems are correctly handled, too.
+ *
+ * And we update in-core i_size and truncate page cache beyond newsize
+ * before writing back the whole file, so we're guaranteed not to write
+ * stale data past the new EOF on truncate down.
+ */
+ truncate_setsize(inode, newsize);
+
+ /*
+ * We are going to tell userspace to log the inode size change so any
+ * previous writes that are beyond the on disk EOF and the new EOF that
+ * have not been written out need to be written here. If we do not
+ * write the data out, we expose ourselves to the null files problem.
+ * Note that this includes any block zeroing we did above; otherwise
+ * those blocks may not be zeroed after a crash. It's really clumsy
+ * to flush the entire file, but we don't know the ondisk inode size
+ * so we use a big hammer instead.
+ */
+ if (did_zeroing || newsize > 0) {
+ error = filemap_write_and_wait(inode->i_mapping);
+ if (error)
+ return error;
+ }
+
+ return 0;
+}
+
+int
+fuse_iomap_setsize(
+ struct mnt_idmap *idmap,
+ struct dentry *dentry,
+ struct iattr *iattr)
+{
+ struct inode *inode = d_inode(dentry);
+ int error;
+
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_setsize(inode, iattr->ia_size, 0);
+
+ error = inode_newsize_ok(inode, iattr->ia_size);
+ if (error)
+ return error;
+ return fuse_iomap_setattr_size(idmap, dentry, inode, iattr);
+}
+
+static int fuse_iomap_punch_range(struct inode *inode, loff_t offset,
+ loff_t length)
+{
+ loff_t isize = i_size_read(inode);
+ int error;
+
+ trace_fuse_iomap_punch_range(inode, offset, length);
+
+ /*
+ * Now that we've unmap all full blocks we'll have to zero out any
+ * partial block at the beginning and/or end. iomap_zero_range is
+ * smart enough to skip holes and unwritten extents, including those we
+ * just created, but we must take care not to zero beyond EOF, which
+ * would enlarge i_size.
+ */
+ if (offset >= isize)
+ return 0;
+ if (offset + length > isize)
+ length = isize - offset;
+ error = fuse_iomap_zero_range(inode, offset, length, NULL);
+ if (error)
+ return error;
+
+ /*
+ * If we zeroed right up to EOF and EOF straddles a page boundary we
+ * must make sure that the post-EOF area is also zeroed because the
+ * page could be mmap'd and iomap_zero_range doesn't do that for us.
+ * Writeback of the eof page will do this, albeit clumsily.
+ */
+ if (offset + length >= isize && offset_in_page(offset + length) > 0) {
+ error = filemap_write_and_wait_range(inode->i_mapping,
+ round_down(offset + length, PAGE_SIZE),
+ LLONG_MAX);
+ }
+
+ return error;
+}
+
+int
+fuse_iomap_fallocate(
+ struct file *file,
+ int mode,
+ loff_t offset,
+ loff_t length,
+ loff_t new_size)
+{
+ struct inode *inode = file_inode(file);
+ int error;
+
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(fuse_has_iomap_pagecache(inode));
+
+ trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+
+ /*
+ * If we unmapped blocks from the file range, then we zero the
+ * pagecache for those regions and push them to disk rather than make
+ * the fuse server manually zero the disk blocks.
+ */
+ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) {
+ error = fuse_iomap_punch_range(inode, offset, length);
+ if (error)
+ return error;
+ }
+
+ /*
+ * If this is an extending write, we need to zero the bytes beyond the
+ * new EOF.
+ */
+ if (new_size) {
+ struct iattr iattr = {
+ .ia_valid = ATTR_SIZE,
+ .ia_size = new_size,
+ };
+
+ return fuse_iomap_setsize(file_mnt_idmap(file),
+ file_dentry(file), &iattr);
+ }
+
+ return 0;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0b3ad7bf89b52d..2f185b7d9349b7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -193,6 +193,9 @@ static void fuse_evict_inode(struct inode *inode)
WARN_ON(!list_empty(&fi->write_files));
WARN_ON(!list_empty(&fi->queued_writes));
}
+
+ if (S_ISREG(inode->i_mode) && fuse_has_iomap_pagecache(inode))
+ fuse_iomap_destroy_pagecache(inode);
}
static int fuse_reconfigure(struct fs_context *fsc)
@@ -1445,6 +1448,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fc->iomap = 1;
if ((flags & FUSE_IOMAP_DIRECTIO) && fc->iomap)
fc->iomap_directio = 1;
+ if ((flags & FUSE_IOMAP_PAGECACHE) && fc->iomap)
+ fc->iomap_pagecache = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1517,7 +1522,7 @@ void fuse_send_init(struct fuse_mount *fm)
if (fuse_uring_enabled())
flags |= FUSE_OVER_IO_URING;
if (fuse_iomap_enabled())
- flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO;
+ flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
ia->in.flags = flags;
ia->in.flags2 = flags >> 32;
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 09/11] fuse: implement large folios for iomap pagecache files
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (7 preceding siblings ...)
2025-05-22 0:04 ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
@ 2025-05-22 0:04 ` Darrick J. Wong
2025-05-22 0:05 ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
` (2 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:04 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Use large folios when we're using iomap.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 345610768edc80..c58ac812598d8f 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1070,6 +1070,7 @@ const struct address_space_operations fuse_iomap_aops = {
void fuse_iomap_init_pagecache(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ unsigned int min_order = 0;
ASSERT(fuse_has_iomap(inode));
@@ -1081,6 +1082,11 @@ void fuse_iomap_init_pagecache(struct inode *inode)
INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
INIT_LIST_HEAD(&fi->ioend_list);
spin_lock_init(&fi->ioend_lock);
+
+ if (inode->i_blkbits > PAGE_SHIFT)
+ min_order = inode->i_blkbits - PAGE_SHIFT;
+
+ mapping_set_folio_min_order(inode->i_mapping, min_order);
}
void fuse_iomap_destroy_pagecache(struct inode *inode)
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (8 preceding siblings ...)
2025-05-22 0:04 ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
@ 2025-05-22 0:05 ` Darrick J. Wong
2025-05-22 0:05 ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
2025-05-22 0:21 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:05 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
With iomap support turned on for the pagecache, the kernel issues
writeback to directly to block devices and we no longer have to push all
those pages through the fuse device to userspace. Therefore, we don't
need the tight dirty limits (~1M) that are used for regular fuse. This
dramatically increases the performance of fuse's pagecache IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index c58ac812598d8f..746d9ae192dc55 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -427,6 +427,28 @@ void fuse_iomap_init_reply(struct fuse_mount *fm)
if (sb->s_bdev)
__fuse_iomap_add_device(fc, sb->s_bdev_file);
+
+ if (fc->iomap_pagecache) {
+ struct backing_dev_info *old_bdi = sb->s_bdi;
+ char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+ int err;
+
+ /*
+ * sb->s_bdi points to the initial private bdi however we want
+ * to redirect it to a new private bdi with default dirty and
+ * readahead settings because iomap writeback won't be pushing
+ * a ton of dirty data through the fuse device
+ */
+ sb->s_bdi = &noop_backing_dev_info;
+ err = super_setup_bdi_name(sb, "%u:%u%s.iomap", MAJOR(fc->dev),
+ MINOR(fc->dev), suffix);
+ if (err) {
+ sb->s_bdi = old_bdi;
+ } else {
+ bdi_unregister(old_bdi);
+ bdi_put(old_bdi);
+ }
+ }
}
int fuse_iomap_add_device(struct fuse_conn *fc,
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH 11/11] fuse: advertise support for iomap
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (9 preceding siblings ...)
2025-05-22 0:05 ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
@ 2025-05-22 0:05 ` Darrick J. Wong
2025-05-22 0:21 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:05 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John
From: Darrick J. Wong <djwong@kernel.org>
Advertise our new IO paths programmatically by creating an ioctl that
can return the capabilities of the kernel.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++++
include/uapi/linux/fuse.h | 13 +++++++++++++
fs/fuse/dev.c | 3 +++
fs/fuse/file_iomap.c | 18 ++++++++++++++++++
4 files changed, 38 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8481b1d0299df0..5b14e8b23f305f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1683,6 +1683,9 @@ int fuse_iomap_setsize(struct mnt_idmap *idmap, struct dentry *dentry,
struct iattr *iattr);
int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
loff_t length, loff_t new_size);
+
+int fuse_iomap_ioc_support(struct file *file,
+ struct fuse_iomap_support __user *argp);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1706,6 +1709,7 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
# define fuse_iomap_buffered_write(...) (-ENOSYS)
# define fuse_iomap_setsize(...) (-ENOSYS)
# define fuse_iomap_fallocate(...) (-ENOSYS)
+# define fuse_iomap_ioc_support(...) (-ENOTTY)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c9402f2b2a335c..cbef70ae05c73b 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1135,12 +1135,25 @@ struct fuse_backing_map {
uint64_t padding;
};
+/* basic reporting functionality */
+#define FUSE_IOMAP_SUPPORT_BASICS (1ULL << 0)
+/* fuse driver can do direct io */
+#define FUSE_IOMAP_SUPPORT_DIRECTIO (1ULL << 1)
+/* fuse driver can do buffered io */
+#define FUSE_IOMAP_SUPPORT_PAGECACHE (1ULL << 2)
+struct fuse_iomap_support {
+ uint64_t flags;
+ uint64_t padding;
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 3, \
+ struct fuse_iomap_support)
struct fuse_lseek_in {
uint64_t fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9d7064ec170cf6..91beafbbcf7c02 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2620,6 +2620,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_BACKING_CLOSE:
return fuse_dev_ioctl_backing_close(file, argp);
+ case FUSE_DEV_IOC_IOMAP_SUPPORT:
+ return fuse_iomap_ioc_support(file, argp);
+
default:
return -ENOTTY;
}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 746d9ae192dc55..60e1242b32fd7c 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1425,3 +1425,21 @@ fuse_iomap_fallocate(
return 0;
}
+
+int fuse_iomap_ioc_support(struct file *file,
+ struct fuse_iomap_support __user *argp)
+{
+ struct fuse_iomap_support ios = { };
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (fuse_iomap_enabled())
+ ios.flags = FUSE_IOMAP_SUPPORT_BASICS |
+ FUSE_IOMAP_SUPPORT_DIRECTIO |
+ FUSE_IOMAP_SUPPORT_PAGECACHE;
+
+ if (copy_to_user(argp, &ios, sizeof(ios)))
+ return -EFAULT;
+ return 0;
+}
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance
2025-05-22 0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (10 preceding siblings ...)
2025-05-22 0:05 ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
@ 2025-05-22 0:21 ` Darrick J. Wong
11 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2025-05-22 0:21 UTC (permalink / raw)
To: linux-xfs
Whoops, my eyes are tired and my script to strip out linux-xfs and cem
from the cc lines didn't work because I inverted a space and a comma in
the regexp.
Sorry about the noise. Can we move off patchbomb development already?
--D
On Wed, May 21, 2025 at 05:01:22PM -0700, Darrick J. Wong wrote:
> Hi all,
>
> This series connects fuse (the userspace filesystem layer) to fs-iomap to get
> fuse servers out of the business of handling file I/O themselves. By keeping
> the IO path mostly within the kernel, we can dramatically improve the speed of
> disk-based filesystems. This enables us to move all the filesystem metadata
> parsing code out of the kernel and into userspace, which means that we can
> containerize them for security without losing a lot of performance.
>
> If you're going to start using this code, I strongly recommend pulling
> from my git trees, which are linked below.
>
> This has been running on the djcloud for months with no problems. Enjoy!
> Comments and questions are, as always, welcome.
>
> --D
>
> kernel git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
> ---
> Commits in this patchset:
> * fuse: fix livelock in synchronous file put from fuseblk workers
> * iomap: exit early when iomap_iter is called with zero length
> * fuse: implement the basic iomap mechanisms
> * fuse: add a notification to add new iomap devices
> * fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection
> * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> * fuse: implement direct IO with iomap
> * fuse: implement buffered IO with iomap
> * fuse: implement large folios for iomap pagecache files
> * fuse: use an unrestricted backing device with iomap pagecache io
> * fuse: advertise support for iomap
> ---
> fs/fuse/fuse_i.h | 135 ++++
> fs/fuse/fuse_trace.h | 845 ++++++++++++++++++++++++++
> include/uapi/linux/fuse.h | 138 ++++
> fs/fuse/Kconfig | 23 +
> fs/fuse/Makefile | 1
> fs/fuse/dev.c | 26 +
> fs/fuse/dir.c | 14
> fs/fuse/file.c | 85 ++-
> fs/fuse/file_iomap.c | 1445 +++++++++++++++++++++++++++++++++++++++++++++
> fs/fuse/inode.c | 23 +
> fs/iomap/iter.c | 5
> 11 files changed, 2730 insertions(+), 10 deletions(-)
> create mode 100644 fs/fuse/file_iomap.c
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread