linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
@ 2025-05-21 23:58 Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                   ` (4 more replies)
  0 siblings, 5 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-21 23:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4,
	Theodore Ts'o

Hi everyone,

DO NOT MERGE THIS.

This is the very first request for comments of a prototype to connect
the Linux fuse driver to fs-iomap for regular file IO operations to and
from files whose contents persist to locally attached storage devices.

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.

willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code.  Eeeugh.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ioend calls within iomap are turned into upcalls
to the fuse server via a trio of new fuse commands.  This is suitable
for very simple filesystems that don't do tricky things with mappings
(e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
but solving that is for the next sprint.

With this overly simplistic RFC, I am to show that it's possible to
build a fuse server for a real filesystem (ext4) that runs entirely in
userspace yet maintains most of its performance.  At this early stage I
get about 95% of the kernel ext4 driver's streaming directio performance
on streaming IO, and 110% of its streaming buffered IO performance.
Random buffered IO suffers a 90% hit on writes due to unwritten extent
conversions.  Random direct IO is about 60% as fast as the kernel; see
the cover letter for the fuse2fs iomap changes for more details.

There are some major warts remaining:

1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.

2. Mappings ought to be cached in the kernel for more speed.

3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.

4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.

5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

6. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.

I'll work on these in June, but for now here's an unmergeable RFC to
start some discussion.

--Darrick

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
@ 2025-05-22  0:01 ` Darrick J. Wong
  2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
                     ` (10 more replies)
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                   ` (3 subsequent siblings)
  4 siblings, 11 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:01 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

Hi all,

This series connects fuse (the userspace filesystem layer) to fs-iomap to get
fuse servers out of the business of handling file I/O themselves.  By keeping
the IO path mostly within the kernel, we can dramatically improve the speed of
disk-based filesystems.  This enables us to move all the filesystem metadata
parsing code out of the kernel and into userspace, which means that we can
containerize them for security without losing a lot of performance.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
---
Commits in this patchset:
 * fuse: fix livelock in synchronous file put from fuseblk workers
 * iomap: exit early when iomap_iter is called with zero length
 * fuse: implement the basic iomap mechanisms
 * fuse: add a notification to add new iomap devices
 * fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection
 * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
 * fuse: implement direct IO with iomap
 * fuse: implement buffered IO with iomap
 * fuse: implement large folios for iomap pagecache files
 * fuse: use an unrestricted backing device with iomap pagecache io
 * fuse: advertise support for iomap
---
 fs/fuse/fuse_i.h          |  135 ++++
 fs/fuse/fuse_trace.h      |  845 ++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |  138 ++++
 fs/fuse/Kconfig           |   23 +
 fs/fuse/Makefile          |    1 
 fs/fuse/dev.c             |   26 +
 fs/fuse/dir.c             |   14 
 fs/fuse/file.c            |   85 ++-
 fs/fuse/file_iomap.c      | 1445 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |   23 +
 fs/iomap/iter.c           |    5 
 11 files changed, 2730 insertions(+), 10 deletions(-)
 create mode 100644 fs/fuse/file_iomap.c


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-05-22  0:01 ` Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
                     ` (7 more replies)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                   ` (2 subsequent siblings)
  4 siblings, 8 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:01 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

Hi all,

This series connects libfuse to the iomap-enabled fuse driver in Linux to get
fuse servers out of the business of handling file I/O themselves.  By keeping
the IO path mostly within the kernel, we can dramatically improve the speed of
disk-based filesystems.  This enables us to move all the filesystem metadata
parsing code out of the kernel and into userspace, which means that we can
containerize them for security without losing a lot of performance.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
---
Commits in this patchset:
 * libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version
 * libfuse: add fuse commands for iomap_begin and end
 * libfuse: add upper level iomap commands
 * libfuse: add a notification to add a new device to iomap
 * libfuse: add iomap ioend low level handler
 * libfuse: add upper level iomap ioend commands
 * libfuse: add FUSE_IOMAP_PAGECACHE
 * libfuse: allow discovery of the kernel's iomap capabilities
---
 include/fuse.h          |   20 ++++++
 include/fuse_common.h   |   80 ++++++++++++++++++++++
 include/fuse_kernel.h   |   89 ++++++++++++++++++++++++-
 include/fuse_lowlevel.h |   95 ++++++++++++++++++++++++++
 lib/fuse.c              |  142 +++++++++++++++++++++++++++++++++++++++
 lib/fuse_lowlevel.c     |  170 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    2 +
 lib/meson.build         |    2 -
 8 files changed, 597 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-05-22  0:02 ` Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
                     ` (9 more replies)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
  4 siblings, 10 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:02 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

Hi all,

In preparation for connecting fuse, iomap, and fuse2fs for a much more
performant file IO path, make some changes to the Unix IO manager in
libext2fs so that we can have better IO.  First we start by making
filesystem flushes a lot more efficient by eliding fsyncs when they're
not necessary, and allowing library clients to turn off the racy code
that writes the superblock byte by byte but exposes stale checksums.

XXX: The second part of this series adds IO tagging so that we could tag
IOs by inode number to distinguish file data blocks in cache from
everything else.  This is temporary scaffolding whilst we're in the
middle adding directio and later buffered writes.  Once we can use the
pagecache for all file IO activity I think we could drop the back half
of this series.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-iomap-prep
---
Commits in this patchset:
 * libext2fs: always fsync the device when flushing the cache
 * libext2fs: always fsync the device when closing the unix IO manager
 * libext2fs: only fsync the unix fd if we wrote to the device
 * libext2fs: invalidate cached blocks when freeing them
 * libext2fs: add tagged block IO for better caching
 * libext2fs: add tagged block IO caching to the unix IO manager
 * libext2fs: only flush affected blocks in unix_write_byte
 * libext2fs: allow unix_write_byte when the write would be aligned
 * libext2fs: allow clients to ask to write full superblocks
 * libext2fs: allow callers to disallow I/O to file data blocks
---
 lib/ext2fs/ext2_io.h         |   29 ++++
 lib/ext2fs/ext2fs.h          |    4 +
 debian/libext2fs2t64.symbols |    5 +
 lib/ext2fs/alloc_stats.c     |    7 +
 lib/ext2fs/closefs.c         |    7 +
 lib/ext2fs/fileio.c          |   26 +++-
 lib/ext2fs/io_manager.c      |   56 ++++++++
 lib/ext2fs/unix_io.c         |  281 +++++++++++++++++++++++++++++++++++-------
 8 files changed, 362 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
@ 2025-05-22  0:02 ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
                     ` (15 more replies)
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
  4 siblings, 16 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:02 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

Hi all,

Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection.  For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel.  This
means that we can get rid of all file data block processing within
fuse2fs.

Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous.  Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.

The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s.  Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s.  FIEMAP and SEEK_DATA/SEEK_HOLE now work
too.  The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.

Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s.  The kernel
can do 900-1300MB/s.  Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s.  I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes.  We also probably
need iomap caching really badly.

These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance.  It contains a single
Big Filesystem Lock which nukes multi-threaded scalability.  There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS.  Sad!

Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance.  We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.

iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so
for capable systems, fuse2fs doesn't need to run in fuseblk mode
anymore.

However, there are some major warts remaining:

1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.

2. Mappings ought to be cached in the kernel for more speed.

3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.

4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.

5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

6. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.

I'll work on these in June, but for now here's an unmergeable RFC to
start some discussion.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap
---
Commits in this patchset:
 * fuse2fs: implement bare minimum iomap for file mapping reporting
 * fuse2fs: register block devices for use with iomap
 * fuse2fs: always use directio disk reads with fuse2fs
 * fuse2fs: implement directio file reads
 * fuse2fs: use tagged block IO for zeroing sub-block regions
 * fuse2fs: only flush the cache for the file under directio read
 * fuse2fs: add extent dump function for debugging
 * fuse2fs: implement direct write support
 * fuse2fs: turn on iomap for pagecache IO
 * fuse2fs: flush and invalidate the buffer cache on trim
 * fuse2fs: improve tracing for fallocate
 * fuse2fs: don't zero bytes in punch hole
 * fuse2fs: don't do file data block IO when iomap is enabled
 * fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
 * fuse2fs: re-enable the block device pagecache for metadata IO
 * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
---
 configure       |   47 ++
 configure.ac    |   32 +
 lib/config.h.in |    3 
 misc/fuse2fs.c  | 1251 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 1312 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-05-22  0:02   ` Darrick J. Wong
  2025-05-29 11:08     ` Miklos Szeredi
  2025-05-22  0:02   ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

I observed a hang when running generic/323 against a fuseblk server.
This test opens a file, initiates a lot of AIO writes to that file
descriptor, and closes the file descriptor before the writes complete.
Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for
responses from the fuseblk server:

# cat /proc/372265/task/372313/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_do_getattr+0xfc/0x1f0 [fuse]
[<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse]
[<0>] aio_read+0x130/0x1e0
[<0>] io_submit_one+0x542/0x860
[<0>] __x64_sys_io_submit+0x98/0x1a0
[<0>] do_syscall_64+0x37/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

But the /weird/ part is that the fuseblk server threads are waiting for
responses from itself:

# cat /proc/372210/task/372232/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_file_put+0x9a/0xd0 [fuse]
[<0>] fuse_release+0x36/0x50 [fuse]
[<0>] __fput+0xec/0x2b0
[<0>] task_work_run+0x55/0x90
[<0>] syscall_exit_to_user_mode+0xe9/0x100
[<0>] do_syscall_64+0x43/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

The fuseblk server is fuse2fs so there's nothing all that exciting in
the server itself.  So why is the fuse server calling fuse_file_put?
The commit message for the fstest sheds some light on that:

"By closing the file descriptor before calling io_destroy, you pretty
much guarantee that the last put on the ioctx will be done in interrupt
context (during I/O completion).

Aha.  AIO fgets a new struct file from the fd when it queues the ioctx.
The completion of the FUSE_WRITE command from userspace causes the fuse
server to call the AIO completion function.  The completion puts the
struct file, queuing a delayed fput to the fuse server task.  When the
fuse server task returns to userspace, it has to run the delayed fput,
which in the case of a fuseblk server, it does synchronously.

Sending the FUSE_RELEASE command sychronously from fuse server threads
is a bad idea because a client program can initiate enough simultaneous
AIOs such that all the fuse server threads end up in delayed_fput, and
now there aren't any threads left to handle the queued fuse commands.

Fix this by only using synchronous fputs for fuseblk servers if the
process doesn't have PF_LOCAL_THROTTLE.  Hopefully the fuseblk server
had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
filesystem server.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 754378dd9f7159..ada1ed9e653e42 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -355,8 +355,16 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
 	 * Make the release synchronous if this is a fuseblk mount,
 	 * synchronous RELEASE is allowed (and desirable) in this case
 	 * because the server can be trusted not to screw up.
+	 *
+	 * If we're a LOCAL_THROTTLE thread, use the asynchronous put
+	 * because the current thread might be a fuse server.  This can
+	 * happen if a process starts some aio and closes the fd before
+	 * the aio completes.  Since aio takes its own ref to the file,
+	 * the IO completion has to drop the ref, which is how the fuse
+	 * server can end up closing its own clients' files.
 	 */
-	fuse_file_put(ff, ff->fm->fc->destroy);
+	fuse_file_put(ff, ff->fm->fc->destroy &&
+			  (current->flags & PF_LOCAL_THROTTLE) == 0);
 }
 
 void fuse_release_common(struct file *file, bool isdir)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-05-22  0:02   ` Darrick J. Wong
  2025-05-22  0:03   ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

If iomap_iter::len is zero on the first call to iomap_iter(), we should
just return zero instead of calling ->iomap_begin with zero count.  This
obviates the need for ->iomap_begin implementations to handle that
"correctly" by not returning a zero-length mapping.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/iter.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 6ffc6a7b9ba502..b86a6a08627126 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -66,8 +66,11 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 
 	trace_iomap_iter(iter, ops, _RET_IP_);
 
-	if (!iter->iomap.length)
+	if (!iter->iomap.length) {
+		if (iter->len == 0)
+			return 0;
 		goto begin;
+	}
 
 	/*
 	 * Calculate how far the iter was advanced and the original length bytes


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/11] fuse: implement the basic iomap mechanisms
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
  2025-05-22  0:02   ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
@ 2025-05-22  0:03   ` Darrick J. Wong
  2025-05-29 22:15     ` Joanne Koong
  2025-05-22  0:03   ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:03 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Implement functions to enable upcalling of iomap_begin and iomap_end to
userspace fuse servers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   38 ++++++
 fs/fuse/fuse_trace.h      |  258 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   87 ++++++++++++++
 fs/fuse/Kconfig           |   23 ++++
 fs/fuse/Makefile          |    1 
 fs/fuse/file_iomap.c      |  280 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |    5 +
 7 files changed, 691 insertions(+), 1 deletion(-)
 create mode 100644 fs/fuse/file_iomap.c


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d56d4fd956db99..aa51f25856697d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -895,6 +895,9 @@ struct fuse_conn {
 	/* Is link not implemented by fs? */
 	unsigned int no_link:1;
 
+	/* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
+	unsigned int iomap:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
@@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
 	return sb->s_fs_info;
 }
 
+static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
 {
 	return get_fuse_mount_super(sb)->fc;
@@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
 	return get_fuse_mount_super(inode->i_sb);
 }
 
+static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
+{
+	return get_fuse_mount_super_c(inode->i_sb);
+}
+
 static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
 {
 	return get_fuse_mount_super(inode->i_sb)->fc;
 }
 
+static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
+{
+	return get_fuse_mount_super_c(inode->i_sb)->fc;
+}
+
 static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
 {
 	return container_of(inode, struct fuse_inode, inode);
 }
 
+static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
+{
+	return container_of(inode, struct fuse_inode, inode);
+}
+
 static inline u64 get_node_id(struct inode *inode)
 {
 	return get_fuse_inode(inode)->nodeid;
@@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
 #define fuse_sysctl_unregister()	do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+# include <linux/fiemap.h>
+# include <linux/iomap.h>
+
+bool fuse_iomap_enabled(void);
+
+static inline bool fuse_has_iomap(const struct inode *inode)
+{
+	return get_fuse_conn_c(inode)->iomap;
+}
+#else
+# define fuse_iomap_enabled(...)		(false)
+# define fuse_has_iomap(...)			(false)
+#endif
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index bbe9ddd8c71696..f9a316c9788e06 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,8 @@
 	EM( FUSE_SYNCFS,		"FUSE_SYNCFS")		\
 	EM( FUSE_TMPFILE,		"FUSE_TMPFILE")		\
 	EM( FUSE_STATX,			"FUSE_STATX")		\
+	EM( FUSE_IOMAP_BEGIN,		"FUSE_IOMAP_BEGIN")	\
+	EM( FUSE_IOMAP_END,		"FUSE_IOMAP_END")	\
 	EMe(CUSE_INIT,			"CUSE_INIT")
 
 /*
@@ -124,6 +126,262 @@ TRACE_EVENT(fuse_request_end,
 		  __entry->unique, __entry->len, __entry->error)
 );
 
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+
+#define FUSE_IOMAP_F_STRINGS \
+	{ FUSE_IOMAP_F_NEW,			"new" }, \
+	{ FUSE_IOMAP_F_DIRTY,			"dirty" }, \
+	{ FUSE_IOMAP_F_SHARED,			"shared" }, \
+	{ FUSE_IOMAP_F_MERGED,			"merged" }, \
+	{ FUSE_IOMAP_F_XATTR,			"xattr" }, \
+	{ FUSE_IOMAP_F_BOUNDARY,		"boundary" }, \
+	{ FUSE_IOMAP_F_ANON_WRITE,		"anon_write" }, \
+	{ FUSE_IOMAP_F_ATOMIC_BIO,		"atomic" }, \
+	{ FUSE_IOMAP_F_WANT_IOMAP_END,		"iomap_end" }, \
+	{ FUSE_IOMAP_F_SIZE_CHANGED,		"append" }, \
+	{ FUSE_IOMAP_F_STALE,			"stale" }
+
+#define FUSE_IOMAP_OP_STRINGS \
+	{ FUSE_IOMAP_OP_WRITE,			"write" }, \
+	{ FUSE_IOMAP_OP_ZERO,			"zero" }, \
+	{ FUSE_IOMAP_OP_REPORT,			"report" }, \
+	{ FUSE_IOMAP_OP_FAULT,			"fault" }, \
+	{ FUSE_IOMAP_OP_DIRECT,			"direct" }, \
+	{ FUSE_IOMAP_OP_NOWAIT,			"nowait" }, \
+	{ FUSE_IOMAP_OP_OVERWRITE_ONLY,		"overwrite" }, \
+	{ FUSE_IOMAP_OP_UNSHARE,		"unshare" }, \
+	{ FUSE_IOMAP_OP_ATOMIC,			"atomic" }, \
+	{ FUSE_IOMAP_OP_DONTCACHE,		"dontcache" }
+
+#define FUSE_IOMAP_TYPE_STRINGS \
+	{ FUSE_IOMAP_TYPE_PURE_OVERWRITE,	"overwrite" }, \
+	{ FUSE_IOMAP_TYPE_HOLE,			"hole" }, \
+	{ FUSE_IOMAP_TYPE_DELALLOC,		"delalloc" }, \
+	{ FUSE_IOMAP_TYPE_MAPPED,		"mapped" }, \
+	{ FUSE_IOMAP_TYPE_UNWRITTEN,		"unwritten" }, \
+	{ FUSE_IOMAP_TYPE_INLINE,		"inline" }
+
+TRACE_EVENT(fuse_iomap_begin,
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+		 unsigned opflags),
+
+	TP_ARGS(inode, pos, count, opflags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+		__entry->opflags	=	opflags;
+	),
+
+	TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_begin_error,
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+		 unsigned opflags, int error),
+
+	TP_ARGS(inode, pos, count, opflags, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+		__entry->opflags	=	opflags;
+		__entry->error		=	error;
+	),
+
+	TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx err %d",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count, __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_read_map,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_begin_out *outarg),
+
+	TP_ARGS(inode, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(loff_t,		length)
+		__field(uint32_t,	dev)
+		__field(uint64_t,	addr)
+		__field(uint16_t,	type)
+		__field(uint16_t,	mapflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->offset		=	outarg->offset;
+		__entry->length		=	outarg->length;
+		__entry->dev		=	outarg->read_dev;
+		__entry->addr		=	outarg->read_addr;
+		__entry->type		=	outarg->read_type;
+		__entry->mapflags	=	outarg->read_flags;
+	),
+
+	TP_printk("connection %u ino %llu read offset 0x%llx count 0x%llx dev %u addr 0x%llu type %s mapflags (%s)",
+		  __entry->connection, __entry->ino, __entry->offset,
+		  __entry->length, __entry->dev, __entry->addr,
+		  __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_write_map,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_begin_out *outarg),
+
+	TP_ARGS(inode, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(loff_t,		length)
+		__field(uint32_t,	dev)
+		__field(uint64_t,	addr)
+		__field(uint16_t,	type)
+		__field(uint16_t,	mapflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->offset		=	outarg->offset;
+		__entry->length		=	outarg->length;
+		__entry->dev		=	outarg->write_dev;
+		__entry->addr		=	outarg->write_addr;
+		__entry->type		=	outarg->write_type;
+		__entry->mapflags	=	outarg->write_flags;
+	),
+
+	TP_printk("connection %u ino %llu write offset 0x%llx count 0x%llx dev %u addr 0x%llu type %s mapflags (%s)",
+		  __entry->connection, __entry->ino, __entry->offset,
+		  __entry->length, __entry->dev, __entry->addr,
+		  __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_end,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_end_in *inarg),
+
+	TP_ARGS(inode, inarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+		__field(size_t,		written)
+
+		__field(uint32_t,	dev)
+		__field(uint64_t,	addr)
+		__field(uint16_t,	type)
+		__field(uint16_t,	mapflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	inarg->pos;
+		__entry->count		=	inarg->count;
+		__entry->opflags	=	inarg->opflags;
+		__entry->written	=	inarg->written;
+		__entry->dev		=	inarg->map_dev;
+		__entry->addr		=	inarg->map_addr;
+		__entry->type		=	inarg->map_type;
+		__entry->mapflags	=	inarg->map_flags;
+	),
+
+	TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx written %zd dev %u addr 0x%llx type 0x%x mapflags (%s)",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count, __entry->written, __entry->dev,
+		  __entry->addr, __entry->type,
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_end_error,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_end_in *inarg, int error),
+
+	TP_ARGS(inode, inarg, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+		__field(size_t,		written)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	inarg->pos;
+		__entry->count		=	inarg->count;
+		__entry->opflags	=	inarg->opflags;
+		__entry->written	=	inarg->written;
+		__entry->error		=	error;
+	),
+
+	TP_printk("connection %u ino %llu opflags (%s) pos 0x%llx count 0x%llx written %zd error %d",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count, __entry->written,
+		  __entry->error)
+);
+#endif /* CONFIG_FUSE_IOMAP */
+
 #endif /* _TRACE_FUSE_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5ec43ecbceb783..ce6c9960f2418f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -232,6 +232,10 @@
  *
  *  7.43
  *  - add FUSE_REQUEST_TIMEOUT
+ *
+ *  7.44
+ *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ *    SEEK_{DATA,HOLE} support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -267,7 +271,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 43
+#define FUSE_KERNEL_MINOR_VERSION 44
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -440,6 +444,8 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ *	       operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -487,6 +493,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_IOMAP		(1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags
@@ -655,6 +662,9 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_BEGIN	= 4094,
+	FUSE_IOMAP_END		= 4095,
+
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
 
@@ -1286,4 +1296,79 @@ struct fuse_uring_cmd_req {
 	uint8_t padding[6];
 };
 
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
+#define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
+#define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
+#define FUSE_IOMAP_TYPE_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
+#define FUSE_IOMAP_TYPE_INLINE		4	/* data inline in the inode */
+
+#define FUSE_IOMAP_DEV_FUSEBLK		(0U)	/* fuseblk sb_dev device cookie */
+#define FUSE_IOMAP_DEV_NULL		(~0U)	/* null device cookie */
+
+#define FUSE_IOMAP_F_NEW		(1U << 0)
+#define FUSE_IOMAP_F_DIRTY		(1U << 1)
+#define FUSE_IOMAP_F_SHARED		(1U << 2)
+#define FUSE_IOMAP_F_MERGED		(1U << 3)
+#define FUSE_IOMAP_F_XATTR		(1U << 5)
+#define FUSE_IOMAP_F_BOUNDARY		(1U << 6)
+#define FUSE_IOMAP_F_ANON_WRITE		(1U << 7)
+#define FUSE_IOMAP_F_ATOMIC_BIO		(1U << 8)
+#define FUSE_IOMAP_F_WANT_IOMAP_END	(1U << 12) /* want ->iomap_end call */
+
+/* only for iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED	(1U << 14)
+#define FUSE_IOMAP_F_STALE		(1U << 15)
+
+#define FUSE_IOMAP_OP_WRITE		(1 << 0) /* writing, must allocate blocks */
+#define FUSE_IOMAP_OP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
+#define FUSE_IOMAP_OP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
+#define FUSE_IOMAP_OP_FAULT		(1 << 3) /* mapping for page fault */
+#define FUSE_IOMAP_OP_DIRECT		(1 << 4) /* direct I/O */
+#define FUSE_IOMAP_OP_NOWAIT		(1 << 5) /* do not block */
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
+#define FUSE_IOMAP_OP_UNSHARE		(1 << 7) /* unshare_file_range */
+#define FUSE_IOMAP_OP_ATOMIC		(1 << 9) /* torn-write protection */
+#define FUSE_IOMAP_OP_DONTCACHE		(1 << 10) /* dont retain pagecache */
+
+#define FUSE_IOMAP_NULL_ADDR		(-1ULL)	/* addr is not valid */
+
+struct fuse_iomap_begin_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+	uint64_t offset;	/* file offset of mapping, bytes */
+	uint64_t length;	/* length of both mappings, bytes */
+
+	uint64_t read_addr;	/* disk offset of mapping, bytes */
+	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t read_dev;	/* device cookie */
+
+	uint64_t write_addr;	/* disk offset of mapping, bytes */
+	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t write_dev;	/* device cookie * */
+};
+
+struct fuse_iomap_end_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+	int64_t written;	/* bytes processed */
+
+	uint64_t map_length;	/* length of mapping, bytes */
+	uint64_t map_addr;	/* disk offset of mapping, bytes */
+	uint16_t map_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t map_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t map_dev;	/* device cookie * */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index ca215a3cba3e31..fc7c5bf1cef52d 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
 
 	  If you want to allow passthrough operations, answer Y.
 
+config FUSE_IOMAP
+	bool "FUSE file IO over iomap"
+	default y
+	depends on FUSE_FS
+	select FS_IOMAP
+	help
+	  For supported fuseblk servers, this allows the file IO path to run
+	  through the kernel.
+
+config FUSE_IOMAP_BY_DEFAULT
+	bool "FUSE file I/O over iomap by default"
+	default n
+	depends on FUSE_IOMAP
+	help
+	  Enable sending FUSE file I/O over iomap by default.
+
+config FUSE_IOMAP_DEBUG
+	bool "Debug FUSE file IO over iomap"
+	default n
+	depends on FUSE_IOMAP
+	help
+	  Enable debugging assertions for the fuse iomap code paths.
+
 config FUSE_IO_URING
 	bool "FUSE communication over io-uring"
 	default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1cc..63a41ef9336aaa 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -16,5 +16,6 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
 fuse-$(CONFIG_SYSCTL) += sysctl.o
 fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
 
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
new file mode 100644
index 00000000000000..dfa0c309803113
--- /dev/null
+++ b/fs/fuse/file_iomap.c
@@ -0,0 +1,280 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org.
+ */
+#include "fuse_i.h"
+#include "fuse_trace.h"
+#include <linux/iomap.h>
+
+static bool __read_mostly enable_iomap =
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
+	true;
+#else
+	false;
+#endif
+module_param(enable_iomap, bool, 0644);
+MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+# define ASSERT(a)	do { WARN_ON(!(a)); } while (0)
+#else
+# define ASSERT(a)
+#endif
+
+bool fuse_iomap_enabled(void)
+{
+	return enable_iomap;
+}
+
+static inline bool fuse_iomap_check_type(uint16_t type)
+{
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_HOLE	!= IOMAP_HOLE);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_DELALLOC	!= IOMAP_DELALLOC);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_MAPPED	!= IOMAP_MAPPED);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_UNWRITTEN	!= IOMAP_UNWRITTEN);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_INLINE	!= IOMAP_INLINE);
+
+	switch (type) {
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+	case FUSE_IOMAP_TYPE_INLINE:
+		return true;
+	}
+
+	return false;
+}
+
+#define FUSE_IOMAP_F_ALL (FUSE_IOMAP_F_NEW | \
+			  FUSE_IOMAP_F_DIRTY | \
+			  FUSE_IOMAP_F_SHARED | \
+			  FUSE_IOMAP_F_MERGED | \
+			  FUSE_IOMAP_F_XATTR | \
+			  FUSE_IOMAP_F_BOUNDARY | \
+			  FUSE_IOMAP_F_ANON_WRITE | \
+			  FUSE_IOMAP_F_ATOMIC_BIO | \
+			  FUSE_IOMAP_F_WANT_IOMAP_END)
+
+static inline bool fuse_iomap_check_flags(uint16_t flags)
+{
+	BUILD_BUG_ON(FUSE_IOMAP_F_NEW		!= IOMAP_F_NEW);
+	BUILD_BUG_ON(FUSE_IOMAP_F_DIRTY		!= IOMAP_F_DIRTY);
+	BUILD_BUG_ON(FUSE_IOMAP_F_SHARED	!= IOMAP_F_SHARED);
+	BUILD_BUG_ON(FUSE_IOMAP_F_MERGED	!= IOMAP_F_MERGED);
+	BUILD_BUG_ON(FUSE_IOMAP_F_XATTR		!= IOMAP_F_XATTR);
+	BUILD_BUG_ON(FUSE_IOMAP_F_BOUNDARY	!= IOMAP_F_BOUNDARY);
+	BUILD_BUG_ON(FUSE_IOMAP_F_ANON_WRITE	!= IOMAP_F_ANON_WRITE);
+	BUILD_BUG_ON(FUSE_IOMAP_F_ATOMIC_BIO	!= IOMAP_F_ATOMIC_BIO);
+	BUILD_BUG_ON(FUSE_IOMAP_F_WANT_IOMAP_END != IOMAP_F_PRIVATE);
+
+	return (flags & ~FUSE_IOMAP_F_ALL) == 0;
+}
+
+/* Check the incoming mappings to make sure they're not nonsense */
+static inline int fuse_iomap_validate(const struct fuse_iomap_begin_out *outarg,
+				      unsigned opflags, loff_t pos)
+{
+	BUILD_BUG_ON(FUSE_IOMAP_OP_WRITE	!= IOMAP_WRITE);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_ZERO		!= IOMAP_ZERO);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_REPORT	!= IOMAP_REPORT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_FAULT	!= IOMAP_FAULT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_DIRECT	!= IOMAP_DIRECT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_NOWAIT	!= IOMAP_NOWAIT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_OVERWRITE_ONLY != IOMAP_OVERWRITE_ONLY);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_UNSHARE	!= IOMAP_UNSHARE);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_ATOMIC	!= IOMAP_ATOMIC);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_DONTCACHE	!= IOMAP_DONTCACHE);
+
+	if (outarg->read_dev == FUSE_IOMAP_DEV_NULL) {
+		ASSERT(outarg->read_dev != FUSE_IOMAP_DEV_NULL);
+		return -EIO;
+	}
+	if (outarg->write_dev == FUSE_IOMAP_DEV_NULL) {
+		ASSERT(outarg->write_dev != FUSE_IOMAP_DEV_NULL);
+		return -EIO;
+	}
+	if (outarg->offset > pos) {
+		ASSERT(outarg->offset <= pos);
+		return -EIO;
+	}
+	if (outarg->length == 0) {
+		ASSERT(outarg->length != 0);
+		return -EIO;
+	}
+	if (outarg->offset + outarg->length <= pos) {
+		ASSERT(outarg->offset + outarg->length > pos);
+		return -EIO;
+	}
+	if (!fuse_iomap_check_type(outarg->write_type)) {
+		ASSERT(fuse_iomap_check_type(outarg->write_type));
+		return -EIO;
+	}
+	if (!fuse_iomap_check_flags(outarg->write_flags)) {
+		ASSERT(fuse_iomap_check_flags(outarg->write_flags));
+		return -EIO;
+	}
+	if (!fuse_iomap_check_type(outarg->read_type)) {
+		ASSERT(fuse_iomap_check_type(outarg->read_type));
+		return -EIO;
+	}
+	if (!fuse_iomap_check_flags(outarg->read_flags)) {
+		ASSERT(fuse_iomap_check_flags(outarg->read_flags));
+		return -EIO;
+	}
+
+	if (!(opflags & FUSE_IOMAP_OP_REPORT)) {
+		/*
+		 * XXX inline data reads and writes are not supported, how do
+		 * we do this?
+		 */
+		ASSERT(outarg->read_type != FUSE_IOMAP_TYPE_INLINE);
+		ASSERT(outarg->write_type != FUSE_IOMAP_TYPE_INLINE);
+
+		if (outarg->read_type == FUSE_IOMAP_TYPE_INLINE)
+			return -EIO;
+		if (outarg->write_type == FUSE_IOMAP_TYPE_INLINE)
+			return -EIO;
+	}
+
+	return 0;
+}
+
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
+			    unsigned opflags, struct iomap *iomap,
+			    struct iomap *srcmap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_begin_in inarg = {
+		.attr_ino = fi->orig_ino,
+		.opflags = opflags,
+		.pos = pos,
+		.count = count,
+	};
+	struct fuse_iomap_begin_out outarg = { };
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	int err;
+
+	trace_fuse_iomap_begin(inode, pos, count, opflags);
+
+	args.opcode = FUSE_IOMAP_BEGIN;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	args.out_numargs = 1;
+	args.out_args[0].size = sizeof(outarg);
+	args.out_args[0].value = &outarg;
+	err = fuse_simple_request(fm, &args);
+	if (err) {
+		trace_fuse_iomap_begin_error(inode, pos, count, opflags, err);
+		return err;
+	}
+
+	trace_fuse_iomap_read_map(inode, &outarg);
+	trace_fuse_iomap_write_map(inode, &outarg);
+
+	err = fuse_iomap_validate(&outarg, opflags, pos);
+	if (err)
+		return err;
+
+	if ((opflags & IOMAP_WRITE) &&
+	    outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+		/*
+		 * For an out of place write, we must supply the write mapping
+		 * via @iomap, and the read mapping via @srcmap.
+		 */
+		iomap->addr = outarg.write_addr;
+		iomap->offset = outarg.offset;
+		iomap->length = outarg.length;
+		iomap->type = outarg.write_type;
+		iomap->flags = outarg.write_flags;
+		iomap->bdev = inode->i_sb->s_bdev;
+
+		srcmap->addr = outarg.read_addr;
+		srcmap->offset = outarg.offset;
+		srcmap->length = outarg.length;
+		srcmap->type = outarg.read_type;
+		srcmap->flags = outarg.read_flags;
+		srcmap->bdev = inode->i_sb->s_bdev;
+	} else {
+		/*
+		 * For everything else (reads, reporting, and pure overwrites),
+		 * we can return the sole mapping through @iomap and leave
+		 * @srcmap unchanged from its default (HOLE).
+		 */
+		iomap->addr = outarg.read_addr;
+		iomap->offset = outarg.offset;
+		iomap->length = outarg.length;
+		iomap->type = outarg.read_type;
+		iomap->flags = outarg.read_flags;
+		iomap->bdev = inode->i_sb->s_bdev;
+	}
+
+	return 0;
+}
+
+static bool fuse_want_iomap_end(const struct iomap *iomap, unsigned int opflags,
+				loff_t count, ssize_t written)
+{
+	/* Caller demanded an iomap_end call. */
+	if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
+		return true;
+
+	/* Reads and reporting should never affect the filesystem metadata */
+	if (!(opflags & (IOMAP_WRITE | IOMAP_ZERO)))
+		return false;
+
+	/* Appending writes get an iomap_end call */
+	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+		return true;
+
+	/* Short writes get an iomap_end call to clean up delalloc */
+	return written < count;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
+			  ssize_t written, unsigned opflags,
+			  struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_end_in inarg = {
+		.opflags = opflags,
+		.attr_ino = fi->orig_ino,
+		.pos = pos,
+		.count = count,
+		.written = written,
+
+		.map_addr = iomap->addr,
+		.map_length = iomap->length,
+		.map_type = iomap->type,
+		.map_flags = iomap->flags,
+	};
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	int err;
+
+	if (!fuse_want_iomap_end(iomap, opflags, count, written))
+		return 0;
+
+	trace_fuse_iomap_end(inode, &inarg);
+
+	args.opcode = FUSE_IOMAP_END;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	err = fuse_simple_request(fm, &args);
+
+	trace_fuse_iomap_end_error(inode, &inarg, err);
+
+	return err;
+}
+
+const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin		= fuse_iomap_begin,
+	.iomap_end		= fuse_iomap_end,
+};
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index fd48e8d37f2edc..88730d26c9b5e2 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1438,6 +1438,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 			if (flags & FUSE_REQUEST_TIMEOUT)
 				timeout = arg->request_timeout;
+
+			if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
+				fc->iomap = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1506,6 +1509,8 @@ void fuse_send_init(struct fuse_mount *fm)
 	 */
 	if (fuse_uring_enabled())
 		flags |= FUSE_OVER_IO_URING;
+	if (fuse_iomap_enabled())
+		flags |= FUSE_IOMAP;
 
 	ia->in.flags = flags;
 	ia->in.flags2 = flags >> 32;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/11] fuse: add a notification to add new iomap devices
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-05-22  0:03   ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-05-22  0:03   ` Darrick J. Wong
  2025-05-22 16:46     ` Amir Goldstein
  2025-05-22  0:03   ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:03 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Add a new notification so that fuse servers can add extra block devices
to use with iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   19 +++++++
 fs/fuse/fuse_trace.h      |   36 ++++++++++++++
 include/uapi/linux/fuse.h |    8 +++
 fs/fuse/dev.c             |   23 +++++++++
 fs/fuse/file_iomap.c      |  119 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/inode.c           |    9 +++
 6 files changed, 211 insertions(+), 3 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index aa51f25856697d..4eb75ed90db300 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,6 +619,12 @@ struct fuse_sync_bucket {
 	struct rcu_head rcu;
 };
 
+struct fuse_iomap {
+	/* array of file objects that reference block devices for iomap */
+	struct file **files;
+	unsigned int nr_files;
+};
+
 /**
  * A Fuse connection.
  *
@@ -970,6 +976,10 @@ struct fuse_conn {
 	struct fuse_ring *ring;
 #endif
 
+#ifdef CONFIG_FUSE_IOMAP
+	struct fuse_iomap iomap_conn;
+#endif
+
 	/** Only used if the connection opts into request timeouts */
 	struct {
 		/* Worker for checking if any requests have timed out */
@@ -1610,9 +1620,18 @@ static inline bool fuse_has_iomap(const struct inode *inode)
 {
 	return get_fuse_conn_c(inode)->iomap;
 }
+
+void fuse_iomap_init_reply(struct fuse_mount *fm);
+void fuse_iomap_conn_put(struct fuse_conn *fc);
+
+int fuse_iomap_add_device(struct fuse_conn *fc,
+			  const struct fuse_iomap_add_device_out *outarg);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
+# define fuse_iomap_init_reply(...)		((void)0)
+# define fuse_iomap_conn_put(...)		((void)0)
+# define fuse_iomap_add_device(...)		(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index f9a316c9788e06..e1a2e491d2581a 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -380,6 +380,42 @@ TRACE_EVENT(fuse_iomap_end_error,
 		  __entry->pos, __entry->count, __entry->written,
 		  __entry->error)
 );
+
+TRACE_EVENT(fuse_iomap_dev_class,
+	TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
+		 const struct file *file),
+
+	TP_ARGS(fc, idx, file),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned int,	idx)
+		__field(dev_t,		bdev)
+	),
+
+	TP_fast_assign(
+		struct inode *inode = file_inode(file);
+
+		__entry->connection	=	fc->dev;
+		__entry->idx		=	idx;
+		if (S_ISBLK(inode->i_mode)) {
+			__entry->bdev	=	inode->i_rdev;
+		} else
+			__entry->bdev	=	0;
+	),
+
+	TP_printk("connection %u idx %u dev %u:%u",
+		  __entry->connection,
+		  __entry->idx,
+		  MAJOR(__entry->bdev), MINOR(__entry->bdev))
+);
+#define DEFINE_FUSE_IOMAP_DEV_EVENT(name)		\
+DEFINE_EVENT(fuse_iomap_dev_class, name,		\
+	TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
+		 const struct file *file), \
+	TP_ARGS(fc, idx, file))
+DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
+DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index ce6c9960f2418f..ea8992e980a015 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -236,6 +236,7 @@
  *  7.44
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
+ *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
  */
 
 #ifndef _LINUX_FUSE_H
@@ -681,6 +682,7 @@ enum fuse_notify_code {
 	FUSE_NOTIFY_RETRIEVE = 5,
 	FUSE_NOTIFY_DELETE = 6,
 	FUSE_NOTIFY_RESEND = 7,
+	FUSE_NOTIFY_ADD_IOMAP_DEVICE = 8,
 	FUSE_NOTIFY_CODE_MAX,
 };
 
@@ -1371,4 +1373,10 @@ struct fuse_iomap_end_in {
 	uint32_t map_dev;	/* device cookie * */
 };
 
+struct fuse_iomap_add_device_out {
+	int32_t fd;		/* fd of the open device to add */
+	uint32_t reserved;	/* must be zero */
+	uint32_t *map_dev;	/* location to receive device cookie */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6dcbaa218b7a16..9d7064ec170cf6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1824,6 +1824,26 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 	return err;
 }
 
+static int fuse_notify_add_iomap_device(struct fuse_conn *fc, unsigned int size,
+					struct fuse_copy_state *cs)
+{
+	struct fuse_iomap_add_device_out outarg;
+	int err = -EINVAL;
+
+	if (size != sizeof(outarg))
+		goto err;
+
+	err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+	if (err)
+		goto err;
+	fuse_copy_finish(cs);
+
+	return fuse_iomap_add_device(fc, &outarg);
+err:
+	fuse_copy_finish(cs);
+	return err;
+}
+
 struct fuse_retrieve_args {
 	struct fuse_args_pages ap;
 	struct fuse_notify_retrieve_in inarg;
@@ -2049,6 +2069,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
 	case FUSE_NOTIFY_RESEND:
 		return fuse_notify_resend(fc);
 
+	case FUSE_NOTIFY_ADD_IOMAP_DEVICE:
+		return fuse_notify_add_iomap_device(fc, size, cs);
+
 	default:
 		fuse_copy_finish(cs);
 		return -EINVAL;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index dfa0c309803113..faefd29a273bf3 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -142,6 +142,26 @@ static inline int fuse_iomap_validate(const struct fuse_iomap_begin_out *outarg,
 	return 0;
 }
 
+static inline struct block_device *fuse_iomap_bdev(struct fuse_mount *fm,
+						   unsigned int idx)
+{
+	struct fuse_conn *fc = fm->fc;
+	struct file *file = NULL;
+
+	spin_lock(&fc->lock);
+	if (idx < fc->iomap_conn.nr_files)
+		file = fc->iomap_conn.files[idx];
+	spin_unlock(&fc->lock);
+
+	if (!file)
+		return NULL;
+
+	if (!S_ISBLK(file_inode(file)->i_mode))
+		return NULL;
+
+	return I_BDEV(file->f_mapping->host);
+}
+
 static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 			    unsigned opflags, struct iomap *iomap,
 			    struct iomap *srcmap)
@@ -155,6 +175,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 	};
 	struct fuse_iomap_begin_out outarg = { };
 	struct fuse_mount *fm = get_fuse_mount(inode);
+	struct block_device *read_bdev;
 	FUSE_ARGS(args);
 	int err;
 
@@ -181,8 +202,18 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 	if (err)
 		return err;
 
+	read_bdev = fuse_iomap_bdev(fm, outarg.read_dev);
+	if (!read_bdev)
+		return -ENODEV;
+
 	if ((opflags & IOMAP_WRITE) &&
 	    outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+		struct block_device *write_bdev =
+			fuse_iomap_bdev(fm, outarg.write_dev);
+
+		if (!write_bdev)
+			return -ENODEV;
+
 		/*
 		 * For an out of place write, we must supply the write mapping
 		 * via @iomap, and the read mapping via @srcmap.
@@ -192,14 +223,14 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 		iomap->length = outarg.length;
 		iomap->type = outarg.write_type;
 		iomap->flags = outarg.write_flags;
-		iomap->bdev = inode->i_sb->s_bdev;
+		iomap->bdev = write_bdev;
 
 		srcmap->addr = outarg.read_addr;
 		srcmap->offset = outarg.offset;
 		srcmap->length = outarg.length;
 		srcmap->type = outarg.read_type;
 		srcmap->flags = outarg.read_flags;
-		srcmap->bdev = inode->i_sb->s_bdev;
+		srcmap->bdev = read_bdev;
 	} else {
 		/*
 		 * For everything else (reads, reporting, and pure overwrites),
@@ -211,7 +242,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 		iomap->length = outarg.length;
 		iomap->type = outarg.read_type;
 		iomap->flags = outarg.read_flags;
-		iomap->bdev = inode->i_sb->s_bdev;
+		iomap->bdev = read_bdev;
 	}
 
 	return 0;
@@ -278,3 +309,85 @@ const struct iomap_ops fuse_iomap_ops = {
 	.iomap_begin		= fuse_iomap_begin,
 	.iomap_end		= fuse_iomap_end,
 };
+
+void fuse_iomap_conn_put(struct fuse_conn *fc)
+{
+	unsigned int i;
+
+	for (i = 0; i < fc->iomap_conn.nr_files; i++) {
+		struct file *file = fc->iomap_conn.files[i];
+
+		trace_fuse_iomap_remove_dev(fc, i, file);
+
+		fc->iomap_conn.files[i] = NULL;
+		fput(file);
+	}
+
+	kfree(fc->iomap_conn.files);
+	fc->iomap_conn.nr_files = 0;
+}
+
+/* Add a bdev to the fuse connection, returns the index or a negative errno */
+static int __fuse_iomap_add_device(struct fuse_conn *fc, struct file *file)
+{
+	struct file **new_files;
+	int ret;
+
+	if (fc->iomap_conn.nr_files >= PAGE_SIZE / sizeof(unsigned int))
+		return -EMFILE;
+
+	new_files = krealloc_array(fc->iomap_conn.files,
+				   fc->iomap_conn.nr_files + 1,
+				   sizeof(struct file *),
+				   GFP_KERNEL | __GFP_ZERO);
+	if (!new_files)
+		return -ENOMEM;
+
+	spin_lock(&fc->lock);
+	fc->iomap_conn.files = new_files;
+	fc->iomap_conn.files[fc->iomap_conn.nr_files] = get_file(file);
+	ret = fc->iomap_conn.nr_files++;
+	spin_unlock(&fc->lock);
+
+	trace_fuse_iomap_add_dev(fc, ret, file);
+
+	return ret;
+}
+
+void fuse_iomap_init_reply(struct fuse_mount *fm)
+{
+	struct fuse_conn *fc = fm->fc;
+	struct super_block *sb = fm->sb;
+
+	if (sb->s_bdev)
+		__fuse_iomap_add_device(fc, sb->s_bdev_file);
+}
+
+int fuse_iomap_add_device(struct fuse_conn *fc,
+			  const struct fuse_iomap_add_device_out *outarg)
+{
+	struct file *file;
+	int ret;
+
+	if (!fc->iomap)
+		return -EINVAL;
+
+	if (outarg->reserved)
+		return -EINVAL;
+
+	CLASS(fd, somefd)(outarg->fd);
+	if (fd_empty(somefd))
+		return -EBADF;
+	file = fd_file(somefd);
+
+	if (!S_ISBLK(file_inode(file)->i_mode))
+		return -ENODEV;
+
+	down_read(&fc->killsb);
+	ret = __fuse_iomap_add_device(fc, file);
+	up_read(&fc->killsb);
+	if (ret < 0)
+		return ret;
+
+	return put_user(ret, outarg->map_dev);
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 88730d26c9b5e2..84b7cd5ffe843b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1010,6 +1010,8 @@ void fuse_conn_put(struct fuse_conn *fc)
 		struct fuse_iqueue *fiq = &fc->iq;
 		struct fuse_sync_bucket *bucket;
 
+		if (fc->iomap)
+			fuse_iomap_conn_put(fc);
 		if (IS_ENABLED(CONFIG_FUSE_DAX))
 			fuse_dax_conn_free(fc);
 		if (fc->timeout.req_timeout)
@@ -1449,6 +1451,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 		init_server_timeout(fc, timeout);
 
+		if (fc->iomap)
+			fuse_iomap_init_reply(fm);
+
 		fm->sb->s_bdi->ra_pages =
 				min(fm->sb->s_bdi->ra_pages, ra_pages);
 		fc->minor = arg->minor;
@@ -1886,6 +1891,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
  err_free_dax:
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_conn_free(fc);
+	/*
+	 * No need to call fuse_iomap_conn_put here because we don't add
+	 * devices until the init reply.
+	 */
  err:
 	return err;
 }


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-05-22  0:03   ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
@ 2025-05-22  0:03   ` Darrick J. Wong
  2025-05-22  0:04   ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:03 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

When we're destroying a fuse connection, send a FUSE_DESTROY command to
userspace so that it has time to react (closing block devices, reporting
latent errors, etc) before the mount actually goes away.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/inode.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 84b7cd5ffe843b..224fb9e7610cc5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2056,7 +2056,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
 {
 	struct fuse_conn *fc = fm->fc;
 
-	if (fc->destroy)
+	if (fc->destroy || fc->iomap)
 		fuse_send_destroy(fm);
 
 	fuse_abort_conn(fc);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-05-22  0:03   ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
@ 2025-05-22  0:04   ` Darrick J. Wong
  2025-05-22  0:04   ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:04 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Implement the basic file mapping reporting functions like FIEMAP, BMAP,
and SEEK_DATA/HOLE.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h     |    8 ++++++
 fs/fuse/fuse_trace.h |   57 +++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dir.c        |    1 +
 fs/fuse/file.c       |   13 +++++++++
 fs/fuse/file_iomap.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 149 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4eb75ed90db300..a39e45eeec2e3e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1626,12 +1626,20 @@ void fuse_iomap_conn_put(struct fuse_conn *fc);
 
 int fuse_iomap_add_device(struct fuse_conn *fc,
 			  const struct fuse_iomap_add_device_out *outarg);
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		      u64 start, u64 length);
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
 # define fuse_iomap_init_reply(...)		((void)0)
 # define fuse_iomap_conn_put(...)		((void)0)
 # define fuse_iomap_add_device(...)		(-ENOSYS)
+# define fuse_iomap_fiemap			NULL
+# define fuse_iomap_lseek(...)			(-ENOSYS)
+# define fuse_iomap_bmap(...)			(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index e1a2e491d2581a..252eab698287bd 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -416,6 +416,63 @@ DEFINE_EVENT(fuse_iomap_dev_class, name,		\
 	TP_ARGS(fc, idx, file))
 DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
 DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
+
+TRACE_EVENT(fuse_iomap_fiemap,
+	TP_PROTO(const struct inode *inode, u64 start, u64 count,
+		unsigned int flags),
+
+	TP_ARGS(inode, start, count, flags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(u64,		start)
+		__field(u64,		count)
+		__field(unsigned int,	flags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->start		=	start;
+		__entry->count		=	count;
+		__entry->flags		=	flags;
+	),
+
+	TP_printk("connection %u ino %llu flags 0x%x start 0x%llx count 0x%llx",
+		  __entry->connection, __entry->ino, __entry->flags,
+		  __entry->start, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_lseek,
+	TP_PROTO(const struct inode *inode, loff_t offset, int whence),
+
+	TP_ARGS(inode, offset, whence),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(int,		whence)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->offset		=	offset;
+		__entry->whence		=	whence;
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx whence %d",
+		  __entry->connection, __entry->ino, __entry->offset,
+		  __entry->whence)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 83ac192e7fdd19..be75a515c4f8b6 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2230,6 +2230,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.set_acl	= fuse_set_acl,
 	.fileattr_get	= fuse_fileattr_get,
 	.fileattr_set	= fuse_fileattr_set,
+	.fiemap		= fuse_iomap_fiemap,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ada1ed9e653e42..6b54b9a8f8a84d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2844,6 +2844,12 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 	struct fuse_bmap_out outarg;
 	int err;
 
+	if (fuse_has_iomap(inode)) {
+		sector_t alt_sec = fuse_iomap_bmap(mapping, block);
+		if (alt_sec > 0)
+			return alt_sec;
+	}
+
 	if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
 		return 0;
 
@@ -2879,6 +2885,13 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
 	struct fuse_lseek_out outarg;
 	int err;
 
+	if (fuse_has_iomap(inode)) {
+		loff_t alt_pos = fuse_iomap_lseek(file, offset, whence);
+
+		if (alt_pos >= 0 || (alt_pos < 0 && alt_pos != -ENOSYS))
+			return alt_pos;
+	}
+
 	if (fm->fc->no_lseek)
 		goto fallback;
 
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index faefd29a273bf3..f943cb3334a787 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -391,3 +391,73 @@ int fuse_iomap_add_device(struct fuse_conn *fc,
 
 	return put_user(ret, outarg->map_dev);
 }
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		      u64 start, u64 count)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	int error;
+
+	/*
+	 * We are called directly from the vfs so we need to check per-inode
+	 * support here explicitly.
+	 */
+	if (!fuse_has_iomap(inode))
+		return -EOPNOTSUPP;
+
+	if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
+		return -EOPNOTSUPP;
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	if (!fuse_allow_current_process(fc))
+		return -EACCES;
+
+	trace_fuse_iomap_fiemap(inode, start, count, fieinfo->fi_flags);
+
+	inode_lock_shared(inode);
+	error = iomap_fiemap(inode, fieinfo, start, count,
+			&fuse_iomap_ops);
+	inode_unlock_shared(inode);
+
+	return error;
+}
+
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block)
+{
+	ASSERT(fuse_has_iomap(mapping->host));
+
+	return iomap_bmap(mapping, block, &fuse_iomap_ops);
+}
+
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	ASSERT(fuse_has_iomap(inode));
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	if (!fuse_allow_current_process(fc))
+		return -EACCES;
+
+	trace_fuse_iomap_lseek(inode, offset, whence);
+
+	switch (whence) {
+	case SEEK_HOLE:
+		offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops);
+		break;
+	case SEEK_DATA:
+		offset = iomap_seek_data(inode, offset, &fuse_iomap_ops);
+		break;
+	default:
+		return -ENOSYS;
+	}
+
+	if (offset < 0)
+		return offset;
+	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
+}


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/11] fuse: implement direct IO with iomap
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-05-22  0:04   ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
@ 2025-05-22  0:04   ` Darrick J. Wong
  2025-05-22  0:04   ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:04 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Implement direct IO with iomap if it's available.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   24 ++++
 fs/fuse/fuse_trace.h      |  186 +++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   27 +++++
 fs/fuse/dir.c             |    7 +
 fs/fuse/file.c            |   16 +++
 fs/fuse/file_iomap.c      |  256 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |    4 +
 7 files changed, 519 insertions(+), 1 deletion(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a39e45eeec2e3e..51a373bc7b03d9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -904,6 +904,9 @@ struct fuse_conn {
 	/* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
 	unsigned int iomap:1;
 
+	/* Use fs/iomap for direct I/O operations */
+	unsigned int iomap_directio:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
@@ -1631,6 +1634,22 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		      u64 start, u64 length);
 loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
 sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
+
+void fuse_iomap_open(struct inode *inode, struct file *file);
+
+static inline bool fuse_has_iomap_direct_io(const struct inode *inode)
+{
+	return get_fuse_conn_c(inode)->iomap_directio;
+}
+
+static inline bool fuse_want_iomap_direct_io(const struct kiocb *iocb)
+{
+	return (iocb->ki_flags & IOCB_DIRECT) &&
+		fuse_has_iomap_direct_io(file_inode(iocb->ki_filp));
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1640,6 +1659,11 @@ sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
 # define fuse_iomap_fiemap			NULL
 # define fuse_iomap_lseek(...)			(-ENOSYS)
 # define fuse_iomap_bmap(...)			(-ENOSYS)
+# define fuse_iomap_open(...)			((void)0)
+# define fuse_has_iomap_direct_io(...)		(false)
+# define fuse_want_iomap_direct_io(...)		(false)
+# define fuse_iomap_direct_read(...)		(-ENOSYS)
+# define fuse_iomap_direct_write(...)		(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 252eab698287bd..da7c317b664a10 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -60,6 +60,7 @@
 	EM( FUSE_STATX,			"FUSE_STATX")		\
 	EM( FUSE_IOMAP_BEGIN,		"FUSE_IOMAP_BEGIN")	\
 	EM( FUSE_IOMAP_END,		"FUSE_IOMAP_END")	\
+	EM( FUSE_IOMAP_IOEND,		"FUSE_IOMAP_IOEND")	\
 	EMe(CUSE_INIT,			"CUSE_INIT")
 
 /*
@@ -161,6 +162,17 @@ TRACE_EVENT(fuse_request_end,
 	{ FUSE_IOMAP_TYPE_UNWRITTEN,		"unwritten" }, \
 	{ FUSE_IOMAP_TYPE_INLINE,		"inline" }
 
+#define FUSE_IOMAP_IOEND_STRINGS \
+	{ FUSE_IOMAP_IOEND_SHARED,		"shared" }, \
+	{ FUSE_IOMAP_IOEND_UNWRITTEN,		"unwritten" }, \
+	{ FUSE_IOMAP_IOEND_BOUNDARY,		"boundary" }, \
+	{ FUSE_IOMAP_IOEND_DIRECT,		"direct" }, \
+	{ FUSE_IOMAP_IOEND_APPEND,		"append" }
+
+#define IOMAP_DIOEND_STRINGS \
+	{ IOMAP_DIO_UNWRITTEN,			"unwritten" }, \
+	{ IOMAP_DIO_COW,			"cow" }
+
 TRACE_EVENT(fuse_iomap_begin,
 	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
 		 unsigned opflags),
@@ -381,6 +393,79 @@ TRACE_EVENT(fuse_iomap_end_error,
 		  __entry->error)
 );
 
+TRACE_EVENT(fuse_iomap_ioend,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_ioend_in *inarg),
+
+	TP_ARGS(inode, inarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned,	ioendflags)
+		__field(int,		error)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(uint64_t,	new_addr)
+		__field(size_t,		written)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->ioendflags	=	inarg->ioendflags;
+		__entry->error		=	inarg->error;
+		__entry->pos		=	inarg->pos;
+		__entry->new_addr	=	inarg->new_addr;
+		__entry->written	=	inarg->written;
+	),
+
+	TP_printk("connection %u ino %llu ioendflags (%s) pos 0x%llx written %zd error %d new_addr 0x%llx",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+		  __entry->pos, __entry->written, __entry->error,
+		  __entry->new_addr)
+);
+
+TRACE_EVENT(fuse_iomap_ioend_error,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_ioend_in *inarg,
+		 int error),
+
+	TP_ARGS(inode, inarg, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned,	ioendflags)
+		__field(int,		error)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(uint64_t,	new_addr)
+		__field(size_t,		written)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->ioendflags	=	inarg->ioendflags;
+		__entry->error		=	error;
+		__entry->pos		=	inarg->pos;
+		__entry->new_addr	=	inarg->new_addr;
+		__entry->written	=	inarg->written;
+	),
+
+	TP_printk("connection %u ino %llu ioendflags (%s) pos 0x%llx written %zd error %d new_addr 0x%llx",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+		  __entry->pos, __entry->written, __entry->error,
+		  __entry->new_addr)
+);
+
 TRACE_EVENT(fuse_iomap_dev_class,
 	TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
 		 const struct file *file),
@@ -473,6 +558,107 @@ TRACE_EVENT(fuse_iomap_lseek,
 		  __entry->connection, __entry->ino, __entry->offset,
 		  __entry->whence)
 );
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_io_class,
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter),
+	TP_ARGS(iocb, iter),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(loff_t, size)
+		__field(loff_t, offset)
+		__field(size_t, count)
+	),
+	TP_fast_assign(
+		const struct inode *inode = file_inode(iocb->ki_filp);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->size		=	i_size_read(inode);
+		__entry->offset		=	iocb->ki_pos;
+		__entry->count		=	iov_iter_count(iter);
+	),
+	TP_printk("connection %u ino %llu disize 0x%llx pos 0x%llx bytecount 0x%zx",
+		  __entry->connection, __entry->ino, __entry->size,
+		  __entry->offset, __entry->count)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IO_EVENT(name)		\
+DEFINE_EVENT(fuse_iomap_file_io_class, name,		\
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), \
+	TP_ARGS(iocb, iter))
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
+		 ssize_t ret),
+	TP_ARGS(iocb, iter, ret),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(loff_t, size)
+		__field(loff_t, offset)
+		__field(size_t, count)
+		__field(ssize_t, ret)
+	),
+	TP_fast_assign(
+		const struct inode *inode = file_inode(iocb->ki_filp);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->size		=	i_size_read(inode);
+		__entry->offset		=	iocb->ki_pos;
+		__entry->count		=	iov_iter_count(iter);
+		__entry->ret		=	ret;
+	),
+	TP_printk("connection %u ino %llu disize 0x%llx pos 0x%llx bytecount 0x%zx ret 0x%zx",
+		  __entry->connection, __entry->ino, __entry->size,
+		  __entry->offset, __entry->count, __entry->ret)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(name)	\
+DEFINE_EVENT(fuse_iomap_file_ioend_class, name,		\
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, \
+		 ssize_t ret), \
+	TP_ARGS(iocb, iter, ret))
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+
+TRACE_EVENT(fuse_iomap_dio_write_end_io,
+	TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
+		 int error, unsigned flags),
+
+	TP_ARGS(inode, pos, written, error, flags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned,	dioendflags)
+		__field(int,		error)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(size_t,		written)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->dioendflags	=	flags;
+		__entry->error		=	error;
+		__entry->pos		=	pos;
+		__entry->written	=	written;
+	),
+
+	TP_printk("connection %u ino %llu dioendflags (%s) pos 0x%llx written %zd error %d",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
+		  __entry->pos, __entry->written, __entry->error)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index ea8992e980a015..4611f912003593 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -237,6 +237,7 @@
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
+ *  - add FUSE_IOMAP_DIRECTIO for direct I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -447,6 +448,7 @@ struct fuse_file_lock {
  *			 init_out.request_timeout contains the timeout (in secs)
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
+ * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -495,6 +497,7 @@ struct fuse_file_lock {
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
 #define FUSE_IOMAP		(1ULL << 43)
+#define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
 
 /**
  * CUSE INIT request/reply flags
@@ -663,6 +666,7 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_IOEND	= 4093,
 	FUSE_IOMAP_BEGIN	= 4094,
 	FUSE_IOMAP_END		= 4095,
 
@@ -1379,4 +1383,27 @@ struct fuse_iomap_add_device_out {
 	uint32_t *map_dev;	/* location to receive device cookie */
 };
 
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN	(1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY	(1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT		(1U << 3)
+
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND		(1U << 15)
+
+struct fuse_iomap_ioend_in {
+	uint16_t ioendflags;	/* FUSE_IOMAP_IOEND_* */
+	uint16_t reserved;	/* zero */
+	int32_t error;		/* negative errno or 0 */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t new_addr;	/* disk offset of new mapping, in bytes */
+	uint32_t written;	/* bytes processed */
+	uint32_t reserved1;	/* zero */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index be75a515c4f8b6..c947ad50a9a8eb 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -704,6 +704,10 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
 	d_instantiate(entry, inode);
 	fuse_change_entry_timeout(entry, &outentry);
 	fuse_dir_changed(dir);
+
+	if (fuse_has_iomap(inode))
+		fuse_iomap_open(inode, file);
+
 	err = generic_file_open(inode, file);
 	if (!err) {
 		file->private_data = ff;
@@ -1692,6 +1696,9 @@ static int fuse_dir_open(struct inode *inode, struct file *file)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_has_iomap(inode))
+		fuse_iomap_open(inode, file);
+
 	err = generic_file_open(inode, file);
 	if (err)
 		return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6b54b9a8f8a84d..7e8b20f56dd823 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -244,6 +244,9 @@ static int fuse_open(struct inode *inode, struct file *file)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_has_iomap(inode))
+		fuse_iomap_open(inode, file);
+
 	err = generic_file_open(inode, file);
 	if (err)
 		return err;
@@ -1778,10 +1781,17 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
+	ssize_t ret;
 
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_want_iomap_direct_io(iocb)) {
+		ret = fuse_iomap_direct_read(iocb, to);
+		if (ret != -ENOSYS)
+			return ret;
+	}
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_read_iter(iocb, to);
 
@@ -1803,6 +1813,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_want_iomap_direct_io(iocb)) {
+		ssize_t ret = fuse_iomap_direct_write(iocb, from);
+		if (ret != -ENOSYS)
+			return ret;
+	}
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_write_iter(iocb, from);
 
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index f943cb3334a787..077ef51ee47452 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -310,6 +310,70 @@ const struct iomap_ops fuse_iomap_ops = {
 	.iomap_end		= fuse_iomap_end,
 };
 
+static inline bool fuse_want_ioend(const struct fuse_iomap_ioend_in *inarg)
+{
+	/* Always send an ioend for errors. */
+	if (inarg->error)
+		return true;
+
+	/* Send an ioend if we performed an IO involving metadata changes. */
+	return inarg->written > 0 &&
+	       (inarg->ioendflags & (FUSE_IOMAP_IOEND_SHARED |
+				     FUSE_IOMAP_IOEND_UNWRITTEN |
+				     FUSE_IOMAP_IOEND_APPEND));
+}
+
+static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
+			    int error, unsigned ioendflags, sector_t new_addr)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_ioend_in inarg = {
+		.ioendflags = ioendflags,
+		.error = error,
+		.attr_ino = fi->orig_ino,
+		.pos = pos,
+		.written = written,
+		.new_addr = new_addr,
+	};
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	int err = 0;
+
+	if (pos + written > i_size_read(inode))
+		inarg.ioendflags |= FUSE_IOMAP_IOEND_APPEND;
+
+	trace_fuse_iomap_ioend(inode, &inarg);
+
+	if (!fuse_want_ioend(&inarg))
+		goto out;
+
+	args.opcode = FUSE_IOMAP_IOEND;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	err = fuse_simple_request(fm, &args);
+
+	trace_fuse_iomap_ioend_error(inode, &inarg, err);
+
+	/*
+	 * Preserve the original error code if userspace didn't respond or
+	 * returned success despite the error we passed along via the ioend.
+	 */
+	if (error && (err == 0 || err == -ENOSYS))
+		err = error;
+
+out:
+	/*
+	 * If there weren't any ioend errors, update the incore isize, which
+	 * confusingly takes the new i_size as "pos".
+	 */
+	if (!error && !err)
+		fuse_write_update_attr(inode, pos + written, written);
+
+	return err;
+}
+
 void fuse_iomap_conn_put(struct fuse_conn *fc)
 {
 	unsigned int i;
@@ -461,3 +525,195 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
 		return offset;
 	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
+
+void fuse_iomap_open(struct inode *inode, struct file *file)
+{
+	if (fuse_has_iomap_direct_io(inode))
+		file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+}
+
+enum fuse_ilock_type {
+	SHARED,
+	EXCL,
+};
+
+static int fuse_iomap_ilock_iocb(const struct kiocb *iocb,
+				 enum fuse_ilock_type type)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		switch (type) {
+		case SHARED:
+			return inode_trylock_shared(inode) ? 0 : -EAGAIN;
+		case EXCL:
+			return inode_trylock(inode) ? 0 : -EAGAIN;
+		default:
+			ASSERT(0);
+			return -EIO;
+		}
+	} else {
+		switch (type) {
+		case SHARED:
+			inode_lock_shared(inode);
+			break;
+		case EXCL:
+			inode_lock(inode);
+			break;
+		default:
+			ASSERT(0);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_direct_io(inode));
+
+	trace_fuse_iomap_direct_read(iocb, to);
+
+	if (!iov_iter_count(to))
+		return 0; /* skip atime */
+
+	file_accessed(iocb->ki_filp);
+
+	ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+	if (ret)
+		return ret;
+	ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0);
+	inode_unlock_shared(inode);
+
+	trace_fuse_iomap_direct_read_end(iocb, to, ret);
+	return ret;
+}
+
+static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written,
+				       int error, unsigned dioflags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	unsigned int nofs_flag;
+	unsigned int ioendflags = FUSE_IOMAP_IOEND_DIRECT;
+	int ret;
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	ASSERT(fuse_has_iomap_direct_io(inode));
+
+	trace_fuse_iomap_dio_write_end_io(inode, iocb->ki_pos, written, error,
+					  dioflags);
+
+	if (dioflags & IOMAP_DIO_COW)
+		ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+	if (dioflags & IOMAP_DIO_UNWRITTEN)
+		ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+	/*
+	 * We can allocate memory here while doing writeback on behalf of
+	 * memory reclaim.  To avoid memory allocation deadlocks set the
+	 * task-wide nofs context for the following operations.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	ret = fuse_iomap_ioend(inode, iocb->ki_pos, written, error, ioendflags,
+			       FUSE_IOMAP_NULL_ADDR);
+	memalloc_nofs_restore(nofs_flag);
+	return ret;
+}
+
+static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
+	.end_io		= fuse_iomap_dio_write_end_io,
+};
+
+static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
+					size_t count)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t end = start + count - 1;
+	int err;
+
+	/* Flush the file metadata, not the page cache. */
+	err = sync_inode_metadata(inode, 1);
+	if (err)
+		return err;
+
+	if (fc->no_fsync)
+		return 0;
+
+	err = fuse_fsync_common(iocb->ki_filp, start, end, iocb_is_dsync(iocb),
+				FUSE_FSYNC);
+	if (err == -ENOSYS) {
+		fc->no_fsync = 1;
+		err = 0;
+	}
+	return err;
+}
+
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	loff_t blockmask = i_blocksize(inode) - 1;
+	loff_t pos = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
+	bool was_dsync = false;
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_direct_io(inode));
+
+	trace_fuse_iomap_direct_write(iocb, from);
+
+	/*
+	 * direct I/O must be aligned to the fsblock size or we fall back to
+	 * the old paths
+	 */
+	if ((iocb->ki_pos | count) & blockmask)
+		return -ENOTBLK;
+
+	/* fuse doesn't support S_SYNC, so complain if we see this. */
+	if (IS_SYNC(inode)) {
+		ASSERT(!IS_SYNC(inode));
+		return -EIO;
+	}
+
+	/*
+	 * Strip off IOCB_DSYNC so that we can run the fsync ourselves because
+	 * we hold inode_lock; iomap_dio_rw calls generic_write_sync; and
+	 * fuse_fsync tries to take inode_lock again.
+	 */
+	if (iocb_is_dsync(iocb)) {
+		was_dsync = true;
+		iocb->ki_flags &= ~IOCB_DSYNC;
+	}
+
+	ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+	if (ret)
+		goto out_dsync;
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+
+	ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
+			&fuse_iomap_dio_write_ops, 0, NULL, 0);
+	if (ret)
+		goto out_unlock;
+
+	if (was_dsync) {
+		/* Restore IOCB_DSYNC and call our sync function */
+		iocb->ki_flags |= IOCB_DSYNC;
+		ret = fuse_iomap_direct_write_sync(iocb, pos, count);
+	}
+
+out_unlock:
+	inode_unlock(inode);
+out_dsync:
+	trace_fuse_iomap_direct_write_end(iocb, from, ret);
+	if (was_dsync)
+		iocb->ki_flags |= IOCB_DSYNC;
+	return ret;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 224fb9e7610cc5..0b3ad7bf89b52d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1443,6 +1443,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 			if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
 				fc->iomap = 1;
+			if ((flags & FUSE_IOMAP_DIRECTIO) && fc->iomap)
+				fc->iomap_directio = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1515,7 +1517,7 @@ void fuse_send_init(struct fuse_mount *fm)
 	if (fuse_uring_enabled())
 		flags |= FUSE_OVER_IO_URING;
 	if (fuse_iomap_enabled())
-		flags |= FUSE_IOMAP;
+		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO;
 
 	ia->in.flags = flags;
 	ia->in.flags2 = flags >> 32;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/11] fuse: implement buffered IO with iomap
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-05-22  0:04   ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
@ 2025-05-22  0:04   ` Darrick J. Wong
  2025-05-22  0:04   ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:04 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Implement pagecache IO with iomap, complete with hooks into truncate and
fallocate so that the fuse server needn't implement disk block zeroing
of post-EOF and unaligned punch/zero regions.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   42 +++
 fs/fuse/fuse_trace.h      |  308 ++++++++++++++++++++
 include/uapi/linux/fuse.h |    3 
 fs/fuse/dir.c             |    6 
 fs/fuse/file.c            |   48 +++
 fs/fuse/file_iomap.c      |  684 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |    7 
 7 files changed, 1088 insertions(+), 10 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 51a373bc7b03d9..8481b1d0299df0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -164,6 +164,13 @@ struct fuse_inode {
 
 			/* List of writepage requestst (pending or sent) */
 			struct rb_root writepages;
+
+#ifdef CONFIG_FUSE_IOMAP
+			/* pending io completions */
+			spinlock_t ioend_lock;
+			struct work_struct ioend_work;
+			struct list_head ioend_list;
+#endif
 		};
 
 		/* readdir cache (directory only) */
@@ -907,6 +914,9 @@ struct fuse_conn {
 	/* Use fs/iomap for direct I/O operations */
 	unsigned int iomap_directio:1;
 
+	/* Use fs/iomap for pagecache I/O operations */
+	unsigned int iomap_pagecache:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
@@ -1613,6 +1623,9 @@ extern void fuse_sysctl_unregister(void);
 #define fuse_sysctl_unregister()	do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
+
 #if IS_ENABLED(CONFIG_FUSE_IOMAP)
 # include <linux/fiemap.h>
 # include <linux/iomap.h>
@@ -1650,6 +1663,26 @@ static inline bool fuse_want_iomap_direct_io(const struct kiocb *iocb)
 
 ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
+
+static inline bool fuse_has_iomap_pagecache(const struct inode *inode)
+{
+	return get_fuse_conn_c(inode)->iomap_pagecache;
+}
+
+static inline bool fuse_want_iomap_buffered_io(const struct kiocb *iocb)
+{
+	return fuse_has_iomap_pagecache(file_inode(iocb->ki_filp));
+}
+
+void fuse_iomap_init_pagecache(struct inode *inode);
+void fuse_iomap_destroy_pagecache(struct inode *inode);
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
+int fuse_iomap_setsize(struct mnt_idmap *idmap, struct dentry *dentry,
+		       struct iattr *iattr);
+int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
+			 loff_t length, loff_t new_size);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1664,6 +1697,15 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
 # define fuse_want_iomap_direct_io(...)		(false)
 # define fuse_iomap_direct_read(...)		(-ENOSYS)
 # define fuse_iomap_direct_write(...)		(-ENOSYS)
+# define fuse_has_iomap_pagecache(...)		(false)
+# define fuse_want_iomap_buffered_io(...)	(false)
+# define fuse_iomap_init_pagecache(...)		((void)0)
+# define fuse_iomap_destroy_pagecache(...)	((void)0)
+# define fuse_iomap_mmap(...)			(-ENOSYS)
+# define fuse_iomap_buffered_read(...)		(-ENOSYS)
+# define fuse_iomap_buffered_write(...)		(-ENOSYS)
+# define fuse_iomap_setsize(...)		(-ENOSYS)
+# define fuse_iomap_fallocate(...)		(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index da7c317b664a10..ef86cfa9195070 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -173,6 +173,12 @@ TRACE_EVENT(fuse_request_end,
 	{ IOMAP_DIO_UNWRITTEN,			"unwritten" }, \
 	{ IOMAP_DIO_COW,			"cow" }
 
+#define IOMAP_IOEND_STRINGS \
+	{ IOMAP_IOEND_SHARED,			"shared" }, \
+	{ IOMAP_IOEND_UNWRITTEN,		"unwritten" }, \
+	{ IOMAP_IOEND_BOUNDARY,			"boundary" }, \
+	{ IOMAP_IOEND_DIRECT,			"direct" }
+
 TRACE_EVENT(fuse_iomap_begin,
 	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
 		 unsigned opflags),
@@ -590,6 +596,9 @@ DEFINE_EVENT(fuse_iomap_file_io_class, name,		\
 	TP_ARGS(iocb, iter))
 DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
 DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_write_zero_eof);
 
 DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
 	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
@@ -626,6 +635,8 @@ DEFINE_EVENT(fuse_iomap_file_ioend_class, name,		\
 	TP_ARGS(iocb, iter, ret))
 DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
 DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_write_end);
 
 TRACE_EVENT(fuse_iomap_dio_write_end_io,
 	TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
@@ -659,6 +670,303 @@ TRACE_EVENT(fuse_iomap_dio_write_end_io,
 		  __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
 		  __entry->pos, __entry->written, __entry->error)
 );
+
+TRACE_EVENT(fuse_iomap_end_ioend,
+	TP_PROTO(const struct iomap_ioend *ioend),
+
+	TP_ARGS(ioend),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(size_t,		size)
+		__field(unsigned int,	ioendflags)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = ioend->io_inode;
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->offset		=	ioend->io_offset;
+		__entry->size		=	ioend->io_size;
+		__entry->ioendflags	=	ioend->io_flags;
+		__entry->error		=
+				blk_status_to_errno(ioend->io_bio.bi_status);
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx size %zu ioendflags (%s) error %d",
+		  __entry->connection, __entry->ino, __entry->offset,
+		  __entry->size,
+		  __print_flags(__entry->ioendflags, "|", IOMAP_IOEND_STRINGS),
+		  __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_map_blocks,
+	TP_PROTO(const struct inode *inode, loff_t offset, unsigned int count),
+
+	TP_ARGS(inode, offset, count),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(unsigned int,	count)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->offset		=	offset;
+		__entry->count		=	count;
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx count %u",
+		  __entry->connection, __entry->ino, __entry->offset,
+		  __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_submit_ioend,
+	TP_PROTO(const struct inode *inode, unsigned int nr_folios, int error),
+
+	TP_ARGS(inode, nr_folios, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(unsigned int,	nr_folios)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nr_folios	=	nr_folios;
+		__entry->error		=	error;
+	),
+
+	TP_printk("connection %u ino %llu nr_folios %u error %d",
+		  __entry->connection, __entry->ino, __entry->nr_folios,
+		  __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_discard_folio,
+	TP_PROTO(const struct inode *inode, loff_t offset, size_t count),
+
+	TP_ARGS(inode, offset, count),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->offset		=	offset;
+		__entry->count		=	count;
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->offset,
+		  __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_writepages,
+	TP_PROTO(const struct inode *inode, const struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		start)
+		__field(loff_t,		end)
+		__field(long,		nr_to_write)
+		__field(bool,		sync_all)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->start		=	wbc->range_start;
+		__entry->end		=	wbc->range_end;
+		__entry->nr_to_write	=	wbc->nr_to_write;
+		__entry->sync_all	=	wbc->sync_mode == WB_SYNC_ALL;
+	),
+
+	TP_printk("connection %u ino %llu start 0x%llx end 0x%llx nr %ld sync_all? %d",
+		  __entry->connection, __entry->ino, __entry->start,
+		  __entry->end, __entry->nr_to_write, __entry->sync_all)
+);
+
+TRACE_EVENT(fuse_iomap_read_folio,
+	TP_PROTO(const struct folio *folio),
+
+	TP_ARGS(folio),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = folio->mapping->host;
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	folio_pos(folio);
+		__entry->count		=	folio_size(folio);
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->pos,
+		  __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_readahead,
+	TP_PROTO(const struct readahead_control *rac),
+
+	TP_ARGS(rac),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = file_inode(rac->file);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+		struct readahead_control *mutrac = (struct readahead_control *)rac;
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	readahead_pos(mutrac);
+		__entry->count		=	readahead_length(mutrac);
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->pos,
+		  __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_page_mkwrite,
+	TP_PROTO(const struct vm_fault *vmf),
+
+	TP_ARGS(vmf),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		pos)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = file_inode(vmf->vma->vm_file);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+		struct folio *folio = page_folio(vmf->page);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->pos		=	folio_pos(folio);
+		__entry->count		=	folio_size(folio);
+	),
+
+	TP_printk("connection %u ino %llu offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->pos,
+		  __entry->count)
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_range_class,
+	TP_PROTO(const struct inode *inode, loff_t offset, loff_t length),
+	TP_ARGS(inode, offset, length),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(loff_t, size)
+		__field(loff_t, offset)
+		__field(loff_t, length)
+	),
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->size		=	i_size_read(inode);
+		__entry->offset		=	offset;
+		__entry->length		=	length;
+	),
+	TP_printk("connection %u ino %llu disize 0x%llx pos 0x%llx bytecount 0x%llx",
+		  __entry->connection, __entry->ino, __entry->size,
+		  __entry->offset, __entry->length)
+)
+#define DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(name)		\
+DEFINE_EVENT(fuse_iomap_file_range_class, name,		\
+	TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \
+	TP_ARGS(inode, offset, length))
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
+
+TRACE_EVENT(fuse_iomap_fallocate,
+	TP_PROTO(const struct inode *inode, int mode, loff_t offset,
+		 loff_t length, loff_t newsize),
+	TP_ARGS(inode, mode, offset, length, newsize),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(loff_t,		offset)
+		__field(loff_t,		length)
+		__field(loff_t,		newsize)
+		__field(int,		mode)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->mode		=	mode;
+		__entry->offset		=	offset;
+		__entry->length		=	length;
+		__entry->newsize	=	newsize;
+	),
+
+	TP_printk("connection %u ino %llu mode 0x%x offset 0x%llx length 0x%llx newsize 0x%llx",
+		  __entry->connection, __entry->ino, __entry->mode,
+		  __entry->offset, __entry->length, __entry->newsize)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4611f912003593..c9402f2b2a335c 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -238,6 +238,7 @@
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
  *  - add FUSE_IOMAP_DIRECTIO for direct I/O support
+ *  - add FUSE_IOMAP_PAGECACHE for buffered I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -449,6 +450,7 @@ struct fuse_file_lock {
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
  * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
+ * FUSE_IOMAP_PAGECACHE: Client supports iomap for pagecache I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -498,6 +500,7 @@ struct fuse_file_lock {
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
 #define FUSE_IOMAP		(1ULL << 43)
 #define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
+#define FUSE_IOMAP_PAGECACHE	(1ULL << 45)
 
 /**
  * CUSE INIT request/reply flags
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c947ad50a9a8eb..2b6c5f3c99338f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2012,6 +2012,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 		if (trust_local_cmtime && attr->ia_size != inode->i_size)
 			attr->ia_valid |= ATTR_MTIME | ATTR_CTIME;
+
+		if (fuse_has_iomap_pagecache(inode)) {
+			err = fuse_iomap_setsize(idmap, dentry, attr);
+			if (err)
+				goto error;
+		}
 	}
 
 	memset(&inarg, 0, sizeof(inarg));
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7e8b20f56dd823..a3e9df5f9788d6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -384,7 +384,7 @@ static int fuse_release(struct inode *inode, struct file *file)
 	 * Dirty pages might remain despite write_inode_now() call from
 	 * fuse_flush() due to writes racing with the close.
 	 */
-	if (fc->writeback_cache)
+	if (fc->writeback_cache || fuse_has_iomap_pagecache(inode))
 		write_inode_now(inode, 1);
 
 	fuse_release_common(file, false);
@@ -1734,8 +1734,6 @@ static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
 	return res;
 }
 
-static ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
-
 static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t res;
@@ -1792,6 +1790,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return ret;
 	}
 
+	if (fuse_want_iomap_buffered_io(iocb))
+		return fuse_iomap_buffered_read(iocb, to);
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_read_iter(iocb, to);
 
@@ -1815,10 +1816,29 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if (fuse_want_iomap_direct_io(iocb)) {
 		ssize_t ret = fuse_iomap_direct_write(iocb, from);
-		if (ret != -ENOSYS)
+		switch (ret) {
+		case -ENOTBLK:
+			/*
+			 * If we're going to fall back to the iomap buffered
+			 * write path only, then try the write again as a
+			 * synchronous buffered write.  Otherwise we let it
+			 * drop through to the old ->direct_IO path.
+			 */
+			if (fuse_want_iomap_buffered_io(iocb))
+				iocb->ki_flags |= IOCB_SYNC;
+			fallthrough;
+		case -ENOSYS:
+			/* no implementation, fall through */
+			break;
+		default:
+			/* errors, no progress, or even partial progress */
 			return ret;
+		}
 	}
 
+	if (fuse_want_iomap_buffered_io(iocb))
+		return fuse_iomap_buffered_write(iocb, from);
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_write_iter(iocb, from);
 
@@ -2653,6 +2673,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	struct inode *inode = file_inode(file);
 	int rc;
 
+	if (fuse_has_iomap_pagecache(inode))
+		return fuse_iomap_mmap(file, vma);
+
 	/* DAX mmap is superior to direct_io mmap */
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_mmap(file, vma);
@@ -2851,7 +2874,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
 	return err;
 }
 
-static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
+sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 {
 	struct inode *inode = mapping->host;
 	struct fuse_mount *fm = get_fuse_mount(inode);
@@ -3107,8 +3130,7 @@ static inline loff_t fuse_round_up(struct fuse_conn *fc, loff_t off)
 	return round_up(off, fc->max_pages << PAGE_SHIFT);
 }
 
-static ssize_t
-fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 	ssize_t ret = 0;
@@ -3227,6 +3249,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.length = length,
 		.mode = mode
 	};
+	loff_t newsize = 0;
 	int err;
 	bool block_faults = FUSE_IS_DAX(inode) &&
 		(!(mode & FALLOC_FL_KEEP_SIZE) ||
@@ -3260,6 +3283,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		err = inode_newsize_ok(inode, offset + length);
 		if (err)
 			goto out;
+		newsize = offset + length;
 	}
 
 	err = file_modified(file);
@@ -3282,6 +3306,14 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (err)
 		goto out;
 
+	if (fuse_has_iomap_pagecache(inode)) {
+		err = fuse_iomap_fallocate(file, mode, offset, length,
+					   newsize);
+		if (err)
+			goto out;
+		file_update_time(file);
+	}
+
 	/* we could have extended the file */
 	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
 		if (fuse_write_update_attr(inode, offset + length, length))
@@ -3480,4 +3512,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);
+	if (fuse_has_iomap_pagecache(inode))
+		fuse_iomap_init_pagecache(inode);
 }
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 077ef51ee47452..345610768edc80 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -6,6 +6,8 @@
 #include "fuse_i.h"
 #include "fuse_trace.h"
 #include <linux/iomap.h>
+#include <linux/pagemap.h>
+#include <linux/falloc.h>
 
 static bool __read_mostly enable_iomap =
 #if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
@@ -530,6 +532,8 @@ void fuse_iomap_open(struct inode *inode, struct file *file)
 {
 	if (fuse_has_iomap_direct_io(inode))
 		file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+	if (fuse_has_iomap_pagecache(inode))
+		file->f_mode |= FMODE_NOWAIT;
 }
 
 enum fuse_ilock_type {
@@ -655,6 +659,109 @@ static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
 	return err;
 }
 
+static int
+fuse_iomap_zero_range(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			len,
+	bool			*did_zero)
+{
+	return iomap_zero_range(inode, pos, len, did_zero, &fuse_iomap_ops,
+				NULL);
+}
+
+/* Take care of zeroing post-EOF blocks when they might exist. */
+static ssize_t
+fuse_iomap_write_zero_eof(
+	struct kiocb		*iocb,
+	struct iov_iter		*from,
+	bool			*drained_dio)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t			isize;
+	int			error;
+
+	/*
+	 * We need to serialise against EOF updates that occur in IO
+	 * completions here. We want to make sure that nobody is changing the
+	 * size while we do this check until we have placed an IO barrier (i.e.
+	 * hold i_rwsem exclusively) that prevents new IO from being
+	 * dispatched.  The spinlock effectively forms a memory barrier once we
+	 * have i_rwsem exclusively so we are guaranteed to see the latest EOF
+	 * value and hence be able to correctly determine if we need to run
+	 * zeroing.
+	 */
+	spin_lock(&fi->lock);
+	isize = i_size_read(inode);
+	if (iocb->ki_pos <= isize) {
+		spin_unlock(&fi->lock);
+		return 0;
+	}
+	spin_unlock(&fi->lock);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return -EAGAIN;
+
+	if (!(*drained_dio)) {
+		/*
+		 * We now have an IO submission barrier in place, but AIO can
+		 * do EOF updates during IO completion and hence we now need to
+		 * wait for all of them to drain.  Non-AIO DIO will have
+		 * drained before we are given the exclusive i_rwsem, and so
+		 * for most cases this wait is a no-op.
+		 */
+		inode_dio_wait(inode);
+		*drained_dio = true;
+		return 1;
+	}
+
+	trace_fuse_iomap_write_zero_eof(iocb, from);
+
+	filemap_invalidate_lock(mapping);
+	error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL);
+	filemap_invalidate_unlock(mapping);
+
+	return error;
+}
+
+static ssize_t
+fuse_iomap_write_checks(
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	ssize_t			error;
+	bool			drained_dio = false;
+
+restart:
+	error = generic_write_checks(iocb, from);
+	if (error <= 0)
+		return error;
+
+	/*
+	 * If the offset is beyond the size of the file, we need to zero all
+	 * blocks that fall between the existing EOF and the start of this
+	 * write.
+	 *
+	 * We can do an unlocked check for i_size here safely as I/O completion
+	 * can only extend EOF.  Truncate is locked out at this point, so the
+	 * EOF cannot move backwards, only forwards. Hence we only need to take
+	 * the slow path when we are at or beyond the current EOF.
+	 */
+	if (fuse_has_iomap_pagecache(inode) &&
+	    iocb->ki_pos > i_size_read(inode)) {
+		error = fuse_iomap_write_zero_eof(iocb, from, &drained_dio);
+		if (error == 1)
+			goto restart;
+		if (error)
+			return error;
+	}
+
+	return kiocb_modified(iocb);
+}
+
 ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -694,8 +801,9 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ret = fuse_iomap_ilock_iocb(iocb, EXCL);
 	if (ret)
 		goto out_dsync;
-	ret = generic_write_checks(iocb, from);
-	if (ret <= 0)
+
+	ret = fuse_iomap_write_checks(iocb, from);
+	if (ret)
 		goto out_unlock;
 
 	ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
@@ -717,3 +825,575 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		iocb->ki_flags |= IOCB_DSYNC;
 	return ret;
 }
+
+struct fuse_writepage_ctx {
+	struct iomap_writepage_ctx ctx;
+};
+
+static void fuse_iomap_end_ioend(struct iomap_ioend *ioend)
+{
+	struct inode *inode = ioend->io_inode;
+	unsigned int ioendflags = 0;
+	unsigned int nofs_flag;
+	int error = blk_status_to_errno(ioend->io_bio.bi_status);
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	if (fuse_is_bad(inode))
+		return;
+
+	trace_fuse_iomap_end_ioend(ioend);
+
+	if (ioend->io_flags & IOMAP_IOEND_SHARED)
+		ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+	if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
+		ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+	/*
+	 * We can allocate memory here while doing writeback on behalf of
+	 * memory reclaim.  To avoid memory allocation deadlocks set the
+	 * task-wide nofs context for the following operations.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	fuse_iomap_ioend(inode, ioend->io_offset, ioend->io_size, error,
+			 ioendflags, FUSE_IOMAP_NULL_ADDR);
+	iomap_finish_ioends(ioend, error);
+	memalloc_nofs_restore(nofs_flag);
+}
+
+/*
+ * Finish all pending IO completions that require transactional modifications.
+ *
+ * We try to merge physical and logically contiguous ioends before completion to
+ * minimise the number of transactions we need to perform during IO completion.
+ * Both unwritten extent conversion and COW remapping need to iterate and modify
+ * one physical extent at a time, so we gain nothing by merging physically
+ * discontiguous extents here.
+ *
+ * The ioend chain length that we can be processing here is largely unbound in
+ * length and we may have to perform significant amounts of work on each ioend
+ * to complete it. Hence we have to be careful about holding the CPU for too
+ * long in this loop.
+ */
+static void fuse_iomap_end_io(struct work_struct *work)
+{
+	struct fuse_inode *fi =
+		container_of(work, struct fuse_inode, ioend_work);
+	struct iomap_ioend *ioend;
+	struct list_head tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fi->ioend_lock, flags);
+	list_replace_init(&fi->ioend_list, &tmp);
+	spin_unlock_irqrestore(&fi->ioend_lock, flags);
+
+	iomap_sort_ioends(&tmp);
+	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
+			io_list))) {
+		list_del_init(&ioend->io_list);
+		iomap_ioend_try_merge(ioend, &tmp);
+		fuse_iomap_end_ioend(ioend);
+		cond_resched();
+	}
+}
+
+static void fuse_iomap_end_bio(struct bio *bio)
+{
+	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+	struct inode *inode = ioend->io_inode;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	unsigned long flags;
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	spin_lock_irqsave(&fi->ioend_lock, flags);
+	if (list_empty(&fi->ioend_list))
+		WARN_ON_ONCE(!queue_work(system_unbound_wq, &fi->ioend_work));
+	list_add_tail(&ioend->io_list, &fi->ioend_list);
+	spin_unlock_irqrestore(&fi->ioend_lock, flags);
+}
+
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+					    loff_t offset)
+{
+	if (offset < wpc->iomap.offset ||
+	    offset >= wpc->iomap.offset + wpc->iomap.length)
+		return false;
+
+	/* XXX actually use revalidation cookie */
+	return true;
+}
+
+static int fuse_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
+				 struct inode *inode, loff_t offset,
+				 unsigned int len)
+{
+	struct iomap write_iomap, dontcare;
+	int ret;
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_map_blocks(inode, offset, len);
+
+	if (fuse_iomap_revalidate_writeback(wpc, offset))
+		return 0;
+
+	/* Pretend that this is a directio write */
+	ret = fuse_iomap_begin(inode, offset, len, IOMAP_DIRECT | IOMAP_WRITE,
+			       &write_iomap, &dontcare);
+	if (ret)
+		return ret;
+
+	/*
+	 * Landed in a hole or beyond EOF?  Send that to iomap, it'll skip
+	 * writing back the file range.
+	 */
+	if (write_iomap.offset > offset) {
+		write_iomap.length = write_iomap.offset - offset;
+		write_iomap.offset = offset;
+		write_iomap.type = IOMAP_HOLE;
+	}
+
+	memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap));
+	return 0;
+}
+
+static int fuse_iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int status)
+{
+	struct iomap_ioend *ioend = wpc->ioend;
+
+	ASSERT(fuse_has_iomap_pagecache(ioend->io_inode));
+
+	trace_fuse_iomap_submit_ioend(ioend->io_inode, wpc->nr_folios, status);
+
+	/* always call our ioend function, even if we cancel the bio */
+	ioend->io_bio.bi_end_io = fuse_iomap_end_bio;
+
+	if (status)
+		return status;
+	submit_bio(&ioend->io_bio);
+	return 0;
+}
+
+/*
+ * If the folio has delalloc blocks on it, the caller is asking us to punch them
+ * out. If we don't, we can leave a stale delalloc mapping covered by a clean
+ * page that needs to be dirtied again before the delalloc mapping can be
+ * converted. This stale delalloc mapping can trip up a later direct I/O read
+ * operation on the same region.
+ *
+ * We prevent this by truncating away the delalloc regions on the folio. Because
+ * they are delalloc, we can do this without needing a transaction. Indeed - if
+ * we get ENOSPC errors, we have to be able to do this truncation without a
+ * transaction as there is no space left for block reservation (typically why
+ * we see a ENOSPC in writeback).
+ */
+static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+	struct inode *inode = folio->mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	if (fuse_is_bad(inode))
+		return;
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_discard_folio(inode, pos, folio_size(folio));
+
+	printk_ratelimited(KERN_ERR
+		"page discard on page %px, inode 0x%llx, pos %llu.",
+			folio, fi->orig_ino, pos);
+
+	/* XXX actually punch the new delalloc ranges? */
+}
+
+static const struct iomap_writeback_ops fuse_iomap_writeback_ops = {
+	.map_blocks		= fuse_iomap_map_blocks,
+	.submit_ioend		= fuse_iomap_submit_ioend,
+	.discard_folio		= fuse_iomap_discard_folio,
+};
+
+static int fuse_iomap_writepages(struct address_space *mapping,
+				 struct writeback_control *wbc)
+{
+	struct fuse_writepage_ctx wpc = { };
+
+	ASSERT(fuse_has_iomap_pagecache(mapping->host));
+
+	trace_fuse_iomap_writepages(mapping->host, wbc);
+
+	return iomap_writepages(mapping, wbc, &wpc.ctx,
+				&fuse_iomap_writeback_ops);
+}
+
+static int fuse_iomap_read_folio(struct file *file, struct folio *folio)
+{
+	ASSERT(fuse_has_iomap_pagecache(file_inode(file)));
+
+	trace_fuse_iomap_read_folio(folio);
+
+	return iomap_read_folio(folio, &fuse_iomap_ops);
+}
+
+static void fuse_iomap_readahead(struct readahead_control *rac)
+{
+	ASSERT(fuse_has_iomap_pagecache(file_inode(rac->file)));
+
+	trace_fuse_iomap_readahead(rac);
+
+	iomap_readahead(rac, &fuse_iomap_ops);
+}
+
+const struct address_space_operations fuse_iomap_aops = {
+	.read_folio		= fuse_iomap_read_folio,
+	.readahead		= fuse_iomap_readahead,
+	.writepages		= fuse_iomap_writepages,
+	.dirty_folio		= iomap_dirty_folio,
+	.release_folio		= iomap_release_folio,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
+	.error_remove_folio	= generic_error_remove_folio,
+
+	/* These aren't pagecache operations per se */
+	.bmap			= fuse_bmap,
+	.direct_IO		= fuse_direct_IO,
+};
+
+void fuse_iomap_init_pagecache(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(fuse_has_iomap(inode));
+
+	/* Manage timestamps ourselves, don't make the fuse server do it */
+	inode->i_flags &= ~S_NOCMTIME;
+	inode->i_flags &= ~S_NOATIME;
+	inode->i_data.a_ops = &fuse_iomap_aops;
+
+	INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
+	INIT_LIST_HEAD(&fi->ioend_list);
+	spin_lock_init(&fi->ioend_lock);
+}
+
+void fuse_iomap_destroy_pagecache(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(list_empty(&fi->ioend_list));
+}
+
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_lock (MM)
+ *   sb_start_pagefault(vfs, freeze)
+ *     invalidate_lock (vfs - truncate serialisation)
+ *       page_lock (MM)
+ *         i_lock (FUSE - extent map serialisation)
+ */
+static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	vm_fault_t ret;
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_page_mkwrite(vmf);
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vmf->vma->vm_file);
+
+	filemap_invalidate_lock_shared(mapping);
+	ret = iomap_page_mkwrite(vmf, &fuse_iomap_ops, NULL);
+	filemap_invalidate_unlock_shared(mapping);
+
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
+static const struct vm_operations_struct fuse_iomap_vm_ops = {
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= fuse_iomap_page_mkwrite,
+};
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	file_accessed(file);
+	vma->vm_ops = &fuse_iomap_vm_ops;
+	return 0;
+}
+
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_buffered_read(iocb, to);
+
+	if (!iov_iter_count(to))
+		return 0; /* skip atime */
+
+	file_accessed(iocb->ki_filp);
+
+	ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+	if (ret)
+		return ret;
+	ret = generic_file_read_iter(iocb, to);
+	inode_unlock_shared(inode);
+
+	trace_fuse_iomap_buffered_read_end(iocb, to, ret);
+	return ret;
+}
+
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t pos = iocb->ki_pos;
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_buffered_write(iocb, from);
+
+	ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+	if (ret)
+		return ret;
+
+	ret = fuse_iomap_write_checks(iocb, from);
+	if (ret)
+		goto out_unlock;
+
+	if (inode->i_size < pos + iov_iter_count(from))
+		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+	ret = iomap_file_buffered_write(iocb, from, &fuse_iomap_ops, NULL);
+
+	if (ret > 0)
+		fuse_write_update_attr(inode, pos + ret, ret);
+	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+out_unlock:
+	inode_unlock(inode);
+
+	if (ret > 0) {
+		/* Handle various SYNC-type writes */
+		ret = generic_write_sync(iocb, ret);
+	}
+	trace_fuse_iomap_buffered_write_end(iocb, from, ret);
+	return ret;
+}
+
+static int
+fuse_iomap_truncate_page(
+	struct inode *inode,
+	loff_t			pos,
+	bool			*did_zero)
+{
+	return iomap_truncate_page(inode, pos, did_zero, &fuse_iomap_ops,
+				   NULL);
+}
+/*
+ * Truncate file.  Must have write permission and not be a directory.
+ *
+ * Caution: The caller of this function is responsible for calling
+ * setattr_prepare() or otherwise verifying the change is fine.
+ */
+static int
+fuse_iomap_setattr_size(
+	struct mnt_idmap	*idmap,
+	struct dentry		*dentry,
+	struct inode *inode,
+	struct iattr		*iattr)
+{
+	loff_t oldsize, newsize;
+	int			error;
+	bool			did_zeroing = false;
+
+	//xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
+	ASSERT(S_ISREG(inode->i_mode));
+	ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
+		ATTR_MTIME_SET|ATTR_TIMES_SET)) == 0);
+
+	oldsize = inode->i_size;
+	newsize = iattr->ia_size;
+
+	/*
+	 * Wait for all direct I/O to complete.
+	 */
+	inode_dio_wait(inode);
+
+	/*
+	 * File data changes must be complete and flushed to disk before we
+	 * call userspace to modify the inode.
+	 *
+	 * Start with zeroing any data beyond EOF that we may expose on file
+	 * extension, or zeroing out the rest of the block on a downward
+	 * truncate.
+	 */
+	if (newsize > oldsize) {
+		trace_fuse_iomap_truncate_up(inode, oldsize, newsize - oldsize);
+
+		error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize,
+					      &did_zeroing);
+	} else {
+		trace_fuse_iomap_truncate_down(inode, newsize,
+					       oldsize - newsize);
+
+		error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing);
+	}
+	if (error)
+		return error;
+
+	/*
+	 * We've already locked out new page faults, so now we can safely
+	 * remove pages from the page cache knowing they won't get refaulted
+	 * until we drop the mapping invalidation lock after the extent
+	 * manipulations are complete. The truncate_setsize() call also cleans
+	 * folios spanning EOF on extending truncates and hence ensures
+	 * sub-page block size filesystems are correctly handled, too.
+	 *
+	 * And we update in-core i_size and truncate page cache beyond newsize
+	 * before writing back the whole file, so we're guaranteed not to write
+	 * stale data past the new EOF on truncate down.
+	 */
+	truncate_setsize(inode, newsize);
+
+	/*
+	 * We are going to tell userspace to log the inode size change so any
+	 * previous writes that are beyond the on disk EOF and the new EOF that
+	 * have not been written out need to be written here.  If we do not
+	 * write the data out, we expose ourselves to the null files problem.
+	 * Note that this includes any block zeroing we did above; otherwise
+	 * those blocks may not be zeroed after a crash.  It's really clumsy
+	 * to flush the entire file, but we don't know the ondisk inode size
+	 * so we use a big hammer instead.
+	 */
+	if (did_zeroing || newsize > 0) {
+		error = filemap_write_and_wait(inode->i_mapping);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+int
+fuse_iomap_setsize(
+	struct mnt_idmap	*idmap,
+	struct dentry		*dentry,
+	struct iattr		*iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	int error;
+
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_setsize(inode, iattr->ia_size, 0);
+
+	error = inode_newsize_ok(inode, iattr->ia_size);
+	if (error)
+		return error;
+	return fuse_iomap_setattr_size(idmap, dentry, inode, iattr);
+}
+
+static int fuse_iomap_punch_range(struct inode *inode, loff_t offset,
+				  loff_t length)
+{
+	loff_t isize = i_size_read(inode);
+	int error;
+
+	trace_fuse_iomap_punch_range(inode, offset, length);
+
+	/*
+	 * Now that we've unmap all full blocks we'll have to zero out any
+	 * partial block at the beginning and/or end.  iomap_zero_range is
+	 * smart enough to skip holes and unwritten extents, including those we
+	 * just created, but we must take care not to zero beyond EOF, which
+	 * would enlarge i_size.
+	 */
+	if (offset >= isize)
+		return 0;
+	if (offset + length > isize)
+		length = isize - offset;
+	error = fuse_iomap_zero_range(inode, offset, length, NULL);
+	if (error)
+		return error;
+
+	/*
+	 * If we zeroed right up to EOF and EOF straddles a page boundary we
+	 * must make sure that the post-EOF area is also zeroed because the
+	 * page could be mmap'd and iomap_zero_range doesn't do that for us.
+	 * Writeback of the eof page will do this, albeit clumsily.
+	 */
+	if (offset + length >= isize && offset_in_page(offset + length) > 0) {
+		error = filemap_write_and_wait_range(inode->i_mapping,
+					round_down(offset + length, PAGE_SIZE),
+					LLONG_MAX);
+	}
+
+	return error;
+}
+
+int
+fuse_iomap_fallocate(
+	struct file		*file,
+	int			mode,
+	loff_t			offset,
+	loff_t			length,
+	loff_t			new_size)
+{
+	struct inode *inode = file_inode(file);
+	int error;
+
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(fuse_has_iomap_pagecache(inode));
+
+	trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+
+	/*
+	 * If we unmapped blocks from the file range, then we zero the
+	 * pagecache for those regions and push them to disk rather than make
+	 * the fuse server manually zero the disk blocks.
+	 */
+	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) {
+		error = fuse_iomap_punch_range(inode, offset, length);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If this is an extending write, we need to zero the bytes beyond the
+	 * new EOF.
+	 */
+	if (new_size) {
+		struct iattr iattr = {
+			.ia_valid	= ATTR_SIZE,
+			.ia_size	= new_size,
+		};
+
+		return fuse_iomap_setsize(file_mnt_idmap(file),
+					  file_dentry(file), &iattr);
+	}
+
+	return 0;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0b3ad7bf89b52d..2f185b7d9349b7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -193,6 +193,9 @@ static void fuse_evict_inode(struct inode *inode)
 		WARN_ON(!list_empty(&fi->write_files));
 		WARN_ON(!list_empty(&fi->queued_writes));
 	}
+
+	if (S_ISREG(inode->i_mode) && fuse_has_iomap_pagecache(inode))
+		fuse_iomap_destroy_pagecache(inode);
 }
 
 static int fuse_reconfigure(struct fs_context *fsc)
@@ -1445,6 +1448,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				fc->iomap = 1;
 			if ((flags & FUSE_IOMAP_DIRECTIO) && fc->iomap)
 				fc->iomap_directio = 1;
+			if ((flags & FUSE_IOMAP_PAGECACHE) && fc->iomap)
+				fc->iomap_pagecache = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1517,7 +1522,7 @@ void fuse_send_init(struct fuse_mount *fm)
 	if (fuse_uring_enabled())
 		flags |= FUSE_OVER_IO_URING;
 	if (fuse_iomap_enabled())
-		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO;
+		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
 
 	ia->in.flags = flags;
 	ia->in.flags2 = flags >> 32;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/11] fuse: implement large folios for iomap pagecache files
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-05-22  0:04   ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
@ 2025-05-22  0:04   ` Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:04 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Use large folios when we're using iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file_iomap.c |    6 ++++++
 1 file changed, 6 insertions(+)


diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 345610768edc80..c58ac812598d8f 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1070,6 +1070,7 @@ const struct address_space_operations fuse_iomap_aops = {
 void fuse_iomap_init_pagecache(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	unsigned int min_order = 0;
 
 	ASSERT(fuse_has_iomap(inode));
 
@@ -1081,6 +1082,11 @@ void fuse_iomap_init_pagecache(struct inode *inode)
 	INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
 	INIT_LIST_HEAD(&fi->ioend_list);
 	spin_lock_init(&fi->ioend_lock);
+
+	if (inode->i_blkbits > PAGE_SHIFT)
+		min_order = inode->i_blkbits - PAGE_SHIFT;
+
+	mapping_set_folio_min_order(inode->i_mapping, min_order);
 }
 
 void fuse_iomap_destroy_pagecache(struct inode *inode)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-05-22  0:04   ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
@ 2025-05-22  0:05   ` Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:05 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

With iomap support turned on for the pagecache, the kernel issues
writeback to directly to block devices and we no longer have to push all
those pages through the fuse device to userspace.  Therefore, we don't
need the tight dirty limits (~1M) that are used for regular fuse.  This
dramatically increases the performance of fuse's pagecache IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file_iomap.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)


diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index c58ac812598d8f..746d9ae192dc55 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -427,6 +427,28 @@ void fuse_iomap_init_reply(struct fuse_mount *fm)
 
 	if (sb->s_bdev)
 		__fuse_iomap_add_device(fc, sb->s_bdev_file);
+
+	if (fc->iomap_pagecache) {
+		struct backing_dev_info *old_bdi = sb->s_bdi;
+		char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+		int err;
+
+		/*
+		 * sb->s_bdi points to the initial private bdi however we want
+		 * to redirect it to a new private bdi with default dirty and
+		 * readahead settings because iomap writeback won't be pushing
+		 * a ton of dirty data through the fuse device
+		 */
+		sb->s_bdi = &noop_backing_dev_info;
+		err = super_setup_bdi_name(sb, "%u:%u%s.iomap", MAJOR(fc->dev),
+					   MINOR(fc->dev), suffix);
+		if (err) {
+			sb->s_bdi = old_bdi;
+		} else {
+			bdi_unregister(old_bdi);
+			bdi_put(old_bdi);
+		}
+	}
 }
 
 int fuse_iomap_add_device(struct fuse_conn *fc,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 11/11] fuse: advertise support for iomap
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-05-22  0:05   ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
@ 2025-05-22  0:05   ` Darrick J. Wong
  10 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:05 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

From: Darrick J. Wong <djwong@kernel.org>

Advertise our new IO paths programmatically by creating an ioctl that
can return the capabilities of the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |    4 ++++
 include/uapi/linux/fuse.h |   13 +++++++++++++
 fs/fuse/dev.c             |    3 +++
 fs/fuse/file_iomap.c      |   18 ++++++++++++++++++
 4 files changed, 38 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8481b1d0299df0..5b14e8b23f305f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1683,6 +1683,9 @@ int fuse_iomap_setsize(struct mnt_idmap *idmap, struct dentry *dentry,
 		       struct iattr *iattr);
 int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
 			 loff_t length, loff_t new_size);
+
+int fuse_iomap_ioc_support(struct file *file,
+			   struct fuse_iomap_support __user *argp);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1706,6 +1709,7 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
 # define fuse_iomap_buffered_write(...)		(-ENOSYS)
 # define fuse_iomap_setsize(...)		(-ENOSYS)
 # define fuse_iomap_fallocate(...)		(-ENOSYS)
+# define fuse_iomap_ioc_support(...)		(-ENOTTY)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c9402f2b2a335c..cbef70ae05c73b 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1135,12 +1135,25 @@ struct fuse_backing_map {
 	uint64_t	padding;
 };
 
+/* basic reporting functionality */
+#define FUSE_IOMAP_SUPPORT_BASICS	(1ULL << 0)
+/* fuse driver can do direct io */
+#define FUSE_IOMAP_SUPPORT_DIRECTIO	(1ULL << 1)
+/* fuse driver can do buffered io */
+#define FUSE_IOMAP_SUPPORT_PAGECACHE	(1ULL << 2)
+struct fuse_iomap_support {
+	uint64_t	flags;
+	uint64_t	padding;
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 #define FUSE_DEV_IOC_BACKING_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, \
 					     struct fuse_backing_map)
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT	_IOR(FUSE_DEV_IOC_MAGIC, 3, \
+					     struct fuse_iomap_support)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9d7064ec170cf6..91beafbbcf7c02 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2620,6 +2620,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 	case FUSE_DEV_IOC_BACKING_CLOSE:
 		return fuse_dev_ioctl_backing_close(file, argp);
 
+	case FUSE_DEV_IOC_IOMAP_SUPPORT:
+		return fuse_iomap_ioc_support(file, argp);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 746d9ae192dc55..60e1242b32fd7c 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1425,3 +1425,21 @@ fuse_iomap_fallocate(
 
 	return 0;
 }
+
+int fuse_iomap_ioc_support(struct file *file,
+			   struct fuse_iomap_support __user *argp)
+{
+	struct fuse_iomap_support ios = { };
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (fuse_iomap_enabled())
+		ios.flags = FUSE_IOMAP_SUPPORT_BASICS |
+			    FUSE_IOMAP_SUPPORT_DIRECTIO |
+			    FUSE_IOMAP_SUPPORT_PAGECACHE;
+
+	if (copy_to_user(argp, &ios, sizeof(ios)))
+		return -EFAULT;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-05-22  0:05   ` Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 2/8] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:05 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Add some flags to query and request kernel support for filesystem iomap
for regular files.  Bump the minor API version so that the new iomap
symbols don't go bleeding into old programs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h |    5 +++++
 include/fuse_kernel.h |    9 ++++++++-
 lib/fuse_lowlevel.c   |    9 +++++++++
 lib/meson.build       |    2 +-
 4 files changed, 23 insertions(+), 2 deletions(-)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 249e0c94f81ea4..2394655140dc26 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -520,6 +520,11 @@ struct fuse_loop_config_v1 {
  */
 #define FUSE_CAP_OVER_IO_URING (1UL << 31)
 
+/**
+ * Client supports using iomap for FIEMAP and SEEK_{DATA,HOLE}
+ */
+#define FUSE_CAP_IOMAP (1ULL << 32)
+
 /**
  * Ioctl flags
  *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 5e0eb41d967e9d..f519fb2dc08b3f 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -229,6 +229,10 @@
  *    - FUSE_URING_IN_OUT_HEADER_SZ
  *    - FUSE_URING_OP_IN_OUT_SZ
  *    - enum fuse_uring_cmd
+ *
+ *  7.44
+ *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ *    SEEK_{DATA,HOLE} support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -264,7 +268,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 42
+#define FUSE_KERNEL_MINOR_VERSION 44
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -435,6 +439,8 @@ struct fuse_file_lock {
  *		    of the request ID indicates resend requests
  * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ *	       operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -482,6 +488,7 @@ struct fuse_file_lock {
 #define FUSE_DIRECT_IO_RELAX	FUSE_DIRECT_IO_ALLOW_MMAP
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
+#define FUSE_IOMAP		(1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 7f4326cb3c14c9..4b03e626dab508 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2544,6 +2544,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_NO_EXPORT_SUPPORT;
 		if (inargflags & FUSE_OVER_IO_URING)
 			se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
+		if (inargflags & FUSE_IOMAP)
+			se->conn.capable_ext |= FUSE_CAP_IOMAP;
 
 	} else {
 		se->conn.max_readahead = 0;
@@ -2590,6 +2592,9 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		       FUSE_CAP_READDIRPLUS_AUTO);
 	LL_SET_DEFAULT(1, FUSE_CAP_OVER_IO_URING);
 
+	/* servers need to opt-in to iomap explicitly */
+	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
+
 	/* This could safely become default, but libfuse needs an API extension
 	 * to support it
 	 * LL_SET_DEFAULT(1, FUSE_CAP_SETXATTR_EXT);
@@ -2713,6 +2718,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		outargflags |= FUSE_NO_EXPORT_SUPPORT;
 	if (se->conn.want_ext & FUSE_CAP_OVER_IO_URING)
 		outargflags |= FUSE_OVER_IO_URING;
+	if (se->conn.want_ext & FUSE_CAP_IOMAP)
+		outargflags |= FUSE_IOMAP;
 
 	if (inargflags & FUSE_INIT_EXT) {
 		outargflags |= FUSE_INIT_EXT;
@@ -2754,6 +2761,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		if (se->conn.want_ext & FUSE_CAP_PASSTHROUGH)
 			fuse_log(FUSE_LOG_DEBUG, "   max_stack_depth=%u\n",
 				outarg.max_stack_depth);
+		if (se->conn.want_ext & FUSE_CAP_IOMAP)
+			fuse_log(FUSE_LOG_DEBUG, "   iomap=1\n");
 	}
 	if (arg->minor < 5)
 		outargsize = FUSE_COMPAT_INIT_OUT_SIZE;
diff --git a/lib/meson.build b/lib/meson.build
index fcd95741c9d374..2999abe8262afd 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -49,7 +49,7 @@ libfuse = library('fuse3',
                   dependencies: deps,
                   install: true,
                   link_depends: 'fuse_versionscript',
-                  c_args: [ '-DFUSE_USE_VERSION=317',
+                  c_args: [ '-DFUSE_USE_VERSION=318',
                             '-DFUSERMOUNT_DIR="@0@"'.format(fusermount_path) ],
                   link_args: ['-Wl,--version-script,' + meson.current_source_dir()
                               + '/fuse_versionscript' ])


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/8] libfuse: add fuse commands for iomap_begin and end
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
@ 2025-05-22  0:05   ` Darrick J. Wong
  2025-05-22  0:06   ` [PATCH 3/8] libfuse: add upper level iomap commands Darrick J. Wong
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:05 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the low level API how to handle iomap begin and end commands that
we get from the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |   53 ++++++++++++++++++++++++++++++++++
 include/fuse_kernel.h   |   41 ++++++++++++++++++++++++++
 include/fuse_lowlevel.h |   54 ++++++++++++++++++++++++++++++++++
 lib/fuse_lowlevel.c     |   74 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 222 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 2394655140dc26..fb9c2f5c3811e3 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1129,6 +1129,59 @@ static inline bool fuse_get_feature_flag(struct fuse_conn_info *conn,
 	return conn->capable_ext & flag ? true : false;
 }
 
+/**
+ * iomap operations.
+ * These APIs are introduced in version 318 (FUSE_MAKE_VERSION(3, 18)).
+ * Using them in earlier versions will result in errors.
+ */
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
+#define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
+#define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
+#define FUSE_IOMAP_TYPE_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
+#define FUSE_IOMAP_TYPE_INLINE		4	/* data inline in the inode */
+
+#define FUSE_IOMAP_DEV_FUSEBLK		(0U)	/* fuseblk sb_dev device cookie */
+#define FUSE_IOMAP_DEV_NULL		(~0U)	/* null device cookie */
+
+#define FUSE_IOMAP_F_NEW		(1U << 0)
+#define FUSE_IOMAP_F_DIRTY		(1U << 1)
+#define FUSE_IOMAP_F_SHARED		(1U << 2)
+#define FUSE_IOMAP_F_MERGED		(1U << 3)
+#define FUSE_IOMAP_F_XATTR		(1U << 5)
+#define FUSE_IOMAP_F_BOUNDARY		(1U << 6)
+#define FUSE_IOMAP_F_ANON_WRITE		(1U << 7)
+#define FUSE_IOMAP_F_ATOMIC_BIO		(1U << 8)
+#define FUSE_IOMAP_F_WANT_IOMAP_END	(1U << 12) /* want ->iomap_end call */
+
+/* only for iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED	(1U << 14)
+#define FUSE_IOMAP_F_STALE		(1U << 15)
+
+#define FUSE_IOMAP_OP_WRITE		(1 << 0) /* writing, must allocate blocks */
+#define FUSE_IOMAP_OP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
+#define FUSE_IOMAP_OP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
+#define FUSE_IOMAP_OP_FAULT		(1 << 3) /* mapping for page fault */
+#define FUSE_IOMAP_OP_DIRECT		(1 << 4) /* direct I/O */
+#define FUSE_IOMAP_OP_NOWAIT		(1 << 5) /* do not block */
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
+#define FUSE_IOMAP_OP_UNSHARE		(1 << 7) /* unshare_file_range */
+#define FUSE_IOMAP_OP_ATOMIC		(1 << 9) /* torn-write protection */
+#define FUSE_IOMAP_OP_DONTCACHE		(1 << 10) /* dont retain pagecache */
+
+#define FUSE_IOMAP_NULL_ADDR		(-1ULL)	/* addr is not valid */
+
+struct fuse_iomap {
+	uint64_t addr;		/* disk offset of mapping, bytes */
+	uint64_t offset;	/* file offset of mapping, bytes */
+	uint64_t length;	/* length of mapping, bytes */
+	uint16_t type;		/* FUSE_IOMAP_TYPE_* */
+	uint16_t flags;		/* FUSE_IOMAP_F_* */
+	uint32_t dev;		/* device cookie */
+};
+#endif /* FUSE_USE_VERSION >= 318 */
+
 /* ----------------------------------------------------------- *
  * Compatibility stuff					       *
  * ----------------------------------------------------------- */
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index f519fb2dc08b3f..1b3f6046128bde 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -657,6 +657,9 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_BEGIN	= 4094,
+	FUSE_IOMAP_END		= 4095,
+
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
 
@@ -1287,4 +1290,42 @@ struct fuse_uring_cmd_req {
 	uint8_t padding[6];
 };
 
+struct fuse_iomap_begin_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+	uint64_t offset;	/* file offset of mapping, bytes */
+	uint64_t length;	/* length of both mappings, bytes */
+
+	uint64_t read_addr;	/* disk offset of mapping, bytes */
+	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t read_dev;	/* FUSE_IOMAP_DEV_* */
+
+	uint64_t write_addr;	/* disk offset of mapping, bytes */
+	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t write_dev;	/* device cookie */
+};
+
+struct fuse_iomap_end_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+	int64_t written;	/* bytes processed */
+
+	uint64_t map_length;	/* length of mapping, bytes */
+	uint64_t map_addr;	/* disk offset of mapping, bytes */
+	uint16_t map_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t map_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t map_dev;	/* device cookie */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 138a78436fe6d2..4950aae4f82e0d 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1325,6 +1325,44 @@ struct fuse_lowlevel_ops {
 	void (*tmpfile) (fuse_req_t req, fuse_ino_t parent,
 			mode_t mode, struct fuse_file_info *fi);
 
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+	/**
+	 * Fetch file I/O mappings to begin an operation
+	 *
+	 * Valid replies:
+	 *   fuse_reply_iomap_begin
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param nodeid the inode number
+	 * @param attr_ino inode number as told by fuse_attr::ino
+	 * @param pos position in file, in bytes
+	 * @param count length of operation, in bytes
+	 * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation
+	 */
+	void (*iomap_begin) (fuse_req_t req, fuse_ino_t nodeid,
+			     uint64_t attr_ino, off_t pos, uint64_t count,
+			     uint32_t opflags);
+
+	/**
+	 * Complete an iomap operation
+	 *
+	 * Valid replies:
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param nodeid the inode number
+	 * @param attr_ino inode number as told by fuse_attr::ino
+	 * @param pos position in file, in bytes
+	 * @param count length of operation, in bytes
+	 * @param written number of bytes processed, or a negative errno
+	 * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation
+	 * @param iomap file I/O mapping that failed
+	 */
+	void (*iomap_end) (fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino,
+			   off_t pos, uint64_t count, uint32_t opflags,
+			   ssize_t written, const struct fuse_iomap *iomap);
+#endif /* FUSE_USE_VERSION >= 318 */
 };
 
 /**
@@ -1705,6 +1743,22 @@ int fuse_reply_poll(fuse_req_t req, unsigned revents);
  */
 int fuse_reply_lseek(fuse_req_t req, off_t off);
 
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+/**
+ * Reply with iomappings for an iomap_begin operation
+ *
+ * Possible requests:
+ *   iomap_begin
+ *
+ * @param req request handle
+ * @param read_iomap mapping for file data reads
+ * @param write_iomap mapping for file data writes
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
+			   const struct fuse_iomap *write_iomap);
+#endif /* FUSE_USE_VERSION >= 318 */
+
 /* ----------------------------------------------------------- *
  * Notification						       *
  * ----------------------------------------------------------- */
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 4b03e626dab508..56f4789ddb2d0a 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2421,6 +2421,76 @@ static void do_lseek(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
 	_do_lseek(req, nodeid, inarg, NULL);
 }
 
+int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
+			   const struct fuse_iomap *write_iomap)
+{
+	struct fuse_iomap_begin_out arg = {
+		.offset = read_iomap->offset,
+		.length = read_iomap->length,
+
+		.read_addr = read_iomap->addr,
+		.read_type = read_iomap->type,
+		.read_flags = read_iomap->flags,
+		.read_dev = read_iomap->dev,
+
+		.write_addr = write_iomap->addr,
+		.write_type = write_iomap->type,
+		.write_flags = write_iomap->flags,
+		.write_dev = write_iomap->dev,
+	};
+
+	return send_reply_ok(req, &arg, sizeof(arg));
+}
+
+static void _do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *op_in, const void *in_payload)
+{
+	const struct fuse_iomap_begin_in *arg = op_in;
+	(void)in_payload;
+	(void)nodeid;
+
+	if (req->se->op.iomap_begin)
+		req->se->op.iomap_begin(req, nodeid, arg->attr_ino, arg->pos,
+					arg->count, arg->opflags);
+	else
+		fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid,
+			   const void *inarg)
+{
+	_do_iomap_begin(req, nodeid, inarg, NULL);
+}
+
+static void _do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *op_in, const void *in_payload)
+{
+	const struct fuse_iomap_end_in *arg = op_in;
+	struct fuse_iomap iomap = {
+		.addr = arg->map_addr,
+		.offset = arg->pos,
+		.length = arg->map_length,
+		.type = arg->map_type,
+		.flags = arg->map_flags,
+		.dev = arg->map_dev,
+	};
+	(void)in_payload;
+	(void)nodeid;
+
+	if (req->se->op.iomap_end)
+		req->se->op.iomap_end(req, nodeid, arg->attr_ino, arg->pos,
+				      arg->count, arg->opflags, arg->written,
+				      &iomap);
+	else
+		fuse_reply_err(req, 0);
+}
+
+static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
+			   const void *inarg)
+{
+	_do_iomap_end(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -3218,6 +3288,8 @@ static struct {
 	[FUSE_RENAME2]     = { do_rename2,      "RENAME2"    },
 	[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
+	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
+	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
 	[CUSE_INIT]	   = { cuse_lowlevel_init, "CUSE_INIT"   },
 };
 
@@ -3272,6 +3344,8 @@ static struct {
 	[FUSE_RENAME2]		= { _do_rename2,	"RENAME2" },
 	[FUSE_COPY_FILE_RANGE]	= { _do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
+	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
+	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },
 	[CUSE_INIT]		= { _cuse_lowlevel_init, "CUSE_INIT" },
 };
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 3/8] libfuse: add upper level iomap commands
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
  2025-05-22  0:05   ` [PATCH 2/8] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
@ 2025-05-22  0:06   ` Darrick J. Wong
  2025-05-22  0:06   ` [PATCH 4/8] libfuse: add a notification to add a new device to iomap Darrick J. Wong
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:06 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the upper level fuse library about the iomap begin and end
operations, and connect it to the lower level.  This is needed for
fuse2fs to start using iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |   14 ++++++++
 lib/fuse.c     |   97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 111 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index 4582cc7ac99271..fa5543bdf59deb 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -850,6 +850,20 @@ struct fuse_operations {
 	 * Find next data or hole after the specified offset
 	 */
 	off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
+
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+	/* Start and end an iomap operation */
+	int (*iomap_begin) (const char *path, uint64_t nodeid,
+			    uint64_t attr_ino, off_t pos_in,
+			    uint64_t length_in, uint32_t opflags_in,
+			    struct fuse_iomap *read_iomap_out,
+			    struct fuse_iomap *write_iomap_out);
+
+	int (*iomap_end) (const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos_in, uint64_t length_in,
+			  uint32_t opflags_in, ssize_t written_in,
+			  const struct fuse_iomap *iomap_in);
+#endif /* FUSE_USE_VERSION >= 318 */
 };
 
 /** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index d89655fc22c844..efec49d35043e0 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -4433,6 +4433,101 @@ static void fuse_lib_lseek(fuse_req_t req, fuse_ino_t ino, off_t off, int whence
 		reply_err(req, res);
 }
 
+static int fuse_fs_iomap_begin(struct fuse_fs *fs, const char *path,
+			       fuse_ino_t nodeid, uint64_t attr_ino, off_t pos,
+			       uint64_t count, uint32_t opflags,
+			       struct fuse_iomap *read_iomap,
+			       struct fuse_iomap *write_iomap)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_begin)
+		return -ENOSYS;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_begin[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x\n",
+			 path, nodeid, attr_ino, pos, count, opflags);
+	}
+
+	return fs->op.iomap_begin(path, nodeid, attr_ino, pos, count, opflags,
+				  read_iomap, write_iomap);
+}
+
+static void fuse_lib_iomap_begin(fuse_req_t req, fuse_ino_t nodeid,
+				 uint64_t attr_ino, off_t pos, uint64_t count,
+				 uint32_t opflags)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_iomap read_iomap = { };
+	struct fuse_iomap write_iomap = {
+		.type = FUSE_IOMAP_TYPE_PURE_OVERWRITE,
+	};
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path_nullok(f, nodeid, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_begin(f->fs, path, nodeid, attr_ino, pos, count,
+				  opflags, &read_iomap, &write_iomap);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, nodeid, path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_reply_iomap_begin(req, &read_iomap, &write_iomap);
+}
+
+static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path,
+			     fuse_ino_t nodeid, uint64_t attr_ino, off_t pos,
+			     uint64_t count, uint32_t opflags, ssize_t written,
+			     const struct fuse_iomap *iomap)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_end)
+		return 0;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_end[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x written %zd\n",
+			 path, nodeid, attr_ino, pos, count, opflags, written);
+	}
+
+	return fs->op.iomap_end(path, nodeid, attr_ino, pos, count, opflags,
+				written, iomap);
+}
+
+static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid,
+			       uint64_t attr_ino, off_t pos, uint64_t count,
+			       uint32_t opflags, ssize_t written,
+			       const struct fuse_iomap *iomap)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path_nullok(f, nodeid, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_end(f->fs, path, nodeid, attr_ino, pos, count,
+				opflags, written, iomap);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, nodeid, path);
+	reply_err(req, err);
+}
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4531,6 +4626,8 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.fallocate = fuse_lib_fallocate,
 	.copy_file_range = fuse_lib_copy_file_range,
 	.lseek = fuse_lib_lseek,
+	.iomap_begin = fuse_lib_iomap_begin,
+	.iomap_end = fuse_lib_iomap_end,
 };
 
 int fuse_notify_poll(struct fuse_pollhandle *ph)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 4/8] libfuse: add a notification to add a new device to iomap
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-05-22  0:06   ` [PATCH 3/8] libfuse: add upper level iomap commands Darrick J. Wong
@ 2025-05-22  0:06   ` Darrick J. Wong
  2025-05-22  0:06   ` [PATCH 5/8] libfuse: add iomap ioend low level handler Darrick J. Wong
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:06 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Plumb in the pieces needed to attach block devices to a fuse+iomap mount
for use with iomap operations.  This enables us to have filesystems
where the metadata could live somewhere else, but the actual file IO
goes to locally attached storage.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_kernel.h   |    8 ++++++++
 include/fuse_lowlevel.h |   16 ++++++++++++++++
 lib/fuse_lowlevel.c     |   21 +++++++++++++++++++++
 lib/fuse_versionscript  |    1 +
 4 files changed, 46 insertions(+)


diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 1b3f6046128bde..94efb90279579c 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -233,6 +233,7 @@
  *  7.44
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
+ *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
  */
 
 #ifndef _LINUX_FUSE_H
@@ -676,6 +677,7 @@ enum fuse_notify_code {
 	FUSE_NOTIFY_RETRIEVE = 5,
 	FUSE_NOTIFY_DELETE = 6,
 	FUSE_NOTIFY_RESEND = 7,
+	FUSE_NOTIFY_ADD_IOMAP_DEVICE = 8,
 	FUSE_NOTIFY_CODE_MAX,
 };
 
@@ -1328,4 +1330,10 @@ struct fuse_iomap_end_in {
 	uint32_t map_dev;	/* device cookie */
 };
 
+struct fuse_iomap_add_device_out {
+	int32_t fd;		/* fd of the open device to add */
+	uint32_t reserved;	/* must be zero */
+	uint32_t *map_dev;	/* location to receive device cookie */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 4950aae4f82e0d..c9975f1862a074 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1948,6 +1948,22 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
 int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino,
 				  size_t size, off_t offset, void *cookie);
 
+/**
+ * Attach an open file descriptor to a fuse+iomap mount.  Currently must be
+ * a block device.
+ *
+ * Added in FUSE protocol version 7.44. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param fd file descriptor of an open block device
+ * @param map_dev pointer to iomap device number
+ * @return zero for success, -errno for failure
+ */
+int fuse_lowlevel_notify_iomap_add_device(struct fuse_session *se, int fd,
+					  uint32_t *map_dev);
+
 
 /* ----------------------------------------------------------- *
  * Utility functions					       *
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 56f4789ddb2d0a..ef92ab8c062cbf 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -3110,6 +3110,27 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
 	return res;
 }
 
+int fuse_lowlevel_notify_iomap_add_device(struct fuse_session *se, int fd,
+					  uint32_t *map_dev)
+{
+	struct fuse_iomap_add_device_out outarg = {
+		.fd = fd,
+		.map_dev = map_dev,
+	};
+	struct iovec iov[2];
+
+	if (!se)
+		return -EINVAL;
+
+	if (se->conn.proto_minor < 44)
+		return -ENOSYS;
+
+	iov[1].iov_base = &outarg;
+	iov[1].iov_len = sizeof(outarg);
+
+	return send_notify_iov(se, FUSE_NOTIFY_ADD_IOMAP_DEVICE, iov, 2);
+}
+
 struct fuse_retrieve_req {
 	struct fuse_notify_req nreq;
 	void *cookie;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 22c59e1af66c95..5c04e204adba33 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -205,6 +205,7 @@ FUSE_3.17 {
 FUSE_3.18 {
 	global:
 		fuse_req_is_uring;
+		fuse_lowlevel_notify_iomap_add_device;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 5/8] libfuse: add iomap ioend low level handler
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-05-22  0:06   ` [PATCH 4/8] libfuse: add a notification to add a new device to iomap Darrick J. Wong
@ 2025-05-22  0:06   ` Darrick J. Wong
  2025-05-22  0:06   ` [PATCH 6/8] libfuse: add upper level iomap ioend commands Darrick J. Wong
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:06 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the low level library about the iomap ioend handler, which gets
called by the kernel when we finish a file write that isn't a pure
overwrite operation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |   17 +++++++++++++++++
 include/fuse_kernel.h   |   15 +++++++++++++++
 include/fuse_lowlevel.h |   20 ++++++++++++++++++++
 lib/fuse_lowlevel.c     |   30 ++++++++++++++++++++++++++++++
 4 files changed, 82 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index fb9c2f5c3811e3..f7bc03427d12e4 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -525,6 +525,11 @@ struct fuse_loop_config_v1 {
  */
 #define FUSE_CAP_IOMAP (1ULL << 32)
 
+/**
+ * Client supports using iomap for direct I/O file operations
+ */
+#define FUSE_CAP_IOMAP_DIRECTIO (1ULL << 33)
+
 /**
  * Ioctl flags
  *
@@ -1182,6 +1187,18 @@ struct fuse_iomap {
 };
 #endif /* FUSE_USE_VERSION >= 318 */
 
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN	(1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY	(1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT		(1U << 3)
+
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND		(1U << 15)
+
 /* ----------------------------------------------------------- *
  * Compatibility stuff					       *
  * ----------------------------------------------------------- */
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 94efb90279579c..a2c044b5957169 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -234,6 +234,7 @@
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
+ *  - add FUSE_IOMAP_DIRECTIO for direct I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -442,6 +443,7 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
+ * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -490,6 +492,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_IOMAP		(1ULL << 43)
+#define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
 
 /**
  * CUSE INIT request/reply flags
@@ -658,6 +661,7 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_IOEND	= 4093,
 	FUSE_IOMAP_BEGIN	= 4094,
 	FUSE_IOMAP_END		= 4095,
 
@@ -1336,4 +1340,15 @@ struct fuse_iomap_add_device_out {
 	uint32_t *map_dev;	/* location to receive device cookie */
 };
 
+struct fuse_iomap_ioend_in {
+	uint16_t ioendflags;	/* FUSE_IOMAP_IOEND_* */
+	uint16_t reserved;	/* zero */
+	int32_t error;		/* negative errno or 0 */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t new_addr;	/* disk offset of new mapping, in bytes */
+	uint32_t written;	/* bytes processed */
+	uint32_t reserved1;	/* zero */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index c9975f1862a074..eb457007a72cbc 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1362,6 +1362,26 @@ struct fuse_lowlevel_ops {
 	void (*iomap_end) (fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino,
 			   off_t pos, uint64_t count, uint32_t opflags,
 			   ssize_t written, const struct fuse_iomap *iomap);
+
+	/**
+	 * Complete an iomap IO operation
+	 *
+	 * Valid replies:
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param nodeid the inode number
+	 * @param attr_ino inode number as told by fuse_attr::ino
+	 * @param pos position in file, in bytes
+	 * @param written number of bytes processed, or a negative errno
+	 * @param ioendflags mask of FUSE_IOMAP_IOEND_ flags specifying operation
+	 * @param error errno code of what went wrong
+	 * @param new_addr disk address of new mapping, in bytes
+	 */
+	void (*iomap_ioend) (fuse_req_t req, fuse_ino_t nodeid,
+			     uint64_t attr_ino, off_t pos, size_t written,
+			     uint32_t ioendflags, int error,
+			     uint64_t new_addr);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ef92ab8c062cbf..9d07743fe522c6 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2491,6 +2491,27 @@ static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
 	_do_iomap_end(req, nodeid, inarg, NULL);
 }
 
+static void _do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *op_in, const void *in_payload)
+{
+	const struct fuse_iomap_ioend_in *arg = op_in;
+	(void)in_payload;
+	(void)nodeid;
+
+	if (req->se->op.iomap_ioend)
+		req->se->op.iomap_ioend(req, nodeid, arg->attr_ino, arg->pos,
+					arg->written, arg->ioendflags,
+					arg->error, arg->new_addr);
+	else
+		fuse_reply_err(req, 0);
+}
+
+static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
+			   const void *inarg)
+{
+	_do_iomap_ioend(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -2616,6 +2637,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
 		if (inargflags & FUSE_IOMAP)
 			se->conn.capable_ext |= FUSE_CAP_IOMAP;
+		if (inargflags & FUSE_IOMAP_DIRECTIO)
+			se->conn.capable_ext |= FUSE_CAP_IOMAP_DIRECTIO;
 
 	} else {
 		se->conn.max_readahead = 0;
@@ -2664,6 +2687,7 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 
 	/* servers need to opt-in to iomap explicitly */
 	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
+	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP_DIRECTIO);
 
 	/* This could safely become default, but libfuse needs an API extension
 	 * to support it
@@ -2790,6 +2814,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		outargflags |= FUSE_OVER_IO_URING;
 	if (se->conn.want_ext & FUSE_CAP_IOMAP)
 		outargflags |= FUSE_IOMAP;
+	if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
+		outargflags |= FUSE_IOMAP_DIRECTIO;
 
 	if (inargflags & FUSE_INIT_EXT) {
 		outargflags |= FUSE_INIT_EXT;
@@ -2833,6 +2859,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 				outarg.max_stack_depth);
 		if (se->conn.want_ext & FUSE_CAP_IOMAP)
 			fuse_log(FUSE_LOG_DEBUG, "   iomap=1\n");
+		if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
+			fuse_log(FUSE_LOG_DEBUG, "   iomap_directio=1\n");
 	}
 	if (arg->minor < 5)
 		outargsize = FUSE_COMPAT_INIT_OUT_SIZE;
@@ -3311,6 +3339,7 @@ static struct {
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
 	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
+	[FUSE_IOMAP_IOEND] = { do_iomap_ioend,	"IOMAP_IOEND" },
 	[CUSE_INIT]	   = { cuse_lowlevel_init, "CUSE_INIT"   },
 };
 
@@ -3367,6 +3396,7 @@ static struct {
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
 	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },
+	[FUSE_IOMAP_IOEND]	= { _do_iomap_ioend,	"IOMAP_IOEND" },
 	[CUSE_INIT]		= { _cuse_lowlevel_init, "CUSE_INIT" },
 };
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 6/8] libfuse: add upper level iomap ioend commands
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-05-22  0:06   ` [PATCH 5/8] libfuse: add iomap ioend low level handler Darrick J. Wong
@ 2025-05-22  0:06   ` Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 7/8] libfuse: add FUSE_IOMAP_PAGECACHE Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 8/8] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:06 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the upper level fuse library about iomap ioend events, which
happen when a write that isn't a pure overwrite completes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |    6 ++++++
 lib/fuse.c     |   45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index fa5543bdf59deb..5b0e8fb370c27c 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -863,6 +863,12 @@ struct fuse_operations {
 			  off_t pos_in, uint64_t length_in,
 			  uint32_t opflags_in, ssize_t written_in,
 			  const struct fuse_iomap *iomap_in);
+
+	/* Complete an iomap file IO operation */
+	int (*iomap_ioend) (const char *path, uint64_t nodeid,
+			    uint64_t attr_ino, off_t pos_in, size_t written_in,
+			    uint32_t ioendflags_in, int error_in,
+			    uint64_t new_addr_in);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse.c b/lib/fuse.c
index efec49d35043e0..b1404cda0abc74 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -4528,6 +4528,50 @@ static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid,
 	reply_err(req, err);
 }
 
+static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path,
+			       uint64_t nodeid, uint64_t attr_ino, off_t pos,
+			       size_t written, uint32_t ioendflags, int error,
+			       uint64_t new_addr)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_ioend)
+		return 0;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_ioend[%s] nodeid %llu attr_ino %llu pos %llu written %zu ioendflags 0x%x error %d\n",
+			 path, nodeid, attr_ino, pos, written, ioendflags,
+			 error);
+	}
+
+	return fs->op.iomap_ioend(path, nodeid, attr_ino, pos, written,
+				  ioendflags, error, new_addr);
+}
+
+static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid,
+				 uint64_t attr_ino, off_t pos, size_t written,
+				 uint32_t ioendflags, int error,
+				 uint64_t new_addr)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path_nullok(f, nodeid, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_ioend(f->fs, path, nodeid, attr_ino, pos, written,
+				  ioendflags, error, new_addr);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, nodeid, path);
+	reply_err(req, err);
+}
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4628,6 +4672,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.lseek = fuse_lib_lseek,
 	.iomap_begin = fuse_lib_iomap_begin,
 	.iomap_end = fuse_lib_iomap_end,
+	.iomap_ioend = fuse_lib_iomap_ioend,
 };
 
 int fuse_notify_poll(struct fuse_pollhandle *ph)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 7/8] libfuse: add FUSE_IOMAP_PAGECACHE
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-05-22  0:06   ` [PATCH 6/8] libfuse: add upper level iomap ioend commands Darrick J. Wong
@ 2025-05-22  0:07   ` Darrick J. Wong
  2025-05-22  0:07   ` [PATCH 8/8] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:07 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Make it so that fuse servers can ask the kernel fuse driver to use iomap
to support buffered IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h |    5 +++++
 include/fuse_kernel.h |    3 +++
 lib/fuse_lowlevel.c   |    8 +++++++-
 3 files changed, 15 insertions(+), 1 deletion(-)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index f7bc03427d12e4..a102e450944f4a 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -530,6 +530,11 @@ struct fuse_loop_config_v1 {
  */
 #define FUSE_CAP_IOMAP_DIRECTIO (1ULL << 33)
 
+/*
+ * Client supports using iomap for pagecache I/O file operations
+ */
+#define FUSE_CAP_IOMAP_PAGECACHE (1ULL << 34)
+
 /**
  * Ioctl flags
  *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index a2c044b5957169..93ecb98a0bc20f 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -235,6 +235,7 @@
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
  *  - add FUSE_IOMAP_DIRECTIO for direct I/O support
+ *  - add FUSE_IOMAP_PAGECACHE for pagecache I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -444,6 +445,7 @@ struct fuse_file_lock {
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
  * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
+ * FUSE_IOMAP_PAGECACHE: Client supports iomap for pagecache I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -493,6 +495,7 @@ struct fuse_file_lock {
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_IOMAP		(1ULL << 43)
 #define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
+#define FUSE_IOMAP_PAGECACHE	(1ULL << 45)
 
 /**
  * CUSE INIT request/reply flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 9d07743fe522c6..fd12daf509cebf 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2639,7 +2639,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_IOMAP;
 		if (inargflags & FUSE_IOMAP_DIRECTIO)
 			se->conn.capable_ext |= FUSE_CAP_IOMAP_DIRECTIO;
-
+		if (inargflags & FUSE_IOMAP_PAGECACHE)
+			se->conn.capable_ext |= FUSE_CAP_IOMAP_PAGECACHE;
 	} else {
 		se->conn.max_readahead = 0;
 	}
@@ -2688,6 +2689,7 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 	/* servers need to opt-in to iomap explicitly */
 	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
 	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP_DIRECTIO);
+	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP_PAGECACHE);
 
 	/* This could safely become default, but libfuse needs an API extension
 	 * to support it
@@ -2816,6 +2818,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		outargflags |= FUSE_IOMAP;
 	if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
 		outargflags |= FUSE_IOMAP_DIRECTIO;
+	if (se->conn.want_ext & FUSE_CAP_IOMAP_PAGECACHE)
+		outargflags |= FUSE_IOMAP_PAGECACHE;
 
 	if (inargflags & FUSE_INIT_EXT) {
 		outargflags |= FUSE_INIT_EXT;
@@ -2861,6 +2865,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			fuse_log(FUSE_LOG_DEBUG, "   iomap=1\n");
 		if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
 			fuse_log(FUSE_LOG_DEBUG, "   iomap_directio=1\n");
+		if (se->conn.want_ext & FUSE_CAP_IOMAP_PAGECACHE)
+			fuse_log(FUSE_LOG_DEBUG, "   iomap_pagecache=1\n");
 	}
 	if (arg->minor < 5)
 		outargsize = FUSE_COMPAT_INIT_OUT_SIZE;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 8/8] libfuse: allow discovery of the kernel's iomap capabilities
  2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-05-22  0:07   ` [PATCH 7/8] libfuse: add FUSE_IOMAP_PAGECACHE Darrick J. Wong
@ 2025-05-22  0:07   ` Darrick J. Wong
  7 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:07 UTC (permalink / raw)
  To: bschubert, djwong; +Cc: linux-fsdevel, bernd, John, joannelkoong, miklos

From: Darrick J. Wong <djwong@kernel.org>

Create a library function so that we can discover the kernel's iomap
capabilities ahead of time.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_kernel.h   |   13 +++++++++++++
 include/fuse_lowlevel.h |    5 +++++
 lib/fuse_lowlevel.c     |   28 ++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    1 +
 4 files changed, 47 insertions(+)


diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 93ecb98a0bc20f..71077eb9f49fef 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -1129,12 +1129,25 @@ struct fuse_backing_map {
 	uint64_t	padding;
 };
 
+/* basic reporting functionality */
+#define FUSE_IOMAP_SUPPORT_BASICS	(1ULL << 0)
+/* fuse driver can do direct io */
+#define FUSE_IOMAP_SUPPORT_DIRECTIO	(1ULL << 1)
+/* fuse driver can do buffered io */
+#define FUSE_IOMAP_SUPPORT_PAGECACHE	(1ULL << 2)
+struct fuse_iomap_support {
+	uint64_t	flags;
+	uint64_t	padding;
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 #define FUSE_DEV_IOC_BACKING_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, \
 					     struct fuse_backing_map)
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT	_IOR(FUSE_DEV_IOC_MAGIC, 3, \
+					     struct fuse_iomap_support)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index eb457007a72cbc..a74d287f9012e9 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2410,6 +2410,11 @@ int fuse_session_receive_buf(struct fuse_session *se, struct fuse_buf *buf);
  */
 bool fuse_req_is_uring(fuse_req_t req);
 
+/**
+ * Discover the kernel's iomap capabilities.  Returns FUSE_CAP_IOMAP_* flags.
+ */
+uint64_t fuse_discover_iomap(void);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index fd12daf509cebf..9779e6ea7cc8ac 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -4341,3 +4341,31 @@ int fuse_session_exited(struct fuse_session *se)
 {
 	return se->exited;
 }
+
+uint64_t fuse_discover_iomap(void)
+{
+	struct fuse_iomap_support ios;
+	uint64_t ret = 0;
+	int fd;
+
+	fd = open("/dev/fuse", O_RDONLY | O_CLOEXEC);
+	if (fd < 0)
+		return 0;
+
+	ret = ioctl(fd, FUSE_DEV_IOC_IOMAP_SUPPORT, &ios);
+	if (ret) {
+		ret = 0;
+		goto out_close;
+	}
+
+	if (ios.flags & FUSE_IOMAP_SUPPORT_BASICS)
+		ret |= FUSE_CAP_IOMAP;
+	if (ios.flags & FUSE_IOMAP_SUPPORT_DIRECTIO)
+		ret |= FUSE_CAP_IOMAP_DIRECTIO;
+	if (ios.flags & FUSE_IOMAP_SUPPORT_PAGECACHE)
+		ret |= FUSE_CAP_IOMAP_PAGECACHE;
+
+out_close:
+	close(fd);
+	return ret;
+}
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 5c04e204adba33..210527ce9dd283 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -206,6 +206,7 @@ FUSE_3.18 {
 	global:
 		fuse_req_is_uring;
 		fuse_lowlevel_notify_iomap_add_device;
+		fuse_discover_iomap;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 01/10] libext2fs: always fsync the device when flushing the cache
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
@ 2025-05-22  0:08   ` Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:08 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When we're flushing the unix IO manager's buffer cache, we should always
fsync the block device, because something could have written to the
block device -- either the buffer cache itself, or a direct write.
Regardless, the callers all want all dirtied regions to be persisted to
stable media.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index ede75cf8ee3681..40fd9cc1427c31 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1452,7 +1452,8 @@ static errcode_t unix_flush(io_channel channel)
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
 #ifdef HAVE_FSYNC
-	if (!retval && fsync(data->dev) != 0)
+	/* always fsync the device, even if flushing our own cache failed */
+	if (fsync(data->dev) != 0 && !retval)
 		return errno;
 #endif
 	return retval;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
@ 2025-05-22  0:08   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:08 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

unix_close is the last chance that libext2fs has to report write
failures to users.  Although it's likely that ext2fs_close already
called ext2fs_flush and told the IO manager to flush, we could do one
more sync before we close the file descriptor.  Also don't override the
fsync's errno with the close's errno.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 40fd9cc1427c31..7c5cb075d6b6b6 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1136,8 +1136,11 @@ static errcode_t unix_close(io_channel channel)
 #ifndef NO_IO_CACHE
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
+	/* always fsync the device, even if flushing our own cache failed */
+	if (fsync(data->dev) != 0 && !retval)
+		retval = errno;
 
-	if (close(data->dev) < 0)
+	if (close(data->dev) < 0 && !retval)
 		retval = errno;
 	free_cache(data);
 	free(data->cache);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
  2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

As an optimization, only fsync the block device fd if we tried to  write
to the io channel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |   48 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 7c5cb075d6b6b6..0fc83e471ca0fe 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -129,10 +129,13 @@ struct unix_cache {
 #define WRITE_DIRECT_SIZE 4	/* Must be smaller than CACHE_SIZE */
 #define READ_DIRECT_SIZE 4	/* Should be smaller than CACHE_SIZE */
 
+#define UNIX_STATE_DIRTY	(1U << 0) /* device needs fsyncing */
+
 struct unix_private_data {
 	int	magic;
 	int	dev;
 	int	flags;
+	unsigned int	state; /* UNIX_STATE_* */
 	int	align;
 	int	access_time;
 	ext2_loff_t offset;
@@ -1121,10 +1124,37 @@ static errcode_t unix_open(const char *name, int flags,
 	return unix_open_channel(name, fd, flags, channel, unix_io_manager);
 }
 
+static void mark_dirty(io_channel channel)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+
+	mutex_lock(data, CACHE_MTX);
+	data->state |= UNIX_STATE_DIRTY;
+	mutex_unlock(data, CACHE_MTX);
+}
+
+static errcode_t maybe_fsync(io_channel channel)
+{
+	struct unix_private_data *data =
+		(struct unix_private_data *) channel->private_data;
+	int was_dirty;
+
+	mutex_lock(data, CACHE_MTX);
+	was_dirty = data->state & UNIX_STATE_DIRTY;
+	data->state &= ~UNIX_STATE_DIRTY;
+	mutex_unlock(data, CACHE_MTX);
+
+	if (was_dirty && fsync(data->dev) != 0)
+		return errno;
+
+	return 0;
+}
+
 static errcode_t unix_close(io_channel channel)
 {
 	struct unix_private_data *data;
-	errcode_t	retval = 0;
+	errcode_t	retval = 0, retval2;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	data = (struct unix_private_data *) channel->private_data;
@@ -1137,8 +1167,9 @@ static errcode_t unix_close(io_channel channel)
 	retval = flush_cached_blocks(channel, data, 0);
 #endif
 	/* always fsync the device, even if flushing our own cache failed */
-	if (fsync(data->dev) != 0 && !retval)
-		retval = errno;
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
 
 	if (close(data->dev) < 0 && !retval)
 		retval = errno;
@@ -1306,6 +1337,8 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+	mark_dirty(channel);
+
 #ifdef NO_IO_CACHE
 	return raw_write_blk(channel, data, block, count, buf, 0);
 #else
@@ -1430,6 +1463,8 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)
 		return errno;
 
+	mark_dirty(channel);
+
 	actual = write(data->dev, buf, size);
 	if (actual < 0)
 		return errno;
@@ -1445,7 +1480,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 static errcode_t unix_flush(io_channel channel)
 {
 	struct unix_private_data *data;
-	errcode_t retval = 0;
+	errcode_t retval = 0, retval2;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
 	data = (struct unix_private_data *) channel->private_data;
@@ -1456,8 +1491,9 @@ static errcode_t unix_flush(io_channel channel)
 #endif
 #ifdef HAVE_FSYNC
 	/* always fsync the device, even if flushing our own cache failed */
-	if (fsync(data->dev) != 0 && !retval)
-		return errno;
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
 #endif
 	return retval;
 }


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When we're freeing blocks, we should tell the IO manager to drop them
from any cache it might be maintaining to improve performance.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2_io.h         |    6 +++++-
 debian/libext2fs2t64.symbols |    1 +
 lib/ext2fs/alloc_stats.c     |    7 +++++++
 lib/ext2fs/io_manager.c      |    8 ++++++++
 lib/ext2fs/unix_io.c         |   32 ++++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index 78c988374c8808..bab7f2a6a44b81 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -103,7 +103,9 @@ struct struct_io_manager {
 	errcode_t (*zeroout)(io_channel channel, unsigned long long block,
 			     unsigned long long count);
 	errcode_t (*get_fd)(io_channel channel, int *fd);
-	long	reserved[13];
+	errcode_t (*invalidate_blk)(io_channel channel,
+				    unsigned long long block);
+	long	reserved[12];
 };
 
 #define IO_FLAG_RW		0x0001
@@ -147,6 +149,8 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
 					    unsigned long long block,
 					    unsigned long long count);
 extern errcode_t io_channel_fd(io_channel io, int *fd);
+extern errcode_t io_channel_invalidate_blk(io_channel io,
+					   unsigned long long block);
 
 #ifdef _WIN32
 /* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index 9cf3b33ca15f91..13870c4b545b2f 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -689,6 +689,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
  io_channel_cache_readahead@Base 1.43
  io_channel_discard@Base 1.42
  io_channel_fd@Base 1.47.3
+ io_channel_invalidate_blk@Base 1.47.3
  io_channel_read_blk64@Base 1.41.1
  io_channel_set_options@Base 1.37
  io_channel_write_blk64@Base 1.41.1
diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c
index 6f98bcc7cbd5f3..4aeaa286b88a7e 100644
--- a/lib/ext2fs/alloc_stats.c
+++ b/lib/ext2fs/alloc_stats.c
@@ -84,6 +84,13 @@ void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse)
 	ext2fs_mark_bb_dirty(fs);
 	if (fs->block_alloc_stats)
 		(fs->block_alloc_stats)(fs, (blk64_t) blk, inuse);
+
+	if (inuse < 0) {
+		unsigned int i;
+
+		for (i = 0; i < EXT2FS_CLUSTER_RATIO(fs); i++)
+			io_channel_invalidate_blk(fs->io, blk + i);
+	}
 }
 
 void ext2fs_block_alloc_stats(ext2_filsys fs, blk_t blk, int inuse)
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 1bab069de63e12..aa7fc58b846be8 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -158,3 +158,11 @@ errcode_t io_channel_fd(io_channel io, int *fd)
 
 	return io->manager->get_fd(io, fd);
 }
+
+errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block)
+{
+	if (!io->manager->invalidate_blk)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->invalidate_blk(io, block);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 0fc83e471ca0fe..89f7915371307f 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -664,6 +664,23 @@ static errcode_t reuse_cache(io_channel channel,
 #define FLUSH_INVALIDATE	0x01
 #define FLUSH_NOLOCK		0x02
 
+/* Remove a block from the cache.  Dirty contents are discarded. */
+static void invalidate_cached_block(io_channel channel,
+				    struct unix_private_data *data,
+				    unsigned long long block)
+{
+	struct unix_cache	*cache;
+	int			i;
+
+	mutex_lock(data, CACHE_MTX);
+	for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) {
+		if (!cache->in_use || cache->block != block)
+			continue;
+		cache->in_use = 0;
+	}
+	mutex_unlock(data, CACHE_MTX);
+}
+
 /*
  * Flush all of the blocks in the cache
  */
@@ -1705,6 +1722,19 @@ static errcode_t unix_get_fd(io_channel channel, int *fd)
 	return 0;
 }
 
+static errcode_t unix_invalidate_blk(io_channel channel,
+				     unsigned long long block)
+{
+	struct unix_private_data *data;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+	invalidate_cached_block(channel, data, block);
+	return 0;
+}
+
 #if __GNUC_PREREQ (4, 6)
 #pragma GCC diagnostic pop
 #endif
@@ -1727,6 +1757,7 @@ static struct struct_io_manager struct_unix_manager = {
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
 	.get_fd		= unix_get_fd,
+	.invalidate_blk	= unix_invalidate_blk,
 };
 
 io_manager unix_io_manager = &struct_unix_manager;
@@ -1749,6 +1780,7 @@ static struct struct_io_manager struct_unixfd_manager = {
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
 	.get_fd		= unix_get_fd,
+	.invalidate_blk	= unix_invalidate_blk,
 };
 
 io_manager unixfd_io_manager = &struct_unixfd_manager;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/10] libext2fs: add tagged block IO for better caching
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Pass inode numbers from the fileio.c code through the io manager to the
unix io manager so that we can manage the disk cache more effectively.
In the next few patches we'll need the ability to flush and invalidate
the caches for specific files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2_io.h         |   25 +++++++++++++++++++++-
 debian/libext2fs2t64.symbols |    4 ++++
 lib/ext2fs/fileio.c          |   14 +++++++-----
 lib/ext2fs/io_manager.c      |   48 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 84 insertions(+), 7 deletions(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index bab7f2a6a44b81..64b35b31d669e7 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -39,6 +39,11 @@ typedef struct struct_io_stats *io_stats;
 
 #define io_channel_discard_zeroes_data(i) (i->flags & CHANNEL_FLAGS_DISCARD_ZEROES)
 
+typedef unsigned int	io_channel_tag_t;
+
+/* I/O operation has no associated tag */
+#define IO_CHANNEL_TAG_NULL		(0)
+
 struct struct_io_channel {
 	errcode_t	magic;
 	io_manager	manager;
@@ -105,7 +110,15 @@ struct struct_io_manager {
 	errcode_t (*get_fd)(io_channel channel, int *fd);
 	errcode_t (*invalidate_blk)(io_channel channel,
 				    unsigned long long block);
-	long	reserved[12];
+	errcode_t (*read_tagblk)(io_channel channel, io_channel_tag_t tag,
+				 unsigned long long block, int count,
+				 void *data);
+	errcode_t (*write_tagblk)(io_channel channel, io_channel_tag_t tag,
+				   unsigned long long block, int count,
+				   const void *data);
+	errcode_t (*flush_tag)(io_channel channel, io_channel_tag_t tag);
+	errcode_t (*invalidate_tag)(io_channel channel, io_channel_tag_t tag);
+	long	reserved[8];
 };
 
 #define IO_FLAG_RW		0x0001
@@ -134,9 +147,17 @@ extern errcode_t io_channel_write_byte(io_channel channel,
 extern errcode_t io_channel_read_blk64(io_channel channel,
 				       unsigned long long block,
 				       int count, void *data);
+extern errcode_t io_channel_read_tagblk(io_channel channel,
+					io_channel_tag_t tag,
+					unsigned long long block, int count,
+					void *data);
 extern errcode_t io_channel_write_blk64(io_channel channel,
 					unsigned long long block,
 					int count, const void *data);
+extern errcode_t io_channel_write_tagblk(io_channel channel,
+					 io_channel_tag_t tag,
+					 unsigned long long block, int count,
+					 const void *data);
 extern errcode_t io_channel_discard(io_channel channel,
 				    unsigned long long block,
 				    unsigned long long count);
@@ -151,6 +172,8 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
 extern errcode_t io_channel_fd(io_channel io, int *fd);
 extern errcode_t io_channel_invalidate_blk(io_channel io,
 					   unsigned long long block);
+extern errcode_t io_channel_flush_tag(io_channel io, io_channel_tag_t tag);
+extern errcode_t io_channel_invalidate_tag(io_channel io, io_channel_tag_t tag);
 
 #ifdef _WIN32
 /* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index 13870c4b545b2f..87ed63155702e0 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -689,11 +689,15 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
  io_channel_cache_readahead@Base 1.43
  io_channel_discard@Base 1.42
  io_channel_fd@Base 1.47.3
+ io_channel_flush_tag@Base 1.47.3
  io_channel_invalidate_blk@Base 1.47.3
+ io_channel_invalidate_tag@Base 1.47.3
  io_channel_read_blk64@Base 1.41.1
+ io_channel_read_tagblk@Base 1.47.3
  io_channel_set_options@Base 1.37
  io_channel_write_blk64@Base 1.41.1
  io_channel_write_byte@Base 1.37
+ io_channel_write_tagblk@Base 1.47.3
  io_channel_zeroout@Base 1.43
  qcow2_read_header@Base 1.42
  qcow2_write_raw_image@Base 1.42
diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c
index 818f7f05420029..1b7e88d990036b 100644
--- a/lib/ext2fs/fileio.c
+++ b/lib/ext2fs/fileio.c
@@ -167,7 +167,8 @@ errcode_t ext2fs_file_flush(ext2_file_t file)
 			return retval;
 	}
 
-	retval = io_channel_write_blk64(fs->io, file->physblock, 1, file->buf);
+	retval = io_channel_write_tagblk(fs->io, file->ino, file->physblock,
+					  1, file->buf);
 	if (retval)
 		return retval;
 
@@ -220,9 +221,10 @@ static errcode_t load_buffer(ext2_file_t file, int dontfill)
 		if (!dontfill) {
 			if (file->physblock &&
 			    !(ret_flags & BMAP_RET_UNINIT)) {
-				retval = io_channel_read_blk64(fs->io,
-							       file->physblock,
-							       1, file->buf);
+				retval = io_channel_read_tagblk(fs->io,
+								 file->ino,
+								 file->physblock,
+								 1, file->buf);
 				if (retval)
 					return retval;
 			} else
@@ -603,13 +605,13 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file,
 		return retval;
 
 	/* Read/zero/write block */
-	retval = io_channel_read_blk64(fs->io, blk, 1, b);
+	retval = io_channel_read_tagblk(fs->io, file->ino, blk, 1, b);
 	if (retval)
 		goto out;
 
 	memset(b + off, 0, fs->blocksize - off);
 
-	retval = io_channel_write_blk64(fs->io, blk, 1, b);
+	retval = io_channel_write_tagblk(fs->io, file->ino, blk, 1, b);
 	if (retval)
 		goto out;
 
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index aa7fc58b846be8..357a3bc7698129 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -85,6 +85,22 @@ errcode_t io_channel_read_blk64(io_channel channel, unsigned long long block,
 					     count, data);
 }
 
+errcode_t io_channel_read_tagblk(io_channel channel, io_channel_tag_t tag,
+				 unsigned long long block, int count,
+				 void *data)
+{
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+
+	if (channel->manager->read_tagblk)
+		return (channel->manager->read_tagblk)(channel, tag, block,
+						       count, data);
+
+	if (tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io_channel_read_blk64(channel, block, count, data);
+}
+
 errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block,
 				 int count, const void *data)
 {
@@ -101,6 +117,22 @@ errcode_t io_channel_write_blk64(io_channel channel, unsigned long long block,
 					     count, data);
 }
 
+errcode_t io_channel_write_tagblk(io_channel channel, io_channel_tag_t tag,
+				  unsigned long long block, int count,
+				  const void *data)
+{
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+
+	if (channel->manager->write_tagblk)
+		return (channel->manager->write_tagblk)(channel, tag, block,
+							count, data);
+
+	if (tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io_channel_write_blk64(channel, block, count, data);
+}
+
 errcode_t io_channel_discard(io_channel channel, unsigned long long block,
 			     unsigned long long count)
 {
@@ -166,3 +198,19 @@ errcode_t io_channel_invalidate_blk(io_channel io, unsigned long long block)
 
 	return io->manager->invalidate_blk(io, block);
 }
+
+errcode_t io_channel_flush_tag(io_channel io, io_channel_tag_t tag)
+{
+	if (!io->manager->flush_tag && tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->flush_tag(io, tag);
+}
+
+errcode_t io_channel_invalidate_tag(io_channel io, io_channel_tag_t tag)
+{
+	if (!io->manager->invalidate_tag && tag != IO_CHANNEL_TAG_NULL)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->invalidate_tag(io, tag);
+}


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
@ 2025-05-22  0:09   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:09 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add tagged block caching to the UNIX IO manager.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |  198 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 154 insertions(+), 44 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 89f7915371307f..8a8afe47ee4503 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -120,6 +120,7 @@ struct unix_cache {
 	char			*buf;
 	unsigned long long	block;
 	int			access_time;
+	io_channel_tag_t	tag;
 	unsigned		dirty:1;
 	unsigned		in_use:1;
 	unsigned		write_err:1;
@@ -526,6 +527,7 @@ static errcode_t alloc_cache(io_channel channel,
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		if (cache->buf)
 			ext2fs_free_mem(&cache->buf);
 		retval = io_channel_alloc_buf(channel, 0, &cache->buf);
@@ -552,6 +554,7 @@ static void free_cache(struct unix_private_data *data)
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		if (cache->buf)
 			ext2fs_free_mem(&cache->buf);
 	}
@@ -639,8 +642,9 @@ static struct unix_cache *find_cached_block(struct unix_private_data *data,
  * Reuse a particular cache entry for another block.
  */
 static errcode_t reuse_cache(io_channel channel,
-		struct unix_private_data *data, struct unix_cache *cache,
-		unsigned long long block)
+			     struct unix_private_data *data,
+			     struct unix_cache *cache, io_channel_tag_t tag,
+			     unsigned long long block)
 {
 	if (cache->dirty && cache->in_use) {
 		errcode_t retval;
@@ -653,7 +657,16 @@ static errcode_t reuse_cache(io_channel channel,
 		}
 	}
 
+#ifdef DEBUG
+	if (cache->in_use)
+		printf("Reusing cached block %llu(%u) for %llu(%u)\n",
+			cache->block, cache->tag, block, tag);
+	else
+		printf("Using cached block %llu(%u)\n", block, tag);
+#endif
+
 	cache->in_use = 1;
+	cache->tag = tag;
 	cache->dirty = 0;
 	cache->write_err = 0;
 	cache->block = block;
@@ -664,6 +677,17 @@ static errcode_t reuse_cache(io_channel channel,
 #define FLUSH_INVALIDATE	0x01
 #define FLUSH_NOLOCK		0x02
 
+static inline void invalidate_cache(struct unix_cache *cache)
+{
+#ifdef DEBUG
+	if (cache->in_use)
+		printf("Invalidating cache %llu(%u)\n", cache->block,
+				cache->tag);
+#endif
+	cache->in_use = 0;
+	cache->tag = IO_CHANNEL_TAG_NULL;
+}
+
 /* Remove a block from the cache.  Dirty contents are discarded. */
 static void invalidate_cached_block(io_channel channel,
 				    struct unix_private_data *data,
@@ -676,7 +700,7 @@ static void invalidate_cached_block(io_channel channel,
 	for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) {
 		if (!cache->in_use || cache->block != block)
 			continue;
-		cache->in_use = 0;
+		invalidate_cache(cache);
 	}
 	mutex_unlock(data, CACHE_MTX);
 }
@@ -686,7 +710,7 @@ static void invalidate_cached_block(io_channel channel,
  */
 static errcode_t flush_cached_blocks(io_channel channel,
 				     struct unix_private_data *data,
-				     int flags)
+				     io_channel_tag_t tag, int flags)
 {
 	struct unix_cache	*cache;
 	errcode_t		retval, retval2 = 0;
@@ -698,6 +722,11 @@ static errcode_t flush_cached_blocks(io_channel channel,
 	for (i=0, cache = data->cache; i < data->cache_size; i++, cache++) {
 		if (!cache->in_use)
 			continue;
+		if (tag && cache->tag != tag)
+			continue;
+#ifdef DEBUG
+		printf("Flushing %sblock %llu(%u)\n", cache->dirty ? "dirty " : "", cache->block, cache->tag);
+#endif
 		if (cache->dirty) {
 			int raw_flags = RAW_WRITE_NO_HANDLER;
 
@@ -715,10 +744,10 @@ static errcode_t flush_cached_blocks(io_channel channel,
 				cache->dirty = 0;
 				cache->write_err = 0;
 				if (flags & FLUSH_INVALIDATE)
-					cache->in_use = 0;
+					invalidate_cache(cache);
 			}
 		} else if (flags & FLUSH_INVALIDATE) {
-			cache->in_use = 0;
+			invalidate_cache(cache);
 		}
 	}
 	if ((flags & FLUSH_NOLOCK) == 0)
@@ -737,7 +766,7 @@ static errcode_t flush_cached_blocks(io_channel channel,
 				unsigned long long err_block = cache->block;
 
 				cache->dirty = 0;
-				cache->in_use = 0;
+				invalidate_cache(cache);
 				cache->write_err = 0;
 				if (io_channel_alloc_buf(channel, 0,
 							 &err_buf))
@@ -772,7 +801,7 @@ static errcode_t shrink_cache(io_channel channel,
 
 	mutex_lock(data, CACHE_MTX);
 
-	retval = flush_cached_blocks(channel, data,
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
 			FLUSH_INVALIDATE | FLUSH_NOLOCK);
 	if (retval)
 		goto unlock;
@@ -784,6 +813,7 @@ static errcode_t shrink_cache(io_channel channel,
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		if (cache->buf)
 			ext2fs_free_mem(&cache->buf);
 	}
@@ -814,7 +844,7 @@ static errcode_t grow_cache(io_channel channel,
 
 	mutex_lock(data, CACHE_MTX);
 
-	retval = flush_cached_blocks(channel, data,
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
 			FLUSH_INVALIDATE | FLUSH_NOLOCK);
 	if (retval)
 		goto unlock;
@@ -832,6 +862,7 @@ static errcode_t grow_cache(io_channel channel,
 		cache->access_time = 0;
 		cache->dirty = 0;
 		cache->in_use = 0;
+		cache->tag = IO_CHANNEL_TAG_NULL;
 		retval = io_channel_alloc_buf(channel, 0, &cache->buf);
 		if (retval)
 			goto unlock;
@@ -1181,7 +1212,7 @@ static errcode_t unix_close(io_channel channel)
 		return 0;
 
 #ifndef NO_IO_CACHE
-	retval = flush_cached_blocks(channel, data, 0);
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, 0);
 #endif
 	/* always fsync the device, even if flushing our own cache failed */
 	retval2 = maybe_fsync(channel);
@@ -1220,7 +1251,9 @@ static errcode_t unix_set_blksize(io_channel channel, int blksize)
 		mutex_lock(data, CACHE_MTX);
 		mutex_lock(data, BOUNCE_MTX);
 #ifndef NO_IO_CACHE
-		if ((retval = flush_cached_blocks(channel, data, FLUSH_NOLOCK))){
+		retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
+					     FLUSH_NOLOCK);
+		if (retval) {
 			mutex_unlock(data, BOUNCE_MTX);
 			mutex_unlock(data, CACHE_MTX);
 			return retval;
@@ -1236,8 +1269,9 @@ static errcode_t unix_set_blksize(io_channel channel, int blksize)
 	return retval;
 }
 
-static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
-			       int count, void *buf)
+static errcode_t unix_read_tagblk(io_channel channel, io_channel_tag_t tag,
+				  unsigned long long block, int count,
+				  void *buf)
 {
 	struct unix_private_data *data;
 	struct unix_cache *cache;
@@ -1249,6 +1283,10 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+#ifdef DEBUG
+	printf("read block %llu(%u) count %u\n", block, tag, count);
+#endif
+
 #ifdef NO_IO_CACHE
 	return raw_read_blk(channel, data, block, count, buf);
 #else
@@ -1259,7 +1297,8 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 	 * flush out the cache and then do a direct read.
 	 */
 	if (count < 0 || count > WRITE_DIRECT_SIZE) {
-		if ((retval = flush_cached_blocks(channel, data, 0)))
+		retval = flush_cached_blocks(channel, data, tag, 0);
+		if (retval)
 			return retval;
 		return raw_read_blk(channel, data, block, count, buf);
 	}
@@ -1270,9 +1309,11 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 		/* If it's in the cache, use it! */
 		if ((cache = find_cached_block(data, block, NULL))) {
 #ifdef DEBUG
-			printf("Using cached block %lu\n", block);
+			printf("Reading from cached block %llu(%u)\n", block, tag);
 #endif
 			memcpy(cp, cache->buf, channel->block_size);
+			if (tag != IO_CHANNEL_TAG_NULL)
+				cache->tag = tag;
 			count--;
 			block++;
 			cp += channel->block_size;
@@ -1287,7 +1328,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 			if (find_cached_block(data, block+i, NULL))
 				break;
 #ifdef DEBUG
-		printf("Reading %d blocks starting at %lu\n", i, block);
+		printf("Reading %d blocks starting at %llu\n", i, block);
 #endif
 		mutex_unlock(data, CACHE_MTX);
 		if ((retval = raw_read_blk(channel, data, block, i, cp)))
@@ -1298,7 +1339,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 		for (j=0; j < i; j++) {
 			if (!find_cached_block(data, block, &cache)) {
 				retval = reuse_cache(channel, data,
-						     cache, block);
+						     cache, tag, block);
 				if (retval)
 					goto call_write_handler;
 				memcpy(cache->buf, cp, channel->block_size);
@@ -1317,7 +1358,7 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 		unsigned long long err_block = cache->block;
 
 		cache->dirty = 0;
-		cache->in_use = 0;
+		invalidate_cache(cache);
 		cache->write_err = 0;
 		if (io_channel_alloc_buf(channel, 0, &err_buf))
 			err_buf = NULL;
@@ -1335,14 +1376,22 @@ static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
 #endif /* NO_IO_CACHE */
 }
 
+static errcode_t unix_read_blk64(io_channel channel, unsigned long long block,
+				  int count, void *buf)
+{
+	return unix_read_tagblk(channel, IO_CHANNEL_TAG_NULL, block, count,
+				buf);
+}
+
 static errcode_t unix_read_blk(io_channel channel, unsigned long block,
 			       int count, void *buf)
 {
 	return unix_read_blk64(channel, block, count, buf);
 }
 
-static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
-				int count, const void *buf)
+static errcode_t unix_write_tagblk(io_channel channel, io_channel_tag_t tag,
+				   unsigned long long block, int count,
+				   const void *buf)
 {
 	struct unix_private_data *data;
 	struct unix_cache *cache, *reuse;
@@ -1354,6 +1403,10 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	data = (struct unix_private_data *) channel->private_data;
 	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
 
+#ifdef DEBUG
+	printf("write block %llu(%u) count %u\n", block, tag, count);
+#endif
+
 	mark_dirty(channel);
 
 #ifdef NO_IO_CACHE
@@ -1366,8 +1419,9 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	 * flush out the cache completely and then do a direct write.
 	 */
 	if (count < 0 || count > WRITE_DIRECT_SIZE) {
-		if ((retval = flush_cached_blocks(channel, data,
-						  FLUSH_INVALIDATE)))
+		retval = flush_cached_blocks(channel, data, tag,
+					     FLUSH_INVALIDATE);
+		if (retval)
 			return retval;
 		return raw_write_blk(channel, data, block, count, buf, 0);
 	}
@@ -1385,11 +1439,17 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 	mutex_lock(data, CACHE_MTX);
 	while (count > 0) {
 		cache = find_cached_block(data, block, &reuse);
-		if (!cache) {
+		if (cache) {
+#ifdef DEBUG
+			printf("Writing to cached block %llu(%u)\n", block, tag);
+#endif
+			if (tag != IO_CHANNEL_TAG_NULL)
+				cache->tag = tag;
+		} else {
 			errcode_t err;
 
 			cache = reuse;
-			err = reuse_cache(channel, data, cache, block);
+			err = reuse_cache(channel, data, cache, tag, block);
 			if (err)
 				goto call_write_handler;
 		}
@@ -1409,7 +1469,7 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 		unsigned long long err_block = cache->block;
 
 		cache->dirty = 0;
-		cache->in_use = 0;
+		invalidate_cache(cache);
 		cache->write_err = 0;
 		if (io_channel_alloc_buf(channel, 0, &err_buf))
 			err_buf = NULL;
@@ -1427,6 +1487,13 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 #endif /* NO_IO_CACHE */
 }
 
+static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
+				  int count, const void *buf)
+{
+	return unix_write_tagblk(channel, IO_CHANNEL_TAG_NULL, block, count,
+				 buf);
+}
+
 static errcode_t unix_cache_readahead(io_channel channel,
 				      unsigned long long block,
 				      unsigned long long count)
@@ -1473,7 +1540,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	/*
 	 * Flush out the cache completely
 	 */
-	if ((retval = flush_cached_blocks(channel, data, FLUSH_INVALIDATE)))
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
+				     FLUSH_INVALIDATE);
+	if (retval)
 		return retval;
 #endif
 
@@ -1491,28 +1560,60 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 	return 0;
 }
 
+/*
+ * Flush data buffers with the given tag to disk and invalidate them.
+ */
+static errcode_t unix_invalidate_tag(io_channel channel, io_channel_tag_t tag)
+{
+	struct unix_private_data *data;
+	errcode_t retval = 0, retval2;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+#ifndef NO_IO_CACHE
+	retval = flush_cached_blocks(channel, data, tag, FLUSH_INVALIDATE);
+#endif
+#ifdef HAVE_FSYNC
+	/* always fsync the device, even if flushing our own cache failed */
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
+#endif
+	return retval;
+}
+
+/*
+ * Flush data buffers with the given tag to disk.
+ */
+static errcode_t unix_flush_tag(io_channel channel, io_channel_tag_t tag)
+{
+	struct unix_private_data *data;
+	errcode_t retval = 0, retval2;
+
+	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+	data = (struct unix_private_data *) channel->private_data;
+	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+#ifndef NO_IO_CACHE
+	retval = flush_cached_blocks(channel, data, tag, 0);
+#endif
+#ifdef HAVE_FSYNC
+	/* always fsync the device, even if flushing our own cache failed */
+	retval2 = maybe_fsync(channel);
+	if (retval2 && !retval)
+		retval = retval2;
+#endif
+	return retval;
+}
+
 /*
  * Flush data buffers to disk.
  */
 static errcode_t unix_flush(io_channel channel)
 {
-	struct unix_private_data *data;
-	errcode_t retval = 0, retval2;
-
-	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
-	data = (struct unix_private_data *) channel->private_data;
-	EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
-
-#ifndef NO_IO_CACHE
-	retval = flush_cached_blocks(channel, data, 0);
-#endif
-#ifdef HAVE_FSYNC
-	/* always fsync the device, even if flushing our own cache failed */
-	retval2 = maybe_fsync(channel);
-	if (retval2 && !retval)
-		retval = retval2;
-#endif
-	return retval;
+	return unix_flush_tag(channel, 0);
 }
 
 static errcode_t unix_set_option(io_channel channel, const char *option,
@@ -1547,7 +1648,8 @@ static errcode_t unix_set_option(io_channel channel, const char *option,
 			return 0;
 		}
 		if (!strcmp(arg, "off")) {
-			retval = flush_cached_blocks(channel, data, 0);
+			retval = flush_cached_blocks(channel, data,
+						     IO_CHANNEL_TAG_NULL, 0);
 			data->flags |= IO_FLAG_NOCACHE;
 			return retval;
 		}
@@ -1748,11 +1850,15 @@ static struct struct_io_manager struct_unix_manager = {
 	.read_blk	= unix_read_blk,
 	.write_blk	= unix_write_blk,
 	.flush		= unix_flush,
+	.flush_tag	= unix_flush_tag,
+	.invalidate_tag	= unix_invalidate_tag,
 	.write_byte	= unix_write_byte,
 	.set_option	= unix_set_option,
 	.get_stats	= unix_get_stats,
 	.read_blk64	= unix_read_blk64,
 	.write_blk64	= unix_write_blk64,
+	.read_tagblk	= unix_read_tagblk,
+	.write_tagblk	= unix_write_tagblk,
 	.discard	= unix_discard,
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,
@@ -1771,11 +1877,15 @@ static struct struct_io_manager struct_unixfd_manager = {
 	.read_blk	= unix_read_blk,
 	.write_blk	= unix_write_blk,
 	.flush		= unix_flush,
+	.flush_tag	= unix_flush_tag,
+	.invalidate_tag	= unix_invalidate_tag,
 	.write_byte	= unix_write_byte,
 	.set_option	= unix_set_option,
 	.get_stats	= unix_get_stats,
 	.read_blk64	= unix_read_blk64,
 	.write_blk64	= unix_write_blk64,
+	.read_tagblk	= unix_read_tagblk,
+	.write_tagblk	= unix_write_tagblk,
 	.discard	= unix_discard,
 	.cache_readahead	= unix_cache_readahead,
 	.zeroout	= unix_zeroout,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

There's no need to invalidate the entire cache when writing a range of
bytes to the device.  The only ones we need to invalidate are the ones
that we're writing separately.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 8a8afe47ee4503..4c924ec9ee0760 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1523,6 +1523,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 {
 	struct unix_private_data *data;
 	errcode_t	retval = 0;
+	unsigned long long bno, nbno;
 	ssize_t		actual;
 
 	EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
@@ -1538,12 +1539,18 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 
 #ifndef NO_IO_CACHE
 	/*
-	 * Flush out the cache completely
+	 * Flush all the dirty blocks, then invalidate the blocks we're about
+	 * to write.
 	 */
-	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL,
-				     FLUSH_INVALIDATE);
+	retval = flush_cached_blocks(channel, data, IO_CHANNEL_TAG_NULL, 0);
 	if (retval)
 		return retval;
+
+	bno = offset / channel->block_size;
+	nbno = (offset + size + channel->block_size - 1) / channel->block_size;
+
+	for (; bno < nbno; bno++)
+		invalidate_cached_block(channel, data, bno);
 #endif
 
 	if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

If someone calls write_byte on an IO channel with an alignment
requirement and the range to be written is aligned correctly, go ahead
and do the write.  This will be needed later when we try to speed up
superblock writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/unix_io.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 4c924ec9ee0760..008a5b46ce7f1f 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1534,7 +1534,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
 #ifdef ALIGN_DEBUG
 		printf("unix_write_byte: O_DIRECT fallback\n");
 #endif
-		return EXT2_ET_UNIMPLEMENTED;
+		if (!IS_ALIGNED(data->offset + offset, channel->align) ||
+		    !IS_ALIGNED(data->offset + offset + size, channel->align))
+			return EXT2_ET_UNIMPLEMENTED;
 	}
 
 #ifndef NO_IO_CACHE


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

write_primary_superblock currently does this weird dance where it will
try to write only the dirty bytes of the primary superblock to disk.  In
theory, this is done so that tune2fs can incrementally update superblock
bytes when the filesystem is mounted; ext2 was famous for allowing using
this dance to set new fs parameters and have them take effect in real
time.

The ability to do this safely was obliterated back in 2001 when ext3 was
introduced with journalling, because tune2fs has no way to know if the
journal has already logged an updated primary superblock but not yet
written it to disk, which means that they can race to write, and changes
can be lost.

This (non-)safety was further obliterated back in 2012 when I added
checksums to all the metadata blocks in ext4 because anyone else with
the block device open can see the primary superblock in an intermediate
state where the checksum does not match the superblock contents.

At this point in 2025 it's kind of stupid to still be doing this, and it
makes fuse2fs syncfs slow because we now perform a bunch of small writes
and introduce extra fsyncs.  It will become especially painful when
fuse2fs turns on iomap, at which point it will need to use directio to
access the disk, which then runs the Really Sad Path where we change the
blocksize and completely obliterate the cache contents.

So, add a new flag to ask for full superblock writes, which fuse2fs will
use later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2fs.h  |    1 +
 lib/ext2fs/closefs.c |    7 +++++++
 2 files changed, 8 insertions(+)


diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 2661e10f57c047..22d56ad7554496 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -220,6 +220,7 @@ typedef struct ext2_file *ext2_file_t;
 #define EXT2_FLAG_IBITMAP_TAIL_PROBLEM	0x2000000
 #define EXT2_FLAG_THREADS		0x4000000
 #define EXT2_FLAG_IGNORE_SWAP_DIRENT	0x8000000
+#define EXT2_FLAG_WRITE_FULL_SUPER	0x10000000
 
 /*
  * Internal flags for use by the ext2fs library only
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 8e5bec03a050de..9a67db76e7b326 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -196,6 +196,13 @@ static errcode_t write_primary_superblock(ext2_filsys fs,
 	int		check_idx, write_idx, size;
 	errcode_t	retval;
 
+	if (fs->flags & EXT2_FLAG_WRITE_FULL_SUPER) {
+		retval = io_channel_write_byte(fs->io, SUPERBLOCK_OFFSET,
+					       SUPERBLOCK_SIZE, super);
+		if (!retval)
+			return 0;
+	}
+
 	if (!fs->io->manager->write_byte || !fs->orig_super) {
 	fallback:
 		io_channel_set_blksize(fs->io, SUPERBLOCK_OFFSET);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
@ 2025-05-22  0:10   ` Darrick J. Wong
  9 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:10 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add a flag to ext2_file_t to disallow read and write I/O to file data
blocks.  This supports fuse2fs iomap support, which will keep all the
file data I/O inside the kerne.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/ext2fs/ext2fs.h |    3 +++
 lib/ext2fs/fileio.c |   12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)


diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 22d56ad7554496..2c8e2cc2b55416 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -178,6 +178,9 @@ typedef struct ext2_struct_dblist *ext2_dblist;
 #define EXT2_FILE_WRITE		0x0001
 #define EXT2_FILE_CREATE	0x0002
 
+/* no file I/O to disk blocks, only to inline data */
+#define EXT2_FILE_NOBLOCKIO	0x0004
+
 #define EXT2_FILE_MASK		0x00FF
 
 #define EXT2_FILE_BUF_DIRTY	0x4000
diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c
index 1b7e88d990036b..229ae6da7f448b 100644
--- a/lib/ext2fs/fileio.c
+++ b/lib/ext2fs/fileio.c
@@ -300,6 +300,11 @@ errcode_t ext2fs_file_read(ext2_file_t file, void *buf,
 	if (file->inode.i_flags & EXT4_INLINE_DATA_FL)
 		return ext2fs_file_read_inline_data(file, buf, wanted, got);
 
+	if (file->flags & EXT2_FILE_NOBLOCKIO) {
+		retval = EXT2_ET_OP_NOT_SUPPORTED;
+		goto fail;
+	}
+
 	while ((file->pos < EXT2_I_SIZE(&file->inode)) && (wanted > 0)) {
 		retval = sync_buffer_position(file);
 		if (retval)
@@ -416,6 +421,11 @@ errcode_t ext2fs_file_write(ext2_file_t file, const void *buf,
 		retval = 0;
 	}
 
+	if (file->flags & EXT2_FILE_NOBLOCKIO) {
+		retval = EXT2_ET_OP_NOT_SUPPORTED;
+		goto fail;
+	}
+
 	while (nbytes > 0) {
 		retval = sync_buffer_position(file);
 		if (retval)
@@ -584,7 +594,7 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file,
 	int ret_flags;
 	errcode_t retval;
 
-	if (off == 0)
+	if (off == 0 || (file->flags & EXT2_FILE_NOBLOCKIO))
 		return 0;
 
 	retval = sync_buffer_position(file);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add enough of an iomap implementation that we can do FIEMAP and
SEEK_DATA and SEEK_HOLE.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure       |   47 ++++++
 configure.ac    |   32 ++++
 lib/config.h.in |    3 
 misc/fuse2fs.c  |  453 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 530 insertions(+), 5 deletions(-)


diff --git a/configure b/configure
index 1f7dbe24ee1ab1..c8b63dd448dca8 100755
--- a/configure
+++ b/configure
@@ -14545,6 +14545,53 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_LIB" = "-lfuse3"
+then
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5
+printf %s "checking for iomap_begin in libfuse... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+
+int
+main (void)
+{
+
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+
+  ;
+  return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  have_fuse_iomap=yes
+   { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else $as_nop
+  { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+
+printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h
+
+fi
+fi
+
 if test -n "$FUSE_USE_VERSION"
 then
 
diff --git a/configure.ac b/configure.ac
index c7f193b4ed06bf..8b12ef3ee542e3 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1429,6 +1429,38 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_LIB" = "-lfuse3"
+then
+dnl
+dnl see if fuse3 supports iomap
+dnl
+AC_MSG_CHECKING(for iomap_begin in libfuse)
+AC_LINK_IFELSE(
+[	AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+	]], [[
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+	]])
+], have_fuse_iomap=yes
+   AC_MSG_RESULT(yes),
+   AC_MSG_RESULT(no))
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+  AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap])
+fi
+fi
+
+dnl
+dnl set FUSE_USE_VERSION now that we've done all the feature tests
+dnl
 if test -n "$FUSE_USE_VERSION"
 then
 	AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
diff --git a/lib/config.h.in b/lib/config.h.in
index 6cd9751baab9d1..850c5fa573bcf0 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -73,6 +73,9 @@
 /* Define to 1 if PR_SET_IO_FLUSHER is present */
 #undef HAVE_PR_SET_IO_FLUSHER
 
+/* Define to 1 if fuse supports iomap */
+#undef HAVE_FUSE_IOMAP
+
 /* Define to 1 if you have the Mac OS X function
    CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
 #undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 769bb5babd2738..f9eed078d91152 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -79,6 +79,8 @@
 #define P_(singular, plural, n) ((n) == 1 ? (singular) : (plural))
 #endif
 
+#define min(x, y)	((x) < (y) ? (y) : (x))
+
 #define dbg_printf(fuse2fs, format, ...) \
 	while ((fuse2fs)->debug) { \
 		printf("FUSE2FS (%s): " format, (fuse2fs)->shortdev, ##__VA_ARGS__); \
@@ -144,6 +146,14 @@ struct fuse2fs_file_handle {
 	int open_flags;
 };
 
+#ifdef HAVE_FUSE_IOMAP
+enum fuse2fs_iomap_state {
+	IOMAP_DISABLED,
+	IOMAP_UNKNOWN,
+	IOMAP_ENABLED,
+};
+#endif
+
 /* Main program context */
 #define FUSE2FS_MAGIC		(0xEF53DEADUL)
 struct fuse2fs {
@@ -167,6 +177,9 @@ struct fuse2fs {
 	uint8_t writable;
 
 	int blocklog;
+#ifdef HAVE_FUSE_IOMAP
+	enum fuse2fs_iomap_state iomap_state;
+#endif
 	unsigned int blockmask;
 	int retcode;
 	unsigned long offset;
@@ -694,7 +707,7 @@ static errcode_t open_fs(struct fuse2fs *ff, int libext2_flags)
 {
 	char options[128];
 	int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
-		    libext2_flags;
+		    EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags;
 	errcode_t err;
 
 	snprintf(options, sizeof(options) - 1, "offset=%lu", ff->offset);
@@ -945,6 +958,38 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
 }
 #endif
 
+#ifdef HAVE_FUSE_IOMAP
+static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
+{
+	int is_bdev;
+	errcode_t err;
+
+	switch (ff->iomap_state) {
+	case IOMAP_UNKNOWN:
+		ff->iomap_state = IOMAP_DISABLED;
+		/* fallthrough */;
+	case IOMAP_DISABLED:
+		return 0;
+	case IOMAP_ENABLED:
+		break;
+	}
+
+	err = fs_on_bdev(ff, &is_bdev);
+	if (err)
+		return err;
+
+	/* iomap only works with block devices */
+	if (!is_bdev) {
+		fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+		ff->iomap_state = IOMAP_DISABLED;
+	}
+
+	return 0;
+}
+#else
+# define confirm_iomap(...)	(0)
+#endif
+
 static void *op_init(struct fuse_conn_info *conn
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 			, struct fuse_config *cfg EXT2FS_ATTR((unused))
@@ -972,6 +1017,12 @@ static void *op_init(struct fuse_conn_info *conn
 #ifdef FUSE_CAP_NO_EXPORT_SUPPORT
 	fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	if (ff->iomap_state != IOMAP_DISABLED &&
+	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
+		ff->iomap_state = IOMAP_ENABLED;
+#endif
+
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	conn->time_gran = 1;
 	cfg->use_ino = 1;
@@ -989,6 +1040,10 @@ static void *op_init(struct fuse_conn_info *conn
 			goto mount_fail;
 		fs = ff->fs;
 
+		err = confirm_iomap(conn, ff);
+		if (err)
+			goto mount_fail;
+
 		if (ff->cache_size) {
 			err = config_fs_cache(ff);
 			if (err)
@@ -1014,6 +1069,10 @@ static void *op_init(struct fuse_conn_info *conn
 		err = mount_fs(ff);
 		if (err)
 			goto mount_fail;
+	} else {
+		err = confirm_iomap(conn, ff);
+		if (err)
+			goto mount_fail;
 	}
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
@@ -4575,6 +4634,384 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 # endif /* SUPPORT_FALLOCATE */
 #endif /* FUSE 29 */
 
+#ifdef HAVE_FUSE_IOMAP
+static void handle_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
+			      off_t pos, uint64_t count)
+{
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->addr = FUSE_IOMAP_NULL_ADDR;
+	iomap->offset = pos;
+	iomap->length = count;
+	iomap->type = FUSE_IOMAP_TYPE_HOLE;
+}
+
+#define DEBUG_IOMAP
+#ifdef DEBUG_IOMAP
+# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \
+	do { \
+		dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \
+			   (func), (tag), (startoff), (err), (extent)->e_lblk, \
+			   (extent)->e_pblk, (extent)->e_len, \
+			   (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \
+	} while(0)
+# define DUMP_EXTENT(ff, tag, startoff, err, extent) \
+	__DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent))
+#else
+# define __DUMP_EXTENT(...)	((void)0)
+# define DUMP_EXTENT(...)	((void)0)
+#endif
+
+static inline errcode_t __get_mapping_at(struct fuse2fs *ff,
+					 ext2_extent_handle_t handle,
+					 blk64_t startoff,
+					 struct ext2fs_extent *bmap,
+					 const char *func)
+{
+	errcode_t err;
+
+	/*
+	 * Find the file mapping at startoff.  We don't check the return value
+	 * of _goto because _get will error out if _goto failed.  There's a
+	 * subtlety to the outcome of _goto when startoff falls in a sparse
+	 * hole however:
+	 *
+	 * Most of the time, _goto points the cursor at the mapping whose lblk
+	 * is just to the left of startoff.  The mapping may or may not overlap
+	 * startoff; this is ok.  In other words, the tree lookup behaves as if
+	 * we asked it to use a less than or equals comparison.
+	 *
+	 * However, if startoff is to the left of the first mapping in the
+	 * extent tree, _goto points the cursor at that first mapping because
+	 * it doesn't know how to deal with this situation.  In this case,
+	 * the tree lookup behaves as if we asked it to use a greater than
+	 * or equals comparison.
+	 *
+	 * Note: If _get() returns 'no current node', that means that there
+	 * aren't any mappings at all.
+	 */
+	ext2fs_extent_goto(handle, startoff);
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap);
+	__DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap);
+	if (err == EXT2_ET_NO_CURRENT_NODE)
+		err = EXT2_ET_EXTENT_NOT_FOUND;
+	return err;
+}
+
+static inline errcode_t __get_next_mapping(struct fuse2fs *ff,
+					   ext2_extent_handle_t handle,
+					   blk64_t startoff,
+					   struct ext2fs_extent *bmap,
+					   const char *func)
+{
+	struct ext2fs_extent newex, errex;
+	errcode_t err;
+
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex);
+	DUMP_EXTENT(ff, "NEXT", startoff, err, &newex);
+	if (err == EXT2_ET_EXTENT_NO_NEXT)
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	if (err)
+		return err;
+
+	/*
+	 * Try to get the next leaf mapping.  There's a weird and longstanding
+	 * "feature" of EXT2_EXTENT_NEXT_LEAF where walking off the end of the
+	 * mapping recordset causes it to wrap around to the beginning of the
+	 * extent map and we end up with a mapping to the left of the one that
+	 * was passed in.
+	 *
+	 * However, a corrupt extent tree could also have such a record.  The
+	 * only way to be sure is to retrieve the mapping for the extreme right
+	 * edge of the tree and compare it to the mapping that the caller gave
+	 * us.  If they match, then we've hit the end.  If not, something is
+	 * corrupt in the ondisk metadata.
+	 */
+	if (newex.e_lblk <= bmap->e_lblk + bmap->e_len) {
+		err = __get_mapping_at(ff, handle, ~0U, &errex, func);
+		if (err)
+			return err;
+
+		if (memcmp(bmap, &errex, sizeof(errex)) != 0)
+			return EXT2_ET_INODE_CORRUPTED;
+
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	}
+
+	*bmap = newex;
+	return 0;
+}
+
+#define get_mapping_at(ff, handle, startoff, bmap) \
+	__get_mapping_at((ff), (handle), (startoff), (bmap), __func__)
+#define get_next_mapping(ff, handle, startoff, bmap) \
+	__get_next_mapping((ff), (handle), (startoff), (bmap), __func__)
+
+static errcode_t extent_iomap_begin(struct fuse2fs *ff, uint64_t ino,
+				    struct ext2_inode_large *inode,
+				    off_t pos, uint64_t count,
+				    uint32_t opflags, struct fuse_iomap *iomap)
+{
+	ext2_extent_handle_t handle;
+	struct ext2fs_extent extent;
+	ext2_filsys fs = ff->fs;
+	const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	errcode_t err;
+	int ret = 0;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/* No mappings at all; the whole range is a hole. */
+		handle_iomap_hole(ff, iomap, pos, count);
+		goto out_handle;
+	}
+	if (err) {
+		ret = translate_error(fs, ino, err);
+		goto out_handle;
+	}
+
+	if (startoff < extent.e_lblk) {
+		/*
+		 * Mapping starts to the right of the current position.
+		 * Synthesize a hole going to that next extent.
+		 */
+		handle_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+		goto out_handle;
+	}
+
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, the
+		 * whole range is in a hole.
+		 */
+		err = get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			handle_iomap_hole(ff, iomap, pos, count);
+			goto out_handle;
+		}
+
+		/*
+		 * If the new mapping starts to the right of startoff, there's
+		 * a hole from startoff to the start of the new mapping.
+		 */
+		if (startoff < extent.e_lblk) {
+			handle_iomap_hole(ff, iomap,
+				FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+			goto out_handle;
+		}
+
+		/*
+		 * The new mapping starts at startoff.  Something weird
+		 * happened in the extent tree lookup, but we found a valid
+		 * mapping so we'll run with it.
+		 */
+	}
+
+	/* Mapping overlaps startoff, report this. */
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
+	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
+	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)
+		iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN;
+	else
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino,
+				struct ext2_inode_large *inode, off_t pos,
+				uint64_t count, uint32_t opflags,
+				struct fuse_iomap *iomap)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	uint64_t real_count = min(count, 131072);
+	const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count);
+	blk64_t startblock;
+	errcode_t err;
+
+	err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL,
+			   &startblock);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->offset = pos;
+	iomap->flags |= FUSE_IOMAP_F_MERGED;
+	if (startblock) {
+		iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+	} else {
+		iomap->addr = FUSE_IOMAP_NULL_ADDR;
+		iomap->type = FUSE_IOMAP_TYPE_HOLE;
+	}
+	iomap->length = fs->blocksize;
+
+	/* See how long the mapping goes for. */
+	for (startoff++; startoff < endoff; startoff++) {
+		blk64_t prev_startblock = startblock;
+
+		err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0,
+				   startoff, NULL, &startblock);
+		if (err)
+			break;
+
+		if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) {
+			if (startblock == prev_startblock + 1)
+				iomap->length += fs->blocksize;
+			else
+				break;
+		} else {
+			if (startblock != 0)
+				break;
+		}
+	}
+
+	return 0;
+}
+
+static int inline_iomap_begin(struct fuse2fs *ff, off_t pos, uint64_t count,
+			      struct fuse_iomap *iomap)
+{
+	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->addr = FUSE_IOMAP_NULL_ADDR;
+	iomap->offset = pos;
+	iomap->length = count;
+	iomap->type = FUSE_IOMAP_TYPE_INLINE;
+
+	return 0;
+}
+
+static int fuse_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino,
+				   struct ext2_inode_large *inode,
+				   off_t pos, uint64_t count, uint32_t opflags,
+				   struct fuse_iomap *read_iomap)
+{
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return inline_iomap_begin(ff, pos, count, read_iomap);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return extent_iomap_begin(ff, ino, inode, pos, count, opflags,
+					 read_iomap);
+
+	return indirect_iomap_begin(ff, ino, inode, pos, count, opflags,
+				    read_iomap);
+}
+
+static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
+				 struct ext2_inode_large *inode, off_t pos,
+				 uint64_t count, uint32_t opflags,
+				 struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
+				  struct ext2_inode_large *inode, off_t pos,
+				  uint64_t count, uint32_t opflags,
+				  struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, uint64_t count, uint32_t opflags,
+			  struct fuse_iomap *read_iomap,
+			  struct fuse_iomap *write_iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	fs = ff->fs;
+
+	pthread_mutex_lock(&ff->bfl);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags);
+
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (opflags & FUSE_IOMAP_OP_REPORT)
+		ret = fuse_iomap_begin_report(ff, attr_ino, &inode, pos, count,
+					      opflags, read_iomap);
+	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
+		ret = fuse_iomap_begin_write(ff, attr_ino, &inode, pos, count,
+					     opflags, read_iomap);
+	else
+		ret = fuse_iomap_begin_read(ff, attr_ino, &inode, pos, count,
+					    opflags, read_iomap);
+	if (ret)
+		goto out_unlock;
+
+	dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n",
+		   __func__,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)read_iomap->addr,
+		   (unsigned long long)read_iomap->offset,
+		   (unsigned long long)read_iomap->length,
+		   read_iomap->type);
+
+out_unlock:
+	if (ret < 0)
+		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
+	pthread_mutex_unlock(&ff->bfl);
+	return ret;
+}
+
+static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			off_t pos, uint64_t count, uint32_t opflags,
+			ssize_t written, const struct fuse_iomap *iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	pthread_mutex_lock(&ff->bfl);
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags 0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags,
+		   written,
+		   iomap->flags);
+	pthread_mutex_unlock(&ff->bfl);
+
+	return 0;
+}
+#endif /* HAVE_FUSE_IOMAP */
+
 static struct fuse_operations fs_ops = {
 	.init = op_init,
 	.destroy = op_destroy,
@@ -4635,6 +5072,10 @@ static struct fuse_operations fs_ops = {
 	.fallocate = op_fallocate,
 # endif
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	.iomap_begin = op_iomap_begin,
+	.iomap_end = op_iomap_end,
+#endif /* HAVE_FUSE_IOMAP */
 };
 
 static int get_random_bytes(void *p, size_t sz)
@@ -4840,7 +5281,12 @@ static void fuse2fs_com_err_proc(const char *whoami, errcode_t code,
 int main(int argc, char *argv[])
 {
 	struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
-	struct fuse2fs fctx;
+	struct fuse2fs fctx = {
+		.magic = FUSE2FS_MAGIC,
+#ifdef HAVE_FUSE_IOMAP
+		.iomap_state = IOMAP_UNKNOWN,
+#endif
+	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
 	char *logfile;
@@ -4849,9 +5295,6 @@ int main(int argc, char *argv[])
 	int is_bdev;
 	int ret = 0;
 
-	memset(&fctx, 0, sizeof(fctx));
-	fctx.magic = FUSE2FS_MAGIC;
-
 	fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc);
 	if (fctx.device == NULL) {
 		fprintf(stderr, "Missing ext4 device/image\n");


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/16] fuse2fs: register block devices for use with iomap
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
                     ` (13 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Register the ext4 block device with the kernel for use with iomap.  For
now this is redundant with using fuseblk mode because the kernel
automatically registers any fuseblk devices, but eventually we'll go
back to regular fuse mode and we'll have to pin the bdev ourselves.
In theory this interface supports strange beasts where the metadata can
exist somewhere else entirely (or be made up by AI) while the file data
persists to real disks.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   44 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 40 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f9eed078d91152..92a80753f4f1e8 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -36,6 +36,7 @@
 # define _FILE_OFFSET_BITS 64
 #endif /* _FILE_OFFSET_BITS */
 #include <fuse.h>
+#include <fuse_lowlevel.h>
 #ifdef __SET_FOB_FOR_FUSE
 # undef _FILE_OFFSET_BITS
 #endif /* __SET_FOB_FOR_FUSE */
@@ -179,6 +180,7 @@ struct fuse2fs {
 	int blocklog;
 #ifdef HAVE_FUSE_IOMAP
 	enum fuse2fs_iomap_state iomap_state;
+	uint32_t iomap_dev;
 #endif
 	unsigned int blockmask;
 	int retcode;
@@ -4638,7 +4640,7 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 static void handle_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
 			      off_t pos, uint64_t count)
 {
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE_IOMAP_NULL_ADDR;
 	iomap->offset = pos;
 	iomap->length = count;
@@ -4815,7 +4817,7 @@ static errcode_t extent_iomap_begin(struct fuse2fs *ff, uint64_t ino,
 	}
 
 	/* Mapping overlaps startoff, report this. */
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
 	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
 	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
@@ -4846,7 +4848,7 @@ static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino,
 	if (err)
 		return translate_error(fs, ino, err);
 
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->offset = pos;
 	iomap->flags |= FUSE_IOMAP_F_MERGED;
 	if (startblock) {
@@ -4884,7 +4886,7 @@ static int indirect_iomap_begin(struct fuse2fs *ff, uint64_t ino,
 static int inline_iomap_begin(struct fuse2fs *ff, off_t pos, uint64_t count,
 			      struct fuse_iomap *iomap)
 {
-	iomap->dev = FUSE_IOMAP_DEV_FUSEBLK;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE_IOMAP_NULL_ADDR;
 	iomap->offset = pos;
 	iomap->length = count;
@@ -4925,6 +4927,31 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	return -ENOSYS;
 }
 
+static errcode_t config_iomap_devices(struct fuse_context *ctxt,
+				      struct fuse2fs *ff)
+{
+	struct fuse_session *se = fuse_get_session(ctxt->fuse);
+	errcode_t err;
+	int fd;
+	int ret;
+
+	err = io_channel_fd(ff->fs->io, &fd);
+	if (err)
+		return err;
+
+	ret = fuse_lowlevel_notify_iomap_add_device(se, fd, &ff->iomap_dev);
+
+	dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n",
+		   __func__, fd, ret, ff->iomap_dev);
+
+	if (ret)
+		return ret;
+	if (ff->iomap_dev == FUSE_IOMAP_DEV_NULL)
+		return -EIO;
+
+	return 0;
+}
+
 static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 			  off_t pos, uint64_t count, uint32_t opflags,
 			  struct fuse_iomap *read_iomap,
@@ -4951,6 +4978,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   (unsigned long long)count,
 		   opflags);
 
+	if (ff->iomap_dev == FUSE_IOMAP_DEV_NULL) {
+		err = config_iomap_devices(ctxt, ff);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 	err = fuse2fs_read_inode(fs, attr_ino, &inode);
 	if (err) {
 		ret = translate_error(fs, attr_ino, err);
@@ -5285,6 +5320,7 @@ int main(int argc, char *argv[])
 		.magic = FUSE2FS_MAGIC,
 #ifdef HAVE_FUSE_IOMAP
 		.iomap_state = IOMAP_UNKNOWN,
+		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
 	};
 	errcode_t err;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel writes file data directly to the block device
and does not flush the bdev page cache.  We must open the filesystem in
directio mode to avoid cache coherency issues when reading file data
blocks.  If we can't open the bdev in directio mode, we must not use
iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 92a80753f4f1e8..91c0da096bef9c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -988,8 +988,14 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
 
 	return 0;
 }
+
+static int iomap_enabled(const struct fuse2fs *ff)
+{
+	return ff->iomap_state == IOMAP_ENABLED;
+}
 #else
 # define confirm_iomap(...)	(0)
+# define iomap_enabled(...)	(0)
 #endif
 
 static void *op_init(struct fuse_conn_info *conn
@@ -1001,6 +1007,9 @@ static void *op_init(struct fuse_conn_info *conn
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs = ff->fs;
+#ifdef HAVE_FUSE_IOMAP
+	int was_directio = ff->directio;
+#endif
 	errcode_t err;
 	int ret;
 
@@ -1023,6 +1032,15 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+	/*
+	 * In iomap mode, the kernel writes file data directly to the block
+	 * device and does not flush the bdev page cache.  We must open the
+	 * filesystem in directio mode to avoid cache coherency issues when
+	 * reading file data.  If we can't open the bdev in directio mode, we
+	 * must not use iomap.
+	 */
+	if (iomap_enabled(ff))
+		ff->directio = 1;
 #endif
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
@@ -1038,6 +1056,14 @@ static void *op_init(struct fuse_conn_info *conn
 	 */
 	if (!fs) {
 		err = open_fs(ff, 0);
+#ifdef HAVE_FUSE_IOMAP
+		if (err && iomap_enabled(ff) && !was_directio) {
+			fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+			ff->iomap_state = IOMAP_DISABLED;
+			ff->directio = 0;
+			err = open_fs(ff, 0);
+		}
+#endif
 		if (err)
 			goto mount_fail;
 		fs = ff->fs;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/16] fuse2fs: implement directio file reads
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
@ 2025-05-22  0:11   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:11 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Implement file reads via iomap.  Currently only directio is supported.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 91c0da096bef9c..b1f3002ec8c481 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1103,6 +1103,11 @@ static void *op_init(struct fuse_conn_info *conn
 			goto mount_fail;
 	}
 
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO)
+	if (iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
+#endif
+
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->writable) {
 		fs->super->s_mnt_count++;
@@ -4942,7 +4947,26 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 				 uint64_t count, uint32_t opflags,
 				 struct fuse_iomap *read_iomap)
 {
-	return -ENOSYS;
+	errcode_t err;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	/* fall back to slow path for inline data reads */
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return -ENOSYS;
+
+	/* flush dirty io_channel buffers to disk before iomap reads them */
+	err = io_channel_flush(ff->fs->io);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return extent_iomap_begin(ff, ino, inode, pos, count, opflags,
+					 read_iomap);
+
+	return indirect_iomap_begin(ff, ino, inode, pos, count, opflags,
+				    read_iomap);
 }
 
 static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Change the punch hole helpers to use the tagged block IO commands now
that libext2fs uses tagged block IO commands for file IO.  We'll need
this in the next patch when we turn on selective IO manager cache
clearing and invalidation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index b1f3002ec8c481..c0f868e8f01ed4 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4510,13 +4510,13 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 
 	memset(*buf + residue, 0, len);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
@@ -4544,7 +4544,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 	if (!blk || (retflags & BMAP_RET_UNINIT))
@@ -4555,7 +4555,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	else
 		memset(*buf + residue, 0, fs->blocksize - residue);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

We only need to flush the io_channel's cache for the file that's being
read directly, not everything else.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index c0f868e8f01ed4..3ec99310b0f112 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4957,7 +4957,7 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush(ff->fs->io);
+	err = io_channel_flush_tag(ff->fs->io, ino);
 	if (err)
 		return translate_error(ff->fs, ino, err);
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/16] fuse2fs: add extent dump function for debugging
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add a function to dump an inode's extent map for debugging purposes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 3ec99310b0f112..7e9095766c6624 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -377,6 +377,74 @@ static inline errcode_t fuse2fs_write_inode(ext2_filsys fs, ext2_ino_t ino,
 				       sizeof(*inode));
 }
 
+static inline void dump_ino_extents(struct fuse2fs *ff, ext2_ino_t ino,
+				    struct ext2_inode_large *inode,
+				    const char *why)
+{
+	ext2_filsys fs = ff->fs;
+	unsigned int nr = 0;
+	blk64_t blockcount = 0;
+	struct ext2_inode_large xinode;
+	struct ext2fs_extent extent;
+	ext2_extent_handle_t extents;
+	int op = EXT2_EXTENT_ROOT;
+	errcode_t retval;
+
+	if (!inode) {
+		inode = &xinode;
+
+		retval = fuse2fs_read_inode(fs, ino, inode);
+		if (retval) {
+			com_err(__func__, retval, _("reading ino %u"), ino);
+			return;
+		}
+	}
+
+	if (!(inode->i_flags & EXT4_EXTENTS_FL))
+		return;
+
+	printf("%s: %s ino %u isize %llu iblocks %llu\n", __func__, why, ino,
+	       EXT2_I_SIZE(inode),
+	       (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) /
+	        fs->blocksize);
+	fflush(stdout);
+
+	retval = ext2fs_extent_open(fs, ino, &extents);
+	if (retval) {
+		com_err(__func__, retval, _("opening extents of ino \"%u\""),
+			ino);
+		return;
+	}
+
+	while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) {
+		op = EXT2_EXTENT_NEXT;
+
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT)
+			continue;
+
+		printf("[%u]: %s lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n",
+		       nr++, why, extent.e_lblk, extent.e_pblk, extent.e_len,
+		       extent.e_flags);
+		fflush(stdout);
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF)
+			blockcount += extent.e_len;
+		else
+			blockcount++;
+	}
+	if (retval == EXT2_ET_EXTENT_NO_NEXT)
+		retval = 0;
+	if (retval) {
+		com_err(__func__, retval, ("getting extents of ino %u"),
+			ino);
+	}
+	if (inode->i_file_acl)
+		blockcount++;
+	printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount);
+	fflush(stdout);
+
+	ext2fs_extent_free(extents);
+}
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/16] fuse2fs: implement direct write support
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
@ 2025-05-22  0:12   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:12 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Wire up an iomap_begin method that can allocate into holes so that we
can do directio writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |  481 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 478 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 7e9095766c6624..ec17f6203b4b70 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5037,12 +5037,99 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 				    read_iomap);
 }
 
+static int fuse_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode, off_t pos,
+				     uint64_t count, uint32_t opflags, struct
+				     fuse_iomap *read_iomap, bool *dirty)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count);
+	errcode_t err;
+	int ret;
+
+	dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n",
+		   __func__, ino, startoff, stopoff - startoff);
+
+	if (!fs_can_allocate(ff, stopoff - startoff))
+		return -ENOSPC;
+
+	err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino,
+			       EXT2_INODE(inode), 0, startoff,
+			       stopoff - startoff);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* pick up the newly allocated mapping */
+	ret = fuse_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				     read_iomap);
+	if (ret)
+		return ret;
+
+	read_iomap->flags |= FUSE_IOMAP_F_DIRTY;
+	*dirty = true;
+	return 0;
+}
+
+static off_t max_file_size(const struct fuse2fs *ff,
+			   const struct ext2_inode_large *inode)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t addr_per_block, max_map_block;
+
+	if (inode->i_flags & EXT4_EXTENTS_FL) {
+		max_map_block = (1ULL << 32) - 1;
+	} else {
+		addr_per_block = fs->blocksize >> 2;
+		max_map_block = addr_per_block;
+		max_map_block += addr_per_block * addr_per_block;
+		max_map_block += addr_per_block * addr_per_block * addr_per_block;
+		max_map_block += 12;
+	}
+
+	return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1);
+}
+
 static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 				  struct ext2_inode_large *inode, off_t pos,
 				  uint64_t count, uint32_t opflags,
-				  struct fuse_iomap *read_iomap)
+				  struct fuse_iomap *read_iomap, bool *dirty)
 {
-	return -ENOSYS;
+	off_t max_size = max_file_size(ff, inode);
+	errcode_t err;
+	int ret;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	if (pos >= max_size)
+		return -EFBIG;
+
+	if (pos >= max_size - count)
+		count = max_size - pos;
+
+	ret = fuse_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				    read_iomap);
+	if (ret)
+		return ret;
+
+	if (read_iomap->type == FUSE_IOMAP_TYPE_HOLE &&
+	    !(opflags & FUSE_IOMAP_OP_ZERO)) {
+		ret = fuse_iomap_write_allocate(ff, ino, inode, pos, count,
+						opflags, read_iomap, dirty);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers before iomap
+	 * writes them
+	 */
+	err = io_channel_invalidate_tag(ff->fs->io, ino);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	return 0;
 }
 
 static errcode_t config_iomap_devices(struct fuse_context *ctxt,
@@ -5080,6 +5167,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	struct ext2_inode_large inode;
 	ext2_filsys fs;
 	errcode_t err;
+	bool dirty = false;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
@@ -5115,7 +5203,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 					      opflags, read_iomap);
 	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
 		ret = fuse_iomap_begin_write(ff, attr_ino, &inode, pos, count,
-					     opflags, read_iomap);
+					     opflags, read_iomap, &dirty);
 	else
 		ret = fuse_iomap_begin_read(ff, attr_ino, &inode, pos, count,
 					    opflags, read_iomap);
@@ -5132,6 +5220,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   (unsigned long long)read_iomap->length,
 		   read_iomap->type);
 
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 out_unlock:
 	if (ret < 0)
 		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
@@ -5163,6 +5259,384 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 
 	return 0;
 }
+
+static inline bool can_merge_mappings(const struct ext2fs_extent *left,
+				      const struct ext2fs_extent *right)
+{
+	uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ?
+				EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN;
+
+	return left->e_lblk + left->e_len == right->e_lblk &&
+	       left->e_pblk + left->e_len == right->e_pblk &&
+	       (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ==
+	        (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) &&
+	       (uint64_t)left->e_len + right->e_len <= max_len;
+}
+
+static int try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+			      ext2_extent_handle_t handle, blk64_t startoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent left, right;
+	errcode_t err;
+
+	/* Look up the mappings before startoff */
+	err = get_mapping_at(ff, handle, startoff - 1, &left);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Look up the mapping at startoff */
+	err = get_mapping_at(ff, handle, startoff, &right);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Can we combine them? */
+	if (!can_merge_mappings(&left, &right))
+		return 0;
+
+	/*
+	 * Delete the mapping after startoff because libext2fs cannot handle
+	 * overlapping mappings.
+	 */
+	err = ext2fs_extent_delete(handle, 0);
+	DUMP_EXTENT(ff, "remover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixremover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Move back and lengthen the mapping before startoff */
+	err = ext2fs_extent_goto(handle, left.e_lblk);
+	DUMP_EXTENT(ff, "movel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	left.e_len += right.e_len;
+	err = ext2fs_extent_replace(handle, 0, &left);
+	DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
+static int convert_unwritten_mapping(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode,
+				     ext2_extent_handle_t handle,
+				     blk64_t *cursor, blk64_t stopoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent extent;
+	blk64_t startoff = *cursor;
+	errcode_t err;
+
+	/*
+	 * Find the mapping at startoff.  Note that we can find holes because
+	 * the mapping data can change due to racing writes.
+	 */
+	err = get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/*
+		 * If we didn't find any mappings at all then the file is
+		 * completely sparse.  There's nothing to convert.
+		 */
+		*cursor = stopoff;
+		return 0;
+	}
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/*
+	 * The mapping is completely to the left of the range that we want.
+	 * Let's see what's in the next extent, if there is one.
+	 */
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, then
+		 * we're done.
+		 */
+		err = get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			*cursor = stopoff;
+			return 0;
+		}
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/*
+	 * The mapping is completely to the right of the range that we want,
+	 * so we're done.
+	 */
+	if (extent.e_lblk >= stopoff) {
+		*cursor = stopoff;
+		return 0;
+	}
+
+	/*
+	 * At this point, we have a mapping that overlaps (startoff, stopoff].
+	 * If the mapping is already written, move on to the next one.
+	 */
+	if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT))
+		goto next;
+
+	if (startoff > extent.e_lblk) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping starts before startoff.  Shorten
+		 * the previous mapping...
+		 */
+		newex.e_len = startoff - extent.e_lblk;
+		err = ext2fs_extent_replace(handle, 0, &newex);
+		DUMP_EXTENT(ff, "shortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ...and create new written mapping at startoff. */
+		extent.e_len -= newex.e_len;
+		extent.e_lblk += newex.e_len;
+		extent.e_pblk += newex.e_len;
+		extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &extent);
+		DUMP_EXTENT(ff, "insertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	if (extent.e_lblk + extent.e_len > stopoff) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping ends after stopoff.  Shorten the current
+		 * mapping...
+		 */
+		extent.e_len = stopoff - extent.e_lblk;
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "shortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ..and create a new unwritten mapping at stopoff. */
+		newex.e_pblk += extent.e_len;
+		newex.e_lblk += extent.e_len;
+		newex.e_len -= extent.e_len;
+		newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &newex);
+		DUMP_EXTENT(ff, "insertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/* Still unwritten?  Update the state. */
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) {
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "replacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+next:
+	/* Try to merge with the previous extent */
+	if (startoff > 0) {
+		err = try_merge_mappings(ff, ino, handle, startoff);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	*cursor = extent.e_lblk + extent.e_len;
+	return 0;
+}
+
+static int convert_unwritten_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+				      struct ext2_inode_large *inode,
+				      off_t pos, size_t written)
+{
+	ext2_extent_handle_t handle;
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written);
+	errcode_t err;
+	int ret;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Walk every mapping in the range, converting them. */
+	while (startoff < stopoff) {
+		blk64_t old_startoff = startoff;
+
+		ret = convert_unwritten_mapping(ff, ino, inode, handle,
+					        &startoff, stopoff);
+		if (ret)
+			goto out_handle;
+		if (startoff <= old_startoff) {
+			/* Do not go backwards. */
+			ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+			goto out_handle;
+		}
+	}
+
+	/* Try to merge the right edge */
+	ret = try_merge_mappings(ff, ino, handle, stopoff);
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, size_t written, uint32_t ioendflags,
+			  int error, uint64_t new_addr)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	bool dirty = false;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	fs = ff->fs;
+
+	pthread_mutex_lock(&ff->bfl);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=%llu\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   written,
+		   ioendflags,
+		   error,
+		   (unsigned long long)new_addr);
+
+	if (error) {
+		ret = error;
+		goto out_unlock;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers again now that
+	 * iomap wrote them
+	 */
+	if (written > 0) {
+		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
+		if (err) {
+			ret = translate_error(ff->fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
+	/* should never see these ioend types */
+	if ((ioendflags & FUSE_IOMAP_IOEND_SHARED) ||
+	    new_addr != FUSE_IOMAP_NULL_ADDR) {
+		ret = translate_error(fs, attr_ino,
+				      EXT2_ET_FILESYSTEM_CORRUPTED);
+		goto out_unlock;
+	}
+
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) {
+		/* unwritten extents are only supported on extents files */
+		if (!(inode.i_flags & EXT4_EXTENTS_FL)) {
+			ret = translate_error(fs, attr_ino,
+					      EXT2_ET_FILESYSTEM_CORRUPTED);
+			goto out_unlock;
+		}
+
+		ret = convert_unwritten_mappings(ff, attr_ino, &inode, pos,
+						 written);
+		if (ret)
+			goto out_unlock;
+
+		dirty = true;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_APPEND) {
+		ext2_off64_t isize = EXT2_I_SIZE(&inode);
+
+		if (pos + written > isize) {
+			err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode),
+						    pos + written);
+			if (err) {
+				ret = translate_error(fs, attr_ino, err);
+				goto out_unlock;
+			}
+
+			dirty = true;
+		}
+	}
+
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
+out_unlock:
+	if (ret < 0)
+		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
+	pthread_mutex_unlock(&ff->bfl);
+	return ret;
+}
 #endif /* HAVE_FUSE_IOMAP */
 
 static struct fuse_operations fs_ops = {
@@ -5228,6 +5702,7 @@ static struct fuse_operations fs_ops = {
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,
+	.iomap_ioend = op_iomap_ioend,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Turn on iomap for pagecache IO to regular files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index ec17f6203b4b70..7152979ed6694e 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1175,6 +1175,10 @@ static void *op_init(struct fuse_conn_info *conn
 	if (iomap_enabled(ff))
 		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
 #endif
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_PAGECACHE)
+	if (iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_PAGECACHE);
+#endif
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->writable) {
@@ -5017,9 +5021,6 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 {
 	errcode_t err;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	/* fall back to slow path for inline data reads */
 	if (inode->i_flags & EXT4_INLINE_DATA_FL)
 		return -ENOSYS;
@@ -5099,9 +5100,6 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	errcode_t err;
 	int ret;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	if (pos >= max_size)
 		return -EFBIG;
 
@@ -5235,12 +5233,51 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	return ret;
 }
 
+static int iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino,
+				loff_t newsize)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2_inode_large inode;
+	ext2_off64_t isize;
+	errcode_t err;
+
+	dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino,
+		   (unsigned long long)newsize);
+
+	err = fuse2fs_read_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	isize = EXT2_I_SIZE(&inode);
+	if (newsize <= isize)
+		return 0;
+
+	dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino,
+		   (unsigned long long)isize,
+		   (unsigned long long)newsize);
+
+	/*
+	 * XXX cheesily update the ondisk size even though we only want to do
+	 * the incore size until writeback happens
+	 */
+	err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = fuse2fs_write_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
 static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 			off_t pos, uint64_t count, uint32_t opflags,
 			ssize_t written, const struct fuse_iomap *iomap)
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
 
@@ -5255,9 +5292,22 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   opflags,
 		   written,
 		   iomap->flags);
+
+	if ((opflags & FUSE_IOMAP_OP_WRITE) &&
+	    !(opflags & FUSE_IOMAP_OP_DIRECT) &&
+	    (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) &&
+	    written > 0) {
+		ret = iomap_append_setsize(ff, attr_ino, pos + written);
+		if (ret)
+			goto out_unlock;
+	}
+
+out_unlock:
+	if (ret < 0)
+		dbg_printf(ff, "%s: libfuse ret=%d\n", __func__, ret);
 	pthread_mutex_unlock(&ff->bfl);
 
-	return 0;
+	return ret;
 }
 
 static inline bool can_merge_mappings(const struct ext2fs_extent *left,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
                     ` (5 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Discard operates directly on the storage device, which means that we
need to flush and invalidate the buffer cache because it could be
caching freed blocks whose contents are about to change.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 7152979ed6694e..219d4bf698d628 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4365,6 +4365,11 @@ static int ioctl_fitrim(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	cleared = 0;
 	max_blocks = FUSE2FS_B_TO_FSBT(ff, 2048ULL * 1024 * 1024);
 
+	/* flush any dirty data out of the disk cache before trimming */
+	err = io_channel_flush_tag(ff->fs->io, IO_CHANNEL_TAG_NULL);
+	if (err)
+		return translate_error(fs, fh->ino, err);
+
 	fr->len = 0;
 	while (start <= end) {
 		err = ext2fs_find_first_zero_block_bitmap2(fs->block_map,
@@ -4394,6 +4399,16 @@ static int ioctl_fitrim(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 		}
 		start = b + 1;
 	}
+	if (err)
+		goto out;
+
+	/*
+	 * Invalidate the entire disk cache now that we've written zeroes so
+	 * that EXT2_ALLOCRANGE_ZERO_BLOCKS works correctly.
+	 */
+	err = io_channel_invalidate_tag(ff->fs->io, IO_CHANNEL_TAG_NULL);
+	if (err)
+		return translate_error(fs, fh->ino, err);
 
 out:
 	fr->len = cleared;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 11/16] fuse2fs: improve tracing for fallocate
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Improve the tracing for fallocate by reporting the inode number and the
file range in all tracepoints.  Make the ranges hexadecimal to make it
easier for the programmer to convert bytes to block numbers and back.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 219d4bf698d628..fe6d97324c1f57 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4529,8 +4529,8 @@ static int fallocate_helper(struct fuse_file_info *fp, int mode, off_t offset,
 	FUSE2FS_CHECK_MAGIC(fs, fh, FUSE2FS_FILE_MAGIC);
 	start = FUSE2FS_B_TO_FSBT(ff, offset);
 	end = FUSE2FS_B_TO_FSBT(ff, offset + len - 1);
-	dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__,
-		   fh->ino, mode, start, end);
+	dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n",
+		   __func__, fh->ino, mode, offset, len, start, end);
 	if (!fs_can_allocate(ff, FUSE2FS_B_TO_FSB(ff, len)))
 		return -ENOSPC;
 
@@ -4601,6 +4601,7 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
+	dbg_printf(ff, "%s: ino=%d offset=0x%jx len=0x%jx\n", __func__, ino, offset + residue, len);
 	memset(*buf + residue, 0, len);
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
@@ -4637,10 +4638,13 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	if (clean_before)
+	if (clean_before) {
+		dbg_printf(ff, "%s: ino=%d before offset=0x%jx len=0x%jx\n", __func__, ino, offset, residue);
 		memset(*buf, 0, residue);
-	else
+	} else {
+		dbg_printf(ff, "%s: ino=%d after offset=0x%jx len=0x%jx\n", __func__, ino, offset, fs->blocksize - residue);
 		memset(*buf + residue, 0, fs->blocksize - residue);
+	}
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
@@ -4661,7 +4665,6 @@ static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset,
 	FUSE2FS_CHECK_CONTEXT(ff);
 	fs = ff->fs;
 	FUSE2FS_CHECK_MAGIC(fs, fh, FUSE2FS_FILE_MAGIC);
-	dbg_printf(ff, "%s: offset=%jd len=%jd\n", __func__, offset, len);
 
 	/* kernel ext4 punch requires this flag to be set */
 	if (!(mode & FL_KEEP_SIZE_FLAG))
@@ -4670,8 +4673,9 @@ static int punch_helper(struct fuse_file_info *fp, int mode, off_t offset,
 	/* Punch out a bunch of blocks */
 	start = FUSE2FS_B_TO_FSB(ff, offset);
 	end = (offset + len - fs->blocksize) / fs->blocksize;
-	dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__,
-		   fh->ino, mode, start, end);
+
+	dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n",
+		   __func__, fh->ino, mode, offset, len, start, end);
 
 	err = fuse2fs_read_inode(fs, fh->ino, &inode);
 	if (err)
@@ -4727,6 +4731,8 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct fuse2fs_file_handle *fh =
+		(struct fuse2fs_file_handle *)(uintptr_t)fp->fh;
 	int ret;
 
 	/* Catch unknown flags */
@@ -4738,6 +4744,12 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 		ret = -EROFS;
 		goto out;
 	}
+
+	dbg_printf(ff, "%s: ino=%d mode=0x%x start=0x%llx end=0x%llx\n", __func__,
+		   fh->ino, mode,
+		   (unsigned long long)offset,
+		   (unsigned long long)offset + len);
+
 	if (mode & FL_ZERO_RANGE_FLAG)
 		ret = zero_helper(fp, mode, offset, len);
 	else if (mode & FL_PUNCH_HOLE_FLAG)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 12/16] fuse2fs: don't zero bytes in punch hole
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
@ 2025-05-22  0:13   ` Darrick J. Wong
  2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:13 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the pagecache, it will take care of zeroing the
unaligned parts of punched out regions so we don't have to do it
ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fe6d97324c1f57..aeb2b6fbc28401 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -152,6 +152,7 @@ enum fuse2fs_iomap_state {
 	IOMAP_DISABLED,
 	IOMAP_UNKNOWN,
 	IOMAP_ENABLED,
+	IOMAP_FILEIO,	/* enabled and does all file data block IO */
 };
 #endif
 
@@ -1040,6 +1041,7 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
 		/* fallthrough */;
 	case IOMAP_DISABLED:
 		return 0;
+	case IOMAP_FILEIO:
 	case IOMAP_ENABLED:
 		break;
 	}
@@ -1059,11 +1061,17 @@ static errcode_t confirm_iomap(struct fuse_conn_info *conn, struct fuse2fs *ff)
 
 static int iomap_enabled(const struct fuse2fs *ff)
 {
-	return ff->iomap_state == IOMAP_ENABLED;
+	return ff->iomap_state >= IOMAP_ENABLED;
+}
+
+static int iomap_does_fileio(const struct fuse2fs *ff)
+{
+	return ff->iomap_state == IOMAP_FILEIO;
 }
 #else
 # define confirm_iomap(...)	(0)
 # define iomap_enabled(...)	(0)
+# define iomap_does_fileio(...)	(0)
 #endif
 
 static void *op_init(struct fuse_conn_info *conn
@@ -1100,6 +1108,20 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+
+	/*
+	 * If iomap is turned on and the kernel advertises support for both
+	 * direct and pagecache IO, then that means the kernel handles all
+	 * regular file data block IO for us.  That means we can turn off all
+	 * of libext2fs' file data block handling except for inline data.
+	 *
+	 * XXX: kernel doesn't support inline data iomap
+	 */
+	if (iomap_enabled(ff) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_PAGECACHE))
+		ff->iomap_state = IOMAP_FILEIO;
+
 	/*
 	 * In iomap mode, the kernel writes file data directly to the block
 	 * device and does not flush the bdev page cache.  We must open the
@@ -4580,6 +4602,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	int retflags;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		return 0;
+
 	residue = FUSE2FS_OFF_IN_FSB(ff, offset);
 	if (residue == 0)
 		return 0;
@@ -4617,6 +4643,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	off_t residue;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		return 0;
+
 	residue = FUSE2FS_OFF_IN_FSB(ff, offset);
 	if (residue == 0)
 		return 0;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
@ 2025-05-22  0:14   ` Darrick J. Wong
  2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
                     ` (2 subsequent siblings)
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:14 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the page cache, the kernel will take care of
all the file data block IO for us, including zeroing of punched ranges
and post-EOF bytes.  fuse2fs only needs to do IO for inline data.

Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not
do any regular file IO to or from disk blocks at all.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index aeb2b6fbc28401..842ea3a191fa44 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -2863,9 +2863,14 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
 	ext2_file_t file;
 	__u64 old_isize;
 	errcode_t err;
+	int flags = EXT2_FILE_WRITE;
 	int ret = 0;
 
-	err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+	/* the kernel handles all eof zeroing for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		flags |= EXT2_FILE_NOBLOCKIO;
+
+	err = ext2fs_file_open(fs, ino, flags, &file);
 	if (err)
 		return translate_error(fs, ino, err);
 
@@ -2987,6 +2992,9 @@ static int __op_open(struct fuse2fs *ff, const char *path,
 		file->open_flags |= EXT2_FILE_WRITE;
 		break;
 	}
+	/* the kernel handles all block IO for us in iomap mode */
+	if (iomap_does_fileio(ff))
+		file->open_flags |= EXT2_FILE_NOBLOCKIO;
 	if (fp->flags & O_APPEND) {
 		/* the kernel doesn't allow truncation of an append-only file */
 		if (fp->flags & O_TRUNC) {


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
@ 2025-05-22  0:14   ` Darrick J. Wong
  2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
  2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:14 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Now that fuse2fs uses iomap for pagecache IO, all regular file IO goes
directly to the disk.  There is no need to flush the unix IO manager's
disk cache (or invalidate it) because it does not contain file data.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 842ea3a191fa44..ba8c5f301625c6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5091,9 +5091,11 @@ static int fuse_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!iomap_does_fileio(ff)) {
+		err = io_channel_flush_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	if (inode->i_flags & EXT4_EXTENTS_FL)
 		return extent_iomap_begin(ff, ino, inode, pos, count, opflags,
@@ -5188,9 +5190,11 @@ static int fuse_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	 * flush and invalidate the file's io_channel buffers before iomap
 	 * writes them
 	 */
-	err = io_channel_invalidate_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!iomap_does_fileio(ff)) {
+		err = io_channel_invalidate_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	return 0;
 }
@@ -5685,7 +5689,7 @@ static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	 * flush and invalidate the file's io_channel buffers again now that
 	 * iomap wrote them
 	 */
-	if (written > 0) {
+	if (written > 0 && !iomap_does_fileio(ff)) {
 		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
 		if (err) {
 			ret = translate_error(ff->fs, attr_ino, err);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (13 preceding siblings ...)
  2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
@ 2025-05-22  0:14   ` Darrick J. Wong
  2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:14 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Back in "fuse2fs: always use directio disk reads with fuse2fs", we
started using directio for all libext2fs disk IO to deal with cache
coherency issues between the unix io manager's disk cache, the block
device page cache, and the file data blocks being read and written to
disk by the kernel itself.

Now that we've turned off all regular file data block IO in libext2fs,
we don't need that and can go back to the old way, which is a lot
faster for metadata operations.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index ba8c5f301625c6..f31aee5af5aad9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1128,8 +1128,12 @@ static void *op_init(struct fuse_conn_info *conn
 	 * filesystem in directio mode to avoid cache coherency issues when
 	 * reading file data.  If we can't open the bdev in directio mode, we
 	 * must not use iomap.
+	 *
+	 * If we know that the kernel can handle all regular file IO for us,
+	 * then there is no cache coherency issue and we can use buffered reads
+	 * for all IO, which will all be filesystem metadata.
 	 */
-	if (iomap_enabled(ff))
+	if (iomap_enabled(ff) && !iomap_does_fileio(ff))
 		ff->directio = 1;
 #endif
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (14 preceding siblings ...)
  2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
@ 2025-05-22  0:15   ` Darrick J. Wong
  15 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22  0:15 UTC (permalink / raw)
  To: tytso; +Cc: John, linux-ext4, miklos, joannelkoong, bernd, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Since fuse in iomap mode guarantees that op_destroy will be called
before umount returns, we don't need to use fuseblk mode to get that
guarantee.  Disable fuseblk mode, which saves us the trouble of closing
and reopening the device.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f31aee5af5aad9..28385d654f5e05 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -787,6 +787,8 @@ static errcode_t open_fs(struct fuse2fs *ff, int libext2_flags)
 	if (ff->directio)
 		flags |= EXT2_FLAG_DIRECT_IO;
 
+	dbg_printf(ff, "opening with flags=0x%x\n", flags);
+
 	err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
 			   &ff->fs);
 	if (err) {
@@ -6153,6 +6155,18 @@ int main(int argc, char *argv[])
 		ret = 32;
 		goto out;
 	}
+#ifdef HAVE_FUSE_IOMAP
+	if (is_bdev && fuse_discover_iomap()) {
+		/*
+		 * fuse-iomap guarantees that op_destroy is called before the
+		 * filesystem is unmounted, so we don't need fuseblk mode.
+		 * This save us the trouble of reopening the filesystem later,
+		 * and means that fuse2fs itself owns the exclusive lock on the
+		 * block device.
+		 */
+		is_bdev = 0;
+	}
+#endif
 
 	blksize = fctx.fs->blocksize;
 
@@ -6171,14 +6185,14 @@ int main(int argc, char *argv[])
 
 	/* Set up default fuse parameters */
 	snprintf(extra_args, BUFSIZ, "-okernel_cache,subtype=%s,"
-		 "attr_timeout=0" FUSE_PLATFORM_OPTS,
-		 get_subtype(argv[0]));
+		 "attr_timeout=0,fsname=%s" FUSE_PLATFORM_OPTS,
+		 get_subtype(argv[0]), fctx.device);
 	if (fctx.no_default_opts == 0)
 		fuse_opt_add_arg(&args, extra_args);
 
 	if (is_bdev) {
-		snprintf(extra_args, BUFSIZ, "-ofsname=%s,blkdev,blksize=%u",
-			 fctx.device, blksize);
+		snprintf(extra_args, BUFSIZ, "-oblkdev,blksize=%u",
+			 blksize);
 		fuse_opt_add_arg(&args, extra_args);
 	}
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (3 preceding siblings ...)
  2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-05-22 16:24 ` Amir Goldstein
  2025-05-29 16:45   ` Darrick J. Wong
  2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
  4 siblings, 2 replies; 82+ messages in thread
From: Amir Goldstein @ 2025-05-22 16:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi everyone,
>
> DO NOT MERGE THIS.
>
> This is the very first request for comments of a prototype to connect
> the Linux fuse driver to fs-iomap for regular file IO operations to and
> from files whose contents persist to locally attached storage devices.
>
> Why would you want to do that?  Most filesystem drivers are seriously
> vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> over almost a decade of its existence.  Faulty code can lead to total
> kernel compromise, and I think there's a very strong incentive to move
> all that parsing out to userspace where we can containerize the fuse
> server process.
>
> willy's folios conversion project (and to a certain degree RH's new
> mount API) have also demonstrated that treewide changes to the core
> mm/pagecache/fs code are very very difficult to pull off and take years
> because you have to understand every filesystem's bespoke use of that
> core code.  Eeeugh.
>
> The fuse command plumbing is very simple -- the ->iomap_begin,
> ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> to the fuse server via a trio of new fuse commands.  This is suitable
> for very simple filesystems that don't do tricky things with mappings
> (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> but solving that is for the next sprint.
>
> With this overly simplistic RFC, I am to show that it's possible to
> build a fuse server for a real filesystem (ext4) that runs entirely in
> userspace yet maintains most of its performance.  At this early stage I
> get about 95% of the kernel ext4 driver's streaming directio performance
> on streaming IO, and 110% of its streaming buffered IO performance.
> Random buffered IO suffers a 90% hit on writes due to unwritten extent
> conversions.  Random direct IO is about 60% as fast as the kernel; see
> the cover letter for the fuse2fs iomap changes for more details.
>

Very cool!

> There are some major warts remaining:
>
> 1. The iomap cookie validation is not present, which can lead to subtle
> races between pagecache zeroing and writeback on filesystems that
> support unwritten and delalloc mappings.
>
> 2. Mappings ought to be cached in the kernel for more speed.
>
> 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> yet figured out how inline data is supposed to work.
>
> 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> which currently isn't possible because the kernel fuse driver will iget
> inodes prior to calling FUSE_GETATTR to discover the properties of the
> inode it just read.

Can you make the decision about enabling iomap on lookup?
The plan for passthrough for inode operations was to allow
setting up passthough config of inode on lookup.

>
> 5. ext4 doesn't support out of place writes so I don't know if that
> actually works correctly.
>
> 6. iomap is an inode-based service, not a file-based service.  This
> means that we /must/ push ext2's inode numbers into the kernel via
> FUSE_GETATTR so that it can report those same numbers back out through
> the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> to index its incore inode, so we have to pass those too so that
> notifications work properly.
>

Again, I might be missing something, but as long as the fuse filesystem
is exposing a single backing filesystem, it should be possible to make
sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
inode number.
See sketch in this WIP branch:
https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/11] fuse: add a notification to add new iomap devices
  2025-05-22  0:03   ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
@ 2025-05-22 16:46     ` Amir Goldstein
  2025-05-22 17:11       ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2025-05-22 16:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

On Thu, May 22, 2025 at 2:03 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add a new notification so that fuse servers can add extra block devices
> to use with iomap.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/fuse/fuse_i.h          |   19 +++++++
>  fs/fuse/fuse_trace.h      |   36 ++++++++++++++
>  include/uapi/linux/fuse.h |    8 +++
>  fs/fuse/dev.c             |   23 +++++++++
>  fs/fuse/file_iomap.c      |  119 ++++++++++++++++++++++++++++++++++++++++++++-
>  fs/fuse/inode.c           |    9 +++
>  6 files changed, 211 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index aa51f25856697d..4eb75ed90db300 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -619,6 +619,12 @@ struct fuse_sync_bucket {
>         struct rcu_head rcu;
>  };
>
> +struct fuse_iomap {
> +       /* array of file objects that reference block devices for iomap */
> +       struct file **files;
> +       unsigned int nr_files;
> +};
> +
>  /**
>   * A Fuse connection.
>   *
> @@ -970,6 +976,10 @@ struct fuse_conn {
>         struct fuse_ring *ring;
>  #endif
>
> +#ifdef CONFIG_FUSE_IOMAP
> +       struct fuse_iomap iomap_conn;
> +#endif
> +
>         /** Only used if the connection opts into request timeouts */
>         struct {
>                 /* Worker for checking if any requests have timed out */
> @@ -1610,9 +1620,18 @@ static inline bool fuse_has_iomap(const struct inode *inode)
>  {
>         return get_fuse_conn_c(inode)->iomap;
>  }
> +
> +void fuse_iomap_init_reply(struct fuse_mount *fm);
> +void fuse_iomap_conn_put(struct fuse_conn *fc);
> +
> +int fuse_iomap_add_device(struct fuse_conn *fc,
> +                         const struct fuse_iomap_add_device_out *outarg);
>  #else
>  # define fuse_iomap_enabled(...)               (false)
>  # define fuse_has_iomap(...)                   (false)
> +# define fuse_iomap_init_reply(...)            ((void)0)
> +# define fuse_iomap_conn_put(...)              ((void)0)
> +# define fuse_iomap_add_device(...)            (-ENOSYS)
>  #endif
>
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
> index f9a316c9788e06..e1a2e491d2581a 100644
> --- a/fs/fuse/fuse_trace.h
> +++ b/fs/fuse/fuse_trace.h
> @@ -380,6 +380,42 @@ TRACE_EVENT(fuse_iomap_end_error,
>                   __entry->pos, __entry->count, __entry->written,
>                   __entry->error)
>  );
> +
> +TRACE_EVENT(fuse_iomap_dev_class,
> +       TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
> +                const struct file *file),
> +
> +       TP_ARGS(fc, idx, file),
> +
> +       TP_STRUCT__entry(
> +               __field(dev_t,          connection)
> +               __field(unsigned int,   idx)
> +               __field(dev_t,          bdev)
> +       ),
> +
> +       TP_fast_assign(
> +               struct inode *inode = file_inode(file);
> +
> +               __entry->connection     =       fc->dev;
> +               __entry->idx            =       idx;
> +               if (S_ISBLK(inode->i_mode)) {
> +                       __entry->bdev   =       inode->i_rdev;
> +               } else
> +                       __entry->bdev   =       0;
> +       ),
> +
> +       TP_printk("connection %u idx %u dev %u:%u",
> +                 __entry->connection,
> +                 __entry->idx,
> +                 MAJOR(__entry->bdev), MINOR(__entry->bdev))
> +);
> +#define DEFINE_FUSE_IOMAP_DEV_EVENT(name)              \
> +DEFINE_EVENT(fuse_iomap_dev_class, name,               \
> +       TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
> +                const struct file *file), \
> +       TP_ARGS(fc, idx, file))
> +DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
> +DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
>  #endif /* CONFIG_FUSE_IOMAP */
>
>  #endif /* _TRACE_FUSE_H */
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index ce6c9960f2418f..ea8992e980a015 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -236,6 +236,7 @@
>   *  7.44
>   *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
>   *    SEEK_{DATA,HOLE} support
> + *  - add FUSE_NOTIFY_ADD_IOMAP_DEVICE for multi-device filesystems
>   */
>
>  #ifndef _LINUX_FUSE_H
> @@ -681,6 +682,7 @@ enum fuse_notify_code {
>         FUSE_NOTIFY_RETRIEVE = 5,
>         FUSE_NOTIFY_DELETE = 6,
>         FUSE_NOTIFY_RESEND = 7,
> +       FUSE_NOTIFY_ADD_IOMAP_DEVICE = 8,
>         FUSE_NOTIFY_CODE_MAX,
>  };
>
> @@ -1371,4 +1373,10 @@ struct fuse_iomap_end_in {
>         uint32_t map_dev;       /* device cookie * */
>  };
>
> +struct fuse_iomap_add_device_out {
> +       int32_t fd;             /* fd of the open device to add */
> +       uint32_t reserved;      /* must be zero */
> +       uint32_t *map_dev;      /* location to receive device cookie */
> +};
> +
>  #endif /* _LINUX_FUSE_H */
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 6dcbaa218b7a16..9d7064ec170cf6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1824,6 +1824,26 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
>         return err;
>  }
>
> +static int fuse_notify_add_iomap_device(struct fuse_conn *fc, unsigned int size,
> +                                       struct fuse_copy_state *cs)
> +{
> +       struct fuse_iomap_add_device_out outarg;
> +       int err = -EINVAL;
> +
> +       if (size != sizeof(outarg))
> +               goto err;
> +
> +       err = fuse_copy_one(cs, &outarg, sizeof(outarg));
> +       if (err)
> +               goto err;
> +       fuse_copy_finish(cs);
> +
> +       return fuse_iomap_add_device(fc, &outarg);
> +err:
> +       fuse_copy_finish(cs);
> +       return err;
> +}
> +
>  struct fuse_retrieve_args {
>         struct fuse_args_pages ap;
>         struct fuse_notify_retrieve_in inarg;
> @@ -2049,6 +2069,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
>         case FUSE_NOTIFY_RESEND:
>                 return fuse_notify_resend(fc);
>
> +       case FUSE_NOTIFY_ADD_IOMAP_DEVICE:
> +               return fuse_notify_add_iomap_device(fc, size, cs);
> +
>         default:
>                 fuse_copy_finish(cs);
>                 return -EINVAL;
> diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
> index dfa0c309803113..faefd29a273bf3 100644
> --- a/fs/fuse/file_iomap.c
> +++ b/fs/fuse/file_iomap.c
> @@ -142,6 +142,26 @@ static inline int fuse_iomap_validate(const struct fuse_iomap_begin_out *outarg,
>         return 0;
>  }
>
> +static inline struct block_device *fuse_iomap_bdev(struct fuse_mount *fm,
> +                                                  unsigned int idx)
> +{
> +       struct fuse_conn *fc = fm->fc;
> +       struct file *file = NULL;
> +
> +       spin_lock(&fc->lock);
> +       if (idx < fc->iomap_conn.nr_files)
> +               file = fc->iomap_conn.files[idx];
> +       spin_unlock(&fc->lock);
> +
> +       if (!file)
> +               return NULL;
> +
> +       if (!S_ISBLK(file_inode(file)->i_mode))
> +               return NULL;
> +
> +       return I_BDEV(file->f_mapping->host);
> +}
> +
>  static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
>                             unsigned opflags, struct iomap *iomap,
>                             struct iomap *srcmap)
> @@ -155,6 +175,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
>         };
>         struct fuse_iomap_begin_out outarg = { };
>         struct fuse_mount *fm = get_fuse_mount(inode);
> +       struct block_device *read_bdev;
>         FUSE_ARGS(args);
>         int err;
>
> @@ -181,8 +202,18 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
>         if (err)
>                 return err;
>
> +       read_bdev = fuse_iomap_bdev(fm, outarg.read_dev);
> +       if (!read_bdev)
> +               return -ENODEV;
> +
>         if ((opflags & IOMAP_WRITE) &&
>             outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
> +               struct block_device *write_bdev =
> +                       fuse_iomap_bdev(fm, outarg.write_dev);
> +
> +               if (!write_bdev)
> +                       return -ENODEV;
> +
>                 /*
>                  * For an out of place write, we must supply the write mapping
>                  * via @iomap, and the read mapping via @srcmap.
> @@ -192,14 +223,14 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
>                 iomap->length = outarg.length;
>                 iomap->type = outarg.write_type;
>                 iomap->flags = outarg.write_flags;
> -               iomap->bdev = inode->i_sb->s_bdev;
> +               iomap->bdev = write_bdev;
>
>                 srcmap->addr = outarg.read_addr;
>                 srcmap->offset = outarg.offset;
>                 srcmap->length = outarg.length;
>                 srcmap->type = outarg.read_type;
>                 srcmap->flags = outarg.read_flags;
> -               srcmap->bdev = inode->i_sb->s_bdev;
> +               srcmap->bdev = read_bdev;
>         } else {
>                 /*
>                  * For everything else (reads, reporting, and pure overwrites),
> @@ -211,7 +242,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
>                 iomap->length = outarg.length;
>                 iomap->type = outarg.read_type;
>                 iomap->flags = outarg.read_flags;
> -               iomap->bdev = inode->i_sb->s_bdev;
> +               iomap->bdev = read_bdev;
>         }
>
>         return 0;
> @@ -278,3 +309,85 @@ const struct iomap_ops fuse_iomap_ops = {
>         .iomap_begin            = fuse_iomap_begin,
>         .iomap_end              = fuse_iomap_end,
>  };
> +
> +void fuse_iomap_conn_put(struct fuse_conn *fc)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < fc->iomap_conn.nr_files; i++) {
> +               struct file *file = fc->iomap_conn.files[i];
> +
> +               trace_fuse_iomap_remove_dev(fc, i, file);
> +
> +               fc->iomap_conn.files[i] = NULL;
> +               fput(file);
> +       }
> +
> +       kfree(fc->iomap_conn.files);
> +       fc->iomap_conn.nr_files = 0;
> +}
> +
> +/* Add a bdev to the fuse connection, returns the index or a negative errno */
> +static int __fuse_iomap_add_device(struct fuse_conn *fc, struct file *file)
> +{
> +       struct file **new_files;
> +       int ret;
> +
> +       if (fc->iomap_conn.nr_files >= PAGE_SIZE / sizeof(unsigned int))
> +               return -EMFILE;
> +
> +       new_files = krealloc_array(fc->iomap_conn.files,
> +                                  fc->iomap_conn.nr_files + 1,
> +                                  sizeof(struct file *),
> +                                  GFP_KERNEL | __GFP_ZERO);
> +       if (!new_files)
> +               return -ENOMEM;
> +
> +       spin_lock(&fc->lock);
> +       fc->iomap_conn.files = new_files;
> +       fc->iomap_conn.files[fc->iomap_conn.nr_files] = get_file(file);
> +       ret = fc->iomap_conn.nr_files++;
> +       spin_unlock(&fc->lock);
> +
> +       trace_fuse_iomap_add_dev(fc, ret, file);
> +
> +       return ret;
> +}
> +
> +void fuse_iomap_init_reply(struct fuse_mount *fm)
> +{
> +       struct fuse_conn *fc = fm->fc;
> +       struct super_block *sb = fm->sb;
> +
> +       if (sb->s_bdev)
> +               __fuse_iomap_add_device(fc, sb->s_bdev_file);
> +}
> +
> +int fuse_iomap_add_device(struct fuse_conn *fc,
> +                         const struct fuse_iomap_add_device_out *outarg)
> +{
> +       struct file *file;
> +       int ret;
> +
> +       if (!fc->iomap)
> +               return -EINVAL;
> +
> +       if (outarg->reserved)
> +               return -EINVAL;
> +
> +       CLASS(fd, somefd)(outarg->fd);
> +       if (fd_empty(somefd))
> +               return -EBADF;
> +       file = fd_file(somefd);
> +
> +       if (!S_ISBLK(file_inode(file)->i_mode))
> +               return -ENODEV;
> +
> +       down_read(&fc->killsb);
> +       ret = __fuse_iomap_add_device(fc, file);
> +       up_read(&fc->killsb);
> +       if (ret < 0)
> +               return ret;
> +
> +       return put_user(ret, outarg->map_dev);
> +}

This very much reminds of FUSE_DEV_IOC_BACKING_OPEN
that gives kernel an fd to remember for later file operations.

FUSE_DEV_IOC_BACKING_OPEN was implemented as an ioctl
because of security concerns of passing an fd to the kernel via write().

Speaking of security concerns, we need to consider if this requires some
privileges to allow setting up direct access to blockdev.

But also, apart from the fact that those are block device fds,
what does iomap_conn.files[] differ from fc->backing_files_map?

Miklos had envisioned this (backing blockdev) use case as one of the
private cases of fuse passthrough.

Instead of identity mapping to backing file created at open time
it's extent mapping to backing blockdev created at data access time.

I am not saying that you need to reuse anything from fuse passthrough
code, because the use cases probably do not overlap, but hopefully,
you can avoid falling into the same pits that we have already managed to avoid.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/11] fuse: add a notification to add new iomap devices
  2025-05-22 16:46     ` Amir Goldstein
@ 2025-05-22 17:11       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-22 17:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, miklos, joannelkoong, linux-xfs, bernd, John

On Thu, May 22, 2025 at 06:46:14PM +0200, Amir Goldstein wrote:
> On Thu, May 22, 2025 at 2:03 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add a new notification so that fuse servers can add extra block devices
> > to use with iomap.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

<snip>

> > +int fuse_iomap_add_device(struct fuse_conn *fc,
> > +                         const struct fuse_iomap_add_device_out *outarg)
> > +{
> > +       struct file *file;
> > +       int ret;
> > +
> > +       if (!fc->iomap)
> > +               return -EINVAL;
> > +
> > +       if (outarg->reserved)
> > +               return -EINVAL;
> > +
> > +       CLASS(fd, somefd)(outarg->fd);
> > +       if (fd_empty(somefd))
> > +               return -EBADF;
> > +       file = fd_file(somefd);
> > +
> > +       if (!S_ISBLK(file_inode(file)->i_mode))
> > +               return -ENODEV;
> > +
> > +       down_read(&fc->killsb);
> > +       ret = __fuse_iomap_add_device(fc, file);
> > +       up_read(&fc->killsb);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       return put_user(ret, outarg->map_dev);
> > +}
> 
> This very much reminds of FUSE_DEV_IOC_BACKING_OPEN
> that gives kernel an fd to remember for later file operations.
> 
> FUSE_DEV_IOC_BACKING_OPEN was implemented as an ioctl
> because of security concerns of passing an fd to the kernel via write().
> 
> Speaking of security concerns, we need to consider if this requires some
> privileges to allow setting up direct access to blockdev.

Yeah, I was assuming that if the fuse server can open the bdev, then
that's enough.  But I suppose I at least need to check that it's opened
in write mode too.

> But also, apart from the fact that those are block device fds,
> what does iomap_conn.files[] differ from fc->backing_files_map?

Oh, so that's what that does!  Yes, I'd rather pile on to that than
introduce more ABI. :)

> Miklos had envisioned this (backing blockdev) use case as one of the
> private cases of fuse passthrough.
> 
> Instead of identity mapping to backing file created at open time
> it's extent mapping to backing blockdev created at data access time.
> 
> I am not saying that you need to reuse anything from fuse passthrough
> code, because the use cases probably do not overlap, but hopefully,
> you can avoid falling into the same pits that we have already managed to avoid.

<nod> The one downside is that fsiomap requires the file to point at
either a block device or (in theory) a dax device, so we'd have to check
that on every access.  But aside from that I think I could reuse this
piece.  Thanks for bringing that to my attention! :)

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-05-29 11:08     ` Miklos Szeredi
  2025-05-31  1:08       ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Miklos Szeredi @ 2025-05-29 11:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John

On Thu, 22 May 2025 at 02:02, Darrick J. Wong <djwong@kernel.org> wrote:

> Fix this by only using synchronous fputs for fuseblk servers if the
> process doesn't have PF_LOCAL_THROTTLE.  Hopefully the fuseblk server
> had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
> filesystem server.

The bug is valid.

I just wonder if we really need to check against the task flag instead
of always sending release async, which would simplify things.

The sync release originates from commit 5a18ec176c93 ("fuse: fix hang
of single threaded fuseblk filesystem"), but then commit baebccbe997d
("fuse: hold inode instead of path after release") made that obsolete.

Anybody sees a reason why sync release for fuseblk is a good idea?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
@ 2025-05-29 16:45   ` Darrick J. Wong
  2025-05-29 19:41     ` Amir Goldstein
  2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
  1 sibling, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-29 16:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi everyone,
> >
> > DO NOT MERGE THIS.
> >
> > This is the very first request for comments of a prototype to connect
> > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > from files whose contents persist to locally attached storage devices.
> >
> > Why would you want to do that?  Most filesystem drivers are seriously
> > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > over almost a decade of its existence.  Faulty code can lead to total
> > kernel compromise, and I think there's a very strong incentive to move
> > all that parsing out to userspace where we can containerize the fuse
> > server process.
> >
> > willy's folios conversion project (and to a certain degree RH's new
> > mount API) have also demonstrated that treewide changes to the core
> > mm/pagecache/fs code are very very difficult to pull off and take years
> > because you have to understand every filesystem's bespoke use of that
> > core code.  Eeeugh.
> >
> > The fuse command plumbing is very simple -- the ->iomap_begin,
> > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > to the fuse server via a trio of new fuse commands.  This is suitable
> > for very simple filesystems that don't do tricky things with mappings
> > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > but solving that is for the next sprint.
> >
> > With this overly simplistic RFC, I am to show that it's possible to
> > build a fuse server for a real filesystem (ext4) that runs entirely in
> > userspace yet maintains most of its performance.  At this early stage I
> > get about 95% of the kernel ext4 driver's streaming directio performance
> > on streaming IO, and 110% of its streaming buffered IO performance.
> > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > the cover letter for the fuse2fs iomap changes for more details.
> >
> 
> Very cool!
> 
> > There are some major warts remaining:
> >
> > 1. The iomap cookie validation is not present, which can lead to subtle
> > races between pagecache zeroing and writeback on filesystems that
> > support unwritten and delalloc mappings.
> >
> > 2. Mappings ought to be cached in the kernel for more speed.
> >
> > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > yet figured out how inline data is supposed to work.
> >
> > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > which currently isn't possible because the kernel fuse driver will iget
> > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > inode it just read.
> 
> Can you make the decision about enabling iomap on lookup?
> The plan for passthrough for inode operations was to allow
> setting up passthough config of inode on lookup.

The main requirement (especially for buffered IO) is that we've set the
address space operations structure either to the regular fuse one or to
the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
code assumes that cannot change on a live inode.

So I /think/ we could ask the fuse server at inode instantiation time
(which, if I'm reading the code correctly, is when iget5_locked gives
fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
to userspace at that time.  Alternately I guess we could extend struct
fuse_attr with another FUSE_ATTR_ flag, I think?

> > 5. ext4 doesn't support out of place writes so I don't know if that
> > actually works correctly.
> >
> > 6. iomap is an inode-based service, not a file-based service.  This
> > means that we /must/ push ext2's inode numbers into the kernel via
> > FUSE_GETATTR so that it can report those same numbers back out through
> > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > to index its incore inode, so we have to pass those too so that
> > notifications work properly.
> >
> 
> Again, I might be missing something, but as long as the fuse filesystem
> is exposing a single backing filesystem, it should be possible to make
> sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> inode number.
> See sketch in this WIP branch:
> https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575

I think this would work in many places, except for filesystems with
64-bit inumbers on 32-bit machines.  That might be a good argument for
continuing to pass along the nodeid and fuse_inode::orig_ino like it
does now.  Plus there are some filesystems that synthesize inode numbers
so tying the two together might not be feasible/desirable anyway.

Though one nice feature of letting fuse have its own nodeids might be
that if the in-memory index switches to a tree structure, then it could
be more compact if the filesystem's inumbers are fairly sparse like xfs.
OTOH the current inode hashtable has been around for a very long time so
that might not be a big concern.  For fuse2fs it doesn't matter since
ext4 inumbers are u32.

--D

> 
> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-29 16:45   ` Darrick J. Wong
@ 2025-05-29 19:41     ` Amir Goldstein
  2025-06-09 22:31       ` Darrick J. Wong
  2025-07-12 10:57       ` Amir Goldstein
  0 siblings, 2 replies; 82+ messages in thread
From: Amir Goldstein @ 2025-05-29 19:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

 or

On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > Hi everyone,
> > >
> > > DO NOT MERGE THIS.
> > >
> > > This is the very first request for comments of a prototype to connect
> > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > from files whose contents persist to locally attached storage devices.
> > >
> > > Why would you want to do that?  Most filesystem drivers are seriously
> > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > over almost a decade of its existence.  Faulty code can lead to total
> > > kernel compromise, and I think there's a very strong incentive to move
> > > all that parsing out to userspace where we can containerize the fuse
> > > server process.
> > >
> > > willy's folios conversion project (and to a certain degree RH's new
> > > mount API) have also demonstrated that treewide changes to the core
> > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > because you have to understand every filesystem's bespoke use of that
> > > core code.  Eeeugh.
> > >
> > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > for very simple filesystems that don't do tricky things with mappings
> > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > but solving that is for the next sprint.
> > >
> > > With this overly simplistic RFC, I am to show that it's possible to
> > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > userspace yet maintains most of its performance.  At this early stage I
> > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > the cover letter for the fuse2fs iomap changes for more details.
> > >
> >
> > Very cool!
> >
> > > There are some major warts remaining:
> > >
> > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > races between pagecache zeroing and writeback on filesystems that
> > > support unwritten and delalloc mappings.
> > >
> > > 2. Mappings ought to be cached in the kernel for more speed.
> > >
> > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > yet figured out how inline data is supposed to work.
> > >
> > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > which currently isn't possible because the kernel fuse driver will iget
> > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > inode it just read.
> >
> > Can you make the decision about enabling iomap on lookup?
> > The plan for passthrough for inode operations was to allow
> > setting up passthough config of inode on lookup.
>
> The main requirement (especially for buffered IO) is that we've set the
> address space operations structure either to the regular fuse one or to
> the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> code assumes that cannot change on a live inode.
>
> So I /think/ we could ask the fuse server at inode instantiation time
> (which, if I'm reading the code correctly, is when iget5_locked gives
> fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> to userspace at that time.  Alternately I guess we could extend struct
> fuse_attr with another FUSE_ATTR_ flag, I think?
>

The latter. Either extend fuse_attr or struct fuse_entry_out,
which is in the responses of FUSE_LOOKUP,
FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
which instantiate fuse inodes.

There is a very hand wavy discussion about this at:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/

In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
command that uses the variable length file handle instead of nodeid
as a key for the inode.

So we will have to extend fuse_entry_out anyway, but TBH I never got to
look at the gritty details of how best to extend all the relevant commands,
so I hope I am not sending you down the wrong path.


> > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > actually works correctly.
> > >
> > > 6. iomap is an inode-based service, not a file-based service.  This
> > > means that we /must/ push ext2's inode numbers into the kernel via
> > > FUSE_GETATTR so that it can report those same numbers back out through
> > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > to index its incore inode, so we have to pass those too so that
> > > notifications work properly.
> > >
> >
> > Again, I might be missing something, but as long as the fuse filesystem
> > is exposing a single backing filesystem, it should be possible to make
> > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > inode number.
> > See sketch in this WIP branch:
> > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
>
> I think this would work in many places, except for filesystems with
> 64-bit inumbers on 32-bit machines.  That might be a good argument for
> continuing to pass along the nodeid and fuse_inode::orig_ino like it
> does now.  Plus there are some filesystems that synthesize inode numbers
> so tying the two together might not be feasible/desirable anyway.
>
> Though one nice feature of letting fuse have its own nodeids might be
> that if the in-memory index switches to a tree structure, then it could
> be more compact if the filesystem's inumbers are fairly sparse like xfs.
> OTOH the current inode hashtable has been around for a very long time so
> that might not be a big concern.  For fuse2fs it doesn't matter since
> ext4 inumbers are u32.
>

I wanted to see if declaring one-to-one 64bit ino can simplify things
for the first version of inode ops passthrough.
If this is not the case, or if this is too much of a limitation for
your use case
then nevermind.
But if it is a good enough shortcut for the demo and can be extended later,
then why not.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 03/11] fuse: implement the basic iomap mechanisms
  2025-05-22  0:03   ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-05-29 22:15     ` Joanne Koong
  2025-05-29 23:15       ` Joanne Koong
  0 siblings, 1 reply; 82+ messages in thread
From: Joanne Koong @ 2025-05-29 22:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, miklos, linux-xfs, bernd, John

On Wed, May 21, 2025 at 5:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Implement functions to enable upcalling of iomap_begin and iomap_end to
> userspace fuse servers.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/fuse/fuse_i.h          |   38 ++++++
>  fs/fuse/fuse_trace.h      |  258 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fuse.h |   87 ++++++++++++++
>  fs/fuse/Kconfig           |   23 ++++
>  fs/fuse/Makefile          |    1
>  fs/fuse/file_iomap.c      |  280 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/inode.c           |    5 +
>  7 files changed, 691 insertions(+), 1 deletion(-)
>  create mode 100644 fs/fuse/file_iomap.c
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d56d4fd956db99..aa51f25856697d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -895,6 +895,9 @@ struct fuse_conn {
>         /* Is link not implemented by fs? */
>         unsigned int no_link:1;
>
> +       /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
> +       unsigned int iomap:1;
> +
>         /* Use io_uring for communication */
>         unsigned int io_uring;
>
> @@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
>         return sb->s_fs_info;
>  }
>
> +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> +{
> +       return sb->s_fs_info;
> +}
> +

Instead of adding this new helper (and the ones below), what about
modifying the existing (non-const) versions of these helpers to take
in const * input args,  eg

-static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
+static inline struct fuse_mount *get_fuse_mount_super(const struct
super_block *sb)
 {
        return sb->s_fs_info;
 }

Then, doing something like "const struct fuse_mount *mt =
get_fuse_mount(inode);" would enforce the same guarantees as "const
struct fuse_mount *mt = get_fuse_mount_c(inode);" and we wouldn't need
2 sets of helpers that pretty much do the same thing.

>  static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
>  {
>         return get_fuse_mount_super(sb)->fc;
> @@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
>         return get_fuse_mount_super(inode->i_sb);
>  }
>
> +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> +{
> +       return get_fuse_mount_super_c(inode->i_sb);
> +}
> +
>  static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
>  {
>         return get_fuse_mount_super(inode->i_sb)->fc;
>  }
>
> +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> +{
> +       return get_fuse_mount_super_c(inode->i_sb)->fc;
> +}
> +
>  static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
>  {
>         return container_of(inode, struct fuse_inode, inode);
>  }
>
> +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> +{
> +       return container_of(inode, struct fuse_inode, inode);
> +}
> +
>  static inline u64 get_node_id(struct inode *inode)
>  {
>         return get_fuse_inode(inode)->nodeid;
> @@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
>  #define fuse_sysctl_unregister()       do { } while (0)
>  #endif /* CONFIG_SYSCTL */
>
> +#if IS_ENABLED(CONFIG_FUSE_IOMAP)
> +# include <linux/fiemap.h>
> +# include <linux/iomap.h>
> +
> +bool fuse_iomap_enabled(void);
> +
> +static inline bool fuse_has_iomap(const struct inode *inode)
> +{
> +       return get_fuse_conn_c(inode)->iomap;
> +}
> +#else
> +# define fuse_iomap_enabled(...)               (false)
> +# define fuse_has_iomap(...)                   (false)
> +#endif
> +
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index ca215a3cba3e31..fc7c5bf1cef52d 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
>
>           If you want to allow passthrough operations, answer Y.
>
> +config FUSE_IOMAP
> +       bool "FUSE file IO over iomap"
> +       default y
> +       depends on FUSE_FS
> +       select FS_IOMAP
> +       help
> +         For supported fuseblk servers, this allows the file IO path to run
> +         through the kernel.

I have config FUSE_FS select FS_IOMAP in my patchset (not yet
submitted) that changes fuse buffered writes / writeback handling to
use iomap. Could we just have config FUSE_FS automatically opt into
FS_IOMAP here or do you see a reason that this needs to be a separate
config?


Thanks,
Joanne
> +
> +config FUSE_IOMAP_BY_DEFAULT
> +       bool "FUSE file I/O over iomap by default"
> +       default n
> +       depends on FUSE_IOMAP
> +       help
> +         Enable sending FUSE file I/O over iomap by default.
> +
> +config FUSE_IOMAP_DEBUG
> +       bool "Debug FUSE file IO over iomap"
> +       default n
> +       depends on FUSE_IOMAP
> +       help
> +         Enable debugging assertions for the fuse iomap code paths.
> +
>  config FUSE_IO_URING
>         bool "FUSE communication over io-uring"
>         default y

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 03/11] fuse: implement the basic iomap mechanisms
  2025-05-29 22:15     ` Joanne Koong
@ 2025-05-29 23:15       ` Joanne Koong
  2025-06-03  0:13         ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Joanne Koong @ 2025-05-29 23:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, miklos, linux-xfs, bernd, John

On Thu, May 29, 2025 at 3:15 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, May 21, 2025 at 5:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > userspace fuse servers.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  fs/fuse/fuse_i.h          |   38 ++++++
> >  fs/fuse/fuse_trace.h      |  258 +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/fuse.h |   87 ++++++++++++++
> >  fs/fuse/Kconfig           |   23 ++++
> >  fs/fuse/Makefile          |    1
> >  fs/fuse/file_iomap.c      |  280 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/inode.c           |    5 +
> >  7 files changed, 691 insertions(+), 1 deletion(-)
> >  create mode 100644 fs/fuse/file_iomap.c
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index d56d4fd956db99..aa51f25856697d 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -895,6 +895,9 @@ struct fuse_conn {
> >         /* Is link not implemented by fs? */
> >         unsigned int no_link:1;
> >
> > +       /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
> > +       unsigned int iomap:1;
> > +
> >         /* Use io_uring for communication */
> >         unsigned int io_uring;
> >
> > @@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> >         return sb->s_fs_info;
> >  }
> >
> > +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> > +{
> > +       return sb->s_fs_info;
> > +}
> > +
>
> Instead of adding this new helper (and the ones below), what about
> modifying the existing (non-const) versions of these helpers to take
> in const * input args,  eg
>
> -static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> +static inline struct fuse_mount *get_fuse_mount_super(const struct
> super_block *sb)
>  {
>         return sb->s_fs_info;
>  }
>
> Then, doing something like "const struct fuse_mount *mt =
> get_fuse_mount(inode);" would enforce the same guarantees as "const
> struct fuse_mount *mt = get_fuse_mount_c(inode);" and we wouldn't need
> 2 sets of helpers that pretty much do the same thing.
>
> >  static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> >  {
> >         return get_fuse_mount_super(sb)->fc;
> > @@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> >         return get_fuse_mount_super(inode->i_sb);
> >  }
> >
> > +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> > +{
> > +       return get_fuse_mount_super_c(inode->i_sb);
> > +}
> > +
> >  static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> >  {
> >         return get_fuse_mount_super(inode->i_sb)->fc;
> >  }
> >
> > +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> > +{
> > +       return get_fuse_mount_super_c(inode->i_sb)->fc;
> > +}
> > +
> >  static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> >  {
> >         return container_of(inode, struct fuse_inode, inode);
> >  }
> >
> > +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> > +{
> > +       return container_of(inode, struct fuse_inode, inode);
> > +}
> > +
> >  static inline u64 get_node_id(struct inode *inode)
> >  {
> >         return get_fuse_inode(inode)->nodeid;
> > @@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
> >  #define fuse_sysctl_unregister()       do { } while (0)
> >  #endif /* CONFIG_SYSCTL */
> >
> > +#if IS_ENABLED(CONFIG_FUSE_IOMAP)
> > +# include <linux/fiemap.h>
> > +# include <linux/iomap.h>
> > +
> > +bool fuse_iomap_enabled(void);
> > +
> > +static inline bool fuse_has_iomap(const struct inode *inode)
> > +{
> > +       return get_fuse_conn_c(inode)->iomap;
> > +}
> > +#else
> > +# define fuse_iomap_enabled(...)               (false)
> > +# define fuse_has_iomap(...)                   (false)
> > +#endif
> > +
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > index ca215a3cba3e31..fc7c5bf1cef52d 100644
> > --- a/fs/fuse/Kconfig
> > +++ b/fs/fuse/Kconfig
> > @@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
> >
> >           If you want to allow passthrough operations, answer Y.
> >
> > +config FUSE_IOMAP
> > +       bool "FUSE file IO over iomap"
> > +       default y
> > +       depends on FUSE_FS
> > +       select FS_IOMAP
> > +       help
> > +         For supported fuseblk servers, this allows the file IO path to run
> > +         through the kernel.
>
> I have config FUSE_FS select FS_IOMAP in my patchset (not yet
> submitted) that changes fuse buffered writes / writeback handling to
> use iomap. Could we just have config FUSE_FS automatically opt into
> FS_IOMAP here or do you see a reason that this needs to be a separate
> config?

Thinking about it some more, the iomap stuff you're adding also
requires a "depends on BLOCK", so this will need to be a separate
config anyways regardless of whether the FUSE_FS will always "select
FS_IOMAP"


Thanks,
Joanne

>
>
> Thanks,
> Joanne
> > +
> > +config FUSE_IOMAP_BY_DEFAULT
> > +       bool "FUSE file I/O over iomap by default"
> > +       default n
> > +       depends on FUSE_IOMAP
> > +       help
> > +         Enable sending FUSE file I/O over iomap by default.
> > +
> > +config FUSE_IOMAP_DEBUG
> > +       bool "Debug FUSE file IO over iomap"
> > +       default n
> > +       depends on FUSE_IOMAP
> > +       help
> > +         Enable debugging assertions for the fuse iomap code paths.
> > +
> >  config FUSE_IO_URING
> >         bool "FUSE communication over io-uring"
> >         default y

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-05-29 11:08     ` Miklos Szeredi
@ 2025-05-31  1:08       ` Darrick J. Wong
  2025-06-06 13:54         ` Miklos Szeredi
  0 siblings, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-05-31  1:08 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John

On Thu, May 29, 2025 at 01:08:25PM +0200, Miklos Szeredi wrote:
> On Thu, 22 May 2025 at 02:02, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Fix this by only using synchronous fputs for fuseblk servers if the
> > process doesn't have PF_LOCAL_THROTTLE.  Hopefully the fuseblk server
> > had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
> > filesystem server.
> 
> The bug is valid.
> 
> I just wonder if we really need to check against the task flag instead
> of always sending release async, which would simplify things.
> 
> The sync release originates from commit 5a18ec176c93 ("fuse: fix hang
> of single threaded fuseblk filesystem"), but then commit baebccbe997d
> ("fuse: hold inode instead of path after release") made that obsolete.
> 
> Anybody sees a reason why sync release for fuseblk is a good idea?

The best reason that I can think of is that normally the process that
owns the fd (and hence is releasing it) should be made to wait for
the release, because normally we want processes that generate file
activity to pay those costs.  It's just this weird case where the fd
already got closed but aio is still going in the background.

(yeah, everyone hates aio ;))

Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
fuseblk filesystems?  I'd have thought that you'd want to make umount
block until the fuse server is totally done.  OTOH I guess I could see
an argument for not waiting for potentially hung servers, etc.

--D

> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 03/11] fuse: implement the basic iomap mechanisms
  2025-05-29 23:15       ` Joanne Koong
@ 2025-06-03  0:13         ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-03  0:13 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, miklos, linux-xfs, bernd, John

On Thu, May 29, 2025 at 04:15:57PM -0700, Joanne Koong wrote:
> On Thu, May 29, 2025 at 3:15 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, May 21, 2025 at 5:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > > userspace fuse servers.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  fs/fuse/fuse_i.h          |   38 ++++++
> > >  fs/fuse/fuse_trace.h      |  258 +++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/fuse.h |   87 ++++++++++++++
> > >  fs/fuse/Kconfig           |   23 ++++
> > >  fs/fuse/Makefile          |    1
> > >  fs/fuse/file_iomap.c      |  280 +++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/fuse/inode.c           |    5 +
> > >  7 files changed, 691 insertions(+), 1 deletion(-)
> > >  create mode 100644 fs/fuse/file_iomap.c
> > >
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index d56d4fd956db99..aa51f25856697d 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -895,6 +895,9 @@ struct fuse_conn {
> > >         /* Is link not implemented by fs? */
> > >         unsigned int no_link:1;
> > >
> > > +       /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
> > > +       unsigned int iomap:1;
> > > +
> > >         /* Use io_uring for communication */
> > >         unsigned int io_uring;
> > >
> > > @@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> > >         return sb->s_fs_info;
> > >  }
> > >
> > > +static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
> > > +{
> > > +       return sb->s_fs_info;
> > > +}
> > > +
> >
> > Instead of adding this new helper (and the ones below), what about
> > modifying the existing (non-const) versions of these helpers to take
> > in const * input args,  eg
> >
> > -static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
> > +static inline struct fuse_mount *get_fuse_mount_super(const struct
> > super_block *sb)
> >  {
> >         return sb->s_fs_info;
> >  }
> >
> > Then, doing something like "const struct fuse_mount *mt =
> > get_fuse_mount(inode);" would enforce the same guarantees as "const
> > struct fuse_mount *mt = get_fuse_mount_c(inode);" and we wouldn't need
> > 2 sets of helpers that pretty much do the same thing.
> >
> > >  static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> > >  {
> > >         return get_fuse_mount_super(sb)->fc;
> > > @@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
> > >         return get_fuse_mount_super(inode->i_sb);
> > >  }
> > >
> > > +static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
> > > +{
> > > +       return get_fuse_mount_super_c(inode->i_sb);
> > > +}
> > > +
> > >  static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
> > >  {
> > >         return get_fuse_mount_super(inode->i_sb)->fc;
> > >  }
> > >
> > > +static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
> > > +{
> > > +       return get_fuse_mount_super_c(inode->i_sb)->fc;
> > > +}
> > > +
> > >  static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
> > >  {
> > >         return container_of(inode, struct fuse_inode, inode);
> > >  }
> > >
> > > +static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
> > > +{
> > > +       return container_of(inode, struct fuse_inode, inode);
> > > +}
> > > +
> > >  static inline u64 get_node_id(struct inode *inode)
> > >  {
> > >         return get_fuse_inode(inode)->nodeid;
> > > @@ -1577,4 +1600,19 @@ extern void fuse_sysctl_unregister(void);
> > >  #define fuse_sysctl_unregister()       do { } while (0)
> > >  #endif /* CONFIG_SYSCTL */
> > >
> > > +#if IS_ENABLED(CONFIG_FUSE_IOMAP)
> > > +# include <linux/fiemap.h>
> > > +# include <linux/iomap.h>
> > > +
> > > +bool fuse_iomap_enabled(void);
> > > +
> > > +static inline bool fuse_has_iomap(const struct inode *inode)
> > > +{
> > > +       return get_fuse_conn_c(inode)->iomap;
> > > +}
> > > +#else
> > > +# define fuse_iomap_enabled(...)               (false)
> > > +# define fuse_has_iomap(...)                   (false)
> > > +#endif
> > > +
> > >  #endif /* _FS_FUSE_I_H */
> > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> > > index ca215a3cba3e31..fc7c5bf1cef52d 100644
> > > --- a/fs/fuse/Kconfig
> > > +++ b/fs/fuse/Kconfig
> > > @@ -64,6 +64,29 @@ config FUSE_PASSTHROUGH
> > >
> > >           If you want to allow passthrough operations, answer Y.
> > >
> > > +config FUSE_IOMAP
> > > +       bool "FUSE file IO over iomap"
> > > +       default y
> > > +       depends on FUSE_FS
> > > +       select FS_IOMAP
> > > +       help
> > > +         For supported fuseblk servers, this allows the file IO path to run
> > > +         through the kernel.
> >
> > I have config FUSE_FS select FS_IOMAP in my patchset (not yet
> > submitted) that changes fuse buffered writes / writeback handling to
> > use iomap. Could we just have config FUSE_FS automatically opt into
> > FS_IOMAP here or do you see a reason that this needs to be a separate
> > config?
> 
> Thinking about it some more, the iomap stuff you're adding also
> requires a "depends on BLOCK", so this will need to be a separate
> config anyways regardless of whether the FUSE_FS will always "select
> FS_IOMAP"

I'll add that, thanks.  I forgot that FS_IOMAP no longer selects BLOCK
all the time. :)

--D

> 
> Thanks,
> Joanne
> 
> >
> >
> > Thanks,
> > Joanne
> > > +
> > > +config FUSE_IOMAP_BY_DEFAULT
> > > +       bool "FUSE file I/O over iomap by default"
> > > +       default n
> > > +       depends on FUSE_IOMAP
> > > +       help
> > > +         Enable sending FUSE file I/O over iomap by default.
> > > +
> > > +config FUSE_IOMAP_DEBUG
> > > +       bool "Debug FUSE file IO over iomap"
> > > +       default n
> > > +       depends on FUSE_IOMAP
> > > +       help
> > > +         Enable debugging assertions for the fuse iomap code paths.
> > > +
> > >  config FUSE_IO_URING
> > >         bool "FUSE communication over io-uring"
> > >         default y

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-05-31  1:08       ` Darrick J. Wong
@ 2025-06-06 13:54         ` Miklos Szeredi
  2025-06-09 18:13           ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Miklos Szeredi @ 2025-06-06 13:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John

On Sat, 31 May 2025 at 03:08, Darrick J. Wong <djwong@kernel.org> wrote:

> The best reason that I can think of is that normally the process that
> owns the fd (and hence is releasing it) should be made to wait for
> the release, because normally we want processes that generate file
> activity to pay those costs.

That argument seems to apply to all fuse variants.  But fuse does get
away with async release and I don't see why fuseblk would be different
in this respect.

Trying to hack around the problems of sync release with a task flag
that servers might or might not have set does not feel a very robust
solution.

> Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
> fuseblk filesystems?  I'd have thought that you'd want to make umount
> block until the fuse server is totally done.  OTOH I guess I could see
> an argument for not waiting for potentially hung servers, etc.

It's a potential DoS.  With allow_root we could arguably enable
FUSE_DESTROY, since the mounter is explicitly acknowledging this DoS
possibilty.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-06-06 13:54         ` Miklos Szeredi
@ 2025-06-09 18:13           ` Darrick J. Wong
  2025-06-09 20:29             ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-09 18:13 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John

On Fri, Jun 06, 2025 at 03:54:50PM +0200, Miklos Szeredi wrote:
> On Sat, 31 May 2025 at 03:08, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > The best reason that I can think of is that normally the process that
> > owns the fd (and hence is releasing it) should be made to wait for
> > the release, because normally we want processes that generate file
> > activity to pay those costs.
> 
> That argument seems to apply to all fuse variants.  But fuse does get
> away with async release and I don't see why fuseblk would be different
> in this respect.
> 
> Trying to hack around the problems of sync release with a task flag
> that servers might or might not have set does not feel a very robust
> solution.
> 
> > Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
> > fuseblk filesystems?  I'd have thought that you'd want to make umount
> > block until the fuse server is totally done.  OTOH I guess I could see
> > an argument for not waiting for potentially hung servers, etc.
> 
> It's a potential DoS.  With allow_root we could arguably enable
> FUSE_DESTROY, since the mounter is explicitly acknowledging this DoS
> possibilty.

<nod> Looking deeper at fuse2fs's op_destroy function, I think most of
the slow functionality (writing group descriptors and the primary super
and fsyncing the device) ought to be done via FUSE_SYNCFS, not
FUSE_DESTROY.  If I made that change, I think op_destroy becomes very
fast -- all it does is close the fs and log a message.  The VFS unmount
code calls sync_filesystem (which initiates a FUSE_SYNCFS) which sounds
like it would work for fuse2fs.

Unhappily, libfuse3 doesn't seem to implement it:

$ git grep FUSE_SYNCFS
doc/libfuse-operations.txt:394:50. FUSE_SYNCFS (50)
include/fuse_kernel.h:186: *  - add FUSE_SYNCFS
include/fuse_kernel.h:670:      FUSE_SYNCFS             = 50,

--D

> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-06-09 18:13           ` Darrick J. Wong
@ 2025-06-09 20:29             ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-09 20:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, joannelkoong, linux-xfs, bernd, John

On Mon, Jun 09, 2025 at 11:13:26AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 06, 2025 at 03:54:50PM +0200, Miklos Szeredi wrote:
> > On Sat, 31 May 2025 at 03:08, Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > > The best reason that I can think of is that normally the process that
> > > owns the fd (and hence is releasing it) should be made to wait for
> > > the release, because normally we want processes that generate file
> > > activity to pay those costs.
> > 
> > That argument seems to apply to all fuse variants.  But fuse does get
> > away with async release and I don't see why fuseblk would be different
> > in this respect.
> > 
> > Trying to hack around the problems of sync release with a task flag
> > that servers might or might not have set does not feel a very robust
> > solution.
> > 
> > > Also: is it a bug that the kernel only sends FUSE_DESTROY on umount for
> > > fuseblk filesystems?  I'd have thought that you'd want to make umount
> > > block until the fuse server is totally done.  OTOH I guess I could see
> > > an argument for not waiting for potentially hung servers, etc.
> > 
> > It's a potential DoS.  With allow_root we could arguably enable
> > FUSE_DESTROY, since the mounter is explicitly acknowledging this DoS
> > possibilty.
> 
> <nod> Looking deeper at fuse2fs's op_destroy function, I think most of
> the slow functionality (writing group descriptors and the primary super
> and fsyncing the device) ought to be done via FUSE_SYNCFS, not
> FUSE_DESTROY.  If I made that change, I think op_destroy becomes very
> fast -- all it does is close the fs and log a message.  The VFS unmount
> code calls sync_filesystem (which initiates a FUSE_SYNCFS) which sounds
> like it would work for fuse2fs.
> 
> Unhappily, libfuse3 doesn't seem to implement it:
> 
> $ git grep FUSE_SYNCFS
> doc/libfuse-operations.txt:394:50. FUSE_SYNCFS (50)
> include/fuse_kernel.h:186: *  - add FUSE_SYNCFS
> include/fuse_kernel.h:670:      FUSE_SYNCFS             = 50,

...and it won't really work anyway since fuse_sync_fs doesn't upcall to
the fuse server if sb->s_root == NULL; and we can't do anything at that
point anyway because deactivate_locked_super -> fuse_kill_sb_anon has
already called fuse_conn_destroy to tear down the connection.

--D

> 
> > Thanks,
> > Miklos
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-29 19:41     ` Amir Goldstein
@ 2025-06-09 22:31       ` Darrick J. Wong
  2025-06-10 10:59         ` Amir Goldstein
  2025-07-12 10:57       ` Amir Goldstein
  1 sibling, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-09 22:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
>  or
> 
> On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > DO NOT MERGE THIS.
> > > >
> > > > This is the very first request for comments of a prototype to connect
> > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > from files whose contents persist to locally attached storage devices.
> > > >
> > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > kernel compromise, and I think there's a very strong incentive to move
> > > > all that parsing out to userspace where we can containerize the fuse
> > > > server process.
> > > >
> > > > willy's folios conversion project (and to a certain degree RH's new
> > > > mount API) have also demonstrated that treewide changes to the core
> > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > because you have to understand every filesystem's bespoke use of that
> > > > core code.  Eeeugh.
> > > >
> > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > for very simple filesystems that don't do tricky things with mappings
> > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > but solving that is for the next sprint.
> > > >
> > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > userspace yet maintains most of its performance.  At this early stage I
> > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > the cover letter for the fuse2fs iomap changes for more details.
> > > >
> > >
> > > Very cool!
> > >
> > > > There are some major warts remaining:
> > > >
> > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > races between pagecache zeroing and writeback on filesystems that
> > > > support unwritten and delalloc mappings.
> > > >
> > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > >
> > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > yet figured out how inline data is supposed to work.
> > > >
> > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > which currently isn't possible because the kernel fuse driver will iget
> > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > inode it just read.
> > >
> > > Can you make the decision about enabling iomap on lookup?
> > > The plan for passthrough for inode operations was to allow
> > > setting up passthough config of inode on lookup.
> >
> > The main requirement (especially for buffered IO) is that we've set the
> > address space operations structure either to the regular fuse one or to
> > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > code assumes that cannot change on a live inode.
> >
> > So I /think/ we could ask the fuse server at inode instantiation time
> > (which, if I'm reading the code correctly, is when iget5_locked gives
> > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > to userspace at that time.  Alternately I guess we could extend struct
> > fuse_attr with another FUSE_ATTR_ flag, I think?
> >
> 
> The latter. Either extend fuse_attr or struct fuse_entry_out,
> which is in the responses of FUSE_LOOKUP,
> FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> which instantiate fuse inodes.
> 
> There is a very hand wavy discussion about this at:
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> 
> In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> command that uses the variable length file handle instead of nodeid
> as a key for the inode.
> 
> So we will have to extend fuse_entry_out anyway, but TBH I never got to
> look at the gritty details of how best to extend all the relevant commands,
> so I hope I am not sending you down the wrong path.

I found another twist to this story: the upper level libfuse3 library
assigns distinct nodeids for each directory entry.  These nodeids are
passed into the kernel and appear to the basis for an iget5_locked call.
IOWs, each nodeid causes a struct fuse_inode to be created in the
kernel.

For a single-linked file this is no big deal, but for a hardlink this
makes iomap a mess because this means that in fuse2fs, an ext2 inode can
map to multiple kernel fuse_inode objects.  This /really/ breaks the
locking model of iomap, which assumes that there's one in-kernel inode
and that it can use i_rwsem to synchronize updates.

So I'm going to have to find a way to deal with this.  I tried trivially
messing with libfuse nodeid assigment but that blew some assertion.
Maybe your LOOKUP_HANDLE thing would work.

> > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > actually works correctly.
> > > >
> > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > to index its incore inode, so we have to pass those too so that
> > > > notifications work properly.
> > > >
> > >
> > > Again, I might be missing something, but as long as the fuse filesystem
> > > is exposing a single backing filesystem, it should be possible to make
> > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > inode number.
> > > See sketch in this WIP branch:
> > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> >
> > I think this would work in many places, except for filesystems with
> > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > does now.  Plus there are some filesystems that synthesize inode numbers
> > so tying the two together might not be feasible/desirable anyway.
> >
> > Though one nice feature of letting fuse have its own nodeids might be
> > that if the in-memory index switches to a tree structure, then it could
> > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > OTOH the current inode hashtable has been around for a very long time so
> > that might not be a big concern.  For fuse2fs it doesn't matter since
> > ext4 inumbers are u32.
> >
> 
> I wanted to see if declaring one-to-one 64bit ino can simplify things
> for the first version of inode ops passthrough.
> If this is not the case, or if this is too much of a limitation for
> your use case
> then nevermind.
> But if it is a good enough shortcut for the demo and can be extended later,
> then why not.

It's very tempting, because it's very confusing to have nodeids and
stat st_ino not be the same thing.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-09 22:31       ` Darrick J. Wong
@ 2025-06-10 10:59         ` Amir Goldstein
  2025-06-10 19:00           ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2025-06-10 10:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> >  or
> >
> > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS.
> > > > >
> > > > > This is the very first request for comments of a prototype to connect
> > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > from files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > but solving that is for the next sprint.
> > > > >
> > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > >
> > > >
> > > > Very cool!
> > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > support unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > yet figured out how inline data is supposed to work.
> > > > >
> > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > inode it just read.
> > > >
> > > > Can you make the decision about enabling iomap on lookup?
> > > > The plan for passthrough for inode operations was to allow
> > > > setting up passthough config of inode on lookup.
> > >
> > > The main requirement (especially for buffered IO) is that we've set the
> > > address space operations structure either to the regular fuse one or to
> > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > code assumes that cannot change on a live inode.
> > >
> > > So I /think/ we could ask the fuse server at inode instantiation time
> > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > to userspace at that time.  Alternately I guess we could extend struct
> > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > >
> >
> > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > which is in the responses of FUSE_LOOKUP,
> > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > which instantiate fuse inodes.
> >
> > There is a very hand wavy discussion about this at:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> >
> > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > command that uses the variable length file handle instead of nodeid
> > as a key for the inode.
> >
> > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > look at the gritty details of how best to extend all the relevant commands,
> > so I hope I am not sending you down the wrong path.
>
> I found another twist to this story: the upper level libfuse3 library
> assigns distinct nodeids for each directory entry.  These nodeids are
> passed into the kernel and appear to the basis for an iget5_locked call.
> IOWs, each nodeid causes a struct fuse_inode to be created in the
> kernel.
>
> For a single-linked file this is no big deal, but for a hardlink this
> makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> map to multiple kernel fuse_inode objects.  This /really/ breaks the
> locking model of iomap, which assumes that there's one in-kernel inode
> and that it can use i_rwsem to synchronize updates.
>
> So I'm going to have to find a way to deal with this.  I tried trivially
> messing with libfuse nodeid assigment but that blew some assertion.
> Maybe your LOOKUP_HANDLE thing would work.
>

Pull the emergency break!

In an amature move, I did not look at fuse2fs.c before commenting on your
work.

High level fuse interface is not the right tool for the job.
It's not even the easiest way to have written fuse2fs in the first place.

High-level fuse API addresses file system objects with full paths.
This is good for writing simple virtual filesystems, but it is not the
correct nor is the easiest choice to write a userspace driver for ext4.

Low-level fuse interface addresses filesystem objects by nodeid
and requires the server to implement lookup(parent_nodeid, name)
where the server gets to choose the nodeid (not libfuse).

current fuse2fs code needs to go to an effort to convert from full path
to inode + name using ext2fs_namei().

With the low-level fuse op_lookup() might have used the native ext2_lookup()
which would have been much more natural.

You can find the most featureful low-level fuse example at:
https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc

Among other things, the server has an inode cache, where an inode
has in its state 'nopen' (was this inode opened for io) and 'backing_id'
(was this inode mapped for kernel passthrough).

Currently this backing_id mapping is only made on first open of inode,
but the plan is to do that also at lookup time, for example, if the
iomap mode for the inode can be determined at lookup time.


> > > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.
> > > > >
> > > >
> > > > Again, I might be missing something, but as long as the fuse filesystem
> > > > is exposing a single backing filesystem, it should be possible to make
> > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > > inode number.
> > > > See sketch in this WIP branch:
> > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> > >
> > > I think this would work in many places, except for filesystems with
> > > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > > does now.  Plus there are some filesystems that synthesize inode numbers
> > > so tying the two together might not be feasible/desirable anyway.
> > >
> > > Though one nice feature of letting fuse have its own nodeids might be
> > > that if the in-memory index switches to a tree structure, then it could
> > > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > > OTOH the current inode hashtable has been around for a very long time so
> > > that might not be a big concern.  For fuse2fs it doesn't matter since
> > > ext4 inumbers are u32.
> > >
> >
> > I wanted to see if declaring one-to-one 64bit ino can simplify things
> > for the first version of inode ops passthrough.
> > If this is not the case, or if this is too much of a limitation for
> > your use case
> > then nevermind.
> > But if it is a good enough shortcut for the demo and can be extended later,
> > then why not.
>
> It's very tempting, because it's very confusing to have nodeids and
> stat st_ino not be the same thing.
>

Now that I have explained that fuse2fs should be low-level, it should be
trivial to claim that it should have no problem to declare via
FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino,
because I see no reason to implement fuse2fs with non one-to-one
mapping of ino <==> nodeid.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 10:59         ` Amir Goldstein
@ 2025-06-10 19:00           ` Darrick J. Wong
  2025-06-10 19:51             ` Amir Goldstein
  2025-06-11 11:56             ` Theodore Ts'o
  0 siblings, 2 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-10 19:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > >  or
> > >
> > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > DO NOT MERGE THIS.
> > > > > >
> > > > > > This is the very first request for comments of a prototype to connect
> > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > from files whose contents persist to locally attached storage devices.
> > > > > >
> > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > server process.
> > > > > >
> > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > core code.  Eeeugh.
> > > > > >
> > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > but solving that is for the next sprint.
> > > > > >
> > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > >
> > > > >
> > > > > Very cool!
> > > > >
> > > > > > There are some major warts remaining:
> > > > > >
> > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > support unwritten and delalloc mappings.
> > > > > >
> > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > >
> > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > yet figured out how inline data is supposed to work.
> > > > > >
> > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > inode it just read.
> > > > >
> > > > > Can you make the decision about enabling iomap on lookup?
> > > > > The plan for passthrough for inode operations was to allow
> > > > > setting up passthough config of inode on lookup.
> > > >
> > > > The main requirement (especially for buffered IO) is that we've set the
> > > > address space operations structure either to the regular fuse one or to
> > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > code assumes that cannot change on a live inode.
> > > >
> > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > >
> > >
> > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > which is in the responses of FUSE_LOOKUP,
> > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > which instantiate fuse inodes.
> > >
> > > There is a very hand wavy discussion about this at:
> > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > >
> > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > command that uses the variable length file handle instead of nodeid
> > > as a key for the inode.
> > >
> > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > look at the gritty details of how best to extend all the relevant commands,
> > > so I hope I am not sending you down the wrong path.
> >
> > I found another twist to this story: the upper level libfuse3 library
> > assigns distinct nodeids for each directory entry.  These nodeids are
> > passed into the kernel and appear to the basis for an iget5_locked call.
> > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > kernel.
> >
> > For a single-linked file this is no big deal, but for a hardlink this
> > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > locking model of iomap, which assumes that there's one in-kernel inode
> > and that it can use i_rwsem to synchronize updates.
> >
> > So I'm going to have to find a way to deal with this.  I tried trivially
> > messing with libfuse nodeid assigment but that blew some assertion.
> > Maybe your LOOKUP_HANDLE thing would work.
> >
> 
> Pull the emergency break!
> 
> In an amature move, I did not look at fuse2fs.c before commenting on your
> work.
> 
> High level fuse interface is not the right tool for the job.
> It's not even the easiest way to have written fuse2fs in the first place.

At the time I thought it would minimize friction across multiple
operating systems' fuse implementations.

> High-level fuse API addresses file system objects with full paths.
> This is good for writing simple virtual filesystems, but it is not the
> correct nor is the easiest choice to write a userspace driver for ext4.

Agreed, it's a *terrible* way to implement ext4.

I think, however, that Ted would like to maintain compatibility with
macfuse and freebsd(?) so he's been resistant to rewriting the entire
program to work with the lowlevel library.

That said, I decided just now to do some spelunking into those two fuse
ports and have discovered that freebsd[1] packages the same upstream
libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.

[1] https://wiki.freebsd.org/FUSEFS
[2] https://github.com/macfuse/macfuse

Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
think about rewriting all of fuse2fs against the lowlevel library?  It's
really annoying to deal with all the problems of the current codebase.
I think I'll try to stabilize the current fuse+iomap code and then look
into a fuse2fs port.  What would we call it, fuse4fs? :D

> Low-level fuse interface addresses filesystem objects by nodeid
> and requires the server to implement lookup(parent_nodeid, name)
> where the server gets to choose the nodeid (not libfuse).

Does the nodeid for the root directory have to be FUSE_ROOT_ID?  I guess
for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
which cannot be accessed from userspace anyway.

> current fuse2fs code needs to go to an effort to convert from full path
> to inode + name using ext2fs_namei().
> 
> With the low-level fuse op_lookup() might have used the native ext2_lookup()
> which would have been much more natural.
> 
> You can find the most featureful low-level fuse example at:
> https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc
> 
> Among other things, the server has an inode cache, where an inode
> has in its state 'nopen' (was this inode opened for io) and 'backing_id'
> (was this inode mapped for kernel passthrough).
> 
> Currently this backing_id mapping is only made on first open of inode,
> but the plan is to do that also at lookup time, for example, if the
> iomap mode for the inode can be determined at lookup time.

<nod>

> > > > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > > > actually works correctly.
> > > > > >
> > > > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > notifications work properly.
> > > > > >
> > > > >
> > > > > Again, I might be missing something, but as long as the fuse filesystem
> > > > > is exposing a single backing filesystem, it should be possible to make
> > > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > > > inode number.
> > > > > See sketch in this WIP branch:
> > > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> > > >
> > > > I think this would work in many places, except for filesystems with
> > > > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > > > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > > > does now.  Plus there are some filesystems that synthesize inode numbers
> > > > so tying the two together might not be feasible/desirable anyway.
> > > >
> > > > Though one nice feature of letting fuse have its own nodeids might be
> > > > that if the in-memory index switches to a tree structure, then it could
> > > > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > > > OTOH the current inode hashtable has been around for a very long time so
> > > > that might not be a big concern.  For fuse2fs it doesn't matter since
> > > > ext4 inumbers are u32.
> > > >
> > >
> > > I wanted to see if declaring one-to-one 64bit ino can simplify things
> > > for the first version of inode ops passthrough.
> > > If this is not the case, or if this is too much of a limitation for
> > > your use case
> > > then nevermind.
> > > But if it is a good enough shortcut for the demo and can be extended later,
> > > then why not.
> >
> > It's very tempting, because it's very confusing to have nodeids and
> > stat st_ino not be the same thing.
> >
> 
> Now that I have explained that fuse2fs should be low-level, it should be
> trivial to claim that it should have no problem to declare via
> FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino,
> because I see no reason to implement fuse2fs with non one-to-one
> mapping of ino <==> nodeid.

Agreed!  Thanks for the nudge!

Let's see what Ted thinks when he returns from vacation. :)

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 19:00           ` Darrick J. Wong
@ 2025-06-10 19:51             ` Amir Goldstein
  2025-06-11  6:00               ` Darrick J. Wong
  2025-06-11 11:56             ` Theodore Ts'o
  1 sibling, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2025-06-10 19:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > > >  or
> > > >
> > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > DO NOT MERGE THIS.
> > > > > > >
> > > > > > > This is the very first request for comments of a prototype to connect
> > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > > from files whose contents persist to locally attached storage devices.
> > > > > > >
> > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > server process.
> > > > > > >
> > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > core code.  Eeeugh.
> > > > > > >
> > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > > but solving that is for the next sprint.
> > > > > > >
> > > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > > >
> > > > > >
> > > > > > Very cool!
> > > > > >
> > > > > > > There are some major warts remaining:
> > > > > > >
> > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > > support unwritten and delalloc mappings.
> > > > > > >
> > > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > > >
> > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > > yet figured out how inline data is supposed to work.
> > > > > > >
> > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > > inode it just read.
> > > > > >
> > > > > > Can you make the decision about enabling iomap on lookup?
> > > > > > The plan for passthrough for inode operations was to allow
> > > > > > setting up passthough config of inode on lookup.
> > > > >
> > > > > The main requirement (especially for buffered IO) is that we've set the
> > > > > address space operations structure either to the regular fuse one or to
> > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > > code assumes that cannot change on a live inode.
> > > > >
> > > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > > >
> > > >
> > > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > > which is in the responses of FUSE_LOOKUP,
> > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > > which instantiate fuse inodes.
> > > >
> > > > There is a very hand wavy discussion about this at:
> > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > > >
> > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > > command that uses the variable length file handle instead of nodeid
> > > > as a key for the inode.
> > > >
> > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > > look at the gritty details of how best to extend all the relevant commands,
> > > > so I hope I am not sending you down the wrong path.
> > >
> > > I found another twist to this story: the upper level libfuse3 library
> > > assigns distinct nodeids for each directory entry.  These nodeids are
> > > passed into the kernel and appear to the basis for an iget5_locked call.
> > > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > > kernel.
> > >
> > > For a single-linked file this is no big deal, but for a hardlink this
> > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > > locking model of iomap, which assumes that there's one in-kernel inode
> > > and that it can use i_rwsem to synchronize updates.
> > >
> > > So I'm going to have to find a way to deal with this.  I tried trivially
> > > messing with libfuse nodeid assigment but that blew some assertion.
> > > Maybe your LOOKUP_HANDLE thing would work.
> > >
> >
> > Pull the emergency break!
> >
> > In an amature move, I did not look at fuse2fs.c before commenting on your
> > work.
> >
> > High level fuse interface is not the right tool for the job.
> > It's not even the easiest way to have written fuse2fs in the first place.
>
> At the time I thought it would minimize friction across multiple
> operating systems' fuse implementations.
>
> > High-level fuse API addresses file system objects with full paths.
> > This is good for writing simple virtual filesystems, but it is not the
> > correct nor is the easiest choice to write a userspace driver for ext4.
>
> Agreed, it's a *terrible* way to implement ext4.
>
> I think, however, that Ted would like to maintain compatibility with
> macfuse and freebsd(?) so he's been resistant to rewriting the entire
> program to work with the lowlevel library.
>
> That said, I decided just now to do some spelunking into those two fuse
> ports and have discovered that freebsd[1] packages the same upstream
> libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.
>
> [1] https://wiki.freebsd.org/FUSEFS
> [2] https://github.com/macfuse/macfuse
>
> Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
> think about rewriting all of fuse2fs against the lowlevel library?  It's
> really annoying to deal with all the problems of the current codebase.
> I think I'll try to stabilize the current fuse+iomap code and then look
> into a fuse2fs port.  What would we call it, fuse4fs? :D
>
> > Low-level fuse interface addresses filesystem objects by nodeid
> > and requires the server to implement lookup(parent_nodeid, name)
> > where the server gets to choose the nodeid (not libfuse).
>
> Does the nodeid for the root directory have to be FUSE_ROOT_ID?

Yeh, I think that's the case, otherwise FUSE_INIT would need to
tell the kernel the root nodeid, because there is no lookup to
return the root nodeid.

> I guess
> for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> which cannot be accessed from userspace anyway.
>

As long as inode #1 is reserved it should be fine.
just need to refine the rules of the one-to-one mapping with
this exception.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 19:51             ` Amir Goldstein
@ 2025-06-11  6:00               ` Darrick J. Wong
  2025-06-11  8:54                 ` Amir Goldstein
  0 siblings, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-11  6:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

On Tue, Jun 10, 2025 at 09:51:55PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> > > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > > > >  or
> > > > >
> > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > DO NOT MERGE THIS.
> > > > > > > >
> > > > > > > > This is the very first request for comments of a prototype to connect
> > > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > > > from files whose contents persist to locally attached storage devices.
> > > > > > > >
> > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > server process.
> > > > > > > >
> > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > core code.  Eeeugh.
> > > > > > > >
> > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > > > but solving that is for the next sprint.
> > > > > > > >
> > > > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > > > >
> > > > > > >
> > > > > > > Very cool!
> > > > > > >
> > > > > > > > There are some major warts remaining:
> > > > > > > >
> > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > > > support unwritten and delalloc mappings.
> > > > > > > >
> > > > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > > > >
> > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > > > yet figured out how inline data is supposed to work.
> > > > > > > >
> > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > > > inode it just read.
> > > > > > >
> > > > > > > Can you make the decision about enabling iomap on lookup?
> > > > > > > The plan for passthrough for inode operations was to allow
> > > > > > > setting up passthough config of inode on lookup.
> > > > > >
> > > > > > The main requirement (especially for buffered IO) is that we've set the
> > > > > > address space operations structure either to the regular fuse one or to
> > > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > > > code assumes that cannot change on a live inode.
> > > > > >
> > > > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > > > >
> > > > >
> > > > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > > > which is in the responses of FUSE_LOOKUP,
> > > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > > > which instantiate fuse inodes.
> > > > >
> > > > > There is a very hand wavy discussion about this at:
> > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > > > >
> > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > > > command that uses the variable length file handle instead of nodeid
> > > > > as a key for the inode.
> > > > >
> > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > > > look at the gritty details of how best to extend all the relevant commands,
> > > > > so I hope I am not sending you down the wrong path.
> > > >
> > > > I found another twist to this story: the upper level libfuse3 library
> > > > assigns distinct nodeids for each directory entry.  These nodeids are
> > > > passed into the kernel and appear to the basis for an iget5_locked call.
> > > > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > > > kernel.
> > > >
> > > > For a single-linked file this is no big deal, but for a hardlink this
> > > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > > > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > > > locking model of iomap, which assumes that there's one in-kernel inode
> > > > and that it can use i_rwsem to synchronize updates.
> > > >
> > > > So I'm going to have to find a way to deal with this.  I tried trivially
> > > > messing with libfuse nodeid assigment but that blew some assertion.
> > > > Maybe your LOOKUP_HANDLE thing would work.
> > > >
> > >
> > > Pull the emergency break!
> > >
> > > In an amature move, I did not look at fuse2fs.c before commenting on your
> > > work.
> > >
> > > High level fuse interface is not the right tool for the job.
> > > It's not even the easiest way to have written fuse2fs in the first place.
> >
> > At the time I thought it would minimize friction across multiple
> > operating systems' fuse implementations.
> >
> > > High-level fuse API addresses file system objects with full paths.
> > > This is good for writing simple virtual filesystems, but it is not the
> > > correct nor is the easiest choice to write a userspace driver for ext4.
> >
> > Agreed, it's a *terrible* way to implement ext4.
> >
> > I think, however, that Ted would like to maintain compatibility with
> > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > program to work with the lowlevel library.
> >
> > That said, I decided just now to do some spelunking into those two fuse
> > ports and have discovered that freebsd[1] packages the same upstream
> > libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.
> >
> > [1] https://wiki.freebsd.org/FUSEFS
> > [2] https://github.com/macfuse/macfuse
> >
> > Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
> > think about rewriting all of fuse2fs against the lowlevel library?  It's
> > really annoying to deal with all the problems of the current codebase.
> > I think I'll try to stabilize the current fuse+iomap code and then look
> > into a fuse2fs port.  What would we call it, fuse4fs? :D
> >
> > > Low-level fuse interface addresses filesystem objects by nodeid
> > > and requires the server to implement lookup(parent_nodeid, name)
> > > where the server gets to choose the nodeid (not libfuse).
> >
> > Does the nodeid for the root directory have to be FUSE_ROOT_ID?
> 
> Yeh, I think that's the case, otherwise FUSE_INIT would need to
> tell the kernel the root nodeid, because there is no lookup to
> return the root nodeid.
> 
> > I guess
> > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> > which cannot be accessed from userspace anyway.
> >
> 
> As long as inode #1 is reserved it should be fine.
> just need to refine the rules of the one-to-one mapping with
> this exception.

Or just make it so that passthrough_ino filesystems can specify the
rootdir inumber?

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11  6:00               ` Darrick J. Wong
@ 2025-06-11  8:54                 ` Amir Goldstein
  2025-06-12  5:54                   ` Miklos Szeredi
  0 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2025-06-11  8:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

> > > Does the nodeid for the root directory have to be FUSE_ROOT_ID?
> >
> > Yeh, I think that's the case, otherwise FUSE_INIT would need to
> > tell the kernel the root nodeid, because there is no lookup to
> > return the root nodeid.
> >
> > > I guess
> > > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> > > which cannot be accessed from userspace anyway.
> > >
> >
> > As long as inode #1 is reserved it should be fine.
> > just need to refine the rules of the one-to-one mapping with
> > this exception.
>
> Or just make it so that passthrough_ino filesystems can specify the
> rootdir inumber?
>

There is already a mount option 'rootmode' for st_mode of root inode
so I suppose we could add the rootino mount option.

Note that currently fuse_fill_super_common() instantiates the root inode
before negotiating FUSE_INIT with the server.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-10 19:00           ` Darrick J. Wong
  2025-06-10 19:51             ` Amir Goldstein
@ 2025-06-11 11:56             ` Theodore Ts'o
  2025-06-12  3:20               ` Darrick J. Wong
  2025-06-20  8:58               ` Allison Karlitskaya
  1 sibling, 2 replies; 82+ messages in thread
From: Theodore Ts'o @ 2025-06-11 11:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Allison Karlitskaya

+Allison Karlitskaya

On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote:
> > High level fuse interface is not the right tool for the job.
> > It's not even the easiest way to have written fuse2fs in the first place.
> 
> At the time I thought it would minimize friction across multiple
> operating systems' fuse implementations.
> 
> > High-level fuse API addresses file system objects with full paths.
> > This is good for writing simple virtual filesystems, but it is not the
> > correct nor is the easiest choice to write a userspace driver for ext4.
> 
> Agreed, it's a *terrible* way to implement ext4.
> 
> I think, however, that Ted would like to maintain compatibility with
> macfuse and freebsd(?) so he's been resistant to rewriting the entire
> program to work with the lowlevel library.

My priority is to make sure that we have compatibility with other OS's
(in particular MacOS, FreeBSD, if possible Windows, although that's
not something that I develop against or have test vehicles to
validate).  However, from what I can tell, they all support Fuse3 at
this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as
of today.

The only complaint that I've had about breaking support using Fuse2
was from Allison (Cc'ed), who was involved with another Github
project, whose Github Actions break because they were using a very old
version of Ubuntu LTS 20.04), which only had support for libfuse2.  I
am going to assume that this is probably only because they hadn't
bothered to update their .github/workflows/ci.yaml file, and not
because there was any inherit requirement that we support ancient
versions of Linux distributions.  (When I was at IBM, I remember
having to support customers who used RHEL4, and even in one extreme
case, RHEL3 because there were a customer paying $$$$$ that refused to
update; but that was well over a decade ago, and at this point, I'm
finding it a lot harder to care about that.  :-)

My plan is that after I release 1.47.2 (which will have some
interesting data corruption bugfixes thanks to Darrick and other users
using fuse2fs in deadly earnest, as opposed to as a lightweight way to
copy files in and out of an file system image), I plan to transition
the master and next branches for the future 1.48 release, and the
maint branch will have bug fixes for 1.47.N releases.

At that point, unless I hear some very strong arguments against, for
1.48, my current thinking is that we will drop support for Fuse2.  I
will still care about making sure that fuse2fs will build and work
well enough that casual file copies work on MacOS and FreeBSD, and
I'll accept patches that make fuse2fs work with WinFSP.  In practice,
this means that Linux-specific things like Verity support will need to
be #ifdef'ed so that they will build against MacFUSE, and I assume the
same will be true for fuseblk mode and iomap mode(?).

This may break the github actions for composefs-rs[1], but I'm going
to assume that they can figure out a way to transition to Fuse3
(hopefully by just using a newer version of Ubuntu, but I suppose it's
possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
in any case, I don't think it makes sense to hold back fuse2fs
development just for the sake of Ubuntu Focal (LTS 20.04).  And if
necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
sound fair to you?

[1] https://github.com/containers/composefs-rs

Does anyone else have any objections to dropping Fuse2 support?  And
is that sufficient for folks to more easily support iomap mode in
fuse2fs?

Cheers,

							- Ted

P.S.  Greetings from Greenland.  :-)  (We're currently in the middle of
a cruise that started in Iceland, and will be ending in New York City
next week.)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11 11:56             ` Theodore Ts'o
@ 2025-06-12  3:20               ` Darrick J. Wong
  2025-06-12  6:10                 ` Amir Goldstein
  2025-06-20  8:58               ` Allison Karlitskaya
  1 sibling, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-12  3:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Allison Karlitskaya

On Wed, Jun 11, 2025 at 10:56:29AM -0100, Theodore Ts'o wrote:
> +Allison Karlitskaya
> 
> On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote:
> > > High level fuse interface is not the right tool for the job.
> > > It's not even the easiest way to have written fuse2fs in the first place.
> > 
> > At the time I thought it would minimize friction across multiple
> > operating systems' fuse implementations.
> > 
> > > High-level fuse API addresses file system objects with full paths.
> > > This is good for writing simple virtual filesystems, but it is not the
> > > correct nor is the easiest choice to write a userspace driver for ext4.
> > 
> > Agreed, it's a *terrible* way to implement ext4.
> > 
> > I think, however, that Ted would like to maintain compatibility with
> > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > program to work with the lowlevel library.
> 
> My priority is to make sure that we have compatibility with other OS's
> (in particular MacOS, FreeBSD, if possible Windows, although that's
> not something that I develop against or have test vehicles to
> validate).  However, from what I can tell, they all support Fuse3 at
> this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as
> of today.
> 
> The only complaint that I've had about breaking support using Fuse2
> was from Allison (Cc'ed), who was involved with another Github
> project, whose Github Actions break because they were using a very old
> version of Ubuntu LTS 20.04), which only had support for libfuse2.  I
> am going to assume that this is probably only because they hadn't
> bothered to update their .github/workflows/ci.yaml file, and not
> because there was any inherit requirement that we support ancient
> versions of Linux distributions.  (When I was at IBM, I remember
> having to support customers who used RHEL4, and even in one extreme
> case, RHEL3 because there were a customer paying $$$$$ that refused to
> update; but that was well over a decade ago, and at this point, I'm
> finding it a lot harder to care about that.  :-)
> 
> My plan is that after I release 1.47.2 (which will have some
> interesting data corruption bugfixes thanks to Darrick and other users
> using fuse2fs in deadly earnest, as opposed to as a lightweight way to
> copy files in and out of an file system image), I plan to transition
> the master and next branches for the future 1.48 release, and the
> maint branch will have bug fixes for 1.47.N releases.
> 
> At that point, unless I hear some very strong arguments against, for
> 1.48, my current thinking is that we will drop support for Fuse2.  I
> will still care about making sure that fuse2fs will build and work
> well enough that casual file copies work on MacOS and FreeBSD, and
> I'll accept patches that make fuse2fs work with WinFSP.  In practice,
> this means that Linux-specific things like Verity support will need to
> be #ifdef'ed so that they will build against MacFUSE, and I assume the
> same will be true for fuseblk mode and iomap mode(?).

<nod> I might just drop fuseblk mode since it's unusable for
unprivileged userspace and regular files; and is a real pain even for
"I'm pretending to be the kernel" mode.

> This may break the github actions for composefs-rs[1], but I'm going
> to assume that they can figure out a way to transition to Fuse3
> (hopefully by just using a newer version of Ubuntu, but I suppose it's
> possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> in any case, I don't think it makes sense to hold back fuse2fs
> development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> sound fair to you?
> 
> [1] https://github.com/containers/composefs-rs
> 
> Does anyone else have any objections to dropping Fuse2 support?  And
> is that sufficient for folks to more easily support iomap mode in
> fuse2fs?

I don't have any objections to cleaning the fuse2 crud out of fuse2fs.

I /do/ worry that rewriting fuse2fs to target the lowlevel fuse3 library
instead of the highlevel one is going to break the !linux platforms.
Although I *think* macfuse and freebsd fuse actually support the
lowlevel library will be ok, I do worry that we might lose windows
support.  I can't tell if winfsp or dokan are what you're supposed to
use there, but afaict neither of them support the lowlevel interface.

That said, I could just fork fuse2fs and make the fork ("fuse4fs") talk
to the lowlevel library, and we can see what happens when/if people try
to build it on those platforms.

(Though again I have zero capacity to build macos or windows programs...)

TBH it might be a huge relief to just start with a new fuse4fs codebase
where I can focus on making iomap the single IO path that works really
well, rather than try to support the existing one.  There's a lot of IO
manager changes in the fuse2fs+iomap prototype that I think just go away
if you don't need to support doing the file IO yourself.

Any code that's shareable between fuse[24]fs should of course get split
out, which should ease the maintenance burden of having two fuse
servers.  Most of fuse2fs' "smarts" are just calling libext2fs anyway.
Maybe someday we can pull an egcs. :P

> Cheers,
> 
> 							- Ted
> 
> P.S.  Greetings from Greenland.  :-)  (We're currently in the middle of
> a cruise that started in Iceland, and will be ending in New York City
> next week.)

Heh, enjoy your cruise!!

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11  8:54                 ` Amir Goldstein
@ 2025-06-12  5:54                   ` Miklos Szeredi
  2025-06-13 17:44                     ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Miklos Szeredi @ 2025-06-12  5:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o

On Wed, 11 Jun 2025 at 10:54, Amir Goldstein <amir73il@gmail.com> wrote:

> There is already a mount option 'rootmode' for st_mode of root inode
> so I suppose we could add the rootino mount option.
>
> Note that currently fuse_fill_super_common() instantiates the root inode
> before negotiating FUSE_INIT with the server.

I'd prefer not to add more mount options like this.

It would be nice to move away from async FUSE_INIT.  It's one of those
things I wish I'd done differently.

Unfortunately I don't think adding FUSE_INIT_SYNC would be sufficient,
as servers might expect the first request to be always FUSE_INIT and
break if it isn't.   Libfuse seems to be okay, but...

One idea is to add an ioctl that the server would call before
mounting, that explicitly allows FUSE_INIT_SYNC.  It's somewhat ugly,
but I can't think of a better solution.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-12  3:20               ` Darrick J. Wong
@ 2025-06-12  6:10                 ` Amir Goldstein
  0 siblings, 0 replies; 82+ messages in thread
From: Amir Goldstein @ 2025-06-12  6:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Allison Karlitskaya

On Thu, Jun 12, 2025 at 5:20 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jun 11, 2025 at 10:56:29AM -0100, Theodore Ts'o wrote:
> > +Allison Karlitskaya
> >
> > On Tue, Jun 10, 2025 at 12:00:26PM -0700, Darrick J. Wong wrote:
> > > > High level fuse interface is not the right tool for the job.
> > > > It's not even the easiest way to have written fuse2fs in the first place.
> > >
> > > At the time I thought it would minimize friction across multiple
> > > operating systems' fuse implementations.
> > >
> > > > High-level fuse API addresses file system objects with full paths.
> > > > This is good for writing simple virtual filesystems, but it is not the
> > > > correct nor is the easiest choice to write a userspace driver for ext4.
> > >
> > > Agreed, it's a *terrible* way to implement ext4.
> > >
> > > I think, however, that Ted would like to maintain compatibility with
> > > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > > program to work with the lowlevel library.
> >
> > My priority is to make sure that we have compatibility with other OS's
> > (in particular MacOS, FreeBSD, if possible Windows, although that's
> > not something that I develop against or have test vehicles to
> > validate).  However, from what I can tell, they all support Fuse3 at
> > this point --- MacFuse, FreeBSD, and WinFSP all have Fuse3 support as
> > of today.
> >
> > The only complaint that I've had about breaking support using Fuse2
> > was from Allison (Cc'ed), who was involved with another Github
> > project, whose Github Actions break because they were using a very old
> > version of Ubuntu LTS 20.04), which only had support for libfuse2.  I
> > am going to assume that this is probably only because they hadn't
> > bothered to update their .github/workflows/ci.yaml file, and not
> > because there was any inherit requirement that we support ancient
> > versions of Linux distributions.  (When I was at IBM, I remember
> > having to support customers who used RHEL4, and even in one extreme
> > case, RHEL3 because there were a customer paying $$$$$ that refused to
> > update; but that was well over a decade ago, and at this point, I'm
> > finding it a lot harder to care about that.  :-)
> >
> > My plan is that after I release 1.47.2 (which will have some
> > interesting data corruption bugfixes thanks to Darrick and other users
> > using fuse2fs in deadly earnest, as opposed to as a lightweight way to
> > copy files in and out of an file system image), I plan to transition
> > the master and next branches for the future 1.48 release, and the
> > maint branch will have bug fixes for 1.47.N releases.
> >
> > At that point, unless I hear some very strong arguments against, for
> > 1.48, my current thinking is that we will drop support for Fuse2.  I
> > will still care about making sure that fuse2fs will build and work
> > well enough that casual file copies work on MacOS and FreeBSD, and
> > I'll accept patches that make fuse2fs work with WinFSP.  In practice,
> > this means that Linux-specific things like Verity support will need to
> > be #ifdef'ed so that they will build against MacFUSE, and I assume the
> > same will be true for fuseblk mode and iomap mode(?).
>
> <nod> I might just drop fuseblk mode since it's unusable for
> unprivileged userspace and regular files; and is a real pain even for
> "I'm pretending to be the kernel" mode.
>
> > This may break the github actions for composefs-rs[1], but I'm going
> > to assume that they can figure out a way to transition to Fuse3
> > (hopefully by just using a newer version of Ubuntu, but I suppose it's
> > possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> > in any case, I don't think it makes sense to hold back fuse2fs
> > development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> > they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> > sound fair to you?
> >
> > [1] https://github.com/containers/composefs-rs
> >
> > Does anyone else have any objections to dropping Fuse2 support?  And
> > is that sufficient for folks to more easily support iomap mode in
> > fuse2fs?
>
> I don't have any objections to cleaning the fuse2 crud out of fuse2fs.
>
> I /do/ worry that rewriting fuse2fs to target the lowlevel fuse3 library
> instead of the highlevel one is going to break the !linux platforms.
> Although I *think* macfuse and freebsd fuse actually support the
> lowlevel library will be ok, I do worry that we might lose windows
> support.  I can't tell if winfsp or dokan are what you're supposed to
> use there, but afaict neither of them support the lowlevel interface.
>
> That said, I could just fork fuse2fs and make the fork ("fuse4fs") talk
> to the lowlevel library, and we can see what happens when/if people try
> to build it on those platforms.
>
> (Though again I have zero capacity to build macos or windows programs...)
>
> TBH it might be a huge relief to just start with a new fuse4fs codebase
> where I can focus on making iomap the single IO path that works really
> well, rather than try to support the existing one.  There's a lot of IO
> manager changes in the fuse2fs+iomap prototype that I think just go away
> if you don't need to support doing the file IO yourself.
>
> Any code that's shareable between fuse[24]fs should of course get split
> out, which should ease the maintenance burden of having two fuse
> servers.  Most of fuse2fs' "smarts" are just calling libext2fs anyway.

That seems like a good way to focus your energy on the important
goals. I like it.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
  2025-05-29 16:45   ` Darrick J. Wong
@ 2025-06-13 17:37   ` Darrick J. Wong
  2025-06-23 13:16     ` Miklos Szeredi
  1 sibling, 1 reply; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-13 17:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o, Matthew Wilcox

On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi everyone,
> >
> > DO NOT MERGE THIS.

Three weeks later, I've mostly gotten the iomap caching working.  This
is probably most exciting for John, because we were talking earlier
about uploading storage mappings to the fuse driver and this is what
I've come up with.  I'm running around trying to fix all the stuff that
doesn't quite work right.

Top of that list is timestamps and file attributes, because fuse no
longer calls the fuse server for file writes.  As a result, the kernel
inode always has the most uptodate versions of the some file attributes
(i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever
the dirty inode gets flushed.

After I get that working I'm going to have to rewrite fuse2fs (or more
likely just fork it) to be a lowlevel driver because as I've noted
elsewhere in this thread, the upper level fuse library can assign
multiple fuse nodeids for a single hardlinked inode.  The only reason
that worked for non-iomap fuse2fs is because we have a BKL and disable
all caching.

For fuse+iomap, this discrepancy between fuse nodeids and ext2 inumbers
means that iomap just plain doesn't work with hardlinks because there
are multiple struct fuse_inodes for each ondisk inode and the locking is
just broken; and the iomap callouts are per-inode, not per-file which
leads to multiple layering violations in the upper level fuse library.
Also as Amir points out, path lookups on every operation is just *slow*.

Interim branches can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2025-06-13
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-cache_2025-06-13
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache_2025-06-13
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs_2025-06-13

(I'm not going to respam the list with patches right now because the
quality as told by fstests isn't quite where I want it to be for such a
thing.  fuse2fs+iomap passes 87% of fstests (down from 89% without
iomap) but that's still pretty bad.)

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-12  5:54                   ` Miklos Szeredi
@ 2025-06-13 17:44                     ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-06-13 17:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o

On Thu, Jun 12, 2025 at 07:54:12AM +0200, Miklos Szeredi wrote:
> On Wed, 11 Jun 2025 at 10:54, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > There is already a mount option 'rootmode' for st_mode of root inode
> > so I suppose we could add the rootino mount option.
> >
> > Note that currently fuse_fill_super_common() instantiates the root inode
> > before negotiating FUSE_INIT with the server.
> 
> I'd prefer not to add more mount options like this.
> 
> It would be nice to move away from async FUSE_INIT.  It's one of those
> things I wish I'd done differently.
> 
> Unfortunately I don't think adding FUSE_INIT_SYNC would be sufficient,
> as servers might expect the first request to be always FUSE_INIT and
> break if it isn't.   Libfuse seems to be okay, but...
> 
> One idea is to add an ioctl that the server would call before
> mounting, that explicitly allows FUSE_INIT_SYNC.  It's somewhat ugly,
> but I can't think of a better solution.

Hmm, well for iomap the fuse server kinda wants to know if the kernel is
going to accept iomap prior to initializing the filesystem, so it
wouldn't be that weird to have it set a "send INIT_SYNC" flag.

If one were to add an INIT_SYNC upcall, where would the callsite be?
Somewhere just prior to where we need to open the root file?  And would
you want to add more fields to it?  Or just use the same struct and
flags as the existing INIT call?

--D

> 
> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-11 11:56             ` Theodore Ts'o
  2025-06-12  3:20               ` Darrick J. Wong
@ 2025-06-20  8:58               ` Allison Karlitskaya
  2025-06-20 11:50                 ` Bernd Schubert
  2025-07-01  5:58                 ` Darrick J. Wong
  1 sibling, 2 replies; 82+ messages in thread
From: Allison Karlitskaya @ 2025-06-20  8:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, Amir Goldstein, linux-fsdevel, John, bernd,
	miklos, joannelkoong, Josef Bacik, linux-ext4

hi Ted,

Sorry I didn't see this earlier.  I've been travelling.

On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
> This may break the github actions for composefs-rs[1], but I'm going
> to assume that they can figure out a way to transition to Fuse3
> (hopefully by just using a newer version of Ubuntu, but I suppose it's
> possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> in any case, I don't think it makes sense to hold back fuse2fs
> development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> sound fair to you?

To be honest, with a composefs-rs hat on, I don't care at all about
fuse support for ext2/3/4 (although I think it's cool that it exists).
We also use fuse in composefs-rs for unrelated reasons, but even there
we use the fuser rust crate which has a "pure rust" direct syscall
layer that no longer depends on libfuse.  Our use of e2fsprogs is
strictly related to building testing images in CI, and for that we
only use mkfs.ext4.  There's also no specific reason that we're using
old Ubuntu.  I probably just copy-pasted it from another project
without paying too much attention.

Thanks for asking, though!

lis


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-20  8:58               ` Allison Karlitskaya
@ 2025-06-20 11:50                 ` Bernd Schubert
  2025-07-01  6:02                   ` Darrick J. Wong
  2025-07-01  5:58                 ` Darrick J. Wong
  1 sibling, 1 reply; 82+ messages in thread
From: Bernd Schubert @ 2025-06-20 11:50 UTC (permalink / raw)
  To: Allison Karlitskaya, Theodore Ts'o
  Cc: Darrick J. Wong, Amir Goldstein, linux-fsdevel, John, miklos,
	joannelkoong, Josef Bacik, linux-ext4



On 6/20/25 10:58, Allison Karlitskaya wrote:
> hi Ted,
> 
> Sorry I didn't see this earlier.  I've been travelling.
> 
> On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
>> This may break the github actions for composefs-rs[1], but I'm going
>> to assume that they can figure out a way to transition to Fuse3
>> (hopefully by just using a newer version of Ubuntu, but I suppose it's
>> possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
>> in any case, I don't think it makes sense to hold back fuse2fs
>> development just for the sake of Ubuntu Focal (LTS 20.04).  And if
>> necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
>> they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
>> sound fair to you?
> 
> To be honest, with a composefs-rs hat on, I don't care at all about
> fuse support for ext2/3/4 (although I think it's cool that it exists).
> We also use fuse in composefs-rs for unrelated reasons, but even there
> we use the fuser rust crate which has a "pure rust" direct syscall
> layer that no longer depends on libfuse.  Our use of e2fsprogs is
> strictly related to building testing images in CI, and for that we
> only use mkfs.ext4.  There's also no specific reason that we're using
> old Ubuntu.  I probably just copy-pasted it from another project
> without paying too much attention.


 From libfuse point of view I'm too happy about that split into different
libraries. Libfuse already right now misses several features because
they were added to virtiofs, but not to libfuse. I need to find the time
for it, but I guess it makes sense to add rust support to libfuse (and
some parts can be entirely rewritten into rust).



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
@ 2025-06-23 13:16     ` Miklos Szeredi
  2025-07-01  6:05       ` Darrick J. Wong
  0 siblings, 1 reply; 82+ messages in thread
From: Miklos Szeredi @ 2025-06-23 13:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox

On Fri, 13 Jun 2025 at 19:37, Darrick J. Wong <djwong@kernel.org> wrote:

> Top of that list is timestamps and file attributes, because fuse no
> longer calls the fuse server for file writes.  As a result, the kernel
> inode always has the most uptodate versions of the some file attributes
> (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever
> the dirty inode gets flushed.

This is already the case for cached writes, no new code should be needed.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-20  8:58               ` Allison Karlitskaya
  2025-06-20 11:50                 ` Bernd Schubert
@ 2025-07-01  5:58                 ` Darrick J. Wong
  1 sibling, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-07-01  5:58 UTC (permalink / raw)
  To: Allison Karlitskaya
  Cc: Theodore Ts'o, Amir Goldstein, linux-fsdevel, John, bernd,
	miklos, joannelkoong, Josef Bacik, linux-ext4

On Fri, Jun 20, 2025 at 10:58:38AM +0200, Allison Karlitskaya wrote:
> hi Ted,
> 
> Sorry I didn't see this earlier.  I've been travelling.
> 
> On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
> > This may break the github actions for composefs-rs[1], but I'm going
> > to assume that they can figure out a way to transition to Fuse3
> > (hopefully by just using a newer version of Ubuntu, but I suppose it's
> > possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> > in any case, I don't think it makes sense to hold back fuse2fs
> > development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> > they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> > sound fair to you?
> 
> To be honest, with a composefs-rs hat on, I don't care at all about
> fuse support for ext2/3/4 (although I think it's cool that it exists).
> We also use fuse in composefs-rs for unrelated reasons, but even there
> we use the fuser rust crate which has a "pure rust" direct syscall

Aha, I just stumbled upon that crate.  There are ... too many things on
crates.io that claim to be fuse libraries/wrappers/etc.

It's tempting to go write fuse4fs as a iomap-only Rust server, but I
never quite got the hang of configuring cargo to link against a locally
built .so in the same source tree (i.e. when I was trying to link
xfs_healer against libhandle that ships as part of xfsprogs).  I'm not
even sure I want to explore exposing libext2fs in a Rust-safe way.

> layer that no longer depends on libfuse.  Our use of e2fsprogs is
> strictly related to building testing images in CI, and for that we
> only use mkfs.ext4.  There's also no specific reason that we're using
> old Ubuntu.  I probably just copy-pasted it from another project
> without paying too much attention.
> 
> Thanks for asking, though!

I'm glad to hear that e2fsprogs can drop fuse2 support! :)

--D

> lis
> 
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-20 11:50                 ` Bernd Schubert
@ 2025-07-01  6:02                   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-07-01  6:02 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Allison Karlitskaya, Theodore Ts'o, Amir Goldstein,
	linux-fsdevel, John, miklos, joannelkoong, Josef Bacik,
	linux-ext4

On Fri, Jun 20, 2025 at 01:50:20PM +0200, Bernd Schubert wrote:
> 
> 
> On 6/20/25 10:58, Allison Karlitskaya wrote:
> > hi Ted,
> > 
> > Sorry I didn't see this earlier.  I've been travelling.
> > 
> > On Wed, 11 Jun 2025 at 21:25, Theodore Ts'o <tytso@mit.edu> wrote:
> > > This may break the github actions for composefs-rs[1], but I'm going
> > > to assume that they can figure out a way to transition to Fuse3
> > > (hopefully by just using a newer version of Ubuntu, but I suppose it's
> > > possible that Rust bindings only exist for Fuse2, and not Fuse3).  But
> > > in any case, I don't think it makes sense to hold back fuse2fs
> > > development just for the sake of Ubuntu Focal (LTS 20.04).  And if
> > > necessary, composefs-rs can just stay back on e2fsprogs 1.47.N until
> > > they can get off of Fuse2 and/or Ubuntu 20.04.  Allison, does that
> > > sound fair to you?
> > 
> > To be honest, with a composefs-rs hat on, I don't care at all about
> > fuse support for ext2/3/4 (although I think it's cool that it exists).
> > We also use fuse in composefs-rs for unrelated reasons, but even there
> > we use the fuser rust crate which has a "pure rust" direct syscall
> > layer that no longer depends on libfuse.  Our use of e2fsprogs is
> > strictly related to building testing images in CI, and for that we
> > only use mkfs.ext4.  There's also no specific reason that we're using
> > old Ubuntu.  I probably just copy-pasted it from another project
> > without paying too much attention.
> 
> 
> From libfuse point of view I'm too happy about that split into different

"too happy"?  I would have thought you would /not/ be too happy about
splits... <confused>

> libraries. Libfuse already right now misses several features because
> they were added to virtiofs, but not to libfuse. I need to find the time
> for it, but I guess it makes sense to add rust support to libfuse (and
> some parts can be entirely rewritten into rust).

Yeah, I noticed a few missing pieces like statx and syncfs support,
which I added to my own libfuse branch (+ fuse2fs).  Eventually I'd like
to get the kernel umount code to flush and wait for all pending fuse
commands, issue a FUSE_SYNCFS and wait for that, and then issue a
FUSE_DESTROY to tell the fuse server to tear itself down and release the
block devices(s) its holding.

--D

> 
> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP] V2] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-06-23 13:16     ` Miklos Szeredi
@ 2025-07-01  6:05       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2025-07-01  6:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Matthew Wilcox

On Mon, Jun 23, 2025 at 03:16:53PM +0200, Miklos Szeredi wrote:
> On Fri, 13 Jun 2025 at 19:37, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Top of that list is timestamps and file attributes, because fuse no
> > longer calls the fuse server for file writes.  As a result, the kernel
> > inode always has the most uptodate versions of the some file attributes
> > (i_size, timestamps, mode) and just want to send FUSE_SETATTR whenever
> > the dirty inode gets flushed.
> 
> This is already the case for cached writes, no new code should be needed.

Are you talking about the fc->writeback_cache stuff?  Yeah, that mostly
works out for fuse2fs.  Though I was wondering, when does atime get
updated?  fs/fuse sets S_NOATIME, so I guess it's up to the fuse server
to update it when it wants to, and a later FUSE_GETATTR can pick it up?
If so, how do fuse servers implement lazytime/relatime?

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-05-29 19:41     ` Amir Goldstein
  2025-06-09 22:31       ` Darrick J. Wong
@ 2025-07-12 10:57       ` Amir Goldstein
  1 sibling, 0 replies; 82+ messages in thread
From: Amir Goldstein @ 2025-07-12 10:57 UTC (permalink / raw)
  To: Darrick J. Wong, Bernd Schubert
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o

> On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
...
> > So I /think/ we could ask the fuse server at inode instantiation time
> > (which, if I'm reading the code correctly, is when iget5_locked gives
> > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > to userspace at that time.  Alternately I guess we could extend struct
> > fuse_attr with another FUSE_ATTR_ flag, I think?
> >
>
> The latter. Either extend fuse_attr or struct fuse_entry_out,
> which is in the responses of FUSE_LOOKUP,
> FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> which instantiate fuse inodes.
>

Update:
I went to look at this extension for my inode ops passthrough patches.

What I saw is that while struct fuse_attr and struct fuse_entry_out
are designed to be extended and have been extended in the past:
 * 7.9:
 *  - add blksize field to fuse_attr

Later on, struct fuse_direntplus was introduced
 * 7.21
 *  - add FUSE_READDIRPLUS

With struct struct fuse_entry_out/fuse_attr embedded in the middle
and I don't see any code in the kernel/lib that is prepared to handle
a change in the FUSE_NAME_OFFSET_DIRENTPLUS constant
(maybe it's there and I am missing it)

So for my own use, which only requires passing a single int backing_id
I was tempted to try and overload attr_valid{,_nsec} which are
not relevant for passthrough getattr case,
something like {attr_valid = backing_id, attr_valid_nsec = UTIME_OMIT}.

In the meanwhile, as an example I used a hole in struct fuse_attr_out
to implement backing file setup in reply to GETATTR in the wip branch [1].

Bernd,

I wonder if I am missing something w.r.t the intended extensibility of
struct fuse_entry_out/fuse_attr and current readdirplus code?

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fuse-backing-inode-wip/

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2025-07-12 10:58 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-05-29 11:08     ` Miklos Szeredi
2025-05-31  1:08       ` Darrick J. Wong
2025-06-06 13:54         ` Miklos Szeredi
2025-06-09 18:13           ` Darrick J. Wong
2025-06-09 20:29             ` Darrick J. Wong
2025-05-22  0:02   ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
2025-05-22  0:03   ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-05-29 22:15     ` Joanne Koong
2025-05-29 23:15       ` Joanne Koong
2025-06-03  0:13         ` Darrick J. Wong
2025-05-22  0:03   ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
2025-05-22 16:46     ` Amir Goldstein
2025-05-22 17:11       ` Darrick J. Wong
2025-05-22  0:03   ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
2025-05-22  0:04   ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-05-22  0:04   ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
2025-05-22  0:04   ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
2025-05-22  0:04   ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-05-22  0:05   ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-05-22  0:05   ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22  0:05   ` [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
2025-05-22  0:05   ` [PATCH 2/8] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2025-05-22  0:06   ` [PATCH 3/8] libfuse: add upper level iomap commands Darrick J. Wong
2025-05-22  0:06   ` [PATCH 4/8] libfuse: add a notification to add a new device to iomap Darrick J. Wong
2025-05-22  0:06   ` [PATCH 5/8] libfuse: add iomap ioend low level handler Darrick J. Wong
2025-05-22  0:06   ` [PATCH 6/8] libfuse: add upper level iomap ioend commands Darrick J. Wong
2025-05-22  0:07   ` [PATCH 7/8] libfuse: add FUSE_IOMAP_PAGECACHE Darrick J. Wong
2025-05-22  0:07   ` [PATCH 8/8] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
2025-05-29 16:45   ` Darrick J. Wong
2025-05-29 19:41     ` Amir Goldstein
2025-06-09 22:31       ` Darrick J. Wong
2025-06-10 10:59         ` Amir Goldstein
2025-06-10 19:00           ` Darrick J. Wong
2025-06-10 19:51             ` Amir Goldstein
2025-06-11  6:00               ` Darrick J. Wong
2025-06-11  8:54                 ` Amir Goldstein
2025-06-12  5:54                   ` Miklos Szeredi
2025-06-13 17:44                     ` Darrick J. Wong
2025-06-11 11:56             ` Theodore Ts'o
2025-06-12  3:20               ` Darrick J. Wong
2025-06-12  6:10                 ` Amir Goldstein
2025-06-20  8:58               ` Allison Karlitskaya
2025-06-20 11:50                 ` Bernd Schubert
2025-07-01  6:02                   ` Darrick J. Wong
2025-07-01  5:58                 ` Darrick J. Wong
2025-07-12 10:57       ` Amir Goldstein
2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
2025-06-23 13:16     ` Miklos Szeredi
2025-07-01  6:05       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).