[RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
@ 2025-07-17 23:10 Darrick J. Wong
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
                   ` (10 more replies)
  0 siblings, 11 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: John, bernd, miklos, joannelkoong, Josef Bacik, linux-ext4,
	Theodore Ts'o, Neal Gompa

Hi everyone,

DO NOT MERGE THIS, STILL!

This is the third request for comments of a prototype to connect the
Linux fuse driver to fs-iomap for regular file IO operations to and from
files whose contents persist to locally attached storage devices.

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.

willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code.  Eeeugh.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands.  Pagecache
writeback is now a directio write.  The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.

With this RFC, I am able to show that it's possible to build a fuse
server for a real filesystem (ext4) that runs entirely in userspace yet
maintains most of its performance.  At this stage I still get about 95%
of the kernel ext4 driver's streaming directio performance on streaming
IO, and 110% of its streaming buffered IO performance.  Random buffered
IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
fast as the kernel; see the cover letter for the fuse2fs iomap changes
for more details.  Unwritten extent conversions on random direct writes
are especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead.  And that's with debugging turned on!

These items have been addressed since the first RFC:

1. The iomap cookie validation is now present, which avoids subtle races
between pagecache zeroing and writeback on filesystems that support
unwritten and delalloc mappings.

2. Mappings can be cached in the kernel for more speed.

3. iomap supports inline data.

4. I can now turn on fuse+iomap on a per-inode basis, which turned out
to be as easy as creating a new ->getattr_iflags callback so that the
fuse server can set fuse_attr::flags.

5. statx and syncfs work on iomap filesystems.

6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
is enabled.

7. The ext4 shutdown ioctl is now supported.

There are some major warts remaining:

a. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

b. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.  This is related to #3 below:

c. Hardlinks and iomap are not possible for upper-level libfuse clients
because the upper level libfuse likes to abstract kernel nodeids with
its own homebrew dirent/inode cache, which doesn't understand hardlinks.
As a result, a hardlinked file results in two distinct struct inodes in
the kernel, which completely breaks iomap's locking model.  I will have
to rewrite fuse2fs for the lowlevel libfuse library to make this work,
but on the plus side there will be far less path lookup overhead.

d. There are too many changes to the IO manager in libext2fs because I
built things needed to stage the direct/buffered IO paths separately.
These are now unnecessary but I haven't pulled them out yet because
they're sort of useful to verify that iomap file IO never goes through
libext2fs except for inline data.

e. If we're going to use fuse servers as "safe" replacements for kernel
filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
We also need to disable the OOM killer(s) for fuse servers because you
don't want filesystems to unmount abruptly.

f. How do we maximally contain the fuse server to have safe filesystem
mounts?  It's very convenient to use systemd services to configure
isolation declaratively, but fuse2fs still needs to be able to open
/dev/fuse, the ext4 block device, and call mount() in the shared
namespace.  This prevents us from using most of the stronger systemd
protections because they tend to run in a private mount namespace with
various parts of the filesystem either hidden or readonly.

In theory one could design a socket protocol to pass mount options,
block device paths, fds, and responsibility for the mount() call between
a mount helper and a service:

e2fsprogs would define as a systemd socket service for fuse2fs that sets
up a dynamic unprivileged user, no network access, and no access to the
host's filesystem aside from readonly access to the root filesystem.

The mount helper (e.g. mount.safe) would then connect to the magic
socket and pass the CLI arguments to the fuse2fs service.  The service
would parse the arguments, find the block device paths, and feed them
back through the socket to mount.safe.  mount.safe would open them and
pass fds back to the fuse2fs service.  The service would then open the
devices, parse the superblock, and if everything was ok, request a mount
through the socket.  The mount helper would then open /dev/fuse and
mount the filesystem, and if successful, pass the /dev/fuse fd through
the socket to the fuse2fs server.  At that point the fuse2fs server
would attach to the /dev/fuse device and handle the usual events.

Finally we'd have to train people/daemons to run "mount -t safe.ext4
/dev/sda1 /mnt" to get the contained version of ext4.

(Yeah, #f is all Neal. ;))

g. fuse2fs doesn't support the ext4 journal.  Urk.

I'll work on these in July/August, but for now here's an unmergeable RFC
to start some discussion.

--Darrick

^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
@ 2025-07-17 23:23 ` Darrick J. Wong
  2025-07-17 23:26   ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
                     ` (6 more replies)
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                   ` (9 subsequent siblings)
  10 siblings, 7 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:23 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

Hi all,

In preparation for making fuse use the fs/iomap code for regular file
data IO, fix a few bugs in fuse and apply a couple of tweaks to iomap.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-prep
---
Commits in this patchset:
 * fuse: fix livelock in synchronous file put from fuseblk workers
 * fuse: flush pending fuse events before aborting the connection
 * fuse: capture the unique id of fuse commands being sent
 * fuse: implement file attributes mask for statx
 * iomap: exit early when iomap_iter is called with zero length
 * iomap: trace iomap_zero_iter zeroing activities
 * iomap: error out on file IO when there is no inline_data buffer
---
 fs/fuse/fuse_i.h       |    6 ++++++
 fs/iomap/trace.h       |    1 +
 fs/fuse/dev.c          |   44 +++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/dev_uring.c    |    8 +++++++-
 fs/fuse/dir.c          |    2 ++
 fs/fuse/file.c         |   10 +++++++++-
 fs/fuse/inode.c        |    1 +
 fs/iomap/buffered-io.c |   18 +++++++++++++-----
 fs/iomap/direct-io.c   |    3 +++
 fs/iomap/iter.c        |    5 ++++-
 10 files changed, 89 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
@ 2025-07-17 23:24 ` Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 01/13] fuse: implement the basic iomap mechanisms Darrick J. Wong
                     ` (12 more replies)
  2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
                   ` (8 subsequent siblings)
  10 siblings, 13 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:24 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

Hi all,

This series connects fuse (the userspace filesystem layer) to fs-iomap
to get fuse servers out of the business of handling file I/O themselves.
By keeping the IO path mostly within the kernel, we can dramatically
improve the speed of disk-based filesystems.  This enables us to move
all the filesystem metadata parsing code out of the kernel and into
userspace, which means that we can containerize them for security
without losing a lot of performance.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
---
Commits in this patchset:
 * fuse: implement the basic iomap mechanisms
 * fuse: add an ioctl to add new iomap devices
 * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
 * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
 * fuse: implement direct IO with iomap
 * fuse: implement buffered IO with iomap
 * fuse: enable caching of timestamps
 * fuse: implement large folios for iomap pagecache files
 * fuse: use an unrestricted backing device with iomap pagecache io
 * fuse: advertise support for iomap
 * fuse: query filesystem geometry when using iomap
 * fuse: implement fadvise for iomap files
 * fuse: implement inline data file IO via iomap
---
 fs/fuse/fuse_i.h          |  164 ++++
 fs/fuse/fuse_trace.h      | 1167 ++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |  174 ++++
 fs/fuse/Kconfig           |   24 +
 fs/fuse/Makefile          |    1 
 fs/fuse/dev.c             |   23 +
 fs/fuse/dir.c             |   34 +
 fs/fuse/file.c            |  138 +++
 fs/fuse/file_iomap.c      | 2019 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |   53 +
 10 files changed, 3761 insertions(+), 36 deletions(-)
 create mode 100644 fs/fuse/file_iomap.c


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-07-17 23:24 ` Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
                     ` (3 more replies)
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                   ` (7 subsequent siblings)
  10 siblings, 4 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:24 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

Hi all,

This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel.  For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem.  For everyone else, it simply
eliminates roundtrips to userspace.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache
---
Commits in this patchset:
 * fuse: cache iomaps
 * fuse: use the iomap cache for iomap_begin
 * fuse: invalidate iomap cache after file updates
 * fuse: enable iomap cache management
---
 fs/fuse/fuse_i.h          |  110 +++
 fs/fuse/fuse_trace.h      |  646 +++++++++++++++++
 fs/fuse/iomap_cache.h     |  122 +++
 include/uapi/linux/fuse.h |   38 +
 fs/fuse/Makefile          |    2 
 fs/fuse/dev.c             |   46 +
 fs/fuse/file.c            |   10 
 fs/fuse/file_iomap.c      |  679 +++++++++++++++++-
 fs/fuse/iomap_cache.c     | 1743 +++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 3377 insertions(+), 19 deletions(-)
 create mode 100644 fs/fuse/iomap_cache.h
 create mode 100644 fs/fuse/iomap_cache.c


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-07-17 23:24 ` Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 1/7] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
                     ` (6 more replies)
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                   ` (6 subsequent siblings)
  10 siblings, 7 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:24 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

Hi all,

When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can.  That means no calling
out to the fuse server in the IO path when we can avoid it.  However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.

We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync).  Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.

IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode.  Let's make the kernel manage all that
and push the results to userspace as needed.  This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
---
Commits in this patchset:
 * fuse: force a ctime update after a fileattr_set call when in iomap mode
 * fuse: synchronize inode->i_flags after fileattr_[gs]et
 * fuse: cache atime when in iomap mode
 * fuse: update file mode when updating acls
 * fuse: propagate default and file acls on creation
 * fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
 * fuse: update ctime when updating acls on an iomap inode
---
 fs/fuse/fuse_i.h     |    5 ++
 fs/fuse/fuse_trace.h |  103 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/acl.c        |  104 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dir.c        |  113 ++++++++++++++++++++++++++++++++++++++------------
 fs/fuse/inode.c      |   20 ++++++++-
 fs/fuse/ioctl.c      |  100 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 415 insertions(+), 30 deletions(-)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (3 preceding siblings ...)
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:25 ` Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 01/14] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
                     ` (13 more replies)
  2025-07-17 23:25 ` [PATCHSET RFC v3 2/3] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
                   ` (5 subsequent siblings)
  10 siblings, 14 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:25 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

Hi all,

This series connects libfuse to the iomap-enabled fuse driver in Linux to get
fuse servers out of the business of handling file I/O themselves.  By keeping
the IO path mostly within the kernel, we can dramatically improve the speed of
disk-based filesystems.  This enables us to move all the filesystem metadata
parsing code out of the kernel and into userspace, which means that we can
containerize them for security without losing a lot of performance.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
---
Commits in this patchset:
 * libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version
 * libfuse: add fuse commands for iomap_begin and end
 * libfuse: add upper level iomap commands
 * libfuse: add a notification to add a new device to iomap
 * libfuse: add iomap ioend low level handler
 * libfuse: add upper level iomap ioend commands
 * libfuse: add a reply function to send FUSE_ATTR_* to the kernel
 * libfuse: connect high level fuse library to fuse_reply_attr_iflags
 * libfuse: add FUSE_IOMAP_DIRECTIO
 * libfuse: add FUSE_IOMAP_FILEIO
 * libfuse: allow discovery of the kernel's iomap capabilities
 * libfuse: add lower level iomap_config implementation
 * libfuse: add upper level iomap_config implementation
 * libfuse: add strictatime/lazytime mount options
---
 include/fuse.h          |   41 +++++
 include/fuse_common.h   |  118 ++++++++++++++
 include/fuse_kernel.h   |  118 +++++++++++++-
 include/fuse_lowlevel.h |  207 +++++++++++++++++++++++-
 lib/fuse.c              |  408 ++++++++++++++++++++++++++++++++++++++++++-----
 lib/fuse_lowlevel.c     |  294 ++++++++++++++++++++++++++++++++--
 lib/fuse_versionscript  |    9 +
 lib/meson.build         |    2 
 lib/mount.c             |   18 ++
 9 files changed, 1147 insertions(+), 68 deletions(-)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 2/3] libfuse: cache iomap mappings for even better file IO performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (4 preceding siblings ...)
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-07-17 23:25 ` Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 1/1] libfuse: enable iomap cache management Darrick J. Wong
  2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:25 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

Hi all,

This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel.  For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem.  For everyone else, it simply
eliminates roundtrips to userspace.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache
---
Commits in this patchset:
 * libfuse: enable iomap cache management
---
 include/fuse_common.h   |    9 +++++
 include/fuse_kernel.h   |   34 +++++++++++++++++++
 include/fuse_lowlevel.h |   39 ++++++++++++++++++++++
 lib/fuse_lowlevel.c     |   82 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    2 +
 5 files changed, 166 insertions(+)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (5 preceding siblings ...)
  2025-07-17 23:25 ` [PATCHSET RFC v3 2/3] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-07-17 23:25 ` Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 1/4] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
                     ` (3 more replies)
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                   ` (3 subsequent siblings)
  10 siblings, 4 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:25 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

Hi all,

Implement statx and syncfs in libfuse so that iomap-compatible fuse servers can
receive syncfs commands and provide extended file flags to the kernel.  This
second piece is critical to being able to enforce the IMMUTABLE and APPEND
inode flags.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
---
Commits in this patchset:
 * libfuse: wire up FUSE_SYNCFS to the low level library
 * libfuse: add syncfs support to the upper library
 * libfuse: add statx support to the lower level library
 * libfuse: add upper level statx hooks
---
 include/fuse.h          |   16 ++++++
 include/fuse_lowlevel.h |   53 +++++++++++++++++++++
 lib/fuse.c              |  120 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_lowlevel.c     |  116 +++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    2 +
 5 files changed, 307 insertions(+)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (6 preceding siblings ...)
  2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
@ 2025-07-17 23:25 ` Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
                     ` (21 more replies)
  2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                   ` (2 subsequent siblings)
  10 siblings, 22 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:25 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

Hi all,

Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection.  For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel.  This
means that we can get rid of all file data block processing within
fuse2fs.

Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous.  Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.

The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s.  Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s.  FIEMAP and SEEK_DATA/SEEK_HOLE now work
too.  The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.

Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s.  The kernel
can do 900-1300MB/s.  Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s.  I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes.  We also probably
need iomap caching really badly.

These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance.  It contains a single
Big Filesystem Lock which nukes multi-threaded scalability.  There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS.  Sad!

Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance.  We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap
---
Commits in this patchset:
 * fuse2fs: implement bare minimum iomap for file mapping reporting
 * fuse2fs: add iomap= mount option
 * fuse2fs: implement iomap configuration
 * fuse2fs: register block devices for use with iomap
 * fuse2fs: always use directio disk reads with fuse2fs
 * fuse2fs: implement directio file reads
 * fuse2fs: use tagged block IO for zeroing sub-block regions
 * fuse2fs: only flush the cache for the file under directio read
 * fuse2fs: add extent dump function for debugging
 * fuse2fs: implement direct write support
 * fuse2fs: turn on iomap for pagecache IO
 * fuse2fs: improve tracing for fallocate
 * fuse2fs: don't zero bytes in punch hole
 * fuse2fs: don't do file data block IO when iomap is enabled
 * fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
 * fuse2fs: re-enable the block device pagecache for metadata IO
 * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
 * fuse2fs: don't allow hardlinks for now
 * fuse2fs: enable file IO to inline data files
 * fuse2fs: set iomap-related inode flags
 * fuse2fs: add strictatime/lazytime mount options
 * fuse2fs: configure block device block size
---
 configure       |   47 ++
 configure.ac    |   32 +
 lib/config.h.in |    3 
 misc/fuse2fs.c  | 1567 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 1628 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (7 preceding siblings ...)
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:26 ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 1/1] fuse2fs: enable caching of iomaps Darrick J. Wong
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
  10 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:26 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

Hi all,

This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel.  For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem.  For everyone else, it simply
eliminates roundtrips to userspace.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache
---
Commits in this patchset:
 * fuse2fs: enable caching of iomaps
---
 misc/fuse2fs.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (8 preceding siblings ...)
  2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:26 ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
                     ` (9 more replies)
  2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
  10 siblings, 10 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:26 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

Hi all,

When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can.  That means no calling
out to the fuse server in the IO path when we can avoid it.  However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.

We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync).  Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.

IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode.  Let's make the kernel manage all that
and push the results to userspace as needed.  This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-attrs
---
Commits in this patchset:
 * fuse2fs: allow O_APPEND and O_TRUNC opens
 * fuse2fs: skip permission checking on utimens when iomap is enabled
 * fuse2fs: let the kernel tell us about acl/mode updates
 * fuse2fs: better debugging for file mode updates
 * fuse2fs: debug timestamp updates
 * fuse2fs: use coarse timestamps for iomap mode
 * fuse2fs: add tracing for retrieving timestamps
 * fuse2fs: enable syncfs
 * fuse2fs: skip the gdt write in op_destroy if syncfs is working
 * fuse2fs: implement statx
---
 misc/fuse2fs.c |  348 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 276 insertions(+), 72 deletions(-)


^ permalink raw reply	[flat|nested] 174+ messages in thread

* [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
@ 2025-07-17 23:26   ` Darrick J. Wong
  2025-07-17 23:26   ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:26 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

I observed a hang when running generic/323 against a fuseblk server.
This test opens a file, initiates a lot of AIO writes to that file
descriptor, and closes the file descriptor before the writes complete.
Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for
responses from the fuseblk server:

# cat /proc/372265/task/372313/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_do_getattr+0xfc/0x1f0 [fuse]
[<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse]
[<0>] aio_read+0x130/0x1e0
[<0>] io_submit_one+0x542/0x860
[<0>] __x64_sys_io_submit+0x98/0x1a0
[<0>] do_syscall_64+0x37/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

But the /weird/ part is that the fuseblk server threads are waiting for
responses from itself:

# cat /proc/372210/task/372232/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_file_put+0x9a/0xd0 [fuse]
[<0>] fuse_release+0x36/0x50 [fuse]
[<0>] __fput+0xec/0x2b0
[<0>] task_work_run+0x55/0x90
[<0>] syscall_exit_to_user_mode+0xe9/0x100
[<0>] do_syscall_64+0x43/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

The fuseblk server is fuse2fs so there's nothing all that exciting in
the server itself.  So why is the fuse server calling fuse_file_put?
The commit message for the fstest sheds some light on that:

"By closing the file descriptor before calling io_destroy, you pretty
much guarantee that the last put on the ioctx will be done in interrupt
context (during I/O completion).

Aha.  AIO fgets a new struct file from the fd when it queues the ioctx.
The completion of the FUSE_WRITE command from userspace causes the fuse
server to call the AIO completion function.  The completion puts the
struct file, queuing a delayed fput to the fuse server task.  When the
fuse server task returns to userspace, it has to run the delayed fput,
which in the case of a fuseblk server, it does synchronously.

Sending the FUSE_RELEASE command sychronously from fuse server threads
is a bad idea because a client program can initiate enough simultaneous
AIOs such that all the fuse server threads end up in delayed_fput, and
now there aren't any threads left to handle the queued fuse commands.

Fix this by only using synchronous fputs for fuseblk servers if the
process doesn't have PF_LOCAL_THROTTLE.  Hopefully the fuseblk server
had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
filesystem server.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 47006d0753f1cd..ee79cb7bc05805 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -355,8 +355,16 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
 	 * Make the release synchronous if this is a fuseblk mount,
 	 * synchronous RELEASE is allowed (and desirable) in this case
 	 * because the server can be trusted not to screw up.
+	 *
+	 * If we're a LOCAL_THROTTLE thread, use the asynchronous put
+	 * because the current thread might be a fuse server.  This can
+	 * happen if a process starts some aio and closes the fd before
+	 * the aio completes.  Since aio takes its own ref to the file,
+	 * the IO completion has to drop the ref, which is how the fuse
+	 * server can end up closing its own clients' files.
 	 */
-	fuse_file_put(ff, ff->fm->fc->destroy);
+	fuse_file_put(ff, ff->fm->fc->destroy &&
+			  (current->flags & PF_LOCAL_THROTTLE) == 0);
 }

 void fuse_release_common(struct file *file, bool isdir)

^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
  2025-07-17 23:26   ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-07-17 23:26   ` Darrick J. Wong
  2025-07-18 16:37     ` Bernd Schubert
  2025-07-18 22:23     ` Joanne Koong
  2025-07-17 23:27   ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 2 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:26 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

generic/488 fails with fuse2fs in the following fashion:

generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
(see /var/tmp/fstests/generic/488.full for details)

This test opens a large number of files, unlinks them (which really just
renames them to fuse hidden files), closes the program, unmounts the
filesystem, and runs fsck to check that there aren't any inconsistencies
in the filesystem.

Unfortunately, the 488.full file shows that there are a lot of hidden
files left over in the filesystem, with incorrect link counts.  Tracing
fuse_request_* shows that there are a large number of FUSE_RELEASE
commands that are queued up on behalf of the unlinked files at the time
that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
aborted, the fuse server would have responded to the RELEASE commands by
removing the hidden files; instead they stick around.

Create a function to push all the background requests to the queue and
then wait for the number of pending events to hit zero, and call this
before fuse_abort_conn.  That way, all the pending events are processed
by the fuse server and we don't end up with a corrupt filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h |    6 ++++++
 fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c  |    1 +
 3 files changed, 45 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b54f4f57789f7f..78d34c8e445b32 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1256,6 +1256,12 @@ void fuse_request_end(struct fuse_req *req);
 void fuse_abort_conn(struct fuse_conn *fc);
 void fuse_wait_aborted(struct fuse_conn *fc);
 
+/**
+ * Flush all pending requests and wait for them.  Takes an optional timeout
+ * in jiffies.
+ */
+void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout);
+
 /* Check if any requests timed out */
 void fuse_check_timeout(struct work_struct *work);
 
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e80cd8f2c049f9..5387e4239d6aa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -24,6 +24,7 @@
 #include <linux/splice.h>
 #include <linux/sched.h>
 #include <linux/seq_file.h>
+#include <linux/nmi.h>
 
 #define CREATE_TRACE_POINTS
 #include "fuse_trace.h"
@@ -2385,6 +2386,43 @@ static void end_polls(struct fuse_conn *fc)
 	}
 }
 
+/*
+ * Flush all pending requests and wait for them.  Only call this function when
+ * it is no longer possible for other threads to add requests.
+ */
+void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
+{
+	unsigned long deadline;
+
+	spin_lock(&fc->lock);
+	if (!fc->connected) {
+		spin_unlock(&fc->lock);
+		return;
+	}
+
+	/* Push all the background requests to the queue. */
+	spin_lock(&fc->bg_lock);
+	fc->blocked = 0;
+	fc->max_background = UINT_MAX;
+	flush_bg_queue(fc);
+	spin_unlock(&fc->bg_lock);
+	spin_unlock(&fc->lock);
+
+	/*
+	 * Wait 30s for all the events to complete or abort.  Touch the
+	 * watchdog once per second so that we don't trip the hangcheck timer
+	 * while waiting for the fuse server.
+	 */
+	deadline = jiffies + timeout;
+	smp_mb();
+	while (fc->connected &&
+	       (!timeout || time_before(jiffies, deadline)) &&
+	       wait_event_timeout(fc->blocked_waitq,
+			!fc->connected || atomic_read(&fc->num_waiting) == 0,
+			HZ) == 0)
+		touch_softlockup_watchdog();
+}
+
 /*
  * Abort all requests.
  *
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9572bdef49eecc..1734c263da3a77 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2047,6 +2047,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
 {
 	struct fuse_conn *fc = fm->fc;
 
+	fuse_flush_requests(fc, 30 * HZ);
 	if (fc->destroy)
 		fuse_send_destroy(fm);
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
  2025-07-17 23:26   ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
  2025-07-17 23:26   ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
@ 2025-07-17 23:27   ` Darrick J. Wong
  2025-07-18 17:10     ` Bernd Schubert
  2025-07-17 23:27   ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:27 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

The fuse_request_{send,end} tracepoints capture the value of
req->in.h.unique in the trace output.  It would be really nice if we
could use this to match a request to its response for debugging and
latency analysis, but the call to trace_fuse_request_send occurs before
the unique id has been set:

fuse_request_send:    connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
fuse_request_end:     connection 8388608 req 6 len 16 error -2

Move the callsites to trace_fuse_request_send to after the unique id has
been set, or right before we decide to cancel a request having not set
one.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/dev.c       |    6 +++++-
 fs/fuse/dev_uring.c |    8 +++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)


diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5387e4239d6aa6..8dd74cbfbcc6fc 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -376,10 +376,15 @@ static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	if (fiq->connected) {
 		if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
 			req->in.h.unique = fuse_get_unique_locked(fiq);
+
+		/* tracepoint captures in.h.unique */
+		trace_fuse_request_send(req);
+
 		list_add_tail(&req->list, &fiq->pending);
 		fuse_dev_wake_and_unlock(fiq);
 	} else {
 		spin_unlock(&fiq->lock);
+		trace_fuse_request_send(req);
 		req->out.h.error = -ENOTCONN;
 		clear_bit(FR_PENDING, &req->flags);
 		fuse_request_end(req);
@@ -398,7 +403,6 @@ static void fuse_send_one(struct fuse_iqueue *fiq, struct fuse_req *req)
 	req->in.h.len = sizeof(struct fuse_in_header) +
 		fuse_len_args(req->args->in_numargs,
 			      (struct fuse_arg *) req->args->in_args);
-	trace_fuse_request_send(req);
 	fiq->ops->send_req(fiq, req);
 }
 
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 249b210becb1cc..14f263d4419392 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -7,6 +7,7 @@
 #include "fuse_i.h"
 #include "dev_uring_i.h"
 #include "fuse_dev_i.h"
+#include "fuse_trace.h"
 
 #include <linux/fs.h>
 #include <linux/io_uring/cmd.h>
@@ -1265,12 +1266,17 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 
 	err = -EINVAL;
 	queue = fuse_uring_task_to_queue(ring);
-	if (!queue)
+	if (!queue) {
+		trace_fuse_request_send(req);
 		goto err;
+	}
 
 	if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
 		req->in.h.unique = fuse_get_unique(fiq);
 
+	/* tracepoint captures in.h.unique */
+	trace_fuse_request_send(req);
+
 	spin_lock(&queue->lock);
 	err = -ENOTCONN;
 	if (unlikely(queue->stopped))


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:27   ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
@ 2025-07-17 23:27   ` Darrick J. Wong
  2025-08-18 15:11     ` Miklos Szeredi
  2025-07-17 23:27   ` [PATCH 5/7] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:27 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Actually copy the attributes/attributes_mask from userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/dir.c |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 45b4c3cc1396af..4d841869ba3d0a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
 		stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
 		stat->btime.tv_sec = sx->btime.tv_sec;
 		stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
+		stat->attributes = sx->attributes;
+		stat->attributes_mask = sx->attributes_mask;
 		fuse_fillattr(idmap, inode, &attr, stat);
 		stat->result_mask |= STATX_TYPE;
 	}


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 5/7] iomap: exit early when iomap_iter is called with zero length
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:27   ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
@ 2025-07-17 23:27   ` Darrick J. Wong
  2025-07-17 23:27   ` [PATCH 6/7] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 7/7] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:27 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

If iomap_iter::len is zero on the first call to iomap_iter(), we should
just return zero instead of calling ->iomap_begin with zero count.  This
obviates the need for ->iomap_begin implementations to handle that
"correctly" by not returning a zero-length mapping.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/iter.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 6ffc6a7b9ba502..b86a6a08627126 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -66,8 +66,11 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 
 	trace_iomap_iter(iter, ops, _RET_IP_);
 
-	if (!iter->iomap.length)
+	if (!iter->iomap.length) {
+		if (iter->len == 0)
+			return 0;
 		goto begin;
+	}
 
 	/*
 	 * Calculate how far the iter was advanced and the original length bytes


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 6/7] iomap: trace iomap_zero_iter zeroing activities
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:27   ` [PATCH 5/7] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
@ 2025-07-17 23:27   ` Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 7/7] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:27 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Trace which bytes actually get zeroed.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/trace.h       |    1 +
 fs/iomap/buffered-io.c |    3 +++
 2 files changed, 4 insertions(+)


diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 455cc6f90be031..c71e432d96bcdb 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -84,6 +84,7 @@ DEFINE_RANGE_EVENT(iomap_release_folio);
 DEFINE_RANGE_EVENT(iomap_invalidate_folio);
 DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
 DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
+DEFINE_RANGE_EVENT(iomap_zero_iter);
 
 #define IOMAP_TYPE_STRINGS \
 	{ IOMAP_HOLE,		"HOLE" }, \
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f526d634bfeda5..53324b0222de6b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1400,6 +1400,9 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		/* warn about zeroing folios beyond eof that won't write back */
 		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
 
+		trace_iomap_zero_iter(iter->inode, folio_pos(folio) + offset,
+				bytes);
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 7/7] iomap: error out on file IO when there is no inline_data buffer
  2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:27   ` [PATCH 6/7] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
@ 2025-07-17 23:28   ` Darrick J. Wong
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:28 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Return IO errors if an ->iomap_begin implementation returns an
IOMAP_INLINE buffer but forgets to set the inline_data pointer.
Filesystems should never do this, but we could help fs developers (me)
fix their bugs by handling this more gracefully than crashing the
kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |   15 ++++++++++-----
 fs/iomap/direct-io.c   |    3 +++
 2 files changed, 13 insertions(+), 5 deletions(-)


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 53324b0222de6b..2e5ed4d8fa6a81 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -351,6 +351,9 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t offset = offset_in_folio(folio, iomap->offset);
 
+	if (WARN_ON_ONCE(iomap->inline_data == NULL))
+		return -EIO;
+
 	if (folio_test_uptodate(folio))
 		return 0;
 
@@ -905,7 +908,7 @@ static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	return true;
 }
 
-static void iomap_write_end_inline(const struct iomap_iter *iter,
+static bool iomap_write_end_inline(const struct iomap_iter *iter,
 		struct folio *folio, loff_t pos, size_t copied)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -914,12 +917,16 @@ static void iomap_write_end_inline(const struct iomap_iter *iter,
 	WARN_ON_ONCE(!folio_test_uptodate(folio));
 	BUG_ON(!iomap_inline_data_valid(iomap));
 
+	if (WARN_ON_ONCE(iomap->inline_data == NULL))
+		return false;
+
 	flush_dcache_folio(folio);
 	addr = kmap_local_folio(folio, pos);
 	memcpy(iomap_inline_data(iomap, pos), addr, copied);
 	kunmap_local(addr);
 
 	mark_inode_dirty(iter->inode);
+	return true;
 }
 
 /*
@@ -932,10 +939,8 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
 
-	if (srcmap->type == IOMAP_INLINE) {
-		iomap_write_end_inline(iter, folio, pos, copied);
-		return true;
-	}
+	if (srcmap->type == IOMAP_INLINE)
+		return iomap_write_end_inline(iter, folio, pos, copied);
 
 	if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
 		size_t bh_written;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 1bf450f00f01c3..03fc68eaa16c6c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -528,6 +528,9 @@ static int iomap_dio_inline_iter(struct iomap_iter *iomi, struct iomap_dio *dio)
 	loff_t pos = iomi->pos;
 	u64 copied;
 
+	if (WARN_ON_ONCE(inline_data == NULL))
+		return -EIO;
+
 	if (WARN_ON_ONCE(!iomap_inline_data_valid(iomap)))
 		return -EIO;
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 01/13] fuse: implement the basic iomap mechanisms
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-07-17 23:28   ` Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 02/13] fuse: add an ioctl to add new iomap devices Darrick J. Wong
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:28 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Implement functions to enable upcalling of iomap_begin and iomap_end to
userspace fuse servers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   38 +++++
 fs/fuse/fuse_trace.h      |  288 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   86 +++++++++++
 fs/fuse/Kconfig           |   24 +++
 fs/fuse/Makefile          |    1 
 fs/fuse/file_iomap.c      |  358 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |    5 +
 7 files changed, 799 insertions(+), 1 deletion(-)
 create mode 100644 fs/fuse/file_iomap.c


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 78d34c8e445b32..b6dc9226f3d77f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -895,6 +895,9 @@ struct fuse_conn {
 	/* Is link not implemented by fs? */
 	unsigned int no_link:1;
 
+	/* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
+	unsigned int iomap:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
@@ -1017,6 +1020,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
 	return sb->s_fs_info;
 }
 
+static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
 {
 	return get_fuse_mount_super(sb)->fc;
@@ -1027,16 +1035,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
 	return get_fuse_mount_super(inode->i_sb);
 }
 
+static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
+{
+	return get_fuse_mount_super_c(inode->i_sb);
+}
+
 static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
 {
 	return get_fuse_mount_super(inode->i_sb)->fc;
 }
 
+static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
+{
+	return get_fuse_mount_super_c(inode->i_sb)->fc;
+}
+
 static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
 {
 	return container_of(inode, struct fuse_inode, inode);
 }
 
+static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
+{
+	return container_of(inode, struct fuse_inode, inode);
+}
+
 static inline u64 get_node_id(struct inode *inode)
 {
 	return get_fuse_inode(inode)->nodeid;
@@ -1583,4 +1606,19 @@ extern void fuse_sysctl_unregister(void);
 #define fuse_sysctl_unregister()	do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+# include <linux/fiemap.h>
+# include <linux/iomap.h>
+
+bool fuse_iomap_enabled(void);
+
+static inline bool fuse_has_iomap(const struct inode *inode)
+{
+	return get_fuse_conn_c(inode)->iomap;
+}
+#else
+# define fuse_iomap_enabled(...)		(false)
+# define fuse_has_iomap(...)			(false)
+#endif
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index bbe9ddd8c71696..ecf9332321a1e6 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,8 @@
 	EM( FUSE_SYNCFS,		"FUSE_SYNCFS")		\
 	EM( FUSE_TMPFILE,		"FUSE_TMPFILE")		\
 	EM( FUSE_STATX,			"FUSE_STATX")		\
+	EM( FUSE_IOMAP_BEGIN,		"FUSE_IOMAP_BEGIN")	\
+	EM( FUSE_IOMAP_END,		"FUSE_IOMAP_END")	\
 	EMe(CUSE_INIT,			"CUSE_INIT")
 
 /*
@@ -124,6 +126,292 @@ TRACE_EVENT(fuse_request_end,
 		  __entry->unique, __entry->len, __entry->error)
 );
 
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+
+#define FUSE_IOMAP_F_STRINGS \
+	{ FUSE_IOMAP_F_NEW,			"new" }, \
+	{ FUSE_IOMAP_F_DIRTY,			"dirty" }, \
+	{ FUSE_IOMAP_F_SHARED,			"shared" }, \
+	{ FUSE_IOMAP_F_MERGED,			"merged" }, \
+	{ FUSE_IOMAP_F_XATTR,			"xattr" }, \
+	{ FUSE_IOMAP_F_BOUNDARY,		"boundary" }, \
+	{ FUSE_IOMAP_F_ANON_WRITE,		"anon_write" }, \
+	{ FUSE_IOMAP_F_ATOMIC_BIO,		"atomic" }, \
+	{ FUSE_IOMAP_F_WANT_IOMAP_END,		"iomap_end" }, \
+	{ FUSE_IOMAP_F_SIZE_CHANGED,		"append" }, \
+	{ FUSE_IOMAP_F_STALE,			"stale" }
+
+#define FUSE_IOMAP_OP_STRINGS \
+	{ FUSE_IOMAP_OP_WRITE,			"write" }, \
+	{ FUSE_IOMAP_OP_ZERO,			"zero" }, \
+	{ FUSE_IOMAP_OP_REPORT,			"report" }, \
+	{ FUSE_IOMAP_OP_FAULT,			"fault" }, \
+	{ FUSE_IOMAP_OP_DIRECT,			"direct" }, \
+	{ FUSE_IOMAP_OP_NOWAIT,			"nowait" }, \
+	{ FUSE_IOMAP_OP_OVERWRITE_ONLY,		"overwrite" }, \
+	{ FUSE_IOMAP_OP_UNSHARE,		"unshare" }, \
+	{ FUSE_IOMAP_OP_ATOMIC,			"atomic" }, \
+	{ FUSE_IOMAP_OP_DONTCACHE,		"dontcache" }
+
+#define FUSE_IOMAP_TYPE_STRINGS \
+	{ FUSE_IOMAP_TYPE_PURE_OVERWRITE,	"overwrite" }, \
+	{ FUSE_IOMAP_TYPE_HOLE,			"hole" }, \
+	{ FUSE_IOMAP_TYPE_DELALLOC,		"delalloc" }, \
+	{ FUSE_IOMAP_TYPE_MAPPED,		"mapped" }, \
+	{ FUSE_IOMAP_TYPE_UNWRITTEN,		"unwritten" }, \
+	{ FUSE_IOMAP_TYPE_INLINE,		"inline" }
+
+TRACE_EVENT(fuse_iomap_begin,
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+		 unsigned opflags),
+
+	TP_ARGS(inode, pos, count, opflags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+		__entry->opflags	=	opflags;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx opflags (%s) pos 0x%llx count 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_begin_error,
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+		 unsigned opflags, int error),
+
+	TP_ARGS(inode, pos, count, opflags, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+		__entry->opflags	=	opflags;
+		__entry->error		=	error;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx opflags (%s) pos 0x%llx count 0x%llx err %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count, __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_read_map,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_begin_out *outarg),
+
+	TP_ARGS(inode, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(loff_t,		length)
+		__field(uint32_t,	dev)
+		__field(uint64_t,	addr)
+		__field(uint16_t,	type)
+		__field(uint16_t,	mapflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	outarg->offset;
+		__entry->length		=	outarg->length;
+		__entry->dev		=	outarg->read_dev;
+		__entry->addr		=	outarg->read_addr;
+		__entry->type		=	outarg->read_type;
+		__entry->mapflags	=	outarg->read_flags;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx read offset 0x%llx count 0x%llx dev %u addr 0x%llx type %s mapflags (%s)",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->length,
+		  __entry->dev, __entry->addr,
+		  __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_write_map,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_begin_out *outarg),
+
+	TP_ARGS(inode, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(loff_t,		length)
+		__field(uint32_t,	dev)
+		__field(uint64_t,	addr)
+		__field(uint16_t,	type)
+		__field(uint16_t,	mapflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	outarg->offset;
+		__entry->length		=	outarg->length;
+		__entry->dev		=	outarg->write_dev;
+		__entry->addr		=	outarg->write_addr;
+		__entry->type		=	outarg->write_type;
+		__entry->mapflags	=	outarg->write_flags;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx write offset 0x%llx count 0x%llx dev %u addr 0x%llx type %s mapflags (%s)",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->length,
+		  __entry->dev, __entry->addr,
+		  __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_end,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_end_in *inarg),
+
+	TP_ARGS(inode, inarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+		__field(size_t,		written)
+
+		__field(uint32_t,	dev)
+		__field(uint64_t,	addr)
+		__field(uint16_t,	type)
+		__field(uint16_t,	mapflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	inarg->pos;
+		__entry->count		=	inarg->count;
+		__entry->opflags	=	inarg->opflags;
+		__entry->written	=	inarg->written;
+		__entry->dev		=	inarg->map_dev;
+		__entry->addr		=	inarg->map_addr;
+		__entry->type		=	inarg->map_type;
+		__entry->mapflags	=	inarg->map_flags;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx opflags (%s) pos 0x%llx count 0x%llx written %zd dev %u addr 0x%llx type 0x%x mapflags (%s)",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count, __entry->written, __entry->dev,
+		  __entry->addr, __entry->type,
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS))
+);
+
+TRACE_EVENT(fuse_iomap_end_error,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_end_in *inarg, int error),
+
+	TP_ARGS(inode, inarg, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(loff_t,		count)
+		__field(unsigned,	opflags)
+		__field(size_t,		written)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	inarg->pos;
+		__entry->count		=	inarg->count;
+		__entry->opflags	=	inarg->opflags;
+		__entry->written	=	inarg->written;
+		__entry->error		=	error;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx opflags (%s) pos 0x%llx count 0x%llx written %zd error %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS),
+		  __entry->pos, __entry->count, __entry->written,
+		  __entry->error)
+);
+#endif /* CONFIG_FUSE_IOMAP */
+
 #endif /* _TRACE_FUSE_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 122d6586e8d4da..501f4d838e654f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -235,6 +235,10 @@
  *
  *  7.44
  *  - add FUSE_NOTIFY_INC_EPOCH
+ *
+ *  7.99
+ *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ *    SEEK_{DATA,HOLE} support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -270,7 +274,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 44
+#define FUSE_KERNEL_MINOR_VERSION 99
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -443,6 +447,8 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ *	       operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -490,6 +496,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_IOMAP		(1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags
@@ -658,6 +665,9 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_BEGIN	= 4094,
+	FUSE_IOMAP_END		= 4095,
+
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
 
@@ -1290,4 +1300,78 @@ struct fuse_uring_cmd_req {
 	uint8_t padding[6];
 };
 
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
+#define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
+#define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
+#define FUSE_IOMAP_TYPE_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
+#define FUSE_IOMAP_TYPE_INLINE		4	/* data inline in the inode */
+
+#define FUSE_IOMAP_DEV_NULL		(0U)	/* null device cookie */
+
+#define FUSE_IOMAP_F_NEW		(1U << 0)
+#define FUSE_IOMAP_F_DIRTY		(1U << 1)
+#define FUSE_IOMAP_F_SHARED		(1U << 2)
+#define FUSE_IOMAP_F_MERGED		(1U << 3)
+#define FUSE_IOMAP_F_XATTR		(1U << 5)
+#define FUSE_IOMAP_F_BOUNDARY		(1U << 6)
+#define FUSE_IOMAP_F_ANON_WRITE		(1U << 7)
+#define FUSE_IOMAP_F_ATOMIC_BIO		(1U << 8)
+#define FUSE_IOMAP_F_WANT_IOMAP_END	(1U << 12) /* want ->iomap_end call */
+
+/* only for iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED	(1U << 14)
+#define FUSE_IOMAP_F_STALE		(1U << 15)
+
+#define FUSE_IOMAP_OP_WRITE		(1 << 0) /* writing, must allocate blocks */
+#define FUSE_IOMAP_OP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
+#define FUSE_IOMAP_OP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
+#define FUSE_IOMAP_OP_FAULT		(1 << 3) /* mapping for page fault */
+#define FUSE_IOMAP_OP_DIRECT		(1 << 4) /* direct I/O */
+#define FUSE_IOMAP_OP_NOWAIT		(1 << 5) /* do not block */
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
+#define FUSE_IOMAP_OP_UNSHARE		(1 << 7) /* unshare_file_range */
+#define FUSE_IOMAP_OP_ATOMIC		(1 << 9) /* torn-write protection */
+#define FUSE_IOMAP_OP_DONTCACHE		(1 << 10) /* dont retain pagecache */
+
+#define FUSE_IOMAP_NULL_ADDR		(-1ULL)	/* addr is not valid */
+
+struct fuse_iomap_begin_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+	uint64_t offset;	/* file offset of mapping, bytes */
+	uint64_t length;	/* length of both mappings, bytes */
+
+	uint64_t read_addr;	/* disk offset of mapping, bytes */
+	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t read_dev;	/* device cookie */
+
+	uint64_t write_addr;	/* disk offset of mapping, bytes */
+	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t write_dev;	/* device cookie * */
+};
+
+struct fuse_iomap_end_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+	int64_t written;	/* bytes processed */
+
+	uint64_t map_length;	/* length of mapping, bytes */
+	uint64_t map_addr;	/* disk offset of mapping, bytes */
+	uint16_t map_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t map_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t map_dev;	/* device cookie * */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index ca215a3cba3e31..b8a453570161d6 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -64,6 +64,30 @@ config FUSE_PASSTHROUGH
 
 	  If you want to allow passthrough operations, answer Y.
 
+config FUSE_IOMAP
+	bool "FUSE file IO over iomap"
+	default y
+	depends on FUSE_FS
+	depends on BLOCK
+	select FS_IOMAP
+	help
+	  For supported fuseblk servers, this allows the file IO path to run
+	  through the kernel.
+
+config FUSE_IOMAP_BY_DEFAULT
+	bool "FUSE file I/O over iomap by default"
+	default n
+	depends on FUSE_IOMAP
+	help
+	  Enable sending FUSE file I/O over iomap by default.
+
+config FUSE_IOMAP_DEBUG
+	bool "Debug FUSE file IO over iomap"
+	default n
+	depends on FUSE_IOMAP
+	help
+	  Enable debugging assertions for the fuse iomap code paths.
+
 config FUSE_IO_URING
 	bool "FUSE communication over io-uring"
 	default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1cc..63a41ef9336aaa 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -16,5 +16,6 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
 fuse-$(CONFIG_SYSCTL) += sysctl.o
 fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
 
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
new file mode 100644
index 00000000000000..a206a9254df3fe
--- /dev/null
+++ b/fs/fuse/file_iomap.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org.
+ */
+#include "fuse_i.h"
+#include "fuse_trace.h"
+#include <linux/iomap.h>
+
+static bool __read_mostly enable_iomap =
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
+	true;
+#else
+	false;
+#endif
+module_param(enable_iomap, bool, 0644);
+MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+# define ASSERT(a)		do { WARN(!(a), "Assertion failed: %s, func: %s, line: %d", #a, __func__, __LINE__); } while (0)
+# define BAD_DATA(condition)	(WARN(condition, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__))
+#else
+# define ASSERT(a)
+# define BAD_DATA(condition)	(condition)
+#endif
+
+bool fuse_iomap_enabled(void)
+{
+	/*
+	 * There are fears that a fuse+iomap server could somehow DoS the
+	 * system by doing things like going out to lunch during a writeback
+	 * related iomap request.  Only allow iomap access if the fuse server
+	 * has rawio capabilities since those processes can mess things up
+	 * quite well even without our help.
+	 */
+	return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO);
+}
+
+static inline bool fuse_iomap_check_type(uint16_t type)
+{
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_HOLE	!= IOMAP_HOLE);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_DELALLOC	!= IOMAP_DELALLOC);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_MAPPED	!= IOMAP_MAPPED);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_UNWRITTEN	!= IOMAP_UNWRITTEN);
+	BUILD_BUG_ON(FUSE_IOMAP_TYPE_INLINE	!= IOMAP_INLINE);
+
+	switch (type) {
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+	case FUSE_IOMAP_TYPE_INLINE:
+		return true;
+	}
+
+	return false;
+}
+
+#define FUSE_IOMAP_F_ALL (FUSE_IOMAP_F_NEW | \
+			  FUSE_IOMAP_F_DIRTY | \
+			  FUSE_IOMAP_F_SHARED | \
+			  FUSE_IOMAP_F_MERGED | \
+			  FUSE_IOMAP_F_XATTR | \
+			  FUSE_IOMAP_F_BOUNDARY | \
+			  FUSE_IOMAP_F_ANON_WRITE | \
+			  FUSE_IOMAP_F_ATOMIC_BIO | \
+			  FUSE_IOMAP_F_WANT_IOMAP_END)
+
+static inline bool fuse_iomap_check_flags(uint16_t flags)
+{
+	BUILD_BUG_ON(FUSE_IOMAP_F_NEW		!= IOMAP_F_NEW);
+	BUILD_BUG_ON(FUSE_IOMAP_F_DIRTY		!= IOMAP_F_DIRTY);
+	BUILD_BUG_ON(FUSE_IOMAP_F_SHARED	!= IOMAP_F_SHARED);
+	BUILD_BUG_ON(FUSE_IOMAP_F_MERGED	!= IOMAP_F_MERGED);
+	BUILD_BUG_ON(FUSE_IOMAP_F_XATTR		!= IOMAP_F_XATTR);
+	BUILD_BUG_ON(FUSE_IOMAP_F_BOUNDARY	!= IOMAP_F_BOUNDARY);
+	BUILD_BUG_ON(FUSE_IOMAP_F_ANON_WRITE	!= IOMAP_F_ANON_WRITE);
+	BUILD_BUG_ON(FUSE_IOMAP_F_ATOMIC_BIO	!= IOMAP_F_ATOMIC_BIO);
+	BUILD_BUG_ON(FUSE_IOMAP_F_WANT_IOMAP_END != IOMAP_F_PRIVATE);
+
+	return (flags & ~FUSE_IOMAP_F_ALL) == 0;
+}
+
+/* Check the incoming mappings to make sure they're not nonsense */
+static inline int
+fuse_iomap_begin_validate(const struct fuse_iomap_begin_out *outarg,
+			  const struct inode *inode,
+			  unsigned opflags, loff_t pos)
+{
+	const unsigned int blocksize = i_blocksize(inode);
+	uint64_t end;
+
+	BUILD_BUG_ON(FUSE_IOMAP_OP_WRITE	!= IOMAP_WRITE);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_ZERO		!= IOMAP_ZERO);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_REPORT	!= IOMAP_REPORT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_FAULT	!= IOMAP_FAULT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_DIRECT	!= IOMAP_DIRECT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_NOWAIT	!= IOMAP_NOWAIT);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_OVERWRITE_ONLY != IOMAP_OVERWRITE_ONLY);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_UNSHARE	!= IOMAP_UNSHARE);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_ATOMIC	!= IOMAP_ATOMIC);
+	BUILD_BUG_ON(FUSE_IOMAP_OP_DONTCACHE	!= IOMAP_DONTCACHE);
+
+	/* No garbage mapping types or flags */
+	if (BAD_DATA(!fuse_iomap_check_type(outarg->read_type)))
+		return -EIO;
+	if (BAD_DATA(!fuse_iomap_check_flags(outarg->read_flags)))
+		return -EIO;
+
+	if (BAD_DATA(!fuse_iomap_check_type(outarg->write_type)))
+		return -EIO;
+	if (BAD_DATA(!fuse_iomap_check_flags(outarg->write_flags)))
+		return -EIO;
+
+	/*
+	 * Must have returned a mapping for at least the first byte in the
+	 * range.
+	 */
+	if (BAD_DATA(outarg->offset > pos))
+		return -EIO;
+	if (BAD_DATA(outarg->length == 0))
+		return -EIO;
+
+	/* File range must be aligned to blocksize */
+	if (BAD_DATA(!IS_ALIGNED(outarg->offset, blocksize)))
+		return -EIO;
+	if (BAD_DATA(!IS_ALIGNED(outarg->length, blocksize)))
+		return -EIO;
+
+	/* No overflows in the file range */
+	if (BAD_DATA(check_add_overflow(outarg->offset, outarg->length, &end)))
+		return -EIO;
+	if (BAD_DATA(end <= pos))
+		return -EIO;
+
+	/* File range cannot start past maxbytes */
+	if (BAD_DATA(outarg->offset >= inode->i_sb->s_maxbytes))
+		return -EIO;
+
+	switch (outarg->read_type) {
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+		/* "Pure overwrite" only allowed for write mapping */
+		BAD_DATA(outarg->read_type == FUSE_IOMAP_TYPE_PURE_OVERWRITE);
+		return -EIO;
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		if (BAD_DATA(outarg->read_dev == FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->read_addr == FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_INLINE:
+		/* Mappings not backed by space cannot have a device addr. */
+		if (BAD_DATA(outarg->read_dev != FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->read_addr != FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	default:
+		/* should have been caught already */
+		return -EIO;
+	}
+
+	switch (outarg->write_type) {
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		if (BAD_DATA(outarg->write_dev == FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->write_addr == FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_INLINE:
+		/* Mappings not backed by space cannot have a device addr. */
+		if (BAD_DATA(outarg->write_dev != FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->write_addr != FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	default:
+		/* should have been caught already */
+		return -EIO;
+	}
+
+	/* XXX: Check the device cookie */
+	ASSERT(outarg->read_dev == 0);
+
+	/* No overflows in the device range, if supplied */
+	if (outarg->read_addr != FUSE_IOMAP_NULL_ADDR &&
+	    BAD_DATA(check_add_overflow(outarg->read_addr, outarg->length, &end)))
+		return -EIO;
+
+	if (outarg->write_addr != FUSE_IOMAP_NULL_ADDR &&
+	    BAD_DATA(check_add_overflow(outarg->write_addr, outarg->length, &end)))
+		return -EIO;
+
+	if (!(opflags & FUSE_IOMAP_OP_REPORT)) {
+		/*
+		 * XXX inline data reads and writes are not supported, how do
+		 * we do this?
+		 */
+		if (BAD_DATA(outarg->read_type == FUSE_IOMAP_TYPE_INLINE))
+			return -EIO;
+		if (BAD_DATA(outarg->write_type == FUSE_IOMAP_TYPE_INLINE))
+			return -EIO;
+	}
+
+	return 0;
+}
+
+static inline bool fuse_is_iomap_file_write(unsigned int opflags)
+{
+	return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
+}
+
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
+			    unsigned opflags, struct iomap *iomap,
+			    struct iomap *srcmap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_begin_in inarg = {
+		.attr_ino = fi->orig_ino,
+		.opflags = opflags,
+		.pos = pos,
+		.count = count,
+	};
+	struct fuse_iomap_begin_out outarg = { };
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	int err;
+
+	trace_fuse_iomap_begin(inode, pos, count, opflags);
+
+	args.opcode = FUSE_IOMAP_BEGIN;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	args.out_numargs = 1;
+	args.out_args[0].size = sizeof(outarg);
+	args.out_args[0].value = &outarg;
+	err = fuse_simple_request(fm, &args);
+	if (err) {
+		trace_fuse_iomap_begin_error(inode, pos, count, opflags, err);
+		return err;
+	}
+
+	trace_fuse_iomap_read_map(inode, &outarg);
+	trace_fuse_iomap_write_map(inode, &outarg);
+
+	err = fuse_iomap_begin_validate(&outarg, inode, opflags, pos);
+	if (err)
+		return err;
+
+	if (fuse_is_iomap_file_write(opflags) &&
+	    outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+		/*
+		 * For an out of place write, we must supply the write mapping
+		 * via @iomap, and the read mapping via @srcmap.
+		 */
+		iomap->addr = outarg.write_addr;
+		iomap->offset = outarg.offset;
+		iomap->length = outarg.length;
+		iomap->type = outarg.write_type;
+		iomap->flags = outarg.write_flags;
+		iomap->bdev = inode->i_sb->s_bdev;
+
+		srcmap->addr = outarg.read_addr;
+		srcmap->offset = outarg.offset;
+		srcmap->length = outarg.length;
+		srcmap->type = outarg.read_type;
+		srcmap->flags = outarg.read_flags;
+		srcmap->bdev = inode->i_sb->s_bdev;
+	} else {
+		/*
+		 * For everything else (reads, reporting, and pure overwrites),
+		 * we can return the sole mapping through @iomap and leave
+		 * @srcmap unchanged from its default (HOLE).
+		 */
+		iomap->addr = outarg.read_addr;
+		iomap->offset = outarg.offset;
+		iomap->length = outarg.length;
+		iomap->type = outarg.read_type;
+		iomap->flags = outarg.read_flags;
+		iomap->bdev = inode->i_sb->s_bdev;
+	}
+
+	return 0;
+}
+
+static bool fuse_want_iomap_end(const struct iomap *iomap, unsigned int opflags,
+				loff_t count, ssize_t written)
+{
+	/* Caller demanded an iomap_end call. */
+	if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
+		return true;
+
+	/* Reads and reporting should never affect the filesystem metadata */
+	if (!fuse_is_iomap_file_write(opflags))
+		return false;
+
+	/* Appending writes get an iomap_end call */
+	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+		return true;
+
+	/* Short writes get an iomap_end call to clean up delalloc */
+	return written < count;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
+			  ssize_t written, unsigned opflags,
+			  struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_end_in inarg = {
+		.opflags = opflags,
+		.attr_ino = fi->orig_ino,
+		.pos = pos,
+		.count = count,
+		.written = written,
+
+		.map_addr = iomap->addr,
+		.map_length = iomap->length,
+		.map_type = iomap->type,
+		.map_flags = iomap->flags,
+	};
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	int err;
+
+	if (!fuse_want_iomap_end(iomap, opflags, count, written))
+		return 0;
+
+	trace_fuse_iomap_end(inode, &inarg);
+
+	args.opcode = FUSE_IOMAP_END;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	err = fuse_simple_request(fm, &args);
+
+	trace_fuse_iomap_end_error(inode, &inarg, err);
+
+	return err;
+}
+
+const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin		= fuse_iomap_begin,
+	.iomap_end		= fuse_iomap_end,
+};
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1734c263da3a77..6173795d3826d0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1443,6 +1443,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 			if (flags & FUSE_REQUEST_TIMEOUT)
 				timeout = arg->request_timeout;
+
+			if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
+				fc->iomap = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1511,6 +1514,8 @@ void fuse_send_init(struct fuse_mount *fm)
 	 */
 	if (fuse_uring_enabled())
 		flags |= FUSE_OVER_IO_URING;
+	if (fuse_iomap_enabled())
+		flags |= FUSE_IOMAP;
 
 	ia->in.flags = flags;
 	ia->in.flags2 = flags >> 32;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 02/13] fuse: add an ioctl to add new iomap devices
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 01/13] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-07-17 23:28   ` Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 03/13] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:28 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Add an ioctl that allows fuse servers to register block devices for use
with iomap.  This is (for now) separate from the backing file open/close
ioctl (despite using the same struct) to keep the codepaths separate.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   27 +++++
 fs/fuse/fuse_trace.h      |   62 +++++++++++
 include/uapi/linux/fuse.h |    3 +
 fs/fuse/dev.c             |   21 ++++
 fs/fuse/file_iomap.c      |  243 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/inode.c           |   13 ++
 6 files changed, 361 insertions(+), 8 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b6dc9226f3d77f..12c462a29fe0c4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -616,6 +616,19 @@ struct fuse_sync_bucket {
 	struct rcu_head rcu;
 };
 
+struct fuse_iomap_conn {
+	struct idr device_map;
+};
+
+struct fuse_iomap_dev {
+	struct file *file;
+	struct block_device *bdev;
+
+	/** refcount */
+	refcount_t count;
+	struct rcu_head rcu;
+};
+
 /**
  * A Fuse connection.
  *
@@ -970,6 +983,10 @@ struct fuse_conn {
 	struct fuse_ring *ring;
 #endif
 
+#ifdef CONFIG_FUSE_IOMAP
+	struct fuse_iomap_conn iomap_conn;
+#endif
+
 	/** Only used if the connection opts into request timeouts */
 	struct {
 		/* Worker for checking if any requests have timed out */
@@ -1616,9 +1633,19 @@ static inline bool fuse_has_iomap(const struct inode *inode)
 {
 	return get_fuse_conn_c(inode)->iomap;
 }
+
+bool fuse_iomap_fill_super(struct fuse_mount *fm);
+int fuse_iomap_conn_alloc(struct fuse_conn *fc);
+void fuse_iomap_conn_put(struct fuse_conn *fc);
+
+int fuse_iomap_dev_add(struct fuse_conn *fc, const struct fuse_backing_map *map);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
+# define fuse_iomap_fill_super(...)		(true)
+# define fuse_iomap_conn_alloc(...)		(0)
+# define fuse_iomap_conn_put(...)		((void)0)
+# define fuse_iomap_dev_add(...)		(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index ecf9332321a1e6..5c8533053f8eed 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -410,6 +410,68 @@ TRACE_EVENT(fuse_iomap_end_error,
 		  __entry->pos, __entry->count, __entry->written,
 		  __entry->error)
 );
+
+TRACE_EVENT(fuse_iomap_dev_add,
+	TP_PROTO(const struct fuse_conn *fc,
+		 const struct fuse_backing_map *map),
+
+	TP_ARGS(fc, map),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(int,		fd)
+		__field(unsigned int,	flags)
+	),
+
+	TP_fast_assign(
+		__entry->connection	=	fc->dev;
+		__entry->fd		=	map->fd;
+		__entry->flags		=	map->flags;
+	),
+
+	TP_printk("connection %u fd %d flags 0x%x",
+		  __entry->connection,
+		  __entry->fd,
+		  __entry->flags)
+);
+
+TRACE_EVENT(fuse_iomap_dev_class,
+	TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
+		 const struct fuse_iomap_dev *fb),
+
+	TP_ARGS(fc, idx, fb),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned int,	idx)
+		__field(dev_t,		bdev)
+	),
+
+	TP_fast_assign(
+		__entry->connection	=	fc->dev;
+		__entry->idx		=	idx;
+
+		if (fb) {
+			struct inode *inode = file_inode(fb->file);
+
+			__entry->bdev	=	inode->i_rdev;
+		} else {
+			__entry->bdev	=	0;
+		}
+	),
+
+	TP_printk("connection %u idx %u dev %u:%u",
+		  __entry->connection,
+		  __entry->idx,
+		  MAJOR(__entry->bdev), MINOR(__entry->bdev))
+);
+#define DEFINE_FUSE_IOMAP_DEV_EVENT(name)		\
+DEFINE_EVENT(fuse_iomap_dev_class, name,		\
+	TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
+		 const struct fuse_iomap_dev *fb), \
+	TP_ARGS(fc, idx, fb))
+DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
+DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 501f4d838e654f..2fe83fc196b021 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -239,6 +239,7 @@
  *  7.99
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
+ *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
  */
 
 #ifndef _LINUX_FUSE_H
@@ -1136,6 +1137,8 @@ struct fuse_backing_map {
 #define FUSE_DEV_IOC_BACKING_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, \
 					     struct fuse_backing_map)
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_DEV_ADD	_IOW(FUSE_DEV_IOC_MAGIC, 3, \
+					     struct fuse_backing_map)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8dd74cbfbcc6fc..49ff2c6654e768 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2633,6 +2633,24 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
 	return fuse_backing_open(fud->fc, &map);
 }
 
+static long fuse_dev_ioctl_iomap_dev_add(struct file *file,
+					 struct fuse_backing_map __user *argp)
+{
+	struct fuse_dev *fud = fuse_get_dev(file);
+	struct fuse_backing_map map;
+
+	if (!fud)
+		return -EPERM;
+
+	if (!IS_ENABLED(CONFIG_FUSE_IOMAP))
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&map, argp, sizeof(map)))
+		return -EFAULT;
+
+	return fuse_iomap_dev_add(fud->fc, &map);
+}
+
 static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
 {
 	struct fuse_dev *fud = fuse_get_dev(file);
@@ -2665,6 +2683,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 	case FUSE_DEV_IOC_BACKING_CLOSE:
 		return fuse_dev_ioctl_backing_close(file, argp);
 
+	case FUSE_DEV_IOC_IOMAP_DEV_ADD:
+		return fuse_dev_ioctl_iomap_dev_add(file, argp);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index a206a9254df3fe..535429023d37e7 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -189,9 +189,6 @@ fuse_iomap_begin_validate(const struct fuse_iomap_begin_out *outarg,
 		return -EIO;
 	}
 
-	/* XXX: Check the device cookie */
-	ASSERT(outarg->read_dev == 0);
-
 	/* No overflows in the device range, if supplied */
 	if (outarg->read_addr != FUSE_IOMAP_NULL_ADDR &&
 	    BAD_DATA(check_add_overflow(outarg->read_addr, outarg->length, &end)))
@@ -220,6 +217,98 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
 	return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
 }
 
+static struct fuse_iomap_dev *fuse_iomap_dev_get(struct fuse_iomap_dev *fb)
+{
+	if (fb && refcount_inc_not_zero(&fb->count))
+		return fb;
+	return NULL;
+}
+
+static void fuse_iomap_dev_free(struct fuse_iomap_dev *fb)
+{
+	if (fb->file)
+		fput(fb->file);
+	kfree_rcu(fb, rcu);
+}
+
+static void fuse_iomap_dev_put(struct fuse_iomap_dev *fb)
+{
+	if (fb && refcount_dec_and_test(&fb->count))
+		fuse_iomap_dev_free(fb);
+}
+
+static int fuse_iomap_dev_id_alloc(struct fuse_conn *fc,
+				   struct fuse_iomap_dev *fb)
+{
+	int id;
+
+	idr_preload(GFP_KERNEL);
+	spin_lock(&fc->lock);
+	id = idr_alloc_cyclic(&fc->iomap_conn.device_map, fb, 1, 0,
+			      GFP_ATOMIC);
+	spin_unlock(&fc->lock);
+	idr_preload_end();
+
+	trace_fuse_iomap_add_dev(fc, id, fb);
+
+	return id;
+}
+
+static struct fuse_iomap_dev *fuse_iomap_dev_id_remove(struct fuse_conn *fc,
+						       int id)
+{
+	struct fuse_iomap_dev *fb;
+
+	spin_lock(&fc->lock);
+	fb = idr_remove(&fc->iomap_conn.device_map, id);
+	spin_unlock(&fc->lock);
+
+	if (fb)
+		trace_fuse_iomap_remove_dev(fc, id, fb);
+
+	return fb;
+}
+
+static inline struct fuse_iomap_dev *
+fuse_iomap_dev_id_find(struct fuse_conn *fc, int idx)
+{
+	struct fuse_iomap_dev *fb;
+
+	rcu_read_lock();
+	fb = idr_find(&fc->iomap_conn.device_map, idx);
+	fb = fuse_iomap_dev_get(fb);
+	rcu_read_unlock();
+
+	return fb;
+}
+
+static inline struct fuse_iomap_dev *
+fuse_iomap_find_dev(struct fuse_conn *fc, uint16_t map_type, uint32_t map_dev)
+{
+	struct fuse_iomap_dev *ret = NULL;
+
+	if (map_dev != FUSE_IOMAP_DEV_NULL && map_dev < INT_MAX)
+		ret = fuse_iomap_dev_id_find(fc, map_dev);
+
+	switch (map_type) {
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		if (BAD_DATA(ret == NULL))
+			return ERR_PTR(-EIO);
+		break;
+	}
+
+	return ret;
+}
+
+static inline void
+fuse_iomap_set_device(struct iomap *iomap, const struct fuse_iomap_dev *fb)
+{
+	iomap->bdev = fb ? fb->bdev : NULL;
+	iomap->dax_dev = NULL;
+}
+
 static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 			    unsigned opflags, struct iomap *iomap,
 			    struct iomap *srcmap)
@@ -233,6 +322,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 	};
 	struct fuse_iomap_begin_out outarg = { };
 	struct fuse_mount *fm = get_fuse_mount(inode);
+	struct fuse_iomap_dev *read_dev = NULL;
+	struct fuse_iomap_dev *write_dev = NULL;
 	FUSE_ARGS(args);
 	int err;
 
@@ -259,8 +350,21 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 	if (err)
 		return err;
 
+	read_dev = fuse_iomap_find_dev(fm->fc, outarg.read_type,
+				       outarg.read_dev);
+	if (IS_ERR(read_dev))
+		return PTR_ERR(read_dev);
+
 	if (fuse_is_iomap_file_write(opflags) &&
 	    outarg.write_type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+
+		write_dev = fuse_iomap_find_dev(fm->fc, outarg.write_type,
+						outarg.write_dev);
+		if (IS_ERR(write_dev)) {
+			err = PTR_ERR(write_dev);
+			goto out_read_dev;
+		}
+
 		/*
 		 * For an out of place write, we must supply the write mapping
 		 * via @iomap, and the read mapping via @srcmap.
@@ -270,14 +374,14 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 		iomap->length = outarg.length;
 		iomap->type = outarg.write_type;
 		iomap->flags = outarg.write_flags;
-		iomap->bdev = inode->i_sb->s_bdev;
+		fuse_iomap_set_device(iomap, write_dev);
 
 		srcmap->addr = outarg.read_addr;
 		srcmap->offset = outarg.offset;
 		srcmap->length = outarg.length;
 		srcmap->type = outarg.read_type;
 		srcmap->flags = outarg.read_flags;
-		srcmap->bdev = inode->i_sb->s_bdev;
+		fuse_iomap_set_device(srcmap, read_dev);
 	} else {
 		/*
 		 * For everything else (reads, reporting, and pure overwrites),
@@ -289,10 +393,19 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 		iomap->length = outarg.length;
 		iomap->type = outarg.read_type;
 		iomap->flags = outarg.read_flags;
-		iomap->bdev = inode->i_sb->s_bdev;
+		fuse_iomap_set_device(iomap, read_dev);
 	}
 
-	return 0;
+	/*
+	 * XXX: if we ever want to support closing devices, we need a way to 
+	 * track the fuse_iomap_dev refcount all the way through bio endios.
+	 * For now we put the refcount here because you can't remove an iomap
+	 * device until unmount time.
+	 */
+	fuse_iomap_dev_put(write_dev);
+out_read_dev:
+	fuse_iomap_dev_put(read_dev);
+	return err;
 }
 
 static bool fuse_want_iomap_end(const struct iomap *iomap, unsigned int opflags,
@@ -356,3 +469,119 @@ const struct iomap_ops fuse_iomap_ops = {
 	.iomap_begin		= fuse_iomap_begin,
 	.iomap_end		= fuse_iomap_end,
 };
+
+int fuse_iomap_conn_alloc(struct fuse_conn *fc)
+{
+	idr_init(&fc->iomap_conn.device_map);
+	return 0;
+}
+
+static int fuse_iomap_dev_id_free(int id, void *p, void *data)
+{
+	struct fuse_iomap_dev *fb = p;
+	struct fuse_conn *fc = data;
+
+	trace_fuse_iomap_remove_dev(fc, id, fb);
+
+	WARN_ON_ONCE(refcount_read(&fb->count) != 1);
+	fuse_iomap_dev_free(fb);
+	return 0;
+}
+
+void fuse_iomap_conn_put(struct fuse_conn *fc)
+{
+	idr_for_each(&fc->iomap_conn.device_map, fuse_iomap_dev_id_free, fc);
+	idr_destroy(&fc->iomap_conn.device_map);
+}
+
+static struct fuse_iomap_dev *fuse_iomap_dev_alloc(struct file *file)
+{
+	struct fuse_iomap_dev *fb =
+			kmalloc(sizeof(struct fuse_iomap_dev), GFP_KERNEL);
+
+	if (!fb)
+		return NULL;
+
+	fb->file = file;
+	fb->bdev = I_BDEV(file->f_mapping->host);
+	refcount_set(&fb->count, 1);
+
+	return fb;
+}
+
+bool fuse_iomap_fill_super(struct fuse_mount *fm)
+{
+	struct fuse_conn *fc = fm->fc;
+	struct super_block *sb = fm->sb;
+	int res;
+
+	if (sb->s_bdev) {
+		/*
+		 * Try to install s_bdev as the first iomap device, if this
+		 * is a block-device filesystem.
+		 */
+		struct fuse_iomap_dev *fb =
+					fuse_iomap_dev_alloc(sb->s_bdev_file);
+
+		if (!fb)
+			return false;
+
+		res = fuse_iomap_dev_id_alloc(fc, fb);
+		if (res < 0)
+			return false;
+		if (res != 1) {
+			struct fuse_iomap_dev *bad =
+					fuse_iomap_dev_id_remove(fc, res);
+
+			ASSERT(res == 1);
+			ASSERT(bad == fb);
+			fuse_iomap_dev_put(bad);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+int fuse_iomap_dev_add(struct fuse_conn *fc, const struct fuse_backing_map *map)
+{
+	struct file *file;
+	struct fuse_iomap_dev *fb = NULL;
+	int res;
+
+	trace_fuse_iomap_dev_add(fc, map);
+
+	res = -EPERM;
+	if (!fc->iomap)
+		goto out;
+
+	res = -EINVAL;
+	if (map->flags || map->padding)
+		goto out;
+
+	file = fget_raw(map->fd);
+	res = -EBADF;
+	if (!file)
+		goto out;
+
+	res = -ENODEV;
+	if (!S_ISBLK(file_inode(file)->i_mode))
+		goto out_fput;
+
+	fb = fuse_iomap_dev_alloc(file);
+	if (!fb)
+		goto out_fput;
+
+	res = fuse_iomap_dev_id_alloc(fc, fb);
+	if (res < 0) {
+		fuse_iomap_dev_free(fb);
+		goto out;
+	}
+
+	return res;
+
+out_fput:
+	fput(file);
+out:
+	return res;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 6173795d3826d0..8266f30bc8a954 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1015,6 +1015,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		struct fuse_iqueue *fiq = &fc->iq;
 		struct fuse_sync_bucket *bucket;
 
+		fuse_iomap_conn_put(fc);
 		if (IS_ENABLED(CONFIG_FUSE_DAX))
 			fuse_dax_conn_free(fc);
 		if (fc->timeout.req_timeout)
@@ -1454,6 +1455,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 		init_server_timeout(fc, timeout);
 
+		if (fc->iomap && !fuse_iomap_fill_super(fm))
+			ok = false;
+
 		fm->sb->s_bdi->ra_pages =
 				min(fm->sb->s_bdi->ra_pages, ra_pages);
 		fc->minor = arg->minor;
@@ -1823,10 +1827,15 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 
 	sb->s_subtype = ctx->subtype;
 	ctx->subtype = NULL;
+
+	err = fuse_iomap_conn_alloc(fc);
+	if (err)
+		goto err;
+
 	if (IS_ENABLED(CONFIG_FUSE_DAX)) {
 		err = fuse_dax_conn_alloc(fc, ctx->dax_mode, ctx->dax_dev);
 		if (err)
-			goto err;
+			goto err_free_iomap;
 	}
 
 	if (ctx->fudptr) {
@@ -1888,6 +1897,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
  err_dev_free:
 	if (fud)
 		fuse_dev_free(fud);
+ err_free_iomap:
+	fuse_iomap_conn_put(fc);
  err_free_dax:
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_conn_free(fc);


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 03/13] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 01/13] fuse: implement the basic iomap mechanisms Darrick J. Wong
  2025-07-17 23:28   ` [PATCH 02/13] fuse: add an ioctl to add new iomap devices Darrick J. Wong
@ 2025-07-17 23:28   ` Darrick J. Wong
  2025-07-17 23:29   ` [PATCH 04/13] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
                     ` (9 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:28 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

At unmount time, there are a few things that we need to ask the fuse
server to do.

First, we need to flush queued events to userspace to give the fuse
server a chance to process the events.  This is how we make sure that
the server processes FUSE_RELEASE events before the connection goes
down.

Second, to ensure that all those metadata updates are persisted to disk
before tell the fuse server to destroy itself, send FUSE_SYNCFS after
waiting for the queued events.

Finally, we need to send FUSE_DESTROY to the fuse server so that it
closes the filesystem and the device fds before unmount returns.  That
way, a script that does something like "umount /dev/sda ; e2fsck -fn
/dev/sda" will not fail the e2fsck because the fd closure races with
e2fsck startup.  Obviously, we need to wait for FUSE_SYNCFS.

This is a major behavior change and who knows what might break existing
code, so we hide it behind iomap mode.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h     |    5 +++++
 fs/fuse/file_iomap.c |   23 +++++++++++++++++++++++
 fs/fuse/inode.c      |    6 ++++--
 3 files changed, 32 insertions(+), 2 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 12c462a29fe0c4..850c187434a61a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1380,6 +1380,9 @@ int fuse_init_fs_context_submount(struct fs_context *fsc);
  */
 void fuse_conn_destroy(struct fuse_mount *fm);
 
+/* Send the FUSE_DESTROY command. */
+void fuse_send_destroy(struct fuse_mount *fm);
+
 /* Drop the connection and free the fuse mount */
 void fuse_mount_destroy(struct fuse_mount *fm);
 
@@ -1639,6 +1642,7 @@ int fuse_iomap_conn_alloc(struct fuse_conn *fc);
 void fuse_iomap_conn_put(struct fuse_conn *fc);
 
 int fuse_iomap_dev_add(struct fuse_conn *fc, const struct fuse_backing_map *map);
+void fuse_iomap_conn_destroy(struct fuse_mount *fm);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1646,6 +1650,7 @@ int fuse_iomap_dev_add(struct fuse_conn *fc, const struct fuse_backing_map *map)
 # define fuse_iomap_conn_alloc(...)		(0)
 # define fuse_iomap_conn_put(...)		((void)0)
 # define fuse_iomap_dev_add(...)		(-ENOSYS)
+# define fuse_iomap_conn_destroy(...)		((void)0)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 535429023d37e7..4724d5678112db 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -540,6 +540,12 @@ bool fuse_iomap_fill_super(struct fuse_mount *fm)
 		}
 	}
 
+	/*
+	 * Enable syncfs for iomap fuse servers so that we can send a final
+	 * flush at unmount time.  This also means that we can support
+	 * freeze/thaw properly.
+	 */
+	fc->sync_fs = true;
 	return true;
 }
 
@@ -585,3 +591,20 @@ int fuse_iomap_dev_add(struct fuse_conn *fc, const struct fuse_backing_map *map)
 out:
 	return res;
 }
+
+void fuse_iomap_conn_destroy(struct fuse_mount *fm)
+{
+	struct fuse_conn *fc = fm->fc;
+
+	/*
+	 * Flush all pending commands, syncfs, flush that, and send a destroy
+	 * command.  This gives the fuse server a chance to process all the
+	 * pending releases, write the last bits of metadata changes to disk,
+	 * and close the iomap block devices before we return from the umount
+	 * call.  The caller already flushed previously pending requests, so we
+	 * only need the flush to wait for syncfs.
+	 */
+	sync_filesystem(fm->sb);
+	fuse_flush_requests(fc, 60 * HZ);
+	fuse_send_destroy(fm);
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8266f30bc8a954..8b12284bced7e6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -618,7 +618,7 @@ static void fuse_umount_begin(struct super_block *sb)
 		retire_super(sb);
 }
 
-static void fuse_send_destroy(struct fuse_mount *fm)
+void fuse_send_destroy(struct fuse_mount *fm)
 {
 	if (fm->fc->conn_init) {
 		FUSE_ARGS(args);
@@ -2064,7 +2064,9 @@ void fuse_conn_destroy(struct fuse_mount *fm)
 	struct fuse_conn *fc = fm->fc;
 
 	fuse_flush_requests(fc, 30 * HZ);
-	if (fc->destroy)
+	if (fc->iomap)
+		fuse_iomap_conn_destroy(fm);
+	else if (fc->destroy)
 		fuse_send_destroy(fm);
 
 	fuse_abort_conn(fc);


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 04/13] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:28   ` [PATCH 03/13] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
@ 2025-07-17 23:29   ` Darrick J. Wong
  2025-07-17 23:29   ` [PATCH 05/13] fuse: implement direct IO with iomap Darrick J. Wong
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:29 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Implement the basic file mapping reporting functions like FIEMAP, BMAP,
and SEEK_DATA/HOLE.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h     |    8 ++++++
 fs/fuse/fuse_trace.h |   66 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dir.c        |    1 +
 fs/fuse/file.c       |   13 +++++++++
 fs/fuse/file_iomap.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 158 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 850c187434a61a..4df51454858146 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1643,6 +1643,11 @@ void fuse_iomap_conn_put(struct fuse_conn *fc);
 
 int fuse_iomap_dev_add(struct fuse_conn *fc, const struct fuse_backing_map *map);
 void fuse_iomap_conn_destroy(struct fuse_mount *fm);
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		      u64 start, u64 length);
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1651,6 +1656,9 @@ void fuse_iomap_conn_destroy(struct fuse_mount *fm);
 # define fuse_iomap_conn_put(...)		((void)0)
 # define fuse_iomap_dev_add(...)		(-ENOSYS)
 # define fuse_iomap_conn_destroy(...)		((void)0)
+# define fuse_iomap_fiemap			NULL
+# define fuse_iomap_lseek(...)			(-ENOSYS)
+# define fuse_iomap_bmap(...)			(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 5c8533053f8eed..9c02ca07571e1c 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -472,6 +472,72 @@ DEFINE_EVENT(fuse_iomap_dev_class, name,		\
 	TP_ARGS(fc, idx, fb))
 DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_add_dev);
 DEFINE_FUSE_IOMAP_DEV_EVENT(fuse_iomap_remove_dev);
+
+TRACE_EVENT(fuse_iomap_fiemap,
+	TP_PROTO(const struct inode *inode, u64 start, u64 count,
+		unsigned int flags),
+
+	TP_ARGS(inode, start, count, flags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(u64,		start)
+		__field(u64,		count)
+		__field(unsigned int,	flags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->start		=	start;
+		__entry->count		=	count;
+		__entry->flags		=	flags;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx flags 0x%x start 0x%llx count 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->flags, __entry->start,
+		  __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_lseek,
+	TP_PROTO(const struct inode *inode, loff_t offset, int whence),
+
+	TP_ARGS(inode, offset, whence),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(int,		whence)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	offset;
+		__entry->whence		=	whence;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx whence %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->whence)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 4d841869ba3d0a..5efd763d188559 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2257,6 +2257,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.set_acl	= fuse_set_acl,
 	.fileattr_get	= fuse_fileattr_get,
 	.fileattr_set	= fuse_fileattr_set,
+	.fiemap		= fuse_iomap_fiemap,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ee79cb7bc05805..d143990d9ed931 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2569,6 +2569,12 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 	struct fuse_bmap_out outarg;
 	int err;
 
+	if (fuse_has_iomap(inode)) {
+		sector_t alt_sec = fuse_iomap_bmap(mapping, block);
+		if (alt_sec > 0)
+			return alt_sec;
+	}
+
 	if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
 		return 0;
 
@@ -2604,6 +2610,13 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
 	struct fuse_lseek_out outarg;
 	int err;
 
+	if (fuse_has_iomap(inode)) {
+		loff_t alt_pos = fuse_iomap_lseek(file, offset, whence);
+
+		if (alt_pos >= 0 || (alt_pos < 0 && alt_pos != -ENOSYS))
+			return alt_pos;
+	}
+
 	if (fm->fc->no_lseek)
 		goto fallback;
 
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 4724d5678112db..fb33185852ff0b 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -608,3 +608,73 @@ void fuse_iomap_conn_destroy(struct fuse_mount *fm)
 	fuse_flush_requests(fc, 60 * HZ);
 	fuse_send_destroy(fm);
 }
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		      u64 start, u64 count)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	int error;
+
+	/*
+	 * We are called directly from the vfs so we need to check per-inode
+	 * support here explicitly.
+	 */
+	if (!fuse_has_iomap(inode))
+		return -EOPNOTSUPP;
+
+	if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
+		return -EOPNOTSUPP;
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	if (!fuse_allow_current_process(fc))
+		return -EACCES;
+
+	trace_fuse_iomap_fiemap(inode, start, count, fieinfo->fi_flags);
+
+	inode_lock_shared(inode);
+	error = iomap_fiemap(inode, fieinfo, start, count,
+			&fuse_iomap_ops);
+	inode_unlock_shared(inode);
+
+	return error;
+}
+
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block)
+{
+	ASSERT(fuse_has_iomap(mapping->host));
+
+	return iomap_bmap(mapping, block, &fuse_iomap_ops);
+}
+
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	ASSERT(fuse_has_iomap(inode));
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	if (!fuse_allow_current_process(fc))
+		return -EACCES;
+
+	trace_fuse_iomap_lseek(inode, offset, whence);
+
+	switch (whence) {
+	case SEEK_HOLE:
+		offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops);
+		break;
+	case SEEK_DATA:
+		offset = iomap_seek_data(inode, offset, &fuse_iomap_ops);
+		break;
+	default:
+		return -ENOSYS;
+	}
+
+	if (offset < 0)
+		return offset;
+	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
+}


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 05/13] fuse: implement direct IO with iomap
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:29   ` [PATCH 04/13] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
@ 2025-07-17 23:29   ` Darrick J. Wong
  2025-07-17 23:29   ` [PATCH 06/13] fuse: implement buffered " Darrick J. Wong
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:29 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Implement direct IO with iomap if it's available.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   33 +++++
 fs/fuse/fuse_trace.h      |  257 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   29 ++++
 fs/fuse/dir.c             |    7 +
 fs/fuse/file.c            |   17 +++
 fs/fuse/file_iomap.c      |  292 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |    6 +
 7 files changed, 640 insertions(+), 1 deletion(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4df51454858146..67e428da4391aa 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -226,6 +226,8 @@ enum {
 	FUSE_I_BTIME,
 	/* Wants or already has page cache IO */
 	FUSE_I_CACHE_IO_MODE,
+	/* Use iomap for directio reads and writes */
+	FUSE_I_IOMAP_DIRECTIO,
 };
 
 struct fuse_conn;
@@ -911,6 +913,9 @@ struct fuse_conn {
 	/* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
 	unsigned int iomap:1;
 
+	/* Use fs/iomap for direct I/O operations */
+	unsigned int iomap_directio:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
@@ -1648,6 +1653,27 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		      u64 start, u64 length);
 loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
 sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
+
+void fuse_iomap_open(struct inode *inode, struct file *file);
+
+void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags);
+void fuse_iomap_evict_inode(struct inode *inode);
+
+static inline bool fuse_has_iomap_directio(const struct inode *inode)
+{
+	const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+	return test_bit(FUSE_I_IOMAP_DIRECTIO, &fi->state);
+}
+
+static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
+{
+	return (iocb->ki_flags & IOCB_DIRECT) &&
+		fuse_has_iomap_directio(file_inode(iocb->ki_filp));
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1659,6 +1685,13 @@ sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
 # define fuse_iomap_fiemap			NULL
 # define fuse_iomap_lseek(...)			(-ENOSYS)
 # define fuse_iomap_bmap(...)			(-ENOSYS)
+# define fuse_iomap_open(...)			((void)0)
+# define fuse_iomap_init_inode(...)		((void)0)
+# define fuse_iomap_evict_inode(...)		((void)0)
+# define fuse_has_iomap_directio(...)		(false)
+# define fuse_want_iomap_directio(...)		(false)
+# define fuse_iomap_direct_read(...)		(-ENOSYS)
+# define fuse_iomap_direct_write(...)		(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 9c02ca07571e1c..b888ae40e1116e 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -60,6 +60,7 @@
 	EM( FUSE_STATX,			"FUSE_STATX")		\
 	EM( FUSE_IOMAP_BEGIN,		"FUSE_IOMAP_BEGIN")	\
 	EM( FUSE_IOMAP_END,		"FUSE_IOMAP_END")	\
+	EM( FUSE_IOMAP_IOEND,		"FUSE_IOMAP_IOEND")	\
 	EMe(CUSE_INIT,			"CUSE_INIT")
 
 /*
@@ -161,6 +162,34 @@ TRACE_EVENT(fuse_request_end,
 	{ FUSE_IOMAP_TYPE_UNWRITTEN,		"unwritten" }, \
 	{ FUSE_IOMAP_TYPE_INLINE,		"inline" }
 
+#define FUSE_IOMAP_IOEND_STRINGS \
+	{ FUSE_IOMAP_IOEND_SHARED,		"shared" }, \
+	{ FUSE_IOMAP_IOEND_UNWRITTEN,		"unwritten" }, \
+	{ FUSE_IOMAP_IOEND_BOUNDARY,		"boundary" }, \
+	{ FUSE_IOMAP_IOEND_DIRECT,		"direct" }, \
+	{ FUSE_IOMAP_IOEND_APPEND,		"append" }
+
+#define IOMAP_DIOEND_STRINGS \
+	{ IOMAP_DIO_UNWRITTEN,			"unwritten" }, \
+	{ IOMAP_DIO_COW,			"cow" }
+
+TRACE_DEFINE_ENUM(FUSE_I_ADVISE_RDPLUS);
+TRACE_DEFINE_ENUM(FUSE_I_INIT_RDPLUS);
+TRACE_DEFINE_ENUM(FUSE_I_SIZE_UNSTABLE);
+TRACE_DEFINE_ENUM(FUSE_I_BAD);
+TRACE_DEFINE_ENUM(FUSE_I_BTIME);
+TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP_DIRECTIO);
+
+#define FUSE_IFLAG_STRINGS \
+	{ 1 << FUSE_I_ADVISE_RDPLUS,		"advise_rdplus" }, \
+	{ 1 << FUSE_I_INIT_RDPLUS,		"init_rdplus" }, \
+	{ 1 << FUSE_I_SIZE_UNSTABLE,		"size_unstable" }, \
+	{ 1 << FUSE_I_BAD,			"bad" }, \
+	{ 1 << FUSE_I_BTIME,			"btime" }, \
+	{ 1 << FUSE_I_CACHE_IO_MODE,		"cacheio" }, \
+	{ 1 << FUSE_I_IOMAP_DIRECTIO,		"iomap_dio" }
+
 TRACE_EVENT(fuse_iomap_begin,
 	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
 		 unsigned opflags),
@@ -411,6 +440,89 @@ TRACE_EVENT(fuse_iomap_end_error,
 		  __entry->error)
 );
 
+TRACE_EVENT(fuse_iomap_ioend,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_ioend_in *inarg),
+
+	TP_ARGS(inode, inarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned,	ioendflags)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(int,		error)
+		__field(uint64_t,	new_addr)
+		__field(size_t,		written)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->ioendflags	=	inarg->ioendflags;
+		__entry->error		=	inarg->error;
+		__entry->pos		=	inarg->pos;
+		__entry->new_addr	=	inarg->new_addr;
+		__entry->written	=	inarg->written;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx ioendflags (%s) pos 0x%llx written %zd error %d new_addr 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+		  __entry->pos, __entry->written, __entry->error,
+		  __entry->new_addr)
+);
+
+TRACE_EVENT(fuse_iomap_ioend_error,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_ioend_in *inarg,
+		 int error),
+
+	TP_ARGS(inode, inarg, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned,	ioendflags)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(int,		error)
+		__field(uint64_t,	new_addr)
+		__field(size_t,		written)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->ioendflags	=	inarg->ioendflags;
+		__entry->error		=	error;
+		__entry->pos		=	inarg->pos;
+		__entry->new_addr	=	inarg->new_addr;
+		__entry->written	=	inarg->written;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx ioendflags (%s) pos 0x%llx written %zd error %d new_addr 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+		  __entry->pos, __entry->written, __entry->error,
+		  __entry->new_addr)
+);
+
 TRACE_EVENT(fuse_iomap_dev_add,
 	TP_PROTO(const struct fuse_conn *fc,
 		 const struct fuse_backing_map *map),
@@ -538,6 +650,151 @@ TRACE_EVENT(fuse_iomap_lseek,
 		  __entry->connection, __entry->ino, __entry->nodeid,
 		  __entry->isize, __entry->offset, __entry->whence)
 );
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_io_class,
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter),
+	TP_ARGS(iocb, iter),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t, isize)
+		__field(loff_t, offset)
+		__field(size_t, count)
+	),
+	TP_fast_assign(
+		const struct inode *inode = file_inode(iocb->ki_filp);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	iocb->ki_pos;
+		__entry->count		=	iov_iter_count(iter);
+	),
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx pos 0x%llx bytecount 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->count)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IO_EVENT(name)		\
+DEFINE_EVENT(fuse_iomap_file_io_class, name,		\
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), \
+	TP_ARGS(iocb, iter))
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
+		 ssize_t ret),
+	TP_ARGS(iocb, iter, ret),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(uint64_t, nodeid)
+		__field(loff_t, isize)
+		__field(loff_t, offset)
+		__field(size_t, count)
+		__field(ssize_t, ret)
+	),
+	TP_fast_assign(
+		const struct inode *inode = file_inode(iocb->ki_filp);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	iocb->ki_pos;
+		__entry->count		=	iov_iter_count(iter);
+		__entry->ret		=	ret;
+	),
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx pos 0x%llx bytecount 0x%zx ret 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->count, __entry->ret)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(name)	\
+DEFINE_EVENT(fuse_iomap_file_ioend_class, name,		\
+	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, \
+		 ssize_t ret), \
+	TP_ARGS(iocb, iter, ret))
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+
+TRACE_EVENT(fuse_iomap_dio_write_end_io,
+	TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
+		 int error, unsigned flags),
+
+	TP_ARGS(inode, pos, written, error, flags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(unsigned,	dioendflags)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(size_t,		written)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->dioendflags	=	flags;
+		__entry->error		=	error;
+		__entry->pos		=	pos;
+		__entry->written	=	written;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx dioendflags (%s) pos 0x%llx written %zd error %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
+		  __entry->pos, __entry->written, __entry->error)
+);
+
+DECLARE_EVENT_CLASS(fuse_inode_state_class,
+	TP_PROTO(const struct inode *inode),
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(unsigned long,	state)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->state		=	fi->state;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx state (%s)",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->state, "|", FUSE_IFLAG_STRINGS))
+);
+#define DEFINE_FUSE_INODE_STATE_EVENT(name)	\
+DEFINE_EVENT(fuse_inode_state_class, name,	\
+	TP_PROTO(const struct inode *inode),	\
+	TP_ARGS(inode))
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2fe83fc196b021..17ea82e23d7ef7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,7 @@
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
+ *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -450,6 +451,7 @@ struct fuse_file_lock {
  *			 init_out.request_timeout contains the timeout (in secs)
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
+ * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -498,6 +500,7 @@ struct fuse_file_lock {
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
 #define FUSE_IOMAP		(1ULL << 43)
+#define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
 
 /**
  * CUSE INIT request/reply flags
@@ -581,9 +584,11 @@ struct fuse_file_lock {
  *
  * FUSE_ATTR_SUBMOUNT: Object is a submount root
  * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
+ * FUSE_ATTR_IOMAP_DIRECTIO: Use iomap for directio
  */
 #define FUSE_ATTR_SUBMOUNT      (1 << 0)
 #define FUSE_ATTR_DAX		(1 << 1)
+#define FUSE_ATTR_IOMAP_DIRECTIO	(1 << 2)
 
 /**
  * Open flags
@@ -666,6 +671,7 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_IOEND	= 4093,
 	FUSE_IOMAP_BEGIN	= 4094,
 	FUSE_IOMAP_END		= 4095,
 
@@ -1377,4 +1383,27 @@ struct fuse_iomap_end_in {
 	uint32_t map_dev;	/* device cookie * */
 };
 
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN	(1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY	(1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT		(1U << 3)
+
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND		(1U << 15)
+
+struct fuse_iomap_ioend_in {
+	uint16_t ioendflags;	/* FUSE_IOMAP_IOEND_* */
+	uint16_t reserved;	/* zero */
+	int32_t error;		/* negative errno or 0 */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t new_addr;	/* disk offset of new mapping, in bytes */
+	uint32_t written;	/* bytes processed */
+	uint32_t reserved1;	/* zero */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5efd763d188559..e991bc1943e6f6 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -713,6 +713,10 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
 	entry->d_time = epoch;
 	fuse_change_entry_timeout(entry, &outentry);
 	fuse_dir_changed(dir);
+
+	if (fuse_has_iomap(inode))
+		fuse_iomap_open(inode, file);
+
 	err = generic_file_open(inode, file);
 	if (!err) {
 		file->private_data = ff;
@@ -1708,6 +1712,9 @@ static int fuse_dir_open(struct inode *inode, struct file *file)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_has_iomap(inode))
+		fuse_iomap_open(inode, file);
+
 	err = generic_file_open(inode, file);
 	if (err)
 		return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d143990d9ed931..06223e56955ca3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -244,6 +244,9 @@ static int fuse_open(struct inode *inode, struct file *file)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_has_iomap(inode))
+		fuse_iomap_open(inode, file);
+
 	err = generic_file_open(inode, file);
 	if (err)
 		return err;
@@ -1712,10 +1715,17 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
+	ssize_t ret;
 
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_want_iomap_directio(iocb)) {
+		ret = fuse_iomap_direct_read(iocb, to);
+		if (ret != -ENOSYS)
+			return ret;
+	}
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_read_iter(iocb, to);
 
@@ -1737,6 +1747,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
+	if (fuse_want_iomap_directio(iocb)) {
+		ssize_t ret = fuse_iomap_direct_write(iocb, from);
+		if (ret != -ENOSYS)
+			return ret;
+	}
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_write_iter(iocb, from);
 
@@ -3191,4 +3207,5 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);
+	fuse_iomap_init_inode(inode, flags);
 }
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index fb33185852ff0b..3f96cab5de1fb4 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -470,6 +470,70 @@ const struct iomap_ops fuse_iomap_ops = {
 	.iomap_end		= fuse_iomap_end,
 };
 
+static inline bool fuse_want_ioend(const struct fuse_iomap_ioend_in *inarg)
+{
+	/* Always send an ioend for errors. */
+	if (inarg->error)
+		return true;
+
+	/* Send an ioend if we performed an IO involving metadata changes. */
+	return inarg->written > 0 &&
+	       (inarg->ioendflags & (FUSE_IOMAP_IOEND_SHARED |
+				     FUSE_IOMAP_IOEND_UNWRITTEN |
+				     FUSE_IOMAP_IOEND_APPEND));
+}
+
+static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
+			    int error, unsigned ioendflags, sector_t new_addr)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_ioend_in inarg = {
+		.ioendflags = ioendflags,
+		.error = error,
+		.attr_ino = fi->orig_ino,
+		.pos = pos,
+		.written = written,
+		.new_addr = new_addr,
+	};
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	int err = 0;
+
+	if (pos + written > i_size_read(inode))
+		inarg.ioendflags |= FUSE_IOMAP_IOEND_APPEND;
+
+	trace_fuse_iomap_ioend(inode, &inarg);
+
+	if (!fuse_want_ioend(&inarg))
+		goto out;
+
+	args.opcode = FUSE_IOMAP_IOEND;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	err = fuse_simple_request(fm, &args);
+
+	trace_fuse_iomap_ioend_error(inode, &inarg, err);
+
+	/*
+	 * Preserve the original error code if userspace didn't respond or
+	 * returned success despite the error we passed along via the ioend.
+	 */
+	if (error && (err == 0 || err == -ENOSYS))
+		err = error;
+
+out:
+	/*
+	 * If there weren't any ioend errors, update the incore isize, which
+	 * confusingly takes the new i_size as "pos".
+	 */
+	if (!error && !err)
+		fuse_write_update_attr(inode, pos + written, written);
+
+	return err;
+}
+
 int fuse_iomap_conn_alloc(struct fuse_conn *fc)
 {
 	idr_init(&fc->iomap_conn.device_map);
@@ -678,3 +742,231 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
 		return offset;
 	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
+
+void fuse_iomap_open(struct inode *inode, struct file *file)
+{
+	if (fuse_has_iomap_directio(inode))
+		file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+}
+
+enum fuse_ilock_type {
+	SHARED,
+	EXCL,
+};
+
+static int fuse_iomap_ilock_iocb(const struct kiocb *iocb,
+				 enum fuse_ilock_type type)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		switch (type) {
+		case SHARED:
+			return inode_trylock_shared(inode) ? 0 : -EAGAIN;
+		case EXCL:
+			return inode_trylock(inode) ? 0 : -EAGAIN;
+		default:
+			ASSERT(0);
+			return -EIO;
+		}
+	} else {
+		switch (type) {
+		case SHARED:
+			inode_lock_shared(inode);
+			break;
+		case EXCL:
+			inode_lock(inode);
+			break;
+		default:
+			ASSERT(0);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+static inline void fuse_iomap_set_directio(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(get_fuse_conn_c(inode)->iomap_directio);
+
+	set_bit(FUSE_I_IOMAP_DIRECTIO, &fi->state);
+}
+
+static inline void fuse_iomap_clear_directio(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(get_fuse_conn_c(inode)->iomap_directio);
+
+	clear_bit(FUSE_I_IOMAP_DIRECTIO, &fi->state);
+}
+
+void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
+{
+	struct fuse_conn *conn = get_fuse_conn(inode);
+
+	if (conn->iomap_directio && (attr_flags & FUSE_ATTR_IOMAP_DIRECTIO))
+		fuse_iomap_set_directio(inode);
+
+	trace_fuse_iomap_init_inode(inode);
+}
+
+void fuse_iomap_evict_inode(struct inode *inode)
+{
+	trace_fuse_iomap_evict_inode(inode);
+
+	if (fuse_has_iomap_directio(inode))
+		fuse_iomap_clear_directio(inode);
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_directio(inode));
+
+	trace_fuse_iomap_direct_read(iocb, to);
+
+	if (!iov_iter_count(to))
+		return 0; /* skip atime */
+
+	file_accessed(iocb->ki_filp);
+
+	ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+	if (ret)
+		return ret;
+	ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0);
+	inode_unlock_shared(inode);
+
+	trace_fuse_iomap_direct_read_end(iocb, to, ret);
+	return ret;
+}
+
+static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written,
+				       int error, unsigned dioflags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	unsigned int nofs_flag;
+	unsigned int ioendflags = FUSE_IOMAP_IOEND_DIRECT;
+	int ret;
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	ASSERT(fuse_has_iomap_directio(inode));
+
+	trace_fuse_iomap_dio_write_end_io(inode, iocb->ki_pos, written, error,
+					  dioflags);
+
+	if (dioflags & IOMAP_DIO_COW)
+		ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+	if (dioflags & IOMAP_DIO_UNWRITTEN)
+		ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+	/*
+	 * We can allocate memory here while doing writeback on behalf of
+	 * memory reclaim.  To avoid memory allocation deadlocks set the
+	 * task-wide nofs context for the following operations.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	ret = fuse_iomap_ioend(inode, iocb->ki_pos, written, error, ioendflags,
+			       FUSE_IOMAP_NULL_ADDR);
+	memalloc_nofs_restore(nofs_flag);
+	return ret;
+}
+
+static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
+	.end_io		= fuse_iomap_dio_write_end_io,
+};
+
+static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
+					size_t count)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t end = start + count - 1;
+	int err;
+
+	/* Flush the file metadata, not the page cache. */
+	err = sync_inode_metadata(inode, 1);
+	if (err)
+		return err;
+
+	if (fc->no_fsync)
+		return 0;
+
+	err = fuse_fsync_common(iocb->ki_filp, start, end, iocb_is_dsync(iocb),
+				FUSE_FSYNC);
+	if (err == -ENOSYS) {
+		fc->no_fsync = 1;
+		err = 0;
+	}
+	return err;
+}
+
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	loff_t blockmask = i_blocksize(inode) - 1;
+	loff_t pos = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
+	bool was_dsync = false;
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_directio(inode));
+
+	trace_fuse_iomap_direct_write(iocb, from);
+
+	/*
+	 * direct I/O must be aligned to the fsblock size or we fall back to
+	 * the old paths
+	 */
+	if ((iocb->ki_pos | count) & blockmask)
+		return -ENOTBLK;
+
+	/* fuse doesn't support S_SYNC, so complain if we see this. */
+	if (IS_SYNC(inode)) {
+		ASSERT(!IS_SYNC(inode));
+		return -EIO;
+	}
+
+	/*
+	 * Strip off IOCB_DSYNC so that we can run the fsync ourselves because
+	 * we hold inode_lock; iomap_dio_rw calls generic_write_sync; and
+	 * fuse_fsync tries to take inode_lock again.
+	 */
+	if (iocb_is_dsync(iocb)) {
+		was_dsync = true;
+		iocb->ki_flags &= ~IOCB_DSYNC;
+	}
+
+	ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+	if (ret)
+		goto out_dsync;
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+
+	ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
+			&fuse_iomap_dio_write_ops, 0, NULL, 0);
+	if (ret)
+		goto out_unlock;
+
+	if (was_dsync) {
+		/* Restore IOCB_DSYNC and call our sync function */
+		iocb->ki_flags |= IOCB_DSYNC;
+		ret = fuse_iomap_direct_write_sync(iocb, pos, count);
+	}
+
+out_unlock:
+	inode_unlock(inode);
+out_dsync:
+	trace_fuse_iomap_direct_write_end(iocb, from, ret);
+	if (was_dsync)
+		iocb->ki_flags |= IOCB_DSYNC;
+	return ret;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8b12284bced7e6..1a17983753c367 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,8 @@ static void fuse_evict_inode(struct inode *inode)
 		WARN_ON(!list_empty(&fi->write_files));
 		WARN_ON(!list_empty(&fi->queued_writes));
 	}
+
+	fuse_iomap_evict_inode(inode);
 }
 
 static int fuse_reconfigure(struct fs_context *fsc)
@@ -1447,6 +1449,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 			if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
 				fc->iomap = 1;
+			if ((flags & FUSE_IOMAP_DIRECTIO) && fc->iomap)
+				fc->iomap_directio = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1519,7 +1523,7 @@ void fuse_send_init(struct fuse_mount *fm)
 	if (fuse_uring_enabled())
 		flags |= FUSE_OVER_IO_URING;
 	if (fuse_iomap_enabled())
-		flags |= FUSE_IOMAP;
+		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO;
 
 	ia->in.flags = flags;
 	ia->in.flags2 = flags >> 32;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:29   ` [PATCH 05/13] fuse: implement direct IO with iomap Darrick J. Wong
@ 2025-07-17 23:29   ` Darrick J. Wong
  2025-07-18 15:10     ` Amir Goldstein
  2025-07-17 23:29   ` [PATCH 07/13] fuse: enable caching of timestamps Darrick J. Wong
                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:29 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Implement pagecache IO with iomap, complete with hooks into truncate and
fallocate so that the fuse server needn't implement disk block zeroing
of post-EOF and unaligned punch/zero regions.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   46 +++
 fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
 include/uapi/linux/fuse.h |    5 
 fs/fuse/dir.c             |   23 +
 fs/fuse/file.c            |   90 +++++-
 fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c           |   14 +
 7 files changed, 1268 insertions(+), 24 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 67e428da4391aa..f33b348d296d5e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -161,6 +161,13 @@ struct fuse_inode {
 
 			/* waitq for direct-io completion */
 			wait_queue_head_t direct_io_waitq;
+
+#ifdef CONFIG_FUSE_IOMAP
+			/* pending io completions */
+			spinlock_t ioend_lock;
+			struct work_struct ioend_work;
+			struct list_head ioend_list;
+#endif
 		};
 
 		/* readdir cache (directory only) */
@@ -228,6 +235,8 @@ enum {
 	FUSE_I_CACHE_IO_MODE,
 	/* Use iomap for directio reads and writes */
 	FUSE_I_IOMAP_DIRECTIO,
+	/* Use iomap for buffered read and writes */
+	FUSE_I_IOMAP_FILEIO,
 };
 
 struct fuse_conn;
@@ -916,6 +925,9 @@ struct fuse_conn {
 	/* Use fs/iomap for direct I/O operations */
 	unsigned int iomap_directio:1;
 
+	/* Use fs/iomap for buffered I/O operations */
+	unsigned int iomap_fileio:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
@@ -1631,6 +1643,9 @@ extern void fuse_sysctl_unregister(void);
 #define fuse_sysctl_unregister()	do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
+
 #if IS_ENABLED(CONFIG_FUSE_IOMAP)
 # include <linux/fiemap.h>
 # include <linux/iomap.h>
@@ -1674,6 +1689,28 @@ static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
 
 ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
+
+static inline bool fuse_has_iomap_fileio(const struct inode *inode)
+{
+	const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+	return test_bit(FUSE_I_IOMAP_FILEIO, &fi->state);
+}
+
+static inline bool fuse_want_iomap_buffered_io(const struct kiocb *iocb)
+{
+	return fuse_has_iomap_fileio(file_inode(iocb->ki_filp));
+}
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
+int fuse_iomap_setsize(struct inode *inode, loff_t newsize);
+void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits);
+int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
+			 loff_t length, loff_t new_size);
+int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
+				 loff_t endpos);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1692,6 +1729,15 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
 # define fuse_want_iomap_directio(...)		(false)
 # define fuse_iomap_direct_read(...)		(-ENOSYS)
 # define fuse_iomap_direct_write(...)		(-ENOSYS)
+# define fuse_has_iomap_fileio(...)		(false)
+# define fuse_want_iomap_buffered_io(...)	(false)
+# define fuse_iomap_mmap(...)			(-ENOSYS)
+# define fuse_iomap_buffered_read(...)		(-ENOSYS)
+# define fuse_iomap_buffered_write(...)		(-ENOSYS)
+# define fuse_iomap_setsize(...)		(-ENOSYS)
+# define fuse_iomap_set_i_blkbits(...)		((void)0)
+# define fuse_iomap_fallocate(...)		(-ENOSYS)
+# define fuse_iomap_flush_unmap_range(...)	(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index b888ae40e1116e..5d9b5a4e93fca5 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -180,6 +180,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BAD);
 TRACE_DEFINE_ENUM(FUSE_I_BTIME);
 TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
 TRACE_DEFINE_ENUM(FUSE_I_IOMAP_DIRECTIO);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP_FILEIO);
 
 #define FUSE_IFLAG_STRINGS \
 	{ 1 << FUSE_I_ADVISE_RDPLUS,		"advise_rdplus" }, \
@@ -188,7 +189,14 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP_DIRECTIO);
 	{ 1 << FUSE_I_BAD,			"bad" }, \
 	{ 1 << FUSE_I_BTIME,			"btime" }, \
 	{ 1 << FUSE_I_CACHE_IO_MODE,		"cacheio" }, \
-	{ 1 << FUSE_I_IOMAP_DIRECTIO,		"iomap_dio" }
+	{ 1 << FUSE_I_IOMAP_DIRECTIO,		"iomap_dio" }, \
+	{ 1 << FUSE_I_IOMAP_FILEIO,		"iomap_fileio" }
+
+#define IOMAP_IOEND_STRINGS \
+	{ IOMAP_IOEND_SHARED,			"shared" }, \
+	{ IOMAP_IOEND_UNWRITTEN,		"unwritten" }, \
+	{ IOMAP_IOEND_BOUNDARY,			"boundary" }, \
+	{ IOMAP_IOEND_DIRECT,			"direct" }
 
 TRACE_EVENT(fuse_iomap_begin,
 	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
@@ -684,6 +692,9 @@ DEFINE_EVENT(fuse_iomap_file_io_class, name,		\
 	TP_ARGS(iocb, iter))
 DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
 DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_write_zero_eof);
 
 DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
 	TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
@@ -722,6 +733,8 @@ DEFINE_EVENT(fuse_iomap_file_ioend_class, name,		\
 	TP_ARGS(iocb, iter, ret))
 DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
 DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_write_end);
 
 TRACE_EVENT(fuse_iomap_dio_write_end_io,
 	TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
@@ -795,6 +808,382 @@ DEFINE_EVENT(fuse_inode_state_class, name,	\
 	TP_ARGS(inode))
 DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
 DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
+
+TRACE_EVENT(fuse_iomap_end_ioend,
+	TP_PROTO(const struct iomap_ioend *ioend),
+
+	TP_ARGS(ioend),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(size_t,		size)
+		__field(unsigned int,	ioendflags)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = ioend->io_inode;
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	ioend->io_offset;
+		__entry->size		=	ioend->io_size;
+		__entry->ioendflags	=	ioend->io_flags;
+		__entry->error		=
+				blk_status_to_errno(ioend->io_bio.bi_status);
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx size %zu ioendflags (%s) error %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->size,
+		  __print_flags(__entry->ioendflags, "|", IOMAP_IOEND_STRINGS),
+		  __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_map_blocks,
+	TP_PROTO(const struct inode *inode, loff_t offset, unsigned int count),
+
+	TP_ARGS(inode, offset, count),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(unsigned int,	count)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	offset;
+		__entry->count		=	count;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx count 0x%x",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_submit_ioend,
+	TP_PROTO(const struct iomap_writepage_ctx *wpc, int error),
+
+	TP_ARGS(wpc, error),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(size_t,		len)
+		__field(unsigned int,	nr_folios)
+		__field(u64,		addr)
+		__field(int,		error)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = wpc->ioend->io_inode;
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->nr_folios	=	wpc->nr_folios;
+		__entry->pos		=	wpc->ioend->io_offset;
+		__entry->len		=	wpc->ioend->io_size;
+		__entry->addr		=	wpc->ioend->io_sector << 9;
+		__entry->error		=	error;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx pos 0x%llx len 0x%zx addr 0x%llx nr_folios %u error %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->pos, __entry->len, __entry->addr,
+		  __entry->nr_folios, __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_discard_folio,
+	TP_PROTO(const struct inode *inode, loff_t offset, size_t count),
+
+	TP_ARGS(inode, offset, count),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	offset;
+		__entry->count		=	count;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_writepages,
+	TP_PROTO(const struct inode *inode, const struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		start)
+		__field(loff_t,		end)
+		__field(long,		nr_to_write)
+		__field(bool,		sync_all)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->start		=	wbc->range_start;
+		__entry->end		=	wbc->range_end;
+		__entry->nr_to_write	=	wbc->nr_to_write;
+		__entry->sync_all	=	wbc->sync_mode == WB_SYNC_ALL;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx start 0x%llx end 0x%llx nr %ld sync_all? %d",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->start, __entry->end,
+		  __entry->nr_to_write, __entry->sync_all)
+);
+
+TRACE_EVENT(fuse_iomap_read_folio,
+	TP_PROTO(const struct folio *folio),
+
+	TP_ARGS(folio),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = folio->mapping->host;
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	folio_pos(folio);
+		__entry->count		=	folio_size(folio);
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->pos, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_readahead,
+	TP_PROTO(const struct readahead_control *rac),
+
+	TP_ARGS(rac),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = file_inode(rac->file);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+		struct readahead_control *mutrac = (struct readahead_control *)rac;
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	readahead_pos(mutrac);
+		__entry->count		=	readahead_length(mutrac);
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->pos, __entry->count)
+);
+
+TRACE_EVENT(fuse_iomap_page_mkwrite,
+	TP_PROTO(const struct vm_fault *vmf),
+
+	TP_ARGS(vmf),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		pos)
+		__field(size_t,		count)
+	),
+
+	TP_fast_assign(
+		const struct inode *inode = file_inode(vmf->vma->vm_file);
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+		struct folio *folio = page_folio(vmf->page);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	folio_pos(folio);
+		__entry->count		=	folio_size(folio);
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx count 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->pos, __entry->count)
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_range_class,
+	TP_PROTO(const struct inode *inode, loff_t offset, loff_t length),
+	TP_ARGS(inode, offset, length),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t, offset)
+		__field(loff_t, length)
+	),
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	offset;
+		__entry->length		=	length;
+	),
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx pos 0x%llx bytecount 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->length)
+)
+#define DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(name)		\
+DEFINE_EVENT(fuse_iomap_file_range_class, name,		\
+	TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \
+	TP_ARGS(inode, offset, length))
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range);
+
+TRACE_EVENT(fuse_iomap_set_i_blkbits,
+	TP_PROTO(const struct inode *inode, u8 new_blkbits),
+	TP_ARGS(inode, new_blkbits),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(u8,		old_blkbits)
+		__field(u8,		new_blkbits)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->old_blkbits	=	inode->i_blkbits;
+		__entry->new_blkbits	=	new_blkbits;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx old_blkbits %u new_blkbits %u",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->old_blkbits, __entry->new_blkbits)
+);
+
+TRACE_EVENT(fuse_iomap_fallocate,
+	TP_PROTO(const struct inode *inode, int mode, loff_t offset,
+		 loff_t length, loff_t newsize),
+	TP_ARGS(inode, mode, offset, length, newsize),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(loff_t,		length)
+		__field(loff_t,		newsize)
+		__field(int,		mode)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->mode		=	mode;
+		__entry->offset		=	offset;
+		__entry->length		=	length;
+		__entry->newsize	=	newsize;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx mode 0x%x offset 0x%llx length 0x%llx newsize 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->mode, __entry->offset,
+		  __entry->length, __entry->newsize)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 17ea82e23d7ef7..71129db79a1dd0 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -241,6 +241,7 @@
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
  *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
+ *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -452,6 +453,7 @@ struct fuse_file_lock {
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
  * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
+ * FUSE_IOMAP_FILEIO: Client supports iomap for buffered I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -501,6 +503,7 @@ struct fuse_file_lock {
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
 #define FUSE_IOMAP		(1ULL << 43)
 #define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
+#define FUSE_IOMAP_FILEIO	(1ULL << 45)
 
 /**
  * CUSE INIT request/reply flags
@@ -585,10 +588,12 @@ struct fuse_file_lock {
  * FUSE_ATTR_SUBMOUNT: Object is a submount root
  * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
  * FUSE_ATTR_IOMAP_DIRECTIO: Use iomap for directio
+ * FUSE_ATTR_IOMAP_FILEIO: Use iomap for buffered io
  */
 #define FUSE_ATTR_SUBMOUNT      (1 << 0)
 #define FUSE_ATTR_DAX		(1 << 1)
 #define FUSE_ATTR_IOMAP_DIRECTIO	(1 << 2)
+#define FUSE_ATTR_IOMAP_FILEIO	(1 << 3)
 
 /**
  * Open flags
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index e991bc1943e6f6..7a398e42e9818b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1984,7 +1984,10 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		is_truncate = true;
 	}
 
-	if (FUSE_IS_DAX(inode) && is_truncate) {
+	if (fuse_has_iomap_fileio(inode) && is_truncate) {
+		filemap_invalidate_lock(mapping);
+		fault_blocked = true;
+	} else if (FUSE_IS_DAX(inode) && is_truncate) {
 		filemap_invalidate_lock(mapping);
 		fault_blocked = true;
 		err = fuse_dax_break_layouts(inode, 0, -1);
@@ -1999,6 +2002,18 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		WARN_ON(!(attr->ia_valid & ATTR_SIZE));
 		WARN_ON(attr->ia_size != 0);
 		if (fc->atomic_o_trunc) {
+			if (fuse_has_iomap_fileio(inode)) {
+				/*
+				 * fuse_open already set the size to zero and
+				 * truncated the pagecache, and we've since
+				 * cycled the inode locks.  Another thread
+				 * could have performed an appending write, so
+				 * we don't want to touch the file further.
+				 */
+				filemap_invalidate_unlock(mapping);
+				return 0;
+			}
+
 			/*
 			 * No need to send request to userspace, since actual
 			 * truncation has already been done by OPEN.  But still
@@ -2071,6 +2086,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		goto error;
 	}
 
+	if (fuse_has_iomap_fileio(inode) && is_truncate) {
+		err = fuse_iomap_setsize(inode, outarg.attr.size);
+		if (err)
+			goto error;
+	}
+
 	spin_lock(&fi->lock);
 	/* the kernel maintains i_mtime locally */
 	if (trust_local_cmtime) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 06223e56955ca3..2dd4e5c2933c0f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -384,7 +384,7 @@ static int fuse_release(struct inode *inode, struct file *file)
 	 * Dirty pages might remain despite write_inode_now() call from
 	 * fuse_flush() due to writes racing with the close.
 	 */
-	if (fc->writeback_cache)
+	if (fc->writeback_cache || fuse_has_iomap_fileio(inode))
 		write_inode_now(inode, 1);
 
 	fuse_release_common(file, false);
@@ -1668,8 +1668,6 @@ static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
 	return res;
 }
 
-static ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
-
 static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t res;
@@ -1726,6 +1724,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return ret;
 	}
 
+	if (fuse_want_iomap_buffered_io(iocb))
+		return fuse_iomap_buffered_read(iocb, to);
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_read_iter(iocb, to);
 
@@ -1749,10 +1750,29 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if (fuse_want_iomap_directio(iocb)) {
 		ssize_t ret = fuse_iomap_direct_write(iocb, from);
-		if (ret != -ENOSYS)
+		switch (ret) {
+		case -ENOTBLK:
+			/*
+			 * If we're going to fall back to the iomap buffered
+			 * write path only, then try the write again as a
+			 * synchronous buffered write.  Otherwise we let it
+			 * drop through to the old ->direct_IO path.
+			 */
+			if (fuse_want_iomap_buffered_io(iocb))
+				iocb->ki_flags |= IOCB_SYNC;
+			fallthrough;
+		case -ENOSYS:
+			/* no implementation, fall through */
+			break;
+		default:
+			/* errors, no progress, or even partial progress */
 			return ret;
+		}
 	}
 
+	if (fuse_want_iomap_buffered_io(iocb))
+		return fuse_iomap_buffered_write(iocb, from);
+
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_write_iter(iocb, from);
 
@@ -2378,6 +2398,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	struct inode *inode = file_inode(file);
 	int rc;
 
+	if (fuse_has_iomap_fileio(inode))
+		return fuse_iomap_mmap(file, vma);
+
 	/* DAX mmap is superior to direct_io mmap */
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_mmap(file, vma);
@@ -2576,7 +2599,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
 	return err;
 }
 
-static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
+sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 {
 	struct inode *inode = mapping->host;
 	struct fuse_mount *fm = get_fuse_mount(inode);
@@ -2832,8 +2855,7 @@ static inline loff_t fuse_round_up(struct fuse_conn *fc, loff_t off)
 	return round_up(off, fc->max_pages << PAGE_SHIFT);
 }
 
-static ssize_t
-fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 	ssize_t ret = 0;
@@ -2930,8 +2952,12 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end)
 {
-	int err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
+	int err;
 
+	if (fuse_has_iomap_fileio(inode))
+		return fuse_iomap_flush_unmap_range(inode, start, end);
+
+	err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
 	if (!err)
 		fuse_sync_writes(inode);
 
@@ -2952,6 +2978,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.length = length,
 		.mode = mode
 	};
+	loff_t newsize = 0;
 	int err;
 	bool block_faults = FUSE_IS_DAX(inode) &&
 		(!(mode & FALLOC_FL_KEEP_SIZE) ||
@@ -2965,7 +2992,10 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
-	if (block_faults) {
+	if (fuse_has_iomap_fileio(inode)) {
+		filemap_invalidate_lock(inode->i_mapping);
+		block_faults = true;
+	} else if (block_faults) {
 		filemap_invalidate_lock(inode->i_mapping);
 		err = fuse_dax_break_layouts(inode, 0, -1);
 		if (err)
@@ -2980,11 +3010,23 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 	}
 
+	/*
+	 * If we are using iomap for file IO, fallocate must wait for all AIO
+	 * to complete before we continue as AIO can change the file size on
+	 * completion without holding any locks we currently hold. We must do
+	 * this first because AIO can update the in-memory inode size, and the
+	 * operations that follow require the in-memory size to be fully
+	 * up-to-date.
+	 */
+	if (fuse_has_iomap_fileio(inode))
+		inode_dio_wait(inode);
+
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
 	    offset + length > i_size_read(inode)) {
 		err = inode_newsize_ok(inode, offset + length);
 		if (err)
 			goto out;
+		newsize = offset + length;
 	}
 
 	err = file_modified(file);
@@ -3007,14 +3049,23 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (err)
 		goto out;
 
-	/* we could have extended the file */
-	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
-		if (fuse_write_update_attr(inode, offset + length, length))
-			file_update_time(file);
-	}
+	if (fuse_has_iomap_fileio(inode)) {
+		err = fuse_iomap_fallocate(file, mode, offset, length,
+					   newsize);
+		if (err)
+			goto out;
+	} else {
+		/* we could have extended the file */
+		if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+			if (fuse_write_update_attr(inode, offset + length,
+						   length))
+				file_update_time(file);
+		}
 
-	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
-		truncate_pagecache_range(inode, offset, offset + length - 1);
+		if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
+			truncate_pagecache_range(inode, offset,
+						 offset + length - 1);
+	}
 
 	fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
 
@@ -3100,6 +3151,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (err)
 		goto out;
 
+	/* See inode_dio_wait comment in fuse_file_fallocate */
+	if (fuse_has_iomap_fileio(inode_out))
+		inode_dio_wait(inode_out);
+
 	if (is_unstable)
 		set_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state);
 
@@ -3119,7 +3174,8 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (err)
 		goto out;
 
-	truncate_inode_pages_range(inode_out->i_mapping,
+	if (!fuse_has_iomap_fileio(inode_out))
+		truncate_inode_pages_range(inode_out->i_mapping,
 				   ALIGN_DOWN(pos_out, PAGE_SIZE),
 				   ALIGN(pos_out + outarg.size, PAGE_SIZE) - 1);
 
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 3f96cab5de1fb4..ab0dee6460a7dd 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -6,6 +6,8 @@
 #include "fuse_i.h"
 #include "fuse_trace.h"
 #include <linux/iomap.h>
+#include <linux/pagemap.h>
+#include <linux/falloc.h>
 
 static bool __read_mostly enable_iomap =
 #if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
@@ -747,6 +749,8 @@ void fuse_iomap_open(struct inode *inode, struct file *file)
 {
 	if (fuse_has_iomap_directio(inode))
 		file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+	if (fuse_has_iomap_fileio(inode))
+		file->f_mode |= FMODE_NOWAIT;
 }
 
 enum fuse_ilock_type {
@@ -804,12 +808,26 @@ static inline void fuse_iomap_clear_directio(struct inode *inode)
 	clear_bit(FUSE_I_IOMAP_DIRECTIO, &fi->state);
 }
 
+static inline void fuse_iomap_set_fileio(struct inode *inode);
+
+static inline void fuse_iomap_clear_fileio(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(get_fuse_conn_c(inode)->iomap_fileio);
+	ASSERT(list_empty(&fi->ioend_list));
+
+	clear_bit(FUSE_I_IOMAP_FILEIO, &fi->state);
+}
+
 void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
 {
 	struct fuse_conn *conn = get_fuse_conn(inode);
 
 	if (conn->iomap_directio && (attr_flags & FUSE_ATTR_IOMAP_DIRECTIO))
 		fuse_iomap_set_directio(inode);
+	if (conn->iomap_fileio && (attr_flags & FUSE_ATTR_IOMAP_FILEIO))
+		fuse_iomap_set_fileio(inode);
 
 	trace_fuse_iomap_init_inode(inode);
 }
@@ -820,6 +838,8 @@ void fuse_iomap_evict_inode(struct inode *inode)
 
 	if (fuse_has_iomap_directio(inode))
 		fuse_iomap_clear_directio(inode);
+	if (fuse_has_iomap_fileio(inode))
+		fuse_iomap_clear_fileio(inode);
 }
 
 ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
@@ -908,6 +928,109 @@ static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
 	return err;
 }
 
+static int
+fuse_iomap_zero_range(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			len,
+	bool			*did_zero)
+{
+	return iomap_zero_range(inode, pos, len, did_zero, &fuse_iomap_ops,
+				NULL);
+}
+
+/* Take care of zeroing post-EOF blocks when they might exist. */
+static ssize_t
+fuse_iomap_write_zero_eof(
+	struct kiocb		*iocb,
+	struct iov_iter		*from,
+	bool			*drained_dio)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t			isize;
+	int			error;
+
+	/*
+	 * We need to serialise against EOF updates that occur in IO
+	 * completions here. We want to make sure that nobody is changing the
+	 * size while we do this check until we have placed an IO barrier (i.e.
+	 * hold i_rwsem exclusively) that prevents new IO from being
+	 * dispatched.  The spinlock effectively forms a memory barrier once we
+	 * have i_rwsem exclusively so we are guaranteed to see the latest EOF
+	 * value and hence be able to correctly determine if we need to run
+	 * zeroing.
+	 */
+	spin_lock(&fi->lock);
+	isize = i_size_read(inode);
+	if (iocb->ki_pos <= isize) {
+		spin_unlock(&fi->lock);
+		return 0;
+	}
+	spin_unlock(&fi->lock);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return -EAGAIN;
+
+	if (!(*drained_dio)) {
+		/*
+		 * We now have an IO submission barrier in place, but AIO can
+		 * do EOF updates during IO completion and hence we now need to
+		 * wait for all of them to drain.  Non-AIO DIO will have
+		 * drained before we are given the exclusive i_rwsem, and so
+		 * for most cases this wait is a no-op.
+		 */
+		inode_dio_wait(inode);
+		*drained_dio = true;
+		return 1;
+	}
+
+	trace_fuse_iomap_write_zero_eof(iocb, from);
+
+	filemap_invalidate_lock(mapping);
+	error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL);
+	filemap_invalidate_unlock(mapping);
+
+	return error;
+}
+
+static ssize_t
+fuse_iomap_write_checks(
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	ssize_t			error;
+	bool			drained_dio = false;
+
+restart:
+	error = generic_write_checks(iocb, from);
+	if (error <= 0)
+		return error;
+
+	/*
+	 * If the offset is beyond the size of the file, we need to zero all
+	 * blocks that fall between the existing EOF and the start of this
+	 * write.
+	 *
+	 * We can do an unlocked check for i_size here safely as I/O completion
+	 * can only extend EOF.  Truncate is locked out at this point, so the
+	 * EOF cannot move backwards, only forwards. Hence we only need to take
+	 * the slow path when we are at or beyond the current EOF.
+	 */
+	if (fuse_has_iomap_fileio(inode) &&
+	    iocb->ki_pos > i_size_read(inode)) {
+		error = fuse_iomap_write_zero_eof(iocb, from, &drained_dio);
+		if (error == 1)
+			goto restart;
+		if (error)
+			return error;
+	}
+
+	return kiocb_modified(iocb);
+}
+
 ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -947,8 +1070,9 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ret = fuse_iomap_ilock_iocb(iocb, EXCL);
 	if (ret)
 		goto out_dsync;
-	ret = generic_write_checks(iocb, from);
-	if (ret <= 0)
+
+	ret = fuse_iomap_write_checks(iocb, from);
+	if (ret)
 		goto out_unlock;
 
 	ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
@@ -970,3 +1094,598 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		iocb->ki_flags |= IOCB_DSYNC;
 	return ret;
 }
+
+struct fuse_writepage_ctx {
+	struct iomap_writepage_ctx ctx;
+};
+
+static void fuse_iomap_end_ioend(struct iomap_ioend *ioend)
+{
+	struct inode *inode = ioend->io_inode;
+	unsigned int ioendflags = 0;
+	unsigned int nofs_flag;
+	int error = blk_status_to_errno(ioend->io_bio.bi_status);
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	if (fuse_is_bad(inode))
+		return;
+
+	trace_fuse_iomap_end_ioend(ioend);
+
+	if (ioend->io_flags & IOMAP_IOEND_SHARED)
+		ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+	if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
+		ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+	/*
+	 * We can allocate memory here while doing writeback on behalf of
+	 * memory reclaim.  To avoid memory allocation deadlocks set the
+	 * task-wide nofs context for the following operations.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	fuse_iomap_ioend(inode, ioend->io_offset, ioend->io_size, error,
+			 ioendflags, FUSE_IOMAP_NULL_ADDR);
+	iomap_finish_ioends(ioend, error);
+	memalloc_nofs_restore(nofs_flag);
+}
+
+/*
+ * Finish all pending IO completions that require transactional modifications.
+ *
+ * We try to merge physical and logically contiguous ioends before completion to
+ * minimise the number of transactions we need to perform during IO completion.
+ * Both unwritten extent conversion and COW remapping need to iterate and modify
+ * one physical extent at a time, so we gain nothing by merging physically
+ * discontiguous extents here.
+ *
+ * The ioend chain length that we can be processing here is largely unbound in
+ * length and we may have to perform significant amounts of work on each ioend
+ * to complete it. Hence we have to be careful about holding the CPU for too
+ * long in this loop.
+ */
+static void fuse_iomap_end_io(struct work_struct *work)
+{
+	struct fuse_inode *fi =
+		container_of(work, struct fuse_inode, ioend_work);
+	struct iomap_ioend *ioend;
+	struct list_head tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fi->ioend_lock, flags);
+	list_replace_init(&fi->ioend_list, &tmp);
+	spin_unlock_irqrestore(&fi->ioend_lock, flags);
+
+	iomap_sort_ioends(&tmp);
+	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
+			io_list))) {
+		list_del_init(&ioend->io_list);
+		iomap_ioend_try_merge(ioend, &tmp);
+		fuse_iomap_end_ioend(ioend);
+		cond_resched();
+	}
+}
+
+static void fuse_iomap_end_bio(struct bio *bio)
+{
+	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+	struct inode *inode = ioend->io_inode;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	unsigned long flags;
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	spin_lock_irqsave(&fi->ioend_lock, flags);
+	if (list_empty(&fi->ioend_list))
+		WARN_ON_ONCE(!queue_work(system_unbound_wq, &fi->ioend_work));
+	list_add_tail(&ioend->io_list, &fi->ioend_list);
+	spin_unlock_irqrestore(&fi->ioend_lock, flags);
+}
+
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+					    loff_t offset)
+{
+	if (offset < wpc->iomap.offset ||
+	    offset >= wpc->iomap.offset + wpc->iomap.length)
+		return false;
+
+	/* XXX actually use revalidation cookie */
+	return true;
+}
+
+static int fuse_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
+				 struct inode *inode, loff_t offset,
+				 unsigned int len)
+{
+	struct iomap write_iomap, dontcare;
+	int ret;
+
+	if (fuse_is_bad(inode))
+		return -EIO;
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_map_blocks(inode, offset, len);
+
+	if (fuse_iomap_revalidate_writeback(wpc, offset))
+		return 0;
+
+	/* Pretend that this is a directio write */
+	ret = fuse_iomap_begin(inode, offset, len, IOMAP_DIRECT | IOMAP_WRITE,
+			       &write_iomap, &dontcare);
+	if (ret)
+		return ret;
+
+	/*
+	 * Landed in a hole or beyond EOF?  Send that to iomap, it'll skip
+	 * writing back the file range.
+	 */
+	if (write_iomap.offset > offset) {
+		write_iomap.length = write_iomap.offset - offset;
+		write_iomap.offset = offset;
+		write_iomap.type = IOMAP_HOLE;
+	}
+
+	memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap));
+	return 0;
+}
+
+static int fuse_iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int status)
+{
+	struct iomap_ioend *ioend = wpc->ioend;
+
+	ASSERT(fuse_has_iomap_fileio(ioend->io_inode));
+
+	trace_fuse_iomap_submit_ioend(wpc, status);
+
+	/* always call our ioend function, even if we cancel the bio */
+	ioend->io_bio.bi_end_io = fuse_iomap_end_bio;
+
+	if (status)
+		return status;
+	submit_bio(&ioend->io_bio);
+	return 0;
+}
+
+/*
+ * If the folio has delalloc blocks on it, the caller is asking us to punch them
+ * out. If we don't, we can leave a stale delalloc mapping covered by a clean
+ * page that needs to be dirtied again before the delalloc mapping can be
+ * converted. This stale delalloc mapping can trip up a later direct I/O read
+ * operation on the same region.
+ *
+ * We prevent this by truncating away the delalloc regions on the folio. Because
+ * they are delalloc, we can do this without needing a transaction. Indeed - if
+ * we get ENOSPC errors, we have to be able to do this truncation without a
+ * transaction as there is no space left for block reservation (typically why
+ * we see a ENOSPC in writeback).
+ */
+static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+	struct inode *inode = folio->mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	if (fuse_is_bad(inode))
+		return;
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_discard_folio(inode, pos, folio_size(folio));
+
+	printk_ratelimited(KERN_ERR
+		"page discard on page %px, inode 0x%llx, pos %llu.",
+			folio, fi->orig_ino, pos);
+
+	/* XXX actually punch the new delalloc ranges? */
+}
+
+static const struct iomap_writeback_ops fuse_iomap_writeback_ops = {
+	.map_blocks		= fuse_iomap_map_blocks,
+	.submit_ioend		= fuse_iomap_submit_ioend,
+	.discard_folio		= fuse_iomap_discard_folio,
+};
+
+static int fuse_iomap_writepages(struct address_space *mapping,
+				 struct writeback_control *wbc)
+{
+	struct fuse_writepage_ctx wpc = { };
+
+	ASSERT(fuse_has_iomap_fileio(mapping->host));
+
+	trace_fuse_iomap_writepages(mapping->host, wbc);
+
+	return iomap_writepages(mapping, wbc, &wpc.ctx,
+				&fuse_iomap_writeback_ops);
+}
+
+static int fuse_iomap_read_folio(struct file *file, struct folio *folio)
+{
+	ASSERT(fuse_has_iomap_fileio(file_inode(file)));
+
+	trace_fuse_iomap_read_folio(folio);
+
+	return iomap_read_folio(folio, &fuse_iomap_ops);
+}
+
+static void fuse_iomap_readahead(struct readahead_control *rac)
+{
+	ASSERT(fuse_has_iomap_fileio(file_inode(rac->file)));
+
+	trace_fuse_iomap_readahead(rac);
+
+	iomap_readahead(rac, &fuse_iomap_ops);
+}
+
+static const struct address_space_operations fuse_iomap_aops = {
+	.read_folio		= fuse_iomap_read_folio,
+	.readahead		= fuse_iomap_readahead,
+	.writepages		= fuse_iomap_writepages,
+	.dirty_folio		= iomap_dirty_folio,
+	.release_folio		= iomap_release_folio,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
+	.error_remove_folio	= generic_error_remove_folio,
+
+	/* These aren't pagecache operations per se */
+	.bmap			= fuse_bmap,
+	.direct_IO		= fuse_direct_IO,
+};
+
+static inline void fuse_iomap_set_fileio(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(get_fuse_conn_c(inode)->iomap_fileio);
+
+	inode->i_data.a_ops = &fuse_iomap_aops;
+
+	INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
+	INIT_LIST_HEAD(&fi->ioend_list);
+	spin_lock_init(&fi->ioend_lock);
+	set_bit(FUSE_I_IOMAP_FILEIO, &fi->state);
+}
+
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_lock (MM)
+ *   sb_start_pagefault(vfs, freeze)
+ *     invalidate_lock (vfs - truncate serialisation)
+ *       page_lock (MM)
+ *         i_lock (FUSE - extent map serialisation)
+ */
+static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	vm_fault_t ret;
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_page_mkwrite(vmf);
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vmf->vma->vm_file);
+
+	filemap_invalidate_lock_shared(mapping);
+	ret = iomap_page_mkwrite(vmf, &fuse_iomap_ops, NULL);
+	filemap_invalidate_unlock_shared(mapping);
+
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
+static const struct vm_operations_struct fuse_iomap_vm_ops = {
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= fuse_iomap_page_mkwrite,
+};
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	ASSERT(fuse_has_iomap_fileio(file_inode(file)));
+
+	file_accessed(file);
+	vma->vm_ops = &fuse_iomap_vm_ops;
+	return 0;
+}
+
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_buffered_read(iocb, to);
+
+	if (!iov_iter_count(to))
+		return 0; /* skip atime */
+
+	file_accessed(iocb->ki_filp);
+
+	ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+	if (ret)
+		return ret;
+	ret = generic_file_read_iter(iocb, to);
+	inode_unlock_shared(inode);
+
+	trace_fuse_iomap_buffered_read_end(iocb, to, ret);
+	return ret;
+}
+
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t pos = iocb->ki_pos;
+	ssize_t ret;
+
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_buffered_write(iocb, from);
+
+	ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+	if (ret)
+		return ret;
+
+	ret = fuse_iomap_write_checks(iocb, from);
+	if (ret)
+		goto out_unlock;
+
+	if (inode->i_size < pos + iov_iter_count(from))
+		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+	ret = iomap_file_buffered_write(iocb, from, &fuse_iomap_ops, NULL);
+
+	if (ret > 0)
+		fuse_write_update_attr(inode, pos + ret, ret);
+	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+out_unlock:
+	inode_unlock(inode);
+
+	if (ret > 0) {
+		/* Handle various SYNC-type writes */
+		ret = generic_write_sync(iocb, ret);
+	}
+	trace_fuse_iomap_buffered_write_end(iocb, from, ret);
+	return ret;
+}
+
+static int
+fuse_iomap_truncate_page(
+	struct inode *inode,
+	loff_t			pos,
+	bool			*did_zero)
+{
+	return iomap_truncate_page(inode, pos, did_zero, &fuse_iomap_ops,
+				   NULL);
+}
+/*
+ * Truncate file.  Must have write permission and not be a directory.
+ *
+ * Caution: The caller of this function is responsible for calling
+ * setattr_prepare() or otherwise verifying the change is fine.
+ */
+static int
+fuse_iomap_setattr_size(
+	struct inode		*inode,
+	loff_t			newsize)
+{
+	loff_t			oldsize = i_size_read(inode);
+	int			error;
+	bool			did_zeroing = false;
+
+	rwsem_assert_held_write(&inode->i_rwsem);
+	rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
+	ASSERT(S_ISREG(inode->i_mode));
+
+	/*
+	 * Wait for all direct I/O to complete.
+	 */
+	inode_dio_wait(inode);
+
+	/*
+	 * File data changes must be complete and flushed to disk before we
+	 * call userspace to modify the inode.
+	 *
+	 * Start with zeroing any data beyond EOF that we may expose on file
+	 * extension, or zeroing out the rest of the block on a downward
+	 * truncate.
+	 */
+	if (newsize > oldsize) {
+		trace_fuse_iomap_truncate_up(inode, oldsize, newsize - oldsize);
+
+		error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize,
+					      &did_zeroing);
+	} else {
+		trace_fuse_iomap_truncate_down(inode, newsize,
+					       oldsize - newsize);
+
+		error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing);
+	}
+	if (error)
+		return error;
+
+	/*
+	 * We've already locked out new page faults, so now we can safely
+	 * remove pages from the page cache knowing they won't get refaulted
+	 * until we drop the mapping invalidation lock after the extent
+	 * manipulations are complete. The truncate_setsize() call also cleans
+	 * folios spanning EOF on extending truncates and hence ensures
+	 * sub-page block size filesystems are correctly handled, too.
+	 *
+	 * And we update in-core i_size and truncate page cache beyond newsize
+	 * before writing back the whole file, so we're guaranteed not to write
+	 * stale data past the new EOF on truncate down.
+	 */
+	truncate_setsize(inode, newsize);
+
+	/*
+	 * Flush the entire pagecache to ensure the fuse server logs the inode
+	 * size change and all dirty data that might be associated with it.
+	 * We don't know the ondisk inode size, so we only have this clumsy
+	 * hammer.
+	 */
+	return filemap_write_and_wait(inode->i_mapping);
+}
+
+int
+fuse_iomap_setsize(
+	struct inode		*inode,
+	loff_t			newsize)
+{
+	int error;
+
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_setsize(inode, newsize, 0);
+
+	error = inode_newsize_ok(inode, newsize);
+	if (error)
+		return error;
+	return fuse_iomap_setattr_size(inode, newsize);
+}
+
+/*
+ * Prepare for a file data block remapping operation by flushing and unmapping
+ * all pagecache for the entire range.
+ */
+int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
+				 loff_t endpos)
+{
+	loff_t			start, end;
+	unsigned int		rounding;
+	int			error;
+
+	/*
+	 * Make sure we extend the flush out to extent alignment boundaries so
+	 * any extent range overlapping the start/end of the modification we
+	 * are about to do is clean and idle.
+	 */
+	rounding = max_t(unsigned int, i_blocksize(inode), PAGE_SIZE);
+	start = round_down(pos, rounding);
+	end = round_up(endpos + 1, rounding) - 1;
+
+	trace_fuse_iomap_flush_unmap_range(inode, start, end + 1 - start);
+
+	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (error)
+		return error;
+	truncate_pagecache_range(inode, start, end);
+	return 0;
+}
+
+static int fuse_iomap_punch_range(struct inode *inode, loff_t offset,
+				  loff_t length)
+{
+	loff_t isize = i_size_read(inode);
+	int error;
+
+	trace_fuse_iomap_punch_range(inode, offset, length);
+
+	/*
+	 * Now that we've unmap all full blocks we'll have to zero out any
+	 * partial block at the beginning and/or end.  iomap_zero_range is
+	 * smart enough to skip holes and unwritten extents, including those we
+	 * just created, but we must take care not to zero beyond EOF, which
+	 * would enlarge i_size.
+	 */
+	if (offset >= isize)
+		return 0;
+	if (offset + length > isize)
+		length = isize - offset;
+	error = fuse_iomap_zero_range(inode, offset, length, NULL);
+	if (error)
+		return error;
+
+	/*
+	 * If we zeroed right up to EOF and EOF straddles a page boundary we
+	 * must make sure that the post-EOF area is also zeroed because the
+	 * page could be mmap'd and iomap_zero_range doesn't do that for us.
+	 * Writeback of the eof page will do this, albeit clumsily.
+	 */
+	if (offset + length >= isize && offset_in_page(offset + length) > 0) {
+		error = filemap_write_and_wait_range(inode->i_mapping,
+					round_down(offset + length, PAGE_SIZE),
+					LLONG_MAX);
+	}
+
+	return error;
+}
+
+void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
+{
+	trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
+
+	if (inode->i_blkbits == new_blkbits)
+		return;
+
+	if (!S_ISREG(inode->i_mode))
+		goto set_it;
+
+	/*
+	 * iomap attaches per-block state to each folio, so we cannot allow
+	 * the file block size to change if there's anything in the page cache.
+	 * In theory, fuse servers should never be doing this.
+	 */
+	if (inode->i_mapping->nrpages > 0) {
+		WARN_ON(inode->i_blkbits != new_blkbits &&
+			inode->i_mapping->nrpages > 0);
+		return;
+	}
+
+set_it:
+	inode->i_blkbits = new_blkbits;
+}
+
+int
+fuse_iomap_fallocate(
+	struct file		*file,
+	int			mode,
+	loff_t			offset,
+	loff_t			length,
+	loff_t			new_size)
+{
+	struct inode *inode = file_inode(file);
+	int error;
+
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+
+	/*
+	 * If we unmapped blocks from the file range, then we zero the
+	 * pagecache for those regions and push them to disk rather than make
+	 * the fuse server manually zero the disk blocks.
+	 */
+	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) {
+		error = fuse_iomap_punch_range(inode, offset, length);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If this is an extending write, we need to zero the bytes beyond the
+	 * new EOF and bounce the new size out to userspace.
+	 */
+	if (new_size) {
+		error = fuse_iomap_setsize(inode, new_size);
+		if (error)
+			return error;
+
+		fuse_write_update_attr(inode, new_size, length);
+	}
+
+	file_update_time(file);
+	return 0;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1a17983753c367..3e92a29d1030c9 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -231,6 +231,7 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	u8 new_blkbits;
 
 	lockdep_assert_held(&fi->lock);
 
@@ -292,9 +293,14 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	}
 
 	if (attr->blksize != 0)
-		inode->i_blkbits = ilog2(attr->blksize);
+		new_blkbits = ilog2(attr->blksize);
 	else
-		inode->i_blkbits = inode->i_sb->s_blocksize_bits;
+		new_blkbits = inode->i_sb->s_blocksize_bits;
+
+	if (fuse_has_iomap_fileio(inode))
+		fuse_iomap_set_i_blkbits(inode, new_blkbits);
+	else
+		inode->i_blkbits = new_blkbits;
 
 	/*
 	 * Don't set the sticky bit in i_mode, unless we want the VFS
@@ -1451,6 +1457,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				fc->iomap = 1;
 			if ((flags & FUSE_IOMAP_DIRECTIO) && fc->iomap)
 				fc->iomap_directio = 1;
+			if ((flags & FUSE_IOMAP_FILEIO) && fc->iomap)
+				fc->iomap_fileio = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1523,7 +1531,7 @@ void fuse_send_init(struct fuse_mount *fm)
 	if (fuse_uring_enabled())
 		flags |= FUSE_OVER_IO_URING;
 	if (fuse_iomap_enabled())
-		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO;
+		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_FILEIO;
 
 	ia->in.flags = flags;
 	ia->in.flags2 = flags >> 32;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 07/13] fuse: enable caching of timestamps
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:29   ` [PATCH 06/13] fuse: implement buffered " Darrick J. Wong
@ 2025-07-17 23:29   ` Darrick J. Wong
  2025-07-17 23:30   ` [PATCH 08/13] fuse: implement large folios for iomap pagecache files Darrick J. Wong
                     ` (5 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:29 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Cache the timestamps in the kernel so that the kernel sends FUSE_SETATTR
calls to the fuse server after writes, because the iomap infrastructure
won't do that for us.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/dir.c        |    3 ++-
 fs/fuse/file.c       |   19 +++++++++++++------
 fs/fuse/file_iomap.c |    6 ++++++
 fs/fuse/inode.c      |   13 +++++++------
 4 files changed, 28 insertions(+), 13 deletions(-)


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 7a398e42e9818b..1e9d5bf1811c6a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1964,7 +1964,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
-	bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
+	bool is_wb = S_ISREG(inode->i_mode) &&
+			(fuse_has_iomap_fileio(inode) || fc->writeback_cache);
 	loff_t oldsize;
 	int err;
 	bool trust_local_cmtime = is_wb;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2dd4e5c2933c0f..207836e2e09cc4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -238,7 +238,8 @@ static int fuse_open(struct inode *inode, struct file *file)
 	struct fuse_file *ff;
 	int err;
 	bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
-	bool is_wb_truncate = is_truncate && fc->writeback_cache;
+	bool is_wb_truncate = is_truncate && (fuse_has_iomap_fileio(inode) ||
+					      fc->writeback_cache);
 	bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
 
 	if (fuse_is_bad(inode))
@@ -458,7 +459,9 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+	if ((ff->open_flags & FOPEN_NOFLUSH) &&
+	    !fm->fc->writeback_cache &&
+	    !fuse_has_iomap_fileio(inode))
 		return 0;
 
 	err = write_inode_now(inode, 1);
@@ -494,7 +497,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	 * In memory i_blocks is not maintained by fuse, if writeback cache is
 	 * enabled, i_blocks from cached attr may not be accurate.
 	 */
-	if (!err && fm->fc->writeback_cache)
+	if (!err && (fuse_has_iomap_fileio(inode) || fm->fc->writeback_cache))
 		fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
 	return err;
 }
@@ -792,8 +795,10 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read,
 	 * If writeback_cache is enabled, a short read means there's a hole in
 	 * the file.  Some data after the hole is in page cache, but has not
 	 * reached the client fs yet.  So the hole is not present there.
+	 * If iomap is enabled, a short read means we hit EOF so there's
+	 * nothing to adjust.
 	 */
-	if (!fc->writeback_cache) {
+	if (!fc->writeback_cache && !fuse_has_iomap_fileio(inode)) {
 		loff_t pos = folio_pos(ap->folios[0]) + num_read;
 		fuse_read_update_size(inode, pos, attr_ver);
 	}
@@ -1935,7 +1940,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 	 * Do this only if writeback_cache is not enabled.  If writeback_cache
 	 * is enabled, we trust local ctime/mtime.
 	 */
-	if (!fc->writeback_cache)
+	if (!fc->writeback_cache && !fuse_has_iomap_fileio(inode))
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
 	spin_lock(&fi->lock);
 	fi->writectr--;
@@ -2266,6 +2271,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
 	int err = -ENOMEM;
 
 	WARN_ON(!fc->writeback_cache);
+	WARN_ON(fuse_has_iomap_fileio(mapping->host));
 
 	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
 			mapping_gfp_mask(mapping));
@@ -3108,7 +3114,8 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	ssize_t err;
 	/* mark unstable when write-back is not used, and file_out gets
 	 * extended */
-	bool is_unstable = (!fc->writeback_cache) &&
+	bool is_unstable = (!fc->writeback_cache &&
+			    !fuse_has_iomap_fileio(inode_out)) &&
 			   ((pos_out + len) > inode_out->i_size);
 
 	if (fc->no_copy_file_range)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index ab0dee6460a7dd..112cbb6cabb015 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1342,6 +1342,12 @@ static inline void fuse_iomap_set_fileio(struct inode *inode)
 
 	ASSERT(get_fuse_conn_c(inode)->iomap_fileio);
 
+	/*
+	 * Manage timestamps ourselves, don't make the fuse server do it.  This
+	 * is critical for mtime updates to work correctly with page_mkwrite.
+	 */
+	inode->i_flags &= ~S_NOCMTIME;
+	inode->i_flags &= ~S_NOATIME;
 	inode->i_data.a_ops = &fuse_iomap_aops;
 
 	INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3e92a29d1030c9..d67cc635612cff 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -328,10 +328,11 @@ u32 fuse_get_cache_mask(struct inode *inode)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
-	if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
-		return 0;
+	if (S_ISREG(inode->i_mode) &&
+	    (fuse_has_iomap_fileio(inode) || fc->writeback_cache))
+		return STATX_MTIME | STATX_CTIME | STATX_SIZE;
 
-	return STATX_MTIME | STATX_CTIME | STATX_SIZE;
+	return 0;
 }
 
 static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr,
@@ -346,9 +347,9 @@ static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr
 
 	spin_lock(&fi->lock);
 	/*
-	 * In case of writeback_cache enabled, writes update mtime, ctime and
-	 * may update i_size.  In these cases trust the cached value in the
-	 * inode.
+	 * In case of writeback_cache or iomap enabled, writes update mtime,
+	 * ctime and may update i_size.  In these cases trust the cached value
+	 * in the inode.
 	 */
 	cache_mask = fuse_get_cache_mask(inode);
 	if (cache_mask & STATX_SIZE)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 08/13] fuse: implement large folios for iomap pagecache files
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-07-17 23:29   ` [PATCH 07/13] fuse: enable caching of timestamps Darrick J. Wong
@ 2025-07-17 23:30   ` Darrick J. Wong
  2025-07-17 23:30   ` [PATCH 09/13] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:30 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Use large folios when we're using iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file_iomap.c |    6 ++++++
 1 file changed, 6 insertions(+)


diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 112cbb6cabb015..0983eabe58ffef 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1339,6 +1339,7 @@ static const struct address_space_operations fuse_iomap_aops = {
 static inline void fuse_iomap_set_fileio(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	unsigned int min_order = 0;
 
 	ASSERT(get_fuse_conn_c(inode)->iomap_fileio);
 
@@ -1353,6 +1354,11 @@ static inline void fuse_iomap_set_fileio(struct inode *inode)
 	INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
 	INIT_LIST_HEAD(&fi->ioend_list);
 	spin_lock_init(&fi->ioend_lock);
+
+	if (inode->i_blkbits > PAGE_SHIFT)
+		min_order = inode->i_blkbits - PAGE_SHIFT;
+
+	mapping_set_folio_min_order(inode->i_mapping, min_order);
 	set_bit(FUSE_I_IOMAP_FILEIO, &fi->state);
 }
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 09/13] fuse: use an unrestricted backing device with iomap pagecache io
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-07-17 23:30   ` [PATCH 08/13] fuse: implement large folios for iomap pagecache files Darrick J. Wong
@ 2025-07-17 23:30   ` Darrick J. Wong
  2025-07-17 23:30   ` [PATCH 10/13] fuse: advertise support for iomap Darrick J. Wong
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:30 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

With iomap support turned on for the pagecache, the kernel issues
writeback to directly to block devices and we no longer have to push all
those pages through the fuse device to userspace.  Therefore, we don't
need the tight dirty limits (~1M) that are used for regular fuse.  This
dramatically increases the performance of fuse's pagecache IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file_iomap.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)


diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 0983eabe58ffef..6ecca237196ac4 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -606,6 +606,29 @@ bool fuse_iomap_fill_super(struct fuse_mount *fm)
 		}
 	}
 
+	if (fc->iomap_fileio) {
+		struct backing_dev_info *old_bdi = sb->s_bdi;
+		char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+		int err;
+
+		/*
+		 * sb->s_bdi points to the initial private bdi.  However, we
+		 * want to redirect it to a new private bdi with default dirty
+		 * and readahead settings because iomap writeback won't be
+		 * pushing a ton of dirty data through the fuse device.  If
+		 * this fails we fall back to the initial fuse bdi.
+		 */
+		sb->s_bdi = &noop_backing_dev_info;
+		err = super_setup_bdi_name(sb, "%u:%u%s.iomap", MAJOR(fc->dev),
+					   MINOR(fc->dev), suffix);
+		if (err) {
+			sb->s_bdi = old_bdi;
+		} else {
+			bdi_unregister(old_bdi);
+			bdi_put(old_bdi);
+		}
+	}
+
 	/*
 	 * Enable syncfs for iomap fuse servers so that we can send a final
 	 * flush at unmount time.  This also means that we can support


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 10/13] fuse: advertise support for iomap
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-07-17 23:30   ` [PATCH 09/13] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
@ 2025-07-17 23:30   ` Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 11/13] fuse: query filesystem geometry when using iomap Darrick J. Wong
                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:30 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Advertise our new IO paths programmatically by creating an ioctl that
can return the capabilities of the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |    4 ++++
 include/uapi/linux/fuse.h |   13 +++++++++++++
 fs/fuse/dev.c             |    2 ++
 fs/fuse/file_iomap.c      |   15 +++++++++++++++
 4 files changed, 34 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f33b348d296d5e..136b9e5aabaf51 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1711,6 +1711,9 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
 			 loff_t length, loff_t new_size);
 int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
 				 loff_t endpos);
+
+int fuse_dev_ioctl_iomap_support(struct file *file,
+				 struct fuse_iomap_support __user *argp);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1738,6 +1741,7 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
 # define fuse_iomap_set_i_blkbits(...)		((void)0)
 # define fuse_iomap_fallocate(...)		(-ENOSYS)
 # define fuse_iomap_flush_unmap_range(...)	(-ENOSYS)
+# define fuse_dev_ioctl_iomap_support(...)	(-EOPNOTSUPP)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 71129db79a1dd0..cd484de60a7c09 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1142,6 +1142,17 @@ struct fuse_backing_map {
 	uint64_t	padding;
 };
 
+/* basic reporting functionality */
+#define FUSE_IOMAP_SUPPORT_BASICS	(1ULL << 0)
+/* fuse driver can do direct io */
+#define FUSE_IOMAP_SUPPORT_DIRECTIO	(1ULL << 1)
+/* fuse driver can do buffered io */
+#define FUSE_IOMAP_SUPPORT_FILEIO	(1ULL << 2)
+struct fuse_iomap_support {
+	uint64_t	flags;
+	uint64_t	padding;
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
@@ -1150,6 +1161,8 @@ struct fuse_backing_map {
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
 #define FUSE_DEV_IOC_IOMAP_DEV_ADD	_IOW(FUSE_DEV_IOC_MAGIC, 3, \
 					     struct fuse_backing_map)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT	_IOR(FUSE_DEV_IOC_MAGIC, 4, \
+					     struct fuse_iomap_support)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 49ff2c6654e768..4ad90d212379ff 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2685,6 +2685,8 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 
 	case FUSE_DEV_IOC_IOMAP_DEV_ADD:
 		return fuse_dev_ioctl_iomap_dev_add(file, argp);
+	case FUSE_DEV_IOC_IOMAP_SUPPORT:
+		return fuse_dev_ioctl_iomap_support(file, argp);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 6ecca237196ac4..673647ddda0ccd 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1724,3 +1724,18 @@ fuse_iomap_fallocate(
 	file_update_time(file);
 	return 0;
 }
+
+int fuse_dev_ioctl_iomap_support(struct file *file,
+				 struct fuse_iomap_support __user *argp)
+{
+	struct fuse_iomap_support ios = { };
+
+	if (fuse_iomap_enabled())
+		ios.flags = FUSE_IOMAP_SUPPORT_BASICS |
+			    FUSE_IOMAP_SUPPORT_DIRECTIO |
+			    FUSE_IOMAP_SUPPORT_FILEIO;
+
+	if (copy_to_user(argp, &ios, sizeof(ios)))
+		return -EFAULT;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 11/13] fuse: query filesystem geometry when using iomap
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-07-17 23:30   ` [PATCH 10/13] fuse: advertise support for iomap Darrick J. Wong
@ 2025-07-17 23:31   ` Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 12/13] fuse: implement fadvise for iomap files Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 13/13] fuse: implement inline data file IO via iomap Darrick J. Wong
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:31 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Add a new upcall to the fuse server so that the kernel can request
filesystem geometry bits when iomap mode is in use.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_trace.h      |   48 +++++++++++++++++++++++
 include/uapi/linux/fuse.h |   38 +++++++++++++++++++
 fs/fuse/file_iomap.c      |   92 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 178 insertions(+)


diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 5d9b5a4e93fca5..0078a9ad2a2871 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,7 @@
 	EM( FUSE_SYNCFS,		"FUSE_SYNCFS")		\
 	EM( FUSE_TMPFILE,		"FUSE_TMPFILE")		\
 	EM( FUSE_STATX,			"FUSE_STATX")		\
+	EM( FUSE_IOMAP_CONFIG,		"FUSE_IOMAP_CONFIG")	\
 	EM( FUSE_IOMAP_BEGIN,		"FUSE_IOMAP_BEGIN")	\
 	EM( FUSE_IOMAP_END,		"FUSE_IOMAP_END")	\
 	EM( FUSE_IOMAP_IOEND,		"FUSE_IOMAP_IOEND")	\
@@ -198,6 +199,14 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP_FILEIO);
 	{ IOMAP_IOEND_BOUNDARY,			"boundary" }, \
 	{ IOMAP_IOEND_DIRECT,			"direct" }
 
+#define FUSE_IOMAP_CONFIG_STRINGS \
+	{ FUSE_IOMAP_CONFIG_SID,		"sid" }, \
+	{ FUSE_IOMAP_CONFIG_UUID,		"uuid" }, \
+	{ FUSE_IOMAP_CONFIG_BLOCKSIZE,		"blocksize" }, \
+	{ FUSE_IOMAP_CONFIG_MAX_LINKS,		"max_links" }, \
+	{ FUSE_IOMAP_CONFIG_TIME,		"time" }, \
+	{ FUSE_IOMAP_CONFIG_MAXBYTES,		"maxbytes" }
+
 TRACE_EVENT(fuse_iomap_begin,
 	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
 		 unsigned opflags),
@@ -1184,6 +1193,45 @@ TRACE_EVENT(fuse_iomap_fallocate,
 		  __entry->isize, __entry->mode, __entry->offset,
 		  __entry->length, __entry->newsize)
 );
+
+TRACE_EVENT(fuse_iomap_config,
+	TP_PROTO(const struct fuse_mount *fm,
+		 const struct fuse_iomap_config_out *outarg),
+	TP_ARGS(fm, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+
+		__field(uint32_t,		flags)
+		__field(uint32_t,		blocksize)
+		__field(uint32_t,		max_links)
+		__field(uint32_t,		time_gran)
+
+		__field(int64_t,		time_min)
+		__field(int64_t,		time_max)
+		__field(int64_t,		maxbytes)
+		__field(uint8_t,		uuid_len)
+	),
+
+	TP_fast_assign(
+		__entry->connection	=	fm->fc->dev;
+		__entry->flags		=	outarg->flags;
+		__entry->blocksize	=	outarg->s_blocksize;
+		__entry->max_links	=	outarg->s_max_links;
+		__entry->time_gran	=	outarg->s_time_gran;
+		__entry->time_min	=	outarg->s_time_min;
+		__entry->time_max	=	outarg->s_time_max;
+		__entry->maxbytes	=	outarg->s_maxbytes;
+		__entry->uuid_len	=	outarg->s_uuid_len;
+	),
+
+	TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
+		  __entry->connection,
+		  __print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS),
+		  __entry->blocksize, __entry->max_links, __entry->time_gran,
+		  __entry->time_min, __entry->time_max, __entry->maxbytes,
+		  __entry->uuid_len)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index cd484de60a7c09..2aac5a0c4cef0a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -242,6 +242,7 @@
  *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
  *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
  *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
+ *  - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
  */
 
 #ifndef _LINUX_FUSE_H
@@ -676,6 +677,7 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_CONFIG	= 4092,
 	FUSE_IOMAP_IOEND	= 4093,
 	FUSE_IOMAP_BEGIN	= 4094,
 	FUSE_IOMAP_END		= 4095,
@@ -1424,4 +1426,40 @@ struct fuse_iomap_ioend_in {
 	uint32_t reserved1;	/* zero */
 };
 
+struct fuse_iomap_config_in {
+	uint64_t flags;		/* zero for now */
+	int64_t maxbytes;	/* maximum supported file size */
+};
+
+/* Which fields are set in fuse_iomap_config_out? */
+#define FUSE_IOMAP_CONFIG_SID		(1 << 0ULL)
+#define FUSE_IOMAP_CONFIG_UUID		(1 << 1ULL)
+#define FUSE_IOMAP_CONFIG_BLOCKSIZE	(1 << 2ULL)
+#define FUSE_IOMAP_CONFIG_MAX_LINKS	(1 << 3ULL)
+#define FUSE_IOMAP_CONFIG_TIME		(1 << 4ULL)
+#define FUSE_IOMAP_CONFIG_MAXBYTES	(1 << 5ULL)
+
+struct fuse_iomap_config_out {
+	uint64_t flags;		/* FUSE_IOMAP_CONFIG_* */
+
+	char s_id[32];		/* Informational name */
+	char s_uuid[16];	/* UUID */
+
+	uint8_t s_uuid_len;	/* length of s_uuid */
+
+	uint8_t s_pad[3];	/* must be zeroes */
+
+	uint32_t s_blocksize;	/* fs block size */
+	uint32_t s_max_links;	/* max hard links */
+
+	/* Granularity of c/m/atime in ns (cannot be worse than a second) */
+	uint32_t s_time_gran;
+
+	/* Time limits for c/m/atime in seconds */
+	int64_t s_time_min;
+	int64_t s_time_max;
+
+	int64_t s_maxbytes;	/* max file size */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 673647ddda0ccd..5253f7ef88c110 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -575,12 +575,104 @@ static struct fuse_iomap_dev *fuse_iomap_dev_alloc(struct file *file)
 	return fb;
 }
 
+#define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_SID | \
+			       FUSE_IOMAP_CONFIG_UUID | \
+			       FUSE_IOMAP_CONFIG_BLOCKSIZE | \
+			       FUSE_IOMAP_CONFIG_MAX_LINKS | \
+			       FUSE_IOMAP_CONFIG_TIME | \
+			       FUSE_IOMAP_CONFIG_MAXBYTES)
+
+static int fuse_iomap_config(struct fuse_mount *fm)
+{
+	struct fuse_iomap_config_in inarg = {
+		.maxbytes = MAX_LFS_FILESIZE,
+	};
+	struct fuse_iomap_config_out outarg = { };
+	FUSE_ARGS(args);
+	struct super_block *sb = fm->sb;
+	int err;
+
+	args.opcode = FUSE_IOMAP_CONFIG;
+	args.nodeid = 0;
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	args.out_numargs = 1;
+	args.out_args[0].size = sizeof(outarg);
+	args.out_args[0].value = &outarg;
+	args.force = true;
+	args.nocreds = true;
+	err = fuse_simple_request(fm, &args);
+	if (err == -ENOSYS)
+		return 0;
+	if (err)
+		return err;
+
+	trace_fuse_iomap_config(fm, &outarg);
+
+	if (outarg.flags & ~FUSE_IOMAP_CONFIG_ALL)
+		return -EINVAL;
+
+	if (outarg.s_uuid_len > sizeof(outarg.s_uuid))
+		return -EINVAL;
+
+	if (memchr_inv(outarg.s_pad, 0, sizeof(outarg.s_pad)))
+		return -EINVAL;
+
+	if (outarg.flags & FUSE_IOMAP_CONFIG_BLOCKSIZE) {
+		if (sb->s_bdev) {
+#ifdef CONFIG_BLOCK
+			if (!sb_set_blocksize(sb, outarg.s_blocksize))
+				return -EINVAL;
+#else
+			/*
+			 * XXX: how do we have a bdev filesystem without
+			 * CONFIG_BLOCK???
+			 */
+			return -EINVAL;
+#endif
+		} else {
+			sb->s_blocksize = outarg.s_blocksize;
+			sb->s_blocksize_bits = blksize_bits(outarg.s_blocksize);
+		}
+	}
+
+	if (outarg.flags & FUSE_IOMAP_CONFIG_SID)
+		memcpy(sb->s_id, outarg.s_id, sizeof(sb->s_id));
+
+	if (outarg.flags & FUSE_IOMAP_CONFIG_UUID) {
+		memcpy(&sb->s_uuid, outarg.s_uuid, outarg.s_uuid_len);
+		sb->s_uuid_len = outarg.s_uuid_len;
+	}
+
+	if (outarg.flags & FUSE_IOMAP_CONFIG_MAX_LINKS)
+		sb->s_max_links = outarg.s_max_links;
+
+	if (outarg.flags & FUSE_IOMAP_CONFIG_TIME) {
+		sb->s_time_gran = outarg.s_time_gran;
+		sb->s_time_min = outarg.s_time_min;
+		sb->s_time_max = outarg.s_time_max;
+	}
+
+	if (outarg.flags & FUSE_IOMAP_CONFIG_MAXBYTES)
+		sb->s_maxbytes = outarg.s_maxbytes;
+
+	return 0;
+}
+
 bool fuse_iomap_fill_super(struct fuse_mount *fm)
 {
 	struct fuse_conn *fc = fm->fc;
 	struct super_block *sb = fm->sb;
 	int res;
 
+	res = fuse_iomap_config(fm);
+	if (res) {
+		printk(KERN_ERR "%s: could not configure iomap, err=%d",
+		       sb->s_id, res);
+		return false;
+	}
+
 	if (sb->s_bdev) {
 		/*
 		 * Try to install s_bdev as the first iomap device, if this


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 12/13] fuse: implement fadvise for iomap files
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-07-17 23:31   ` [PATCH 11/13] fuse: query filesystem geometry when using iomap Darrick J. Wong
@ 2025-07-17 23:31   ` Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 13/13] fuse: implement inline data file IO via iomap Darrick J. Wong
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:31 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

If userspace asks us to perform readahead on a file, take i_rwsem so
that it can't race with hole punching or writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h     |    3 +++
 fs/fuse/file.c       |    1 +
 fs/fuse/file_iomap.c |   20 ++++++++++++++++++++
 3 files changed, 24 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 136b9e5aabaf51..5fba84c75f4a64 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1714,6 +1714,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
 
 int fuse_dev_ioctl_iomap_support(struct file *file,
 				 struct fuse_iomap_support __user *argp);
+
+int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1742,6 +1744,7 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
 # define fuse_iomap_fallocate(...)		(-ENOSYS)
 # define fuse_iomap_flush_unmap_range(...)	(-ENOSYS)
 # define fuse_dev_ioctl_iomap_support(...)	(-EOPNOTSUPP)
+# define fuse_iomap_fadvise			NULL
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 207836e2e09cc4..78e776878427e3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3236,6 +3236,7 @@ static const struct file_operations fuse_file_operations = {
 	.poll		= fuse_file_poll,
 	.fallocate	= fuse_file_fallocate,
 	.copy_file_range = fuse_copy_file_range,
+	.fadvise	= fuse_iomap_fadvise,
 };
 
 static const struct address_space_operations fuse_file_aops  = {
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 5253f7ef88c110..3f6e0496c4744b 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -8,6 +8,7 @@
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
 #include <linux/falloc.h>
+#include <linux/fadvise.h>
 
 static bool __read_mostly enable_iomap =
 #if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
@@ -1831,3 +1832,22 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
 		return -EFAULT;
 	return 0;
 }
+
+int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice)
+{
+	struct inode *inode = file_inode(file);
+	bool needlock = advice == POSIX_FADV_WILLNEED &&
+			fuse_has_iomap_fileio(inode);
+	int ret;
+
+	/*
+	 * Operations creating pages in page cache need protection from hole
+	 * punching and similar ops
+	 */
+	if (needlock)
+		inode_lock_shared(inode);
+	ret = generic_fadvise(file, start, end, advice);
+	if (needlock)
+		inode_unlock_shared(inode);
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 13/13] fuse: implement inline data file IO via iomap
  2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-07-17 23:31   ` [PATCH 12/13] fuse: implement fadvise for iomap files Darrick J. Wong
@ 2025-07-17 23:31   ` Darrick J. Wong
  12 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:31 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Implement inline data file IO by issuing FUSE_READ/FUSE_WRITE commands
in response to an inline data mapping.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_trace.h |   57 +++++++++++++++
 fs/fuse/file_iomap.c |  188 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 234 insertions(+), 11 deletions(-)


diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 0078a9ad2a2871..20257aed0cd89f 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -1232,6 +1232,63 @@ TRACE_EVENT(fuse_iomap_config,
 		  __entry->time_min, __entry->time_max, __entry->maxbytes,
 		  __entry->uuid_len)
 );
+
+DECLARE_EVENT_CLASS(fuse_iomap_inline_class,
+	TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count,
+		 const struct iomap *map),
+	TP_ARGS(inode, pos, count, map),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(loff_t,			pos)
+		__field(uint64_t,		count)
+		__field(loff_t,			offset)
+		__field(uint64_t,		length)
+		__field(uint16_t,		maptype)
+		__field(uint16_t,		mapflags)
+		__field(bool,			has_buf)
+		__field(uint64_t,		validity_cookie)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+		__entry->offset		=	map->offset;
+		__entry->length		=	map->length;
+		__entry->maptype	=	map->type;
+		__entry->mapflags	=	map->flags;
+		__entry->has_buf	=	map->inline_data != NULL;
+		__entry->validity_cookie=	map->validity_cookie;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx pos 0x%llx count 0x%llx offset 0x%llx length 0x%llx type %s mapflags (%s) has_buf? %d cookie 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __entry->pos, __entry->count,
+		  __entry->offset, __entry->length,
+		  __print_symbolic(__entry->maptype, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->has_buf, __entry->validity_cookie)
+);
+#define DEFINE_FUSE_IOMAP_INLINE_EVENT(name)	\
+DEFINE_EVENT(fuse_iomap_inline_class, name,	\
+	TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count, \
+		 const struct iomap *map), \
+	TP_ARGS(inode, pos, count, map))
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 3f6e0496c4744b..5ef9fa67db807e 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -201,17 +201,6 @@ fuse_iomap_begin_validate(const struct fuse_iomap_begin_out *outarg,
 	    BAD_DATA(check_add_overflow(outarg->write_addr, outarg->length, &end)))
 		return -EIO;
 
-	if (!(opflags & FUSE_IOMAP_OP_REPORT)) {
-		/*
-		 * XXX inline data reads and writes are not supported, how do
-		 * we do this?
-		 */
-		if (BAD_DATA(outarg->read_type == FUSE_IOMAP_TYPE_INLINE))
-			return -EIO;
-		if (BAD_DATA(outarg->write_type == FUSE_IOMAP_TYPE_INLINE))
-			return -EIO;
-	}
-
 	return 0;
 }
 
@@ -312,6 +301,157 @@ fuse_iomap_set_device(struct iomap *iomap, const struct fuse_iomap_dev *fb)
 	iomap->dax_dev = NULL;
 }
 
+static inline int fuse_iomap_inline_alloc(struct iomap *iomap)
+{
+	ASSERT(iomap->inline_data == NULL);
+	ASSERT(iomap->length > 0);
+
+	iomap->inline_data = kvzalloc(iomap->length, GFP_KERNEL);
+	return iomap->inline_data ? 0 : -ENOMEM;
+}
+
+static inline void fuse_iomap_inline_free(struct iomap *iomap)
+{
+	kvfree(iomap->inline_data);
+	iomap->inline_data = NULL;
+}
+
+/*
+ * Use the FUSE_READ command to read inline file data from the fuse server.
+ * Note that there's no file handle attached, so the fuse server must be able
+ * to reconnect to the inode via the nodeid.
+ */
+static int fuse_iomap_inline_read(struct inode *inode, loff_t pos,
+				  loff_t count, struct iomap *iomap)
+{
+	struct fuse_read_in in = {
+		.offset = pos,
+		.size = count,
+	};
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	ssize_t ret;
+
+	if (BAD_DATA(!iomap_inline_data_valid(iomap)))
+		return -EIO;
+
+	trace_fuse_iomap_inline_read(inode, pos, count, iomap);
+
+	args.opcode = FUSE_READ;
+	args.nodeid = fi->nodeid;
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(in);
+	args.in_args[0].value = &in;
+	args.out_argvar = true;
+	args.out_numargs = 1;
+	args.out_args[0].size = count;
+	args.out_args[0].value = iomap_inline_data(iomap, pos);
+
+	ret = fuse_simple_request(fm, &args);
+	if (ret < 0) {
+		fuse_iomap_inline_free(iomap);
+		return ret;
+	}
+	/* no readahead means something bad happened */
+	if (ret == 0) {
+		fuse_iomap_inline_free(iomap);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+/*
+ * Use the FUSE_WRITE command to write inline file data from the fuse server.
+ * Note that there's no file handle attached, so the fuse server must be able
+ * to reconnect to the inode via the nodeid.
+ */
+static int fuse_iomap_inline_write(struct inode *inode, loff_t pos,
+				   loff_t count, struct iomap *iomap)
+{
+	struct fuse_write_in in = {
+		.offset = pos,
+		.size = count,
+	};
+	struct fuse_write_out out = { };
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	FUSE_ARGS(args);
+	ssize_t ret;
+
+	if (BAD_DATA(!iomap_inline_data_valid(iomap)))
+		return -EIO;
+
+	trace_fuse_iomap_inline_write(inode, pos, count, iomap);
+
+	args.opcode = FUSE_WRITE;
+	args.nodeid = fi->nodeid;
+	args.in_numargs = 2;
+	args.in_args[0].size = sizeof(in);
+	args.in_args[0].value = &in;
+	args.in_args[1].size = count;
+	args.in_args[1].value = iomap_inline_data(iomap, pos);
+	args.out_numargs = 1;
+	args.out_args[0].size = sizeof(out);
+	args.out_args[0].value = &out;
+
+	ret = fuse_simple_request(fm, &args);
+	if (ret < 0) {
+		fuse_iomap_inline_free(iomap);
+		return ret;
+	}
+	/* short write means something bad happened */
+	if (out.size < count) {
+		fuse_iomap_inline_free(iomap);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+/* Set up inline data buffers for iomap_begin */
+static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
+				 loff_t pos, loff_t count,
+				 struct iomap *iomap, struct iomap *srcmap)
+{
+	int err;
+
+	if (opflags & FUSE_IOMAP_OP_REPORT)
+		return 0;
+
+	if (fuse_is_iomap_file_write(opflags)) {
+		if (iomap->type == IOMAP_INLINE) {
+			err = fuse_iomap_inline_alloc(iomap);
+			if (err)
+				return err;
+		}
+
+		if (srcmap->type == IOMAP_INLINE) {
+			err = fuse_iomap_inline_alloc(srcmap);
+			if (!err)
+				err = fuse_iomap_inline_read(inode, pos, count,
+							     srcmap);
+			if (err) {
+				fuse_iomap_inline_free(iomap);
+				return err;
+			}
+		}
+	} else if (iomap->type == IOMAP_INLINE) {
+		/* inline data read */
+		err = fuse_iomap_inline_alloc(iomap);
+		if (!err)
+			err = fuse_iomap_inline_read(inode, pos, count, iomap);
+		if (err)
+			return err;
+	}
+
+	trace_fuse_iomap_set_inline_iomap(inode, pos, count, iomap);
+	trace_fuse_iomap_set_inline_srcmap(inode, pos, count, srcmap);
+
+	return 0;
+}
+
 static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 			    unsigned opflags, struct iomap *iomap,
 			    struct iomap *srcmap)
@@ -399,12 +539,20 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 		fuse_iomap_set_device(iomap, read_dev);
 	}
 
+	if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
+		err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
+					    srcmap);
+		if (err)
+			goto out_write_dev;
+	}
+
 	/*
 	 * XXX: if we ever want to support closing devices, we need a way to 
 	 * track the fuse_iomap_dev refcount all the way through bio endios.
 	 * For now we put the refcount here because you can't remove an iomap
 	 * device until unmount time.
 	 */
+out_write_dev:
 	fuse_iomap_dev_put(write_dev);
 out_read_dev:
 	fuse_iomap_dev_put(read_dev);
@@ -448,9 +596,26 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
 		.map_flags = iomap->flags,
 	};
 	struct fuse_mount *fm = get_fuse_mount(inode);
+	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+	struct iomap *srcmap = &iter->srcmap;
 	FUSE_ARGS(args);
 	int err;
 
+	if (srcmap->inline_data)
+		fuse_iomap_inline_free(srcmap);
+
+	if (iomap->inline_data) {
+		if (fuse_is_iomap_file_write(opflags) && written > 0) {
+			err = fuse_iomap_inline_write(inode, pos, written,
+						      iomap);
+			fuse_iomap_inline_free(iomap);
+			if (err)
+				goto out_err;
+		} else {
+			fuse_iomap_inline_free(iomap);
+		}
+	}
+
 	if (!fuse_want_iomap_end(iomap, opflags, count, written))
 		return 0;
 
@@ -463,6 +628,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
 	args.in_args[0].value = &inarg;
 	err = fuse_simple_request(fm, &args);
 
+out_err:
 	trace_fuse_iomap_end_error(inode, &inarg, err);
 
 	return err;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 1/4] fuse: cache iomaps
  2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-07-17 23:31   ` Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:31 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Cache iomaps to a file so that we don't have to upcall the server.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |   87 ++
 fs/fuse/fuse_trace.h      |  436 ++++++++++++
 fs/fuse/iomap_cache.h     |  119 +++
 include/uapi/linux/fuse.h |    4 
 fs/fuse/Makefile          |    2 
 fs/fuse/dev.c             |    1 
 fs/fuse/file_iomap.c      |   32 +
 fs/fuse/iomap_cache.c     | 1651 +++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 2322 insertions(+), 10 deletions(-)
 create mode 100644 fs/fuse/iomap_cache.h
 create mode 100644 fs/fuse/iomap_cache.c


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 5fba84c75f4a64..196d2b57e80bb1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -106,6 +106,24 @@ struct fuse_backing {
 	struct rcu_head rcu;
 };
 
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+/*
+ * File incore extent information, present for each of data & attr forks.
+ */
+struct fuse_ifork {
+	int64_t			if_bytes;	/* bytes in if_data */
+	void			*if_data;	/* extent tree root */
+	int			if_height;	/* height of the extent tree */
+};
+
+struct fuse_iomap_cache {
+	struct fuse_ifork	im_read;
+	struct fuse_ifork	*im_write;
+	uint64_t		im_seq;		/* validity counter */
+	struct rw_semaphore	im_lock;	/* mapping lock */
+};
+#endif
+
 /** FUSE inode */
 struct fuse_inode {
 	/** Inode data */
@@ -167,6 +185,7 @@ struct fuse_inode {
 			spinlock_t ioend_lock;
 			struct work_struct ioend_work;
 			struct list_head ioend_list;
+			struct fuse_iomap_cache cache;
 #endif
 		};
 
@@ -237,6 +256,11 @@ enum {
 	FUSE_I_IOMAP_DIRECTIO,
 	/* Use iomap for buffered read and writes */
 	FUSE_I_IOMAP_FILEIO,
+	/*
+	 * Cache iomaps in the kernel.  This is required for any filesystem
+	 * that needs to synchronize pagecache write and writeback.
+	 */
+	FUSE_I_IOMAP_CACHE,
 };
 
 struct fuse_conn;
@@ -1716,6 +1740,65 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
 				 struct fuse_iomap_support __user *argp);
 
 int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
+
+enum fuse_iomap_fork {
+	FUSE_IOMAP_READ_FORK,
+	FUSE_IOMAP_WRITE_FORK,
+};
+
+struct fuse_iomap {
+	uint64_t		addr;	/* disk offset of mapping, bytes */
+	loff_t			offset;	/* file offset of mapping, bytes */
+	uint64_t		length;	/* length of mapping, bytes */
+	uint16_t		type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t		flags;	/* FUSE_IOMAP_F_* */
+	uint32_t		dev;	/* device cookie */
+	uint64_t		validity_cookie; /* used with .iomap_valid() */
+};
+
+static inline bool fuse_has_iomap_cache(const struct inode *inode)
+{
+	const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+	return test_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+}
+
+int fuse_iomap_cache_remove(struct inode *inode,
+			    enum fuse_iomap_fork whichfork,
+			    loff_t off, uint64_t len);
+
+int fuse_iomap_cache_add(struct inode *inode,
+			 enum fuse_iomap_fork whichfork,
+			 const struct fuse_iomap *map);
+
+static inline int fuse_iomap_cache_upsert(struct inode *inode,
+					  enum fuse_iomap_fork whichfork,
+					  const struct fuse_iomap *map)
+{
+	int err = fuse_iomap_cache_remove(inode, whichfork, map->offset,
+					  map->length);
+	if (err)
+		return err;
+
+	return fuse_iomap_cache_add(inode, whichfork, map);
+}
+
+static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ip)
+{
+	return (uint64_t)READ_ONCE(ip->im_seq);
+}
+
+enum fuse_iomap_lookup_result {
+	LOOKUP_HIT,
+	LOOKUP_MISS,
+	LOOKUP_NOFORK,
+};
+
+enum fuse_iomap_lookup_result
+fuse_iomap_cache_lookup(struct inode *inode,
+			enum fuse_iomap_fork whichfork,
+			loff_t off, uint64_t len,
+			struct fuse_iomap *mval);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1745,6 +1828,10 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
 # define fuse_iomap_flush_unmap_range(...)	(-ENOSYS)
 # define fuse_dev_ioctl_iomap_support(...)	(-EOPNOTSUPP)
 # define fuse_iomap_fadvise			NULL
+# define fuse_has_iomap_cache(...)		(false)
+# define fuse_iomap_cache_remove(...)		(-ENOSYS)
+# define fuse_iomap_cache_add(...)		(-ENOSYS)
+# define fuse_iomap_cache_upsert(...)		(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 20257aed0cd89f..598c0e603a32b1 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -129,6 +129,7 @@ TRACE_EVENT(fuse_request_end,
 );
 
 #if IS_ENABLED(CONFIG_FUSE_IOMAP)
+struct fuse_iext_cursor;
 
 #define FUSE_IOMAP_F_STRINGS \
 	{ FUSE_IOMAP_F_NEW,			"new" }, \
@@ -182,6 +183,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BTIME);
 TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
 TRACE_DEFINE_ENUM(FUSE_I_IOMAP_DIRECTIO);
 TRACE_DEFINE_ENUM(FUSE_I_IOMAP_FILEIO);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP_CACHE);
 
 #define FUSE_IFLAG_STRINGS \
 	{ 1 << FUSE_I_ADVISE_RDPLUS,		"advise_rdplus" }, \
@@ -191,7 +193,8 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP_FILEIO);
 	{ 1 << FUSE_I_BTIME,			"btime" }, \
 	{ 1 << FUSE_I_CACHE_IO_MODE,		"cacheio" }, \
 	{ 1 << FUSE_I_IOMAP_DIRECTIO,		"iomap_dio" }, \
-	{ 1 << FUSE_I_IOMAP_FILEIO,		"iomap_fileio" }
+	{ 1 << FUSE_I_IOMAP_FILEIO,		"iomap_fileio" }, \
+	{ 1 << FUSE_I_IOMAP_CACHE,		"iomap_cache" }
 
 #define IOMAP_IOEND_STRINGS \
 	{ IOMAP_IOEND_SHARED,			"shared" }, \
@@ -207,6 +210,26 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP_FILEIO);
 	{ FUSE_IOMAP_CONFIG_TIME,		"time" }, \
 	{ FUSE_IOMAP_CONFIG_MAXBYTES,		"maxbytes" }
 
+TRACE_DEFINE_ENUM(FUSE_IOMAP_READ_FORK);
+TRACE_DEFINE_ENUM(FUSE_IOMAP_WRITE_FORK);
+
+#define FUSE_IOMAP_FORK_STRINGS \
+	{ FUSE_IOMAP_READ_FORK,			"read" }, \
+	{ FUSE_IOMAP_WRITE_FORK,		"write" }
+
+#define FUSE_IOMAP_CACHE_LOCK_STRINGS \
+	{ FUSE_IOMAP_LOCK_SHARED,		"shared" }, \
+	{ FUSE_IOMAP_LOCK_EXCL,			"exclusive" }
+
+#define FUSE_IEXT_STATE_STRINGS \
+	{ FUSE_IEXT_LEFT_CONTIG,		"l_cont" }, \
+	{ FUSE_IEXT_RIGHT_CONTIG,		"r_cont" }, \
+	{ FUSE_IEXT_LEFT_FILLING,		"l_fill" }, \
+	{ FUSE_IEXT_RIGHT_FILLING,		"r_fill" }, \
+	{ FUSE_IEXT_LEFT_VALID,			"l_valid" }, \
+	{ FUSE_IEXT_RIGHT_VALID,		"r_valid" }, \
+	{ FUSE_IEXT_WRITEFORK,			"writefork" }
+
 TRACE_EVENT(fuse_iomap_begin,
 	TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
 		 unsigned opflags),
@@ -1289,6 +1312,417 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read);
 DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
 DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
 DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
+
+DECLARE_EVENT_CLASS(fuse_iomap_cache_lock_class,
+	TP_PROTO(const struct inode *inode, unsigned int lock_flags,
+		 unsigned long caller_ip),
+	TP_ARGS(inode, lock_flags, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(unsigned int, lock_flags)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->lock_flags	=	lock_flags;
+		__entry->caller_ip	=	caller_ip;
+	),
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx lock (%s) caller %pS",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->lock_flags, "|", FUSE_IOMAP_CACHE_LOCK_STRINGS),
+		  (void *)__entry->caller_ip)
+)
+#define DEFINE_FUSE_IOMAP_CACHE_LOCK_EVENT(name)	\
+DEFINE_EVENT(fuse_iomap_cache_lock_class, name,		\
+	TP_PROTO(const struct inode *inode, unsigned int lock_flags, \
+		 unsigned long caller_ip), \
+	TP_ARGS(inode, lock_flags, caller_ip))
+DEFINE_FUSE_IOMAP_CACHE_LOCK_EVENT(fuse_iomap_cache_lock);
+DEFINE_FUSE_IOMAP_CACHE_LOCK_EVENT(fuse_iomap_cache_unlock);
+
+DECLARE_EVENT_CLASS(fuse_iext_class,
+	TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur,
+		 int state, unsigned long caller_ip),
+
+	TP_ARGS(inode, cur, state, caller_ip),
+
+	TP_STRUCT__entry(
+		__field(dev_t, connection)
+		__field(uint64_t, ino)
+		__field(void *, leaf)
+		__field(int, pos)
+		__field(loff_t, offset)
+		__field(uint64_t, addr)
+		__field(uint64_t, length)
+		__field(uint16_t, type)
+		__field(uint16_t, mapflags)
+		__field(uint32_t, dev)
+		__field(int, iext_state)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+		const struct fuse_ifork *ifp;
+		struct fuse_iomap	r = { };
+
+		if (state & FUSE_IEXT_WRITEFORK)
+			ifp = fi->cache.im_write;
+		else
+			ifp = &fi->cache.im_read;
+		if (ifp)
+			fuse_iext_get_extent(ifp, cur, &r);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->leaf		=	cur->leaf;
+		__entry->pos		=	cur->pos;
+		__entry->offset		=	r.offset;
+		__entry->addr		=	r.addr;
+		__entry->length		=	r.length;
+		__entry->dev		=	r.dev;
+		__entry->type		=	r.type;
+		__entry->mapflags	=	r.flags;
+		__entry->iext_state	=	state;
+		__entry->caller_ip	=	caller_ip;
+	),
+	TP_printk("connection %u ino %llu state (%s) cur %p/%d "
+		  "offset 0x%llx addr 0x%llx length 0x%llx type %s mapflags (%s) dev %u caller %pS",
+		  __entry->connection, __entry->ino,
+		  __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS),
+		  __entry->leaf,
+		  __entry->pos,
+		  __entry->offset,
+		  __entry->addr,
+		  __entry->length,
+		  __print_symbolic(__entry->type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->dev,
+		  (void *)__entry->caller_ip)
+)
+
+#define DEFINE_IEXT_EVENT(name) \
+DEFINE_EVENT(fuse_iext_class, name, \
+	TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur, \
+		 int state, unsigned long caller_ip), \
+	TP_ARGS(inode, cur, state, caller_ip))
+DEFINE_IEXT_EVENT(fuse_iext_insert);
+DEFINE_IEXT_EVENT(fuse_iext_remove);
+DEFINE_IEXT_EVENT(fuse_iext_pre_update);
+DEFINE_IEXT_EVENT(fuse_iext_post_update);
+
+TRACE_EVENT(fuse_iext_update_class,
+	TP_PROTO(const struct inode *inode, uint32_t iext_state,
+		 const struct fuse_iomap *map),
+	TP_ARGS(inode, iext_state, map),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+
+		__field(loff_t,			map_offset)
+		__field(loff_t,			map_length)
+		__field(uint16_t,		map_type)
+		__field(uint16_t,		map_flags)
+		__field(uint32_t,		map_dev)
+		__field(uint64_t,		map_addr)
+
+		__field(uint32_t,		iext_state)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+
+		__entry->map_offset	=	map->offset;
+		__entry->map_length	=	map->length;
+		__entry->map_type	=	map->type;
+		__entry->map_flags	=	map->flags;
+		__entry->map_dev	=	map->dev;
+		__entry->map_addr	=	map->addr;
+
+		__entry->iext_state	=	iext_state;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx state (%s) offset 0x%llx length 0x%llx type %s mapflags (%s) dev %u addr 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS),
+		  __entry->map_offset, __entry->map_length,
+		  __print_symbolic(__entry->map_type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->map_flags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->map_dev, __entry->map_addr)
+);
+#define DEFINE_IEXT_UPDATE_EVENT(name) \
+DEFINE_EVENT(fuse_iext_update_class, name, \
+	TP_PROTO(const struct inode *inode, uint32_t iext_state, \
+		 const struct fuse_iomap *map), \
+	TP_ARGS(inode, iext_state, map))
+DEFINE_IEXT_UPDATE_EVENT(fuse_iext_del_mapping);
+DEFINE_IEXT_UPDATE_EVENT(fuse_iext_add_mapping);
+
+TRACE_EVENT(fuse_iext_alt_update_class,
+	TP_PROTO(const struct inode *inode, const struct fuse_iomap *map),
+	TP_ARGS(inode, map),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+
+		__field(loff_t,			map_offset)
+		__field(loff_t,			map_length)
+		__field(uint16_t,		map_type)
+		__field(uint16_t,		map_flags)
+		__field(uint32_t,		map_dev)
+		__field(uint64_t,		map_addr)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+
+		__entry->map_offset	=	map->offset;
+		__entry->map_length	=	map->length;
+		__entry->map_type	=	map->type;
+		__entry->map_flags	=	map->flags;
+		__entry->map_dev	=	map->dev;
+		__entry->map_addr	=	map->addr;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu offset 0x%llx length 0x%llx type %s mapflags (%s) dev %u addr 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->map_offset, __entry->map_length,
+		  __print_symbolic(__entry->map_type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->map_flags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->map_dev, __entry->map_addr)
+);
+#define DEFINE_IEXT_ALT_UPDATE_EVENT(name) \
+DEFINE_EVENT(fuse_iext_alt_update_class, name, \
+	TP_PROTO(const struct inode *inode, const struct fuse_iomap *map), \
+	TP_ARGS(inode, map))
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_del_mapping_got);
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_left);
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_right);
+
+TRACE_EVENT(fuse_iomap_cache_remove,
+	TP_PROTO(const struct inode *inode, enum fuse_iomap_fork whichfork,
+		 loff_t offset, uint64_t length, unsigned long caller_ip),
+	TP_ARGS(inode, whichfork, offset, length, caller_ip),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(enum fuse_iomap_fork,	whichfork)
+		__field(loff_t,			offset)
+		__field(uint64_t,		length)
+		__field(unsigned long,		caller_ip)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->whichfork	=	whichfork;
+		__entry->offset		=	offset;
+		__entry->length		=	length;
+		__entry->caller_ip	=	caller_ip;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx whichfork %s offset 0x%llx length 0x%llx caller %pS",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_symbolic(__entry->whichfork, FUSE_IOMAP_FORK_STRINGS),
+		  __entry->offset, __entry->length, (void *)__entry->caller_ip)
+);
+
+TRACE_EVENT(fuse_iomap_mapping_class,
+	TP_PROTO(const struct inode *inode, enum fuse_iomap_fork whichfork,
+		 const struct fuse_iomap *map, unsigned long caller_ip),
+	TP_ARGS(inode, whichfork, map, caller_ip),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(enum fuse_iomap_fork,	whichfork)
+		__field(loff_t,			offset)
+		__field(loff_t,			length)
+		__field(uint16_t,		maptype)
+		__field(uint16_t,		mapflags)
+		__field(uint32_t,		dev)
+		__field(uint64_t,		addr)
+		__field(unsigned long,		caller_ip)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->whichfork	=	whichfork;
+		__entry->offset		=	map->offset;
+		__entry->length		=	map->length;
+		__entry->maptype	=	map->type;
+		__entry->mapflags	=	map->flags;
+		__entry->dev		=	map->dev;
+		__entry->addr		=	map->addr;
+		__entry->caller_ip	=	caller_ip;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx whichfork %s offset 0x%llx length 0x%llx type %s mapflags (%s) dev %u addr 0x%llx caller %pS",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_symbolic(__entry->whichfork, FUSE_IOMAP_FORK_STRINGS),
+		  __entry->offset, __entry->length,
+		  __print_symbolic(__entry->maptype, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->dev, __entry->addr, (void *)__entry->caller_ip)
+);
+#define DEFINE_FUSE_IOMAP_MAPPING_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_mapping_class, name, \
+	TP_PROTO(const struct inode *inode, enum fuse_iomap_fork whichfork, \
+		 const struct fuse_iomap *map, unsigned long caller_ip), \
+	TP_ARGS(inode, whichfork, map, caller_ip))
+DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_cache_add);
+DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iext_check_mapping);
+
+TRACE_EVENT(fuse_iomap_cache_lookup,
+	TP_PROTO(const struct inode *inode, enum fuse_iomap_fork whichfork,
+		 loff_t pos, uint64_t count, unsigned long caller_ip),
+	TP_ARGS(inode, whichfork, pos, count, caller_ip),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(enum fuse_iomap_fork,	whichfork)
+		__field(loff_t,			pos)
+		__field(uint64_t,		count)
+		__field(unsigned long,		caller_ip)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->whichfork	=	whichfork;
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+		__entry->caller_ip	=	caller_ip;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx whichfork %s pos 0x%llx count 0x%llx caller %pS",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_symbolic(__entry->whichfork, FUSE_IOMAP_FORK_STRINGS),
+		  __entry->pos, __entry->count,
+		  (void *)__entry->caller_ip)
+);
+
+TRACE_EVENT(fuse_iomap_cache_lookup_result,
+	TP_PROTO(const struct inode *inode, enum fuse_iomap_fork whichfork,
+		 loff_t pos, uint64_t count, const struct fuse_iomap *got,
+		 const struct fuse_iomap *map),
+	TP_ARGS(inode, whichfork, pos, count, got, map),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(enum fuse_iomap_fork,	whichfork)
+		__field(loff_t,			pos)
+		__field(uint64_t,		count)
+
+		__field(loff_t,			got_offset)
+		__field(uint64_t,		got_length)
+		__field(uint64_t,		got_addr)
+
+		__field(loff_t,			map_offset)
+		__field(uint64_t,		map_length)
+		__field(uint16_t,		map_type)
+		__field(uint16_t,		map_flags)
+		__field(uint32_t,		map_dev)
+		__field(uint64_t,		map_addr)
+
+		__field(uint64_t,		validity_cookie)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->whichfork	=	whichfork;
+		__entry->pos		=	pos;
+		__entry->count		=	count;
+
+		__entry->got_offset	=	got->offset;
+		__entry->got_length	=	got->length;
+		__entry->got_addr	=	got->addr;
+
+		__entry->map_offset	=	map->offset;
+		__entry->map_length	=	map->length;
+		__entry->map_type	=	map->type;
+		__entry->map_flags	=	map->flags;
+		__entry->map_dev	=	map->dev;
+		__entry->map_addr	=	map->addr;
+
+		__entry->validity_cookie=	map->validity_cookie;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx whichfork %s pos 0x%llx count 0x%llx map offset 0x%llx length 0x%llx type %s mapflags (%s) dev %u addr 0x%llx got offset 0x%llx length 0x%llx addr 0x%llx cookie 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize,
+		  __print_symbolic(__entry->whichfork, FUSE_IOMAP_FORK_STRINGS),
+		  __entry->pos, __entry->count,
+		  __entry->map_offset, __entry->map_length,
+		  __print_symbolic(__entry->map_type, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->map_flags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->map_dev, __entry->map_addr, __entry->got_offset,
+		  __entry->got_length, __entry->got_addr,
+		  __entry->validity_cookie)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/iomap_cache.h b/fs/fuse/iomap_cache.h
new file mode 100644
index 00000000000000..7efa23be18d155
--- /dev/null
+++ b/fs/fuse/iomap_cache.h
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2017 Christoph Hellwig.
+ */
+
+#ifndef _FS_FUSE_IOMAP_CACHE_H
+#define _FS_FUSE_IOMAP_CACHE_H
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+# define ASSERT(a)		do { WARN(!(a), "Assertion failed: %s, func: %s, line: %d", #a, __func__, __LINE__); } while (0)
+# define BAD_DATA(condition)	(WARN(condition, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__))
+#else
+# define ASSERT(a)
+# define BAD_DATA(condition)	(condition)
+#endif
+
+#define FUSE_IOMAP_LOCK_SHARED	(1U << 0)
+#define FUSE_IOMAP_LOCK_EXCL	(1U << 1)
+
+void fuse_iomap_cache_lock(struct inode *inode, unsigned int lock_flags);
+void fuse_iomap_cache_unlock(struct inode *inode, unsigned int lock_flags);
+
+#define FUSE_IOMAP_MAX_LEN	((loff_t)(1ULL << 63))
+
+struct fuse_iext_leaf;
+
+struct fuse_iext_cursor {
+	struct fuse_iext_leaf	*leaf;
+	int			pos;
+};
+
+#define FUSE_IEXT_LEFT_CONTIG	(1u << 0)
+#define FUSE_IEXT_RIGHT_CONTIG	(1u << 1)
+#define FUSE_IEXT_LEFT_FILLING	(1u << 2)
+#define FUSE_IEXT_RIGHT_FILLING	(1u << 3)
+#define FUSE_IEXT_LEFT_VALID	(1u << 4)
+#define FUSE_IEXT_RIGHT_VALID	(1u << 5)
+#define FUSE_IEXT_WRITEFORK	(1u << 6)
+
+struct fuse_ifork *fuse_iext_state_to_fork(struct fuse_iomap_cache *ip,
+		unsigned int state);
+
+uint64_t	fuse_iext_count(const struct fuse_ifork *ifp);
+void		fuse_iext_insert_raw(struct fuse_iomap_cache *ip,
+			struct fuse_ifork *ifp,
+			struct fuse_iext_cursor *cur,
+			const struct fuse_iomap *irec);
+void		fuse_iext_insert(struct fuse_iomap_cache *,
+			struct fuse_iext_cursor *cur,
+			const struct fuse_iomap *, int);
+void		fuse_iext_remove(struct fuse_iomap_cache *,
+			struct fuse_iext_cursor *,
+			int);
+void		fuse_iext_destroy(struct fuse_ifork *);
+
+bool		fuse_iext_lookup_extent(struct fuse_iomap_cache *ip,
+			struct fuse_ifork *ifp, loff_t bno,
+			struct fuse_iext_cursor *cur,
+			struct fuse_iomap *gotp);
+bool		fuse_iext_lookup_extent_before(struct fuse_iomap_cache *ip,
+			struct fuse_ifork *ifp, loff_t *end,
+			struct fuse_iext_cursor *cur,
+			struct fuse_iomap *gotp);
+bool		fuse_iext_get_extent(const struct fuse_ifork *ifp,
+			const struct fuse_iext_cursor *cur,
+			struct fuse_iomap *gotp);
+void		fuse_iext_update_extent(struct fuse_iomap_cache *ip, int state,
+			struct fuse_iext_cursor *cur,
+			struct fuse_iomap *gotp);
+
+void		fuse_iext_first(struct fuse_ifork *, struct fuse_iext_cursor *);
+void		fuse_iext_last(struct fuse_ifork *, struct fuse_iext_cursor *);
+void		fuse_iext_next(struct fuse_ifork *, struct fuse_iext_cursor *);
+void		fuse_iext_prev(struct fuse_ifork *, struct fuse_iext_cursor *);
+
+static inline bool fuse_iext_next_extent(struct fuse_ifork *ifp,
+		struct fuse_iext_cursor *cur, struct fuse_iomap *gotp)
+{
+	fuse_iext_next(ifp, cur);
+	return fuse_iext_get_extent(ifp, cur, gotp);
+}
+
+static inline bool fuse_iext_prev_extent(struct fuse_ifork *ifp,
+		struct fuse_iext_cursor *cur, struct fuse_iomap *gotp)
+{
+	fuse_iext_prev(ifp, cur);
+	return fuse_iext_get_extent(ifp, cur, gotp);
+}
+
+/*
+ * Return the extent after cur in gotp without updating the cursor.
+ */
+static inline bool fuse_iext_peek_next_extent(struct fuse_ifork *ifp,
+		struct fuse_iext_cursor *cur, struct fuse_iomap *gotp)
+{
+	struct fuse_iext_cursor ncur = *cur;
+
+	fuse_iext_next(ifp, &ncur);
+	return fuse_iext_get_extent(ifp, &ncur, gotp);
+}
+
+/*
+ * Return the extent before cur in gotp without updating the cursor.
+ */
+static inline bool fuse_iext_peek_prev_extent(struct fuse_ifork *ifp,
+		struct fuse_iext_cursor *cur, struct fuse_iomap *gotp)
+{
+	struct fuse_iext_cursor ncur = *cur;
+
+	fuse_iext_prev(ifp, &ncur);
+	return fuse_iext_get_extent(ifp, &ncur, gotp);
+}
+
+#define for_each_fuse_iext(ifp, ext, got)		\
+	for (fuse_iext_first((ifp), (ext));		\
+	     fuse_iext_get_extent((ifp), (ext), (got));	\
+	     fuse_iext_next((ifp), (ext)))
+
+#endif /* _FS_FUSE_IOMAP_CACHE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2aac5a0c4cef0a..a9b2d68b4b79c3 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1330,6 +1330,7 @@ struct fuse_uring_cmd_req {
 };
 
 #define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_NULL		(0xFFFE) /* no record here */
 #define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
 #define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
 #define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
@@ -1462,4 +1463,7 @@ struct fuse_iomap_config_out {
 	int64_t s_maxbytes;	/* max file size */
 };
 
+/* invalidate all cached iomap mappings up to EOF */
+#define FUSE_IOMAP_INVAL_TO_EOF		(~0ULL)
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 63a41ef9336aaa..cf5c242be09f84 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -16,6 +16,6 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
 fuse-$(CONFIG_SYSCTL) += sysctl.o
 fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
-fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o iomap_cache.o
 
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 4ad90d212379ff..3dd04c2fdae7ba 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -9,6 +9,7 @@
 #include "dev_uring_i.h"
 #include "fuse_i.h"
 #include "fuse_dev_i.h"
+#include "iomap_cache.h"
 
 #include <linux/init.h>
 #include <linux/module.h>
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 5ef9fa67db807e..66e1be93592023 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -4,6 +4,7 @@
  * Author: Darrick J. Wong <djwong@kernel.org.
  */
 #include "fuse_i.h"
+#include "iomap_cache.h"
 #include "fuse_trace.h"
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
@@ -19,14 +20,6 @@ static bool __read_mostly enable_iomap =
 module_param(enable_iomap, bool, 0644);
 MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
 
-#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
-# define ASSERT(a)		do { WARN(!(a), "Assertion failed: %s, func: %s, line: %d", #a, __func__, __LINE__); } while (0)
-# define BAD_DATA(condition)	(WARN(condition, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__))
-#else
-# define ASSERT(a)
-# define BAD_DATA(condition)	(condition)
-#endif
-
 bool fuse_iomap_enabled(void)
 {
 	/*
@@ -1102,6 +1095,21 @@ static inline void fuse_iomap_clear_fileio(struct inode *inode)
 	clear_bit(FUSE_I_IOMAP_FILEIO, &fi->state);
 }
 
+static inline void fuse_iomap_clear_cache(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ASSERT(fuse_has_iomap(inode));
+
+	clear_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+
+	fuse_iext_destroy(&fi->cache.im_read);
+	if (fi->cache.im_write) {
+		fuse_iext_destroy(fi->cache.im_write);
+		kfree(fi->cache.im_write);
+	}
+}
+
 void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
 {
 	struct fuse_conn *conn = get_fuse_conn(inode);
@@ -1122,6 +1130,8 @@ void fuse_iomap_evict_inode(struct inode *inode)
 		fuse_iomap_clear_directio(inode);
 	if (fuse_has_iomap_fileio(inode))
 		fuse_iomap_clear_fileio(inode);
+	if (fuse_has_iomap_cache(inode))
+		fuse_iomap_clear_cache(inode);
 }
 
 ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
@@ -1641,6 +1651,12 @@ static inline void fuse_iomap_set_fileio(struct inode *inode)
 		min_order = inode->i_blkbits - PAGE_SHIFT;
 
 	mapping_set_folio_min_order(inode->i_mapping, min_order);
+
+	memset(&fi->cache.im_read, 0, sizeof(fi->cache.im_read));
+	fi->cache.im_seq = 0;
+	fi->cache.im_write = NULL;
+
+	init_rwsem(&fi->cache.im_lock);
 	set_bit(FUSE_I_IOMAP_FILEIO, &fi->state);
 }
 
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
new file mode 100644
index 00000000000000..6244352f543f03
--- /dev/null
+++ b/fs/fuse/iomap_cache.c
@@ -0,0 +1,1651 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fuse_iext* code adapted from xfs_iext_tree.c:
+ * Copyright (c) 2017 Christoph Hellwig.
+ *
+ * fuse_iomap_cache*lock* code adapted from xfs_inode.c:
+ * Copyright (c) 2000-2006 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * Copyright (C) 2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org.
+ */
+#include "fuse_i.h"
+#include "iomap_cache.h"
+#include "fuse_trace.h"
+#include <linux/iomap.h>
+
+static inline void fuse_iomap_cache_lock_flags_assert(unsigned int lock_flags)
+{
+	ASSERT((lock_flags & (FUSE_IOMAP_LOCK_SHARED | FUSE_IOMAP_LOCK_EXCL)) !=
+		(FUSE_IOMAP_LOCK_SHARED | FUSE_IOMAP_LOCK_EXCL));
+	ASSERT(lock_flags != 0);
+}
+
+void fuse_iomap_cache_lock(struct inode *inode, unsigned int lock_flags)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_cache *ip = &fi->cache;
+
+	fuse_iomap_cache_lock_flags_assert(lock_flags);
+
+	if (lock_flags & FUSE_IOMAP_LOCK_EXCL)
+		down_write(&ip->im_lock);
+	else if (lock_flags & FUSE_IOMAP_LOCK_SHARED)
+		down_read(&ip->im_lock);
+
+	trace_fuse_iomap_cache_lock(inode, lock_flags, _RET_IP_);
+}
+
+void fuse_iomap_cache_unlock(struct inode *inode, unsigned int lock_flags)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_iomap_cache *ip = &fi->cache;
+
+	fuse_iomap_cache_lock_flags_assert(lock_flags);
+
+	trace_fuse_iomap_cache_unlock(inode, lock_flags, _RET_IP_);
+
+	if (lock_flags & FUSE_IOMAP_LOCK_EXCL)
+		up_write(&ip->im_lock);
+	else if (lock_flags & FUSE_IOMAP_LOCK_SHARED)
+		up_read(&ip->im_lock);
+}
+
+static inline void fuse_iomap_assert_locked(struct fuse_iomap_cache *ip,
+					    unsigned int lock_flags)
+{
+	if (lock_flags & FUSE_IOMAP_LOCK_SHARED)
+		rwsem_assert_held(&ip->im_lock);
+	else if (lock_flags & FUSE_IOMAP_LOCK_EXCL)
+		rwsem_assert_held_write_nolockdep(&ip->im_lock);
+}
+
+struct fuse_iext_rec {
+	uint64_t		addr;	/* disk offset of mapping, bytes */
+	loff_t			offset;	/* file offset of mapping, bytes */
+	uint64_t		length;	/* length of mapping, bytes */
+	uint16_t		type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t		flags;	/* FUSE_IOMAP_F_* */
+	uint32_t		dev;	/* device cookie */
+};
+
+static inline struct fuse_inode *FUSE_I(struct fuse_iomap_cache *ip)
+{
+	return container_of(ip, struct fuse_inode, cache);
+}
+
+static inline struct inode *VFS_I(struct fuse_iomap_cache *ip)
+{
+	struct fuse_inode *fi = FUSE_I(ip);
+
+	return &fi->inode;
+}
+
+static inline uint32_t
+fuse_iomap_fork_to_state(const struct fuse_iomap_cache *ip,
+			 const struct fuse_ifork *ifp)
+{
+	ASSERT(ifp == ip->im_write || ifp == &ip->im_read);
+
+	if (ifp == ip->im_write)
+		return FUSE_IEXT_WRITEFORK;
+	return 0;
+}
+
+/* Convert bmap state flags to an inode fork. */
+struct fuse_ifork *
+fuse_iext_state_to_fork(
+	struct fuse_iomap_cache	*ip,
+	unsigned int		state)
+{
+	if (state & FUSE_IEXT_WRITEFORK)
+		return ip->im_write;
+	return &ip->im_read;
+}
+
+static bool fuse_iext_rec_is_empty(const struct fuse_iext_rec *rec)
+{
+	return rec->length == 0;
+}
+
+static inline void fuse_iext_rec_clear(struct fuse_iext_rec *rec)
+{
+	memset(rec, 0, sizeof(*rec));
+}
+
+static void
+fuse_iext_set(
+	struct fuse_iext_rec	*rec,
+	const struct fuse_iomap	*irec)
+{
+	ASSERT(irec->length > 0);
+
+	rec->addr = irec->addr;
+	rec->offset = irec->offset;
+	rec->length = irec->length;
+	rec->type = irec->type;
+	rec->flags = irec->flags;
+	rec->dev = irec->dev;
+}
+
+static void
+fuse_iext_get(
+	struct fuse_iomap		*irec,
+	const struct fuse_iext_rec	*rec)
+{
+	irec->addr = rec->addr;
+	irec->offset = rec->offset;
+	irec->length = rec->length;
+	irec->type = rec->type;
+	irec->flags = rec->flags;
+	irec->dev = rec->dev;
+	/* validity cookie is set at the end of lookup */
+}
+
+enum {
+	NODE_SIZE	= 256,
+	KEYS_PER_NODE	= NODE_SIZE / (sizeof(uint64_t) + sizeof(void *)),
+	RECS_PER_LEAF	= (NODE_SIZE - (2 * sizeof(struct fuse_iext_leaf *))) /
+				sizeof(struct fuse_iext_rec),
+};
+
+/*
+ * In-core extent btree block layout:
+ *
+ * There are two types of blocks in the btree: leaf and inner (non-leaf) blocks.
+ *
+ * The leaf blocks are made up by %KEYS_PER_NODE extent records, which each
+ * contain the startoffset, blockcount, startblock and unwritten extent flag.
+ * See above for the exact format, followed by pointers to the previous and next
+ * leaf blocks (if there are any).
+ *
+ * The inner (non-leaf) blocks first contain KEYS_PER_NODE lookup keys, followed
+ * by an equal number of pointers to the btree blocks at the next lower level.
+ *
+ *		+-------+-------+-------+-------+-------+----------+----------+
+ * Leaf:	| rec 1 | rec 2 | rec 3 | rec 4 | rec N | prev-ptr | next-ptr |
+ *		+-------+-------+-------+-------+-------+----------+----------+
+ *
+ *		+-------+-------+-------+-------+-------+-------+------+-------+
+ * Inner:	| key 1 | key 2 | key 3 | key N | ptr 1 | ptr 2 | ptr3 | ptr N |
+ *		+-------+-------+-------+-------+-------+-------+------+-------+
+ */
+struct fuse_iext_node {
+	uint64_t		keys[KEYS_PER_NODE];
+#define FUSE_IEXT_KEY_INVALID	(1ULL << 63)
+	void			*ptrs[KEYS_PER_NODE];
+};
+
+struct fuse_iext_leaf {
+	struct fuse_iext_rec	recs[RECS_PER_LEAF];
+	struct fuse_iext_leaf	*prev;
+	struct fuse_iext_leaf	*next;
+};
+
+inline uint64_t fuse_iext_count(const struct fuse_ifork *ifp)
+{
+	return ifp->if_bytes / sizeof(struct fuse_iext_rec);
+}
+
+static inline int fuse_iext_max_recs(const struct fuse_ifork *ifp)
+{
+	if (ifp->if_height == 1)
+		return fuse_iext_count(ifp);
+	return RECS_PER_LEAF;
+}
+
+static inline struct fuse_iext_rec *cur_rec(const struct fuse_iext_cursor *cur)
+{
+	return &cur->leaf->recs[cur->pos];
+}
+
+static inline bool fuse_iext_valid(const struct fuse_ifork *ifp,
+				   const struct fuse_iext_cursor *cur)
+{
+	if (!cur->leaf)
+		return false;
+	if (cur->pos < 0 || cur->pos >= fuse_iext_max_recs(ifp))
+		return false;
+	if (fuse_iext_rec_is_empty(cur_rec(cur)))
+		return false;
+	return true;
+}
+
+static void *
+fuse_iext_find_first_leaf(
+	struct fuse_ifork	*ifp)
+{
+	struct fuse_iext_node	*node = ifp->if_data;
+	int			height;
+
+	if (!ifp->if_height)
+		return NULL;
+
+	for (height = ifp->if_height; height > 1; height--) {
+		node = node->ptrs[0];
+		ASSERT(node);
+	}
+
+	return node;
+}
+
+static void *
+fuse_iext_find_last_leaf(
+	struct fuse_ifork	*ifp)
+{
+	struct fuse_iext_node	*node = ifp->if_data;
+	int			height, i;
+
+	if (!ifp->if_height)
+		return NULL;
+
+	for (height = ifp->if_height; height > 1; height--) {
+		for (i = 1; i < KEYS_PER_NODE; i++)
+			if (!node->ptrs[i])
+				break;
+		node = node->ptrs[i - 1];
+		ASSERT(node);
+	}
+
+	return node;
+}
+
+void
+fuse_iext_first(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur)
+{
+	cur->pos = 0;
+	cur->leaf = fuse_iext_find_first_leaf(ifp);
+}
+
+void
+fuse_iext_last(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur)
+{
+	int			i;
+
+	cur->leaf = fuse_iext_find_last_leaf(ifp);
+	if (!cur->leaf) {
+		cur->pos = 0;
+		return;
+	}
+
+	for (i = 1; i < fuse_iext_max_recs(ifp); i++) {
+		if (fuse_iext_rec_is_empty(&cur->leaf->recs[i]))
+			break;
+	}
+	cur->pos = i - 1;
+}
+
+void
+fuse_iext_next(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur)
+{
+	if (!cur->leaf) {
+		ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF);
+		fuse_iext_first(ifp, cur);
+		return;
+	}
+
+	ASSERT(cur->pos >= 0);
+	ASSERT(cur->pos < fuse_iext_max_recs(ifp));
+
+	cur->pos++;
+	if (ifp->if_height > 1 && !fuse_iext_valid(ifp, cur) &&
+	    cur->leaf->next) {
+		cur->leaf = cur->leaf->next;
+		cur->pos = 0;
+	}
+}
+
+void
+fuse_iext_prev(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur)
+{
+	if (!cur->leaf) {
+		ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF);
+		fuse_iext_last(ifp, cur);
+		return;
+	}
+
+	ASSERT(cur->pos >= 0);
+	ASSERT(cur->pos <= RECS_PER_LEAF);
+
+recurse:
+	do {
+		cur->pos--;
+		if (fuse_iext_valid(ifp, cur))
+			return;
+	} while (cur->pos > 0);
+
+	if (ifp->if_height > 1 && cur->leaf->prev) {
+		cur->leaf = cur->leaf->prev;
+		cur->pos = RECS_PER_LEAF;
+		goto recurse;
+	}
+}
+
+static inline int
+fuse_iext_key_cmp(
+	struct fuse_iext_node	*node,
+	int			n,
+	loff_t			offset)
+{
+	if (node->keys[n] > offset)
+		return 1;
+	if (node->keys[n] < offset)
+		return -1;
+	return 0;
+}
+
+static inline int
+fuse_iext_rec_cmp(
+	struct fuse_iext_rec	*rec,
+	loff_t			offset)
+{
+	if (rec->offset > offset)
+		return 1;
+	if (rec->offset + rec->length <= offset)
+		return -1;
+	return 0;
+}
+
+static void *
+fuse_iext_find_level(
+	struct fuse_ifork	*ifp,
+	loff_t			offset,
+	int			level)
+{
+	struct fuse_iext_node	*node = ifp->if_data;
+	int			height, i;
+
+	if (!ifp->if_height)
+		return NULL;
+
+	for (height = ifp->if_height; height > level; height--) {
+		for (i = 1; i < KEYS_PER_NODE; i++)
+			if (fuse_iext_key_cmp(node, i, offset) > 0)
+				break;
+
+		node = node->ptrs[i - 1];
+		if (!node)
+			break;
+	}
+
+	return node;
+}
+
+static int
+fuse_iext_node_pos(
+	struct fuse_iext_node	*node,
+	loff_t			offset)
+{
+	int			i;
+
+	for (i = 1; i < KEYS_PER_NODE; i++) {
+		if (fuse_iext_key_cmp(node, i, offset) > 0)
+			break;
+	}
+
+	return i - 1;
+}
+
+static int
+fuse_iext_node_insert_pos(
+	struct fuse_iext_node	*node,
+	loff_t			offset)
+{
+	int			i;
+
+	for (i = 0; i < KEYS_PER_NODE; i++) {
+		if (fuse_iext_key_cmp(node, i, offset) > 0)
+			return i;
+	}
+
+	return KEYS_PER_NODE;
+}
+
+static int
+fuse_iext_node_nr_entries(
+	struct fuse_iext_node	*node,
+	int			start)
+{
+	int			i;
+
+	for (i = start; i < KEYS_PER_NODE; i++) {
+		if (node->keys[i] == FUSE_IEXT_KEY_INVALID)
+			break;
+	}
+
+	return i;
+}
+
+static int
+fuse_iext_leaf_nr_entries(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_leaf	*leaf,
+	int			start)
+{
+	int			i;
+
+	for (i = start; i < fuse_iext_max_recs(ifp); i++) {
+		if (fuse_iext_rec_is_empty(&leaf->recs[i]))
+			break;
+	}
+
+	return i;
+}
+
+static inline uint64_t
+fuse_iext_leaf_key(
+	struct fuse_iext_leaf	*leaf,
+	int			n)
+{
+	return leaf->recs[n].offset;
+}
+
+static inline void *
+fuse_iext_alloc_node(
+	int	size)
+{
+	return kzalloc(size, GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
+}
+
+static void
+fuse_iext_grow(
+	struct fuse_ifork	*ifp)
+{
+	struct fuse_iext_node	*node = fuse_iext_alloc_node(NODE_SIZE);
+	int			i;
+
+	if (ifp->if_height == 1) {
+		struct fuse_iext_leaf *prev = ifp->if_data;
+
+		node->keys[0] = fuse_iext_leaf_key(prev, 0);
+		node->ptrs[0] = prev;
+	} else  {
+		struct fuse_iext_node *prev = ifp->if_data;
+
+		ASSERT(ifp->if_height > 1);
+
+		node->keys[0] = prev->keys[0];
+		node->ptrs[0] = prev;
+	}
+
+	for (i = 1; i < KEYS_PER_NODE; i++)
+		node->keys[i] = FUSE_IEXT_KEY_INVALID;
+
+	ifp->if_data = node;
+	ifp->if_height++;
+}
+
+static void
+fuse_iext_update_node(
+	struct fuse_ifork	*ifp,
+	loff_t			old_offset,
+	loff_t			new_offset,
+	int			level,
+	void			*ptr)
+{
+	struct fuse_iext_node	*node = ifp->if_data;
+	int			height, i;
+
+	for (height = ifp->if_height; height > level; height--) {
+		for (i = 0; i < KEYS_PER_NODE; i++) {
+			if (i > 0 && fuse_iext_key_cmp(node, i, old_offset) > 0)
+				break;
+			if (node->keys[i] == old_offset)
+				node->keys[i] = new_offset;
+		}
+		node = node->ptrs[i - 1];
+		ASSERT(node);
+	}
+
+	ASSERT(node == ptr);
+}
+
+static struct fuse_iext_node *
+fuse_iext_split_node(
+	struct fuse_iext_node	**nodep,
+	int			*pos,
+	int			*nr_entries)
+{
+	struct fuse_iext_node	*node = *nodep;
+	struct fuse_iext_node	*new = fuse_iext_alloc_node(NODE_SIZE);
+	const int		nr_move = KEYS_PER_NODE / 2;
+	int			nr_keep = nr_move + (KEYS_PER_NODE & 1);
+	int			i = 0;
+
+	/* for sequential append operations just spill over into the new node */
+	if (*pos == KEYS_PER_NODE) {
+		*nodep = new;
+		*pos = 0;
+		*nr_entries = 0;
+		goto done;
+	}
+
+
+	for (i = 0; i < nr_move; i++) {
+		new->keys[i] = node->keys[nr_keep + i];
+		new->ptrs[i] = node->ptrs[nr_keep + i];
+
+		node->keys[nr_keep + i] = FUSE_IEXT_KEY_INVALID;
+		node->ptrs[nr_keep + i] = NULL;
+	}
+
+	if (*pos >= nr_keep) {
+		*nodep = new;
+		*pos -= nr_keep;
+		*nr_entries = nr_move;
+	} else {
+		*nr_entries = nr_keep;
+	}
+done:
+	for (; i < KEYS_PER_NODE; i++)
+		new->keys[i] = FUSE_IEXT_KEY_INVALID;
+	return new;
+}
+
+static void
+fuse_iext_insert_node(
+	struct fuse_ifork	*ifp,
+	uint64_t		offset,
+	void			*ptr,
+	int			level)
+{
+	struct fuse_iext_node	*node, *new;
+	int			i, pos, nr_entries;
+
+again:
+	if (ifp->if_height < level)
+		fuse_iext_grow(ifp);
+
+	new = NULL;
+	node = fuse_iext_find_level(ifp, offset, level);
+	pos = fuse_iext_node_insert_pos(node, offset);
+	nr_entries = fuse_iext_node_nr_entries(node, pos);
+
+	ASSERT(pos >= nr_entries || fuse_iext_key_cmp(node, pos, offset) != 0);
+	ASSERT(nr_entries <= KEYS_PER_NODE);
+
+	if (nr_entries == KEYS_PER_NODE)
+		new = fuse_iext_split_node(&node, &pos, &nr_entries);
+
+	/*
+	 * Update the pointers in higher levels if the first entry changes
+	 * in an existing node.
+	 */
+	if (node != new && pos == 0 && nr_entries > 0)
+		fuse_iext_update_node(ifp, node->keys[0], offset, level, node);
+
+	for (i = nr_entries; i > pos; i--) {
+		node->keys[i] = node->keys[i - 1];
+		node->ptrs[i] = node->ptrs[i - 1];
+	}
+	node->keys[pos] = offset;
+	node->ptrs[pos] = ptr;
+
+	if (new) {
+		offset = new->keys[0];
+		ptr = new;
+		level++;
+		goto again;
+	}
+}
+
+static struct fuse_iext_leaf *
+fuse_iext_split_leaf(
+	struct fuse_iext_cursor	*cur,
+	int			*nr_entries)
+{
+	struct fuse_iext_leaf	*leaf = cur->leaf;
+	struct fuse_iext_leaf	*new = fuse_iext_alloc_node(NODE_SIZE);
+	const int		nr_move = RECS_PER_LEAF / 2;
+	int			nr_keep = nr_move + (RECS_PER_LEAF & 1);
+	int			i;
+
+	/* for sequential append operations just spill over into the new node */
+	if (cur->pos == RECS_PER_LEAF) {
+		cur->leaf = new;
+		cur->pos = 0;
+		*nr_entries = 0;
+		goto done;
+	}
+
+	for (i = 0; i < nr_move; i++) {
+		new->recs[i] = leaf->recs[nr_keep + i];
+		fuse_iext_rec_clear(&leaf->recs[nr_keep + i]);
+	}
+
+	if (cur->pos >= nr_keep) {
+		cur->leaf = new;
+		cur->pos -= nr_keep;
+		*nr_entries = nr_move;
+	} else {
+		*nr_entries = nr_keep;
+	}
+done:
+	if (leaf->next)
+		leaf->next->prev = new;
+	new->next = leaf->next;
+	new->prev = leaf;
+	leaf->next = new;
+	return new;
+}
+
+static void
+fuse_iext_alloc_root(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur)
+{
+	ASSERT(ifp->if_bytes == 0);
+
+	ifp->if_data = fuse_iext_alloc_node(sizeof(struct fuse_iext_rec));
+	ifp->if_height = 1;
+
+	/* now that we have a node step into it */
+	cur->leaf = ifp->if_data;
+	cur->pos = 0;
+}
+
+static void
+fuse_iext_realloc_root(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur)
+{
+	int64_t new_size = ifp->if_bytes + sizeof(struct fuse_iext_rec);
+	void *new;
+
+	/* account for the prev/next pointers */
+	if (new_size / sizeof(struct fuse_iext_rec) == RECS_PER_LEAF)
+		new_size = NODE_SIZE;
+
+	new = krealloc(ifp->if_data, new_size,
+			GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
+	memset(new + ifp->if_bytes, 0, new_size - ifp->if_bytes);
+	ifp->if_data = new;
+	cur->leaf = new;
+}
+
+/*
+ * Increment the sequence counter on extent tree changes. We use WRITE_ONCE
+ * here to ensure the update to the sequence counter is seen before the
+ * modifications to the extent tree itself take effect.
+ */
+static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ip)
+{
+	WRITE_ONCE(ip->im_seq, READ_ONCE(ip->im_seq) + 1);
+}
+
+void
+fuse_iext_insert_raw(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur,
+	const struct fuse_iomap	*irec)
+{
+	loff_t			offset = irec->offset;
+	struct fuse_iext_leaf	*new = NULL;
+	int			nr_entries, i;
+
+	fuse_iext_inc_seq(ip);
+
+	if (ifp->if_height == 0)
+		fuse_iext_alloc_root(ifp, cur);
+	else if (ifp->if_height == 1)
+		fuse_iext_realloc_root(ifp, cur);
+
+	nr_entries = fuse_iext_leaf_nr_entries(ifp, cur->leaf, cur->pos);
+	ASSERT(nr_entries <= RECS_PER_LEAF);
+	ASSERT(cur->pos >= nr_entries ||
+	       fuse_iext_rec_cmp(cur_rec(cur), irec->offset) != 0);
+
+	if (nr_entries == RECS_PER_LEAF)
+		new = fuse_iext_split_leaf(cur, &nr_entries);
+
+	/*
+	 * Update the pointers in higher levels if the first entry changes
+	 * in an existing node.
+	 */
+	if (cur->leaf != new && cur->pos == 0 && nr_entries > 0) {
+		fuse_iext_update_node(ifp, fuse_iext_leaf_key(cur->leaf, 0),
+				offset, 1, cur->leaf);
+	}
+
+	for (i = nr_entries; i > cur->pos; i--)
+		cur->leaf->recs[i] = cur->leaf->recs[i - 1];
+	fuse_iext_set(cur_rec(cur), irec);
+	ifp->if_bytes += sizeof(struct fuse_iext_rec);
+
+	if (new)
+		fuse_iext_insert_node(ifp, fuse_iext_leaf_key(new, 0), new, 2);
+}
+
+void
+fuse_iext_insert(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_iext_cursor	*cur,
+	const struct fuse_iomap	*irec,
+	int			state)
+{
+	struct fuse_ifork	*ifp = fuse_iext_state_to_fork(ip, state);
+
+	fuse_iext_insert_raw(ip, ifp, cur, irec);
+	trace_fuse_iext_insert(VFS_I(ip), cur, state, _RET_IP_);
+}
+
+static struct fuse_iext_node *
+fuse_iext_rebalance_node(
+	struct fuse_iext_node	*parent,
+	int			*pos,
+	struct fuse_iext_node	*node,
+	int			nr_entries)
+{
+	/*
+	 * If the neighbouring nodes are completely full, or have different
+	 * parents, we might never be able to merge our node, and will only
+	 * delete it once the number of entries hits zero.
+	 */
+	if (nr_entries == 0)
+		return node;
+
+	if (*pos > 0) {
+		struct fuse_iext_node *prev = parent->ptrs[*pos - 1];
+		int nr_prev = fuse_iext_node_nr_entries(prev, 0), i;
+
+		if (nr_prev + nr_entries <= KEYS_PER_NODE) {
+			for (i = 0; i < nr_entries; i++) {
+				prev->keys[nr_prev + i] = node->keys[i];
+				prev->ptrs[nr_prev + i] = node->ptrs[i];
+			}
+			return node;
+		}
+	}
+
+	if (*pos + 1 < fuse_iext_node_nr_entries(parent, *pos)) {
+		struct fuse_iext_node *next = parent->ptrs[*pos + 1];
+		int nr_next = fuse_iext_node_nr_entries(next, 0), i;
+
+		if (nr_entries + nr_next <= KEYS_PER_NODE) {
+			/*
+			 * Merge the next node into this node so that we don't
+			 * have to do an additional update of the keys in the
+			 * higher levels.
+			 */
+			for (i = 0; i < nr_next; i++) {
+				node->keys[nr_entries + i] = next->keys[i];
+				node->ptrs[nr_entries + i] = next->ptrs[i];
+			}
+
+			++*pos;
+			return next;
+		}
+	}
+
+	return NULL;
+}
+
+static void
+fuse_iext_remove_node(
+	struct fuse_ifork	*ifp,
+	loff_t			offset,
+	void			*victim)
+{
+	struct fuse_iext_node	*node, *parent;
+	int			level = 2, pos, nr_entries, i;
+
+	ASSERT(level <= ifp->if_height);
+	node = fuse_iext_find_level(ifp, offset, level);
+	pos = fuse_iext_node_pos(node, offset);
+again:
+	ASSERT(node->ptrs[pos]);
+	ASSERT(node->ptrs[pos] == victim);
+	kfree(victim);
+
+	nr_entries = fuse_iext_node_nr_entries(node, pos) - 1;
+	offset = node->keys[0];
+	for (i = pos; i < nr_entries; i++) {
+		node->keys[i] = node->keys[i + 1];
+		node->ptrs[i] = node->ptrs[i + 1];
+	}
+	node->keys[nr_entries] = FUSE_IEXT_KEY_INVALID;
+	node->ptrs[nr_entries] = NULL;
+
+	if (pos == 0 && nr_entries > 0) {
+		fuse_iext_update_node(ifp, offset, node->keys[0], level, node);
+		offset = node->keys[0];
+	}
+
+	if (nr_entries >= KEYS_PER_NODE / 2)
+		return;
+
+	if (level < ifp->if_height) {
+		/*
+		 * If we aren't at the root yet try to find a neighbour node to
+		 * merge with (or delete the node if it is empty), and then
+		 * recurse up to the next level.
+		 */
+		level++;
+		parent = fuse_iext_find_level(ifp, offset, level);
+		pos = fuse_iext_node_pos(parent, offset);
+
+		ASSERT(pos != KEYS_PER_NODE);
+		ASSERT(parent->ptrs[pos] == node);
+
+		node = fuse_iext_rebalance_node(parent, &pos, node, nr_entries);
+		if (node) {
+			victim = node;
+			node = parent;
+			goto again;
+		}
+	} else if (nr_entries == 1) {
+		/*
+		 * If we are at the root and only one entry is left we can just
+		 * free this node and update the root pointer.
+		 */
+		ASSERT(node == ifp->if_data);
+		ifp->if_data = node->ptrs[0];
+		ifp->if_height--;
+		kfree(node);
+	}
+}
+
+static void
+fuse_iext_rebalance_leaf(
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*cur,
+	struct fuse_iext_leaf	*leaf,
+	loff_t			offset,
+	int			nr_entries)
+{
+	/*
+	 * If the neighbouring nodes are completely full we might never be able
+	 * to merge our node, and will only delete it once the number of
+	 * entries hits zero.
+	 */
+	if (nr_entries == 0)
+		goto remove_node;
+
+	if (leaf->prev) {
+		int nr_prev = fuse_iext_leaf_nr_entries(ifp, leaf->prev, 0), i;
+
+		if (nr_prev + nr_entries <= RECS_PER_LEAF) {
+			for (i = 0; i < nr_entries; i++)
+				leaf->prev->recs[nr_prev + i] = leaf->recs[i];
+
+			if (cur->leaf == leaf) {
+				cur->leaf = leaf->prev;
+				cur->pos += nr_prev;
+			}
+			goto remove_node;
+		}
+	}
+
+	if (leaf->next) {
+		int nr_next = fuse_iext_leaf_nr_entries(ifp, leaf->next, 0), i;
+
+		if (nr_entries + nr_next <= RECS_PER_LEAF) {
+			/*
+			 * Merge the next node into this node so that we don't
+			 * have to do an additional update of the keys in the
+			 * higher levels.
+			 */
+			for (i = 0; i < nr_next; i++) {
+				leaf->recs[nr_entries + i] =
+					leaf->next->recs[i];
+			}
+
+			if (cur->leaf == leaf->next) {
+				cur->leaf = leaf;
+				cur->pos += nr_entries;
+			}
+
+			offset = fuse_iext_leaf_key(leaf->next, 0);
+			leaf = leaf->next;
+			goto remove_node;
+		}
+	}
+
+	return;
+remove_node:
+	if (leaf->prev)
+		leaf->prev->next = leaf->next;
+	if (leaf->next)
+		leaf->next->prev = leaf->prev;
+	fuse_iext_remove_node(ifp, offset, leaf);
+}
+
+static void
+fuse_iext_free_last_leaf(
+	struct fuse_ifork	*ifp)
+{
+	ifp->if_height--;
+	kfree(ifp->if_data);
+	ifp->if_data = NULL;
+}
+
+void
+fuse_iext_remove(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_iext_cursor	*cur,
+	int			state)
+{
+	struct fuse_ifork	*ifp = fuse_iext_state_to_fork(ip, state);
+	struct fuse_iext_leaf	*leaf = cur->leaf;
+	loff_t			offset = fuse_iext_leaf_key(leaf, 0);
+	int			i, nr_entries;
+
+	trace_fuse_iext_remove(VFS_I(ip), cur, state, _RET_IP_);
+
+	ASSERT(ifp->if_height > 0);
+	ASSERT(ifp->if_data != NULL);
+	ASSERT(fuse_iext_valid(ifp, cur));
+
+	fuse_iext_inc_seq(ip);
+
+	nr_entries = fuse_iext_leaf_nr_entries(ifp, leaf, cur->pos) - 1;
+	for (i = cur->pos; i < nr_entries; i++)
+		leaf->recs[i] = leaf->recs[i + 1];
+	fuse_iext_rec_clear(&leaf->recs[nr_entries]);
+	ifp->if_bytes -= sizeof(struct fuse_iext_rec);
+
+	if (cur->pos == 0 && nr_entries > 0) {
+		fuse_iext_update_node(ifp, offset, fuse_iext_leaf_key(leaf, 0), 1,
+				leaf);
+		offset = fuse_iext_leaf_key(leaf, 0);
+	} else if (cur->pos == nr_entries) {
+		if (ifp->if_height > 1 && leaf->next)
+			cur->leaf = leaf->next;
+		else
+			cur->leaf = NULL;
+		cur->pos = 0;
+	}
+
+	if (nr_entries >= RECS_PER_LEAF / 2)
+		return;
+
+	if (ifp->if_height > 1)
+		fuse_iext_rebalance_leaf(ifp, cur, leaf, offset, nr_entries);
+	else if (nr_entries == 0)
+		fuse_iext_free_last_leaf(ifp);
+}
+
+/*
+ * Lookup the extent covering offset.
+ *
+ * If there is an extent covering offset return the extent index, and store the
+ * expanded extent structure in *gotp, and the extent cursor in *cur.
+ * If there is no extent covering offset, but there is an extent after it (e.g.
+ * it lies in a hole) return that extent in *gotp and its cursor in *cur
+ * instead.
+ * If offset is beyond the last extent return false, and return an invalid
+ * cursor value.
+ */
+bool
+fuse_iext_lookup_extent(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_ifork	*ifp,
+	loff_t			offset,
+	struct fuse_iext_cursor	*cur,
+	struct fuse_iomap	*gotp)
+{
+	cur->leaf = fuse_iext_find_level(ifp, offset, 1);
+	if (!cur->leaf) {
+		cur->pos = 0;
+		return false;
+	}
+
+	for (cur->pos = 0; cur->pos < fuse_iext_max_recs(ifp); cur->pos++) {
+		struct fuse_iext_rec *rec = cur_rec(cur);
+
+		if (fuse_iext_rec_is_empty(rec))
+			break;
+		if (fuse_iext_rec_cmp(rec, offset) >= 0)
+			goto found;
+	}
+
+	/* Try looking in the next node for an entry > offset */
+	if (ifp->if_height == 1 || !cur->leaf->next)
+		return false;
+	cur->leaf = cur->leaf->next;
+	cur->pos = 0;
+	if (!fuse_iext_valid(ifp, cur))
+		return false;
+found:
+	fuse_iext_get(gotp, cur_rec(cur));
+	return true;
+}
+
+/*
+ * Returns the last extent before end, and if this extent doesn't cover
+ * end, update end to the end of the extent.
+ */
+bool
+fuse_iext_lookup_extent_before(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_ifork	*ifp,
+	loff_t			*end,
+	struct fuse_iext_cursor	*cur,
+	struct fuse_iomap	*gotp)
+{
+	/* could be optimized to not even look up the next on a match.. */
+	if (fuse_iext_lookup_extent(ip, ifp, *end - 1, cur, gotp) &&
+	    gotp->offset <= *end - 1)
+		return true;
+	if (!fuse_iext_prev_extent(ifp, cur, gotp))
+		return false;
+	*end = gotp->offset + gotp->length;
+	return true;
+}
+
+void
+fuse_iext_update_extent(
+	struct fuse_iomap_cache	*ip,
+	int			state,
+	struct fuse_iext_cursor	*cur,
+	struct fuse_iomap	*new)
+{
+	struct fuse_ifork	*ifp = fuse_iext_state_to_fork(ip, state);
+
+	fuse_iext_inc_seq(ip);
+
+	if (cur->pos == 0) {
+		struct fuse_iomap	old;
+
+		fuse_iext_get(&old, cur_rec(cur));
+		if (new->offset != old.offset) {
+			fuse_iext_update_node(ifp, old.offset,
+					new->offset, 1, cur->leaf);
+		}
+	}
+
+	trace_fuse_iext_pre_update(VFS_I(ip), cur, state, _RET_IP_);
+	fuse_iext_set(cur_rec(cur), new);
+	trace_fuse_iext_post_update(VFS_I(ip), cur, state, _RET_IP_);
+}
+
+/*
+ * Return true if the cursor points at an extent and return the extent structure
+ * in gotp.  Else return false.
+ */
+bool
+fuse_iext_get_extent(
+	const struct fuse_ifork		*ifp,
+	const struct fuse_iext_cursor	*cur,
+	struct fuse_iomap		*gotp)
+{
+	if (!fuse_iext_valid(ifp, cur))
+		return false;
+	fuse_iext_get(gotp, cur_rec(cur));
+	return true;
+}
+
+/*
+ * This is a recursive function, because of that we need to be extremely
+ * careful with stack usage.
+ */
+static void
+fuse_iext_destroy_node(
+	struct fuse_iext_node	*node,
+	int			level)
+{
+	int			i;
+
+	if (level > 1) {
+		for (i = 0; i < KEYS_PER_NODE; i++) {
+			if (node->keys[i] == FUSE_IEXT_KEY_INVALID)
+				break;
+			fuse_iext_destroy_node(node->ptrs[i], level - 1);
+		}
+	}
+
+	kfree(node);
+}
+
+void
+fuse_iext_destroy(
+	struct fuse_ifork	*ifp)
+{
+	fuse_iext_destroy_node(ifp->if_data, ifp->if_height);
+
+	ifp->if_bytes = 0;
+	ifp->if_height = 0;
+	ifp->if_data = NULL;
+}
+
+static inline struct fuse_ifork *
+fuse_iomap_fork_ptr(
+	struct fuse_iomap_cache	*ip,
+	enum fuse_iomap_fork	whichfork)
+{
+	switch (whichfork) {
+	case FUSE_IOMAP_READ_FORK:
+		return &ip->im_read;
+	case FUSE_IOMAP_WRITE_FORK:
+		return ip->im_write;
+	default:
+		ASSERT(0);
+		return NULL;
+	}
+}
+
+static inline bool fuse_iomap_addrs_adjacent(const struct fuse_iomap *left,
+					     const struct fuse_iomap *right)
+{
+	switch (left->type) {
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		return left->addr + left->length == right->addr;
+	default:
+		return left->addr  == FUSE_IOMAP_NULL_ADDR &&
+		       right->addr == FUSE_IOMAP_NULL_ADDR;
+	}
+}
+
+static inline bool fuse_iomap_can_merge(const struct fuse_iomap *left,
+					const struct fuse_iomap *right)
+{
+	return (left->dev == right->dev &&
+		left->offset + left->length == right->offset &&
+		left->type  == right->type &&
+		fuse_iomap_addrs_adjacent(left, right) &&
+		left->flags == right->flags &&
+		left->length + right->length <= FUSE_IOMAP_MAX_LEN);
+}
+
+static inline bool fuse_iomap_can_merge3(const struct fuse_iomap *left,
+					 const struct fuse_iomap *new,
+					 const struct fuse_iomap *right)
+{
+	return left->length + new->length + right->length <= FUSE_IOMAP_MAX_LEN;
+}
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static void fuse_iext_check_mappings(struct inode *inode,
+				      struct fuse_iomap_cache *ip,
+				      struct fuse_ifork *ifp)
+{
+	struct fuse_inode	*fi = FUSE_I(ip);
+	struct fuse_iext_cursor	icur;
+	struct fuse_iomap	prev, got;
+	unsigned long long	nr = 0;
+	enum fuse_iomap_fork	whichfork;
+
+	if (!ifp)
+		return;
+
+	if (ifp == ip->im_write)
+		whichfork = FUSE_IOMAP_WRITE_FORK;
+	else
+		whichfork = FUSE_IOMAP_READ_FORK;
+
+	fuse_iext_first(ifp, &icur);
+	if (!fuse_iext_get_extent(ifp, &icur, &prev))
+		return;
+	trace_fuse_iext_check_mapping(inode, whichfork, &prev, _RET_IP_);
+	nr++;
+
+	fuse_iext_next(ifp, &icur);
+	while (fuse_iext_get_extent(ifp, &icur, &got)) {
+		trace_fuse_iext_check_mapping(inode, whichfork, &got, _RET_IP_);
+		if (got.length == 0 ||
+		    got.offset < prev.offset + prev.length ||
+		    fuse_iomap_can_merge(&prev, &got)) {
+			printk(KERN_ERR "FUSE IOMAP CORRUPTION ino=%llu nr=%llu",
+			       fi->orig_ino, nr);
+			printk(KERN_ERR "prev: offset=%llu length=%llu type=%u flags=0x%x cookie=%llu dev=%u addr=%llu\n",
+			       prev.offset, prev.length, prev.type, prev.flags,
+			       prev.validity_cookie, prev.dev, prev.addr);
+			printk(KERN_ERR "curr: offset=%llu length=%llu type=%u flags=0x%x cookie=%llu dev=%u addr=%llu\n",
+			       got.offset, got.length, got.type, got.flags,
+			       got.validity_cookie, got.dev, got.addr);
+		}
+
+		prev = got;
+		nr++;
+		fuse_iext_next(ifp, &icur);
+	}
+}
+#else
+# define fuse_iext_check_mappings(...)	((void)0)
+#endif
+
+static void
+fuse_iext_del_mapping(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*icur,
+	struct fuse_iomap	*got,	/* current extent entry */
+	struct fuse_iomap	*del)	/* data to remove from extents */
+{
+	struct fuse_iomap	new;	/* new record to be inserted */
+	/* first addr (fsblock aligned) past del */
+	uint64_t		del_endaddr;
+	/* first offset (fsblock aligned) past del */
+	uint64_t		del_endoff = del->offset + del->length;
+	/* first offset (fsblock aligned) past got */
+	uint64_t		got_endoff = got->offset + got->length;
+	uint32_t		state = fuse_iomap_fork_to_state(ip, ifp);
+
+	ASSERT(del->length > 0);
+	ASSERT(got->offset <= del->offset);
+	ASSERT(got_endoff >= del_endoff);
+
+	switch (del->type) {
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		del_endaddr = del->addr + del->length;
+		break;
+	default:
+		del_endaddr = FUSE_IOMAP_NULL_ADDR;
+		break;
+	}
+
+	if (got->offset == del->offset)
+		state |= FUSE_IEXT_LEFT_FILLING;
+	if (got_endoff == del_endoff)
+		state |= FUSE_IEXT_RIGHT_FILLING;
+
+	trace_fuse_iext_del_mapping(VFS_I(ip), state, del);
+	trace_fuse_iext_del_mapping_got(VFS_I(ip), got);
+
+	switch (state & (FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING)) {
+	case FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING:
+		/*
+		 * Matches the whole extent.  Delete the entry.
+		 */
+		fuse_iext_remove(ip, icur, state);
+		fuse_iext_prev(ifp, icur);
+		break;
+	case FUSE_IEXT_LEFT_FILLING:
+		/*
+		 * Deleting the first part of the extent.
+		 */
+		got->offset = del_endoff;
+		got->addr = del_endaddr;
+		got->length -= del->length;
+		fuse_iext_update_extent(ip, state, icur, got);
+		break;
+	case FUSE_IEXT_RIGHT_FILLING:
+		/*
+		 * Deleting the last part of the extent.
+		 */
+		got->length -= del->length;
+		fuse_iext_update_extent(ip, state, icur, got);
+		break;
+	case 0:
+		/*
+		 * Deleting the middle of the extent.
+		 */
+		got->length = del->offset - got->offset;
+		fuse_iext_update_extent(ip, state, icur, got);
+
+		new.offset = del_endoff;
+		new.length = got_endoff - del_endoff;
+		new.type = got->type;
+		new.flags = got->flags;
+		new.addr = del_endaddr;
+		new.dev = got->dev;
+
+		fuse_iext_next(ifp, icur);
+		fuse_iext_insert(ip, icur, &new, state);
+		break;
+	}
+}
+
+int
+fuse_iomap_cache_remove(
+	struct inode		*inode,
+	enum fuse_iomap_fork	whichfork,
+	loff_t			start,		/* first file offset deleted */
+	uint64_t		len)		/* length to unmap */
+{
+	struct fuse_iext_cursor	icur;
+	struct fuse_iomap	got;		/* current extent record */
+	struct fuse_iomap	del;		/* extent being deleted */
+	loff_t			end;
+	struct fuse_inode	*fi = get_fuse_inode(inode);
+	struct fuse_iomap_cache	*ip = &fi->cache;
+	struct fuse_ifork	*ifp = fuse_iomap_fork_ptr(ip, whichfork);
+	bool			wasreal;
+	bool			done = false;
+	int			ret = 0;
+
+	fuse_iomap_assert_locked(ip, FUSE_IOMAP_LOCK_EXCL);
+
+	trace_fuse_iomap_cache_remove(inode, whichfork, start, len, _RET_IP_);
+
+	if (!ifp || fuse_iext_count(ifp) == 0)
+		return 0;
+
+	/* Fast shortcut if the caller wants to erase everything */
+	if (start == 0 && len >= inode->i_sb->s_maxbytes) {
+		fuse_iext_destroy(ifp);
+		return 0;
+	}
+
+	if (!len)
+		goto out;
+
+	/*
+	 * If the caller wants us to remove everything to EOF, we set the end
+	 * of the removal range to the maximum file offset.  We don't support
+	 * unsigned file offsets.
+	 */
+	if (len == FUSE_IOMAP_INVAL_TO_EOF) {
+		const unsigned int blocksize = i_blocksize(inode);
+
+		len = round_up(inode->i_sb->s_maxbytes, blocksize) - start;
+	}
+
+	/*
+	 * Now that we've settled len, look up the extent before the end of the
+	 * range.
+	 */
+	end = start + len;
+	if (!fuse_iext_lookup_extent_before(ip, ifp, &end, &icur, &got))
+		goto out;
+	end--;
+
+	while (end != -1 && end >= start) {
+		/*
+		 * Is the found extent after a hole in which end lives?
+		 * Just back up to the previous extent, if so.
+		 */
+		if (got.offset > end &&
+		    !fuse_iext_prev_extent(ifp, &icur, &got)) {
+			done = true;
+			break;
+		}
+		/*
+		 * Is the last block of this extent before the range
+		 * we're supposed to delete?  If so, we're done.
+		 */
+		end = min_t(loff_t, end, got.offset + got.length - 1);
+		if (end < start)
+			break;
+		/*
+		 * Then deal with the (possibly delayed) allocated space
+		 * we found.
+		 */
+		del = got;
+		switch (del.type) {
+		case FUSE_IOMAP_TYPE_DELALLOC:
+		case FUSE_IOMAP_TYPE_HOLE:
+		case FUSE_IOMAP_TYPE_INLINE:
+		case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+			wasreal = false;
+			break;
+		case FUSE_IOMAP_TYPE_MAPPED:
+		case FUSE_IOMAP_TYPE_UNWRITTEN:
+			wasreal = true;
+			break;
+		default:
+			ASSERT(0);
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		if (got.offset < start) {
+			del.offset = start;
+			del.length -= start - got.offset;
+			if (wasreal)
+				del.addr += start - got.offset;
+		}
+		if (del.offset + del.length > end + 1)
+			del.length = end + 1 - del.offset;
+
+		fuse_iext_del_mapping(ip, ifp, &icur, &got, &del);
+		end = del.offset - 1;
+
+		/*
+		 * If not done go on to the next (previous) record.
+		 */
+		if (end != -1 && end >= start) {
+			if (!fuse_iext_get_extent(ifp, &icur, &got) ||
+			    (got.offset > end &&
+			     !fuse_iext_prev_extent(ifp, &icur, &got))) {
+				done = true;
+				break;
+			}
+		}
+	}
+
+	/* Should have removed everything */
+	if (len == 0 || done || end == (loff_t)-1 || end < start)
+		ret = 0;
+	else
+		ret = -EUCLEAN;
+
+out:
+	fuse_iext_check_mappings(inode, ip, ifp);
+	return ret;
+}
+
+static void
+fuse_iext_add_mapping(
+	struct fuse_iomap_cache	*ip,
+	struct fuse_ifork	*ifp,
+	struct fuse_iext_cursor	*icur,
+	const struct fuse_iomap	*new)	/* new extent entry */
+{
+	struct fuse_iomap	left;	/* left neighbor extent entry */
+	struct fuse_iomap	right;	/* right neighbor extent entry */
+	uint32_t		state = fuse_iomap_fork_to_state(ip, ifp);
+
+	/*
+	 * Check and set flags if this segment has a left neighbor.
+	 */
+	if (fuse_iext_peek_prev_extent(ifp, icur, &left))
+		state |= FUSE_IEXT_LEFT_VALID;
+
+	/*
+	 * Check and set flags if this segment has a current value.
+	 * Not true if we're inserting into the "hole" at eof.
+	 */
+	if (fuse_iext_get_extent(ifp, icur, &right))
+		state |= FUSE_IEXT_RIGHT_VALID;
+
+	/*
+	 * We're inserting a real allocation between "left" and "right".
+	 * Set the contiguity flags.  Don't let extents get too large.
+	 */
+	if ((state & FUSE_IEXT_LEFT_VALID) && fuse_iomap_can_merge(&left, new))
+		state |= FUSE_IEXT_LEFT_CONTIG;
+
+	if ((state & FUSE_IEXT_RIGHT_VALID) &&
+	    fuse_iomap_can_merge(new, &right) &&
+	    (!(state & FUSE_IEXT_LEFT_CONTIG) ||
+	     fuse_iomap_can_merge3(&left, new, &right)))
+		state |= FUSE_IEXT_RIGHT_CONTIG;
+
+	trace_fuse_iext_add_mapping(VFS_I(ip), state, new);
+	if (state & FUSE_IEXT_LEFT_VALID)
+		trace_fuse_iext_add_mapping_left(VFS_I(ip), &left);
+	if (state & FUSE_IEXT_RIGHT_VALID)
+		trace_fuse_iext_add_mapping_right(VFS_I(ip), &right);
+
+	/*
+	 * Select which case we're in here, and implement it.
+	 */
+	switch (state & (FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG)) {
+	case FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG:
+		/*
+		 * New allocation is contiguous with real allocations on the
+		 * left and on the right.
+		 * Merge all three into a single extent record.
+		 */
+		left.length += new->length + right.length;
+
+		fuse_iext_remove(ip, icur, state);
+		fuse_iext_prev(ifp, icur);
+		fuse_iext_update_extent(ip, state, icur, &left);
+		break;
+
+	case FUSE_IEXT_LEFT_CONTIG:
+		/*
+		 * New allocation is contiguous with a real allocation
+		 * on the left.
+		 * Merge the new allocation with the left neighbor.
+		 */
+		left.length += new->length;
+
+		fuse_iext_prev(ifp, icur);
+		fuse_iext_update_extent(ip, state, icur, &left);
+		break;
+
+	case FUSE_IEXT_RIGHT_CONTIG:
+		/*
+		 * New allocation is contiguous with a real allocation
+		 * on the right.
+		 * Merge the new allocation with the right neighbor.
+		 */
+		right.offset = new->offset;
+		right.addr = new->addr;
+		right.length += new->length;
+		fuse_iext_update_extent(ip, state, icur, &right);
+		break;
+
+	case 0:
+		/*
+		 * New allocation is not contiguous with another
+		 * real allocation.
+		 * Insert a new entry.
+		 */
+		fuse_iext_insert(ip, icur, new, state);
+		break;
+	}
+}
+
+int
+fuse_iomap_cache_add(
+	struct inode		*inode,
+	enum fuse_iomap_fork	whichfork,
+	const struct fuse_iomap	*new)
+{
+	struct fuse_iext_cursor	icur;
+	struct fuse_iomap	got;
+	struct fuse_inode	*fi = get_fuse_inode(inode);
+	struct fuse_iomap_cache	*ip = &fi->cache;
+	struct fuse_ifork	*ifp = fuse_iomap_fork_ptr(ip, whichfork);
+
+	fuse_iomap_assert_locked(ip, FUSE_IOMAP_LOCK_EXCL);
+	ASSERT(new->length > 0);
+	ASSERT(new->offset < inode->i_sb->s_maxbytes);
+
+	trace_fuse_iomap_cache_add(inode, whichfork, new, _RET_IP_);
+
+	if (!ifp) {
+		ifp = kzalloc(sizeof(struct fuse_ifork),
+			      GFP_KERNEL | __GFP_NOFAIL);
+		if (!ifp)
+			return -ENOMEM;
+
+		ip->im_write = ifp;
+	}
+
+	if (fuse_iext_lookup_extent(ip, ifp, new->offset, &icur, &got)) {
+		/* make sure we only add into a hole. */
+		ASSERT(got.offset > new->offset);
+		ASSERT(got.offset - new->offset >= new->length);
+
+		if (got.offset <= new->offset ||
+		    got.offset - new->offset < new->length)
+			return -EUCLEAN;
+	}
+
+	fuse_iext_add_mapping(ip, ifp, &icur, new);
+	fuse_iext_check_mappings(inode, ip, ifp);
+	return 0;
+}
+
+/*
+ * Trim the returned map to the required bounds
+ */
+static void
+fuse_iomap_trim(
+	struct inode		*inode,
+	struct fuse_iomap	*mval,
+	const struct fuse_iomap	*got,
+	loff_t			off,
+	loff_t			len)
+{
+	const unsigned int blocksize = i_blocksize(inode);
+	const loff_t aligned_off = round_down(off, blocksize);
+	const loff_t aligned_end = round_up(off + len, blocksize);
+	const loff_t aligned_len = aligned_end - aligned_off;
+
+	ASSERT(aligned_off >= got->offset);
+
+	switch (got->type) {
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		mval->addr = got->addr + (aligned_off - got->offset);
+		break;
+	default:
+		mval->addr = FUSE_IOMAP_NULL_ADDR;
+		break;
+	}
+	mval->offset = aligned_off;
+	mval->length = min_t(loff_t, aligned_len,
+			     got->length - (aligned_off - got->offset));
+	mval->type = got->type;
+	mval->flags = got->flags;
+	mval->dev = got->dev;
+}
+
+enum fuse_iomap_lookup_result
+fuse_iomap_cache_lookup(
+	struct inode		*inode,
+	enum fuse_iomap_fork	whichfork,
+	loff_t			off,
+	uint64_t		len,
+	struct fuse_iomap	*mval)
+{
+	struct fuse_iomap	got;
+	struct fuse_iext_cursor	icur;
+	struct fuse_inode	*fi = get_fuse_inode(inode);
+	struct fuse_iomap_cache	*ip = &fi->cache;
+	struct fuse_ifork	*ifp = fuse_iomap_fork_ptr(ip, whichfork);
+
+	fuse_iomap_assert_locked(ip, FUSE_IOMAP_LOCK_SHARED |
+				     FUSE_IOMAP_LOCK_EXCL);
+
+	trace_fuse_iomap_cache_lookup(inode, whichfork, off, len, _RET_IP_);
+
+	if (!ifp) {
+		/*
+		 * No write fork at all means this filesystem doesn't do out of
+		 * place writes.
+		 */
+		return LOOKUP_NOFORK;
+	}
+
+	if (!fuse_iext_lookup_extent(ip, ifp, off, &icur, &got)) {
+		/*
+		 * Write fork does not contain a mapping at or beyond off,
+		 * which is a cache miss.
+		 */
+		return LOOKUP_MISS;
+	}
+
+	if (got.offset > off) {
+		/*
+		 * Found a mapping, but it doesn't cover the start of the
+		 * range, which is effectively a miss.
+		 */
+		return LOOKUP_MISS;
+	}
+
+	/* Found a mapping in the cache, return it */
+	fuse_iomap_trim(inode, mval, &got, off, len);
+	mval->validity_cookie = fuse_iext_read_seq(ip);
+	trace_fuse_iomap_cache_lookup_result(inode, whichfork, off, len, &got,
+					     mval);
+	return LOOKUP_HIT;
+}


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 2/4] fuse: use the iomap cache for iomap_begin
  2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
@ 2025-07-17 23:32   ` Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:32 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Look inside the iomap cache to try to satisfy iomap_begin.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_trace.h  |   46 ++++++++
 fs/fuse/iomap_cache.h |    3 +
 fs/fuse/file_iomap.c  |  270 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/iomap_cache.c |   63 +++++++++++
 4 files changed, 377 insertions(+), 5 deletions(-)


diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 598c0e603a32b1..88f1dd2ccbc9d5 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -158,6 +158,7 @@ struct fuse_iext_cursor;
 
 #define FUSE_IOMAP_TYPE_STRINGS \
 	{ FUSE_IOMAP_TYPE_PURE_OVERWRITE,	"overwrite" }, \
+	{ FUSE_IOMAP_TYPE_NULL,			"null" }, \
 	{ FUSE_IOMAP_TYPE_HOLE,			"hole" }, \
 	{ FUSE_IOMAP_TYPE_DELALLOC,		"delalloc" }, \
 	{ FUSE_IOMAP_TYPE_MAPPED,		"mapped" }, \
@@ -1723,6 +1724,51 @@ TRACE_EVENT(fuse_iomap_cache_lookup_result,
 		  __entry->got_length, __entry->got_addr,
 		  __entry->validity_cookie)
 );
+
+TRACE_EVENT(fuse_iomap_invalid,
+	TP_PROTO(const struct inode *inode, const struct iomap *map,
+		 uint64_t validity_cookie),
+	TP_ARGS(inode, map, validity_cookie),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(loff_t,			offset)
+		__field(uint64_t,		length)
+		__field(uint16_t,		maptype)
+		__field(uint16_t,		mapflags)
+		__field(uint64_t,		addr)
+		__field(uint64_t,		old_validity_cookie)
+		__field(uint64_t,		validity_cookie)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	map->offset;
+		__entry->length		=	map->length;
+		__entry->maptype	=	map->type;
+		__entry->mapflags	=	map->flags;
+		__entry->addr		=	map->addr;
+		__entry->old_validity_cookie=	map->validity_cookie;
+		__entry->validity_cookie=	validity_cookie;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx length 0x%llx type %s mapflags (%s) addr 0x%llx old_cookie 0x%llx new_cookie 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->length,
+		  __print_symbolic(__entry->maptype, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->mapflags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->addr, __entry->old_validity_cookie,
+		  __entry->validity_cookie)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/iomap_cache.h b/fs/fuse/iomap_cache.h
index 7efa23be18d155..2edcc8dc94b145 100644
--- a/fs/fuse/iomap_cache.h
+++ b/fs/fuse/iomap_cache.h
@@ -20,6 +20,9 @@
 void fuse_iomap_cache_lock(struct inode *inode, unsigned int lock_flags);
 void fuse_iomap_cache_unlock(struct inode *inode, unsigned int lock_flags);
 
+bool fuse_iomap_check_type(uint16_t type);
+bool fuse_iomap_check_flags(uint16_t flags);
+
 #define FUSE_IOMAP_MAX_LEN	((loff_t)(1ULL << 63))
 
 struct fuse_iext_leaf;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 66e1be93592023..122860af4bc42f 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -32,7 +32,7 @@ bool fuse_iomap_enabled(void)
 	return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO);
 }
 
-static inline bool fuse_iomap_check_type(uint16_t type)
+inline bool fuse_iomap_check_type(uint16_t type)
 {
 	BUILD_BUG_ON(FUSE_IOMAP_TYPE_HOLE	!= IOMAP_HOLE);
 	BUILD_BUG_ON(FUSE_IOMAP_TYPE_DELALLOC	!= IOMAP_DELALLOC);
@@ -42,6 +42,7 @@ static inline bool fuse_iomap_check_type(uint16_t type)
 
 	switch (type) {
 	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+	case FUSE_IOMAP_TYPE_NULL:
 	case FUSE_IOMAP_TYPE_HOLE:
 	case FUSE_IOMAP_TYPE_DELALLOC:
 	case FUSE_IOMAP_TYPE_MAPPED:
@@ -63,7 +64,7 @@ static inline bool fuse_iomap_check_type(uint16_t type)
 			  FUSE_IOMAP_F_ATOMIC_BIO | \
 			  FUSE_IOMAP_F_WANT_IOMAP_END)
 
-static inline bool fuse_iomap_check_flags(uint16_t flags)
+inline bool fuse_iomap_check_flags(uint16_t flags)
 {
 	BUILD_BUG_ON(FUSE_IOMAP_F_NEW		!= IOMAP_F_NEW);
 	BUILD_BUG_ON(FUSE_IOMAP_F_DIRTY		!= IOMAP_F_DIRTY);
@@ -147,6 +148,14 @@ fuse_iomap_begin_validate(const struct fuse_iomap_begin_out *outarg,
 		if (BAD_DATA(outarg->read_addr == FUSE_IOMAP_NULL_ADDR))
 			return -EIO;
 		break;
+	case FUSE_IOMAP_TYPE_NULL:
+		/*
+		 * We only accept null mappings if we have a cache to query.
+		 * There must not be a device addr.
+		 */
+		if (BAD_DATA(!fuse_has_iomap_cache(inode)))
+			return -EIO;
+		fallthrough;
 	case FUSE_IOMAP_TYPE_DELALLOC:
 	case FUSE_IOMAP_TYPE_HOLE:
 	case FUSE_IOMAP_TYPE_INLINE:
@@ -170,6 +179,14 @@ fuse_iomap_begin_validate(const struct fuse_iomap_begin_out *outarg,
 		if (BAD_DATA(outarg->write_addr == FUSE_IOMAP_NULL_ADDR))
 			return -EIO;
 		break;
+	case FUSE_IOMAP_TYPE_NULL:
+		/*
+		 * We only accept null mappings if we have a cache to query.
+		 * There must not be a device addr.
+		 */
+		if (BAD_DATA(!fuse_has_iomap_cache(inode)))
+			return -EIO;
+		fallthrough;
 	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
 	case FUSE_IOMAP_TYPE_HOLE:
 	case FUSE_IOMAP_TYPE_DELALLOC:
@@ -445,6 +462,220 @@ static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
 	return 0;
 }
 
+static bool fuse_iomap_revalidate(struct inode *inode,
+				  const struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	uint64_t validity_cookie = fuse_iext_read_seq(&fi->cache);
+
+	if (iomap->validity_cookie != validity_cookie) {
+		trace_fuse_iomap_invalid(inode, iomap, validity_cookie);
+		return false;
+	}
+
+	return true;
+}
+
+static const struct iomap_folio_ops fuse_iomap_folio_ops = {
+	.iomap_valid		= fuse_iomap_revalidate,
+};
+
+static int fuse_iomap_from_cache(struct inode *inode, struct iomap *iomap,
+				 const struct fuse_iomap *fmap)
+{
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	struct fuse_iomap_dev *fb;
+
+	fb = fuse_iomap_find_dev(fm->fc, fmap->type, fmap->dev);
+	if (IS_ERR(fb))
+		return PTR_ERR(fb);
+
+	iomap->addr = fmap->addr;
+	iomap->offset = fmap->offset;
+	iomap->length = fmap->length;
+	iomap->type = fmap->type;
+	iomap->flags = fmap->flags;
+	iomap->folio_ops = &fuse_iomap_folio_ops;
+	iomap->validity_cookie = fmap->validity_cookie;
+	fuse_iomap_set_device(iomap, fb);
+
+	fuse_iomap_dev_put(fb);
+	return 0;
+}
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static inline int fuse_iomap_validate_cached(const struct inode *inode,
+					     enum fuse_iomap_fork whichfork,
+					     unsigned opflags,
+					     const struct fuse_iomap *fmap)
+{
+	uint64_t end;
+
+	/* No garbage mapping types or flags */
+	if (BAD_DATA(!fuse_iomap_check_type(fmap->type)))
+		return -EIO;
+	if (BAD_DATA(!fuse_iomap_check_flags(fmap->flags)))
+		return -EIO;
+
+	/* Must have returned a mapping for the first byte in the range */
+	if (BAD_DATA(fmap->length == 0))
+		return -EIO;
+
+	/* No overflows in the file range */
+	if (BAD_DATA(check_add_overflow(fmap->offset, fmap->length, &end)))
+		return -EIO;
+
+	/* File range cannot start past maxbytes */
+	if (BAD_DATA(fmap->offset >= inode->i_sb->s_maxbytes))
+		return -EIO;
+
+	switch (fmap->type) {
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+		/* "Pure overwrite" only allowed for write mapping */
+		if (BAD_DATA(whichfork != FUSE_IOMAP_WRITE_FORK))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		if (BAD_DATA(fmap->dev == FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(fmap->addr == FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_INLINE:
+		/* Mappings not backed by space cannot have a device addr. */
+		if (BAD_DATA(fmap->dev != FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(fmap->addr != FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_NULL:
+		/* Cache itself cannot contain null mappings */
+		BAD_DATA(fmap->type == FUSE_IOMAP_TYPE_NULL);
+		return -EIO;
+	default:
+		/* should have been caught already */
+		return -EIO;
+	}
+
+	/* No overflows in the device range, if supplied */
+	if (fmap->addr != FUSE_IOMAP_NULL_ADDR &&
+	    BAD_DATA(check_add_overflow(fmap->addr, fmap->length, &end)))
+		return -EIO;
+
+	return 0;
+}
+#else
+# define fuse_iomap_validate_cached(...)	(0)
+#endif
+
+/*
+ * Look up iomappings from the cache.  Returns 1 if iomap and srcmap were
+ * satisfied from cache; 0 if not; or a negative errno.
+ */
+static int fuse_iomap_try_cache(struct inode *inode, loff_t pos, loff_t count,
+				unsigned opflags, struct iomap *iomap,
+				struct iomap *srcmap)
+{
+	struct fuse_iomap map;
+	struct iomap *dest = iomap;
+	enum fuse_iomap_lookup_result res;
+	int ret;
+
+	if (!fuse_has_iomap_cache(inode))
+		return 0;
+
+	fuse_iomap_cache_lock(inode, FUSE_IOMAP_LOCK_SHARED);
+
+	if (fuse_is_iomap_file_write(opflags)) {
+		res = fuse_iomap_cache_lookup(inode, FUSE_IOMAP_WRITE_FORK,
+					      pos, count, &map);
+		switch (res) {
+		case LOOKUP_HIT:
+			ret = fuse_iomap_validate_cached(inode, opflags,
+					FUSE_IOMAP_WRITE_FORK, &map);
+			if (ret)
+				goto out_unlock;
+
+			if (map.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+				ret = fuse_iomap_from_cache(inode, dest, &map);
+				if (ret)
+					goto out_unlock;
+
+				dest = srcmap;
+			}
+			fallthrough;
+		case LOOKUP_NOFORK:
+			/* move on to the read fork */
+			break;
+		case LOOKUP_MISS:
+			ret = 0;
+			goto out_unlock;
+		}
+	}
+
+	res = fuse_iomap_cache_lookup(inode, FUSE_IOMAP_READ_FORK, pos, count,
+				      &map);
+	switch (res) {
+	case LOOKUP_HIT:
+		break;
+	case LOOKUP_NOFORK:
+		ASSERT(res != LOOKUP_NOFORK);
+		ret = -EIO;
+		goto out_unlock;
+	case LOOKUP_MISS:
+		ret = 0;
+		goto out_unlock;
+	}
+
+	ret = fuse_iomap_validate_cached(inode, opflags, FUSE_IOMAP_READ_FORK,
+					 &map);
+	if (ret)
+		goto out_unlock;
+
+	ret = fuse_iomap_from_cache(inode, dest, &map);
+	if (ret)
+		goto out_unlock;
+
+	if (fuse_is_iomap_file_write(opflags)) {
+		switch (iomap->type) {
+		case IOMAP_HOLE:
+			if (opflags & (IOMAP_ZERO | IOMAP_UNSHARE))
+				ret = 1;
+			else
+				ret = 0;
+			break;
+		case IOMAP_DELALLOC:
+			if (opflags & IOMAP_DIRECT)
+				ret = 0;
+			else
+				ret = 1;
+			break;
+		default:
+			ret = 1;
+			break;
+		}
+	} else {
+		ret = 1;
+	}
+
+out_unlock:
+	fuse_iomap_cache_unlock(inode, FUSE_IOMAP_LOCK_SHARED);
+	if (ret < 1)
+		return ret;
+
+	if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
+		ret = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
+					    srcmap);
+		if (ret)
+			return ret;
+	}
+	return 1;
+}
+
 static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 			    unsigned opflags, struct iomap *iomap,
 			    struct iomap *srcmap)
@@ -465,6 +696,17 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 
 	trace_fuse_iomap_begin(inode, pos, count, opflags);
 
+	/*
+	 * Try to read mappings from the cache; if we find something then use
+	 * it; otherwise we upcall the fuse server.
+	 */
+	err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap, srcmap);
+	if (err < 0)
+		return err;
+	if (err == 1)
+		return 0;
+
+retry:
 	args.opcode = FUSE_IOMAP_BEGIN;
 	args.nodeid = get_node_id(inode);
 	args.in_numargs = 1;
@@ -486,6 +728,24 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
 	if (err)
 		return err;
 
+	/*
+	 * If the fuse server returned null mappings, we'll try the cache again
+	 * assuming that the fuse server populated the cache.  Note that we
+	 * dropped the cache lock, so it's entirely possible that another
+	 * thread could have invalidated the cache.
+	 */
+	if (outarg.read_type == FUSE_IOMAP_TYPE_NULL) {
+		err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap,
+					   srcmap);
+		if (err < 0)
+			return err;
+		if (err == 1)
+			return 0;
+		if (signal_pending(current))
+			return -EINTR;
+		goto retry;
+	}
+
 	read_dev = fuse_iomap_find_dev(fm->fc, outarg.read_type,
 				       outarg.read_dev);
 	if (IS_ERR(read_dev))
@@ -1479,14 +1739,14 @@ static void fuse_iomap_end_bio(struct bio *bio)
  * mapping is valid, false otherwise.
  */
 static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+					    struct inode *inode,
 					    loff_t offset)
 {
 	if (offset < wpc->iomap.offset ||
 	    offset >= wpc->iomap.offset + wpc->iomap.length)
 		return false;
 
-	/* XXX actually use revalidation cookie */
-	return true;
+	return fuse_iomap_revalidate(inode, &wpc->iomap);
 }
 
 static int fuse_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
@@ -1503,7 +1763,7 @@ static int fuse_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
 
 	trace_fuse_iomap_map_blocks(inode, offset, len);
 
-	if (fuse_iomap_revalidate_writeback(wpc, offset))
+	if (fuse_iomap_revalidate_writeback(wpc, inode, offset))
 		return 0;
 
 	/* Pretend that this is a directio write */
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 6244352f543f03..239441d2903cc8 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -1564,6 +1564,67 @@ fuse_iomap_cache_add(
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static inline void
+fuse_iomap_cache_validate_lookup(const struct inode *inode,
+				 enum fuse_iomap_fork whichfork,
+				 const struct fuse_iomap *fmap)
+{
+	const unsigned int blocksize = i_blocksize(inode);
+	uint64_t end;
+
+	/* No garbage mapping types or flags */
+	BAD_DATA(!fuse_iomap_check_type(fmap->type));
+	BAD_DATA(!fuse_iomap_check_flags(fmap->flags));
+
+	/* Must have returned a mapping for the first byte in the range */
+	BAD_DATA(fmap->length == 0);
+
+	/* File range must be aligned to blocksize */
+	BAD_DATA(!IS_ALIGNED(fmap->offset, blocksize));
+	BAD_DATA(!IS_ALIGNED(fmap->length, blocksize));
+
+	/* No overflows in the file range */
+	BAD_DATA(check_add_overflow(fmap->offset, fmap->length, &end));
+
+	/* File range cannot start past maxbytes */
+	BAD_DATA(fmap->offset >= inode->i_sb->s_maxbytes);
+
+	switch (fmap->type) {
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+		/* "Pure overwrite" only allowed for write mapping */
+		BAD_DATA(whichfork != FUSE_IOMAP_WRITE_FORK);
+		break;
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		BAD_DATA(fmap->dev == FUSE_IOMAP_DEV_NULL);
+		BAD_DATA(fmap->addr == FUSE_IOMAP_NULL_ADDR);
+		break;
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_INLINE:
+		/* Mappings not backed by space cannot have a device addr. */
+		BAD_DATA(fmap->dev != FUSE_IOMAP_DEV_NULL);
+		BAD_DATA(fmap->addr != FUSE_IOMAP_NULL_ADDR);
+		break;
+	case FUSE_IOMAP_TYPE_NULL:
+		/* Cache itself cannot contain null mappings */
+		BAD_DATA(fmap->type == FUSE_IOMAP_TYPE_NULL);
+		break;
+	default:
+		BAD_DATA(1);
+		break;
+	}
+
+	/* No overflows in the device range, if supplied */
+	if (fmap->addr != FUSE_IOMAP_NULL_ADDR)
+		BAD_DATA(check_add_overflow(fmap->addr, fmap->length, &end));
+}
+#else
+# define fuse_iomap_cache_validate_lookup(...)	((void)0)
+#endif
+
 /*
  * Trim the returned map to the required bounds
  */
@@ -1642,6 +1703,8 @@ fuse_iomap_cache_lookup(
 		return LOOKUP_MISS;
 	}
 
+	fuse_iomap_cache_validate_lookup(inode, whichfork, &got);
+
 	/* Found a mapping in the cache, return it */
 	fuse_iomap_trim(inode, mval, &got, off, len);
 	mval->validity_cookie = fuse_iext_read_seq(ip);


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 3/4] fuse: invalidate iomap cache after file updates
  2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
  2025-07-17 23:31   ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
@ 2025-07-17 23:32   ` Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:32 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

The kernel doesn't know what the fuse server might have done in response
to truncate, fallocate, or ioend events.  Therefore, it must invalidate
the mapping cache after those operations to ensure cache coherency.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h      |   16 +++++++++++++
 fs/fuse/fuse_trace.h  |   59 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/file.c        |   10 ++++++--
 fs/fuse/file_iomap.c  |   42 ++++++++++++++++++++++++++++++++++-
 fs/fuse/iomap_cache.c |   29 ++++++++++++++++++++++++
 5 files changed, 152 insertions(+), 4 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 196d2b57e80bb1..3b51aa6b50b8ab 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1735,6 +1735,9 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
 			 loff_t length, loff_t new_size);
 int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
 				 loff_t endpos);
+void fuse_iomap_open_truncate(struct inode *inode);
+void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
+				  size_t written);
 
 int fuse_dev_ioctl_iomap_support(struct file *file,
 				 struct fuse_iomap_support __user *argp);
@@ -1799,6 +1802,15 @@ fuse_iomap_cache_lookup(struct inode *inode,
 			enum fuse_iomap_fork whichfork,
 			loff_t off, uint64_t len,
 			struct fuse_iomap *mval);
+
+int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
+				      uint64_t length);
+static inline int fuse_iomap_cache_invalidate(struct inode *inode,
+					      loff_t offset)
+{
+	return fuse_iomap_cache_invalidate_range(inode, offset,
+						 FUSE_IOMAP_INVAL_TO_EOF);
+}
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1826,12 +1838,16 @@ fuse_iomap_cache_lookup(struct inode *inode,
 # define fuse_iomap_set_i_blkbits(...)		((void)0)
 # define fuse_iomap_fallocate(...)		(-ENOSYS)
 # define fuse_iomap_flush_unmap_range(...)	(-ENOSYS)
+# define fuse_iomap_open_truncate(...)		((void)0)
+# define fuse_iomap_copied_file_range(...)	((void)0)
 # define fuse_dev_ioctl_iomap_support(...)	(-EOPNOTSUPP)
 # define fuse_iomap_fadvise			NULL
 # define fuse_has_iomap_cache(...)		(false)
 # define fuse_iomap_cache_remove(...)		(-ENOSYS)
 # define fuse_iomap_cache_add(...)		(-ENOSYS)
 # define fuse_iomap_cache_upsert(...)		(-ENOSYS)
+# define fuse_iomap_cache_invalidate_range(...)	(-ENOSYS)
+# define fuse_iomap_cache_invalidate(...)	(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 88f1dd2ccbc9d5..547c548163ab54 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -1151,6 +1151,7 @@ DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
 DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
 DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
 DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_cache_invalidate_range);
 
 TRACE_EVENT(fuse_iomap_set_i_blkbits,
 	TP_PROTO(const struct inode *inode, u8 new_blkbits),
@@ -1314,6 +1315,64 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
 DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
 DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
 
+TRACE_EVENT(fuse_iomap_open_truncate,
+	TP_PROTO(const struct inode *inode),
+
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize)
+);
+
+TRACE_EVENT(fuse_iomap_copied_file_range,
+	TP_PROTO(const struct inode *inode, loff_t offset,
+		 size_t written),
+	TP_ARGS(inode, offset, written),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(loff_t,		offset)
+		__field(size_t,		written)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->offset		=	offset;
+		__entry->written	=	written;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx offset 0x%llx written 0x%zx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->offset, __entry->written)
+);
+
 DECLARE_EVENT_CLASS(fuse_iomap_cache_lock_class,
 	TP_PROTO(const struct inode *inode, unsigned int lock_flags,
 		 unsigned long caller_ip),
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 78e776878427e3..b390041f5c6659 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -278,9 +278,11 @@ static int fuse_open(struct inode *inode, struct file *file)
 	if (is_wb_truncate || dax_truncate)
 		fuse_release_nowrite(inode);
 	if (!err) {
-		if (is_truncate)
+		if (is_truncate) {
 			truncate_pagecache(inode, 0);
-		else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
+			if (fuse_has_iomap_fileio(inode))
+				fuse_iomap_open_truncate(inode);
+		} else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 			invalidate_inode_pages2(inode->i_mapping);
 	}
 	if (dax_truncate)
@@ -3181,7 +3183,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (err)
 		goto out;
 
-	if (!fuse_has_iomap_fileio(inode_out))
+	if (fuse_has_iomap_fileio(inode_out))
+		fuse_iomap_copied_file_range(inode_out, pos_out, outarg.size);
+	else
 		truncate_inode_pages_range(inode_out->i_mapping,
 				   ALIGN_DOWN(pos_out, PAGE_SIZE),
 				   ALIGN(pos_out + outarg.size, PAGE_SIZE) - 1);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 122860af4bc42f..bffadbf5660bff 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -864,6 +864,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
 			fuse_iomap_inline_free(iomap);
 			if (err)
 				goto out_err;
+			fuse_iomap_cache_invalidate_range(inode, pos, written);
 		} else {
 			fuse_iomap_inline_free(iomap);
 		}
@@ -938,6 +939,13 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
 
 	trace_fuse_iomap_ioend_error(inode, &inarg, err);
 
+	/*
+	 * If the ioend completed successfully, invalidate the range that we
+	 * just completed.
+	 */
+	if (!err)
+		fuse_iomap_cache_invalidate_range(inode, pos, written);
+
 	/*
 	 * Preserve the original error code if userspace didn't respond or
 	 * returned success despite the error we passed along via the ioend.
@@ -2122,7 +2130,10 @@ fuse_iomap_setsize(
 	error = inode_newsize_ok(inode, newsize);
 	if (error)
 		return error;
-	return fuse_iomap_setattr_size(inode, newsize);
+	error = fuse_iomap_setattr_size(inode, newsize);
+	if (error)
+		return error;
+	return fuse_iomap_cache_invalidate(inode, newsize);
 }
 
 /*
@@ -2233,6 +2244,14 @@ fuse_iomap_fallocate(
 
 	trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
 
+	if (mode & (FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE))
+		error = fuse_iomap_cache_invalidate(inode, offset);
+	else
+		error = fuse_iomap_cache_invalidate_range(inode, offset,
+							  length);
+	if (error)
+		return error;
+
 	/*
 	 * If we unmapped blocks from the file range, then we zero the
 	 * pagecache for those regions and push them to disk rather than make
@@ -2293,3 +2312,24 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice)
 		inode_unlock_shared(inode);
 	return ret;
 }
+
+void fuse_iomap_open_truncate(struct inode *inode)
+{
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_open_truncate(inode);
+
+	fuse_iomap_cache_invalidate(inode, 0);
+}
+
+void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
+				  size_t written)
+{
+	ASSERT(fuse_has_iomap(inode));
+	ASSERT(fuse_has_iomap_fileio(inode));
+
+	trace_fuse_iomap_copied_file_range(inode, offset, written);
+
+	fuse_iomap_cache_invalidate_range(inode, offset, written);
+}
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 239441d2903cc8..87f03d8c9a76aa 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -1427,6 +1427,35 @@ fuse_iomap_cache_remove(
 	return ret;
 }
 
+int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
+				      uint64_t length)
+{
+	loff_t aligned_offset;
+	const unsigned int blocksize = i_blocksize(inode);
+	int ret, ret2;
+
+	if (!fuse_has_iomap_cache(inode))
+		return 0;
+
+	trace_fuse_iomap_cache_invalidate_range(inode, offset, length);
+
+	aligned_offset = round_down(offset, blocksize);
+	if (length != FUSE_IOMAP_INVAL_TO_EOF) {
+		length += offset - aligned_offset;
+		length = round_up(length, blocksize);
+	}
+
+	fuse_iomap_cache_lock(inode, FUSE_IOMAP_LOCK_EXCL);
+	ret = fuse_iomap_cache_remove(inode, FUSE_IOMAP_READ_FORK,
+			aligned_offset, length);
+	ret2 = fuse_iomap_cache_remove(inode, FUSE_IOMAP_WRITE_FORK,
+			aligned_offset, length);
+	fuse_iomap_cache_unlock(inode, FUSE_IOMAP_LOCK_EXCL);
+	if (ret)
+		return ret;
+	return ret2;
+}
+
 static void
 fuse_iext_add_mapping(
 	struct fuse_iomap_cache	*ip,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 4/4] fuse: enable iomap cache management
  2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:32   ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
@ 2025-07-17 23:32   ` Darrick J. Wong
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:32 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Provide a means for the fuse server to upload iomappings to the kernel
and invalidate them.  This is how we enable iomap caching for better
performance.  This is also required for correct synchronization between
pagecache writes and writeback.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h          |    7 +
 fs/fuse/fuse_trace.h      |  105 ++++++++++++++
 include/uapi/linux/fuse.h |   34 +++++
 fs/fuse/dev.c             |   45 ++++++
 fs/fuse/file_iomap.c      |  335 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 526 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3b51aa6b50b8ab..e7da75d8a5741d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1811,6 +1811,11 @@ static inline int fuse_iomap_cache_invalidate(struct inode *inode,
 	return fuse_iomap_cache_invalidate_range(inode, offset,
 						 FUSE_IOMAP_INVAL_TO_EOF);
 }
+
+int fuse_iomap_upsert(struct fuse_conn *fc,
+		      const struct fuse_iomap_upsert_out *outarg);
+int fuse_iomap_inval(struct fuse_conn *fc,
+		     const struct fuse_iomap_inval_out *outarg);
 #else
 # define fuse_iomap_enabled(...)		(false)
 # define fuse_has_iomap(...)			(false)
@@ -1848,6 +1853,8 @@ static inline int fuse_iomap_cache_invalidate(struct inode *inode,
 # define fuse_iomap_cache_upsert(...)		(-ENOSYS)
 # define fuse_iomap_cache_invalidate_range(...)	(-ENOSYS)
 # define fuse_iomap_cache_invalidate(...)	(-ENOSYS)
+# define fuse_iomap_upsert(...)			(-ENOSYS)
+# define fuse_iomap_inval(...)			(-ENOSYS)
 #endif
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 547c548163ab54..cc22635790b68c 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -841,6 +841,7 @@ DEFINE_EVENT(fuse_inode_state_class, name,	\
 	TP_ARGS(inode))
 DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
 DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_cache_enable);
 
 TRACE_EVENT(fuse_iomap_end_ioend,
 	TP_PROTO(const struct iomap_ioend *ioend),
@@ -1828,6 +1829,110 @@ TRACE_EVENT(fuse_iomap_invalid,
 		  __entry->addr, __entry->old_validity_cookie,
 		  __entry->validity_cookie)
 );
+
+TRACE_EVENT(fuse_iomap_upsert,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_upsert_out *outarg),
+	TP_ARGS(inode, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(uint64_t,		attr_ino)
+
+		__field(uint64_t,		read_offset)
+		__field(uint64_t,		read_length)
+		__field(uint64_t,		read_addr)
+		__field(uint16_t,		read_maptype)
+		__field(uint16_t,		read_mapflags)
+		__field(uint32_t,		read_dev)
+
+		__field(uint64_t,		write_offset)
+		__field(uint64_t,		write_length)
+		__field(uint64_t,		write_addr)
+		__field(uint16_t,		write_maptype)
+		__field(uint16_t,		write_mapflags)
+		__field(uint32_t,		write_dev)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	outarg->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->attr_ino	=	outarg->attr_ino;
+		__entry->read_offset	=	outarg->read_offset;
+		__entry->read_length	=	outarg->read_length;
+		__entry->read_addr	=	outarg->read_addr;
+		__entry->read_maptype	=	outarg->read_type;
+		__entry->read_mapflags	=	outarg->read_flags;
+		__entry->read_dev	=	outarg->read_dev;
+		__entry->write_offset	=	outarg->write_offset;
+		__entry->write_length	=	outarg->write_length;
+		__entry->write_addr	=	outarg->write_addr;
+		__entry->write_maptype	=	outarg->write_type;
+		__entry->write_mapflags	=	outarg->write_flags;
+		__entry->write_dev	=	outarg->write_dev;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx attr_ino 0x%llx read offset 0x%llx read_length 0x%llx read_addr 0x%llx read_maptype %s read_mapflags (%s) read_dev %u write_offset 0x%llx write_length 0x%llx write_addr 0x%llx write_maptype %s write_mapflags (%s) write_dev %u",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->attr_ino, __entry->read_offset,
+		  __entry->read_length, __entry->read_addr,
+		  __print_symbolic(__entry->read_maptype, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->read_mapflags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->read_dev, __entry->write_offset,
+		  __entry->write_length, __entry->write_addr,
+		  __print_symbolic(__entry->write_maptype, FUSE_IOMAP_TYPE_STRINGS),
+		  __print_flags(__entry->write_mapflags, "|", FUSE_IOMAP_F_STRINGS),
+		  __entry->write_dev)
+);
+
+TRACE_EVENT(fuse_iomap_inval,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_iomap_inval_out *outarg),
+	TP_ARGS(inode, outarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(loff_t,			isize)
+		__field(uint64_t,		attr_ino)
+
+		__field(uint64_t,		read_offset)
+		__field(uint64_t,		read_length)
+
+		__field(uint64_t,		write_offset)
+		__field(uint64_t,		write_length)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	outarg->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->attr_ino	=	outarg->attr_ino;
+		__entry->read_offset	=	outarg->read_offset;
+		__entry->read_length	=	outarg->read_length;
+		__entry->write_offset	=	outarg->write_offset;
+		__entry->write_length	=	outarg->write_length;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx attr_ino 0x%llx read offset 0x%llx read_length 0x%llx write_offset 0x%llx write_length 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->attr_ino, __entry->read_offset,
+		  __entry->read_length, __entry->write_offset,
+		  __entry->write_length)
+);
 #endif /* CONFIG_FUSE_IOMAP */
 
 #endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index a9b2d68b4b79c3..0068bc32a920a7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -243,6 +243,8 @@
  *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
  *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
  *  - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
+ *  - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
+ *    can cache iomappings in the kernel
  */
 
 #ifndef _LINUX_FUSE_H
@@ -699,6 +701,8 @@ enum fuse_notify_code {
 	FUSE_NOTIFY_DELETE = 6,
 	FUSE_NOTIFY_RESEND = 7,
 	FUSE_NOTIFY_INC_EPOCH = 8,
+	FUSE_NOTIFY_IOMAP_UPSERT = 9,
+	FUSE_NOTIFY_IOMAP_INVAL = 10,
 	FUSE_NOTIFY_CODE_MAX,
 };
 
@@ -1466,4 +1470,34 @@ struct fuse_iomap_config_out {
 /* invalidate all cached iomap mappings up to EOF */
 #define FUSE_IOMAP_INVAL_TO_EOF		(~0ULL)
 
+struct fuse_iomap_inval_out {
+	uint64_t nodeid;	/* Inode ID */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+
+	uint64_t read_offset;	/* range to invalidate read iomaps, bytes */
+	uint64_t read_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
+
+	uint64_t write_offset;	/* range to invalidate write iomaps, bytes */
+	uint64_t write_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
+};
+
+struct fuse_iomap_upsert_out {
+	uint64_t nodeid;	/* Inode ID */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+
+	uint64_t read_offset;	/* file offset of mapping, bytes */
+	uint64_t read_length;	/* length of mapping, bytes */
+	uint64_t read_addr;	/* disk offset of mapping, bytes */
+	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t read_dev;	/* device cookie */
+
+	uint64_t write_offset;	/* file offset of mapping, bytes */
+	uint64_t write_length;	/* length of mapping, bytes */
+	uint64_t write_addr;	/* disk offset of mapping, bytes */
+	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t write_dev;	/* device cookie * */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3dd04c2fdae7ba..abb24f99ed163e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1835,6 +1835,46 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 	return err;
 }
 
+static int fuse_notify_iomap_upsert(struct fuse_conn *fc, unsigned int size,
+				    struct fuse_copy_state *cs)
+{
+	struct fuse_iomap_upsert_out outarg;
+	int err = -EINVAL;
+
+	if (size != sizeof(outarg))
+		goto err;
+
+	err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+	if (err)
+		goto err;
+	fuse_copy_finish(cs);
+
+	return fuse_iomap_upsert(fc, &outarg);
+err:
+	fuse_copy_finish(cs);
+	return err;
+}
+
+static int fuse_notify_iomap_inval(struct fuse_conn *fc, unsigned int size,
+				   struct fuse_copy_state *cs)
+{
+	struct fuse_iomap_inval_out outarg;
+	int err = -EINVAL;
+
+	if (size != sizeof(outarg))
+		goto err;
+
+	err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+	if (err)
+		goto err;
+	fuse_copy_finish(cs);
+
+	return fuse_iomap_inval(fc, &outarg);
+err:
+	fuse_copy_finish(cs);
+	return err;
+}
+
 struct fuse_retrieve_args {
 	struct fuse_args_pages ap;
 	struct fuse_notify_retrieve_in inarg;
@@ -2081,6 +2121,11 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
 	case FUSE_NOTIFY_INC_EPOCH:
 		return fuse_notify_inc_epoch(fc);
 
+	case FUSE_NOTIFY_IOMAP_UPSERT:
+		return fuse_notify_iomap_upsert(fc, size, cs);
+	case FUSE_NOTIFY_IOMAP_INVAL:
+		return fuse_notify_iomap_inval(fc, size, cs);
+
 	default:
 		fuse_copy_finish(cs);
 		return -EINVAL;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index bffadbf5660bff..7bf522283f2e72 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -2333,3 +2333,338 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
 
 	fuse_iomap_cache_invalidate_range(inode, offset, written);
 }
+
+static inline int
+fuse_iomap_upsert_validate_dev(
+	const struct fuse_iomap_dev	*fb,
+	uint16_t			map_type,
+	uint64_t			map_addr,
+	uint64_t			map_length)
+{
+	uint64_t			map_end;
+	sector_t			device_bytes;
+
+	if (!fb) {
+		if (BAD_DATA(map_addr != FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+
+		return 0;
+	}
+
+	if (BAD_DATA(map_addr == FUSE_IOMAP_NULL_ADDR))
+		return -EIO;
+
+	if (BAD_DATA(check_add_overflow(map_addr, map_length, &map_end)))
+		return -EIO;
+
+	device_bytes = bdev_nr_sectors(fb->bdev) << SECTOR_SHIFT;
+	if (BAD_DATA(map_end > device_bytes))
+		return -EIO;
+
+	return 0;
+}
+
+/* Check the incoming mappings to make sure they're not nonsense */
+static inline int
+fuse_iomap_upsert_validate(struct fuse_conn *fc,
+			   const struct fuse_iomap_upsert_out *outarg)
+{
+	uint64_t n;
+	int ret;
+
+	/* No garbage mapping types or flags */
+	if (BAD_DATA(!fuse_iomap_check_type(outarg->write_type)))
+		return -EIO;
+	if (BAD_DATA(!fuse_iomap_check_flags(outarg->write_flags)))
+		return -EIO;
+
+	if (BAD_DATA(!fuse_iomap_check_type(outarg->read_type)))
+		return -EIO;
+	if (BAD_DATA(!fuse_iomap_check_flags(outarg->read_flags)))
+		return -EIO;
+
+	/* No zero-length mappings; we'll check offset/maxbytes later */
+	if (BAD_DATA(outarg->read_length == 0))
+		return -EIO;
+	if (BAD_DATA(outarg->write_length == 0))
+		return -EIO;
+
+	/* No overflows in the file range */
+	if (BAD_DATA(check_add_overflow(outarg->read_offset,
+					outarg->read_length, &n)))
+		return -EIO;
+	if (BAD_DATA(check_add_overflow(outarg->write_offset,
+					outarg->write_length, &n)))
+		return -EIO;
+
+	switch (outarg->read_type) {
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+		/* "Pure overwrite" only allowed for write mapping */
+		BAD_DATA(outarg->read_type == FUSE_IOMAP_TYPE_PURE_OVERWRITE);
+		return -EIO;
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		if (BAD_DATA(outarg->read_dev == FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->read_addr == FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_INLINE:
+		/* Mappings not backed by space cannot have a device addr. */
+		if (BAD_DATA(outarg->read_dev != FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->read_addr != FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_NULL:
+		/* We're ignoring this mapping */
+		break;
+	default:
+		/* should have been caught already */
+		return -EIO;
+	}
+
+	switch (outarg->write_type) {
+	case FUSE_IOMAP_TYPE_MAPPED:
+	case FUSE_IOMAP_TYPE_UNWRITTEN:
+		/* Mappings backed by space must have a device/addr */
+		if (BAD_DATA(outarg->write_dev == FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->write_addr == FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+	case FUSE_IOMAP_TYPE_DELALLOC:
+	case FUSE_IOMAP_TYPE_HOLE:
+	case FUSE_IOMAP_TYPE_INLINE:
+		/* Mappings not backed by space cannot have a device addr. */
+		if (BAD_DATA(outarg->write_dev != FUSE_IOMAP_DEV_NULL))
+			return -EIO;
+		if (BAD_DATA(outarg->write_addr != FUSE_IOMAP_NULL_ADDR))
+			return -EIO;
+		break;
+	case FUSE_IOMAP_TYPE_NULL:
+		/* We're ignoring this mapping */
+		break;
+	default:
+		/* should have been caught already */
+		return -EIO;
+	}
+
+	if (outarg->read_type != FUSE_IOMAP_TYPE_NULL) {
+		struct fuse_iomap_dev *fb = fuse_iomap_find_dev(fc,
+							outarg->read_type,
+							outarg->read_dev);
+
+		if (IS_ERR(fb))
+			return PTR_ERR(fb);
+
+		ret = fuse_iomap_upsert_validate_dev(fb, outarg->read_type,
+						     outarg->read_addr,
+						     outarg->read_length);
+		fuse_iomap_dev_put(fb);
+		if (ret)
+			return ret;
+	}
+
+	if (outarg->write_type != FUSE_IOMAP_TYPE_NULL) {
+		struct fuse_iomap_dev *fb = fuse_iomap_find_dev(fc,
+							outarg->write_type,
+							outarg->write_dev);
+
+		if (IS_ERR(fb))
+			return PTR_ERR(fb);
+
+		ret = fuse_iomap_upsert_validate_dev(fb, outarg->write_type,
+						     outarg->write_addr,
+						     outarg->write_length);
+		fuse_iomap_dev_put(fb);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static inline int
+fuse_iomap_upsert_validate_range(const struct inode *inode,
+				 const struct fuse_iomap *map)
+{
+	const unsigned int blocksize = i_blocksize(inode);
+
+	/* Mapping can't start beyond maxbytes */
+	if (BAD_DATA(map->offset >= inode->i_sb->s_maxbytes))
+		return -EIO;
+
+	/* File range must be aligned to blocksize */
+	if (BAD_DATA(!IS_ALIGNED(map->offset, blocksize)))
+		return -EIO;
+	if (BAD_DATA(!IS_ALIGNED(map->length, blocksize)))
+		return -EIO;
+
+	return 0;
+}
+
+int fuse_iomap_upsert(struct fuse_conn *fc,
+		      const struct fuse_iomap_upsert_out *outarg)
+{
+	struct inode *inode;
+	struct fuse_inode *fi;
+	struct fuse_iomap read_map = {
+		.offset		= outarg->read_offset,
+		.length		= outarg->read_length,
+		.addr		= outarg->read_addr,
+		.type		= outarg->read_type,
+		.flags		= outarg->read_flags,
+		.dev		= outarg->read_dev,
+	};
+	struct fuse_iomap write_map = {
+		.offset		= outarg->write_offset,
+		.length		= outarg->write_length,
+		.addr		= outarg->write_addr,
+		.type		= outarg->write_type,
+		.flags		= outarg->write_flags,
+		.dev		= outarg->write_dev,
+	};
+	int ret;
+
+	if (!fc->iomap)
+		return -EINVAL;
+
+	ret = fuse_iomap_upsert_validate(fc, outarg);
+	if (ret)
+		return ret;
+
+	down_read(&fc->killsb);
+	inode = fuse_ilookup(fc, outarg->nodeid, NULL);
+	if (!inode) {
+		ret = -ESTALE;
+		goto out_sb;
+	}
+
+	trace_fuse_iomap_upsert(inode, outarg);
+
+	fi = get_fuse_inode(inode);
+	if (fi->orig_ino != outarg->attr_ino) {
+		ret = -EINVAL;
+		goto out_inode;
+	}
+
+	if (fuse_is_bad(inode)) {
+		ret = -EIO;
+		goto out_inode;
+	}
+
+	if (read_map.type != FUSE_IOMAP_TYPE_NULL) {
+		ret = fuse_iomap_upsert_validate_range(inode, &read_map);
+		if (ret)
+			goto out_inode;
+	}
+
+	if (write_map.type != FUSE_IOMAP_TYPE_NULL) {
+		ret = fuse_iomap_upsert_validate_range(inode, &write_map);
+		if (ret)
+			goto out_inode;
+	}
+
+	fuse_iomap_cache_lock(inode, FUSE_IOMAP_LOCK_EXCL);
+
+	if (!test_and_set_bit(FUSE_I_IOMAP_CACHE, &fi->state))
+		trace_fuse_iomap_cache_enable(inode);
+
+	if (read_map.type != FUSE_IOMAP_TYPE_NULL) {
+		ret = fuse_iomap_cache_upsert(inode, FUSE_IOMAP_READ_FORK,
+					      &read_map);
+		if (ret)
+			goto out_unlock;
+	}
+
+	if (write_map.type != FUSE_IOMAP_TYPE_NULL) {
+		ret = fuse_iomap_cache_upsert(inode, FUSE_IOMAP_WRITE_FORK,
+					      &write_map);
+		if (ret)
+			goto out_unlock;
+	}
+
+out_unlock:
+	fuse_iomap_cache_unlock(inode, FUSE_IOMAP_LOCK_EXCL);
+out_inode:
+	iput(inode);
+out_sb:
+	up_read(&fc->killsb);
+	return ret;
+}
+
+static inline int fuse_iomap_inval_validate(const struct inode *inode,
+					    uint64_t offset, uint64_t length)
+{
+	const unsigned int blocksize = i_blocksize(inode);
+
+	/* Range can't start beyond maxbytes */
+	if (BAD_DATA(offset >= inode->i_sb->s_maxbytes))
+		return -EIO;
+
+	/* File range must be aligned to blocksize */
+	if (BAD_DATA(!IS_ALIGNED(offset, blocksize)))
+		return -EIO;
+	if (length != FUSE_IOMAP_INVAL_TO_EOF &&
+	    BAD_DATA(!IS_ALIGNED(length, blocksize)))
+		return -EIO;
+
+	return 0;
+}
+
+int fuse_iomap_inval(struct fuse_conn *fc,
+		     const struct fuse_iomap_inval_out *outarg)
+{
+	struct inode *inode;
+	uint64_t read_length = outarg->read_length;
+	uint64_t write_length = outarg->write_length;
+	int ret = 0, ret2 = 0;
+
+	if (!fc->iomap)
+		return -EINVAL;
+
+	down_read(&fc->killsb);
+	inode = fuse_ilookup(fc, outarg->nodeid, NULL);
+	if (!inode) {
+		ret = -ESTALE;
+		goto out_sb;
+	}
+
+	trace_fuse_iomap_inval(inode, outarg);
+
+	if (fuse_is_bad(inode)) {
+		ret = -EIO;
+		goto out_inode;
+	}
+
+	if (write_length)
+		ret = fuse_iomap_inval_validate(inode, outarg->write_offset,
+						write_length);
+	if (read_length)
+		ret2 = fuse_iomap_inval_validate(inode, outarg->read_offset,
+						 read_length);
+	if (ret || ret2)
+		goto out_inode;
+
+	fuse_iomap_cache_lock(inode, FUSE_IOMAP_LOCK_EXCL);
+	if (read_length)
+		ret2 = fuse_iomap_cache_remove(inode, FUSE_IOMAP_READ_FORK,
+					       outarg->read_offset,
+					       read_length);
+	if (write_length)
+		ret = fuse_iomap_cache_remove(inode, FUSE_IOMAP_WRITE_FORK,
+					      outarg->write_offset,
+					      write_length);
+	fuse_iomap_cache_unlock(inode, FUSE_IOMAP_LOCK_EXCL);
+
+out_inode:
+	iput(inode);
+out_sb:
+	up_read(&fc->killsb);
+	return ret ? ret : ret2;
+}


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 1/7] fuse: force a ctime update after a fileattr_set call when in iomap mode
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:32   ` Darrick J. Wong
  2025-07-17 23:33   ` [PATCH 2/7] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:32 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel is in charge of driving ctime updates to
the fuse server and ignores updates coming from the fuse server.
Therefore, when someone calls fileattr_set to change file attributes, we
must force a ctime update.

Found by generic/277.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/ioctl.c |   11 +++++++++++
 1 file changed, 11 insertions(+)


diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index 2d9abf48828f94..5be73609dfe979 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -546,8 +546,13 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
 	struct fuse_file *ff;
 	unsigned int flags = fa->flags;
 	struct fsxattr xfa;
+	struct fileattr old_ma = { };
+	bool is_wb = (fuse_get_cache_mask(inode) & STATX_CTIME);
 	int err;
 
+	if (is_wb)
+		vfs_fileattr_get(dentry, &old_ma);
+
 	ff = fuse_priv_ioctl_prepare(inode);
 	if (IS_ERR(ff))
 		return PTR_ERR(ff);
@@ -571,6 +576,12 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
 
 cleanup:
 	fuse_priv_ioctl_cleanup(inode, ff);
+	/*
+	 * If we cache ctime updates and the fileattr changed, then force a
+	 * ctime update.
+	 */
+	if (is_wb && memcmp(&old_ma, fa, sizeof(old_ma)))
+		fuse_update_ctime(inode);
 
 	return err;
 }


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 2/7] fuse: synchronize inode->i_flags after fileattr_[gs]et
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 1/7] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
@ 2025-07-17 23:33   ` Darrick J. Wong
  2025-07-17 23:33   ` [PATCH 3/7] fuse: cache atime when in iomap mode Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:33 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

There are three inode flags (immutable, append, sync) that are enforced
by the VFS.  Whenever we go around setting iflags, let's update the VFS
state so that they actually work.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h     |    1 +
 fs/fuse/fuse_trace.h |   31 +++++++++++++++++
 fs/fuse/dir.c        |    1 +
 fs/fuse/inode.c      |    1 +
 fs/fuse/ioctl.c      |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 123 insertions(+)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e7da75d8a5741d..3058d02cd65cc7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1579,6 +1579,7 @@ long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
 int fuse_fileattr_get(struct dentry *dentry, struct fileattr *fa);
 int fuse_fileattr_set(struct mnt_idmap *idmap,
 		      struct dentry *dentry, struct fileattr *fa);
+void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr);
 
 /* iomode.c */
 int fuse_file_cached_io_open(struct inode *inode, struct fuse_file *ff);
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index cc22635790b68c..e5a41be1bfd6cf 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -128,6 +128,37 @@ TRACE_EVENT(fuse_request_end,
 		  __entry->unique, __entry->len, __entry->error)
 );
 
+TRACE_EVENT(fuse_fileattr_update_inode,
+	TP_PROTO(const struct inode *inode, unsigned int old_iflags),
+
+	TP_ARGS(inode, old_iflags),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		connection)
+		__field(uint64_t,	ino)
+		__field(uint64_t,	nodeid)
+		__field(loff_t,		isize)
+		__field(unsigned int,	old_iflags)
+		__field(unsigned int,	new_iflags)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	fi->nodeid;
+		__entry->isize		=	i_size_read(inode);
+		__entry->old_iflags	=	old_iflags;
+		__entry->new_iflags	=	inode->i_flags;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu isize 0x%llx old_iflags 0x%x iflags 0x%x",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->isize, __entry->old_iflags, __entry->new_iflags)
+);
+
 #if IS_ENABLED(CONFIG_FUSE_IOMAP)
 struct fuse_iext_cursor;
 
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 1e9d5bf1811c6a..56ef73dd58e3b6 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1213,6 +1213,7 @@ static void fuse_fillattr(struct mnt_idmap *idmap, struct inode *inode,
 		blkbits = inode->i_sb->s_blocksize_bits;
 
 	stat->blksize = 1 << blkbits;
+	generic_fill_statx_attr(inode, stat);
 }
 
 static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d67cc635612cff..84f68dc37db64f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -521,6 +521,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		fuse_init_inode(inode, attr, fc);
+		fuse_fileattr_init(inode, attr);
 		unlock_new_inode(inode);
 	} else if (fuse_stale_inode(inode, generation, attr)) {
 		/* nodeid was reused, any I/O on the old inode should fail */
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index 5be73609dfe979..2c5002fc3ee9e0 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -4,6 +4,7 @@
  */
 
 #include "fuse_i.h"
+#include "fuse_trace.h"
 
 #include <linux/uio.h>
 #include <linux/compat.h>
@@ -502,6 +503,91 @@ static void fuse_priv_ioctl_cleanup(struct inode *inode, struct fuse_file *ff)
 	fuse_file_release(inode, ff, O_RDONLY, NULL, S_ISDIR(inode->i_mode));
 }
 
+static inline void update_iflag(struct inode *inode, unsigned int iflag,
+				bool set)
+{
+	if (set)
+		inode->i_flags |= iflag;
+	else
+		inode->i_flags &= ~iflag;
+}
+
+static void fuse_fileattr_update_inode(struct inode *inode,
+				       const struct fileattr *fa)
+{
+	unsigned int old_iflags = inode->i_flags;
+
+	/*
+	 * Prior to iomap, the fuse driver sent all file IO operations to the
+	 * fuse server, which was wholly responsible for enforcing the
+	 * immutable and append bits.  With iomap, we let more of the kernel IO
+	 * path stay within the kernel, so we actually have to set the VFS
+	 * flags now so that the enforcement can take place inside the kernel.
+	 */
+	if (!fuse_has_iomap(inode))
+		return;
+
+	/*
+	 * Configure VFS enforcement of the three inode flags that we support.
+	 * XXX: still need to figure out what's going on wrt NOATIME in fuse.
+	 */
+	if (fa->flags_valid) {
+		update_iflag(inode, S_SYNC, fa->flags & FS_SYNC_FL);
+		update_iflag(inode, S_IMMUTABLE, fa->flags & FS_IMMUTABLE_FL);
+		update_iflag(inode, S_APPEND, fa->flags & FS_APPEND_FL);
+	} else if (fa->fsx_xflags) {
+		update_iflag(inode, S_SYNC, fa->fsx_xflags & FS_XFLAG_SYNC);
+		update_iflag(inode, S_IMMUTABLE,
+					fa->fsx_xflags & FS_XFLAG_IMMUTABLE);
+		update_iflag(inode, S_APPEND, fa->fsx_xflags & FS_XFLAG_APPEND);
+	}
+
+	trace_fuse_fileattr_update_inode(inode, old_iflags);
+
+	if (old_iflags != inode->i_flags)
+		fuse_invalidate_attr(inode);
+}
+
+void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr)
+{
+	struct fileattr fa;
+	struct fsxattr xfa = { };
+	struct fuse_file *ff;
+	unsigned int flags = 0;
+	int err;
+
+	if (!fuse_has_iomap(inode))
+		return;
+
+	/*
+	 * Don't do this when we're setting up the root inode because the
+	 * connection workers haven't been set up yet.
+	 */
+	if (attr->ino == FUSE_ROOT_ID && attr->blksize == 0)
+		return;
+
+	ff = fuse_priv_ioctl_prepare(inode);
+	if (IS_ERR(ff))
+		return;
+
+	err = fuse_priv_ioctl(inode, ff, FS_IOC_FSGETXATTR, &xfa, sizeof(xfa));
+	if (!err) {
+		fileattr_fill_xflags(&fa, xfa.fsx_xflags);
+		fuse_fileattr_update_inode(inode, &fa);
+		goto cleanup;
+	}
+
+	err = fuse_priv_ioctl(inode, ff, FS_IOC_GETFLAGS, &flags, sizeof(flags));
+	if (!err) {
+		fileattr_fill_flags(&fa, flags);
+		fuse_fileattr_update_inode(inode, &fa);
+		goto cleanup;
+	}
+
+cleanup:
+	fuse_priv_ioctl_cleanup(inode, ff);
+}
+
 int fuse_fileattr_get(struct dentry *dentry, struct fileattr *fa)
 {
 	struct inode *inode = d_inode(dentry);
@@ -572,7 +658,10 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
 
 		err = fuse_priv_ioctl(inode, ff, FS_IOC_FSSETXATTR,
 				      &xfa, sizeof(xfa));
+		if (err)
+			goto cleanup;
 	}
+	fuse_fileattr_update_inode(inode, fa);
 
 cleanup:
 	fuse_priv_ioctl_cleanup(inode, ff);


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 3/7] fuse: cache atime when in iomap mode
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-17 23:32   ` [PATCH 1/7] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
  2025-07-17 23:33   ` [PATCH 2/7] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
@ 2025-07-17 23:33   ` Darrick J. Wong
  2025-07-17 23:33   ` [PATCH 4/7] fuse: update file mode when updating acls Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:33 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

When we're running in iomap mode, allow the kernel to cache the access
timestamp to further reduce the number of roundtrips to the fuse server.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/dir.c   |    5 +++++
 fs/fuse/inode.c |   19 ++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 56ef73dd58e3b6..33a375a21b2da1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1936,6 +1936,11 @@ int fuse_flush_times(struct inode *inode, struct fuse_file *ff)
 		inarg.ctime = inode_get_ctime_sec(inode);
 		inarg.ctimensec = inode_get_ctime_nsec(inode);
 	}
+	if (fuse_has_iomap(inode)) {
+		inarg.valid |= FATTR_ATIME;
+		inarg.atime = inode_get_atime_sec(inode);
+		inarg.atimensec = inode_get_atime_nsec(inode);
+	}
 	if (ff) {
 		inarg.valid |= FATTR_FH;
 		inarg.fh = ff->fh;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 84f68dc37db64f..19d51a44793e0c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -264,7 +264,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	attr->mtimensec = min_t(u32, attr->mtimensec, NSEC_PER_SEC - 1);
 	attr->ctimensec = min_t(u32, attr->ctimensec, NSEC_PER_SEC - 1);
 
-	inode_set_atime(inode, attr->atime, attr->atimensec);
+	if (!(cache_mask & STATX_ATIME))
+		inode_set_atime(inode, attr->atime, attr->atimensec);
 	/* mtime from server may be stale due to local buffered write */
 	if (!(cache_mask & STATX_MTIME)) {
 		inode_set_mtime(inode, attr->mtime, attr->mtimensec);
@@ -328,8 +329,12 @@ u32 fuse_get_cache_mask(struct inode *inode)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
-	if (S_ISREG(inode->i_mode) &&
-	    (fuse_has_iomap_fileio(inode) || fc->writeback_cache))
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+
+	if (fuse_has_iomap_fileio(inode))
+		return STATX_MTIME | STATX_CTIME | STATX_ATIME | STATX_SIZE;
+	if (fc->writeback_cache)
 		return STATX_MTIME | STATX_CTIME | STATX_SIZE;
 
 	return 0;
@@ -448,6 +453,14 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr,
 				   new_decode_dev(attr->rdev));
 	} else
 		BUG();
+
+	/*
+	 * iomap caches atime too, so we must load it from the fuse server
+	 * at instantiation time.
+	 */
+	if (fuse_has_iomap(inode))
+		inode_set_atime(inode, attr->atime, attr->atimensec);
+
 	/*
 	 * Ensure that we don't cache acls for daemons without FUSE_POSIX_ACL
 	 * so they see the exact same behavior as before.


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 4/7] fuse: update file mode when updating acls
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:33   ` [PATCH 3/7] fuse: cache atime when in iomap mode Darrick J. Wong
@ 2025-07-17 23:33   ` Darrick J. Wong
  2025-07-17 23:33   ` [PATCH 5/7] fuse: propagate default and file acls on creation Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:33 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

If someone sets ACLs on a file that can be expressed fully as Unix DAC
mode bits, most filesystems will then update the mode bits and drop the
ACL xattr to reduce inefficiency in the file access paths.  Let's do
that too.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/acl.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)


diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8f484b105f13ab..b892976d9e284c 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -98,6 +98,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 	struct inode *inode = d_inode(dentry);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	const char *name;
+	umode_t mode = inode->i_mode;
 	int ret;
 
 	if (fuse_is_bad(inode))
@@ -113,6 +114,20 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 	else
 		return -EINVAL;
 
+	/*
+	 * If the ACL can be represented entirely with changes to the mode
+	 * bits, then most filesystems will update the mode bits and delete
+	 * the ACL xattr.  Note that we only started doing this after the main
+	 * ACL implementation was merged, so that's why it's gated on regular
+	 * iomap.  XXX: This should be some sort of separate flag?
+	 */
+	if (acl && type == ACL_TYPE_ACCESS &&
+	    fuse_has_iomap(inode) && fc->posix_acl) {
+		ret = posix_acl_update_mode(idmap, inode, &mode, &acl);
+		if (ret)
+			return ret;
+	}
+
 	if (acl) {
 		unsigned int extra_flags = 0;
 		/*
@@ -143,7 +158,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 		 * through POSIX ACLs. Such daemons don't expect setgid bits to
 		 * be stripped.
 		 */
-		if (fc->posix_acl &&
+		if (fc->posix_acl && mode == inode->i_mode &&
 		    !in_group_or_capable(idmap, inode,
 					 i_gid_into_vfsgid(idmap, inode)))
 			extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID;
@@ -152,6 +167,19 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 		kfree(value);
 	} else {
 		ret = fuse_removexattr(inode, name);
+		/* If the acl didn't exist to start with that's fine. */
+		if (ret == -ENODATA)
+			ret = 0;
+	}
+
+	/* If we scheduled a mode update above, push that to userspace now. */
+	if (!ret && mode != inode->i_mode) {
+		struct iattr attr = {
+			.ia_valid = ATTR_MODE,
+			.ia_mode = mode,
+		};
+
+		ret = fuse_do_setattr(idmap, dentry, &attr, NULL);
 	}
 
 	if (fc->posix_acl) {


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 5/7] fuse: propagate default and file acls on creation
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:33   ` [PATCH 4/7] fuse: update file mode when updating acls Darrick J. Wong
@ 2025-07-17 23:33   ` Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 6/7] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 7/7] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:33 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Propagate the default and file access ACLs to new children when creating
them, just like the other kernel filesystems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_i.h |    4 ++
 fs/fuse/acl.c    |   65 ++++++++++++++++++++++++++++++++++++++
 fs/fuse/dir.c    |   92 +++++++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 138 insertions(+), 23 deletions(-)


diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3058d02cd65cc7..a8caee5e896871 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1539,6 +1539,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap,
 			       struct dentry *dentry, int type);
 int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry,
 		 struct posix_acl *acl, int type);
+int fuse_acl_create(struct inode *dir, umode_t *mode,
+		    struct posix_acl **default_acl, struct posix_acl **acl);
+int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
+		   const struct posix_acl *acl);
 
 /* readdir.c */
 int fuse_readdir(struct file *file, struct dir_context *ctx);
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index b892976d9e284c..26776e7a0b88fa 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -193,3 +193,68 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 
 	return ret;
 }
+
+int fuse_acl_create(struct inode *dir, umode_t *mode,
+		    struct posix_acl **default_acl, struct posix_acl **acl)
+{
+	struct fuse_conn *fc = get_fuse_conn(dir);
+
+	if (fuse_is_bad(dir))
+		return -EIO;
+
+	if (fuse_has_iomap(dir) && IS_POSIXACL(dir))
+		return posix_acl_create(dir, mode, default_acl, acl);
+
+	if (!fc->dont_mask)
+		*mode &= ~current_umask();
+
+	*default_acl = NULL;
+	*acl = NULL;
+	return 0;
+}
+
+static int __fuse_set_acl(struct inode *inode, const char *name,
+			  const struct posix_acl *acl)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	size_t size = posix_acl_xattr_size(acl->a_count);
+	void *value;
+	int ret;
+
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	value = kmalloc(size, GFP_KERNEL);
+	if (!value)
+		return -ENOMEM;
+
+	ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
+	if (ret < 0)
+		goto out_value;
+
+	ret = fuse_setxattr(inode, name, value, size, 0, 0);
+out_value:
+	kfree(value);
+	return ret;
+}
+
+int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
+		   const struct posix_acl *acl)
+{
+	int ret;
+
+	if (default_acl) {
+		ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT,
+				     default_acl);
+		if (ret)
+			return ret;
+	}
+
+	if (acl) {
+		ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 33a375a21b2da1..4cdd3ef0793379 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -635,26 +635,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
 	struct fuse_entry_out outentry;
 	struct fuse_inode *fi;
 	struct fuse_file *ff;
+	struct posix_acl *default_acl = NULL, *acl = NULL;
 	int epoch, err;
 	bool trunc = flags & O_TRUNC;
 
 	/* Userspace expects S_IFREG in create mode */
 	BUG_ON((mode & S_IFMT) != S_IFREG);
 
+	err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+	if (err)
+		return err;
+
 	epoch = atomic_read(&fm->fc->epoch);
 	forget = fuse_alloc_forget();
 	err = -ENOMEM;
 	if (!forget)
-		goto out_err;
+		goto out_acl_release;
 
 	err = -ENOMEM;
 	ff = fuse_file_alloc(fm, true);
 	if (!ff)
 		goto out_put_forget_req;
 
-	if (!fm->fc->dont_mask)
-		mode &= ~current_umask();
-
 	flags &= ~O_NOCTTY;
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outentry, 0, sizeof(outentry));
@@ -706,12 +708,16 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
 		fuse_sync_release(NULL, ff, flags);
 		fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
 		err = -ENOMEM;
-		goto out_err;
+		goto out_acl_release;
 	}
 	kfree(forget);
 	d_instantiate(entry, inode);
 	entry->d_time = epoch;
 	fuse_change_entry_timeout(entry, &outentry);
+
+	err = fuse_init_acls(inode, default_acl, acl);
+	if (err)
+		goto out_acl_release;
 	fuse_dir_changed(dir);
 
 	if (fuse_has_iomap(inode))
@@ -737,7 +743,9 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
 	fuse_file_free(ff);
 out_put_forget_req:
 	kfree(forget);
-out_err:
+out_acl_release:
+	posix_acl_release(default_acl);
+	posix_acl_release(acl);
 	return err;
 }
 
@@ -796,7 +804,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
  */
 static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm,
 				       struct fuse_args *args, struct inode *dir,
-				       struct dentry *entry, umode_t mode)
+				       struct dentry *entry, umode_t mode,
+				       struct posix_acl *default_acl,
+				       struct posix_acl *acl)
 {
 	struct fuse_entry_out outarg;
 	struct inode *inode;
@@ -804,14 +814,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
 	struct fuse_forget_link *forget;
 	int epoch, err;
 
-	if (fuse_is_bad(dir))
-		return ERR_PTR(-EIO);
+	if (fuse_is_bad(dir)) {
+		err = -EIO;
+		goto out_acl_release;
+	}
 
 	epoch = atomic_read(&fm->fc->epoch);
 
 	forget = fuse_alloc_forget();
-	if (!forget)
-		return ERR_PTR(-ENOMEM);
+	if (!forget) {
+		err = -ENOMEM;
+		goto out_acl_release;
+	}
 
 	memset(&outarg, 0, sizeof(outarg));
 	args->nodeid = get_node_id(dir);
@@ -841,7 +855,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
 			  &outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0);
 	if (!inode) {
 		fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
-		return ERR_PTR(-ENOMEM);
+		err = -ENOMEM;
+		goto out_acl_release;
 	}
 	kfree(forget);
 
@@ -857,19 +872,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
 		entry->d_time = epoch;
 		fuse_change_entry_timeout(entry, &outarg);
 	}
+
+	err = fuse_init_acls(inode, default_acl, acl);
+	if (err)
+		goto out_acl_release;
 	fuse_dir_changed(dir);
+
+	posix_acl_release(default_acl);
+	posix_acl_release(acl);
 	return d;
 
  out_put_forget_req:
 	if (err == -EEXIST)
 		fuse_invalidate_entry(entry);
 	kfree(forget);
+ out_acl_release:
+	posix_acl_release(default_acl);
+	posix_acl_release(acl);
 	return ERR_PTR(err);
 }
 
 static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
 			     struct fuse_args *args, struct inode *dir,
-			     struct dentry *entry, umode_t mode)
+			     struct dentry *entry, umode_t mode,
+			     struct posix_acl *default_acl,
+			     struct posix_acl *acl)
 {
 	/*
 	 * Note that when creating anything other than a directory we
@@ -880,7 +907,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
 	 */
 	WARN_ON_ONCE(S_ISDIR(mode));
 
-	return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode));
+	return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode,
+					default_acl, acl));
 }
 
 static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
@@ -888,10 +916,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
 {
 	struct fuse_mknod_in inarg;
 	struct fuse_mount *fm = get_fuse_mount(dir);
+	struct posix_acl *default_acl, *acl;
 	FUSE_ARGS(args);
+	int err;
 
-	if (!fm->fc->dont_mask)
-		mode &= ~current_umask();
+	err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+	if (err)
+		return err;
 
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.mode = mode;
@@ -903,7 +934,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	args.in_args[0].value = &inarg;
 	args.in_args[1].size = entry->d_name.len + 1;
 	args.in_args[1].value = entry->d_name.name;
-	return create_new_nondir(idmap, fm, &args, dir, entry, mode);
+	return create_new_nondir(idmap, fm, &args, dir, entry, mode,
+				 default_acl, acl);
 }
 
 static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
@@ -935,13 +967,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 {
 	struct fuse_mkdir_in inarg;
 	struct fuse_mount *fm = get_fuse_mount(dir);
+	struct posix_acl *default_acl, *acl;
 	FUSE_ARGS(args);
+	int err;
 
-	if (!fm->fc->dont_mask)
-		mode &= ~current_umask();
+	mode |= S_IFDIR;	/* vfs doesn't set S_IFDIR for us */
+	err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+	if (err)
+		return ERR_PTR(err);
 
 	memset(&inarg, 0, sizeof(inarg));
-	inarg.mode = mode;
+	inarg.mode = mode & ~S_IFDIR;
 	inarg.umask = current_umask();
 	args.opcode = FUSE_MKDIR;
 	args.in_numargs = 2;
@@ -949,7 +985,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 	args.in_args[0].value = &inarg;
 	args.in_args[1].size = entry->d_name.len + 1;
 	args.in_args[1].value = entry->d_name.name;
-	return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR);
+	return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR,
+				default_acl, acl);
 }
 
 static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
@@ -957,7 +994,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
 {
 	struct fuse_mount *fm = get_fuse_mount(dir);
 	unsigned len = strlen(link) + 1;
+	struct posix_acl *default_acl, *acl;
+	umode_t mode = S_IFLNK | 0777;
 	FUSE_ARGS(args);
+	int err;
+
+	err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+	if (err)
+		return err;
 
 	args.opcode = FUSE_SYMLINK;
 	args.in_numargs = 3;
@@ -966,7 +1010,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
 	args.in_args[1].value = entry->d_name.name;
 	args.in_args[2].size = len;
 	args.in_args[2].value = link;
-	return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK);
+	return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK,
+				 default_acl, acl);
 }
 
 void fuse_flush_time_update(struct inode *inode)
@@ -1166,7 +1211,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir,
 	args.in_args[0].value = &inarg;
 	args.in_args[1].size = newent->d_name.len + 1;
 	args.in_args[1].value = newent->d_name.name;
-	err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode);
+	err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent,
+				inode->i_mode, NULL, NULL);
 	if (!err)
 		fuse_update_ctime_in_cache(inode);
 	else if (err == -EINTR)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 6/7] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:33   ` [PATCH 5/7] fuse: propagate default and file acls on creation Darrick J. Wong
@ 2025-07-17 23:34   ` Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 7/7] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:34 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

Let the kernel handle killing the suid/sgid bits because the
write/falloc/truncate/chown code already does this, and we don't have to
worry about external modifications that are only visible to the fuse
server (i.e. we're not a cluster fs).

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/fuse_trace.h |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dir.c        |   15 ++++++++--
 2 files changed, 84 insertions(+), 3 deletions(-)


diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index e5a41be1bfd6cf..c6b6757bd8bc3c 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -159,6 +159,78 @@ TRACE_EVENT(fuse_fileattr_update_inode,
 		  __entry->isize, __entry->old_iflags, __entry->new_iflags)
 );
 
+TRACE_EVENT(fuse_setattr_fill,
+	TP_PROTO(const struct inode *inode,
+		 const struct fuse_setattr_in *inarg),
+	TP_ARGS(inode, inarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(umode_t,		mode)
+		__field(loff_t,			isize)
+
+		__field(uint32_t,		valid)
+		__field(umode_t,		new_mode)
+		__field(uint64_t,		new_size)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	inode->i_ino;
+		__entry->isize		=	i_size_read(inode);
+		__entry->valid		=	inarg->valid;
+		__entry->new_mode	=	inarg->mode;
+		__entry->new_size	=	inarg->size;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu mode 0%o isize 0x%llx valid 0x%x new_mode 0%o new_size 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->mode, __entry->isize, __entry->valid,
+		  __entry->new_mode, __entry->new_size)
+);
+
+TRACE_EVENT(fuse_setattr,
+	TP_PROTO(const struct inode *inode,
+		 const struct iattr *inarg),
+	TP_ARGS(inode, inarg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			connection)
+		__field(uint64_t,		ino)
+		__field(uint64_t,		nodeid)
+		__field(umode_t,		mode)
+		__field(loff_t,			isize)
+
+		__field(uint32_t,		valid)
+		__field(umode_t,		new_mode)
+		__field(uint64_t,		new_size)
+	),
+
+	TP_fast_assign(
+		const struct fuse_inode *fi = get_fuse_inode_c(inode);
+		const struct fuse_mount *fm = get_fuse_mount_c(inode);
+
+		__entry->connection	=	fm->fc->dev;
+		__entry->ino		=	fi->orig_ino;
+		__entry->nodeid		=	inode->i_ino;
+		__entry->isize		=	i_size_read(inode);
+		__entry->valid		=	inarg->ia_valid;
+		__entry->new_mode	=	inarg->ia_mode;
+		__entry->new_size	=	inarg->ia_size;
+	),
+
+	TP_printk("connection %u ino %llu nodeid %llu mode 0%o isize 0x%llx valid 0x%x new_mode 0%o new_size 0x%llx",
+		  __entry->connection, __entry->ino, __entry->nodeid,
+		  __entry->mode, __entry->isize, __entry->valid,
+		  __entry->new_mode, __entry->new_size)
+);
+
 #if IS_ENABLED(CONFIG_FUSE_IOMAP)
 struct fuse_iext_cursor;
 
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 4cdd3ef0793379..8422310d070665 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -7,6 +7,7 @@
 */
 
 #include "fuse_i.h"
+#include "fuse_trace.h"
 
 #include <linux/pagemap.h>
 #include <linux/file.h>
@@ -1951,6 +1952,8 @@ static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_args *args,
 			      struct fuse_setattr_in *inarg_p,
 			      struct fuse_attr_out *outarg_p)
 {
+	trace_fuse_setattr_fill(inode, inarg_p);
+
 	args->opcode = FUSE_SETATTR;
 	args->nodeid = get_node_id(inode);
 	args->in_numargs = 1;
@@ -2219,15 +2222,21 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
 	if (!fuse_allow_current_process(get_fuse_conn(inode)))
 		return -EACCES;
 
-	if (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) {
+	trace_fuse_setattr(inode, attr);
+
+	if (!fuse_has_iomap(inode) &&
+	    (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))) {
 		attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID |
 				    ATTR_MODE);
 
 		/*
 		 * The only sane way to reliably kill suid/sgid is to do it in
-		 * the userspace filesystem
+		 * the userspace filesystem if this isn't an iomap file.  For
+		 * iomap filesystems we let the kernel kill the setuid/setgid
+		 * bits.
 		 *
-		 * This should be done on write(), truncate() and chown().
+		 * This should be done on write(), truncate(), chown(), and
+		 * fallocate().
 		 */
 		if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {
 			/*


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 7/7] fuse: update ctime when updating acls on an iomap inode
  2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:34   ` [PATCH 6/7] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
@ 2025-07-17 23:34   ` Darrick J. Wong
  6 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:34 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the fuse kernel driver is in charge of updating file
attributes, so we need to update ctime after an ACL change.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/acl.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)


diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 26776e7a0b88fa..578b139a1d3380 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -99,6 +99,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	const char *name;
 	umode_t mode = inode->i_mode;
+	bool is_iomap = fuse_has_iomap(inode);
 	int ret;
 
 	if (fuse_is_bad(inode))
@@ -121,8 +122,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 	 * ACL implementation was merged, so that's why it's gated on regular
 	 * iomap.  XXX: This should be some sort of separate flag?
 	 */
-	if (acl && type == ACL_TYPE_ACCESS &&
-	    fuse_has_iomap(inode) && fc->posix_acl) {
+	if (acl && type == ACL_TYPE_ACCESS && is_iomap && fc->posix_acl) {
 		ret = posix_acl_update_mode(idmap, inode, &mode, &acl);
 		if (ret)
 			return ret;
@@ -172,13 +172,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
 			ret = 0;
 	}
 
-	/* If we scheduled a mode update above, push that to userspace now. */
-	if (!ret && mode != inode->i_mode) {
+	/*
+	 * When we're running in iomap mode, we need to update mode and ctime
+	 * ourselves instead of letting the fuse server figure that out.
+	 */
+	if (!ret && is_iomap) {
 		struct iattr attr = {
-			.ia_valid = ATTR_MODE,
-			.ia_mode = mode,
+			.ia_valid = ATTR_CTIME,
 		};
 
+		inode_set_ctime_current(inode);
+		attr.ia_ctime = inode_get_ctime(inode);
+		if (mode != inode->i_mode) {
+			attr.ia_valid |= ATTR_MODE;
+			attr.ia_mode = mode;
+		}
+
 		ret = fuse_do_setattr(idmap, dentry, &attr, NULL);
 	}
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 01/14] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-07-17 23:34   ` Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 02/14] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:34 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Add some flags to query and request kernel support for filesystem iomap
for regular files.  Bump the minor API version so that the new iomap
symbols don't go bleeding into old programs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h |    5 +++++
 include/fuse_kernel.h |    9 ++++++++-
 lib/fuse_lowlevel.c   |    9 +++++++++
 lib/meson.build       |    2 +-
 4 files changed, 23 insertions(+), 2 deletions(-)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index b82f2c41deb30c..8f87263d78f999 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -520,6 +520,11 @@ struct fuse_loop_config_v1 {
  */
 #define FUSE_CAP_OVER_IO_URING (1UL << 31)
 
+/**
+ * Client supports using iomap for FIEMAP and SEEK_{DATA,HOLE}
+ */
+#define FUSE_CAP_IOMAP (1ULL << 32)
+
 /**
  * Ioctl flags
  *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 122d6586e8d4da..b1e42d3cf86e81 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -235,6 +235,10 @@
  *
  *  7.44
  *  - add FUSE_NOTIFY_INC_EPOCH
+ *
+ *  7.99
+ *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ *    SEEK_{DATA,HOLE} support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -270,7 +274,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 44
+#define FUSE_KERNEL_MINOR_VERSION 99
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -443,6 +447,8 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ *	       operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -490,6 +496,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_IOMAP		(1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 91f42440fca4b3..392e898a5e8ec1 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2624,6 +2624,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_NO_EXPORT_SUPPORT;
 		if (inargflags & FUSE_OVER_IO_URING)
 			se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
+		if (inargflags & FUSE_IOMAP)
+			se->conn.capable_ext |= FUSE_CAP_IOMAP;
 
 	} else {
 		se->conn.max_readahead = 0;
@@ -2670,6 +2672,9 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		       FUSE_CAP_READDIRPLUS_AUTO);
 	LL_SET_DEFAULT(1, FUSE_CAP_OVER_IO_URING);
 
+	/* servers need to opt-in to iomap explicitly */
+	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
+
 	/* This could safely become default, but libfuse needs an API extension
 	 * to support it
 	 * LL_SET_DEFAULT(1, FUSE_CAP_SETXATTR_EXT);
@@ -2788,6 +2793,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		outargflags |= FUSE_REQUEST_TIMEOUT;
 		outarg.request_timeout = se->conn.request_timeout;
 	}
+	if (se->conn.want_ext & FUSE_CAP_IOMAP)
+		outargflags |= FUSE_IOMAP;
 
 	if (inargflags & FUSE_INIT_EXT) {
 		outargflags |= FUSE_INIT_EXT;
@@ -2829,6 +2836,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		if (se->conn.want_ext & FUSE_CAP_PASSTHROUGH)
 			fuse_log(FUSE_LOG_DEBUG, "   max_stack_depth=%u\n",
 				outarg.max_stack_depth);
+		if (se->conn.want_ext & FUSE_CAP_IOMAP)
+			fuse_log(FUSE_LOG_DEBUG, "   iomap=1\n");
 	}
 	if (arg->minor < 5)
 		outargsize = FUSE_COMPAT_INIT_OUT_SIZE;
diff --git a/lib/meson.build b/lib/meson.build
index fcd95741c9d374..2999abe8262afd 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -49,7 +49,7 @@ libfuse = library('fuse3',
                   dependencies: deps,
                   install: true,
                   link_depends: 'fuse_versionscript',
-                  c_args: [ '-DFUSE_USE_VERSION=317',
+                  c_args: [ '-DFUSE_USE_VERSION=318',
                             '-DFUSERMOUNT_DIR="@0@"'.format(fusermount_path) ],
                   link_args: ['-Wl,--version-script,' + meson.current_source_dir()
                               + '/fuse_versionscript' ])


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 02/14] libfuse: add fuse commands for iomap_begin and end
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 01/14] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
@ 2025-07-17 23:34   ` Darrick J. Wong
  2025-07-17 23:35   ` [PATCH 03/14] libfuse: add upper level iomap commands Darrick J. Wong
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:34 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the low level API how to handle iomap begin and end commands that
we get from the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |   52 +++++++++++++++++++++++++++++++++
 include/fuse_kernel.h   |   41 ++++++++++++++++++++++++++
 include/fuse_lowlevel.h |   54 ++++++++++++++++++++++++++++++++++
 lib/fuse_lowlevel.c     |   74 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    2 +
 5 files changed, 223 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 8f87263d78f999..f48724b0d1ea0f 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1147,6 +1147,58 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn);
 
 
 
+/**
+ * iomap operations.
+ * These APIs are introduced in version 318 (FUSE_MAKE_VERSION(3, 18)).
+ * Using them in earlier versions will result in errors.
+ */
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
+#define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
+#define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
+#define FUSE_IOMAP_TYPE_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
+#define FUSE_IOMAP_TYPE_INLINE		4	/* data inline in the inode */
+
+#define FUSE_IOMAP_DEV_NULL		(0U)	/* null device cookie */
+
+#define FUSE_IOMAP_F_NEW		(1U << 0)
+#define FUSE_IOMAP_F_DIRTY		(1U << 1)
+#define FUSE_IOMAP_F_SHARED		(1U << 2)
+#define FUSE_IOMAP_F_MERGED		(1U << 3)
+#define FUSE_IOMAP_F_XATTR		(1U << 5)
+#define FUSE_IOMAP_F_BOUNDARY		(1U << 6)
+#define FUSE_IOMAP_F_ANON_WRITE		(1U << 7)
+#define FUSE_IOMAP_F_ATOMIC_BIO		(1U << 8)
+#define FUSE_IOMAP_F_WANT_IOMAP_END	(1U << 12) /* want ->iomap_end call */
+
+/* only for iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED	(1U << 14)
+#define FUSE_IOMAP_F_STALE		(1U << 15)
+
+#define FUSE_IOMAP_OP_WRITE		(1 << 0) /* writing, must allocate blocks */
+#define FUSE_IOMAP_OP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
+#define FUSE_IOMAP_OP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
+#define FUSE_IOMAP_OP_FAULT		(1 << 3) /* mapping for page fault */
+#define FUSE_IOMAP_OP_DIRECT		(1 << 4) /* direct I/O */
+#define FUSE_IOMAP_OP_NOWAIT		(1 << 5) /* do not block */
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
+#define FUSE_IOMAP_OP_UNSHARE		(1 << 7) /* unshare_file_range */
+#define FUSE_IOMAP_OP_ATOMIC		(1 << 9) /* torn-write protection */
+#define FUSE_IOMAP_OP_DONTCACHE		(1 << 10) /* dont retain pagecache */
+
+#define FUSE_IOMAP_NULL_ADDR		(-1ULL)	/* addr is not valid */
+
+struct fuse_iomap {
+	uint64_t addr;		/* disk offset of mapping, bytes */
+	uint64_t offset;	/* file offset of mapping, bytes */
+	uint64_t length;	/* length of mapping, bytes */
+	uint16_t type;		/* FUSE_IOMAP_TYPE_* */
+	uint16_t flags;		/* FUSE_IOMAP_F_* */
+	uint32_t dev;		/* device cookie */
+};
+#endif /* FUSE_USE_VERSION >= 318 */
+
 /* ----------------------------------------------------------- *
  * Compatibility stuff					       *
  * ----------------------------------------------------------- */
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index b1e42d3cf86e81..eb59ff687b2e7d 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -665,6 +665,9 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_BEGIN	= 4094,
+	FUSE_IOMAP_END		= 4095,
+
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
 
@@ -1297,4 +1300,42 @@ struct fuse_uring_cmd_req {
 	uint8_t padding[6];
 };
 
+struct fuse_iomap_begin_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+	uint64_t offset;	/* file offset of mapping, bytes */
+	uint64_t length;	/* length of both mappings, bytes */
+
+	uint64_t read_addr;	/* disk offset of mapping, bytes */
+	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t read_dev;	/* FUSE_IOMAP_DEV_* */
+
+	uint64_t write_addr;	/* disk offset of mapping, bytes */
+	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t write_dev;	/* device cookie */
+};
+
+struct fuse_iomap_end_in {
+	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
+	uint32_t reserved;	/* zero */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t count;		/* operation length, in bytes */
+	int64_t written;	/* bytes processed */
+
+	uint64_t map_length;	/* length of mapping, bytes */
+	uint64_t map_addr;	/* disk offset of mapping, bytes */
+	uint16_t map_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t map_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t map_dev;	/* device cookie */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 75e084d09167de..d3de87897c47b8 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1325,6 +1325,44 @@ struct fuse_lowlevel_ops {
 	void (*tmpfile) (fuse_req_t req, fuse_ino_t parent,
 			mode_t mode, struct fuse_file_info *fi);
 
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+	/**
+	 * Fetch file I/O mappings to begin an operation
+	 *
+	 * Valid replies:
+	 *   fuse_reply_iomap_begin
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param nodeid the inode number
+	 * @param attr_ino inode number as told by fuse_attr::ino
+	 * @param pos position in file, in bytes
+	 * @param count length of operation, in bytes
+	 * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation
+	 */
+	void (*iomap_begin) (fuse_req_t req, fuse_ino_t nodeid,
+			     uint64_t attr_ino, off_t pos, uint64_t count,
+			     uint32_t opflags);
+
+	/**
+	 * Complete an iomap operation
+	 *
+	 * Valid replies:
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param nodeid the inode number
+	 * @param attr_ino inode number as told by fuse_attr::ino
+	 * @param pos position in file, in bytes
+	 * @param count length of operation, in bytes
+	 * @param written number of bytes processed, or a negative errno
+	 * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation
+	 * @param iomap file I/O mapping that failed
+	 */
+	void (*iomap_end) (fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino,
+			   off_t pos, uint64_t count, uint32_t opflags,
+			   ssize_t written, const struct fuse_iomap *iomap);
+#endif /* FUSE_USE_VERSION >= 318 */
 };
 
 /**
@@ -1705,6 +1743,22 @@ int fuse_reply_poll(fuse_req_t req, unsigned revents);
  */
 int fuse_reply_lseek(fuse_req_t req, off_t off);
 
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+/**
+ * Reply with iomappings for an iomap_begin operation
+ *
+ * Possible requests:
+ *   iomap_begin
+ *
+ * @param req request handle
+ * @param read_iomap mapping for file data reads
+ * @param write_iomap mapping for file data writes
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
+			   const struct fuse_iomap *write_iomap);
+#endif /* FUSE_USE_VERSION >= 318 */
+
 /* ----------------------------------------------------------- *
  * Notification						       *
  * ----------------------------------------------------------- */
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 392e898a5e8ec1..875d2345461251 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2428,6 +2428,76 @@ static void do_lseek(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
 	_do_lseek(req, nodeid, inarg, NULL);
 }
 
+int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
+			   const struct fuse_iomap *write_iomap)
+{
+	struct fuse_iomap_begin_out arg = {
+		.offset = read_iomap->offset,
+		.length = read_iomap->length,
+
+		.read_addr = read_iomap->addr,
+		.read_type = read_iomap->type,
+		.read_flags = read_iomap->flags,
+		.read_dev = read_iomap->dev,
+
+		.write_addr = write_iomap->addr,
+		.write_type = write_iomap->type,
+		.write_flags = write_iomap->flags,
+		.write_dev = write_iomap->dev,
+	};
+
+	return send_reply_ok(req, &arg, sizeof(arg));
+}
+
+static void _do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *op_in, const void *in_payload)
+{
+	const struct fuse_iomap_begin_in *arg = op_in;
+	(void)in_payload;
+	(void)nodeid;
+
+	if (req->se->op.iomap_begin)
+		req->se->op.iomap_begin(req, nodeid, arg->attr_ino, arg->pos,
+					arg->count, arg->opflags);
+	else
+		fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid,
+			   const void *inarg)
+{
+	_do_iomap_begin(req, nodeid, inarg, NULL);
+}
+
+static void _do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *op_in, const void *in_payload)
+{
+	const struct fuse_iomap_end_in *arg = op_in;
+	struct fuse_iomap iomap = {
+		.addr = arg->map_addr,
+		.offset = arg->pos,
+		.length = arg->map_length,
+		.type = arg->map_type,
+		.flags = arg->map_flags,
+		.dev = arg->map_dev,
+	};
+	(void)in_payload;
+	(void)nodeid;
+
+	if (req->se->op.iomap_end)
+		req->se->op.iomap_end(req, nodeid, arg->attr_ino, arg->pos,
+				      arg->count, arg->opflags, arg->written,
+				      &iomap);
+	else
+		fuse_reply_err(req, 0);
+}
+
+static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
+			   const void *inarg)
+{
+	_do_iomap_end(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -3306,6 +3376,8 @@ static struct {
 	[FUSE_RENAME2]     = { do_rename2,      "RENAME2"    },
 	[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
+	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
+	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
 	[CUSE_INIT]	   = { cuse_lowlevel_init, "CUSE_INIT"   },
 };
 
@@ -3360,6 +3432,8 @@ static struct {
 	[FUSE_RENAME2]		= { _do_rename2,	"RENAME2" },
 	[FUSE_COPY_FILE_RANGE]	= { _do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
+	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
+	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },
 	[CUSE_INIT]		= { _cuse_lowlevel_init, "CUSE_INIT" },
 };
 
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 2d8884d7eae090..2b4c16abdaf519 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -212,6 +212,8 @@ FUSE_3.18 {
 
 		# Not part of public API, for internal test use only
 		fuse_convert_to_conn_want_ext;
+
+		fuse_reply_iomap_begin;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 03/14] libfuse: add upper level iomap commands
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 01/14] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
  2025-07-17 23:34   ` [PATCH 02/14] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
@ 2025-07-17 23:35   ` Darrick J. Wong
  2025-07-17 23:35   ` [PATCH 04/14] libfuse: add a notification to add a new device to iomap Darrick J. Wong
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:35 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the upper level fuse library about the iomap begin and end
operations, and connect it to the lower level.  This is needed for
fuse2fs to start using iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |   19 ++++++++++
 lib/fuse.c     |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index b99004334c99f3..6b25586e768285 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -850,6 +850,25 @@ struct fuse_operations {
 	 * Find next data or hole after the specified offset
 	 */
 	off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
+
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+	/**
+	 * Send a mapping to the kernel so that a file IO operation can run.
+	 */
+	int (*iomap_begin) (const char *path, uint64_t nodeid,
+			    uint64_t attr_ino, off_t pos_in,
+			    uint64_t length_in, uint32_t opflags_in,
+			    struct fuse_iomap *read_iomap_out,
+			    struct fuse_iomap *write_iomap_out);
+
+	/**
+	 * Respond to the outcome of a previous file mapping operation.
+	 */
+	int (*iomap_end) (const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos_in, uint64_t length_in,
+			  uint32_t opflags_in, ssize_t written_in,
+			  const struct fuse_iomap *iomap_in);
+#endif /* FUSE_USE_VERSION >= 318 */
 };
 
 /** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index 68b61ce6953d6f..aa4287e0896761 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2737,6 +2737,45 @@ int fuse_fs_chmod(struct fuse_fs *fs, const char *path, mode_t mode,
 	return fs->op.chmod(path, mode, fi);
 }
 
+static int fuse_fs_iomap_begin(struct fuse_fs *fs, const char *path,
+			       fuse_ino_t nodeid, uint64_t attr_ino, off_t pos,
+			       uint64_t count, uint32_t opflags,
+			       struct fuse_iomap *read_iomap,
+			       struct fuse_iomap *write_iomap)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_begin)
+		return -ENOSYS;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_begin[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x\n",
+			 path, nodeid, attr_ino, pos, count, opflags);
+	}
+
+	return fs->op.iomap_begin(path, nodeid, attr_ino, pos, count, opflags,
+				  read_iomap, write_iomap);
+}
+
+static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path,
+			     fuse_ino_t nodeid, uint64_t attr_ino, off_t pos,
+			     uint64_t count, uint32_t opflags, ssize_t written,
+			     const struct fuse_iomap *iomap)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_end)
+		return 0;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_end[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x written %zd\n",
+			 path, nodeid, attr_ino, pos, count, opflags, written);
+	}
+
+	return fs->op.iomap_end(path, nodeid, attr_ino, pos, count, opflags,
+				written, iomap);
+}
+
 static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 			     int valid, struct fuse_file_info *fi)
 {
@@ -4361,6 +4400,72 @@ static void fuse_lib_lseek(fuse_req_t req, fuse_ino_t ino, off_t off, int whence
 		reply_err(req, res);
 }
 
+static void fuse_lib_iomap_begin(fuse_req_t req, fuse_ino_t nodeid,
+				 uint64_t attr_ino, off_t pos, uint64_t count,
+				 uint32_t opflags)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_iomap read_iomap = {
+		.offset = pos,
+		.length = count,
+		.type = FUSE_IOMAP_TYPE_HOLE,
+		.dev  = FUSE_IOMAP_DEV_NULL,
+		.addr = FUSE_IOMAP_NULL_ADDR,
+	};
+	struct fuse_iomap write_iomap = {
+		.offset = pos,
+		.length = count,
+		.type = FUSE_IOMAP_TYPE_PURE_OVERWRITE,
+		.dev  = FUSE_IOMAP_DEV_NULL,
+		.addr = FUSE_IOMAP_NULL_ADDR,
+	};
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path_nullok(f, nodeid, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_begin(f->fs, path, nodeid, attr_ino, pos, count,
+				  opflags, &read_iomap, &write_iomap);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, nodeid, path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_reply_iomap_begin(req, &read_iomap, &write_iomap);
+}
+
+static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid,
+			       uint64_t attr_ino, off_t pos, uint64_t count,
+			       uint32_t opflags, ssize_t written,
+			       const struct fuse_iomap *iomap)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path_nullok(f, nodeid, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_end(f->fs, path, nodeid, attr_ino, pos, count,
+				opflags, written, iomap);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, nodeid, path);
+	reply_err(req, err);
+}
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4459,6 +4564,8 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.fallocate = fuse_lib_fallocate,
 	.copy_file_range = fuse_lib_copy_file_range,
 	.lseek = fuse_lib_lseek,
+	.iomap_begin = fuse_lib_iomap_begin,
+	.iomap_end = fuse_lib_iomap_end,
 };
 
 int fuse_notify_poll(struct fuse_pollhandle *ph)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 04/14] libfuse: add a notification to add a new device to iomap
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:35   ` [PATCH 03/14] libfuse: add upper level iomap commands Darrick J. Wong
@ 2025-07-17 23:35   ` Darrick J. Wong
  2025-07-17 23:35   ` [PATCH 05/14] libfuse: add iomap ioend low level handler Darrick J. Wong
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:35 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Plumb in the pieces needed to attach block devices to a fuse+iomap mount
for use with iomap operations.  This enables us to have filesystems
where the metadata could live somewhere else, but the actual file IO
goes to locally attached storage.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_kernel.h   |    3 +++
 include/fuse_lowlevel.h |   16 ++++++++++++++++
 lib/fuse_lowlevel.c     |   17 +++++++++++++++++
 lib/fuse_versionscript  |    1 +
 4 files changed, 37 insertions(+)


diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index eb59ff687b2e7d..97ca55f0114b1d 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -239,6 +239,7 @@
  *  7.99
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
+ *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
  */
 
 #ifndef _LINUX_FUSE_H
@@ -1136,6 +1137,8 @@ struct fuse_backing_map {
 #define FUSE_DEV_IOC_BACKING_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, \
 					     struct fuse_backing_map)
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_DEV_ADD	_IOW(FUSE_DEV_IOC_MAGIC, 3, \
+					     struct fuse_backing_map)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index d3de87897c47b8..d3e505ed52815b 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1962,6 +1962,22 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
 int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino,
 				  size_t size, off_t offset, void *cookie);
 
+#if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+/**
+ * Attach an open file descriptor to a fuse+iomap mount.  Currently must be
+ * a block device.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param fd file descriptor of an open block device
+ * @param flags flags for the operation; none defined so far
+ * @return positive device id for success, zero for failure
+ */
+int fuse_iomap_add_device(struct fuse_session *se, int fd, unsigned int flags);
+#endif
 
 /* ----------------------------------------------------------- *
  * Utility functions					       *
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 875d2345461251..5df0cdd4ac461a 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -580,6 +580,23 @@ int fuse_passthrough_close(fuse_req_t req, int backing_id)
 	return ret;
 }
 
+int fuse_iomap_add_device(struct fuse_session *se, int fd, unsigned int flags)
+{
+	struct fuse_backing_map map = {
+		.fd = fd,
+		.flags = flags,
+	};
+	int ret;
+
+	ret = ioctl(se->fd, FUSE_DEV_IOC_IOMAP_DEV_ADD, &map);
+	if (ret <= 0) {
+		fuse_log(FUSE_LOG_ERR, "fuse: iomap_dev_add: %s\n", strerror(errno));
+		return 0;
+	}
+
+	return ret;
+}
+
 int fuse_reply_open(fuse_req_t req, const struct fuse_file_info *f)
 {
 	struct fuse_open_out arg;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 2b4c16abdaf519..4cdae6a6a42051 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -214,6 +214,7 @@ FUSE_3.18 {
 		fuse_convert_to_conn_want_ext;
 
 		fuse_reply_iomap_begin;
+		fuse_iomap_add_device;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 05/14] libfuse: add iomap ioend low level handler
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:35   ` [PATCH 04/14] libfuse: add a notification to add a new device to iomap Darrick J. Wong
@ 2025-07-17 23:35   ` Darrick J. Wong
  2025-07-17 23:35   ` [PATCH 06/14] libfuse: add upper level iomap ioend commands Darrick J. Wong
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:35 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the low level library about the iomap ioend handler, which gets
called by the kernel when we finish a file write that isn't a pure
overwrite operation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |   13 +++++++++++++
 include/fuse_kernel.h   |   12 ++++++++++++
 include/fuse_lowlevel.h |   20 ++++++++++++++++++++
 lib/fuse_lowlevel.c     |   24 +++++++++++++++++++++++-
 4 files changed, 68 insertions(+), 1 deletion(-)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index f48724b0d1ea0f..66c25afe15ec76 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1197,6 +1197,19 @@ struct fuse_iomap {
 	uint16_t flags;		/* FUSE_IOMAP_F_* */
 	uint32_t dev;		/* device cookie */
 };
+
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN	(1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY	(1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT		(1U << 3)
+
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND		(1U << 15)
+
 #endif /* FUSE_USE_VERSION >= 318 */
 
 /* ----------------------------------------------------------- *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 97ca55f0114b1d..a06c16243a7885 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -666,6 +666,7 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_IOEND	= 4093,
 	FUSE_IOMAP_BEGIN	= 4094,
 	FUSE_IOMAP_END		= 4095,
 
@@ -1341,4 +1342,15 @@ struct fuse_iomap_end_in {
 	uint32_t map_dev;	/* device cookie */
 };
 
+struct fuse_iomap_ioend_in {
+	uint16_t ioendflags;	/* FUSE_IOMAP_IOEND_* */
+	uint16_t reserved;	/* zero */
+	int32_t error;		/* negative errno or 0 */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+	uint64_t pos;		/* file position, in bytes */
+	uint64_t new_addr;	/* disk offset of new mapping, in bytes */
+	uint32_t written;	/* bytes processed */
+	uint32_t reserved1;	/* zero */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index d3e505ed52815b..1b856431de0a60 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1362,6 +1362,26 @@ struct fuse_lowlevel_ops {
 	void (*iomap_end) (fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino,
 			   off_t pos, uint64_t count, uint32_t opflags,
 			   ssize_t written, const struct fuse_iomap *iomap);
+
+	/**
+	 * Complete an iomap IO operation
+	 *
+	 * Valid replies:
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param nodeid the inode number
+	 * @param attr_ino inode number as told by fuse_attr::ino
+	 * @param pos position in file, in bytes
+	 * @param written number of bytes processed, or a negative errno
+	 * @param ioendflags mask of FUSE_IOMAP_IOEND_ flags specifying operation
+	 * @param error errno code of what went wrong
+	 * @param new_addr disk address of new mapping, in bytes
+	 */
+	void (*iomap_ioend) (fuse_req_t req, fuse_ino_t nodeid,
+			     uint64_t attr_ino, off_t pos, size_t written,
+			     uint32_t ioendflags, int error,
+			     uint64_t new_addr);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 5df0cdd4ac461a..d26043fa54c036 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2515,6 +2515,27 @@ static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
 	_do_iomap_end(req, nodeid, inarg, NULL);
 }
 
+static void _do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *op_in, const void *in_payload)
+{
+	const struct fuse_iomap_ioend_in *arg = op_in;
+	(void)in_payload;
+	(void)nodeid;
+
+	if (req->se->op.iomap_ioend)
+		req->se->op.iomap_ioend(req, nodeid, arg->attr_ino, arg->pos,
+					arg->written, arg->ioendflags,
+					arg->error, arg->new_addr);
+	else
+		fuse_reply_err(req, 0);
+}
+
+static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
+			   const void *inarg)
+{
+	_do_iomap_ioend(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -2713,7 +2734,6 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
 		if (inargflags & FUSE_IOMAP)
 			se->conn.capable_ext |= FUSE_CAP_IOMAP;
-
 	} else {
 		se->conn.max_readahead = 0;
 	}
@@ -3395,6 +3415,7 @@ static struct {
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
 	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
+	[FUSE_IOMAP_IOEND] = { do_iomap_ioend,	"IOMAP_IOEND" },
 	[CUSE_INIT]	   = { cuse_lowlevel_init, "CUSE_INIT"   },
 };
 
@@ -3451,6 +3472,7 @@ static struct {
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
 	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },
+	[FUSE_IOMAP_IOEND]	= { _do_iomap_ioend,	"IOMAP_IOEND" },
 	[CUSE_INIT]		= { _cuse_lowlevel_init, "CUSE_INIT" },
 };
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 06/14] libfuse: add upper level iomap ioend commands
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:35   ` [PATCH 05/14] libfuse: add iomap ioend low level handler Darrick J. Wong
@ 2025-07-17 23:35   ` Darrick J. Wong
  2025-07-17 23:36   ` [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:35 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Teach the upper level fuse library about iomap ioend events, which
happen when a write that isn't a pure overwrite completes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |    8 ++++++++
 lib/fuse.c     |   45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index 6b25586e768285..e2e7c950bf144d 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -868,6 +868,14 @@ struct fuse_operations {
 			  off_t pos_in, uint64_t length_in,
 			  uint32_t opflags_in, ssize_t written_in,
 			  const struct fuse_iomap *iomap_in);
+
+	/**
+	 * Respond to the outcome of a file IO operation.
+	 */
+	int (*iomap_ioend) (const char *path, uint64_t nodeid,
+			    uint64_t attr_ino, off_t pos_in, size_t written_in,
+			    uint32_t ioendflags_in, int error_in,
+			    uint64_t new_addr_in);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse.c b/lib/fuse.c
index aa4287e0896761..8dbf88877dd37c 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2776,6 +2776,26 @@ static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path,
 				written, iomap);
 }
 
+static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path,
+			       uint64_t nodeid, uint64_t attr_ino, off_t pos,
+			       size_t written, uint32_t ioendflags, int error,
+			       uint64_t new_addr)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_ioend)
+		return 0;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_ioend[%s] nodeid %llu attr_ino %llu pos %llu written %zu ioendflags 0x%x error %d\n",
+			 path, nodeid, attr_ino, pos, written, ioendflags,
+			 error);
+	}
+
+	return fs->op.iomap_ioend(path, nodeid, attr_ino, pos, written,
+				  ioendflags, error, new_addr);
+}
+
 static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 			     int valid, struct fuse_file_info *fi)
 {
@@ -4466,6 +4486,30 @@ static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid,
 	reply_err(req, err);
 }
 
+static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid,
+				 uint64_t attr_ino, off_t pos, size_t written,
+				 uint32_t ioendflags, int error,
+				 uint64_t new_addr)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path_nullok(f, nodeid, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_ioend(f->fs, path, nodeid, attr_ino, pos, written,
+				  ioendflags, error, new_addr);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, nodeid, path);
+	reply_err(req, err);
+}
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4566,6 +4610,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.lseek = fuse_lib_lseek,
 	.iomap_begin = fuse_lib_iomap_begin,
 	.iomap_end = fuse_lib_iomap_end,
+	.iomap_ioend = fuse_lib_iomap_ioend,
 };
 
 int fuse_notify_poll(struct fuse_pollhandle *ph)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:35   ` [PATCH 06/14] libfuse: add upper level iomap ioend commands Darrick J. Wong
@ 2025-07-17 23:36   ` Darrick J. Wong
  2025-07-18 14:10     ` Amir Goldstein
  2025-07-17 23:36   ` [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
                     ` (6 subsequent siblings)
  13 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:36 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Create new fuse_reply_{attr,create,entry}_iflags functions so that we
can send FUSE_ATTR_* flags to the kernel when instantiating an inode.
Servers are expected to send FUSE_IFLAG_* values, which will be
translated into what the kernel can understand.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |    3 ++
 include/fuse_lowlevel.h |   87 +++++++++++++++++++++++++++++++++++++++++++++--
 lib/fuse_lowlevel.c     |   69 ++++++++++++++++++++++++++++++-------
 lib/fuse_versionscript  |    4 ++
 4 files changed, 146 insertions(+), 17 deletions(-)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 66c25afe15ec76..11eb22d011896c 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1210,6 +1210,9 @@ struct fuse_iomap {
 /* is append ioend */
 #define FUSE_IOMAP_IOEND_APPEND		(1U << 15)
 
+/* enable fsdax */
+#define FUSE_IFLAG_DAX			(1U << 0)
+
 #endif /* FUSE_USE_VERSION >= 318 */
 
 /* ----------------------------------------------------------- *
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 1b856431de0a60..07748abcf079cf 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -240,6 +240,7 @@ struct fuse_lowlevel_ops {
 	 *
 	 * Valid replies:
 	 *   fuse_reply_entry
+	 *   fuse_reply_entry_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -299,6 +300,7 @@ struct fuse_lowlevel_ops {
 	 *
 	 * Valid replies:
 	 *   fuse_reply_attr
+	 *   fuse_reply_attr_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -334,6 +336,7 @@ struct fuse_lowlevel_ops {
 	 *
 	 * Valid replies:
 	 *   fuse_reply_attr
+	 *   fuse_reply_attr_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -364,7 +367,7 @@ struct fuse_lowlevel_ops {
 	 * socket node.
 	 *
 	 * Valid replies:
-	 *   fuse_reply_entry
+	 *   fuse_reply_entry_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -380,7 +383,7 @@ struct fuse_lowlevel_ops {
 	 * Create a directory
 	 *
 	 * Valid replies:
-	 *   fuse_reply_entry
+	 *   fuse_reply_entry_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -429,7 +432,7 @@ struct fuse_lowlevel_ops {
 	 * Create a symbolic link
 	 *
 	 * Valid replies:
-	 *   fuse_reply_entry
+	 *   fuse_reply_entry_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -477,7 +480,7 @@ struct fuse_lowlevel_ops {
 	 * Create a hard link
 	 *
 	 * Valid replies:
-	 *   fuse_reply_entry
+	 *   fuse_reply_entry_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -969,6 +972,7 @@ struct fuse_lowlevel_ops {
 	 *
 	 * Valid replies:
 	 *   fuse_reply_create
+	 *   fuse_reply_create_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -1315,6 +1319,7 @@ struct fuse_lowlevel_ops {
 	 *
 	 * Valid replies:
 	 *   fuse_reply_create
+	 *   fuse_reply_create_iflags
 	 *   fuse_reply_err
 	 *
 	 * @param req request handle
@@ -1435,6 +1440,23 @@ void fuse_reply_none(fuse_req_t req);
  */
 int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e);
 
+/**
+ * Reply with a directory entry and FUSE_IFLAG_*
+ *
+ * Possible requests:
+ *   lookup, mknod, mkdir, symlink, link
+ *
+ * Side effects:
+ *   increments the lookup count on success
+ *
+ * @param req request handle
+ * @param e the entry parameters
+ * @param iflags	FUSE_IFLAG_*
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+			    unsigned int iflags);
+
 /**
  * Reply with a directory entry and open parameters
  *
@@ -1456,6 +1478,29 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e);
 int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
 		      const struct fuse_file_info *fi);
 
+/**
+ * Reply with a directory entry, open parameters and FUSE_IFLAG_*
+ *
+ * currently the following members of 'fi' are used:
+ *   fh, direct_io, keep_cache, cache_readdir, nonseekable, noflush,
+ *   parallel_direct_writes
+ *
+ * Possible requests:
+ *   create
+ *
+ * Side effects:
+ *   increments the lookup count on success
+ *
+ * @param req request handle
+ * @param e the entry parameters
+ * @param iflags	FUSE_IFLAG_*
+ * @param fi file information
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+			     unsigned int iflags,
+			     const struct fuse_file_info *fi);
+
 /**
  * Reply with attributes
  *
@@ -1470,6 +1515,21 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
 int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
 		    double attr_timeout);
 
+/**
+ * Reply with attributes and FUSE_IFLAG_* flags
+ *
+ * Possible requests:
+ *   getattr, setattr
+ *
+ * @param req request handle
+ * @param attr the attributes
+ * @param attr_timeout	validity timeout (in seconds) for the attributes
+ * @param iflags	set of FUSE_IFLAG_* flags
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
+			   unsigned int iflags, double attr_timeout);
+
 /**
  * Reply with the contents of a symbolic link
  *
@@ -1697,6 +1757,25 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
 			      const char *name,
 			      const struct fuse_entry_param *e, off_t off);
 
+/**
+ * Add a directory entry and FUSE_IFLAG_* to the buffer with the attributes
+ *
+ * See documentation of `fuse_add_direntry_plus()` for more details.
+ *
+ * @param req request handle
+ * @param buf the point where the new entry will be added to the buffer
+ * @param bufsize remaining size of the buffer
+ * @param name the name of the entry
+ * @param iflags	FUSE_IFLAG_*
+ * @param e the directory entry
+ * @param off the offset of the next entry
+ * @return the space needed for the entry
+ */
+size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize,
+				     const char *name, unsigned int iflags,
+				     const struct fuse_entry_param *e,
+				     off_t off);
+
 /**
  * Reply to ask for data fetch and output buffer preparation.  ioctl
  * will be retried with the specified input data fetched and output
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index d26043fa54c036..568db13502a7d7 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -102,7 +102,8 @@ static void trace_request_reply(uint64_t unique, unsigned int len,
 }
 #endif
 
-static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr)
+static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
+			 unsigned int iflags)
 {
 	attr->ino	= stbuf->st_ino;
 	attr->mode	= stbuf->st_mode;
@@ -119,6 +120,10 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr)
 	attr->atimensec = ST_ATIM_NSEC(stbuf);
 	attr->mtimensec = ST_MTIM_NSEC(stbuf);
 	attr->ctimensec = ST_CTIM_NSEC(stbuf);
+
+	attr->flags	= 0;
+	if (iflags & FUSE_IFLAG_DAX)
+		attr->flags |= FUSE_ATTR_DAX;
 }
 
 static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
@@ -438,7 +443,8 @@ static unsigned int calc_timeout_nsec(double t)
 }
 
 static void fill_entry(struct fuse_entry_out *arg,
-		       const struct fuse_entry_param *e)
+		       const struct fuse_entry_param *e,
+		       unsigned int iflags)
 {
 	arg->nodeid = e->ino;
 	arg->generation = e->generation;
@@ -446,14 +452,15 @@ static void fill_entry(struct fuse_entry_out *arg,
 	arg->entry_valid_nsec = calc_timeout_nsec(e->entry_timeout);
 	arg->attr_valid = calc_timeout_sec(e->attr_timeout);
 	arg->attr_valid_nsec = calc_timeout_nsec(e->attr_timeout);
-	convert_stat(&e->attr, &arg->attr);
+	convert_stat(&e->attr, &arg->attr, iflags);
 }
 
 /* `buf` is allowed to be empty so that the proper size may be
    allocated by the caller */
-size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
-			      const char *name,
-			      const struct fuse_entry_param *e, off_t off)
+size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize,
+				     const char *name, unsigned int iflags,
+				     const struct fuse_entry_param *e,
+				     off_t off)
 {
 	(void)req;
 	size_t namelen;
@@ -468,7 +475,7 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
 
 	struct fuse_direntplus *dp = (struct fuse_direntplus *) buf;
 	memset(&dp->entry_out, 0, sizeof(dp->entry_out));
-	fill_entry(&dp->entry_out, e);
+	fill_entry(&dp->entry_out, e, iflags);
 
 	struct fuse_dirent *dirent = &dp->dirent;
 	dirent->ino = e->attr.st_ino;
@@ -481,6 +488,14 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
 	return entlen_padded;
 }
 
+size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
+			      const char *name,
+			      const struct fuse_entry_param *e, off_t off)
+{
+	return fuse_add_direntry_plus_iflags(req, buf, bufsize, name, 0, e,
+					     off);
+}
+
 static void fill_open(struct fuse_open_out *arg,
 		      const struct fuse_file_info *f)
 {
@@ -503,7 +518,8 @@ static void fill_open(struct fuse_open_out *arg,
 		arg->open_flags |= FOPEN_PARALLEL_DIRECT_WRITES;
 }
 
-int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
+int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+			    unsigned int iflags)
 {
 	struct fuse_entry_out arg;
 	size_t size = req->se->conn.proto_minor < 9 ?
@@ -515,12 +531,18 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
 		return fuse_reply_err(req, ENOENT);
 
 	memset(&arg, 0, sizeof(arg));
-	fill_entry(&arg, e);
+	fill_entry(&arg, e, iflags);
 	return send_reply_ok(req, &arg, size);
 }
 
-int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
-		      const struct fuse_file_info *f)
+int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
+{
+	return fuse_reply_entry_iflags(req, e, 0);
+}
+
+int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+			     unsigned int iflags,
+			     const struct fuse_file_info *f)
 {
 	alignas(uint64_t) char buf[sizeof(struct fuse_entry_out) + sizeof(struct fuse_open_out)];
 	size_t entrysize = req->se->conn.proto_minor < 9 ?
@@ -529,12 +551,18 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
 	struct fuse_open_out *oarg = (struct fuse_open_out *) (buf + entrysize);
 
 	memset(buf, 0, sizeof(buf));
-	fill_entry(earg, e);
+	fill_entry(earg, e, iflags);
 	fill_open(oarg, f);
 	return send_reply_ok(req, buf,
 			     entrysize + sizeof(struct fuse_open_out));
 }
 
+int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
+		      const struct fuse_file_info *f)
+{
+	return fuse_reply_create_iflags(req, e, 0, f);
+}
+
 int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
 		    double attr_timeout)
 {
@@ -545,7 +573,22 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
 	memset(&arg, 0, sizeof(arg));
 	arg.attr_valid = calc_timeout_sec(attr_timeout);
 	arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
-	convert_stat(attr, &arg.attr);
+	convert_stat(attr, &arg.attr, 0);
+
+	return send_reply_ok(req, &arg, size);
+}
+
+int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
+			   unsigned int iflags, double attr_timeout)
+{
+	struct fuse_attr_out arg;
+	size_t size = req->se->conn.proto_minor < 9 ?
+		FUSE_COMPAT_ATTR_OUT_SIZE : sizeof(arg);
+
+	memset(&arg, 0, sizeof(arg));
+	arg.attr_valid = calc_timeout_sec(attr_timeout);
+	arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
+	convert_stat(attr, &arg.attr, iflags);
 
 	return send_reply_ok(req, &arg, size);
 }
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 4cdae6a6a42051..9207145624ba83 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -215,6 +215,10 @@ FUSE_3.18 {
 
 		fuse_reply_iomap_begin;
 		fuse_iomap_add_device;
+		fuse_reply_attr_iflags;
+		fuse_reply_create_iflags;
+		fuse_reply_entry_iflags;
+		fuse_add_direntry_plus_iflags;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-07-17 23:36   ` [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
@ 2025-07-17 23:36   ` Darrick J. Wong
  2025-07-18 14:27     ` Amir Goldstein
  2025-07-17 23:36   ` [PATCH 09/14] libfuse: add FUSE_IOMAP_DIRECTIO Darrick J. Wong
                     ` (5 subsequent siblings)
  13 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:36 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Create a new ->getattr_iflags function so that iomap filesystems can set
the appropriate in-kernel inode flags on instantiation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |    7 ++
 lib/fuse.c     |  219 ++++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 180 insertions(+), 46 deletions(-)


diff --git a/include/fuse.h b/include/fuse.h
index e2e7c950bf144d..f894dd5da0d106 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -876,6 +876,13 @@ struct fuse_operations {
 			    uint64_t attr_ino, off_t pos_in, size_t written_in,
 			    uint32_t ioendflags_in, int error_in,
 			    uint64_t new_addr_in);
+
+	/**
+	 * Get file attributes and FUSE_IFLAG_* flags.  Otherwise the same as
+	 * getattr.
+	 */
+	int (*getattr_iflags) (const char *path, struct stat *buf,
+			       unsigned int *iflags, struct fuse_file_info *fi);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse.c b/lib/fuse.c
index 8dbf88877dd37c..685d0181e569d0 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -123,6 +123,7 @@ struct fuse {
 	struct list_head partial_slabs;
 	struct list_head full_slabs;
 	pthread_t prune_thread;
+	bool want_iflags;
 };
 
 struct lock {
@@ -144,6 +145,7 @@ struct node {
 	char *name;
 	uint64_t nlookup;
 	int open_count;
+	unsigned int iflags;
 	struct timespec stat_updated;
 	struct timespec mtime;
 	off_t size;
@@ -1605,6 +1607,24 @@ int fuse_fs_getattr(struct fuse_fs *fs, const char *path, struct stat *buf,
 	return fs->op.getattr(path, buf, fi);
 }
 
+static int fuse_fs_getattr_iflags(struct fuse_fs *fs, const char *path,
+				  struct stat *buf, unsigned int *iflags,
+				  struct fuse_file_info *fi)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.getattr_iflags)
+		return -ENOSYS;
+
+	if (fs->debug) {
+		char buf[10];
+
+		fuse_log(FUSE_LOG_DEBUG, "getattr_iflags[%s] %s\n",
+			file_info_string(fi, buf, sizeof(buf)),
+			path);
+	}
+	return fs->op.getattr_iflags(path, buf, iflags, fi);
+}
+
 int fuse_fs_rename(struct fuse_fs *fs, const char *oldpath,
 		   const char *newpath, unsigned int flags)
 {
@@ -2417,7 +2437,7 @@ static void update_stat(struct node *node, const struct stat *stbuf)
 }
 
 static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name,
-		     struct fuse_entry_param *e)
+		     struct fuse_entry_param *e, unsigned int *iflags)
 {
 	struct node *node;
 
@@ -2435,25 +2455,59 @@ static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name,
 		pthread_mutex_unlock(&f->lock);
 	}
 	set_stat(f, e->ino, &e->attr);
+	*iflags = node->iflags;
+	return 0;
+}
+
+static int lookup_and_update(struct fuse *f, fuse_ino_t nodeid,
+			     const char *name, struct fuse_entry_param *e,
+			     unsigned int iflags)
+{
+	struct node *node;
+
+	node = find_node(f, nodeid, name);
+	if (node == NULL)
+		return -ENOMEM;
+
+	e->ino = node->nodeid;
+	e->generation = node->generation;
+	e->entry_timeout = f->conf.entry_timeout;
+	e->attr_timeout = f->conf.attr_timeout;
+	if (f->conf.auto_cache) {
+		pthread_mutex_lock(&f->lock);
+		update_stat(node, &e->attr);
+		pthread_mutex_unlock(&f->lock);
+	}
+	set_stat(f, e->ino, &e->attr);
+	node->iflags = iflags;
 	return 0;
 }
 
 static int lookup_path(struct fuse *f, fuse_ino_t nodeid,
 		       const char *name, const char *path,
-		       struct fuse_entry_param *e, struct fuse_file_info *fi)
+		       struct fuse_entry_param *e, unsigned int *iflags,
+		       struct fuse_file_info *fi)
 {
 	int res;
 
 	memset(e, 0, sizeof(struct fuse_entry_param));
-	res = fuse_fs_getattr(f->fs, path, &e->attr, fi);
-	if (res == 0) {
-		res = do_lookup(f, nodeid, name, e);
-		if (res == 0 && f->conf.debug) {
-			fuse_log(FUSE_LOG_DEBUG, "   NODEID: %llu\n",
-				(unsigned long long) e->ino);
-		}
-	}
-	return res;
+	*iflags = 0;
+	if (f->want_iflags)
+		res = fuse_fs_getattr_iflags(f->fs, path, &e->attr, iflags, fi);
+	else
+		res = fuse_fs_getattr(f->fs, path, &e->attr, fi);
+	if (res)
+		return res;
+
+	res = lookup_and_update(f, nodeid, name, e, *iflags);
+	if (res)
+		return res;
+
+	if (f->conf.debug)
+		fuse_log(FUSE_LOG_DEBUG, "   NODEID: %llu iflags 0x%x\n",
+			(unsigned long long) e->ino, *iflags);
+
+	return 0;
 }
 
 static struct fuse_context_i *fuse_get_context_internal(void)
@@ -2537,11 +2591,17 @@ static inline void reply_err(fuse_req_t req, int err)
 }
 
 static void reply_entry(fuse_req_t req, const struct fuse_entry_param *e,
-			int err)
+			unsigned int iflags, int err)
 {
 	if (!err) {
 		struct fuse *f = req_fuse(req);
-		if (fuse_reply_entry(req, e) == -ENOENT) {
+		int entry_res;
+
+		if (f->want_iflags)
+			entry_res = fuse_reply_entry_iflags(req, e, iflags);
+		else
+			entry_res = fuse_reply_entry(req, e);
+		if (entry_res == -ENOENT) {
 			/* Skip forget for negative result */
 			if  (e->ino != 0)
 				forget_node(f, e->ino, 1);
@@ -2582,6 +2642,9 @@ static void fuse_lib_init(void *data, struct fuse_conn_info *conn)
 		/* Disable the receiving and processing of FUSE_INTERRUPT requests */
 		conn->no_interrupt = 1;
 	}
+
+	if (fuse_get_feature_flag(conn, FUSE_CAP_IOMAP))
+		f->want_iflags = true;
 }
 
 void fuse_fs_destroy(struct fuse_fs *fs)
@@ -2605,6 +2668,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
 	struct fuse *f = req_fuse_prepare(req);
 	struct fuse_entry_param e;
 	char *path;
+	unsigned int iflags = 0;
 	int err;
 	struct node *dot = NULL;
 
@@ -2619,7 +2683,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
 				dot = get_node_nocheck(f, parent);
 				if (dot == NULL) {
 					pthread_mutex_unlock(&f->lock);
-					reply_entry(req, &e, -ESTALE);
+					reply_entry(req, &e, -ESTALE, 0);
 					return;
 				}
 				dot->refctr++;
@@ -2639,7 +2703,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
 		if (f->conf.debug)
 			fuse_log(FUSE_LOG_DEBUG, "LOOKUP %s\n", path);
 		fuse_prepare_interrupt(f, req, &d);
-		err = lookup_path(f, parent, name, path, &e, NULL);
+		err = lookup_path(f, parent, name, path, &e, &iflags, NULL);
 		if (err == -ENOENT && f->conf.negative_timeout != 0.0) {
 			e.ino = 0;
 			e.entry_timeout = f->conf.negative_timeout;
@@ -2653,7 +2717,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
 		unref_node(f, dot);
 		pthread_mutex_unlock(&f->lock);
 	}
-	reply_entry(req, &e, err);
+	reply_entry(req, &e, iflags, err);
 }
 
 static void do_forget(struct fuse *f, fuse_ino_t ino, uint64_t nlookup)
@@ -2689,6 +2753,7 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
 	struct fuse *f = req_fuse_prepare(req);
 	struct stat buf;
 	char *path;
+	unsigned int iflags = 0;
 	int err;
 
 	memset(&buf, 0, sizeof(buf));
@@ -2700,7 +2765,11 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
 	if (!err) {
 		struct fuse_intr_data d;
 		fuse_prepare_interrupt(f, req, &d);
-		err = fuse_fs_getattr(f->fs, path, &buf, fi);
+		if (f->want_iflags)
+			err = fuse_fs_getattr_iflags(f->fs, path, &buf,
+						     &iflags, fi);
+		else
+			err = fuse_fs_getattr(f->fs, path, &buf, fi);
 		fuse_finish_interrupt(f, req, &d);
 		free_path(f, ino, path);
 	}
@@ -2713,9 +2782,14 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
 			buf.st_nlink--;
 		if (f->conf.auto_cache)
 			update_stat(node, &buf);
+		node->iflags = iflags;
 		pthread_mutex_unlock(&f->lock);
 		set_stat(f, ino, &buf);
-		fuse_reply_attr(req, &buf, f->conf.attr_timeout);
+		if (f->want_iflags)
+			fuse_reply_attr_iflags(req, &buf, iflags,
+					       f->conf.attr_timeout);
+		else
+			fuse_reply_attr(req, &buf, f->conf.attr_timeout);
 	} else
 		reply_err(req, err);
 }
@@ -2802,6 +2876,7 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 	struct fuse *f = req_fuse_prepare(req);
 	struct stat buf;
 	char *path;
+	unsigned int iflags = 0;
 	int err;
 
 	memset(&buf, 0, sizeof(buf));
@@ -2860,19 +2935,30 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 			err = fuse_fs_utimens(f->fs, path, tv, fi);
 		}
 		if (!err) {
-			err = fuse_fs_getattr(f->fs, path, &buf, fi);
+			if (f->want_iflags)
+				err = fuse_fs_getattr_iflags(f->fs, path, &buf,
+							     &iflags, fi);
+			else
+				err = fuse_fs_getattr(f->fs, path, &buf, fi);
 		}
 		fuse_finish_interrupt(f, req, &d);
 		free_path(f, ino, path);
 	}
 	if (!err) {
-		if (f->conf.auto_cache) {
-			pthread_mutex_lock(&f->lock);
-			update_stat(get_node(f, ino), &buf);
-			pthread_mutex_unlock(&f->lock);
-		}
+		struct node *node;
+
+		pthread_mutex_lock(&f->lock);
+		node = get_node(f, ino);
+		if (f->conf.auto_cache)
+			update_stat(node, &buf);
+		node->iflags = iflags;
+		pthread_mutex_unlock(&f->lock);
 		set_stat(f, ino, &buf);
-		fuse_reply_attr(req, &buf, f->conf.attr_timeout);
+		if (f->want_iflags)
+			fuse_reply_attr_iflags(req, &buf, iflags,
+					       f->conf.attr_timeout);
+		else
+			fuse_reply_attr(req, &buf, f->conf.attr_timeout);
 	} else
 		reply_err(req, err);
 }
@@ -2923,6 +3009,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
 	struct fuse *f = req_fuse_prepare(req);
 	struct fuse_entry_param e;
 	char *path;
+	unsigned int iflags = 0;
 	int err;
 
 	err = get_path_name(f, parent, name, &path);
@@ -2939,7 +3026,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
 			err = fuse_fs_create(f->fs, path, mode, &fi);
 			if (!err) {
 				err = lookup_path(f, parent, name, path, &e,
-						  &fi);
+						  &iflags, &fi);
 				fuse_fs_release(f->fs, path, &fi);
 			}
 		}
@@ -2947,12 +3034,12 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
 			err = fuse_fs_mknod(f->fs, path, mode, rdev);
 			if (!err)
 				err = lookup_path(f, parent, name, path, &e,
-						  NULL);
+						  &iflags, NULL);
 		}
 		fuse_finish_interrupt(f, req, &d);
 		free_path(f, parent, path);
 	}
-	reply_entry(req, &e, err);
+	reply_entry(req, &e, iflags, err);
 }
 
 static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
@@ -2961,6 +3048,7 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
 	struct fuse *f = req_fuse_prepare(req);
 	struct fuse_entry_param e;
 	char *path;
+	unsigned int iflags = 0;
 	int err;
 
 	err = get_path_name(f, parent, name, &path);
@@ -2970,11 +3058,12 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
 		fuse_prepare_interrupt(f, req, &d);
 		err = fuse_fs_mkdir(f->fs, path, mode);
 		if (!err)
-			err = lookup_path(f, parent, name, path, &e, NULL);
+			err = lookup_path(f, parent, name, path, &e, &iflags,
+					  NULL);
 		fuse_finish_interrupt(f, req, &d);
 		free_path(f, parent, path);
 	}
-	reply_entry(req, &e, err);
+	reply_entry(req, &e, iflags, err);
 }
 
 static void fuse_lib_unlink(fuse_req_t req, fuse_ino_t parent,
@@ -3044,6 +3133,7 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname,
 	struct fuse *f = req_fuse_prepare(req);
 	struct fuse_entry_param e;
 	char *path;
+	unsigned int iflags = 0;
 	int err;
 
 	err = get_path_name(f, parent, name, &path);
@@ -3053,11 +3143,12 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname,
 		fuse_prepare_interrupt(f, req, &d);
 		err = fuse_fs_symlink(f->fs, linkname, path);
 		if (!err)
-			err = lookup_path(f, parent, name, path, &e, NULL);
+			err = lookup_path(f, parent, name, path, &e, &iflags,
+					  NULL);
 		fuse_finish_interrupt(f, req, &d);
 		free_path(f, parent, path);
 	}
-	reply_entry(req, &e, err);
+	reply_entry(req, &e, iflags, err);
 }
 
 static void fuse_lib_rename(fuse_req_t req, fuse_ino_t olddir,
@@ -3105,6 +3196,7 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
 	struct fuse_entry_param e;
 	char *oldpath;
 	char *newpath;
+	unsigned int iflags = 0;
 	int err;
 
 	err = get_path2(f, ino, NULL, newparent, newname,
@@ -3116,11 +3208,11 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
 		err = fuse_fs_link(f->fs, oldpath, newpath);
 		if (!err)
 			err = lookup_path(f, newparent, newname, newpath,
-					  &e, NULL);
+					  &e, &iflags, NULL);
 		fuse_finish_interrupt(f, req, &d);
 		free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath);
 	}
-	reply_entry(req, &e, err);
+	reply_entry(req, &e, iflags, err);
 }
 
 static void fuse_do_release(struct fuse *f, fuse_ino_t ino, const char *path,
@@ -3163,6 +3255,7 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
 	struct fuse_intr_data d;
 	struct fuse_entry_param e;
 	char *path;
+	unsigned int iflags;
 	int err;
 
 	err = get_path_name(f, parent, name, &path);
@@ -3170,7 +3263,8 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
 		fuse_prepare_interrupt(f, req, &d);
 		err = fuse_fs_create(f->fs, path, mode, fi);
 		if (!err) {
-			err = lookup_path(f, parent, name, path, &e, fi);
+			err = lookup_path(f, parent, name, path, &e,
+					  &iflags, fi);
 			if (err)
 				fuse_fs_release(f->fs, path, fi);
 			else if (!S_ISREG(e.attr.st_mode)) {
@@ -3190,10 +3284,18 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
 		fuse_finish_interrupt(f, req, &d);
 	}
 	if (!err) {
+		int create_res;
+
 		pthread_mutex_lock(&f->lock);
 		get_node(f, e.ino)->open_count++;
 		pthread_mutex_unlock(&f->lock);
-		if (fuse_reply_create(req, &e, fi) == -ENOENT) {
+
+		if (f->want_iflags)
+			create_res = fuse_reply_create_iflags(req, &e, iflags,
+							      fi);
+		else
+			create_res = fuse_reply_create(req, &e, fi);
+		if (create_res == -ENOENT) {
 			/* The open syscall was interrupted, so it
 			   must be cancelled */
 			fuse_do_release(f, e.ino, path, fi);
@@ -3227,13 +3329,21 @@ static void open_auto_cache(struct fuse *f, fuse_ino_t ino, const char *path,
 		if (diff_timespec(&now, &node->stat_updated) >
 		    f->conf.ac_attr_timeout) {
 			struct stat stbuf;
+			unsigned int iflags = 0;
 			int err;
+
 			pthread_mutex_unlock(&f->lock);
-			err = fuse_fs_getattr(f->fs, path, &stbuf, fi);
+			if (f->want_iflags)
+				err = fuse_fs_getattr_iflags(f->fs, path,
+							     &stbuf, &iflags,
+							     fi);
+			else
+				err = fuse_fs_getattr(f->fs, path, &stbuf, fi);
 			pthread_mutex_lock(&f->lock);
-			if (!err)
+			if (!err) {
 				update_stat(node, &stbuf);
-			else
+				node->iflags = iflags;
+			} else
 				node->cache_valid = 0;
 		}
 	}
@@ -3562,6 +3672,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
 		.ino = 0,
 	};
 	struct fuse *f = dh->fuse;
+	unsigned int iflags = 0;
 	int res;
 
 	if ((flags & ~FUSE_FILL_DIR_PLUS) != 0) {
@@ -3586,6 +3697,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
 
 	if (off) {
 		size_t newlen;
+		size_t thislen;
 
 		if (dh->filled) {
 			dh->error = -EIO;
@@ -3601,7 +3713,8 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
 
 		if (statp && (flags & FUSE_FILL_DIR_PLUS)) {
 			if (!is_dot_or_dotdot(name)) {
-				res = do_lookup(f, dh->nodeid, name, &e);
+				res = do_lookup(f, dh->nodeid, name, &e,
+						&iflags);
 				if (res) {
 					dh->error = res;
 					return 1;
@@ -3609,10 +3722,17 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
 			}
 		}
 
-		newlen = dh->len +
-			fuse_add_direntry_plus(dh->req, dh->contents + dh->len,
-					       dh->needlen - dh->len, name,
-					       &e, off);
+		if (f->want_iflags)
+			thislen = fuse_add_direntry_plus_iflags(dh->req,
+					dh->contents + dh->len,
+					dh->needlen - dh->len, name, iflags,
+					&e, off);
+		else
+			thislen = fuse_add_direntry_plus(dh->req,
+					dh->contents + dh->len,
+					dh->needlen - dh->len, name, &e, off);
+		newlen = dh->len + thislen;
+
 		if (newlen > dh->needlen)
 			return 1;
 		dh->len = newlen;
@@ -3679,6 +3799,7 @@ static int readdir_fill(struct fuse *f, fuse_req_t req, fuse_ino_t ino,
 static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
 				  off_t off, enum fuse_readdir_flags flags)
 {
+	struct fuse *f = req_fuse_prepare(req);
 	off_t pos;
 	struct fuse_direntry *de = dh->first;
 	int res;
@@ -3699,6 +3820,7 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
 		unsigned rem = dh->needlen - dh->len;
 		unsigned thislen;
 		unsigned newlen;
+		unsigned int iflags = 0;
 		pos++;
 
 		if (flags & FUSE_READDIR_PLUS) {
@@ -3710,14 +3832,19 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
 			if (de->flags & FUSE_FILL_DIR_PLUS &&
 			    !is_dot_or_dotdot(de->name)) {
 				res = do_lookup(dh->fuse, dh->nodeid,
-						de->name, &e);
+						de->name, &e, &iflags);
 				if (res) {
 					dh->error = res;
 					return 1;
 				}
 			}
 
-			thislen = fuse_add_direntry_plus(req, p, rem,
+			if (f->want_iflags)
+				thislen = fuse_add_direntry_plus_iflags(req, p,
+							 rem, de->name, iflags,
+							 &e, pos);
+			else
+				thislen = fuse_add_direntry_plus(req, p, rem,
 							 de->name, &e, pos);
 		} else {
 			thislen = fuse_add_direntry(req, p, rem,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 09/14] libfuse: add FUSE_IOMAP_DIRECTIO
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-07-17 23:36   ` [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
@ 2025-07-17 23:36   ` Darrick J. Wong
  2025-07-17 23:37   ` [PATCH 10/14] libfuse: add FUSE_IOMAP_FILEIO Darrick J. Wong
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:36 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Make it so that fuse servers can ask the kernel fuse driver to use iomap
to support direct IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h |    7 +++++++
 include/fuse_kernel.h |    5 +++++
 lib/fuse_lowlevel.c   |    9 +++++++++
 3 files changed, 21 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 11eb22d011896c..657256b6309284 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -525,6 +525,11 @@ struct fuse_loop_config_v1 {
  */
 #define FUSE_CAP_IOMAP (1ULL << 32)
 
+/**
+ * Client supports using iomap for direct I/O file operations
+ */
+#define FUSE_CAP_IOMAP_DIRECTIO (1ULL << 33)
+
 /**
  * Ioctl flags
  *
@@ -1212,6 +1217,8 @@ struct fuse_iomap {
 
 /* enable fsdax */
 #define FUSE_IFLAG_DAX			(1U << 0)
+/* use iomap for directio */
+#define FUSE_IFLAG_IOMAP_DIRECTIO	(1U << 1)
 
 #endif /* FUSE_USE_VERSION >= 318 */
 
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index a06c16243a7885..7205de018634b9 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -240,6 +240,7 @@
  *  - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
+ *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -450,6 +451,7 @@ struct fuse_file_lock {
  *			 init_out.request_timeout contains the timeout (in secs)
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
+ * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -498,6 +500,7 @@ struct fuse_file_lock {
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
 #define FUSE_IOMAP		(1ULL << 43)
+#define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
 
 /**
  * CUSE INIT request/reply flags
@@ -581,9 +584,11 @@ struct fuse_file_lock {
  *
  * FUSE_ATTR_SUBMOUNT: Object is a submount root
  * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
+ * FUSE_ATTR_IOMAP_DIRECTIO: Use iomap for directio
  */
 #define FUSE_ATTR_SUBMOUNT      (1 << 0)
 #define FUSE_ATTR_DAX		(1 << 1)
+#define FUSE_ATTR_IOMAP_DIRECTIO	(1 << 2)
 
 /**
  * Open flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 568db13502a7d7..f98900c51d4a9b 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -124,6 +124,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
 	attr->flags	= 0;
 	if (iflags & FUSE_IFLAG_DAX)
 		attr->flags |= FUSE_ATTR_DAX;
+	if (iflags & FUSE_IFLAG_IOMAP_DIRECTIO)
+		attr->flags |= FUSE_ATTR_IOMAP_DIRECTIO;
 }
 
 static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
@@ -2777,6 +2779,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
 		if (inargflags & FUSE_IOMAP)
 			se->conn.capable_ext |= FUSE_CAP_IOMAP;
+		if (inargflags & FUSE_IOMAP_DIRECTIO)
+			se->conn.capable_ext |= FUSE_CAP_IOMAP_DIRECTIO;
 	} else {
 		se->conn.max_readahead = 0;
 	}
@@ -2824,6 +2828,7 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 
 	/* servers need to opt-in to iomap explicitly */
 	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
+	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP_DIRECTIO);
 
 	/* This could safely become default, but libfuse needs an API extension
 	 * to support it
@@ -2945,6 +2950,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 	}
 	if (se->conn.want_ext & FUSE_CAP_IOMAP)
 		outargflags |= FUSE_IOMAP;
+	if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
+		outargflags |= FUSE_IOMAP_DIRECTIO;
 
 	if (inargflags & FUSE_INIT_EXT) {
 		outargflags |= FUSE_INIT_EXT;
@@ -2988,6 +2995,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 				outarg.max_stack_depth);
 		if (se->conn.want_ext & FUSE_CAP_IOMAP)
 			fuse_log(FUSE_LOG_DEBUG, "   iomap=1\n");
+		if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
+			fuse_log(FUSE_LOG_DEBUG, "   iomap_directio=1\n");
 	}
 	if (arg->minor < 5)
 		outargsize = FUSE_COMPAT_INIT_OUT_SIZE;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 10/14] libfuse: add FUSE_IOMAP_FILEIO
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-07-17 23:36   ` [PATCH 09/14] libfuse: add FUSE_IOMAP_DIRECTIO Darrick J. Wong
@ 2025-07-17 23:37   ` Darrick J. Wong
  2025-07-17 23:37   ` [PATCH 11/14] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:37 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Make it so that fuse servers can ask the kernel fuse driver to use iomap
to support buffered IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h |    7 +++++++
 include/fuse_kernel.h |    5 +++++
 lib/fuse_lowlevel.c   |    9 +++++++++
 3 files changed, 21 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 657256b6309284..8bc21677b6e5c7 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -530,6 +530,11 @@ struct fuse_loop_config_v1 {
  */
 #define FUSE_CAP_IOMAP_DIRECTIO (1ULL << 33)
 
+/*
+ * Client supports using iomap for buffered I/O file operations
+ */
+#define FUSE_CAP_IOMAP_FILEIO (1ULL << 34)
+
 /**
  * Ioctl flags
  *
@@ -1219,6 +1224,8 @@ struct fuse_iomap {
 #define FUSE_IFLAG_DAX			(1U << 0)
 /* use iomap for directio */
 #define FUSE_IFLAG_IOMAP_DIRECTIO	(1U << 1)
+/* use iomap for buffered io */
+#define FUSE_IFLAG_IOMAP_FILEIO		(1U << 2)
 
 #endif /* FUSE_USE_VERSION >= 318 */
 
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 7205de018634b9..17ab74255cbf33 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -241,6 +241,7 @@
  *    SEEK_{DATA,HOLE} support
  *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
  *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
+ *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
  */
 
 #ifndef _LINUX_FUSE_H
@@ -452,6 +453,7 @@ struct fuse_file_lock {
  * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
  *	       operations.
  * FUSE_IOMAP_DIRECTIO: Client supports iomap for direct I/O operations.
+ * FUSE_IOMAP_FILEIO: Client supports iomap for buffered I/O operations.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -501,6 +503,7 @@ struct fuse_file_lock {
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
 #define FUSE_IOMAP		(1ULL << 43)
 #define FUSE_IOMAP_DIRECTIO	(1ULL << 44)
+#define FUSE_IOMAP_FILEIO	(1ULL << 45)
 
 /**
  * CUSE INIT request/reply flags
@@ -585,10 +588,12 @@ struct fuse_file_lock {
  * FUSE_ATTR_SUBMOUNT: Object is a submount root
  * FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
  * FUSE_ATTR_IOMAP_DIRECTIO: Use iomap for directio
+ * FUSE_ATTR_IOMAP_FILEIO: Use iomap for buffered io
  */
 #define FUSE_ATTR_SUBMOUNT      (1 << 0)
 #define FUSE_ATTR_DAX		(1 << 1)
 #define FUSE_ATTR_IOMAP_DIRECTIO	(1 << 2)
+#define FUSE_ATTR_IOMAP_FILEIO	(1 << 3)
 
 /**
  * Open flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index f98900c51d4a9b..d354b947a4fb6b 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -126,6 +126,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
 		attr->flags |= FUSE_ATTR_DAX;
 	if (iflags & FUSE_IFLAG_IOMAP_DIRECTIO)
 		attr->flags |= FUSE_ATTR_IOMAP_DIRECTIO;
+	if (iflags & FUSE_IFLAG_IOMAP_FILEIO)
+		attr->flags |= FUSE_ATTR_IOMAP_FILEIO;
 }
 
 static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
@@ -2781,6 +2783,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			se->conn.capable_ext |= FUSE_CAP_IOMAP;
 		if (inargflags & FUSE_IOMAP_DIRECTIO)
 			se->conn.capable_ext |= FUSE_CAP_IOMAP_DIRECTIO;
+		if (inargflags & FUSE_IOMAP_FILEIO)
+			se->conn.capable_ext |= FUSE_CAP_IOMAP_FILEIO;
 	} else {
 		se->conn.max_readahead = 0;
 	}
@@ -2829,6 +2833,7 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 	/* servers need to opt-in to iomap explicitly */
 	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
 	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP_DIRECTIO);
+	LL_SET_DEFAULT(0, FUSE_CAP_IOMAP_FILEIO);
 
 	/* This could safely become default, but libfuse needs an API extension
 	 * to support it
@@ -2952,6 +2957,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 		outargflags |= FUSE_IOMAP;
 	if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
 		outargflags |= FUSE_IOMAP_DIRECTIO;
+	if (se->conn.want_ext & FUSE_CAP_IOMAP_FILEIO)
+		outargflags |= FUSE_IOMAP_FILEIO;
 
 	if (inargflags & FUSE_INIT_EXT) {
 		outargflags |= FUSE_INIT_EXT;
@@ -2997,6 +3004,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
 			fuse_log(FUSE_LOG_DEBUG, "   iomap=1\n");
 		if (se->conn.want_ext & FUSE_CAP_IOMAP_DIRECTIO)
 			fuse_log(FUSE_LOG_DEBUG, "   iomap_directio=1\n");
+		if (se->conn.want_ext & FUSE_CAP_IOMAP_FILEIO)
+			fuse_log(FUSE_LOG_DEBUG, "   iomap_fileio=1\n");
 	}
 	if (arg->minor < 5)
 		outargsize = FUSE_COMPAT_INIT_OUT_SIZE;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 11/14] libfuse: allow discovery of the kernel's iomap capabilities
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-07-17 23:37   ` [PATCH 10/14] libfuse: add FUSE_IOMAP_FILEIO Darrick J. Wong
@ 2025-07-17 23:37   ` Darrick J. Wong
  2025-07-17 23:37   ` [PATCH 12/14] libfuse: add lower level iomap_config implementation Darrick J. Wong
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:37 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Create a library function so that we can discover the kernel's iomap
capabilities ahead of time.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_kernel.h   |   13 +++++++++++++
 include/fuse_lowlevel.h |    5 +++++
 lib/fuse_lowlevel.c     |   28 ++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    1 +
 4 files changed, 47 insertions(+)


diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 17ab74255cbf33..7a1226d6bc2c0a 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -1142,6 +1142,17 @@ struct fuse_backing_map {
 	uint64_t	padding;
 };
 
+/* basic reporting functionality */
+#define FUSE_IOMAP_SUPPORT_BASICS	(1ULL << 0)
+/* fuse driver can do direct io */
+#define FUSE_IOMAP_SUPPORT_DIRECTIO	(1ULL << 1)
+/* fuse driver can do buffered io */
+#define FUSE_IOMAP_SUPPORT_FILEIO	(1ULL << 2)
+struct fuse_iomap_support {
+	uint64_t	flags;
+	uint64_t	padding;
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
@@ -1150,6 +1161,8 @@ struct fuse_backing_map {
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
 #define FUSE_DEV_IOC_IOMAP_DEV_ADD	_IOW(FUSE_DEV_IOC_MAGIC, 3, \
 					     struct fuse_backing_map)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT	_IOR(FUSE_DEV_IOC_MAGIC, 4, \
+					     struct fuse_iomap_support)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 07748abcf079cf..a529a112998d6e 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2503,6 +2503,11 @@ int fuse_session_receive_buf(struct fuse_session *se, struct fuse_buf *buf);
  */
 bool fuse_req_is_uring(fuse_req_t req);
 
+/**
+ * Discover the kernel's iomap capabilities.  Returns FUSE_CAP_IOMAP_* flags.
+ */
+uint64_t fuse_discover_iomap(void);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index d354b947a4fb6b..0c7d5cc99945ee 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -4490,3 +4490,31 @@ int fuse_session_exited(struct fuse_session *se)
 
 	return exited ? 1 : 0;
 }
+
+uint64_t fuse_discover_iomap(void)
+{
+	struct fuse_iomap_support ios;
+	uint64_t ret = 0;
+	int fd;
+
+	fd = open("/dev/fuse", O_RDONLY | O_CLOEXEC);
+	if (fd < 0)
+		return 0;
+
+	ret = ioctl(fd, FUSE_DEV_IOC_IOMAP_SUPPORT, &ios);
+	if (ret) {
+		ret = 0;
+		goto out_close;
+	}
+
+	if (ios.flags & FUSE_IOMAP_SUPPORT_BASICS)
+		ret |= FUSE_CAP_IOMAP;
+	if (ios.flags & FUSE_IOMAP_SUPPORT_DIRECTIO)
+		ret |= FUSE_CAP_IOMAP_DIRECTIO;
+	if (ios.flags & FUSE_IOMAP_SUPPORT_FILEIO)
+		ret |= FUSE_CAP_IOMAP_FILEIO;
+
+out_close:
+	close(fd);
+	return ret;
+}
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 9207145624ba83..606fdc6127462e 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -219,6 +219,7 @@ FUSE_3.18 {
 		fuse_reply_create_iflags;
 		fuse_reply_entry_iflags;
 		fuse_add_direntry_plus_iflags;
+		fuse_discover_iomap;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 12/14] libfuse: add lower level iomap_config implementation
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-07-17 23:37   ` [PATCH 11/14] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
@ 2025-07-17 23:37   ` Darrick J. Wong
  2025-07-17 23:37   ` [PATCH 13/14] libfuse: add upper " Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 14/14] libfuse: add strictatime/lazytime mount options Darrick J. Wong
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:37 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Add FUSE_IOMAP_CONFIG helpers to the low level fuse library.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |   31 ++++++++++++++++++++++++++
 include/fuse_kernel.h   |   30 ++++++++++++++++++++++++++
 include/fuse_lowlevel.h |   25 +++++++++++++++++++++
 lib/fuse_lowlevel.c     |   55 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    1 +
 5 files changed, 142 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 8bc21677b6e5c7..98cb8f656efd13 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1227,6 +1227,37 @@ struct fuse_iomap {
 /* use iomap for buffered io */
 #define FUSE_IFLAG_IOMAP_FILEIO		(1U << 2)
 
+/* Which fields are set in fuse_iomap_config_out? */
+#define FUSE_IOMAP_CONFIG_SID		(1 << 0ULL)
+#define FUSE_IOMAP_CONFIG_UUID		(1 << 1ULL)
+#define FUSE_IOMAP_CONFIG_BLOCKSIZE	(1 << 2ULL)
+#define FUSE_IOMAP_CONFIG_MAX_LINKS	(1 << 3ULL)
+#define FUSE_IOMAP_CONFIG_TIME		(1 << 4ULL)
+#define FUSE_IOMAP_CONFIG_MAXBYTES	(1 << 5ULL)
+
+struct fuse_iomap_config{
+	uint64_t flags;		/* FUSE_IOMAP_CONFIG_* */
+
+	char s_id[32];		/* Informational name */
+	char s_uuid[16];	/* UUID */
+
+	uint8_t s_uuid_len;	/* length of s_uuid */
+
+	uint8_t s_pad[3];	/* must be zeroes */
+
+	uint32_t s_blocksize;	/* fs block size */
+	uint32_t s_max_links;	/* max hard links */
+
+	/* Granularity of c/m/atime in ns (cannot be worse than a second) */
+	uint32_t s_time_gran;
+
+	/* Time limits for c/m/atime in seconds */
+	int64_t s_time_min;
+	int64_t s_time_max;
+
+	int64_t s_maxbytes;	/* max file size */
+};
+
 #endif /* FUSE_USE_VERSION >= 318 */
 
 /* ----------------------------------------------------------- *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 7a1226d6bc2c0a..3c704f03434693 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -242,6 +242,7 @@
  *  - add FUSE_DEV_IOC_IOMAP_DEV_ADD to configure block devices for iomap
  *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
  *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
+ *  - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
  */
 
 #ifndef _LINUX_FUSE_H
@@ -676,6 +677,7 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	FUSE_IOMAP_CONFIG	= 4092,
 	FUSE_IOMAP_IOEND	= 4093,
 	FUSE_IOMAP_BEGIN	= 4094,
 	FUSE_IOMAP_END		= 4095,
@@ -1376,4 +1378,32 @@ struct fuse_iomap_ioend_in {
 	uint32_t reserved1;	/* zero */
 };
 
+struct fuse_iomap_config_in {
+	uint64_t flags;		/* zero for now */
+	int64_t maxbytes;	/* max supported file size */
+};
+
+struct fuse_iomap_config_out {
+	uint64_t flags;		/* FUSE_IOMAP_CONFIG_* */
+
+	char s_id[32];		/* Informational name */
+	char s_uuid[16];	/* UUID */
+
+	uint8_t s_uuid_len;	/* length of s_uuid */
+
+	uint8_t s_pad[3];	/* must be zeroes */
+
+	uint32_t s_blocksize;	/* fs block size */
+	uint32_t s_max_links;	/* max hard links */
+
+	/* Granularity of c/m/atime in ns (cannot be worse than a second) */
+	uint32_t s_time_gran;
+
+	/* Time limits for c/m/atime in seconds */
+	int64_t s_time_min;
+	int64_t s_time_max;
+
+	int64_t s_maxbytes;	/* max file size */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index a529a112998d6e..fd7df5c2c11e16 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1387,6 +1387,19 @@ struct fuse_lowlevel_ops {
 			     uint64_t attr_ino, off_t pos, size_t written,
 			     uint32_t ioendflags, int error,
 			     uint64_t new_addr);
+
+	/**
+	 * Configure the filesystem geometry for iomap mode
+	 *
+	 * Valid replies:
+	 *   fuse_reply_iomap_config
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param flags currently zero
+	 * @param maxbytes maximum supported file size
+	 */
+	void (*iomap_config) (fuse_req_t req, uint32_t flags, int64_t maxbytes);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
@@ -1856,6 +1869,18 @@ int fuse_reply_lseek(fuse_req_t req, off_t off);
  */
 int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
 			   const struct fuse_iomap *write_iomap);
+
+/**
+ * Reply with iomap configuration
+ *
+ * Possible requests:
+ *   iomap_config
+ *
+ * @param req request handle
+ * @param cfg iomap configuration
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg);
 #endif /* FUSE_USE_VERSION >= 318 */
 
 /* ----------------------------------------------------------- *
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 0c7d5cc99945ee..ed9464d592c8a1 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2583,6 +2583,59 @@ static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
 	_do_iomap_ioend(req, nodeid, inarg, NULL);
 }
 
+int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg)
+{
+	struct fuse_iomap_config_out arg = {
+		.flags = cfg->flags,
+	};
+
+	if (cfg->flags & FUSE_IOMAP_CONFIG_BLOCKSIZE)
+		arg.s_blocksize = cfg->s_blocksize;
+
+	if (cfg->flags & FUSE_IOMAP_CONFIG_SID)
+		memcpy(arg.s_id, cfg->s_id, sizeof(arg.s_id));
+
+	if (cfg->flags & FUSE_IOMAP_CONFIG_UUID) {
+		arg.s_uuid_len = cfg->s_uuid_len;
+		if (arg.s_uuid_len > sizeof(arg.s_uuid))
+			arg.s_uuid_len = sizeof(arg.s_uuid);
+		memcpy(arg.s_uuid, cfg->s_uuid, arg.s_uuid_len);
+	}
+
+	if (cfg->flags & FUSE_IOMAP_CONFIG_MAX_LINKS)
+		arg.s_max_links = cfg->s_max_links;
+
+	if (cfg->flags & FUSE_IOMAP_CONFIG_TIME) {
+		arg.s_time_gran = cfg->s_time_gran;
+		arg.s_time_min = cfg->s_time_min;
+		arg.s_time_max = cfg->s_time_max;
+	}
+
+	if (cfg->flags & FUSE_IOMAP_CONFIG_MAXBYTES)
+		arg.s_maxbytes = cfg->s_maxbytes;
+
+	return send_reply_ok(req, &arg, sizeof(arg));
+}
+
+static void _do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid,
+		      const void *op_in, const void *in_payload)
+{
+	(void)nodeid;
+	(void)in_payload;
+	const struct fuse_iomap_config_in *arg = op_in;
+
+	if (req->se->op.iomap_config)
+		req->se->op.iomap_config(req, arg->flags, arg->maxbytes);
+	else
+		fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid,
+			    const void *inarg)
+{
+	_do_iomap_config(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -3474,6 +3527,7 @@ static struct {
 	[FUSE_RENAME2]     = { do_rename2,      "RENAME2"    },
 	[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
+	[FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
 	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
 	[FUSE_IOMAP_IOEND] = { do_iomap_ioend,	"IOMAP_IOEND" },
@@ -3531,6 +3585,7 @@ static struct {
 	[FUSE_RENAME2]		= { _do_rename2,	"RENAME2" },
 	[FUSE_COPY_FILE_RANGE]	= { _do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
+	[FUSE_IOMAP_CONFIG]	= { _do_iomap_config,	"IOMAP_CONFIG" },
 	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },
 	[FUSE_IOMAP_IOEND]	= { _do_iomap_ioend,	"IOMAP_IOEND" },
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 606fdc6127462e..9cb46d8a7afdd2 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -220,6 +220,7 @@ FUSE_3.18 {
 		fuse_reply_entry_iflags;
 		fuse_add_direntry_plus_iflags;
 		fuse_discover_iomap;
+		fuse_reply_iomap_config;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 13/14] libfuse: add upper level iomap_config implementation
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-07-17 23:37   ` [PATCH 12/14] libfuse: add lower level iomap_config implementation Darrick J. Wong
@ 2025-07-17 23:37   ` Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 14/14] libfuse: add strictatime/lazytime mount options Darrick J. Wong
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:37 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Add FUSE_IOMAP_CONFIG helpers to the upper level fuse library.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |    7 +++++++
 lib/fuse.c     |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index f894dd5da0d106..6ce6ccfd102386 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -883,6 +883,13 @@ struct fuse_operations {
 	 */
 	int (*getattr_iflags) (const char *path, struct stat *buf,
 			       unsigned int *iflags, struct fuse_file_info *fi);
+
+	/**
+	 * Configure the filesystem geometry that will be used by iomap
+	 * files.
+	 */
+	int (*iomap_config) (uint32_t flags, off_t maxbytes,
+			     struct fuse_iomap_config *cfg);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse.c b/lib/fuse.c
index 685d0181e569d0..b722a1b526e3de 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2870,6 +2870,23 @@ static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path,
 				  ioendflags, error, new_addr);
 }
 
+static int fuse_fs_iomap_config(struct fuse_fs *fs, uint32_t flags,
+				int64_t maxbytes,
+				struct fuse_iomap_config *cfg)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.iomap_config)
+		return -ENOSYS;
+
+	if (fs->debug) {
+		fuse_log(FUSE_LOG_DEBUG,
+			 "iomap_config flags 0x%x maxbytes %lld\n",
+			 flags, (long long)maxbytes);
+	}
+
+	return fs->op.iomap_config(flags, maxbytes, cfg);
+}
+
 static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 			     int valid, struct fuse_file_info *fi)
 {
@@ -4637,6 +4654,25 @@ static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid,
 	reply_err(req, err);
 }
 
+static void fuse_lib_iomap_config(fuse_req_t req, uint32_t flags,
+				  int64_t maxbytes)
+{
+	struct fuse_iomap_config cfg = { };
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_intr_data d;
+	int err;
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_iomap_config(f->fs, flags, maxbytes, &cfg);
+	fuse_finish_interrupt(f, req, &d);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_reply_iomap_config(req, &cfg);
+}
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4738,6 +4774,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.iomap_begin = fuse_lib_iomap_begin,
 	.iomap_end = fuse_lib_iomap_end,
 	.iomap_ioend = fuse_lib_iomap_ioend,
+	.iomap_config = fuse_lib_iomap_config,
 };
 
 int fuse_notify_poll(struct fuse_pollhandle *ph)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 14/14] libfuse: add strictatime/lazytime mount options
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-07-17 23:37   ` [PATCH 13/14] libfuse: add upper " Darrick J. Wong
@ 2025-07-17 23:38   ` Darrick J. Wong
  13 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:38 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

fuse+iomap leaves the kernel completely in charge of handling
timestamps.  Add the lazytime and strictatime mount options so that
fuse+iomap filesystems can take advantage of those options.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 lib/mount.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)


diff --git a/lib/mount.c b/lib/mount.c
index 2eb967399c9606..3d021428a2ecfc 100644
--- a/lib/mount.c
+++ b/lib/mount.c
@@ -116,9 +116,16 @@ static const struct fuse_opt fuse_mount_opts[] = {
 	FUSE_OPT_KEY("dirsync",			KEY_KERN_FLAG),
 	FUSE_OPT_KEY("noatime",			KEY_KERN_FLAG),
 	FUSE_OPT_KEY("nodiratime",		KEY_KERN_FLAG),
-	FUSE_OPT_KEY("nostrictatime",		KEY_KERN_FLAG),
 	FUSE_OPT_KEY("symfollow",		KEY_KERN_FLAG),
 	FUSE_OPT_KEY("nosymfollow",		KEY_KERN_FLAG),
+#ifdef MS_LAZYTIME
+	FUSE_OPT_KEY("lazytime",		KEY_KERN_FLAG),
+	FUSE_OPT_KEY("nolazytime",		KEY_KERN_FLAG),
+#endif
+#ifdef MS_STRICTATIME
+	FUSE_OPT_KEY("strictatime",		KEY_KERN_FLAG),
+	FUSE_OPT_KEY("nostrictatime",		KEY_KERN_FLAG),
+#endif
 	FUSE_OPT_END
 };
 
@@ -189,11 +196,18 @@ static const struct mount_flags mount_flags[] = {
 	{"noatime", MS_NOATIME,	    1},
 	{"nodiratime",	    MS_NODIRATIME,	1},
 	{"norelatime",	    MS_RELATIME,	0},
-	{"nostrictatime",   MS_STRICTATIME,	0},
 	{"symfollow",	    MS_NOSYMFOLLOW,	0},
 	{"nosymfollow",	    MS_NOSYMFOLLOW,	1},
 #ifndef __NetBSD__
 	{"dirsync", MS_DIRSYNC,	    1},
+#endif
+#ifdef MS_LAZYTIME
+	{"lazytime",	    MS_LAZYTIME,	1},
+	{"nolazytime",	    MS_LAZYTIME,	0},
+#endif
+#ifdef MS_STRICTATIME
+	{"strictatime",	    MS_STRICTATIME,	1},
+	{"nostrictatime",   MS_STRICTATIME,	0},
 #endif
 	{NULL,	    0,		    0}
 };


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 1/1] libfuse: enable iomap cache management
  2025-07-17 23:25 ` [PATCHSET RFC v3 2/3] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-07-17 23:38   ` Darrick J. Wong
  2025-07-18 16:16     ` Bernd Schubert
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:38 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Add the library methods so that fuse servers can manage an in-kernel
iomap cache.  This enables better performance on small IOs and is
required if the filesystem needs synchronization between pagecache
writes and writeback.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_common.h   |    9 +++++
 include/fuse_kernel.h   |   34 +++++++++++++++++++
 include/fuse_lowlevel.h |   39 ++++++++++++++++++++++
 lib/fuse_lowlevel.c     |   82 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    2 +
 5 files changed, 166 insertions(+)


diff --git a/include/fuse_common.h b/include/fuse_common.h
index 98cb8f656efd13..1237cc2656b9c4 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1164,6 +1164,7 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn);
  */
 #if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
 #define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
+#define FUSE_IOMAP_TYPE_NULL		(0xFFFE) /* no mapping here */
 #define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
 #define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
 #define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
@@ -1208,6 +1209,11 @@ struct fuse_iomap {
 	uint32_t dev;		/* device cookie */
 };
 
+struct fuse_iomap_inval {
+	uint64_t offset;	/* file offset to invalidate, bytes */
+	uint64_t length;	/* length to invalidate, bytes */
+};
+
 /* out of place write extent */
 #define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
 /* unwritten extent */
@@ -1258,6 +1264,9 @@ struct fuse_iomap_config{
 	int64_t s_maxbytes;	/* max file size */
 };
 
+/* invalidate to end of file */
+#define FUSE_IOMAP_INVAL_TO_EOF		(~0ULL)
+
 #endif /* FUSE_USE_VERSION >= 318 */
 
 /* ----------------------------------------------------------- *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 3c704f03434693..f1a93dbd1ff443 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -243,6 +243,8 @@
  *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
  *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
  *  - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
+ *  - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
+ *    can cache iomappings in the kernel
  */
 
 #ifndef _LINUX_FUSE_H
@@ -699,6 +701,8 @@ enum fuse_notify_code {
 	FUSE_NOTIFY_DELETE = 6,
 	FUSE_NOTIFY_RESEND = 7,
 	FUSE_NOTIFY_INC_EPOCH = 8,
+	FUSE_NOTIFY_IOMAP_UPSERT = 9,
+	FUSE_NOTIFY_IOMAP_INVAL = 10,
 	FUSE_NOTIFY_CODE_MAX,
 };
 
@@ -1406,4 +1410,34 @@ struct fuse_iomap_config_out {
 	int64_t s_maxbytes;	/* max file size */
 };
 
+struct fuse_iomap_upsert_out {
+	uint64_t nodeid;	/* Inode ID */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+
+	uint64_t read_offset;	/* file offset of mapping, bytes */
+	uint64_t read_length;	/* length of mapping, bytes */
+	uint64_t read_addr;	/* disk offset of mapping, bytes */
+	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t read_dev;	/* device cookie */
+
+	uint64_t write_offset;	/* file offset of mapping, bytes */
+	uint64_t write_length;	/* length of mapping, bytes */
+	uint64_t write_addr;	/* disk offset of mapping, bytes */
+	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
+	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
+	uint32_t write_dev;	/* device cookie * */
+};
+
+struct fuse_iomap_inval_out {
+	uint64_t nodeid;	/* Inode ID */
+	uint64_t attr_ino;	/* matches fuse_attr:ino */
+
+	uint64_t read_offset;	/* range to invalidate read iomaps, bytes */
+	uint64_t read_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
+
+	uint64_t write_offset;	/* range to invalidate write iomaps, bytes */
+	uint64_t write_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
+};
+
 #endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index fd7df5c2c11e16..f690c62fcdd61c 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2101,6 +2101,45 @@ int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino,
  * @return positive device id for success, zero for failure
  */
 int fuse_iomap_add_device(struct fuse_session *se, int fd, unsigned int flags);
+
+/**
+ * Upsert some file mapping information into the kernel.  This is necessary
+ * for filesystems that require coordination of mapping state changes between
+ * buffered writes and writeback, and desirable for better performance
+ * elsewhere.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param read_iomap mapping information for file reads
+ * @param write_iomap mapping information for file reads
+ * @return zero for success, -errno for failure
+ */
+int fuse_lowlevel_notify_iomap_upsert(struct fuse_session *se,
+				      fuse_ino_t nodeid, uint64_t attr_ino,
+				      const struct fuse_iomap *read_iomap,
+				      const struct fuse_iomap *write_iomap);
+
+/**
+ * Invalidate some file mapping information in the kernel.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param read read mapping range to invalidate
+ * @param write write mapping range to invalidate
+ * @return zero for success, -errno for failure
+ */
+int fuse_lowlevel_notify_iomap_inval(struct fuse_session *se,
+				     fuse_ino_t nodeid,
+				     const struct fuse_iomap_inval *read,
+				     const struct fuse_iomap_inval *write);
 #endif
 
 /* ----------------------------------------------------------- *
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ed9464d592c8a1..e31ce96593a9b3 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -3349,6 +3349,88 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
 	return res;
 }
 
+int fuse_lowlevel_notify_iomap_upsert(struct fuse_session *se,
+				      fuse_ino_t nodeid, uint64_t attr_ino,
+				      const struct fuse_iomap *read_iomap,
+				      const struct fuse_iomap *write_iomap)
+{
+	struct fuse_iomap_upsert_out outarg = {
+		.nodeid		= nodeid,
+		.attr_ino	= attr_ino,
+	};
+	struct iovec iov[2];
+
+	if (!se)
+		return -EINVAL;
+
+	if (se->conn.proto_minor < 44)
+		return -ENOSYS;
+
+	if (!read_iomap && !write_iomap)
+		return 0;
+
+	if (read_iomap) {
+		outarg.read_offset = read_iomap->offset;
+		outarg.read_length = read_iomap->length;
+		outarg.read_addr = read_iomap->addr;
+		outarg.read_type = read_iomap->type;
+		outarg.read_flags = read_iomap->flags;
+		outarg.read_dev = read_iomap->dev;
+	} else {
+		outarg.read_type = FUSE_IOMAP_TYPE_NULL;
+	}
+
+	if (write_iomap) {
+		outarg.write_offset = write_iomap->offset;
+		outarg.write_length = write_iomap->length;
+		outarg.write_addr = write_iomap->addr;
+		outarg.write_type = write_iomap->type;
+		outarg.write_flags = write_iomap->flags;
+		outarg.write_dev = write_iomap->dev;
+	} else {
+		outarg.write_type = FUSE_IOMAP_TYPE_NULL;
+	}
+
+	iov[1].iov_base = &outarg;
+	iov[1].iov_len = sizeof(outarg);
+
+	return send_notify_iov(se, FUSE_NOTIFY_IOMAP_UPSERT, iov, 2);
+}
+
+int fuse_lowlevel_notify_iomap_inval(struct fuse_session *se,
+				     fuse_ino_t nodeid,
+				     const struct fuse_iomap_inval *read,
+				     const struct fuse_iomap_inval *write)
+{
+	struct fuse_iomap_inval_out outarg = {
+		.nodeid		= nodeid,
+	};
+	struct iovec iov[2];
+
+	if (!se)
+		return -EINVAL;
+
+	if (se->conn.proto_minor < 44)
+		return -ENOSYS;
+
+	if (!read && !write)
+		return 0;
+
+	if (read) {
+		outarg.read_offset = read->offset;
+		outarg.read_length = read->length;
+	}
+	if (write) {
+		outarg.write_offset = write->offset;
+		outarg.write_length = write->length;
+	}
+
+	iov[1].iov_base = &outarg;
+	iov[1].iov_len = sizeof(outarg);
+
+	return send_notify_iov(se, FUSE_NOTIFY_IOMAP_INVAL, iov, 2);
+}
+
 struct fuse_retrieve_req {
 	struct fuse_notify_req nreq;
 	void *cookie;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 9cb46d8a7afdd2..dc9fa2428b5325 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -221,6 +221,8 @@ FUSE_3.18 {
 		fuse_add_direntry_plus_iflags;
 		fuse_discover_iomap;
 		fuse_reply_iomap_config;
+		fuse_lowlevel_notify_iomap_upsert;
+		fuse_lowlevel_notify_iomap_inval;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 1/4] libfuse: wire up FUSE_SYNCFS to the low level library
  2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
@ 2025-07-17 23:38   ` Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 2/4] libfuse: add syncfs support to the upper library Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:38 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Create hooks in the lowlevel library for syncfs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_lowlevel.h |   16 ++++++++++++++++
 lib/fuse_lowlevel.c     |   19 +++++++++++++++++++
 2 files changed, 35 insertions(+)


diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index f690c62fcdd61c..77685e433e4f7d 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1400,6 +1400,22 @@ struct fuse_lowlevel_ops {
 	 * @param maxbytes maximum supported file size
 	 */
 	void (*iomap_config) (fuse_req_t req, uint32_t flags, int64_t maxbytes);
+
+	/**
+	 * Flush the entire filesystem to disk.
+	 *
+	 * If this request is answered with an error code of ENOSYS, this is
+	 * treated as a permanent failure, i.e. all future syncfs() requests
+	 * will fail with the same error code without being sent to the
+	 * filesystem process.
+	 *
+	 * Valid replies:
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param ino the inode number
+	 */
+	void (*syncfs) (fuse_req_t req, fuse_ino_t ino);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index e31ce96593a9b3..ec30ebc4cdd074 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2636,6 +2636,23 @@ static void do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid,
 	_do_iomap_config(req, nodeid, inarg, NULL);
 }
 
+static void _do_syncfs(fuse_req_t req, const fuse_ino_t nodeid,
+		      const void *op_in, const void *in_payload)
+{
+	(void)op_in;
+	(void)in_payload;
+
+	if (req->se->op.syncfs)
+		req->se->op.syncfs(req, nodeid);
+	else
+		fuse_reply_err(req, ENOSYS);
+}
+
+static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
+{
+	_do_syncfs(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -3609,6 +3626,7 @@ static struct {
 	[FUSE_RENAME2]     = { do_rename2,      "RENAME2"    },
 	[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
+	[FUSE_SYNCFS]	   = { do_syncfs,	"SYNCFS"     },
 	[FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
 	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
@@ -3667,6 +3685,7 @@ static struct {
 	[FUSE_RENAME2]		= { _do_rename2,	"RENAME2" },
 	[FUSE_COPY_FILE_RANGE]	= { _do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
+	[FUSE_SYNCFS]		= { _do_syncfs,		"SYNCFS" },
 	[FUSE_IOMAP_CONFIG]	= { _do_iomap_config,	"IOMAP_CONFIG" },
 	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 2/4] libfuse: add syncfs support to the upper library
  2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 1/4] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
@ 2025-07-17 23:38   ` Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 3/4] libfuse: add statx support to the lower level library Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 4/4] libfuse: add upper level statx hooks Darrick J. Wong
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:38 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Support syncfs in the upper level library.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |    5 +++++
 lib/fuse.c     |   31 +++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index 6ce6ccfd102386..a59f43e0701e1a 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -890,6 +890,11 @@ struct fuse_operations {
 	 */
 	int (*iomap_config) (uint32_t flags, off_t maxbytes,
 			     struct fuse_iomap_config *cfg);
+
+	/**
+	 * Flush the entire filesystem to disk.
+	 */
+	int (*syncfs) (const char *path);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse.c b/lib/fuse.c
index b722a1b526e3de..c3fa6dad589cb0 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2887,6 +2887,16 @@ static int fuse_fs_iomap_config(struct fuse_fs *fs, uint32_t flags,
 	return fs->op.iomap_config(flags, maxbytes, cfg);
 }
 
+static int fuse_fs_syncfs(struct fuse_fs *fs, const char *path)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.syncfs)
+		return -ENOSYS;
+	if (fs->debug)
+		fuse_log(FUSE_LOG_DEBUG, "syncfs[%s]\n", path);
+	return fs->op.syncfs(path);
+}
+
 static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 			     int valid, struct fuse_file_info *fi)
 {
@@ -4673,6 +4683,26 @@ static void fuse_lib_iomap_config(fuse_req_t req, uint32_t flags,
 	fuse_reply_iomap_config(req, &cfg);
 }
 
+static void fuse_lib_syncfs(fuse_req_t req, fuse_ino_t ino)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct fuse_intr_data d;
+	char *path;
+	int err;
+
+	err = get_path(f, ino, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_syncfs(f->fs, path);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, ino, path);
+	reply_err(req, err);
+}
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4771,6 +4801,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.fallocate = fuse_lib_fallocate,
 	.copy_file_range = fuse_lib_copy_file_range,
 	.lseek = fuse_lib_lseek,
+	.syncfs = fuse_lib_syncfs,
 	.iomap_begin = fuse_lib_iomap_begin,
 	.iomap_end = fuse_lib_iomap_end,
 	.iomap_ioend = fuse_lib_iomap_ioend,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 3/4] libfuse: add statx support to the lower level library
  2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 1/4] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
  2025-07-17 23:38   ` [PATCH 2/4] libfuse: add syncfs support to the upper library Darrick J. Wong
@ 2025-07-17 23:39   ` Darrick J. Wong
  2025-07-18 13:28     ` Amir Goldstein
  2025-07-17 23:39   ` [PATCH 4/4] libfuse: add upper level statx hooks Darrick J. Wong
  3 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:39 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Add statx support to the lower level fuse library.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse_lowlevel.h |   37 ++++++++++++++++++
 lib/fuse_lowlevel.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/fuse_versionscript  |    2 +
 3 files changed, 136 insertions(+)


diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 77685e433e4f7d..f4d62cee22870a 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1416,6 +1416,26 @@ struct fuse_lowlevel_ops {
 	 * @param ino the inode number
 	 */
 	void (*syncfs) (fuse_req_t req, fuse_ino_t ino);
+
+	/**
+	 * Fetch extended stat information about a file
+	 *
+	 * If this request is answered with an error code of ENOSYS, this is
+	 * treated as a permanent failure, i.e. all future statx() requests
+	 * will fail with the same error code without being sent to the
+	 * filesystem process.
+	 *
+	 * Valid replies:
+	 *   fuse_reply_statx
+	 *   fuse_reply_err
+	 *
+	 * @param req request handle
+	 * @param statx_flags AT_STATX_* flags
+	 * @param statx_mask desired STATX_* attribute mask
+	 * @param fi file information
+	 */
+	void (*statx) (fuse_req_t req, fuse_ino_t ino, uint32_t statx_flags,
+		       uint32_t statx_mask, struct fuse_file_info *fi);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
@@ -1897,6 +1917,23 @@ int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
  * @return zero for success, -errno for failure to send reply
  */
 int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg);
+
+struct statx;
+
+/**
+ * Reply with statx attributes
+ *
+ * Possible requests:
+ *   statx
+ *
+ * @param req request handle
+ * @param statx the attributes
+ * @param size the size of the statx structure
+ * @param attr_timeout	validity timeout (in seconds) for the attributes
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
+		     double attr_timeout);
 #endif /* FUSE_USE_VERSION >= 318 */
 
 /* ----------------------------------------------------------- *
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ec30ebc4cdd074..8eeb6a8547da91 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -144,6 +144,43 @@ static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
 	ST_CTIM_NSEC_SET(stbuf, attr->ctimensec);
 }
 
+#ifdef STATX_BASIC_STATS
+static int convert_statx(struct fuse_statx *stbuf, const struct statx *stx,
+			 size_t size)
+{
+	if (sizeof(struct statx) != size)
+		return EOPNOTSUPP;
+
+	stbuf->mask = stx->stx_mask & (STATX_BASIC_STATS | STATX_BTIME);
+	stbuf->blksize		= stx->stx_blksize;
+	stbuf->attributes	= stx->stx_attributes;
+	stbuf->nlink		= stx->stx_nlink;
+	stbuf->uid		= stx->stx_uid;
+	stbuf->gid		= stx->stx_gid;
+	stbuf->mode		= stx->stx_mode;
+	stbuf->ino		= stx->stx_ino;
+	stbuf->size		= stx->stx_size;
+	stbuf->blocks		= stx->stx_blocks;
+	stbuf->attributes_mask	= stx->stx_attributes_mask;
+	stbuf->rdev_major	= stx->stx_rdev_major;
+	stbuf->rdev_minor	= stx->stx_rdev_minor;
+	stbuf->dev_major	= stx->stx_dev_major;
+	stbuf->dev_minor	= stx->stx_dev_minor;
+
+	stbuf->atime.tv_sec	= stx->stx_atime.tv_sec;
+	stbuf->btime.tv_sec	= stx->stx_btime.tv_sec;
+	stbuf->ctime.tv_sec	= stx->stx_ctime.tv_sec;
+	stbuf->mtime.tv_sec	= stx->stx_mtime.tv_sec;
+
+	stbuf->atime.tv_nsec	= stx->stx_atime.tv_nsec;
+	stbuf->btime.tv_nsec	= stx->stx_btime.tv_nsec;
+	stbuf->ctime.tv_nsec	= stx->stx_ctime.tv_nsec;
+	stbuf->mtime.tv_nsec	= stx->stx_mtime.tv_nsec;
+
+	return 0;
+}
+#endif
+
 static	size_t iov_length(const struct iovec *iov, size_t count)
 {
 	size_t seg;
@@ -2653,6 +2690,64 @@ static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg
 	_do_syncfs(req, nodeid, inarg, NULL);
 }
 
+#ifdef STATX_BASIC_STATS
+int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
+		     double attr_timeout)
+{
+	struct fuse_statx_out arg = {
+		.attr_valid = calc_timeout_sec(attr_timeout),
+		.attr_valid_nsec = calc_timeout_nsec(attr_timeout),
+	};
+
+	int err = convert_statx(&arg.stat, statx, size);
+	if (err) {
+		fuse_reply_err(req, err);
+		return err;
+	}
+
+	return send_reply_ok(req, &arg, sizeof(arg));
+}
+
+static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
+		      const void *op_in, const void *in_payload)
+{
+	(void)in_payload;
+	const struct fuse_statx_in *arg = op_in;
+	struct fuse_file_info *fip = NULL;
+	struct fuse_file_info fi;
+
+	if (arg->getattr_flags & FUSE_GETATTR_FH) {
+		memset(&fi, 0, sizeof(fi));
+		fi.fh = arg->fh;
+		fip = &fi;
+	}
+
+	if (req->se->op.statx)
+		req->se->op.statx(req, nodeid, arg->sx_flags, arg->sx_mask,
+				  fip);
+	else
+		fuse_reply_err(req, ENOSYS);
+}
+#else
+int fuse_reply_statx(fuse_req_t req, const struct statx *statx,
+		     double attr_timeout)
+{
+	fuse_reply_err(req, ENOSYS);
+	return -ENOSYS;
+}
+
+static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
+		      const void *op_in, const void *in_payload)
+{
+	fuse_reply_err(req, ENOSYS);
+}
+#endif /* STATX_BASIC_STATS */
+
+static void do_statx(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
+{
+	_do_statx(req, nodeid, inarg, NULL);
+}
+
 static bool want_flags_valid(uint64_t capable, uint64_t want)
 {
 	uint64_t unknown_flags = want & (~capable);
@@ -3627,6 +3722,7 @@ static struct {
 	[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]	   = { do_lseek,       "LSEEK"	     },
 	[FUSE_SYNCFS]	   = { do_syncfs,	"SYNCFS"     },
+	[FUSE_STATX]	   = { do_statx,       "STATX"	     },
 	[FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
 	[FUSE_IOMAP_BEGIN] = { do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]   = { do_iomap_end,	"IOMAP_END" },
@@ -3686,6 +3782,7 @@ static struct {
 	[FUSE_COPY_FILE_RANGE]	= { _do_copy_file_range, "COPY_FILE_RANGE" },
 	[FUSE_LSEEK]		= { _do_lseek,		"LSEEK" },
 	[FUSE_SYNCFS]		= { _do_syncfs,		"SYNCFS" },
+	[FUSE_STATX]		= { _do_statx,		"STATX" },
 	[FUSE_IOMAP_CONFIG]	= { _do_iomap_config,	"IOMAP_CONFIG" },
 	[FUSE_IOMAP_BEGIN]	= { _do_iomap_begin,	"IOMAP_BEGIN" },
 	[FUSE_IOMAP_END]	= { _do_iomap_end,	"IOMAP_END" },
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index dc9fa2428b5325..a67b1802770335 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -223,6 +223,8 @@ FUSE_3.18 {
 		fuse_reply_iomap_config;
 		fuse_lowlevel_notify_iomap_upsert;
 		fuse_lowlevel_notify_iomap_inval;
+
+		fuse_reply_statx;
 } FUSE_3.17;
 
 # Local Variables:


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 4/4] libfuse: add upper level statx hooks
  2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:39   ` [PATCH 3/4] libfuse: add statx support to the lower level library Darrick J. Wong
@ 2025-07-17 23:39   ` Darrick J. Wong
  3 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:39 UTC (permalink / raw)
  To: djwong, bschubert; +Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos

From: Darrick J. Wong <djwong@kernel.org>

Connect statx to the upper level library.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/fuse.h |   11 +++++++
 lib/fuse.c     |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)


diff --git a/include/fuse.h b/include/fuse.h
index a59f43e0701e1a..03af42c9884acd 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -323,6 +323,7 @@ struct fuse_config {
 	uint64_t reserved[48];
 };
 
+struct statx;
 
 /**
  * The file system operations:
@@ -895,6 +896,16 @@ struct fuse_operations {
 	 * Flush the entire filesystem to disk.
 	 */
 	int (*syncfs) (const char *path);
+
+	/**
+	 * Return detailed attributes about a file.
+	 *
+	 * File information should be written to the statx struct.
+	 * The size parameter is the size of the statx structure.
+	 */
+	int (*statx) (const char *path, uint32_t statx_flags,
+		      uint32_t statx_mask, struct statx *statx, size_t size,
+		      struct fuse_file_info *fi);
 #endif /* FUSE_USE_VERSION >= 318 */
 };
 
diff --git a/lib/fuse.c b/lib/fuse.c
index c3fa6dad589cb0..41e37f2760356c 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -41,6 +41,8 @@
 #include <sys/time.h>
 #include <sys/mman.h>
 #include <sys/file.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
 
 #define FUSE_NODE_SLAB 1
 
@@ -2897,6 +2899,24 @@ static int fuse_fs_syncfs(struct fuse_fs *fs, const char *path)
 	return fs->op.syncfs(path);
 }
 
+static int fuse_fs_statx(struct fuse_fs *fs, const char *path,
+			 uint32_t statx_flags, uint32_t statx_mask,
+			 struct statx *statx, size_t size,
+			 struct fuse_file_info *fi)
+{
+	fuse_get_context()->private_data = fs->user_data;
+	if (!fs->op.statx)
+		return -ENOSYS;
+	if (fs->debug) {
+		char buf[10];
+
+		fuse_log(FUSE_LOG_DEBUG, "statx[%s] 0x%x 0x%x\n",
+			file_info_string(fi, buf, sizeof(buf)),
+			statx_flags, statx_mask);
+	}
+	return fs->op.statx(path, statx_flags, statx_mask, statx, size, fi);
+}
+
 static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
 			     int valid, struct fuse_file_info *fi)
 {
@@ -4703,6 +4723,74 @@ static void fuse_lib_syncfs(fuse_req_t req, fuse_ino_t ino)
 	reply_err(req, err);
 }
 
+#ifdef STATX_BASIC_STATS
+static void from_statx(struct stat *buf, const struct statx *stx)
+{
+	buf->st_dev		= makedev(stx->stx_dev_major,
+					  stx->stx_dev_minor);
+	buf->st_ino		= stx->stx_ino;
+	buf->st_mode		= stx->stx_mode;
+	buf->st_nlink		= stx->stx_nlink;
+	buf->st_uid		= stx->stx_uid;
+	buf->st_gid		= stx->stx_gid;
+	buf->st_rdev		= makedev(stx->stx_rdev_major,
+					  stx->stx_rdev_minor);
+	buf->st_size		= stx->stx_size;
+	buf->st_blksize		= stx->stx_blksize;
+	buf->st_blocks		= stx->stx_blocks;
+
+	buf->st_atime		= stx->stx_atime.tv_sec;
+	buf->st_mtime		= stx->stx_mtime.tv_sec;
+	buf->st_ctime		= stx->stx_ctime.tv_sec;
+
+	/* XXX do we care about tv_nsec? */
+}
+
+static void fuse_lib_statx(fuse_req_t req, fuse_ino_t ino, uint32_t statx_flags,
+			   uint32_t statx_mask, struct fuse_file_info *fi)
+{
+	struct fuse *f = req_fuse_prepare(req);
+	struct node *node;
+	struct fuse_intr_data d;
+	struct statx statx = { };
+	struct stat buf = { };
+	char *path;
+	int err;
+
+	if (fi != NULL)
+		err = get_path_nullok(f, ino, &path);
+	else
+		err = get_path(f, ino, &path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	fuse_prepare_interrupt(f, req, &d);
+	err = fuse_fs_statx(f->fs, path, statx_flags, statx_mask, &statx,
+			    sizeof(statx), fi);
+	fuse_finish_interrupt(f, req, &d);
+	free_path(f, ino, path);
+	if (err) {
+		reply_err(req, err);
+		return;
+	}
+
+	pthread_mutex_lock(&f->lock);
+	node = get_node(f, ino);
+	if (node->is_hidden && statx.stx_nlink > 0)
+		statx.stx_nlink--;
+	from_statx(&buf, &statx);
+	if (f->conf.auto_cache)
+		update_stat(node, &buf);
+	pthread_mutex_unlock(&f->lock);
+	set_stat(f, ino, &buf);
+	fuse_reply_statx(req, &statx, sizeof(statx), f->conf.attr_timeout);
+}
+#else
+# define fuse_lib_statx		NULL
+#endif /* STATX_BASIC_STATS */
+
 static int clean_delay(struct fuse *f)
 {
 	/*
@@ -4802,6 +4890,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
 	.copy_file_range = fuse_lib_copy_file_range,
 	.lseek = fuse_lib_lseek,
 	.syncfs = fuse_lib_syncfs,
+	.statx = fuse_lib_statx,
 	.iomap_begin = fuse_lib_iomap_begin,
 	.iomap_end = fuse_lib_iomap_end,
 	.iomap_ioend = fuse_lib_iomap_ioend,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:39   ` Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
                     ` (20 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:39 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add enough of an iomap implementation that we can do FIEMAP and
SEEK_DATA and SEEK_HOLE.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure       |   47 +++++
 configure.ac    |   32 ++++
 lib/config.h.in |    3 
 misc/fuse2fs.c  |  500 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 576 insertions(+), 6 deletions(-)


diff --git a/configure b/configure
index 0dc027d21280dc..ffa98829757788 100755
--- a/configure
+++ b/configure
@@ -14719,6 +14719,53 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_USE_VERSION" -ge 30
+then
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5
+printf %s "checking for iomap_begin in libfuse... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+
+int
+main (void)
+{
+
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+
+  ;
+  return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+  have_fuse_iomap=yes
+   { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else $as_nop
+  { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+    conftest$ac_exeext conftest.$ac_ext
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+
+printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h
+
+fi
+fi
+
 if test -n "$FUSE_USE_VERSION"
 then
 
diff --git a/configure.ac b/configure.ac
index 9f0e74c209b0f2..a4e122ac37880e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1447,6 +1447,38 @@ elif test -n "$FUSE_LIB"
 then
 	FUSE_USE_VERSION=29
 fi
+
+if test "$FUSE_USE_VERSION" -ge 30
+then
+dnl
+dnl see if fuse3 supports iomap
+dnl
+AC_MSG_CHECKING(for iomap_begin in libfuse)
+AC_LINK_IFELSE(
+[	AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS	64
+#define FUSE_USE_VERSION 318
+#include <fuse.h>
+	]], [[
+struct fuse_operations fs_ops = {
+	.iomap_begin = NULL,
+	.iomap_end = NULL,
+};
+struct fuse_iomap narf = { };
+	]])
+], have_fuse_iomap=yes
+   AC_MSG_RESULT(yes),
+   AC_MSG_RESULT(no))
+if test "$have_fuse_iomap" = yes; then
+  FUSE_USE_VERSION=318
+  AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap])
+fi
+fi
+
+dnl
+dnl set FUSE_USE_VERSION now that we've done all the feature tests
+dnl
 if test -n "$FUSE_USE_VERSION"
 then
 	AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
diff --git a/lib/config.h.in b/lib/config.h.in
index f6597e69a7df8a..f054a1c1642a39 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -73,6 +73,9 @@
 /* Define to 1 if PR_SET_IO_FLUSHER is present */
 #undef HAVE_PR_SET_IO_FLUSHER
 
+/* Define to 1 if fuse supports iomap */
+#undef HAVE_FUSE_IOMAP
+
 /* Define to 1 if you have the Mac OS X function
    CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
 #undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 526c928f735ea2..e688772ddd8b60 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -145,6 +145,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
 	return b - m;
 }
 
+#define max(a, b)	((a) > (b) ? (a) : (b))
+#define min(x, y)	((x) < (y) ? (y) : (x))
+
 #define dbg_printf(fuse2fs, format, ...) \
 	while ((fuse2fs)->debug) { \
 		printf("FUSE2FS (%s): " format, (fuse2fs)->shortdev, ##__VA_ARGS__); \
@@ -216,6 +219,14 @@ enum fuse2fs_opstate {
 	F2OP_SHUTDOWN,
 };
 
+#ifdef HAVE_FUSE_IOMAP
+enum fuse2fs_iomap_state {
+	IOMAP_DISABLED,
+	IOMAP_UNKNOWN,
+	IOMAP_ENABLED,
+};
+#endif
+
 /* Main program context */
 #define FUSE2FS_MAGIC		(0xEF53DEADUL)
 struct fuse2fs {
@@ -241,6 +252,9 @@ struct fuse2fs {
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
+#ifdef HAVE_FUSE_IOMAP
+	enum fuse2fs_iomap_state iomap_state;
+#endif
 	unsigned int blockmask;
 	int retcode;
 	unsigned long offset;
@@ -462,6 +476,15 @@ static inline void __fuse2fs_finish(struct fuse2fs *ff, int ret,
 }
 #define fuse2fs_finish(ff, ret) __fuse2fs_finish((ff), (ret), __func__)
 
+#ifdef HAVE_FUSE_IOMAP
+static int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
+{
+	return ff->iomap_state >= IOMAP_ENABLED;
+}
+#else
+# define fuse2fs_iomap_enabled(...)	(0)
+#endif
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME
@@ -856,7 +879,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
 {
 	char options[128];
 	int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
-		    libext2_flags;
+		    EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags;
 	errcode_t err;
 
 	if (ff->lockfile) {
@@ -1105,6 +1128,30 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
 }
 #endif
 
+#ifdef HAVE_FUSE_IOMAP
+static void fuse2fs_iomap_confirm(struct fuse_conn_info *conn,
+				  struct fuse2fs *ff)
+{
+	switch (ff->iomap_state) {
+	case IOMAP_UNKNOWN:
+		ff->iomap_state = IOMAP_DISABLED;
+		return;
+	case IOMAP_DISABLED:
+		return;
+	case IOMAP_ENABLED:
+		break;
+	}
+
+	/* iomap only works with block devices */
+	if (!fuse2fs_on_bdev(ff)) {
+		fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+		ff->iomap_state = IOMAP_DISABLED;
+	}
+}
+#else
+# define fuse2fs_iomap_confirm(...)	((void)0)
+#endif
+
 static void *op_init(struct fuse_conn_info *conn
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 			, struct fuse_config *cfg EXT2FS_ATTR((unused))
@@ -1132,6 +1179,12 @@ static void *op_init(struct fuse_conn_info *conn
 #ifdef FUSE_CAP_NO_EXPORT_SUPPORT
 	fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	if (ff->iomap_state != IOMAP_DISABLED &&
+	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
+		ff->iomap_state = IOMAP_ENABLED;
+#endif
+
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	conn->time_gran = 1;
 	cfg->use_ino = 1;
@@ -1151,6 +1204,8 @@ static void *op_init(struct fuse_conn_info *conn
 			goto mount_fail;
 		fs = ff->fs;
 
+		fuse2fs_iomap_confirm(conn, ff);
+
 		if (ff->cache_size) {
 			err = fuse2fs_config_cache(ff);
 			if (err)
@@ -1176,8 +1231,17 @@ static void *op_init(struct fuse_conn_info *conn
 		err = fuse2fs_mount(ff);
 		if (err)
 			goto mount_fail;
+	} else {
+		fuse2fs_iomap_confirm(conn, ff);
 	}
 
+	/*
+	 * If we're mounting in iomap mode, we need to unmount in op_destroy
+	 * so that the block device will be released before umount(2) returns.
+	 */
+	if (fuse2fs_iomap_enabled(ff))
+		ff->unmount_in_destroy = 1;
+
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->opstate == F2OP_WRITABLE) {
 		fs->super->s_mnt_count++;
@@ -4734,6 +4798,424 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 # endif /* SUPPORT_FALLOCATE */
 #endif /* FUSE 29 */
 
+#ifdef HAVE_FUSE_IOMAP
+static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
+			       off_t pos, uint64_t count)
+{
+	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->addr = FUSE_IOMAP_NULL_ADDR;
+	iomap->offset = pos;
+	iomap->length = count;
+	iomap->type = FUSE_IOMAP_TYPE_HOLE;
+}
+
+static void fuse2fs_iomap_hole_to_eof(struct fuse2fs *ff,
+				      struct fuse_iomap *iomap, off_t pos,
+				      off_t count,
+				      const struct ext2_inode_large *inode)
+{
+	ext2_filsys fs = ff->fs;
+	uint64_t isize = EXT2_I_SIZE(inode);
+
+	/*
+	 * We have to be careful about handling a hole to the right of the
+	 * entire mapping tree.  First, the mapping must start and end on a
+	 * block boundary because they must be aligned to at least an LBA for
+	 * the block layer; and to the fsblock for smoother operation.
+	 *
+	 * As for the length -- we could return a mapping all the way to
+	 * i_size, but i_size could be less than pos/count if we're zeroing the
+	 * EOF block in anticipation of a truncate operation.  Similarly, we
+	 * don't want to end the mapping at pos+count because we know there's
+	 * nothing mapped byeond here.
+	 */
+	uint64_t startoff = round_down(pos, fs->blocksize);
+	uint64_t eofoff = round_up(max(pos + count, isize), fs->blocksize);
+
+	dbg_printf(ff,
+ "pos=0x%llx count=0x%llx isize=0x%llx startoff=0x%llx eofoff=0x%llx\n",
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   (unsigned long long)isize,
+		   (unsigned long long)startoff,
+		   (unsigned long long)eofoff);
+
+	fuse2fs_iomap_hole(ff, iomap, startoff, eofoff - startoff);
+}
+
+#define DEBUG_IOMAP
+#ifdef DEBUG_IOMAP
+# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \
+	do { \
+		dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \
+			   (func), (tag), (startoff), (err), (extent)->e_lblk, \
+			   (extent)->e_pblk, (extent)->e_len, \
+			   (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \
+	} while(0)
+# define DUMP_EXTENT(ff, tag, startoff, err, extent) \
+	__DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent))
+#else
+# define __DUMP_EXTENT(...)	((void)0)
+# define DUMP_EXTENT(...)	((void)0)
+#endif
+
+static inline errcode_t __fuse2fs_get_mapping_at(struct fuse2fs *ff,
+						 ext2_extent_handle_t handle,
+						 blk64_t startoff,
+						 struct ext2fs_extent *bmap,
+						 const char *func)
+{
+	errcode_t err;
+
+	/*
+	 * Find the file mapping at startoff.  We don't check the return value
+	 * of _goto because _get will error out if _goto failed.  There's a
+	 * subtlety to the outcome of _goto when startoff falls in a sparse
+	 * hole however:
+	 *
+	 * Most of the time, _goto points the cursor at the mapping whose lblk
+	 * is just to the left of startoff.  The mapping may or may not overlap
+	 * startoff; this is ok.  In other words, the tree lookup behaves as if
+	 * we asked it to use a less than or equals comparison.
+	 *
+	 * However, if startoff is to the left of the first mapping in the
+	 * extent tree, _goto points the cursor at that first mapping because
+	 * it doesn't know how to deal with this situation.  In this case,
+	 * the tree lookup behaves as if we asked it to use a greater than
+	 * or equals comparison.
+	 *
+	 * Note: If _get() returns 'no current node', that means that there
+	 * aren't any mappings at all.
+	 */
+	ext2fs_extent_goto(handle, startoff);
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap);
+	__DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap);
+	if (err == EXT2_ET_NO_CURRENT_NODE)
+		err = EXT2_ET_EXTENT_NOT_FOUND;
+	return err;
+}
+
+static inline errcode_t __fuse2fs_get_next_mapping(struct fuse2fs *ff,
+						   ext2_extent_handle_t handle,
+						   blk64_t startoff,
+						   struct ext2fs_extent *bmap,
+						   const char *func)
+{
+	struct ext2fs_extent newex, errex;
+	errcode_t err;
+
+	err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex);
+	DUMP_EXTENT(ff, "NEXT", startoff, err, &newex);
+	if (err == EXT2_ET_EXTENT_NO_NEXT)
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	if (err)
+		return err;
+
+	/*
+	 * Try to get the next leaf mapping.  There's a weird and longstanding
+	 * "feature" of EXT2_EXTENT_NEXT_LEAF where walking off the end of the
+	 * mapping recordset causes it to wrap around to the beginning of the
+	 * extent map and we end up with a mapping to the left of the one that
+	 * was passed in.
+	 *
+	 * However, a corrupt extent tree could also have such a record.  The
+	 * only way to be sure is to retrieve the mapping for the extreme right
+	 * edge of the tree and compare it to the mapping that the caller gave
+	 * us.  If they match, then we've hit the end.  If not, something is
+	 * corrupt in the ondisk metadata.
+	 */
+	if (newex.e_lblk <= bmap->e_lblk + bmap->e_len) {
+		err = __fuse2fs_get_mapping_at(ff, handle, ~0U, &errex, func);
+		if (err)
+			return err;
+
+		if (memcmp(bmap, &errex, sizeof(errex)) != 0)
+			return EXT2_ET_INODE_CORRUPTED;
+
+		return EXT2_ET_EXTENT_NOT_FOUND;
+	}
+
+	*bmap = newex;
+	return 0;
+}
+
+#define fuse2fs_get_mapping_at(ff, handle, startoff, bmap) \
+	__fuse2fs_get_mapping_at((ff), (handle), (startoff), (bmap), __func__)
+#define fuse2fs_get_next_mapping(ff, handle, startoff, bmap) \
+	__fuse2fs_get_next_mapping((ff), (handle), (startoff), (bmap), __func__)
+
+static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino,
+					    struct ext2_inode_large *inode,
+					    off_t pos, uint64_t count,
+					    uint32_t opflags,
+					    struct fuse_iomap *iomap)
+{
+	ext2_extent_handle_t handle;
+	struct ext2fs_extent extent;
+	ext2_filsys fs = ff->fs;
+	const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	errcode_t err;
+	int ret = 0;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/* No mappings at all; the whole range is a hole. */
+		fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+		goto out_handle;
+	}
+	if (err) {
+		ret = translate_error(fs, ino, err);
+		goto out_handle;
+	}
+
+	if (startoff < extent.e_lblk) {
+		/*
+		 * Mapping starts to the right of the current position.
+		 * Synthesize a hole going to that next extent.
+		 */
+		fuse2fs_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+		goto out_handle;
+	}
+
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, the
+		 * whole range is in a hole.
+		 */
+		err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+			goto out_handle;
+		}
+
+		/*
+		 * If the new mapping starts to the right of startoff, there's
+		 * a hole from startoff to the start of the new mapping.
+		 */
+		if (startoff < extent.e_lblk) {
+			fuse2fs_iomap_hole(ff, iomap,
+				FUSE2FS_FSB_TO_B(ff, startoff),
+				FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+			goto out_handle;
+		}
+
+		/*
+		 * The new mapping starts at startoff.  Something weird
+		 * happened in the extent tree lookup, but we found a valid
+		 * mapping so we'll run with it.
+		 */
+	}
+
+	/* Mapping overlaps startoff, report this. */
+	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
+	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
+	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)
+		iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN;
+	else
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino,
+					struct ext2_inode_large *inode,
+					off_t pos, uint64_t count,
+					uint32_t opflags,
+					struct fuse_iomap *iomap)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	uint64_t real_count = min(count, 131072);
+	const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count);
+	blk64_t startblock;
+	errcode_t err;
+
+	err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL,
+			   &startblock);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->offset = pos;
+	iomap->flags |= FUSE_IOMAP_F_MERGED;
+	if (startblock) {
+		iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
+		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+	} else {
+		iomap->addr = FUSE_IOMAP_NULL_ADDR;
+		iomap->type = FUSE_IOMAP_TYPE_HOLE;
+	}
+	iomap->length = fs->blocksize;
+
+	/* See how long the mapping goes for. */
+	for (startoff++; startoff < endoff; startoff++) {
+		blk64_t prev_startblock = startblock;
+
+		err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0,
+				   startoff, NULL, &startblock);
+		if (err)
+			break;
+
+		if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) {
+			if (startblock == prev_startblock + 1)
+				iomap->length += fs->blocksize;
+			else
+				break;
+		} else {
+			if (startblock != 0)
+				break;
+		}
+	}
+
+	return 0;
+}
+
+static int fuse2fs_iomap_begin_inline(struct fuse2fs *ff, ext2_ino_t ino,
+				      struct ext2_inode_large *inode, off_t pos,
+				      uint64_t count, struct fuse_iomap *iomap)
+{
+	uint64_t one_fsb = FUSE2FS_FSB_TO_B(ff, 1);
+
+	if (pos >= one_fsb) {
+		fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+	} else {
+		/* ext4 only supports inline data files up to 1 fsb */
+		iomap->dev = FUSE_IOMAP_DEV_NULL;
+		iomap->addr = FUSE_IOMAP_NULL_ADDR;
+		iomap->offset = 0;
+		iomap->length = one_fsb;
+		iomap->type = FUSE_IOMAP_TYPE_INLINE;
+	}
+
+	return 0;
+}
+
+static int fuse2fs_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino,
+				      struct ext2_inode_large *inode,
+				      off_t pos, uint64_t count,
+				      uint32_t opflags,
+				      struct fuse_iomap *read_iomap)
+{
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count,
+						  read_iomap);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
+						  opflags, read_iomap);
+
+	return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+					    opflags, read_iomap);
+}
+
+static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
+				    struct ext2_inode_large *inode, off_t pos,
+				    uint64_t count, uint32_t opflags,
+				    struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode, off_t pos,
+				     uint64_t count, uint32_t opflags,
+				     struct fuse_iomap *read_iomap)
+{
+	return -ENOSYS;
+}
+
+static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, uint64_t count, uint32_t opflags,
+			  struct fuse_iomap *read_iomap,
+			  struct fuse_iomap *write_iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags);
+
+	fs = fuse2fs_start(ff);
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (opflags & FUSE_IOMAP_OP_REPORT)
+		ret = fuse2fs_iomap_begin_report(ff, attr_ino, &inode, pos,
+						 count, opflags, read_iomap);
+	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
+		ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos,
+						count, opflags, read_iomap);
+	else
+		ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos,
+					       count, opflags, read_iomap);
+	if (ret)
+		goto out_unlock;
+
+	dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n",
+		   __func__,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)read_iomap->addr,
+		   (unsigned long long)read_iomap->offset,
+		   (unsigned long long)read_iomap->length,
+		   read_iomap->type);
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+
+static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			off_t pos, uint64_t count, uint32_t opflags,
+			ssize_t written, const struct fuse_iomap *iomap)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags 0x%x\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   (unsigned long long)count,
+		   opflags,
+		   written,
+		   iomap->flags);
+
+	return 0;
+}
+#endif /* HAVE_FUSE_IOMAP */
+
 static struct fuse_operations fs_ops = {
 	.init = op_init,
 	.destroy = op_destroy,
@@ -4794,6 +5276,10 @@ static struct fuse_operations fs_ops = {
 	.fallocate = op_fallocate,
 # endif
 #endif
+#ifdef HAVE_FUSE_IOMAP
+	.iomap_begin = op_iomap_begin,
+	.iomap_end = op_iomap_end,
+#endif /* HAVE_FUSE_IOMAP */
 };
 
 static int get_random_bytes(void *p, size_t sz)
@@ -5010,17 +5496,19 @@ static void fuse2fs_com_err_proc(const char *whoami, errcode_t code,
 int main(int argc, char *argv[])
 {
 	struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
-	struct fuse2fs fctx;
+	struct fuse2fs fctx = {
+		.magic = FUSE2FS_MAGIC,
+		.opstate = F2OP_WRITABLE,
+#ifdef HAVE_FUSE_IOMAP
+		.iomap_state = IOMAP_UNKNOWN,
+#endif
+	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;
 	char *logfile;
 	char extra_args[BUFSIZ];
 	int ret;
 
-	memset(&fctx, 0, sizeof(fctx));
-	fctx.magic = FUSE2FS_MAGIC;
-	fctx.opstate = F2OP_WRITABLE;
-
 	ret = fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc);
 	if (ret)
 		exit(1);


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 02/22] fuse2fs: add iomap= mount option
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
@ 2025-07-17 23:39   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:39 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add a mount option to control iomap usage so that we can test before and
after scenarios.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e688772ddd8b60..d4912dee08d43f 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -219,6 +219,12 @@ enum fuse2fs_opstate {
 	F2OP_SHUTDOWN,
 };
 
+enum fuse2fs_feature_toggle {
+	FT_DISABLE,
+	FT_ENABLE,
+	FT_DEFAULT,
+};
+
 #ifdef HAVE_FUSE_IOMAP
 enum fuse2fs_iomap_state {
 	IOMAP_DISABLED,
@@ -253,6 +259,7 @@ struct fuse2fs {
 	enum fuse2fs_opstate opstate;
 	int blocklog;
 #ifdef HAVE_FUSE_IOMAP
+	enum fuse2fs_feature_toggle iomap_want;
 	enum fuse2fs_iomap_state iomap_state;
 #endif
 	unsigned int blockmask;
@@ -1235,6 +1242,13 @@ static void *op_init(struct fuse_conn_info *conn
 		fuse2fs_iomap_confirm(conn, ff);
 	}
 
+#if defined(HAVE_FUSE_IOMAP)
+	if (ff->iomap_want == FT_ENABLE && !fuse2fs_iomap_enabled(ff)) {
+		err_printf(ff, "%s\n", _("could not enable iomap."));
+		goto mount_fail;
+	}
+#endif
+
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
 	 * so that the block device will be released before umount(2) returns.
@@ -5307,6 +5321,9 @@ enum {
 	FUSE2FS_CACHE_SIZE,
 	FUSE2FS_DIRSYNC,
 	FUSE2FS_ERRORS_BEHAVIOR,
+#ifdef HAVE_FUSE_IOMAP
+	FUSE2FS_IOMAP,
+#endif
 };
 
 #define FUSE2FS_OPT(t, p, v) { t, offsetof(struct fuse2fs, p), v }
@@ -5335,6 +5352,10 @@ static struct fuse_opt fuse2fs_opts[] = {
 	FUSE_OPT_KEY("cache_size=%s",	FUSE2FS_CACHE_SIZE),
 	FUSE_OPT_KEY("dirsync",		FUSE2FS_DIRSYNC),
 	FUSE_OPT_KEY("errors=%s",	FUSE2FS_ERRORS_BEHAVIOR),
+#ifdef HAVE_FUSE_IOMAP
+	FUSE_OPT_KEY("iomap=%s",	FUSE2FS_IOMAP),
+	FUSE_OPT_KEY("iomap",		FUSE2FS_IOMAP),
+#endif
 
 	FUSE_OPT_KEY("-V",             FUSE2FS_VERSION),
 	FUSE_OPT_KEY("--version",      FUSE2FS_VERSION),
@@ -5386,6 +5407,23 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 
 		/* do not pass through to libfuse */
 		return 0;
+#ifdef HAVE_FUSE_IOMAP
+	case FUSE2FS_IOMAP:
+		if (strcmp(arg, "iomap") == 0 || strcmp(arg + 6, "1") == 0)
+			ff->iomap_want = FT_ENABLE;
+		else if (strcmp(arg + 6, "0") == 0)
+			ff->iomap_want = FT_DISABLE;
+		else if (strcmp(arg + 6, "default") == 0)
+			ff->iomap_want = FT_DEFAULT;
+		else {
+			fprintf(stderr, "%s: %s\n", arg,
+ _("unknown iomap= behavior."));
+			return -1;
+		}
+
+		/* do not pass through to libfuse */
+		return 0;
+#endif
 	case FUSE2FS_IGNORED:
 		return 0;
 	case FUSE2FS_HELP:
@@ -5413,6 +5451,9 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 	"    -o cache_size=N[KMG]   use a disk cache of this size\n"
 	"    -o errors=             behavior when an error is encountered:\n"
 	"                           continue|remount-ro|panic\n"
+#ifdef HAVE_FUSE_IOMAP
+	"    -o iomap=              0 to disable iomap, 1 to enable iomap\n"
+#endif
 	"\n",
 			outargs->argv[0]);
 		if (key == FUSE2FS_HELPFULL) {
@@ -5500,6 +5541,7 @@ int main(int argc, char *argv[])
 		.magic = FUSE2FS_MAGIC,
 		.opstate = F2OP_WRITABLE,
 #ifdef HAVE_FUSE_IOMAP
+		.iomap_want = FT_DEFAULT,
 		.iomap_state = IOMAP_UNKNOWN,
 #endif
 	};
@@ -5518,6 +5560,11 @@ int main(int argc, char *argv[])
 		exit(1);
 	}
 
+#ifdef HAVE_FUSE_IOMAP
+	if (fctx.iomap_want == FT_DISABLE)
+		fctx.iomap_state = IOMAP_DISABLED;
+#endif
+
 	/* /dev/sda -> sda for reporting */
 	fctx.shortdev = strrchr(fctx.device, '/');
 	if (fctx.shortdev)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 03/22] fuse2fs: implement iomap configuration
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
  2025-07-17 23:39   ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Upload the filesystem geometry to the kernel when asked.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 93 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index d4912dee08d43f..fb71886b58f215 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -194,6 +194,10 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
 # define FL_ZERO_RANGE_FLAG (0)
 #endif
 
+#ifndef NSEC_PER_SEC
+# define NSEC_PER_SEC	(1000000000L)
+#endif
+
 errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs);
 
 const char *err_shortdev;
@@ -575,9 +579,9 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
 	EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
 	get_now(&now);
 
-	datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000);
-	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000);
-	dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000);
+	datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
+	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
+	dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
 
 	/*
 	 * If atime is newer than mtime and atime hasn't been updated in thirty
@@ -5228,6 +5232,91 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 
 	return 0;
 }
+
+/*
+ * Maximal extent format file size.
+ * Resulting logical blkno at s_maxbytes must fit in our on-disk
+ * extent format containers, within a sector_t, and within i_blocks
+ * in the vfs.  ext4 inode has 48 bits of i_block in fsblock units,
+ * so that won't be a limiting factor.
+ *
+ * However there is other limiting factor. We do store extents in the form
+ * of starting block and length, hence the resulting length of the extent
+ * covering maximum file size must fit into on-disk format containers as
+ * well. Given that length is always by 1 unit bigger than max unit (because
+ * we count 0 as well) we have to lower the s_maxbytes by one fs block.
+ *
+ * Note, this does *not* consider any metadata overhead for vfs i_blocks.
+ */
+static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
+{
+	off_t res;
+
+	if (!ext2fs_has_feature_huge_file(ff->fs->super)) {
+		upper_limit = (1LL << 32) - 1;
+
+		/* total blocks in file system block size */
+		upper_limit >>= (ff->blocklog - 9);
+		upper_limit <<= ff->blocklog;
+	}
+
+	/*
+	 * 32-bit extent-start container, ee_block. We lower the maxbytes
+	 * by one fs block, so ee_len can cover the extent of maximum file
+	 * size
+	 */
+	res = (1LL << 32) - 1;
+	res <<= ff->blocklog;
+
+	/* Sanity check against vm- & vfs- imposed limits */
+	if (res > upper_limit)
+		res = upper_limit;
+
+	return res;
+}
+
+static int op_iomap_config(uint32_t flags, off_t maxbytes,
+			   struct fuse_iomap_config *cfg)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	ext2_filsys fs;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff, "%s: flags=0x%x maxbytes=0x%llx\n", __func__, flags,
+		   (long long)maxbytes);
+	fs = fuse2fs_start(ff);
+
+	cfg->flags |= FUSE_IOMAP_CONFIG_UUID;
+	memcpy(cfg->s_uuid, fs->super->s_uuid, sizeof(cfg->s_uuid));
+	cfg->s_uuid_len = sizeof(fs->super->s_uuid);
+
+	cfg->flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE;
+	cfg->s_blocksize = FUSE2FS_FSB_TO_B(ff, 1);
+
+	/*
+	 * If there inode is large enough to house i_[acm]time_extra then we
+	 * can turn on nanosecond timestamps; i_crtime was the next field added
+	 * after i_atime_extra.
+	 */
+	cfg->flags |= FUSE_IOMAP_CONFIG_TIME;
+	if (fs->super->s_inode_size >=
+	    offsetof(struct ext2_inode_large, i_crtime)) {
+		cfg->s_time_gran = 1;
+		cfg->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
+	} else {
+		cfg->s_time_gran = NSEC_PER_SEC;
+		cfg->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
+	}
+	cfg->s_time_min = EXT4_TIMESTAMP_MIN;
+
+	cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
+	cfg->s_maxbytes = fuse2fs_max_size(ff, maxbytes);
+
+	fuse2fs_finish(ff, 0);
+	return 0;
+}
 #endif /* HAVE_FUSE_IOMAP */
 
 static struct fuse_operations fs_ops = {
@@ -5293,6 +5382,7 @@ static struct fuse_operations fs_ops = {
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,
+	.iomap_config = op_iomap_config,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 04/22] fuse2fs: register block devices for use with iomap
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Register the ext4 block device with the kernel for use with iomap.  For
now this is redundant with using fuseblk mode because the kernel
automatically registers any fuseblk devices, but eventually we'll go
back to regular fuse mode and we'll have to pin the bdev ourselves.
In theory this interface supports strange beasts where the metadata can
exist somewhere else entirely (or be made up by AI) while the file data
persists to real disks.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fb71886b58f215..9eb067e1737054 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -40,6 +40,7 @@
 # define _FILE_OFFSET_BITS 64
 #endif /* _FILE_OFFSET_BITS */
 #include <fuse.h>
+#include <fuse_lowlevel.h>
 #ifdef __SET_FOB_FOR_FUSE
 # undef _FILE_OFFSET_BITS
 #endif /* __SET_FOB_FOR_FUSE */
@@ -265,6 +266,7 @@ struct fuse2fs {
 #ifdef HAVE_FUSE_IOMAP
 	enum fuse2fs_feature_toggle iomap_want;
 	enum fuse2fs_iomap_state iomap_state;
+	uint32_t iomap_dev;
 #endif
 	unsigned int blockmask;
 	int retcode;
@@ -5032,7 +5034,7 @@ static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino,
 	}
 
 	/* Mapping overlaps startoff, report this. */
-	iomap->dev = FUSE_IOMAP_DEV_NULL;
+	iomap->dev = ff->iomap_dev;
 	iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
 	iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
 	iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
@@ -5064,13 +5066,14 @@ static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino,
 	if (err)
 		return translate_error(fs, ino, err);
 
-	iomap->dev = FUSE_IOMAP_DEV_NULL;
 	iomap->offset = pos;
 	iomap->flags |= FUSE_IOMAP_F_MERGED;
 	if (startblock) {
+		iomap->dev = ff->iomap_dev;
 		iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
 		iomap->type = FUSE_IOMAP_TYPE_MAPPED;
 	} else {
+		iomap->dev = FUSE_IOMAP_DEV_NULL;
 		iomap->addr = FUSE_IOMAP_NULL_ADDR;
 		iomap->type = FUSE_IOMAP_TYPE_HOLE;
 	}
@@ -5275,12 +5278,38 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
 	return res;
 }
 
+static errcode_t fuse2fs_iomap_config_devices(struct fuse_context *ctxt,
+					      struct fuse2fs *ff)
+{
+	struct fuse_session *se = fuse_get_session(ctxt->fuse);
+	errcode_t err;
+	int fd;
+	int ret;
+
+	err = io_channel_fd(ff->fs->io, &fd);
+	if (err)
+		return err;
+
+	ret = fuse_iomap_add_device(se, fd, 0);
+
+	dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n",
+		   __func__, fd, ret, ff->iomap_dev);
+
+	if (ret < 1)
+		return -EIO;
+
+	ff->iomap_dev = ret;
+	return 0;
+}
+
 static int op_iomap_config(uint32_t flags, off_t maxbytes,
 			   struct fuse_iomap_config *cfg)
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
 
@@ -5314,8 +5343,15 @@ static int op_iomap_config(uint32_t flags, off_t maxbytes,
 	cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
 	cfg->s_maxbytes = fuse2fs_max_size(ff, maxbytes);
 
-	fuse2fs_finish(ff, 0);
-	return 0;
+	err = fuse2fs_iomap_config_devices(ctxt, ff);
+	if (err) {
+		ret = translate_error(fs, 0, err);
+		goto out_unlock;
+	}
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
 }
 #endif /* HAVE_FUSE_IOMAP */
 
@@ -5633,6 +5669,7 @@ int main(int argc, char *argv[])
 #ifdef HAVE_FUSE_IOMAP
 		.iomap_want = FT_DEFAULT,
 		.iomap_state = IOMAP_UNKNOWN,
+		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
 	};
 	errcode_t err;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:40   ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel writes file data directly to the block device
and does not flush the bdev page cache.  We must open the filesystem in
directio mode to avoid cache coherency issues when reading file data
blocks.  If we can't open the bdev in directio mode, we must not use
iomap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9eb067e1737054..72b9ec837209ca 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1174,6 +1174,9 @@ static void *op_init(struct fuse_conn_info *conn
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs = ff->fs;
+#ifdef HAVE_FUSE_IOMAP
+	int was_directio = ff->directio;
+#endif
 	errcode_t err;
 	int ret;
 
@@ -1196,6 +1199,15 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+	/*
+	 * In iomap mode, the kernel writes file data directly to the block
+	 * device and does not flush the bdev page cache.  We must open the
+	 * filesystem in directio mode to avoid cache coherency issues when
+	 * reading file data.  If we can't open the bdev in directio mode, we
+	 * must not use iomap.
+	 */
+	if (fuse2fs_iomap_enabled(ff))
+		ff->directio = 1;
 #endif
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
@@ -1213,6 +1225,14 @@ static void *op_init(struct fuse_conn_info *conn
 	 */
 	if (!fs) {
 		err = fuse2fs_open(ff, 0);
+#ifdef HAVE_FUSE_IOMAP
+		if (err && fuse2fs_iomap_enabled(ff) && !was_directio) {
+			fuse_unset_feature_flag(conn, FUSE_CAP_IOMAP);
+			ff->iomap_state = IOMAP_DISABLED;
+			ff->directio = 0;
+			err = fuse2fs_open(ff, 0);
+		}
+#endif
 		if (err)
 			goto mount_fail;
 		fs = ff->fs;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 06/22] fuse2fs: implement directio file reads
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
@ 2025-07-17 23:40   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:40 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Implement file reads via iomap.  Currently only directio is supported.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 72b9ec837209ca..209858aeb9307c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1274,6 +1274,10 @@ static void *op_init(struct fuse_conn_info *conn
 		goto mount_fail;
 	}
 #endif
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO)
+	if (fuse2fs_iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
+#endif
 
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
@@ -5165,7 +5169,26 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 				    uint64_t count, uint32_t opflags,
 				    struct fuse_iomap *read_iomap)
 {
-	return -ENOSYS;
+	errcode_t err;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	/* fall back to slow path for inline data reads */
+	if (inode->i_flags & EXT4_INLINE_DATA_FL)
+		return -ENOSYS;
+
+	/* flush dirty io_channel buffers to disk before iomap reads them */
+	err = io_channel_flush(ff->fs->io);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	if (inode->i_flags & EXT4_EXTENTS_FL)
+		return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
+						  opflags, read_iomap);
+
+	return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+					    opflags, read_iomap);
 }
 
 static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:40   ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Change the punch hole helpers to use the tagged block IO commands now
that libext2fs uses tagged block IO commands for file IO.  We'll need
this in the next patch when we turn on selective IO manager cache
clearing and invalidation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 209858aeb9307c..64aca0f962daaf 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4675,13 +4675,13 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 
 	memset(*buf + residue, 0, len);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
@@ -4709,7 +4709,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
-	err = io_channel_read_blk(fs->io, blk, 1, *buf);
+	err = io_channel_read_tagblk(fs->io, ino, blk, 1, *buf);
 	if (err)
 		return err;
 	if (!blk || (retflags & BMAP_RET_UNINIT))
@@ -4720,7 +4720,7 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	else
 		memset(*buf + residue, 0, fs->blocksize - residue);
 
-	return io_channel_write_blk(fs->io, blk, 1, *buf);
+	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
 
 static int fuse2fs_punch_range(struct fuse2fs *ff,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

We only need to flush the io_channel's cache for the file that's being
read directly, not everything else.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 64aca0f962daaf..88b71af417c0d7 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5179,7 +5179,7 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush(ff->fs->io);
+	err = io_channel_flush_tag(ff->fs->io, ino);
 	if (err)
 		return translate_error(ff->fs, ino, err);
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 09/22] fuse2fs: add extent dump function for debugging
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:41   ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add a function to dump an inode's extent map for debugging purposes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 88b71af417c0d7..0137403b7a25b9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -498,6 +498,74 @@ static int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
 # define fuse2fs_iomap_enabled(...)	(0)
 #endif
 
+static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
+					struct ext2_inode_large *inode,
+					const char *why)
+{
+	ext2_filsys fs = ff->fs;
+	unsigned int nr = 0;
+	blk64_t blockcount = 0;
+	struct ext2_inode_large xinode;
+	struct ext2fs_extent extent;
+	ext2_extent_handle_t extents;
+	int op = EXT2_EXTENT_ROOT;
+	errcode_t retval;
+
+	if (!inode) {
+		inode = &xinode;
+
+		retval = fuse2fs_read_inode(fs, ino, inode);
+		if (retval) {
+			com_err(__func__, retval, _("reading ino %u"), ino);
+			return;
+		}
+	}
+
+	if (!(inode->i_flags & EXT4_EXTENTS_FL))
+		return;
+
+	printf("%s: %s ino=%u isize %llu iblocks %llu\n", __func__, why, ino,
+	       EXT2_I_SIZE(inode),
+	       (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) /
+	        fs->blocksize);
+	fflush(stdout);
+
+	retval = ext2fs_extent_open(fs, ino, &extents);
+	if (retval) {
+		com_err(__func__, retval, _("opening extents of ino \"%u\""),
+			ino);
+		return;
+	}
+
+	while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) {
+		op = EXT2_EXTENT_NEXT;
+
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT)
+			continue;
+
+		printf("[%u]: %s ino=%u lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n",
+		       nr++, why, ino, extent.e_lblk, extent.e_pblk,
+		       extent.e_len, extent.e_flags);
+		fflush(stdout);
+		if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF)
+			blockcount += extent.e_len;
+		else
+			blockcount++;
+	}
+	if (retval == EXT2_ET_EXTENT_NO_NEXT)
+		retval = 0;
+	if (retval) {
+		com_err(__func__, retval, ("getting extents of ino %u"),
+			ino);
+	}
+	if (inode->i_file_acl)
+		blockcount++;
+	printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount);
+	fflush(stdout);
+
+	ext2fs_extent_free(extents);
+}
+
 static void get_now(struct timespec *now)
 {
 #ifdef CLOCK_REALTIME


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 10/22] fuse2fs: implement direct write support
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
@ 2025-07-17 23:41   ` Darrick J. Wong
  2025-07-17 23:42   ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:41 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Wire up an iomap_begin method that can allocate into holes so that we
can do directio writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |  482 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 479 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 0137403b7a25b9..8c3cc7adc72579 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5259,12 +5259,100 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 					    opflags, read_iomap);
 }
 
+static int fuse2fs_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino,
+				     struct ext2_inode_large *inode, off_t pos,
+				     uint64_t count, uint32_t opflags, struct
+				     fuse_iomap *read_iomap, bool *dirty)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count);
+	errcode_t err;
+	int ret;
+
+	dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n",
+		   __func__, ino, startoff, stopoff - startoff);
+
+	if (!fs_can_allocate(ff, stopoff - startoff))
+		return -ENOSPC;
+
+	err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino,
+			       EXT2_INODE(inode), ~0ULL, startoff,
+			       stopoff - startoff);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* pick up the newly allocated mapping */
+	ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				       read_iomap);
+	if (ret)
+		return ret;
+
+	read_iomap->flags |= FUSE_IOMAP_F_DIRTY;
+	*dirty = true;
+	return 0;
+}
+
+static off_t fuse2fs_max_file_size(const struct fuse2fs *ff,
+				   const struct ext2_inode_large *inode)
+{
+	ext2_filsys fs = ff->fs;
+	blk64_t addr_per_block, max_map_block;
+
+	if (inode->i_flags & EXT4_EXTENTS_FL) {
+		max_map_block = (1ULL << 32) - 1;
+	} else {
+		addr_per_block = fs->blocksize >> 2;
+		max_map_block = addr_per_block;
+		max_map_block += addr_per_block * addr_per_block;
+		max_map_block += addr_per_block * addr_per_block * addr_per_block;
+		max_map_block += 12;
+	}
+
+	return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1);
+}
+
 static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 				     struct ext2_inode_large *inode, off_t pos,
 				     uint64_t count, uint32_t opflags,
-				     struct fuse_iomap *read_iomap)
+				     struct fuse_iomap *read_iomap,
+				     bool *dirty)
 {
-	return -ENOSYS;
+	off_t max_size = fuse2fs_max_file_size(ff, inode);
+	errcode_t err;
+	int ret;
+
+	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+		return -ENOSYS;
+
+	if (pos >= max_size)
+		return -EFBIG;
+
+	if (pos >= max_size - count)
+		count = max_size - pos;
+
+	ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+				       read_iomap);
+	if (ret)
+		return ret;
+
+	if (read_iomap->type == FUSE_IOMAP_TYPE_HOLE &&
+	    !(opflags & FUSE_IOMAP_OP_ZERO)) {
+		ret = fuse2fs_iomap_write_allocate(ff, ino, inode, pos, count,
+						   opflags, read_iomap, dirty);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers before iomap
+	 * writes them
+	 */
+	err = io_channel_invalidate_tag(ff->fs->io, ino);
+	if (err)
+		return translate_error(ff->fs, ino, err);
+
+	return 0;
 }
 
 static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
@@ -5277,6 +5365,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	struct ext2_inode_large inode;
 	ext2_filsys fs;
 	errcode_t err;
+	bool dirty = false;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
@@ -5302,7 +5391,8 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 						 count, opflags, read_iomap);
 	else if (opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO))
 		ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos,
-						count, opflags, read_iomap);
+						count, opflags, read_iomap,
+						&dirty);
 	else
 		ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos,
 					       count, opflags, read_iomap);
@@ -5319,6 +5409,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   (unsigned long long)read_iomap->length,
 		   read_iomap->type);
 
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -5460,6 +5558,383 @@ static int op_iomap_config(uint32_t flags, off_t maxbytes,
 		goto out_unlock;
 	}
 
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+
+static inline bool fuse2fs_can_merge_mappings(const struct ext2fs_extent *left,
+					      const struct ext2fs_extent *right)
+{
+	uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ?
+				EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN;
+
+	return left->e_lblk + left->e_len == right->e_lblk &&
+	       left->e_pblk + left->e_len == right->e_pblk &&
+	       (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ==
+	        (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) &&
+	       (uint64_t)left->e_len + right->e_len <= max_len;
+}
+
+static int fuse2fs_try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+				      ext2_extent_handle_t handle,
+				      blk64_t startoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent left, right;
+	errcode_t err;
+
+	/* Look up the mappings before startoff */
+	err = fuse2fs_get_mapping_at(ff, handle, startoff - 1, &left);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Look up the mapping at startoff */
+	err = fuse2fs_get_mapping_at(ff, handle, startoff, &right);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND)
+		return 0;
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Can we combine them? */
+	if (!fuse2fs_can_merge_mappings(&left, &right))
+		return 0;
+
+	/*
+	 * Delete the mapping after startoff because libext2fs cannot handle
+	 * overlapping mappings.
+	 */
+	err = ext2fs_extent_delete(handle, 0);
+	DUMP_EXTENT(ff, "remover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixremover", startoff, err, &right);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Move back and lengthen the mapping before startoff */
+	err = ext2fs_extent_goto(handle, left.e_lblk);
+	DUMP_EXTENT(ff, "movel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	left.e_len += right.e_len;
+	err = ext2fs_extent_replace(handle, 0, &left);
+	DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = ext2fs_extent_fix_parents(handle);
+	DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
+static int fuse2fs_convert_unwritten_mapping(struct fuse2fs *ff,
+					     ext2_ino_t ino,
+					     struct ext2_inode_large *inode,
+					     ext2_extent_handle_t handle,
+					     blk64_t *cursor, blk64_t stopoff)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2fs_extent extent;
+	blk64_t startoff = *cursor;
+	errcode_t err;
+
+	/*
+	 * Find the mapping at startoff.  Note that we can find holes because
+	 * the mapping data can change due to racing writes.
+	 */
+	err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent);
+	if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+		/*
+		 * If we didn't find any mappings at all then the file is
+		 * completely sparse.  There's nothing to convert.
+		 */
+		*cursor = stopoff;
+		return 0;
+	}
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/*
+	 * The mapping is completely to the left of the range that we want.
+	 * Let's see what's in the next extent, if there is one.
+	 */
+	if (startoff >= extent.e_lblk + extent.e_len) {
+		/*
+		 * Mapping ends to the left of the current position.  Try to
+		 * find the next mapping.  If there is no next mapping, then
+		 * we're done.
+		 */
+		err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent);
+		if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+			*cursor = stopoff;
+			return 0;
+		}
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/*
+	 * The mapping is completely to the right of the range that we want,
+	 * so we're done.
+	 */
+	if (extent.e_lblk >= stopoff) {
+		*cursor = stopoff;
+		return 0;
+	}
+
+	/*
+	 * At this point, we have a mapping that overlaps (startoff, stopoff].
+	 * If the mapping is already written, move on to the next one.
+	 */
+	if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT))
+		goto next;
+
+	if (startoff > extent.e_lblk) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping starts before startoff.  Shorten
+		 * the previous mapping...
+		 */
+		newex.e_len = startoff - extent.e_lblk;
+		err = ext2fs_extent_replace(handle, 0, &newex);
+		DUMP_EXTENT(ff, "shortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ...and create new written mapping at startoff. */
+		extent.e_len -= newex.e_len;
+		extent.e_lblk += newex.e_len;
+		extent.e_pblk += newex.e_len;
+		extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &extent);
+		DUMP_EXTENT(ff, "insertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	if (extent.e_lblk + extent.e_len > stopoff) {
+		struct ext2fs_extent newex = extent;
+
+		/*
+		 * Unwritten mapping ends after stopoff.  Shorten the current
+		 * mapping...
+		 */
+		extent.e_len = stopoff - extent.e_lblk;
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "shortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		/* ..and create a new unwritten mapping at stopoff. */
+		newex.e_pblk += extent.e_len;
+		newex.e_lblk += extent.e_len;
+		newex.e_len -= extent.e_len;
+		newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_insert(handle,
+					   EXT2_EXTENT_INSERT_AFTER,
+					   &newex);
+		DUMP_EXTENT(ff, "insertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	/* Still unwritten?  Update the state. */
+	if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) {
+		extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+		err = ext2fs_extent_replace(handle, 0, &extent);
+		DUMP_EXTENT(ff, "replacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+
+		err = ext2fs_extent_fix_parents(handle);
+		DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+next:
+	/* Try to merge with the previous extent */
+	if (startoff > 0) {
+		err = fuse2fs_try_merge_mappings(ff, ino, handle, startoff);
+		if (err)
+			return translate_error(fs, ino, err);
+	}
+
+	*cursor = extent.e_lblk + extent.e_len;
+	return 0;
+}
+
+static int fuse2fs_convert_unwritten_mappings(struct fuse2fs *ff,
+					      ext2_ino_t ino,
+					      struct ext2_inode_large *inode,
+					      off_t pos, size_t written)
+{
+	ext2_extent_handle_t handle;
+	ext2_filsys fs = ff->fs;
+	blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+	const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written);
+	errcode_t err;
+	int ret;
+
+	err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	/* Walk every mapping in the range, converting them. */
+	while (startoff < stopoff) {
+		blk64_t old_startoff = startoff;
+
+		ret = fuse2fs_convert_unwritten_mapping(ff, ino, inode, handle,
+							&startoff, stopoff);
+		if (ret)
+			goto out_handle;
+		if (startoff <= old_startoff) {
+			/* Do not go backwards. */
+			ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+			goto out_handle;
+		}
+	}
+
+	/* Try to merge the right edge */
+	ret = fuse2fs_try_merge_mappings(ff, ino, handle, stopoff);
+out_handle:
+	ext2fs_extent_free(handle);
+	return ret;
+}
+
+static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
+			  off_t pos, size_t written, uint32_t ioendflags,
+			  int error, uint64_t new_addr)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct ext2_inode_large inode;
+	ext2_filsys fs;
+	errcode_t err;
+	bool dirty = false;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+
+	dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=%llu\n",
+		   __func__, path,
+		   (unsigned long long)nodeid,
+		   (unsigned long long)attr_ino,
+		   (unsigned long long)pos,
+		   written,
+		   ioendflags,
+		   error,
+		   (unsigned long long)new_addr);
+
+	fs = fuse2fs_start(ff);
+	if (error) {
+		ret = error;
+		goto out_unlock;
+	}
+
+	/*
+	 * flush and invalidate the file's io_channel buffers again now that
+	 * iomap wrote them
+	 */
+	if (written > 0) {
+		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
+		if (err) {
+			ret = translate_error(ff->fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
+	/* should never see these ioend types */
+	if ((ioendflags & FUSE_IOMAP_IOEND_SHARED) ||
+	    new_addr != FUSE_IOMAP_NULL_ADDR) {
+		ret = translate_error(fs, attr_ino,
+				      EXT2_ET_FILESYSTEM_CORRUPTED);
+		goto out_unlock;
+	}
+
+	err = fuse2fs_read_inode(fs, attr_ino, &inode);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) {
+		/* unwritten extents are only supported on extents files */
+		if (!(inode.i_flags & EXT4_EXTENTS_FL)) {
+			ret = translate_error(fs, attr_ino,
+					      EXT2_ET_FILESYSTEM_CORRUPTED);
+			goto out_unlock;
+		}
+
+		ret = fuse2fs_convert_unwritten_mappings(ff, attr_ino, &inode,
+							 pos, written);
+		if (ret)
+			goto out_unlock;
+
+		dirty = true;
+	}
+
+	if (ioendflags & FUSE_IOMAP_IOEND_APPEND) {
+		ext2_off64_t isize = EXT2_I_SIZE(&inode);
+
+		if (pos + written > isize) {
+			err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode),
+						    pos + written);
+			if (err) {
+				ret = translate_error(fs, attr_ino, err);
+				goto out_unlock;
+			}
+
+			dirty = true;
+		}
+	}
+
+	if (dirty) {
+		err = fuse2fs_write_inode(fs, attr_ino, &inode);
+		if (err) {
+			ret = translate_error(fs, attr_ino, err);
+			goto out_unlock;
+		}
+	}
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -5530,6 +6005,7 @@ static struct fuse_operations fs_ops = {
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,
 	.iomap_config = op_iomap_config,
+	.iomap_ioend = op_iomap_ioend,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-07-17 23:41   ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
@ 2025-07-17 23:42   ` Darrick J. Wong
  2025-07-17 23:42   ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:42 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Turn on iomap for pagecache IO to regular files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 8c3cc7adc72579..a8fb18650ec080 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1346,6 +1346,10 @@ static void *op_init(struct fuse_conn_info *conn
 	if (fuse2fs_iomap_enabled(ff))
 		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO);
 #endif
+#if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_FILEIO)
+	if (fuse2fs_iomap_enabled(ff))
+		fuse_set_feature_flag(conn, FUSE_CAP_IOMAP_FILEIO);
+#endif
 
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
@@ -5239,9 +5243,6 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 {
 	errcode_t err;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	/* fall back to slow path for inline data reads */
 	if (inode->i_flags & EXT4_INLINE_DATA_FL)
 		return -ENOSYS;
@@ -5322,9 +5323,6 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	errcode_t err;
 	int ret;
 
-	if (!(opflags & FUSE_IOMAP_OP_DIRECT))
-		return -ENOSYS;
-
 	if (pos >= max_size)
 		return -EFBIG;
 
@@ -5422,12 +5420,51 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	return ret;
 }
 
+static int fuse2fs_iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino,
+					loff_t newsize)
+{
+	ext2_filsys fs = ff->fs;
+	struct ext2_inode_large inode;
+	ext2_off64_t isize;
+	errcode_t err;
+
+	dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino,
+		   (unsigned long long)newsize);
+
+	err = fuse2fs_read_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	isize = EXT2_I_SIZE(&inode);
+	if (newsize <= isize)
+		return 0;
+
+	dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino,
+		   (unsigned long long)isize,
+		   (unsigned long long)newsize);
+
+	/*
+	 * XXX cheesily update the ondisk size even though we only want to do
+	 * the incore size until writeback happens
+	 */
+	err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	err = fuse2fs_write_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	return 0;
+}
+
 static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 			off_t pos, uint64_t count, uint32_t opflags,
 			ssize_t written, const struct fuse_iomap *iomap)
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
 
@@ -5442,7 +5479,21 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		   written,
 		   iomap->flags);
 
-	return 0;
+	fuse2fs_start(ff);
+
+	/* XXX is this really necessary? */
+	if ((opflags & FUSE_IOMAP_OP_WRITE) &&
+	    !(opflags & FUSE_IOMAP_OP_DIRECT) &&
+	    (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) &&
+	    written > 0) {
+		ret = fuse2fs_iomap_append_setsize(ff, attr_ino, pos + written);
+		if (ret)
+			goto out_unlock;
+	}
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 12/22] fuse2fs: improve tracing for fallocate
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-07-17 23:42   ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
@ 2025-07-17 23:42   ` Darrick J. Wong
  2025-07-17 23:42   ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:42 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Improve the tracing for fallocate by reporting the inode number and the
file range in all tracepoints.  Make the ranges hexadecimal to make it
easier for the programmer to convert bytes to block numbers and back.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index a8fb18650ec080..f7d17737459c11 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4683,8 +4683,8 @@ static int fuse2fs_allocate_range(struct fuse2fs *ff,
 
 	start = FUSE2FS_B_TO_FSBT(ff, offset);
 	end = FUSE2FS_B_TO_FSBT(ff, offset + len - 1);
-	dbg_printf(ff, "%s: ino=%d mode=0x%x start=%llu end=%llu\n", __func__,
-		   fh->ino, mode, start, end);
+	dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%jx len=0x%jx start=%llu end=%llu\n",
+		   __func__, fh->ino, mode, offset, len, start, end);
 	if (!fs_can_allocate(ff, FUSE2FS_B_TO_FSB(ff, len)))
 		return -ENOSPC;
 
@@ -4751,6 +4751,7 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return err;
 
+	dbg_printf(ff, "%s: ino=%d offset=0x%jx len=0x%jx\n", __func__, ino, offset + residue, len);
 	memset(*buf + residue, 0, len);
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
@@ -4787,10 +4788,15 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	if (!blk || (retflags & BMAP_RET_UNINIT))
 		return 0;
 
-	if (clean_before)
+	if (clean_before) {
+		dbg_printf(ff, "%s: ino=%d before offset=0x%jx len=0x%jx\n",
+			   __func__, ino, offset, residue);
 		memset(*buf, 0, residue);
-	else
+	} else {
+		dbg_printf(ff, "%s: ino=%d after offset=0x%jx len=0x%jx\n",
+			   __func__, ino, offset, fs->blocksize - residue);
 		memset(*buf + residue, 0, fs->blocksize - residue);
+	}
 
 	return io_channel_write_tagblk(fs->io, ino, blk, 1, *buf);
 }
@@ -4805,9 +4811,6 @@ static int fuse2fs_punch_range(struct fuse2fs *ff,
 	errcode_t err;
 	char *buf = NULL;
 
-	dbg_printf(ff, "%s: offset=%jd len=%jd\n", __func__,
-		   (intmax_t) offset, (intmax_t) len);
-
 	/* kernel ext4 punch requires this flag to be set */
 	if (!(mode & FL_KEEP_SIZE_FLAG))
 		return -EINVAL;
@@ -4900,6 +4903,12 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 		ret = -EROFS;
 		goto out;
 	}
+
+	dbg_printf(ff, "%s: ino=%d mode=0x%x start=0x%llx end=0x%llx\n", __func__,
+		   fh->ino, mode,
+		   (unsigned long long)offset,
+		   (unsigned long long)offset + len);
+
 	if (mode & FL_ZERO_RANGE_FLAG)
 		ret = fuse2fs_zero_range(ff, fh, mode, offset, len);
 	else if (mode & FL_PUNCH_HOLE_FLAG)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 13/22] fuse2fs: don't zero bytes in punch hole
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-07-17 23:42   ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
@ 2025-07-17 23:42   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:42 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the pagecache, it will take care of zeroing the
unaligned parts of punched out regions so we don't have to do it
ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f7d17737459c11..45eec59d85faf4 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -235,6 +235,7 @@ enum fuse2fs_iomap_state {
 	IOMAP_DISABLED,
 	IOMAP_UNKNOWN,
 	IOMAP_ENABLED,
+	IOMAP_FILEIO,	/* enabled and does all file data block IO */
 };
 #endif
 
@@ -494,8 +495,14 @@ static int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
 {
 	return ff->iomap_state >= IOMAP_ENABLED;
 }
+
+static int fuse2fs_iomap_does_fileio(const struct fuse2fs *ff)
+{
+	return ff->iomap_state == IOMAP_FILEIO;
+}
 #else
 # define fuse2fs_iomap_enabled(...)	(0)
+# define fuse2fs_iomap_does_fileio(...)	(0)
 #endif
 
 static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
@@ -1219,6 +1226,7 @@ static void fuse2fs_iomap_confirm(struct fuse_conn_info *conn,
 		return;
 	case IOMAP_DISABLED:
 		return;
+	case IOMAP_FILEIO:
 	case IOMAP_ENABLED:
 		break;
 	}
@@ -1267,6 +1275,20 @@ static void *op_init(struct fuse_conn_info *conn
 	if (ff->iomap_state != IOMAP_DISABLED &&
 	    fuse_set_feature_flag(conn, FUSE_CAP_IOMAP))
 		ff->iomap_state = IOMAP_ENABLED;
+
+	/*
+	 * If iomap is turned on and the kernel advertises support for both
+	 * direct and buffered IO, then that means the kernel handles all
+	 * regular file data block IO for us.  That means we can turn off all
+	 * of libext2fs' file data block handling except for inline data.
+	 *
+	 * XXX: kernel doesn't support inline data iomap
+	 */
+	if (fuse2fs_iomap_enabled(ff) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_DIRECTIO) &&
+	    fuse_get_feature_flag(conn, FUSE_CAP_IOMAP_FILEIO))
+		ff->iomap_state = IOMAP_FILEIO;
+
 	/*
 	 * In iomap mode, the kernel writes file data directly to the block
 	 * device and does not flush the bdev page cache.  We must open the
@@ -4734,6 +4756,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
 	int retflags;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		return 0;
+
 	if (!*buf) {
 		err = ext2fs_get_mem(fs->blocksize, buf);
 		if (err)
@@ -4767,6 +4793,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
 	off_t residue;
 	errcode_t err;
 
+	/* the kernel does this for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		return 0;
+
 	residue = FUSE2FS_OFF_IN_FSB(ff, offset);
 	if (residue == 0)
 		return 0;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-07-17 23:42   ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When iomap is in use for the page cache, the kernel will take care of
all the file data block IO for us, including zeroing of punched ranges
and post-EOF bytes.  fuse2fs only needs to do IO for inline data.

Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not
do any regular file IO to or from disk blocks at all.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 45eec59d85faf4..989f9f17cae0a9 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -3059,9 +3059,14 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
 	ext2_file_t file;
 	__u64 old_isize;
 	errcode_t err;
+	int flags = EXT2_FILE_WRITE;
 	int ret = 0;
 
-	err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+	/* the kernel handles all eof zeroing for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		flags |= EXT2_FILE_NOBLOCKIO;
+
+	err = ext2fs_file_open(fs, ino, flags, &file);
 	if (err)
 		return translate_error(fs, ino, err);
 
@@ -3181,6 +3186,9 @@ static int __op_open(struct fuse2fs *ff, const char *path,
 		file->open_flags |= EXT2_FILE_WRITE;
 		break;
 	}
+	/* the kernel handles all block IO for us in iomap mode */
+	if (fuse2fs_iomap_does_fileio(ff))
+		file->open_flags |= EXT2_FILE_NOBLOCKIO;
 	if (fp->flags & O_APPEND) {
 		/* the kernel doesn't allow truncation of an append-only file */
 		if (fp->flags & O_TRUNC) {


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (13 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Now that fuse2fs uses iomap for pagecache IO, all regular file IO goes
directly to the disk.  There is no need to flush the unix IO manager's
disk cache (or invalidate it) because it does not contain file data.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 989f9f17cae0a9..9604f06e69bc90 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5295,9 +5295,11 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 		return -ENOSYS;
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
-	err = io_channel_flush_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!fuse2fs_iomap_does_fileio(ff)) {
+		err = io_channel_flush_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	if (inode->i_flags & EXT4_EXTENTS_FL)
 		return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
@@ -5393,9 +5395,11 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
 	 * flush and invalidate the file's io_channel buffers before iomap
 	 * writes them
 	 */
-	err = io_channel_invalidate_tag(ff->fs->io, ino);
-	if (err)
-		return translate_error(ff->fs, ino, err);
+	if (!fuse2fs_iomap_does_fileio(ff)) {
+		err = io_channel_invalidate_tag(ff->fs->io, ino);
+		if (err)
+			return translate_error(ff->fs, ino, err);
+	}
 
 	return 0;
 }
@@ -5972,7 +5976,7 @@ static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
 	 * flush and invalidate the file's io_channel buffers again now that
 	 * iomap wrote them
 	 */
-	if (written > 0) {
+	if (written > 0 && !fuse2fs_iomap_does_fileio(ff)) {
 		err = io_channel_invalidate_tag(ff->fs->io, attr_ino);
 		if (err) {
 			ret = translate_error(ff->fs, attr_ino, err);


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (14 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:43   ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Back in "fuse2fs: always use directio disk reads with fuse2fs", we
started using directio for all libext2fs disk IO to deal with cache
coherency issues between the unix io manager's disk cache, the block
device page cache, and the file data blocks being read and written to
disk by the kernel itself.

Now that we've turned off all regular file data block IO in libext2fs,
we don't need that and can go back to the old way, which is a lot
faster for metadata operations.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9604f06e69bc90..9a62971f8dbba7 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1295,8 +1295,12 @@ static void *op_init(struct fuse_conn_info *conn
 	 * filesystem in directio mode to avoid cache coherency issues when
 	 * reading file data.  If we can't open the bdev in directio mode, we
 	 * must not use iomap.
+	 *
+	 * If we know that the kernel can handle all regular file IO for us,
+	 * then there is no cache coherency issue and we can use buffered reads
+	 * for all IO, which will all be filesystem metadata.
 	 */
-	if (fuse2fs_iomap_enabled(ff))
+	if (fuse2fs_iomap_enabled(ff) && !fuse2fs_iomap_does_fileio(ff))
 		ff->directio = 1;
 #endif
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (15 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
@ 2025-07-17 23:43   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:43 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Since fuse in iomap mode guarantees that op_destroy will be called
before umount returns, we don't need to use fuseblk mode to get that
guarantee.  Disable fuseblk mode, which saves us the trouble of closing
and reopening the device.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9a62971f8dbba7..82b59c1ac89774 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -982,6 +982,8 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
 	if (ff->directio)
 		flags |= EXT2_FLAG_DIRECT_IO;
 
+	dbg_printf(ff, "opening with flags=0x%x\n", flags);
+
 	err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
 			   &ff->fs);
 	if (err == EPERM) {
@@ -6333,10 +6335,24 @@ static unsigned long long default_cache_size(void)
 	return ret;
 }
 
+#ifdef HAVE_FUSE_IOMAP
+static inline bool fuse2fs_discover_iomap(const struct fuse2fs *ff)
+{
+	if (ff->iomap_want == FT_DISABLE)
+		return false;
+
+	return fuse_discover_iomap();
+}
+#else
+# define fuse2fs_discover_iomap(...)	(false)
+#endif
+
 static inline bool fuse2fs_want_fuseblk(const struct fuse2fs *ff)
 {
 	if (ff->noblkdev)
 		return false;
+	if (fuse2fs_discover_iomap(ff))
+		return false;
 
 	return fuse2fs_on_bdev(ff);
 }
@@ -6499,6 +6515,12 @@ int main(int argc, char *argv[])
 		 * device) so that unmount will wait until op_destroy
 		 * completes.  If this is not a block device, we cannot use
 		 * fuseblk mode and should leave the filesystem open.
+		 *
+		 * However, fuse+iomap guarantees that op_destroy is called
+		 * before the filesystem is unmounted, so we don't need fuseblk
+		 * mode.  This save us the trouble of reopening the filesystem
+		 * later, and means that fuse2fs itself owns the exclusive lock
+		 * on the block device.
 		 */
 		fuse2fs_unmount(&fctx);
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 18/22] fuse2fs: don't allow hardlinks for now
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (16 preceding siblings ...)
  2025-07-17 23:43   ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

XXX see the comment for why we have to do this bellicosely stupid thing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 82b59c1ac89774..e281b5fc589d82 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -261,6 +261,7 @@ struct fuse2fs {
 	uint8_t dirsync;
 	uint8_t unmount_in_destroy;
 	uint8_t noblkdev;
+	uint8_t can_hardlink;
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
@@ -1382,9 +1383,31 @@ static void *op_init(struct fuse_conn_info *conn
 	/*
 	 * If we're mounting in iomap mode, we need to unmount in op_destroy
 	 * so that the block device will be released before umount(2) returns.
+	 *
+	 * XXX: It turns out that fuse2fs creates internal node ids that have
+	 * nothing to do with the ext2_ino_t that we give it.  These internal
+	 * node ids are what actually gets igetted in the kernel, which means
+	 * that there can be multiple fuse_inode objects for the same fuse2fs
+	 * inode.
+	 *
+	 * What this means, horrifyingly, is that on a fuse filesystem that
+	 * supports hard links, the in-kernel i_rwsem does not protect against
+	 * concurrent writes between files that point to the same inode.  That
+	 * in turn means that the file mode and size can get desynchronized
+	 * between the multiple fuse_inode objects.  This also means that we
+	 * cannot cache iomaps in the kernel AT ALL because the caches will
+	 * get out of sync, leading to WARN_ONs from the iomap zeroing code and
+	 * probably data corruption after that.
+	 *
+	 * So for now we just disable hardlinking on iomap to see if the weird
+	 * fstests failures (particularly g/476) go away.  Long term it means
+	 * we probably have to find a way around this, like porting fuse2fs
+	 * to be a low level fuse driver.
 	 */
-	if (fuse2fs_iomap_enabled(ff))
+	if (fuse2fs_iomap_enabled(ff)) {
 		ff->unmount_in_destroy = 1;
+		ff->can_hardlink = 0;
+	}
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
 	if (ff->opstate == F2OP_WRITABLE) {
@@ -2751,6 +2774,10 @@ static int op_link(const char *src, const char *dest)
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
+
+	if (!ff->can_hardlink)
+		return -ENOSYS;
+
 	dbg_printf(ff, "%s: src=%s dest=%s\n", __func__, src, dest);
 	temp_path = strdup(dest);
 	if (!temp_path) {
@@ -6380,6 +6407,7 @@ int main(int argc, char *argv[])
 		.iomap_state = IOMAP_UNKNOWN,
 		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
+		.can_hardlink = 1,
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 19/22] fuse2fs: enable file IO to inline data files
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (17 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Enable file reads and writes from inline data files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e281b5fc589d82..c21a95b6920d5c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1407,6 +1407,14 @@ static void *op_init(struct fuse_conn_info *conn
 	if (fuse2fs_iomap_enabled(ff)) {
 		ff->unmount_in_destroy = 1;
 		ff->can_hardlink = 0;
+
+		/*
+		 * XXX: inline data file io depends on op_read/write being fed
+		 * a path, so we have to slow everyone down to look up the path
+		 * from the nodeid
+		 */
+		if (ext2fs_has_feature_inline_data(ff->fs->super))
+			cfg->nullpath_ok = 0;
 	}
 
 	/* Clear the valid flag so that an unclean shutdown forces a fsck */
@@ -3294,6 +3302,9 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
 		   size_t len, off_t offset,
 		   struct fuse_file_info *fp)
 {
+	struct fuse2fs_file_handle fhurk = {
+		.magic = FUSE2FS_FILE_MAGIC,
+	};
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	struct fuse2fs_file_handle *fh =
@@ -3305,10 +3316,21 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
+
+	if (!fh)
+		fh = &fhurk;
+
 	FUSE2FS_CHECK_HANDLE(ff, fh);
 	dbg_printf(ff, "%s: ino=%d off=%jd len=%jd\n", __func__, fh->ino,
 		   (intmax_t) offset, len);
 	fs = fuse2fs_start(ff);
+
+	if (fh == &fhurk) {
+		ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino);
+		if (ret)
+			goto out;
+	}
+
 	err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
 	if (err) {
 		ret = translate_error(fs, fh->ino, err);
@@ -3350,6 +3372,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 		    const char *buf, size_t len, off_t offset,
 		    struct fuse_file_info *fp)
 {
+	struct fuse2fs_file_handle fhurk = {
+		.magic = FUSE2FS_FILE_MAGIC,
+		.open_flags = EXT2_FILE_WRITE,
+	};
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	struct fuse2fs_file_handle *fh =
@@ -3361,6 +3387,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
+
+	if (!fh)
+		fh = &fhurk;
+
 	FUSE2FS_CHECK_HANDLE(ff, fh);
 	dbg_printf(ff, "%s: ino=%d off=%jd len=%jd\n", __func__, fh->ino,
 		   (intmax_t) offset, (intmax_t) len);
@@ -3375,6 +3405,12 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 		goto out;
 	}
 
+	if (fh == &fhurk) {
+		ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino);
+		if (ret)
+			goto out;
+	}
+
 	err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
 	if (err) {
 		ret = translate_error(fs, fh->ino, err);
@@ -5325,7 +5361,8 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
 
 	/* fall back to slow path for inline data reads */
 	if (inode->i_flags & EXT4_INLINE_DATA_FL)
-		return -ENOSYS;
+		return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count,
+						  read_iomap);
 
 	/* flush dirty io_channel buffers to disk before iomap reads them */
 	if (!fuse2fs_iomap_does_fileio(ff)) {


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 20/22] fuse2fs: set iomap-related inode flags
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (18 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:44   ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Set FUSE_IFLAG_* when we do a getattr, so that all files will have iomap
enabled.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index c21a95b6920d5c..e71fcbaeeaf0c6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1571,6 +1571,25 @@ static int op_getattr(const char *path, struct stat *statbuf
 	return ret;
 }
 
+#ifdef HAVE_FUSE_IOMAP
+static int op_getattr_iflags(const char *path, struct stat *statbuf,
+			     unsigned int *iflags, struct fuse_file_info *fi)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	int ret = op_getattr(path, statbuf, fi);
+
+	if (ret)
+		return ret;
+
+	if (fuse2fs_iomap_does_fileio(ff))
+		*iflags |= FUSE_IFLAG_IOMAP_DIRECTIO | FUSE_IFLAG_IOMAP_FILEIO;
+
+	return 0;
+}
+#endif
+
+
 static int op_readlink(const char *path, char *buf, size_t len)
 {
 	struct fuse_context *ctxt = fuse_get_context();
@@ -6178,6 +6197,7 @@ static struct fuse_operations fs_ops = {
 	.iomap_end = op_iomap_end,
 	.iomap_config = op_iomap_config,
 	.iomap_ioend = op_iomap_ioend,
+	.getattr_iflags = op_getattr_iflags,
 #endif /* HAVE_FUSE_IOMAP */
 };
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (19 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
@ 2025-07-17 23:44   ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:44 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, we can support the strictatime/lazytime mount options.
Add them to fuse2fs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e71fcbaeeaf0c6..b5f665ada36991 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -262,6 +262,7 @@ struct fuse2fs {
 	uint8_t unmount_in_destroy;
 	uint8_t noblkdev;
 	uint8_t can_hardlink;
+	uint8_t iomap_passthrough_options;
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
@@ -1370,6 +1371,10 @@ static void *op_init(struct fuse_conn_info *conn
 		err_printf(ff, "%s\n", _("could not enable iomap."));
 		goto mount_fail;
 	}
+	if (ff->iomap_passthrough_options && !fuse2fs_iomap_enabled(ff)) {
+		err_printf(ff, "%s\n", _("some mount options require iomap."));
+		goto mount_fail;
+	}
 #endif
 #if defined(HAVE_FUSE_IOMAP) && defined(FUSE_CAP_IOMAP_DIRECTIO)
 	if (fuse2fs_iomap_enabled(ff))
@@ -6228,6 +6233,7 @@ enum {
 	FUSE2FS_ERRORS_BEHAVIOR,
 #ifdef HAVE_FUSE_IOMAP
 	FUSE2FS_IOMAP,
+	FUSE2FS_IOMAP_PASSTHROUGH,
 #endif
 };
 
@@ -6251,6 +6257,17 @@ static struct fuse_opt fuse2fs_opts[] = {
 	FUSE2FS_OPT("lockfile=%s",	lockfile,		0),
 	FUSE2FS_OPT("noblkdev",		noblkdev,		1),
 
+#ifdef HAVE_FUSE_IOMAP
+#ifdef MS_LAZYTIME
+	FUSE_OPT_KEY("lazytime",	FUSE2FS_IOMAP_PASSTHROUGH),
+	FUSE_OPT_KEY("nolazytime",	FUSE2FS_IOMAP_PASSTHROUGH),
+#endif
+#ifdef MS_STRICTATIME
+	FUSE_OPT_KEY("strictatime",	FUSE2FS_IOMAP_PASSTHROUGH),
+	FUSE_OPT_KEY("nostrictatime",	FUSE2FS_IOMAP_PASSTHROUGH),
+#endif
+#endif
+
 	FUSE_OPT_KEY("user_xattr",	FUSE2FS_IGNORED),
 	FUSE_OPT_KEY("noblock_validity", FUSE2FS_IGNORED),
 	FUSE_OPT_KEY("nodelalloc",	FUSE2FS_IGNORED),
@@ -6277,6 +6294,12 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
 	struct fuse2fs *ff = data;
 
 	switch (key) {
+#ifdef HAVE_FUSE_IOMAP
+	case FUSE2FS_IOMAP_PASSTHROUGH:
+		ff->iomap_passthrough_options = 1;
+		/* pass through to libfuse */
+		return 1;
+#endif
 	case FUSE2FS_DIRSYNC:
 		ff->dirsync = 1;
 		/* pass through to libfuse */


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 22/22] fuse2fs: configure block device block size
  2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
                     ` (20 preceding siblings ...)
  2025-07-17 23:44   ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  21 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Set the blocksize of the block device to the filesystem blocksize.
This prevents the bdev pagecache from caching file data blocks that
iomap will read and write directly.  Cache duplication is dangerous.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index b5f665ada36991..d0478af036a25e 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5683,6 +5683,42 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
 	return res;
 }
 
+/*
+ * Set the block device's blocksize to the fs blocksize.
+ *
+ * This is required to avoid creating uptodate bdev pagecache that aliases file
+ * data blocks because iomap reads and writes directly to file data blocks.
+ */
+static int fuse2fs_set_bdev_blocksize(struct fuse2fs *ff, int fd)
+{
+	int blocksize = ff->fs->blocksize;
+	int set_error;
+	int ret;
+
+	ret = ioctl(fd, BLKBSZSET, &blocksize);
+	if (!ret)
+		return 0;
+
+	/*
+	 * Save the original errno so we can report that if the block device
+	 * blocksize isn't set in an agreeable way.
+	 */
+	set_error = errno;
+
+	ret = ioctl(fd, BLKBSZGET, &blocksize);
+	if (ret)
+		goto out_bad;
+
+	if (blocksize > ff->fs->blocksize)
+		set_error = -EINVAL;
+
+	return 0;
+out_bad:
+	err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__,
+		   blocksize, strerror(set_error));
+	return -EIO;
+}
+
 static errcode_t fuse2fs_iomap_config_devices(struct fuse_context *ctxt,
 					      struct fuse2fs *ff)
 {
@@ -5695,6 +5731,10 @@ static errcode_t fuse2fs_iomap_config_devices(struct fuse_context *ctxt,
 	if (err)
 		return err;
 
+	ret = fuse2fs_set_bdev_blocksize(ff, fd);
+	if (ret)
+		return ret;
+
 	ret = fuse_iomap_add_device(se, fd, 0);
 
 	dbg_printf(ff, "%s: registering iomap dev fd=%d ret=%d iomap_dev=%u\n",


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 1/1] fuse2fs: enable caching of iomaps
  2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Cache the iomaps we generate in the kernel for better performance.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index d0478af036a25e..f863042a4db074 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5505,6 +5505,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	struct fuse_session *se = fuse_get_session(ctxt->fuse);
 	struct ext2_inode_large inode;
 	ext2_filsys fs;
 	errcode_t err;
@@ -5560,6 +5561,24 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
 		}
 	}
 
+	/*
+	 * Cache the mapping in the kernel so that we can reuse them for
+	 * subsequent IO.  Note that we have to return NULL mappings to the
+	 * kernel to prompt it to re-try the cache.
+	 */
+	write_iomap->type = FUSE_IOMAP_TYPE_NULL;
+	err = fuse_lowlevel_notify_iomap_upsert(se, nodeid, attr_ino,
+						read_iomap, write_iomap);
+	if (err) {
+		ret = translate_error(fs, attr_ino, err);
+		goto out_unlock;
+	}
+
+	/* Null out the read mapping to encourage a retry. */
+	read_iomap->type = FUSE_IOMAP_TYPE_NULL;
+	read_iomap->dev = FUSE_IOMAP_DEV_NULL;
+	read_iomap->addr = FUSE_IOMAP_NULL_ADDR;
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Commit 9f69dfc4e275cc didn't quite get the permissions checking correct:

generic/362       - output mismatch (see /var/tmp/fstests/generic/362.out.bad)
    --- tests/generic/362.out   2025-04-30 16:20:44.563833050 -0700
    +++ /var/tmp/fstests/generic/362.out.bad    2025-06-11 17:04:24.061193618 -0700
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Failed to open/create file: Operation not permitted
     Silence is golden
    ...
    (Run 'diff -u /run/fstests/bin/tests/generic/362.out /var/tmp/fstests/generic/362.out.bad'  to see the entire diff)

The kernel allows opening a file for append and truncation.  What it
doesn't allow is opening an append-only file for truncation.  Note that
this causes generic/079 to regress, but the root cause of that problem
is actually that fuse oddly supports FS_IOC_[GS]ETFLAGS but doesn't
actually set the VFS inode flags.

Fixes: 9f69dfc4e275cc ("fuse2fs: implement O_APPEND correctly")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f863042a4db074..f9151ae6acb4e5 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -3254,15 +3254,8 @@ static int __op_open(struct fuse2fs *ff, const char *path,
 	/* the kernel handles all block IO for us in iomap mode */
 	if (fuse2fs_iomap_does_fileio(ff))
 		file->open_flags |= EXT2_FILE_NOBLOCKIO;
-	if (fp->flags & O_APPEND) {
-		/* the kernel doesn't allow truncation of an append-only file */
-		if (fp->flags & O_TRUNC) {
-			ret = -EPERM;
-			goto out;
-		}
-
+	if (fp->flags & O_APPEND)
 		check |= A_OK;
-	}
 
 	detect_linux_executable_open(fp->flags, &check, &file->open_flags);
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
@ 2025-07-17 23:45   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:45 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When iomap is enabled, the kernel is in charge of enforcing permissions
checks on timestamp updates for files.  We needn't do that in userspace
anymore.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f9151ae6acb4e5..5d75cffa8f6bca 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4334,11 +4334,12 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
 
 	/*
 	 * ext4 allows timestamp updates of append-only files but only if we're
-	 * setting to current time
+	 * setting to current time.  If iomap is enabled, the kernel does the
+	 * permission checking for timestamp updates and we can skip the check.
 	 */
 	if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW)
 		access |= A_OK;
-	ret = check_inum_access(ff, ino, access);
+	ret = fuse2fs_iomap_enabled(ff) ? 0 : check_inum_access(ff, ino, access);
 	if (ret)
 		goto out;
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
  2025-07-17 23:45   ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

When the kernel is running in iomap mode, it will also manage all the
ACL updates and the resulting file mode changes for us.  Disable the
manual implementation of it in fuse2fs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 5d75cffa8f6bca..e580622d39b1d1 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1739,7 +1739,7 @@ static int propagate_default_acls(struct fuse2fs *ff, ext2_ino_t parent,
 	size_t deflen;
 	int ret;
 
-	if (!ff->acl)
+	if (!ff->acl || fuse2fs_iomap_does_fileio(ff))
 		return 0;
 
 	ret = __getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
@@ -2999,7 +2999,7 @@ static int op_chmod(const char *path, mode_t mode
 	 * of the user's groups, but FUSE only tells us about the primary
 	 * group.
 	 */
-	if (!is_superuser(ff, ctxt)) {
+	if (!fuse2fs_iomap_does_fileio(ff) && !is_superuser(ff, ctxt)) {
 		ret = in_file_group(ctxt, &inode);
 		if (ret < 0)
 			goto out;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 04/10] fuse2fs: better debugging for file mode updates
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Improve the tracing of a chmod operation so that we can debug file mode
updates.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e580622d39b1d1..f2cb44a4e53b4c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -2964,12 +2964,13 @@ static int op_chmod(const char *path, mode_t mode
 #endif
 			)
 {
+	struct ext2_inode_large inode;
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
 	ext2_filsys fs;
 	errcode_t err;
 	ext2_ino_t ino;
-	struct ext2_inode_large inode;
+	mode_t new_mode;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
@@ -3008,11 +3009,12 @@ static int op_chmod(const char *path, mode_t mode
 			mode &= ~S_ISGID;
 	}
 
-	inode.i_mode &= ~0xFFF;
-	inode.i_mode |= mode & 0xFFF;
+	new_mode = (inode.i_mode & ~0xFFF) | (mode & 0xFFF);
 
-	dbg_printf(ff, "%s: path=%s new_mode=0%o ino=%d\n", __func__,
-		   path, inode.i_mode, ino);
+	dbg_printf(ff, "%s: path=%s old_mode=0%o new_mode=0%o ino=%d\n",
+		   __func__, path, inode.i_mode, new_mode, ino);
+
+	inode.i_mode = new_mode;
 
 	ret = update_ctime(fs, ino, &inode);
 	if (ret)


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 05/10] fuse2fs: debug timestamp updates
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:46   ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add tracing for timestamp updates to files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   99 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 62 insertions(+), 37 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f2cb44a4e53b4c..ddc647f32c5df6 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -599,7 +599,8 @@ static void increment_version(struct ext2_inode_large *inode)
 		inode->i_version_hi = ver >> 32;
 }
 
-static void init_times(struct ext2_inode_large *inode)
+static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino,
+				    struct ext2_inode_large *inode)
 {
 	struct timespec now;
 
@@ -609,11 +610,15 @@ static void init_times(struct ext2_inode_large *inode)
 	EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
 	EXT4_EINODE_SET_XTIME(i_crtime, &now, inode);
 	increment_version(inode);
+
+	dbg_printf(ff, "%s: ino=%u time %ld:%lu\n", __func__, ino, now.tv_sec,
+		   now.tv_nsec);
 }
 
-static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
-			struct ext2_inode_large *pinode)
+static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino,
+				struct ext2_inode_large *pinode)
 {
+	ext2_filsys fs = ff->fs;
 	errcode_t err;
 	struct timespec now;
 	struct ext2_inode_large inode;
@@ -624,6 +629,10 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
 	if (pinode) {
 		increment_version(pinode);
 		EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
+
+		dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino,
+			   now.tv_sec, now.tv_nsec);
+
 		return 0;
 	}
 
@@ -635,6 +644,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
 	increment_version(&inode);
 	EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
 
+	dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino,
+		   now.tv_sec, now.tv_nsec);
+
 	err = fuse2fs_write_inode(fs, ino, &inode);
 	if (err)
 		return translate_error(fs, ino, err);
@@ -642,8 +654,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
 	return 0;
 }
 
-static int update_atime(ext2_filsys fs, ext2_ino_t ino)
+static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino)
 {
+	ext2_filsys fs = ff->fs;
 	errcode_t err;
 	struct ext2_inode_large inode, *pinode;
 	struct timespec atime, mtime, now;
@@ -662,6 +675,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
 	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
 	dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
 
+	dbg_printf(ff, "%s: ino=%u atime %ld:%lu mtime %ld:%lu now %ld:%lu\n",
+		   __func__, ino, atime.tv_sec, atime.tv_nsec, mtime.tv_sec,
+		   mtime.tv_nsec, now.tv_sec, now.tv_nsec);
+
 	/*
 	 * If atime is newer than mtime and atime hasn't been updated in thirty
 	 * seconds, skip the atime update.  Same idea as Linux "relatime".  Use
@@ -678,9 +695,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
 	return 0;
 }
 
-static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
-			struct ext2_inode_large *pinode)
+static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
+				struct ext2_inode_large *pinode)
 {
+	ext2_filsys fs = ff->fs;
 	errcode_t err;
 	struct ext2_inode_large inode;
 	struct timespec now;
@@ -690,6 +708,10 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
 		EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
 		EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
 		increment_version(pinode);
+
+		dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n",
+			   __func__, ino, now.tv_sec, now.tv_nsec);
+
 		return 0;
 	}
 
@@ -702,6 +724,9 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
 	EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
 	increment_version(&inode);
 
+	dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n",
+		   __func__, ino, now.tv_sec, now.tv_nsec);
+
 	err = fuse2fs_write_inode(fs, ino, &inode);
 	if (err)
 		return translate_error(fs, ino, err);
@@ -1660,7 +1685,7 @@ static int op_readlink(const char *path, char *buf, size_t len)
 	buf[len] = 0;
 
 	if (fuse2fs_is_writeable(ff)) {
-		ret = update_atime(fs, ino);
+		ret = fuse2fs_update_atime(ff, ino);
 		if (ret)
 			goto out;
 	}
@@ -1927,7 +1952,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -1950,7 +1975,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
 	}
 
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
 		ret = translate_error(fs, child, err);
@@ -2036,7 +2061,7 @@ static int op_mkdir(const char *path, mode_t mode)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -2063,7 +2088,7 @@ static int op_mkdir(const char *path, mode_t mode)
 	if (parent_sgid)
 		inode.i_mode |= S_ISGID;
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
@@ -2146,7 +2171,7 @@ static int fuse2fs_unlink(struct fuse2fs *ff, const char *path,
 	if (err)
 		return translate_error(fs, dir, err);
 
-	ret = update_mtime(fs, dir, NULL);
+	ret = fuse2fs_update_mtime(ff, dir, NULL);
 	if (ret)
 		return ret;
 
@@ -2215,7 +2240,7 @@ static int remove_inode(struct fuse2fs *ff, ext2_ino_t ino)
 		inode.i_links_count--;
 	}
 
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -2394,7 +2419,7 @@ static int __op_rmdir(struct fuse2fs *ff, const char *path)
 		}
 		if (inode.i_links_count > 1)
 			inode.i_links_count--;
-		ret = update_mtime(fs, rds.parent, &inode);
+		ret = fuse2fs_update_mtime(ff, rds.parent, &inode);
 		if (ret)
 			goto out;
 		err = fuse2fs_write_inode(fs, rds.parent, &inode);
@@ -2488,7 +2513,7 @@ static int op_symlink(const char *src, const char *dest)
 	}
 
 	/* Update parent dir's mtime */
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -2512,7 +2537,7 @@ static int op_symlink(const char *src, const char *dest)
 	fuse2fs_set_uid(&inode, ctxt->uid);
 	fuse2fs_set_gid(&inode, gid);
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
@@ -2762,11 +2787,11 @@ static int op_rename(const char *from, const char *to
 	}
 
 	/* Update timestamps */
-	ret = update_ctime(fs, from_ino, NULL);
+	ret = fuse2fs_update_ctime(ff, from_ino, NULL);
 	if (ret)
 		goto out2;
 
-	ret = update_mtime(fs, to_dir_ino, NULL);
+	ret = fuse2fs_update_mtime(ff, to_dir_ino, NULL);
 	if (ret)
 		goto out2;
 
@@ -2860,7 +2885,7 @@ static int op_link(const char *src, const char *dest)
 		goto out2;
 
 	inode.i_links_count++;
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out2;
 
@@ -2879,7 +2904,7 @@ static int op_link(const char *src, const char *dest)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -3016,7 +3041,7 @@ static int op_chmod(const char *path, mode_t mode
 
 	inode.i_mode = new_mode;
 
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -3086,7 +3111,7 @@ static int op_chown(const char *path, uid_t owner, gid_t group
 		fuse2fs_set_gid(&inode, group);
 	}
 
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -3159,7 +3184,7 @@ static int truncate_helper(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
 	if (err)
 		return translate_error(fs, ino, err);
 
-	ret = update_mtime(fs, ino, NULL);
+	ret = fuse2fs_update_mtime(ff, ino, NULL);
 	if (ret)
 		return ret;
 
@@ -3378,7 +3403,7 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
 	}
 
 	if (fuse2fs_is_writeable(ff)) {
-		ret = update_atime(fs, fh->ino);
+		ret = fuse2fs_update_atime(ff, fh->ino);
 		if (ret)
 			goto out;
 	}
@@ -3464,7 +3489,7 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
 		goto out;
 	}
 
-	ret = update_mtime(fs, fh->ino, NULL);
+	ret = fuse2fs_update_mtime(ff, fh->ino, NULL);
 	if (ret)
 		goto out;
 
@@ -3834,7 +3859,7 @@ static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
 		goto out2;
 	}
 
-	ret = update_ctime(fs, ino, NULL);
+	ret = fuse2fs_update_ctime(ff, ino, NULL);
 out2:
 	err = ext2fs_xattrs_close(&h);
 	if (!ret && err)
@@ -3929,7 +3954,7 @@ static int op_removexattr(const char *path, const char *key)
 		goto out2;
 	}
 
-	ret = update_ctime(fs, ino, NULL);
+	ret = fuse2fs_update_ctime(ff, ino, NULL);
 out2:
 	err = ext2fs_xattrs_close(&h);
 	if (err && !ret)
@@ -4067,7 +4092,7 @@ static int op_readdir(const char *path EXT2FS_ATTR((unused)),
 	}
 
 	if (fuse2fs_is_writeable(ff)) {
-		ret = update_atime(i.fs, fh->ino);
+		ret = fuse2fs_update_atime(ff, fh->ino);
 		if (ret)
 			goto out;
 	}
@@ -4173,7 +4198,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
 		goto out2;
 	}
 
-	ret = update_mtime(fs, parent, NULL);
+	ret = fuse2fs_update_mtime(ff, parent, NULL);
 	if (ret)
 		goto out2;
 
@@ -4204,7 +4229,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
 	}
 
 	inode.i_generation = ff->next_generation++;
-	init_times(&inode);
+	fuse2fs_init_timestamps(ff, child, &inode);
 	err = fuse2fs_write_inode(fs, child, &inode);
 	if (err) {
 		ret = translate_error(fs, child, err);
@@ -4277,7 +4302,7 @@ static int op_ftruncate(const char *path EXT2FS_ATTR((unused)),
 		goto out;
 	}
 
-	ret = update_mtime(fs, fh->ino, NULL);
+	ret = fuse2fs_update_mtime(ff, fh->ino, NULL);
 	if (ret)
 		goto out;
 
@@ -4365,7 +4390,7 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
 	if (tv[1].tv_nsec != UTIME_OMIT)
 		EXT4_INODE_SET_XTIME(i_mtime, &tv[1], &inode);
 #endif /* UTIME_OMIT */
-	ret = update_ctime(fs, ino, &inode);
+	ret = fuse2fs_update_ctime(ff, ino, &inode);
 	if (ret)
 		goto out;
 
@@ -4433,7 +4458,7 @@ static int ioctl_setflags(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	if (ret)
 		return ret;
 
-	ret = update_ctime(fs, fh->ino, &inode);
+	ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
 	if (ret)
 		return ret;
 
@@ -4480,7 +4505,7 @@ static int ioctl_setversion(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 
 	inode.i_generation = generation;
 
-	ret = update_ctime(fs, fh->ino, &inode);
+	ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
 	if (ret)
 		return ret;
 
@@ -4585,7 +4610,7 @@ static int ioctl_fssetxattr(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
 	if (ext2fs_inode_includes(inode_size, i_projid))
 		inode.i_projid = fsx->fsx_projid;
 
-	ret = update_ctime(fs, fh->ino, &inode);
+	ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
 	if (ret)
 		return ret;
 
@@ -4832,7 +4857,7 @@ static int fuse2fs_allocate_range(struct fuse2fs *ff,
 		}
 	}
 
-	err = update_mtime(fs, fh->ino, &inode);
+	err = fuse2fs_update_mtime(ff, fh->ino, &inode);
 	if (err)
 		return err;
 
@@ -4986,7 +5011,7 @@ static int fuse2fs_punch_range(struct fuse2fs *ff,
 			return translate_error(fs, fh->ino, err);
 	}
 
-	err = update_mtime(fs, fh->ino, &inode);
+	err = fuse2fs_update_mtime(ff, fh->ino, &inode);
 	if (err)
 		return err;
 


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
@ 2025-07-17 23:46   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:46 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

In iomap mode, the kernel is responsible for maintaining timestamps
because file writes don't upcall to fuse2fs.  The kernel's predicate for
deciding if [cm]time should be updated bases its decisions off [cm]time
being an exact match for the coarse clock (instead of checking that
[cm]time < coarse_clock) which means that fuse2fs setting a fine-grained
timestamp that is slightly ahead of the coarse clock can result in
timestamps appearing to go backwards.  generic/423 doesn't like seeing
btime > ctime from statx, so we'll use the coarse clock in iomap mode.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index ddc647f32c5df6..54f501b36d808b 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -575,8 +575,24 @@ static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
 	ext2fs_extent_free(extents);
 }
 
-static void get_now(struct timespec *now)
+static void fuse2fs_get_now(struct fuse2fs *ff, struct timespec *now)
 {
+#ifdef CLOCK_REALTIME_COARSE
+	/*
+	 * In iomap mode, the kernel is responsible for maintaining timestamps
+	 * because file writes don't upcall to fuse2fs.  The kernel's predicate
+	 * for deciding if [cm]time should be updated bases its decisions off
+	 * [cm]time being an exact match for the coarse clock (instead of
+	 * checking that [cm]time < coarse_clock) which means that fuse2fs
+	 * setting a fine-grained timestamp that is slightly ahead of the
+	 * coarse clock can result in timestamps appearing to go backwards.
+	 * generic/423 doesn't like seeing btime > ctime from statx, so we'll
+	 * use the coarse clock in iomap mode.
+	 */
+	if (fuse2fs_iomap_does_fileio(ff) &&
+	    !clock_gettime(CLOCK_REALTIME_COARSE, now))
+		return;
+#endif
 #ifdef CLOCK_REALTIME
 	if (!clock_gettime(CLOCK_REALTIME, now))
 		return;
@@ -604,7 +620,7 @@ static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino,
 {
 	struct timespec now;
 
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 	EXT4_INODE_SET_XTIME(i_atime, &now, inode);
 	EXT4_INODE_SET_XTIME(i_ctime, &now, inode);
 	EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
@@ -623,7 +639,7 @@ static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino,
 	struct timespec now;
 	struct ext2_inode_large inode;
 
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 
 	/* If user already has a inode buffer, just update that */
 	if (pinode) {
@@ -669,7 +685,7 @@ static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino)
 	pinode = &inode;
 	EXT4_INODE_GET_XTIME(i_atime, &atime, pinode);
 	EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 
 	datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
 	dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
@@ -704,7 +720,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
 	struct timespec now;
 
 	if (pinode) {
-		get_now(&now);
+		fuse2fs_get_now(ff, &now);
 		EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
 		EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
 		increment_version(pinode);
@@ -719,7 +735,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
 	if (err)
 		return translate_error(fs, ino, err);
 
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 	EXT4_INODE_SET_XTIME(i_mtime, &now, &inode);
 	EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
 	increment_version(&inode);
@@ -4380,9 +4396,9 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
 	tv[1] = ctv[1];
 #ifdef UTIME_NOW
 	if (tv[0].tv_nsec == UTIME_NOW)
-		get_now(tv);
+		fuse2fs_get_now(ff, tv);
 	if (tv[1].tv_nsec == UTIME_NOW)
-		get_now(tv + 1);
+		fuse2fs_get_now(ff, tv + 1);
 #endif /* UTIME_NOW */
 #ifdef UTIME_OMIT
 	if (tv[0].tv_nsec != UTIME_OMIT)
@@ -6917,7 +6933,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
 			error_message(err), func, line);
 
 	/* Make a note in the error log */
-	get_now(&now);
+	fuse2fs_get_now(ff, &now);
 	ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec);
 	fs->super->s_last_error_ino = ino;
 	fs->super->s_last_error_line = line;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-17 23:46   ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Add tracing for retrieving timestamps so we can debug the weird
behavior.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 54f501b36d808b..15595fdf0b19ba 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1502,9 +1502,11 @@ static void *op_init(struct fuse_conn_info *conn
 	goto out;
 }
 
-static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
+static int fuse2fs_stat(struct fuse2fs *ff, ext2_ino_t ino,
+			struct stat *statbuf)
 {
 	struct ext2_inode_large inode;
+	ext2_filsys fs = ff->fs;
 	dev_t fakedev = 0;
 	errcode_t err;
 	int ret = 0;
@@ -1543,6 +1545,13 @@ static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
 #else
 	statbuf->st_ctime = tv.tv_sec;
 #endif
+
+	dbg_printf(ff, "%s: ino=%d atime=%lld.%ld mtime=%lld.%ld ctime=%lld.%ld\n",
+		   __func__, ino,
+		   (long long int)statbuf->st_atim.tv_sec, statbuf->st_atim.tv_nsec,
+		   (long long int)statbuf->st_mtim.tv_sec, statbuf->st_mtim.tv_nsec,
+		   (long long int)statbuf->st_ctim.tv_sec, statbuf->st_ctim.tv_nsec);
+
 	if (LINUX_S_ISCHR(inode.i_mode) ||
 	    LINUX_S_ISBLK(inode.i_mode)) {
 		if (inode.i_block[0])
@@ -1602,16 +1611,15 @@ static int op_getattr(const char *path, struct stat *statbuf
 {
 	struct fuse_context *ctxt = fuse_get_context();
 	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
-	ext2_filsys fs;
 	ext2_ino_t ino;
 	int ret = 0;
 
 	FUSE2FS_CHECK_CONTEXT(ff);
-	fs = fuse2fs_start(ff);
+	fuse2fs_start(ff);
 	ret = fuse2fs_file_ino(ff, path, fi, &ino);
 	if (ret)
 		goto out;
-	ret = stat_inode(fs, ino, statbuf);
+	ret = fuse2fs_stat(ff, ino, statbuf);
 out:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -4051,7 +4059,7 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
 
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
 	if (i->flags == FUSE_READDIR_PLUS) {
-		ret = stat_inode(i->fs, dirent->inode, &stat);
+		ret = fuse2fs_stat(i->ff, dirent->inode, &stat);
 		if (ret)
 			return DIRENT_ABORT;
 	}
@@ -4342,7 +4350,7 @@ static int op_fgetattr(const char *path EXT2FS_ATTR((unused)),
 	FUSE2FS_CHECK_HANDLE(ff, fh);
 	dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
 	fs = fuse2fs_start(ff);
-	ret = stat_inode(fs, fh->ino, statbuf);
+	ret = fuse2fs_stat(ff, fh->ino, statbuf);
 	fuse2fs_finish(ff, ret);
 
 	return ret;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 08/10] fuse2fs: enable syncfs
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-07-17 23:47   ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Enable syncfs calls in fuse2fs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 15595fdf0b19ba..66baca72ad49d1 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5099,6 +5099,42 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
 # endif /* SUPPORT_FALLOCATE */
 #endif /* FUSE 29 */
 
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+static int op_syncfs(const char *path)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	ext2_filsys fs;
+	errcode_t err;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	dbg_printf(ff, "%s: path=%s\n", __func__, path);
+	fs = fuse2fs_start(ff);
+
+	if (ff->opstate == F2OP_WRITABLE) {
+		if (fs->super->s_error_count)
+			fs->super->s_state |= EXT2_ERROR_FS;
+		ext2fs_mark_super_dirty(fs);
+		err = ext2fs_set_gdt_csum(fs);
+		if (err) {
+			ret = translate_error(fs, 0, err);
+			goto out_unlock;
+		}
+
+		err = ext2fs_flush2(fs, 0);
+		if (err) {
+			ret = translate_error(fs, 0, err);
+			goto out_unlock;
+		}
+	}
+
+out_unlock:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+#endif
+
 #ifdef HAVE_FUSE_IOMAP
 static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_iomap *iomap,
 			       off_t pos, uint64_t count)
@@ -6301,6 +6337,9 @@ static struct fuse_operations fs_ops = {
 	.fallocate = op_fallocate,
 # endif
 #endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+	.syncfs = op_syncfs,
+#endif
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,
 	.iomap_end = op_iomap_end,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-07-17 23:47   ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  2025-07-17 23:47   ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

As an umount-time performance enhancement, don't bother to write the
group descriptor tables in op_destroy if we know that op_syncfs will do
it for us.  That only happens if iomap is enabled.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 66baca72ad49d1..3bded0fdd21e2a 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -263,6 +263,7 @@ struct fuse2fs {
 	uint8_t noblkdev;
 	uint8_t can_hardlink;
 	uint8_t iomap_passthrough_options;
+	uint8_t write_gdt_on_destroy;
 
 	enum fuse2fs_opstate opstate;
 	int blocklog;
@@ -1212,9 +1213,11 @@ static void op_destroy(void *p EXT2FS_ATTR((unused)))
 		if (fs->super->s_error_count)
 			fs->super->s_state |= EXT2_ERROR_FS;
 		ext2fs_mark_super_dirty(fs);
-		err = ext2fs_set_gdt_csum(fs);
-		if (err)
-			translate_error(fs, 0, err);
+		if (ff->write_gdt_on_destroy) {
+			err = ext2fs_set_gdt_csum(fs);
+			if (err)
+				translate_error(fs, 0, err);
+		}
 
 		err = ext2fs_flush2(fs, 0);
 		if (err)
@@ -5129,6 +5132,15 @@ static int op_syncfs(const char *path)
 		}
 	}
 
+	/*
+	 * When iomap is enabled, the kernel will call syncfs right before
+	 * calling the destroy method.  If any syncfs succeeds, then we know
+	 * that there will be a last syncfs and that it will write the GDT, so
+	 * destroy doesn't need to waste time doing that.
+	 */
+	if (fuse2fs_iomap_enabled(ff))
+		ff->write_gdt_on_destroy = 0;
+
 out_unlock:
 	fuse2fs_finish(ff, ret);
 	return ret;
@@ -6631,6 +6643,7 @@ int main(int argc, char *argv[])
 		.iomap_dev = FUSE_IOMAP_DEV_NULL,
 #endif
 		.can_hardlink = 1,
+		.write_gdt_on_destroy = 1,
 	};
 	errcode_t err;
 	FILE *orig_stderr = stderr;


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* [PATCH 10/10] fuse2fs: implement statx
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-07-17 23:47   ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
@ 2025-07-17 23:47   ` Darrick J. Wong
  9 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:47 UTC (permalink / raw)
  To: tytso; +Cc: joannelkoong, miklos, John, linux-fsdevel, bernd, linux-ext4,
	neal

From: Darrick J. Wong <djwong@kernel.org>

Implement statx.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 misc/fuse2fs.c |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)


diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 3bded0fdd21e2a..6d2ed7da9cc09e 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -23,6 +23,7 @@
 #include <sys/xattr.h>
 #endif
 #include <sys/ioctl.h>
+#include <sys/sysmacros.h>
 #include <unistd.h>
 #include <ctype.h>
 #define FUSE_DARWIN_ENABLE_EXTENSIONS 0
@@ -1646,6 +1647,111 @@ static int op_getattr_iflags(const char *path, struct stat *statbuf,
 }
 #endif
 
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) && defined(STATX_BASIC_STATS)
+static inline void fuse2fs_set_statx_attr(struct statx *stx,
+					  uint64_t statx_flag, int set)
+{
+	if (set)
+		stx->stx_attributes |= statx_flag;
+	stx->stx_attributes_mask |= statx_flag;
+}
+
+static int fuse2fs_statx(struct fuse2fs *ff, ext2_ino_t ino,
+			 uint32_t statx_mask, struct statx *stx, size_t size)
+{
+	struct ext2_inode_large inode;
+	ext2_filsys fs = ff->fs;;
+	dev_t fakedev = 0;
+	errcode_t err;
+	struct timespec tv;
+
+	if (size < sizeof(struct statx))
+		return translate_error(fs, ino, EOPNOTSUPP);
+
+	err = fuse2fs_read_inode(fs, ino, &inode);
+	if (err)
+		return translate_error(fs, ino, err);
+
+	memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev));
+	stx->stx_mask = STATX_BASIC_STATS | STATX_BTIME;
+	stx->stx_dev_major = major(fakedev);
+	stx->stx_dev_minor = minor(fakedev);
+	stx->stx_ino = ino;
+	stx->stx_mode = inode.i_mode;
+	stx->stx_nlink = inode.i_links_count;
+	stx->stx_uid = inode_uid(inode);
+	stx->stx_gid = inode_gid(inode);
+	stx->stx_size = EXT2_I_SIZE(&inode);
+	stx->stx_blksize = fs->blocksize;
+	stx->stx_blocks = ext2fs_get_stat_i_blocks(fs,
+						EXT2_INODE(&inode));
+	EXT4_INODE_GET_XTIME(i_atime, &tv, &inode);
+	stx->stx_atime.tv_sec = tv.tv_sec;
+	stx->stx_atime.tv_nsec = tv.tv_nsec;
+
+	EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode);
+	stx->stx_mtime.tv_sec = tv.tv_sec;
+	stx->stx_mtime.tv_nsec = tv.tv_nsec;
+
+	EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode);
+	stx->stx_ctime.tv_sec = tv.tv_sec;
+	stx->stx_ctime.tv_nsec = tv.tv_nsec;
+
+	EXT4_INODE_GET_XTIME(i_crtime, &tv, &inode);
+	stx->stx_btime.tv_sec = tv.tv_sec;
+	stx->stx_btime.tv_nsec = tv.tv_nsec;
+
+	dbg_printf(ff, "%s: ino=%d atime=%lld.%d mtime=%lld.%d ctime=%lld.%d btime=%lld.%d\n",
+		   __func__, ino,
+		   (long long int)stx->stx_atime.tv_sec, stx->stx_atime.tv_nsec,
+		   (long long int)stx->stx_mtime.tv_sec, stx->stx_mtime.tv_nsec,
+		   (long long int)stx->stx_ctime.tv_sec, stx->stx_ctime.tv_nsec,
+		   (long long int)stx->stx_btime.tv_sec, stx->stx_btime.tv_nsec);
+
+	if (LINUX_S_ISCHR(inode.i_mode) ||
+	    LINUX_S_ISBLK(inode.i_mode)) {
+		if (inode.i_block[0]) {
+			stx->stx_rdev_major = major(inode.i_block[0]);
+			stx->stx_rdev_minor = minor(inode.i_block[0]);
+		} else {
+			stx->stx_rdev_major = major(inode.i_block[1]);
+			stx->stx_rdev_minor = minor(inode.i_block[1]);
+		}
+	}
+
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_COMPRESSED,
+			       inode.i_flags & EXT2_COMPR_FL);
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_IMMUTABLE,
+			       inode.i_flags & EXT2_IMMUTABLE_FL);
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_APPEND,
+			       inode.i_flags & EXT2_APPEND_FL);
+	fuse2fs_set_statx_attr(stx, STATX_ATTR_NODUMP,
+			       inode.i_flags & EXT2_NODUMP_FL);
+
+	return 0;
+}
+
+static int op_statx(const char *path, uint32_t statx_flags, uint32_t statx_mask,
+		    struct statx *stx, size_t size, struct fuse_file_info *fi)
+{
+	struct fuse_context *ctxt = fuse_get_context();
+	struct fuse2fs *ff = (struct fuse2fs *)ctxt->private_data;
+	ext2_ino_t ino;
+	int ret = 0;
+
+	FUSE2FS_CHECK_CONTEXT(ff);
+	fuse2fs_start(ff);
+	ret = fuse2fs_file_ino(ff, path, fi, &ino);
+	if (ret)
+		goto out;
+	ret = fuse2fs_statx(ff, ino, statx_mask, stx, size);
+out:
+	fuse2fs_finish(ff, ret);
+	return ret;
+}
+#else
+# define op_statx		NULL
+#endif
 
 static int op_readlink(const char *path, char *buf, size_t len)
 {
@@ -6351,6 +6457,7 @@ static struct fuse_operations fs_ops = {
 #endif
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
 	.syncfs = op_syncfs,
+	.statx = op_statx,
 #endif
 #ifdef HAVE_FUSE_IOMAP
 	.iomap_begin = op_iomap_begin,


^ permalink raw reply related	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
                   ` (9 preceding siblings ...)
  2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-07-18  8:54 ` Christian Brauner
  2025-07-18 11:55   ` Amir Goldstein
  10 siblings, 1 reply; 174+ messages in thread
From: Christian Brauner @ 2025-07-18  8:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, John, bernd, miklos, joannelkoong, Josef Bacik,
	linux-ext4, Theodore Ts'o, Neal Gompa

On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> Hi everyone,
> 
> DO NOT MERGE THIS, STILL!
> 
> This is the third request for comments of a prototype to connect the
> Linux fuse driver to fs-iomap for regular file IO operations to and from
> files whose contents persist to locally attached storage devices.
> 
> Why would you want to do that?  Most filesystem drivers are seriously
> vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> over almost a decade of its existence.  Faulty code can lead to total
> kernel compromise, and I think there's a very strong incentive to move
> all that parsing out to userspace where we can containerize the fuse
> server process.
> 
> willy's folios conversion project (and to a certain degree RH's new
> mount API) have also demonstrated that treewide changes to the core
> mm/pagecache/fs code are very very difficult to pull off and take years
> because you have to understand every filesystem's bespoke use of that
> core code.  Eeeugh.
> 
> The fuse command plumbing is very simple -- the ->iomap_begin,
> ->iomap_end, and iomap ->ioend calls within iomap are turned into
> upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> writeback is now a directio write.  The fuse server is now able to
> upsert mappings into the kernel for cached access (== zero upcalls for
> rereads and pure overwrites!) and the iomap cache revalidation code
> works.
> 
> With this RFC, I am able to show that it's possible to build a fuse
> server for a real filesystem (ext4) that runs entirely in userspace yet
> maintains most of its performance.  At this stage I still get about 95%
> of the kernel ext4 driver's streaming directio performance on streaming
> IO, and 110% of its streaming buffered IO performance.  Random buffered
> IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> fast as the kernel; see the cover letter for the fuse2fs iomap changes
> for more details.  Unwritten extent conversions on random direct writes
> are especially painful for fuse+iomap (~90% more overhead) due to upcall
> overhead.  And that's with debugging turned on!
> 
> These items have been addressed since the first RFC:
> 
> 1. The iomap cookie validation is now present, which avoids subtle races
> between pagecache zeroing and writeback on filesystems that support
> unwritten and delalloc mappings.
> 
> 2. Mappings can be cached in the kernel for more speed.
> 
> 3. iomap supports inline data.
> 
> 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> to be as easy as creating a new ->getattr_iflags callback so that the
> fuse server can set fuse_attr::flags.
> 
> 5. statx and syncfs work on iomap filesystems.
> 
> 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> is enabled.
> 
> 7. The ext4 shutdown ioctl is now supported.
> 
> There are some major warts remaining:
> 
> a. ext4 doesn't support out of place writes so I don't know if that
> actually works correctly.
> 
> b. iomap is an inode-based service, not a file-based service.  This
> means that we /must/ push ext2's inode numbers into the kernel via
> FUSE_GETATTR so that it can report those same numbers back out through
> the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> to index its incore inode, so we have to pass those too so that
> notifications work properly.  This is related to #3 below:
> 
> c. Hardlinks and iomap are not possible for upper-level libfuse clients
> because the upper level libfuse likes to abstract kernel nodeids with
> its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> As a result, a hardlinked file results in two distinct struct inodes in
> the kernel, which completely breaks iomap's locking model.  I will have
> to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> but on the plus side there will be far less path lookup overhead.
> 
> d. There are too many changes to the IO manager in libext2fs because I
> built things needed to stage the direct/buffered IO paths separately.
> These are now unnecessary but I haven't pulled them out yet because
> they're sort of useful to verify that iomap file IO never goes through
> libext2fs except for inline data.
> 
> e. If we're going to use fuse servers as "safe" replacements for kernel
> filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> We also need to disable the OOM killer(s) for fuse servers because you
> don't want filesystems to unmount abruptly.
> 
> f. How do we maximally contain the fuse server to have safe filesystem
> mounts?  It's very convenient to use systemd services to configure
> isolation declaratively, but fuse2fs still needs to be able to open
> /dev/fuse, the ext4 block device, and call mount() in the shared
> namespace.  This prevents us from using most of the stronger systemd

I'm happy to help you here.

First, I think using a character device for namespaced drivers is always
a mistake. FUSE predates all that ofc. They're incredibly terrible for
delegation because of devtmpfs not being namespaced as well as devices
in general. And having device nodes on anything other than tmpfs is just
wrong (TM).

In systemd I ultimately want a bpf LSM program that prevents the
creation of device nodes outside of tmpfs. They don't belong on
persistent storage imho. But anyway, that's besides the point.

Opening the block device should be done by systemd-mountfsd but I think
/dev/fuse should really be openable by the service itself.

So we can try and allowlist /dev/fuse in vfs_mknod() similar to
whiteouts. That means you can do mknod() in the container to create
/dev/fuse (Personally, I would even restrict this to tmpfs right off the
bat so that containers can only do this on their private tmpfs mount at
/dev.)

The downside of this would be to give unprivileged containers access to
FUSE by default. I don't think that's a problem per se but it is a uapi
change.

Let me think a bit about alternatives. I have one crazy idea but I'm not
sure enough about it to spill it.

> protections because they tend to run in a private mount namespace with
> various parts of the filesystem either hidden or readonly.
> 
> In theory one could design a socket protocol to pass mount options,
> block device paths, fds, and responsibility for the mount() call between
> a mount helper and a service:

This isn't a problem really. This should just be an extension to
systemd-mountfsd.

> 
> e2fsprogs would define as a systemd socket service for fuse2fs that sets
> up a dynamic unprivileged user, no network access, and no access to the
> host's filesystem aside from readonly access to the root filesystem.
> 
> The mount helper (e.g. mount.safe) would then connect to the magic
> socket and pass the CLI arguments to the fuse2fs service.  The service
> would parse the arguments, find the block device paths, and feed them
> back through the socket to mount.safe.  mount.safe would open them and
> pass fds back to the fuse2fs service.  The service would then open the
> devices, parse the superblock, and if everything was ok, request a mount
> through the socket.  The mount helper would then open /dev/fuse and
> mount the filesystem, and if successful, pass the /dev/fuse fd through
> the socket to the fuse2fs server.  At that point the fuse2fs server
> would attach to the /dev/fuse device and handle the usual events.
> 
> Finally we'd have to train people/daemons to run "mount -t safe.ext4
> /dev/sda1 /mnt" to get the contained version of ext4.
> 
> (Yeah, #f is all Neal. ;))
> 
> g. fuse2fs doesn't support the ext4 journal.  Urk.
> 
> I'll work on these in July/August, but for now here's an unmergeable RFC
> to start some discussion.
> 
> --Darrick
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
@ 2025-07-18 11:55   ` Amir Goldstein
  2025-07-18 19:31     ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 11:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> >
> > DO NOT MERGE THIS, STILL!
> >
> > This is the third request for comments of a prototype to connect the
> > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > files whose contents persist to locally attached storage devices.
> >
> > Why would you want to do that?  Most filesystem drivers are seriously
> > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > over almost a decade of its existence.  Faulty code can lead to total
> > kernel compromise, and I think there's a very strong incentive to move
> > all that parsing out to userspace where we can containerize the fuse
> > server process.
> >
> > willy's folios conversion project (and to a certain degree RH's new
> > mount API) have also demonstrated that treewide changes to the core
> > mm/pagecache/fs code are very very difficult to pull off and take years
> > because you have to understand every filesystem's bespoke use of that
> > core code.  Eeeugh.
> >
> > The fuse command plumbing is very simple -- the ->iomap_begin,
> > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > writeback is now a directio write.  The fuse server is now able to
> > upsert mappings into the kernel for cached access (== zero upcalls for
> > rereads and pure overwrites!) and the iomap cache revalidation code
> > works.
> >
> > With this RFC, I am able to show that it's possible to build a fuse
> > server for a real filesystem (ext4) that runs entirely in userspace yet
> > maintains most of its performance.  At this stage I still get about 95%
> > of the kernel ext4 driver's streaming directio performance on streaming
> > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > for more details.  Unwritten extent conversions on random direct writes
> > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > overhead.  And that's with debugging turned on!
> >
> > These items have been addressed since the first RFC:
> >
> > 1. The iomap cookie validation is now present, which avoids subtle races
> > between pagecache zeroing and writeback on filesystems that support
> > unwritten and delalloc mappings.
> >
> > 2. Mappings can be cached in the kernel for more speed.
> >
> > 3. iomap supports inline data.
> >
> > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > to be as easy as creating a new ->getattr_iflags callback so that the
> > fuse server can set fuse_attr::flags.
> >
> > 5. statx and syncfs work on iomap filesystems.
> >
> > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > is enabled.
> >
> > 7. The ext4 shutdown ioctl is now supported.
> >
> > There are some major warts remaining:
> >
> > a. ext4 doesn't support out of place writes so I don't know if that
> > actually works correctly.
> >
> > b. iomap is an inode-based service, not a file-based service.  This
> > means that we /must/ push ext2's inode numbers into the kernel via
> > FUSE_GETATTR so that it can report those same numbers back out through
> > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > to index its incore inode, so we have to pass those too so that
> > notifications work properly.  This is related to #3 below:
> >
> > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > because the upper level libfuse likes to abstract kernel nodeids with
> > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > As a result, a hardlinked file results in two distinct struct inodes in
> > the kernel, which completely breaks iomap's locking model.  I will have
> > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > but on the plus side there will be far less path lookup overhead.
> >
> > d. There are too many changes to the IO manager in libext2fs because I
> > built things needed to stage the direct/buffered IO paths separately.
> > These are now unnecessary but I haven't pulled them out yet because
> > they're sort of useful to verify that iomap file IO never goes through
> > libext2fs except for inline data.
> >
> > e. If we're going to use fuse servers as "safe" replacements for kernel
> > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > We also need to disable the OOM killer(s) for fuse servers because you
> > don't want filesystems to unmount abruptly.
> >
> > f. How do we maximally contain the fuse server to have safe filesystem
> > mounts?  It's very convenient to use systemd services to configure
> > isolation declaratively, but fuse2fs still needs to be able to open
> > /dev/fuse, the ext4 block device, and call mount() in the shared
> > namespace.  This prevents us from using most of the stronger systemd
>
> I'm happy to help you here.
>
> First, I think using a character device for namespaced drivers is always
> a mistake. FUSE predates all that ofc. They're incredibly terrible for
> delegation because of devtmpfs not being namespaced as well as devices
> in general. And having device nodes on anything other than tmpfs is just
> wrong (TM).
>
> In systemd I ultimately want a bpf LSM program that prevents the
> creation of device nodes outside of tmpfs. They don't belong on
> persistent storage imho. But anyway, that's besides the point.
>
> Opening the block device should be done by systemd-mountfsd but I think
> /dev/fuse should really be openable by the service itself.
>
> So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> whiteouts. That means you can do mknod() in the container to create
> /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> bat so that containers can only do this on their private tmpfs mount at
> /dev.)
>
> The downside of this would be to give unprivileged containers access to
> FUSE by default. I don't think that's a problem per se but it is a uapi
> change.
>
> Let me think a bit about alternatives. I have one crazy idea but I'm not
> sure enough about it to spill it.
>

I don't think there is a hard requirement for the fuse fd to be opened from
a device driver.
With fuse io_uring communication, the open fd doesn't even need to do io.

> > protections because they tend to run in a private mount namespace with
> > various parts of the filesystem either hidden or readonly.
> >
> > In theory one could design a socket protocol to pass mount options,
> > block device paths, fds, and responsibility for the mount() call between
> > a mount helper and a service:
>
> This isn't a problem really. This should just be an extension to
> systemd-mountfsd.

This is relevant not only to systemd env.

I have been experimenting with this mount helper service to mount fuse fs
inside an unprivileged kubernetes container, where opening of /dev/fuse
is restricted by LSM policy:

https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/4] libfuse: add statx support to the lower level library
  2025-07-17 23:39   ` [PATCH 3/4] libfuse: add statx support to the lower level library Darrick J. Wong
@ 2025-07-18 13:28     ` Amir Goldstein
  2025-07-18 15:58       ` Darrick J. Wong
  2025-07-18 16:27       ` Darrick J. Wong
  0 siblings, 2 replies; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 13:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 1:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add statx support to the lower level fuse library.

This looked familiar.
Merged 3 days ago:
https://github.com/libfuse/libfuse/pull/1026

>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  include/fuse_lowlevel.h |   37 ++++++++++++++++++
>  lib/fuse_lowlevel.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++
>  lib/fuse_versionscript  |    2 +
>  3 files changed, 136 insertions(+)
>
>
> diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> index 77685e433e4f7d..f4d62cee22870a 100644
> --- a/include/fuse_lowlevel.h
> +++ b/include/fuse_lowlevel.h
> @@ -1416,6 +1416,26 @@ struct fuse_lowlevel_ops {
>          * @param ino the inode number
>          */
>         void (*syncfs) (fuse_req_t req, fuse_ino_t ino);
> +
> +       /**
> +        * Fetch extended stat information about a file
> +        *
> +        * If this request is answered with an error code of ENOSYS, this is
> +        * treated as a permanent failure, i.e. all future statx() requests
> +        * will fail with the same error code without being sent to the
> +        * filesystem process.
> +        *
> +        * Valid replies:
> +        *   fuse_reply_statx
> +        *   fuse_reply_err
> +        *
> +        * @param req request handle
> +        * @param statx_flags AT_STATX_* flags
> +        * @param statx_mask desired STATX_* attribute mask
> +        * @param fi file information
> +        */
> +       void (*statx) (fuse_req_t req, fuse_ino_t ino, uint32_t statx_flags,
> +                      uint32_t statx_mask, struct fuse_file_info *fi);
>  #endif /* FUSE_USE_VERSION >= 318 */
>  };
>
> @@ -1897,6 +1917,23 @@ int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
>   * @return zero for success, -errno for failure to send reply
>   */
>  int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg);
> +
> +struct statx;
> +
> +/**
> + * Reply with statx attributes
> + *
> + * Possible requests:
> + *   statx
> + *
> + * @param req request handle
> + * @param statx the attributes
> + * @param size the size of the statx structure
> + * @param attr_timeout validity timeout (in seconds) for the attributes
> + * @return zero for success, -errno for failure to send reply
> + */
> +int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
> +                    double attr_timeout);
>  #endif /* FUSE_USE_VERSION >= 318 */
>
>  /* ----------------------------------------------------------- *
> diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> index ec30ebc4cdd074..8eeb6a8547da91 100644
> --- a/lib/fuse_lowlevel.c
> +++ b/lib/fuse_lowlevel.c
> @@ -144,6 +144,43 @@ static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
>         ST_CTIM_NSEC_SET(stbuf, attr->ctimensec);
>  }
>
> +#ifdef STATX_BASIC_STATS
> +static int convert_statx(struct fuse_statx *stbuf, const struct statx *stx,
> +                        size_t size)
> +{
> +       if (sizeof(struct statx) != size)
> +               return EOPNOTSUPP;
> +
> +       stbuf->mask = stx->stx_mask & (STATX_BASIC_STATS | STATX_BTIME);
> +       stbuf->blksize          = stx->stx_blksize;
> +       stbuf->attributes       = stx->stx_attributes;
> +       stbuf->nlink            = stx->stx_nlink;
> +       stbuf->uid              = stx->stx_uid;
> +       stbuf->gid              = stx->stx_gid;
> +       stbuf->mode             = stx->stx_mode;
> +       stbuf->ino              = stx->stx_ino;
> +       stbuf->size             = stx->stx_size;
> +       stbuf->blocks           = stx->stx_blocks;
> +       stbuf->attributes_mask  = stx->stx_attributes_mask;
> +       stbuf->rdev_major       = stx->stx_rdev_major;
> +       stbuf->rdev_minor       = stx->stx_rdev_minor;
> +       stbuf->dev_major        = stx->stx_dev_major;
> +       stbuf->dev_minor        = stx->stx_dev_minor;
> +
> +       stbuf->atime.tv_sec     = stx->stx_atime.tv_sec;
> +       stbuf->btime.tv_sec     = stx->stx_btime.tv_sec;
> +       stbuf->ctime.tv_sec     = stx->stx_ctime.tv_sec;
> +       stbuf->mtime.tv_sec     = stx->stx_mtime.tv_sec;
> +
> +       stbuf->atime.tv_nsec    = stx->stx_atime.tv_nsec;
> +       stbuf->btime.tv_nsec    = stx->stx_btime.tv_nsec;
> +       stbuf->ctime.tv_nsec    = stx->stx_ctime.tv_nsec;
> +       stbuf->mtime.tv_nsec    = stx->stx_mtime.tv_nsec;
> +
> +       return 0;
> +}
> +#endif
> +

Why is this conversion not needed in the merged version?
What am I missing?

Thanks,
Amir.

>  static size_t iov_length(const struct iovec *iov, size_t count)
>  {
>         size_t seg;
> @@ -2653,6 +2690,64 @@ static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg
>         _do_syncfs(req, nodeid, inarg, NULL);
>  }
>
> +#ifdef STATX_BASIC_STATS
> +int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
> +                    double attr_timeout)
> +{
> +       struct fuse_statx_out arg = {
> +               .attr_valid = calc_timeout_sec(attr_timeout),
> +               .attr_valid_nsec = calc_timeout_nsec(attr_timeout),
> +       };
> +
> +       int err = convert_statx(&arg.stat, statx, size);
> +       if (err) {
> +               fuse_reply_err(req, err);
> +               return err;
> +       }
> +
> +       return send_reply_ok(req, &arg, sizeof(arg));
> +}
> +
> +static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
> +                     const void *op_in, const void *in_payload)
> +{
> +       (void)in_payload;
> +       const struct fuse_statx_in *arg = op_in;
> +       struct fuse_file_info *fip = NULL;
> +       struct fuse_file_info fi;
> +
> +       if (arg->getattr_flags & FUSE_GETATTR_FH) {
> +               memset(&fi, 0, sizeof(fi));
> +               fi.fh = arg->fh;
> +               fip = &fi;
> +       }
> +
> +       if (req->se->op.statx)
> +               req->se->op.statx(req, nodeid, arg->sx_flags, arg->sx_mask,
> +                                 fip);
> +       else
> +               fuse_reply_err(req, ENOSYS);
> +}
> +#else
> +int fuse_reply_statx(fuse_req_t req, const struct statx *statx,
> +                    double attr_timeout)
> +{
> +       fuse_reply_err(req, ENOSYS);
> +       return -ENOSYS;
> +}
> +
> +static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
> +                     const void *op_in, const void *in_payload)
> +{
> +       fuse_reply_err(req, ENOSYS);
> +}
> +#endif /* STATX_BASIC_STATS */
> +
> +static void do_statx(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
> +{
> +       _do_statx(req, nodeid, inarg, NULL);
> +}
> +
>  static bool want_flags_valid(uint64_t capable, uint64_t want)
>  {
>         uint64_t unknown_flags = want & (~capable);
> @@ -3627,6 +3722,7 @@ static struct {
>         [FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
>         [FUSE_LSEEK]       = { do_lseek,       "LSEEK"       },
>         [FUSE_SYNCFS]      = { do_syncfs,       "SYNCFS"     },
> +       [FUSE_STATX]       = { do_statx,       "STATX"       },
>         [FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
>         [FUSE_IOMAP_BEGIN] = { do_iomap_begin,  "IOMAP_BEGIN" },
>         [FUSE_IOMAP_END]   = { do_iomap_end,    "IOMAP_END" },
> @@ -3686,6 +3782,7 @@ static struct {
>         [FUSE_COPY_FILE_RANGE]  = { _do_copy_file_range, "COPY_FILE_RANGE" },
>         [FUSE_LSEEK]            = { _do_lseek,          "LSEEK" },
>         [FUSE_SYNCFS]           = { _do_syncfs,         "SYNCFS" },
> +       [FUSE_STATX]            = { _do_statx,          "STATX" },
>         [FUSE_IOMAP_CONFIG]     = { _do_iomap_config,   "IOMAP_CONFIG" },
>         [FUSE_IOMAP_BEGIN]      = { _do_iomap_begin,    "IOMAP_BEGIN" },
>         [FUSE_IOMAP_END]        = { _do_iomap_end,      "IOMAP_END" },
> diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
> index dc9fa2428b5325..a67b1802770335 100644
> --- a/lib/fuse_versionscript
> +++ b/lib/fuse_versionscript
> @@ -223,6 +223,8 @@ FUSE_3.18 {
>                 fuse_reply_iomap_config;
>                 fuse_lowlevel_notify_iomap_upsert;
>                 fuse_lowlevel_notify_iomap_inval;
> +
> +               fuse_reply_statx;
>  } FUSE_3.17;
>
>  # Local Variables:
>
>

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel
  2025-07-17 23:36   ` [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
@ 2025-07-18 14:10     ` Amir Goldstein
  2025-07-18 15:48       ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 14:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Create new fuse_reply_{attr,create,entry}_iflags functions so that we
> can send FUSE_ATTR_* flags to the kernel when instantiating an inode.
> Servers are expected to send FUSE_IFLAG_* values, which will be
> translated into what the kernel can understand.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  include/fuse_common.h   |    3 ++
>  include/fuse_lowlevel.h |   87 +++++++++++++++++++++++++++++++++++++++++++++--
>  lib/fuse_lowlevel.c     |   69 ++++++++++++++++++++++++++++++-------
>  lib/fuse_versionscript  |    4 ++
>  4 files changed, 146 insertions(+), 17 deletions(-)
>
>
> diff --git a/include/fuse_common.h b/include/fuse_common.h
> index 66c25afe15ec76..11eb22d011896c 100644
> --- a/include/fuse_common.h
> +++ b/include/fuse_common.h
> @@ -1210,6 +1210,9 @@ struct fuse_iomap {
>  /* is append ioend */
>  #define FUSE_IOMAP_IOEND_APPEND                (1U << 15)
>
> +/* enable fsdax */
> +#define FUSE_IFLAG_DAX                 (1U << 0)
> +
>  #endif /* FUSE_USE_VERSION >= 318 */
>
>  /* ----------------------------------------------------------- *
> diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> index 1b856431de0a60..07748abcf079cf 100644
> --- a/include/fuse_lowlevel.h
> +++ b/include/fuse_lowlevel.h
> @@ -240,6 +240,7 @@ struct fuse_lowlevel_ops {
>          *
>          * Valid replies:
>          *   fuse_reply_entry
> +        *   fuse_reply_entry_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -299,6 +300,7 @@ struct fuse_lowlevel_ops {
>          *
>          * Valid replies:
>          *   fuse_reply_attr
> +        *   fuse_reply_attr_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -334,6 +336,7 @@ struct fuse_lowlevel_ops {
>          *
>          * Valid replies:
>          *   fuse_reply_attr
> +        *   fuse_reply_attr_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -364,7 +367,7 @@ struct fuse_lowlevel_ops {
>          * socket node.
>          *
>          * Valid replies:
> -        *   fuse_reply_entry
> +        *   fuse_reply_entry_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -380,7 +383,7 @@ struct fuse_lowlevel_ops {
>          * Create a directory
>          *
>          * Valid replies:
> -        *   fuse_reply_entry
> +        *   fuse_reply_entry_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -429,7 +432,7 @@ struct fuse_lowlevel_ops {
>          * Create a symbolic link
>          *
>          * Valid replies:
> -        *   fuse_reply_entry
> +        *   fuse_reply_entry_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -477,7 +480,7 @@ struct fuse_lowlevel_ops {
>          * Create a hard link
>          *
>          * Valid replies:
> -        *   fuse_reply_entry
> +        *   fuse_reply_entry_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -969,6 +972,7 @@ struct fuse_lowlevel_ops {
>          *
>          * Valid replies:
>          *   fuse_reply_create
> +        *   fuse_reply_create_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -1315,6 +1319,7 @@ struct fuse_lowlevel_ops {
>          *
>          * Valid replies:
>          *   fuse_reply_create
> +        *   fuse_reply_create_iflags
>          *   fuse_reply_err
>          *
>          * @param req request handle
> @@ -1435,6 +1440,23 @@ void fuse_reply_none(fuse_req_t req);
>   */
>  int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e);
>
> +/**
> + * Reply with a directory entry and FUSE_IFLAG_*
> + *
> + * Possible requests:
> + *   lookup, mknod, mkdir, symlink, link
> + *
> + * Side effects:
> + *   increments the lookup count on success
> + *
> + * @param req request handle
> + * @param e the entry parameters
> + * @param iflags       FUSE_IFLAG_*
> + * @return zero for success, -errno for failure to send reply
> + */
> +int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e,
> +                           unsigned int iflags);
> +
>  /**
>   * Reply with a directory entry and open parameters
>   *
> @@ -1456,6 +1478,29 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e);
>  int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
>                       const struct fuse_file_info *fi);
>
> +/**
> + * Reply with a directory entry, open parameters and FUSE_IFLAG_*
> + *
> + * currently the following members of 'fi' are used:
> + *   fh, direct_io, keep_cache, cache_readdir, nonseekable, noflush,
> + *   parallel_direct_writes
> + *
> + * Possible requests:
> + *   create
> + *
> + * Side effects:
> + *   increments the lookup count on success
> + *
> + * @param req request handle
> + * @param e the entry parameters
> + * @param iflags       FUSE_IFLAG_*
> + * @param fi file information
> + * @return zero for success, -errno for failure to send reply
> + */
> +int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e,
> +                            unsigned int iflags,
> +                            const struct fuse_file_info *fi);
> +
>  /**
>   * Reply with attributes
>   *
> @@ -1470,6 +1515,21 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
>  int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
>                     double attr_timeout);
>
> +/**
> + * Reply with attributes and FUSE_IFLAG_* flags
> + *
> + * Possible requests:
> + *   getattr, setattr
> + *
> + * @param req request handle
> + * @param attr the attributes
> + * @param attr_timeout validity timeout (in seconds) for the attributes
> + * @param iflags       set of FUSE_IFLAG_* flags
> + * @return zero for success, -errno for failure to send reply
> + */
> +int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
> +                          unsigned int iflags, double attr_timeout);
> +
>  /**
>   * Reply with the contents of a symbolic link
>   *
> @@ -1697,6 +1757,25 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
>                               const char *name,
>                               const struct fuse_entry_param *e, off_t off);
>
> +/**
> + * Add a directory entry and FUSE_IFLAG_* to the buffer with the attributes
> + *
> + * See documentation of `fuse_add_direntry_plus()` for more details.
> + *
> + * @param req request handle
> + * @param buf the point where the new entry will be added to the buffer
> + * @param bufsize remaining size of the buffer
> + * @param name the name of the entry
> + * @param iflags       FUSE_IFLAG_*
> + * @param e the directory entry
> + * @param off the offset of the next entry
> + * @return the space needed for the entry
> + */
> +size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize,
> +                                    const char *name, unsigned int iflags,
> +                                    const struct fuse_entry_param *e,
> +                                    off_t off);
> +
>  /**
>   * Reply to ask for data fetch and output buffer preparation.  ioctl
>   * will be retried with the specified input data fetched and output
> diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> index d26043fa54c036..568db13502a7d7 100644
> --- a/lib/fuse_lowlevel.c
> +++ b/lib/fuse_lowlevel.c
> @@ -102,7 +102,8 @@ static void trace_request_reply(uint64_t unique, unsigned int len,
>  }
>  #endif
>
> -static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr)
> +static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
> +                        unsigned int iflags)
>  {
>         attr->ino       = stbuf->st_ino;
>         attr->mode      = stbuf->st_mode;
> @@ -119,6 +120,10 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr)
>         attr->atimensec = ST_ATIM_NSEC(stbuf);
>         attr->mtimensec = ST_MTIM_NSEC(stbuf);
>         attr->ctimensec = ST_CTIM_NSEC(stbuf);
> +
> +       attr->flags     = 0;
> +       if (iflags & FUSE_IFLAG_DAX)
> +               attr->flags |= FUSE_ATTR_DAX;
>  }
>
>  static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
> @@ -438,7 +443,8 @@ static unsigned int calc_timeout_nsec(double t)
>  }
>
>  static void fill_entry(struct fuse_entry_out *arg,
> -                      const struct fuse_entry_param *e)
> +                      const struct fuse_entry_param *e,
> +                      unsigned int iflags)
>  {
>         arg->nodeid = e->ino;
>         arg->generation = e->generation;
> @@ -446,14 +452,15 @@ static void fill_entry(struct fuse_entry_out *arg,
>         arg->entry_valid_nsec = calc_timeout_nsec(e->entry_timeout);
>         arg->attr_valid = calc_timeout_sec(e->attr_timeout);
>         arg->attr_valid_nsec = calc_timeout_nsec(e->attr_timeout);
> -       convert_stat(&e->attr, &arg->attr);
> +       convert_stat(&e->attr, &arg->attr, iflags);
>  }
>
>  /* `buf` is allowed to be empty so that the proper size may be
>     allocated by the caller */
> -size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
> -                             const char *name,
> -                             const struct fuse_entry_param *e, off_t off)
> +size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize,
> +                                    const char *name, unsigned int iflags,
> +                                    const struct fuse_entry_param *e,
> +                                    off_t off)
>  {
>         (void)req;
>         size_t namelen;
> @@ -468,7 +475,7 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
>
>         struct fuse_direntplus *dp = (struct fuse_direntplus *) buf;
>         memset(&dp->entry_out, 0, sizeof(dp->entry_out));
> -       fill_entry(&dp->entry_out, e);
> +       fill_entry(&dp->entry_out, e, iflags);
>
>         struct fuse_dirent *dirent = &dp->dirent;
>         dirent->ino = e->attr.st_ino;
> @@ -481,6 +488,14 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
>         return entlen_padded;
>  }
>
> +size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
> +                             const char *name,
> +                             const struct fuse_entry_param *e, off_t off)
> +{
> +       return fuse_add_direntry_plus_iflags(req, buf, bufsize, name, 0, e,
> +                                            off);
> +}
> +
>  static void fill_open(struct fuse_open_out *arg,
>                       const struct fuse_file_info *f)
>  {
> @@ -503,7 +518,8 @@ static void fill_open(struct fuse_open_out *arg,
>                 arg->open_flags |= FOPEN_PARALLEL_DIRECT_WRITES;
>  }
>
> -int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
> +int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e,
> +                           unsigned int iflags)
>  {
>         struct fuse_entry_out arg;
>         size_t size = req->se->conn.proto_minor < 9 ?
> @@ -515,12 +531,18 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
>                 return fuse_reply_err(req, ENOENT);
>
>         memset(&arg, 0, sizeof(arg));
> -       fill_entry(&arg, e);
> +       fill_entry(&arg, e, iflags);
>         return send_reply_ok(req, &arg, size);
>  }
>
> -int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
> -                     const struct fuse_file_info *f)
> +int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
> +{
> +       return fuse_reply_entry_iflags(req, e, 0);
> +}
> +
> +int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e,
> +                            unsigned int iflags,
> +                            const struct fuse_file_info *f)
>  {
>         alignas(uint64_t) char buf[sizeof(struct fuse_entry_out) + sizeof(struct fuse_open_out)];
>         size_t entrysize = req->se->conn.proto_minor < 9 ?
> @@ -529,12 +551,18 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
>         struct fuse_open_out *oarg = (struct fuse_open_out *) (buf + entrysize);
>
>         memset(buf, 0, sizeof(buf));
> -       fill_entry(earg, e);
> +       fill_entry(earg, e, iflags);
>         fill_open(oarg, f);
>         return send_reply_ok(req, buf,
>                              entrysize + sizeof(struct fuse_open_out));
>  }
>
> +int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
> +                     const struct fuse_file_info *f)
> +{
> +       return fuse_reply_create_iflags(req, e, 0, f);
> +}
> +
>  int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
>                     double attr_timeout)
>  {
> @@ -545,7 +573,22 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
>         memset(&arg, 0, sizeof(arg));
>         arg.attr_valid = calc_timeout_sec(attr_timeout);
>         arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> -       convert_stat(attr, &arg.attr);
> +       convert_stat(attr, &arg.attr, 0);
> +
> +       return send_reply_ok(req, &arg, size);
> +}
> +
> +int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
> +                          unsigned int iflags, double attr_timeout)
> +{
> +       struct fuse_attr_out arg;
> +       size_t size = req->se->conn.proto_minor < 9 ?
> +               FUSE_COMPAT_ATTR_OUT_SIZE : sizeof(arg);
> +
> +       memset(&arg, 0, sizeof(arg));
> +       arg.attr_valid = calc_timeout_sec(attr_timeout);
> +       arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> +       convert_stat(attr, &arg.attr, iflags);
>
>         return send_reply_ok(req, &arg, size);
>  }

I wonder why fuse_reply_attr() is not implemented as a wrapper to
fuse_reply_attr_iflags()?

FWIW, the flags field was added in minor version 23 for
FUSE_ATTR_SUBMOUNT, but I guess that doesn't matter here.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-17 23:36   ` [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
@ 2025-07-18 14:27     ` Amir Goldstein
  2025-07-18 15:55       ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 14:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Create a new ->getattr_iflags function so that iomap filesystems can set
> the appropriate in-kernel inode flags on instantiation.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  include/fuse.h |    7 ++
>  lib/fuse.c     |  219 ++++++++++++++++++++++++++++++++++++++++++++------------
>  2 files changed, 180 insertions(+), 46 deletions(-)
>
>
> diff --git a/include/fuse.h b/include/fuse.h
> index e2e7c950bf144d..f894dd5da0d106 100644
> --- a/include/fuse.h
> +++ b/include/fuse.h
> @@ -876,6 +876,13 @@ struct fuse_operations {
>                             uint64_t attr_ino, off_t pos_in, size_t written_in,
>                             uint32_t ioendflags_in, int error_in,
>                             uint64_t new_addr_in);
> +
> +       /**
> +        * Get file attributes and FUSE_IFLAG_* flags.  Otherwise the same as
> +        * getattr.
> +        */
> +       int (*getattr_iflags) (const char *path, struct stat *buf,
> +                              unsigned int *iflags, struct fuse_file_info *fi);
>  #endif /* FUSE_USE_VERSION >= 318 */
>  };
>
> diff --git a/lib/fuse.c b/lib/fuse.c
> index 8dbf88877dd37c..685d0181e569d0 100644
> --- a/lib/fuse.c
> +++ b/lib/fuse.c
> @@ -123,6 +123,7 @@ struct fuse {
>         struct list_head partial_slabs;
>         struct list_head full_slabs;
>         pthread_t prune_thread;
> +       bool want_iflags;
>  };
>
>  struct lock {
> @@ -144,6 +145,7 @@ struct node {
>         char *name;
>         uint64_t nlookup;
>         int open_count;
> +       unsigned int iflags;
>         struct timespec stat_updated;
>         struct timespec mtime;
>         off_t size;
> @@ -1605,6 +1607,24 @@ int fuse_fs_getattr(struct fuse_fs *fs, const char *path, struct stat *buf,
>         return fs->op.getattr(path, buf, fi);
>  }
>
> +static int fuse_fs_getattr_iflags(struct fuse_fs *fs, const char *path,
> +                                 struct stat *buf, unsigned int *iflags,
> +                                 struct fuse_file_info *fi)
> +{
> +       fuse_get_context()->private_data = fs->user_data;
> +       if (!fs->op.getattr_iflags)
> +               return -ENOSYS;
> +
> +       if (fs->debug) {
> +               char buf[10];
> +
> +               fuse_log(FUSE_LOG_DEBUG, "getattr_iflags[%s] %s\n",
> +                       file_info_string(fi, buf, sizeof(buf)),
> +                       path);
> +       }
> +       return fs->op.getattr_iflags(path, buf, iflags, fi);
> +}
> +
>  int fuse_fs_rename(struct fuse_fs *fs, const char *oldpath,
>                    const char *newpath, unsigned int flags)
>  {
> @@ -2417,7 +2437,7 @@ static void update_stat(struct node *node, const struct stat *stbuf)
>  }
>
>  static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name,
> -                    struct fuse_entry_param *e)
> +                    struct fuse_entry_param *e, unsigned int *iflags)
>  {
>         struct node *node;
>
> @@ -2435,25 +2455,59 @@ static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name,
>                 pthread_mutex_unlock(&f->lock);
>         }
>         set_stat(f, e->ino, &e->attr);
> +       *iflags = node->iflags;
> +       return 0;
> +}
> +
> +static int lookup_and_update(struct fuse *f, fuse_ino_t nodeid,
> +                            const char *name, struct fuse_entry_param *e,
> +                            unsigned int iflags)
> +{
> +       struct node *node;
> +
> +       node = find_node(f, nodeid, name);
> +       if (node == NULL)
> +               return -ENOMEM;
> +
> +       e->ino = node->nodeid;
> +       e->generation = node->generation;
> +       e->entry_timeout = f->conf.entry_timeout;
> +       e->attr_timeout = f->conf.attr_timeout;
> +       if (f->conf.auto_cache) {
> +               pthread_mutex_lock(&f->lock);
> +               update_stat(node, &e->attr);
> +               pthread_mutex_unlock(&f->lock);
> +       }
> +       set_stat(f, e->ino, &e->attr);
> +       node->iflags = iflags;
>         return 0;
>  }
>
>  static int lookup_path(struct fuse *f, fuse_ino_t nodeid,
>                        const char *name, const char *path,
> -                      struct fuse_entry_param *e, struct fuse_file_info *fi)
> +                      struct fuse_entry_param *e, unsigned int *iflags,
> +                      struct fuse_file_info *fi)
>  {
>         int res;
>
>         memset(e, 0, sizeof(struct fuse_entry_param));
> -       res = fuse_fs_getattr(f->fs, path, &e->attr, fi);
> -       if (res == 0) {
> -               res = do_lookup(f, nodeid, name, e);
> -               if (res == 0 && f->conf.debug) {
> -                       fuse_log(FUSE_LOG_DEBUG, "   NODEID: %llu\n",
> -                               (unsigned long long) e->ino);
> -               }
> -       }
> -       return res;
> +       *iflags = 0;
> +       if (f->want_iflags)
> +               res = fuse_fs_getattr_iflags(f->fs, path, &e->attr, iflags, fi);
> +       else
> +               res = fuse_fs_getattr(f->fs, path, &e->attr, fi);
> +       if (res)
> +               return res;
> +
> +       res = lookup_and_update(f, nodeid, name, e, *iflags);
> +       if (res)
> +               return res;
> +
> +       if (f->conf.debug)
> +               fuse_log(FUSE_LOG_DEBUG, "   NODEID: %llu iflags 0x%x\n",
> +                       (unsigned long long) e->ino, *iflags);
> +
> +       return 0;
>  }
>
>  static struct fuse_context_i *fuse_get_context_internal(void)
> @@ -2537,11 +2591,17 @@ static inline void reply_err(fuse_req_t req, int err)
>  }
>
>  static void reply_entry(fuse_req_t req, const struct fuse_entry_param *e,
> -                       int err)
> +                       unsigned int iflags, int err)
>  {
>         if (!err) {
>                 struct fuse *f = req_fuse(req);
> -               if (fuse_reply_entry(req, e) == -ENOENT) {
> +               int entry_res;
> +
> +               if (f->want_iflags)
> +                       entry_res = fuse_reply_entry_iflags(req, e, iflags);
> +               else
> +                       entry_res = fuse_reply_entry(req, e);
> +               if (entry_res == -ENOENT) {
>                         /* Skip forget for negative result */
>                         if  (e->ino != 0)
>                                 forget_node(f, e->ino, 1);
> @@ -2582,6 +2642,9 @@ static void fuse_lib_init(void *data, struct fuse_conn_info *conn)
>                 /* Disable the receiving and processing of FUSE_INTERRUPT requests */
>                 conn->no_interrupt = 1;
>         }
> +
> +       if (fuse_get_feature_flag(conn, FUSE_CAP_IOMAP))
> +               f->want_iflags = true;
>  }
>
>  void fuse_fs_destroy(struct fuse_fs *fs)
> @@ -2605,6 +2668,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
>         struct fuse *f = req_fuse_prepare(req);
>         struct fuse_entry_param e;
>         char *path;
> +       unsigned int iflags = 0;
>         int err;
>         struct node *dot = NULL;
>
> @@ -2619,7 +2683,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
>                                 dot = get_node_nocheck(f, parent);
>                                 if (dot == NULL) {
>                                         pthread_mutex_unlock(&f->lock);
> -                                       reply_entry(req, &e, -ESTALE);
> +                                       reply_entry(req, &e, -ESTALE, 0);
>                                         return;
>                                 }
>                                 dot->refctr++;
> @@ -2639,7 +2703,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
>                 if (f->conf.debug)
>                         fuse_log(FUSE_LOG_DEBUG, "LOOKUP %s\n", path);
>                 fuse_prepare_interrupt(f, req, &d);
> -               err = lookup_path(f, parent, name, path, &e, NULL);
> +               err = lookup_path(f, parent, name, path, &e, &iflags, NULL);
>                 if (err == -ENOENT && f->conf.negative_timeout != 0.0) {
>                         e.ino = 0;
>                         e.entry_timeout = f->conf.negative_timeout;
> @@ -2653,7 +2717,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
>                 unref_node(f, dot);
>                 pthread_mutex_unlock(&f->lock);
>         }
> -       reply_entry(req, &e, err);
> +       reply_entry(req, &e, iflags, err);
>  }
>
>  static void do_forget(struct fuse *f, fuse_ino_t ino, uint64_t nlookup)
> @@ -2689,6 +2753,7 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
>         struct fuse *f = req_fuse_prepare(req);
>         struct stat buf;
>         char *path;
> +       unsigned int iflags = 0;
>         int err;
>
>         memset(&buf, 0, sizeof(buf));
> @@ -2700,7 +2765,11 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
>         if (!err) {
>                 struct fuse_intr_data d;
>                 fuse_prepare_interrupt(f, req, &d);
> -               err = fuse_fs_getattr(f->fs, path, &buf, fi);
> +               if (f->want_iflags)
> +                       err = fuse_fs_getattr_iflags(f->fs, path, &buf,
> +                                                    &iflags, fi);
> +               else
> +                       err = fuse_fs_getattr(f->fs, path, &buf, fi);
>                 fuse_finish_interrupt(f, req, &d);
>                 free_path(f, ino, path);
>         }
> @@ -2713,9 +2782,14 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
>                         buf.st_nlink--;
>                 if (f->conf.auto_cache)
>                         update_stat(node, &buf);
> +               node->iflags = iflags;
>                 pthread_mutex_unlock(&f->lock);
>                 set_stat(f, ino, &buf);
> -               fuse_reply_attr(req, &buf, f->conf.attr_timeout);
> +               if (f->want_iflags)
> +                       fuse_reply_attr_iflags(req, &buf, iflags,
> +                                              f->conf.attr_timeout);
> +               else
> +                       fuse_reply_attr(req, &buf, f->conf.attr_timeout);
>         } else
>                 reply_err(req, err);
>  }
> @@ -2802,6 +2876,7 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
>         struct fuse *f = req_fuse_prepare(req);
>         struct stat buf;
>         char *path;
> +       unsigned int iflags = 0;
>         int err;
>
>         memset(&buf, 0, sizeof(buf));
> @@ -2860,19 +2935,30 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
>                         err = fuse_fs_utimens(f->fs, path, tv, fi);
>                 }
>                 if (!err) {
> -                       err = fuse_fs_getattr(f->fs, path, &buf, fi);
> +                       if (f->want_iflags)
> +                               err = fuse_fs_getattr_iflags(f->fs, path, &buf,
> +                                                            &iflags, fi);
> +                       else
> +                               err = fuse_fs_getattr(f->fs, path, &buf, fi);
>                 }
>                 fuse_finish_interrupt(f, req, &d);
>                 free_path(f, ino, path);
>         }
>         if (!err) {
> -               if (f->conf.auto_cache) {
> -                       pthread_mutex_lock(&f->lock);
> -                       update_stat(get_node(f, ino), &buf);
> -                       pthread_mutex_unlock(&f->lock);
> -               }
> +               struct node *node;
> +
> +               pthread_mutex_lock(&f->lock);
> +               node = get_node(f, ino);
> +               if (f->conf.auto_cache)
> +                       update_stat(node, &buf);
> +               node->iflags = iflags;
> +               pthread_mutex_unlock(&f->lock);
>                 set_stat(f, ino, &buf);
> -               fuse_reply_attr(req, &buf, f->conf.attr_timeout);
> +               if (f->want_iflags)
> +                       fuse_reply_attr_iflags(req, &buf, iflags,
> +                                              f->conf.attr_timeout);
> +               else
> +                       fuse_reply_attr(req, &buf, f->conf.attr_timeout);
>         } else
>                 reply_err(req, err);
>  }
> @@ -2923,6 +3009,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
>         struct fuse *f = req_fuse_prepare(req);
>         struct fuse_entry_param e;
>         char *path;
> +       unsigned int iflags = 0;
>         int err;
>
>         err = get_path_name(f, parent, name, &path);
> @@ -2939,7 +3026,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
>                         err = fuse_fs_create(f->fs, path, mode, &fi);
>                         if (!err) {
>                                 err = lookup_path(f, parent, name, path, &e,
> -                                                 &fi);
> +                                                 &iflags, &fi);
>                                 fuse_fs_release(f->fs, path, &fi);
>                         }
>                 }
> @@ -2947,12 +3034,12 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
>                         err = fuse_fs_mknod(f->fs, path, mode, rdev);
>                         if (!err)
>                                 err = lookup_path(f, parent, name, path, &e,
> -                                                 NULL);
> +                                                 &iflags, NULL);
>                 }
>                 fuse_finish_interrupt(f, req, &d);
>                 free_path(f, parent, path);
>         }
> -       reply_entry(req, &e, err);
> +       reply_entry(req, &e, iflags, err);
>  }
>
>  static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
> @@ -2961,6 +3048,7 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
>         struct fuse *f = req_fuse_prepare(req);
>         struct fuse_entry_param e;
>         char *path;
> +       unsigned int iflags = 0;
>         int err;
>
>         err = get_path_name(f, parent, name, &path);
> @@ -2970,11 +3058,12 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
>                 fuse_prepare_interrupt(f, req, &d);
>                 err = fuse_fs_mkdir(f->fs, path, mode);
>                 if (!err)
> -                       err = lookup_path(f, parent, name, path, &e, NULL);
> +                       err = lookup_path(f, parent, name, path, &e, &iflags,
> +                                         NULL);
>                 fuse_finish_interrupt(f, req, &d);
>                 free_path(f, parent, path);
>         }
> -       reply_entry(req, &e, err);
> +       reply_entry(req, &e, iflags, err);
>  }
>
>  static void fuse_lib_unlink(fuse_req_t req, fuse_ino_t parent,
> @@ -3044,6 +3133,7 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname,
>         struct fuse *f = req_fuse_prepare(req);
>         struct fuse_entry_param e;
>         char *path;
> +       unsigned int iflags = 0;
>         int err;
>
>         err = get_path_name(f, parent, name, &path);
> @@ -3053,11 +3143,12 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname,
>                 fuse_prepare_interrupt(f, req, &d);
>                 err = fuse_fs_symlink(f->fs, linkname, path);
>                 if (!err)
> -                       err = lookup_path(f, parent, name, path, &e, NULL);
> +                       err = lookup_path(f, parent, name, path, &e, &iflags,
> +                                         NULL);
>                 fuse_finish_interrupt(f, req, &d);
>                 free_path(f, parent, path);
>         }
> -       reply_entry(req, &e, err);
> +       reply_entry(req, &e, iflags, err);
>  }
>
>  static void fuse_lib_rename(fuse_req_t req, fuse_ino_t olddir,
> @@ -3105,6 +3196,7 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
>         struct fuse_entry_param e;
>         char *oldpath;
>         char *newpath;
> +       unsigned int iflags = 0;
>         int err;
>
>         err = get_path2(f, ino, NULL, newparent, newname,
> @@ -3116,11 +3208,11 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
>                 err = fuse_fs_link(f->fs, oldpath, newpath);
>                 if (!err)
>                         err = lookup_path(f, newparent, newname, newpath,
> -                                         &e, NULL);
> +                                         &e, &iflags, NULL);
>                 fuse_finish_interrupt(f, req, &d);
>                 free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath);
>         }
> -       reply_entry(req, &e, err);
> +       reply_entry(req, &e, iflags, err);
>  }
>
>  static void fuse_do_release(struct fuse *f, fuse_ino_t ino, const char *path,
> @@ -3163,6 +3255,7 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
>         struct fuse_intr_data d;
>         struct fuse_entry_param e;
>         char *path;
> +       unsigned int iflags;
>         int err;
>
>         err = get_path_name(f, parent, name, &path);
> @@ -3170,7 +3263,8 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
>                 fuse_prepare_interrupt(f, req, &d);
>                 err = fuse_fs_create(f->fs, path, mode, fi);
>                 if (!err) {
> -                       err = lookup_path(f, parent, name, path, &e, fi);
> +                       err = lookup_path(f, parent, name, path, &e,
> +                                         &iflags, fi);
>                         if (err)
>                                 fuse_fs_release(f->fs, path, fi);
>                         else if (!S_ISREG(e.attr.st_mode)) {
> @@ -3190,10 +3284,18 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
>                 fuse_finish_interrupt(f, req, &d);
>         }
>         if (!err) {
> +               int create_res;
> +
>                 pthread_mutex_lock(&f->lock);
>                 get_node(f, e.ino)->open_count++;
>                 pthread_mutex_unlock(&f->lock);
> -               if (fuse_reply_create(req, &e, fi) == -ENOENT) {
> +
> +               if (f->want_iflags)
> +                       create_res = fuse_reply_create_iflags(req, &e, iflags,
> +                                                             fi);
> +               else
> +                       create_res = fuse_reply_create(req, &e, fi);
> +               if (create_res == -ENOENT) {
>                         /* The open syscall was interrupted, so it
>                            must be cancelled */
>                         fuse_do_release(f, e.ino, path, fi);
> @@ -3227,13 +3329,21 @@ static void open_auto_cache(struct fuse *f, fuse_ino_t ino, const char *path,
>                 if (diff_timespec(&now, &node->stat_updated) >
>                     f->conf.ac_attr_timeout) {
>                         struct stat stbuf;
> +                       unsigned int iflags = 0;
>                         int err;
> +
>                         pthread_mutex_unlock(&f->lock);
> -                       err = fuse_fs_getattr(f->fs, path, &stbuf, fi);
> +                       if (f->want_iflags)
> +                               err = fuse_fs_getattr_iflags(f->fs, path,
> +                                                            &stbuf, &iflags,
> +                                                            fi);
> +                       else
> +                               err = fuse_fs_getattr(f->fs, path, &stbuf, fi);
>                         pthread_mutex_lock(&f->lock);
> -                       if (!err)
> +                       if (!err) {
>                                 update_stat(node, &stbuf);
> -                       else
> +                               node->iflags = iflags;
> +                       } else
>                                 node->cache_valid = 0;
>                 }
>         }
> @@ -3562,6 +3672,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
>                 .ino = 0,
>         };
>         struct fuse *f = dh->fuse;
> +       unsigned int iflags = 0;
>         int res;
>
>         if ((flags & ~FUSE_FILL_DIR_PLUS) != 0) {
> @@ -3586,6 +3697,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
>
>         if (off) {
>                 size_t newlen;
> +               size_t thislen;
>
>                 if (dh->filled) {
>                         dh->error = -EIO;
> @@ -3601,7 +3713,8 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
>
>                 if (statp && (flags & FUSE_FILL_DIR_PLUS)) {
>                         if (!is_dot_or_dotdot(name)) {
> -                               res = do_lookup(f, dh->nodeid, name, &e);
> +                               res = do_lookup(f, dh->nodeid, name, &e,
> +                                               &iflags);
>                                 if (res) {
>                                         dh->error = res;
>                                         return 1;
> @@ -3609,10 +3722,17 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
>                         }
>                 }
>
> -               newlen = dh->len +
> -                       fuse_add_direntry_plus(dh->req, dh->contents + dh->len,
> -                                              dh->needlen - dh->len, name,
> -                                              &e, off);
> +               if (f->want_iflags)
> +                       thislen = fuse_add_direntry_plus_iflags(dh->req,
> +                                       dh->contents + dh->len,
> +                                       dh->needlen - dh->len, name, iflags,
> +                                       &e, off);
> +               else
> +                       thislen = fuse_add_direntry_plus(dh->req,
> +                                       dh->contents + dh->len,
> +                                       dh->needlen - dh->len, name, &e, off);
> +               newlen = dh->len + thislen;
> +
>                 if (newlen > dh->needlen)
>                         return 1;
>                 dh->len = newlen;
> @@ -3679,6 +3799,7 @@ static int readdir_fill(struct fuse *f, fuse_req_t req, fuse_ino_t ino,
>  static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
>                                   off_t off, enum fuse_readdir_flags flags)
>  {
> +       struct fuse *f = req_fuse_prepare(req);
>         off_t pos;
>         struct fuse_direntry *de = dh->first;
>         int res;
> @@ -3699,6 +3820,7 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
>                 unsigned rem = dh->needlen - dh->len;
>                 unsigned thislen;
>                 unsigned newlen;
> +               unsigned int iflags = 0;
>                 pos++;
>
>                 if (flags & FUSE_READDIR_PLUS) {
> @@ -3710,14 +3832,19 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
>                         if (de->flags & FUSE_FILL_DIR_PLUS &&
>                             !is_dot_or_dotdot(de->name)) {
>                                 res = do_lookup(dh->fuse, dh->nodeid,
> -                                               de->name, &e);
> +                                               de->name, &e, &iflags);
>                                 if (res) {
>                                         dh->error = res;
>                                         return 1;
>                                 }
>                         }
>
> -                       thislen = fuse_add_direntry_plus(req, p, rem,
> +                       if (f->want_iflags)
> +                               thislen = fuse_add_direntry_plus_iflags(req, p,
> +                                                        rem, de->name, iflags,
> +                                                        &e, pos);
> +                       else
> +                               thislen = fuse_add_direntry_plus(req, p, rem,
>                                                          de->name, &e, pos);


All those conditional statements look pretty moot.
Can't we just force iflags to 0 if (!f->want_iflags)
and always call the *_iflags functions?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-17 23:29   ` [PATCH 06/13] fuse: implement buffered " Darrick J. Wong
@ 2025-07-18 15:10     ` Amir Goldstein
  2025-07-18 18:01       ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 15:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

On Fri, Jul 18, 2025 at 1:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Implement pagecache IO with iomap, complete with hooks into truncate and
> fallocate so that the fuse server needn't implement disk block zeroing
> of post-EOF and unaligned punch/zero regions.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/fuse/fuse_i.h          |   46 +++
>  fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
>  include/uapi/linux/fuse.h |    5
>  fs/fuse/dir.c             |   23 +
>  fs/fuse/file.c            |   90 +++++-
>  fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/inode.c           |   14 +
>  7 files changed, 1268 insertions(+), 24 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 67e428da4391aa..f33b348d296d5e 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -161,6 +161,13 @@ struct fuse_inode {
>
>                         /* waitq for direct-io completion */
>                         wait_queue_head_t direct_io_waitq;
> +
> +#ifdef CONFIG_FUSE_IOMAP
> +                       /* pending io completions */
> +                       spinlock_t ioend_lock;
> +                       struct work_struct ioend_work;
> +                       struct list_head ioend_list;
> +#endif
>                 };

This union member you are changing is declared for
/* read/write io cache (regular file only) */
but actually it is also for parallel dio and passthrough mode

IIUC, there should be zero intersection between these io modes and
 /* iomap cached fileio (regular file only) */

Right?

So it can use its own union member without increasing fuse_inode size.

Just need to be carefull in fuse_init_file_inode(), fuse_evict_inode() and
fuse_file_io_release() which do not assume a specific inode io mode.

Was it your intention to allow filesystems to configure some inodes to be
in file_iomap mode and other inodes to be in regular cached/direct/passthrough
io modes?

I can't say that I see a big benefit in allowing such setups.
It certainly adds a lot of complication to the test matrix if we allow that.
My instinct is for initial version, either allow only opening files in
FILE_IOMAP or
DIRECT_IOMAP to inodes for a filesystem that supports those modes.

Perhaps later we can allow (and maybe fallback to) FOPEN_DIRECT_IO
(without parallel dio) if a server does not configure IOMAP to some inode
to allow a server to provide the data for a specific inode directly.

fuse_file_io_open/release() can help you manage those restrictions and
set ff->iomode = IOM_FILE_IOMAP when a file is opened for file iomap.
I did not look closely enough to see if file_iomap code ends up setting
ff->iomode = IOM_CACHED/UNCACHED or always remains IOM_NONE.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel
  2025-07-18 14:10     ` Amir Goldstein
@ 2025-07-18 15:48       ` Darrick J. Wong
  2025-07-19  7:34         ` Amir Goldstein
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 15:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 04:10:18PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Create new fuse_reply_{attr,create,entry}_iflags functions so that we
> > can send FUSE_ATTR_* flags to the kernel when instantiating an inode.
> > Servers are expected to send FUSE_IFLAG_* values, which will be
> > translated into what the kernel can understand.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  include/fuse_common.h   |    3 ++
> >  include/fuse_lowlevel.h |   87 +++++++++++++++++++++++++++++++++++++++++++++--
> >  lib/fuse_lowlevel.c     |   69 ++++++++++++++++++++++++++++++-------
> >  lib/fuse_versionscript  |    4 ++
> >  4 files changed, 146 insertions(+), 17 deletions(-)

<snip for brevity>

> > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> > index d26043fa54c036..568db13502a7d7 100644
> > --- a/lib/fuse_lowlevel.c
> > +++ b/lib/fuse_lowlevel.c
> > @@ -545,7 +573,22 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
> >         memset(&arg, 0, sizeof(arg));
> >         arg.attr_valid = calc_timeout_sec(attr_timeout);
> >         arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> > -       convert_stat(attr, &arg.attr);
> > +       convert_stat(attr, &arg.attr, 0);
> > +
> > +       return send_reply_ok(req, &arg, size);
> > +}
> > +
> > +int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
> > +                          unsigned int iflags, double attr_timeout)
> > +{
> > +       struct fuse_attr_out arg;
> > +       size_t size = req->se->conn.proto_minor < 9 ?
> > +               FUSE_COMPAT_ATTR_OUT_SIZE : sizeof(arg);
> > +
> > +       memset(&arg, 0, sizeof(arg));
> > +       arg.attr_valid = calc_timeout_sec(attr_timeout);
> > +       arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> > +       convert_stat(attr, &arg.attr, iflags);
> >
> >         return send_reply_ok(req, &arg, size);
> >  }
> 
> I wonder why fuse_reply_attr() is not implemented as a wrapper to
> fuse_reply_attr_iflags()?

oops.  I meant to convert this one, and apparently forgot. :(

> FWIW, the flags field was added in minor version 23 for
> FUSE_ATTR_SUBMOUNT, but I guess that doesn't matter here.

<nod> Hopefully nobody will call fuse_reply_attr_iflags when
proto_minor < 23.  Do I need to check for that explicitly in libfuse and
zero out iflags?  Or is it safe enough to assume that the os kernel
ignores flags bits that it doesn't understand and/or are not enabled on
the fuse_mount?

(I'm not sure if the lowlevel fuse library exists on mac/bsdfuse, though
afaict they ship the same source code so ... probably?)

Also: how aggressively do the syzbot people go after /dev/fuse?

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-18 14:27     ` Amir Goldstein
@ 2025-07-18 15:55       ` Darrick J. Wong
  2025-07-21 18:51         ` Bernd Schubert
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 15:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 04:27:50PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Create a new ->getattr_iflags function so that iomap filesystems can set
> > the appropriate in-kernel inode flags on instantiation.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

<snip for brevity>

> > diff --git a/lib/fuse.c b/lib/fuse.c
> > index 8dbf88877dd37c..685d0181e569d0 100644
> > --- a/lib/fuse.c
> > +++ b/lib/fuse.c
> > @@ -3710,14 +3832,19 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
> >                         if (de->flags & FUSE_FILL_DIR_PLUS &&
> >                             !is_dot_or_dotdot(de->name)) {
> >                                 res = do_lookup(dh->fuse, dh->nodeid,
> > -                                               de->name, &e);
> > +                                               de->name, &e, &iflags);
> >                                 if (res) {
> >                                         dh->error = res;
> >                                         return 1;
> >                                 }
> >                         }
> >
> > -                       thislen = fuse_add_direntry_plus(req, p, rem,
> > +                       if (f->want_iflags)
> > +                               thislen = fuse_add_direntry_plus_iflags(req, p,
> > +                                                        rem, de->name, iflags,
> > +                                                        &e, pos);
> > +                       else
> > +                               thislen = fuse_add_direntry_plus(req, p, rem,
> >                                                          de->name, &e, pos);
> 
> 
> All those conditional statements look pretty moot.
> Can't we just force iflags to 0 if (!f->want_iflags)
> and always call the *_iflags functions?

Heh, it already is zero, so yes, this could be a straight call to
fuse_add_direntry_plus_iflags without the want_iflags check.  Will fix
up this and the other thing you mentioned in the previous patch.

Thanks for the code review!

Having said that, the significant difficulties with iomap and the
upper level fuse library still exist.  To summarize -- upper libfuse has
its own nodeids which don't necssarily correspond to the filesystem's,
and struct node/nodeid are duplicated for hardlinked files.  As a
result, the kernel has multiple struct inodes for an ondisk ext4 inode,
which completely breaks the locking for the iomap file IO model.

That forces me to port fuse2fs to the lowlevel library, so I might
remove the lib/fuse.c patches entirely.  Are there plans to make the
upper libfuse handle hardlinks better?

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/4] libfuse: add statx support to the lower level library
  2025-07-18 13:28     ` Amir Goldstein
@ 2025-07-18 15:58       ` Darrick J. Wong
  2025-07-18 16:27       ` Darrick J. Wong
  1 sibling, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 15:58 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 03:28:25PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 1:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add statx support to the lower level fuse library.
> 
> This looked familiar.
> Merged 3 days ago:
> https://github.com/libfuse/libfuse/pull/1026

Heheheh ok I'll rebase then.  I see Joanne's version is more complete
than mine anyway. :)

That said, is there any interest in adding the newer statx fields
(subvol id, directio geometry, atomic write geometry) to the FUSE_STATX
UABI?  fuse+iomap could support atomic writes pretty easily AFAICT.

(But first things first, there's at least one or two lingering data
corruption bugs in non-iomap fuse2fs that I ought to squash :P)

--D

> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  include/fuse_lowlevel.h |   37 ++++++++++++++++++
> >  lib/fuse_lowlevel.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++
> >  lib/fuse_versionscript  |    2 +
> >  3 files changed, 136 insertions(+)
> >
> >
> > diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> > index 77685e433e4f7d..f4d62cee22870a 100644
> > --- a/include/fuse_lowlevel.h
> > +++ b/include/fuse_lowlevel.h
> > @@ -1416,6 +1416,26 @@ struct fuse_lowlevel_ops {
> >          * @param ino the inode number
> >          */
> >         void (*syncfs) (fuse_req_t req, fuse_ino_t ino);
> > +
> > +       /**
> > +        * Fetch extended stat information about a file
> > +        *
> > +        * If this request is answered with an error code of ENOSYS, this is
> > +        * treated as a permanent failure, i.e. all future statx() requests
> > +        * will fail with the same error code without being sent to the
> > +        * filesystem process.
> > +        *
> > +        * Valid replies:
> > +        *   fuse_reply_statx
> > +        *   fuse_reply_err
> > +        *
> > +        * @param req request handle
> > +        * @param statx_flags AT_STATX_* flags
> > +        * @param statx_mask desired STATX_* attribute mask
> > +        * @param fi file information
> > +        */
> > +       void (*statx) (fuse_req_t req, fuse_ino_t ino, uint32_t statx_flags,
> > +                      uint32_t statx_mask, struct fuse_file_info *fi);
> >  #endif /* FUSE_USE_VERSION >= 318 */
> >  };
> >
> > @@ -1897,6 +1917,23 @@ int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_iomap *read_iomap,
> >   * @return zero for success, -errno for failure to send reply
> >   */
> >  int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg);
> > +
> > +struct statx;
> > +
> > +/**
> > + * Reply with statx attributes
> > + *
> > + * Possible requests:
> > + *   statx
> > + *
> > + * @param req request handle
> > + * @param statx the attributes
> > + * @param size the size of the statx structure
> > + * @param attr_timeout validity timeout (in seconds) for the attributes
> > + * @return zero for success, -errno for failure to send reply
> > + */
> > +int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
> > +                    double attr_timeout);
> >  #endif /* FUSE_USE_VERSION >= 318 */
> >
> >  /* ----------------------------------------------------------- *
> > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> > index ec30ebc4cdd074..8eeb6a8547da91 100644
> > --- a/lib/fuse_lowlevel.c
> > +++ b/lib/fuse_lowlevel.c
> > @@ -144,6 +144,43 @@ static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
> >         ST_CTIM_NSEC_SET(stbuf, attr->ctimensec);
> >  }
> >
> > +#ifdef STATX_BASIC_STATS
> > +static int convert_statx(struct fuse_statx *stbuf, const struct statx *stx,
> > +                        size_t size)
> > +{
> > +       if (sizeof(struct statx) != size)
> > +               return EOPNOTSUPP;
> > +
> > +       stbuf->mask = stx->stx_mask & (STATX_BASIC_STATS | STATX_BTIME);
> > +       stbuf->blksize          = stx->stx_blksize;
> > +       stbuf->attributes       = stx->stx_attributes;
> > +       stbuf->nlink            = stx->stx_nlink;
> > +       stbuf->uid              = stx->stx_uid;
> > +       stbuf->gid              = stx->stx_gid;
> > +       stbuf->mode             = stx->stx_mode;
> > +       stbuf->ino              = stx->stx_ino;
> > +       stbuf->size             = stx->stx_size;
> > +       stbuf->blocks           = stx->stx_blocks;
> > +       stbuf->attributes_mask  = stx->stx_attributes_mask;
> > +       stbuf->rdev_major       = stx->stx_rdev_major;
> > +       stbuf->rdev_minor       = stx->stx_rdev_minor;
> > +       stbuf->dev_major        = stx->stx_dev_major;
> > +       stbuf->dev_minor        = stx->stx_dev_minor;
> > +
> > +       stbuf->atime.tv_sec     = stx->stx_atime.tv_sec;
> > +       stbuf->btime.tv_sec     = stx->stx_btime.tv_sec;
> > +       stbuf->ctime.tv_sec     = stx->stx_ctime.tv_sec;
> > +       stbuf->mtime.tv_sec     = stx->stx_mtime.tv_sec;
> > +
> > +       stbuf->atime.tv_nsec    = stx->stx_atime.tv_nsec;
> > +       stbuf->btime.tv_nsec    = stx->stx_btime.tv_nsec;
> > +       stbuf->ctime.tv_nsec    = stx->stx_ctime.tv_nsec;
> > +       stbuf->mtime.tv_nsec    = stx->stx_mtime.tv_nsec;
> > +
> > +       return 0;
> > +}
> > +#endif
> > +
> 
> Why is this conversion not needed in the merged version?
> What am I missing?
> 
> Thanks,
> Amir.
> 
> >  static size_t iov_length(const struct iovec *iov, size_t count)
> >  {
> >         size_t seg;
> > @@ -2653,6 +2690,64 @@ static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg
> >         _do_syncfs(req, nodeid, inarg, NULL);
> >  }
> >
> > +#ifdef STATX_BASIC_STATS
> > +int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
> > +                    double attr_timeout)
> > +{
> > +       struct fuse_statx_out arg = {
> > +               .attr_valid = calc_timeout_sec(attr_timeout),
> > +               .attr_valid_nsec = calc_timeout_nsec(attr_timeout),
> > +       };
> > +
> > +       int err = convert_statx(&arg.stat, statx, size);
> > +       if (err) {
> > +               fuse_reply_err(req, err);
> > +               return err;
> > +       }
> > +
> > +       return send_reply_ok(req, &arg, sizeof(arg));
> > +}
> > +
> > +static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
> > +                     const void *op_in, const void *in_payload)
> > +{
> > +       (void)in_payload;
> > +       const struct fuse_statx_in *arg = op_in;
> > +       struct fuse_file_info *fip = NULL;
> > +       struct fuse_file_info fi;
> > +
> > +       if (arg->getattr_flags & FUSE_GETATTR_FH) {
> > +               memset(&fi, 0, sizeof(fi));
> > +               fi.fh = arg->fh;
> > +               fip = &fi;
> > +       }
> > +
> > +       if (req->se->op.statx)
> > +               req->se->op.statx(req, nodeid, arg->sx_flags, arg->sx_mask,
> > +                                 fip);
> > +       else
> > +               fuse_reply_err(req, ENOSYS);
> > +}
> > +#else
> > +int fuse_reply_statx(fuse_req_t req, const struct statx *statx,
> > +                    double attr_timeout)
> > +{
> > +       fuse_reply_err(req, ENOSYS);
> > +       return -ENOSYS;
> > +}
> > +
> > +static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
> > +                     const void *op_in, const void *in_payload)
> > +{
> > +       fuse_reply_err(req, ENOSYS);
> > +}
> > +#endif /* STATX_BASIC_STATS */
> > +
> > +static void do_statx(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
> > +{
> > +       _do_statx(req, nodeid, inarg, NULL);
> > +}
> > +
> >  static bool want_flags_valid(uint64_t capable, uint64_t want)
> >  {
> >         uint64_t unknown_flags = want & (~capable);
> > @@ -3627,6 +3722,7 @@ static struct {
> >         [FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
> >         [FUSE_LSEEK]       = { do_lseek,       "LSEEK"       },
> >         [FUSE_SYNCFS]      = { do_syncfs,       "SYNCFS"     },
> > +       [FUSE_STATX]       = { do_statx,       "STATX"       },
> >         [FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
> >         [FUSE_IOMAP_BEGIN] = { do_iomap_begin,  "IOMAP_BEGIN" },
> >         [FUSE_IOMAP_END]   = { do_iomap_end,    "IOMAP_END" },
> > @@ -3686,6 +3782,7 @@ static struct {
> >         [FUSE_COPY_FILE_RANGE]  = { _do_copy_file_range, "COPY_FILE_RANGE" },
> >         [FUSE_LSEEK]            = { _do_lseek,          "LSEEK" },
> >         [FUSE_SYNCFS]           = { _do_syncfs,         "SYNCFS" },
> > +       [FUSE_STATX]            = { _do_statx,          "STATX" },
> >         [FUSE_IOMAP_CONFIG]     = { _do_iomap_config,   "IOMAP_CONFIG" },
> >         [FUSE_IOMAP_BEGIN]      = { _do_iomap_begin,    "IOMAP_BEGIN" },
> >         [FUSE_IOMAP_END]        = { _do_iomap_end,      "IOMAP_END" },
> > diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
> > index dc9fa2428b5325..a67b1802770335 100644
> > --- a/lib/fuse_versionscript
> > +++ b/lib/fuse_versionscript
> > @@ -223,6 +223,8 @@ FUSE_3.18 {
> >                 fuse_reply_iomap_config;
> >                 fuse_lowlevel_notify_iomap_upsert;
> >                 fuse_lowlevel_notify_iomap_inval;
> > +
> > +               fuse_reply_statx;
> >  } FUSE_3.17;
> >
> >  # Local Variables:
> >
> >
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 1/1] libfuse: enable iomap cache management
  2025-07-17 23:38   ` [PATCH 1/1] libfuse: enable iomap cache management Darrick J. Wong
@ 2025-07-18 16:16     ` Bernd Schubert
  2025-07-18 18:22       ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 16:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On 7/18/25 01:38, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add the library methods so that fuse servers can manage an in-kernel
> iomap cache.  This enables better performance on small IOs and is
> required if the filesystem needs synchronization between pagecache
> writes and writeback.

Sorry, if this ready to be merged? I don't see in linux master? Or part
of your other patches (will take some to go through these).

> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  include/fuse_common.h   |    9 +++++
>  include/fuse_kernel.h   |   34 +++++++++++++++++++
>  include/fuse_lowlevel.h |   39 ++++++++++++++++++++++
>  lib/fuse_lowlevel.c     |   82 +++++++++++++++++++++++++++++++++++++++++++++++
>  lib/fuse_versionscript  |    2 +
>  5 files changed, 166 insertions(+)
> 
> 
> diff --git a/include/fuse_common.h b/include/fuse_common.h
> index 98cb8f656efd13..1237cc2656b9c4 100644
> --- a/include/fuse_common.h
> +++ b/include/fuse_common.h
> @@ -1164,6 +1164,7 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn);
>   */
>  #if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
>  #define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
> +#define FUSE_IOMAP_TYPE_NULL		(0xFFFE) /* no mapping here */
>  #define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
>  #define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
>  #define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
> @@ -1208,6 +1209,11 @@ struct fuse_iomap {
>  	uint32_t dev;		/* device cookie */
>  };
>  
> +struct fuse_iomap_inval {
> +	uint64_t offset;	/* file offset to invalidate, bytes */
> +	uint64_t length;	/* length to invalidate, bytes */
> +};
> +
>  /* out of place write extent */
>  #define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
>  /* unwritten extent */
> @@ -1258,6 +1264,9 @@ struct fuse_iomap_config{
>  	int64_t s_maxbytes;	/* max file size */
>  };
>  
> +/* invalidate to end of file */
> +#define FUSE_IOMAP_INVAL_TO_EOF		(~0ULL)
> +
>  #endif /* FUSE_USE_VERSION >= 318 */
>  
>  /* ----------------------------------------------------------- *
> diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
> index 3c704f03434693..f1a93dbd1ff443 100644
> --- a/include/fuse_kernel.h
> +++ b/include/fuse_kernel.h
> @@ -243,6 +243,8 @@
>   *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
>   *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
>   *  - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
> + *  - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
> + *    can cache iomappings in the kernel


Personally I prefer a preparation patch, that just syncs the entire
fuse_kernel.h from linux-<version>. Also this file might get renamed to
fuse_kernel_linux.h, there seems to be interest from BSD and OSX to have
their own headers.

>   */
>  
>  #ifndef _LINUX_FUSE_H
> @@ -699,6 +701,8 @@ enum fuse_notify_code {
>  	FUSE_NOTIFY_DELETE = 6,
>  	FUSE_NOTIFY_RESEND = 7,
>  	FUSE_NOTIFY_INC_EPOCH = 8,
> +	FUSE_NOTIFY_IOMAP_UPSERT = 9,
> +	FUSE_NOTIFY_IOMAP_INVAL = 10,
>  	FUSE_NOTIFY_CODE_MAX,
>  };
>  
> @@ -1406,4 +1410,34 @@ struct fuse_iomap_config_out {
>  	int64_t s_maxbytes;	/* max file size */
>  };
>  
> +struct fuse_iomap_upsert_out {
> +	uint64_t nodeid;	/* Inode ID */
> +	uint64_t attr_ino;	/* matches fuse_attr:ino */
> +
> +	uint64_t read_offset;	/* file offset of mapping, bytes */
> +	uint64_t read_length;	/* length of mapping, bytes */
> +	uint64_t read_addr;	/* disk offset of mapping, bytes */
> +	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
> +	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
> +	uint32_t read_dev;	/* device cookie */
> +
> +	uint64_t write_offset;	/* file offset of mapping, bytes */
> +	uint64_t write_length;	/* length of mapping, bytes */
> +	uint64_t write_addr;	/* disk offset of mapping, bytes */
> +	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
> +	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
> +	uint32_t write_dev;	/* device cookie * */
> +};
> +
> +struct fuse_iomap_inval_out {
> +	uint64_t nodeid;	/* Inode ID */
> +	uint64_t attr_ino;	/* matches fuse_attr:ino */
> +
> +	uint64_t read_offset;	/* range to invalidate read iomaps, bytes */
> +	uint64_t read_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
> +
> +	uint64_t write_offset;	/* range to invalidate write iomaps, bytes */
> +	uint64_t write_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
> +};
> +
>  #endif /* _LINUX_FUSE_H */
> diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> index fd7df5c2c11e16..f690c62fcdd61c 100644
> --- a/include/fuse_lowlevel.h
> +++ b/include/fuse_lowlevel.h
> @@ -2101,6 +2101,45 @@ int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino,
>   * @return positive device id for success, zero for failure
>   */
>  int fuse_iomap_add_device(struct fuse_session *se, int fd, unsigned int flags);
> +
> +/**
> + * Upsert some file mapping information into the kernel.  This is necessary
> + * for filesystems that require coordination of mapping state changes between
> + * buffered writes and writeback, and desirable for better performance
> + * elsewhere.
> + *
> + * Added in FUSE protocol version 7.99. If the kernel does not support

7.99?



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/4] libfuse: add statx support to the lower level library
  2025-07-18 13:28     ` Amir Goldstein
  2025-07-18 15:58       ` Darrick J. Wong
@ 2025-07-18 16:27       ` Darrick J. Wong
  2025-07-18 16:54         ` Bernd Schubert
  1 sibling, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 16:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 03:28:25PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 1:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add statx support to the lower level fuse library.
> 
> This looked familiar.
> Merged 3 days ago:
> https://github.com/libfuse/libfuse/pull/1026
> 
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  include/fuse_lowlevel.h |   37 ++++++++++++++++++
> >  lib/fuse_lowlevel.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++
> >  lib/fuse_versionscript  |    2 +
> >  3 files changed, 136 insertions(+)

<snip>

> > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> > index ec30ebc4cdd074..8eeb6a8547da91 100644
> > --- a/lib/fuse_lowlevel.c
> > +++ b/lib/fuse_lowlevel.c
> > @@ -144,6 +144,43 @@ static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
> >         ST_CTIM_NSEC_SET(stbuf, attr->ctimensec);
> >  }
> >
> > +#ifdef STATX_BASIC_STATS
> > +static int convert_statx(struct fuse_statx *stbuf, const struct statx *stx,
> > +                        size_t size)
> > +{
> > +       if (sizeof(struct statx) != size)
> > +               return EOPNOTSUPP;
> > +
> > +       stbuf->mask = stx->stx_mask & (STATX_BASIC_STATS | STATX_BTIME);
> > +       stbuf->blksize          = stx->stx_blksize;
> > +       stbuf->attributes       = stx->stx_attributes;
> > +       stbuf->nlink            = stx->stx_nlink;
> > +       stbuf->uid              = stx->stx_uid;
> > +       stbuf->gid              = stx->stx_gid;
> > +       stbuf->mode             = stx->stx_mode;
> > +       stbuf->ino              = stx->stx_ino;
> > +       stbuf->size             = stx->stx_size;
> > +       stbuf->blocks           = stx->stx_blocks;
> > +       stbuf->attributes_mask  = stx->stx_attributes_mask;
> > +       stbuf->rdev_major       = stx->stx_rdev_major;
> > +       stbuf->rdev_minor       = stx->stx_rdev_minor;
> > +       stbuf->dev_major        = stx->stx_dev_major;
> > +       stbuf->dev_minor        = stx->stx_dev_minor;
> > +
> > +       stbuf->atime.tv_sec     = stx->stx_atime.tv_sec;
> > +       stbuf->btime.tv_sec     = stx->stx_btime.tv_sec;
> > +       stbuf->ctime.tv_sec     = stx->stx_ctime.tv_sec;
> > +       stbuf->mtime.tv_sec     = stx->stx_mtime.tv_sec;
> > +
> > +       stbuf->atime.tv_nsec    = stx->stx_atime.tv_nsec;
> > +       stbuf->btime.tv_nsec    = stx->stx_btime.tv_nsec;
> > +       stbuf->ctime.tv_nsec    = stx->stx_ctime.tv_nsec;
> > +       stbuf->mtime.tv_nsec    = stx->stx_mtime.tv_nsec;
> > +
> > +       return 0;
> > +}
> > +#endif
> > +
> 
> Why is this conversion not needed in the merged version?
> What am I missing?

The patch in upstream memcpy's struct statx to struct fuse_statx:

	memset(&arg, 0, sizeof(arg));
	arg.flags = flags;
	arg.attr_valid = calc_timeout_sec(attr_timeout);
	arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
	memcpy(&arg.stat, statx, sizeof(arg.stat));

As long as the fields in the two are kept exactly in sync, this isn't a
problem and no explicit struct conversion is necessary.

I also noticed that the !HAVE_STATX variant of _do_statx doesn't call
fuse_reply_err(req, ENOSYS).  I think that means a new kernel calling
an old userspace would never receive a reply to a FUSE_STATX command
and ... time out?

My version also has explicit sizing of struct statx, but I concede that
if that struct ever gets bigger we're going to have to rev the whole
syscall anyway.  I was being perhaps a bit paranoid.

BTW, where are libfuse patches reviewed?  I guess all the review are
done via github PRs?

--D

> Thanks,
> Amir.
> 
> >  static size_t iov_length(const struct iovec *iov, size_t count)
> >  {
> >         size_t seg;
> > @@ -2653,6 +2690,64 @@ static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg
> >         _do_syncfs(req, nodeid, inarg, NULL);
> >  }
> >
> > +#ifdef STATX_BASIC_STATS
> > +int fuse_reply_statx(fuse_req_t req, const struct statx *statx, size_t size,
> > +                    double attr_timeout)
> > +{
> > +       struct fuse_statx_out arg = {
> > +               .attr_valid = calc_timeout_sec(attr_timeout),
> > +               .attr_valid_nsec = calc_timeout_nsec(attr_timeout),
> > +       };
> > +
> > +       int err = convert_statx(&arg.stat, statx, size);
> > +       if (err) {
> > +               fuse_reply_err(req, err);
> > +               return err;
> > +       }
> > +
> > +       return send_reply_ok(req, &arg, sizeof(arg));
> > +}
> > +
> > +static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
> > +                     const void *op_in, const void *in_payload)
> > +{
> > +       (void)in_payload;
> > +       const struct fuse_statx_in *arg = op_in;
> > +       struct fuse_file_info *fip = NULL;
> > +       struct fuse_file_info fi;
> > +
> > +       if (arg->getattr_flags & FUSE_GETATTR_FH) {
> > +               memset(&fi, 0, sizeof(fi));
> > +               fi.fh = arg->fh;
> > +               fip = &fi;
> > +       }
> > +
> > +       if (req->se->op.statx)
> > +               req->se->op.statx(req, nodeid, arg->sx_flags, arg->sx_mask,
> > +                                 fip);
> > +       else
> > +               fuse_reply_err(req, ENOSYS);
> > +}
> > +#else
> > +int fuse_reply_statx(fuse_req_t req, const struct statx *statx,
> > +                    double attr_timeout)
> > +{
> > +       fuse_reply_err(req, ENOSYS);
> > +       return -ENOSYS;
> > +}
> > +
> > +static void _do_statx(fuse_req_t req, const fuse_ino_t nodeid,
> > +                     const void *op_in, const void *in_payload)
> > +{
> > +       fuse_reply_err(req, ENOSYS);
> > +}
> > +#endif /* STATX_BASIC_STATS */
> > +
> > +static void do_statx(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
> > +{
> > +       _do_statx(req, nodeid, inarg, NULL);
> > +}
> > +
> >  static bool want_flags_valid(uint64_t capable, uint64_t want)
> >  {
> >         uint64_t unknown_flags = want & (~capable);
> > @@ -3627,6 +3722,7 @@ static struct {
> >         [FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
> >         [FUSE_LSEEK]       = { do_lseek,       "LSEEK"       },
> >         [FUSE_SYNCFS]      = { do_syncfs,       "SYNCFS"     },
> > +       [FUSE_STATX]       = { do_statx,       "STATX"       },
> >         [FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
> >         [FUSE_IOMAP_BEGIN] = { do_iomap_begin,  "IOMAP_BEGIN" },
> >         [FUSE_IOMAP_END]   = { do_iomap_end,    "IOMAP_END" },
> > @@ -3686,6 +3782,7 @@ static struct {
> >         [FUSE_COPY_FILE_RANGE]  = { _do_copy_file_range, "COPY_FILE_RANGE" },
> >         [FUSE_LSEEK]            = { _do_lseek,          "LSEEK" },
> >         [FUSE_SYNCFS]           = { _do_syncfs,         "SYNCFS" },
> > +       [FUSE_STATX]            = { _do_statx,          "STATX" },
> >         [FUSE_IOMAP_CONFIG]     = { _do_iomap_config,   "IOMAP_CONFIG" },
> >         [FUSE_IOMAP_BEGIN]      = { _do_iomap_begin,    "IOMAP_BEGIN" },
> >         [FUSE_IOMAP_END]        = { _do_iomap_end,      "IOMAP_END" },
> > diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
> > index dc9fa2428b5325..a67b1802770335 100644
> > --- a/lib/fuse_versionscript
> > +++ b/lib/fuse_versionscript
> > @@ -223,6 +223,8 @@ FUSE_3.18 {
> >                 fuse_reply_iomap_config;
> >                 fuse_lowlevel_notify_iomap_upsert;
> >                 fuse_lowlevel_notify_iomap_inval;
> > +
> > +               fuse_reply_statx;
> >  } FUSE_3.17;
> >
> >  # Local Variables:
> >
> >
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-17 23:26   ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
@ 2025-07-18 16:37     ` Bernd Schubert
  2025-07-18 17:50       ` Joanne Koong
  2025-07-18 18:07       ` Bernd Schubert
  2025-07-18 22:23     ` Joanne Koong
  1 sibling, 2 replies; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 16:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, joannelkoong

[-- Attachment #1: Type: text/plain, Size: 4444 bytes --]



On 7/18/25 01:26, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> generic/488 fails with fuse2fs in the following fashion:
> 
> generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> (see /var/tmp/fstests/generic/488.full for details)
> 
> This test opens a large number of files, unlinks them (which really just
> renames them to fuse hidden files), closes the program, unmounts the
> filesystem, and runs fsck to check that there aren't any inconsistencies
> in the filesystem.
> 
> Unfortunately, the 488.full file shows that there are a lot of hidden
> files left over in the filesystem, with incorrect link counts.  Tracing
> fuse_request_* shows that there are a large number of FUSE_RELEASE
> commands that are queued up on behalf of the unlinked files at the time
> that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> aborted, the fuse server would have responded to the RELEASE commands by
> removing the hidden files; instead they stick around.
> 
> Create a function to push all the background requests to the queue and
> then wait for the number of pending events to hit zero, and call this
> before fuse_abort_conn.  That way, all the pending events are processed
> by the fuse server and we don't end up with a corrupt filesystem.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/fuse/fuse_i.h |    6 ++++++
>  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
>  fs/fuse/inode.c  |    1 +
>  3 files changed, 45 insertions(+)
> 
> 
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index b54f4f57789f7f..78d34c8e445b32 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1256,6 +1256,12 @@ void fuse_request_end(struct fuse_req *req);
>  void fuse_abort_conn(struct fuse_conn *fc);
>  void fuse_wait_aborted(struct fuse_conn *fc);
>  
> +/**
> + * Flush all pending requests and wait for them.  Takes an optional timeout
> + * in jiffies.
> + */
> +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout);
> +
>  /* Check if any requests timed out */
>  void fuse_check_timeout(struct work_struct *work);
>  
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index e80cd8f2c049f9..5387e4239d6aa6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -24,6 +24,7 @@
>  #include <linux/splice.h>
>  #include <linux/sched.h>
>  #include <linux/seq_file.h>
> +#include <linux/nmi.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include "fuse_trace.h"
> @@ -2385,6 +2386,43 @@ static void end_polls(struct fuse_conn *fc)
>  	}
>  }
>  
> +/*
> + * Flush all pending requests and wait for them.  Only call this function when
> + * it is no longer possible for other threads to add requests.
> + */
> +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)

I wonder if this should have "abort" in its name. Because it is not a
simple flush attempt, but also sets fc->blocked and fc->max_background.

> +{
> +	unsigned long deadline;
> +
> +	spin_lock(&fc->lock);
> +	if (!fc->connected) {
> +		spin_unlock(&fc->lock);
> +		return;
> +	}
> +
> +	/* Push all the background requests to the queue. */
> +	spin_lock(&fc->bg_lock);
> +	fc->blocked = 0;
> +	fc->max_background = UINT_MAX;
> +	flush_bg_queue(fc);
> +	spin_unlock(&fc->bg_lock);
> +	spin_unlock(&fc->lock);
> +
> +	/*
> +	 * Wait 30s for all the events to complete or abort.  Touch the
> +	 * watchdog once per second so that we don't trip the hangcheck timer
> +	 * while waiting for the fuse server.
> +	 */
> +	deadline = jiffies + timeout;
> +	smp_mb();
> +	while (fc->connected &&
> +	       (!timeout || time_before(jiffies, deadline)) &&
> +	       wait_event_timeout(fc->blocked_waitq,
> +			!fc->connected || atomic_read(&fc->num_waiting) == 0,
> +			HZ) == 0)
> +		touch_softlockup_watchdog();
> +}
> +
>  /*
>   * Abort all requests.
>   *
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 9572bdef49eecc..1734c263da3a77 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -2047,6 +2047,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
>  {
>  	struct fuse_conn *fc = fm->fc;
>  
> +	fuse_flush_requests(fc, 30 * HZ);

I think fc->connected should be set to 0, to avoid that new requests can
be allocated.

>  	if (fc->destroy)
>  		fuse_send_destroy(fm);
>  
> 


Please see the two attached patches, which are needed for fuse-io-uring.
I can also send them separately, if you prefer.


Thanks,
Bernd

[-- Attachment #2: 01-flush-io-uring-queue --]
[-- Type: text/plain, Size: 3552 bytes --]

fuse: Refactor io-uring bg queue flush and queue abort

From: Bernd Schubert <bschubert@ddn.com>

This is a preparation to allow fuse-io-uring bg queue
flush from flush_bg_queue()

This does two function renames:
fuse_uring_flush_bg -> fuse_uring_flush_queue_bg
fuse_uring_abort_end_requests -> fuse_uring_flush_bg

And fuse_uring_abort_end_queue_requests() is moved to
fuse_uring_stop_queues().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   |   14 +++++++-------
 fs/fuse/dev_uring_i.h |    4 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 249b210becb1..eca457d1005e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -47,7 +47,7 @@ static struct fuse_ring_ent *uring_cmd_to_ring_ent(struct io_uring_cmd *cmd)
 	return pdu->ent;
 }
 
-static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
+static void fuse_uring_flush_queue_bg(struct fuse_ring_queue *queue)
 {
 	struct fuse_ring *ring = queue->ring;
 	struct fuse_conn *fc = ring->fc;
@@ -88,7 +88,7 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
 	if (test_bit(FR_BACKGROUND, &req->flags)) {
 		queue->active_background--;
 		spin_lock(&fc->bg_lock);
-		fuse_uring_flush_bg(queue);
+		fuse_uring_flush_queue_bg(queue);
 		spin_unlock(&fc->bg_lock);
 	}
 
@@ -117,11 +117,11 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
 	fuse_dev_end_requests(&req_list);
 }
 
-void fuse_uring_abort_end_requests(struct fuse_ring *ring)
+void fuse_uring_flush_bg(struct fuse_conn *fc)
 {
 	int qid;
 	struct fuse_ring_queue *queue;
-	struct fuse_conn *fc = ring->fc;
+	struct fuse_ring *ring = fc->ring;
 
 	for (qid = 0; qid < ring->nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
@@ -133,10 +133,9 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
 		WARN_ON_ONCE(ring->fc->max_background != UINT_MAX);
 		spin_lock(&queue->lock);
 		spin_lock(&fc->bg_lock);
-		fuse_uring_flush_bg(queue);
+		fuse_uring_flush_queue_bg(queue);
 		spin_unlock(&fc->bg_lock);
 		spin_unlock(&queue->lock);
-		fuse_uring_abort_end_queue_requests(queue);
 	}
 }
 
@@ -475,6 +474,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 		if (!queue)
 			continue;
 
+		fuse_uring_abort_end_queue_requests(queue);
 		fuse_uring_teardown_entries(queue);
 	}
 
@@ -1326,7 +1326,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	fc->num_background++;
 	if (fc->num_background == fc->max_background)
 		fc->blocked = 1;
-	fuse_uring_flush_bg(queue);
+	fuse_uring_flush_queue_bg(queue);
 	spin_unlock(&fc->bg_lock);
 
 	/*
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce1..55f52508de3c 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -138,7 +138,7 @@ struct fuse_ring {
 bool fuse_uring_enabled(void);
 void fuse_uring_destruct(struct fuse_conn *fc);
 void fuse_uring_stop_queues(struct fuse_ring *ring);
-void fuse_uring_abort_end_requests(struct fuse_ring *ring);
+void fuse_uring_flush_bg(struct fuse_conn *fc);
 int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req);
 bool fuse_uring_queue_bq_req(struct fuse_req *req);
@@ -153,7 +153,7 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
 		return;
 
 	if (atomic_read(&ring->queue_refs) > 0) {
-		fuse_uring_abort_end_requests(ring);
+		fuse_uring_flush_bg(fc);
 		fuse_uring_stop_queues(ring);
 	}
 }

[-- Attachment #3: 02-flush-uring-bg --]
[-- Type: text/plain, Size: 1533 bytes --]

fuse: Flush the io-uring bg queue from fuse_uring_flush_bg

From: Bernd Schubert <bschubert@ddn.com>

This is useful to have a unique API to flush background requests.
For example when the bg queue gets flushed before
the remaining of fuse_conn_destroy().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c         |    2 ++
 fs/fuse/dev_uring_i.h |   10 +++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5387e4239d6a..3f5f168cc28a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -426,6 +426,8 @@ static void flush_bg_queue(struct fuse_conn *fc)
 		fc->active_background++;
 		fuse_send_one(fiq, req);
 	}
+
+	fuse_uring_flush_bg(fc);
 }
 
 /*
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 55f52508de3c..fca2184e8d94 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -152,10 +152,10 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
 	if (ring == NULL)
 		return;
 
-	if (atomic_read(&ring->queue_refs) > 0) {
-		fuse_uring_flush_bg(fc);
+	/* Assumes bg queues were already flushed before */
+
+	if (atomic_read(&ring->queue_refs) > 0)
 		fuse_uring_stop_queues(ring);
-	}
 }
 
 static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
@@ -206,6 +206,10 @@ static inline bool fuse_uring_request_expired(struct fuse_conn *fc)
 	return false;
 }
 
+static inline void fuse_uring_flush_bg(struct fuse_conn *fc)
+{
+}
+
 #endif /* CONFIG_FUSE_IO_URING */
 
 #endif /* _FS_FUSE_DEV_URING_I_H */

^ permalink raw reply related	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/4] libfuse: add statx support to the lower level library
  2025-07-18 16:27       ` Darrick J. Wong
@ 2025-07-18 16:54         ` Bernd Schubert
  2025-07-18 18:42           ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 16:54 UTC (permalink / raw)
  To: Darrick J. Wong, Amir Goldstein
  Cc: John, joannelkoong, linux-fsdevel, bernd, neal, miklos



On 7/18/25 18:27, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 03:28:25PM +0200, Amir Goldstein wrote:
>> On Fri, Jul 18, 2025 at 1:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>>
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> Add statx support to the lower level fuse library.
>>
>> This looked familiar.
>> Merged 3 days ago:
>> https://github.com/libfuse/libfuse/pull/1026
>>
>>>
>>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>>> ---
>>>  include/fuse_lowlevel.h |   37 ++++++++++++++++++
>>>  lib/fuse_lowlevel.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  lib/fuse_versionscript  |    2 +
>>>  3 files changed, 136 insertions(+)
> 
> <snip>
> 
>>> diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
>>> index ec30ebc4cdd074..8eeb6a8547da91 100644
>>> --- a/lib/fuse_lowlevel.c
>>> +++ b/lib/fuse_lowlevel.c
>>> @@ -144,6 +144,43 @@ static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
>>>         ST_CTIM_NSEC_SET(stbuf, attr->ctimensec);
>>>  }
>>>
>>> +#ifdef STATX_BASIC_STATS
>>> +static int convert_statx(struct fuse_statx *stbuf, const struct statx *stx,
>>> +                        size_t size)
>>> +{
>>> +       if (sizeof(struct statx) != size)
>>> +               return EOPNOTSUPP;
>>> +
>>> +       stbuf->mask = stx->stx_mask & (STATX_BASIC_STATS | STATX_BTIME);
>>> +       stbuf->blksize          = stx->stx_blksize;
>>> +       stbuf->attributes       = stx->stx_attributes;
>>> +       stbuf->nlink            = stx->stx_nlink;
>>> +       stbuf->uid              = stx->stx_uid;
>>> +       stbuf->gid              = stx->stx_gid;
>>> +       stbuf->mode             = stx->stx_mode;
>>> +       stbuf->ino              = stx->stx_ino;
>>> +       stbuf->size             = stx->stx_size;
>>> +       stbuf->blocks           = stx->stx_blocks;
>>> +       stbuf->attributes_mask  = stx->stx_attributes_mask;
>>> +       stbuf->rdev_major       = stx->stx_rdev_major;
>>> +       stbuf->rdev_minor       = stx->stx_rdev_minor;
>>> +       stbuf->dev_major        = stx->stx_dev_major;
>>> +       stbuf->dev_minor        = stx->stx_dev_minor;
>>> +
>>> +       stbuf->atime.tv_sec     = stx->stx_atime.tv_sec;
>>> +       stbuf->btime.tv_sec     = stx->stx_btime.tv_sec;
>>> +       stbuf->ctime.tv_sec     = stx->stx_ctime.tv_sec;
>>> +       stbuf->mtime.tv_sec     = stx->stx_mtime.tv_sec;
>>> +
>>> +       stbuf->atime.tv_nsec    = stx->stx_atime.tv_nsec;
>>> +       stbuf->btime.tv_nsec    = stx->stx_btime.tv_nsec;
>>> +       stbuf->ctime.tv_nsec    = stx->stx_ctime.tv_nsec;
>>> +       stbuf->mtime.tv_nsec    = stx->stx_mtime.tv_nsec;
>>> +
>>> +       return 0;
>>> +}
>>> +#endif
>>> +
>>
>> Why is this conversion not needed in the merged version?
>> What am I missing?
> 
> The patch in upstream memcpy's struct statx to struct fuse_statx:
> 
> 	memset(&arg, 0, sizeof(arg));
> 	arg.flags = flags;
> 	arg.attr_valid = calc_timeout_sec(attr_timeout);
> 	arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> 	memcpy(&arg.stat, statx, sizeof(arg.stat));
> 
> As long as the fields in the two are kept exactly in sync, this isn't a
> problem and no explicit struct conversion is necessary.
> 
> I also noticed that the !HAVE_STATX variant of _do_statx doesn't call
> fuse_reply_err(req, ENOSYS).  I think that means a new kernel calling
> an old userspace would never receive a reply to a FUSE_STATX command
> and ... time out?
> 
> My version also has explicit sizing of struct statx, but I concede that
> if that struct ever gets bigger we're going to have to rev the whole
> syscall anyway.  I was being perhaps a bit paranoid.
> 
> BTW, where are libfuse patches reviewed?  I guess all the review are
> done via github PRs?

Yeah, typical procedure is github PR. If preferred for these complex
patches fine with me to post them here. Especially if others like Amir
are going to review :)


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
  2025-07-17 23:27   ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
@ 2025-07-18 17:10     ` Bernd Schubert
  2025-07-18 18:13       ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 17:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, joannelkoong



On 7/18/25 01:27, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The fuse_request_{send,end} tracepoints capture the value of
> req->in.h.unique in the trace output.  It would be really nice if we
> could use this to match a request to its response for debugging and
> latency analysis, but the call to trace_fuse_request_send occurs before
> the unique id has been set:
> 
> fuse_request_send:    connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
> fuse_request_end:     connection 8388608 req 6 len 16 error -2
> 
> Move the callsites to trace_fuse_request_send to after the unique id has
> been set, or right before we decide to cancel a request having not set
> one.

Sorry, my fault, I have a branch for that already. Just occupied and
then just didn't send v4.

https://lore.kernel.org/all/20250403-fuse-io-uring-trace-points-v3-0-35340aa31d9c@ddn.com/

The updated branch is here

https://github.com/bsbernd/linux/commits/fuse-io-uring-trace-points/

Objections if we go with that version, as it adds a few more tracepoints
and removes the lock to get the unique ID.

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 16:37     ` Bernd Schubert
@ 2025-07-18 17:50       ` Joanne Koong
  2025-07-18 17:57         ` Bernd Schubert
  2025-07-18 18:07       ` Bernd Schubert
  1 sibling, 1 reply; 174+ messages in thread
From: Joanne Koong @ 2025-07-18 17:50 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Darrick J. Wong, linux-fsdevel, neal, John, miklos

On Fri, Jul 18, 2025 at 9:37 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>
> On 7/18/25 01:26, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > +/*
> > + * Flush all pending requests and wait for them.  Only call this function when
> > + * it is no longer possible for other threads to add requests.
> > + */
> > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
>
> I wonder if this should have "abort" in its name. Because it is not a
> simple flush attempt, but also sets fc->blocked and fc->max_background.
>
> > +{
> > +     unsigned long deadline;
> > +
> > +     spin_lock(&fc->lock);
> > +     if (!fc->connected) {
> > +             spin_unlock(&fc->lock);
> > +             return;
> > +     }
> > +
> > +     /* Push all the background requests to the queue. */
> > +     spin_lock(&fc->bg_lock);
> > +     fc->blocked = 0;
> > +     fc->max_background = UINT_MAX;
> > +     flush_bg_queue(fc);
> > +     spin_unlock(&fc->bg_lock);
> > +     spin_unlock(&fc->lock);
> > +
> > +     /*
> > +      * Wait 30s for all the events to complete or abort.  Touch the
> > +      * watchdog once per second so that we don't trip the hangcheck timer
> > +      * while waiting for the fuse server.
> > +      */
> > +     deadline = jiffies + timeout;
> > +     smp_mb();
> > +     while (fc->connected &&
> > +            (!timeout || time_before(jiffies, deadline)) &&
> > +            wait_event_timeout(fc->blocked_waitq,
> > +                     !fc->connected || atomic_read(&fc->num_waiting) == 0,
> > +                     HZ) == 0)
> > +             touch_softlockup_watchdog();
> > +}
> > +
> >  /*
> >   * Abort all requests.
> >   *
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 9572bdef49eecc..1734c263da3a77 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -2047,6 +2047,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
> >  {
> >       struct fuse_conn *fc = fm->fc;
> >
> > +     fuse_flush_requests(fc, 30 * HZ);
>
> I think fc->connected should be set to 0, to avoid that new requests can
> be allocated.

fuse_abort_conn() logic is gated on "if (fc->connected)" so I think
fc->connected can only get set to 0 within fuse_abort_conn()


Thanks,
Joanne

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 17:50       ` Joanne Koong
@ 2025-07-18 17:57         ` Bernd Schubert
  2025-07-18 18:38           ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 17:57 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Darrick J. Wong, linux-fsdevel, neal, John, miklos



On 7/18/25 19:50, Joanne Koong wrote:
> On Fri, Jul 18, 2025 at 9:37 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 7/18/25 01:26, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> +/*
>>> + * Flush all pending requests and wait for them.  Only call this function when
>>> + * it is no longer possible for other threads to add requests.
>>> + */
>>> +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
>>
>> I wonder if this should have "abort" in its name. Because it is not a
>> simple flush attempt, but also sets fc->blocked and fc->max_background.
>>
>>> +{
>>> +     unsigned long deadline;
>>> +
>>> +     spin_lock(&fc->lock);
>>> +     if (!fc->connected) {
>>> +             spin_unlock(&fc->lock);
>>> +             return;
>>> +     }
>>> +
>>> +     /* Push all the background requests to the queue. */
>>> +     spin_lock(&fc->bg_lock);
>>> +     fc->blocked = 0;
>>> +     fc->max_background = UINT_MAX;
>>> +     flush_bg_queue(fc);
>>> +     spin_unlock(&fc->bg_lock);
>>> +     spin_unlock(&fc->lock);
>>> +
>>> +     /*
>>> +      * Wait 30s for all the events to complete or abort.  Touch the
>>> +      * watchdog once per second so that we don't trip the hangcheck timer
>>> +      * while waiting for the fuse server.
>>> +      */
>>> +     deadline = jiffies + timeout;
>>> +     smp_mb();
>>> +     while (fc->connected &&
>>> +            (!timeout || time_before(jiffies, deadline)) &&
>>> +            wait_event_timeout(fc->blocked_waitq,
>>> +                     !fc->connected || atomic_read(&fc->num_waiting) == 0,
>>> +                     HZ) == 0)
>>> +             touch_softlockup_watchdog();
>>> +}
>>> +
>>>  /*
>>>   * Abort all requests.
>>>   *
>>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>>> index 9572bdef49eecc..1734c263da3a77 100644
>>> --- a/fs/fuse/inode.c
>>> +++ b/fs/fuse/inode.c
>>> @@ -2047,6 +2047,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
>>>  {
>>>       struct fuse_conn *fc = fm->fc;
>>>
>>> +     fuse_flush_requests(fc, 30 * HZ);
>>
>> I think fc->connected should be set to 0, to avoid that new requests can
>> be allocated.
> 
> fuse_abort_conn() logic is gated on "if (fc->connected)" so I think
> fc->connected can only get set to 0 within fuse_abort_conn()

Hmm yeah, I wonder if we should allow multiple values in there. Like
fuse_abort_conn sets UINT64_MAX and checks that and other functions
could set values in between? We could add another variable, but given
that it is used on every request allocation might be better to avoid too
many conditions.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-18 15:10     ` Amir Goldstein
@ 2025-07-18 18:01       ` Darrick J. Wong
  2025-07-18 18:39         ` Bernd Schubert
  2025-07-18 19:45         ` Amir Goldstein
  0 siblings, 2 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:01 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

On Fri, Jul 18, 2025 at 05:10:14PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 1:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Implement pagecache IO with iomap, complete with hooks into truncate and
> > fallocate so that the fuse server needn't implement disk block zeroing
> > of post-EOF and unaligned punch/zero regions.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  fs/fuse/fuse_i.h          |   46 +++
> >  fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
> >  include/uapi/linux/fuse.h |    5
> >  fs/fuse/dir.c             |   23 +
> >  fs/fuse/file.c            |   90 +++++-
> >  fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/inode.c           |   14 +
> >  7 files changed, 1268 insertions(+), 24 deletions(-)
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 67e428da4391aa..f33b348d296d5e 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -161,6 +161,13 @@ struct fuse_inode {
> >
> >                         /* waitq for direct-io completion */
> >                         wait_queue_head_t direct_io_waitq;
> > +
> > +#ifdef CONFIG_FUSE_IOMAP
> > +                       /* pending io completions */
> > +                       spinlock_t ioend_lock;
> > +                       struct work_struct ioend_work;
> > +                       struct list_head ioend_list;
> > +#endif
> >                 };
> 
> This union member you are changing is declared for
> /* read/write io cache (regular file only) */
> but actually it is also for parallel dio and passthrough mode
> 
> IIUC, there should be zero intersection between these io modes and
>  /* iomap cached fileio (regular file only) */
> 
> Right?

Right.  iomap will get very very confused if you switch file IO paths on
a live file.  I think it's /possible/ to switch if you flush and
truncate the whole page cache while holding inode_lock() but I don't
think anyone has ever tried.

> So it can use its own union member without increasing fuse_inode size.
> 
> Just need to be carefull in fuse_init_file_inode(), fuse_evict_inode() and
> fuse_file_io_release() which do not assume a specific inode io mode.

Yes, I think it's possible to put the iomap stuff in a separate struct
within that union so that we're not increasing the fuse_inode size
unnecessarily.  That's desirable for something to do before merging,
but for now prototyping is /much/ easier if I don't have to do that.

Making that change will require a lot of careful auditing, first I want
to make sure you all agree with the iomap approach because it's much
different from what I see in the other fuse IO paths. :)

Eeeyiks, struct fuse_inode shrinks from 1272 bytes to 1152 if I push the
iomap stuff into its own union struct.

> Was it your intention to allow filesystems to configure some inodes to be
> in file_iomap mode and other inodes to be in regular cached/direct/passthrough
> io modes?

That was a deliberate design decision on my part -- maybe a fuse server
would be capable of serving up some files from a local disk, and others
from (say) a network filesystem.  Or maybe it would like to expose an
administrative fd for the filesystem (like the xfs_healer event stream)
that isn't backed by storage.

> I can't say that I see a big benefit in allowing such setups.
> It certainly adds a lot of complication to the test matrix if we allow that.
> My instinct is for initial version, either allow only opening files in
> FILE_IOMAP or
> DIRECT_IOMAP to inodes for a filesystem that supports those modes.

I was thinking about combining FUSE_ATTR_IOMAP_(DIRECTIO|FILEIO) for the
next RFC because I can't imagine any scenario where you don't want
directio support if you already use iomap for the pagecache.  fuse iomap
requires directio write support for writeback, so the server *must*
support IOMAP_WRITE|IOMAP_DIRECT.

> Perhaps later we can allow (and maybe fallback to) FOPEN_DIRECT_IO
> (without parallel dio) if a server does not configure IOMAP to some inode
> to allow a server to provide the data for a specific inode directly.

Hrmm.  Is FOPEN_DIRECT_IO the magic flag that fuse passes to the fuse
server to tell it that a file is open in directio mode?  There's a few
fstests that initiate aio+dio writes to a dm-error device that currently
fail in non-iomap mode because fuse2fs writes everything to the bdev
pagecache.

> fuse_file_io_open/release() can help you manage those restrictions and
> set ff->iomode = IOM_FILE_IOMAP when a file is opened for file iomap.
> I did not look closely enough to see if file_iomap code ends up setting
> ff->iomode = IOM_CACHED/UNCACHED or always remains IOM_NONE.

I don't touch ff->iomode because iomap is a per-inode property, not a
per-file property... but I suppose that would be a good place to look.

Thank you for the feedback!

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 16:37     ` Bernd Schubert
  2025-07-18 17:50       ` Joanne Koong
@ 2025-07-18 18:07       ` Bernd Schubert
  2025-07-18 18:13         ` Bernd Schubert
  2025-07-18 19:34         ` Darrick J. Wong
  1 sibling, 2 replies; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 18:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, neal, John, miklos, joannelkoong, Horst Birthelmer

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]


> 
> Please see the two attached patches, which are needed for fuse-io-uring.
> I can also send them separately, if you prefer.

We (actually Horst) is just testing it as Horst sees failing xfs tests in
our branch with tmp page removal

Patch 2 needs this addition (might be more, as I didn't test). 
I had it in first, but then split the patch and missed that.

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index eca457d1005e..acf11eadbf3b 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -123,6 +123,9 @@ void fuse_uring_flush_bg(struct fuse_conn *fc)
        struct fuse_ring_queue *queue;
        struct fuse_ring *ring = fc->ring;
 
+       if (!ring)
+               return;
+
        for (qid = 0; qid < ring->nr_queues; qid++) {
                queue = READ_ONCE(ring->queues[qid]);
                if (!queue)



[-- Attachment #2: 01-flush-io-uring-queue --]
[-- Type: text/plain, Size: 3552 bytes --]

fuse: Refactor io-uring bg queue flush and queue abort

From: Bernd Schubert <bschubert@ddn.com>

This is a preparation to allow fuse-io-uring bg queue
flush from flush_bg_queue()

This does two function renames:
fuse_uring_flush_bg -> fuse_uring_flush_queue_bg
fuse_uring_abort_end_requests -> fuse_uring_flush_bg

And fuse_uring_abort_end_queue_requests() is moved to
fuse_uring_stop_queues().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   |   14 +++++++-------
 fs/fuse/dev_uring_i.h |    4 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 249b210becb1..eca457d1005e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -47,7 +47,7 @@ static struct fuse_ring_ent *uring_cmd_to_ring_ent(struct io_uring_cmd *cmd)
 	return pdu->ent;
 }
 
-static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
+static void fuse_uring_flush_queue_bg(struct fuse_ring_queue *queue)
 {
 	struct fuse_ring *ring = queue->ring;
 	struct fuse_conn *fc = ring->fc;
@@ -88,7 +88,7 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
 	if (test_bit(FR_BACKGROUND, &req->flags)) {
 		queue->active_background--;
 		spin_lock(&fc->bg_lock);
-		fuse_uring_flush_bg(queue);
+		fuse_uring_flush_queue_bg(queue);
 		spin_unlock(&fc->bg_lock);
 	}
 
@@ -117,11 +117,11 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
 	fuse_dev_end_requests(&req_list);
 }
 
-void fuse_uring_abort_end_requests(struct fuse_ring *ring)
+void fuse_uring_flush_bg(struct fuse_conn *fc)
 {
 	int qid;
 	struct fuse_ring_queue *queue;
-	struct fuse_conn *fc = ring->fc;
+	struct fuse_ring *ring = fc->ring;
 
 	for (qid = 0; qid < ring->nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
@@ -133,10 +133,9 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
 		WARN_ON_ONCE(ring->fc->max_background != UINT_MAX);
 		spin_lock(&queue->lock);
 		spin_lock(&fc->bg_lock);
-		fuse_uring_flush_bg(queue);
+		fuse_uring_flush_queue_bg(queue);
 		spin_unlock(&fc->bg_lock);
 		spin_unlock(&queue->lock);
-		fuse_uring_abort_end_queue_requests(queue);
 	}
 }
 
@@ -475,6 +474,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 		if (!queue)
 			continue;
 
+		fuse_uring_abort_end_queue_requests(queue);
 		fuse_uring_teardown_entries(queue);
 	}
 
@@ -1326,7 +1326,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	fc->num_background++;
 	if (fc->num_background == fc->max_background)
 		fc->blocked = 1;
-	fuse_uring_flush_bg(queue);
+	fuse_uring_flush_queue_bg(queue);
 	spin_unlock(&fc->bg_lock);
 
 	/*
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce1..55f52508de3c 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -138,7 +138,7 @@ struct fuse_ring {
 bool fuse_uring_enabled(void);
 void fuse_uring_destruct(struct fuse_conn *fc);
 void fuse_uring_stop_queues(struct fuse_ring *ring);
-void fuse_uring_abort_end_requests(struct fuse_ring *ring);
+void fuse_uring_flush_bg(struct fuse_conn *fc);
 int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req);
 bool fuse_uring_queue_bq_req(struct fuse_req *req);
@@ -153,7 +153,7 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
 		return;
 
 	if (atomic_read(&ring->queue_refs) > 0) {
-		fuse_uring_abort_end_requests(ring);
+		fuse_uring_flush_bg(fc);
 		fuse_uring_stop_queues(ring);
 	}
 }

[-- Attachment #3: 02-flush-uring-bg --]
[-- Type: text/plain, Size: 1984 bytes --]

fuse: Flush the io-uring bg queue from fuse_uring_flush_bg

From: Bernd Schubert <bschubert@ddn.com>

This is useful to have a unique API to flush background requests.
For example when the bg queue gets flushed before
the remaining of fuse_conn_destroy().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c         |    2 ++
 fs/fuse/dev_uring.c   |    3 +++
 fs/fuse/dev_uring_i.h |   10 +++++++---
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5387e4239d6a..3f5f168cc28a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -426,6 +426,8 @@ static void flush_bg_queue(struct fuse_conn *fc)
 		fc->active_background++;
 		fuse_send_one(fiq, req);
 	}
+
+	fuse_uring_flush_bg(fc);
 }
 
 /*
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index eca457d1005e..acf11eadbf3b 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -123,6 +123,9 @@ void fuse_uring_flush_bg(struct fuse_conn *fc)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring *ring = fc->ring;
 
+	if (!ring)
+		return;
+
 	for (qid = 0; qid < ring->nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 55f52508de3c..fca2184e8d94 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -152,10 +152,10 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
 	if (ring == NULL)
 		return;
 
-	if (atomic_read(&ring->queue_refs) > 0) {
-		fuse_uring_flush_bg(fc);
+	/* Assumes bg queues were already flushed before */
+
+	if (atomic_read(&ring->queue_refs) > 0)
 		fuse_uring_stop_queues(ring);
-	}
 }
 
 static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
@@ -206,6 +206,10 @@ static inline bool fuse_uring_request_expired(struct fuse_conn *fc)
 	return false;
 }
 
+static inline void fuse_uring_flush_bg(struct fuse_conn *fc)
+{
+}
+
 #endif /* CONFIG_FUSE_IO_URING */
 
 #endif /* _FS_FUSE_DEV_URING_I_H */

^ permalink raw reply related	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
  2025-07-18 17:10     ` Bernd Schubert
@ 2025-07-18 18:13       ` Darrick J. Wong
  2025-07-22 22:20         ` Bernd Schubert
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:13 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel, neal, John, miklos, joannelkoong

On Fri, Jul 18, 2025 at 07:10:37PM +0200, Bernd Schubert wrote:
> 
> 
> On 7/18/25 01:27, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The fuse_request_{send,end} tracepoints capture the value of
> > req->in.h.unique in the trace output.  It would be really nice if we
> > could use this to match a request to its response for debugging and
> > latency analysis, but the call to trace_fuse_request_send occurs before
> > the unique id has been set:
> > 
> > fuse_request_send:    connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
> > fuse_request_end:     connection 8388608 req 6 len 16 error -2
> > 
> > Move the callsites to trace_fuse_request_send to after the unique id has
> > been set, or right before we decide to cancel a request having not set
> > one.
> 
> Sorry, my fault, I have a branch for that already. Just occupied and
> then just didn't send v4.
> 
> https://lore.kernel.org/all/20250403-fuse-io-uring-trace-points-v3-0-35340aa31d9c@ddn.com/

(Aha, that was before I started paying attention to the fuse patches on
fsdevel.)

> The updated branch is here
> 
> https://github.com/bsbernd/linux/commits/fuse-io-uring-trace-points/
> 
> Objections if we go with that version, as it adds a few more tracepoints
> and removes the lock to get the unique ID.

Let me look through the branch --

 * fuse: Make the fuse unique value a per-cpu counter

Is there any reason you didn't use percpu_counter_init() ?  It does the
same per-cpu batching that (I think) your version does.

 * fuse: Set request unique on allocation
 * fuse: {io-uring} Avoid _send code dup

Looks good,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

 * fuse: fine-grained request ftraces

Are these three new tracepoints exactly identical except in name?
If you declare an event class for them, that will save a lot of memory
(~5K per tracepoint according to rostedt) over definining them
individually.

 * per cpu cntr fix

I think you can avoid this if you use the kernel struct percpu_counter.

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 18:07       ` Bernd Schubert
@ 2025-07-18 18:13         ` Bernd Schubert
  2025-07-18 19:34         ` Darrick J. Wong
  1 sibling, 0 replies; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 18:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, neal, John, miklos, joannelkoong, Horst Birthelmer



On 7/18/25 20:07, Bernd Schubert wrote:
> 
>>
>> Please see the two attached patches, which are needed for fuse-io-uring.
>> I can also send them separately, if you prefer.
> 
> We (actually Horst) is just testing it as Horst sees failing xfs tests in
> our branch with tmp page removal
> 
> Patch 2 needs this addition (might be more, as I didn't test). 
> I had it in first, but then split the patch and missed that.
> 
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index eca457d1005e..acf11eadbf3b 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -123,6 +123,9 @@ void fuse_uring_flush_bg(struct fuse_conn *fc)
>         struct fuse_ring_queue *queue;
>         struct fuse_ring *ring = fc->ring;
>  
> +       if (!ring)
> +               return;
> +
>         for (qid = 0; qid < ring->nr_queues; qid++) {
>                 queue = READ_ONCE(ring->queues[qid]);
>                 if (!queue)


More changes needed, we don't want to iterate over all queues in
fuse_request_end, dev_uring.c already handles the queue that ends
a request.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 1/1] libfuse: enable iomap cache management
  2025-07-18 16:16     ` Bernd Schubert
@ 2025-07-18 18:22       ` Darrick J. Wong
  2025-07-18 18:35         ` Bernd Schubert
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:22 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On Fri, Jul 18, 2025 at 04:16:28PM +0000, Bernd Schubert wrote:
> On 7/18/25 01:38, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add the library methods so that fuse servers can manage an in-kernel
> > iomap cache.  This enables better performance on small IOs and is
> > required if the filesystem needs synchronization between pagecache
> > writes and writeback.
> 
> Sorry, if this ready to be merged? I don't see in linux master? Or part
> of your other patches (will take some to go through these).

No, everything you see in here is all RFC status and not for merging.
We're past -rc6, it's far too late to be trying to get anything new
merged in the kernel.

Though I say that as a former iomap maintainer who wouldn't take big
core code changes after -rc4 or XFS changes after -rc6.  I think I was
much more conservative about that than most maintainers. :)

(The cover letter yells very loudly about do not merge any of this,
btw.)

> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  include/fuse_common.h   |    9 +++++
> >  include/fuse_kernel.h   |   34 +++++++++++++++++++
> >  include/fuse_lowlevel.h |   39 ++++++++++++++++++++++
> >  lib/fuse_lowlevel.c     |   82 +++++++++++++++++++++++++++++++++++++++++++++++
> >  lib/fuse_versionscript  |    2 +
> >  5 files changed, 166 insertions(+)
> > 
> > 
> > diff --git a/include/fuse_common.h b/include/fuse_common.h
> > index 98cb8f656efd13..1237cc2656b9c4 100644
> > --- a/include/fuse_common.h
> > +++ b/include/fuse_common.h
> > @@ -1164,6 +1164,7 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn);
> >   */
> >  #if FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
> >  #define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
> > +#define FUSE_IOMAP_TYPE_NULL		(0xFFFE) /* no mapping here */
> >  #define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
> >  #define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
> >  #define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
> > @@ -1208,6 +1209,11 @@ struct fuse_iomap {
> >  	uint32_t dev;		/* device cookie */
> >  };
> >  
> > +struct fuse_iomap_inval {
> > +	uint64_t offset;	/* file offset to invalidate, bytes */
> > +	uint64_t length;	/* length to invalidate, bytes */
> > +};
> > +
> >  /* out of place write extent */
> >  #define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
> >  /* unwritten extent */
> > @@ -1258,6 +1264,9 @@ struct fuse_iomap_config{
> >  	int64_t s_maxbytes;	/* max file size */
> >  };
> >  
> > +/* invalidate to end of file */
> > +#define FUSE_IOMAP_INVAL_TO_EOF		(~0ULL)
> > +
> >  #endif /* FUSE_USE_VERSION >= 318 */
> >  
> >  /* ----------------------------------------------------------- *
> > diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
> > index 3c704f03434693..f1a93dbd1ff443 100644
> > --- a/include/fuse_kernel.h
> > +++ b/include/fuse_kernel.h
> > @@ -243,6 +243,8 @@
> >   *  - add FUSE_IOMAP_DIRECTIO/FUSE_ATTR_IOMAP_DIRECTIO for direct I/O support
> >   *  - add FUSE_IOMAP_FILEIO/FUSE_ATTR_IOMAP_FILEIO for buffered I/O support
> >   *  - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
> > + *  - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
> > + *    can cache iomappings in the kernel
> 
> 
> Personally I prefer a preparation patch, that just syncs the entire
> fuse_kernel.h from linux-<version>.

<nod>

>                                     Also this file might get renamed to
> fuse_kernel_linux.h, there seems to be interest from BSD and OSX to have
> their own headers.

That's a good idea.

> >   */
> >  
> >  #ifndef _LINUX_FUSE_H
> > @@ -699,6 +701,8 @@ enum fuse_notify_code {
> >  	FUSE_NOTIFY_DELETE = 6,
> >  	FUSE_NOTIFY_RESEND = 7,
> >  	FUSE_NOTIFY_INC_EPOCH = 8,
> > +	FUSE_NOTIFY_IOMAP_UPSERT = 9,
> > +	FUSE_NOTIFY_IOMAP_INVAL = 10,
> >  	FUSE_NOTIFY_CODE_MAX,
> >  };
> >  
> > @@ -1406,4 +1410,34 @@ struct fuse_iomap_config_out {
> >  	int64_t s_maxbytes;	/* max file size */
> >  };
> >  
> > +struct fuse_iomap_upsert_out {
> > +	uint64_t nodeid;	/* Inode ID */
> > +	uint64_t attr_ino;	/* matches fuse_attr:ino */
> > +
> > +	uint64_t read_offset;	/* file offset of mapping, bytes */
> > +	uint64_t read_length;	/* length of mapping, bytes */
> > +	uint64_t read_addr;	/* disk offset of mapping, bytes */
> > +	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
> > +	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
> > +	uint32_t read_dev;	/* device cookie */
> > +
> > +	uint64_t write_offset;	/* file offset of mapping, bytes */
> > +	uint64_t write_length;	/* length of mapping, bytes */
> > +	uint64_t write_addr;	/* disk offset of mapping, bytes */
> > +	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
> > +	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
> > +	uint32_t write_dev;	/* device cookie * */
> > +};
> > +
> > +struct fuse_iomap_inval_out {
> > +	uint64_t nodeid;	/* Inode ID */
> > +	uint64_t attr_ino;	/* matches fuse_attr:ino */
> > +
> > +	uint64_t read_offset;	/* range to invalidate read iomaps, bytes */
> > +	uint64_t read_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
> > +
> > +	uint64_t write_offset;	/* range to invalidate write iomaps, bytes */
> > +	uint64_t write_length;	/* can be FUSE_IOMAP_INVAL_TO_EOF */
> > +};
> > +
> >  #endif /* _LINUX_FUSE_H */
> > diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> > index fd7df5c2c11e16..f690c62fcdd61c 100644
> > --- a/include/fuse_lowlevel.h
> > +++ b/include/fuse_lowlevel.h
> > @@ -2101,6 +2101,45 @@ int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino,
> >   * @return positive device id for success, zero for failure
> >   */
> >  int fuse_iomap_add_device(struct fuse_session *se, int fd, unsigned int flags);
> > +
> > +/**
> > + * Upsert some file mapping information into the kernel.  This is necessary
> > + * for filesystems that require coordination of mapping state changes between
> > + * buffered writes and writeback, and desirable for better performance
> > + * elsewhere.
> > + *
> > + * Added in FUSE protocol version 7.99. If the kernel does not support
> 
> 7.99?

I set the minor versions to 99 and just today did the same thing for
libfuse itself ("3.99") to make it obvious where all the code changes
lie.  When these patches are ready for merging I'll rework them to pick
up whatever version of libfuse is current.

Doing so reduces rebasing collisions when others' ABI changes get merged
upstream.  I've found it a useful trick/crutch for a patchset that I
think is going to take a long time to get integrated.  See previous
comments about being a former XFS maintainer. ;)

--D

> 
> 
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 1/1] libfuse: enable iomap cache management
  2025-07-18 18:22       ` Darrick J. Wong
@ 2025-07-18 18:35         ` Bernd Schubert
  2025-07-18 18:40           ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 18:35 UTC (permalink / raw)
  To: Darrick J. Wong, Bernd Schubert
  Cc: John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, neal@gompa.dev, miklos@szeredi.hu



On 7/18/25 20:22, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 04:16:28PM +0000, Bernd Schubert wrote:
>> On 7/18/25 01:38, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> Add the library methods so that fuse servers can manage an in-kernel
>>> iomap cache.  This enables better performance on small IOs and is
>>> required if the filesystem needs synchronization between pagecache
>>> writes and writeback.
>>
>> Sorry, if this ready to be merged? I don't see in linux master? Or part
>> of your other patches (will take some to go through these).
> 
> No, everything you see in here is all RFC status and not for merging.
> We're past -rc6, it's far too late to be trying to get anything new
> merged in the kernel.
> 
> Though I say that as a former iomap maintainer who wouldn't take big
> core code changes after -rc4 or XFS changes after -rc6.  I think I was
> much more conservative about that than most maintainers. :)
> 
> (The cover letter yells very loudly about do not merge any of this,
> btw.)


This is  [PATCH 1/1] and when I wrote the mail it was not sorted in
threaded form - I didn't see a cover letter for this specific patch.
Might also be because some mails go to my ddn address and some to 
my own one. I use the DDN address for patches to give DDN credits 
for the work, but fastmail provides so much better filtering - I
prefer my private address for CCs.

So asked because I was confused about this [1/1] - it made it look
like it is ready.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 17:57         ` Bernd Schubert
@ 2025-07-18 18:38           ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:38 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Joanne Koong, linux-fsdevel, neal, John, miklos

On Fri, Jul 18, 2025 at 07:57:15PM +0200, Bernd Schubert wrote:
> 
> 
> On 7/18/25 19:50, Joanne Koong wrote:
> > On Fri, Jul 18, 2025 at 9:37 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>
> >> On 7/18/25 01:26, Darrick J. Wong wrote:
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> +/*
> >>> + * Flush all pending requests and wait for them.  Only call this function when
> >>> + * it is no longer possible for other threads to add requests.
> >>> + */
> >>> +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> >>
> >> I wonder if this should have "abort" in its name. Because it is not a
> >> simple flush attempt, but also sets fc->blocked and fc->max_background.

I don't want to abort the connection here, because later I'll use this
same function to flush pending commands before sending a syncfs to the
fuse server and waiting for that as well:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=d4936bb06a81886de844f089ded85f1461b41e59

The server is still alive and accepting requests, so this what I want is
to push all the FUSE_RELEASE requests to the server so that it will
delete all the "unlinked" open files (aka the .fusehiddenXXXX files) in
the filesystem and wait for that to complete.

> >>> +{
> >>> +     unsigned long deadline;
> >>> +
> >>> +     spin_lock(&fc->lock);
> >>> +     if (!fc->connected) {
> >>> +             spin_unlock(&fc->lock);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     /* Push all the background requests to the queue. */
> >>> +     spin_lock(&fc->bg_lock);
> >>> +     fc->blocked = 0;
> >>> +     fc->max_background = UINT_MAX;

Yeah, I was a little confused about this -- it looked like these two
lines will push all the pending background commands into the queue and
turn off max_background throttling.  That might not be optimal for
what's otherwise still a live fuse server.

All I need here is for fc->bg_queue to be empty when flush_bg_queue
returns.  I suppose I could wait in a loop, too:

	while (!list_empty(&fc->bg_queue)) {
		flush_bg_queue(fc);
		wait_event_timeout(..., fc->active_background > 0, HZ);
	}

But that's more complicated. ;)

> >>> +     flush_bg_queue(fc);
> >>> +     spin_unlock(&fc->bg_lock);
> >>> +     spin_unlock(&fc->lock);
> >>> +
> >>> +     /*
> >>> +      * Wait 30s for all the events to complete or abort.  Touch the
> >>> +      * watchdog once per second so that we don't trip the hangcheck timer
> >>> +      * while waiting for the fuse server.
> >>> +      */
> >>> +     deadline = jiffies + timeout;
> >>> +     smp_mb();
> >>> +     while (fc->connected &&
> >>> +            (!timeout || time_before(jiffies, deadline)) &&
> >>> +            wait_event_timeout(fc->blocked_waitq,
> >>> +                     !fc->connected || atomic_read(&fc->num_waiting) == 0,
> >>> +                     HZ) == 0)
> >>> +             touch_softlockup_watchdog();
> >>> +}
> >>> +
> >>>  /*
> >>>   * Abort all requests.
> >>>   *
> >>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> >>> index 9572bdef49eecc..1734c263da3a77 100644
> >>> --- a/fs/fuse/inode.c
> >>> +++ b/fs/fuse/inode.c
> >>> @@ -2047,6 +2047,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
> >>>  {
> >>>       struct fuse_conn *fc = fm->fc;
> >>>
> >>> +     fuse_flush_requests(fc, 30 * HZ);
> >>
> >> I think fc->connected should be set to 0, to avoid that new requests can
> >> be allocated.
> > 
> > fuse_abort_conn() logic is gated on "if (fc->connected)" so I think
> > fc->connected can only get set to 0 within fuse_abort_conn()

Keep in mind that the function says that it should not be used when
other threads can add new requests.  All current callers are in the
unmount call stack so the only thread that could add a new request is
the current one.

> Hmm yeah, I wonder if we should allow multiple values in there. Like
> fuse_abort_conn sets UINT64_MAX and checks that and other functions
> could set values in between? We could add another variable, but given
> that it is used on every request allocation might be better to avoid too
> many conditions.

<shrug> It /would/ be nifty if fuse requests were associated with an
epoch and one could wait for an epoch to complete.  But for something
that only gets called during unmount I didn't think it was worth the
extra surgery and object bloat.

--D

> 
> 
> Thanks,
> Bernd
> 
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-18 18:01       ` Darrick J. Wong
@ 2025-07-18 18:39         ` Bernd Schubert
  2025-07-18 18:46           ` Darrick J. Wong
  2025-07-18 19:45         ` Amir Goldstein
  1 sibling, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 18:39 UTC (permalink / raw)
  To: Darrick J. Wong, Amir Goldstein
  Cc: linux-fsdevel, neal, John, miklos, joannelkoong



On 7/18/25 20:01, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 05:10:14PM +0200, Amir Goldstein wrote:
>> On Fri, Jul 18, 2025 at 1:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>>
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> Implement pagecache IO with iomap, complete with hooks into truncate and
>>> fallocate so that the fuse server needn't implement disk block zeroing
>>> of post-EOF and unaligned punch/zero regions.
>>>
>>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>>> ---
>>>  fs/fuse/fuse_i.h          |   46 +++
>>>  fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
>>>  include/uapi/linux/fuse.h |    5
>>>  fs/fuse/dir.c             |   23 +
>>>  fs/fuse/file.c            |   90 +++++-
>>>  fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
>>>  fs/fuse/inode.c           |   14 +
>>>  7 files changed, 1268 insertions(+), 24 deletions(-)
>>>
>>>
>>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>>> index 67e428da4391aa..f33b348d296d5e 100644
>>> --- a/fs/fuse/fuse_i.h
>>> +++ b/fs/fuse/fuse_i.h
>>> @@ -161,6 +161,13 @@ struct fuse_inode {
>>>
>>>                         /* waitq for direct-io completion */
>>>                         wait_queue_head_t direct_io_waitq;
>>> +
>>> +#ifdef CONFIG_FUSE_IOMAP
>>> +                       /* pending io completions */
>>> +                       spinlock_t ioend_lock;
>>> +                       struct work_struct ioend_work;
>>> +                       struct list_head ioend_list;
>>> +#endif
>>>                 };
>>
>> This union member you are changing is declared for
>> /* read/write io cache (regular file only) */
>> but actually it is also for parallel dio and passthrough mode
>>
>> IIUC, there should be zero intersection between these io modes and
>>  /* iomap cached fileio (regular file only) */
>>
>> Right?
> 
> Right.  iomap will get very very confused if you switch file IO paths on
> a live file.  I think it's /possible/ to switch if you flush and
> truncate the whole page cache while holding inode_lock() but I don't
> think anyone has ever tried.
> 
>> So it can use its own union member without increasing fuse_inode size.
>>
>> Just need to be carefull in fuse_init_file_inode(), fuse_evict_inode() and
>> fuse_file_io_release() which do not assume a specific inode io mode.
> 
> Yes, I think it's possible to put the iomap stuff in a separate struct
> within that union so that we're not increasing the fuse_inode size
> unnecessarily.  That's desirable for something to do before merging,
> but for now prototyping is /much/ easier if I don't have to do that.
> 
> Making that change will require a lot of careful auditing, first I want
> to make sure you all agree with the iomap approach because it's much
> different from what I see in the other fuse IO paths. :)
> 
> Eeeyiks, struct fuse_inode shrinks from 1272 bytes to 1152 if I push the
> iomap stuff into its own union struct.
> 
>> Was it your intention to allow filesystems to configure some inodes to be
>> in file_iomap mode and other inodes to be in regular cached/direct/passthrough
>> io modes?
> 
> That was a deliberate design decision on my part -- maybe a fuse server
> would be capable of serving up some files from a local disk, and others
> from (say) a network filesystem.  Or maybe it would like to expose an
> administrative fd for the filesystem (like the xfs_healer event stream)
> that isn't backed by storage.
> 
>> I can't say that I see a big benefit in allowing such setups.
>> It certainly adds a lot of complication to the test matrix if we allow that.
>> My instinct is for initial version, either allow only opening files in
>> FILE_IOMAP or
>> DIRECT_IOMAP to inodes for a filesystem that supports those modes.
> 
> I was thinking about combining FUSE_ATTR_IOMAP_(DIRECTIO|FILEIO) for the
> next RFC because I can't imagine any scenario where you don't want
> directio support if you already use iomap for the pagecache.  fuse iomap
> requires directio write support for writeback, so the server *must*
> support IOMAP_WRITE|IOMAP_DIRECT.
> 
>> Perhaps later we can allow (and maybe fallback to) FOPEN_DIRECT_IO
>> (without parallel dio) if a server does not configure IOMAP to some inode
>> to allow a server to provide the data for a specific inode directly.
> 
> Hrmm.  Is FOPEN_DIRECT_IO the magic flag that fuse passes to the fuse
> server to tell it that a file is open in directio mode?  There's a few
> fstests that initiate aio+dio writes to a dm-error device that currently
> fail in non-iomap mode because fuse2fs writes everything to the bdev
> pagecache.


The other way around, FOPEN_DIRECT_IO is a flag that fuse-server tells
the kernel that it wants to bypass the page cache. And also allows
parallel DIO IO (shared vs exclusive lock).


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 1/1] libfuse: enable iomap cache management
  2025-07-18 18:35         ` Bernd Schubert
@ 2025-07-18 18:40           ` Darrick J. Wong
  2025-07-18 18:51             ` Bernd Schubert
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:40 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, neal@gompa.dev, miklos@szeredi.hu

On Fri, Jul 18, 2025 at 08:35:29PM +0200, Bernd Schubert wrote:
> 
> 
> On 7/18/25 20:22, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 04:16:28PM +0000, Bernd Schubert wrote:
> >> On 7/18/25 01:38, Darrick J. Wong wrote:
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Add the library methods so that fuse servers can manage an in-kernel
> >>> iomap cache.  This enables better performance on small IOs and is
> >>> required if the filesystem needs synchronization between pagecache
> >>> writes and writeback.
> >>
> >> Sorry, if this ready to be merged? I don't see in linux master? Or part
> >> of your other patches (will take some to go through these).
> > 
> > No, everything you see in here is all RFC status and not for merging.
> > We're past -rc6, it's far too late to be trying to get anything new
> > merged in the kernel.
> > 
> > Though I say that as a former iomap maintainer who wouldn't take big
> > core code changes after -rc4 or XFS changes after -rc6.  I think I was
> > much more conservative about that than most maintainers. :)
> > 
> > (The cover letter yells very loudly about do not merge any of this,
> > btw.)
> 
> 
> This is  [PATCH 1/1] and when I wrote the mail it was not sorted in
> threaded form - I didn't see a cover letter for this specific patch.
> Might also be because some mails go to my ddn address and some to 
> my own one. I use the DDN address for patches to give DDN credits 
> for the work, but fastmail provides so much better filtering - I
> prefer my private address for CCs.
> 
> So asked because I was confused about this [1/1] - it made it look
> like it is ready.

Ah, yeah.  My stgit maintainer^Wwrapper scripts only know how to put the
RFC tag on the cover letter, not the patches themselves.  Would you
prefer that I send to your bsbernd.com domain from now on so the emails
all end up in the same place?

--D

> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/4] libfuse: add statx support to the lower level library
  2025-07-18 16:54         ` Bernd Schubert
@ 2025-07-18 18:42           ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:42 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, John, joannelkoong, linux-fsdevel, bernd, neal,
	miklos

On Fri, Jul 18, 2025 at 06:54:44PM +0200, Bernd Schubert wrote:
> 
> 
> On 7/18/25 18:27, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 03:28:25PM +0200, Amir Goldstein wrote:
> >> On Fri, Jul 18, 2025 at 1:39 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Add statx support to the lower level fuse library.
> >>
> >> This looked familiar.
> >> Merged 3 days ago:
> >> https://github.com/libfuse/libfuse/pull/1026
> >>
> >>>
> >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> >>> ---
> >>>  include/fuse_lowlevel.h |   37 ++++++++++++++++++
> >>>  lib/fuse_lowlevel.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++
> >>>  lib/fuse_versionscript  |    2 +
> >>>  3 files changed, 136 insertions(+)
> > 
> > <snip>
> > 
> >>> diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> >>> index ec30ebc4cdd074..8eeb6a8547da91 100644
> >>> --- a/lib/fuse_lowlevel.c
> >>> +++ b/lib/fuse_lowlevel.c
> >>> @@ -144,6 +144,43 @@ static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
> >>>         ST_CTIM_NSEC_SET(stbuf, attr->ctimensec);
> >>>  }
> >>>
> >>> +#ifdef STATX_BASIC_STATS
> >>> +static int convert_statx(struct fuse_statx *stbuf, const struct statx *stx,
> >>> +                        size_t size)
> >>> +{
> >>> +       if (sizeof(struct statx) != size)
> >>> +               return EOPNOTSUPP;
> >>> +
> >>> +       stbuf->mask = stx->stx_mask & (STATX_BASIC_STATS | STATX_BTIME);
> >>> +       stbuf->blksize          = stx->stx_blksize;
> >>> +       stbuf->attributes       = stx->stx_attributes;
> >>> +       stbuf->nlink            = stx->stx_nlink;
> >>> +       stbuf->uid              = stx->stx_uid;
> >>> +       stbuf->gid              = stx->stx_gid;
> >>> +       stbuf->mode             = stx->stx_mode;
> >>> +       stbuf->ino              = stx->stx_ino;
> >>> +       stbuf->size             = stx->stx_size;
> >>> +       stbuf->blocks           = stx->stx_blocks;
> >>> +       stbuf->attributes_mask  = stx->stx_attributes_mask;
> >>> +       stbuf->rdev_major       = stx->stx_rdev_major;
> >>> +       stbuf->rdev_minor       = stx->stx_rdev_minor;
> >>> +       stbuf->dev_major        = stx->stx_dev_major;
> >>> +       stbuf->dev_minor        = stx->stx_dev_minor;
> >>> +
> >>> +       stbuf->atime.tv_sec     = stx->stx_atime.tv_sec;
> >>> +       stbuf->btime.tv_sec     = stx->stx_btime.tv_sec;
> >>> +       stbuf->ctime.tv_sec     = stx->stx_ctime.tv_sec;
> >>> +       stbuf->mtime.tv_sec     = stx->stx_mtime.tv_sec;
> >>> +
> >>> +       stbuf->atime.tv_nsec    = stx->stx_atime.tv_nsec;
> >>> +       stbuf->btime.tv_nsec    = stx->stx_btime.tv_nsec;
> >>> +       stbuf->ctime.tv_nsec    = stx->stx_ctime.tv_nsec;
> >>> +       stbuf->mtime.tv_nsec    = stx->stx_mtime.tv_nsec;
> >>> +
> >>> +       return 0;
> >>> +}
> >>> +#endif
> >>> +
> >>
> >> Why is this conversion not needed in the merged version?
> >> What am I missing?
> > 
> > The patch in upstream memcpy's struct statx to struct fuse_statx:
> > 
> > 	memset(&arg, 0, sizeof(arg));
> > 	arg.flags = flags;
> > 	arg.attr_valid = calc_timeout_sec(attr_timeout);
> > 	arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> > 	memcpy(&arg.stat, statx, sizeof(arg.stat));
> > 
> > As long as the fields in the two are kept exactly in sync, this isn't a
> > problem and no explicit struct conversion is necessary.
> > 
> > I also noticed that the !HAVE_STATX variant of _do_statx doesn't call
> > fuse_reply_err(req, ENOSYS).  I think that means a new kernel calling
> > an old userspace would never receive a reply to a FUSE_STATX command
> > and ... time out?
> > 
> > My version also has explicit sizing of struct statx, but I concede that
> > if that struct ever gets bigger we're going to have to rev the whole
> > syscall anyway.  I was being perhaps a bit paranoid.
> > 
> > BTW, where are libfuse patches reviewed?  I guess all the review are
> > done via github PRs?
> 
> Yeah, typical procedure is github PR. If preferred for these complex
> patches fine with me to post them here. Especially if others like Amir
> are going to review :)

Ok.  I prefer emailing fsdevel for the broader reach and the
LF-maintained permanent archives of the discussions.

(I really dislike email for actually sending things that are ready for
merging and would rather send PRs though.  In XFSland the PRs are more
or less a formality that comes _after_ months of arguing. :P)

--D

> 
> Thanks,
> Bernd
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-18 18:39         ` Bernd Schubert
@ 2025-07-18 18:46           ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 18:46 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, linux-fsdevel, neal, John, miklos, joannelkoong

On Fri, Jul 18, 2025 at 08:39:18PM +0200, Bernd Schubert wrote:
> 
> 
> On 7/18/25 20:01, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 05:10:14PM +0200, Amir Goldstein wrote:
> >> On Fri, Jul 18, 2025 at 1:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Implement pagecache IO with iomap, complete with hooks into truncate and
> >>> fallocate so that the fuse server needn't implement disk block zeroing
> >>> of post-EOF and unaligned punch/zero regions.
> >>>
> >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> >>> ---
> >>>  fs/fuse/fuse_i.h          |   46 +++
> >>>  fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
> >>>  include/uapi/linux/fuse.h |    5
> >>>  fs/fuse/dir.c             |   23 +
> >>>  fs/fuse/file.c            |   90 +++++-
> >>>  fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
> >>>  fs/fuse/inode.c           |   14 +
> >>>  7 files changed, 1268 insertions(+), 24 deletions(-)
> >>>
> >>>
> >>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> >>> index 67e428da4391aa..f33b348d296d5e 100644
> >>> --- a/fs/fuse/fuse_i.h
> >>> +++ b/fs/fuse/fuse_i.h
> >>> @@ -161,6 +161,13 @@ struct fuse_inode {
> >>>
> >>>                         /* waitq for direct-io completion */
> >>>                         wait_queue_head_t direct_io_waitq;
> >>> +
> >>> +#ifdef CONFIG_FUSE_IOMAP
> >>> +                       /* pending io completions */
> >>> +                       spinlock_t ioend_lock;
> >>> +                       struct work_struct ioend_work;
> >>> +                       struct list_head ioend_list;
> >>> +#endif
> >>>                 };
> >>
> >> This union member you are changing is declared for
> >> /* read/write io cache (regular file only) */
> >> but actually it is also for parallel dio and passthrough mode
> >>
> >> IIUC, there should be zero intersection between these io modes and
> >>  /* iomap cached fileio (regular file only) */
> >>
> >> Right?
> > 
> > Right.  iomap will get very very confused if you switch file IO paths on
> > a live file.  I think it's /possible/ to switch if you flush and
> > truncate the whole page cache while holding inode_lock() but I don't
> > think anyone has ever tried.
> > 
> >> So it can use its own union member without increasing fuse_inode size.
> >>
> >> Just need to be carefull in fuse_init_file_inode(), fuse_evict_inode() and
> >> fuse_file_io_release() which do not assume a specific inode io mode.
> > 
> > Yes, I think it's possible to put the iomap stuff in a separate struct
> > within that union so that we're not increasing the fuse_inode size
> > unnecessarily.  That's desirable for something to do before merging,
> > but for now prototyping is /much/ easier if I don't have to do that.
> > 
> > Making that change will require a lot of careful auditing, first I want
> > to make sure you all agree with the iomap approach because it's much
> > different from what I see in the other fuse IO paths. :)
> > 
> > Eeeyiks, struct fuse_inode shrinks from 1272 bytes to 1152 if I push the
> > iomap stuff into its own union struct.
> > 
> >> Was it your intention to allow filesystems to configure some inodes to be
> >> in file_iomap mode and other inodes to be in regular cached/direct/passthrough
> >> io modes?
> > 
> > That was a deliberate design decision on my part -- maybe a fuse server
> > would be capable of serving up some files from a local disk, and others
> > from (say) a network filesystem.  Or maybe it would like to expose an
> > administrative fd for the filesystem (like the xfs_healer event stream)
> > that isn't backed by storage.
> > 
> >> I can't say that I see a big benefit in allowing such setups.
> >> It certainly adds a lot of complication to the test matrix if we allow that.
> >> My instinct is for initial version, either allow only opening files in
> >> FILE_IOMAP or
> >> DIRECT_IOMAP to inodes for a filesystem that supports those modes.
> > 
> > I was thinking about combining FUSE_ATTR_IOMAP_(DIRECTIO|FILEIO) for the
> > next RFC because I can't imagine any scenario where you don't want
> > directio support if you already use iomap for the pagecache.  fuse iomap
> > requires directio write support for writeback, so the server *must*
> > support IOMAP_WRITE|IOMAP_DIRECT.
> > 
> >> Perhaps later we can allow (and maybe fallback to) FOPEN_DIRECT_IO
> >> (without parallel dio) if a server does not configure IOMAP to some inode
> >> to allow a server to provide the data for a specific inode directly.
> > 
> > Hrmm.  Is FOPEN_DIRECT_IO the magic flag that fuse passes to the fuse
> > server to tell it that a file is open in directio mode?  There's a few
> > fstests that initiate aio+dio writes to a dm-error device that currently
> > fail in non-iomap mode because fuse2fs writes everything to the bdev
> > pagecache.
> 
> 
> The other way around, FOPEN_DIRECT_IO is a flag that fuse-server tells
> the kernel that it wants to bypass the page cache. And also allows
> parallel DIO IO (shared vs exclusive lock).

Oh ok.  iomap supports parallel directio writes, but one has to be
careful to drop to synchronous mode for file extending and unaligned
writes so I've left it out of the prototype for now.  (Parallel reads
are supported by default.)

Hrmm I'll have to study these more...

--D

> 
> Thanks,
> Bernd
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 1/1] libfuse: enable iomap cache management
  2025-07-18 18:40           ` Darrick J. Wong
@ 2025-07-18 18:51             ` Bernd Schubert
  0 siblings, 0 replies; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 18:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bernd Schubert, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, neal@gompa.dev, miklos@szeredi.hu



On 7/18/25 20:40, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 08:35:29PM +0200, Bernd Schubert wrote:
>>
>>
>> On 7/18/25 20:22, Darrick J. Wong wrote:
>>> On Fri, Jul 18, 2025 at 04:16:28PM +0000, Bernd Schubert wrote:
>>>> On 7/18/25 01:38, Darrick J. Wong wrote:
>>>>> From: Darrick J. Wong <djwong@kernel.org>
>>>>>
>>>>> Add the library methods so that fuse servers can manage an in-kernel
>>>>> iomap cache.  This enables better performance on small IOs and is
>>>>> required if the filesystem needs synchronization between pagecache
>>>>> writes and writeback.
>>>>
>>>> Sorry, if this ready to be merged? I don't see in linux master? Or part
>>>> of your other patches (will take some to go through these).
>>>
>>> No, everything you see in here is all RFC status and not for merging.
>>> We're past -rc6, it's far too late to be trying to get anything new
>>> merged in the kernel.
>>>
>>> Though I say that as a former iomap maintainer who wouldn't take big
>>> core code changes after -rc4 or XFS changes after -rc6.  I think I was
>>> much more conservative about that than most maintainers. :)
>>>
>>> (The cover letter yells very loudly about do not merge any of this,
>>> btw.)
>>
>>
>> This is  [PATCH 1/1] and when I wrote the mail it was not sorted in
>> threaded form - I didn't see a cover letter for this specific patch.
>> Might also be because some mails go to my ddn address and some to 
>> my own one. I use the DDN address for patches to give DDN credits 
>> for the work, but fastmail provides so much better filtering - I
>> prefer my private address for CCs.
>>
>> So asked because I was confused about this [1/1] - it made it look
>> like it is ready.
> 
> Ah, yeah.  My stgit maintainer^Wwrapper scripts only know how to put the
> RFC tag on the cover letter, not the patches themselves.  Would you
> prefer that I send to your bsbernd.com domain from now on so the emails
> all end up in the same place?

Yes please! bsbernd.com gets routed to fastmail and I have quite some
mail sorting rules there.

Btw, interesting that you manage to handle cover
letters with stgit. Must be your wrapper scripts.
I basically switched to b4 to send patches
because stgit doesn't handle it well. 


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 11:55   ` Amir Goldstein
@ 2025-07-18 19:31     ` Darrick J. Wong
  2025-07-18 19:56       ` Amir Goldstein
  2025-07-23 13:05       ` Christian Brauner
  0 siblings, 2 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 19:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o,
	Neal Gompa

On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > Hi everyone,
> > >
> > > DO NOT MERGE THIS, STILL!
> > >
> > > This is the third request for comments of a prototype to connect the
> > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > files whose contents persist to locally attached storage devices.
> > >
> > > Why would you want to do that?  Most filesystem drivers are seriously
> > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > over almost a decade of its existence.  Faulty code can lead to total
> > > kernel compromise, and I think there's a very strong incentive to move
> > > all that parsing out to userspace where we can containerize the fuse
> > > server process.
> > >
> > > willy's folios conversion project (and to a certain degree RH's new
> > > mount API) have also demonstrated that treewide changes to the core
> > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > because you have to understand every filesystem's bespoke use of that
> > > core code.  Eeeugh.
> > >
> > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > writeback is now a directio write.  The fuse server is now able to
> > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > works.
> > >
> > > With this RFC, I am able to show that it's possible to build a fuse
> > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > maintains most of its performance.  At this stage I still get about 95%
> > > of the kernel ext4 driver's streaming directio performance on streaming
> > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > for more details.  Unwritten extent conversions on random direct writes
> > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > overhead.  And that's with debugging turned on!
> > >
> > > These items have been addressed since the first RFC:
> > >
> > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > between pagecache zeroing and writeback on filesystems that support
> > > unwritten and delalloc mappings.
> > >
> > > 2. Mappings can be cached in the kernel for more speed.
> > >
> > > 3. iomap supports inline data.
> > >
> > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > fuse server can set fuse_attr::flags.
> > >
> > > 5. statx and syncfs work on iomap filesystems.
> > >
> > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > is enabled.
> > >
> > > 7. The ext4 shutdown ioctl is now supported.
> > >
> > > There are some major warts remaining:
> > >
> > > a. ext4 doesn't support out of place writes so I don't know if that
> > > actually works correctly.
> > >
> > > b. iomap is an inode-based service, not a file-based service.  This
> > > means that we /must/ push ext2's inode numbers into the kernel via
> > > FUSE_GETATTR so that it can report those same numbers back out through
> > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > to index its incore inode, so we have to pass those too so that
> > > notifications work properly.  This is related to #3 below:
> > >
> > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > because the upper level libfuse likes to abstract kernel nodeids with
> > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > As a result, a hardlinked file results in two distinct struct inodes in
> > > the kernel, which completely breaks iomap's locking model.  I will have
> > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > but on the plus side there will be far less path lookup overhead.
> > >
> > > d. There are too many changes to the IO manager in libext2fs because I
> > > built things needed to stage the direct/buffered IO paths separately.
> > > These are now unnecessary but I haven't pulled them out yet because
> > > they're sort of useful to verify that iomap file IO never goes through
> > > libext2fs except for inline data.
> > >
> > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > We also need to disable the OOM killer(s) for fuse servers because you
> > > don't want filesystems to unmount abruptly.
> > >
> > > f. How do we maximally contain the fuse server to have safe filesystem
> > > mounts?  It's very convenient to use systemd services to configure
> > > isolation declaratively, but fuse2fs still needs to be able to open
> > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > namespace.  This prevents us from using most of the stronger systemd
> >
> > I'm happy to help you here.
> >
> > First, I think using a character device for namespaced drivers is always
> > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > delegation because of devtmpfs not being namespaced as well as devices
> > in general. And having device nodes on anything other than tmpfs is just
> > wrong (TM).
> >
> > In systemd I ultimately want a bpf LSM program that prevents the
> > creation of device nodes outside of tmpfs. They don't belong on
> > persistent storage imho. But anyway, that's besides the point.
> >
> > Opening the block device should be done by systemd-mountfsd but I think
> > /dev/fuse should really be openable by the service itself.

/me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
Can you pass an fsopen fd to an unprivileged process and have that
second process call fsmount?

If so, then it would be more convenient if mount.safe/systemd-mountfsd
could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
need any special /dev access at all.  I think then the fuse server's
service could have:

DynamicUser=true
ProtectSystem=true
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
DevicePolicy=strict

(I think most of those are redundant with DynamicUser=true but a lot of
my systemd-fu is paged out ATM.)

My goal here is extreme containment -- the code doing the fs metadata
parsing has no privileges, no write access except to the fds it was
given, no network access, and no ability to read anything outside the
root filesystem.  Then I can get back to writing buffer
overflows^W^Whigh quality filesystem code in peace.

> > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > whiteouts. That means you can do mknod() in the container to create
> > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > bat so that containers can only do this on their private tmpfs mount at
> > /dev.)
> >
> > The downside of this would be to give unprivileged containers access to
> > FUSE by default. I don't think that's a problem per se but it is a uapi
> > change.

Yeah, that is a new risk.  It's still better than metadata parsing
within the kernel address space ... though who knows how thoroughly fuse
has been fuzzed by syzbot :P

> > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > sure enough about it to spill it.

Please do share, #f is my crazy unbaked idea. :)

> I don't think there is a hard requirement for the fuse fd to be opened from
> a device driver.
> With fuse io_uring communication, the open fd doesn't even need to do io.
> 
> > > protections because they tend to run in a private mount namespace with
> > > various parts of the filesystem either hidden or readonly.
> > >
> > > In theory one could design a socket protocol to pass mount options,
> > > block device paths, fds, and responsibility for the mount() call between
> > > a mount helper and a service:
> >
> > This isn't a problem really. This should just be an extension to
> > systemd-mountfsd.

I suppose mount.safe could very well call systemd-mount to go do all the
systemd-related service setup, and that would take care of udisks as
well.

> This is relevant not only to systemd env.
> 
> I have been experimenting with this mount helper service to mount fuse fs
> inside an unprivileged kubernetes container, where opening of /dev/fuse
> is restricted by LSM policy:
> 
> https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach

That sounds similar to what I was thinking about, though there are a lot
of TLAs that I don't understand.

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 18:07       ` Bernd Schubert
  2025-07-18 18:13         ` Bernd Schubert
@ 2025-07-18 19:34         ` Darrick J. Wong
  2025-07-18 21:03           ` Bernd Schubert
  1 sibling, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 19:34 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: linux-fsdevel, neal, John, miklos, joannelkoong, Horst Birthelmer

On Fri, Jul 18, 2025 at 08:07:30PM +0200, Bernd Schubert wrote:
> 
> > 
> > Please see the two attached patches, which are needed for fuse-io-uring.
> > I can also send them separately, if you prefer.
> 
> We (actually Horst) is just testing it as Horst sees failing xfs tests in
> our branch with tmp page removal
> 
> Patch 2 needs this addition (might be more, as I didn't test). 
> I had it in first, but then split the patch and missed that.

Aha, I noticed that the flush didn't quite work when uring was enabled.
I don't generally enable uring for testing because I already wrote a lot
of shaky code and uring support is new.

Though I'm afraid I have no opinion on this, because I haven't looked
deeply into dev_uring.c.

--D

> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index eca457d1005e..acf11eadbf3b 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -123,6 +123,9 @@ void fuse_uring_flush_bg(struct fuse_conn *fc)
>         struct fuse_ring_queue *queue;
>         struct fuse_ring *ring = fc->ring;
>  
> +       if (!ring)
> +               return;
> +
>         for (qid = 0; qid < ring->nr_queues; qid++) {
>                 queue = READ_ONCE(ring->queues[qid]);
>                 if (!queue)
> 
> 

> fuse: Refactor io-uring bg queue flush and queue abort
> 
> From: Bernd Schubert <bschubert@ddn.com>
> 
> This is a preparation to allow fuse-io-uring bg queue
> flush from flush_bg_queue()
> 
> This does two function renames:
> fuse_uring_flush_bg -> fuse_uring_flush_queue_bg
> fuse_uring_abort_end_requests -> fuse_uring_flush_bg
> 
> And fuse_uring_abort_end_queue_requests() is moved to
> fuse_uring_stop_queues().
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c   |   14 +++++++-------
>  fs/fuse/dev_uring_i.h |    4 ++--
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 249b210becb1..eca457d1005e 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -47,7 +47,7 @@ static struct fuse_ring_ent *uring_cmd_to_ring_ent(struct io_uring_cmd *cmd)
>  	return pdu->ent;
>  }
>  
> -static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
> +static void fuse_uring_flush_queue_bg(struct fuse_ring_queue *queue)
>  {
>  	struct fuse_ring *ring = queue->ring;
>  	struct fuse_conn *fc = ring->fc;
> @@ -88,7 +88,7 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
>  	if (test_bit(FR_BACKGROUND, &req->flags)) {
>  		queue->active_background--;
>  		spin_lock(&fc->bg_lock);
> -		fuse_uring_flush_bg(queue);
> +		fuse_uring_flush_queue_bg(queue);
>  		spin_unlock(&fc->bg_lock);
>  	}
>  
> @@ -117,11 +117,11 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
>  	fuse_dev_end_requests(&req_list);
>  }
>  
> -void fuse_uring_abort_end_requests(struct fuse_ring *ring)
> +void fuse_uring_flush_bg(struct fuse_conn *fc)
>  {
>  	int qid;
>  	struct fuse_ring_queue *queue;
> -	struct fuse_conn *fc = ring->fc;
> +	struct fuse_ring *ring = fc->ring;
>  
>  	for (qid = 0; qid < ring->nr_queues; qid++) {
>  		queue = READ_ONCE(ring->queues[qid]);
> @@ -133,10 +133,9 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
>  		WARN_ON_ONCE(ring->fc->max_background != UINT_MAX);
>  		spin_lock(&queue->lock);
>  		spin_lock(&fc->bg_lock);
> -		fuse_uring_flush_bg(queue);
> +		fuse_uring_flush_queue_bg(queue);
>  		spin_unlock(&fc->bg_lock);
>  		spin_unlock(&queue->lock);
> -		fuse_uring_abort_end_queue_requests(queue);
>  	}
>  }
>  
> @@ -475,6 +474,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
>  		if (!queue)
>  			continue;
>  
> +		fuse_uring_abort_end_queue_requests(queue);
>  		fuse_uring_teardown_entries(queue);
>  	}
>  
> @@ -1326,7 +1326,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	fc->num_background++;
>  	if (fc->num_background == fc->max_background)
>  		fc->blocked = 1;
> -	fuse_uring_flush_bg(queue);
> +	fuse_uring_flush_queue_bg(queue);
>  	spin_unlock(&fc->bg_lock);
>  
>  	/*
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 51a563922ce1..55f52508de3c 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -138,7 +138,7 @@ struct fuse_ring {
>  bool fuse_uring_enabled(void);
>  void fuse_uring_destruct(struct fuse_conn *fc);
>  void fuse_uring_stop_queues(struct fuse_ring *ring);
> -void fuse_uring_abort_end_requests(struct fuse_ring *ring);
> +void fuse_uring_flush_bg(struct fuse_conn *fc);
>  int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>  void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req);
>  bool fuse_uring_queue_bq_req(struct fuse_req *req);
> @@ -153,7 +153,7 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
>  		return;
>  
>  	if (atomic_read(&ring->queue_refs) > 0) {
> -		fuse_uring_abort_end_requests(ring);
> +		fuse_uring_flush_bg(fc);
>  		fuse_uring_stop_queues(ring);
>  	}
>  }

> fuse: Flush the io-uring bg queue from fuse_uring_flush_bg
> 
> From: Bernd Schubert <bschubert@ddn.com>
> 
> This is useful to have a unique API to flush background requests.
> For example when the bg queue gets flushed before
> the remaining of fuse_conn_destroy().
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c         |    2 ++
>  fs/fuse/dev_uring.c   |    3 +++
>  fs/fuse/dev_uring_i.h |   10 +++++++---
>  3 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 5387e4239d6a..3f5f168cc28a 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -426,6 +426,8 @@ static void flush_bg_queue(struct fuse_conn *fc)
>  		fc->active_background++;
>  		fuse_send_one(fiq, req);
>  	}
> +
> +	fuse_uring_flush_bg(fc);
>  }
>  
>  /*
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index eca457d1005e..acf11eadbf3b 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -123,6 +123,9 @@ void fuse_uring_flush_bg(struct fuse_conn *fc)
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring *ring = fc->ring;
>  
> +	if (!ring)
> +		return;
> +
>  	for (qid = 0; qid < ring->nr_queues; qid++) {
>  		queue = READ_ONCE(ring->queues[qid]);
>  		if (!queue)
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 55f52508de3c..fca2184e8d94 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -152,10 +152,10 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
>  	if (ring == NULL)
>  		return;
>  
> -	if (atomic_read(&ring->queue_refs) > 0) {
> -		fuse_uring_flush_bg(fc);
> +	/* Assumes bg queues were already flushed before */
> +
> +	if (atomic_read(&ring->queue_refs) > 0)
>  		fuse_uring_stop_queues(ring);
> -	}
>  }
>  
>  static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
> @@ -206,6 +206,10 @@ static inline bool fuse_uring_request_expired(struct fuse_conn *fc)
>  	return false;
>  }
>  
> +static inline void fuse_uring_flush_bg(struct fuse_conn *fc)
> +{
> +}
> +
>  #endif /* CONFIG_FUSE_IO_URING */
>  
>  #endif /* _FS_FUSE_DEV_URING_I_H */


^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-18 18:01       ` Darrick J. Wong
  2025-07-18 18:39         ` Bernd Schubert
@ 2025-07-18 19:45         ` Amir Goldstein
  2025-07-18 20:20           ` Darrick J. Wong
  1 sibling, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 19:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

On Fri, Jul 18, 2025 at 8:01 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 18, 2025 at 05:10:14PM +0200, Amir Goldstein wrote:
> > On Fri, Jul 18, 2025 at 1:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Implement pagecache IO with iomap, complete with hooks into truncate and
> > > fallocate so that the fuse server needn't implement disk block zeroing
> > > of post-EOF and unaligned punch/zero regions.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  fs/fuse/fuse_i.h          |   46 +++
> > >  fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
> > >  include/uapi/linux/fuse.h |    5
> > >  fs/fuse/dir.c             |   23 +
> > >  fs/fuse/file.c            |   90 +++++-
> > >  fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/fuse/inode.c           |   14 +
> > >  7 files changed, 1268 insertions(+), 24 deletions(-)
> > >
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 67e428da4391aa..f33b348d296d5e 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -161,6 +161,13 @@ struct fuse_inode {
> > >
> > >                         /* waitq for direct-io completion */
> > >                         wait_queue_head_t direct_io_waitq;
> > > +
> > > +#ifdef CONFIG_FUSE_IOMAP
> > > +                       /* pending io completions */
> > > +                       spinlock_t ioend_lock;
> > > +                       struct work_struct ioend_work;
> > > +                       struct list_head ioend_list;
> > > +#endif
> > >                 };
> >
> > This union member you are changing is declared for
> > /* read/write io cache (regular file only) */
> > but actually it is also for parallel dio and passthrough mode
> >
> > IIUC, there should be zero intersection between these io modes and
> >  /* iomap cached fileio (regular file only) */
> >
> > Right?
>
> Right.  iomap will get very very confused if you switch file IO paths on
> a live file.  I think it's /possible/ to switch if you flush and
> truncate the whole page cache while holding inode_lock() but I don't
> think anyone has ever tried.
>
> > So it can use its own union member without increasing fuse_inode size.
> >
> > Just need to be carefull in fuse_init_file_inode(), fuse_evict_inode() and
> > fuse_file_io_release() which do not assume a specific inode io mode.
>
> Yes, I think it's possible to put the iomap stuff in a separate struct
> within that union so that we're not increasing the fuse_inode size
> unnecessarily.  That's desirable for something to do before merging,
> but for now prototyping is /much/ easier if I don't have to do that.
>

understood. you can deal with that later. I just wanted to leave a TODO note.

> Making that change will require a lot of careful auditing, first I want
> to make sure you all agree with the iomap approach because it's much
> different from what I see in the other fuse IO paths. :)
>

Indeed a good audit will be required, but
*if* you can guarantee to configure iomap alway at inode initiation
then in fuse_init_file_inode() it is clear, which member of the union
is being initialized and this mode has to stick with the inode until
evict anyway.

So basically, all you need to do is never allow configuring iomap on an
already initialized inode.

> Eeeyiks, struct fuse_inode shrinks from 1272 bytes to 1152 if I push the
> iomap stuff into its own union struct.
>
> > Was it your intention to allow filesystems to configure some inodes to be
> > in file_iomap mode and other inodes to be in regular cached/direct/passthrough
> > io modes?
>
> That was a deliberate design decision on my part -- maybe a fuse server
> would be capable of serving up some files from a local disk, and others
> from (say) a network filesystem.  Or maybe it would like to expose an
> administrative fd for the filesystem (like the xfs_healer event stream)
> that isn't backed by storage.
>

Understood.

But the filesystem should be able to make the decision on inode
initiation time (lookup)
and once made, this decision sticks throughout the inode lifetime. Right?

> > I can't say that I see a big benefit in allowing such setups.
> > It certainly adds a lot of complication to the test matrix if we allow that.
> > My instinct is for initial version, either allow only opening files in
> > FILE_IOMAP or
> > DIRECT_IOMAP to inodes for a filesystem that supports those modes.
>
> I was thinking about combining FUSE_ATTR_IOMAP_(DIRECTIO|FILEIO) for the
> next RFC because I can't imagine any scenario where you don't want
> directio support if you already use iomap for the pagecache.  fuse iomap
> requires directio write support for writeback, so the server *must*
> support IOMAP_WRITE|IOMAP_DIRECT.
>
> > Perhaps later we can allow (and maybe fallback to) FOPEN_DIRECT_IO
> > (without parallel dio) if a server does not configure IOMAP to some inode
> > to allow a server to provide the data for a specific inode directly.
>
> Hrmm.  Is FOPEN_DIRECT_IO the magic flag that fuse passes to the fuse
> server to tell it that a file is open in directio mode?  There's a few
> fstests that initiate aio+dio writes to a dm-error device that currently
> fail in non-iomap mode because fuse2fs writes everything to the bdev
> pagecache.
>

Not exactly, but nevermind, you can use a much simpler logic for what
you described:
iomap has to be configured on inode instantiation and never changed afterwards.
Other inodes are not going to be affected by iomap at all from that point on.

> > fuse_file_io_open/release() can help you manage those restrictions and
> > set ff->iomode = IOM_FILE_IOMAP when a file is opened for file iomap.
> > I did not look closely enough to see if file_iomap code ends up setting
> > ff->iomode = IOM_CACHED/UNCACHED or always remains IOM_NONE.
>
> I don't touch ff->iomode because iomap is a per-inode property, not a
> per-file property... but I suppose that would be a good place to look.
>

Right, with cached/direct/passthrough the inode may change the iomode
after all files are closed, but we *do* keep the mode in the inode,
so we know that files cannot be opened in conflicting modes on the same inode.

The purpose of ff->iomode is to know if the file contributes to cached mode
positive iocachectr or to a negative passthrough mode refcount.

So setting ff->iomode = IOM_IOMAP just helps for annotating how the
file was opened, in case we are tracing it. There is no functional need to
define and set this mode on the file when the mode of the inode is const.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 19:31     ` Darrick J. Wong
@ 2025-07-18 19:56       ` Amir Goldstein
  2025-07-18 20:21         ` Darrick J. Wong
  2025-07-23 13:05       ` Christian Brauner
  1 sibling, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-18 19:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o,
	Neal Gompa

On Fri, Jul 18, 2025 at 9:31 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > Hi everyone,
> > > >
> > > > DO NOT MERGE THIS, STILL!
> > > >
> > > > This is the third request for comments of a prototype to connect the
> > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > files whose contents persist to locally attached storage devices.
> > > >
> > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > kernel compromise, and I think there's a very strong incentive to move
> > > > all that parsing out to userspace where we can containerize the fuse
> > > > server process.
> > > >
> > > > willy's folios conversion project (and to a certain degree RH's new
> > > > mount API) have also demonstrated that treewide changes to the core
> > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > because you have to understand every filesystem's bespoke use of that
> > > > core code.  Eeeugh.
> > > >
> > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > writeback is now a directio write.  The fuse server is now able to
> > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > works.
> > > >
> > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > maintains most of its performance.  At this stage I still get about 95%
> > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > for more details.  Unwritten extent conversions on random direct writes
> > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > overhead.  And that's with debugging turned on!
> > > >
> > > > These items have been addressed since the first RFC:
> > > >
> > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > between pagecache zeroing and writeback on filesystems that support
> > > > unwritten and delalloc mappings.
> > > >
> > > > 2. Mappings can be cached in the kernel for more speed.
> > > >
> > > > 3. iomap supports inline data.
> > > >
> > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > fuse server can set fuse_attr::flags.
> > > >
> > > > 5. statx and syncfs work on iomap filesystems.
> > > >
> > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > is enabled.
> > > >
> > > > 7. The ext4 shutdown ioctl is now supported.
> > > >
> > > > There are some major warts remaining:
> > > >
> > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > actually works correctly.
> > > >
> > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > to index its incore inode, so we have to pass those too so that
> > > > notifications work properly.  This is related to #3 below:
> > > >
> > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > but on the plus side there will be far less path lookup overhead.
> > > >
> > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > built things needed to stage the direct/buffered IO paths separately.
> > > > These are now unnecessary but I haven't pulled them out yet because
> > > > they're sort of useful to verify that iomap file IO never goes through
> > > > libext2fs except for inline data.
> > > >
> > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > don't want filesystems to unmount abruptly.
> > > >
> > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > mounts?  It's very convenient to use systemd services to configure
> > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > namespace.  This prevents us from using most of the stronger systemd
> > >
> > > I'm happy to help you here.
> > >
> > > First, I think using a character device for namespaced drivers is always
> > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > delegation because of devtmpfs not being namespaced as well as devices
> > > in general. And having device nodes on anything other than tmpfs is just
> > > wrong (TM).
> > >
> > > In systemd I ultimately want a bpf LSM program that prevents the
> > > creation of device nodes outside of tmpfs. They don't belong on
> > > persistent storage imho. But anyway, that's besides the point.
> > >
> > > Opening the block device should be done by systemd-mountfsd but I think
> > > /dev/fuse should really be openable by the service itself.
>
> /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> Can you pass an fsopen fd to an unprivileged process and have that
> second process call fsmount?
>
> If so, then it would be more convenient if mount.safe/systemd-mountfsd
> could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> need any special /dev access at all.  I think then the fuse server's
> service could have:
>
> DynamicUser=true
> ProtectSystem=true
> ProtectHome=true
> PrivateTmp=true
> PrivateDevices=true
> DevicePolicy=strict
>
> (I think most of those are redundant with DynamicUser=true but a lot of
> my systemd-fu is paged out ATM.)
>
> My goal here is extreme containment -- the code doing the fs metadata
> parsing has no privileges, no write access except to the fds it was
> given, no network access, and no ability to read anything outside the
> root filesystem.  Then I can get back to writing buffer
> overflows^W^Whigh quality filesystem code in peace.
>
> > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > whiteouts. That means you can do mknod() in the container to create
> > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > bat so that containers can only do this on their private tmpfs mount at
> > > /dev.)
> > >
> > > The downside of this would be to give unprivileged containers access to
> > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > change.
>
> Yeah, that is a new risk.  It's still better than metadata parsing
> within the kernel address space ... though who knows how thoroughly fuse
> has been fuzzed by syzbot :P
>
> > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > sure enough about it to spill it.
>
> Please do share, #f is my crazy unbaked idea. :)
>
> > I don't think there is a hard requirement for the fuse fd to be opened from
> > a device driver.
> > With fuse io_uring communication, the open fd doesn't even need to do io.
> >
> > > > protections because they tend to run in a private mount namespace with
> > > > various parts of the filesystem either hidden or readonly.
> > > >
> > > > In theory one could design a socket protocol to pass mount options,
> > > > block device paths, fds, and responsibility for the mount() call between
> > > > a mount helper and a service:
> > >
> > > This isn't a problem really. This should just be an extension to
> > > systemd-mountfsd.
>
> I suppose mount.safe could very well call systemd-mount to go do all the
> systemd-related service setup, and that would take care of udisks as
> well.
>
> > This is relevant not only to systemd env.
> >
> > I have been experimenting with this mount helper service to mount fuse fs
> > inside an unprivileged kubernetes container, where opening of /dev/fuse
> > is restricted by LSM policy:
> >
> > https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach
>
> That sounds similar to what I was thinking about, though there are a lot
> of TLAs that I don't understand.

Heh. UDS is Unix Domain Socket if that's what you missed (?)
All the rest don't matter.
It's just a privileged service to mount fuse filesystems.
The interesting thing is the trick with replacing fusermount3
to make existing fuse filesystems work out of the box, but the
principle is simply what you described.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 06/13] fuse: implement buffered IO with iomap
  2025-07-18 19:45         ` Amir Goldstein
@ 2025-07-18 20:20           ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 20:20 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong

On Fri, Jul 18, 2025 at 09:45:17PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 8:01 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jul 18, 2025 at 05:10:14PM +0200, Amir Goldstein wrote:
> > > On Fri, Jul 18, 2025 at 1:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Implement pagecache IO with iomap, complete with hooks into truncate and
> > > > fallocate so that the fuse server needn't implement disk block zeroing
> > > > of post-EOF and unaligned punch/zero regions.
> > > >
> > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > > >  fs/fuse/fuse_i.h          |   46 +++
> > > >  fs/fuse/fuse_trace.h      |  391 ++++++++++++++++++++++++
> > > >  include/uapi/linux/fuse.h |    5
> > > >  fs/fuse/dir.c             |   23 +
> > > >  fs/fuse/file.c            |   90 +++++-
> > > >  fs/fuse/file_iomap.c      |  723 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/inode.c           |   14 +
> > > >  7 files changed, 1268 insertions(+), 24 deletions(-)
> > > >
> > > >
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 67e428da4391aa..f33b348d296d5e 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -161,6 +161,13 @@ struct fuse_inode {
> > > >
> > > >                         /* waitq for direct-io completion */
> > > >                         wait_queue_head_t direct_io_waitq;
> > > > +
> > > > +#ifdef CONFIG_FUSE_IOMAP
> > > > +                       /* pending io completions */
> > > > +                       spinlock_t ioend_lock;
> > > > +                       struct work_struct ioend_work;
> > > > +                       struct list_head ioend_list;
> > > > +#endif
> > > >                 };
> > >
> > > This union member you are changing is declared for
> > > /* read/write io cache (regular file only) */
> > > but actually it is also for parallel dio and passthrough mode
> > >
> > > IIUC, there should be zero intersection between these io modes and
> > >  /* iomap cached fileio (regular file only) */
> > >
> > > Right?
> >
> > Right.  iomap will get very very confused if you switch file IO paths on
> > a live file.  I think it's /possible/ to switch if you flush and
> > truncate the whole page cache while holding inode_lock() but I don't
> > think anyone has ever tried.
> >
> > > So it can use its own union member without increasing fuse_inode size.
> > >
> > > Just need to be carefull in fuse_init_file_inode(), fuse_evict_inode() and
> > > fuse_file_io_release() which do not assume a specific inode io mode.
> >
> > Yes, I think it's possible to put the iomap stuff in a separate struct
> > within that union so that we're not increasing the fuse_inode size
> > unnecessarily.  That's desirable for something to do before merging,
> > but for now prototyping is /much/ easier if I don't have to do that.
> >
> 
> understood. you can deal with that later. I just wanted to leave a TODO note.

<nod> I'll leave an XXX comment then.

> > Making that change will require a lot of careful auditing, first I want
> > to make sure you all agree with the iomap approach because it's much
> > different from what I see in the other fuse IO paths. :)
> >
> 
> Indeed a good audit will be required, but
> *if* you can guarantee to configure iomap alway at inode initiation
> then in fuse_init_file_inode() it is clear, which member of the union
> is being initialized and this mode has to stick with the inode until
> evict anyway.
> 
> So basically, all you need to do is never allow configuring iomap on an
> already initialized inode.

Right.  iomap has to be initialized at INEW time and cannot be changed.

> > Eeeyiks, struct fuse_inode shrinks from 1272 bytes to 1152 if I push the
> > iomap stuff into its own union struct.
> >
> > > Was it your intention to allow filesystems to configure some inodes to be
> > > in file_iomap mode and other inodes to be in regular cached/direct/passthrough
> > > io modes?
> >
> > That was a deliberate design decision on my part -- maybe a fuse server
> > would be capable of serving up some files from a local disk, and others
> > from (say) a network filesystem.  Or maybe it would like to expose an
> > administrative fd for the filesystem (like the xfs_healer event stream)
> > that isn't backed by storage.
> >
> 
> Understood.
> 
> But the filesystem should be able to make the decision on inode
> initiation time (lookup)
> and once made, this decision sticks throughout the inode lifetime. Right?

Correct.

> > > I can't say that I see a big benefit in allowing such setups.
> > > It certainly adds a lot of complication to the test matrix if we allow that.
> > > My instinct is for initial version, either allow only opening files in
> > > FILE_IOMAP or
> > > DIRECT_IOMAP to inodes for a filesystem that supports those modes.
> >
> > I was thinking about combining FUSE_ATTR_IOMAP_(DIRECTIO|FILEIO) for the
> > next RFC because I can't imagine any scenario where you don't want
> > directio support if you already use iomap for the pagecache.  fuse iomap
> > requires directio write support for writeback, so the server *must*
> > support IOMAP_WRITE|IOMAP_DIRECT.
> >
> > > Perhaps later we can allow (and maybe fallback to) FOPEN_DIRECT_IO
> > > (without parallel dio) if a server does not configure IOMAP to some inode
> > > to allow a server to provide the data for a specific inode directly.
> >
> > Hrmm.  Is FOPEN_DIRECT_IO the magic flag that fuse passes to the fuse
> > server to tell it that a file is open in directio mode?  There's a few
> > fstests that initiate aio+dio writes to a dm-error device that currently
> > fail in non-iomap mode because fuse2fs writes everything to the bdev
> > pagecache.
> >
> 
> Not exactly, but nevermind, you can use a much simpler logic for what
> you described:
> iomap has to be configured on inode instantiation and never changed afterwards.
> Other inodes are not going to be affected by iomap at all from that point on.

<nod>

> > > fuse_file_io_open/release() can help you manage those restrictions and
> > > set ff->iomode = IOM_FILE_IOMAP when a file is opened for file iomap.
> > > I did not look closely enough to see if file_iomap code ends up setting
> > > ff->iomode = IOM_CACHED/UNCACHED or always remains IOM_NONE.
> >
> > I don't touch ff->iomode because iomap is a per-inode property, not a
> > per-file property... but I suppose that would be a good place to look.
> >
> 
> Right, with cached/direct/passthrough the inode may change the iomode
> after all files are closed, but we *do* keep the mode in the inode,
> so we know that files cannot be opened in conflicting modes on the same inode.
> 
> The purpose of ff->iomode is to know if the file contributes to cached mode
> positive iocachectr or to a negative passthrough mode refcount.
> 
> So setting ff->iomode = IOM_IOMAP just helps for annotating how the
> file was opened, in case we are tracing it. There is no functional need to
> define and set this mode on the file when the mode of the inode is const.

Ah ok.  I'll go add that for the next rfc, thanks!

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 19:56       ` Amir Goldstein
@ 2025-07-18 20:21         ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-18 20:21 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, linux-fsdevel, John, bernd, miklos,
	joannelkoong, Josef Bacik, linux-ext4, Theodore Ts'o,
	Neal Gompa

On Fri, Jul 18, 2025 at 09:56:56PM +0200, Amir Goldstein wrote:
> On Fri, Jul 18, 2025 at 9:31 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS, STILL!
> > > > >
> > > > > This is the third request for comments of a prototype to connect the
> > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > works.
> > > > >
> > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > overhead.  And that's with debugging turned on!
> > > > >
> > > > > These items have been addressed since the first RFC:
> > > > >
> > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap supports inline data.
> > > > >
> > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > fuse server can set fuse_attr::flags.
> > > > >
> > > > > 5. statx and syncfs work on iomap filesystems.
> > > > >
> > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > is enabled.
> > > > >
> > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.  This is related to #3 below:
> > > > >
> > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > but on the plus side there will be far less path lookup overhead.
> > > > >
> > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > libext2fs except for inline data.
> > > > >
> > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > don't want filesystems to unmount abruptly.
> > > > >
> > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > namespace.  This prevents us from using most of the stronger systemd
> > > >
> > > > I'm happy to help you here.
> > > >
> > > > First, I think using a character device for namespaced drivers is always
> > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > in general. And having device nodes on anything other than tmpfs is just
> > > > wrong (TM).
> > > >
> > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > persistent storage imho. But anyway, that's besides the point.
> > > >
> > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > /dev/fuse should really be openable by the service itself.
> >
> > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > Can you pass an fsopen fd to an unprivileged process and have that
> > second process call fsmount?
> >
> > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> > need any special /dev access at all.  I think then the fuse server's
> > service could have:
> >
> > DynamicUser=true
> > ProtectSystem=true
> > ProtectHome=true
> > PrivateTmp=true
> > PrivateDevices=true
> > DevicePolicy=strict
> >
> > (I think most of those are redundant with DynamicUser=true but a lot of
> > my systemd-fu is paged out ATM.)
> >
> > My goal here is extreme containment -- the code doing the fs metadata
> > parsing has no privileges, no write access except to the fds it was
> > given, no network access, and no ability to read anything outside the
> > root filesystem.  Then I can get back to writing buffer
> > overflows^W^Whigh quality filesystem code in peace.
> >
> > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > whiteouts. That means you can do mknod() in the container to create
> > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > bat so that containers can only do this on their private tmpfs mount at
> > > > /dev.)
> > > >
> > > > The downside of this would be to give unprivileged containers access to
> > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > change.
> >
> > Yeah, that is a new risk.  It's still better than metadata parsing
> > within the kernel address space ... though who knows how thoroughly fuse
> > has been fuzzed by syzbot :P
> >
> > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > sure enough about it to spill it.
> >
> > Please do share, #f is my crazy unbaked idea. :)
> >
> > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > a device driver.
> > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > >
> > > > > protections because they tend to run in a private mount namespace with
> > > > > various parts of the filesystem either hidden or readonly.
> > > > >
> > > > > In theory one could design a socket protocol to pass mount options,
> > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > a mount helper and a service:
> > > >
> > > > This isn't a problem really. This should just be an extension to
> > > > systemd-mountfsd.
> >
> > I suppose mount.safe could very well call systemd-mount to go do all the
> > systemd-related service setup, and that would take care of udisks as
> > well.
> >
> > > This is relevant not only to systemd env.
> > >
> > > I have been experimenting with this mount helper service to mount fuse fs
> > > inside an unprivileged kubernetes container, where opening of /dev/fuse
> > > is restricted by LSM policy:
> > >
> > > https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach
> >
> > That sounds similar to what I was thinking about, though there are a lot
> > of TLAs that I don't understand.
> 
> Heh. UDS is Unix Domain Socket if that's what you missed (?)
> All the rest don't matter.

I was wondering what that was.

> It's just a privileged service to mount fuse filesystems.
> The interesting thing is the trick with replacing fusermount3
> to make existing fuse filesystems work out of the box, but the
> principle is simply what you described.

<nod> Got it.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 19:34         ` Darrick J. Wong
@ 2025-07-18 21:03           ` Bernd Schubert
  0 siblings, 0 replies; 174+ messages in thread
From: Bernd Schubert @ 2025-07-18 21:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, neal, John, miklos, joannelkoong, Horst Birthelmer



On 7/18/25 21:34, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 08:07:30PM +0200, Bernd Schubert wrote:
>>
>>>
>>> Please see the two attached patches, which are needed for fuse-io-uring.
>>> I can also send them separately, if you prefer.
>>
>> We (actually Horst) is just testing it as Horst sees failing xfs tests in
>> our branch with tmp page removal
>>
>> Patch 2 needs this addition (might be more, as I didn't test). 
>> I had it in first, but then split the patch and missed that.
> 
> Aha, I noticed that the flush didn't quite work when uring was enabled.
> I don't generally enable uring for testing because I already wrote a lot
> of shaky code and uring support is new.

Yeah, I can understand.

> 
> Though I'm afraid I have no opinion on this, because I haven't looked
> deeply into dev_uring.c.

The updates patches in my branch seem to work. Going to post them
separately, but with reference to your series tomorrow. Difference is
that we cannot call fuse_uring_flush_bg() from flush_bg_queue(), because
the latter is also called from fuse_request_end() - result in double
lock and even it wouldn't flush over all queues is not desirable.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-17 23:26   ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
  2025-07-18 16:37     ` Bernd Schubert
@ 2025-07-18 22:23     ` Joanne Koong
  2025-07-19  0:32       ` Darrick J. Wong
  2025-07-19  7:18       ` Amir Goldstein
  1 sibling, 2 replies; 174+ messages in thread
From: Joanne Koong @ 2025-07-18 22:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> generic/488 fails with fuse2fs in the following fashion:
>
> generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> (see /var/tmp/fstests/generic/488.full for details)
>
> This test opens a large number of files, unlinks them (which really just
> renames them to fuse hidden files), closes the program, unmounts the
> filesystem, and runs fsck to check that there aren't any inconsistencies
> in the filesystem.
>
> Unfortunately, the 488.full file shows that there are a lot of hidden
> files left over in the filesystem, with incorrect link counts.  Tracing
> fuse_request_* shows that there are a large number of FUSE_RELEASE
> commands that are queued up on behalf of the unlinked files at the time
> that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> aborted, the fuse server would have responded to the RELEASE commands by
> removing the hidden files; instead they stick around.

Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
of synchronous. For example for fuse servers that cache their data and
only write the buffer out to some remote filesystem when the file gets
closed, it seems useful for them to (like nfs) be able to return an
error to the client for close() if there's a failure committing that
data; that also has clearer API semantics imo, eg users are guaranteed
that when close() returns, all the processing/cleanup for that file
has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
the server holds local locks that get released in FUSE_RELEASE, if a
subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
grabbing that lock, then we end up deadlocked if the server is
single-threaded.

I saw in your first patch that sending FUSE_RELEASE synchronously
leads to a deadlock under AIO but AFAICT, that happens because we
execute req->args->end() in fuse_request_end() synchronously; I think
if we execute that release asynchronously on a worker thread then that
gets rid of the deadlock.

If FUSE_RELEASE must be asynchronous though, then your approach makes
sense to me.

>
> Create a function to push all the background requests to the queue and
> then wait for the number of pending events to hit zero, and call this
> before fuse_abort_conn.  That way, all the pending events are processed
> by the fuse server and we don't end up with a corrupt filesystem.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/fuse/fuse_i.h |    6 ++++++
>  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
>  fs/fuse/inode.c  |    1 +
>  3 files changed, 45 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> +/*
> + * Flush all pending requests and wait for them.  Only call this function when
> + * it is no longer possible for other threads to add requests.
> + */
> +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)

It might be worth renaming this to something like
'fuse_flush_bg_requests' to make it more clear that this only flushes
background requests

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 22:23     ` Joanne Koong
@ 2025-07-19  0:32       ` Darrick J. Wong
  2025-07-21 20:32         ` Joanne Koong
  2025-07-22 12:30         ` Jeff Layton
  2025-07-19  7:18       ` Amir Goldstein
  1 sibling, 2 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-19  0:32 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > generic/488 fails with fuse2fs in the following fashion:
> >
> > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > (see /var/tmp/fstests/generic/488.full for details)
> >
> > This test opens a large number of files, unlinks them (which really just
> > renames them to fuse hidden files), closes the program, unmounts the
> > filesystem, and runs fsck to check that there aren't any inconsistencies
> > in the filesystem.
> >
> > Unfortunately, the 488.full file shows that there are a lot of hidden
> > files left over in the filesystem, with incorrect link counts.  Tracing
> > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > commands that are queued up on behalf of the unlinked files at the time
> > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > aborted, the fuse server would have responded to the RELEASE commands by
> > removing the hidden files; instead they stick around.
> 
> Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> of synchronous. For example for fuse servers that cache their data and
> only write the buffer out to some remote filesystem when the file gets
> closed, it seems useful for them to (like nfs) be able to return an
> error to the client for close() if there's a failure committing that

I don't think supplying a return value for close() is as helpful as it
seems -- the manage says that there is no guarantee that data has been
flushed to disk; and if the file is removed from the process' fd table
then the operation succeeded no matter the return value. :P

(Also C programmers tend to be sloppy and not check the return value.)

> data; that also has clearer API semantics imo, eg users are guaranteed
> that when close() returns, all the processing/cleanup for that file
> has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> the server holds local locks that get released in FUSE_RELEASE, if a

Yes.  I think it's only useful for the case outined in that patch, which
is that a program started an asyncio operation and then closed the fd.
In that particular case the program unambiguously doesn't care about the
return value of close so it's ok to perform the release asynchronously.

> subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> grabbing that lock, then we end up deadlocked if the server is
> single-threaded.

Hrm.  I suppose if you had a script that ran two programs one after the
other, each of which expected to be able to open and lock the same file,
then you could run into problems if the lock isn't released by the time
the second program is ready to open the file.

But having said that, some other program could very well open and lock
the file as soon as the lock drops.

> I saw in your first patch that sending FUSE_RELEASE synchronously
> leads to a deadlock under AIO but AFAICT, that happens because we
> execute req->args->end() in fuse_request_end() synchronously; I think
> if we execute that release asynchronously on a worker thread then that
> gets rid of the deadlock.

<nod> Last time I think someone replied that maybe they should all be
asynchronous.

> If FUSE_RELEASE must be asynchronous though, then your approach makes
> sense to me.

I think it only has to be asynchronous for the weird case outlined in
that patch (fuse server gets stuck closing its own client's fds).
Personally I think release ought to be synchronous at least as far as
the kernel doing all the stuff that close() says it has to do (removal
of record locks, deleting the fd table entry).

Note that doesn't necessarily mean that the kernel has to be completely
done with all the work that entails.  XFS defers freeing of unlinked
files until a background garbage collector gets around to doing that.
Other filesystems will actually make you wait while they free all the
data blocks and the inode.  But the kernel has no idea what the fuse
server actually does.

> > Create a function to push all the background requests to the queue and
> > then wait for the number of pending events to hit zero, and call this
> > before fuse_abort_conn.  That way, all the pending events are processed
> > by the fuse server and we don't end up with a corrupt filesystem.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  fs/fuse/fuse_i.h |    6 ++++++
> >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/inode.c  |    1 +
> >  3 files changed, 45 insertions(+)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > +/*
> > + * Flush all pending requests and wait for them.  Only call this function when
> > + * it is no longer possible for other threads to add requests.
> > + */
> > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> 
> It might be worth renaming this to something like
> 'fuse_flush_bg_requests' to make it more clear that this only flushes
> background requests

Hum.  Did I not understand the code correctly?  I thought that
flush_bg_queue puts all the background requests onto the active queue
and issues them to the fuse server; and the wait_event_timeout sits
around waiting for all the requests to receive their replies?

I could be mistaken though.  This is my rough understanding of what
happens to background requests:

1. Request created
2. Put request on bg_queue
3. <wait>
4. Request removed from bg_queue
5. Request sent
6. <wait>
7. Reply received
8. Request ends and is _put.

Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
fc->waiting tracks the number of requests that are anywhere between the
end of step 1 and the start of step 8.

In any case, I want to push all the bg requests and wait until there are
no more requests in the system.

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-18 22:23     ` Joanne Koong
  2025-07-19  0:32       ` Darrick J. Wong
@ 2025-07-19  7:18       ` Amir Goldstein
  2025-07-21 20:05         ` Joanne Koong
  1 sibling, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-19  7:18 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Darrick J. Wong, linux-fsdevel, neal, John, miklos, bernd

On Sat, Jul 19, 2025 at 12:23 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > generic/488 fails with fuse2fs in the following fashion:
> >
> > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > (see /var/tmp/fstests/generic/488.full for details)
> >
> > This test opens a large number of files, unlinks them (which really just
> > renames them to fuse hidden files), closes the program, unmounts the
> > filesystem, and runs fsck to check that there aren't any inconsistencies
> > in the filesystem.
> >
> > Unfortunately, the 488.full file shows that there are a lot of hidden
> > files left over in the filesystem, with incorrect link counts.  Tracing
> > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > commands that are queued up on behalf of the unlinked files at the time
> > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > aborted, the fuse server would have responded to the RELEASE commands by
> > removing the hidden files; instead they stick around.
>
> Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> of synchronous. For example for fuse servers that cache their data and
> only write the buffer out to some remote filesystem when the file gets
> closed, it seems useful for them to (like nfs) be able to return an
> error to the client for close() if there's a failure committing that
> data; that also has clearer API semantics imo, eg users are guaranteed
> that when close() returns, all the processing/cleanup for that file
> has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> the server holds local locks that get released in FUSE_RELEASE, if a
> subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> grabbing that lock, then we end up deadlocked if the server is
> single-threaded.
>

There is a very good reason for keeping FUSE_FLUSH and FUSE_RELEASE
(as well as those vfs ops) separate.

A filesystem can decide if it needs synchronous close() (not release).
And with FOPEN_NOFLUSH, the filesystem can decide that per open file,
(unless it conflicts with a config like writeback cache).

I have a filesystem which can do very slow io and some clients
can get stuck doing open;fstat;close if close is always synchronous.
I actually found the libfuse feature of async flush (FUSE_RELEASE_FLUSH)
quite useful for my filesystem, so I carry a kernel patch to support it.

The issue of racing that you mentioned sounds odd.
First of all, who runs a single threaded fuse server?
Second, what does it matter if release is sync or async,
FUSE_RELEASE will not be triggered by the same
task calling FUSE_OPEN, so if there is a deadlock, it will happen
with sync release as well.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel
  2025-07-18 15:48       ` Darrick J. Wong
@ 2025-07-19  7:34         ` Amir Goldstein
  0 siblings, 0 replies; 174+ messages in thread
From: Amir Goldstein @ 2025-07-19  7:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: bschubert, John, joannelkoong, linux-fsdevel, bernd, neal, miklos

On Fri, Jul 18, 2025 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 18, 2025 at 04:10:18PM +0200, Amir Goldstein wrote:
> > On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Create new fuse_reply_{attr,create,entry}_iflags functions so that we
> > > can send FUSE_ATTR_* flags to the kernel when instantiating an inode.
> > > Servers are expected to send FUSE_IFLAG_* values, which will be
> > > translated into what the kernel can understand.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  include/fuse_common.h   |    3 ++
> > >  include/fuse_lowlevel.h |   87 +++++++++++++++++++++++++++++++++++++++++++++--
> > >  lib/fuse_lowlevel.c     |   69 ++++++++++++++++++++++++++++++-------
> > >  lib/fuse_versionscript  |    4 ++
> > >  4 files changed, 146 insertions(+), 17 deletions(-)
>
> <snip for brevity>
>
> > > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> > > index d26043fa54c036..568db13502a7d7 100644
> > > --- a/lib/fuse_lowlevel.c
> > > +++ b/lib/fuse_lowlevel.c
> > > @@ -545,7 +573,22 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
> > >         memset(&arg, 0, sizeof(arg));
> > >         arg.attr_valid = calc_timeout_sec(attr_timeout);
> > >         arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> > > -       convert_stat(attr, &arg.attr);
> > > +       convert_stat(attr, &arg.attr, 0);
> > > +
> > > +       return send_reply_ok(req, &arg, size);
> > > +}
> > > +
> > > +int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
> > > +                          unsigned int iflags, double attr_timeout)
> > > +{
> > > +       struct fuse_attr_out arg;
> > > +       size_t size = req->se->conn.proto_minor < 9 ?
> > > +               FUSE_COMPAT_ATTR_OUT_SIZE : sizeof(arg);
> > > +
> > > +       memset(&arg, 0, sizeof(arg));
> > > +       arg.attr_valid = calc_timeout_sec(attr_timeout);
> > > +       arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> > > +       convert_stat(attr, &arg.attr, iflags);
> > >
> > >         return send_reply_ok(req, &arg, size);
> > >  }
> >
> > I wonder why fuse_reply_attr() is not implemented as a wrapper to
> > fuse_reply_attr_iflags()?
>
> oops.  I meant to convert this one, and apparently forgot. :(
>
> > FWIW, the flags field was added in minor version 23 for
> > FUSE_ATTR_SUBMOUNT, but I guess that doesn't matter here.
>
> <nod> Hopefully nobody will call fuse_reply_attr_iflags when
> proto_minor < 23.  Do I need to check for that explicitly in libfuse and
> zero out iflags?  Or is it safe enough to assume that the os kernel
> ignores flags bits that it doesn't understand and/or are not enabled on
> the fuse_mount?
>

AFAICS the server ignores other flags, so I think it's fine.
It only ever checks the bits it knows about in fuse_iget() and in
fuse_dentry_revalidate() to make sure they did not change.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-18 15:55       ` Darrick J. Wong
@ 2025-07-21 18:51         ` Bernd Schubert
  2025-07-23 17:50           ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Bernd Schubert @ 2025-07-21 18:51 UTC (permalink / raw)
  To: Darrick J. Wong, Amir Goldstein
  Cc: John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On 7/18/25 17:55, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 04:27:50PM +0200, Amir Goldstein wrote:
>> On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>>
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> Create a new ->getattr_iflags function so that iomap filesystems can set
>>> the appropriate in-kernel inode flags on instantiation.
>>>
>>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> <snip for brevity>
> 
>>> diff --git a/lib/fuse.c b/lib/fuse.c
>>> index 8dbf88877dd37c..685d0181e569d0 100644
>>> --- a/lib/fuse.c
>>> +++ b/lib/fuse.c
>>> @@ -3710,14 +3832,19 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
>>>                         if (de->flags & FUSE_FILL_DIR_PLUS &&
>>>                             !is_dot_or_dotdot(de->name)) {
>>>                                 res = do_lookup(dh->fuse, dh->nodeid,
>>> -                                               de->name, &e);
>>> +                                               de->name, &e, &iflags);
>>>                                 if (res) {
>>>                                         dh->error = res;
>>>                                         return 1;
>>>                                 }
>>>                         }
>>>
>>> -                       thislen = fuse_add_direntry_plus(req, p, rem,
>>> +                       if (f->want_iflags)
>>> +                               thislen = fuse_add_direntry_plus_iflags(req, p,
>>> +                                                        rem, de->name, iflags,
>>> +                                                        &e, pos);
>>> +                       else
>>> +                               thislen = fuse_add_direntry_plus(req, p, rem,
>>>                                                          de->name, &e, pos);
>>
>>
>> All those conditional statements look pretty moot.
>> Can't we just force iflags to 0 if (!f->want_iflags)
>> and always call the *_iflags functions?
> 
> Heh, it already is zero, so yes, this could be a straight call to
> fuse_add_direntry_plus_iflags without the want_iflags check.  Will fix
> up this and the other thing you mentioned in the previous patch.
> 
> Thanks for the code review!
> 
> Having said that, the significant difficulties with iomap and the
> upper level fuse library still exist.  To summarize -- upper libfuse has
> its own nodeids which don't necssarily correspond to the filesystem's,
> and struct node/nodeid are duplicated for hardlinked files.  As a
> result, the kernel has multiple struct inodes for an ondisk ext4 inode,
> which completely breaks the locking for the iomap file IO model.
> 
> That forces me to port fuse2fs to the lowlevel library, so I might
> remove the lib/fuse.c patches entirely.  Are there plans to make the
> upper libfuse handle hardlinks better?

I don't have plans for high level improvements. To be honest, I didn't
know about the hard link issue at all. 
Also a bit surprising to see all your lowlevel work and then fuse high
level coming ;)

Btw, I will go on vacation on Wednesday and still other things queued,
going to try to review in the evenings (but not before next Saturday).



Cheers,
Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-19  7:18       ` Amir Goldstein
@ 2025-07-21 20:05         ` Joanne Koong
  2025-07-23 17:06           ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Joanne Koong @ 2025-07-21 20:05 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, linux-fsdevel, neal, John, miklos, bernd

On Sat, Jul 19, 2025 at 12:18 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Jul 19, 2025 at 12:23 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > generic/488 fails with fuse2fs in the following fashion:
> > >
> > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > commands that are queued up on behalf of the unlinked files at the time
> > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > aborted, the fuse server would have responded to the RELEASE commands by
> > > removing the hidden files; instead they stick around.
> >
> > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > of synchronous. For example for fuse servers that cache their data and
> > only write the buffer out to some remote filesystem when the file gets
> > closed, it seems useful for them to (like nfs) be able to return an
> > error to the client for close() if there's a failure committing that
> > data; that also has clearer API semantics imo, eg users are guaranteed
> > that when close() returns, all the processing/cleanup for that file
> > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > the server holds local locks that get released in FUSE_RELEASE, if a
> > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > grabbing that lock, then we end up deadlocked if the server is
> > single-threaded.
> >
>
> There is a very good reason for keeping FUSE_FLUSH and FUSE_RELEASE
> (as well as those vfs ops) separate.

Oh interesting, I didn't realize FUSE_FLUSH gets also sent on the
release path. I had assumed FUSE_FLUSH was for the sync()/fsync()
case. But I see now that you're right, close() makes a call to
filp_flush() in the vfs layer. (and I now see there's FUSE_FSYNC for
the fsync() case)

>
> A filesystem can decide if it needs synchronous close() (not release).
> And with FOPEN_NOFLUSH, the filesystem can decide that per open file,
> (unless it conflicts with a config like writeback cache).
>
> I have a filesystem which can do very slow io and some clients
> can get stuck doing open;fstat;close if close is always synchronous.
> I actually found the libfuse feature of async flush (FUSE_RELEASE_FLUSH)
> quite useful for my filesystem, so I carry a kernel patch to support it.
>
> The issue of racing that you mentioned sounds odd.
> First of all, who runs a single threaded fuse server?
> Second, what does it matter if release is sync or async,
> FUSE_RELEASE will not be triggered by the same
> task calling FUSE_OPEN, so if there is a deadlock, it will happen
> with sync release as well.

If the server is single-threaded, I think the FUSE_RELEASE would have
to happen on the same task as FUSE_OPEN, so if the release is
synchronous, this would avoid the deadlock because that guarantees the
FUSE_RELEASE happens before the next FUSE_OPEN.

However now that you pointed out FUSE_FLUSH gets sent on the release
path, that addresses my worry about async FUSE_RELEASE returning
before the server has gotten a chance to write out their local buffer
cache.

Thanks,
Joanne
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-19  0:32       ` Darrick J. Wong
@ 2025-07-21 20:32         ` Joanne Koong
  2025-07-23 17:34           ` Darrick J. Wong
  2025-07-22 12:30         ` Jeff Layton
  1 sibling, 1 reply; 174+ messages in thread
From: Joanne Koong @ 2025-07-21 20:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Fri, Jul 18, 2025 at 5:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > generic/488 fails with fuse2fs in the following fashion:
> > >
> > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > (see /var/tmp/fstests/generic/488.full for details)
> > >
> > > This test opens a large number of files, unlinks them (which really just
> > > renames them to fuse hidden files), closes the program, unmounts the
> > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > in the filesystem.
> > >
> > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > commands that are queued up on behalf of the unlinked files at the time
> > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > aborted, the fuse server would have responded to the RELEASE commands by
> > > removing the hidden files; instead they stick around.
> >
> > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > of synchronous. For example for fuse servers that cache their data and
> > only write the buffer out to some remote filesystem when the file gets
> > closed, it seems useful for them to (like nfs) be able to return an
> > error to the client for close() if there's a failure committing that
>
> I don't think supplying a return value for close() is as helpful as it
> seems -- the manage says that there is no guarantee that data has been
> flushed to disk; and if the file is removed from the process' fd table
> then the operation succeeded no matter the return value. :P
>
> (Also C programmers tend to be sloppy and not check the return value.)

Amir pointed out FUSE_FLUSH gets sent on the FUSE_RELEASE path so that
addresses my worry. FUSE_FLUSH is sent synchronously (and close() will
propagate any flush errors too), so now if there's an abort or
something right after close() returns, the client is guaranteed that
any data they wrote into a local cache has been flushed by the server.

>
> > data; that also has clearer API semantics imo, eg users are guaranteed
> > that when close() returns, all the processing/cleanup for that file
> > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > the server holds local locks that get released in FUSE_RELEASE, if a
>
> Yes.  I think it's only useful for the case outined in that patch, which
> is that a program started an asyncio operation and then closed the fd.
> In that particular case the program unambiguously doesn't care about the
> return value of close so it's ok to perform the release asynchronously.

I wonder why fuseblk devices need to be synchronously released. The
comment says " Make the release synchronous if this is a fuseblk
mount, synchronous RELEASE is allowed (and desirable)". Why is it
desirable?

>
> > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > grabbing that lock, then we end up deadlocked if the server is
> > single-threaded.
>
> Hrm.  I suppose if you had a script that ran two programs one after the
> other, each of which expected to be able to open and lock the same file,
> then you could run into problems if the lock isn't released by the time
> the second program is ready to open the file.

I think in your scenario with the two programs, the worst outcome is
that the open/lock acquiring can take a while but in the (contrived
and probably far-fetched) scenario where it's single threaded, it
would result in a complete deadlock.

>
> But having said that, some other program could very well open and lock
> the file as soon as the lock drops.
>
> > I saw in your first patch that sending FUSE_RELEASE synchronously
> > leads to a deadlock under AIO but AFAICT, that happens because we
> > execute req->args->end() in fuse_request_end() synchronously; I think
> > if we execute that release asynchronously on a worker thread then that
> > gets rid of the deadlock.
>
> <nod> Last time I think someone replied that maybe they should all be
> asynchronous.
>
> > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > sense to me.
>
> I think it only has to be asynchronous for the weird case outlined in
> that patch (fuse server gets stuck closing its own client's fds).
> Personally I think release ought to be synchronous at least as far as
> the kernel doing all the stuff that close() says it has to do (removal
> of record locks, deleting the fd table entry).
>
> Note that doesn't necessarily mean that the kernel has to be completely
> done with all the work that entails.  XFS defers freeing of unlinked
> files until a background garbage collector gets around to doing that.
> Other filesystems will actually make you wait while they free all the
> data blocks and the inode.  But the kernel has no idea what the fuse
> server actually does.

I guess if that's important enough to the server, we could add
something an FOPEN flag for that that servers could set on the file
handle if they want synchronous release?

after Amir's point about FUSE_FLUSH, I'm in favor now of FUSE_RELEASE
being asynchronous.
>
> > > Create a function to push all the background requests to the queue and
> > > then wait for the number of pending events to hit zero, and call this
> > > before fuse_abort_conn.  That way, all the pending events are processed
> > > by the fuse server and we don't end up with a corrupt filesystem.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  fs/fuse/fuse_i.h |    6 ++++++
> > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > >  fs/fuse/inode.c  |    1 +
> > >  3 files changed, 45 insertions(+)
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > +/*
> > > + * Flush all pending requests and wait for them.  Only call this function when
> > > + * it is no longer possible for other threads to add requests.
> > > + */
> > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> >
> > It might be worth renaming this to something like
> > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > background requests
>
> Hum.  Did I not understand the code correctly?  I thought that
> flush_bg_queue puts all the background requests onto the active queue
> and issues them to the fuse server; and the wait_event_timeout sits
> around waiting for all the requests to receive their replies?

Sorry, didn't mean to be confusing with my previous comment. What I
was trying to say is that "fuse_flush_requests" implies that all
requests get flushed to userspace but here only the background
requests get flushed.

Thanks,
Joanne
>
> I could be mistaken though.  This is my rough understanding of what
> happens to background requests:
>
> 1. Request created
> 2. Put request on bg_queue
> 3. <wait>
> 4. Request removed from bg_queue
> 5. Request sent
> 6. <wait>
> 7. Reply received
> 8. Request ends and is _put.
>
> Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> fc->waiting tracks the number of requests that are anywhere between the
> end of step 1 and the start of step 8.
>
> In any case, I want to push all the bg requests and wait until there are
> no more requests in the system.
>
> --D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-19  0:32       ` Darrick J. Wong
  2025-07-21 20:32         ` Joanne Koong
@ 2025-07-22 12:30         ` Jeff Layton
  2025-07-22 12:38           ` Jeff Layton
  1 sibling, 1 reply; 174+ messages in thread
From: Jeff Layton @ 2025-07-22 12:30 UTC (permalink / raw)
  To: Darrick J. Wong, Joanne Koong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Fri, 2025-07-18 at 17:32 -0700, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > 
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > generic/488 fails with fuse2fs in the following fashion:
> > > 
> > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > (see /var/tmp/fstests/generic/488.full for details)
> > > 
> > > This test opens a large number of files, unlinks them (which really just
> > > renames them to fuse hidden files), closes the program, unmounts the
> > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > in the filesystem.
> > > 
> > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > commands that are queued up on behalf of the unlinked files at the time
> > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > aborted, the fuse server would have responded to the RELEASE commands by
> > > removing the hidden files; instead they stick around.
> > 
> > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > of synchronous. For example for fuse servers that cache their data and
> > only write the buffer out to some remote filesystem when the file gets
> > closed, it seems useful for them to (like nfs) be able to return an
> > error to the client for close() if there's a failure committing that
> 
> I don't think supplying a return value for close() is as helpful as it
> seems -- the manage says that there is no guarantee that data has been
> flushed to disk; and if the file is removed from the process' fd table
> then the operation succeeded no matter the return value. :P
> 
> (Also C programmers tend to be sloppy and not check the return value.)
> 

The POSIX spec and manpage for close(2) make no mention of writeback
errors, so it's not 100% clear that returning them there is at all OK.
Everyone sort of assumes that it makes sense to do so, but it can be
actively harmful. Suppose we do this:

open() = 1
write(1)
close(1) 
open() = 2
fsync(2) = ???

Now, assume there was a writeback error that happens either before or
after the close.

With the way this works today, you will get back an error on that final
fsync() even if fd 2 was opened _after_ the writeback error occurred,
because nothing will have scraped it yet.

If you scrape the error to return it on the close though, then the
result of that fsync() would be inconclusive. If the error happens
before the close(), then fsync() will return 0. If it fails after the
close(), then the fsync() will see an error.

> > data; that also has clearer API semantics imo, eg users are guaranteed
> > that when close() returns, all the processing/cleanup for that file
> > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > the server holds local locks that get released in FUSE_RELEASE, if a
> 
> Yes.  I think it's only useful for the case outined in that patch, which
> is that a program started an asyncio operation and then closed the fd.
> In that particular case the program unambiguously doesn't care about the
> return value of close so it's ok to perform the release asynchronously.
> 
> > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > grabbing that lock, then we end up deadlocked if the server is
> > single-threaded.
> 
> Hrm.  I suppose if you had a script that ran two programs one after the
> other, each of which expected to be able to open and lock the same file,
> then you could run into problems if the lock isn't released by the time
> the second program is ready to open the file.
> 
> But having said that, some other program could very well open and lock
> the file as soon as the lock drops.
> 
> > I saw in your first patch that sending FUSE_RELEASE synchronously
> > leads to a deadlock under AIO but AFAICT, that happens because we
> > execute req->args->end() in fuse_request_end() synchronously; I think
> > if we execute that release asynchronously on a worker thread then that
> > gets rid of the deadlock.
> 
> <nod> Last time I think someone replied that maybe they should all be
> asynchronous.
> 
> > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > sense to me.
> 
> I think it only has to be asynchronous for the weird case outlined in
> that patch (fuse server gets stuck closing its own client's fds).
> Personally I think release ought to be synchronous at least as far as
> the kernel doing all the stuff that close() says it has to do (removal
> of record locks, deleting the fd table entry).
> 
> Note that doesn't necessarily mean that the kernel has to be completely
> done with all the work that entails.  XFS defers freeing of unlinked
> files until a background garbage collector gets around to doing that.
> Other filesystems will actually make you wait while they free all the
> data blocks and the inode.  But the kernel has no idea what the fuse
> server actually does.
> 
> > > Create a function to push all the background requests to the queue and
> > > then wait for the number of pending events to hit zero, and call this
> > > before fuse_abort_conn.  That way, all the pending events are processed
> > > by the fuse server and we don't end up with a corrupt filesystem.
> > > 
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  fs/fuse/fuse_i.h |    6 ++++++
> > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > >  fs/fuse/inode.c  |    1 +
> > >  3 files changed, 45 insertions(+)
> > > 
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > +/*
> > > + * Flush all pending requests and wait for them.  Only call this function when
> > > + * it is no longer possible for other threads to add requests.
> > > + */
> > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > 
> > It might be worth renaming this to something like
> > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > background requests
> 
> Hum.  Did I not understand the code correctly?  I thought that
> flush_bg_queue puts all the background requests onto the active queue
> and issues them to the fuse server; and the wait_event_timeout sits
> around waiting for all the requests to receive their replies?
> 
> I could be mistaken though.  This is my rough understanding of what
> happens to background requests:
> 
> 1. Request created
> 2. Put request on bg_queue
> 3. <wait>
> 4. Request removed from bg_queue
> 5. Request sent
> 6. <wait>
> 7. Reply received
> 8. Request ends and is _put.
> 
> Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> fc->waiting tracks the number of requests that are anywhere between the
> end of step 1 and the start of step 8.
> 
> In any case, I want to push all the bg requests and wait until there are
> no more requests in the system.
> 
> --D

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-22 12:30         ` Jeff Layton
@ 2025-07-22 12:38           ` Jeff Layton
  2025-07-23 15:37             ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Jeff Layton @ 2025-07-22 12:38 UTC (permalink / raw)
  To: Darrick J. Wong, Joanne Koong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Tue, 2025-07-22 at 08:30 -0400, Jeff Layton wrote:
> On Fri, 2025-07-18 at 17:32 -0700, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > 
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > generic/488 fails with fuse2fs in the following fashion:
> > > > 
> > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > (see /var/tmp/fstests/generic/488.full for details)
> > > > 
> > > > This test opens a large number of files, unlinks them (which really just
> > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > in the filesystem.
> > > > 
> > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > commands that are queued up on behalf of the unlinked files at the time
> > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > removing the hidden files; instead they stick around.
> > > 
> > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > of synchronous. For example for fuse servers that cache their data and
> > > only write the buffer out to some remote filesystem when the file gets
> > > closed, it seems useful for them to (like nfs) be able to return an
> > > error to the client for close() if there's a failure committing that
> > 
> > I don't think supplying a return value for close() is as helpful as it
> > seems -- the manage says that there is no guarantee that data has been
> > flushed to disk; and if the file is removed from the process' fd table
> > then the operation succeeded no matter the return value. :P
> > 
> > (Also C programmers tend to be sloppy and not check the return value.)
> > 
> 
> The POSIX spec and manpage for close(2) make no mention of writeback
> errors, so it's not 100% clear that returning them there is at all OK.
> Everyone sort of assumes that it makes sense to do so, but it can be
> actively harmful.
> 

Actually, they do mention this, but I still argue that it's not a good
idea to do so. If you want writeback errors use fsync() (or maybe the
new ioctl() that someone was plumbing in that scrapes errors without
doing writeback).

> Suppose we do this:
> 
> open() = 1
> write(1)
> close(1) 
> open() = 2
> fsync(2) = ???
> 
> Now, assume there was a writeback error that happens either before or
> after the close.
> 
> With the way this works today, you will get back an error on that final
> fsync() even if fd 2 was opened _after_ the writeback error occurred,
> because nothing will have scraped it yet.
> 
> If you scrape the error to return it on the close though, then the
> result of that fsync() would be inconclusive. If the error happens
> before the close(), then fsync() will return 0. If it fails after the
> close(), then the fsync() will see an error.
> 
> > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > that when close() returns, all the processing/cleanup for that file
> > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > the server holds local locks that get released in FUSE_RELEASE, if a
> > 
> > Yes.  I think it's only useful for the case outined in that patch, which
> > is that a program started an asyncio operation and then closed the fd.
> > In that particular case the program unambiguously doesn't care about the
> > return value of close so it's ok to perform the release asynchronously.
> > 
> > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > grabbing that lock, then we end up deadlocked if the server is
> > > single-threaded.
> > 
> > Hrm.  I suppose if you had a script that ran two programs one after the
> > other, each of which expected to be able to open and lock the same file,
> > then you could run into problems if the lock isn't released by the time
> > the second program is ready to open the file.
> > 
> > But having said that, some other program could very well open and lock
> > the file as soon as the lock drops.
> > 
> > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > if we execute that release asynchronously on a worker thread then that
> > > gets rid of the deadlock.
> > 
> > <nod> Last time I think someone replied that maybe they should all be
> > asynchronous.
> > 
> > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > sense to me.
> > 
> > I think it only has to be asynchronous for the weird case outlined in
> > that patch (fuse server gets stuck closing its own client's fds).
> > Personally I think release ought to be synchronous at least as far as
> > the kernel doing all the stuff that close() says it has to do (removal
> > of record locks, deleting the fd table entry).
> > 
> > Note that doesn't necessarily mean that the kernel has to be completely
> > done with all the work that entails.  XFS defers freeing of unlinked
> > files until a background garbage collector gets around to doing that.
> > Other filesystems will actually make you wait while they free all the
> > data blocks and the inode.  But the kernel has no idea what the fuse
> > server actually does.
> > 
> > > > Create a function to push all the background requests to the queue and
> > > > then wait for the number of pending events to hit zero, and call this
> > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > > 
> > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/inode.c  |    1 +
> > > >  3 files changed, 45 insertions(+)
> > > > 
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > +/*
> > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > + * it is no longer possible for other threads to add requests.
> > > > + */
> > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > > 
> > > It might be worth renaming this to something like
> > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > background requests
> > 
> > Hum.  Did I not understand the code correctly?  I thought that
> > flush_bg_queue puts all the background requests onto the active queue
> > and issues them to the fuse server; and the wait_event_timeout sits
> > around waiting for all the requests to receive their replies?
> > 
> > I could be mistaken though.  This is my rough understanding of what
> > happens to background requests:
> > 
> > 1. Request created
> > 2. Put request on bg_queue
> > 3. <wait>
> > 4. Request removed from bg_queue
> > 5. Request sent
> > 6. <wait>
> > 7. Reply received
> > 8. Request ends and is _put.
> > 
> > Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> > fc->waiting tracks the number of requests that are anywhere between the
> > end of step 1 and the start of step 8.
> > 
> > In any case, I want to push all the bg requests and wait until there are
> > no more requests in the system.
> > 
> > --D

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
  2025-07-18 18:13       ` Darrick J. Wong
@ 2025-07-22 22:20         ` Bernd Schubert
  0 siblings, 0 replies; 174+ messages in thread
From: Bernd Schubert @ 2025-07-22 22:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, joannelkoong



On 7/18/25 20:13, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 07:10:37PM +0200, Bernd Schubert wrote:
>>
>>
>> On 7/18/25 01:27, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> The fuse_request_{send,end} tracepoints capture the value of
>>> req->in.h.unique in the trace output.  It would be really nice if we
>>> could use this to match a request to its response for debugging and
>>> latency analysis, but the call to trace_fuse_request_send occurs before
>>> the unique id has been set:
>>>
>>> fuse_request_send:    connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
>>> fuse_request_end:     connection 8388608 req 6 len 16 error -2
>>>
>>> Move the callsites to trace_fuse_request_send to after the unique id has
>>> been set, or right before we decide to cancel a request having not set
>>> one.
>>
>> Sorry, my fault, I have a branch for that already. Just occupied and
>> then just didn't send v4.
>>
>> https://lore.kernel.org/all/20250403-fuse-io-uring-trace-points-v3-0-35340aa31d9c@ddn.com/
> 
> (Aha, that was before I started paying attention to the fuse patches on
> fsdevel.)
> 
>> The updated branch is here
>>
>> https://github.com/bsbernd/linux/commits/fuse-io-uring-trace-points/
>>
>> Objections if we go with that version, as it adds a few more tracepoints
>> and removes the lock to get the unique ID.
> 
> Let me look through the branch --
> 
>  * fuse: Make the fuse unique value a per-cpu counter
> 
> Is there any reason you didn't use percpu_counter_init() ?  It does the
> same per-cpu batching that (I think) your version does.
> 
>  * fuse: Set request unique on allocation
>  * fuse: {io-uring} Avoid _send code dup
> 
> Looks good,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
>  * fuse: fine-grained request ftraces
> 
> Are these three new tracepoints exactly identical except in name?
> If you declare an event class for them, that will save a lot of memory
> (~5K per tracepoint according to rostedt) over definining them
> individually.
> 
>  * per cpu cntr fix
> 
> I think you can avoid this if you use the kernel struct percpu_counter.

Thanks a lot for your review! I was hoping I would get it updated before
I got on vacation, but much too late now. I will that I get some work
done next week, but no way before Saturday - traveling the next days.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-18 19:31     ` Darrick J. Wong
  2025-07-18 19:56       ` Amir Goldstein
@ 2025-07-23 13:05       ` Christian Brauner
  2025-07-23 18:04         ` Darrick J. Wong
  1 sibling, 1 reply; 174+ messages in thread
From: Christian Brauner @ 2025-07-23 13:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > Hi everyone,
> > > >
> > > > DO NOT MERGE THIS, STILL!
> > > >
> > > > This is the third request for comments of a prototype to connect the
> > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > files whose contents persist to locally attached storage devices.
> > > >
> > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > kernel compromise, and I think there's a very strong incentive to move
> > > > all that parsing out to userspace where we can containerize the fuse
> > > > server process.
> > > >
> > > > willy's folios conversion project (and to a certain degree RH's new
> > > > mount API) have also demonstrated that treewide changes to the core
> > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > because you have to understand every filesystem's bespoke use of that
> > > > core code.  Eeeugh.
> > > >
> > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > writeback is now a directio write.  The fuse server is now able to
> > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > works.
> > > >
> > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > maintains most of its performance.  At this stage I still get about 95%
> > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > for more details.  Unwritten extent conversions on random direct writes
> > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > overhead.  And that's with debugging turned on!
> > > >
> > > > These items have been addressed since the first RFC:
> > > >
> > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > between pagecache zeroing and writeback on filesystems that support
> > > > unwritten and delalloc mappings.
> > > >
> > > > 2. Mappings can be cached in the kernel for more speed.
> > > >
> > > > 3. iomap supports inline data.
> > > >
> > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > fuse server can set fuse_attr::flags.
> > > >
> > > > 5. statx and syncfs work on iomap filesystems.
> > > >
> > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > is enabled.
> > > >
> > > > 7. The ext4 shutdown ioctl is now supported.
> > > >
> > > > There are some major warts remaining:
> > > >
> > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > actually works correctly.
> > > >
> > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > to index its incore inode, so we have to pass those too so that
> > > > notifications work properly.  This is related to #3 below:
> > > >
> > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > but on the plus side there will be far less path lookup overhead.
> > > >
> > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > built things needed to stage the direct/buffered IO paths separately.
> > > > These are now unnecessary but I haven't pulled them out yet because
> > > > they're sort of useful to verify that iomap file IO never goes through
> > > > libext2fs except for inline data.
> > > >
> > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > don't want filesystems to unmount abruptly.
> > > >
> > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > mounts?  It's very convenient to use systemd services to configure
> > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > namespace.  This prevents us from using most of the stronger systemd
> > >
> > > I'm happy to help you here.
> > >
> > > First, I think using a character device for namespaced drivers is always
> > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > delegation because of devtmpfs not being namespaced as well as devices
> > > in general. And having device nodes on anything other than tmpfs is just
> > > wrong (TM).
> > >
> > > In systemd I ultimately want a bpf LSM program that prevents the
> > > creation of device nodes outside of tmpfs. They don't belong on
> > > persistent storage imho. But anyway, that's besides the point.
> > >
> > > Opening the block device should be done by systemd-mountfsd but I think
> > > /dev/fuse should really be openable by the service itself.
> 
> /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> Can you pass an fsopen fd to an unprivileged process and have that
> second process call fsmount?

Yes, but remember that at some point you must call
fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
fses that requires CAP_SYS_ADMIN so that has to be done by the
privielged process. All the rest can be done by the unprivileged process
though. That's exactly how bpf tokens work.

> 
> If so, then it would be more convenient if mount.safe/systemd-mountfsd
> could pass open fds for /dev/fuse fsopen then the fuse server wouldn't

Yes, that would work.

> need any special /dev access at all.  I think then the fuse server's
> service could have:
> 
> DynamicUser=true
> ProtectSystem=true
> ProtectHome=true
> PrivateTmp=true
> PrivateDevices=true
> DevicePolicy=strict
> 
> (I think most of those are redundant with DynamicUser=true but a lot of
> my systemd-fu is paged out ATM.)
> 
> My goal here is extreme containment -- the code doing the fs metadata
> parsing has no privileges, no write access except to the fds it was
> given, no network access, and no ability to read anything outside the
> root filesystem.  Then I can get back to writing buffer
> overflows^W^Whigh quality filesystem code in peace.

Yeah, sounds about right.

> 
> > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > whiteouts. That means you can do mknod() in the container to create
> > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > bat so that containers can only do this on their private tmpfs mount at
> > > /dev.)
> > >
> > > The downside of this would be to give unprivileged containers access to
> > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > change.
> 
> Yeah, that is a new risk.  It's still better than metadata parsing
> within the kernel address space ... though who knows how thoroughly fuse
> has been fuzzed by syzbot :P
> 
> > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > sure enough about it to spill it.
> 
> Please do share, #f is my crazy unbaked idea. :)
> 
> > I don't think there is a hard requirement for the fuse fd to be opened from
> > a device driver.
> > With fuse io_uring communication, the open fd doesn't even need to do io.
> > 
> > > > protections because they tend to run in a private mount namespace with
> > > > various parts of the filesystem either hidden or readonly.
> > > >
> > > > In theory one could design a socket protocol to pass mount options,
> > > > block device paths, fds, and responsibility for the mount() call between
> > > > a mount helper and a service:
> > >
> > > This isn't a problem really. This should just be an extension to
> > > systemd-mountfsd.
> 
> I suppose mount.safe could very well call systemd-mount to go do all the
> systemd-related service setup, and that would take care of udisks as
> well.

The ultimate goal is to teach mount(8)/libmount to use that daemon when
it's available. Because that would just make unprivileged mounting work
without userspace noticing anything.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-22 12:38           ` Jeff Layton
@ 2025-07-23 15:37             ` Darrick J. Wong
  2025-07-23 16:24               ` Jeff Layton
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-23 15:37 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Joanne Koong, linux-fsdevel, neal, John, miklos, bernd

On Tue, Jul 22, 2025 at 08:38:08AM -0400, Jeff Layton wrote:
> On Tue, 2025-07-22 at 08:30 -0400, Jeff Layton wrote:
> > On Fri, 2025-07-18 at 17:32 -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > 
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > > 
> > > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > > (see /var/tmp/fstests/generic/488.full for details)
> > > > > 
> > > > > This test opens a large number of files, unlinks them (which really just
> > > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > > in the filesystem.
> > > > > 
> > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > removing the hidden files; instead they stick around.
> > > > 
> > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > of synchronous. For example for fuse servers that cache their data and
> > > > only write the buffer out to some remote filesystem when the file gets
> > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > error to the client for close() if there's a failure committing that
> > > 
> > > I don't think supplying a return value for close() is as helpful as it
> > > seems -- the manage says that there is no guarantee that data has been
> > > flushed to disk; and if the file is removed from the process' fd table
> > > then the operation succeeded no matter the return value. :P
> > > 
> > > (Also C programmers tend to be sloppy and not check the return value.)
> > > 
> > 
> > The POSIX spec and manpage for close(2) make no mention of writeback
> > errors, so it's not 100% clear that returning them there is at all OK.
> > Everyone sort of assumes that it makes sense to do so, but it can be
> > actively harmful.
> > 
> 
> Actually, they do mention this, but I still argue that it's not a good
> idea to do so. If you want writeback errors use fsync() (or maybe the
> new ioctl() that someone was plumbing in that scrapes errors without
> doing writeback).
> 
> > Suppose we do this:
> > 
> > open() = 1
> > write(1)
> > close(1) 
> > open() = 2
> > fsync(2) = ???
> > 
> > Now, assume there was a writeback error that happens either before or
> > after the close.
> > 
> > With the way this works today, you will get back an error on that final
> > fsync() even if fd 2 was opened _after_ the writeback error occurred,
> > because nothing will have scraped it yet.
> > 
> > If you scrape the error to return it on the close though, then the
> > result of that fsync() would be inconclusive. If the error happens
> > before the close(), then fsync() will return 0. If it fails after the
> > close(), then the fsync() will see an error.

<nod> Given the horrible legacy of C programmers not really checking the
return value from close(), I think that /if/ the kernel is going to
check for writeback errors at close, it should sample the error state
but not clear it, so that the fsync returns accumulated errors.

(That said, my opinion is that after years of all of us telling
programmers that fsync is the golden standard for checking if bad stuff
happened, we really ought only be clearing error state during fsync.)

Evidently some projects do fsync-after-open assuming that close doesn't
flush and wait for writeback:
https://despairlabs.com/blog/posts/2025-03-13-fsync-after-open-is-an-elaborate-no-op/

--D

> > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > that when close() returns, all the processing/cleanup for that file
> > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > 
> > > Yes.  I think it's only useful for the case outined in that patch, which
> > > is that a program started an asyncio operation and then closed the fd.
> > > In that particular case the program unambiguously doesn't care about the
> > > return value of close so it's ok to perform the release asynchronously.
> > > 
> > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > grabbing that lock, then we end up deadlocked if the server is
> > > > single-threaded.
> > > 
> > > Hrm.  I suppose if you had a script that ran two programs one after the
> > > other, each of which expected to be able to open and lock the same file,
> > > then you could run into problems if the lock isn't released by the time
> > > the second program is ready to open the file.
> > > 
> > > But having said that, some other program could very well open and lock
> > > the file as soon as the lock drops.
> > > 
> > > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > > if we execute that release asynchronously on a worker thread then that
> > > > gets rid of the deadlock.
> > > 
> > > <nod> Last time I think someone replied that maybe they should all be
> > > asynchronous.
> > > 
> > > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > > sense to me.
> > > 
> > > I think it only has to be asynchronous for the weird case outlined in
> > > that patch (fuse server gets stuck closing its own client's fds).
> > > Personally I think release ought to be synchronous at least as far as
> > > the kernel doing all the stuff that close() says it has to do (removal
> > > of record locks, deleting the fd table entry).
> > > 
> > > Note that doesn't necessarily mean that the kernel has to be completely
> > > done with all the work that entails.  XFS defers freeing of unlinked
> > > files until a background garbage collector gets around to doing that.
> > > Other filesystems will actually make you wait while they free all the
> > > data blocks and the inode.  But the kernel has no idea what the fuse
> > > server actually does.
> > > 
> > > > > Create a function to push all the background requests to the queue and
> > > > > then wait for the number of pending events to hit zero, and call this
> > > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > > > 
> > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > ---
> > > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > > >  fs/fuse/inode.c  |    1 +
> > > > >  3 files changed, 45 insertions(+)
> > > > > 
> > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > +/*
> > > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > > + * it is no longer possible for other threads to add requests.
> > > > > + */
> > > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > > > 
> > > > It might be worth renaming this to something like
> > > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > > background requests
> > > 
> > > Hum.  Did I not understand the code correctly?  I thought that
> > > flush_bg_queue puts all the background requests onto the active queue
> > > and issues them to the fuse server; and the wait_event_timeout sits
> > > around waiting for all the requests to receive their replies?
> > > 
> > > I could be mistaken though.  This is my rough understanding of what
> > > happens to background requests:
> > > 
> > > 1. Request created
> > > 2. Put request on bg_queue
> > > 3. <wait>
> > > 4. Request removed from bg_queue
> > > 5. Request sent
> > > 6. <wait>
> > > 7. Reply received
> > > 8. Request ends and is _put.
> > > 
> > > Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> > > fc->waiting tracks the number of requests that are anywhere between the
> > > end of step 1 and the start of step 8.
> > > 
> > > In any case, I want to push all the bg requests and wait until there are
> > > no more requests in the system.
> > > 
> > > --D
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 15:37             ` Darrick J. Wong
@ 2025-07-23 16:24               ` Jeff Layton
  2025-07-31  9:45                 ` Christian Brauner
  0 siblings, 1 reply; 174+ messages in thread
From: Jeff Layton @ 2025-07-23 16:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Joanne Koong, linux-fsdevel, neal, John, miklos, bernd

On Wed, 2025-07-23 at 08:37 -0700, Darrick J. Wong wrote:
> On Tue, Jul 22, 2025 at 08:38:08AM -0400, Jeff Layton wrote:
> > On Tue, 2025-07-22 at 08:30 -0400, Jeff Layton wrote:
> > > On Fri, 2025-07-18 at 17:32 -0700, Darrick J. Wong wrote:
> > > > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > 
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > 
> > > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > > > 
> > > > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > > > (see /var/tmp/fstests/generic/488.full for details)
> > > > > > 
> > > > > > This test opens a large number of files, unlinks them (which really just
> > > > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > > > in the filesystem.
> > > > > > 
> > > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > > removing the hidden files; instead they stick around.
> > > > > 
> > > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > > of synchronous. For example for fuse servers that cache their data and
> > > > > only write the buffer out to some remote filesystem when the file gets
> > > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > > error to the client for close() if there's a failure committing that
> > > > 
> > > > I don't think supplying a return value for close() is as helpful as it
> > > > seems -- the manage says that there is no guarantee that data has been
> > > > flushed to disk; and if the file is removed from the process' fd table
> > > > then the operation succeeded no matter the return value. :P
> > > > 
> > > > (Also C programmers tend to be sloppy and not check the return value.)
> > > > 
> > > 
> > > The POSIX spec and manpage for close(2) make no mention of writeback
> > > errors, so it's not 100% clear that returning them there is at all OK.
> > > Everyone sort of assumes that it makes sense to do so, but it can be
> > > actively harmful.
> > > 
> > 
> > Actually, they do mention this, but I still argue that it's not a good
> > idea to do so. If you want writeback errors use fsync() (or maybe the
> > new ioctl() that someone was plumbing in that scrapes errors without
> > doing writeback).
> > 
> > > Suppose we do this:
> > > 
> > > open() = 1
> > > write(1)
> > > close(1) 
> > > open() = 2
> > > fsync(2) = ???
> > > 
> > > Now, assume there was a writeback error that happens either before or
> > > after the close.
> > > 
> > > With the way this works today, you will get back an error on that final
> > > fsync() even if fd 2 was opened _after_ the writeback error occurred,
> > > because nothing will have scraped it yet.
> > > 
> > > If you scrape the error to return it on the close though, then the
> > > result of that fsync() would be inconclusive. If the error happens
> > > before the close(), then fsync() will return 0. If it fails after the
> > > close(), then the fsync() will see an error.
> 
> <nod> Given the horrible legacy of C programmers not really checking the
> return value from close(), I think that /if/ the kernel is going to
> check for writeback errors at close, it should sample the error state
> but not clear it, so that the fsync returns accumulated errors.
> 
> (That said, my opinion is that after years of all of us telling
> programmers that fsync is the golden standard for checking if bad stuff
> happened, we really ought only be clearing error state during fsync.)
> 

That is pretty doable. The only question is whether it's something we
*want* to do. Something like this would probably be enough if so:

diff --git a/fs/open.c b/fs/open.c
index 7828234a7caa..a20657a85ee1 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1582,6 +1582,10 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 
        retval = filp_flush(file, current->files);
 
+       /* Do an opportunistic writeback error check before returning. */
+       if (likely(retval == 0))
+               retval = filemap_check_wb_err(file_inode(file)->i_mapping, file->f_wb_err);
+
        /*
         * We're returning to user space. Don't bother
         * with any delayed fput() cases.


> Evidently some projects do fsync-after-open assuming that close doesn't
> flush and wait for writeback:
> https://despairlabs.com/blog/posts/2025-03-13-fsync-after-open-is-an-elaborate-no-op/
>
> > > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > > that when close() returns, all the processing/cleanup for that file
> > > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > > 
> > > > Yes.  I think it's only useful for the case outined in that patch, which
> > > > is that a program started an asyncio operation and then closed the fd.
> > > > In that particular case the program unambiguously doesn't care about the
> > > > return value of close so it's ok to perform the release asynchronously.
> > > > 
> > > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > > grabbing that lock, then we end up deadlocked if the server is
> > > > > single-threaded.
> > > > 
> > > > Hrm.  I suppose if you had a script that ran two programs one after the
> > > > other, each of which expected to be able to open and lock the same file,
> > > > then you could run into problems if the lock isn't released by the time
> > > > the second program is ready to open the file.
> > > > 
> > > > But having said that, some other program could very well open and lock
> > > > the file as soon as the lock drops.
> > > > 
> > > > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > > > if we execute that release asynchronously on a worker thread then that
> > > > > gets rid of the deadlock.
> > > > 
> > > > <nod> Last time I think someone replied that maybe they should all be
> > > > asynchronous.
> > > > 
> > > > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > > > sense to me.
> > > > 
> > > > I think it only has to be asynchronous for the weird case outlined in
> > > > that patch (fuse server gets stuck closing its own client's fds).
> > > > Personally I think release ought to be synchronous at least as far as
> > > > the kernel doing all the stuff that close() says it has to do (removal
> > > > of record locks, deleting the fd table entry).
> > > > 
> > > > Note that doesn't necessarily mean that the kernel has to be completely
> > > > done with all the work that entails.  XFS defers freeing of unlinked
> > > > files until a background garbage collector gets around to doing that.
> > > > Other filesystems will actually make you wait while they free all the
> > > > data blocks and the inode.  But the kernel has no idea what the fuse
> > > > server actually does.
> > > > 
> > > > > > Create a function to push all the background requests to the queue and
> > > > > > then wait for the number of pending events to hit zero, and call this
> > > > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > > > > 
> > > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > ---
> > > > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > > > >  fs/fuse/inode.c  |    1 +
> > > > > >  3 files changed, 45 insertions(+)
> > > > > > 
> > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > +/*
> > > > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > > > + * it is no longer possible for other threads to add requests.
> > > > > > + */
> > > > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > > > > 
> > > > > It might be worth renaming this to something like
> > > > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > > > background requests
> > > > 
> > > > Hum.  Did I not understand the code correctly?  I thought that
> > > > flush_bg_queue puts all the background requests onto the active queue
> > > > and issues them to the fuse server; and the wait_event_timeout sits
> > > > around waiting for all the requests to receive their replies?
> > > > 
> > > > I could be mistaken though.  This is my rough understanding of what
> > > > happens to background requests:
> > > > 
> > > > 1. Request created
> > > > 2. Put request on bg_queue
> > > > 3. <wait>
> > > > 4. Request removed from bg_queue
> > > > 5. Request sent
> > > > 6. <wait>
> > > > 7. Reply received
> > > > 8. Request ends and is _put.
> > > > 
> > > > Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> > > > fc->waiting tracks the number of requests that are anywhere between the
> > > > end of step 1 and the start of step 8.
> > > > 
> > > > In any case, I want to push all the bg requests and wait until there are
> > > > no more requests in the system.
> > > > 
> > > > --D
> > 
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply related	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-21 20:05         ` Joanne Koong
@ 2025-07-23 17:06           ` Darrick J. Wong
  2025-07-23 20:27             ` Joanne Koong
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-23 17:06 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Amir Goldstein, linux-fsdevel, neal, John, miklos, bernd

On Mon, Jul 21, 2025 at 01:05:02PM -0700, Joanne Koong wrote:
> On Sat, Jul 19, 2025 at 12:18 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Sat, Jul 19, 2025 at 12:23 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > generic/488 fails with fuse2fs in the following fashion:
> > > >
> > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > commands that are queued up on behalf of the unlinked files at the time
> > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > removing the hidden files; instead they stick around.
> > >
> > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > of synchronous. For example for fuse servers that cache their data and
> > > only write the buffer out to some remote filesystem when the file gets
> > > closed, it seems useful for them to (like nfs) be able to return an
> > > error to the client for close() if there's a failure committing that
> > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > that when close() returns, all the processing/cleanup for that file
> > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > grabbing that lock, then we end up deadlocked if the server is
> > > single-threaded.
> > >
> >
> > There is a very good reason for keeping FUSE_FLUSH and FUSE_RELEASE
> > (as well as those vfs ops) separate.
> 
> Oh interesting, I didn't realize FUSE_FLUSH gets also sent on the
> release path. I had assumed FUSE_FLUSH was for the sync()/fsync()

(That's FUSE_FSYNC)

> case. But I see now that you're right, close() makes a call to
> filp_flush() in the vfs layer. (and I now see there's FUSE_FSYNC for
> the fsync() case)

Yeah, flush-on-close (FUSE_FLUSH) is generally a good idea for
"unreliable" filesystems -- either because they're remote, or because
the local storage they're on could get yanked at any time.  It's slow,
but it papers over a lot of bugs and "bad" usage.

> > A filesystem can decide if it needs synchronous close() (not release).
> > And with FOPEN_NOFLUSH, the filesystem can decide that per open file,
> > (unless it conflicts with a config like writeback cache).
> >
> > I have a filesystem which can do very slow io and some clients
> > can get stuck doing open;fstat;close if close is always synchronous.
> > I actually found the libfuse feature of async flush (FUSE_RELEASE_FLUSH)
> > quite useful for my filesystem, so I carry a kernel patch to support it.
> >
> > The issue of racing that you mentioned sounds odd.
> > First of all, who runs a single threaded fuse server?
> > Second, what does it matter if release is sync or async,
> > FUSE_RELEASE will not be triggered by the same
> > task calling FUSE_OPEN, so if there is a deadlock, it will happen
> > with sync release as well.
> 
> If the server is single-threaded, I think the FUSE_RELEASE would have
> to happen on the same task as FUSE_OPEN, so if the release is
> synchronous, this would avoid the deadlock because that guarantees the
> FUSE_RELEASE happens before the next FUSE_OPEN.

On a single-threaded server(!) I would hope that the release would be
issued to the fuse server before the open.  (I'm not sure I understand
where this part of the thread went, because why would that happen?  And
why would the fuse server hold a lock across requests?)

> However now that you pointed out FUSE_FLUSH gets sent on the release
> path, that addresses my worry about async FUSE_RELEASE returning
> before the server has gotten a chance to write out their local buffer
> cache.

<nod>

--D

> Thanks,
> Joanne
> >
> > Thanks,
> > Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-21 20:32         ` Joanne Koong
@ 2025-07-23 17:34           ` Darrick J. Wong
  2025-07-23 21:02             ` Joanne Koong
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-23 17:34 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Mon, Jul 21, 2025 at 01:32:43PM -0700, Joanne Koong wrote:
> On Fri, Jul 18, 2025 at 5:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > generic/488 fails with fuse2fs in the following fashion:
> > > >
> > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > (see /var/tmp/fstests/generic/488.full for details)
> > > >
> > > > This test opens a large number of files, unlinks them (which really just
> > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > in the filesystem.
> > > >
> > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > commands that are queued up on behalf of the unlinked files at the time
> > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > removing the hidden files; instead they stick around.
> > >
> > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > of synchronous. For example for fuse servers that cache their data and
> > > only write the buffer out to some remote filesystem when the file gets
> > > closed, it seems useful for them to (like nfs) be able to return an
> > > error to the client for close() if there's a failure committing that
> >
> > I don't think supplying a return value for close() is as helpful as it
> > seems -- the manage says that there is no guarantee that data has been
> > flushed to disk; and if the file is removed from the process' fd table
> > then the operation succeeded no matter the return value. :P
> >
> > (Also C programmers tend to be sloppy and not check the return value.)
> 
> Amir pointed out FUSE_FLUSH gets sent on the FUSE_RELEASE path so that
> addresses my worry. FUSE_FLUSH is sent synchronously (and close() will
> propagate any flush errors too), so now if there's an abort or
> something right after close() returns, the client is guaranteed that
> any data they wrote into a local cache has been flushed by the server.

<nod>

> >
> > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > that when close() returns, all the processing/cleanup for that file
> > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > the server holds local locks that get released in FUSE_RELEASE, if a
> >
> > Yes.  I think it's only useful for the case outined in that patch, which
> > is that a program started an asyncio operation and then closed the fd.
> > In that particular case the program unambiguously doesn't care about the
> > return value of close so it's ok to perform the release asynchronously.
> 
> I wonder why fuseblk devices need to be synchronously released. The
> comment says " Make the release synchronous if this is a fuseblk
> mount, synchronous RELEASE is allowed (and desirable)". Why is it
> desirable?

Err, which are you asking about?

Are you asking why it is that fuseblk mounts call FUSE_DESTROY from
unmount instead of letting libfuse synthesize it once the event loop
terminates?  I think that's because in the fuseblk case, the kernel has
the block device open for itself, so the fuse server must write and
flush all dirty data before the unmount() returns to the caller.

Or were you asking why synchronous RELEASE is done on fuseblk
filesystems?  Here is my speculation:

Synchronous RELEASE was added back in commit 5a18ec176c934c ("fuse: fix
hang of single threaded fuseblk filesystem").  I /think/ the idea behind
that patch was that for fuseblk servers, we're ok with issuing a
FUSE_DESTROY request from the kernel and waiting on it.

However, for that to work correctly, all previous pending requests
anywhere in the fuse mount have to be flushed to and completed by the
fuse server before we can send DESTROY, because destroy closes the
filesystem.

So I think the idea behind 5a18ec176c934c is that we make FUSE_RELEASE
synchronous so it's not possible to umount(8) until all the releases
requests are finished.

> > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > grabbing that lock, then we end up deadlocked if the server is
> > > single-threaded.
> >
> > Hrm.  I suppose if you had a script that ran two programs one after the
> > other, each of which expected to be able to open and lock the same file,
> > then you could run into problems if the lock isn't released by the time
> > the second program is ready to open the file.
> 
> I think in your scenario with the two programs, the worst outcome is
> that the open/lock acquiring can take a while but in the (contrived
> and probably far-fetched) scenario where it's single threaded, it
> would result in a complete deadlock.

<nod> I concede it's a minor point. :)

> > But having said that, some other program could very well open and lock
> > the file as soon as the lock drops.
> >
> > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > if we execute that release asynchronously on a worker thread then that
> > > gets rid of the deadlock.
> >
> > <nod> Last time I think someone replied that maybe they should all be
> > asynchronous.
> >
> > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > sense to me.
> >
> > I think it only has to be asynchronous for the weird case outlined in
> > that patch (fuse server gets stuck closing its own client's fds).
> > Personally I think release ought to be synchronous at least as far as
> > the kernel doing all the stuff that close() says it has to do (removal
> > of record locks, deleting the fd table entry).
> >
> > Note that doesn't necessarily mean that the kernel has to be completely
> > done with all the work that entails.  XFS defers freeing of unlinked
> > files until a background garbage collector gets around to doing that.
> > Other filesystems will actually make you wait while they free all the
> > data blocks and the inode.  But the kernel has no idea what the fuse
> > server actually does.
> 
> I guess if that's important enough to the server, we could add
> something an FOPEN flag for that that servers could set on the file
> handle if they want synchronous release?

If a fuse server /did/ have background garbage collection, there are a
few things it could do -- every time it sees a FUSE_RELEASE of an
unlinked file, it could set a timer (say 50ms) after which it would kick
the gc thread to do its thing.  Or it could do wake up the background
thread in response to a FUSE_SYNCFS command and hope it finishes by the
time FUSE_DESTROY comes around.

(Speaking of which, can we enable syncfs for all fuse servers?)

But that said, not everyone wants the fancy background gc stuff that XFS
does.  FUSE_RELEASE would then be doing a lot of work.

> after Amir's point about FUSE_FLUSH, I'm in favor now of FUSE_RELEASE
> being asynchronous.
> >
> > > > Create a function to push all the background requests to the queue and
> > > > then wait for the number of pending events to hit zero, and call this
> > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > >
> > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/inode.c  |    1 +
> > > >  3 files changed, 45 insertions(+)
> > > >
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > +/*
> > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > + * it is no longer possible for other threads to add requests.
> > > > + */
> > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > >
> > > It might be worth renaming this to something like
> > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > background requests
> >
> > Hum.  Did I not understand the code correctly?  I thought that
> > flush_bg_queue puts all the background requests onto the active queue
> > and issues them to the fuse server; and the wait_event_timeout sits
> > around waiting for all the requests to receive their replies?
> 
> Sorry, didn't mean to be confusing with my previous comment. What I
> was trying to say is that "fuse_flush_requests" implies that all
> requests get flushed to userspace but here only the background
> requests get flushed.

Oh, I see now, I /was/ mistaken.  Synchronous requests are ...

Wait, no, still confused :(

fuse_flush_requests waits until fuse_conn::num_waiting is zero.

Synchronous requests (aka the ones sent through fuse_simple_request)
bump num_waiting either directly in the args->force case or indirectly
via fuse_get_req.  num_waiting is decremented in fuse_put_request.
Therefore waiting for num_waiting to hit zero implements waiting for all
the requests that were in flight before fuse_flush_requests was called.

Background requests (aka the ones sent via fuse_simple_background) have
num_waiting set in the !args->force case or indirectly in
fuse_request_queue_background.  num_waiting is decremented in
fuse_put_request the same as is done for synchronous requests.

Therefore, it's correct to say that waiting for num_requests to become 0
is sufficient to wait for all pending requests anywhere in the
fuse_mount to complete.

Right?

Maybe this should be called fuse_flush_requests_and_wait. :)

--D

> Thanks,
> Joanne
> >
> > I could be mistaken though.  This is my rough understanding of what
> > happens to background requests:
> >
> > 1. Request created
> > 2. Put request on bg_queue
> > 3. <wait>
> > 4. Request removed from bg_queue
> > 5. Request sent
> > 6. <wait>
> > 7. Reply received
> > 8. Request ends and is _put.
> >
> > Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> > fc->waiting tracks the number of requests that are anywhere between the
> > end of step 1 and the start of step 8.
> >
> > In any case, I want to push all the bg requests and wait until there are
> > no more requests in the system.
> >
> > --D
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-21 18:51         ` Bernd Schubert
@ 2025-07-23 17:50           ` Darrick J. Wong
  2025-07-24 19:56             ` Amir Goldstein
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-23 17:50 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On Mon, Jul 21, 2025 at 06:51:00PM +0000, Bernd Schubert wrote:
> On 7/18/25 17:55, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 04:27:50PM +0200, Amir Goldstein wrote:
> >> On Fri, Jul 18, 2025 at 1:36 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Create a new ->getattr_iflags function so that iomap filesystems can set
> >>> the appropriate in-kernel inode flags on instantiation.
> >>>
> >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > 
> > <snip for brevity>
> > 
> >>> diff --git a/lib/fuse.c b/lib/fuse.c
> >>> index 8dbf88877dd37c..685d0181e569d0 100644
> >>> --- a/lib/fuse.c
> >>> +++ b/lib/fuse.c
> >>> @@ -3710,14 +3832,19 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
> >>>                         if (de->flags & FUSE_FILL_DIR_PLUS &&
> >>>                             !is_dot_or_dotdot(de->name)) {
> >>>                                 res = do_lookup(dh->fuse, dh->nodeid,
> >>> -                                               de->name, &e);
> >>> +                                               de->name, &e, &iflags);
> >>>                                 if (res) {
> >>>                                         dh->error = res;
> >>>                                         return 1;
> >>>                                 }
> >>>                         }
> >>>
> >>> -                       thislen = fuse_add_direntry_plus(req, p, rem,
> >>> +                       if (f->want_iflags)
> >>> +                               thislen = fuse_add_direntry_plus_iflags(req, p,
> >>> +                                                        rem, de->name, iflags,
> >>> +                                                        &e, pos);
> >>> +                       else
> >>> +                               thislen = fuse_add_direntry_plus(req, p, rem,
> >>>                                                          de->name, &e, pos);
> >>
> >>
> >> All those conditional statements look pretty moot.
> >> Can't we just force iflags to 0 if (!f->want_iflags)
> >> and always call the *_iflags functions?
> > 
> > Heh, it already is zero, so yes, this could be a straight call to
> > fuse_add_direntry_plus_iflags without the want_iflags check.  Will fix
> > up this and the other thing you mentioned in the previous patch.
> > 
> > Thanks for the code review!
> > 
> > Having said that, the significant difficulties with iomap and the
> > upper level fuse library still exist.  To summarize -- upper libfuse has
> > its own nodeids which don't necssarily correspond to the filesystem's,
> > and struct node/nodeid are duplicated for hardlinked files.  As a
> > result, the kernel has multiple struct inodes for an ondisk ext4 inode,
> > which completely breaks the locking for the iomap file IO model.
> > 
> > That forces me to port fuse2fs to the lowlevel library, so I might
> > remove the lib/fuse.c patches entirely.  Are there plans to make the
> > upper libfuse handle hardlinks better?
> 
> I don't have plans for high level improvements. To be honest, I didn't
> know about the hard link issue at all.

Assuming "I didn't know" means you're not familiar with what I'm
talking about, let me provide a brief overview:

So you know how fuse.c implements a directory entry cache in
fuse::name_table?  Every time someone uses the cache to walk a path and
misses a path, it'll alloc_node() a new struct node, hash it, and add it
to the name_table.

Allocating a node assigns a new nodeid, which is then passed into the
kernel and the kernel uses the nodeid to index the struct fuse_inode
objects.

Unfortunately, if the filesystem supports hardlinks, the name_table
creates two nodeids for the same ondisk inode.  IOWs, if the directory
tree is:

$ <mount fuse server>
$ mkdir /mnt/a /mnt/b
$ touch /mnt/a/foo
$ ln /mnt/a/foo /mnt/b/bar
$ umount /mnt
$ <mount fuse server>
$ ls /mnt/a/foo /mnt/b/bar

Then the fuse library will create one struct node for foo and another
one for bar.  They both refer to the same ondisk inode, but in memory
they have separate nodeids and hence separate struct fuse_inodes in the
kernel.

For a regular fuse server (no writeback caching, no iomap) this works
out because all the file IO requests get forwarded to the fuse server.
If the server is sane it'll coordinate access to its internal inode
structure to process the requests.  fuse is careful enough to revalidate
the cached file attributes very frequently, so out of date metadata is
barely noticeable.

For a fuse+iomap server, having separate fuse_inodes for the same ondisk
inode isn't going to work because iomap relies on i_rwsem in the kernel
struct fuse_inode to coordinate writes among all writer threads, no
matter what path they used to open the file.

> Also a bit surprising to see all your lowlevel work and then fuse high
> level coming ;)

Right now fuse2fs is a high level fuse server, so I hacked whatever I
needed into fuse.c to make it sort of work, awkwardly.  That stuff
doesn't need to live forever.

In the long run, the lowlevel server will probably have better
performance because fuse2fs++ can pass ext2 inode numbers to the kernel
as the nodeids, and libext2fs can look up inodes via nodeid.  No more
path construction overhead!

> Btw, I will go on vacation on Wednesday and still other things queued,
> going to try to review in the evenings (but not before next Saturday).

<nod> Enjoy your vacation!

--D

> 
> 
> Cheers,
> Bernd

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-23 13:05       ` Christian Brauner
@ 2025-07-23 18:04         ` Darrick J. Wong
  2025-07-31 10:13           ` Christian Brauner
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-23 18:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS, STILL!
> > > > >
> > > > > This is the third request for comments of a prototype to connect the
> > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > works.
> > > > >
> > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > overhead.  And that's with debugging turned on!
> > > > >
> > > > > These items have been addressed since the first RFC:
> > > > >
> > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap supports inline data.
> > > > >
> > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > fuse server can set fuse_attr::flags.
> > > > >
> > > > > 5. statx and syncfs work on iomap filesystems.
> > > > >
> > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > is enabled.
> > > > >
> > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.  This is related to #3 below:
> > > > >
> > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > but on the plus side there will be far less path lookup overhead.
> > > > >
> > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > libext2fs except for inline data.
> > > > >
> > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > don't want filesystems to unmount abruptly.
> > > > >
> > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > namespace.  This prevents us from using most of the stronger systemd
> > > >
> > > > I'm happy to help you here.
> > > >
> > > > First, I think using a character device for namespaced drivers is always
> > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > in general. And having device nodes on anything other than tmpfs is just
> > > > wrong (TM).
> > > >
> > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > persistent storage imho. But anyway, that's besides the point.
> > > >
> > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > /dev/fuse should really be openable by the service itself.
> > 
> > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > Can you pass an fsopen fd to an unprivileged process and have that
> > second process call fsmount?
> 
> Yes, but remember that at some point you must call
> fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> fses that requires CAP_SYS_ADMIN so that has to be done by the
> privielged process. All the rest can be done by the unprivileged process
> though. That's exactly how bpf tokens work.

Hrm.  Assuming the fsopen mount sequence is still:

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
	...
	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
for move_mount() to have the intended effect.

Can two processes share the same fsopen fd?  If so then systemd-mountfsd
could pass the fsopen fd to the fuse server (whilst retaining its own
copy).  The fuse server could do its own mount option parsing, call
FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
the create/fsmount/move_mount part.

The systemd-mountfsd would have to be running in desired fs namespace
and with sufficient privileges to open block devices, but I'm guessing
that's already a requirement?

> > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> 
> Yes, that would work.

Oh goody :)

> > need any special /dev access at all.  I think then the fuse server's
> > service could have:
> > 
> > DynamicUser=true
> > ProtectSystem=true
> > ProtectHome=true
> > PrivateTmp=true
> > PrivateDevices=true
> > DevicePolicy=strict
> > 
> > (I think most of those are redundant with DynamicUser=true but a lot of
> > my systemd-fu is paged out ATM.)
> > 
> > My goal here is extreme containment -- the code doing the fs metadata
> > parsing has no privileges, no write access except to the fds it was
> > given, no network access, and no ability to read anything outside the
> > root filesystem.  Then I can get back to writing buffer
> > overflows^W^Whigh quality filesystem code in peace.
> 
> Yeah, sounds about right.
> 
> > 
> > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > whiteouts. That means you can do mknod() in the container to create
> > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > bat so that containers can only do this on their private tmpfs mount at
> > > > /dev.)
> > > >
> > > > The downside of this would be to give unprivileged containers access to
> > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > change.
> > 
> > Yeah, that is a new risk.  It's still better than metadata parsing
> > within the kernel address space ... though who knows how thoroughly fuse
> > has been fuzzed by syzbot :P
> > 
> > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > sure enough about it to spill it.
> > 
> > Please do share, #f is my crazy unbaked idea. :)
> > 
> > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > a device driver.
> > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > 
> > > > > protections because they tend to run in a private mount namespace with
> > > > > various parts of the filesystem either hidden or readonly.
> > > > >
> > > > > In theory one could design a socket protocol to pass mount options,
> > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > a mount helper and a service:
> > > >
> > > > This isn't a problem really. This should just be an extension to
> > > > systemd-mountfsd.
> > 
> > I suppose mount.safe could very well call systemd-mount to go do all the
> > systemd-related service setup, and that would take care of udisks as
> > well.
> 
> The ultimate goal is to teach mount(8)/libmount to use that daemon when
> it's available. Because that would just make unprivileged mounting work
> without userspace noticing anything.

That sounds really neat. :)

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 17:06           ` Darrick J. Wong
@ 2025-07-23 20:27             ` Joanne Koong
  2025-07-24 22:34               ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Joanne Koong @ 2025-07-23 20:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, neal, John, miklos, bernd

On Wed, Jul 23, 2025 at 10:06 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, Jul 21, 2025 at 01:05:02PM -0700, Joanne Koong wrote:
> > On Sat, Jul 19, 2025 at 12:18 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Sat, Jul 19, 2025 at 12:23 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > >
> > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > removing the hidden files; instead they stick around.
> > > >
> > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > of synchronous. For example for fuse servers that cache their data and
> > > > only write the buffer out to some remote filesystem when the file gets
> > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > error to the client for close() if there's a failure committing that
> > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > that when close() returns, all the processing/cleanup for that file
> > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > grabbing that lock, then we end up deadlocked if the server is
> > > > single-threaded.
> > > >
> > >
> > > There is a very good reason for keeping FUSE_FLUSH and FUSE_RELEASE
> > > (as well as those vfs ops) separate.
> >
> > Oh interesting, I didn't realize FUSE_FLUSH gets also sent on the
> > release path. I had assumed FUSE_FLUSH was for the sync()/fsync()
>
> (That's FUSE_FSYNC)
>
> > case. But I see now that you're right, close() makes a call to
> > filp_flush() in the vfs layer. (and I now see there's FUSE_FSYNC for
> > the fsync() case)
>
> Yeah, flush-on-close (FUSE_FLUSH) is generally a good idea for
> "unreliable" filesystems -- either because they're remote, or because
> the local storage they're on could get yanked at any time.  It's slow,
> but it papers over a lot of bugs and "bad" usage.
>
> > > A filesystem can decide if it needs synchronous close() (not release).
> > > And with FOPEN_NOFLUSH, the filesystem can decide that per open file,
> > > (unless it conflicts with a config like writeback cache).
> > >
> > > I have a filesystem which can do very slow io and some clients
> > > can get stuck doing open;fstat;close if close is always synchronous.
> > > I actually found the libfuse feature of async flush (FUSE_RELEASE_FLUSH)
> > > quite useful for my filesystem, so I carry a kernel patch to support it.
> > >
> > > The issue of racing that you mentioned sounds odd.
> > > First of all, who runs a single threaded fuse server?
> > > Second, what does it matter if release is sync or async,
> > > FUSE_RELEASE will not be triggered by the same
> > > task calling FUSE_OPEN, so if there is a deadlock, it will happen
> > > with sync release as well.
> >
> > If the server is single-threaded, I think the FUSE_RELEASE would have
> > to happen on the same task as FUSE_OPEN, so if the release is
> > synchronous, this would avoid the deadlock because that guarantees the
> > FUSE_RELEASE happens before the next FUSE_OPEN.
>
> On a single-threaded server(!) I would hope that the release would be
> issued to the fuse server before the open.  (I'm not sure I understand

I don't think this is 100% guaranteed if fuse sends the release
request asynchronously rather than synchronously (eg the request gets
stalled on the bg queue if active_background >= max_background)

> where this part of the thread went, because why would that happen?  And
> why would the fuse server hold a lock across requests?)

The fuse server holding a lock across requests example was a contrived
one to illustrate that an async release could be racy if a fuse server
implementation has the (standard?) expectation that release and opens
are always received in order.

>
> > However now that you pointed out FUSE_FLUSH gets sent on the release
> > path, that addresses my worry about async FUSE_RELEASE returning
> > before the server has gotten a chance to write out their local buffer
> > cache.
>
> <nod>
>
> --D
>
> > Thanks,
> > Joanne
> > >
> > > Thanks,
> > > Amir.
> >

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 17:34           ` Darrick J. Wong
@ 2025-07-23 21:02             ` Joanne Koong
  2025-07-23 21:11               ` Joanne Koong
  2025-07-24 22:28               ` Darrick J. Wong
  0 siblings, 2 replies; 174+ messages in thread
From: Joanne Koong @ 2025-07-23 21:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Wed, Jul 23, 2025 at 10:34 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, Jul 21, 2025 at 01:32:43PM -0700, Joanne Koong wrote:
> > On Fri, Jul 18, 2025 at 5:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > >
> > > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > > (see /var/tmp/fstests/generic/488.full for details)
> > > > >
> > > > > This test opens a large number of files, unlinks them (which really just
> > > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > > in the filesystem.
> > > > >
> > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > removing the hidden files; instead they stick around.
> > > >
> > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > of synchronous. For example for fuse servers that cache their data and
> > > > only write the buffer out to some remote filesystem when the file gets
> > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > error to the client for close() if there's a failure committing that
> > >
> > > I don't think supplying a return value for close() is as helpful as it
> > > seems -- the manage says that there is no guarantee that data has been
> > > flushed to disk; and if the file is removed from the process' fd table
> > > then the operation succeeded no matter the return value. :P
> > >
> > > (Also C programmers tend to be sloppy and not check the return value.)
> >
> > Amir pointed out FUSE_FLUSH gets sent on the FUSE_RELEASE path so that
> > addresses my worry. FUSE_FLUSH is sent synchronously (and close() will
> > propagate any flush errors too), so now if there's an abort or
> > something right after close() returns, the client is guaranteed that
> > any data they wrote into a local cache has been flushed by the server.
>
> <nod>
>
> > >
> > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > that when close() returns, all the processing/cleanup for that file
> > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > >
> > > Yes.  I think it's only useful for the case outined in that patch, which
> > > is that a program started an asyncio operation and then closed the fd.
> > > In that particular case the program unambiguously doesn't care about the
> > > return value of close so it's ok to perform the release asynchronously.
> >
> > I wonder why fuseblk devices need to be synchronously released. The
> > comment says " Make the release synchronous if this is a fuseblk
> > mount, synchronous RELEASE is allowed (and desirable)". Why is it
> > desirable?
>
> Err, which are you asking about?
>
> Are you asking why it is that fuseblk mounts call FUSE_DESTROY from
> unmount instead of letting libfuse synthesize it once the event loop
> terminates?  I think that's because in the fuseblk case, the kernel has
> the block device open for itself, so the fuse server must write and
> flush all dirty data before the unmount() returns to the caller.
>
> Or were you asking why synchronous RELEASE is done on fuseblk
> filesystems?  Here is my speculation:
>
> Synchronous RELEASE was added back in commit 5a18ec176c934c ("fuse: fix
> hang of single threaded fuseblk filesystem").  I /think/ the idea behind
> that patch was that for fuseblk servers, we're ok with issuing a
> FUSE_DESTROY request from the kernel and waiting on it.
>
> However, for that to work correctly, all previous pending requests
> anywhere in the fuse mount have to be flushed to and completed by the
> fuse server before we can send DESTROY, because destroy closes the
> filesystem.
>
> So I think the idea behind 5a18ec176c934c is that we make FUSE_RELEASE
> synchronous so it's not possible to umount(8) until all the releases
> requests are finished.

Thanks for the explanation. With the fix you added in this patch then,
it seems there's no reason fuseblk requests shouldn't now also be
asynchronous since your fix ensures that all pending requests have
been flushed and completed before issuing the DESTROY

>
> > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > grabbing that lock, then we end up deadlocked if the server is
> > > > single-threaded.
> > >
> > > Hrm.  I suppose if you had a script that ran two programs one after the
> > > other, each of which expected to be able to open and lock the same file,
> > > then you could run into problems if the lock isn't released by the time
> > > the second program is ready to open the file.
> >
> > I think in your scenario with the two programs, the worst outcome is
> > that the open/lock acquiring can take a while but in the (contrived
> > and probably far-fetched) scenario where it's single threaded, it
> > would result in a complete deadlock.
>
> <nod> I concede it's a minor point. :)
>
> > > But having said that, some other program could very well open and lock
> > > the file as soon as the lock drops.
> > >
> > > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > > if we execute that release asynchronously on a worker thread then that
> > > > gets rid of the deadlock.
> > >
> > > <nod> Last time I think someone replied that maybe they should all be
> > > asynchronous.
> > >
> > > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > > sense to me.
> > >
> > > I think it only has to be asynchronous for the weird case outlined in
> > > that patch (fuse server gets stuck closing its own client's fds).
> > > Personally I think release ought to be synchronous at least as far as
> > > the kernel doing all the stuff that close() says it has to do (removal
> > > of record locks, deleting the fd table entry).
> > >
> > > Note that doesn't necessarily mean that the kernel has to be completely
> > > done with all the work that entails.  XFS defers freeing of unlinked
> > > files until a background garbage collector gets around to doing that.
> > > Other filesystems will actually make you wait while they free all the
> > > data blocks and the inode.  But the kernel has no idea what the fuse
> > > server actually does.
> >
> > I guess if that's important enough to the server, we could add
> > something an FOPEN flag for that that servers could set on the file
> > handle if they want synchronous release?
>
> If a fuse server /did/ have background garbage collection, there are a
> few things it could do -- every time it sees a FUSE_RELEASE of an
> unlinked file, it could set a timer (say 50ms) after which it would kick
> the gc thread to do its thing.  Or it could do wake up the background
> thread in response to a FUSE_SYNCFS command and hope it finishes by the
> time FUSE_DESTROY comes around.
>
> (Speaking of which, can we enable syncfs for all fuse servers?)

I'm not sure what you mean by this - i thought the implementation of
FUSE_SYNCFS is dependent on each server's logic depending on if
they've set a callback for it or not? Speaking of which, it doesn't
look like FUSE_SYNCFS support has been added to libfuse yet.

>
> But that said, not everyone wants the fancy background gc stuff that XFS
> does.  FUSE_RELEASE would then be doing a lot of work.
>
> > after Amir's point about FUSE_FLUSH, I'm in favor now of FUSE_RELEASE
> > being asynchronous.
> > >
> > > > > Create a function to push all the background requests to the queue and
> > > > > then wait for the number of pending events to hit zero, and call this
> > > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > > >
> > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > ---
> > > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > > >  fs/fuse/inode.c  |    1 +
> > > > >  3 files changed, 45 insertions(+)
> > > > >
> > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > +/*
> > > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > > + * it is no longer possible for other threads to add requests.
> > > > > + */
> > > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > > >
> > > > It might be worth renaming this to something like
> > > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > > background requests
> > >
> > > Hum.  Did I not understand the code correctly?  I thought that
> > > flush_bg_queue puts all the background requests onto the active queue
> > > and issues them to the fuse server; and the wait_event_timeout sits
> > > around waiting for all the requests to receive their replies?
> >
> > Sorry, didn't mean to be confusing with my previous comment. What I
> > was trying to say is that "fuse_flush_requests" implies that all
> > requests get flushed to userspace but here only the background
> > requests get flushed.
>
> Oh, I see now, I /was/ mistaken.  Synchronous requests are ...
>
> Wait, no, still confused :(
>
> fuse_flush_requests waits until fuse_conn::num_waiting is zero.
>
> Synchronous requests (aka the ones sent through fuse_simple_request)
> bump num_waiting either directly in the args->force case or indirectly
> via fuse_get_req.  num_waiting is decremented in fuse_put_request.
> Therefore waiting for num_waiting to hit zero implements waiting for all
> the requests that were in flight before fuse_flush_requests was called.
>
> Background requests (aka the ones sent via fuse_simple_background) have
> num_waiting set in the !args->force case or indirectly in
> fuse_request_queue_background.  num_waiting is decremented in
> fuse_put_request the same as is done for synchronous requests.
>
> Therefore, it's correct to say that waiting for num_requests to become 0
> is sufficient to wait for all pending requests anywhere in the
> fuse_mount to complete.

You're right, good point, waiting on fc->num_waiting == 0 also ensures
foreground requests have been completed. sorry for the confusion!

Connections can also be aborted through the
/sys/fs/fuse/connections/*/abort interface or through request timeouts
(eg fuse_check_timeout()) - should those places too flush pending
requests and wait for them before aborting the connection?

>
> Right?
>
> Maybe this should be called fuse_flush_requests_and_wait. :)
>
> --D
>
> > Thanks,
> > Joanne
> > >
> > > I could be mistaken though.  This is my rough understanding of what
> > > happens to background requests:
> > >
> > > 1. Request created
> > > 2. Put request on bg_queue
> > > 3. <wait>
> > > 4. Request removed from bg_queue
> > > 5. Request sent
> > > 6. <wait>
> > > 7. Reply received
> > > 8. Request ends and is _put.
> > >
> > > Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> > > fc->waiting tracks the number of requests that are anywhere between the
> > > end of step 1 and the start of step 8.
> > >
> > > In any case, I want to push all the bg requests and wait until there are
> > > no more requests in the system.
> > >
> > > --D
> >

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 21:02             ` Joanne Koong
@ 2025-07-23 21:11               ` Joanne Koong
  2025-07-24 22:28               ` Darrick J. Wong
  1 sibling, 0 replies; 174+ messages in thread
From: Joanne Koong @ 2025-07-23 21:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Wed, Jul 23, 2025 at 2:02 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Jul 23, 2025 at 10:34 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Mon, Jul 21, 2025 at 01:32:43PM -0700, Joanne Koong wrote:
> > > On Fri, Jul 18, 2025 at 5:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > >
> > > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > > >
> > > > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > > > (see /var/tmp/fstests/generic/488.full for details)
> > > > > >
> > > > > > This test opens a large number of files, unlinks them (which really just
> > > > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > > > in the filesystem.
> > > > > >
> > > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > > removing the hidden files; instead they stick around.
> > > > >
> > > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > > of synchronous. For example for fuse servers that cache their data and
> > > > > only write the buffer out to some remote filesystem when the file gets
> > > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > > error to the client for close() if there's a failure committing that
> > > >
> > > > I don't think supplying a return value for close() is as helpful as it
> > > > seems -- the manage says that there is no guarantee that data has been
> > > > flushed to disk; and if the file is removed from the process' fd table
> > > > then the operation succeeded no matter the return value. :P
> > > >
> > > > (Also C programmers tend to be sloppy and not check the return value.)
> > >
> > > Amir pointed out FUSE_FLUSH gets sent on the FUSE_RELEASE path so that
> > > addresses my worry. FUSE_FLUSH is sent synchronously (and close() will
> > > propagate any flush errors too), so now if there's an abort or
> > > something right after close() returns, the client is guaranteed that
> > > any data they wrote into a local cache has been flushed by the server.
> >
> > <nod>
> >
> > > >
> > > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > > that when close() returns, all the processing/cleanup for that file
> > > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > >
> > > > Yes.  I think it's only useful for the case outined in that patch, which
> > > > is that a program started an asyncio operation and then closed the fd.
> > > > In that particular case the program unambiguously doesn't care about the
> > > > return value of close so it's ok to perform the release asynchronously.
> > >
> > > I wonder why fuseblk devices need to be synchronously released. The
> > > comment says " Make the release synchronous if this is a fuseblk
> > > mount, synchronous RELEASE is allowed (and desirable)". Why is it
> > > desirable?
> >
> > Err, which are you asking about?
> >
> > Are you asking why it is that fuseblk mounts call FUSE_DESTROY from
> > unmount instead of letting libfuse synthesize it once the event loop
> > terminates?  I think that's because in the fuseblk case, the kernel has
> > the block device open for itself, so the fuse server must write and
> > flush all dirty data before the unmount() returns to the caller.
> >
> > Or were you asking why synchronous RELEASE is done on fuseblk
> > filesystems?  Here is my speculation:
> >
> > Synchronous RELEASE was added back in commit 5a18ec176c934c ("fuse: fix
> > hang of single threaded fuseblk filesystem").  I /think/ the idea behind
> > that patch was that for fuseblk servers, we're ok with issuing a
> > FUSE_DESTROY request from the kernel and waiting on it.
> >
> > However, for that to work correctly, all previous pending requests
> > anywhere in the fuse mount have to be flushed to and completed by the
> > fuse server before we can send DESTROY, because destroy closes the
> > filesystem.
> >
> > So I think the idea behind 5a18ec176c934c is that we make FUSE_RELEASE
> > synchronous so it's not possible to umount(8) until all the releases
> > requests are finished.
>
> Thanks for the explanation. With the fix you added in this patch then,
> it seems there's no reason fuseblk requests shouldn't now also be
> asynchronous since your fix ensures that all pending requests have
> been flushed and completed before issuing the DESTROY
>
> >
> > > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > > grabbing that lock, then we end up deadlocked if the server is
> > > > > single-threaded.
> > > >
> > > > Hrm.  I suppose if you had a script that ran two programs one after the
> > > > other, each of which expected to be able to open and lock the same file,
> > > > then you could run into problems if the lock isn't released by the time
> > > > the second program is ready to open the file.
> > >
> > > I think in your scenario with the two programs, the worst outcome is
> > > that the open/lock acquiring can take a while but in the (contrived
> > > and probably far-fetched) scenario where it's single threaded, it
> > > would result in a complete deadlock.
> >
> > <nod> I concede it's a minor point. :)
> >
> > > > But having said that, some other program could very well open and lock
> > > > the file as soon as the lock drops.
> > > >
> > > > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > > > if we execute that release asynchronously on a worker thread then that
> > > > > gets rid of the deadlock.
> > > >
> > > > <nod> Last time I think someone replied that maybe they should all be
> > > > asynchronous.
> > > >
> > > > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > > > sense to me.
> > > >
> > > > I think it only has to be asynchronous for the weird case outlined in
> > > > that patch (fuse server gets stuck closing its own client's fds).
> > > > Personally I think release ought to be synchronous at least as far as
> > > > the kernel doing all the stuff that close() says it has to do (removal
> > > > of record locks, deleting the fd table entry).
> > > >
> > > > Note that doesn't necessarily mean that the kernel has to be completely
> > > > done with all the work that entails.  XFS defers freeing of unlinked
> > > > files until a background garbage collector gets around to doing that.
> > > > Other filesystems will actually make you wait while they free all the
> > > > data blocks and the inode.  But the kernel has no idea what the fuse
> > > > server actually does.
> > >
> > > I guess if that's important enough to the server, we could add
> > > something an FOPEN flag for that that servers could set on the file
> > > handle if they want synchronous release?
> >
> > If a fuse server /did/ have background garbage collection, there are a
> > few things it could do -- every time it sees a FUSE_RELEASE of an
> > unlinked file, it could set a timer (say 50ms) after which it would kick
> > the gc thread to do its thing.  Or it could do wake up the background
> > thread in response to a FUSE_SYNCFS command and hope it finishes by the
> > time FUSE_DESTROY comes around.
> >
> > (Speaking of which, can we enable syncfs for all fuse servers?)
>
> I'm not sure what you mean by this - i thought the implementation of
> FUSE_SYNCFS is dependent on each server's logic depending on if
> they've set a callback for it or not? Speaking of which, it doesn't
> look like FUSE_SYNCFS support has been added to libfuse yet.
>
> >
> > But that said, not everyone wants the fancy background gc stuff that XFS
> > does.  FUSE_RELEASE would then be doing a lot of work.
> >
> > > after Amir's point about FUSE_FLUSH, I'm in favor now of FUSE_RELEASE
> > > being asynchronous.
> > > >
> > > > > > Create a function to push all the background requests to the queue and
> > > > > > then wait for the number of pending events to hit zero, and call this
> > > > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > > > >
> > > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > ---
> > > > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > > > >  fs/fuse/inode.c  |    1 +
> > > > > >  3 files changed, 45 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > +/*
> > > > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > > > + * it is no longer possible for other threads to add requests.
> > > > > > + */
> > > > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > > > >
> > > > > It might be worth renaming this to something like
> > > > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > > > background requests
> > > >
> > > > Hum.  Did I not understand the code correctly?  I thought that
> > > > flush_bg_queue puts all the background requests onto the active queue
> > > > and issues them to the fuse server; and the wait_event_timeout sits
> > > > around waiting for all the requests to receive their replies?
> > >
> > > Sorry, didn't mean to be confusing with my previous comment. What I
> > > was trying to say is that "fuse_flush_requests" implies that all
> > > requests get flushed to userspace but here only the background
> > > requests get flushed.
> >
> > Oh, I see now, I /was/ mistaken.  Synchronous requests are ...
> >
> > Wait, no, still confused :(
> >
> > fuse_flush_requests waits until fuse_conn::num_waiting is zero.
> >
> > Synchronous requests (aka the ones sent through fuse_simple_request)
> > bump num_waiting either directly in the args->force case or indirectly
> > via fuse_get_req.  num_waiting is decremented in fuse_put_request.
> > Therefore waiting for num_waiting to hit zero implements waiting for all
> > the requests that were in flight before fuse_flush_requests was called.
> >
> > Background requests (aka the ones sent via fuse_simple_background) have
> > num_waiting set in the !args->force case or indirectly in
> > fuse_request_queue_background.  num_waiting is decremented in
> > fuse_put_request the same as is done for synchronous requests.
> >
> > Therefore, it's correct to say that waiting for num_requests to become 0
> > is sufficient to wait for all pending requests anywhere in the
> > fuse_mount to complete.
>
> You're right, good point, waiting on fc->num_waiting == 0 also ensures
> foreground requests have been completed. sorry for the confusion!
>
> Connections can also be aborted through the
> /sys/fs/fuse/connections/*/abort interface or through request timeouts
> (eg fuse_check_timeout()) - should those places too flush pending
> requests and wait for them before aborting the connection?
>

Or I guess just the FUSE_RELEASE one since that seems to be the only
one that could lead to disk inconsistencies if it's not completed

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-23 17:50           ` Darrick J. Wong
@ 2025-07-24 19:56             ` Amir Goldstein
  2025-07-29  5:35               ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-24 19:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bernd Schubert, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

> > Also a bit surprising to see all your lowlevel work and then fuse high
> > level coming ;)
>
> Right now fuse2fs is a high level fuse server, so I hacked whatever I
> needed into fuse.c to make it sort of work, awkwardly.  That stuff
> doesn't need to live forever.
>
> In the long run, the lowlevel server will probably have better
> performance because fuse2fs++ can pass ext2 inode numbers to the kernel
> as the nodeids, and libext2fs can look up inodes via nodeid.  No more
> path construction overhead!
>

I was wondering how well an LLM would be in the mechanical task of
converting fuse2fs to a low level fuse fs, so I was tempted to try.

Feel free to use it or lose it or use as a reference, because at least
for basic testing it seems to works:
https://github.com/amir73il/e2fsprogs/commits/fuse4fs/

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 21:02             ` Joanne Koong
  2025-07-23 21:11               ` Joanne Koong
@ 2025-07-24 22:28               ` Darrick J. Wong
  1 sibling, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-24 22:28 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, neal, John, miklos, bernd

On Wed, Jul 23, 2025 at 02:02:19PM -0700, Joanne Koong wrote:
> On Wed, Jul 23, 2025 at 10:34 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Mon, Jul 21, 2025 at 01:32:43PM -0700, Joanne Koong wrote:
> > > On Fri, Jul 18, 2025 at 5:32 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Fri, Jul 18, 2025 at 03:23:30PM -0700, Joanne Koong wrote:
> > > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > >
> > > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > > >
> > > > > > generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > > > > (see /var/tmp/fstests/generic/488.full for details)
> > > > > >
> > > > > > This test opens a large number of files, unlinks them (which really just
> > > > > > renames them to fuse hidden files), closes the program, unmounts the
> > > > > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > > > > in the filesystem.
> > > > > >
> > > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > > removing the hidden files; instead they stick around.
> > > > >
> > > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > > of synchronous. For example for fuse servers that cache their data and
> > > > > only write the buffer out to some remote filesystem when the file gets
> > > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > > error to the client for close() if there's a failure committing that
> > > >
> > > > I don't think supplying a return value for close() is as helpful as it
> > > > seems -- the manage says that there is no guarantee that data has been
> > > > flushed to disk; and if the file is removed from the process' fd table
> > > > then the operation succeeded no matter the return value. :P
> > > >
> > > > (Also C programmers tend to be sloppy and not check the return value.)
> > >
> > > Amir pointed out FUSE_FLUSH gets sent on the FUSE_RELEASE path so that
> > > addresses my worry. FUSE_FLUSH is sent synchronously (and close() will
> > > propagate any flush errors too), so now if there's an abort or
> > > something right after close() returns, the client is guaranteed that
> > > any data they wrote into a local cache has been flushed by the server.
> >
> > <nod>
> >
> > > >
> > > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > > that when close() returns, all the processing/cleanup for that file
> > > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > >
> > > > Yes.  I think it's only useful for the case outined in that patch, which
> > > > is that a program started an asyncio operation and then closed the fd.
> > > > In that particular case the program unambiguously doesn't care about the
> > > > return value of close so it's ok to perform the release asynchronously.
> > >
> > > I wonder why fuseblk devices need to be synchronously released. The
> > > comment says " Make the release synchronous if this is a fuseblk
> > > mount, synchronous RELEASE is allowed (and desirable)". Why is it
> > > desirable?
> >
> > Err, which are you asking about?
> >
> > Are you asking why it is that fuseblk mounts call FUSE_DESTROY from
> > unmount instead of letting libfuse synthesize it once the event loop
> > terminates?  I think that's because in the fuseblk case, the kernel has
> > the block device open for itself, so the fuse server must write and
> > flush all dirty data before the unmount() returns to the caller.
> >
> > Or were you asking why synchronous RELEASE is done on fuseblk
> > filesystems?  Here is my speculation:
> >
> > Synchronous RELEASE was added back in commit 5a18ec176c934c ("fuse: fix
> > hang of single threaded fuseblk filesystem").  I /think/ the idea behind
> > that patch was that for fuseblk servers, we're ok with issuing a
> > FUSE_DESTROY request from the kernel and waiting on it.
> >
> > However, for that to work correctly, all previous pending requests
> > anywhere in the fuse mount have to be flushed to and completed by the
> > fuse server before we can send DESTROY, because destroy closes the
> > filesystem.
> >
> > So I think the idea behind 5a18ec176c934c is that we make FUSE_RELEASE
> > synchronous so it's not possible to umount(8) until all the releases
> > requests are finished.
> 
> Thanks for the explanation. With the fix you added in this patch then,
> it seems there's no reason fuseblk requests shouldn't now also be
> asynchronous since your fix ensures that all pending requests have
> been flushed and completed before issuing the DESTROY

<nod>

> >
> > > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > > grabbing that lock, then we end up deadlocked if the server is
> > > > > single-threaded.
> > > >
> > > > Hrm.  I suppose if you had a script that ran two programs one after the
> > > > other, each of which expected to be able to open and lock the same file,
> > > > then you could run into problems if the lock isn't released by the time
> > > > the second program is ready to open the file.
> > >
> > > I think in your scenario with the two programs, the worst outcome is
> > > that the open/lock acquiring can take a while but in the (contrived
> > > and probably far-fetched) scenario where it's single threaded, it
> > > would result in a complete deadlock.
> >
> > <nod> I concede it's a minor point. :)
> >
> > > > But having said that, some other program could very well open and lock
> > > > the file as soon as the lock drops.
> > > >
> > > > > I saw in your first patch that sending FUSE_RELEASE synchronously
> > > > > leads to a deadlock under AIO but AFAICT, that happens because we
> > > > > execute req->args->end() in fuse_request_end() synchronously; I think
> > > > > if we execute that release asynchronously on a worker thread then that
> > > > > gets rid of the deadlock.
> > > >
> > > > <nod> Last time I think someone replied that maybe they should all be
> > > > asynchronous.
> > > >
> > > > > If FUSE_RELEASE must be asynchronous though, then your approach makes
> > > > > sense to me.
> > > >
> > > > I think it only has to be asynchronous for the weird case outlined in
> > > > that patch (fuse server gets stuck closing its own client's fds).
> > > > Personally I think release ought to be synchronous at least as far as
> > > > the kernel doing all the stuff that close() says it has to do (removal
> > > > of record locks, deleting the fd table entry).
> > > >
> > > > Note that doesn't necessarily mean that the kernel has to be completely
> > > > done with all the work that entails.  XFS defers freeing of unlinked
> > > > files until a background garbage collector gets around to doing that.
> > > > Other filesystems will actually make you wait while they free all the
> > > > data blocks and the inode.  But the kernel has no idea what the fuse
> > > > server actually does.
> > >
> > > I guess if that's important enough to the server, we could add
> > > something an FOPEN flag for that that servers could set on the file
> > > handle if they want synchronous release?
> >
> > If a fuse server /did/ have background garbage collection, there are a
> > few things it could do -- every time it sees a FUSE_RELEASE of an
> > unlinked file, it could set a timer (say 50ms) after which it would kick
> > the gc thread to do its thing.  Or it could do wake up the background
> > thread in response to a FUSE_SYNCFS command and hope it finishes by the
> > time FUSE_DESTROY comes around.
> >
> > (Speaking of which, can we enable syncfs for all fuse servers?)
> 
> I'm not sure what you mean by this - i thought the implementation of
> FUSE_SYNCFS is dependent on each server's logic depending on if
> they've set a callback for it or not? Speaking of which, it doesn't
> look like FUSE_SYNCFS support has been added to libfuse yet.

Curiously, it's only enabled for virtiofs:

$ grep -w sync_fs fs/fuse/
fs/fuse/virtio_fs.c:1702:       fc->sync_fs = true;
fs/fuse/file.c:2022:    if (!fc->sync_fs)
fs/fuse/fuse_i.h:920:   unsigned int sync_fs:1;
fs/fuse/inode.c:770:    if (!fc->sync_fs)
fs/fuse/inode.c:785:            fc->sync_fs = 0;
fs/fuse/inode.c:1243:   .sync_fs        = fuse_sync_fs,

In contrast the the usual mechanism where fuse turns it on by default
and turns it off if ever the server returns ENOSYS.  You're correct that
it hasn't been wired up to libfuse yet.

One other thing I noticed after rebasing libfuse -- why are the ->statx
definitions in fuse.h/fuse_lowlevel.h protected by "#ifdef HAVE_STATX"?
AFAICT that symbol is defined by the build system for libfuse if the
system headers have a struct statx, right?  So I guess idea is that
you're building new libfuse on an old userspace, the stubs will return
ENOSYS to the fuse client?

Unfortunately, fuse{,_lowlevel}.h are public header files, and not all
downstreams are expected to have defined a HAVE_STATX field, right?

I would've thought they'd be protected by a
FUSE_USE_VERSION >= FUSE_MAKE_VERSION(3, 18)
but evidently that doesn't happen for the ops structures?

<confused>

> > But that said, not everyone wants the fancy background gc stuff that XFS
> > does.  FUSE_RELEASE would then be doing a lot of work.
> >
> > > after Amir's point about FUSE_FLUSH, I'm in favor now of FUSE_RELEASE
> > > being asynchronous.
> > > >
> > > > > > Create a function to push all the background requests to the queue and
> > > > > > then wait for the number of pending events to hit zero, and call this
> > > > > > before fuse_abort_conn.  That way, all the pending events are processed
> > > > > > by the fuse server and we don't end up with a corrupt filesystem.
> > > > > >
> > > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > ---
> > > > > >  fs/fuse/fuse_i.h |    6 ++++++
> > > > > >  fs/fuse/dev.c    |   38 ++++++++++++++++++++++++++++++++++++++
> > > > > >  fs/fuse/inode.c  |    1 +
> > > > > >  3 files changed, 45 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > +/*
> > > > > > + * Flush all pending requests and wait for them.  Only call this function when
> > > > > > + * it is no longer possible for other threads to add requests.
> > > > > > + */
> > > > > > +void fuse_flush_requests(struct fuse_conn *fc, unsigned long timeout)
> > > > >
> > > > > It might be worth renaming this to something like
> > > > > 'fuse_flush_bg_requests' to make it more clear that this only flushes
> > > > > background requests
> > > >
> > > > Hum.  Did I not understand the code correctly?  I thought that
> > > > flush_bg_queue puts all the background requests onto the active queue
> > > > and issues them to the fuse server; and the wait_event_timeout sits
> > > > around waiting for all the requests to receive their replies?
> > >
> > > Sorry, didn't mean to be confusing with my previous comment. What I
> > > was trying to say is that "fuse_flush_requests" implies that all
> > > requests get flushed to userspace but here only the background
> > > requests get flushed.
> >
> > Oh, I see now, I /was/ mistaken.  Synchronous requests are ...
> >
> > Wait, no, still confused :(
> >
> > fuse_flush_requests waits until fuse_conn::num_waiting is zero.
> >
> > Synchronous requests (aka the ones sent through fuse_simple_request)
> > bump num_waiting either directly in the args->force case or indirectly
> > via fuse_get_req.  num_waiting is decremented in fuse_put_request.
> > Therefore waiting for num_waiting to hit zero implements waiting for all
> > the requests that were in flight before fuse_flush_requests was called.
> >
> > Background requests (aka the ones sent via fuse_simple_background) have
> > num_waiting set in the !args->force case or indirectly in
> > fuse_request_queue_background.  num_waiting is decremented in
> > fuse_put_request the same as is done for synchronous requests.
> >
> > Therefore, it's correct to say that waiting for num_requests to become 0
> > is sufficient to wait for all pending requests anywhere in the
> > fuse_mount to complete.
> 
> You're right, good point, waiting on fc->num_waiting == 0 also ensures
> foreground requests have been completed. sorry for the confusion!

Ah, no worries.  I'm glad this pushed me to figure out why this really
worked. :)

> Connections can also be aborted through the
> /sys/fs/fuse/connections/*/abort interface or through request timeouts
> (eg fuse_check_timeout()) - should those places too flush pending
> requests and wait for them before aborting the connection?

I'm not sure since I wasn't around when they added that, but I imagine
if things get to the point where a sysadmin or whoever needs to kill a
fuse mount via **sysfs** then they probably want to pull down the
connection ASAP.

--D

> >
> > Right?
> >
> > Maybe this should be called fuse_flush_requests_and_wait. :)
> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > > >
> > > > I could be mistaken though.  This is my rough understanding of what
> > > > happens to background requests:
> > > >
> > > > 1. Request created
> > > > 2. Put request on bg_queue
> > > > 3. <wait>
> > > > 4. Request removed from bg_queue
> > > > 5. Request sent
> > > > 6. <wait>
> > > > 7. Reply received
> > > > 8. Request ends and is _put.
> > > >
> > > > Non-background (foreground?) requests skip steps 2-4.  Meanwhile,
> > > > fc->waiting tracks the number of requests that are anywhere between the
> > > > end of step 1 and the start of step 8.
> > > >
> > > > In any case, I want to push all the bg requests and wait until there are
> > > > no more requests in the system.
> > > >
> > > > --D
> > >
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 20:27             ` Joanne Koong
@ 2025-07-24 22:34               ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-24 22:34 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Amir Goldstein, linux-fsdevel, neal, John, miklos, bernd

On Wed, Jul 23, 2025 at 01:27:44PM -0700, Joanne Koong wrote:
> On Wed, Jul 23, 2025 at 10:06 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Mon, Jul 21, 2025 at 01:05:02PM -0700, Joanne Koong wrote:
> > > On Sat, Jul 19, 2025 at 12:18 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 19, 2025 at 12:23 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jul 17, 2025 at 4:26 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > >
> > > > > > generic/488 fails with fuse2fs in the following fashion:
> > > > > >
> > > > > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > > > > files left over in the filesystem, with incorrect link counts.  Tracing
> > > > > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > > > > commands that are queued up on behalf of the unlinked files at the time
> > > > > > that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
> > > > > > aborted, the fuse server would have responded to the RELEASE commands by
> > > > > > removing the hidden files; instead they stick around.
> > > > >
> > > > > Tbh it's still weird to me that FUSE_RELEASE is asynchronous instead
> > > > > of synchronous. For example for fuse servers that cache their data and
> > > > > only write the buffer out to some remote filesystem when the file gets
> > > > > closed, it seems useful for them to (like nfs) be able to return an
> > > > > error to the client for close() if there's a failure committing that
> > > > > data; that also has clearer API semantics imo, eg users are guaranteed
> > > > > that when close() returns, all the processing/cleanup for that file
> > > > > has been completed.  Async FUSE_RELEASE also seems kind of racy, eg if
> > > > > the server holds local locks that get released in FUSE_RELEASE, if a
> > > > > subsequent FUSE_OPEN happens before FUSE_RELEASE then depends on
> > > > > grabbing that lock, then we end up deadlocked if the server is
> > > > > single-threaded.
> > > > >
> > > >
> > > > There is a very good reason for keeping FUSE_FLUSH and FUSE_RELEASE
> > > > (as well as those vfs ops) separate.
> > >
> > > Oh interesting, I didn't realize FUSE_FLUSH gets also sent on the
> > > release path. I had assumed FUSE_FLUSH was for the sync()/fsync()
> >
> > (That's FUSE_FSYNC)
> >
> > > case. But I see now that you're right, close() makes a call to
> > > filp_flush() in the vfs layer. (and I now see there's FUSE_FSYNC for
> > > the fsync() case)
> >
> > Yeah, flush-on-close (FUSE_FLUSH) is generally a good idea for
> > "unreliable" filesystems -- either because they're remote, or because
> > the local storage they're on could get yanked at any time.  It's slow,
> > but it papers over a lot of bugs and "bad" usage.
> >
> > > > A filesystem can decide if it needs synchronous close() (not release).
> > > > And with FOPEN_NOFLUSH, the filesystem can decide that per open file,
> > > > (unless it conflicts with a config like writeback cache).
> > > >
> > > > I have a filesystem which can do very slow io and some clients
> > > > can get stuck doing open;fstat;close if close is always synchronous.
> > > > I actually found the libfuse feature of async flush (FUSE_RELEASE_FLUSH)
> > > > quite useful for my filesystem, so I carry a kernel patch to support it.
> > > >
> > > > The issue of racing that you mentioned sounds odd.
> > > > First of all, who runs a single threaded fuse server?
> > > > Second, what does it matter if release is sync or async,
> > > > FUSE_RELEASE will not be triggered by the same
> > > > task calling FUSE_OPEN, so if there is a deadlock, it will happen
> > > > with sync release as well.
> > >
> > > If the server is single-threaded, I think the FUSE_RELEASE would have
> > > to happen on the same task as FUSE_OPEN, so if the release is
> > > synchronous, this would avoid the deadlock because that guarantees the
> > > FUSE_RELEASE happens before the next FUSE_OPEN.
> >
> > On a single-threaded server(!) I would hope that the release would be
> > issued to the fuse server before the open.  (I'm not sure I understand
> 
> I don't think this is 100% guaranteed if fuse sends the release
> request asynchronously rather than synchronously (eg the request gets
> stalled on the bg queue if active_background >= max_background)

Humm, that /is/ weird one.  I guess there's nothing to prevent an OPEN
from racing with a RELEASE, since those two operations concern
themselves with *files*.  I suppose that means that if a fuse server
wants to hold a lock across fuse commands, then it had better be really
careful about that.

> > where this part of the thread went, because why would that happen?  And
> > why would the fuse server hold a lock across requests?)
> 
> The fuse server holding a lock across requests example was a contrived
> one to illustrate that an async release could be racy if a fuse server
> implementation has the (standard?) expectation that release and opens
> are always received in order.

<nod> I think it's quite common, since each open() call in userspace
creates a new struct file, even though they all point to the same inode.
That might be why you can't normally open-and-lock a resource.  opens
shouldn't stall indefinitely...(?)

--D

> >
> > > However now that you pointed out FUSE_FLUSH gets sent on the release
> > > path, that addresses my worry about async FUSE_RELEASE returning
> > > before the server has gotten a chance to write out their local buffer
> > > cache.
> >
> > <nod>
> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > > >
> > > > Thanks,
> > > > Amir.
> > >
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-24 19:56             ` Amir Goldstein
@ 2025-07-29  5:35               ` Darrick J. Wong
  2025-07-29  7:50                 ` Amir Goldstein
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-29  5:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On Thu, Jul 24, 2025 at 09:56:16PM +0200, Amir Goldstein wrote:
> > > Also a bit surprising to see all your lowlevel work and then fuse high
> > > level coming ;)
> >
> > Right now fuse2fs is a high level fuse server, so I hacked whatever I
> > needed into fuse.c to make it sort of work, awkwardly.  That stuff
> > doesn't need to live forever.
> >
> > In the long run, the lowlevel server will probably have better
> > performance because fuse2fs++ can pass ext2 inode numbers to the kernel
> > as the nodeids, and libext2fs can look up inodes via nodeid.  No more
> > path construction overhead!
> >
> 
> I was wondering how well an LLM would be in the mechanical task of
> converting fuse2fs to a low level fuse fs, so I was tempted to try.
> 
> Feel free to use it or lose it or use as a reference, because at least
> for basic testing it seems to works:
> https://github.com/amir73il/e2fsprogs/commits/fuse4fs/

Heh, I'll take a closer look in the morning, but it looks like a
reasonable conversion.  Are you willing to add a "Co-developed-by" tag
per Sasha's recent proposal[1] if I pull it in?

--D

[1] https://lore.kernel.org/lkml/20250727195802.2222764-1-sashal@kernel.org/

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-29  5:35               ` Darrick J. Wong
@ 2025-07-29  7:50                 ` Amir Goldstein
  2025-07-29 14:22                   ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Amir Goldstein @ 2025-07-29  7:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bernd Schubert, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On Tue, Jul 29, 2025 at 7:35 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Jul 24, 2025 at 09:56:16PM +0200, Amir Goldstein wrote:
> > > > Also a bit surprising to see all your lowlevel work and then fuse high
> > > > level coming ;)
> > >
> > > Right now fuse2fs is a high level fuse server, so I hacked whatever I
> > > needed into fuse.c to make it sort of work, awkwardly.  That stuff
> > > doesn't need to live forever.
> > >
> > > In the long run, the lowlevel server will probably have better
> > > performance because fuse2fs++ can pass ext2 inode numbers to the kernel
> > > as the nodeids, and libext2fs can look up inodes via nodeid.  No more
> > > path construction overhead!
> > >
> >
> > I was wondering how well an LLM would be in the mechanical task of
> > converting fuse2fs to a low level fuse fs, so I was tempted to try.
> >
> > Feel free to use it or lose it or use as a reference, because at least
> > for basic testing it seems to works:
> > https://github.com/amir73il/e2fsprogs/commits/fuse4fs/
>
> Heh, I'll take a closer look in the morning, but it looks like a
> reasonable conversion.  Are you willing to add a "Co-developed-by" tag
> per Sasha's recent proposal[1] if I pull it in?
>
> [1] https://lore.kernel.org/lkml/20250727195802.2222764-1-sashal@kernel.org/
>

Sure. Added and pushed.

FYI, some behind the scenes for the interested:
- The commit titles roughly align to the LLM prompts that I used
- One liner commit message "LLM aided conversion" means it's mostly hands off
- Anything other than the one liner commit message suggests human intervention,
  that was usually done to make the code more human friendly, the patches
  diffstat smaller and frankly, to match my human preferences
- I did not let the agent touch git at all and I took care of applying
fixes into
  respective patches manually when needed
- The code compiles, but obviously does not work mid series
- The most interesting part was the last commit of tests, when the agent
  was testing and fixing its own conversion. This comes with some nice
  observations about machine-human collaboration in this context, for example:
- The machine figured out the need to convert
  EXT2_ROOT_INO <=> FUSE_ROOT_INO by itself from self testing,
  created the conversion helpers and used them in lookup and some other
  methods
- Obviously, it would have figured out that the conversion helpers need to
  be used for all methods sooner or later during self testing, but its self
  reflecting cycles can be so long and tedious for an observation that
  look so trivial, so a nudge from human "convert all methods" really helps
  speeding things up, at least with the agent/model/version that I used

I think that language/API conversion is one of the tasks where LLM
can contribute most to humans, as long the work is meticulously
reviewed by a human that has good knowledge of both source and
target language/dialect.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags
  2025-07-29  7:50                 ` Amir Goldstein
@ 2025-07-29 14:22                   ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-29 14:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, John@groves.net, joannelkoong@gmail.com,
	linux-fsdevel@vger.kernel.org, bernd@bsbernd.com, neal@gompa.dev,
	miklos@szeredi.hu

On Tue, Jul 29, 2025 at 09:50:30AM +0200, Amir Goldstein wrote:
> On Tue, Jul 29, 2025 at 7:35 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Jul 24, 2025 at 09:56:16PM +0200, Amir Goldstein wrote:
> > > > > Also a bit surprising to see all your lowlevel work and then fuse high
> > > > > level coming ;)
> > > >
> > > > Right now fuse2fs is a high level fuse server, so I hacked whatever I
> > > > needed into fuse.c to make it sort of work, awkwardly.  That stuff
> > > > doesn't need to live forever.
> > > >
> > > > In the long run, the lowlevel server will probably have better
> > > > performance because fuse2fs++ can pass ext2 inode numbers to the kernel
> > > > as the nodeids, and libext2fs can look up inodes via nodeid.  No more
> > > > path construction overhead!
> > > >
> > >
> > > I was wondering how well an LLM would be in the mechanical task of
> > > converting fuse2fs to a low level fuse fs, so I was tempted to try.
> > >
> > > Feel free to use it or lose it or use as a reference, because at least
> > > for basic testing it seems to works:
> > > https://github.com/amir73il/e2fsprogs/commits/fuse4fs/
> >
> > Heh, I'll take a closer look in the morning, but it looks like a
> > reasonable conversion.  Are you willing to add a "Co-developed-by" tag
> > per Sasha's recent proposal[1] if I pull it in?
> >
> > [1] https://lore.kernel.org/lkml/20250727195802.2222764-1-sashal@kernel.org/
> >
> 
> Sure. Added and pushed.
> 
> FYI, some behind the scenes for the interested:
> - The commit titles roughly align to the LLM prompts that I used

Heh.  For reproducibility, I wonder if it ought to be a good idea for
the commit messages to contain the prompts fed to the LLM?  Maybe I'll
suggest that to Sasha.

> - One liner commit message "LLM aided conversion" means it's mostly hands off
> - Anything other than the one liner commit message suggests human intervention,
>   that was usually done to make the code more human friendly, the patches
>   diffstat smaller and frankly, to match my human preferences

Oh, I was hoping you'd say that you reprompted all the way to working
patches, but I suppose AIs are rather expensive to operate.

> - I did not let the agent touch git at all and I took care of applying
> fixes into
>   respective patches manually when needed
> - The code compiles, but obviously does not work mid series

ha lol :)

> - The most interesting part was the last commit of tests, when the agent
>   was testing and fixing its own conversion. This comes with some nice
>   observations about machine-human collaboration in this context, for example:
> - The machine figured out the need to convert
>   EXT2_ROOT_INO <=> FUSE_ROOT_INO by itself from self testing,
>   created the conversion helpers and used them in lookup and some other
>   methods

<nod> I think Miklos mentioned that I could work around that by allowing
fuse servers to set the root nodeid with a mount option.

> - Obviously, it would have figured out that the conversion helpers need to
>   be used for all methods sooner or later during self testing, but its self
>   reflecting cycles can be so long and tedious for an observation that
>   look so trivial, so a nudge from human "convert all methods" really helps
>   speeding things up, at least with the agent/model/version that I used

Well we could just do the usual "make main exit(1) for the duration of
the chur^Wchanges" trick to avoid bisection bombs. :)

> I think that language/API conversion is one of the tasks where LLM
> can contribute most to humans, as long the work is meticulously
> reviewed by a human that has good knowledge of both source and
> target language/dialect.

Yeah.  Though first I need to lay the groundwork by figuring out if
macfuse/freebsd fuse actually provide the lowlevel library.  If not,
then per Ted's direction I'll have to implement both. :/

Maybe I'll try the Oracle codebot this week, though I think they said it
only knows Python.  Anyway, thanks for the inputs. :)

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-23 16:24               ` Jeff Layton
@ 2025-07-31  9:45                 ` Christian Brauner
  2025-07-31 17:52                   ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Christian Brauner @ 2025-07-31  9:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Darrick J. Wong, Joanne Koong, linux-fsdevel, neal, John, miklos,
	bernd

> > (That said, my opinion is that after years of all of us telling
> > programmers that fsync is the golden standard for checking if bad stuff
> > happened, we really ought only be clearing error state during fsync.)
> > 
> 
> That is pretty doable. The only question is whether it's something we
> *want* to do. Something like this would probably be enough if so:
> 
> diff --git a/fs/open.c b/fs/open.c
> index 7828234a7caa..a20657a85ee1 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1582,6 +1582,10 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
>  
>         retval = filp_flush(file, current->files);
>  
> +       /* Do an opportunistic writeback error check before returning. */
> +       if (likely(retval == 0))
> +               retval = filemap_check_wb_err(file_inode(file)->i_mapping, file->f_wb_err);

I think that's a bad idea. 90% of the code will not check close for
any errors so they'll never see any of this anyway. 1% will be the very
interested users that may care about. 9% will be tests that suddenly
start failing because they assert on close(fd) I'm pretty sure.

So I don't think this provides a lot of value. At least I can't see it yet.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-23 18:04         ` Darrick J. Wong
@ 2025-07-31 10:13           ` Christian Brauner
  2025-07-31 17:22             ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Christian Brauner @ 2025-07-31 10:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > DO NOT MERGE THIS, STILL!
> > > > > >
> > > > > > This is the third request for comments of a prototype to connect the
> > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > files whose contents persist to locally attached storage devices.
> > > > > >
> > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > server process.
> > > > > >
> > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > core code.  Eeeugh.
> > > > > >
> > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > works.
> > > > > >
> > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > overhead.  And that's with debugging turned on!
> > > > > >
> > > > > > These items have been addressed since the first RFC:
> > > > > >
> > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > unwritten and delalloc mappings.
> > > > > >
> > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > >
> > > > > > 3. iomap supports inline data.
> > > > > >
> > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > fuse server can set fuse_attr::flags.
> > > > > >
> > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > >
> > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > is enabled.
> > > > > >
> > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > >
> > > > > > There are some major warts remaining:
> > > > > >
> > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > actually works correctly.
> > > > > >
> > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > notifications work properly.  This is related to #3 below:
> > > > > >
> > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > >
> > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > libext2fs except for inline data.
> > > > > >
> > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > don't want filesystems to unmount abruptly.
> > > > > >
> > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > >
> > > > > I'm happy to help you here.
> > > > >
> > > > > First, I think using a character device for namespaced drivers is always
> > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > wrong (TM).
> > > > >
> > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > persistent storage imho. But anyway, that's besides the point.
> > > > >
> > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > /dev/fuse should really be openable by the service itself.
> > > 
> > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > Can you pass an fsopen fd to an unprivileged process and have that
> > > second process call fsmount?
> > 
> > Yes, but remember that at some point you must call
> > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > privielged process. All the rest can be done by the unprivileged process
> > though. That's exactly how bpf tokens work.
> 
> Hrm.  Assuming the fsopen mount sequence is still:
> 
> 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> 	...
> 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> 
> Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> for move_mount() to have the intended effect.

Yes-ish.

At fsopen() time the user namespace of the caller is recorded in
fs_context->user_ns. If the filesystems is mountable inside of a user
namespace then fs_context->user_ns will be used to perform the
CAP_SYS_ADMIN check.

For filesystems that aren't mountable inside of user namespaces (ext4,
xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
to mount a filesystem with a non-initial userns if it's not marked as
mountable. That used to be possible but it's an invitation for extremely
subtle bugs and you gain control over the superblock itself.

TL;DR the user namespace the superblock belongs to is usually determined
at fsopen() time.

> 
> Can two processes share the same fsopen fd?  If so then systemd-mountfsd

Yes, they can share and it's synchronized.

> could pass the fsopen fd to the fuse server (whilst retaining its own
> copy).  The fuse server could do its own mount option parsing, call

Yes, systemd-mountfsd already does passing like that.

> FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> the create/fsmount/move_mount part.

Yes.

> 
> The systemd-mountfsd would have to be running in desired fs namespace
> and with sufficient privileges to open block devices, but I'm guessing
> that's already a requirement?

Yes, systemd-mountfsd is a system level service running in the initial
set of namespaces and interacting with systemd-nsresourced (namespace
related stuff). It can obviously also create helper to setns() into
various namespaces if required. 

> 
> > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't

Yes, I would think so.

> > 
> > Yes, that would work.
> 
> Oh goody :)
> 
> > > need any special /dev access at all.  I think then the fuse server's
> > > service could have:
> > > 
> > > DynamicUser=true
> > > ProtectSystem=true
> > > ProtectHome=true
> > > PrivateTmp=true
> > > PrivateDevices=true
> > > DevicePolicy=strict
> > > 
> > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > my systemd-fu is paged out ATM.)
> > > 
> > > My goal here is extreme containment -- the code doing the fs metadata
> > > parsing has no privileges, no write access except to the fds it was
> > > given, no network access, and no ability to read anything outside the
> > > root filesystem.  Then I can get back to writing buffer
> > > overflows^W^Whigh quality filesystem code in peace.
> > 
> > Yeah, sounds about right.
> > 
> > > 
> > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > /dev.)
> > > > >
> > > > > The downside of this would be to give unprivileged containers access to
> > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > change.
> > > 
> > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > within the kernel address space ... though who knows how thoroughly fuse
> > > has been fuzzed by syzbot :P
> > > 
> > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > sure enough about it to spill it.
> > > 
> > > Please do share, #f is my crazy unbaked idea. :)
> > > 
> > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > a device driver.
> > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > 
> > > > > > protections because they tend to run in a private mount namespace with
> > > > > > various parts of the filesystem either hidden or readonly.
> > > > > >
> > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > a mount helper and a service:
> > > > >
> > > > > This isn't a problem really. This should just be an extension to
> > > > > systemd-mountfsd.
> > > 
> > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > systemd-related service setup, and that would take care of udisks as
> > > well.
> > 
> > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > it's available. Because that would just make unprivileged mounting work
> > without userspace noticing anything.
> 
> That sounds really neat. :)
> 
> --D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-31 10:13           ` Christian Brauner
@ 2025-07-31 17:22             ` Darrick J. Wong
  2025-08-04 10:12               ` Christian Brauner
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-31 17:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > >
> > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > >
> > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > server process.
> > > > > > >
> > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > core code.  Eeeugh.
> > > > > > >
> > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > works.
> > > > > > >
> > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > overhead.  And that's with debugging turned on!
> > > > > > >
> > > > > > > These items have been addressed since the first RFC:
> > > > > > >
> > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > unwritten and delalloc mappings.
> > > > > > >
> > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > >
> > > > > > > 3. iomap supports inline data.
> > > > > > >
> > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > fuse server can set fuse_attr::flags.
> > > > > > >
> > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > >
> > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > is enabled.
> > > > > > >
> > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > >
> > > > > > > There are some major warts remaining:
> > > > > > >
> > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > actually works correctly.
> > > > > > >
> > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > >
> > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > >
> > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > libext2fs except for inline data.
> > > > > > >
> > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > don't want filesystems to unmount abruptly.
> > > > > > >
> > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > >
> > > > > > I'm happy to help you here.
> > > > > >
> > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > wrong (TM).
> > > > > >
> > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > >
> > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > /dev/fuse should really be openable by the service itself.
> > > > 
> > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > second process call fsmount?
> > > 
> > > Yes, but remember that at some point you must call
> > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > privielged process. All the rest can be done by the unprivileged process
> > > though. That's exactly how bpf tokens work.
> > 
> > Hrm.  Assuming the fsopen mount sequence is still:
> > 
> > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > 	...
> > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > 
> > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > for move_mount() to have the intended effect.
> 
> Yes-ish.
> 
> At fsopen() time the user namespace of the caller is recorded in
> fs_context->user_ns. If the filesystems is mountable inside of a user
> namespace then fs_context->user_ns will be used to perform the
> CAP_SYS_ADMIN check.

Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
I gather that means that the fuse service server (ugh) could invoke the
mount using the fsopen fd given to it?  That sounds promising.

> For filesystems that aren't mountable inside of user namespaces (ext4,
> xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> to mount a filesystem with a non-initial userns if it's not marked as
> mountable. That used to be possible but it's an invitation for extremely
> subtle bugs and you gain control over the superblock itself.

I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
for a filesystem to be "...written with a non-initial s_user_ns in
mind"?  Is there something specific that I should look out for, aside
from the usual "we don't mount parking lot xfs because validating that
is too hard and it might explode the kernel"?

> TL;DR the user namespace the superblock belongs to is usually determined
> at fsopen() time.
> 
> > 
> > Can two processes share the same fsopen fd?  If so then systemd-mountfsd
> 
> Yes, they can share and it's synchronized.

> > could pass the fsopen fd to the fuse server (whilst retaining its own
> > copy).  The fuse server could do its own mount option parsing, call
> 
> Yes, systemd-mountfsd already does passing like that.

Oh!

> > FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> > the create/fsmount/move_mount part.
> 
> Yes.

If the fdopen fd tracks the userns of whoever initiated the mount
attempt, then maybe the fuse server can do that part too?  I guess the
weird part would be that the fuse server would effectively be passing a
path from the caller's ns, despite not having access to that ns.

> > The systemd-mountfsd would have to be running in desired fs namespace
> > and with sufficient privileges to open block devices, but I'm guessing
> > that's already a requirement?
> 
> Yes, systemd-mountfsd is a system level service running in the initial
> set of namespaces and interacting with systemd-nsresourced (namespace
> related stuff). It can obviously also create helper to setns() into
> various namespaces if required. 

<nod> I think I saw something else from you about a file descriptor
store, so I'll go look there next.

--D

> > 
> > > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> 
> Yes, I would think so.
> 
> > > 
> > > Yes, that would work.
> > 
> > Oh goody :)
> > 
> > > > need any special /dev access at all.  I think then the fuse server's
> > > > service could have:
> > > > 
> > > > DynamicUser=true
> > > > ProtectSystem=true
> > > > ProtectHome=true
> > > > PrivateTmp=true
> > > > PrivateDevices=true
> > > > DevicePolicy=strict
> > > > 
> > > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > > my systemd-fu is paged out ATM.)
> > > > 
> > > > My goal here is extreme containment -- the code doing the fs metadata
> > > > parsing has no privileges, no write access except to the fds it was
> > > > given, no network access, and no ability to read anything outside the
> > > > root filesystem.  Then I can get back to writing buffer
> > > > overflows^W^Whigh quality filesystem code in peace.
> > > 
> > > Yeah, sounds about right.
> > > 
> > > > 
> > > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > > /dev.)
> > > > > >
> > > > > > The downside of this would be to give unprivileged containers access to
> > > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > > change.
> > > > 
> > > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > > within the kernel address space ... though who knows how thoroughly fuse
> > > > has been fuzzed by syzbot :P
> > > > 
> > > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > > sure enough about it to spill it.
> > > > 
> > > > Please do share, #f is my crazy unbaked idea. :)
> > > > 
> > > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > > a device driver.
> > > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > > 
> > > > > > > protections because they tend to run in a private mount namespace with
> > > > > > > various parts of the filesystem either hidden or readonly.
> > > > > > >
> > > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > > a mount helper and a service:
> > > > > >
> > > > > > This isn't a problem really. This should just be an extension to
> > > > > > systemd-mountfsd.
> > > > 
> > > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > > systemd-related service setup, and that would take care of udisks as
> > > > well.
> > > 
> > > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > > it's available. Because that would just make unprivileged mounting work
> > > without userspace noticing anything.
> > 
> > That sounds really neat. :)
> > 
> > --D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
  2025-07-31  9:45                 ` Christian Brauner
@ 2025-07-31 17:52                   ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-07-31 17:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jeff Layton, Joanne Koong, linux-fsdevel, neal, John, miklos,
	bernd

On Thu, Jul 31, 2025 at 11:45:37AM +0200, Christian Brauner wrote:
> > > (That said, my opinion is that after years of all of us telling
> > > programmers that fsync is the golden standard for checking if bad stuff
> > > happened, we really ought only be clearing error state during fsync.)
> > > 
> > 
> > That is pretty doable. The only question is whether it's something we
> > *want* to do. Something like this would probably be enough if so:
> > 
> > diff --git a/fs/open.c b/fs/open.c
> > index 7828234a7caa..a20657a85ee1 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -1582,6 +1582,10 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
> >  
> >         retval = filp_flush(file, current->files);
> >  
> > +       /* Do an opportunistic writeback error check before returning. */
> > +       if (likely(retval == 0))
> > +               retval = filemap_check_wb_err(file_inode(file)->i_mapping, file->f_wb_err);
> 
> I think that's a bad idea. 90% of the code will not check close for
> any errors so they'll never see any of this anyway. 1% will be the very
> interested users that may care about. 9% will be tests that suddenly
> start failing because they assert on close(fd) I'm pretty sure.
> 
> So I don't think this provides a lot of value. At least I can't see it yet.

Yeah, I think changed my mind to thinking it's sensible to say that if
@fd was removed from the file descriptor table then close() returns 0 no
matter what else happened to the file.

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-07-31 17:22             ` Darrick J. Wong
@ 2025-08-04 10:12               ` Christian Brauner
  2025-08-12 20:20                 ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Christian Brauner @ 2025-08-04 10:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Thu, Jul 31, 2025 at 10:22:06AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> > On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > > >
> > > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > > >
> > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > server process.
> > > > > > > >
> > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > core code.  Eeeugh.
> > > > > > > >
> > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > > works.
> > > > > > > >
> > > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > > overhead.  And that's with debugging turned on!
> > > > > > > >
> > > > > > > > These items have been addressed since the first RFC:
> > > > > > > >
> > > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > > unwritten and delalloc mappings.
> > > > > > > >
> > > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > > >
> > > > > > > > 3. iomap supports inline data.
> > > > > > > >
> > > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > > fuse server can set fuse_attr::flags.
> > > > > > > >
> > > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > > >
> > > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > > is enabled.
> > > > > > > >
> > > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > > >
> > > > > > > > There are some major warts remaining:
> > > > > > > >
> > > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > > actually works correctly.
> > > > > > > >
> > > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > > >
> > > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > > >
> > > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > > libext2fs except for inline data.
> > > > > > > >
> > > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > > don't want filesystems to unmount abruptly.
> > > > > > > >
> > > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > > >
> > > > > > > I'm happy to help you here.
> > > > > > >
> > > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > > wrong (TM).
> > > > > > >
> > > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > > >
> > > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > > /dev/fuse should really be openable by the service itself.
> > > > > 
> > > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > > second process call fsmount?
> > > > 
> > > > Yes, but remember that at some point you must call
> > > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > > privielged process. All the rest can be done by the unprivileged process
> > > > though. That's exactly how bpf tokens work.
> > > 
> > > Hrm.  Assuming the fsopen mount sequence is still:
> > > 
> > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > > 	...
> > > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > > 
> > > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > > for move_mount() to have the intended effect.
> > 
> > Yes-ish.
> > 
> > At fsopen() time the user namespace of the caller is recorded in
> > fs_context->user_ns. If the filesystems is mountable inside of a user
> > namespace then fs_context->user_ns will be used to perform the
> > CAP_SYS_ADMIN check.
> 
> Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
> I gather that means that the fuse service server (ugh) could invoke the
> mount using the fsopen fd given to it?  That sounds promising.

Yes, it could provided fsopen() was called in a user namespace that the
service holds privileges over.

> 
> > For filesystems that aren't mountable inside of user namespaces (ext4,
> > xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> > global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> > to mount a filesystem with a non-initial userns if it's not marked as
> > mountable. That used to be possible but it's an invitation for extremely
> > subtle bugs and you gain control over the superblock itself.
> 
> I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
> s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
> for a filesystem to be "...written with a non-initial s_user_ns in
> mind"?  Is there something specific that I should look out for, aside
> from the usual "we don't mount parking lot xfs because validating that
> is too hard and it might explode the kernel"?

So there are two sides on how to view this:

(1) The filesystem is mountable   in a user namespace.
(2) The filesystem is delegatable to a user namespace.

These are two different things. Allowing (1) is difficult because of the
usual complexities involved even though everyone always seems to believe
that their block-based filesystems is reliable enough to be mounted with
any corrupted image.

But (2) is something that's doable and in fact something we do allow
currently for e.g., bpffs. In order to allow containers to use bpf the
container must have a bpffs instance mounted.

To do this fsopen() must be called in the containers user namespace. To
allow specific bpf features and to actually create the superblock
CAP_SYS_ADMIN or CAP_BPF in the initial users namespace are required.
Then a new bpf instance will be created that is owned by the user
namespace of the container.

IOW, to delegate a superblock/filesystems to an unprivileged container
capabilities are still required but ultimately the filesystems will be
owned by the container.

One story I always found worth exploring to get at (1) is if we had
dm-verity directly integrated into the filesystem. And I don't mean
fsverity, I mean dm-verity and in a way such that it's explicitly not
part of the on-disk image in contrast to fsverity where each filesystem
integrates this very differently into their on-disk format. It basically
would be as dumb as it gets. Static, simple arithmetic, appended,
pre-pended, whatever.

> 
> > TL;DR the user namespace the superblock belongs to is usually determined
> > at fsopen() time.
> > 
> > > 
> > > Can two processes share the same fsopen fd?  If so then systemd-mountfsd
> > 
> > Yes, they can share and it's synchronized.
> 
> > > could pass the fsopen fd to the fuse server (whilst retaining its own
> > > copy).  The fuse server could do its own mount option parsing, call
> > 
> > Yes, systemd-mountfsd already does passing like that.
> 
> Oh!
> 
> > > FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> > > the create/fsmount/move_mount part.
> > 
> > Yes.
> 
> If the fdopen fd tracks the userns of whoever initiated the mount
> attempt, then maybe the fuse server can do that part too?  I guess the
> weird part would be that the fuse server would effectively be passing a
> path from the caller's ns, despite not having access to that ns.

Remind me why the FUSE server would want to track the userns?

> 
> > > The systemd-mountfsd would have to be running in desired fs namespace
> > > and with sufficient privileges to open block devices, but I'm guessing
> > > that's already a requirement?
> > 
> > Yes, systemd-mountfsd is a system level service running in the initial
> > set of namespaces and interacting with systemd-nsresourced (namespace
> > related stuff). It can obviously also create helper to setns() into
> > various namespaces if required. 
> 
> <nod> I think I saw something else from you about a file descriptor
> store, so I'll go look there next.
> 
> --D
> 
> > > 
> > > > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> > 
> > Yes, I would think so.
> > 
> > > > 
> > > > Yes, that would work.
> > > 
> > > Oh goody :)
> > > 
> > > > > need any special /dev access at all.  I think then the fuse server's
> > > > > service could have:
> > > > > 
> > > > > DynamicUser=true
> > > > > ProtectSystem=true
> > > > > ProtectHome=true
> > > > > PrivateTmp=true
> > > > > PrivateDevices=true
> > > > > DevicePolicy=strict
> > > > > 
> > > > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > > > my systemd-fu is paged out ATM.)
> > > > > 
> > > > > My goal here is extreme containment -- the code doing the fs metadata
> > > > > parsing has no privileges, no write access except to the fds it was
> > > > > given, no network access, and no ability to read anything outside the
> > > > > root filesystem.  Then I can get back to writing buffer
> > > > > overflows^W^Whigh quality filesystem code in peace.
> > > > 
> > > > Yeah, sounds about right.
> > > > 
> > > > > 
> > > > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > > > /dev.)
> > > > > > >
> > > > > > > The downside of this would be to give unprivileged containers access to
> > > > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > > > change.
> > > > > 
> > > > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > > > within the kernel address space ... though who knows how thoroughly fuse
> > > > > has been fuzzed by syzbot :P
> > > > > 
> > > > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > > > sure enough about it to spill it.
> > > > > 
> > > > > Please do share, #f is my crazy unbaked idea. :)
> > > > > 
> > > > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > > > a device driver.
> > > > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > > > 
> > > > > > > > protections because they tend to run in a private mount namespace with
> > > > > > > > various parts of the filesystem either hidden or readonly.
> > > > > > > >
> > > > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > > > a mount helper and a service:
> > > > > > >
> > > > > > > This isn't a problem really. This should just be an extension to
> > > > > > > systemd-mountfsd.
> > > > > 
> > > > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > > > systemd-related service setup, and that would take care of udisks as
> > > > > well.
> > > > 
> > > > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > > > it's available. Because that would just make unprivileged mounting work
> > > > without userspace noticing anything.
> > > 
> > > That sounds really neat. :)
> > > 
> > > --D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-08-04 10:12               ` Christian Brauner
@ 2025-08-12 20:20                 ` Darrick J. Wong
  2025-08-15 14:20                   ` Christian Brauner
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-12 20:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Mon, Aug 04, 2025 at 12:12:24PM +0200, Christian Brauner wrote:
> On Thu, Jul 31, 2025 at 10:22:06AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> > > On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > > > >
> > > > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > > > >
> > > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > > server process.
> > > > > > > > >
> > > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > > core code.  Eeeugh.
> > > > > > > > >
> > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > > > works.
> > > > > > > > >
> > > > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > > > overhead.  And that's with debugging turned on!
> > > > > > > > >
> > > > > > > > > These items have been addressed since the first RFC:
> > > > > > > > >
> > > > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > > > unwritten and delalloc mappings.
> > > > > > > > >
> > > > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > > > >
> > > > > > > > > 3. iomap supports inline data.
> > > > > > > > >
> > > > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > > > fuse server can set fuse_attr::flags.
> > > > > > > > >
> > > > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > > > >
> > > > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > > > is enabled.
> > > > > > > > >
> > > > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > > > >
> > > > > > > > > There are some major warts remaining:
> > > > > > > > >
> > > > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > > > actually works correctly.
> > > > > > > > >
> > > > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > > > >
> > > > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > > > >
> > > > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > > > libext2fs except for inline data.
> > > > > > > > >
> > > > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > > > don't want filesystems to unmount abruptly.
> > > > > > > > >
> > > > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > > > >
> > > > > > > > I'm happy to help you here.
> > > > > > > >
> > > > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > > > wrong (TM).
> > > > > > > >
> > > > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > > > >
> > > > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > > > /dev/fuse should really be openable by the service itself.
> > > > > > 
> > > > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > > > second process call fsmount?
> > > > > 
> > > > > Yes, but remember that at some point you must call
> > > > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > > > privielged process. All the rest can be done by the unprivileged process
> > > > > though. That's exactly how bpf tokens work.
> > > > 
> > > > Hrm.  Assuming the fsopen mount sequence is still:
> > > > 
> > > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > > > 	...
> > > > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > > > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > > > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > > > 
> > > > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > > > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > > > for move_mount() to have the intended effect.
> > > 
> > > Yes-ish.
> > > 
> > > At fsopen() time the user namespace of the caller is recorded in
> > > fs_context->user_ns. If the filesystems is mountable inside of a user
> > > namespace then fs_context->user_ns will be used to perform the
> > > CAP_SYS_ADMIN check.
> > 
> > Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
> > I gather that means that the fuse service server (ugh) could invoke the
> > mount using the fsopen fd given to it?  That sounds promising.
> 
> Yes, it could provided fsopen() was called in a user namespace that the
> service holds privileges over.
> 
> > 
> > > For filesystems that aren't mountable inside of user namespaces (ext4,
> > > xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> > > global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> > > to mount a filesystem with a non-initial userns if it's not marked as
> > > mountable. That used to be possible but it's an invitation for extremely
> > > subtle bugs and you gain control over the superblock itself.
> > 
> > I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
> > s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
> > for a filesystem to be "...written with a non-initial s_user_ns in
> > mind"?  Is there something specific that I should look out for, aside
> > from the usual "we don't mount parking lot xfs because validating that
> > is too hard and it might explode the kernel"?
> 
> So there are two sides on how to view this:
> 
> (1) The filesystem is mountable   in a user namespace.
> (2) The filesystem is delegatable to a user namespace.
> 
> These are two different things. Allowing (1) is difficult because of the
> usual complexities involved even though everyone always seems to believe
> that their block-based filesystems is reliable enough to be mounted with
> any corrupted image.
> 
> But (2) is something that's doable and in fact something we do allow
> currently for e.g., bpffs. In order to allow containers to use bpf the
> container must have a bpffs instance mounted.
> 
> To do this fsopen() must be called in the containers user namespace. To
> allow specific bpf features and to actually create the superblock
> CAP_SYS_ADMIN or CAP_BPF in the initial users namespace are required.
> Then a new bpf instance will be created that is owned by the user
> namespace of the container.
> 
> IOW, to delegate a superblock/filesystems to an unprivileged container
> capabilities are still required but ultimately the filesystems will be
> owned by the container.

<nod>

> One story I always found worth exploring to get at (1) is if we had
> dm-verity directly integrated into the filesystem. And I don't mean
> fsverity, I mean dm-verity and in a way such that it's explicitly not
> part of the on-disk image in contrast to fsverity where each filesystem
> integrates this very differently into their on-disk format. It basically
> would be as dumb as it gets. Static, simple arithmetic, appended,
> pre-pended, whatever.

That would work as long as you don't need to write to the filesystem,
ever.  For gold master rootfs that would work fine, less so for "my
container needs a writable data partition but the bofh doesn't want us
compromising kernel memory".

> > 
> > > TL;DR the user namespace the superblock belongs to is usually determined
> > > at fsopen() time.
> > > 
> > > > 
> > > > Can two processes share the same fsopen fd?  If so then systemd-mountfsd
> > > 
> > > Yes, they can share and it's synchronized.
> > 
> > > > could pass the fsopen fd to the fuse server (whilst retaining its own
> > > > copy).  The fuse server could do its own mount option parsing, call
> > > 
> > > Yes, systemd-mountfsd already does passing like that.
> > 
> > Oh!
> > 
> > > > FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> > > > the create/fsmount/move_mount part.
> > > 
> > > Yes.
> > 
> > If the fdopen fd tracks the userns of whoever initiated the mount
> > attempt, then maybe the fuse server can do that part too?  I guess the
> > weird part would be that the fuse server would effectively be passing a
> > path from the caller's ns, despite not having access to that ns.
> 
> Remind me why the FUSE server would want to track the userns?

My wording there might have been confusing -- what I meant is:

1. The fdopen fd tracks the userns of the program that called fdopen.
2. The program from #1 passes the fdopen fd to a fuse server that's
   running in a much more constrained environment (separate systemd
   scope, no privileges at all, resources)
3. The fuse server calls fsmount on the fdopen fd passed to it by #1.

But I also haven't tried *building* any of these pieces, so this is
entirely speculative nonsense on my part. :)

> > > > The systemd-mountfsd would have to be running in desired fs namespace
> > > > and with sufficient privileges to open block devices, but I'm guessing
> > > > that's already a requirement?
> > > 
> > > Yes, systemd-mountfsd is a system level service running in the initial
> > > set of namespaces and interacting with systemd-nsresourced (namespace
> > > related stuff). It can obviously also create helper to setns() into
> > > various namespaces if required. 
> > 
> > <nod> I think I saw something else from you about a file descriptor
> > store, so I'll go look there next.
> > 
> > --D
> > 
> > > > 
> > > > > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > > > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't
> > > 
> > > Yes, I would think so.
> > > 
> > > > > 
> > > > > Yes, that would work.
> > > > 
> > > > Oh goody :)
> > > > 
> > > > > > need any special /dev access at all.  I think then the fuse server's
> > > > > > service could have:
> > > > > > 
> > > > > > DynamicUser=true
> > > > > > ProtectSystem=true
> > > > > > ProtectHome=true
> > > > > > PrivateTmp=true
> > > > > > PrivateDevices=true
> > > > > > DevicePolicy=strict
> > > > > > 
> > > > > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > > > > my systemd-fu is paged out ATM.)
> > > > > > 
> > > > > > My goal here is extreme containment -- the code doing the fs metadata
> > > > > > parsing has no privileges, no write access except to the fds it was
> > > > > > given, no network access, and no ability to read anything outside the
> > > > > > root filesystem.  Then I can get back to writing buffer
> > > > > > overflows^W^Whigh quality filesystem code in peace.
> > > > > 
> > > > > Yeah, sounds about right.
> > > > > 
> > > > > > 
> > > > > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > > > > /dev.)
> > > > > > > >
> > > > > > > > The downside of this would be to give unprivileged containers access to
> > > > > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > > > > change.
> > > > > > 
> > > > > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > > > > within the kernel address space ... though who knows how thoroughly fuse
> > > > > > has been fuzzed by syzbot :P
> > > > > > 
> > > > > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > > > > sure enough about it to spill it.
> > > > > > 
> > > > > > Please do share, #f is my crazy unbaked idea. :)
> > > > > > 
> > > > > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > > > > a device driver.
> > > > > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > > > > 
> > > > > > > > > protections because they tend to run in a private mount namespace with
> > > > > > > > > various parts of the filesystem either hidden or readonly.
> > > > > > > > >
> > > > > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > > > > a mount helper and a service:
> > > > > > > >
> > > > > > > > This isn't a problem really. This should just be an extension to
> > > > > > > > systemd-mountfsd.
> > > > > > 
> > > > > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > > > > systemd-related service setup, and that would take care of udisks as
> > > > > > well.
> > > > > 
> > > > > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > > > > it's available. Because that would just make unprivileged mounting work
> > > > > without userspace noticing anything.
> > > > 
> > > > That sounds really neat. :)
> > > > 
> > > > --D
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
  2025-08-12 20:20                 ` Darrick J. Wong
@ 2025-08-15 14:20                   ` Christian Brauner
  0 siblings, 0 replies; 174+ messages in thread
From: Christian Brauner @ 2025-08-15 14:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, John, bernd, miklos, joannelkoong,
	Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa

On Tue, Aug 12, 2025 at 01:20:25PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 04, 2025 at 12:12:24PM +0200, Christian Brauner wrote:
> > On Thu, Jul 31, 2025 at 10:22:06AM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 31, 2025 at 12:13:01PM +0200, Christian Brauner wrote:
> > > > On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > > > > > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > > > > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > > > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > DO NOT MERGE THIS, STILL!
> > > > > > > > > >
> > > > > > > > > > This is the third request for comments of a prototype to connect the
> > > > > > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > > > > > files whose contents persist to locally attached storage devices.
> > > > > > > > > >
> > > > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > > > server process.
> > > > > > > > > >
> > > > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > > > core code.  Eeeugh.
> > > > > > > > > >
> > > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > > > > > works.
> > > > > > > > > >
> > > > > > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > > > > > overhead.  And that's with debugging turned on!
> > > > > > > > > >
> > > > > > > > > > These items have been addressed since the first RFC:
> > > > > > > > > >
> > > > > > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > > > > > unwritten and delalloc mappings.
> > > > > > > > > >
> > > > > > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > > > > > >
> > > > > > > > > > 3. iomap supports inline data.
> > > > > > > > > >
> > > > > > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > > > > > fuse server can set fuse_attr::flags.
> > > > > > > > > >
> > > > > > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > > > > > >
> > > > > > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > > > > > is enabled.
> > > > > > > > > >
> > > > > > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > > > > > >
> > > > > > > > > > There are some major warts remaining:
> > > > > > > > > >
> > > > > > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > > > > > actually works correctly.
> > > > > > > > > >
> > > > > > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > > > > > notifications work properly.  This is related to #3 below:
> > > > > > > > > >
> > > > > > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > > > > > >
> > > > > > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > > > > > libext2fs except for inline data.
> > > > > > > > > >
> > > > > > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > > > > > don't want filesystems to unmount abruptly.
> > > > > > > > > >
> > > > > > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > > > > > >
> > > > > > > > > I'm happy to help you here.
> > > > > > > > >
> > > > > > > > > First, I think using a character device for namespaced drivers is always
> > > > > > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > > > > > wrong (TM).
> > > > > > > > >
> > > > > > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > > > > > persistent storage imho. But anyway, that's besides the point.
> > > > > > > > >
> > > > > > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > > > > > /dev/fuse should really be openable by the service itself.
> > > > > > > 
> > > > > > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > > > > > Can you pass an fsopen fd to an unprivileged process and have that
> > > > > > > second process call fsmount?
> > > > > > 
> > > > > > Yes, but remember that at some point you must call
> > > > > > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > > > > > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > > > > > privielged process. All the rest can be done by the unprivileged process
> > > > > > though. That's exactly how bpf tokens work.
> > > > > 
> > > > > Hrm.  Assuming the fsopen mount sequence is still:
> > > > > 
> > > > > 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > > > > 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > > > > 	...
> > > > > 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > > > > 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > > > > 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > > > > 
> > > > > Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> > > > > CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> > > > > for move_mount() to have the intended effect.
> > > > 
> > > > Yes-ish.
> > > > 
> > > > At fsopen() time the user namespace of the caller is recorded in
> > > > fs_context->user_ns. If the filesystems is mountable inside of a user
> > > > namespace then fs_context->user_ns will be used to perform the
> > > > CAP_SYS_ADMIN check.
> > > 
> > > Hrmm, well fuse is one of the filesystems that sets FS_USERNS_MOUNT, so
> > > I gather that means that the fuse service server (ugh) could invoke the
> > > mount using the fsopen fd given to it?  That sounds promising.
> > 
> > Yes, it could provided fsopen() was called in a user namespace that the
> > service holds privileges over.
> > 
> > > 
> > > > For filesystems that aren't mountable inside of user namespaces (ext4,
> > > > xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
> > > > global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
> > > > to mount a filesystem with a non-initial userns if it's not marked as
> > > > mountable. That used to be possible but it's an invitation for extremely
> > > > subtle bugs and you gain control over the superblock itself.
> > > 
> > > I guess that's commit e1c5ae59c0f22f ("fs: don't allow non-init
> > > s_user_ns for filesystems without FS_USERNS_MOUNT")?  What does it mean
> > > for a filesystem to be "...written with a non-initial s_user_ns in
> > > mind"?  Is there something specific that I should look out for, aside
> > > from the usual "we don't mount parking lot xfs because validating that
> > > is too hard and it might explode the kernel"?
> > 
> > So there are two sides on how to view this:
> > 
> > (1) The filesystem is mountable   in a user namespace.
> > (2) The filesystem is delegatable to a user namespace.
> > 
> > These are two different things. Allowing (1) is difficult because of the
> > usual complexities involved even though everyone always seems to believe
> > that their block-based filesystems is reliable enough to be mounted with
> > any corrupted image.
> > 
> > But (2) is something that's doable and in fact something we do allow
> > currently for e.g., bpffs. In order to allow containers to use bpf the
> > container must have a bpffs instance mounted.
> > 
> > To do this fsopen() must be called in the containers user namespace. To
> > allow specific bpf features and to actually create the superblock
> > CAP_SYS_ADMIN or CAP_BPF in the initial users namespace are required.
> > Then a new bpf instance will be created that is owned by the user
> > namespace of the container.
> > 
> > IOW, to delegate a superblock/filesystems to an unprivileged container
> > capabilities are still required but ultimately the filesystems will be
> > owned by the container.
> 
> <nod>
> 
> > One story I always found worth exploring to get at (1) is if we had
> > dm-verity directly integrated into the filesystem. And I don't mean
> > fsverity, I mean dm-verity and in a way such that it's explicitly not
> > part of the on-disk image in contrast to fsverity where each filesystem
> > integrates this very differently into their on-disk format. It basically
> > would be as dumb as it gets. Static, simple arithmetic, appended,
> > pre-pended, whatever.
> 
> That would work as long as you don't need to write to the filesystem,
> ever.  For gold master rootfs that would work fine, less so for "my
> container needs a writable data partition but the bofh doesn't want us
> compromising kernel memory".

Yes, for that use-case you probably almost always want to combine this
with overlayfs. Well, ideally the system would clearly differentiate
between filesystems that contain executable code and those should never
be writable and filesystem that contain data.

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-07-17 23:27   ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
@ 2025-08-18 15:11     ` Miklos Szeredi
  2025-08-18 20:01       ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Miklos Szeredi @ 2025-08-18 15:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Fri, 18 Jul 2025 at 01:27, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Actually copy the attributes/attributes_mask from userspace.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/fuse/dir.c |    2 ++
>  1 file changed, 2 insertions(+)
>
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 45b4c3cc1396af..4d841869ba3d0a 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
>                 stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
>                 stat->btime.tv_sec = sx->btime.tv_sec;
>                 stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> +               stat->attributes = sx->attributes;
> +               stat->attributes_mask = sx->attributes_mask;

fuse_update_get_attr() has a cached and an uncached branch and these
fields are only getting set in the uncached case.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-18 15:11     ` Miklos Szeredi
@ 2025-08-18 20:01       ` Darrick J. Wong
  2025-08-18 20:04         ` Darrick J. Wong
  2025-08-19 15:01         ` Miklos Szeredi
  0 siblings, 2 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-18 20:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Mon, Aug 18, 2025 at 05:11:07PM +0200, Miklos Szeredi wrote:
> On Fri, 18 Jul 2025 at 01:27, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Actually copy the attributes/attributes_mask from userspace.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  fs/fuse/dir.c |    2 ++
> >  1 file changed, 2 insertions(+)
> >
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 45b4c3cc1396af..4d841869ba3d0a 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
> >                 stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
> >                 stat->btime.tv_sec = sx->btime.tv_sec;
> >                 stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> > +               stat->attributes = sx->attributes;
> > +               stat->attributes_mask = sx->attributes_mask;
> 
> fuse_update_get_attr() has a cached and an uncached branch and these
> fields are only getting set in the uncached case.

Hrmm, do you want to cache all the various statx attributes in struct
fuse_inode?  Or would you rather that the kernel always call the fuse
server if any of the statx flags outside of (BASIC_STATS|BTIME) are set?

Right now the full version of kstat_from_fuse_statx contains:

	if (sx->mask & STATX_BTIME) {
		stat->btime.tv_sec = sx->btime.tv_sec;
		stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
	}

	if (sx->mask & STATX_DIOALIGN) {
		stat->dio_mem_align = sx->dio_mem_align;
		stat->dio_offset_align = sx->dio_offset_align;
	}

	if (sx->mask & STATX_SUBVOL)
		stat->subvol = sx->subvol;

	if (sx->mask & STATX_WRITE_ATOMIC) {
		stat->atomic_write_unit_min = sx->atomic_write_unit_min;
		stat->atomic_write_unit_max = sx->atomic_write_unit_max;
		stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt;
		stat->atomic_write_segments_max = sx->atomic_write_segments_max;
	}

	if (sx->mask & STATX_DIO_READ_ALIGN)
		stat->dio_read_offset_align = sx->dio_read_offset_align;

In theory only specialty programs are going to be interested in directio
or atomic writes, and only userspace nfs servers and backup programs are
going to care about subvolumes, so I don't know if it's really worth the
trouble to cache all that.

The dio/atomic fields are 7x u32, and the subvol id is u64.  That's 40
bytes per inode, which is kind of a lot.

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-18 20:01       ` Darrick J. Wong
@ 2025-08-18 20:04         ` Darrick J. Wong
  2025-08-19 15:01         ` Miklos Szeredi
  1 sibling, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-18 20:04 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Mon, Aug 18, 2025 at 01:01:55PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 18, 2025 at 05:11:07PM +0200, Miklos Szeredi wrote:
> > On Fri, 18 Jul 2025 at 01:27, Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Actually copy the attributes/attributes_mask from userspace.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  fs/fuse/dir.c |    2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > >
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index 45b4c3cc1396af..4d841869ba3d0a 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
> > >                 stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
> > >                 stat->btime.tv_sec = sx->btime.tv_sec;
> > >                 stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> > > +               stat->attributes = sx->attributes;
> > > +               stat->attributes_mask = sx->attributes_mask;
> > 
> > fuse_update_get_attr() has a cached and an uncached branch and these
> > fields are only getting set in the uncached case.
> 
> Hrmm, do you want to cache all the various statx attributes in struct
> fuse_inode?  Or would you rather that the kernel always call the fuse
> server if any of the statx flags outside of (BASIC_STATS|BTIME) are set?

I should have said explicitly that attributes/attributes_mask need to be
cached because there's no separate STATX_ request flag for the bitfield.

However, the *new* fields that have been added since BASIC_STATS are the
subject of my ramblings below.

--D

> Right now the full version of kstat_from_fuse_statx contains:
> 
> 	if (sx->mask & STATX_BTIME) {
> 		stat->btime.tv_sec = sx->btime.tv_sec;
> 		stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> 	}
> 
> 	if (sx->mask & STATX_DIOALIGN) {
> 		stat->dio_mem_align = sx->dio_mem_align;
> 		stat->dio_offset_align = sx->dio_offset_align;
> 	}
> 
> 	if (sx->mask & STATX_SUBVOL)
> 		stat->subvol = sx->subvol;
> 
> 	if (sx->mask & STATX_WRITE_ATOMIC) {
> 		stat->atomic_write_unit_min = sx->atomic_write_unit_min;
> 		stat->atomic_write_unit_max = sx->atomic_write_unit_max;
> 		stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt;
> 		stat->atomic_write_segments_max = sx->atomic_write_segments_max;
> 	}
> 
> 	if (sx->mask & STATX_DIO_READ_ALIGN)
> 		stat->dio_read_offset_align = sx->dio_read_offset_align;
> 
> In theory only specialty programs are going to be interested in directio
> or atomic writes, and only userspace nfs servers and backup programs are
> going to care about subvolumes, so I don't know if it's really worth the
> trouble to cache all that.
> 
> The dio/atomic fields are 7x u32, and the subvol id is u64.  That's 40
> bytes per inode, which is kind of a lot.
> 
> --D
> 
> > Thanks,
> > Miklos
> > 
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-18 20:01       ` Darrick J. Wong
  2025-08-18 20:04         ` Darrick J. Wong
@ 2025-08-19 15:01         ` Miklos Szeredi
  2025-08-19 22:51           ` Darrick J. Wong
  1 sibling, 1 reply; 174+ messages in thread
From: Miklos Szeredi @ 2025-08-19 15:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Mon, 18 Aug 2025 at 22:01, Darrick J. Wong <djwong@kernel.org> wrote:

> In theory only specialty programs are going to be interested in directio
> or atomic writes, and only userspace nfs servers and backup programs are
> going to care about subvolumes, so I don't know if it's really worth the
> trouble to cache all that.
>
> The dio/atomic fields are 7x u32, and the subvol id is u64.  That's 40
> bytes per inode, which is kind of a lot.

Agreed.  This should also depend on the sync mode.

AT_STATX_DONT_SYNC: anything not cached should be cleared from the mask.

AT_STATX_FORCE_SYNC: cached values should be ignored and FUSE_STATX
request sent.

AT_STATX_SYNC_AS_STAT: ???

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-19 15:01         ` Miklos Szeredi
@ 2025-08-19 22:51           ` Darrick J. Wong
  2025-08-20  9:16             ` Miklos Szeredi
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-19 22:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Tue, Aug 19, 2025 at 05:01:15PM +0200, Miklos Szeredi wrote:
> On Mon, 18 Aug 2025 at 22:01, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > In theory only specialty programs are going to be interested in directio
> > or atomic writes, and only userspace nfs servers and backup programs are
> > going to care about subvolumes, so I don't know if it's really worth the
> > trouble to cache all that.
> >
> > The dio/atomic fields are 7x u32, and the subvol id is u64.  That's 40
> > bytes per inode, which is kind of a lot.
> 
> Agreed.  This should also depend on the sync mode.
> 
> AT_STATX_DONT_SYNC: anything not cached should be cleared from the mask.
> 
> AT_STATX_FORCE_SYNC: cached values should be ignored and FUSE_STATX
> request sent.

IMO, if the caller asks for the weird statx attributes
(dioalign/subvol/write_atomic) then they probably prefer to wait to get
the attributes they asked for.  I'd be willing to strip them out of the
request_mask if they affirm _DONT_SYNC though.

Something like this, maybe?

#define FUSE_UNCACHED_STATX_MASK	(STATX_DIOALIGN | \
					 STATX_SUBVOL | \
					 STATX_WRITE_ATOMIC)

and then in fuse_update_get_attr,

	if (!request_mask)
		sync = false;
	else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
		if (flags & AT_STATX_DONT_SYNC) {
			request_mask &= ~FUSE_UNCACHED_STATX_MASK;
			sync = false;
		} else {
			sync = true;
		}
	} else if (flags & AT_STATX_FORCE_SYNC)
		sync = true;
	else if (flags & AT_STATX_DONT_SYNC)
		sync = false;
	else if (request_mask & inval_mask & ~cache_mask)
		sync = true;
	else
		sync = time_before64(fi->i_time, get_jiffies_64());

> AT_STATX_SYNC_AS_STAT: ???

I have no idea what that means. :)

Way back in 2017, dhowells implied that it synchronises the attributes
with the backing store in the same way that network filesystems do[1].
But the question is, does fuse count as a network fs?

I guess it does.  But the discussion from 2016 also provided "this is
very filesystem specific" so I guess we can do whatever we want??  XFS
and ext4 ignore that value.  The statx(2) manpage repeats that "whatever
stat does" language, but the stat(2) and stat(3) manpages don't say a
darned thing.

I was just gonna ignore it.

[1] https://lore.kernel.org/linux-fsdevel/147948603812.5122.5116851833739815967.stgit@warthog.procyon.org.uk/

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-19 22:51           ` Darrick J. Wong
@ 2025-08-20  9:16             ` Miklos Szeredi
  2025-08-20  9:40               ` Miklos Szeredi
  2025-08-20 15:09               ` Darrick J. Wong
  0 siblings, 2 replies; 174+ messages in thread
From: Miklos Szeredi @ 2025-08-20  9:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, 20 Aug 2025 at 00:51, Darrick J. Wong <djwong@kernel.org> wrote:

> Something like this, maybe?
>
> #define FUSE_UNCACHED_STATX_MASK        (STATX_DIOALIGN | \
>                                          STATX_SUBVOL | \
>                                          STATX_WRITE_ATOMIC)
>
> and then in fuse_update_get_attr,
>
>         if (!request_mask)
>                 sync = false;
>         else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
>                 if (flags & AT_STATX_DONT_SYNC) {
>                         request_mask &= ~FUSE_UNCACHED_STATX_MASK;
>                         sync = false;
>                 } else {
>                         sync = true;
>                 }
>         } else if (flags & AT_STATX_FORCE_SYNC)
>                 sync = true;
>         else if (flags & AT_STATX_DONT_SYNC)
>                 sync = false;
>         else if (request_mask & inval_mask & ~cache_mask)
>                 sync = true;
>         else
>                 sync = time_before64(fi->i_time, get_jiffies_64());

Yes.

> Way back in 2017, dhowells implied that it synchronises the attributes
> with the backing store in the same way that network filesystems do[1].
> But the question is, does fuse count as a network fs?
>
> I guess it does.  But the discussion from 2016 also provided "this is
> very filesystem specific" so I guess we can do whatever we want??  XFS
> and ext4 ignore that value.  The statx(2) manpage repeats that "whatever
> stat does" language, but the stat(2) and stat(3) manpages don't say a
> darned thing.

Actually we can't ignore it, since it's the default (i.e. if neither
FORCE_SYNC nor DONT_SYNC is in effect, then that implies
SYNC_AS_STAT).

I guess the semantics you codified above make sense.  In words:

"If neither forcing nor forbidding sync, then statx shall always
attempt to return attributes that are defined on that filesystem, but
may return stale values."

As an optimization of the above, the filesystem clearing the
request_mask for these uncached attributes means that that attribute
is not supported by the filesystem and that *can* be cheaply cached
(e.g. clearing fi->inval_mask).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-20  9:16             ` Miklos Szeredi
@ 2025-08-20  9:40               ` Miklos Szeredi
  2025-08-20 15:16                 ` Darrick J. Wong
  2025-08-20 15:09               ` Darrick J. Wong
  1 sibling, 1 reply; 174+ messages in thread
From: Miklos Szeredi @ 2025-08-20  9:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, 20 Aug 2025 at 11:16, Miklos Szeredi <miklos@szeredi.hu> wrote:

> As an optimization of the above, the filesystem clearing the
> request_mask for these uncached attributes means that that attribute
> is not supported by the filesystem and that *can* be cheaply cached
> (e.g. clearing fi->inval_mask).

Even better: add sx_supported to fuse_init_out, so that unsupported
ones don't generate unnecessary requests.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-20  9:16             ` Miklos Szeredi
  2025-08-20  9:40               ` Miklos Szeredi
@ 2025-08-20 15:09               ` Darrick J. Wong
  2025-08-20 15:23                 ` Miklos Szeredi
  1 sibling, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-20 15:09 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, Aug 20, 2025 at 11:16:42AM +0200, Miklos Szeredi wrote:
> On Wed, 20 Aug 2025 at 00:51, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Something like this, maybe?
> >
> > #define FUSE_UNCACHED_STATX_MASK        (STATX_DIOALIGN | \
> >                                          STATX_SUBVOL | \
> >                                          STATX_WRITE_ATOMIC)
> >
> > and then in fuse_update_get_attr,
> >
> >         if (!request_mask)
> >                 sync = false;
> >         else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
> >                 if (flags & AT_STATX_DONT_SYNC) {
> >                         request_mask &= ~FUSE_UNCACHED_STATX_MASK;
> >                         sync = false;
> >                 } else {
> >                         sync = true;
> >                 }
> >         } else if (flags & AT_STATX_FORCE_SYNC)
> >                 sync = true;
> >         else if (flags & AT_STATX_DONT_SYNC)
> >                 sync = false;
> >         else if (request_mask & inval_mask & ~cache_mask)
> >                 sync = true;
> >         else
> >                 sync = time_before64(fi->i_time, get_jiffies_64());
> 
> Yes.
> 
> > Way back in 2017, dhowells implied that it synchronises the attributes
> > with the backing store in the same way that network filesystems do[1].
> > But the question is, does fuse count as a network fs?
> >
> > I guess it does.  But the discussion from 2016 also provided "this is
> > very filesystem specific" so I guess we can do whatever we want??  XFS
> > and ext4 ignore that value.  The statx(2) manpage repeats that "whatever
> > stat does" language, but the stat(2) and stat(3) manpages don't say a
> > darned thing.

Ohhh, only now I noticed that it's one of those trickster flags symbols
like O_RDONLY that are #define'd to 0.  That's why there's no
(flags & SYNC_AS_STAT) anywhere in the codebase.

> Actually we can't ignore it, since it's the default (i.e. if neither
> FORCE_SYNC nor DONT_SYNC is in effect, then that implies
> SYNC_AS_STAT).
> 
> I guess the semantics you codified above make sense.  In words:
> 
> "If neither forcing nor forbidding sync, then statx shall always
> attempt to return attributes that are defined on that filesystem, but
> may return stale values."

Where is that written?  I'd like to read the rest of it to clear my
head. :)

> As an optimization of the above, the filesystem clearing the
> request_mask for these uncached attributes means that that attribute
> is not supported by the filesystem and that *can* be cheaply cached
> (e.g. clearing fi->inval_mask).

Hrmm.  I wouldn't want to set fi->inval_mask bits just because a
FUSE_STATX message ignored a mask bit one time -- imagine a filesystem
with tiered storage.  A file might be on slow hdd storage which means no
fancy things like atomic writes, but later it might get promoted to
faster nvme which does support that.

Anyway I'll send out rfcv4 today, which has the above update_get_attr
logic in it.

--D

> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-20  9:40               ` Miklos Szeredi
@ 2025-08-20 15:16                 ` Darrick J. Wong
  2025-08-20 15:31                   ` Miklos Szeredi
  0 siblings, 1 reply; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-20 15:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, Aug 20, 2025 at 11:40:50AM +0200, Miklos Szeredi wrote:
> On Wed, 20 Aug 2025 at 11:16, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > As an optimization of the above, the filesystem clearing the
> > request_mask for these uncached attributes means that that attribute
> > is not supported by the filesystem and that *can* be cheaply cached
> > (e.g. clearing fi->inval_mask).
> 
> Even better: add sx_supported to fuse_init_out, so that unsupported
> ones don't generate unnecessary requests.

That would work better -- if the fuse server knows it'll never respond
to STX_SUBVOL then we could obliterate it from all the statx queries.

How does one add a new field to struct fuse_init_out without breaking
old libfuse / fuse servers which still have the old fuse_init_out?
AFAICT, fuse_send_init sets out_argvar, so fuse_copy_out_args will
handle a short reply from old libfuse.  But a new libfuse running on an
old kernel can't send the kernel what it will think is an oversized
init reply, right?

So I think we end up having to declare a new flags bit for struct
fuse_init_in, and the kernel sets the bit unconditionally.  libfuse
sends the larger fuse_init_out reply if the new flag bit is set, or the
old size if it isn't.  Does that sound correct?

--D

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-20 15:09               ` Darrick J. Wong
@ 2025-08-20 15:23                 ` Miklos Szeredi
  2025-08-20 15:29                   ` Darrick J. Wong
  0 siblings, 1 reply; 174+ messages in thread
From: Miklos Szeredi @ 2025-08-20 15:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, 20 Aug 2025 at 17:09, Darrick J. Wong <djwong@kernel.org> wrote:

> > "If neither forcing nor forbidding sync, then statx shall always
> > attempt to return attributes that are defined on that filesystem, but
> > may return stale values."
>
> Where is that written?  I'd like to read the rest of it to clear my
> head. :)

It's my summary of what you wrote as code.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-20 15:23                 ` Miklos Szeredi
@ 2025-08-20 15:29                   ` Darrick J. Wong
  0 siblings, 0 replies; 174+ messages in thread
From: Darrick J. Wong @ 2025-08-20 15:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, Aug 20, 2025 at 05:23:27PM +0200, Miklos Szeredi wrote:
> On Wed, 20 Aug 2025 at 17:09, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > > "If neither forcing nor forbidding sync, then statx shall always
> > > attempt to return attributes that are defined on that filesystem, but
> > > may return stale values."
> >
> > Where is that written?  I'd like to read the rest of it to clear my
> > head. :)
> 
> It's my summary of what you wrote as code.

Ahhh, thanks.

/me hands himself another cup of coffee. :P

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 174+ messages in thread

* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
  2025-08-20 15:16                 ` Darrick J. Wong
@ 2025-08-20 15:31                   ` Miklos Szeredi
  0 siblings, 0 replies; 174+ messages in thread
From: Miklos Szeredi @ 2025-08-20 15:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong

On Wed, 20 Aug 2025 at 17:16, Darrick J. Wong <djwong@kernel.org> wrote:

> How does one add a new field to struct fuse_init_out without breaking
> old libfuse / fuse servers which still have the old fuse_init_out?

There's currently 22 bytes unused at the end, so it's easy unless you
want to add more.

Ideally there should also be a matching feature flag indicating that
a) kernel supports this feature b) field contains valid data.

> AFAICT, fuse_send_init sets out_argvar, so fuse_copy_out_args will
> handle a short reply from old libfuse.  But a new libfuse running on an
> old kernel can't send the kernel what it will think is an oversized
> init reply, right?
>
> So I think we end up having to declare a new flags bit for struct
> fuse_init_in, and the kernel sets the bit unconditionally.  libfuse
> sends the larger fuse_init_out reply if the new flag bit is set, or the
> old size if it isn't.  Does that sound correct?

I think that's exactly what the previous size extension did
(FUSE_INIT_EXT flag).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 174+ messages in thread

end of thread, other threads:[~2025-08-20 15:31 UTC | newest]

Thread overview: 174+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 23:10 [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
2025-07-17 23:26   ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-07-17 23:26   ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
2025-07-18 16:37     ` Bernd Schubert
2025-07-18 17:50       ` Joanne Koong
2025-07-18 17:57         ` Bernd Schubert
2025-07-18 18:38           ` Darrick J. Wong
2025-07-18 18:07       ` Bernd Schubert
2025-07-18 18:13         ` Bernd Schubert
2025-07-18 19:34         ` Darrick J. Wong
2025-07-18 21:03           ` Bernd Schubert
2025-07-18 22:23     ` Joanne Koong
2025-07-19  0:32       ` Darrick J. Wong
2025-07-21 20:32         ` Joanne Koong
2025-07-23 17:34           ` Darrick J. Wong
2025-07-23 21:02             ` Joanne Koong
2025-07-23 21:11               ` Joanne Koong
2025-07-24 22:28               ` Darrick J. Wong
2025-07-22 12:30         ` Jeff Layton
2025-07-22 12:38           ` Jeff Layton
2025-07-23 15:37             ` Darrick J. Wong
2025-07-23 16:24               ` Jeff Layton
2025-07-31  9:45                 ` Christian Brauner
2025-07-31 17:52                   ` Darrick J. Wong
2025-07-19  7:18       ` Amir Goldstein
2025-07-21 20:05         ` Joanne Koong
2025-07-23 17:06           ` Darrick J. Wong
2025-07-23 20:27             ` Joanne Koong
2025-07-24 22:34               ` Darrick J. Wong
2025-07-17 23:27   ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
2025-07-18 17:10     ` Bernd Schubert
2025-07-18 18:13       ` Darrick J. Wong
2025-07-22 22:20         ` Bernd Schubert
2025-07-17 23:27   ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
2025-08-18 15:11     ` Miklos Szeredi
2025-08-18 20:01       ` Darrick J. Wong
2025-08-18 20:04         ` Darrick J. Wong
2025-08-19 15:01         ` Miklos Szeredi
2025-08-19 22:51           ` Darrick J. Wong
2025-08-20  9:16             ` Miklos Szeredi
2025-08-20  9:40               ` Miklos Szeredi
2025-08-20 15:16                 ` Darrick J. Wong
2025-08-20 15:31                   ` Miklos Szeredi
2025-08-20 15:09               ` Darrick J. Wong
2025-08-20 15:23                 ` Miklos Szeredi
2025-08-20 15:29                   ` Darrick J. Wong
2025-07-17 23:27   ` [PATCH 5/7] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
2025-07-17 23:27   ` [PATCH 6/7] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
2025-07-17 23:28   ` [PATCH 7/7] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-07-17 23:28   ` [PATCH 01/13] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-07-17 23:28   ` [PATCH 02/13] fuse: add an ioctl to add new iomap devices Darrick J. Wong
2025-07-17 23:28   ` [PATCH 03/13] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
2025-07-17 23:29   ` [PATCH 04/13] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-07-17 23:29   ` [PATCH 05/13] fuse: implement direct IO with iomap Darrick J. Wong
2025-07-17 23:29   ` [PATCH 06/13] fuse: implement buffered " Darrick J. Wong
2025-07-18 15:10     ` Amir Goldstein
2025-07-18 18:01       ` Darrick J. Wong
2025-07-18 18:39         ` Bernd Schubert
2025-07-18 18:46           ` Darrick J. Wong
2025-07-18 19:45         ` Amir Goldstein
2025-07-18 20:20           ` Darrick J. Wong
2025-07-17 23:29   ` [PATCH 07/13] fuse: enable caching of timestamps Darrick J. Wong
2025-07-17 23:30   ` [PATCH 08/13] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-07-17 23:30   ` [PATCH 09/13] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-07-17 23:30   ` [PATCH 10/13] fuse: advertise support for iomap Darrick J. Wong
2025-07-17 23:31   ` [PATCH 11/13] fuse: query filesystem geometry when using iomap Darrick J. Wong
2025-07-17 23:31   ` [PATCH 12/13] fuse: implement fadvise for iomap files Darrick J. Wong
2025-07-17 23:31   ` [PATCH 13/13] fuse: implement inline data file IO via iomap Darrick J. Wong
2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-07-17 23:31   ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
2025-07-17 23:32   ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
2025-07-17 23:32   ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
2025-07-17 23:32   ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-07-17 23:32   ` [PATCH 1/7] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
2025-07-17 23:33   ` [PATCH 2/7] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
2025-07-17 23:33   ` [PATCH 3/7] fuse: cache atime when in iomap mode Darrick J. Wong
2025-07-17 23:33   ` [PATCH 4/7] fuse: update file mode when updating acls Darrick J. Wong
2025-07-17 23:33   ` [PATCH 5/7] fuse: propagate default and file acls on creation Darrick J. Wong
2025-07-17 23:34   ` [PATCH 6/7] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
2025-07-17 23:34   ` [PATCH 7/7] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-07-17 23:34   ` [PATCH 01/14] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
2025-07-17 23:34   ` [PATCH 02/14] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2025-07-17 23:35   ` [PATCH 03/14] libfuse: add upper level iomap commands Darrick J. Wong
2025-07-17 23:35   ` [PATCH 04/14] libfuse: add a notification to add a new device to iomap Darrick J. Wong
2025-07-17 23:35   ` [PATCH 05/14] libfuse: add iomap ioend low level handler Darrick J. Wong
2025-07-17 23:35   ` [PATCH 06/14] libfuse: add upper level iomap ioend commands Darrick J. Wong
2025-07-17 23:36   ` [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
2025-07-18 14:10     ` Amir Goldstein
2025-07-18 15:48       ` Darrick J. Wong
2025-07-19  7:34         ` Amir Goldstein
2025-07-17 23:36   ` [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
2025-07-18 14:27     ` Amir Goldstein
2025-07-18 15:55       ` Darrick J. Wong
2025-07-21 18:51         ` Bernd Schubert
2025-07-23 17:50           ` Darrick J. Wong
2025-07-24 19:56             ` Amir Goldstein
2025-07-29  5:35               ` Darrick J. Wong
2025-07-29  7:50                 ` Amir Goldstein
2025-07-29 14:22                   ` Darrick J. Wong
2025-07-17 23:36   ` [PATCH 09/14] libfuse: add FUSE_IOMAP_DIRECTIO Darrick J. Wong
2025-07-17 23:37   ` [PATCH 10/14] libfuse: add FUSE_IOMAP_FILEIO Darrick J. Wong
2025-07-17 23:37   ` [PATCH 11/14] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2025-07-17 23:37   ` [PATCH 12/14] libfuse: add lower level iomap_config implementation Darrick J. Wong
2025-07-17 23:37   ` [PATCH 13/14] libfuse: add upper " Darrick J. Wong
2025-07-17 23:38   ` [PATCH 14/14] libfuse: add strictatime/lazytime mount options Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 2/3] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-07-17 23:38   ` [PATCH 1/1] libfuse: enable iomap cache management Darrick J. Wong
2025-07-18 16:16     ` Bernd Schubert
2025-07-18 18:22       ` Darrick J. Wong
2025-07-18 18:35         ` Bernd Schubert
2025-07-18 18:40           ` Darrick J. Wong
2025-07-18 18:51             ` Bernd Schubert
2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
2025-07-17 23:38   ` [PATCH 1/4] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
2025-07-17 23:38   ` [PATCH 2/4] libfuse: add syncfs support to the upper library Darrick J. Wong
2025-07-17 23:39   ` [PATCH 3/4] libfuse: add statx support to the lower level library Darrick J. Wong
2025-07-18 13:28     ` Amir Goldstein
2025-07-18 15:58       ` Darrick J. Wong
2025-07-18 16:27       ` Darrick J. Wong
2025-07-18 16:54         ` Bernd Schubert
2025-07-18 18:42           ` Darrick J. Wong
2025-07-17 23:39   ` [PATCH 4/4] libfuse: add upper level statx hooks Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-07-17 23:39   ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-07-17 23:39   ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
2025-07-17 23:40   ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
2025-07-17 23:40   ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-07-17 23:40   ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-07-17 23:40   ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
2025-07-17 23:41   ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-07-17 23:41   ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-07-17 23:41   ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-07-17 23:41   ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
2025-07-17 23:42   ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-07-17 23:42   ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-07-17 23:42   ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-07-17 23:43   ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-07-17 23:43   ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-07-17 23:43   ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-07-17 23:43   ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-07-17 23:44   ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
2025-07-17 23:44   ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
2025-07-17 23:44   ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
2025-07-17 23:44   ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2025-07-17 23:45   ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-07-17 23:45   ` [PATCH 1/1] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-07-17 23:45   ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
2025-07-17 23:45   ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
2025-07-17 23:46   ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2025-07-17 23:46   ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
2025-07-17 23:46   ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
2025-07-17 23:46   ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2025-07-17 23:47   ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2025-07-17 23:47   ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
2025-07-17 23:47   ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
2025-07-17 23:47   ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
2025-07-18  8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
2025-07-18 11:55   ` Amir Goldstein
2025-07-18 19:31     ` Darrick J. Wong
2025-07-18 19:56       ` Amir Goldstein
2025-07-18 20:21         ` Darrick J. Wong
2025-07-23 13:05       ` Christian Brauner
2025-07-23 18:04         ` Darrick J. Wong
2025-07-31 10:13           ` Christian Brauner
2025-07-31 17:22             ` Darrick J. Wong
2025-08-04 10:12               ` Christian Brauner
2025-08-12 20:20                 ` Darrick J. Wong
2025-08-15 14:20                   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).