From: "Darrick J. Wong" <djwong@kernel.org>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Cc: John@groves.net, bernd@bsbernd.com, miklos@szeredi.hu,
joannelkoong@gmail.com, Josef Bacik <josef@toxicpanda.com>,
linux-ext4 <linux-ext4@vger.kernel.org>,
Theodore Ts'o <tytso@mit.edu>, Neal Gompa <neal@gompa.dev>
Subject: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4
Date: Thu, 17 Jul 2025 16:10:38 -0700 [thread overview]
Message-ID: <20250717231038.GQ2672029@frogsfrogsfrogs> (raw)
Hi everyone,
DO NOT MERGE THIS, STILL!
This is the third request for comments of a prototype to connect the
Linux fuse driver to fs-iomap for regular file IO operations to and from
files whose contents persist to locally attached storage devices.
Why would you want to do that? Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence. Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.
willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code. Eeeugh.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.
With this RFC, I am able to show that it's possible to build a fuse
server for a real filesystem (ext4) that runs entirely in userspace yet
maintains most of its performance. At this stage I still get about 95%
of the kernel ext4 driver's streaming directio performance on streaming
IO, and 110% of its streaming buffered IO performance. Random buffered
IO is about 85% as fast as the kernel. Random direct IO is about 80% as
fast as the kernel; see the cover letter for the fuse2fs iomap changes
for more details. Unwritten extent conversions on random direct writes
are especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with debugging turned on!
These items have been addressed since the first RFC:
1. The iomap cookie validation is now present, which avoids subtle races
between pagecache zeroing and writeback on filesystems that support
unwritten and delalloc mappings.
2. Mappings can be cached in the kernel for more speed.
3. iomap supports inline data.
4. I can now turn on fuse+iomap on a per-inode basis, which turned out
to be as easy as creating a new ->getattr_iflags callback so that the
fuse server can set fuse_attr::flags.
5. statx and syncfs work on iomap filesystems.
6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
is enabled.
7. The ext4 shutdown ioctl is now supported.
There are some major warts remaining:
a. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
b. iomap is an inode-based service, not a file-based service. This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly. This is related to #3 below:
c. Hardlinks and iomap are not possible for upper-level libfuse clients
because the upper level libfuse likes to abstract kernel nodeids with
its own homebrew dirent/inode cache, which doesn't understand hardlinks.
As a result, a hardlinked file results in two distinct struct inodes in
the kernel, which completely breaks iomap's locking model. I will have
to rewrite fuse2fs for the lowlevel libfuse library to make this work,
but on the plus side there will be far less path lookup overhead.
d. There are too many changes to the IO manager in libext2fs because I
built things needed to stage the direct/buffered IO paths separately.
These are now unnecessary but I haven't pulled them out yet because
they're sort of useful to verify that iomap file IO never goes through
libext2fs except for inline data.
e. If we're going to use fuse servers as "safe" replacements for kernel
filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
We also need to disable the OOM killer(s) for fuse servers because you
don't want filesystems to unmount abruptly.
f. How do we maximally contain the fuse server to have safe filesystem
mounts? It's very convenient to use systemd services to configure
isolation declaratively, but fuse2fs still needs to be able to open
/dev/fuse, the ext4 block device, and call mount() in the shared
namespace. This prevents us from using most of the stronger systemd
protections because they tend to run in a private mount namespace with
various parts of the filesystem either hidden or readonly.
In theory one could design a socket protocol to pass mount options,
block device paths, fds, and responsibility for the mount() call between
a mount helper and a service:
e2fsprogs would define as a systemd socket service for fuse2fs that sets
up a dynamic unprivileged user, no network access, and no access to the
host's filesystem aside from readonly access to the root filesystem.
The mount helper (e.g. mount.safe) would then connect to the magic
socket and pass the CLI arguments to the fuse2fs service. The service
would parse the arguments, find the block device paths, and feed them
back through the socket to mount.safe. mount.safe would open them and
pass fds back to the fuse2fs service. The service would then open the
devices, parse the superblock, and if everything was ok, request a mount
through the socket. The mount helper would then open /dev/fuse and
mount the filesystem, and if successful, pass the /dev/fuse fd through
the socket to the fuse2fs server. At that point the fuse2fs server
would attach to the /dev/fuse device and handle the usual events.
Finally we'd have to train people/daemons to run "mount -t safe.ext4
/dev/sda1 /mnt" to get the contained version of ext4.
(Yeah, #f is all Neal. ;))
g. fuse2fs doesn't support the ext4 journal. Urk.
I'll work on these in July/August, but for now here's an unmergeable RFC
to start some discussion.
--Darrick
next reply other threads:[~2025-07-17 23:10 UTC|newest]
Thread overview: 174+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-17 23:10 Darrick J. Wong [this message]
2025-07-17 23:23 ` [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
2025-07-17 23:26 ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-07-17 23:26 ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
2025-07-18 16:37 ` Bernd Schubert
2025-07-18 17:50 ` Joanne Koong
2025-07-18 17:57 ` Bernd Schubert
2025-07-18 18:38 ` Darrick J. Wong
2025-07-18 18:07 ` Bernd Schubert
2025-07-18 18:13 ` Bernd Schubert
2025-07-18 19:34 ` Darrick J. Wong
2025-07-18 21:03 ` Bernd Schubert
2025-07-18 22:23 ` Joanne Koong
2025-07-19 0:32 ` Darrick J. Wong
2025-07-21 20:32 ` Joanne Koong
2025-07-23 17:34 ` Darrick J. Wong
2025-07-23 21:02 ` Joanne Koong
2025-07-23 21:11 ` Joanne Koong
2025-07-24 22:28 ` Darrick J. Wong
2025-07-22 12:30 ` Jeff Layton
2025-07-22 12:38 ` Jeff Layton
2025-07-23 15:37 ` Darrick J. Wong
2025-07-23 16:24 ` Jeff Layton
2025-07-31 9:45 ` Christian Brauner
2025-07-31 17:52 ` Darrick J. Wong
2025-07-19 7:18 ` Amir Goldstein
2025-07-21 20:05 ` Joanne Koong
2025-07-23 17:06 ` Darrick J. Wong
2025-07-23 20:27 ` Joanne Koong
2025-07-24 22:34 ` Darrick J. Wong
2025-07-17 23:27 ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
2025-07-18 17:10 ` Bernd Schubert
2025-07-18 18:13 ` Darrick J. Wong
2025-07-22 22:20 ` Bernd Schubert
2025-07-17 23:27 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
2025-08-18 15:11 ` Miklos Szeredi
2025-08-18 20:01 ` Darrick J. Wong
2025-08-18 20:04 ` Darrick J. Wong
2025-08-19 15:01 ` Miklos Szeredi
2025-08-19 22:51 ` Darrick J. Wong
2025-08-20 9:16 ` Miklos Szeredi
2025-08-20 9:40 ` Miklos Szeredi
2025-08-20 15:16 ` Darrick J. Wong
2025-08-20 15:31 ` Miklos Szeredi
2025-08-20 15:09 ` Darrick J. Wong
2025-08-20 15:23 ` Miklos Szeredi
2025-08-20 15:29 ` Darrick J. Wong
2025-07-17 23:27 ` [PATCH 5/7] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
2025-07-17 23:27 ` [PATCH 6/7] iomap: trace iomap_zero_iter zeroing activities Darrick J. Wong
2025-07-17 23:28 ` [PATCH 7/7] iomap: error out on file IO when there is no inline_data buffer Darrick J. Wong
2025-07-17 23:24 ` [PATCHSET RFC v3 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-07-17 23:28 ` [PATCH 01/13] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-07-17 23:28 ` [PATCH 02/13] fuse: add an ioctl to add new iomap devices Darrick J. Wong
2025-07-17 23:28 ` [PATCH 03/13] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
2025-07-17 23:29 ` [PATCH 04/13] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-07-17 23:29 ` [PATCH 05/13] fuse: implement direct IO with iomap Darrick J. Wong
2025-07-17 23:29 ` [PATCH 06/13] fuse: implement buffered " Darrick J. Wong
2025-07-18 15:10 ` Amir Goldstein
2025-07-18 18:01 ` Darrick J. Wong
2025-07-18 18:39 ` Bernd Schubert
2025-07-18 18:46 ` Darrick J. Wong
2025-07-18 19:45 ` Amir Goldstein
2025-07-18 20:20 ` Darrick J. Wong
2025-07-17 23:29 ` [PATCH 07/13] fuse: enable caching of timestamps Darrick J. Wong
2025-07-17 23:30 ` [PATCH 08/13] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-07-17 23:30 ` [PATCH 09/13] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-07-17 23:30 ` [PATCH 10/13] fuse: advertise support for iomap Darrick J. Wong
2025-07-17 23:31 ` [PATCH 11/13] fuse: query filesystem geometry when using iomap Darrick J. Wong
2025-07-17 23:31 ` [PATCH 12/13] fuse: implement fadvise for iomap files Darrick J. Wong
2025-07-17 23:31 ` [PATCH 13/13] fuse: implement inline data file IO via iomap Darrick J. Wong
2025-07-17 23:24 ` [PATCHSET RFC v3 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-07-17 23:31 ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
2025-07-17 23:32 ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
2025-07-17 23:32 ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
2025-07-17 23:32 ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
2025-07-17 23:24 ` [PATCHSET RFC v3 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-07-17 23:32 ` [PATCH 1/7] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
2025-07-17 23:33 ` [PATCH 2/7] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
2025-07-17 23:33 ` [PATCH 3/7] fuse: cache atime when in iomap mode Darrick J. Wong
2025-07-17 23:33 ` [PATCH 4/7] fuse: update file mode when updating acls Darrick J. Wong
2025-07-17 23:33 ` [PATCH 5/7] fuse: propagate default and file acls on creation Darrick J. Wong
2025-07-17 23:34 ` [PATCH 6/7] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
2025-07-17 23:34 ` [PATCH 7/7] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-07-17 23:34 ` [PATCH 01/14] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
2025-07-17 23:34 ` [PATCH 02/14] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2025-07-17 23:35 ` [PATCH 03/14] libfuse: add upper level iomap commands Darrick J. Wong
2025-07-17 23:35 ` [PATCH 04/14] libfuse: add a notification to add a new device to iomap Darrick J. Wong
2025-07-17 23:35 ` [PATCH 05/14] libfuse: add iomap ioend low level handler Darrick J. Wong
2025-07-17 23:35 ` [PATCH 06/14] libfuse: add upper level iomap ioend commands Darrick J. Wong
2025-07-17 23:36 ` [PATCH 07/14] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
2025-07-18 14:10 ` Amir Goldstein
2025-07-18 15:48 ` Darrick J. Wong
2025-07-19 7:34 ` Amir Goldstein
2025-07-17 23:36 ` [PATCH 08/14] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
2025-07-18 14:27 ` Amir Goldstein
2025-07-18 15:55 ` Darrick J. Wong
2025-07-21 18:51 ` Bernd Schubert
2025-07-23 17:50 ` Darrick J. Wong
2025-07-24 19:56 ` Amir Goldstein
2025-07-29 5:35 ` Darrick J. Wong
2025-07-29 7:50 ` Amir Goldstein
2025-07-29 14:22 ` Darrick J. Wong
2025-07-17 23:36 ` [PATCH 09/14] libfuse: add FUSE_IOMAP_DIRECTIO Darrick J. Wong
2025-07-17 23:37 ` [PATCH 10/14] libfuse: add FUSE_IOMAP_FILEIO Darrick J. Wong
2025-07-17 23:37 ` [PATCH 11/14] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2025-07-17 23:37 ` [PATCH 12/14] libfuse: add lower level iomap_config implementation Darrick J. Wong
2025-07-17 23:37 ` [PATCH 13/14] libfuse: add upper " Darrick J. Wong
2025-07-17 23:38 ` [PATCH 14/14] libfuse: add strictatime/lazytime mount options Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 2/3] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-07-17 23:38 ` [PATCH 1/1] libfuse: enable iomap cache management Darrick J. Wong
2025-07-18 16:16 ` Bernd Schubert
2025-07-18 18:22 ` Darrick J. Wong
2025-07-18 18:35 ` Bernd Schubert
2025-07-18 18:40 ` Darrick J. Wong
2025-07-18 18:51 ` Bernd Schubert
2025-07-17 23:25 ` [PATCHSET RFC v3 3/3] libfuse: implement statx and syncfs Darrick J. Wong
2025-07-17 23:38 ` [PATCH 1/4] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
2025-07-17 23:38 ` [PATCH 2/4] libfuse: add syncfs support to the upper library Darrick J. Wong
2025-07-17 23:39 ` [PATCH 3/4] libfuse: add statx support to the lower level library Darrick J. Wong
2025-07-18 13:28 ` Amir Goldstein
2025-07-18 15:58 ` Darrick J. Wong
2025-07-18 16:27 ` Darrick J. Wong
2025-07-18 16:54 ` Bernd Schubert
2025-07-18 18:42 ` Darrick J. Wong
2025-07-17 23:39 ` [PATCH 4/4] libfuse: add upper level statx hooks Darrick J. Wong
2025-07-17 23:25 ` [PATCHSET RFC v3 1/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-07-17 23:39 ` [PATCH 01/22] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-07-17 23:39 ` [PATCH 02/22] fuse2fs: add iomap= mount option Darrick J. Wong
2025-07-17 23:40 ` [PATCH 03/22] fuse2fs: implement iomap configuration Darrick J. Wong
2025-07-17 23:40 ` [PATCH 04/22] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-07-17 23:40 ` [PATCH 05/22] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-07-17 23:40 ` [PATCH 06/22] fuse2fs: implement directio file reads Darrick J. Wong
2025-07-17 23:41 ` [PATCH 07/22] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-07-17 23:41 ` [PATCH 08/22] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-07-17 23:41 ` [PATCH 09/22] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-07-17 23:41 ` [PATCH 10/22] fuse2fs: implement direct write support Darrick J. Wong
2025-07-17 23:42 ` [PATCH 11/22] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-07-17 23:42 ` [PATCH 12/22] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-07-17 23:42 ` [PATCH 13/22] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-07-17 23:43 ` [PATCH 14/22] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-07-17 23:43 ` [PATCH 15/22] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-07-17 23:43 ` [PATCH 16/22] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-07-17 23:43 ` [PATCH 17/22] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-07-17 23:44 ` [PATCH 18/22] fuse2fs: don't allow hardlinks for now Darrick J. Wong
2025-07-17 23:44 ` [PATCH 19/22] fuse2fs: enable file IO to inline data files Darrick J. Wong
2025-07-17 23:44 ` [PATCH 20/22] fuse2fs: set iomap-related inode flags Darrick J. Wong
2025-07-17 23:44 ` [PATCH 21/22] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2025-07-17 23:45 ` [PATCH 22/22] fuse2fs: configure block device block size Darrick J. Wong
2025-07-17 23:26 ` [PATCHSET RFC v3 2/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-07-17 23:45 ` [PATCH 1/1] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-07-17 23:26 ` [PATCHSET RFC v3 3/3] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-07-17 23:45 ` [PATCH 01/10] fuse2fs: allow O_APPEND and O_TRUNC opens Darrick J. Wong
2025-07-17 23:45 ` [PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap is enabled Darrick J. Wong
2025-07-17 23:46 ` [PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2025-07-17 23:46 ` [PATCH 04/10] fuse2fs: better debugging for file mode updates Darrick J. Wong
2025-07-17 23:46 ` [PATCH 05/10] fuse2fs: debug timestamp updates Darrick J. Wong
2025-07-17 23:46 ` [PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2025-07-17 23:47 ` [PATCH 07/10] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2025-07-17 23:47 ` [PATCH 08/10] fuse2fs: enable syncfs Darrick J. Wong
2025-07-17 23:47 ` [PATCH 09/10] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
2025-07-17 23:47 ` [PATCH 10/10] fuse2fs: implement statx Darrick J. Wong
2025-07-18 8:54 ` [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4 Christian Brauner
2025-07-18 11:55 ` Amir Goldstein
2025-07-18 19:31 ` Darrick J. Wong
2025-07-18 19:56 ` Amir Goldstein
2025-07-18 20:21 ` Darrick J. Wong
2025-07-23 13:05 ` Christian Brauner
2025-07-23 18:04 ` Darrick J. Wong
2025-07-31 10:13 ` Christian Brauner
2025-07-31 17:22 ` Darrick J. Wong
2025-08-04 10:12 ` Christian Brauner
2025-08-12 20:20 ` Darrick J. Wong
2025-08-15 14:20 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250717231038.GQ2672029@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=John@groves.net \
--cc=bernd@bsbernd.com \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=neal@gompa.dev \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).