From: "Darrick J. Wong" <djwong@kernel.org>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
Bernd Schubert <bernd@bsbernd.com>,
Joanne Koong <joannelkoong@gmail.com>,
John Groves <John@groves.net>, Josef Bacik <josef@toxicpanda.com>,
linux-ext4 <linux-ext4@vger.kernel.org>,
Theodore Ts'o <tytso@mit.edu>, Neal Gompa <neal@gompa.dev>,
Amir Goldstein <amir73il@gmail.com>,
Christian Brauner <brauner@kernel.org>,
Jeff Layton <jlayton@kernel.org>
Subject: [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4
Date: Wed, 20 Aug 2025 17:37:20 -0700 [thread overview]
Message-ID: <20250821003720.GA4194186@frogsfrogsfrogs> (raw)
Hi everyone,
Do not merge this, still!!
This is the fourth request for comments of a prototype to connect the
Linux fuse driver to fs-iomap for regular file IO operations to and from
files whose contents persist to locally attached storage devices.
Why would you want to do that? Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence. Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.
willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code. Eeeugh.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.
With this RFC, I am able to show that it's possible to build a fuse
server for a real filesystem (ext4) that runs entirely in userspace yet
maintains most of its performance. At this stage I still get about 95%
of the kernel ext4 driver's streaming directio performance on streaming
IO, and 110% of its streaming buffered IO performance. Random buffered
IO is about 85% as fast as the kernel. Random direct IO is about 80% as
fast as the kernel; see the cover letter for the fuse2fs iomap changes
for more details. Unwritten extent conversions on random direct writes
are especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
These items have been addressed since the third RFC:
1. fuse2fs has been forked into fuse4fs, which now talks to the low
level fuse interface. This avoids all the path walking that the
high level fuse library provides, which dramatically improves the
performance of fuse4fs. fstests runs in half the time now. Many
thanks to Amir Goldstein for giving me a rough draft of the
conversion!
2. I simplified the configuration protocols -- now there's a per-fs
bit to enable any iomap, and a per-inode bit to enable iomap on a
specific file. Registration of iomap devices now uses the backing
fd registration interface.
3. You can now specify the root nodeid for any fuse mount.
4. Atomic writes are working, at least for single fsblocks.
5. I've ported the cache implementation from xfsprogs to e2fsprogs
libsupport, so the inode and buffer caches can now dynamically grow
to support larger working sets. No more fixed-size caches!
6. Cleaned up the kernel/libfuse ABI quite a bit.
7. fstests passes 97% of the tests that run, when iomap is enabled!
Only 93% pass when iomap is disabled, and I think that's due to some
bugs in the ACL and mode handling code.
There are some major warts remaining:
a. I've a /much/ clearer picture of how one might containerize a
filesystem server, thanks to a lot of input from Christian Brauner
in response to v3. I think I have enough pieces to try setting up
a fd-passing interface into a systemd service ... but I haven't
actually written any of it yet.
b. fsdax isn't implemented. I think I'm going to work on this for
RFC v5 to see if we can simplify the file mapping handling in famfs.
If not, then everyone else gets fsdax for free.
c. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
d. I've not yet consolidated struct fuse_inode, so the iomap gunk still
eats rather a lot of space per inode.
e. fuse2fs doesn't support the ext4 journal. Urk.
f. There's a VERY large quantity of fuse2fs improvements that need to be
applied before we get to the fuse-iomap parts. I'm not sending these
(or the fstests changes) to keep the size of the patchbomb at
"unreasonably large". :P
I'll work on these in August/Steptember, but for now here's an
unmergeable RFC to start some discussion.
--Darrick
next reply other threads:[~2025-08-21 0:37 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-21 0:37 Darrick J. Wong [this message]
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
2025-08-21 1:08 ` [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API Darrick J. Wong
2025-08-21 1:08 ` [PATCH 02/20] fuse4fs: drop fuse 2.x support code Darrick J. Wong
2025-08-21 1:08 ` [PATCH 03/20] fuse4fs: namespace some helpers Darrick J. Wong
2025-08-21 1:08 ` [PATCH 04/20] fuse4fs: convert to low level API Darrick J. Wong
2025-08-21 1:09 ` [PATCH 05/20] libsupport: port the kernel list.h to libsupport Darrick J. Wong
2025-08-21 1:09 ` [PATCH 06/20] libsupport: add a cache Darrick J. Wong
2025-08-21 1:09 ` [PATCH 07/20] cache: disable debugging Darrick J. Wong
2025-08-21 1:09 ` [PATCH 08/20] cache: use modern list iterator macros Darrick J. Wong
2025-08-21 1:10 ` [PATCH 09/20] cache: embed struct cache in the owner Darrick J. Wong
2025-08-21 1:10 ` [PATCH 10/20] cache: pass cache pointer to callbacks Darrick J. Wong
2025-08-21 1:10 ` [PATCH 11/20] cache: pass a private data pointer through cache_walk Darrick J. Wong
2025-08-21 1:11 ` [PATCH 12/20] cache: add a helper to grab a new refcount for a cache_node Darrick J. Wong
2025-08-21 1:11 ` [PATCH 13/20] cache: return results of a cache flush Darrick J. Wong
2025-08-21 1:11 ` [PATCH 14/20] cache: add a "get only if incore" flag to cache_node_get Darrick J. Wong
2025-08-21 1:11 ` [PATCH 15/20] cache: support gradual expansion Darrick J. Wong
2025-08-21 1:12 ` [PATCH 16/20] cache: implement automatic shrinking Darrick J. Wong
2025-08-21 1:12 ` [PATCH 17/20] fuse4fs: add cache to track open files Darrick J. Wong
2025-08-21 1:12 ` [PATCH 18/20] fuse4fs: use the orphaned inode list Darrick J. Wong
2025-08-21 1:12 ` [PATCH 19/20] fuse4fs: implement FUSE_TMPFILE Darrick J. Wong
2025-08-21 1:13 ` [PATCH 20/20] fuse4fs: create incore reverse orphan list Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-08-21 1:13 ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
2025-08-21 1:13 ` [PATCH 02/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-08-21 1:13 ` [PATCH 03/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-08-21 1:14 ` [PATCH 04/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-08-21 1:14 ` [PATCH 05/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-08-21 1:14 ` [PATCH 06/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-08-21 1:14 ` [PATCH 07/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-08-21 1:15 ` [PATCH 08/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-08-21 1:15 ` [PATCH 09/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-08-21 1:15 ` [PATCH 10/10] libext2fs: add posix advisory locking to the unix IO manager Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:15 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-08-21 1:16 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong
2025-08-21 1:16 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong
2025-08-21 1:16 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-08-21 1:17 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong
2025-08-21 1:17 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-08-21 1:17 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong
2025-08-21 1:17 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-08-21 1:18 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-08-21 1:18 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-08-21 1:18 ` [PATCH 11/19] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-08-21 1:18 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong
2025-08-21 1:19 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong
2025-08-21 1:19 ` [PATCH 14/19] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2025-08-21 1:19 ` [PATCH 15/19] fuse2fs: configure block device block size Darrick J. Wong
2025-08-21 1:19 ` [PATCH 16/19] fuse4fs: don't use inode number translation when possible Darrick J. Wong
2025-08-21 1:20 ` [PATCH 17/19] fuse4fs: separate invalidation Darrick J. Wong
2025-08-21 1:20 ` [PATCH 18/19] fuse2fs: implement statx Darrick J. Wong
2025-08-21 1:20 ` [PATCH 19/19] fuse2fs: enable atomic writes Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:20 ` [PATCH 1/2] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/2] fuse2fs: be smarter about caching iomaps Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 1:21 ` [PATCH 1/8] fuse2fs: skip permission checking on utimens " Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/8] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2025-08-21 1:21 ` [PATCH 3/8] fuse2fs: better debugging for file mode updates Darrick J. Wong
2025-08-21 1:22 ` [PATCH 4/8] fuse2fs: debug timestamp updates Darrick J. Wong
2025-08-21 1:22 ` [PATCH 5/8] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2025-08-21 1:22 ` [PATCH 6/8] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2025-08-21 1:23 ` [PATCH 7/8] fuse2fs: enable syncfs Darrick J. Wong
2025-08-21 1:23 ` [PATCH 8/8] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
2025-08-21 1:23 ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
2025-08-21 1:23 ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
2025-08-21 1:24 ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
2025-08-21 1:24 ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
2025-08-21 1:24 ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
2025-08-21 1:24 ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250821003720.GA4194186@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=John@groves.net \
--cc=amir73il@gmail.com \
--cc=bernd@bsbernd.com \
--cc=brauner@kernel.org \
--cc=jlayton@kernel.org \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=neal@gompa.dev \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox