public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: tytso@mit.edu
Cc: John@groves.net, bernd@bsbernd.com,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	miklos@szeredi.hu, joannelkoong@gmail.com, neal@gompa.dev
Subject: [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance
Date: Wed, 20 Aug 2025 17:49:53 -0700	[thread overview]
Message-ID: <175573713645.21970.9783397720493472605.stgit@frogsfrogsfrogs> (raw)
In-Reply-To: <20250821003720.GA4194186@frogsfrogsfrogs>

Hi all,

Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection.  For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel.  This
means that we can get rid of all file data block processing within
fuse2fs.

Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous.  Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.

The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s.  Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s.  FIEMAP and SEEK_DATA/SEEK_HOLE now work
too.  The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.

Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s.  The kernel
can do 900-1300MB/s.  Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s.  I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes.  We also probably
need iomap caching really badly.

These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance.  It contains a single
Big Filesystem Lock which nukes multi-threaded scalability.  There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS.  Sad!

Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance.  We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.

iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so
for capable systems, fuse2fs doesn't need to run in fuseblk mode
anymore.

However, there are some major warts remaining:

1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.

2. Mappings ought to be cached in the kernel for more speed.

3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.

4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.

5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.

6. iomap is an inode-based service, not a file-based service.  This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.

I'll work on these in June, but for now here's an unmergeable RFC to
start some discussion.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap
---
Commits in this patchset:
 * fuse2fs: implement bare minimum iomap for file mapping reporting
 * fuse2fs: add iomap= mount option
 * fuse2fs: implement iomap configuration
 * fuse2fs: register block devices for use with iomap
 * fuse2fs: implement directio file reads
 * fuse2fs: add extent dump function for debugging
 * fuse2fs: implement direct write support
 * fuse2fs: turn on iomap for pagecache IO
 * fuse2fs: don't zero bytes in punch hole
 * fuse2fs: don't do file data block IO when iomap is enabled
 * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
 * fuse2fs: enable file IO to inline data files
 * fuse2fs: set iomap-related inode flags
 * fuse2fs: add strictatime/lazytime mount options
 * fuse2fs: configure block device block size
 * fuse4fs: don't use inode number translation when possible
 * fuse4fs: separate invalidation
 * fuse2fs: implement statx
 * fuse2fs: enable atomic writes
---
 configure       |   46 +
 configure.ac    |   31 +
 lib/config.h.in |    3 
 misc/fuse2fs.c  | 1777 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 misc/fuse4fs.c  | 1741 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 3556 insertions(+), 42 deletions(-)


  parent reply	other threads:[~2025-08-21  0:49 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-21  0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-08-21  0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
2025-08-21  1:08   ` [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API Darrick J. Wong
2025-08-21  1:08   ` [PATCH 02/20] fuse4fs: drop fuse 2.x support code Darrick J. Wong
2025-08-21  1:08   ` [PATCH 03/20] fuse4fs: namespace some helpers Darrick J. Wong
2025-08-21  1:08   ` [PATCH 04/20] fuse4fs: convert to low level API Darrick J. Wong
2025-08-21  1:09   ` [PATCH 05/20] libsupport: port the kernel list.h to libsupport Darrick J. Wong
2025-08-21  1:09   ` [PATCH 06/20] libsupport: add a cache Darrick J. Wong
2025-08-21  1:09   ` [PATCH 07/20] cache: disable debugging Darrick J. Wong
2025-08-21  1:09   ` [PATCH 08/20] cache: use modern list iterator macros Darrick J. Wong
2025-08-21  1:10   ` [PATCH 09/20] cache: embed struct cache in the owner Darrick J. Wong
2025-08-21  1:10   ` [PATCH 10/20] cache: pass cache pointer to callbacks Darrick J. Wong
2025-08-21  1:10   ` [PATCH 11/20] cache: pass a private data pointer through cache_walk Darrick J. Wong
2025-08-21  1:11   ` [PATCH 12/20] cache: add a helper to grab a new refcount for a cache_node Darrick J. Wong
2025-08-21  1:11   ` [PATCH 13/20] cache: return results of a cache flush Darrick J. Wong
2025-08-21  1:11   ` [PATCH 14/20] cache: add a "get only if incore" flag to cache_node_get Darrick J. Wong
2025-08-21  1:11   ` [PATCH 15/20] cache: support gradual expansion Darrick J. Wong
2025-08-21  1:12   ` [PATCH 16/20] cache: implement automatic shrinking Darrick J. Wong
2025-08-21  1:12   ` [PATCH 17/20] fuse4fs: add cache to track open files Darrick J. Wong
2025-08-21  1:12   ` [PATCH 18/20] fuse4fs: use the orphaned inode list Darrick J. Wong
2025-08-21  1:12   ` [PATCH 19/20] fuse4fs: implement FUSE_TMPFILE Darrick J. Wong
2025-08-21  1:13   ` [PATCH 20/20] fuse4fs: create incore reverse orphan list Darrick J. Wong
2025-08-21  0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-08-21  1:13   ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
2025-08-21  1:13   ` [PATCH 02/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-08-21  1:13   ` [PATCH 03/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-08-21  1:14   ` [PATCH 04/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-08-21  1:14   ` [PATCH 05/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-08-21  1:14   ` [PATCH 06/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-08-21  1:14   ` [PATCH 07/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-08-21  1:15   ` [PATCH 08/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-08-21  1:15   ` [PATCH 09/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-08-21  1:15   ` [PATCH 10/10] libext2fs: add posix advisory locking to the unix IO manager Darrick J. Wong
2025-08-21  0:49 ` Darrick J. Wong [this message]
2025-08-21  1:15   ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-08-21  1:16   ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong
2025-08-21  1:16   ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong
2025-08-21  1:16   ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-08-21  1:17   ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong
2025-08-21  1:17   ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-08-21  1:17   ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong
2025-08-21  1:17   ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-08-21  1:18   ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-08-21  1:18   ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-08-21  1:18   ` [PATCH 11/19] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-08-21  1:18   ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong
2025-08-21  1:19   ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong
2025-08-21  1:19   ` [PATCH 14/19] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2025-08-21  1:19   ` [PATCH 15/19] fuse2fs: configure block device block size Darrick J. Wong
2025-08-21  1:19   ` [PATCH 16/19] fuse4fs: don't use inode number translation when possible Darrick J. Wong
2025-08-21  1:20   ` [PATCH 17/19] fuse4fs: separate invalidation Darrick J. Wong
2025-08-21  1:20   ` [PATCH 18/19] fuse2fs: implement statx Darrick J. Wong
2025-08-21  1:20   ` [PATCH 19/19] fuse2fs: enable atomic writes Darrick J. Wong
2025-08-21  0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21  1:20   ` [PATCH 1/2] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-08-21  1:21   ` [PATCH 2/2] fuse2fs: be smarter about caching iomaps Darrick J. Wong
2025-08-21  0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21  1:21   ` [PATCH 1/8] fuse2fs: skip permission checking on utimens " Darrick J. Wong
2025-08-21  1:21   ` [PATCH 2/8] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2025-08-21  1:21   ` [PATCH 3/8] fuse2fs: better debugging for file mode updates Darrick J. Wong
2025-08-21  1:22   ` [PATCH 4/8] fuse2fs: debug timestamp updates Darrick J. Wong
2025-08-21  1:22   ` [PATCH 5/8] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2025-08-21  1:22   ` [PATCH 6/8] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2025-08-21  1:23   ` [PATCH 7/8] fuse2fs: enable syncfs Darrick J. Wong
2025-08-21  1:23   ` [PATCH 8/8] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
2025-08-21  0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
2025-08-21  1:23   ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
2025-08-21  1:23   ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
2025-08-21  1:24   ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
2025-08-21  1:24   ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
2025-08-21  1:24   ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
2025-08-21  1:24   ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=175573713645.21970.9783397720493472605.stgit@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=John@groves.net \
    --cc=bernd@bsbernd.com \
    --cc=joannelkoong@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=neal@gompa.dev \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox