linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	John@groves.net, bernd@bsbernd.com, miklos@szeredi.hu,
	joannelkoong@gmail.com, Josef Bacik <josef@toxicpanda.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	Theodore Ts'o <tytso@mit.edu>
Subject: Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
Date: Tue, 10 Jun 2025 12:00:26 -0700	[thread overview]
Message-ID: <20250610190026.GA6134@frogsfrogsfrogs> (raw)
In-Reply-To: <CAOQ4uxgUVOLs070MyBpfodt12E0zjUn_SvyaCSJcm_M3SW36Ug@mail.gmail.com>

On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > >  or
> > >
> > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > DO NOT MERGE THIS.
> > > > > >
> > > > > > This is the very first request for comments of a prototype to connect
> > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > from files whose contents persist to locally attached storage devices.
> > > > > >
> > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > server process.
> > > > > >
> > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > core code.  Eeeugh.
> > > > > >
> > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > but solving that is for the next sprint.
> > > > > >
> > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > >
> > > > >
> > > > > Very cool!
> > > > >
> > > > > > There are some major warts remaining:
> > > > > >
> > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > support unwritten and delalloc mappings.
> > > > > >
> > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > >
> > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > yet figured out how inline data is supposed to work.
> > > > > >
> > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > inode it just read.
> > > > >
> > > > > Can you make the decision about enabling iomap on lookup?
> > > > > The plan for passthrough for inode operations was to allow
> > > > > setting up passthough config of inode on lookup.
> > > >
> > > > The main requirement (especially for buffered IO) is that we've set the
> > > > address space operations structure either to the regular fuse one or to
> > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > code assumes that cannot change on a live inode.
> > > >
> > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > >
> > >
> > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > which is in the responses of FUSE_LOOKUP,
> > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > which instantiate fuse inodes.
> > >
> > > There is a very hand wavy discussion about this at:
> > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > >
> > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > command that uses the variable length file handle instead of nodeid
> > > as a key for the inode.
> > >
> > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > look at the gritty details of how best to extend all the relevant commands,
> > > so I hope I am not sending you down the wrong path.
> >
> > I found another twist to this story: the upper level libfuse3 library
> > assigns distinct nodeids for each directory entry.  These nodeids are
> > passed into the kernel and appear to the basis for an iget5_locked call.
> > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > kernel.
> >
> > For a single-linked file this is no big deal, but for a hardlink this
> > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > locking model of iomap, which assumes that there's one in-kernel inode
> > and that it can use i_rwsem to synchronize updates.
> >
> > So I'm going to have to find a way to deal with this.  I tried trivially
> > messing with libfuse nodeid assigment but that blew some assertion.
> > Maybe your LOOKUP_HANDLE thing would work.
> >
> 
> Pull the emergency break!
> 
> In an amature move, I did not look at fuse2fs.c before commenting on your
> work.
> 
> High level fuse interface is not the right tool for the job.
> It's not even the easiest way to have written fuse2fs in the first place.

At the time I thought it would minimize friction across multiple
operating systems' fuse implementations.

> High-level fuse API addresses file system objects with full paths.
> This is good for writing simple virtual filesystems, but it is not the
> correct nor is the easiest choice to write a userspace driver for ext4.

Agreed, it's a *terrible* way to implement ext4.

I think, however, that Ted would like to maintain compatibility with
macfuse and freebsd(?) so he's been resistant to rewriting the entire
program to work with the lowlevel library.

That said, I decided just now to do some spelunking into those two fuse
ports and have discovered that freebsd[1] packages the same upstream
libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.

[1] https://wiki.freebsd.org/FUSEFS
[2] https://github.com/macfuse/macfuse

Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
think about rewriting all of fuse2fs against the lowlevel library?  It's
really annoying to deal with all the problems of the current codebase.
I think I'll try to stabilize the current fuse+iomap code and then look
into a fuse2fs port.  What would we call it, fuse4fs? :D

> Low-level fuse interface addresses filesystem objects by nodeid
> and requires the server to implement lookup(parent_nodeid, name)
> where the server gets to choose the nodeid (not libfuse).

Does the nodeid for the root directory have to be FUSE_ROOT_ID?  I guess
for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
which cannot be accessed from userspace anyway.

> current fuse2fs code needs to go to an effort to convert from full path
> to inode + name using ext2fs_namei().
> 
> With the low-level fuse op_lookup() might have used the native ext2_lookup()
> which would have been much more natural.
> 
> You can find the most featureful low-level fuse example at:
> https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc
> 
> Among other things, the server has an inode cache, where an inode
> has in its state 'nopen' (was this inode opened for io) and 'backing_id'
> (was this inode mapped for kernel passthrough).
> 
> Currently this backing_id mapping is only made on first open of inode,
> but the plan is to do that also at lookup time, for example, if the
> iomap mode for the inode can be determined at lookup time.

<nod>

> > > > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > > > actually works correctly.
> > > > > >
> > > > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > notifications work properly.
> > > > > >
> > > > >
> > > > > Again, I might be missing something, but as long as the fuse filesystem
> > > > > is exposing a single backing filesystem, it should be possible to make
> > > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > > > inode number.
> > > > > See sketch in this WIP branch:
> > > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> > > >
> > > > I think this would work in many places, except for filesystems with
> > > > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > > > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > > > does now.  Plus there are some filesystems that synthesize inode numbers
> > > > so tying the two together might not be feasible/desirable anyway.
> > > >
> > > > Though one nice feature of letting fuse have its own nodeids might be
> > > > that if the in-memory index switches to a tree structure, then it could
> > > > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > > > OTOH the current inode hashtable has been around for a very long time so
> > > > that might not be a big concern.  For fuse2fs it doesn't matter since
> > > > ext4 inumbers are u32.
> > > >
> > >
> > > I wanted to see if declaring one-to-one 64bit ino can simplify things
> > > for the first version of inode ops passthrough.
> > > If this is not the case, or if this is too much of a limitation for
> > > your use case
> > > then nevermind.
> > > But if it is a good enough shortcut for the demo and can be extended later,
> > > then why not.
> >
> > It's very tempting, because it's very confusing to have nodeids and
> > stat st_ino not be the same thing.
> >
> 
> Now that I have explained that fuse2fs should be low-level, it should be
> trivial to claim that it should have no problem to declare via
> FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino,
> because I see no reason to implement fuse2fs with non one-to-one
> mapping of ino <==> nodeid.

Agreed!  Thanks for the nudge!

Let's see what Ted thinks when he returns from vacation. :)

--D

  reply	other threads:[~2025-06-10 19:00 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-05-29 11:08     ` Miklos Szeredi
2025-05-31  1:08       ` Darrick J. Wong
2025-06-06 13:54         ` Miklos Szeredi
2025-06-09 18:13           ` Darrick J. Wong
2025-06-09 20:29             ` Darrick J. Wong
2025-05-22  0:02   ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
2025-05-22  0:03   ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-05-29 22:15     ` Joanne Koong
2025-05-29 23:15       ` Joanne Koong
2025-06-03  0:13         ` Darrick J. Wong
2025-05-22  0:03   ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
2025-05-22 16:46     ` Amir Goldstein
2025-05-22 17:11       ` Darrick J. Wong
2025-05-22  0:03   ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
2025-05-22  0:04   ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-05-22  0:04   ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
2025-05-22  0:04   ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
2025-05-22  0:04   ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-05-22  0:05   ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-05-22  0:05   ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22  0:05   ` [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
2025-05-22  0:05   ` [PATCH 2/8] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2025-05-22  0:06   ` [PATCH 3/8] libfuse: add upper level iomap commands Darrick J. Wong
2025-05-22  0:06   ` [PATCH 4/8] libfuse: add a notification to add a new device to iomap Darrick J. Wong
2025-05-22  0:06   ` [PATCH 5/8] libfuse: add iomap ioend low level handler Darrick J. Wong
2025-05-22  0:06   ` [PATCH 6/8] libfuse: add upper level iomap ioend commands Darrick J. Wong
2025-05-22  0:07   ` [PATCH 7/8] libfuse: add FUSE_IOMAP_PAGECACHE Darrick J. Wong
2025-05-22  0:07   ` [PATCH 8/8] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
2025-05-29 16:45   ` Darrick J. Wong
2025-05-29 19:41     ` Amir Goldstein
2025-06-09 22:31       ` Darrick J. Wong
2025-06-10 10:59         ` Amir Goldstein
2025-06-10 19:00           ` Darrick J. Wong [this message]
2025-06-10 19:51             ` Amir Goldstein
2025-06-11  6:00               ` Darrick J. Wong
2025-06-11  8:54                 ` Amir Goldstein
2025-06-12  5:54                   ` Miklos Szeredi
2025-06-13 17:44                     ` Darrick J. Wong
2025-06-11 11:56             ` Theodore Ts'o
2025-06-12  3:20               ` Darrick J. Wong
2025-06-12  6:10                 ` Amir Goldstein
2025-06-20  8:58               ` Allison Karlitskaya
2025-06-20 11:50                 ` Bernd Schubert
2025-07-01  6:02                   ` Darrick J. Wong
2025-07-01  5:58                 ` Darrick J. Wong
2025-07-12 10:57       ` Amir Goldstein
2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
2025-06-23 13:16     ` Miklos Szeredi
2025-07-01  6:05       ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250610190026.GA6134@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=John@groves.net \
    --cc=amir73il@gmail.com \
    --cc=bernd@bsbernd.com \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).