All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	John@groves.net, bernd@bsbernd.com, miklos@szeredi.hu,
	joannelkoong@gmail.com, Josef Bacik <josef@toxicpanda.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	Theodore Ts'o <tytso@mit.edu>
Subject: Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4
Date: Tue, 10 Jun 2025 23:00:40 -0700	[thread overview]
Message-ID: <20250611060040.GC6138@frogsfrogsfrogs> (raw)
In-Reply-To: <CAOQ4uxj4G_7E-Yba0hP2kpdeX17Fma0H-dB6Z8=BkbOWsF9NUg@mail.gmail.com>

On Tue, Jun 10, 2025 at 09:51:55PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> > > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > > > >  or
> > > > >
> > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > DO NOT MERGE THIS.
> > > > > > > >
> > > > > > > > This is the very first request for comments of a prototype to connect
> > > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > > > from files whose contents persist to locally attached storage devices.
> > > > > > > >
> > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > server process.
> > > > > > > >
> > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > core code.  Eeeugh.
> > > > > > > >
> > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > > > but solving that is for the next sprint.
> > > > > > > >
> > > > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > > > >
> > > > > > >
> > > > > > > Very cool!
> > > > > > >
> > > > > > > > There are some major warts remaining:
> > > > > > > >
> > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > > > support unwritten and delalloc mappings.
> > > > > > > >
> > > > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > > > >
> > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > > > yet figured out how inline data is supposed to work.
> > > > > > > >
> > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > > > inode it just read.
> > > > > > >
> > > > > > > Can you make the decision about enabling iomap on lookup?
> > > > > > > The plan for passthrough for inode operations was to allow
> > > > > > > setting up passthough config of inode on lookup.
> > > > > >
> > > > > > The main requirement (especially for buffered IO) is that we've set the
> > > > > > address space operations structure either to the regular fuse one or to
> > > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > > > code assumes that cannot change on a live inode.
> > > > > >
> > > > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > > > >
> > > > >
> > > > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > > > which is in the responses of FUSE_LOOKUP,
> > > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > > > which instantiate fuse inodes.
> > > > >
> > > > > There is a very hand wavy discussion about this at:
> > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > > > >
> > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > > > command that uses the variable length file handle instead of nodeid
> > > > > as a key for the inode.
> > > > >
> > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > > > look at the gritty details of how best to extend all the relevant commands,
> > > > > so I hope I am not sending you down the wrong path.
> > > >
> > > > I found another twist to this story: the upper level libfuse3 library
> > > > assigns distinct nodeids for each directory entry.  These nodeids are
> > > > passed into the kernel and appear to the basis for an iget5_locked call.
> > > > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > > > kernel.
> > > >
> > > > For a single-linked file this is no big deal, but for a hardlink this
> > > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > > > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > > > locking model of iomap, which assumes that there's one in-kernel inode
> > > > and that it can use i_rwsem to synchronize updates.
> > > >
> > > > So I'm going to have to find a way to deal with this.  I tried trivially
> > > > messing with libfuse nodeid assigment but that blew some assertion.
> > > > Maybe your LOOKUP_HANDLE thing would work.
> > > >
> > >
> > > Pull the emergency break!
> > >
> > > In an amature move, I did not look at fuse2fs.c before commenting on your
> > > work.
> > >
> > > High level fuse interface is not the right tool for the job.
> > > It's not even the easiest way to have written fuse2fs in the first place.
> >
> > At the time I thought it would minimize friction across multiple
> > operating systems' fuse implementations.
> >
> > > High-level fuse API addresses file system objects with full paths.
> > > This is good for writing simple virtual filesystems, but it is not the
> > > correct nor is the easiest choice to write a userspace driver for ext4.
> >
> > Agreed, it's a *terrible* way to implement ext4.
> >
> > I think, however, that Ted would like to maintain compatibility with
> > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > program to work with the lowlevel library.
> >
> > That said, I decided just now to do some spelunking into those two fuse
> > ports and have discovered that freebsd[1] packages the same upstream
> > libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.
> >
> > [1] https://wiki.freebsd.org/FUSEFS
> > [2] https://github.com/macfuse/macfuse
> >
> > Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
> > think about rewriting all of fuse2fs against the lowlevel library?  It's
> > really annoying to deal with all the problems of the current codebase.
> > I think I'll try to stabilize the current fuse+iomap code and then look
> > into a fuse2fs port.  What would we call it, fuse4fs? :D
> >
> > > Low-level fuse interface addresses filesystem objects by nodeid
> > > and requires the server to implement lookup(parent_nodeid, name)
> > > where the server gets to choose the nodeid (not libfuse).
> >
> > Does the nodeid for the root directory have to be FUSE_ROOT_ID?
> 
> Yeh, I think that's the case, otherwise FUSE_INIT would need to
> tell the kernel the root nodeid, because there is no lookup to
> return the root nodeid.
> 
> > I guess
> > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> > which cannot be accessed from userspace anyway.
> >
> 
> As long as inode #1 is reserved it should be fine.
> just need to refine the rules of the one-to-one mapping with
> this exception.

Or just make it so that passthrough_ino filesystems can specify the
rootdir inumber?

--D

> Thanks,
> Amir.
> 

  reply	other threads:[~2025-06-11  6:00 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 23:58 [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22  0:02   ` [PATCH 01/11] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-05-29 11:08     ` Miklos Szeredi
2025-05-31  1:08       ` Darrick J. Wong
2025-06-06 13:54         ` Miklos Szeredi
2025-06-09 18:13           ` Darrick J. Wong
2025-06-09 20:29             ` Darrick J. Wong
2025-05-22  0:02   ` [PATCH 02/11] iomap: exit early when iomap_iter is called with zero length Darrick J. Wong
2025-05-22  0:03   ` [PATCH 03/11] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-05-29 22:15     ` Joanne Koong
2025-05-29 23:15       ` Joanne Koong
2025-06-03  0:13         ` Darrick J. Wong
2025-05-22  0:03   ` [PATCH 04/11] fuse: add a notification to add new iomap devices Darrick J. Wong
2025-05-22 16:46     ` Amir Goldstein
2025-05-22 17:11       ` Darrick J. Wong
2025-05-22  0:03   ` [PATCH 05/11] fuse: send FUSE_DESTROY to userspace when tearing down an iomap connection Darrick J. Wong
2025-05-22  0:04   ` [PATCH 06/11] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-05-22  0:04   ` [PATCH 07/11] fuse: implement direct IO with iomap Darrick J. Wong
2025-05-22  0:04   ` [PATCH 08/11] fuse: implement buffered " Darrick J. Wong
2025-05-22  0:04   ` [PATCH 09/11] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-05-22  0:05   ` [PATCH 10/11] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-05-22  0:05   ` [PATCH 11/11] fuse: advertise support for iomap Darrick J. Wong
2025-05-22  0:21   ` [PATCHSET RFC[RAP]] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET RFC[RAP]] libfuse: " Darrick J. Wong
2025-05-22  0:05   ` [PATCH 1/8] libfuse: add kernel gates for FUSE_IOMAP and bump libfuse api version Darrick J. Wong
2025-05-22  0:05   ` [PATCH 2/8] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2025-05-22  0:06   ` [PATCH 3/8] libfuse: add upper level iomap commands Darrick J. Wong
2025-05-22  0:06   ` [PATCH 4/8] libfuse: add a notification to add a new device to iomap Darrick J. Wong
2025-05-22  0:06   ` [PATCH 5/8] libfuse: add iomap ioend low level handler Darrick J. Wong
2025-05-22  0:06   ` [PATCH 6/8] libfuse: add upper level iomap ioend commands Darrick J. Wong
2025-05-22  0:07   ` [PATCH 7/8] libfuse: add FUSE_IOMAP_PAGECACHE Darrick J. Wong
2025-05-22  0:07   ` [PATCH 8/8] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2025-05-22  0:01 ` [PATCHSET 1/3] fuse2fs: upgrade to libfuse 3.17 Darrick J. Wong
2025-05-22  0:07   ` [PATCH 1/3] fuse2fs: bump library version Darrick J. Wong
2025-05-22  0:07   ` [PATCH 2/3] fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse Darrick J. Wong
2025-05-22  0:08   ` [PATCH 3/3] fuse2fs: disable nfs exports Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 2/3] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-05-22  0:08   ` [PATCH 01/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-05-22  0:08   ` [PATCH 02/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-05-22  0:09   ` [PATCH 03/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-05-22  0:09   ` [PATCH 04/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-05-22  0:09   ` [PATCH 05/10] libext2fs: add tagged block IO for better caching Darrick J. Wong
2025-05-22  0:09   ` [PATCH 06/10] libext2fs: add tagged block IO caching to the unix IO manager Darrick J. Wong
2025-05-22  0:10   ` [PATCH 07/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-05-22  0:10   ` [PATCH 08/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-05-22  0:10   ` [PATCH 09/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-05-22  0:10   ` [PATCH 10/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-05-22  0:02 ` [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-05-22  0:11   ` [PATCH 01/16] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-05-22  0:11   ` [PATCH 02/16] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-05-22  0:11   ` [PATCH 03/16] fuse2fs: always use directio disk reads with fuse2fs Darrick J. Wong
2025-05-22  0:11   ` [PATCH 04/16] fuse2fs: implement directio file reads Darrick J. Wong
2025-05-22  0:12   ` [PATCH 05/16] fuse2fs: use tagged block IO for zeroing sub-block regions Darrick J. Wong
2025-05-22  0:12   ` [PATCH 06/16] fuse2fs: only flush the cache for the file under directio read Darrick J. Wong
2025-05-22  0:12   ` [PATCH 07/16] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-05-22  0:12   ` [PATCH 08/16] fuse2fs: implement direct write support Darrick J. Wong
2025-05-22  0:13   ` [PATCH 09/16] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-05-22  0:13   ` [PATCH 10/16] fuse2fs: flush and invalidate the buffer cache on trim Darrick J. Wong
2025-05-22  0:13   ` [PATCH 11/16] fuse2fs: improve tracing for fallocate Darrick J. Wong
2025-05-22  0:13   ` [PATCH 12/16] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-05-22  0:14   ` [PATCH 13/16] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-05-22  0:14   ` [PATCH 14/16] fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode Darrick J. Wong
2025-05-22  0:14   ` [PATCH 15/16] fuse2fs: re-enable the block device pagecache for metadata IO Darrick J. Wong
2025-05-22  0:15   ` [PATCH 16/16] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-05-22 16:24 ` [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 Amir Goldstein
2025-05-29 16:45   ` Darrick J. Wong
2025-05-29 19:41     ` Amir Goldstein
2025-06-09 22:31       ` Darrick J. Wong
2025-06-10 10:59         ` Amir Goldstein
2025-06-10 19:00           ` Darrick J. Wong
2025-06-10 19:51             ` Amir Goldstein
2025-06-11  6:00               ` Darrick J. Wong [this message]
2025-06-11  8:54                 ` Amir Goldstein
2025-06-12  5:54                   ` Miklos Szeredi
2025-06-13 17:44                     ` Darrick J. Wong
2025-06-11 11:56             ` Theodore Ts'o
2025-06-12  3:20               ` Darrick J. Wong
2025-06-12  6:10                 ` Amir Goldstein
2025-06-20  8:58               ` Allison Karlitskaya
2025-06-20 11:50                 ` Bernd Schubert
2025-07-01  6:02                   ` Darrick J. Wong
2025-07-01  5:58                 ` Darrick J. Wong
2025-07-12 10:57       ` Amir Goldstein
2025-06-13 17:37   ` [RFC[RAP] V2] " Darrick J. Wong
2025-06-23 13:16     ` Miklos Szeredi
2025-07-01  6:05       ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250611060040.GC6138@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=John@groves.net \
    --cc=amir73il@gmail.com \
    --cc=bernd@bsbernd.com \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.