linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: John Groves <John@groves.net>,
	Dan Williams <dan.j.williams@intel.com>,
	Bernd Schubert <bschubert@ddn.com>,
	John Groves <jgroves@micron.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Luis Henriques <luis@igalia.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Jeff Layton <jlayton@kernel.org>,
	Kent Overstreet <kent.overstreet@linux.dev>,
	Petr Vorel <pvorel@suse.cz>, Brian Foster <bfoster@redhat.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	Amir Goldstein <amir73il@gmail.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Stefan Hajnoczi <shajnocz@redhat.com>,
	Joanne Koong <joannelkoong@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Aravind Ramesh <arramesh@micron.com>,
	Ajay Joshi <ajayjoshi@micron.com>
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
Date: Wed, 14 May 2025 19:06:24 -0700	[thread overview]
Message-ID: <20250515020624.GP1035866@frogsfrogsfrogs> (raw)
In-Reply-To: <CAJfpegt4drCVNomOLqcU8JHM+qLrO1JwaQbp69xnGdjLn5O6wA@mail.gmail.com>

On Tue, May 13, 2025 at 11:14:55AM +0200, Miklos Szeredi wrote:
> On Thu, 8 May 2025 at 17:56, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Well right now my barely functional prototype exposes this interface
> > for communicating mappings to the kernel.  I've only gotten as far as
> > exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
> > server with no caching, because the only functions I've implemented so
> > far are FIEMAP, SEEK_{DATA,HOLE}, and directio.
> >
> > So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
> > desired (pos, count) file range to the fuse server, which responds with
> > a struct fuse_iomap_begin_out object that is translated into a struct
> > iomap.
> >
> > The fuse server then responds with a read mapping and a write mapping,
> > which tell the kernel from where to read data, and where to write data.
> 
> So far so good.
> 
> The iomap layer is non-caching, right?   This means that e.g. a
> direct_io request spanning two extents will result in two separate
> requests, since one FUSE_IOMAP_BEGIN can only return one extent.

Originally it wasn't supposed to be cached at all.  Then history taught
us a lesson. :P

In hindsight, there needs to be coordination of the space mapping
manipulations that go on between pagecache writes and reclaim writeback.
Pagecache write can get an unwritten iomap, then go to sleep while it
tries to get a folio.  In the meantime, writeback can find the folio for
that range, write it back to the disk (which converts unwritten to
written) and reclaim the folio.  Now the first process wakes up and
grabs a new folio.  Because its unwritten mapping is now stale, it must
not start zeroing that folio; it needs to go get a new mapping.

So iomap still doesn't need caching per se, but it needs writer threads
to revalidate the mapping after locking a folio.  The reason for caching
iomaps under the fuse_inode somewhere is that I don't want the
revalidations to have to jump all the way out to userspace with a folio
lock held.

That said, on a VM on this 12 year old workstation, I can get about
2.0GB/s direct writes in fuse2fs and 2.2GB/s in kernel ext4, and that's
with initiating iomap_begin/end/ioends with no caching of the mappings.
Pagecache writes run at about 1.9GB/s through fuse2fs and 1.5GB/s
through the kernel, but only if I tweak fuse to use large folios and a
relatively unconstrained bdi.  2GB/s might be enough IO for anyone. ;)

> And the next direct_io request may need to repeat the query for the
> same extent as the previous one if the I/O boundary wasn't on the
> extent boundary (which is likely).
> 
> So some sort of caching would make sense, but seeing the multitude of
> FUSE_IOMAP_OP_ types I'm not clearly seeing how that would look.

Yeah, it's confusing.  The design doc tries to clarify this, but this is
roughly what we need for fuse:

FUSE_IOMAP_OP_WRITE being set means we're writing to the file.
FUSE_IOMAP_OP_ZERO being set means we're zeroing the file.
Neither of those being set means we're reading the file.

(3 different operations)

FUSE_IOMAP_OP_DIRECT being set means directio, and it not being set
means pagecache.

(and one flag, for 6 different types of IO)

FUSE_IOMAP_OP_REPORT is set all by itself for things like FIEMAP and
SEEK_DATA/HOLE.

> > I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
> > If so, then I think that's the wrong layer -- INVAL_INODE invalidates
> > the page cache, whereas I'm talking about caching the file space
> > mappings that iomap uses to construct bios for disk IO, and possibly
> > wanting to invalidate parts of that cache to force the kernel to upcall
> > the fuse server for a new mapping.
> 
> Maybe I'm confused, as the layering is not very clear in my head yet.
> 
> But in your example you did say that invalidation of data as well as
> mapping needs to be invalidated, so I thought that the simplest thing
> to do is to just invalidate the cached mapping from
> FUSE_NOTIFY_INVAL_INODE as well.

For now I want to keep the two invalidation types separate while I build
out more of the prototype so that I can be more sure that I haven't
broken any existing code. :)

The mapping invalidation might be more useful for things like FICLONE on
weird filesystems where the file allocation unit size is larger than the
block size and we actually need to invalidate more mappings than the vfs
knows about.

But I'm only 80% sure of that, as I'm still figuring out how to create a
notification and send it from fuse2fs and haven't gotten to the caching
layer yet.

--D

> Thanks,
> Miklos
> 

  reply	other threads:[~2025-05-15  2:06 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-04-21  1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-04-21  1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
2025-04-21  1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-04-21  1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
2025-04-21  1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-04-21  1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-04-21  1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
2025-04-21  1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-04-23  1:36   ` Joanne Koong
2025-04-23 20:23     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
2025-04-23  1:51   ` Joanne Koong
2025-04-23 20:19     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-05-02  5:48   ` Joanne Koong
2025-05-02 20:35     ` Darrick J. Wong
2025-05-12 16:28     ` John Groves
2025-05-22 15:45       ` Amir Goldstein
2025-05-23  0:30         ` John Groves
2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
2025-04-21 21:57   ` Darrick J. Wong
2025-04-21 22:31     ` John Groves
2025-04-24 13:43   ` John Groves
2025-04-24 14:38     ` Darrick J. Wong
2025-04-28  1:48       ` John Groves
2025-04-28 19:00         ` Darrick J. Wong
2025-05-06 16:56           ` Miklos Szeredi
2025-05-08 15:56             ` Darrick J. Wong
2025-05-13  9:14               ` Miklos Szeredi
2025-05-15  2:06                 ` Darrick J. Wong [this message]
2025-05-16 10:06                   ` Miklos Szeredi
2025-05-16 23:17                     ` Darrick J. Wong
2025-05-12 19:51             ` John Groves
2025-05-13  4:03               ` Darrick J. Wong
2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-04-21  3:43   ` Randy Dunlap
2025-04-21 20:57     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-04-21  1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
2025-04-21  3:51   ` Randy Dunlap
2025-04-21 21:00     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
2025-04-22  2:10   ` Randy Dunlap
2025-04-28  1:50     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
2025-04-21 22:00   ` John Groves
2025-04-22  1:25     ` Darrick J. Wong
2025-04-22 11:50       ` John Groves
2025-04-30 14:42 ` Alireza Sanaee
2025-05-01  2:13   ` John Groves
2025-05-21 22:30 ` John Groves
2025-05-21 23:11   ` Darrick J. Wong
2025-05-22 15:55   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250515020624.GP1035866@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=John@groves.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ajayjoshi@micron.com \
    --cc=amir73il@gmail.com \
    --cc=arramesh@micron.com \
    --cc=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=bschubert@ddn.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=jack@suse.cz \
    --cc=jgroves@micron.com \
    --cc=jlayton@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luis@igalia.com \
    --cc=miklos@szeredi.hu \
    --cc=nvdimm@lists.linux.dev \
    --cc=pvorel@suse.cz \
    --cc=rdunlap@infradead.org \
    --cc=shajnocz@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).