public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: John Groves <John@groves.net>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Bernd Schubert <bschubert@ddn.com>,
	John Groves <jgroves@micron.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Luis Henriques <luis@igalia.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Jeff Layton <jlayton@kernel.org>,
	Kent Overstreet <kent.overstreet@linux.dev>,
	Petr Vorel <pvorel@suse.cz>, Brian Foster <bfoster@redhat.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	Amir Goldstein <amir73il@gmail.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Stefan Hajnoczi <shajnocz@redhat.com>,
	Joanne Koong <joannelkoong@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Aravind Ramesh <arramesh@micron.com>,
	Ajay Joshi <ajayjoshi@micron.com>,
	0@groves.net
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
Date: Mon, 28 Apr 2025 12:00:10 -0700	[thread overview]
Message-ID: <20250428190010.GB1035866@frogsfrogsfrogs> (raw)
In-Reply-To: <5rwwzsya6f7dkf4de2uje2b3f6fxewrcl4nv5ba6jh6chk36f3@ushxiwxojisf>

On Sun, Apr 27, 2025 at 08:48:30PM -0500, John Groves wrote:
> On 25/04/24 07:38AM, Darrick J. Wong wrote:
> > On Thu, Apr 24, 2025 at 08:43:33AM -0500, John Groves wrote:
> > > On 25/04/20 08:33PM, John Groves wrote:
> > > > On completion of GET_FMAP message/response, setup the full famfs
> > > > metadata such that it's possible to handle read/write/mmap directly to
> > > > dax. Note that the devdax_iomap plumbing is not in yet...
> > > > 
> > > > Update MAINTAINERS for the new files.
> > > > 
> > > > Signed-off-by: John Groves <john@groves.net>
> > > > ---
> > > >  MAINTAINERS               |   9 +
> > > >  fs/fuse/Makefile          |   2 +-
> > > >  fs/fuse/dir.c             |   3 +
> > > >  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> > > >  fs/fuse/fuse_i.h          |  16 +-
> > > >  fs/fuse/inode.c           |   2 +-
> > > >  include/uapi/linux/fuse.h |  42 +++++
> > > >  8 files changed, 477 insertions(+), 4 deletions(-)
> > > >  create mode 100644 fs/fuse/famfs.c
> > > >  create mode 100644 fs/fuse/famfs_kfmap.h
> > > > 
> > 
> > <snip>
> > 
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index d85fb692cf3b..0f6ff1ffb23d 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
> > > >  	uint8_t padding[6];
> > > >  };
> > > >  
> > > > +/* Famfs fmap message components */
> > > > +
> > > > +#define FAMFS_FMAP_VERSION 1
> > > > +
> > > > +#define FUSE_FAMFS_MAX_EXTENTS 2
> > > > +#define FUSE_FAMFS_MAX_STRIPS 16
> > > 
> > > FYI, after thinking through the conversation with Darrick,  I'm planning 
> > > to drop FUSE_FAMFS_MAX_(EXTENTS|STRIPS) in the next version.  In the 
> > > response to GET_FMAP, it's the structures below serialized into a message 
> > > buffer. If it fits, it's good - and if not it's invalid. When the
> > > in-memory metadata (defined in famfs_kfmap.h) gets assembled, if there is
> > > a reason to apply limits it can be done - but I don't currently see a reason
> > > do to that (so if I'm currently enforcing limits there, I'll probably drop
> > > that.
> > 
> > You could also define GET_FMAP to have an offset in the request buffer,
> > and have the famfs daemon send back the next offset at the end of its
> > reply (or -1ULL to stop).  Then the kernel can call GET_FMAP again with
> > that new offset to get more mappings.
> > 
> > Though at this point maybe it should go the /other/ way, where the fuse
> > server can sends a "notification" to the kernel to populate its mapping
> > data?  fuse already defines a handful of notifications for invalidating
> > pagecache and directory links.
> > 
> > (Ugly wart: notifications aren't yet implemented for the iouring channel)
> 
> I don't have fully-formed thoughts about notifications yet; thinking...

Me neither.  The existing ones seem like they /could/ be useful for 

> If the fmap stuff may be shared by more than one use case (as has always
> seemed possible), it's a good idea to think through a couple of things: 
> 1) is there anything important missing from this general approach, and 

Well for general iomap caching, I think we'd need to pull in a lot more
of the iomap fields:

struct fuse_iomap {
	u64		addr;	/* disk offset of mapping, bytes */
	loff_t		offset;	/* file offset of mapping, bytes */
	u64		length;	/* length of mapping, bytes */
	u16		type;	/* type of mapping */
	u16		flags;	/* flags for mapping */
	u32		devindex;
	u64		validity_cookie; /* used with .iomap_valid() */
};

fuse would use devindex to find the block_device/dax_device, but
otherwise the fields are exactly the same as struct iomap.  Given that
this is exposed to userspace we'd probably want to add some padding.

The validity cookie I'm not 100% sure about -- buffered IO uses it to
detect stale iomappings after we've locked a folio for write, having
dropped whatever locks protect the iomappings.  The ->iomap_valid
function compares the iomap::validity_cookie against some internal magic
value (this would have to be the iomap cache) to decide if revalidation
is needed.

One way to make this work is to implement the cookie entirely within the
fuse-iomap cache itself -- every time a new mapping comes in (or a range
gets invalidated) the cache bumps its cookie.  The fuse server doesn't
have to implement the cookie itself, but it will have to push a new
mapping or invalidate something every time the mappings change.

Another way would be to have the fuse server implement the cookie
itself, but now we have to find a way to have the kernel and userspace
share a piece of memory where the cookie lives.  I don't like this
option, but it does give the fuse server direct control over when the
cookie value changes.

> 2) do you need to *partially* cache fmaps? (or is the "offset" idea above 
> just to deal with an fmap that might otherwise overflow a response size?)

It's mostly to cap the amount of mapping data being copied into the
kernel in a specific GET_FMAP call.  For famfs I don't think you have
that many mappings, but for (say) an XFS filesystem there could be
billions of them.

Though at that point it might make more sense to populate the cache
piecemeal as file IO actually happens.

I wouldn't split an existing mapping, FWIW.  Think "I have 1,000,000
mappings and I'm only going to upload them 1,000 at a time", not "I'm
going to upload mappings for 100MB worth of file range at a time".

> The current approach lets the kernel retrieve and cache simple and 
> interleaved fmaps (and BTW interleaved can be multi-dev or single-dev - 
> there are current weird cases where that's useful). Also too, FWIW everything
> that can be done with simple ext list fmaps can be done with a collection
> of interleaved extents, each with strip count = 1. But I think there is a
> worthwhile clarity to having both.

<nod> I don't know what Miklos' opinion is about having multiple
fusecmds that do similar things -- on the one hand keeping yours and my
efforts separate explodes the amount of userspace abi that everyone must
maintain, but on the other hand it then doesn't couple our projects
together, which might be a good thing if it turns out that our domain
models are /really/ actually quite different.

(Especially because I suspect that interleaving is the norm for memory,
whereas we try to avoid that for disk filesystems.)

> But the current implementation does not contemplate partially cached fmaps.
> 
> Adding notification could address revoking them post-haste (is that why
> you're thinking about notifications? And if not can you elaborate on what
> you're after there?).

Yeah, invalidating the mapping cache at random places.  If, say, you
implement a clustered filesystem with iomap, the metadata server could
inform the fuse server on the local node that a certain range of inode X
has been written to, at which point you need to revoke any local leases,
invalidate the pagecache, and invalidate the iomapping cache to force
the client to requery the server.

Or if your fuse server wants to implement its own weird operations (e.g.
XFS EXCHANGE-RANGE) this would make that possible without needing to
add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.

--D

> 
> > 
> > --D
> 
> Cheers,
> John
> 
> 

  reply	other threads:[~2025-04-28 19:00 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-04-21  1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-04-21  1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
2025-04-21  1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-04-21  1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
2025-04-21  1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-04-21  1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-04-21  1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
2025-04-21  1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-04-23  1:36   ` Joanne Koong
2025-04-23 20:23     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
2025-04-23  1:51   ` Joanne Koong
2025-04-23 20:19     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-05-02  5:48   ` Joanne Koong
2025-05-02 20:35     ` Darrick J. Wong
2025-05-12 16:28     ` John Groves
2025-05-22 15:45       ` Amir Goldstein
2025-05-23  0:30         ` John Groves
2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
2025-04-21 21:57   ` Darrick J. Wong
2025-04-21 22:31     ` John Groves
2025-04-24 13:43   ` John Groves
2025-04-24 14:38     ` Darrick J. Wong
2025-04-28  1:48       ` John Groves
2025-04-28 19:00         ` Darrick J. Wong [this message]
2025-05-06 16:56           ` Miklos Szeredi
2025-05-08 15:56             ` Darrick J. Wong
2025-05-13  9:14               ` Miklos Szeredi
2025-05-15  2:06                 ` Darrick J. Wong
2025-05-16 10:06                   ` Miklos Szeredi
2025-05-16 23:17                     ` Darrick J. Wong
2025-05-12 19:51             ` John Groves
2025-05-13  4:03               ` Darrick J. Wong
2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-04-21  3:43   ` Randy Dunlap
2025-04-21 20:57     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-04-21  1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
2025-04-21  3:51   ` Randy Dunlap
2025-04-21 21:00     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
2025-04-22  2:10   ` Randy Dunlap
2025-04-28  1:50     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
2025-04-21 22:00   ` John Groves
2025-04-22  1:25     ` Darrick J. Wong
2025-04-22 11:50       ` John Groves
2025-04-30 14:42 ` Alireza Sanaee
2025-05-01  2:13   ` John Groves
2025-05-21 22:30 ` John Groves
2025-05-21 23:11   ` Darrick J. Wong
2025-05-22 15:55   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250428190010.GB1035866@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=0@groves.net \
    --cc=John@groves.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ajayjoshi@micron.com \
    --cc=amir73il@gmail.com \
    --cc=arramesh@micron.com \
    --cc=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=bschubert@ddn.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=jack@suse.cz \
    --cc=jgroves@micron.com \
    --cc=jlayton@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luis@igalia.com \
    --cc=miklos@szeredi.hu \
    --cc=nvdimm@lists.linux.dev \
    --cc=pvorel@suse.cz \
    --cc=rdunlap@infradead.org \
    --cc=shajnocz@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox