From: "Darrick J. Wong" <djwong@kernel.org>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: John Groves <John@groves.net>,
Dan Williams <dan.j.williams@intel.com>,
Bernd Schubert <bschubert@ddn.com>,
John Groves <jgroves@micron.com>,
Jonathan Corbet <corbet@lwn.net>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
Luis Henriques <luis@igalia.com>,
Randy Dunlap <rdunlap@infradead.org>,
Jeff Layton <jlayton@kernel.org>,
Kent Overstreet <kent.overstreet@linux.dev>,
Petr Vorel <pvorel@suse.cz>, Brian Foster <bfoster@redhat.com>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-fsdevel@vger.kernel.org,
Amir Goldstein <amir73il@gmail.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Stefan Hajnoczi <shajnocz@redhat.com>,
Joanne Koong <joannelkoong@gmail.com>,
Josef Bacik <josef@toxicpanda.com>,
Aravind Ramesh <arramesh@micron.com>,
Ajay Joshi <ajayjoshi@micron.com>,
0@groves.net
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
Date: Thu, 8 May 2025 08:56:44 -0700 [thread overview]
Message-ID: <20250508155644.GM1035866@frogsfrogsfrogs> (raw)
In-Reply-To: <CAJfpegtR28rH1VA-442kS_ZCjbHf-WDD+w_FgrAkWDBxvzmN_g@mail.gmail.com>
On Tue, May 06, 2025 at 06:56:29PM +0200, Miklos Szeredi wrote:
> On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > <nod> I don't know what Miklos' opinion is about having multiple
> > fusecmds that do similar things -- on the one hand keeping yours and my
> > efforts separate explodes the amount of userspace abi that everyone must
> > maintain, but on the other hand it then doesn't couple our projects
> > together, which might be a good thing if it turns out that our domain
> > models are /really/ actually quite different.
>
> Sharing the interface at least would definitely be worthwhile, as
> there does not seem to be a great deal of difference between the
> generic one and the famfs specific one. Only implementing part of the
> functionality that the generic one provides would be fine.
Well right now my barely functional prototype exposes this interface
for communicating mappings to the kernel. I've only gotten as far as
exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
server with no caching, because the only functions I've implemented so
far are FIEMAP, SEEK_{DATA,HOLE}, and directio.
So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
desired (pos, count) file range to the fuse server, which responds with
a struct fuse_iomap_begin_out object that is translated into a struct
iomap.
The fuse server then responds with a read mapping and a write mapping,
which tell the kernel from where to read data, and where to write data.
As a shortcut, the write mapping can be of type
FUSE_IOMAP_TYPE_PURE_OVERWRITE to avoid having to fill out fields twice.
iomap_end is only called if there were errors while processing the
mapping, or if the fuse server sets FUSE_IOMAP_F_WANT_IOMAP_END.
iomap_ioend is called after read or write IOs complete, so that the
filesystem can update mapping metadata (e.g. unwritten extent
conversion, remapping after an out of place write, ondisk isize update).
Some of the flags here might not be needed or workable; I was merely
cutting and pasting the #defines from iomap.h.
#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (0xFFFF) /* use read mapping data */
#define FUSE_IOMAP_TYPE_HOLE 0 /* no blocks allocated, need allocation */
#define FUSE_IOMAP_TYPE_DELALLOC 1 /* delayed allocation blocks */
#define FUSE_IOMAP_TYPE_MAPPED 2 /* blocks allocated at @addr */
#define FUSE_IOMAP_TYPE_UNWRITTEN 3 /* blocks allocated at @addr in unwritten state */
#define FUSE_IOMAP_TYPE_INLINE 4 /* data inline in the inode */
#define FUSE_IOMAP_DEV_SBDEV (0) /* use superblock bdev */
#define FUSE_IOMAP_F_NEW (1U << 0)
#define FUSE_IOMAP_F_DIRTY (1U << 1)
#define FUSE_IOMAP_F_SHARED (1U << 2)
#define FUSE_IOMAP_F_MERGED (1U << 3)
#define FUSE_IOMAP_F_XATTR (1U << 5)
#define FUSE_IOMAP_F_BOUNDARY (1U << 6)
#define FUSE_IOMAP_F_ANON_WRITE (1U << 7)
#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 15) /* want ->iomap_end call */
#define FUSE_IOMAP_OP_WRITE (1 << 0) /* writing, must allocate blocks */
#define FUSE_IOMAP_OP_ZERO (1 << 1) /* zeroing operation, may skip holes */
#define FUSE_IOMAP_OP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */
#define FUSE_IOMAP_OP_FAULT (1 << 3) /* mapping for page fault */
#define FUSE_IOMAP_OP_DIRECT (1 << 4) /* direct I/O */
#define FUSE_IOMAP_OP_NOWAIT (1 << 5) /* do not block */
#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1 << 6) /* only pure overwrites allowed */
#define FUSE_IOMAP_OP_UNSHARE (1 << 7) /* unshare_file_range */
#define FUSE_IOMAP_OP_ATOMIC (1 << 9) /* torn-write protection */
#define FUSE_IOMAP_OP_DONTCACHE (1 << 10) /* dont retain pagecache */
#define FUSE_IOMAP_NULL_ADDR -1ULL /* addr is not valid */
struct fuse_iomap_begin_in {
uint32_t opflags; /* FUSE_IOMAP_OP_* */
uint32_t reserved;
uint64_t ino; /* matches st_ino provided by getattr/open */
uint64_t pos; /* file position, in bytes */
uint64_t count; /* operation length, in bytes */
};
struct fuse_iomap_begin_out {
uint64_t offset; /* file offset of mapping, bytes */
uint64_t length; /* length of both mappings, bytes */
uint64_t read_addr; /* disk offset of mapping, bytes */
uint16_t read_type; /* FUSE_IOMAP_TYPE_* */
uint16_t read_flags; /* FUSE_IOMAP_F_* */
uint32_t read_dev; /* FUSE_IOMAP_DEV_* */
uint64_t write_addr; /* disk offset of mapping, bytes */
uint16_t write_type; /* FUSE_IOMAP_TYPE_* */
uint16_t write_flags; /* FUSE_IOMAP_F_* */
uint32_t write_dev; /* FUSE_IOMAP_DEV_* */
};
struct fuse_iomap_end_in {
uint32_t opflags; /* FUSE_IOMAP_OP_* */
uint32_t reserved;
uint64_t ino; /* matches st_ino provided iomap_begin */
uint64_t pos; /* file position, in bytes */
uint64_t count; /* operation length, in bytes */
int64_t written; /* bytes processed */
uint64_t map_length; /* length of mapping, bytes */
uint64_t map_addr; /* disk offset of mapping, bytes */
uint16_t map_type; /* FUSE_IOMAP_TYPE_* */
uint16_t map_flags; /* FUSE_IOMAP_F_* */
uint32_t map_dev; /* FUSE_IOMAP_DEV_* */
};
/* out of place write extent */
#define FUSE_IOMAP_IOEND_SHARED (1U << 0)
/* unwritten extent */
#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1)
/* don't merge into previous ioend */
#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2)
/* is direct I/O */
#define FUSE_IOMAP_IOEND_DIRECT (1U << 3)
/* is append ioend */
#define FUSE_IOMAP_IOEND_APPEND (1U << 15)
struct fuse_iomap_ioend_in {
uint16_t ioendflags; /* FUSE_IOMAP_IOEND_* */
uint16_t reserved;
int32_t error; /* negative errno or 0 */
uint64_t ino; /* matches st_ino provided iomap_begin */
uint64_t pos; /* file position, in bytes */
uint64_t addr; /* disk offset of new mapping, in bytes */
uint32_t written; /* bytes processed */
uint32_t reserved1;
};
> > (Especially because I suspect that interleaving is the norm for memory,
> > whereas we try to avoid that for disk filesystems.)
>
> So interleaved extents are just like normal ones except they repeat,
> right? What about adding a special "repeat last N extent
> descriptions" type of extent?
Yeah, I suppose a mapping cache could do that. From talking to John
last week, it sounds like the mappings are supposed to be static for the
life of the file, as opposed to ext* where truncates and fallocate can
appear at any time.
One thing I forgot to ask John -- can there be multiple sets of
interleaved mappings per file? e.g. the first 32g of a file are split
between 4 memory controllers, whereas the next 64g are split between 4
different domains?
> > > But the current implementation does not contemplate partially cached fmaps.
> > >
> > > Adding notification could address revoking them post-haste (is that why
> > > you're thinking about notifications? And if not can you elaborate on what
> > > you're after there?).
> >
> > Yeah, invalidating the mapping cache at random places. If, say, you
> > implement a clustered filesystem with iomap, the metadata server could
> > inform the fuse server on the local node that a certain range of inode X
> > has been written to, at which point you need to revoke any local leases,
> > invalidate the pagecache, and invalidate the iomapping cache to force
> > the client to requery the server.
> >
> > Or if your fuse server wants to implement its own weird operations (e.g.
> > XFS EXCHANGE-RANGE) this would make that possible without needing to
> > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.
>
> Wouldn't existing invalidation framework be sufficient?
I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
If so, then I think that's the wrong layer -- INVAL_INODE invalidates
the page cache, whereas I'm talking about caching the file space
mappings that iomap uses to construct bios for disk IO, and possibly
wanting to invalidate parts of that cache to force the kernel to upcall
the fuse server for a new mapping.
(Obviously this only applies to fuse servers for ondisk filesystems.)
--D
> Thanks,
> Miklos
>
next prev parent reply other threads:[~2025-05-08 15:56 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-21 1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
2025-04-21 1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-04-21 1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-04-21 1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
2025-04-21 1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-04-21 1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
2025-04-21 1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-04-21 1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-04-21 1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
2025-04-21 1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-04-21 1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-04-23 1:36 ` Joanne Koong
2025-04-23 20:23 ` John Groves
2025-04-21 1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
2025-04-23 1:51 ` Joanne Koong
2025-04-23 20:19 ` John Groves
2025-04-21 1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-05-02 5:48 ` Joanne Koong
2025-05-02 20:35 ` Darrick J. Wong
2025-05-12 16:28 ` John Groves
2025-05-22 15:45 ` Amir Goldstein
2025-05-23 0:30 ` John Groves
2025-04-21 1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
2025-04-21 21:57 ` Darrick J. Wong
2025-04-21 22:31 ` John Groves
2025-04-24 13:43 ` John Groves
2025-04-24 14:38 ` Darrick J. Wong
2025-04-28 1:48 ` John Groves
2025-04-28 19:00 ` Darrick J. Wong
2025-05-06 16:56 ` Miklos Szeredi
2025-05-08 15:56 ` Darrick J. Wong [this message]
2025-05-13 9:14 ` Miklos Szeredi
2025-05-15 2:06 ` Darrick J. Wong
2025-05-16 10:06 ` Miklos Szeredi
2025-05-16 23:17 ` Darrick J. Wong
2025-05-12 19:51 ` John Groves
2025-05-13 4:03 ` Darrick J. Wong
2025-04-21 1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-04-21 3:43 ` Randy Dunlap
2025-04-21 20:57 ` John Groves
2025-04-21 1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-04-21 1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-04-21 1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
2025-04-21 3:51 ` Randy Dunlap
2025-04-21 21:00 ` John Groves
2025-04-21 1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
2025-04-22 2:10 ` Randy Dunlap
2025-04-28 1:50 ` John Groves
2025-04-21 1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
2025-04-21 22:00 ` John Groves
2025-04-22 1:25 ` Darrick J. Wong
2025-04-22 11:50 ` John Groves
2025-04-30 14:42 ` Alireza Sanaee
2025-05-01 2:13 ` John Groves
2025-05-21 22:30 ` John Groves
2025-05-21 23:11 ` Darrick J. Wong
2025-05-22 15:55 ` Amir Goldstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250508155644.GM1035866@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=0@groves.net \
--cc=John@groves.net \
--cc=Jonathan.Cameron@huawei.com \
--cc=ajayjoshi@micron.com \
--cc=amir73il@gmail.com \
--cc=arramesh@micron.com \
--cc=bfoster@redhat.com \
--cc=brauner@kernel.org \
--cc=bschubert@ddn.com \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=jack@suse.cz \
--cc=jgroves@micron.com \
--cc=jlayton@kernel.org \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=kent.overstreet@linux.dev \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luis@igalia.com \
--cc=miklos@szeredi.hu \
--cc=nvdimm@lists.linux.dev \
--cc=pvorel@suse.cz \
--cc=rdunlap@infradead.org \
--cc=shajnocz@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).