public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gourry@gourry.net>
To: John Groves <John@groves.net>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Joanne Koong <joannelkoong@gmail.com>,
	Bernd Schubert <bernd@bsbernd.com>,
	John Groves <john@jagalactic.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Bernd Schubert <bschubert@ddn.com>,
	Alison Schofield <alison.schofield@intel.com>,
	John Groves <jgroves@micron.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	Jeff Layton <jlayton@kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Stefan Hajnoczi <shajnocz@redhat.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	Chen Linxuan <chenlinxuan@uniontech.com>,
	James Morse <james.morse@arm.com>, Fuad Tabba <tabba@google.com>,
	Sean Christopherson <seanjc@google.com>,
	Shivank Garg <shivankg@amd.com>,
	Ackerley Tng <ackerleytng@google.com>,
	Aravind Ramesh <arramesh@micron.com>,
	Ajay Joshi <ajayjoshi@micron.com>,
	"venkataravis@micron.com" <venkataravis@micron.com>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	djbw@kernel.org
Subject: Re: [PATCH V10 00/10] famfs: port into fuse
Date: Sun, 19 Apr 2026 20:27:04 -0400	[thread overview]
Message-ID: <aeVy2MzucnrLlOQx@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <aeUU8hMwPij2WvfF@groves.net>

On Sun, Apr 19, 2026 at 03:36:30PM -0500, John Groves wrote:
> On 26/04/15 10:16AM, David Hildenbrand (Arm) wrote:
> > On 4/15/26 00:20, Gregory Price wrote:
> > > On Tue, Apr 14, 2026 at 11:57:40AM -0700, Darrick J. Wong wrote:
> 
> Gregory's code, in the current form, still uses two new fuse messages,
> GET_FMAP and GET_DAXDEV, but it makes the fmap message format opaque by
> removing fmap format structs from the uapi. It also uses two BPF programs.
> One BPF program parses and validates the GET_FMAP payload for every file,
> and hangs it from a 'void *' in each fuse_inode (just like the current famfs
> code). The other BPF program is called during vma faults and reads the 
> fuse_inode->'void *' in order to handle faults the same way famfs-fuse does
> today, but via BPF instead.
> 

I'll just lay out what i've done and why.

For John's sanity, if there are NACKs, knowing sooner rather than later
would be a kindness.

=== Problem: Any lookup() in iomap_begin() is too much overhead.

No dax-backed server will want to eat the cost of a lookup() that
could be multiple microseconds on what should be a 1-5us soft-fault.

Joanne's prototype had this:

   meta = bpf_map_lookup_elem(&inode_map, &nodeid);

But it was offsetting a single pointer dereference:

   struct fuse_inode *fi = get_fuse_inode(inode);
   struct famfs_file_meta *meta = fi->famfs_meta;

Not all O(1) are created equal here.

   A single L3 LLC miss plus page table walk can cost you ~100ns.
   If that pointer was cache-hot, it's almost free.

   A pointer chase through any structure is N x ~100ns.
   This is unlikely to ever be sufficiently cache hot by comparison.

So, lets just avoid this problem altogether.


===  Requirements

1) No hard-coded OMF structures in the FUSE API.

   While RAID0 style interleaving isn't exactly fancy or novel,
   folks think this should not be in the kernel headers.

   (I'm not going to argue, I think the argument is pointless)


2) imap_begin() needs metadata accessible on the order of a single
   pointer dereference - which is what John has implemented.


3) open() needs to validate the metadata and identify DAX devices

   a) it needs to validate the DAX devices are available and
      acquire them / set them up / etc.  This is a kernel-side op.

   b) it needs to validate the addressing information is valid for
      the relevant dax devices

   Both GET_FMAP and GET_DAXDEV are avoided if the metadata is
   already cached or the DAXDEV is already setup.  So keeping these
   separate is actually important.


Joanne's code deals with #1 - but it doesn't handle #2 or #3.
(It also doesn't handle GET_DAXDEV at all).

John's code mananges #2 and #3 by having the fuse-server pass meta data
on open() via GET_FMAP and GET_DAXDEV.

  GET_FMAP acquires the meta data on how dax devices are used

  GET_DAXDEV just translates an ID to specific dax device.
  iomap_being() then uses the OMF to do the mapping.

But it does this by hard-coding the format into kernel headers.


===  Observation: Add a BPF dax_fmap_parse() on open() 

Pair Joanne's suggestion with John's GET_FMAP/GET_DAXDEV operations.

  struct fuse_dax_fmap_ops {
      char name[FUSE_DAX_FMAP_OPS_NAME_LEN];   // 16 bytes
      int (*dax_fmap_parse)(struct fuse_dax_fmap_parse_ctx *ctx);
      int (*iomap_begin)(struct fuse_dax_fmap_resolve_ctx *ctx,
                         struct fuse_iomap_io *io);
  };

This parse function is used to do filesystem specific setup the (such as
populate the dax bitmap) based on filesystem-specific per-file metadata.

In John's case, essentially all it does is populate the dax bitmap and
toss the data onto fi->dax_fmap.meta.

Pseudo code:

  fuse_dax_fmap_open(inode):
      fmap_size = send_GET_FMAP(inode, fmap_buf)

      /* Make space to store the metadata */
      meta_buf = kzalloc(meta_size)
      ctx = { ... }
      kern = { .ctx, .blob = blob, .meta_buf = meta_buf }

      /* Parse the metadata: i.e. fill out the daxdev bitmap */
      fc->dax_fmap_ops->dax_fmap_parse(&ctx)

      /* Call GET_DAXDEV for any new dax devices */
      resolve_dev_bitmap(ctx.dev_bitmap)

      /* cache the meta data on the inode */
      inode_lock()
      fi->dax_fmap.meta      = meta_buf
      ... etc etc ...
      inode_unlock()

And otherwise, imap_begin() works exactly as Joanne proposed, but with
in-kernel cached data instead of the bpfmap.

  const struct dax_simple_meta *meta = (const struct dax_simple_meta *)
                   bpf_fuse_dax_resolve_get_meta(ctx, 0, sizeof(*meta));

And since both parse() and iomap_begin() are bpf programs - and they're
the only consumers of the metadata - FUSE itself no longer needs to know
anything about the server's particular strategy to use the dax devices.

  struct fuse_inode {
      ...
  #if IS_ENABLED(CONFIG_FUSE_DAX_FMAP)
      struct {
          void    *meta;
          u32      meta_size;
          u64      file_size;
      } dax_fmap;
  #endif
  };

Just a big ol' honkin' void* that otherwise gets ignored.

(Note: while i'm not a BPF wizard, this pattern seems well established in
       existing BPF code, i found code in the network stack that caches
       data on kernel objects this way as well)

==== Caveats

1) We don't know the overhead BPF introduces in the fault path.

My napkin math (and best understanding of BPF) suggests:

   1) trampoline / vtable for bpf ops (iomap_begin func)
   2) retpoline cost of BPF (assuming this is on, safe assumption)
   3) bpf_fuse_dax_resolve_get_meta() overhead (extra pointer deref)

This *should* (i think) amount to an extra pointer dereference, a longjump,
and a retpoline, which hopefully is <100ns since any extra pointer
derefs here SHOULD be cache-hot (hard to know).

It's not 0 overhead, and if the average fault time is 1us then every
additional 10ns not an insignificant cost.

But this is napkin math.  John will collect data.


2) FUSE needs to be ok with the BPF-driven changes:

https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/


3) FUSE needs to be ok with GET_FMAP/GET_DAXDEV as opaque meta-data
   handlers for DAX devices.

   That means there is no default parser or format. If you don't
   register ops, these functions are functionally dead.

   (probably fine to enforce during init, which is what i did)


4) As John said: MM needs to be good with it.

   Any server using DAX like this already essentially has CAP_SYS_RAWIO
   for DAX, and most likely some form of CAP_SYS_ADMIN.

   Additionally, as folks have pointed out, the resolution to PTE is
   bounded by dax device extents, so it's not entirely arbitrary.

===

As mentioned at the start - you'd be doing John a kindness if there are
clear and obvious NACK's to be had here.

~Gregory

  reply	other threads:[~2026-04-20  0:27 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260331123702.35052-1-john@jagalactic.com>
2026-03-31 12:37 ` [PATCH V10 00/10] famfs: port into fuse John Groves
2026-03-31 12:38   ` [PATCH V10 01/10] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2026-03-31 12:38   ` [PATCH V10 02/10] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2026-03-31 12:38   ` [PATCH V10 03/10] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2026-03-31 12:38   ` [PATCH V10 04/10] famfs_fuse: Create files with famfs fmaps John Groves
2026-03-31 12:38   ` [PATCH V10 05/10] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2026-03-31 12:39   ` [PATCH V10 06/10] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2026-03-31 12:39   ` [PATCH V10 07/10] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2026-03-31 12:39   ` [PATCH V10 08/10] famfs_fuse: Add DAX address_space_operations with noop_dirty_folio John Groves
2026-03-31 12:39   ` [PATCH V10 09/10] famfs_fuse: Add famfs fmap metadata documentation John Groves
2026-03-31 12:39   ` [PATCH V10 10/10] famfs_fuse: Add documentation John Groves
2026-04-01 15:15   ` [PATCH V10 00/10] famfs: port into fuse John Groves
2026-04-06 17:43   ` Joanne Koong
2026-04-10 14:46     ` John Groves
2026-04-10 15:24       ` Bernd Schubert
2026-04-10 18:38         ` John Groves
2026-04-10 19:44           ` Joanne Koong
2026-04-14 13:19             ` Miklos Szeredi
2026-04-14 13:41               ` John Groves
2026-04-14 14:18                 ` Miklos Szeredi
2026-04-14 15:23                   ` John Groves
2026-04-14 18:57                 ` Darrick J. Wong
2026-04-14 22:13                   ` Joanne Koong
2026-04-14 23:36                     ` Darrick J. Wong
2026-04-15  0:10                     ` John Groves
2026-04-16 15:56                       ` Joanne Koong
2026-04-16 20:14                         ` Gregory Price
2026-04-16 20:53                           ` Dan Williams
2026-04-16 22:43                             ` Darrick J. Wong
2026-04-17  0:44                               ` Joanne Koong
2026-04-17  5:40                                 ` Darrick J. Wong
2026-04-17  8:17                                   ` Christoph Hellwig
2026-04-17 15:58                                     ` Darrick J. Wong
2026-04-17  8:13                               ` Christoph Hellwig
2026-04-17 13:30                                 ` Gregory Price
2026-04-17  1:24                           ` Joanne Koong
2026-04-17  6:46                             ` Gregory Price
2026-04-17  9:06                               ` Amir Goldstein
2026-04-14 22:20                   ` Gregory Price
2026-04-15  8:16                     ` David Hildenbrand (Arm)
2026-04-15 13:34                       ` Gregory Price
2026-04-15 14:04                         ` Miklos Szeredi
2026-04-15 15:10                           ` Matthew Wilcox
2026-04-15 15:28                             ` Darrick J. Wong
2026-04-15 15:32                             ` Gregory Price
2026-04-15 17:12                               ` Joanne Koong
2026-04-15 19:40                                 ` Gregory Price
2026-04-19 20:36                       ` John Groves
2026-04-20  0:27                         ` Gregory Price [this message]
2026-04-21  3:12                           ` Joanne Koong
2026-04-21 14:30                             ` Gregory Price
2026-04-21 18:59                               ` Joanne Koong
2026-04-21 22:13                                 ` Gregory Price
2026-04-14 23:53                   ` John Groves
2026-04-15  0:15                     ` Darrick J. Wong
2026-04-15  8:57                       ` Miklos Szeredi
2026-04-17  8:04               ` Christoph Hellwig
2026-04-17 19:35                 ` Joanne Koong
2026-04-21  6:59                   ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeVy2MzucnrLlOQx@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=John@groves.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ackerleytng@google.com \
    --cc=ajayjoshi@micron.com \
    --cc=alison.schofield@intel.com \
    --cc=amir73il@gmail.com \
    --cc=arramesh@micron.com \
    --cc=bagasdotme@gmail.com \
    --cc=bernd@bsbernd.com \
    --cc=brauner@kernel.org \
    --cc=bschubert@ddn.com \
    --cc=chenlinxuan@uniontech.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=james.morse@arm.com \
    --cc=jgroves@micron.com \
    --cc=jlayton@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=john@jagalactic.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=nvdimm@lists.linux.dev \
    --cc=rdunlap@infradead.org \
    --cc=seanjc@google.com \
    --cc=shajnocz@redhat.com \
    --cc=shivankg@amd.com \
    --cc=skhan@linuxfoundation.org \
    --cc=tabba@google.com \
    --cc=venkataravis@micron.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox