Linux filesystem development
 help / color / mirror / Atom feed
From: John Groves <john@groves.net>
To: Miklos Szeredi <mszeredi@redhat.com>
Cc: fuse-devel@lists.linux.dev, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] famfs: FUSE_GET_DAXDEV -> FUSE_DEV_IOC_BACKING_OPEN
Date: Mon, 1 Jun 2026 23:58:56 -0700	[thread overview]
Message-ID: <ah5-40YypaswEG-1@groves.net> (raw)
In-Reply-To: <20260529134302.1989958-1-mszeredi@redhat.com>

On 26/05/29 03:43PM, Miklos Szeredi wrote:
> The difference is that now the server needs to look at the fmap and check
> if any new device needs registering, the device index is now allocated by
> the kernel and returned by the ioctl().
> 
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
> This is the first part of what we were talking about: using the backing
> file API for dax devices.
> 
> It's on top of the famfs-v10 patchset.
> 
> Totally untested, haven't even tried booting it.  Obviously needs
> nontrivial conversion in the server code as well.
> 
> Thanks,
> Miklos

Miklos, thanks for looking at this.

First, a scope check. This patch touches only the daxdev registration -
GET_FMAP, struct fuse_famfs_fmap_header, and the simple/interleave
extent structs in include/uapi/linux/fuse.h are untouched. I'm reading
that as tacit agreement to GET_FMAP and its ABI. Can you confirm? If so
the large piece is settled and this is really just about how a daxdev
gets registered.

Device indices - the part that has to be right
----------------------------------------------

Everything below turns on what a famfs device index *is*, so let me make
it explicit, because the patch's model of it (kernel-allocated) isn't
workable for famfs.

Famfs daxdevs are shared memory. The same physical memory is mapped by
famfs on multiple nodes, and on each node it most likely has a different
name and dev_t. What is invariant across nodes is the daxdev's UUID. So
a daxdev's definitive identity is its UUID, not a path or a dev_t.

The index is assigned by famfs userspace, as the order of appearance of
daxdevs in the famfs metadata log:
  - the primary daxdev (which holds the superblock) is always index 0;
  - additional daxdevs are introduced by ADD_DAXDEV log entries and take
    1..n in order of appearance.
  - ADD_DAXDEV log entries establish the UUID-to-index mapping
  - any file may have extents referencing any or all daxdevs defined
    previously in the log - by index

Two properties make the index a userspace-owned, cluster-invariant value
the kernel cannot allocate:

  1. Bootstrap. The famfs metadata log establishes the indices - each
     ADD_DAXDEV entry defines the next - and an index can then appear in
     any later file fmap. But the log is itself a file on index 0 (with
     the superblock), so index 0 must resolve before the kernel can even
     read the log. The numbering is fixed in the log before the kernel
     sees any of it, so the kernel can't be the allocator.
     (idr_alloc_cyclic never hands out 0 anyway, so a kernel-assigned
     primary couldn't even be index 0.)

  2. Cluster-invariance. fmaps live in the shared-memory log, read
     identically by every node. A given dev_index must denote the same
     daxdev (the same UUID) on every node, even though it most likely has
     a different name and dev_t on each node. A kernel-allocated id is
     node-local and mount-local - the wrong kind of value to live in a
     shared fmap.

So the index must be supplied by userspace (which owns the log and
resolves each index's UUID to a local device), and the kernel looks
devices up by that index. Resolving UUID -> local device is inherently
userspace's job; the kernel trusts that resolution and never needs the
UUID.

This is also why the hand-rolled daxdev table is the right structure, not
incidental cruft: a small self-contained array in famfs.c, keyed directly
by the dense zero-based index, closing the discovery race (valid flag +
wmb) and giving O(1) lookup. That's exactly what this index namespace
calls for.

Identity by reference, not by path - this part is right
-------------------------------------------------------

The genuinely better idea here is getting the kernel out of resolving a
device pathname: the daemon hands over an already-resolved reference
(you do it with an fd, from which the kernel takes inode->i_rdev)
instead of a string the kernel has to kern_path(). I want to adopt that
- credit where it's due.

The minimal way to realize it is to have GET_DAXDEV return a dev_t: the
kernel uses dax_dev_find() and kern_path()/may_open_dev() drop out
entirely. No ABI change - struct fuse_daxdev_out already carries the
index and has reserved space (sized with room for a uuid), so it's just
filling a reserved field, and the index is already first-class in the
reply.

The fd delivers that same dev_t (the kernel already reads it from
f_path), so BACKING_OPEN can realize the same idea - but the dev_t is
the part that's actually the improvement, and the daxdev reply ABI
already models (index, dev_t) while fuse_backing_map models neither. So
realizing it is cleaner without the fd than with it; the costs of going
through BACKING_OPEN are below.

On BACKING_OPEN
---------------

If you feel strongly about FUSE_DEV_IOC_BACKING_OPEN, I'll make it work
- but I want to be honest that for famfs it's a shared-almost-nothing
approach:

  - famfs needs its own handler regardless (the famfs branch of
    fuse_backing_open), and that handler can keep populating the famfs
    devlist exactly as today - so the only things actually shared are the
    ioctl entry and the {fd, flags} struct.

  - it forces 'select FUSE_PASSTHROUGH', pulling in file retention, I/O
    forwarding, refcounting and close - none of which famfs uses. Famfs
    maps device memory directly via iomap/dax_direct_access; there is no
    backing file and no famfs close path (fuse_backing_close already
    returns -EINVAL when famfs_iomap is set).

  - the ABI doesn't fit. struct fuse_backing_map is {int32_t fd; uint32_t
    flags; uint64_t padding} - it's fd-shaped, with no dev_t field and no
    index field. To use it for famfs I'd have to carry the device by fd
    (which forces the daemon to open() the device just to register it -
    the daemon otherwise never opens it) and smuggle the native index
    through padding. And the fd does no real work: in famfs_daxdev_open()
    it's borrowed only to read f_path -> devno, then dropped; exclusivity
    comes from fs_dax_get()'s holder, not the fd.

On access control, in case it's a concern behind the fd: the famfs path
is already gated on the server holding CAP_SYS_RAWIO at init -
FUSE_DAX_FMAP is only offered under capable(CAP_SYS_RAWIO) in
fuse_send_init(), and famfs_iomap (hence GET_FMAP/GET_DAXDEV) is enabled
only if that flag comes back. So the registering daemon is already
raw-IO-trusted; there's no access-control gap an fd's open would close
(may_open_dev() isn't a capability check anyway, just a mount-nodev test).

Net
---

I'm glad to (a) confirm we're aligned on GET_FMAP and its ABI, and (b)
switch GET_DAXDEV to return a dev_t instead of a path. I can't move index
allocation into the kernel - the index is a userspace-owned,
cluster-invariant value for the reasons above - and I'd keep the famfs
daxdev table where it is. Lazy discovery has to keep working either way:
only the primary daxdev is known at mount; the rest are discovered as the
famfs metadata log is replayed and first referenced by an fmap.

If you still want BACKING_OPEN after the above, say so and I'll make it
work, ABI overloads and all - but I think GET_DAXDEV + dev_t is the
smaller and better-fitting change.

<snip>

Thanks,
John


  parent reply	other threads:[~2026-06-02  6:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-29 13:43 [PATCH] famfs: FUSE_GET_DAXDEV -> FUSE_DEV_IOC_BACKING_OPEN Miklos Szeredi
2026-05-31 17:36 ` Amir Goldstein
2026-06-01  8:42   ` Miklos Szeredi
2026-06-02  6:58 ` John Groves [this message]
2026-06-02 13:01   ` Amir Goldstein
2026-06-03  8:35   ` Miklos Szeredi
2026-06-15 16:20     ` John Groves
2026-06-03 22:09 ` Joanne Koong
2026-06-04 14:17   ` Miklos Szeredi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah5-40YypaswEG-1@groves.net \
    --to=john@groves.net \
    --cc=fuse-devel@lists.linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox