public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Demi Marie Obenour <demiobenour@gmail.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>,
	lsf-pc@lists.linux-foundation.org,
	aleksandr.mikhalitsyn@futurfusion.io,
	linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	stgraber@stgraber.org, ksugihara@preferred.jp,
	utam0k@preferred.jp, trondmy@kernel.org, anna@kernel.org,
	jlayton@kernel.org, chuck.lever@oracle.com, neilb@suse.de,
	miklos@szeredi.hu, jack@suse.cz, amir73il@gmail.com,
	trapexit@spawn.link
Subject: Re: [LSF/MM/BPF TOPIC] VFS idmappings support in NFS
Date: Tue, 24 Feb 2026 09:18:21 -0500	[thread overview]
Message-ID: <3cc6a84f-2f97-49e8-909a-ea69f9a97fb0@gmail.com> (raw)
In-Reply-To: <20260224-bannen-waldlauf-481ad13899a9@brauner>


[-- Attachment #1.1.1: Type: text/plain, Size: 6453 bytes --]

On 2/24/26 03:54, Christian Brauner wrote:
> On Sat, Feb 21, 2026 at 01:44:26AM -0500, Demi Marie Obenour wrote:
>> On 2/18/26 07:44, Alexander Mikhalitsyn wrote:
>>> Dear friends,
>>>
>>> I would like to propose "VFS idmappings support in NFS" as a topic for discussion at the LSF/MM/BPF Summit.
>>>
>>> Previously, I worked on VFS idmap support for FUSE/virtiofs [2] and cephfs [1] with support/guidance
>>> from Christian.
>>>
>>> This experience with Cephfs & FUSE has shown that VFS idmap semantics, while being very elegant and
>>> intuitive for local filesystems, can be quite challenging to combine with network/network-like (e.g. FUSE)
>>> FSes. In case of Cephfs we had to modify its protocol (!) (see [2]) as a part of our agreement with
>>> ceph folks about the right way to support idmaps.
>>>
>>> One obstacle here was that cephfs has some features that are not very Linux-wayish, I would say.
>>> In particular, system administrator can configure path-based UID/GID restrictions on a *server*-side (Ceph MDS).
>>> Basically, you can say "I expect UID 1000 and GID 2000 for all files under /stuff directory".
>>> The problem here is that these UID/GIDs are taken from a syscall-caller's creds (not from (struct file *)->f_cred)
>>> which makes cephfs FDs not very transferable through unix sockets. [3]
>>>
>>> These path-based UID/GID restrictions mean that server expects client to send UID/GID with every single request,
>>> not only for those OPs where UID/GID needs to be written to the disk (mknod, mkdir, symlink, etc).
>>> VFS idmaps API is designed to prevent filesystems developers from making a mistakes when supporting FS_ALLOW_IDMAP.
>>> For example, (struct mnt_idmap *) is not passed to every single i_op, but instead to only those where it can be
>>> used legitimately. Particularly, readlink/listxattr or rmdir are not expected to use idmapping information anyhow.
>>>
>>> We've seen very similar challenges with FUSE. Not a long time ago on Linux Containers project forum, there
>>> was a discussion about mergerfs (a popular FUSE-based filesystem) & VFS idmaps [5]. And I see that this problem
>>> of "caller UID/GID are needed everywhere" still blocks VFS idmaps adoption in some usecases.
>>> Antonio Musumeci (mergerfs maintainer) claimed that in many cases filesystems behind mergerfs may not be fully
>>> POSIX and basically, when mergerfs does IO on the underlying FSes it needs to do UID/GID switch to caller's UID/GID
>>> (taken from FUSE request header).
>>>
>>> We don't expect NFS to be any simpler :-) I would say that supporting NFS is a final boss. It would be great
>>> to have a deep technical discussion with VFS/FSes maintainers and developers about all these challenges and
>>> make some conclusions and identify a right direction/approach to these problems. From my side, I'm going
>>> to get more familiar with high-level part of NFS (or even make PoC if time permits), identify challenges,
>>> summarize everything and prepare some slides to navigate/plan discussion.
>>>
>>> [1] cephfs https://lore.kernel.org/linux-fsdevel/20230807132626.182101-1-aleksandr.mikhalitsyn@canonical.com
>>> [2] cephfs protocol changes https://github.com/ceph/ceph/pull/52575
>>> [3] cephfs & f_cred https://lore.kernel.org/lkml/CAEivzxeZ6fDgYMnjk21qXYz13tHqZa8rP-cZ2jdxkY0eX+dOjw@mail.gmail.com/
>>> [4] fuse/virtiofs https://lore.kernel.org/linux-fsdevel/20240903151626.264609-1-aleksandr.mikhalitsyn@canonical.com/
>>> [5]
>>> mergerfshttps://discuss.linuxcontainers.org/t/is-it-the-case-that-you-cannot-use-shift-true-for-disk-devices-where-the-source-is-a-mergerfs-mount-is-there-a-workaround/25336/11?u=amikhalitsyn
>>>
>>> Kind regards,
>>> Alexander Mikhalitsyn @ futurfusion.io
>>
>> The secure case (strong authentication) has similar problems to
>> In both cases, there is no way to store the files with the UID/GID/etc
> 
> It's easy to support idmapped mounts without user namespaces. They're
> completely decoupled from them already for that purpose so they can
> support id squashing and so on going forward. The only thing that's
> needed is to extend the api so that we can specific mappings to be used
> for a mount. That's not difficult and there's no need to adhere to any
> inherent limit on the number of mappings that user namespaces have.
> 
> It's also useful indepent of all that for local filesystems that want to
> expose files with different ownership at different locations without
> getting into namespaces at all.

Virtiofsd needs user namespaces to create a mount unless it runs as
real root.

>> that the VFS says they should have.  The server (NFS) or kernel
>> (virtiofsd) simply will not (and, for security reasons, *must not*)
>> allow this.
>>
>> I proposed a workaround for virtiofsd [1] that I will also propose
>> here: store the mapped UID and GID as a user.* xattr.  This requires
> 
> xattrs as an ownership side-channel are an absolute clusterfuck. The
> kernel implementation for POSIX ACLs and filesystem capabilities that
> slap ownership information that the VFS must consume on arbitraries
> inodes should be set on fire.I've burned way too many cycles getting
> this into an even remotely acceptable shape and it still sucks to no
> end. Permission checking is a completely nightmare because we need to go
> fetch stuff from disk, cache it in a global format, then do an in-place
> translation having to parse ownership out of binary data stored
> alongside the inode.

Why is this so bad?  What would be a better way to do this?

> Nowever, if userspace wants to consume ownership information by storing
> arbitrary ownership information as user.* xattrs then I obviously
> couldn't care less but it won't nest, performance will suck, and it will
> be brittle to get this right imho.

It can be nested by repeatedly remapping xattrs: user.o, user.o.o,
and so on.  More efficient schemes undoubtedly exist.

Is there a better solution?  The only one I can think of is to include
subordinate UIDs/GIDs in Kerberos tickets, and that fails when the
product of (total users in an installation * number of subordinate
UIDs/GIDs per user) approaches 2^32.  This can happen if both are
65536.

I did suggest replacing UIDs and GIDs with NT-style SIDs, but that
is an absolutely enormous change across the entire system.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2026-02-24 14:18 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 12:44 [LSF/MM/BPF TOPIC] VFS idmappings support in NFS Alexander Mikhalitsyn
2026-02-18 13:49 ` Jeff Layton
2026-02-18 14:36   ` Alexander Mikhalitsyn
2026-02-18 16:01     ` Jeff Layton
2026-02-18 16:39       ` Alexander Mikhalitsyn
2026-02-19  0:57       ` NeilBrown
2026-02-19  8:53         ` Kohei Sugihara
2026-02-18 14:37   ` Trond Myklebust
2026-02-18 15:08     ` Jeff Layton
2026-02-18 15:25       ` Alexander Mikhalitsyn
2026-02-21  6:44 ` Demi Marie Obenour
2026-02-24  8:54   ` Christian Brauner
2026-02-24 14:18     ` Demi Marie Obenour [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3cc6a84f-2f97-49e8-909a-ea69f9a97fb0@gmail.com \
    --to=demiobenour@gmail.com \
    --cc=aleksandr.mikhalitsyn@futurfusion.io \
    --cc=alexander@mihalicyn.com \
    --cc=amir73il@gmail.com \
    --cc=anna@kernel.org \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=ksugihara@preferred.jp \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=miklos@szeredi.hu \
    --cc=neilb@suse.de \
    --cc=stgraber@stgraber.org \
    --cc=trapexit@spawn.link \
    --cc=trondmy@kernel.org \
    --cc=utam0k@preferred.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox