* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jeff Layton @ 2026-04-01 19:02 UTC (permalink / raw)
To: Dorjoy Chowdhury
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <CAFfO_h75dF2s83VNtUaNuRmto1NVVcxo7kN6eAtNtN3ME8mPiQ@mail.gmail.com>
On Mon, 2026-03-30 at 21:07 +0600, Dorjoy Chowdhury wrote:
> On Mon, Mar 30, 2026 at 5:49 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sat, 2026-03-28 at 23:22 +0600, Dorjoy Chowdhury wrote:
> > > This flag indicates the path should be opened if it's a regular file.
> > > This is useful to write secure programs that want to avoid being
> > > tricked into opening device nodes with special semantics while thinking
> > > they operate on regular files. This is a requested feature from the
> > > uapi-group[1].
> > >
> > > A corresponding error code EFTYPE has been introduced. For example, if
> > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > like FreeBSD, macOS.
> > >
> > > When used in combination with O_CREAT, either the regular file is
> > > created, or if the path already exists, it is opened if it's a regular
> > > file. Otherwise, -EFTYPE is returned.
> > >
> > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > as it doesn't make sense to open a path that is both a directory and a
> > > regular file.
> > >
> > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > >
> > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > ---
> > > arch/alpha/include/uapi/asm/errno.h | 2 ++
> > > arch/alpha/include/uapi/asm/fcntl.h | 1 +
> > > arch/mips/include/uapi/asm/errno.h | 2 ++
> > > arch/parisc/include/uapi/asm/errno.h | 2 ++
> > > arch/parisc/include/uapi/asm/fcntl.h | 1 +
> > > arch/sparc/include/uapi/asm/errno.h | 2 ++
> > > arch/sparc/include/uapi/asm/fcntl.h | 1 +
> > > fs/ceph/file.c | 4 ++++
> > > fs/fcntl.c | 4 ++--
> > > fs/gfs2/inode.c | 6 ++++++
> > > fs/namei.c | 4 ++++
> > > fs/nfs/dir.c | 4 ++++
> > > fs/open.c | 8 +++++---
> > > fs/smb/client/dir.c | 14 +++++++++++++-
> > > include/linux/fcntl.h | 2 ++
> > > include/uapi/asm-generic/errno.h | 2 ++
> > > include/uapi/asm-generic/fcntl.h | 4 ++++
> > > tools/arch/alpha/include/uapi/asm/errno.h | 2 ++
> > > tools/arch/mips/include/uapi/asm/errno.h | 2 ++
> > > tools/arch/parisc/include/uapi/asm/errno.h | 2 ++
> > > tools/arch/sparc/include/uapi/asm/errno.h | 2 ++
> > > tools/include/uapi/asm-generic/errno.h | 2 ++
> > > 22 files changed, 67 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
> > > index 6791f6508632..1a99f38813c7 100644
> > > --- a/arch/alpha/include/uapi/asm/errno.h
> > > +++ b/arch/alpha/include/uapi/asm/errno.h
> > > @@ -127,4 +127,6 @@
> > >
> > > #define EHWPOISON 139 /* Memory page has hardware error */
> > >
> > > +#define EFTYPE 140 /* Wrong file type for the intended operation */
> > > +
> > > #endif
> > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > @@ -34,6 +34,7 @@
> > >
> > > #define O_PATH 040000000
> > > #define __O_TMPFILE 0100000000
> > > +#define OPENAT2_REGULAR 0200000000
> > >
> > > #define F_GETLK 7
> > > #define F_SETLK 8
> > > diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
> > > index c01ed91b1ef4..1835a50b69ce 100644
> > > --- a/arch/mips/include/uapi/asm/errno.h
> > > +++ b/arch/mips/include/uapi/asm/errno.h
> > > @@ -126,6 +126,8 @@
> > >
> > > #define EHWPOISON 168 /* Memory page has hardware error */
> > >
> > > +#define EFTYPE 169 /* Wrong file type for the intended operation */
> > > +
> > > #define EDQUOT 1133 /* Quota exceeded */
> > >
> > >
> > > diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
> > > index 8cbc07c1903e..93194fbb0a80 100644
> > > --- a/arch/parisc/include/uapi/asm/errno.h
> > > +++ b/arch/parisc/include/uapi/asm/errno.h
> > > @@ -124,4 +124,6 @@
> > >
> > > #define EHWPOISON 257 /* Memory page has hardware error */
> > >
> > > +#define EFTYPE 258 /* Wrong file type for the intended operation */
> > > +
> > > #endif
> > > diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
> > > index 03dee816cb13..d46812f2f0f4 100644
> > > --- a/arch/parisc/include/uapi/asm/fcntl.h
> > > +++ b/arch/parisc/include/uapi/asm/fcntl.h
> > > @@ -19,6 +19,7 @@
> > >
> > > #define O_PATH 020000000
> > > #define __O_TMPFILE 040000000
> > > +#define OPENAT2_REGULAR 0100000000
> > >
> > > #define F_GETLK64 8
> > > #define F_SETLK64 9
> > > diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
> > > index 4a41e7835fd5..71940ec9130b 100644
> > > --- a/arch/sparc/include/uapi/asm/errno.h
> > > +++ b/arch/sparc/include/uapi/asm/errno.h
> > > @@ -117,4 +117,6 @@
> > >
> > > #define EHWPOISON 135 /* Memory page has hardware error */
> > >
> > > +#define EFTYPE 136 /* Wrong file type for the intended operation */
> > > +
> > > #endif
> > > diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
> > > index 67dae75e5274..bb6e9fa94bc9 100644
> > > --- a/arch/sparc/include/uapi/asm/fcntl.h
> > > +++ b/arch/sparc/include/uapi/asm/fcntl.h
> > > @@ -37,6 +37,7 @@
> > >
> > > #define O_PATH 0x1000000
> > > #define __O_TMPFILE 0x2000000
> > > +#define OPENAT2_REGULAR 0x4000000
> > >
> > > #define F_GETOWN 5 /* for sockets. */
> > > #define F_SETOWN 6 /* for sockets. */
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 66bbf6d517a9..6d8d4c7765e6 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > ceph_init_inode_acls(newino, &as_ctx);
> > > file->f_mode |= FMODE_CREATED;
> > > }
> > > + if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
> > > + err = -EFTYPE;
> > > + goto out_req;
> > > + }
> >
> > ^^^
> > This doesn't look quite right. Here's a larger chunk of the code:
> >
> > -------------------------8<--------------------------
> > if (d_in_lookup(dentry)) {
> > dn = ceph_finish_lookup(req, dentry, err);
> > if (IS_ERR(dn))
> > err = PTR_ERR(dn);
> > } else {
> > /* we were given a hashed negative dentry */
> > dn = NULL;
> > }
> > if (err)
> > goto out_req;
> > if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> > /* make vfs retry on splice, ENOENT, or symlink */
> > doutc(cl, "finish_no_open on dn %p\n", dn);
> > err = finish_no_open(file, dn);
> > } else {
> > if (IS_ENCRYPTED(dir) &&
> > !fscrypt_has_permitted_context(dir, d_inode(dentry))) {
> > pr_warn_client(cl,
> > "Inconsistent encryption context (parent %llx:%llx child %llx:%llx)\n",
> > ceph_vinop(dir), ceph_vinop(d_inode(dentry)));
> > goto out_req;
> > }
> >
> > doutc(cl, "finish_open on dn %p\n", dn);
> > if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> > struct inode *newino = d_inode(dentry);
> >
> > cache_file_layout(dir, newino);
> > ceph_init_inode_acls(newino, &as_ctx);
> > file->f_mode |= FMODE_CREATED;
> > }
> > err = finish_open(file, dentry, ceph_open);
> > }
> > -------------------------8<--------------------------
> >
> > It looks like this won't handle it correctly if the pathwalk terminates
> > on a symlink (re: d_is_symlink() case). You should either set up a test
> > ceph cluster on your own, or reach out to the ceph community and ask
> > them to test this.
> >
>
> Thanks for reviewing. The d_is_symlink() case seems to be calling
> finish_no_open so shouldn't this be okay?
>
My mistake -- you're correct. I keep forgetting that finish_no_open()
will handle this case regardless of what else happens.
> > > err = finish_open(file, dentry, ceph_open);
> > > }
> > > out_req:
> > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > index beab8080badf..240bb511557a 100644
> > > --- a/fs/fcntl.c
> > > +++ b/fs/fcntl.c
> > > @@ -1169,9 +1169,9 @@ static int __init fcntl_init(void)
> > > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > > * is defined as O_NONBLOCK on some platforms and not on others.
> > > */
> > > - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> > > + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > > HWEIGHT32(
> > > - (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > > + (VALID_OPENAT2_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > > __FMODE_EXEC));
> > >
> > > fasync_cache = kmem_cache_create("fasync_cache",
> > > diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> > > index 8344040ecaf7..4604e2e8a9cc 100644
> > > --- a/fs/gfs2/inode.c
> > > +++ b/fs/gfs2/inode.c
> > > @@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
> > > inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
> > > error = PTR_ERR(inode);
> > > if (!IS_ERR(inode)) {
> > > + if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
> >
> > Isn't OPENAT2_REGULAR getting masked off in ->f_flags now?
> >
> Yes, I thought the masking off was happening after this codepath got
> executed. Maybe it's better anyway to pass another flags param to this
> function and forward the flags from the gfs2_atomic_open function and
> in other call sites pass 0 ? What do you think?
>
Also my mistake. That happens in do_dentry_open() which happens in
finish_open(), so you should be OK here.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: David Laight @ 2026-04-01 14:09 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Alexander Viro,
Christian Brauner, Jeff Layton, Chuck Lever, shuah,
Greg Kroah-Hartman, H. Peter Anvin, Jan Kara, Alexander Aring,
Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
linux-fsdevel, linux-api, Linux-Arch, linux-kselftest, cmirabil,
Masami Hiramatsu
In-Reply-To: <c2ea52f2-b232-404b-9ec6-75d8efae6bea@app.fastmail.com>
On Tue, 31 Mar 2026 21:13:34 +0200
"Arnd Bergmann" <arnd@arndb.de> wrote:
> On Tue, Mar 31, 2026, at 19:19, Jori Koolstra wrote:
> > Currently there is no way to race-freely create and open a directory.
> > For regular files we have open(O_CREAT) for creating a new file inode,
> > and returning a pinning fd to it. The lack of such functionality for
> > directories means that when populating a directory tree there's always
> > a race involved: the inodes first need to be created, and then opened
> > to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> > but in the time window between the creation and the opening they might
> > be replaced by something else.
> >
> > Addressing this race without proper APIs is possible (by immediately
> > fstat()ing what was opened, to verify that it has the right inode type),
> > but difficult to get right. Hence, mkdirat_fd() that creates a directory
> > and returns an O_DIRECTORY fd is useful.
> >
> > This feature idea (and description) is taken from the UAPI group:
> > https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
> >
> > Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
>
> I checked that the calling conventions are fine, i.e. this will work
> as expected across all architectures. I assume you are also aware
> that the non-RFC patch will need to add the syscall number to all
> .tbl files.
>
> The hardest problem here does seem to be the naming of the
> new syscall, and I'm sorry to not be able to offer any solution
> either, just two observations:
>
> - mkdirat/mkdirat_fd sounds similar to the existing
> quotactl/quotactl_fd pair, but quotactl_fd() takes a file
> descriptor argument rather than returning it, which makes
> this addition quite confusing.
>
> - the nicest interface IMO would have been a variation of
> openat(dfd, filename, O_CREAT | O_DIRECTORY, mode)
> but that is a minefield of incompatible implementations[1],
> so we can't do that without changing the behavior for
> existing callers that currently run into an error.
Just require O_TMPFILE to be set as well :-)
You know you'll never regret it one Apr-1 is over.
Can something be done with the flags to openat2().
That might save allocating an extra system call.
David
>
> Arnd
>
> [1] https://lwn.net/Articles/926782/
>
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Jori Koolstra @ 2026-04-01 10:25 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <pbobkjhtuli53o3z34ajyxztaosmztwlygxfxhhjq5ajt47inc@ngtoge3ucdm5>
> Op 01-04-2026 06:19 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
>
>
> On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> > @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> > lookup_flags |= LOOKUP_REVAL;
> > goto retry;
> > }
> > +
> > + if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> > + struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> > + error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> > + }
> > + end_creating_path(&path, dentry);
> > return error;
>
>
> You can't do it like this. Should it turn out no fd can be allocated,
> the entire thing is going to error out while keeping the newly created
> directory behind. You need to allocate the fd first, then do the hard
> work, and only then fd_install and or free the fd. The FD_ADD machinery
> can probably still be used provided proper wrapping of the real new
> mkdir.
But isn't this exactly what happens in open(O_CREAT) too? Eventually we
call
error = dir_inode->i_op->create(idmap, dir_inode, dentry,
mode, open_flag & O_EXCL);
and only then do we assign and install the fd. AFAIK there is no cleanup
happening there either if the FD_ADD step fails. You will just have a
regular file and no descriptor. But I would have to test this to be sure.
>
> On top of that similarly to what other people mentioned the new syscall
> will definitely want to support O_CLOEXEC and probably other flags down
> the line.
>
I agree, and perhaps O_PATH too. Maybe just all open flags relevant to
directories?
> Trying to handle this in open() is a no-go. openat2 is rather
> problematic.
I don't think that is necessarily true. It turned out O_CREAT | O_DIRECTORY
was bugged for a very long time. Christian Brauner fixed it eventually, and
that combination now returns EINVAL. But I think there is nothing really
stopping us from implementing that combination in the expected way, apart
from whatever reasons there were for not allowing this in the first place,
which I don't know about (maybe mixing semantics?)
>
> I tend to agree mkdirat_fd is not a good name for the syscall either,
> but I don't have a suggestion I'm happy with. I think least bad name
> would follow the existing stuff and be mkdirat2 or similar.
>
> The routine would have to start with validating the passed O_ flags, for
> now only allowing O_CLOEXEC and EINVAL-ing otherwise.
Thanks,
Jori
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Cyril Hrubis @ 2026-04-01 9:44 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Alexander Viro,
Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
Alexander Aring, Peter Zijlstra, Oleg Nesterov,
Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <pbobkjhtuli53o3z34ajyxztaosmztwlygxfxhhjq5ajt47inc@ngtoge3ucdm5>
Hi!
> I tend to agree mkdirat_fd is not a good name for the syscall either,
> but I don't have a suggestion I'm happy with. I think least bad name
> would follow the existing stuff and be mkdirat2 or similar.
Why not mkdirat_open() as it does combine these two syscalls into one?
--
Cyril Hrubis
chrubis@suse.cz
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Mateusz Guzik @ 2026-04-01 4:19 UTC (permalink / raw)
To: Jori Koolstra
Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <20260331172011.3512876-2-jkoolstra@xs4all.nl>
On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> lookup_flags |= LOOKUP_REVAL;
> goto retry;
> }
> +
> + if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> + struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> + error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> + }
> + end_creating_path(&path, dentry);
> return error;
You can't do it like this. Should it turn out no fd can be allocated,
the entire thing is going to error out while keeping the newly created
directory behind. You need to allocate the fd first, then do the hard
work, and only then fd_install and or free the fd. The FD_ADD machinery
can probably still be used provided proper wrapping of the real new
mkdir.
It should be perfectly feasible to de facto wrap existing mkdir
functionality by this syscall.
On top of that similarly to what other people mentioned the new syscall
will definitely want to support O_CLOEXEC and probably other flags down
the line.
Trying to handle this in open() is a no-go. openat2 is rather
problematic.
I tend to agree mkdirat_fd is not a good name for the syscall either,
but I don't have a suggestion I'm happy with. I think least bad name
would follow the existing stuff and be mkdirat2 or similar.
The routine would have to start with validating the passed O_ flags, for
now only allowing O_CLOEXEC and EINVAL-ing otherwise.
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: H. Peter Anvin @ 2026-03-31 20:42 UTC (permalink / raw)
To: Yann Droneaud, Jori Koolstra, Andy Lutomirski, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, Alexander Viro,
Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
Shuah Khan, Greg Kroah-Hartman, Jan Kara, Alexander Aring
Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
linux-fsdevel, linux-api, linux-arch, linux-kselftest, cmirabil,
Masami Hiramatsu (Google)
In-Reply-To: <df5a6fec-ca67-4196-9e7b-cd129c79578e@droneaud.fr>
On March 31, 2026 1:25:03 PM PDT, Yann Droneaud <yann@droneaud.fr> wrote:
>Hi,
>
>Le 31/03/2026 à 19:19, Jori Koolstra a écrit :
>> Currently there is no way to race-freely create and open a directory.
>> For regular files we have open(O_CREAT) for creating a new file inode,
>> and returning a pinning fd to it. The lack of such functionality for
>> directories means that when populating a directory tree there's always
>> a race involved: the inodes first need to be created, and then opened
>> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
>> but in the time window between the creation and the opening they might
>> be replaced by something else.
>>
>> Addressing this race without proper APIs is possible (by immediately
>> fstat()ing what was opened, to verify that it has the right inode type),
>> but difficult to get right. Hence, mkdirat_fd() that creates a directory
>> and returns an O_DIRECTORY fd is useful.
>>
>> This feature idea (and description) is taken from the UAPI group:
>> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>>
>> Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
>> ---
>> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>> fs/internal.h | 1 +
>> fs/namei.c | 26 ++++++++++++++++++++++++--
>> include/linux/fcntl.h | 2 ++
>> include/linux/syscalls.h | 2 ++
>> include/uapi/asm-generic/fcntl.h | 3 +++
>> include/uapi/asm-generic/unistd.h | 5 ++++-
>> scripts/syscall.tbl | 1 +
>> 8 files changed, 38 insertions(+), 3 deletions(-)
>
>> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
>> index a332e79b3207..d2f0fdb82847 100644
>> --- a/include/linux/fcntl.h
>> +++ b/include/linux/fcntl.h
>> @@ -25,6 +25,8 @@
>> #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
>> #endif
>> +#define VALID_MKDIRAT_FD_FLAGS (MKDIRAT_FD_NEED_FD)
>> +
>
>I don't see support for O_CLOEXEC-ish flag, is the file descriptor in close-on-exec mode by default ? If yes, it should be mentioned.
>
>
>> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
>> index 613475285643..621458bf1fbf 100644
>> --- a/include/uapi/asm-generic/fcntl.h
>> +++ b/include/uapi/asm-generic/fcntl.h
>> @@ -95,6 +95,9 @@
>> #define O_NDELAY O_NONBLOCK
>> #endif
>> +/* Flags for mkdirat_fd */
>> +#define MKDIRAT_FD_NEED_FD 0x01
>> +
>
>
>Regards.
>
>
And even if it is, POSIX already has O_CLOFORK and we should expect that that will be needed, too.
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Yann Droneaud @ 2026-03-31 20:25 UTC (permalink / raw)
To: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Alexander Viro,
Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
Alexander Aring
Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
linux-fsdevel, linux-api, linux-arch, linux-kselftest, cmirabil,
Masami Hiramatsu (Google)
In-Reply-To: <20260331172011.3512876-2-jkoolstra@xs4all.nl>
Hi,
Le 31/03/2026 à 19:19, Jori Koolstra a écrit :
> Currently there is no way to race-freely create and open a directory.
> For regular files we have open(O_CREAT) for creating a new file inode,
> and returning a pinning fd to it. The lack of such functionality for
> directories means that when populating a directory tree there's always
> a race involved: the inodes first need to be created, and then opened
> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> but in the time window between the creation and the opening they might
> be replaced by something else.
>
> Addressing this race without proper APIs is possible (by immediately
> fstat()ing what was opened, to verify that it has the right inode type),
> but difficult to get right. Hence, mkdirat_fd() that creates a directory
> and returns an O_DIRECTORY fd is useful.
>
> This feature idea (and description) is taken from the UAPI group:
> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>
> Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
> ---
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/internal.h | 1 +
> fs/namei.c | 26 ++++++++++++++++++++++++--
> include/linux/fcntl.h | 2 ++
> include/linux/syscalls.h | 2 ++
> include/uapi/asm-generic/fcntl.h | 3 +++
> include/uapi/asm-generic/unistd.h | 5 ++++-
> scripts/syscall.tbl | 1 +
> 8 files changed, 38 insertions(+), 3 deletions(-)
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index a332e79b3207..d2f0fdb82847 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -25,6 +25,8 @@
> #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> #endif
>
> +#define VALID_MKDIRAT_FD_FLAGS (MKDIRAT_FD_NEED_FD)
> +
I don't see support for O_CLOEXEC-ish flag, is the file descriptor in
close-on-exec mode by default ? If yes, it should be mentioned.
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..621458bf1fbf 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,9 @@
> #define O_NDELAY O_NONBLOCK
> #endif
>
> +/* Flags for mkdirat_fd */
> +#define MKDIRAT_FD_NEED_FD 0x01
> +
Regards.
^ permalink raw reply
* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Arnd Bergmann @ 2026-03-31 19:13 UTC (permalink / raw)
To: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Alexander Viro,
Christian Brauner, Jeff Layton, Chuck Lever, shuah,
Greg Kroah-Hartman, H. Peter Anvin, Jan Kara, Alexander Aring
Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
linux-fsdevel, linux-api, Linux-Arch, linux-kselftest, cmirabil,
Masami Hiramatsu
In-Reply-To: <20260331172011.3512876-2-jkoolstra@xs4all.nl>
On Tue, Mar 31, 2026, at 19:19, Jori Koolstra wrote:
> Currently there is no way to race-freely create and open a directory.
> For regular files we have open(O_CREAT) for creating a new file inode,
> and returning a pinning fd to it. The lack of such functionality for
> directories means that when populating a directory tree there's always
> a race involved: the inodes first need to be created, and then opened
> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> but in the time window between the creation and the opening they might
> be replaced by something else.
>
> Addressing this race without proper APIs is possible (by immediately
> fstat()ing what was opened, to verify that it has the right inode type),
> but difficult to get right. Hence, mkdirat_fd() that creates a directory
> and returns an O_DIRECTORY fd is useful.
>
> This feature idea (and description) is taken from the UAPI group:
> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>
> Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
I checked that the calling conventions are fine, i.e. this will work
as expected across all architectures. I assume you are also aware
that the non-RFC patch will need to add the syscall number to all
.tbl files.
The hardest problem here does seem to be the naming of the
new syscall, and I'm sorry to not be able to offer any solution
either, just two observations:
- mkdirat/mkdirat_fd sounds similar to the existing
quotactl/quotactl_fd pair, but quotactl_fd() takes a file
descriptor argument rather than returning it, which makes
this addition quite confusing.
- the nicest interface IMO would have been a variation of
openat(dfd, filename, O_CREAT | O_DIRECTORY, mode)
but that is a minefield of incompatible implementations[1],
so we can't do that without changing the behavior for
existing callers that currently run into an error.
Arnd
[1] https://lwn.net/Articles/926782/
^ permalink raw reply
* [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Jori Koolstra @ 2026-03-31 17:19 UTC (permalink / raw)
To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
H. Peter Anvin, Jan Kara, Alexander Aring
Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
linux-fsdevel, linux-api, linux-arch, linux-kselftest, cmirabil,
Jori Koolstra, Masami Hiramatsu (Google)
In-Reply-To: <20260331172011.3512876-1-jkoolstra@xs4all.nl>
Currently there is no way to race-freely create and open a directory.
For regular files we have open(O_CREAT) for creating a new file inode,
and returning a pinning fd to it. The lack of such functionality for
directories means that when populating a directory tree there's always
a race involved: the inodes first need to be created, and then opened
to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
but in the time window between the creation and the opening they might
be replaced by something else.
Addressing this race without proper APIs is possible (by immediately
fstat()ing what was opened, to verify that it has the right inode type),
but difficult to get right. Hence, mkdirat_fd() that creates a directory
and returns an O_DIRECTORY fd is useful.
This feature idea (and description) is taken from the UAPI group:
https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/internal.h | 1 +
fs/namei.c | 26 ++++++++++++++++++++++++--
include/linux/fcntl.h | 2 ++
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/fcntl.h | 3 +++
include/uapi/asm-generic/unistd.h | 5 ++++-
scripts/syscall.tbl | 1 +
8 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..dda920c26941 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common mkdirat_fd sys_mkdirat_fd
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..2885a3e4ebdd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -58,6 +58,7 @@ int filename_unlinkat(int dfd, struct filename *name);
int may_linkat(struct mnt_idmap *idmap, const struct path *link);
int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
struct filename *newname, unsigned int flags);
+int filename_mkdirat_fd(int dfd, struct filename *name, umode_t mode, unsigned int flags);
int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
diff --git a/fs/namei.c b/fs/namei.c
index 1eb9db055292..93252937983e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5256,6 +5256,11 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
EXPORT_SYMBOL(vfs_mkdir);
int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
+{
+ return filename_mkdirat_fd(dfd, name, mode, 0);
+}
+
+int filename_mkdirat_fd(int dfd, struct filename *name, umode_t mode, unsigned int flags)
{
struct dentry *dentry;
struct path path;
@@ -5263,7 +5268,7 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
unsigned int lookup_flags = LOOKUP_DIRECTORY;
struct delegated_inode delegated_inode = { };
-retry:
+start:
dentry = filename_create(dfd, name, &path, lookup_flags);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
@@ -5276,7 +5281,6 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
if (IS_ERR(dentry))
error = PTR_ERR(dentry);
}
- end_creating_path(&path, dentry);
if (is_delegated(&delegated_inode)) {
error = break_deleg_wait(&delegated_inode);
if (!error)
@@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
lookup_flags |= LOOKUP_REVAL;
goto retry;
}
+
+ if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
+ struct path new_path = { .mnt = path.mnt, .dentry = dentry };
+ error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
+ }
+ end_creating_path(&path, dentry);
return error;
+retry:
+ end_creating_path(&path, dentry);
+ goto start;
+}
+
+SYSCALL_DEFINE4(mkdirat_fd, int, dfd, const char __user *, pathname, umode_t, mode,
+ unsigned int, flags)
+{
+ CLASS(filename, name)(pathname);
+ if (flags & ~VALID_MKDIRAT_FD_FLAGS)
+ return -EINVAL;
+ return filename_mkdirat_fd(dfd, name, mode, flags | MKDIRAT_FD_NEED_FD);
}
SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..d2f0fdb82847 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -25,6 +25,8 @@
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif
+#define VALID_MKDIRAT_FD_FLAGS (MKDIRAT_FD_NEED_FD)
+
#if BITS_PER_LONG == 32
#define IS_GETLK32(cmd) ((cmd) == F_GETLK)
#define IS_SETLK32(cmd) ((cmd) == F_SETLK)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 02bd6ddb6278..52e7f09d5525 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_mkdirat_fd(int dfd, const char __user *pathname, umode_t mode,
+ unsigned int flags)
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..621458bf1fbf 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -95,6 +95,9 @@
#define O_NDELAY O_NONBLOCK
#endif
+/* Flags for mkdirat_fd */
+#define MKDIRAT_FD_NEED_FD 0x01
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..5bae1029f5d9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_mkdirat_fd 472
+__SYSCALL(__NR_mkdirat_fd, sys_mkdirat_fd)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
/*
* 32 bit systems traditionally used different
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..db3bd97d4a1a 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common mkdirat_fd sys_mkdirat_fd
--
2.53.0
^ permalink raw reply related
* [RFC PATCH 2/2] selftest: add tests for mkdirat_fd()
From: Jori Koolstra @ 2026-03-31 17:19 UTC (permalink / raw)
To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman
Cc: H . Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
linux-kselftest, cmirabil, Jori Koolstra, Ingo Molnar
In-Reply-To: <20260331172011.3512876-1-jkoolstra@xs4all.nl>
Add some tests for the new mkdirat_fd() syscall to test compliance and
to showcase its behaviour.
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
tools/include/uapi/asm-generic/unistd.h | 5 +-
tools/testing/selftests/filesystems/Makefile | 4 +-
.../selftests/filesystems/mkdirat_fd_test.c | 139 ++++++++++++++++++
3 files changed, 145 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..5bae1029f5d9 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_mkdirat_fd 472
+__SYSCALL(__NR_mkdirat_fd, sys_mkdirat_fd)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
/*
* 32 bit systems traditionally used different
diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile
index 85427d7f19b9..7357769db57a 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
-CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog
+CFLAGS += $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
+TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog mkdirat_fd_test
TEST_GEN_PROGS_EXTENDED := dnotify_test
include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/mkdirat_fd_test.c b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
new file mode 100644
index 000000000000..9058be49dc7b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sys/stat.h>
+
+#include <asm-generic/unistd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef MKDIRAT_FD_NEED_FD
+#define MKDIRAT_FD_NEED_FD 0x01
+#endif
+
+#define mkdirat_fd_checked(dfd, pathname) ({ \
+ struct stat __st; \
+ int __fd = sys_mkdirat_fd(dfd, pathname, S_IRWXU, MKDIRAT_FD_NEED_FD); \
+ ASSERT_GE(__fd, 0); \
+ EXPECT_EQ(fstat(__fd, &__st), 0); \
+ EXPECT_TRUE(S_ISDIR(__st.st_mode)); \
+ __fd; \
+})
+
+static inline int sys_mkdirat_fd(int dfd, const char *pathname, mode_t mode,
+ unsigned int flags)
+{
+ return syscall(__NR_mkdirat_fd, dfd, pathname, mode, flags);
+}
+
+FIXTURE(mkdirat_fd) {
+ char dirpath[PATH_MAX];
+ int dfd;
+};
+
+FIXTURE_SETUP(mkdirat_fd)
+{
+ snprintf(self->dirpath, sizeof(self->dirpath),
+ "/tmp/mkdirat_fd_test.%d", getpid());
+ ASSERT_EQ(mkdir(self->dirpath, S_IRWXU), 0);
+
+ self->dfd = open(self->dirpath, O_DIRECTORY);
+ ASSERT_GE(self->dfd, 0);
+}
+
+FIXTURE_TEARDOWN(mkdirat_fd)
+{
+ close(self->dfd);
+ rmdir(self->dirpath);
+}
+
+/* Does mkdirat_fd return a fd at all */
+TEST_F(mkdirat_fd, returns_fd)
+{
+ int fd = mkdirat_fd_checked(self->dfd, "newdir");
+ EXPECT_EQ(close(fd), 0)
+ EXPECT_EQ(unlinkat(self->dfd, "newdir", AT_REMOVEDIR), 0);
+}
+
+/* The fd must refer to the directory that was just created. */
+TEST_F(mkdirat_fd, fd_is_created_dir)
+{
+ int fd;
+ struct stat st_via_fd, st_via_path;
+ char path[PATH_MAX];
+
+ fd = mkdirat_fd_checked(self->dfd, "checkdir");
+
+ ASSERT_EQ(fstat(fd, &st_via_fd), 0);
+
+ snprintf(path, sizeof(path), "%s/checkdir", self->dirpath);
+ ASSERT_EQ(stat(path, &st_via_path), 0);
+
+ EXPECT_EQ(st_via_fd.st_ino, st_via_path.st_ino);
+ EXPECT_EQ(st_via_fd.st_dev, st_via_path.st_dev);
+
+ EXPECT_EQ(close(fd), 0)
+ EXPECT_EQ(rmdir(path), 0);
+}
+
+
+/* Missing parent component must fail with ENOENT. */
+TEST_F(mkdirat_fd, enoent_missing_parent)
+{
+ EXPECT_EQ(sys_mkdirat_fd(self->dfd, "nonexistent/child", S_IRWXU, MKDIRAT_FD_NEED_FD), -1);
+ EXPECT_EQ(errno, ENOENT);
+}
+
+/* An invalid dfd must fail with EBADF. */
+TEST_F(mkdirat_fd, ebadf)
+{
+ EXPECT_EQ(sys_mkdirat_fd(-42, "badfdir", S_IRWXU, MKDIRAT_FD_NEED_FD), -1);
+ EXPECT_EQ(errno, EBADF);
+}
+
+/* A dfd that points to a file (not a directory) must fail with ENOTDIR. */
+TEST_F(mkdirat_fd, enotdir_dfd)
+{
+ int file_fd;
+
+ file_fd = openat(self->dfd, "file",
+ O_CREAT | O_WRONLY, S_IRWXU);
+ ASSERT_GE(file_fd, 0);
+
+ EXPECT_EQ(sys_mkdirat_fd(file_fd, "subdir", S_IRWXU, MKDIRAT_FD_NEED_FD), -1);
+ EXPECT_EQ(errno, ENOTDIR);
+
+ EXPECT_EQ(close(file_fd), 0);
+ EXPECT_EQ(unlinkat(self->dfd, "file", 0), 0);
+}
+
+/*
+ * The returned fd must be usable as a dfd for further *at() calls.
+ */
+TEST_F(mkdirat_fd, fd_usable_as_dfd)
+{
+ int parent_fd, child_fd;
+
+ parent_fd = mkdirat_fd_checked(self->dfd, "parent");
+ child_fd = mkdirat_fd_checked(parent_fd, "child");
+
+ EXPECT_EQ(close(child_fd), 0);
+ EXPECT_EQ(close(parent_fd), 0);
+
+ char path[PATH_MAX];
+ snprintf(path, sizeof(path), "%s/parent/child", self->dirpath);
+ EXPECT_EQ(rmdir(path), 0);
+ snprintf(path, sizeof(path), "%s/parent", self->dirpath);
+ EXPECT_EQ(rmdir(path), 0);
+}
+
+/* Unknown flags must be rejected with EINVAL. */
+TEST_F(mkdirat_fd, einval_unknown_flags)
+{
+ EXPECT_EQ(sys_mkdirat_fd(self->dfd, "flagsdir", S_IRWXU, ~MKDIRAT_FD_NEED_FD), -1);
+ EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
--
2.53.0
^ permalink raw reply related
* [RFC PATCH 0/2] vfs: mkdirat_fd() syscall
From: Jori Koolstra @ 2026-03-31 17:19 UTC (permalink / raw)
To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman
Cc: H . Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
linux-kselftest, cmirabil, Jori Koolstra
This series implements the mkdirat_fd() syscall that was suggested over
at the UAPI group kernel feature page [1] with some tests.
Obviously, if we want this we should also implement mknodeat_fd() and
symlinkat_fd(), but their implementation can be done quite similar I
believe.
I have added an unigned int flags like [2] suggests and an example flag
that we may want to remove (it right now mainly serves an internal
purpose). But it marks where I would want to place the definitions.
This has been compiled and tested on x86 only. [2] is a bit confusing
here and there, so I hope I have added the proper syscall definitions
everywhere where they needs to be added.
[1]: https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
[2]: https://www.kernel.org/doc/html/latest/process/adding-syscalls.html
Jori Koolstra (2):
vfs: syscalls: add mkdirat_fd()
selftest: add tests for mkdirat_fd()
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/internal.h | 1 +
fs/namei.c | 26 +++-
include/linux/fcntl.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/fcntl.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
scripts/syscall.tbl | 1 +
tools/include/uapi/asm-generic/unistd.h | 5 +-
tools/testing/selftests/filesystems/Makefile | 4 +-
.../selftests/filesystems/mkdirat_fd_test.c | 139 ++++++++++++++++++
11 files changed, 183 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c
--
2.53.0
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-03-30 15:07 UTC (permalink / raw)
To: Jeff Layton
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <e526fbdb450a593b575355c1c9ae21f286427275.camel@kernel.org>
On Mon, Mar 30, 2026 at 5:49 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2026-03-28 at 23:22 +0600, Dorjoy Chowdhury wrote:
> > This flag indicates the path should be opened if it's a regular file.
> > This is useful to write secure programs that want to avoid being
> > tricked into opening device nodes with special semantics while thinking
> > they operate on regular files. This is a requested feature from the
> > uapi-group[1].
> >
> > A corresponding error code EFTYPE has been introduced. For example, if
> > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > like FreeBSD, macOS.
> >
> > When used in combination with O_CREAT, either the regular file is
> > created, or if the path already exists, it is opened if it's a regular
> > file. Otherwise, -EFTYPE is returned.
> >
> > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > as it doesn't make sense to open a path that is both a directory and a
> > regular file.
> >
> > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> >
> > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > ---
> > arch/alpha/include/uapi/asm/errno.h | 2 ++
> > arch/alpha/include/uapi/asm/fcntl.h | 1 +
> > arch/mips/include/uapi/asm/errno.h | 2 ++
> > arch/parisc/include/uapi/asm/errno.h | 2 ++
> > arch/parisc/include/uapi/asm/fcntl.h | 1 +
> > arch/sparc/include/uapi/asm/errno.h | 2 ++
> > arch/sparc/include/uapi/asm/fcntl.h | 1 +
> > fs/ceph/file.c | 4 ++++
> > fs/fcntl.c | 4 ++--
> > fs/gfs2/inode.c | 6 ++++++
> > fs/namei.c | 4 ++++
> > fs/nfs/dir.c | 4 ++++
> > fs/open.c | 8 +++++---
> > fs/smb/client/dir.c | 14 +++++++++++++-
> > include/linux/fcntl.h | 2 ++
> > include/uapi/asm-generic/errno.h | 2 ++
> > include/uapi/asm-generic/fcntl.h | 4 ++++
> > tools/arch/alpha/include/uapi/asm/errno.h | 2 ++
> > tools/arch/mips/include/uapi/asm/errno.h | 2 ++
> > tools/arch/parisc/include/uapi/asm/errno.h | 2 ++
> > tools/arch/sparc/include/uapi/asm/errno.h | 2 ++
> > tools/include/uapi/asm-generic/errno.h | 2 ++
> > 22 files changed, 67 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
> > index 6791f6508632..1a99f38813c7 100644
> > --- a/arch/alpha/include/uapi/asm/errno.h
> > +++ b/arch/alpha/include/uapi/asm/errno.h
> > @@ -127,4 +127,6 @@
> >
> > #define EHWPOISON 139 /* Memory page has hardware error */
> >
> > +#define EFTYPE 140 /* Wrong file type for the intended operation */
> > +
> > #endif
> > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > index 50bdc8e8a271..fe488bf7c18e 100644
> > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > @@ -34,6 +34,7 @@
> >
> > #define O_PATH 040000000
> > #define __O_TMPFILE 0100000000
> > +#define OPENAT2_REGULAR 0200000000
> >
> > #define F_GETLK 7
> > #define F_SETLK 8
> > diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
> > index c01ed91b1ef4..1835a50b69ce 100644
> > --- a/arch/mips/include/uapi/asm/errno.h
> > +++ b/arch/mips/include/uapi/asm/errno.h
> > @@ -126,6 +126,8 @@
> >
> > #define EHWPOISON 168 /* Memory page has hardware error */
> >
> > +#define EFTYPE 169 /* Wrong file type for the intended operation */
> > +
> > #define EDQUOT 1133 /* Quota exceeded */
> >
> >
> > diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
> > index 8cbc07c1903e..93194fbb0a80 100644
> > --- a/arch/parisc/include/uapi/asm/errno.h
> > +++ b/arch/parisc/include/uapi/asm/errno.h
> > @@ -124,4 +124,6 @@
> >
> > #define EHWPOISON 257 /* Memory page has hardware error */
> >
> > +#define EFTYPE 258 /* Wrong file type for the intended operation */
> > +
> > #endif
> > diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
> > index 03dee816cb13..d46812f2f0f4 100644
> > --- a/arch/parisc/include/uapi/asm/fcntl.h
> > +++ b/arch/parisc/include/uapi/asm/fcntl.h
> > @@ -19,6 +19,7 @@
> >
> > #define O_PATH 020000000
> > #define __O_TMPFILE 040000000
> > +#define OPENAT2_REGULAR 0100000000
> >
> > #define F_GETLK64 8
> > #define F_SETLK64 9
> > diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
> > index 4a41e7835fd5..71940ec9130b 100644
> > --- a/arch/sparc/include/uapi/asm/errno.h
> > +++ b/arch/sparc/include/uapi/asm/errno.h
> > @@ -117,4 +117,6 @@
> >
> > #define EHWPOISON 135 /* Memory page has hardware error */
> >
> > +#define EFTYPE 136 /* Wrong file type for the intended operation */
> > +
> > #endif
> > diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
> > index 67dae75e5274..bb6e9fa94bc9 100644
> > --- a/arch/sparc/include/uapi/asm/fcntl.h
> > +++ b/arch/sparc/include/uapi/asm/fcntl.h
> > @@ -37,6 +37,7 @@
> >
> > #define O_PATH 0x1000000
> > #define __O_TMPFILE 0x2000000
> > +#define OPENAT2_REGULAR 0x4000000
> >
> > #define F_GETOWN 5 /* for sockets. */
> > #define F_SETOWN 6 /* for sockets. */
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 66bbf6d517a9..6d8d4c7765e6 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > ceph_init_inode_acls(newino, &as_ctx);
> > file->f_mode |= FMODE_CREATED;
> > }
> > + if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
> > + err = -EFTYPE;
> > + goto out_req;
> > + }
>
> ^^^
> This doesn't look quite right. Here's a larger chunk of the code:
>
> -------------------------8<--------------------------
> if (d_in_lookup(dentry)) {
> dn = ceph_finish_lookup(req, dentry, err);
> if (IS_ERR(dn))
> err = PTR_ERR(dn);
> } else {
> /* we were given a hashed negative dentry */
> dn = NULL;
> }
> if (err)
> goto out_req;
> if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> /* make vfs retry on splice, ENOENT, or symlink */
> doutc(cl, "finish_no_open on dn %p\n", dn);
> err = finish_no_open(file, dn);
> } else {
> if (IS_ENCRYPTED(dir) &&
> !fscrypt_has_permitted_context(dir, d_inode(dentry))) {
> pr_warn_client(cl,
> "Inconsistent encryption context (parent %llx:%llx child %llx:%llx)\n",
> ceph_vinop(dir), ceph_vinop(d_inode(dentry)));
> goto out_req;
> }
>
> doutc(cl, "finish_open on dn %p\n", dn);
> if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> struct inode *newino = d_inode(dentry);
>
> cache_file_layout(dir, newino);
> ceph_init_inode_acls(newino, &as_ctx);
> file->f_mode |= FMODE_CREATED;
> }
> err = finish_open(file, dentry, ceph_open);
> }
> -------------------------8<--------------------------
>
> It looks like this won't handle it correctly if the pathwalk terminates
> on a symlink (re: d_is_symlink() case). You should either set up a test
> ceph cluster on your own, or reach out to the ceph community and ask
> them to test this.
>
Thanks for reviewing. The d_is_symlink() case seems to be calling
finish_no_open so shouldn't this be okay?
> > err = finish_open(file, dentry, ceph_open);
> > }
> > out_req:
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index beab8080badf..240bb511557a 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -1169,9 +1169,9 @@ static int __init fcntl_init(void)
> > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > * is defined as O_NONBLOCK on some platforms and not on others.
> > */
> > - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> > + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > HWEIGHT32(
> > - (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > + (VALID_OPENAT2_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > __FMODE_EXEC));
> >
> > fasync_cache = kmem_cache_create("fasync_cache",
> > diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> > index 8344040ecaf7..4604e2e8a9cc 100644
> > --- a/fs/gfs2/inode.c
> > +++ b/fs/gfs2/inode.c
> > @@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
> > inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
> > error = PTR_ERR(inode);
> > if (!IS_ERR(inode)) {
> > + if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
>
> Isn't OPENAT2_REGULAR getting masked off in ->f_flags now?
>
Yes, I thought the masking off was happening after this codepath got
executed. Maybe it's better anyway to pass another flags param to this
function and forward the flags from the gfs2_atomic_open function and
in other call sites pass 0 ? What do you think?
Regards,
Dorjoy
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jeff Layton @ 2026-03-30 11:49 UTC (permalink / raw)
To: Dorjoy Chowdhury, linux-fsdevel
Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, brauner, jack, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <20260328172314.45807-2-dorjoychy111@gmail.com>
On Sat, 2026-03-28 at 23:22 +0600, Dorjoy Chowdhury wrote:
> This flag indicates the path should be opened if it's a regular file.
> This is useful to write secure programs that want to avoid being
> tricked into opening device nodes with special semantics while thinking
> they operate on regular files. This is a requested feature from the
> uapi-group[1].
>
> A corresponding error code EFTYPE has been introduced. For example, if
> openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> like FreeBSD, macOS.
>
> When used in combination with O_CREAT, either the regular file is
> created, or if the path already exists, it is opened if it's a regular
> file. Otherwise, -EFTYPE is returned.
>
> When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> as it doesn't make sense to open a path that is both a directory and a
> regular file.
>
> [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
>
> Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> ---
> arch/alpha/include/uapi/asm/errno.h | 2 ++
> arch/alpha/include/uapi/asm/fcntl.h | 1 +
> arch/mips/include/uapi/asm/errno.h | 2 ++
> arch/parisc/include/uapi/asm/errno.h | 2 ++
> arch/parisc/include/uapi/asm/fcntl.h | 1 +
> arch/sparc/include/uapi/asm/errno.h | 2 ++
> arch/sparc/include/uapi/asm/fcntl.h | 1 +
> fs/ceph/file.c | 4 ++++
> fs/fcntl.c | 4 ++--
> fs/gfs2/inode.c | 6 ++++++
> fs/namei.c | 4 ++++
> fs/nfs/dir.c | 4 ++++
> fs/open.c | 8 +++++---
> fs/smb/client/dir.c | 14 +++++++++++++-
> include/linux/fcntl.h | 2 ++
> include/uapi/asm-generic/errno.h | 2 ++
> include/uapi/asm-generic/fcntl.h | 4 ++++
> tools/arch/alpha/include/uapi/asm/errno.h | 2 ++
> tools/arch/mips/include/uapi/asm/errno.h | 2 ++
> tools/arch/parisc/include/uapi/asm/errno.h | 2 ++
> tools/arch/sparc/include/uapi/asm/errno.h | 2 ++
> tools/include/uapi/asm-generic/errno.h | 2 ++
> 22 files changed, 67 insertions(+), 6 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
> index 6791f6508632..1a99f38813c7 100644
> --- a/arch/alpha/include/uapi/asm/errno.h
> +++ b/arch/alpha/include/uapi/asm/errno.h
> @@ -127,4 +127,6 @@
>
> #define EHWPOISON 139 /* Memory page has hardware error */
>
> +#define EFTYPE 140 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> index 50bdc8e8a271..fe488bf7c18e 100644
> --- a/arch/alpha/include/uapi/asm/fcntl.h
> +++ b/arch/alpha/include/uapi/asm/fcntl.h
> @@ -34,6 +34,7 @@
>
> #define O_PATH 040000000
> #define __O_TMPFILE 0100000000
> +#define OPENAT2_REGULAR 0200000000
>
> #define F_GETLK 7
> #define F_SETLK 8
> diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
> index c01ed91b1ef4..1835a50b69ce 100644
> --- a/arch/mips/include/uapi/asm/errno.h
> +++ b/arch/mips/include/uapi/asm/errno.h
> @@ -126,6 +126,8 @@
>
> #define EHWPOISON 168 /* Memory page has hardware error */
>
> +#define EFTYPE 169 /* Wrong file type for the intended operation */
> +
> #define EDQUOT 1133 /* Quota exceeded */
>
>
> diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
> index 8cbc07c1903e..93194fbb0a80 100644
> --- a/arch/parisc/include/uapi/asm/errno.h
> +++ b/arch/parisc/include/uapi/asm/errno.h
> @@ -124,4 +124,6 @@
>
> #define EHWPOISON 257 /* Memory page has hardware error */
>
> +#define EFTYPE 258 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
> index 03dee816cb13..d46812f2f0f4 100644
> --- a/arch/parisc/include/uapi/asm/fcntl.h
> +++ b/arch/parisc/include/uapi/asm/fcntl.h
> @@ -19,6 +19,7 @@
>
> #define O_PATH 020000000
> #define __O_TMPFILE 040000000
> +#define OPENAT2_REGULAR 0100000000
>
> #define F_GETLK64 8
> #define F_SETLK64 9
> diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
> index 4a41e7835fd5..71940ec9130b 100644
> --- a/arch/sparc/include/uapi/asm/errno.h
> +++ b/arch/sparc/include/uapi/asm/errno.h
> @@ -117,4 +117,6 @@
>
> #define EHWPOISON 135 /* Memory page has hardware error */
>
> +#define EFTYPE 136 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
> index 67dae75e5274..bb6e9fa94bc9 100644
> --- a/arch/sparc/include/uapi/asm/fcntl.h
> +++ b/arch/sparc/include/uapi/asm/fcntl.h
> @@ -37,6 +37,7 @@
>
> #define O_PATH 0x1000000
> #define __O_TMPFILE 0x2000000
> +#define OPENAT2_REGULAR 0x4000000
>
> #define F_GETOWN 5 /* for sockets. */
> #define F_SETOWN 6 /* for sockets. */
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 66bbf6d517a9..6d8d4c7765e6 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> ceph_init_inode_acls(newino, &as_ctx);
> file->f_mode |= FMODE_CREATED;
> }
> + if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
> + err = -EFTYPE;
> + goto out_req;
> + }
^^^
This doesn't look quite right. Here's a larger chunk of the code:
-------------------------8<--------------------------
if (d_in_lookup(dentry)) {
dn = ceph_finish_lookup(req, dentry, err);
if (IS_ERR(dn))
err = PTR_ERR(dn);
} else {
/* we were given a hashed negative dentry */
dn = NULL;
}
if (err)
goto out_req;
if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
/* make vfs retry on splice, ENOENT, or symlink */
doutc(cl, "finish_no_open on dn %p\n", dn);
err = finish_no_open(file, dn);
} else {
if (IS_ENCRYPTED(dir) &&
!fscrypt_has_permitted_context(dir, d_inode(dentry))) {
pr_warn_client(cl,
"Inconsistent encryption context (parent %llx:%llx child %llx:%llx)\n",
ceph_vinop(dir), ceph_vinop(d_inode(dentry)));
goto out_req;
}
doutc(cl, "finish_open on dn %p\n", dn);
if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
struct inode *newino = d_inode(dentry);
cache_file_layout(dir, newino);
ceph_init_inode_acls(newino, &as_ctx);
file->f_mode |= FMODE_CREATED;
}
err = finish_open(file, dentry, ceph_open);
}
-------------------------8<--------------------------
It looks like this won't handle it correctly if the pathwalk terminates
on a symlink (re: d_is_symlink() case). You should either set up a test
ceph cluster on your own, or reach out to the ceph community and ask
them to test this.
> err = finish_open(file, dentry, ceph_open);
> }
> out_req:
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index beab8080badf..240bb511557a 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1169,9 +1169,9 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> - (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> + (VALID_OPENAT2_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> __FMODE_EXEC));
>
> fasync_cache = kmem_cache_create("fasync_cache",
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 8344040ecaf7..4604e2e8a9cc 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
> inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
> error = PTR_ERR(inode);
> if (!IS_ERR(inode)) {
> + if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
Isn't OPENAT2_REGULAR getting masked off in ->f_flags now?
JFYI: it's quite simple to set up a single-node gfs2 fs to test this.
> + iput(inode);
> + inode = NULL;
> + error = -EFTYPE;
> + goto fail_gunlock;
> + }
> if (S_ISDIR(inode->i_mode)) {
> iput(inode);
> inode = NULL;
> diff --git a/fs/namei.c b/fs/namei.c
> index 2113958c3b7a..e557c538c238 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4679,6 +4679,10 @@ static int do_open(struct nameidata *nd,
> if (unlikely(error))
> return error;
> }
> +
> + if ((open_flag & OPENAT2_REGULAR) && !d_is_reg(nd->path.dentry))
> + return -EFTYPE;
> +
> if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
> return -ENOTDIR;
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index ddc3789363a5..bfe9470327c8 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -2195,6 +2195,10 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
> break;
> case -EISDIR:
> case -ENOTDIR:
> + if (open_flags & OPENAT2_REGULAR) {
> + err = -EFTYPE;
> + break;
> + }
> goto no_open;
> case -ELOOP:
> if (!(open_flags & O_NOFOLLOW))
> diff --git a/fs/open.c b/fs/open.c
> index 681d405bc61e..a6f445f72181 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> f->f_mode |= FMODE_CAN_ODIRECT;
>
> - f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> + f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
> f->f_iocb_flags = iocb_flags(f);
>
> file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
> @@ -1183,7 +1183,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> int lookup_flags = 0;
> int acc_mode = ACC_MODE(flags);
>
> - BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
> + BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPENAT2_FLAGS),
> "struct open_flags doesn't yet handle flags > 32 bits");
>
> /*
> @@ -1196,7 +1196,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> * values before calling build_open_flags(), but openat2(2) checks all
> * of its arguments.
> */
> - if (flags & ~VALID_OPEN_FLAGS)
> + if (flags & ~VALID_OPENAT2_FLAGS)
> return -EINVAL;
> if (how->resolve & ~VALID_RESOLVE_FLAGS)
> return -EINVAL;
> @@ -1235,6 +1235,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> return -EINVAL;
> if (!(acc_mode & MAY_WRITE))
> return -EINVAL;
> + } else if ((flags & O_DIRECTORY) && (flags & OPENAT2_REGULAR)) {
> + return -EINVAL;
> }
> if (flags & O_PATH) {
> /* O_PATH only permits certain other flags to be set. */
> diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
> index 953f1fee8cb8..355681ebacf1 100644
> --- a/fs/smb/client/dir.c
> +++ b/fs/smb/client/dir.c
> @@ -222,6 +222,13 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
> goto cifs_create_get_file_info;
> }
>
> + if ((oflags & OPENAT2_REGULAR) && !S_ISREG(newinode->i_mode)) {
> + CIFSSMBClose(xid, tcon, fid->netfid);
> + iput(newinode);
> + rc = -EFTYPE;
> + goto out;
> + }
> +
> if (S_ISDIR(newinode->i_mode)) {
> CIFSSMBClose(xid, tcon, fid->netfid);
> iput(newinode);
> @@ -436,11 +443,16 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
> goto out_err;
> }
>
> - if (newinode)
> + if (newinode) {
> + if ((oflags & OPENAT2_REGULAR) && !S_ISREG(newinode->i_mode)) {
> + rc = -EFTYPE;
> + goto out_err;
> + }
> if (S_ISDIR(newinode->i_mode)) {
> rc = -EISDIR;
> goto out_err;
> }
> + }
>
> d_drop(direntry);
> d_add(direntry, newinode);
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index a332e79b3207..a80026718217 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -12,6 +12,8 @@
> FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
>
> +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OPENAT2_REGULAR)
> +
> /* List of all valid flags for the how->resolve argument: */
> #define VALID_RESOLVE_FLAGS \
> (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
> diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
> index 92e7ae493ee3..bd78e69e0a43 100644
> --- a/include/uapi/asm-generic/errno.h
> +++ b/include/uapi/asm-generic/errno.h
> @@ -122,4 +122,6 @@
>
> #define EHWPOISON 133 /* Memory page has hardware error */
>
> +#define EFTYPE 134 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..b2c2ddd0edc0 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -88,6 +88,10 @@
> #define __O_TMPFILE 020000000
> #endif
>
> +#ifndef OPENAT2_REGULAR
> +#define OPENAT2_REGULAR 040000000
> +#endif
> +
> /* a horrid kludge trying to make sure that this will fail on old kernels */
> #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>
> diff --git a/tools/arch/alpha/include/uapi/asm/errno.h b/tools/arch/alpha/include/uapi/asm/errno.h
> index 6791f6508632..1a99f38813c7 100644
> --- a/tools/arch/alpha/include/uapi/asm/errno.h
> +++ b/tools/arch/alpha/include/uapi/asm/errno.h
> @@ -127,4 +127,6 @@
>
> #define EHWPOISON 139 /* Memory page has hardware error */
>
> +#define EFTYPE 140 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/tools/arch/mips/include/uapi/asm/errno.h b/tools/arch/mips/include/uapi/asm/errno.h
> index c01ed91b1ef4..1835a50b69ce 100644
> --- a/tools/arch/mips/include/uapi/asm/errno.h
> +++ b/tools/arch/mips/include/uapi/asm/errno.h
> @@ -126,6 +126,8 @@
>
> #define EHWPOISON 168 /* Memory page has hardware error */
>
> +#define EFTYPE 169 /* Wrong file type for the intended operation */
> +
> #define EDQUOT 1133 /* Quota exceeded */
>
>
> diff --git a/tools/arch/parisc/include/uapi/asm/errno.h b/tools/arch/parisc/include/uapi/asm/errno.h
> index 8cbc07c1903e..93194fbb0a80 100644
> --- a/tools/arch/parisc/include/uapi/asm/errno.h
> +++ b/tools/arch/parisc/include/uapi/asm/errno.h
> @@ -124,4 +124,6 @@
>
> #define EHWPOISON 257 /* Memory page has hardware error */
>
> +#define EFTYPE 258 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/tools/arch/sparc/include/uapi/asm/errno.h b/tools/arch/sparc/include/uapi/asm/errno.h
> index 4a41e7835fd5..71940ec9130b 100644
> --- a/tools/arch/sparc/include/uapi/asm/errno.h
> +++ b/tools/arch/sparc/include/uapi/asm/errno.h
> @@ -117,4 +117,6 @@
>
> #define EHWPOISON 135 /* Memory page has hardware error */
>
> +#define EFTYPE 136 /* Wrong file type for the intended operation */
> +
> #endif
> diff --git a/tools/include/uapi/asm-generic/errno.h b/tools/include/uapi/asm-generic/errno.h
> index 92e7ae493ee3..bd78e69e0a43 100644
> --- a/tools/include/uapi/asm-generic/errno.h
> +++ b/tools/include/uapi/asm-generic/errno.h
> @@ -122,4 +122,6 @@
>
> #define EHWPOISON 133 /* Memory page has hardware error */
>
> +#define EFTYPE 134 /* Wrong file type for the intended operation */
> +
> #endif
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* [PATCH v6 4/4] mips/fcntl.h: convert O_* flag macros from hex to octal
From: Dorjoy Chowdhury @ 2026-03-28 17:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <20260328172314.45807-1-dorjoychy111@gmail.com>
Following the convention in include/uapi/asm-generic/fcntl.h and other
architecture specific arch/*/include/uapi/asm/fcntl.h files.
Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
arch/mips/include/uapi/asm/fcntl.h | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/mips/include/uapi/asm/fcntl.h b/arch/mips/include/uapi/asm/fcntl.h
index 0369a38e3d4f..6aa3f49df17e 100644
--- a/arch/mips/include/uapi/asm/fcntl.h
+++ b/arch/mips/include/uapi/asm/fcntl.h
@@ -11,15 +11,15 @@
#include <asm/sgidefs.h>
-#define O_APPEND 0x0008
-#define O_DSYNC 0x0010 /* used to be O_SYNC, see below */
-#define O_NONBLOCK 0x0080
-#define O_CREAT 0x0100 /* not fcntl */
-#define O_TRUNC 0x0200 /* not fcntl */
-#define O_EXCL 0x0400 /* not fcntl */
-#define O_NOCTTY 0x0800 /* not fcntl */
-#define FASYNC 0x1000 /* fcntl, for BSD compatibility */
-#define O_LARGEFILE 0x2000 /* allow large file opens */
+#define O_APPEND 0000010
+#define O_DSYNC 0000020 /* used to be O_SYNC, see below */
+#define O_NONBLOCK 0000200
+#define O_CREAT 0000400 /* not fcntl */
+#define O_TRUNC 0001000 /* not fcntl */
+#define O_EXCL 0002000 /* not fcntl */
+#define O_NOCTTY 0004000 /* not fcntl */
+#define FASYNC 0010000 /* fcntl, for BSD compatibility */
+#define O_LARGEFILE 0020000 /* allow large file opens */
/*
* Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
* the O_SYNC flag. We continue to use the existing numerical value
@@ -33,9 +33,9 @@
*
* Note: __O_SYNC must never be used directly.
*/
-#define __O_SYNC 0x4000
+#define __O_SYNC 0040000
#define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_DIRECT 0x8000 /* direct disk access hint */
+#define O_DIRECT 0100000 /* direct disk access hint */
#define F_GETLK 14
#define F_SETLK 6
--
2.53.0
^ permalink raw reply related
* [PATCH v6 3/4] sparc/fcntl.h: convert O_* flag macros from hex to octal
From: Dorjoy Chowdhury @ 2026-03-28 17:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <20260328172314.45807-1-dorjoychy111@gmail.com>
Following the convention in include/uapi/asm-generic/fcntl.h and other
architecture specific arch/*/include/uapi/asm/fcntl.h files.
Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
arch/sparc/include/uapi/asm/fcntl.h | 36 ++++++++++++++---------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index bb6e9fa94bc9..33ce58ec57f6 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -2,23 +2,23 @@
#ifndef _SPARC_FCNTL_H
#define _SPARC_FCNTL_H
-#define O_APPEND 0x0008
-#define FASYNC 0x0040 /* fcntl, for BSD compatibility */
-#define O_CREAT 0x0200 /* not fcntl */
-#define O_TRUNC 0x0400 /* not fcntl */
-#define O_EXCL 0x0800 /* not fcntl */
-#define O_DSYNC 0x2000 /* used to be O_SYNC, see below */
-#define O_NONBLOCK 0x4000
+#define O_APPEND 0000000010
+#define FASYNC 0000000100 /* fcntl, for BSD compatibility */
+#define O_CREAT 0000001000 /* not fcntl */
+#define O_TRUNC 0000002000 /* not fcntl */
+#define O_EXCL 0000004000 /* not fcntl */
+#define O_DSYNC 0000020000 /* used to be O_SYNC, see below */
+#define O_NONBLOCK 0000040000
#if defined(__sparc__) && defined(__arch64__)
-#define O_NDELAY 0x0004
+#define O_NDELAY 0000000004
#else
-#define O_NDELAY (0x0004 | O_NONBLOCK)
+#define O_NDELAY (0000000004 | O_NONBLOCK)
#endif
-#define O_NOCTTY 0x8000 /* not fcntl */
-#define O_LARGEFILE 0x40000
-#define O_DIRECT 0x100000 /* direct disk access hint */
-#define O_NOATIME 0x200000
-#define O_CLOEXEC 0x400000
+#define O_NOCTTY 0000100000 /* not fcntl */
+#define O_LARGEFILE 0001000000
+#define O_DIRECT 0004000000 /* direct disk access hint */
+#define O_NOATIME 0010000000
+#define O_CLOEXEC 0020000000
/*
* Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
* the O_SYNC flag. We continue to use the existing numerical value
@@ -32,12 +32,12 @@
*
* Note: __O_SYNC must never be used directly.
*/
-#define __O_SYNC 0x800000
+#define __O_SYNC 0040000000
#define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_PATH 0x1000000
-#define __O_TMPFILE 0x2000000
-#define OPENAT2_REGULAR 0x4000000
+#define O_PATH 0100000000
+#define __O_TMPFILE 0200000000
+#define OPENAT2_REGULAR 0400000000
#define F_GETOWN 5 /* for sockets. */
#define F_SETOWN 6 /* for sockets. */
--
2.53.0
^ permalink raw reply related
* [PATCH v6 2/4] kselftest/openat2: test for OPENAT2_REGULAR flag
From: Dorjoy Chowdhury @ 2026-03-28 17:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <20260328172314.45807-1-dorjoychy111@gmail.com>
Just a happy path test.
Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
.../testing/selftests/openat2/openat2_test.c | 37 ++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
index 0e161ef9e9e4..e8847f7d416c 100644
--- a/tools/testing/selftests/openat2/openat2_test.c
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -320,8 +320,42 @@ void test_openat2_flags(void)
}
}
+#ifndef OPENAT2_REGULAR
+#define OPENAT2_REGULAR 040000000
+#endif
+
+#ifndef EFTYPE
+#define EFTYPE 134
+#endif
+
+void test_openat2_regular_flag(void)
+{
+ if (!openat2_supported) {
+ ksft_test_result_skip("Skipping %s as openat2 is not supported\n", __func__);
+ return;
+ }
+
+ struct open_how how = {
+ .flags = OPENAT2_REGULAR | O_RDONLY
+ };
+
+ int fd = sys_openat2(AT_FDCWD, "/dev/null", &how);
+
+ if (fd == -ENOENT) {
+ ksft_test_result_skip("Skipping %s as there is no /dev/null\n", __func__);
+ return;
+ }
+
+ if (fd != -EFTYPE) {
+ ksft_test_result_fail("openat2 should return EFTYPE\n");
+ return;
+ }
+
+ ksft_test_result_pass("%s succeeded\n", __func__);
+}
+
#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \
- NUM_OPENAT2_FLAG_TESTS)
+ NUM_OPENAT2_FLAG_TESTS + 1)
int main(int argc, char **argv)
{
@@ -330,6 +364,7 @@ int main(int argc, char **argv)
test_openat2_struct();
test_openat2_flags();
+ test_openat2_regular_flag();
if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
ksft_exit_fail();
--
2.53.0
^ permalink raw reply related
* [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-03-28 17:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <20260328172314.45807-1-dorjoychy111@gmail.com>
This flag indicates the path should be opened if it's a regular file.
This is useful to write secure programs that want to avoid being
tricked into opening device nodes with special semantics while thinking
they operate on regular files. This is a requested feature from the
uapi-group[1].
A corresponding error code EFTYPE has been introduced. For example, if
openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
param, it will return -EFTYPE. EFTYPE is already used in BSD systems
like FreeBSD, macOS.
When used in combination with O_CREAT, either the regular file is
created, or if the path already exists, it is opened if it's a regular
file. Otherwise, -EFTYPE is returned.
When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
as it doesn't make sense to open a path that is both a directory and a
regular file.
[1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
arch/alpha/include/uapi/asm/errno.h | 2 ++
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/mips/include/uapi/asm/errno.h | 2 ++
arch/parisc/include/uapi/asm/errno.h | 2 ++
arch/parisc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/include/uapi/asm/errno.h | 2 ++
arch/sparc/include/uapi/asm/fcntl.h | 1 +
fs/ceph/file.c | 4 ++++
fs/fcntl.c | 4 ++--
fs/gfs2/inode.c | 6 ++++++
fs/namei.c | 4 ++++
fs/nfs/dir.c | 4 ++++
fs/open.c | 8 +++++---
fs/smb/client/dir.c | 14 +++++++++++++-
include/linux/fcntl.h | 2 ++
include/uapi/asm-generic/errno.h | 2 ++
include/uapi/asm-generic/fcntl.h | 4 ++++
tools/arch/alpha/include/uapi/asm/errno.h | 2 ++
tools/arch/mips/include/uapi/asm/errno.h | 2 ++
tools/arch/parisc/include/uapi/asm/errno.h | 2 ++
tools/arch/sparc/include/uapi/asm/errno.h | 2 ++
tools/include/uapi/asm-generic/errno.h | 2 ++
22 files changed, 67 insertions(+), 6 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
index 6791f6508632..1a99f38813c7 100644
--- a/arch/alpha/include/uapi/asm/errno.h
+++ b/arch/alpha/include/uapi/asm/errno.h
@@ -127,4 +127,6 @@
#define EHWPOISON 139 /* Memory page has hardware error */
+#define EFTYPE 140 /* Wrong file type for the intended operation */
+
#endif
diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fe488bf7c18e 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
#define O_PATH 040000000
#define __O_TMPFILE 0100000000
+#define OPENAT2_REGULAR 0200000000
#define F_GETLK 7
#define F_SETLK 8
diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
index c01ed91b1ef4..1835a50b69ce 100644
--- a/arch/mips/include/uapi/asm/errno.h
+++ b/arch/mips/include/uapi/asm/errno.h
@@ -126,6 +126,8 @@
#define EHWPOISON 168 /* Memory page has hardware error */
+#define EFTYPE 169 /* Wrong file type for the intended operation */
+
#define EDQUOT 1133 /* Quota exceeded */
diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
index 8cbc07c1903e..93194fbb0a80 100644
--- a/arch/parisc/include/uapi/asm/errno.h
+++ b/arch/parisc/include/uapi/asm/errno.h
@@ -124,4 +124,6 @@
#define EHWPOISON 257 /* Memory page has hardware error */
+#define EFTYPE 258 /* Wrong file type for the intended operation */
+
#endif
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03dee816cb13..d46812f2f0f4 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -19,6 +19,7 @@
#define O_PATH 020000000
#define __O_TMPFILE 040000000
+#define OPENAT2_REGULAR 0100000000
#define F_GETLK64 8
#define F_SETLK64 9
diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
index 4a41e7835fd5..71940ec9130b 100644
--- a/arch/sparc/include/uapi/asm/errno.h
+++ b/arch/sparc/include/uapi/asm/errno.h
@@ -117,4 +117,6 @@
#define EHWPOISON 135 /* Memory page has hardware error */
+#define EFTYPE 136 /* Wrong file type for the intended operation */
+
#endif
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..bb6e9fa94bc9 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
#define O_PATH 0x1000000
#define __O_TMPFILE 0x2000000
+#define OPENAT2_REGULAR 0x4000000
#define F_GETOWN 5 /* for sockets. */
#define F_SETOWN 6 /* for sockets. */
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 66bbf6d517a9..6d8d4c7765e6 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
ceph_init_inode_acls(newino, &as_ctx);
file->f_mode |= FMODE_CREATED;
}
+ if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
+ err = -EFTYPE;
+ goto out_req;
+ }
err = finish_open(file, dentry, ceph_open);
}
out_req:
diff --git a/fs/fcntl.c b/fs/fcntl.c
index beab8080badf..240bb511557a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1169,9 +1169,9 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
- (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
+ (VALID_OPENAT2_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC));
fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8344040ecaf7..4604e2e8a9cc 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
error = PTR_ERR(inode);
if (!IS_ERR(inode)) {
+ if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
+ iput(inode);
+ inode = NULL;
+ error = -EFTYPE;
+ goto fail_gunlock;
+ }
if (S_ISDIR(inode->i_mode)) {
iput(inode);
inode = NULL;
diff --git a/fs/namei.c b/fs/namei.c
index 2113958c3b7a..e557c538c238 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4679,6 +4679,10 @@ static int do_open(struct nameidata *nd,
if (unlikely(error))
return error;
}
+
+ if ((open_flag & OPENAT2_REGULAR) && !d_is_reg(nd->path.dentry))
+ return -EFTYPE;
+
if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
return -ENOTDIR;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ddc3789363a5..bfe9470327c8 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2195,6 +2195,10 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
break;
case -EISDIR:
case -ENOTDIR:
+ if (open_flags & OPENAT2_REGULAR) {
+ err = -EFTYPE;
+ break;
+ }
goto no_open;
case -ELOOP:
if (!(open_flags & O_NOFOLLOW))
diff --git a/fs/open.c b/fs/open.c
index 681d405bc61e..a6f445f72181 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
f->f_mode |= FMODE_CAN_ODIRECT;
- f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+ f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
f->f_iocb_flags = iocb_flags(f);
file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
@@ -1183,7 +1183,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);
- BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
+ BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPENAT2_FLAGS),
"struct open_flags doesn't yet handle flags > 32 bits");
/*
@@ -1196,7 +1196,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
* values before calling build_open_flags(), but openat2(2) checks all
* of its arguments.
*/
- if (flags & ~VALID_OPEN_FLAGS)
+ if (flags & ~VALID_OPENAT2_FLAGS)
return -EINVAL;
if (how->resolve & ~VALID_RESOLVE_FLAGS)
return -EINVAL;
@@ -1235,6 +1235,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
+ } else if ((flags & O_DIRECTORY) && (flags & OPENAT2_REGULAR)) {
+ return -EINVAL;
}
if (flags & O_PATH) {
/* O_PATH only permits certain other flags to be set. */
diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index 953f1fee8cb8..355681ebacf1 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -222,6 +222,13 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
goto cifs_create_get_file_info;
}
+ if ((oflags & OPENAT2_REGULAR) && !S_ISREG(newinode->i_mode)) {
+ CIFSSMBClose(xid, tcon, fid->netfid);
+ iput(newinode);
+ rc = -EFTYPE;
+ goto out;
+ }
+
if (S_ISDIR(newinode->i_mode)) {
CIFSSMBClose(xid, tcon, fid->netfid);
iput(newinode);
@@ -436,11 +443,16 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
goto out_err;
}
- if (newinode)
+ if (newinode) {
+ if ((oflags & OPENAT2_REGULAR) && !S_ISREG(newinode->i_mode)) {
+ rc = -EFTYPE;
+ goto out_err;
+ }
if (S_ISDIR(newinode->i_mode)) {
rc = -EISDIR;
goto out_err;
}
+ }
d_drop(direntry);
d_add(direntry, newinode);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..a80026718217 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -12,6 +12,8 @@
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OPENAT2_REGULAR)
+
/* List of all valid flags for the how->resolve argument: */
#define VALID_RESOLVE_FLAGS \
(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
index 92e7ae493ee3..bd78e69e0a43 100644
--- a/include/uapi/asm-generic/errno.h
+++ b/include/uapi/asm-generic/errno.h
@@ -122,4 +122,6 @@
#define EHWPOISON 133 /* Memory page has hardware error */
+#define EFTYPE 134 /* Wrong file type for the intended operation */
+
#endif
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..b2c2ddd0edc0 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
#define __O_TMPFILE 020000000
#endif
+#ifndef OPENAT2_REGULAR
+#define OPENAT2_REGULAR 040000000
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
diff --git a/tools/arch/alpha/include/uapi/asm/errno.h b/tools/arch/alpha/include/uapi/asm/errno.h
index 6791f6508632..1a99f38813c7 100644
--- a/tools/arch/alpha/include/uapi/asm/errno.h
+++ b/tools/arch/alpha/include/uapi/asm/errno.h
@@ -127,4 +127,6 @@
#define EHWPOISON 139 /* Memory page has hardware error */
+#define EFTYPE 140 /* Wrong file type for the intended operation */
+
#endif
diff --git a/tools/arch/mips/include/uapi/asm/errno.h b/tools/arch/mips/include/uapi/asm/errno.h
index c01ed91b1ef4..1835a50b69ce 100644
--- a/tools/arch/mips/include/uapi/asm/errno.h
+++ b/tools/arch/mips/include/uapi/asm/errno.h
@@ -126,6 +126,8 @@
#define EHWPOISON 168 /* Memory page has hardware error */
+#define EFTYPE 169 /* Wrong file type for the intended operation */
+
#define EDQUOT 1133 /* Quota exceeded */
diff --git a/tools/arch/parisc/include/uapi/asm/errno.h b/tools/arch/parisc/include/uapi/asm/errno.h
index 8cbc07c1903e..93194fbb0a80 100644
--- a/tools/arch/parisc/include/uapi/asm/errno.h
+++ b/tools/arch/parisc/include/uapi/asm/errno.h
@@ -124,4 +124,6 @@
#define EHWPOISON 257 /* Memory page has hardware error */
+#define EFTYPE 258 /* Wrong file type for the intended operation */
+
#endif
diff --git a/tools/arch/sparc/include/uapi/asm/errno.h b/tools/arch/sparc/include/uapi/asm/errno.h
index 4a41e7835fd5..71940ec9130b 100644
--- a/tools/arch/sparc/include/uapi/asm/errno.h
+++ b/tools/arch/sparc/include/uapi/asm/errno.h
@@ -117,4 +117,6 @@
#define EHWPOISON 135 /* Memory page has hardware error */
+#define EFTYPE 136 /* Wrong file type for the intended operation */
+
#endif
diff --git a/tools/include/uapi/asm-generic/errno.h b/tools/include/uapi/asm-generic/errno.h
index 92e7ae493ee3..bd78e69e0a43 100644
--- a/tools/include/uapi/asm-generic/errno.h
+++ b/tools/include/uapi/asm-generic/errno.h
@@ -122,4 +122,6 @@
#define EHWPOISON 133 /* Memory page has hardware error */
+#define EFTYPE 134 /* Wrong file type for the intended operation */
+
#endif
--
2.53.0
^ permalink raw reply related
* [PATCH v6 0/4] OPENAT2_REGULAR flag support for openat2
From: Dorjoy Chowdhury @ 2026-03-28 17:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
Hi,
I came upon this "Ability to only open regular files" uapi feature suggestion
from https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
and thought it would be something I could do as a first patch and get to
know the kernel code a bit better.
The following filesystems have been tested by building and booting the kernel
x86 bzImage in a Fedora 43 VM in QEMU. I have tested with OPENAT2_REGULAR that
regular files can be successfully opened and non-regular files (directory, fifo etc)
return -EFTYPE.
- btrfs
- NFS (loopback)
- SMB (loopback)
Changes in v6:
- OPENAT2_REGULAR stripped from file->f_flags in do_dentry_open so that it doesn't leak in fcntl(fd, F_GETFL)
- BUILD_BUG_ON updated to use VALID_OPENAT2_FLAGS instead of VALID_OPEN_FLAGS in build_open_flags and in fcntl_init
- v5 is at: https://lore.kernel.org/linux-fsdevel/20260307140726.70219-1-dorjoychy111@gmail.com/T/
Changes in v5:
- EFTYPE is already used in BSDs mentioned in commit message
- consistently return -EFTYPE in all filesystems
- v4 is at: https://lore.kernel.org/linux-fsdevel/20260221145915.81749-1-dorjoychy111@gmail.com/T/
Changes in v4:
- changed O_REGULAR to OPENAT2_REGULAR
- OPENAT2_REGULAR does not affect O_PATH
- atomic_open codepaths updated to work properly for OPENAT2_REGULAR
- commit message includes the uapi-group URL
- v3 is at: https://lore.kernel.org/linux-fsdevel/20260127180109.66691-1-dorjoychy111@gmail.com/T/
Changes in v3:
- included motivation about O_REGULAR flag in commit message e.g., programs not wanting to be tricked into opening device nodes
- fixed commit message wrongly referencing ENOTREGULAR instead of ENOTREG
- fixed the O_REGULAR flag in arch/parisc/include/uapi/asm/fcntl.h from 060000000 to 0100000000
- added 2 commits converting arch/{mips,sparc}/include/uapi/asm/fcntl.h O_* macros from hex to octal
- v2 is at: https://lore.kernel.org/linux-fsdevel/20260126154156.55723-1-dorjoychy111@gmail.com/T/
Changes in v2:
- rename ENOTREGULAR to ENOTREG
- define ENOTREG in uapi/asm-generic/errno.h (instead of errno-base.h) and in arch/*/include/uapi/asm/errno.h files
- override O_REGULAR in arch/{alpha,sparc,parisc}/include/uapi/asm/fcntl.h due to clash with include/uapi/asm-generic/fcntl.h
- I have kept the kselftest but now that O_REGULAR and ENOTREG can have different value on different architectures I am not sure if it's right
- v1 is at: https://lore.kernel.org/linux-fsdevel/20260125141518.59493-1-dorjoychy111@gmail.com/T/
Thanks.
Regards,
Dorjoy
Dorjoy Chowdhury (4):
openat2: new OPENAT2_REGULAR flag support
kselftest/openat2: test for OPENAT2_REGULAR flag
sparc/fcntl.h: convert O_* flag macros from hex to octal
mips/fcntl.h: convert O_* flag macros from hex to octal
arch/alpha/include/uapi/asm/errno.h | 2 +
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/mips/include/uapi/asm/errno.h | 2 +
arch/mips/include/uapi/asm/fcntl.h | 22 +++++------
arch/parisc/include/uapi/asm/errno.h | 2 +
arch/parisc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/include/uapi/asm/errno.h | 2 +
arch/sparc/include/uapi/asm/fcntl.h | 35 +++++++++---------
fs/ceph/file.c | 4 ++
fs/fcntl.c | 4 +-
fs/gfs2/inode.c | 6 +++
fs/namei.c | 4 ++
fs/nfs/dir.c | 4 ++
fs/open.c | 8 ++--
fs/smb/client/dir.c | 14 ++++++-
include/linux/fcntl.h | 2 +
include/uapi/asm-generic/errno.h | 2 +
include/uapi/asm-generic/fcntl.h | 4 ++
tools/arch/alpha/include/uapi/asm/errno.h | 2 +
tools/arch/mips/include/uapi/asm/errno.h | 2 +
tools/arch/parisc/include/uapi/asm/errno.h | 2 +
tools/arch/sparc/include/uapi/asm/errno.h | 2 +
tools/include/uapi/asm-generic/errno.h | 2 +
.../testing/selftests/openat2/openat2_test.c | 37 ++++++++++++++++++-
24 files changed, 131 insertions(+), 35 deletions(-)
--
2.53.0
^ permalink raw reply
* Re: [PATCH 1/4] exec: inherit HWCAPs from the parent process
From: Andrei Vagin @ 2026-03-28 0:21 UTC (permalink / raw)
To: Mark Rutland
Cc: Will Deacon, Kees Cook, Andrew Morton, Marek Szyprowski,
Cyrill Gorcunov, Mike Rapoport, Alexander Mikhalitsyn,
linux-kernel, linux-fsdevel, linux-mm, criu, Catalin Marinas,
linux-arm-kernel, Chen Ridong, Christian Brauner,
David Hildenbrand, Eric Biederman, Lorenzo Stoakes, Michal Koutny,
Alexander Mikhalitsyn, Linux API
In-Reply-To: <acarA3sGKY4Acozw@J2N7QTR9R3.cambridge.arm.com>
On Fri, Mar 27, 2026 at 9:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Mar 24, 2026 at 03:19:49PM -0700, Andrei Vagin wrote:
> > Hi Mark and Will,
> >
> > Thanks for the feedback. Please read the inline comments.
> >
> > On Tue, Mar 24, 2026 at 3:28 AM Will Deacon <will@kernel.org> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 06:21:22PM +0000, Mark Rutland wrote:
> > > > On Mon, Mar 23, 2026 at 05:53:37PM +0000, Andrei Vagin wrote:
> > > > > Introduces a mechanism to inherit hardware capabilities (AT_HWCAP,
> > > > > AT_HWCAP2, etc.) from a parent process when they have been modified via
> > > > > prctl.
> > > > >
> > > > > To support C/R operations (snapshots, live migration) in heterogeneous
> > > > > clusters, we must ensure that processes utilize CPU features available
> > > > > on all potential target nodes. To solve this, we need to advertise a
> > > > > common feature set across the cluster.
> > > > >
> > > > > This patch adds a new mm flag MMF_USER_HWCAP, which is set when the
> > > > > auxiliary vector is modified via prctl(PR_SET_MM, PR_SET_MM_AUXV). When
> > > > > execve() is called, if the current process has MMF_USER_HWCAP set, the
> > > > > HWCAP values are extracted from the current auxiliary vector and stored
> > > > > in the linux_binprm structure. These values are then used to populate
> > > > > the auxiliary vector of the new process, effectively inheriting the
> > > > > hardware capabilities.
> > > > >
> > > > > The inherited HWCAPs are masked with the hardware capabilities supported
> > > > > by the current kernel to ensure that we don't report more features than
> > > > > actually supported. This is important to avoid unexpected behavior,
> > > > > especially for processes with additional privileges.
> > > >
> > > > At a high level, I don't think that's going to be sufficient:
> > > >
> > > > * On an architecture with other userspace accessible feature
> > > > identification mechanism registers (e.g. ID registers), userspace
> > > > might read those. So you might need to hide stuff there too, and
> > > > that's going to require architecture-specific interfaces to manage.
> > > >
> > > > It's possible that some code checks HWCAPs and others check ID
> > > > registers, and mismatch between the two could be problematic.
> > > >
> > > > * If the HWCAPs can be inherited by a more privileged task, then a
> > > > malicious user could use this to hide security features (e.g. shadow
> > > > stack or pointer authentication on arm64), and make it easier to
> > > > attack that task. While not a direct attack, it would undermine those
> > > > features.
> >
> > I agree with Mark that only a privileged process have to be able to mask
> > certain hardware features. Currently, PR_SET_MM_AUXV is guarded by
> > CAP_SYS_RESOURCE, but PR_SET_MM_MAP allows changing the auxiliary vector
> > without specific capabilities. This is definitely the issue. To address
> > this, I think we can consider to introduce a new prctl command to enable
> > HWCAP inheritance explicitly.
> >
> > > Yeah, this looks like a non-starter to me on arm64. Even if it was
> > > extended to apply the same treatment to the idregs, many of the hwcap
> > > features can't actually be disabled by the kernel and so you still run
> > > the risk of a task that probes for the presence of a feature using
> > > something like a SIGILL handler or, perhaps more likely, assumes that
> > > the presence of one hwcap implies the presence of another. And then
> > > there are the applications that just base everything off the MIDR...
> >
> > The goal of this mechanism is not to provide strict architectural
> > enforcement or to trap the use of hardware features; rather, it is to
> > provide a consistent discovery interface for applications. I chose the
> > HWCAP vector because it mirrors the existing behavior of running an
> > older kernel on newer hardware: while ID registers might report a
> > feature as physically present, the HWCAPs will omit it if the kernel
> > lacks support.
>
> On arm64, the view of the ID registers that userspace gets *only*
> exposes features that the kernel knows about, as userspace reads of
> those registers are trapped+emulated by the kernel. On arm64 it's
> not true to say that something appears in those but not the HWCAPs.
>
> I understand that might be different on other architectures, and so
> maybe this approach is sufficient on other architectures, but it is not
> sufficient on arm64.
>
> > Applications are generally expected to treat HWCAPs as
> > the source of truth for which features are safe to use, even if the
> > underlying hardware is technically capable of more.
>
> I'm fairly certain that there are arm64 applications (and libraries)
> which check only the ID register values, and not the HWCAPs.
>
> Architecturally, there are features which are detected via other
> mechanisms (e.g. CHKFEAT), for which HWCAPs are also irrelevant. Even if
> that happens to be ok today, there are almost certainly future uses that
> will not be compatible with the scheme you propose.
>
> I don't think we can say "applications must check the HWCAPs", when we
> know that applications and libraries legitimately don't always do that.
>
> > Another significant advantage of using HWCAPs is that many
> > applications already rely on them for feature detection. This interface
> > allows these applications to work correctly "out-of-the-box" in a
> > migrated environment without requiring any userspace modifications. I
> > understand that some apps may use other detection methods; however, there
> > it no gurantee that these applications will work correctly after
> > migration to another machine.
>
> I think the existince of applications that detect features by other
> (legitimate!) means implies that there's no guarantee that this feature
> is useful and will remain useful going forwards.
>
> For example, what do you plan to do if an application or library starts
> doing something legitimate that causes it to become incompatible with
> this scheme?
>
> I don't want to be in a position where userspace is asked to steer clear
> of legitimate mechanisms, or where architecture code suddently has to
> pick up a lot of complexity to make this work.
>
> > > There's also kvm, which provides a roundabout way to query some features
> > > of the underlying hardware.
> > >
> > > You're probably better off using/extending the idreg overrides we have
> > > in arch/arm64/kernel/pi/idreg-override.c so that you can make your
> > > cluster of heterogeneous machines look alike.
> >
> > IIRC, idreg-override/cpuid-masking usually works for an entire machine.
> > We actually need to have a mechanism that will work on a per-container
> > basis. Workloads inside one cluster can have different
> > migration/snapshot requirements. Some are pinned to a specific node,
> > others are never migrated, while others need to be migratable across a
> > cluster or even between clusters. We need a mechanism that can be
> > tunable on a per-container/per-process basis.
>
> I think that's theoretically possible, BUT it will require substantially
> more complexity, to address the issues that Will and I have mentioned. I
> don't think people are very happy to pick up that complexity.
>
> There are many other aspects that are going to be problematic for
> heterogeneous migration. Even if you hide the HWCAP for a stateful
> feature (e.g. SME), it might appear in one machine's signal frames (and
> be mandatory there), but might not appear in anothers, and so migration
> might not work either way. Likewise, that state can appear via ptrace.
Hi Mark,
I understand all these points and they are valid. However, as I
mentioned, we are not trying to introduce a mechanism that will strictly
enforce feature sets for every container. While we would like to have
that functionality, as you and will mentioned, it would require
substantially more complexity to address, and maintainers would unlikely
to pick up that complexity. Even masking ID registers on a per-container
basis would introduce extra complexity that could make architecture
maintainers unhappy. There were a few attempts to introduce container
CPUID masking on x86_64 in the past.
In CRIU, we are not aiming to handle every possible workload. Our goal
is to target workloads where developers are ready to cooperate and
willing to make adjustments to be C/R compatible. The goal here is to
provide developers with clear instructions on what they can do to ensure
their applications are C/R compatible. When I say "workloads", I mean
this in a broad sense. A container might pack a set of tools with
different runtimes (Go, Java, libc-based). All these runtimes should
detect only allowed features.
Returning to the subject of this patchset: this series extends the role
of hwcaps. With this change, we would establish that hwcaps is the
"source of truth" for which features an application can safely use. Any
other features available on the current CPU would not be guaranteed to
remain available after migration to another machine.
After this discussion, I found that the current version missed one major
thing: there should be a signal indicating that hwcaps must be used for
feature detection. Since we will need to integrate this interface into
libc, Go, and other runtimes, they definitely should not rely just on
hwcaps by default, especially in the early stages. This can be solved
via the prctl command. Libraries like libc would call
prctl(PR_USER_HWCAP_ENABLED). If this returns true, the runtime knows
that only the features explicitly listed in hwcaps should be used.
You are right, the controlled feature set will be limited to features
the kernel knows about. And yes, we would need to report CPU features in
hwcaps even if the kernel isn't directly involved in handling them.
Honestly, I am not certain if this is the "right" interface for that,
and I would be happy to consider other ideas. I understand that these
hwcaps will not work right out of the box, but we need a way to solve
this problem. Having a centralized API for CPU/kernel feature detection
seems like the right direction.
As for signal frame size and extended states like SVE/SME, we aware
about this problem. However, it is partly mitigated by the fact that if
an application does not use some features, those states are not placed
in the signal frame. In the future, when we construct/reload a signal
frame, we could look at a process feature set for a process and generate
a frame according to those features...
Thanks,
Andrei
^ permalink raw reply
* Re: [PATCH v2 3/9] kernel/api: add debugfs interface for kernel API specifications
From: Greg Kroah-Hartman @ 2026-03-24 11:45 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-api, linux-kernel, linux-doc, linux-fsdevel, linux-kbuild,
linux-kselftest, workflows, tools, x86, Thomas Gleixner,
Paul E . McKenney, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
Ingo Molnar, Arnd Bergmann
In-Reply-To: <acJ2gnnA9MP1wO_Z@laps>
On Tue, Mar 24, 2026 at 07:33:22AM -0400, Sasha Levin wrote:
> On Tue, Mar 24, 2026 at 09:20:01AM +0100, Greg Kroah-Hartman wrote:
> > On Mon, Mar 23, 2026 at 07:58:50PM -0400, Sasha Levin wrote:
> > > > But this only works if the kabi stuff is built into the kernel image,
> > > > right? This doesn't work if any of these abi sections are in a module
> > > > or am I missing that logic here?
> > >
> > > That is correct, for now.
> > >
> > > I'm only trying to tackle syscalls to begin with, and since no syscalls live in
> > > modules, we have no need for module support.
> >
> > We used to support syscalls in modules, but thankfully that is now gone.
> > But, how will this work for stuff like usbfs ioctls? That is a module,
> > and our uapi is, by far, in drivers through ioctl "hell" and that would
> > be great to be able to document through all of this. Will that just not
> > be in the debugfs api?
>
> It will. I see it working just like how BTF or trace events do it now.
>
> When a module loads, find_module_sections() extracts the .kapi_specs section
> pointer and element count into new struct module fields. The COMING notifier
> then iterates those specs, registers each via the existing kapi_register_spec()
> dynamic registration path, and creates per-spec debugfs files under the
> existing /sys/kernel/debug/kapi/specs/ directory. The kapi_list_show() function
> is extended to also walk the dynamic_api_specs list (currently it only iterates
> the static __start_kapi_specs..__stop_kapi_specs range). On GOING, all specs
> owned by that module are removed from the list and their debugfs entries
> cleaned up via debugfs_remove().
Sounds good, I was worried about that static range and how that would be
"extended" or not.
greg k-h
^ permalink raw reply
* Re: [PATCH v2 3/9] kernel/api: add debugfs interface for kernel API specifications
From: Sasha Levin @ 2026-03-24 11:33 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: linux-api, linux-kernel, linux-doc, linux-fsdevel, linux-kbuild,
linux-kselftest, workflows, tools, x86, Thomas Gleixner,
Paul E . McKenney, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
Ingo Molnar, Arnd Bergmann
In-Reply-To: <2026032411-paramount-lapdog-41e6@gregkh>
On Tue, Mar 24, 2026 at 09:20:01AM +0100, Greg Kroah-Hartman wrote:
>On Mon, Mar 23, 2026 at 07:58:50PM -0400, Sasha Levin wrote:
>> > But this only works if the kabi stuff is built into the kernel image,
>> > right? This doesn't work if any of these abi sections are in a module
>> > or am I missing that logic here?
>>
>> That is correct, for now.
>>
>> I'm only trying to tackle syscalls to begin with, and since no syscalls live in
>> modules, we have no need for module support.
>
>We used to support syscalls in modules, but thankfully that is now gone.
>But, how will this work for stuff like usbfs ioctls? That is a module,
>and our uapi is, by far, in drivers through ioctl "hell" and that would
>be great to be able to document through all of this. Will that just not
>be in the debugfs api?
It will. I see it working just like how BTF or trace events do it now.
When a module loads, find_module_sections() extracts the .kapi_specs section
pointer and element count into new struct module fields. The COMING notifier
then iterates those specs, registers each via the existing kapi_register_spec()
dynamic registration path, and creates per-spec debugfs files under the
existing /sys/kernel/debug/kapi/specs/ directory. The kapi_list_show() function
is extended to also walk the dynamic_api_specs list (currently it only iterates
the static __start_kapi_specs..__stop_kapi_specs range). On GOING, all specs
owned by that module are removed from the list and their debugfs entries
cleaned up via debugfs_remove().
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH v2 3/9] kernel/api: add debugfs interface for kernel API specifications
From: Mauro Carvalho Chehab @ 2026-03-24 9:49 UTC (permalink / raw)
To: Sasha Levin, Greg Kroah-Hartman
Cc: linux-api, linux-kernel, linux-doc, linux-fsdevel, linux-kbuild,
linux-kselftest, workflows, tools, x86, Thomas Gleixner,
Paul E . McKenney, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
Ingo Molnar, Arnd Bergmann
In-Reply-To: <acHTupVGxJR3gmFT@laps>
Hi Sasha,
On Mon, 23 Mar 2026 19:58:50 -0400
Sasha Levin <sashal@kernel.org> wrote:
> On Mon, Mar 23, 2026 at 02:52:58PM +0100, Greg Kroah-Hartman wrote:
> >On Sun, Mar 22, 2026 at 08:10:17AM -0400, Sasha Levin wrote:
> >> Add a debugfs interface to expose kernel API specifications at runtime.
> >> This allows tools and users to query the complete API specifications
> >> through the debugfs filesystem.
> >>
> >> The interface provides:
> >> - /sys/kernel/debug/kapi/list - lists all available API specifications
> >> - /sys/kernel/debug/kapi/specs/<name> - detailed info for each API
> >>
> >> Each specification file includes:
> >> - Function name, version, and descriptions
> >> - Execution context requirements and flags
> >> - Parameter details with types, flags, and constraints
> >> - Return value specifications and success conditions
> >> - Error codes with descriptions and conditions
> >> - Locking requirements and constraints
> >> - Signal handling specifications
> >> - Examples, notes, and deprecation status
> >>
> >> This enables runtime introspection of kernel APIs for documentation
> >> tools, static analyzers, and debugging purposes.
> >>
> >> Signed-off-by: Sasha Levin <sashal@kernel.org>
> >
> >Debugfs logic looks sane, nice.
>
> Thanks!
>
> >But this only works if the kabi stuff is built into the kernel image,
> >right? This doesn't work if any of these abi sections are in a module
> >or am I missing that logic here?
>
> That is correct, for now.
>
> I'm only trying to tackle syscalls to begin with, and since no syscalls live in
> modules, we have no need for module support.
Have you seen tools/docs/get_abi.py?
It handles debugfs/sysfs descriptions from Documentation/ABI/.
Its "rest" command converts ABI specs from it into RST.
The "undefined" command does realtime introspection for sysfs ABI,
checking if they're documented.
Perhaps you should consider integrating it on your new tool.
--
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v2 3/9] kernel/api: add debugfs interface for kernel API specifications
From: Greg Kroah-Hartman @ 2026-03-24 8:20 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-api, linux-kernel, linux-doc, linux-fsdevel, linux-kbuild,
linux-kselftest, workflows, tools, x86, Thomas Gleixner,
Paul E . McKenney, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
Ingo Molnar, Arnd Bergmann
In-Reply-To: <acHTupVGxJR3gmFT@laps>
On Mon, Mar 23, 2026 at 07:58:50PM -0400, Sasha Levin wrote:
> > But this only works if the kabi stuff is built into the kernel image,
> > right? This doesn't work if any of these abi sections are in a module
> > or am I missing that logic here?
>
> That is correct, for now.
>
> I'm only trying to tackle syscalls to begin with, and since no syscalls live in
> modules, we have no need for module support.
We used to support syscalls in modules, but thankfully that is now gone.
But, how will this work for stuff like usbfs ioctls? That is a module,
and our uapi is, by far, in drivers through ioctl "hell" and that would
be great to be able to document through all of this. Will that just not
be in the debugfs api?
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH v2 3/9] kernel/api: add debugfs interface for kernel API specifications
From: Sasha Levin @ 2026-03-23 23:58 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: linux-api, linux-kernel, linux-doc, linux-fsdevel, linux-kbuild,
linux-kselftest, workflows, tools, x86, Thomas Gleixner,
Paul E . McKenney, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
Ingo Molnar, Arnd Bergmann
In-Reply-To: <2026032309-jargon-stalling-28c2@gregkh>
On Mon, Mar 23, 2026 at 02:52:58PM +0100, Greg Kroah-Hartman wrote:
>On Sun, Mar 22, 2026 at 08:10:17AM -0400, Sasha Levin wrote:
>> Add a debugfs interface to expose kernel API specifications at runtime.
>> This allows tools and users to query the complete API specifications
>> through the debugfs filesystem.
>>
>> The interface provides:
>> - /sys/kernel/debug/kapi/list - lists all available API specifications
>> - /sys/kernel/debug/kapi/specs/<name> - detailed info for each API
>>
>> Each specification file includes:
>> - Function name, version, and descriptions
>> - Execution context requirements and flags
>> - Parameter details with types, flags, and constraints
>> - Return value specifications and success conditions
>> - Error codes with descriptions and conditions
>> - Locking requirements and constraints
>> - Signal handling specifications
>> - Examples, notes, and deprecation status
>>
>> This enables runtime introspection of kernel APIs for documentation
>> tools, static analyzers, and debugging purposes.
>>
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>
>Debugfs logic looks sane, nice.
Thanks!
>But this only works if the kabi stuff is built into the kernel image,
>right? This doesn't work if any of these abi sections are in a module
>or am I missing that logic here?
That is correct, for now.
I'm only trying to tackle syscalls to begin with, and since no syscalls live in
modules, we have no need for module support.
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH v2 3/9] kernel/api: add debugfs interface for kernel API specifications
From: Greg Kroah-Hartman @ 2026-03-23 13:52 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-api, linux-kernel, linux-doc, linux-fsdevel, linux-kbuild,
linux-kselftest, workflows, tools, x86, Thomas Gleixner,
Paul E . McKenney, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
Cyril Hrubis, Kees Cook, Jake Edge, David Laight, Askar Safin,
Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
Ingo Molnar, Arnd Bergmann
In-Reply-To: <20260322121026.869758-4-sashal@kernel.org>
On Sun, Mar 22, 2026 at 08:10:17AM -0400, Sasha Levin wrote:
> Add a debugfs interface to expose kernel API specifications at runtime.
> This allows tools and users to query the complete API specifications
> through the debugfs filesystem.
>
> The interface provides:
> - /sys/kernel/debug/kapi/list - lists all available API specifications
> - /sys/kernel/debug/kapi/specs/<name> - detailed info for each API
>
> Each specification file includes:
> - Function name, version, and descriptions
> - Execution context requirements and flags
> - Parameter details with types, flags, and constraints
> - Return value specifications and success conditions
> - Error codes with descriptions and conditions
> - Locking requirements and constraints
> - Signal handling specifications
> - Examples, notes, and deprecation status
>
> This enables runtime introspection of kernel APIs for documentation
> tools, static analyzers, and debugging purposes.
>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
Debugfs logic looks sane, nice.
But this only works if the kabi stuff is built into the kernel image,
right? This doesn't work if any of these abi sections are in a module
or am I missing that logic here?
thanks,
greg k-h
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox