public inbox for linux-api@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] vfs: mkdirat_fd() syscall
@ 2026-03-31 17:19 Jori Koolstra
  2026-03-31 17:19 ` [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd() Jori Koolstra
  2026-03-31 17:19 ` [RFC PATCH 2/2] selftest: add tests for mkdirat_fd() Jori Koolstra
  0 siblings, 2 replies; 13+ messages in thread
From: Jori Koolstra @ 2026-03-31 17:19 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman
  Cc: H . Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Jori Koolstra

This series implements the mkdirat_fd() syscall that was suggested over
at the UAPI group kernel feature page [1] with some tests.

Obviously, if we want this we should also implement mknodeat_fd() and
symlinkat_fd(), but their implementation can be done quite similar I
believe.

I have added an unigned int flags like [2] suggests and an example flag
that we may want to remove (it right now mainly serves an internal
purpose). But it marks where I would want to place the definitions.

This has been compiled and tested on x86 only. [2] is a bit confusing
here and there, so I hope I have added the proper syscall definitions
everywhere where they needs to be added.

[1]: https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
[2]: https://www.kernel.org/doc/html/latest/process/adding-syscalls.html

Jori Koolstra (2):
  vfs: syscalls: add mkdirat_fd()
  selftest: add tests for mkdirat_fd()

 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 fs/internal.h                                 |   1 +
 fs/namei.c                                    |  26 +++-
 include/linux/fcntl.h                         |   2 +
 include/linux/syscalls.h                      |   2 +
 include/uapi/asm-generic/fcntl.h              |   3 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 scripts/syscall.tbl                           |   1 +
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 .../selftests/filesystems/mkdirat_fd_test.c   | 139 ++++++++++++++++++
 11 files changed, 183 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c

-- 
2.53.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-03-31 17:19 [RFC PATCH 0/2] vfs: mkdirat_fd() syscall Jori Koolstra
@ 2026-03-31 17:19 ` Jori Koolstra
  2026-03-31 19:13   ` Arnd Bergmann
                     ` (2 more replies)
  2026-03-31 17:19 ` [RFC PATCH 2/2] selftest: add tests for mkdirat_fd() Jori Koolstra
  1 sibling, 3 replies; 13+ messages in thread
From: Jori Koolstra @ 2026-03-31 17:19 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring
  Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
	Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
	Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
	linux-fsdevel, linux-api, linux-arch, linux-kselftest, cmirabil,
	Jori Koolstra, Masami Hiramatsu (Google)

Currently there is no way to race-freely create and open a directory.
For regular files we have open(O_CREAT) for creating a new file inode,
and returning a pinning fd to it. The lack of such functionality for
directories means that when populating a directory tree there's always
a race involved: the inodes first need to be created, and then opened
to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
but in the time window between the creation and the opening they might
be replaced by something else.

Addressing this race without proper APIs is possible (by immediately
fstat()ing what was opened, to verify that it has the right inode type),
but difficult to get right. Hence, mkdirat_fd() that creates a directory
and returns an O_DIRECTORY fd is useful.

This feature idea (and description) is taken from the UAPI group:
https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/internal.h                          |  1 +
 fs/namei.c                             | 26 ++++++++++++++++++++++++--
 include/linux/fcntl.h                  |  2 ++
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/fcntl.h       |  3 +++
 include/uapi/asm-generic/unistd.h      |  5 ++++-
 scripts/syscall.tbl                    |  1 +
 8 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..dda920c26941 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	mkdirat_fd		sys_mkdirat_fd
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..2885a3e4ebdd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -58,6 +58,7 @@ int filename_unlinkat(int dfd, struct filename *name);
 int may_linkat(struct mnt_idmap *idmap, const struct path *link);
 int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
 		 struct filename *newname, unsigned int flags);
+int filename_mkdirat_fd(int dfd, struct filename *name, umode_t mode, unsigned int flags);
 int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
 int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
 int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
diff --git a/fs/namei.c b/fs/namei.c
index 1eb9db055292..93252937983e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5256,6 +5256,11 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 EXPORT_SYMBOL(vfs_mkdir);
 
 int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
+{
+	return filename_mkdirat_fd(dfd, name, mode, 0);
+}
+
+int filename_mkdirat_fd(int dfd, struct filename *name, umode_t mode, unsigned int flags)
 {
 	struct dentry *dentry;
 	struct path path;
@@ -5263,7 +5268,7 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
 	struct delegated_inode delegated_inode = { };
 
-retry:
+start:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
@@ -5276,7 +5281,6 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 		if (IS_ERR(dentry))
 			error = PTR_ERR(dentry);
 	}
-	end_creating_path(&path, dentry);
 	if (is_delegated(&delegated_inode)) {
 		error = break_deleg_wait(&delegated_inode);
 		if (!error)
@@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
+
+	if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
+		struct path new_path = { .mnt = path.mnt, .dentry = dentry };
+		error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
+	}
+	end_creating_path(&path, dentry);
 	return error;
+retry:
+	end_creating_path(&path, dentry);
+	goto start;
+}
+
+SYSCALL_DEFINE4(mkdirat_fd, int, dfd, const char __user *, pathname, umode_t, mode,
+		unsigned int, flags)
+{
+	CLASS(filename, name)(pathname);
+	if (flags & ~VALID_MKDIRAT_FD_FLAGS)
+		return -EINVAL;
+	return filename_mkdirat_fd(dfd, name, mode, flags | MKDIRAT_FD_NEED_FD);
 }
 
 SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..d2f0fdb82847 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -25,6 +25,8 @@
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
 #endif
 
+#define VALID_MKDIRAT_FD_FLAGS	(MKDIRAT_FD_NEED_FD)
+
 #if BITS_PER_LONG == 32
 #define IS_GETLK32(cmd)		((cmd) == F_GETLK)
 #define IS_SETLK32(cmd)		((cmd) == F_SETLK)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 02bd6ddb6278..52e7f09d5525 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
 asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
 				      u32 size, u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_mkdirat_fd(int dfd, const char __user *pathname, umode_t mode,
+				     unsigned int flags)
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..621458bf1fbf 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -95,6 +95,9 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/* Flags for mkdirat_fd */
+#define MKDIRAT_FD_NEED_FD	0x01
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..5bae1029f5d9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_mkdirat_fd 472
+__SYSCALL(__NR_mkdirat_fd, sys_mkdirat_fd)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..db3bd97d4a1a 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	mkdirat_fd			sys_mkdirat_fd
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/2] selftest: add tests for mkdirat_fd()
  2026-03-31 17:19 [RFC PATCH 0/2] vfs: mkdirat_fd() syscall Jori Koolstra
  2026-03-31 17:19 ` [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd() Jori Koolstra
@ 2026-03-31 17:19 ` Jori Koolstra
  1 sibling, 0 replies; 13+ messages in thread
From: Jori Koolstra @ 2026-03-31 17:19 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman
  Cc: H . Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Jori Koolstra, Ingo Molnar

Add some tests for the new mkdirat_fd() syscall to test compliance and
to showcase its behaviour.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 .../selftests/filesystems/mkdirat_fd_test.c   | 139 ++++++++++++++++++
 3 files changed, 145 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c

diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..5bae1029f5d9 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_mkdirat_fd 472
+__SYSCALL(__NR_mkdirat_fd, sys_mkdirat_fd)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile
index 85427d7f19b9..7357769db57a 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
-CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog
+CFLAGS += $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
+TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog mkdirat_fd_test
 TEST_GEN_PROGS_EXTENDED := dnotify_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/mkdirat_fd_test.c b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
new file mode 100644
index 000000000000..9058be49dc7b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sys/stat.h>
+
+#include <asm-generic/unistd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef MKDIRAT_FD_NEED_FD
+#define MKDIRAT_FD_NEED_FD 0x01
+#endif
+
+#define mkdirat_fd_checked(dfd, pathname) ({					\
+	struct stat __st;							\
+	int __fd = sys_mkdirat_fd(dfd, pathname, S_IRWXU, MKDIRAT_FD_NEED_FD);	\
+	ASSERT_GE(__fd, 0);							\
+	EXPECT_EQ(fstat(__fd, &__st), 0);					\
+	EXPECT_TRUE(S_ISDIR(__st.st_mode));					\
+	__fd;									\
+})
+
+static inline int sys_mkdirat_fd(int dfd, const char *pathname, mode_t mode,
+				 unsigned int flags)
+{
+	return syscall(__NR_mkdirat_fd, dfd, pathname, mode, flags);
+}
+
+FIXTURE(mkdirat_fd) {
+	char dirpath[PATH_MAX];
+	int dfd;
+};
+
+FIXTURE_SETUP(mkdirat_fd)
+{
+	snprintf(self->dirpath, sizeof(self->dirpath),
+		 "/tmp/mkdirat_fd_test.%d", getpid());
+	ASSERT_EQ(mkdir(self->dirpath, S_IRWXU), 0);
+
+	self->dfd = open(self->dirpath, O_DIRECTORY);
+	ASSERT_GE(self->dfd, 0);
+}
+
+FIXTURE_TEARDOWN(mkdirat_fd)
+{
+	close(self->dfd);
+	rmdir(self->dirpath);
+}
+
+/* Does mkdirat_fd return a fd at all */
+TEST_F(mkdirat_fd, returns_fd)
+{
+	int fd = mkdirat_fd_checked(self->dfd, "newdir");
+	EXPECT_EQ(close(fd), 0)
+	EXPECT_EQ(unlinkat(self->dfd, "newdir", AT_REMOVEDIR), 0);
+}
+
+/* The fd must refer to the directory that was just created. */
+TEST_F(mkdirat_fd, fd_is_created_dir)
+{
+	int fd;
+	struct stat st_via_fd, st_via_path;
+	char path[PATH_MAX];
+
+	fd = mkdirat_fd_checked(self->dfd, "checkdir");
+
+	ASSERT_EQ(fstat(fd, &st_via_fd), 0);
+
+	snprintf(path, sizeof(path), "%s/checkdir", self->dirpath);
+	ASSERT_EQ(stat(path, &st_via_path), 0);
+
+	EXPECT_EQ(st_via_fd.st_ino, st_via_path.st_ino);
+	EXPECT_EQ(st_via_fd.st_dev, st_via_path.st_dev);
+
+	EXPECT_EQ(close(fd), 0)
+	EXPECT_EQ(rmdir(path), 0);
+}
+
+
+/* Missing parent component must fail with ENOENT. */
+TEST_F(mkdirat_fd, enoent_missing_parent)
+{
+	EXPECT_EQ(sys_mkdirat_fd(self->dfd, "nonexistent/child", S_IRWXU, MKDIRAT_FD_NEED_FD), -1);
+	EXPECT_EQ(errno, ENOENT);
+}
+
+/* An invalid dfd must fail with EBADF. */
+TEST_F(mkdirat_fd, ebadf)
+{
+	EXPECT_EQ(sys_mkdirat_fd(-42, "badfdir", S_IRWXU, MKDIRAT_FD_NEED_FD), -1);
+	EXPECT_EQ(errno, EBADF);
+}
+
+/* A dfd that points to a file (not a directory) must fail with ENOTDIR. */
+TEST_F(mkdirat_fd, enotdir_dfd)
+{
+	int file_fd;
+
+	file_fd = openat(self->dfd, "file",
+			 O_CREAT | O_WRONLY, S_IRWXU);
+	ASSERT_GE(file_fd, 0);
+
+	EXPECT_EQ(sys_mkdirat_fd(file_fd, "subdir", S_IRWXU, MKDIRAT_FD_NEED_FD), -1);
+	EXPECT_EQ(errno, ENOTDIR);
+
+	EXPECT_EQ(close(file_fd), 0);
+	EXPECT_EQ(unlinkat(self->dfd, "file", 0), 0);
+}
+
+/*
+ * The returned fd must be usable as a dfd for further *at() calls.
+ */
+TEST_F(mkdirat_fd, fd_usable_as_dfd)
+{
+	int parent_fd, child_fd;
+
+	parent_fd = mkdirat_fd_checked(self->dfd, "parent");
+	child_fd = mkdirat_fd_checked(parent_fd, "child");
+
+	EXPECT_EQ(close(child_fd), 0);
+	EXPECT_EQ(close(parent_fd), 0);
+
+	char path[PATH_MAX];
+	snprintf(path, sizeof(path), "%s/parent/child", self->dirpath);
+	EXPECT_EQ(rmdir(path), 0);
+	snprintf(path, sizeof(path), "%s/parent", self->dirpath);
+	EXPECT_EQ(rmdir(path), 0);
+}
+
+/* Unknown flags must be rejected with EINVAL. */
+TEST_F(mkdirat_fd, einval_unknown_flags)
+{
+	EXPECT_EQ(sys_mkdirat_fd(self->dfd, "flagsdir", S_IRWXU, ~MKDIRAT_FD_NEED_FD), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-03-31 17:19 ` [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd() Jori Koolstra
@ 2026-03-31 19:13   ` Arnd Bergmann
  2026-04-01 14:09     ` David Laight
  2026-03-31 20:25   ` Yann Droneaud
  2026-04-01  4:19   ` Mateusz Guzik
  2 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2026-03-31 19:13 UTC (permalink / raw)
  To: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, shuah,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Kara, Alexander Aring
  Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
	Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
	Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
	linux-fsdevel, linux-api, Linux-Arch, linux-kselftest, cmirabil,
	Masami Hiramatsu

On Tue, Mar 31, 2026, at 19:19, Jori Koolstra wrote:
> Currently there is no way to race-freely create and open a directory.
> For regular files we have open(O_CREAT) for creating a new file inode,
> and returning a pinning fd to it. The lack of such functionality for
> directories means that when populating a directory tree there's always
> a race involved: the inodes first need to be created, and then opened
> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> but in the time window between the creation and the opening they might
> be replaced by something else.
>
> Addressing this race without proper APIs is possible (by immediately
> fstat()ing what was opened, to verify that it has the right inode type),
> but difficult to get right. Hence, mkdirat_fd() that creates a directory
> and returns an O_DIRECTORY fd is useful.
>
> This feature idea (and description) is taken from the UAPI group:
> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>
> Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>

I checked that the calling conventions are fine, i.e. this will work
as expected across all architectures. I assume you are also aware
that the non-RFC patch will need to add the syscall number to all
.tbl files.

The hardest problem here does seem to be the naming of the
new syscall, and I'm sorry to not be able to offer any solution
either, just two observations:

- mkdirat/mkdirat_fd sounds similar to the existing
  quotactl/quotactl_fd pair, but quotactl_fd() takes a file
  descriptor argument rather than returning it, which makes
  this addition quite confusing.

- the nicest interface IMO would have been a variation of
  openat(dfd, filename, O_CREAT | O_DIRECTORY, mode)
  but that is a minefield of incompatible implementations[1],
  so we can't do that without changing the behavior for
  existing callers that currently run into an error.

       Arnd

[1] https://lwn.net/Articles/926782/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-03-31 17:19 ` [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd() Jori Koolstra
  2026-03-31 19:13   ` Arnd Bergmann
@ 2026-03-31 20:25   ` Yann Droneaud
  2026-03-31 20:42     ` H. Peter Anvin
  2026-04-01  4:19   ` Mateusz Guzik
  2 siblings, 1 reply; 13+ messages in thread
From: Yann Droneaud @ 2026-03-31 20:25 UTC (permalink / raw)
  To: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring
  Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
	Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
	Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
	linux-fsdevel, linux-api, linux-arch, linux-kselftest, cmirabil,
	Masami Hiramatsu (Google)

Hi,

Le 31/03/2026 à 19:19, Jori Koolstra a écrit :
> Currently there is no way to race-freely create and open a directory.
> For regular files we have open(O_CREAT) for creating a new file inode,
> and returning a pinning fd to it. The lack of such functionality for
> directories means that when populating a directory tree there's always
> a race involved: the inodes first need to be created, and then opened
> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> but in the time window between the creation and the opening they might
> be replaced by something else.
>
> Addressing this race without proper APIs is possible (by immediately
> fstat()ing what was opened, to verify that it has the right inode type),
> but difficult to get right. Hence, mkdirat_fd() that creates a directory
> and returns an O_DIRECTORY fd is useful.
>
> This feature idea (and description) is taken from the UAPI group:
> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>
> Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
> ---
>   arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>   fs/internal.h                          |  1 +
>   fs/namei.c                             | 26 ++++++++++++++++++++++++--
>   include/linux/fcntl.h                  |  2 ++
>   include/linux/syscalls.h               |  2 ++
>   include/uapi/asm-generic/fcntl.h       |  3 +++
>   include/uapi/asm-generic/unistd.h      |  5 ++++-
>   scripts/syscall.tbl                    |  1 +
>   8 files changed, 38 insertions(+), 3 deletions(-)

> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index a332e79b3207..d2f0fdb82847 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -25,6 +25,8 @@
>   #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
>   #endif
>   
> +#define VALID_MKDIRAT_FD_FLAGS	(MKDIRAT_FD_NEED_FD)
> +

I don't see support for O_CLOEXEC-ish flag, is the file descriptor in 
close-on-exec mode by default ? If yes, it should be mentioned.


> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..621458bf1fbf 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,9 @@
>   #define O_NDELAY	O_NONBLOCK
>   #endif
>   
> +/* Flags for mkdirat_fd */
> +#define MKDIRAT_FD_NEED_FD	0x01
> +


Regards.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-03-31 20:25   ` Yann Droneaud
@ 2026-03-31 20:42     ` H. Peter Anvin
  0 siblings, 0 replies; 13+ messages in thread
From: H. Peter Anvin @ 2026-03-31 20:42 UTC (permalink / raw)
  To: Yann Droneaud, Jori Koolstra, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, Jan Kara, Alexander Aring
  Cc: Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
	Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
	Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
	linux-fsdevel, linux-api, linux-arch, linux-kselftest, cmirabil,
	Masami Hiramatsu (Google)

On March 31, 2026 1:25:03 PM PDT, Yann Droneaud <yann@droneaud.fr> wrote:
>Hi,
>
>Le 31/03/2026 à 19:19, Jori Koolstra a écrit :
>> Currently there is no way to race-freely create and open a directory.
>> For regular files we have open(O_CREAT) for creating a new file inode,
>> and returning a pinning fd to it. The lack of such functionality for
>> directories means that when populating a directory tree there's always
>> a race involved: the inodes first need to be created, and then opened
>> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
>> but in the time window between the creation and the opening they might
>> be replaced by something else.
>> 
>> Addressing this race without proper APIs is possible (by immediately
>> fstat()ing what was opened, to verify that it has the right inode type),
>> but difficult to get right. Hence, mkdirat_fd() that creates a directory
>> and returns an O_DIRECTORY fd is useful.
>> 
>> This feature idea (and description) is taken from the UAPI group:
>> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>> 
>> Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
>> ---
>>   arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>   fs/internal.h                          |  1 +
>>   fs/namei.c                             | 26 ++++++++++++++++++++++++--
>>   include/linux/fcntl.h                  |  2 ++
>>   include/linux/syscalls.h               |  2 ++
>>   include/uapi/asm-generic/fcntl.h       |  3 +++
>>   include/uapi/asm-generic/unistd.h      |  5 ++++-
>>   scripts/syscall.tbl                    |  1 +
>>   8 files changed, 38 insertions(+), 3 deletions(-)
>
>> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
>> index a332e79b3207..d2f0fdb82847 100644
>> --- a/include/linux/fcntl.h
>> +++ b/include/linux/fcntl.h
>> @@ -25,6 +25,8 @@
>>   #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
>>   #endif
>>   +#define VALID_MKDIRAT_FD_FLAGS	(MKDIRAT_FD_NEED_FD)
>> +
>
>I don't see support for O_CLOEXEC-ish flag, is the file descriptor in close-on-exec mode by default ? If yes, it should be mentioned.
>
>
>> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
>> index 613475285643..621458bf1fbf 100644
>> --- a/include/uapi/asm-generic/fcntl.h
>> +++ b/include/uapi/asm-generic/fcntl.h
>> @@ -95,6 +95,9 @@
>>   #define O_NDELAY	O_NONBLOCK
>>   #endif
>>   +/* Flags for mkdirat_fd */
>> +#define MKDIRAT_FD_NEED_FD	0x01
>> +
>
>
>Regards.
>
>

And even if it is, POSIX already has O_CLOFORK and we should expect that that will be needed, too.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-03-31 17:19 ` [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd() Jori Koolstra
  2026-03-31 19:13   ` Arnd Bergmann
  2026-03-31 20:25   ` Yann Droneaud
@ 2026-04-01  4:19   ` Mateusz Guzik
  2026-04-01  9:44     ` Cyril Hrubis
                       ` (2 more replies)
  2 siblings, 3 replies; 13+ messages in thread
From: Mateusz Guzik @ 2026-04-01  4:19 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)

On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
>  		lookup_flags |= LOOKUP_REVAL;
>  		goto retry;
>  	}
> +
> +	if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> +		struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> +		error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> +	}
> +	end_creating_path(&path, dentry);
>  	return error;


You can't do it like this. Should it turn out no fd can be allocated,
the entire thing is going to error out while keeping the newly created
directory behind. You need to allocate the fd first, then do the hard
work, and only then fd_install and or free the fd. The FD_ADD machinery
can probably still be used provided proper wrapping of the real new
mkdir.

It should be perfectly feasible to de facto wrap existing mkdir
functionality by this syscall.

On top of that similarly to what other people mentioned the new syscall
will definitely want to support O_CLOEXEC and probably other flags down
the line.

Trying to handle this in open() is a no-go. openat2 is rather
problematic.

I tend to agree mkdirat_fd is not a good name for the syscall either,
but I don't have a suggestion I'm happy with. I think least bad name
would follow the existing stuff and be mkdirat2 or similar.

The routine would have to start with validating the passed O_ flags, for
now only allowing O_CLOEXEC and EINVAL-ing otherwise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-04-01  4:19   ` Mateusz Guzik
@ 2026-04-01  9:44     ` Cyril Hrubis
  2026-04-01 10:25     ` Jori Koolstra
  2026-04-02  2:52     ` Aleksa Sarai
  2 siblings, 0 replies; 13+ messages in thread
From: Cyril Hrubis @ 2026-04-01  9:44 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)

Hi!
> I tend to agree mkdirat_fd is not a good name for the syscall either,
> but I don't have a suggestion I'm happy with. I think least bad name
> would follow the existing stuff and be mkdirat2 or similar.

Why not mkdirat_open() as it does combine these two syscalls into one?

-- 
Cyril Hrubis
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-04-01  4:19   ` Mateusz Guzik
  2026-04-01  9:44     ` Cyril Hrubis
@ 2026-04-01 10:25     ` Jori Koolstra
  2026-04-07  9:00       ` Mateusz Guzik
  2026-04-02  2:52     ` Aleksa Sarai
  2 siblings, 1 reply; 13+ messages in thread
From: Jori Koolstra @ 2026-04-01 10:25 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)


> Op 01-04-2026 06:19 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
> 
>  
> On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> > @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> >  		lookup_flags |= LOOKUP_REVAL;
> >  		goto retry;
> >  	}
> > +
> > +	if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> > +		struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> > +		error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> > +	}
> > +	end_creating_path(&path, dentry);
> >  	return error;
> 
> 
> You can't do it like this. Should it turn out no fd can be allocated,
> the entire thing is going to error out while keeping the newly created
> directory behind. You need to allocate the fd first, then do the hard
> work, and only then fd_install and or free the fd. The FD_ADD machinery
> can probably still be used provided proper wrapping of the real new
> mkdir.

But isn't this exactly what happens in open(O_CREAT) too? Eventually we
call
		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
						mode, open_flag & O_EXCL);

and only then do we assign and install the fd. AFAIK there is no cleanup
happening there either if the FD_ADD step fails. You will just have a
regular file and no descriptor. But I would have to test this to be sure.

> 
> On top of that similarly to what other people mentioned the new syscall
> will definitely want to support O_CLOEXEC and probably other flags down
> the line.
> 

I agree, and perhaps O_PATH too. Maybe just all open flags relevant to
directories?

> Trying to handle this in open() is a no-go. openat2 is rather
> problematic.

I don't think that is necessarily true. It turned out O_CREAT | O_DIRECTORY
was bugged for a very long time. Christian Brauner fixed it eventually, and
that combination now returns EINVAL. But I think there is nothing really
stopping us from implementing that combination in the expected way, apart
from whatever reasons there were for not allowing this in the first place,
which I don't know about (maybe mixing semantics?)

> 
> I tend to agree mkdirat_fd is not a good name for the syscall either,
> but I don't have a suggestion I'm happy with. I think least bad name
> would follow the existing stuff and be mkdirat2 or similar.
> 
> The routine would have to start with validating the passed O_ flags, for
> now only allowing O_CLOEXEC and EINVAL-ing otherwise.

Thanks,
Jori

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-03-31 19:13   ` Arnd Bergmann
@ 2026-04-01 14:09     ` David Laight
  0 siblings, 0 replies; 13+ messages in thread
From: David Laight @ 2026-04-01 14:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, shuah,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Kara, Alexander Aring,
	Peter Zijlstra, Oleg Nesterov, Andrey Albershteyn, Jiri Olsa,
	Mathieu Desnoyers, Thomas Weißschuh, Namhyung Kim,
	Arnaldo Carvalho de Melo, Aleksa Sarai, linux-kernel,
	linux-fsdevel, linux-api, Linux-Arch, linux-kselftest, cmirabil,
	Masami Hiramatsu

On Tue, 31 Mar 2026 21:13:34 +0200
"Arnd Bergmann" <arnd@arndb.de> wrote:

> On Tue, Mar 31, 2026, at 19:19, Jori Koolstra wrote:
> > Currently there is no way to race-freely create and open a directory.
> > For regular files we have open(O_CREAT) for creating a new file inode,
> > and returning a pinning fd to it. The lack of such functionality for
> > directories means that when populating a directory tree there's always
> > a race involved: the inodes first need to be created, and then opened
> > to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> > but in the time window between the creation and the opening they might
> > be replaced by something else.
> >
> > Addressing this race without proper APIs is possible (by immediately
> > fstat()ing what was opened, to verify that it has the right inode type),
> > but difficult to get right. Hence, mkdirat_fd() that creates a directory
> > and returns an O_DIRECTORY fd is useful.
> >
> > This feature idea (and description) is taken from the UAPI group:
> > https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
> >
> > Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>  
> 
> I checked that the calling conventions are fine, i.e. this will work
> as expected across all architectures. I assume you are also aware
> that the non-RFC patch will need to add the syscall number to all
> .tbl files.
> 
> The hardest problem here does seem to be the naming of the
> new syscall, and I'm sorry to not be able to offer any solution
> either, just two observations:
> 
> - mkdirat/mkdirat_fd sounds similar to the existing
>   quotactl/quotactl_fd pair, but quotactl_fd() takes a file
>   descriptor argument rather than returning it, which makes
>   this addition quite confusing.
> 
> - the nicest interface IMO would have been a variation of
>   openat(dfd, filename, O_CREAT | O_DIRECTORY, mode)
>   but that is a minefield of incompatible implementations[1],
>   so we can't do that without changing the behavior for
>   existing callers that currently run into an error.

Just require O_TMPFILE to be set as well :-)
You know you'll never regret it one Apr-1 is over.

Can something be done with the flags to openat2().
That might save allocating an extra system call.

	David


> 
>        Arnd
> 
> [1] https://lwn.net/Articles/926782/
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-04-01  4:19   ` Mateusz Guzik
  2026-04-01  9:44     ` Cyril Hrubis
  2026-04-01 10:25     ` Jori Koolstra
@ 2026-04-02  2:52     ` Aleksa Sarai
  2026-04-07  8:52       ` Mateusz Guzik
  2 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2026-04-02  2:52 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)

[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]

On 2026-04-01, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> > @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> >  		lookup_flags |= LOOKUP_REVAL;
> >  		goto retry;
> >  	}
> > +
> > +	if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> > +		struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> > +		error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> > +	}
> > +	end_creating_path(&path, dentry);
> >  	return error;
> 
> 
> You can't do it like this. Should it turn out no fd can be allocated,
> the entire thing is going to error out while keeping the newly created
> directory behind. You need to allocate the fd first, then do the hard
> work, and only then fd_install and or free the fd. The FD_ADD machinery
> can probably still be used provided proper wrapping of the real new
> mkdir.
> 
> It should be perfectly feasible to de facto wrap existing mkdir
> functionality by this syscall.
> 
> On top of that similarly to what other people mentioned the new syscall
> will definitely want to support O_CLOEXEC and probably other flags down
> the line.
> 
> Trying to handle this in open() is a no-go. openat2 is rather
> problematic.

I'm interested in what makes you say that. It would be very nice to be able
to do mkdir + RESOLVE_IN_ROOT and get an fd back all in one syscall. :D

To be fair, build_open_how() will need some more magic to keep openat()
working, and that won't be particularly pretty. If we went with
O_CREAT|O_DIRECTORY we would need to be quite careful to make sure
O_TMPFILE continues to work for both openat() and openat2()...

> I tend to agree mkdirat_fd is not a good name for the syscall either,
> but I don't have a suggestion I'm happy with. I think least bad name
> would follow the existing stuff and be mkdirat2 or similar.
> 
> The routine would have to start with validating the passed O_ flags, for
> now only allowing O_CLOEXEC and EINVAL-ing otherwise.

Please do not use O_* flags! O_CLOEXEC takes up 3 flag bits on different
architectures which makes adding new flags a nightmare.

I think this should take AT_* flags and (like most newer syscalls)
O_CLOEXEC should be automatically set. Userspace can unset it with
fnctl(F_SETFD) in the relatively rare case where they don't want
O_CLOEXEC. Alternatively, we could just bite the bullet and make
AT_NO_CLOEXEC a thing...

But yes, new syscalls *absolutely* need to take some kind of flag
argument. I'd hoped we finally learned our lesson on that one...

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-04-02  2:52     ` Aleksa Sarai
@ 2026-04-07  8:52       ` Mateusz Guzik
  0 siblings, 0 replies; 13+ messages in thread
From: Mateusz Guzik @ 2026-04-07  8:52 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)

On Thu, Apr 2, 2026 at 4:52 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2026-04-01, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Trying to handle this in open() is a no-go. openat2 is rather
> > problematic.
>
> I'm interested in what makes you say that. It would be very nice to be able
> to do mkdir + RESOLVE_IN_ROOT and get an fd back all in one syscall. :D
>

Not handling this in either of open or openat2 does not preclude mkdir
+ RESOLVE_IN_ROOT + getting a fd in one go from existing.

Creating a directory was always a different syscall than creating a
file. I don't see any benefit to squeezing it into open. I do see a
downside because of an extra branchfest to differentiate the cases.

> > The routine would have to start with validating the passed O_ flags, for
> > now only allowing O_CLOEXEC and EINVAL-ing otherwise.
>
> Please do not use O_* flags! O_CLOEXEC takes up 3 flag bits on different
> architectures which makes adding new flags a nightmare.
>

With my proposal there are no new flags added so I don't think that's relevant.

> I think this should take AT_* flags and (like most newer syscalls)
> O_CLOEXEC should be automatically set. Userspace can unset it with
> fnctl(F_SETFD) in the relatively rare case where they don't want
> O_CLOEXEC. Alternatively, we could just bite the bullet and make
> AT_NO_CLOEXEC a thing...
>

I would say that's a pretty weird discrepancy vs what normally happens
with other syscalls, but perhaps it would be fine.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
  2026-04-01 10:25     ` Jori Koolstra
@ 2026-04-07  9:00       ` Mateusz Guzik
  0 siblings, 0 replies; 13+ messages in thread
From: Mateusz Guzik @ 2026-04-07  9:00 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)

On Wed, Apr 1, 2026 at 12:25 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
>
>
> > Op 01-04-2026 06:19 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
> >
> >
> > On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> > > @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> > >             lookup_flags |= LOOKUP_REVAL;
> > >             goto retry;
> > >     }
> > > +
> > > +   if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> > > +           struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> > > +           error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> > > +   }
> > > +   end_creating_path(&path, dentry);
> > >     return error;
> >
> >
> > You can't do it like this. Should it turn out no fd can be allocated,
> > the entire thing is going to error out while keeping the newly created
> > directory behind. You need to allocate the fd first, then do the hard
> > work, and only then fd_install and or free the fd. The FD_ADD machinery
> > can probably still be used provided proper wrapping of the real new
> > mkdir.
>
> But isn't this exactly what happens in open(O_CREAT) too? Eventually we
> call
>                 error = dir_inode->i_op->create(idmap, dir_inode, dentry,
>                                                 mode, open_flag & O_EXCL);
>
> and only then do we assign and install the fd. AFAIK there is no cleanup
> happening there either if the FD_ADD step fails. You will just have a
> regular file and no descriptor. But I would have to test this to be sure.
>

FD_ADD(how->flags, do_file_open(dfd, name, &op)) means fd itself will
be allocated upfront and only then file creation will happen and which
is what I'm saying is how it should be done. With your patch the
directory is created first and the possibly failing fd allocation
happens later.

> >
> > On top of that similarly to what other people mentioned the new syscall
> > will definitely want to support O_CLOEXEC and probably other flags down
> > the line.
> >
>
> I agree, and perhaps O_PATH too. Maybe just all open flags relevant to
> directories?
>

I don't know about O_PATH as is, but certainly the syscall needs to be
able to grab more flags in the future.

> > Trying to handle this in open() is a no-go. openat2 is rather
> > problematic.
>
> I don't think that is necessarily true. It turned out O_CREAT | O_DIRECTORY
> was bugged for a very long time. Christian Brauner fixed it eventually, and
> that combination now returns EINVAL. But I think there is nothing really
> stopping us from implementing that combination in the expected way, apart
> from whatever reasons there were for not allowing this in the first place,
> which I don't know about (maybe mixing semantics?)
>

I am not saying it's impossible. I am saying mkdir was always a
separate codepath and in order to change that you would need to add a
branchfest to open. I don't see any reason to go that route.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-07  9:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 17:19 [RFC PATCH 0/2] vfs: mkdirat_fd() syscall Jori Koolstra
2026-03-31 17:19 ` [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd() Jori Koolstra
2026-03-31 19:13   ` Arnd Bergmann
2026-04-01 14:09     ` David Laight
2026-03-31 20:25   ` Yann Droneaud
2026-03-31 20:42     ` H. Peter Anvin
2026-04-01  4:19   ` Mateusz Guzik
2026-04-01  9:44     ` Cyril Hrubis
2026-04-01 10:25     ` Jori Koolstra
2026-04-07  9:00       ` Mateusz Guzik
2026-04-02  2:52     ` Aleksa Sarai
2026-04-07  8:52       ` Mateusz Guzik
2026-03-31 17:19 ` [RFC PATCH 2/2] selftest: add tests for mkdirat_fd() Jori Koolstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox