Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Thorsten Kukuk @ 2026-04-10 12:35 UTC (permalink / raw)
  To: linux-api, linux-kernel, audit, libc-alpha; +Cc: Roman Bakshansky
In-Reply-To: <20260313144508.GA5446@cventin.lip.ens-lyon.fr>

On Fri, Mar 13, 2026 at 3:45 PM Vincent Lefevre <vincent@vinc17.net> wrote:
>
> On 2026-03-13 10:59:11 -0300, Adhemerval Zanella Netto wrote:
> > From the glibc standpoint my plan is just to make the accounting database
> > function no-op [1] (I hopefully to get this in the next 2.44 release).
> >
> > And I think Thorsten Kukuk already adapted most of the usages in current
> > distros [2][3] using similar strategy, along with a better systemd
> > integration.  I am not sure if/when distros are incorporating his work.
> >
> > [1] https://patchwork.sourceware.org/project/glibc/list/?series=37271
> > [2] https://www.thkukuk.de/blog/Y2038_glibc_lastlog_64bit/
> > [3] https://www.thkukuk.de/blog/Y2038_glibc_utmp_64bit/
>
> FYI, utmp has been reintroduced in Debian for libutempter (and thus
> applications that use this library), because systemd was not working
> or at least not sufficiently documented:
>
>   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1125682

They introduced the old "hack" to get "wall" working without solving
the problem.
What will happen now again: all people having xterm running will get
the wall message in all terminals.
People not using a terminal (so most of the normal users, not
developers) will not see this message, because web browsers and other
graphical applications don't show them.
The correct solution is, that the desktop environments register a
session, and if there is a wall message, show that in an own dialog,
so that everybody get's the message once. Not the one person 50 times,
the others not at all.

Regards,
Thorsten

-- 
Thorsten Kukuk, Distinguished Engineer, Future Technologies
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
Nuernberg, Germany
Geschäftsführer: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB
36809, AG Nürnberg)

^ permalink raw reply

* Re: [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Thorsten Kukuk @ 2026-04-10 12:38 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Roman Bakshansky, linux-api, linux-kernel, audit, libc-alpha
In-Reply-To: <87cy175zrg.fsf@mid.deneb.enyo.de>

On Fri, Mar 13, 2026 at 7:51 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * Roman Bakshansky:
>
> > The full RFC, including preliminary database schemas and API drafts,
> > is available in the discussion repository:
> >
> >      https://github.com/bakshansky/linux-auth-logs
>
> I don't understand how SQLite (without a daemon) addresses the locking
> issue.  WAL mode still uses fcntl locking.

It doesn't, that's why wtmpdb is using a daemon for this.
With pam_lastlog2, the messages aren't important or reliable enough to
justify the overhead. But if you want, you would need to introduce a
daemon, too.

Regards,
Thorsten

-- 
Thorsten Kukuk, Distinguished Engineer, Future Technologies
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
Nuernberg, Germany
Geschäftsführer: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB
36809, AG Nürnberg)

^ permalink raw reply

* Re: Avoid reading /sys/kernel/mm/transparent_hugepage/?
From: H.J. Lu @ 2026-04-11  0:12 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Florian Weimer, GNU C Library, linux-kernel, linux-arch,
	linux-api
In-Reply-To: <d095cc40-5217-4318-ae2e-40e5fe3be47a@p183>

On Fri, Apr 10, 2026 at 4:35 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> On Fri, Apr 10, 2026 at 03:40:30PM +0800, H.J. Lu wrote:
> > On Fri, Apr 10, 2026 at 3:28 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >
> > > * H. J. Lu:
> > >
> > > > To enable THP segment load, ld.so opens and reads 2 files under
> > > > /sys/kernel/mm/transparent_hugepage/.   This requires mounting
> > > > /sys and is expensive.   Is it possible to put such info in vDSO?
> > >
> > > Alexey Dobriyan proposed adding AT_PAGE_SHIFT_LIST to the auxiliary
> >
> > Does it cover
> >
> > [hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > 2097152
> > [hjl@gnu-tgl-3 linux]$
> >
> > > vector a while back, but I don't know the status of that.
>
> Status: nothing happened.
>
> > How can we get
> >
> > [hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/enabled
> > always [madvise] never
> > [hjl@gnu-tgl-3 linux]$
>
> This is not covered, see the link:
> https://lore.kernel.org/lkml/ecb049aa-bcac-45c7-bbb1-4612d094935a@p183/
>
> PAGE_SHIFT_MASK should be folded into system call probably.

We need a fast way to check THP status for THP segment load.
A system call to return /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/hpage_pmd_size should
work.

-- 
H.J.

^ permalink raw reply

* [RFC PATCH v2 0/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Jori Koolstra @ 2026-04-12 13:54 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner,
	Arnd Bergmann
  Cc: H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jori Koolstra, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, Jeff Layton, Aleksa Sarai, cmirabil,
	Greg Kroah-Hartman, linux-kernel, linux-fsdevel, linux-api,
	linux-arch

This series implements the mkdirat2() syscall that was suggested over
at the UAPI group kernel feature page [1] with some tests.

Obviously, we probably also want to implement equivalent mknodeat2() and
symlinkat2() syscalls, but their implementation can be done quite similar
I believe.

This has been compiled and tested on x86 only.

[1]: https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes

v2:
- Use AT_* flags.
- Ensure an fd is allocated only if mkdir and open_dentry succeed.
- The returned fd gets O_CLOEXEC by default.
- Renamed syscall from mkdirat_fd() to mkdirat2().

Jori Koolstra (2):
  vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
  selftest: add tests for mkdirat2()

 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 fs/internal.h                                 |   2 +
 fs/namei.c                                    |  44 +++++-
 include/linux/syscalls.h                      |   2 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 scripts/syscall.tbl                           |   1 +
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 .../selftests/filesystems/mkdirat_fd_test.c   | 143 ++++++++++++++++++
 10 files changed, 200 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c

-- 
2.53.0


^ permalink raw reply

* [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Jori Koolstra @ 2026-04-12 13:54 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner,
	Arnd Bergmann
  Cc: H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jori Koolstra, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, Jeff Layton, Aleksa Sarai, cmirabil,
	Greg Kroah-Hartman, linux-kernel, linux-fsdevel, linux-api,
	linux-arch
In-Reply-To: <20260412135434.3095416-1-jkoolstra@xs4all.nl>

Currently there is no way to race-freely create and open a directory.
For regular files we have open(O_CREAT) for creating a new file inode,
and returning a pinning fd to it. The lack of such functionality for
directories means that when populating a directory tree there's always
a race involved: the inodes first need to be created, and then opened
to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
but in the time window between the creation and the opening they might
be replaced by something else.

Addressing this race without proper APIs is possible (by immediately
fstat()ing what was opened, to verify that it has the right inode type),
but difficult to get right. Hence, mkdirat2() that creates a directory
and returns an O_DIRECTORY fd is useful.

This feature idea (and description) is taken from the UAPI group:
https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/internal.h                          |  2 ++
 fs/namei.c                             | 44 +++++++++++++++++++++++---
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  5 ++-
 scripts/syscall.tbl                    |  1 +
 6 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..e200ca2067a4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	mkdirat2		sys_mkdirat2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..c6a79afadacf 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -59,6 +59,8 @@ int may_linkat(struct mnt_idmap *idmap, const struct path *link);
 int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
 		 struct filename *newname, unsigned int flags);
 int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
+struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
+		unsigned int flags, bool open);
 int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
 int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
 int filename_linkat(int olddfd, struct filename *old, int newdfd,
diff --git a/fs/namei.c b/fs/namei.c
index a880454a6415..6451e96dc225 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5255,18 +5255,36 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 EXPORT_SYMBOL(vfs_mkdir);
 
-int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
+static int mkdirat_lookup_flags(unsigned int flags)
+{
+	int lookup_flags = LOOKUP_DIRECTORY;
+
+	if (!(flags & AT_SYMLINK_NOFOLLOW))
+		lookup_flags |= LOOKUP_FOLLOW;
+	if (!(flags & AT_NO_AUTOMOUNT))
+		lookup_flags |= LOOKUP_AUTOMOUNT;
+
+	return lookup_flags;
+}
+
+int filename_mkdirat(int dfd, struct filename *name, umode_t mode) {
+	return PTR_ERR_OR_ZERO(do_file_mkdirat(dfd, name, mode, 0, false));
+}
+
+struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
+		unsigned int flags, bool open)
 {
 	struct dentry *dentry;
 	struct path path;
 	int error;
-	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	struct file *filp = NULL;
+	unsigned int lookup_flags = mkdirat_lookup_flags(flags);
 	struct delegated_inode delegated_inode = { };
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
 	if (IS_ERR(dentry))
-		return PTR_ERR(dentry);
+		return ERR_CAST(dentry);
 
 	error = security_path_mkdir(&path, dentry,
 			mode_strip_umask(path.dentry->d_inode, mode));
@@ -5276,6 +5294,10 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 		if (IS_ERR(dentry))
 			error = PTR_ERR(dentry);
 	}
+	if (open && !error && !is_delegated(&delegated_inode)) {
+		const struct path new_path = { .mnt = path.mnt, .dentry = dentry };
+		filp = dentry_open(&new_path, O_DIRECTORY, current_cred());
+	}
 	end_creating_path(&path, dentry);
 	if (is_delegated(&delegated_inode)) {
 		error = break_deleg_wait(&delegated_inode);
@@ -5286,7 +5308,21 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
-	return error;
+	if (error)
+		return ERR_PTR(error);
+	return filp;
+}
+
+#define VALID_MKDIRAT2_FLAGS (AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT)
+
+SYSCALL_DEFINE4(mkdirat2, int, dfd, const char __user *, pathname, umode_t, mode,
+		unsigned int, flags)
+{
+	CLASS(filename, name)(pathname);
+	if (flags & ~VALID_MKDIRAT2_FLAGS)
+		return -EINVAL;
+
+	return FD_ADD(O_CLOEXEC, do_file_mkdirat(dfd, name, mode, flags, true));
 }
 
 SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 02bd6ddb6278..b3b4ae26dbdd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
 asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
 				      u32 size, u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_mkdirat2(int dfd, const char __user *pathname, umode_t mode,
+				     unsigned int flags)
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..6efc21779b62 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_mkdirat2 472
+__SYSCALL(__NR_mkdirat2, sys_mkdirat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..9d86f29762ae 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	mkdirat2			sys_mkdirat2
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v2 2/2] selftest: add tests for mkdirat2()
From: Jori Koolstra @ 2026-04-12 13:54 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner,
	Arnd Bergmann
  Cc: H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jori Koolstra, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, Jeff Layton, Aleksa Sarai, cmirabil,
	Greg Kroah-Hartman, linux-kernel, linux-fsdevel, linux-api,
	linux-arch
In-Reply-To: <20260412135434.3095416-1-jkoolstra@xs4all.nl>

Add some tests for the new mkdirat2() syscall to test compliance and
to showcase its behaviour.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 .../selftests/filesystems/mkdirat_fd_test.c   | 143 ++++++++++++++++++
 4 files changed, 150 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c

diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..6efc21779b62 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_mkdirat2 472
+__SYSCALL(__NR_mkdirat2, sys_mkdirat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/testing/selftests/filesystems/.gitignore b/tools/testing/selftests/filesystems/.gitignore
index 64ac0dfa46b7..84e2175d171f 100644
--- a/tools/testing/selftests/filesystems/.gitignore
+++ b/tools/testing/selftests/filesystems/.gitignore
@@ -5,3 +5,4 @@ fclog
 file_stressor
 anon_inode_test
 kernfs_test
+mkdirat_fd_test
diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile
index 85427d7f19b9..7357769db57a 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
-CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog
+CFLAGS += $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
+TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog mkdirat_fd_test
 TEST_GEN_PROGS_EXTENDED := dnotify_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/mkdirat_fd_test.c b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
new file mode 100644
index 000000000000..a02c0223d63b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sys/stat.h>
+
+#include <asm-generic/unistd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef VALID_MKDIRAT2_FLAGS
+#define VALID_MKDIRAT2_FLAGS (AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT)
+#endif
+
+#define mkdirat2_checked_flags(dfd, pathname, flags) ({		\
+	struct stat __st;					\
+	int __fd = sys_mkdirat2(dfd, pathname, S_IRWXU, flags);	\
+	ASSERT_GE(__fd, 0);					\
+	EXPECT_EQ(fstat(__fd, &__st), 0);			\
+	EXPECT_TRUE(S_ISDIR(__st.st_mode));			\
+	__fd;							\
+})
+
+#define mkdirat2_checked(dfd, pathname) \
+	mkdirat2_checked_flags(dfd, pathname, 0)
+
+
+static inline int sys_mkdirat2(int dfd, const char *pathname, mode_t mode,
+				 unsigned int flags)
+{
+	return syscall(__NR_mkdirat2, dfd, pathname, mode, flags);
+}
+
+FIXTURE(mkdirat2) {
+	char dirpath[PATH_MAX];
+	int dfd;
+};
+
+FIXTURE_SETUP(mkdirat2)
+{
+	snprintf(self->dirpath, sizeof(self->dirpath),
+		 "/tmp/mkdirat2_test.%d", getpid());
+	ASSERT_EQ(mkdir(self->dirpath, S_IRWXU), 0);
+
+	self->dfd = open(self->dirpath, O_DIRECTORY);
+	ASSERT_GE(self->dfd, 0);
+}
+
+FIXTURE_TEARDOWN(mkdirat2)
+{
+	close(self->dfd);
+	rmdir(self->dirpath);
+}
+
+/* Does mkdirat2 return a fd at all */
+TEST_F(mkdirat2, returns_fd)
+{
+	int fd = mkdirat2_checked(self->dfd, "newdir");
+	EXPECT_EQ(close(fd), 0)
+	EXPECT_EQ(unlinkat(self->dfd, "newdir", AT_REMOVEDIR), 0);
+}
+
+/* The fd must refer to the directory that was just created. */
+TEST_F(mkdirat2, fd_is_created_dir)
+{
+	int fd;
+	struct stat st_via_fd, st_via_path;
+	char path[PATH_MAX];
+
+	fd = mkdirat2_checked(self->dfd, "checkdir");
+
+	ASSERT_EQ(fstat(fd, &st_via_fd), 0);
+
+	snprintf(path, sizeof(path), "%s/checkdir", self->dirpath);
+	ASSERT_EQ(stat(path, &st_via_path), 0);
+
+	EXPECT_EQ(st_via_fd.st_ino, st_via_path.st_ino);
+	EXPECT_EQ(st_via_fd.st_dev, st_via_path.st_dev);
+
+	EXPECT_EQ(close(fd), 0)
+	EXPECT_EQ(rmdir(path), 0);
+}
+
+
+/* Missing parent component must fail with ENOENT. */
+TEST_F(mkdirat2, enoent_missing_parent)
+{
+	EXPECT_EQ(sys_mkdirat2(self->dfd, "nonexistent/child", S_IRWXU, 0), -1);
+	EXPECT_EQ(errno, ENOENT);
+}
+
+/* An invalid dfd must fail with EBADF. */
+TEST_F(mkdirat2, ebadf)
+{
+	EXPECT_EQ(sys_mkdirat2(-42, "badfdir", S_IRWXU, 0), -1);
+	EXPECT_EQ(errno, EBADF);
+}
+
+/* A dfd that points to a file (not a directory) must fail with ENOTDIR. */
+TEST_F(mkdirat2, enotdir_dfd)
+{
+	int file_fd;
+
+	file_fd = openat(self->dfd, "file",
+			 O_CREAT | O_WRONLY, S_IRWXU);
+	ASSERT_GE(file_fd, 0);
+
+	EXPECT_EQ(sys_mkdirat2(file_fd, "subdir", S_IRWXU, 0), -1);
+	EXPECT_EQ(errno, ENOTDIR);
+
+	EXPECT_EQ(close(file_fd), 0);
+	EXPECT_EQ(unlinkat(self->dfd, "file", 0), 0);
+}
+
+/*
+ * The returned fd must be usable as a dfd for further *at() calls.
+ */
+TEST_F(mkdirat2, fd_usable_as_dfd)
+{
+	int parent_fd, child_fd;
+
+	parent_fd = mkdirat2_checked(self->dfd, "parent");
+	child_fd = mkdirat2_checked(parent_fd, "child");
+
+	EXPECT_EQ(close(child_fd), 0);
+	EXPECT_EQ(close(parent_fd), 0);
+
+	char path[PATH_MAX];
+	snprintf(path, sizeof(path), "%s/parent/child", self->dirpath);
+	EXPECT_EQ(rmdir(path), 0);
+	snprintf(path, sizeof(path), "%s/parent", self->dirpath);
+	EXPECT_EQ(rmdir(path), 0);
+}
+
+/* Unknown flags must be rejected with EINVAL. */
+TEST_F(mkdirat2, einval_unknown_flags)
+{
+	EXPECT_EQ(sys_mkdirat2(self->dfd, "flagsdir", S_IRWXU, ~VALID_MKDIRAT2_FLAGS ), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.53.0


^ permalink raw reply related

* Re: [RFC PATCH v3 1/6] uapi: add goldfish_address_space userspace ABI header
From: Arnd Bergmann @ 2026-04-13 16:28 UTC (permalink / raw)
  To: Wenzhao Liao, rust-for-linux, linux-pci
  Cc: Miguel Ojeda, Danilo Krummrich, bhelgaas,
	Krzysztof Wilczyński, Greg Kroah-Hartman, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-2-wenzhaoliao@ruc.edu.cn>

On Mon, Apr 6, 2026, at 18:51, Wenzhao Liao wrote:

> +struct goldfish_address_space_allocate_block {
> +	__u64 size;
> +	__u64 offset;
> +	__u64 phys_addr;
> +};
> +
> +struct goldfish_address_space_ping {
> +	__u64 offset;
> +	__u64 size;
> +	__u64 metadata;
> +	__u32 version;
> +	__u32 wait_fd;
> +	__u32 wait_flags;
> +	__u32 direction;
> +};
> +
> +struct goldfish_address_space_claim_shared {
> +	__u64 offset;
> +	__u64 size;
> +};

All these ioctl structures are well-formed in the sense that they
are portable across architectures and won't leak kernel data
through implicit padding.

Two of the members are a bit worrying, but that may just
be my own understanding:

- the 'phys_addr' member sounds like it refers to a physical
  memory location in the CPU address space, which in general
  should not be visible to user space, as that tends to
  expose security problems if users with access to the
  device can use this to access data they should not.

- the 'version' field may refer to the version of the ioctl
  command, which is similarly discouraged since it is
  harder to deal with than just coming up with new ioctl
  command codes. If this refers to the version of the
  remote side, this is probably fine.

> +#define GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC 'G'
> +
> +#define GOLDFISH_ADDRESS_SPACE_IOCTL_OP(OP, T) \
> +	_IOWR(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, OP, T)

I think it would be better to remove this intermediate macro, since
it prevents easy scraping of ioctl command codes from looking
at the source file with regular expressions.

It is also unusual that all commands are both reading
and writing the data. Please check if you can make some
of them read-only or write-only.

     Arnd

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-04-14  8:02 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: linux-kernel, linux-f2fs-devel, Akilesh Kailash, linux-fsdevel,
	linux-mm, linux-api
In-Reply-To: <adhPZxtbZxgU-37v@google.com>

Please add the relevant mailing lists when adding new user interfaces.

And I'm not sure hacks working around the proper large folio
implementation are something that should be merged upstream.

On Fri, Apr 10, 2026 at 01:16:23AM +0000, Jaegeuk Kim wrote:
> enum {
>        F2FS_XATTR_FADV_LARGEFOLIO,
> };
> 
> unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> 
> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
>  -> register the inode number for large folio
> 2. chmod(0400, file)
>  -> make Read-Only
> 3. fsync() && close() && open(READ)
>  -> f2fs_iget() with large folio
> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
>  -> return error
> 5. close() and open()
>  -> goto #3
> 6. unlink
>  -> deregister the inode number
> 
> Suggested-by: Akilesh Kailash <akailash@google.com>
> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> ---
>  
>   Log from v1:
>    - add a condition in f2fs_drop_inode
>    - add Doc
> 
>  Documentation/filesystems/f2fs.rst | 41 ++++++++++++++++++++++++++----
>  fs/f2fs/checkpoint.c               |  2 +-
>  fs/f2fs/data.c                     |  2 +-
>  fs/f2fs/f2fs.h                     |  1 +
>  fs/f2fs/file.c                     | 11 ++++++--
>  fs/f2fs/inode.c                    | 19 +++++++++++---
>  fs/f2fs/super.c                    |  7 +++++
>  fs/f2fs/xattr.c                    | 35 ++++++++++++++++++++++++-
>  fs/f2fs/xattr.h                    |  6 +++++
>  9 files changed, 111 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
> index 7e4031631286..de899d0d3088 100644
> --- a/Documentation/filesystems/f2fs.rst
> +++ b/Documentation/filesystems/f2fs.rst
> @@ -1044,11 +1044,14 @@ page allocation for significant performance gains. To minimize code complexity,
>  this support is currently excluded from the write path, which requires handling
>  complex optimizations such as compression and block allocation modes.
>  
> -This optional feature is triggered only when a file's immutable bit is set.
> -Consequently, F2FS will return EOPNOTSUPP if a user attempts to open a cached
> -file with write permissions, even immediately after clearing the bit. Write
> -access is only restored once the cached inode is dropped. The usage flow is
> -demonstrated below:
> +This optional feature is triggered by two mechanisms: the file's immutable bit
> +or a specific xattr flag. In both cases, F2FS ensures data integrity by
> +restricting the file to a read-only state while large folios are active.
> +
> +1. Immutable Bit Approach:
> +Triggered when the FS_IMMUTABLE_FL is set. This is a strict enforcement
> +where the file cannot be modified at all until the bit is cleared and
> +the cached inode is dropped.
>  
>  .. code-block::
>  
> @@ -1078,3 +1081,31 @@ demonstrated below:
>     Written 4096 bytes with pattern = zero, total_time = 29 us, max_latency = 28 us
>  
>     # rm /data/testfile_read_seq
> +
> +2. XATTR fadvise Approach:
> +A more flexible registration via extended attributes.
> +
> +.. code-block::
> +
> +    enum {
> +        F2FS_XATTR_FADV_LARGEFOLIO,
> +    };
> +    unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> +
> +    /* Registers the inode number for large folio support in the subsystem.*/
> +    # setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> +
> +    /* The file must be made Read-Only to transition into the large folio path. */
> +    # fchmod(0400, fd)
> +
> +    /* clean up dirty inode state. */
> +    # fsync(fd)
> +
> +    /* Drop the inode cache.
> +    # close(fd)
> +
> +    /* f2fs_iget() instantiates the inode with large folio support.*/
> +    # open()
> +
> +    /* Returns -EOPNOTSUPP or error to protect the large folio cache.*/
> +    # open(WRITE), mkwrite on mmap, or chmod(WRITE)
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 01e1ba77263e..fdd62ddc3ed6 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -778,7 +778,7 @@ void f2fs_remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type)
>  	__remove_ino_entry(sbi, ino, type);
>  }
>  
> -/* mode should be APPEND_INO, UPDATE_INO or TRANS_DIR_INO */
> +/* mode should be APPEND_INO, UPDATE_INO, LARGE_FOLIO_IO, or TRANS_DIR_INO */
>  bool f2fs_exist_written_data(struct f2fs_sb_info *sbi, nid_t ino, int mode)
>  {
>  	struct inode_management *im = &sbi->im[mode];
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 965d4e6443c6..5e46230398d7 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -2494,7 +2494,7 @@ static int f2fs_read_data_large_folio(struct inode *inode,
>  	int ret = 0;
>  	bool folio_in_bio;
>  
> -	if (!IS_IMMUTABLE(inode) || f2fs_compressed_file(inode)) {
> +	if (f2fs_compressed_file(inode)) {
>  		if (folio)
>  			folio_unlock(folio);
>  		return -EOPNOTSUPP;
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index e40b6b2784ee..02bc6eb96a59 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -381,6 +381,7 @@ enum {
>  /* for the list of ino */
>  enum {
>  	ORPHAN_INO,		/* for orphan ino list */
> +	LARGE_FOLIO_INO,	/* for large folio case */
>  	APPEND_INO,		/* for append ino list */
>  	UPDATE_INO,		/* for update ino list */
>  	TRANS_DIR_INO,		/* for transactions dir ino list */
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index c0220cd7b332..64ba900410fc 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -2068,9 +2068,16 @@ static long f2fs_fallocate(struct file *file, int mode,
>  
>  static int f2fs_release_file(struct inode *inode, struct file *filp)
>  {
> -	if (atomic_dec_and_test(&F2FS_I(inode)->open_count))
> +	if (atomic_dec_and_test(&F2FS_I(inode)->open_count)) {
>  		f2fs_remove_donate_inode(inode);
> -
> +		/*
> +		 * In order to get large folio as soon as possible, let's drop
> +		 * inode cache asap. See also f2fs_drop_inode.
> +		 */
> +		if (f2fs_exist_written_data(F2FS_I_SB(inode),
> +					    inode->i_ino, LARGE_FOLIO_INO))
> +                       d_drop(filp->f_path.dentry);
> +	}
>  	/*
>  	 * f2fs_release_file is called at every close calls. So we should
>  	 * not drop any inmemory pages by close called by other process.
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index 89240be8cc59..e100bc5a378c 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -565,6 +565,20 @@ static bool is_meta_ino(struct f2fs_sb_info *sbi, unsigned int ino)
>  		ino == F2FS_COMPRESS_INO(sbi);
>  }
>  
> +static void f2fs_mapping_set_large_folio(struct inode *inode)
> +{
> +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> +
> +	if (f2fs_compressed_file(inode))
> +		return;
> +	if (f2fs_quota_file(sbi, inode->i_ino))
> +		return;
> +	if (IS_IMMUTABLE(inode) ||
> +	    (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> +	     !(inode->i_mode & S_IWUGO)))
> +	    mapping_set_folio_min_order(inode->i_mapping, 0);
> +}
> +
>  struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
>  {
>  	struct f2fs_sb_info *sbi = F2FS_SB(sb);
> @@ -620,9 +634,7 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
>  		inode->i_op = &f2fs_file_inode_operations;
>  		inode->i_fop = &f2fs_file_operations;
>  		inode->i_mapping->a_ops = &f2fs_dblock_aops;
> -		if (IS_IMMUTABLE(inode) && !f2fs_compressed_file(inode) &&
> -		    !f2fs_quota_file(sbi, inode->i_ino))
> -			mapping_set_folio_min_order(inode->i_mapping, 0);
> +		f2fs_mapping_set_large_folio(inode);
>  	} else if (S_ISDIR(inode->i_mode)) {
>  		inode->i_op = &f2fs_dir_inode_operations;
>  		inode->i_fop = &f2fs_dir_operations;
> @@ -895,6 +907,7 @@ void f2fs_evict_inode(struct inode *inode)
>  	f2fs_remove_ino_entry(sbi, inode->i_ino, APPEND_INO);
>  	f2fs_remove_ino_entry(sbi, inode->i_ino, UPDATE_INO);
>  	f2fs_remove_ino_entry(sbi, inode->i_ino, FLUSH_INO);
> +	f2fs_remove_ino_entry(sbi, inode->i_ino, LARGE_FOLIO_INO);
>  
>  	if (!is_sbi_flag_set(sbi, SBI_IS_FREEZING)) {
>  		sb_start_intwrite(inode->i_sb);
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ccf806b676f5..11d1e0c99ac1 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -1844,6 +1844,13 @@ static int f2fs_drop_inode(struct inode *inode)
>  			return 1;
>  		}
>  	}
> +	/*
> +	 * In order to get large folio as soon as possible, let's drop
> +	 * inode cache asap. See also f2fs_release_file.
> +	 */
> +	if (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> +	    !is_inode_flag_set(inode, FI_DIRTY_INODE))
> +		return 1;
>  
>  	/*
>  	 * This is to avoid a deadlock condition like below.
> diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
> index 941dc62a6d6f..0c0e44c2dcdd 100644
> --- a/fs/f2fs/xattr.c
> +++ b/fs/f2fs/xattr.c
> @@ -44,6 +44,16 @@ static void xattr_free(struct f2fs_sb_info *sbi, void *xattr_addr,
>  		kfree(xattr_addr);
>  }
>  
> +static int f2fs_xattr_fadvise_get(struct inode *inode, void *buffer)
> +{
> +	if (!buffer)
> +		goto out;
> +	if (mapping_large_folio_support(inode->i_mapping))
> +		*((unsigned int *)buffer) |= BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> +out:
> +	return sizeof(unsigned int);
> +}
> +
>  static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
>  		struct dentry *unused, struct inode *inode,
>  		const char *name, void *buffer, size_t size)
> @@ -61,10 +71,29 @@ static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
>  	default:
>  		return -EINVAL;
>  	}
> +	if (handler->flags == F2FS_XATTR_INDEX_USER &&
> +	    !strcmp(name, "fadvise"))
> +		return f2fs_xattr_fadvise_get(inode, buffer);
> +
>  	return f2fs_getxattr(inode, handler->flags, name,
>  			     buffer, size, NULL);
>  }
>  
> +static int f2fs_xattr_fadvise_set(struct inode *inode, const void *value)
> +{
> +	unsigned int new_fadvise;
> +
> +	new_fadvise = *(unsigned int *)value;
> +
> +	if (new_fadvise & BIT(F2FS_XATTR_FADV_LARGEFOLIO))
> +		f2fs_add_ino_entry(F2FS_I_SB(inode),
> +				inode->i_ino, LARGE_FOLIO_INO);
> +	else
> +		f2fs_remove_ino_entry(F2FS_I_SB(inode),
> +				inode->i_ino, LARGE_FOLIO_INO);
> +	return 0;
> +}
> +
>  static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
>  		struct mnt_idmap *idmap,
>  		struct dentry *unused, struct inode *inode,
> @@ -84,6 +113,10 @@ static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
>  	default:
>  		return -EINVAL;
>  	}
> +	if (handler->flags == F2FS_XATTR_INDEX_USER &&
> +	    !strcmp(name, "fadvise"))
> +		return f2fs_xattr_fadvise_set(inode, value);
> +
>  	return f2fs_setxattr(inode, handler->flags, name,
>  					value, size, NULL, flags);
>  }
> @@ -842,4 +875,4 @@ int __init f2fs_init_xattr_cache(void)
>  void f2fs_destroy_xattr_cache(void)
>  {
>  	kmem_cache_destroy(inline_xattr_slab);
> -}
> \ No newline at end of file
> +}
> diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h
> index bce3d93e4755..455f460d014e 100644
> --- a/fs/f2fs/xattr.h
> +++ b/fs/f2fs/xattr.h
> @@ -24,6 +24,7 @@
>  #define F2FS_XATTR_REFCOUNT_MAX         1024
>  
>  /* Name indexes */
> +#define F2FS_USER_FADVISE_NAME			"user.fadvise"
>  #define F2FS_SYSTEM_ADVISE_NAME			"system.advise"
>  #define F2FS_XATTR_INDEX_USER			1
>  #define F2FS_XATTR_INDEX_POSIX_ACL_ACCESS	2
> @@ -39,6 +40,11 @@
>  #define F2FS_XATTR_NAME_ENCRYPTION_CONTEXT	"c"
>  #define F2FS_XATTR_NAME_VERITY			"v"
>  
> +/* used for F2FS_USER_FADVISE_NAME */
> +enum {
> +	F2FS_XATTR_FADV_LARGEFOLIO,
> +};
> +
>  struct f2fs_xattr_header {
>  	__le32  h_magic;        /* magic number for identification */
>  	__le32  h_refcount;     /* reference count */
> -- 
> 2.53.0.1213.gd9a14994de-goog
> 
> 
---end quoted text---

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-04-14 17:33 UTC (permalink / raw)
  To: linux-fsdevel, brauner
  Cc: Jeff Layton, linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs,
	linux-cifs, v9fs, linux-kselftest, viro, jack, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg
In-Reply-To: <CAFfO_h5FOTv-VMbh2Dmwkp04BFxQu192gsvFLohDFXAWPccRNA@mail.gmail.com>

On Mon, Apr 6, 2026 at 9:30 PM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
>
> On Mon, Apr 6, 2026 at 5:27 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sat, 2026-04-04 at 21:17 +0600, Dorjoy Chowdhury wrote:
> > > On Thu, Apr 2, 2026 at 1:02 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > On Mon, 2026-03-30 at 21:07 +0600, Dorjoy Chowdhury wrote:
> > > > > On Mon, Mar 30, 2026 at 5:49 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > >
> > > > > > On Sat, 2026-03-28 at 23:22 +0600, Dorjoy Chowdhury wrote:
> > > > > > > This flag indicates the path should be opened if it's a regular file.
> > > > > > > This is useful to write secure programs that want to avoid being
> > > > > > > tricked into opening device nodes with special semantics while thinking
> > > > > > > they operate on regular files. This is a requested feature from the
> > > > > > > uapi-group[1].
> > > > > > >
> > > > > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > > > > like FreeBSD, macOS.
> > > > > > >
> > > > > > > When used in combination with O_CREAT, either the regular file is
> > > > > > > created, or if the path already exists, it is opened if it's a regular
> > > > > > > file. Otherwise, -EFTYPE is returned.
> > > > > > >
> > > > > > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > > > > > as it doesn't make sense to open a path that is both a directory and a
> > > > > > > regular file.
> > > > > > >
> > > > > > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > > > > > >
> > > > > > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > > > > > ---
> > > > > > >  arch/alpha/include/uapi/asm/errno.h        |  2 ++
> > > > > > >  arch/alpha/include/uapi/asm/fcntl.h        |  1 +
> > > > > > >  arch/mips/include/uapi/asm/errno.h         |  2 ++
> > > > > > >  arch/parisc/include/uapi/asm/errno.h       |  2 ++
> > > > > > >  arch/parisc/include/uapi/asm/fcntl.h       |  1 +
> > > > > > >  arch/sparc/include/uapi/asm/errno.h        |  2 ++
> > > > > > >  arch/sparc/include/uapi/asm/fcntl.h        |  1 +
> > > > > > >  fs/ceph/file.c                             |  4 ++++
> > > > > > >  fs/fcntl.c                                 |  4 ++--
> > > > > > >  fs/gfs2/inode.c                            |  6 ++++++
> > > > > > >  fs/namei.c                                 |  4 ++++
> > > > > > >  fs/nfs/dir.c                               |  4 ++++
> > > > > > >  fs/open.c                                  |  8 +++++---
> > > > > > >  fs/smb/client/dir.c                        | 14 +++++++++++++-
> > > > > > >  include/linux/fcntl.h                      |  2 ++
> > > > > > >  include/uapi/asm-generic/errno.h           |  2 ++
> > > > > > >  include/uapi/asm-generic/fcntl.h           |  4 ++++
> > > > > > >  tools/arch/alpha/include/uapi/asm/errno.h  |  2 ++
> > > > > > >  tools/arch/mips/include/uapi/asm/errno.h   |  2 ++
> > > > > > >  tools/arch/parisc/include/uapi/asm/errno.h |  2 ++
> > > > > > >  tools/arch/sparc/include/uapi/asm/errno.h  |  2 ++
> > > > > > >  tools/include/uapi/asm-generic/errno.h     |  2 ++
> > > > > > >  22 files changed, 67 insertions(+), 6 deletions(-)
> > > > > > >
> > > > > > > diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
> > > > > > > index 6791f6508632..1a99f38813c7 100644
> > > > > > > --- a/arch/alpha/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/alpha/include/uapi/asm/errno.h
> > > > > > > @@ -127,4 +127,6 @@
> > > > > > >
> > > > > > >  #define EHWPOISON    139     /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE               140     /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > >  #endif
> > > > > > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > > > > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > > > > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > > > > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > > > > > @@ -34,6 +34,7 @@
> > > > > > >
> > > > > > >  #define O_PATH               040000000
> > > > > > >  #define __O_TMPFILE  0100000000
> > > > > > > +#define OPENAT2_REGULAR      0200000000
> > > > > > >
> > > > > > >  #define F_GETLK              7
> > > > > > >  #define F_SETLK              8
> > > > > > > diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
> > > > > > > index c01ed91b1ef4..1835a50b69ce 100644
> > > > > > > --- a/arch/mips/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/mips/include/uapi/asm/errno.h
> > > > > > > @@ -126,6 +126,8 @@
> > > > > > >
> > > > > > >  #define EHWPOISON    168     /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE               169     /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > >  #define EDQUOT               1133    /* Quota exceeded */
> > > > > > >
> > > > > > >
> > > > > > > diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
> > > > > > > index 8cbc07c1903e..93194fbb0a80 100644
> > > > > > > --- a/arch/parisc/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/parisc/include/uapi/asm/errno.h
> > > > > > > @@ -124,4 +124,6 @@
> > > > > > >
> > > > > > >  #define EHWPOISON    257     /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE               258     /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > >  #endif
> > > > > > > diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
> > > > > > > index 03dee816cb13..d46812f2f0f4 100644
> > > > > > > --- a/arch/parisc/include/uapi/asm/fcntl.h
> > > > > > > +++ b/arch/parisc/include/uapi/asm/fcntl.h
> > > > > > > @@ -19,6 +19,7 @@
> > > > > > >
> > > > > > >  #define O_PATH               020000000
> > > > > > >  #define __O_TMPFILE  040000000
> > > > > > > +#define OPENAT2_REGULAR      0100000000
> > > > > > >
> > > > > > >  #define F_GETLK64    8
> > > > > > >  #define F_SETLK64    9
> > > > > > > diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
> > > > > > > index 4a41e7835fd5..71940ec9130b 100644
> > > > > > > --- a/arch/sparc/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/sparc/include/uapi/asm/errno.h
> > > > > > > @@ -117,4 +117,6 @@
> > > > > > >
> > > > > > >  #define EHWPOISON    135     /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE               136     /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > >  #endif
> > > > > > > diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
> > > > > > > index 67dae75e5274..bb6e9fa94bc9 100644
> > > > > > > --- a/arch/sparc/include/uapi/asm/fcntl.h
> > > > > > > +++ b/arch/sparc/include/uapi/asm/fcntl.h
> > > > > > > @@ -37,6 +37,7 @@
> > > > > > >
> > > > > > >  #define O_PATH               0x1000000
> > > > > > >  #define __O_TMPFILE  0x2000000
> > > > > > > +#define OPENAT2_REGULAR      0x4000000
> > > > > > >
> > > > > > >  #define F_GETOWN     5       /*  for sockets. */
> > > > > > >  #define F_SETOWN     6       /*  for sockets. */
> > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > > > index 66bbf6d517a9..6d8d4c7765e6 100644
> > > > > > > --- a/fs/ceph/file.c
> > > > > > > +++ b/fs/ceph/file.c
> > > > > > > @@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > > > > >                       ceph_init_inode_acls(newino, &as_ctx);
> > > > > > >                       file->f_mode |= FMODE_CREATED;
> > > > > > >               }
> > > > > > > +             if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
> > > > > > > +                     err = -EFTYPE;
> > > > > > > +                     goto out_req;
> > > > > > > +             }
> > > > > >
> > > > > > ^^^
> > > > > > This doesn't look quite right. Here's a larger chunk of the code:
> > > > > >
> > > > > > -------------------------8<--------------------------
> > > > > >         if (d_in_lookup(dentry)) {
> > > > > >                 dn = ceph_finish_lookup(req, dentry, err);
> > > > > >                 if (IS_ERR(dn))
> > > > > >                         err = PTR_ERR(dn);
> > > > > >         } else {
> > > > > >                 /* we were given a hashed negative dentry */
> > > > > >                 dn = NULL;
> > > > > >         }
> > > > > >         if (err)
> > > > > >                 goto out_req;
> > > > > >         if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> > > > > >                 /* make vfs retry on splice, ENOENT, or symlink */
> > > > > >                 doutc(cl, "finish_no_open on dn %p\n", dn);
> > > > > >                 err = finish_no_open(file, dn);
> > > > > >         } else {
> > > > > >                 if (IS_ENCRYPTED(dir) &&
> > > > > >                     !fscrypt_has_permitted_context(dir, d_inode(dentry))) {
> > > > > >                         pr_warn_client(cl,
> > > > > >                                 "Inconsistent encryption context (parent %llx:%llx child %llx:%llx)\n",
> > > > > >                                 ceph_vinop(dir), ceph_vinop(d_inode(dentry)));
> > > > > >                         goto out_req;
> > > > > >                 }
> > > > > >
> > > > > >                 doutc(cl, "finish_open on dn %p\n", dn);
> > > > > >                 if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> > > > > >                         struct inode *newino = d_inode(dentry);
> > > > > >
> > > > > >                         cache_file_layout(dir, newino);
> > > > > >                         ceph_init_inode_acls(newino, &as_ctx);
> > > > > >                         file->f_mode |= FMODE_CREATED;
> > > > > >                 }
> > > > > >                 err = finish_open(file, dentry, ceph_open);
> > > > > >         }
> > > > > > -------------------------8<--------------------------
> > > > > >
> > > > > > It looks like this won't handle it correctly if the pathwalk terminates
> > > > > > on a symlink (re: d_is_symlink() case). You should either set up a test
> > > > > > ceph cluster on your own, or reach out to the ceph community and ask
> > > > > > them to test this.
> > > > > >
> > > > >
> > > > > Thanks for reviewing. The d_is_symlink() case seems to be calling
> > > > > finish_no_open so shouldn't this be okay?
> > > > >
> > > >
> > > > My mistake -- you're correct. I keep forgetting that finish_no_open()
> > > > will handle this case regardless of what else happens.
> > > >
> > > > > > >               err = finish_open(file, dentry, ceph_open);
> > > > > > >       }
> > > > > > >  out_req:
> > > > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > > > > > index beab8080badf..240bb511557a 100644
> > > > > > > --- a/fs/fcntl.c
> > > > > > > +++ b/fs/fcntl.c
> > > > > > > @@ -1169,9 +1169,9 @@ static int __init fcntl_init(void)
> > > > > > >        * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > > > > > >        * is defined as O_NONBLOCK on some platforms and not on others.
> > > > > > >        */
> > > > > > > -     BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> > > > > > > +     BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > > > > > >               HWEIGHT32(
> > > > > > > -                     (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > > > > > > +                     (VALID_OPENAT2_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > > > > > >                       __FMODE_EXEC));
> > > > > > >
> > > > > > >       fasync_cache = kmem_cache_create("fasync_cache",
> > > > > > > diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> > > > > > > index 8344040ecaf7..4604e2e8a9cc 100644
> > > > > > > --- a/fs/gfs2/inode.c
> > > > > > > +++ b/fs/gfs2/inode.c
> > > > > > > @@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
> > > > > > >       inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
> > > > > > >       error = PTR_ERR(inode);
> > > > > > >       if (!IS_ERR(inode)) {
> > > > > > > +             if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
> > > > > >
> > > > > > Isn't OPENAT2_REGULAR getting masked off in ->f_flags now?
> > > > > >
> > > > > Yes, I thought the masking off was happening after this codepath got
> > > > > executed. Maybe it's better anyway to pass another flags param to this
> > > > > function and forward the flags from the gfs2_atomic_open function and
> > > > > in other call sites pass 0 ? What do you think?
> > > > >
> > > >
> > > > Also my mistake. That happens in do_dentry_open() which happens in
> > > > finish_open(), so you should be OK here.
> > > >
> > > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > >
> > > Thanks for patiently reviewing this! I am planning on sending patches
> > > for man-pages and looking into some xfs-tests for this. But I am not
> > > sure if this patch series will get more reviews from others or if it
> > > will be picked up in the vfs branch?
> > >
> >
> > This is a change to rather core VFS infrastructure so yes, you should
> > expect some more review. Assuming no major issues are found, then yes,
> > this should eventually get picked up by the VFS maintainers.
> >
> > Cheers,
> > --
> > Jeff Layton <jlayton@kernel.org>
>
> Ping....
> This patch series got a "Reviewed-by" from Jeff Layton but it probably
> requires more reviews from other maintainers/reviewers as well. So
> requesting for review on this patch series. Thanks!
>

Ping...
Requesting for review on this patch series please.

Regards,
Dorjoy

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-15 16:41 UTC (permalink / raw)
  To: linux-kernel, linux-f2fs-devel, linux-api, linux-fsdevel; +Cc: Akilesh Kailash
In-Reply-To: <adhPZxtbZxgU-37v@google.com>

By the way, is it worth to add some generic apis such as
1) reclaim a specifc inode object when closing the last file
2) add another fadvise hint for large folio

On 04/10, Jaegeuk Kim wrote:
> enum {
>        F2FS_XATTR_FADV_LARGEFOLIO,
> };
> 
> unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> 
> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
>  -> register the inode number for large folio
> 2. chmod(0400, file)
>  -> make Read-Only
> 3. fsync() && close() && open(READ)
>  -> f2fs_iget() with large folio
> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
>  -> return error
> 5. close() and open()
>  -> goto #3
> 6. unlink
>  -> deregister the inode number
> 
> Suggested-by: Akilesh Kailash <akailash@google.com>
> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> ---
>  
>   Log from v1:
>    - add a condition in f2fs_drop_inode
>    - add Doc
> 
>  Documentation/filesystems/f2fs.rst | 41 ++++++++++++++++++++++++++----
>  fs/f2fs/checkpoint.c               |  2 +-
>  fs/f2fs/data.c                     |  2 +-
>  fs/f2fs/f2fs.h                     |  1 +
>  fs/f2fs/file.c                     | 11 ++++++--
>  fs/f2fs/inode.c                    | 19 +++++++++++---
>  fs/f2fs/super.c                    |  7 +++++
>  fs/f2fs/xattr.c                    | 35 ++++++++++++++++++++++++-
>  fs/f2fs/xattr.h                    |  6 +++++
>  9 files changed, 111 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
> index 7e4031631286..de899d0d3088 100644
> --- a/Documentation/filesystems/f2fs.rst
> +++ b/Documentation/filesystems/f2fs.rst
> @@ -1044,11 +1044,14 @@ page allocation for significant performance gains. To minimize code complexity,
>  this support is currently excluded from the write path, which requires handling
>  complex optimizations such as compression and block allocation modes.
>  
> -This optional feature is triggered only when a file's immutable bit is set.
> -Consequently, F2FS will return EOPNOTSUPP if a user attempts to open a cached
> -file with write permissions, even immediately after clearing the bit. Write
> -access is only restored once the cached inode is dropped. The usage flow is
> -demonstrated below:
> +This optional feature is triggered by two mechanisms: the file's immutable bit
> +or a specific xattr flag. In both cases, F2FS ensures data integrity by
> +restricting the file to a read-only state while large folios are active.
> +
> +1. Immutable Bit Approach:
> +Triggered when the FS_IMMUTABLE_FL is set. This is a strict enforcement
> +where the file cannot be modified at all until the bit is cleared and
> +the cached inode is dropped.
>  
>  .. code-block::
>  
> @@ -1078,3 +1081,31 @@ demonstrated below:
>     Written 4096 bytes with pattern = zero, total_time = 29 us, max_latency = 28 us
>  
>     # rm /data/testfile_read_seq
> +
> +2. XATTR fadvise Approach:
> +A more flexible registration via extended attributes.
> +
> +.. code-block::
> +
> +    enum {
> +        F2FS_XATTR_FADV_LARGEFOLIO,
> +    };
> +    unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> +
> +    /* Registers the inode number for large folio support in the subsystem.*/
> +    # setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> +
> +    /* The file must be made Read-Only to transition into the large folio path. */
> +    # fchmod(0400, fd)
> +
> +    /* clean up dirty inode state. */
> +    # fsync(fd)
> +
> +    /* Drop the inode cache.
> +    # close(fd)
> +
> +    /* f2fs_iget() instantiates the inode with large folio support.*/
> +    # open()
> +
> +    /* Returns -EOPNOTSUPP or error to protect the large folio cache.*/
> +    # open(WRITE), mkwrite on mmap, or chmod(WRITE)
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 01e1ba77263e..fdd62ddc3ed6 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -778,7 +778,7 @@ void f2fs_remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type)
>  	__remove_ino_entry(sbi, ino, type);
>  }
>  
> -/* mode should be APPEND_INO, UPDATE_INO or TRANS_DIR_INO */
> +/* mode should be APPEND_INO, UPDATE_INO, LARGE_FOLIO_IO, or TRANS_DIR_INO */
>  bool f2fs_exist_written_data(struct f2fs_sb_info *sbi, nid_t ino, int mode)
>  {
>  	struct inode_management *im = &sbi->im[mode];
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 965d4e6443c6..5e46230398d7 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -2494,7 +2494,7 @@ static int f2fs_read_data_large_folio(struct inode *inode,
>  	int ret = 0;
>  	bool folio_in_bio;
>  
> -	if (!IS_IMMUTABLE(inode) || f2fs_compressed_file(inode)) {
> +	if (f2fs_compressed_file(inode)) {
>  		if (folio)
>  			folio_unlock(folio);
>  		return -EOPNOTSUPP;
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index e40b6b2784ee..02bc6eb96a59 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -381,6 +381,7 @@ enum {
>  /* for the list of ino */
>  enum {
>  	ORPHAN_INO,		/* for orphan ino list */
> +	LARGE_FOLIO_INO,	/* for large folio case */
>  	APPEND_INO,		/* for append ino list */
>  	UPDATE_INO,		/* for update ino list */
>  	TRANS_DIR_INO,		/* for transactions dir ino list */
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index c0220cd7b332..64ba900410fc 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -2068,9 +2068,16 @@ static long f2fs_fallocate(struct file *file, int mode,
>  
>  static int f2fs_release_file(struct inode *inode, struct file *filp)
>  {
> -	if (atomic_dec_and_test(&F2FS_I(inode)->open_count))
> +	if (atomic_dec_and_test(&F2FS_I(inode)->open_count)) {
>  		f2fs_remove_donate_inode(inode);
> -
> +		/*
> +		 * In order to get large folio as soon as possible, let's drop
> +		 * inode cache asap. See also f2fs_drop_inode.
> +		 */
> +		if (f2fs_exist_written_data(F2FS_I_SB(inode),
> +					    inode->i_ino, LARGE_FOLIO_INO))
> +                       d_drop(filp->f_path.dentry);
> +	}
>  	/*
>  	 * f2fs_release_file is called at every close calls. So we should
>  	 * not drop any inmemory pages by close called by other process.
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index 89240be8cc59..e100bc5a378c 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -565,6 +565,20 @@ static bool is_meta_ino(struct f2fs_sb_info *sbi, unsigned int ino)
>  		ino == F2FS_COMPRESS_INO(sbi);
>  }
>  
> +static void f2fs_mapping_set_large_folio(struct inode *inode)
> +{
> +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> +
> +	if (f2fs_compressed_file(inode))
> +		return;
> +	if (f2fs_quota_file(sbi, inode->i_ino))
> +		return;
> +	if (IS_IMMUTABLE(inode) ||
> +	    (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> +	     !(inode->i_mode & S_IWUGO)))
> +	    mapping_set_folio_min_order(inode->i_mapping, 0);
> +}
> +
>  struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
>  {
>  	struct f2fs_sb_info *sbi = F2FS_SB(sb);
> @@ -620,9 +634,7 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
>  		inode->i_op = &f2fs_file_inode_operations;
>  		inode->i_fop = &f2fs_file_operations;
>  		inode->i_mapping->a_ops = &f2fs_dblock_aops;
> -		if (IS_IMMUTABLE(inode) && !f2fs_compressed_file(inode) &&
> -		    !f2fs_quota_file(sbi, inode->i_ino))
> -			mapping_set_folio_min_order(inode->i_mapping, 0);
> +		f2fs_mapping_set_large_folio(inode);
>  	} else if (S_ISDIR(inode->i_mode)) {
>  		inode->i_op = &f2fs_dir_inode_operations;
>  		inode->i_fop = &f2fs_dir_operations;
> @@ -895,6 +907,7 @@ void f2fs_evict_inode(struct inode *inode)
>  	f2fs_remove_ino_entry(sbi, inode->i_ino, APPEND_INO);
>  	f2fs_remove_ino_entry(sbi, inode->i_ino, UPDATE_INO);
>  	f2fs_remove_ino_entry(sbi, inode->i_ino, FLUSH_INO);
> +	f2fs_remove_ino_entry(sbi, inode->i_ino, LARGE_FOLIO_INO);
>  
>  	if (!is_sbi_flag_set(sbi, SBI_IS_FREEZING)) {
>  		sb_start_intwrite(inode->i_sb);
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ccf806b676f5..11d1e0c99ac1 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -1844,6 +1844,13 @@ static int f2fs_drop_inode(struct inode *inode)
>  			return 1;
>  		}
>  	}
> +	/*
> +	 * In order to get large folio as soon as possible, let's drop
> +	 * inode cache asap. See also f2fs_release_file.
> +	 */
> +	if (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> +	    !is_inode_flag_set(inode, FI_DIRTY_INODE))
> +		return 1;
>  
>  	/*
>  	 * This is to avoid a deadlock condition like below.
> diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
> index 941dc62a6d6f..0c0e44c2dcdd 100644
> --- a/fs/f2fs/xattr.c
> +++ b/fs/f2fs/xattr.c
> @@ -44,6 +44,16 @@ static void xattr_free(struct f2fs_sb_info *sbi, void *xattr_addr,
>  		kfree(xattr_addr);
>  }
>  
> +static int f2fs_xattr_fadvise_get(struct inode *inode, void *buffer)
> +{
> +	if (!buffer)
> +		goto out;
> +	if (mapping_large_folio_support(inode->i_mapping))
> +		*((unsigned int *)buffer) |= BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> +out:
> +	return sizeof(unsigned int);
> +}
> +
>  static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
>  		struct dentry *unused, struct inode *inode,
>  		const char *name, void *buffer, size_t size)
> @@ -61,10 +71,29 @@ static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
>  	default:
>  		return -EINVAL;
>  	}
> +	if (handler->flags == F2FS_XATTR_INDEX_USER &&
> +	    !strcmp(name, "fadvise"))
> +		return f2fs_xattr_fadvise_get(inode, buffer);
> +
>  	return f2fs_getxattr(inode, handler->flags, name,
>  			     buffer, size, NULL);
>  }
>  
> +static int f2fs_xattr_fadvise_set(struct inode *inode, const void *value)
> +{
> +	unsigned int new_fadvise;
> +
> +	new_fadvise = *(unsigned int *)value;
> +
> +	if (new_fadvise & BIT(F2FS_XATTR_FADV_LARGEFOLIO))
> +		f2fs_add_ino_entry(F2FS_I_SB(inode),
> +				inode->i_ino, LARGE_FOLIO_INO);
> +	else
> +		f2fs_remove_ino_entry(F2FS_I_SB(inode),
> +				inode->i_ino, LARGE_FOLIO_INO);
> +	return 0;
> +}
> +
>  static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
>  		struct mnt_idmap *idmap,
>  		struct dentry *unused, struct inode *inode,
> @@ -84,6 +113,10 @@ static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
>  	default:
>  		return -EINVAL;
>  	}
> +	if (handler->flags == F2FS_XATTR_INDEX_USER &&
> +	    !strcmp(name, "fadvise"))
> +		return f2fs_xattr_fadvise_set(inode, value);
> +
>  	return f2fs_setxattr(inode, handler->flags, name,
>  					value, size, NULL, flags);
>  }
> @@ -842,4 +875,4 @@ int __init f2fs_init_xattr_cache(void)
>  void f2fs_destroy_xattr_cache(void)
>  {
>  	kmem_cache_destroy(inline_xattr_slab);
> -}
> \ No newline at end of file
> +}
> diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h
> index bce3d93e4755..455f460d014e 100644
> --- a/fs/f2fs/xattr.h
> +++ b/fs/f2fs/xattr.h
> @@ -24,6 +24,7 @@
>  #define F2FS_XATTR_REFCOUNT_MAX         1024
>  
>  /* Name indexes */
> +#define F2FS_USER_FADVISE_NAME			"user.fadvise"
>  #define F2FS_SYSTEM_ADVISE_NAME			"system.advise"
>  #define F2FS_XATTR_INDEX_USER			1
>  #define F2FS_XATTR_INDEX_POSIX_ACL_ACCESS	2
> @@ -39,6 +40,11 @@
>  #define F2FS_XATTR_NAME_ENCRYPTION_CONTEXT	"c"
>  #define F2FS_XATTR_NAME_VERITY			"v"
>  
> +/* used for F2FS_USER_FADVISE_NAME */
> +enum {
> +	F2FS_XATTR_FADV_LARGEFOLIO,
> +};
> +
>  struct f2fs_xattr_header {
>  	__le32  h_magic;        /* magic number for identification */
>  	__le32  h_refcount;     /* reference count */
> -- 
> 2.53.0.1213.gd9a14994de-goog
> 

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-f2fs-devel, Akilesh Kailash, linux-fsdevel,
	linux-mm, linux-api
In-Reply-To: <ad30g9xMs9wNJhFb@infradead.org>

On 04/14, Christoph Hellwig wrote:
> Please add the relevant mailing lists when adding new user interfaces.
> 
> And I'm not sure hacks working around the proper large folio
> implementation are something that should be merged upstream.

Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
I'm not sure it's acceptable or not. 

> 
> On Fri, Apr 10, 2026 at 01:16:23AM +0000, Jaegeuk Kim wrote:
> > enum {
> >        F2FS_XATTR_FADV_LARGEFOLIO,
> > };
> > 
> > unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> > 
> > 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> >  -> register the inode number for large folio
> > 2. chmod(0400, file)
> >  -> make Read-Only
> > 3. fsync() && close() && open(READ)
> >  -> f2fs_iget() with large folio
> > 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
> >  -> return error
> > 5. close() and open()
> >  -> goto #3
> > 6. unlink
> >  -> deregister the inode number
> > 
> > Suggested-by: Akilesh Kailash <akailash@google.com>
> > Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> > ---
> >  
> >   Log from v1:
> >    - add a condition in f2fs_drop_inode
> >    - add Doc
> > 
> >  Documentation/filesystems/f2fs.rst | 41 ++++++++++++++++++++++++++----
> >  fs/f2fs/checkpoint.c               |  2 +-
> >  fs/f2fs/data.c                     |  2 +-
> >  fs/f2fs/f2fs.h                     |  1 +
> >  fs/f2fs/file.c                     | 11 ++++++--
> >  fs/f2fs/inode.c                    | 19 +++++++++++---
> >  fs/f2fs/super.c                    |  7 +++++
> >  fs/f2fs/xattr.c                    | 35 ++++++++++++++++++++++++-
> >  fs/f2fs/xattr.h                    |  6 +++++
> >  9 files changed, 111 insertions(+), 13 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
> > index 7e4031631286..de899d0d3088 100644
> > --- a/Documentation/filesystems/f2fs.rst
> > +++ b/Documentation/filesystems/f2fs.rst
> > @@ -1044,11 +1044,14 @@ page allocation for significant performance gains. To minimize code complexity,
> >  this support is currently excluded from the write path, which requires handling
> >  complex optimizations such as compression and block allocation modes.
> >  
> > -This optional feature is triggered only when a file's immutable bit is set.
> > -Consequently, F2FS will return EOPNOTSUPP if a user attempts to open a cached
> > -file with write permissions, even immediately after clearing the bit. Write
> > -access is only restored once the cached inode is dropped. The usage flow is
> > -demonstrated below:
> > +This optional feature is triggered by two mechanisms: the file's immutable bit
> > +or a specific xattr flag. In both cases, F2FS ensures data integrity by
> > +restricting the file to a read-only state while large folios are active.
> > +
> > +1. Immutable Bit Approach:
> > +Triggered when the FS_IMMUTABLE_FL is set. This is a strict enforcement
> > +where the file cannot be modified at all until the bit is cleared and
> > +the cached inode is dropped.
> >  
> >  .. code-block::
> >  
> > @@ -1078,3 +1081,31 @@ demonstrated below:
> >     Written 4096 bytes with pattern = zero, total_time = 29 us, max_latency = 28 us
> >  
> >     # rm /data/testfile_read_seq
> > +
> > +2. XATTR fadvise Approach:
> > +A more flexible registration via extended attributes.
> > +
> > +.. code-block::
> > +
> > +    enum {
> > +        F2FS_XATTR_FADV_LARGEFOLIO,
> > +    };
> > +    unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> > +
> > +    /* Registers the inode number for large folio support in the subsystem.*/
> > +    # setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> > +
> > +    /* The file must be made Read-Only to transition into the large folio path. */
> > +    # fchmod(0400, fd)
> > +
> > +    /* clean up dirty inode state. */
> > +    # fsync(fd)
> > +
> > +    /* Drop the inode cache.
> > +    # close(fd)
> > +
> > +    /* f2fs_iget() instantiates the inode with large folio support.*/
> > +    # open()
> > +
> > +    /* Returns -EOPNOTSUPP or error to protect the large folio cache.*/
> > +    # open(WRITE), mkwrite on mmap, or chmod(WRITE)
> > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> > index 01e1ba77263e..fdd62ddc3ed6 100644
> > --- a/fs/f2fs/checkpoint.c
> > +++ b/fs/f2fs/checkpoint.c
> > @@ -778,7 +778,7 @@ void f2fs_remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type)
> >  	__remove_ino_entry(sbi, ino, type);
> >  }
> >  
> > -/* mode should be APPEND_INO, UPDATE_INO or TRANS_DIR_INO */
> > +/* mode should be APPEND_INO, UPDATE_INO, LARGE_FOLIO_IO, or TRANS_DIR_INO */
> >  bool f2fs_exist_written_data(struct f2fs_sb_info *sbi, nid_t ino, int mode)
> >  {
> >  	struct inode_management *im = &sbi->im[mode];
> > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> > index 965d4e6443c6..5e46230398d7 100644
> > --- a/fs/f2fs/data.c
> > +++ b/fs/f2fs/data.c
> > @@ -2494,7 +2494,7 @@ static int f2fs_read_data_large_folio(struct inode *inode,
> >  	int ret = 0;
> >  	bool folio_in_bio;
> >  
> > -	if (!IS_IMMUTABLE(inode) || f2fs_compressed_file(inode)) {
> > +	if (f2fs_compressed_file(inode)) {
> >  		if (folio)
> >  			folio_unlock(folio);
> >  		return -EOPNOTSUPP;
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index e40b6b2784ee..02bc6eb96a59 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -381,6 +381,7 @@ enum {
> >  /* for the list of ino */
> >  enum {
> >  	ORPHAN_INO,		/* for orphan ino list */
> > +	LARGE_FOLIO_INO,	/* for large folio case */
> >  	APPEND_INO,		/* for append ino list */
> >  	UPDATE_INO,		/* for update ino list */
> >  	TRANS_DIR_INO,		/* for transactions dir ino list */
> > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > index c0220cd7b332..64ba900410fc 100644
> > --- a/fs/f2fs/file.c
> > +++ b/fs/f2fs/file.c
> > @@ -2068,9 +2068,16 @@ static long f2fs_fallocate(struct file *file, int mode,
> >  
> >  static int f2fs_release_file(struct inode *inode, struct file *filp)
> >  {
> > -	if (atomic_dec_and_test(&F2FS_I(inode)->open_count))
> > +	if (atomic_dec_and_test(&F2FS_I(inode)->open_count)) {
> >  		f2fs_remove_donate_inode(inode);
> > -
> > +		/*
> > +		 * In order to get large folio as soon as possible, let's drop
> > +		 * inode cache asap. See also f2fs_drop_inode.
> > +		 */
> > +		if (f2fs_exist_written_data(F2FS_I_SB(inode),
> > +					    inode->i_ino, LARGE_FOLIO_INO))
> > +                       d_drop(filp->f_path.dentry);
> > +	}
> >  	/*
> >  	 * f2fs_release_file is called at every close calls. So we should
> >  	 * not drop any inmemory pages by close called by other process.
> > diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> > index 89240be8cc59..e100bc5a378c 100644
> > --- a/fs/f2fs/inode.c
> > +++ b/fs/f2fs/inode.c
> > @@ -565,6 +565,20 @@ static bool is_meta_ino(struct f2fs_sb_info *sbi, unsigned int ino)
> >  		ino == F2FS_COMPRESS_INO(sbi);
> >  }
> >  
> > +static void f2fs_mapping_set_large_folio(struct inode *inode)
> > +{
> > +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> > +
> > +	if (f2fs_compressed_file(inode))
> > +		return;
> > +	if (f2fs_quota_file(sbi, inode->i_ino))
> > +		return;
> > +	if (IS_IMMUTABLE(inode) ||
> > +	    (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> > +	     !(inode->i_mode & S_IWUGO)))
> > +	    mapping_set_folio_min_order(inode->i_mapping, 0);
> > +}
> > +
> >  struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> >  {
> >  	struct f2fs_sb_info *sbi = F2FS_SB(sb);
> > @@ -620,9 +634,7 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> >  		inode->i_op = &f2fs_file_inode_operations;
> >  		inode->i_fop = &f2fs_file_operations;
> >  		inode->i_mapping->a_ops = &f2fs_dblock_aops;
> > -		if (IS_IMMUTABLE(inode) && !f2fs_compressed_file(inode) &&
> > -		    !f2fs_quota_file(sbi, inode->i_ino))
> > -			mapping_set_folio_min_order(inode->i_mapping, 0);
> > +		f2fs_mapping_set_large_folio(inode);
> >  	} else if (S_ISDIR(inode->i_mode)) {
> >  		inode->i_op = &f2fs_dir_inode_operations;
> >  		inode->i_fop = &f2fs_dir_operations;
> > @@ -895,6 +907,7 @@ void f2fs_evict_inode(struct inode *inode)
> >  	f2fs_remove_ino_entry(sbi, inode->i_ino, APPEND_INO);
> >  	f2fs_remove_ino_entry(sbi, inode->i_ino, UPDATE_INO);
> >  	f2fs_remove_ino_entry(sbi, inode->i_ino, FLUSH_INO);
> > +	f2fs_remove_ino_entry(sbi, inode->i_ino, LARGE_FOLIO_INO);
> >  
> >  	if (!is_sbi_flag_set(sbi, SBI_IS_FREEZING)) {
> >  		sb_start_intwrite(inode->i_sb);
> > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> > index ccf806b676f5..11d1e0c99ac1 100644
> > --- a/fs/f2fs/super.c
> > +++ b/fs/f2fs/super.c
> > @@ -1844,6 +1844,13 @@ static int f2fs_drop_inode(struct inode *inode)
> >  			return 1;
> >  		}
> >  	}
> > +	/*
> > +	 * In order to get large folio as soon as possible, let's drop
> > +	 * inode cache asap. See also f2fs_release_file.
> > +	 */
> > +	if (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> > +	    !is_inode_flag_set(inode, FI_DIRTY_INODE))
> > +		return 1;
> >  
> >  	/*
> >  	 * This is to avoid a deadlock condition like below.
> > diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
> > index 941dc62a6d6f..0c0e44c2dcdd 100644
> > --- a/fs/f2fs/xattr.c
> > +++ b/fs/f2fs/xattr.c
> > @@ -44,6 +44,16 @@ static void xattr_free(struct f2fs_sb_info *sbi, void *xattr_addr,
> >  		kfree(xattr_addr);
> >  }
> >  
> > +static int f2fs_xattr_fadvise_get(struct inode *inode, void *buffer)
> > +{
> > +	if (!buffer)
> > +		goto out;
> > +	if (mapping_large_folio_support(inode->i_mapping))
> > +		*((unsigned int *)buffer) |= BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> > +out:
> > +	return sizeof(unsigned int);
> > +}
> > +
> >  static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
> >  		struct dentry *unused, struct inode *inode,
> >  		const char *name, void *buffer, size_t size)
> > @@ -61,10 +71,29 @@ static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
> >  	default:
> >  		return -EINVAL;
> >  	}
> > +	if (handler->flags == F2FS_XATTR_INDEX_USER &&
> > +	    !strcmp(name, "fadvise"))
> > +		return f2fs_xattr_fadvise_get(inode, buffer);
> > +
> >  	return f2fs_getxattr(inode, handler->flags, name,
> >  			     buffer, size, NULL);
> >  }
> >  
> > +static int f2fs_xattr_fadvise_set(struct inode *inode, const void *value)
> > +{
> > +	unsigned int new_fadvise;
> > +
> > +	new_fadvise = *(unsigned int *)value;
> > +
> > +	if (new_fadvise & BIT(F2FS_XATTR_FADV_LARGEFOLIO))
> > +		f2fs_add_ino_entry(F2FS_I_SB(inode),
> > +				inode->i_ino, LARGE_FOLIO_INO);
> > +	else
> > +		f2fs_remove_ino_entry(F2FS_I_SB(inode),
> > +				inode->i_ino, LARGE_FOLIO_INO);
> > +	return 0;
> > +}
> > +
> >  static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
> >  		struct mnt_idmap *idmap,
> >  		struct dentry *unused, struct inode *inode,
> > @@ -84,6 +113,10 @@ static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
> >  	default:
> >  		return -EINVAL;
> >  	}
> > +	if (handler->flags == F2FS_XATTR_INDEX_USER &&
> > +	    !strcmp(name, "fadvise"))
> > +		return f2fs_xattr_fadvise_set(inode, value);
> > +
> >  	return f2fs_setxattr(inode, handler->flags, name,
> >  					value, size, NULL, flags);
> >  }
> > @@ -842,4 +875,4 @@ int __init f2fs_init_xattr_cache(void)
> >  void f2fs_destroy_xattr_cache(void)
> >  {
> >  	kmem_cache_destroy(inline_xattr_slab);
> > -}
> > \ No newline at end of file
> > +}
> > diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h
> > index bce3d93e4755..455f460d014e 100644
> > --- a/fs/f2fs/xattr.h
> > +++ b/fs/f2fs/xattr.h
> > @@ -24,6 +24,7 @@
> >  #define F2FS_XATTR_REFCOUNT_MAX         1024
> >  
> >  /* Name indexes */
> > +#define F2FS_USER_FADVISE_NAME			"user.fadvise"
> >  #define F2FS_SYSTEM_ADVISE_NAME			"system.advise"
> >  #define F2FS_XATTR_INDEX_USER			1
> >  #define F2FS_XATTR_INDEX_POSIX_ACL_ACCESS	2
> > @@ -39,6 +40,11 @@
> >  #define F2FS_XATTR_NAME_ENCRYPTION_CONTEXT	"c"
> >  #define F2FS_XATTR_NAME_VERITY			"v"
> >  
> > +/* used for F2FS_USER_FADVISE_NAME */
> > +enum {
> > +	F2FS_XATTR_FADV_LARGEFOLIO,
> > +};
> > +
> >  struct f2fs_xattr_header {
> >  	__le32  h_magic;        /* magic number for identification */
> >  	__le32  h_refcount;     /* reference count */
> > -- 
> > 2.53.0.1213.gd9a14994de-goog
> > 
> > 
> ---end quoted text---

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-04-15 17:15 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Christoph Hellwig, linux-kernel, linux-f2fs-devel,
	Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <ad_AVHe7RMnGrGTb@google.com>

On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> On 04/14, Christoph Hellwig wrote:
> > Please add the relevant mailing lists when adding new user interfaces.
> > 
> > And I'm not sure hacks working around the proper large folio
> > implementation are something that should be merged upstream.
> 
> Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> I'm not sure it's acceptable or not. 

You haven't sent a proposal.  This is a reply to a reply to a reply of a
patch.  There's no justification for why f2fs is so special that it
needs this.  What the hell is going on?  You know this is not the way to
get code merged into Linux.

^ permalink raw reply

* Re: [PATCH 1/4] exec: inherit HWCAPs from the parent process
From: Andrei Vagin @ 2026-04-15 19:27 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andrei Vagin, Will Deacon, Kees Cook, Andrew Morton,
	Marek Szyprowski, Cyrill Gorcunov, Mike Rapoport,
	Alexander Mikhalitsyn, linux-kernel, linux-fsdevel, linux-mm,
	criu, Catalin Marinas, linux-arm-kernel, Chen Ridong,
	Christian Brauner, David Hildenbrand, Eric Biederman,
	Lorenzo Stoakes, Michal Koutny, Alexander Mikhalitsyn, Linux API
In-Reply-To: <adUhbk0sKT0ucWhJ@J2N7QTR9R3>

Hi Mark,

Thanks for the feedback and sorry for the delay, was on vacation.
Please see my comments inline.

On Tue, Apr 7, 2026 at 8:29 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Fri, Mar 27, 2026 at 05:21:26PM -0700, Andrei Vagin wrote:
> > Hi Mark,
> >
> > I understand all these points and they are valid. However, as I
> > mentioned, we are not trying to introduce a mechanism that will strictly
> > enforce feature sets for every container. While we would like to have
> > that functionality, as you and will mentioned, it would require
> > substantially more complexity to address, and maintainers would unlikely
> > to pick up that complexity.
>
> The crux of my complaint here is that unless you do that (to some
> degree), this is not going to work reliably, even with the constraints
> you outline.
>
> Further, I disagree with your proposed solution of pushing more
> constraints onto userspace (to also consider HWCAPs as overriding other
> mechainsms, etc).
>
> I think that as-is, the approach is flawed.

I would really appreciate it if we could move this conversation toward
how we can make it work.

>
> > Even masking ID registers on a per-container basis would introduce
> > extra complexity that could make architecture maintainers unhappy.
> > There were a few attempts to introduce container CPUID masking on
> > x86_64 in the past.
>
> > In CRIU, we are not aiming to handle every possible workload. Our goal
> > is to target workloads where developers are ready to cooperate and
> > willing to make adjustments to be C/R compatible. The goal here is to
> > provide developers with clear instructions on what they can do to ensure
> > their applications are C/R compatible. When I say "workloads", I mean
> > this in a broad sense. A container might pack a set of tools with
> > different runtimes (Go, Java, libc-based). All these runtimes should
> > detect only allowed features.
>
> I do not think that arbitrary applications (and libraries!) should have
> to pick up additional constraints that are unnecessary without CRIU,
> especially where that goes against deliberate design decisions (e.g.
> features in arm64's HINT instruction space, which are designed to be
> usable in fast paths WITHOUT needing explicit checks of things like
> HWCAPs). Note that those typically *do* have kernel controls.
>
> I think there's a much larger problem space than you anticipate, and
> adding an incomplete solution now is just going to introduce a
> maintenance burden.

I am not adding arbitrary constraints for standard non-CRIU use cases.
Previously, I suggested that standard libraries would need to call prctl
to determine if hwcaps should be used for feature detection.  However,
we can avoid this extra syscall by adding the new HWCAP2_CR bit. Then
libraries will simply check this bit in auxv[AT_HWCAP2], meaning the
overhead for "non-criu" cases is just a single bit check.

As for HINT instructions, there are two class of instructions.

The first one doesn't change a process state and they are not required
any special handling in term of checkpoint/restore. If a process is
checkpointed on a newer cpu, and restore it on an older cpu, the older
hardware will simply skip over that instructions.  The architectural
state (registers, memory) should remain consistent.

The second class such as PAC are instructions that actually change a
process state. These instructions require kernel/userspace coordination.
For example, usage of PAC keys can be controlled from userspace via prctl.
I mean when support for new instructions is implemented in the kernel,
we will need to consider that userspace should be able to control them.

>
> > Returning to the subject of this patchset: this series extends the role
> > of hwcaps. With this change, we would establish that hwcaps is the
> > "source of truth" for which features an application can safely use. Any
> > other features available on the current CPU would not be guaranteed to
> > remain available after migration to another machine.
> >
> > After this discussion, I found that the current version missed one major
> > thing: there should be a signal indicating that hwcaps must be used for
> > feature detection. Since we will need to integrate this interface into
> > libc, Go, and other runtimes, they definitely should not rely just on
> > hwcaps by default, especially in the early stages. This can be solved
> > via the prctl command.  Libraries like libc would call
> > prctl(PR_USER_HWCAP_ENABLED). If this returns true, the runtime knows
> > that only the features explicitly listed in hwcaps should be used.
>
> I do not think we should be pushing that shape of constraint onto
> userspace.

Look at the previous command.

>
> > You are right, the controlled feature set will be limited to features
> > the kernel knows about. And yes, we would need to report CPU features in
> > hwcaps even if the kernel isn't directly involved in handling them.
>
> To be clear, that is not what I am arguing.
>
> As I mentioned before, the way this works on arm64 is that the kernel
> only exposes what it is aware of, even in the ID regs accessible to
> userspace. We usually *can* hide features, and do that for cases of
> mismatched big.LITTLE, virtual machines, etc.

I understand that. My point was that the kernel would need to report
features in hwcaps even if they don't require specific kernel-side
handling.

>
> > Honestly, I am not certain if this is the "right" interface for that,
> > and I would be happy to consider other ideas. I understand that these
> > hwcaps will not work right out of the box, but we need a way to solve
> > this problem. Having a centralized API for CPU/kernel feature detection
> > seems like the right direction.
>
> I think that for better or worse the approach you are tkaing here simply
> does not solve enough of the problem to actually be worthwhile.

This approach mimics solutions that some CRIU users are already
implementing in userspace, but those only work when the user controls/
recompiles all their libraries. I am open to other ideas, but we need a
path forward.

>
> > As for signal frame size and extended states like SVE/SME, we aware
> > about this problem.  However, it is partly mitigated by the fact that if
> > an application does not use some features, those states are not placed
> > in the signal frame.
>
> That is not true. The kernel can and will create signal frames for
> architectural state that a task might never have touched.
>
> Generally arm64 creates signal frames for features when the feature
> *exists*, regardless of whether the task has actively manipulated the
> relevant state. For example, on systems with SVE a trivial SVE signal
> frame gets created even if a task only uses the FPSIMD registers, and on
> systms with SME a TPIDR2 signal frame gets created even if the task has
> never read/written TPIDR2.
>
> When restoring, an unrecognised signal frame is treated as invalid, and
> we can require that certain signal frames are present.

You are right; that was my mistake. My only explanation for why we don't
see this failure often is that C/R is rarely triggered while a process
is actually
inside a signal handler. This is definitely a problem that still needs
to be solved.

>
> > In the future, when we construct/reload a signal frame, we could look
> > at a process feature set for a process and generate a frame according
> > to those features...
>
> When you say 'we' here, are you talking about within the kernel, or
> within the userspace C/R mechanism?

... within the kernel.

Thanks,
Andrei

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-15 22:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-kernel, linux-f2fs-devel,
	Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <ad_HwhzlNPUEKQi6@casper.infradead.org>

On 04/15, Matthew Wilcox wrote:
> On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> > On 04/14, Christoph Hellwig wrote:
> > > Please add the relevant mailing lists when adding new user interfaces.
> > > 
> > > And I'm not sure hacks working around the proper large folio
> > > implementation are something that should be merged upstream.
> > 
> > Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> > I'm not sure it's acceptable or not. 
> 
> You haven't sent a proposal.  This is a reply to a reply to a reply of a
> patch.  There's no justification for why f2fs is so special that it
> needs this.  What the hell is going on?  You know this is not the way to
> get code merged into Linux.

I added two ideas in that email. Have you even tried to understand?

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Darrick J. Wong @ 2026-04-15 23:49 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Matthew Wilcox, Christoph Hellwig, linux-kernel, linux-f2fs-devel,
	Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <aeAK8mFxzgMOepmZ@google.com>

On Wed, Apr 15, 2026 at 10:02:26PM +0000, Jaegeuk Kim wrote:
> On 04/15, Matthew Wilcox wrote:
> > On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> > > On 04/14, Christoph Hellwig wrote:
> > > > Please add the relevant mailing lists when adding new user interfaces.
> > > > 
> > > > And I'm not sure hacks working around the proper large folio
> > > > implementation are something that should be merged upstream.
> > > 
> > > Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> > > I'm not sure it's acceptable or not. 
> > 
> > You haven't sent a proposal.  This is a reply to a reply to a reply of a
> > patch.  There's no justification for why f2fs is so special that it
> > needs this.  What the hell is going on?  You know this is not the way to
> > get code merged into Linux.
> 
> I added two ideas in that email. Have you even tried to understand?

You want to establish "user.fadvise" as an extended attribute containing
a bitmask.  The sole bit defined in that attribute means "use large
folios", but you also have to change the file mode and set the IMMUTABLE
bit for it to actually do anything.

Meanwhile, you can't actually persist any of the fadvise(2) advice
flags, so the xattr name doesn't even make sense.  Maybe you meant to
call it "user.madvise" since the closest thing I can think of is
MADV_HUGEPAGE?

I've understood enough.  YUCK.

--D

^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-16  1:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Christoph Hellwig, linux-kernel, linux-f2fs-devel,
	Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <20260415234950.GC114184@frogsfrogsfrogs>

On 04/15, Darrick J. Wong wrote:
> On Wed, Apr 15, 2026 at 10:02:26PM +0000, Jaegeuk Kim wrote:
> > On 04/15, Matthew Wilcox wrote:
> > > On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> > > > On 04/14, Christoph Hellwig wrote:
> > > > > Please add the relevant mailing lists when adding new user interfaces.
> > > > > 
> > > > > And I'm not sure hacks working around the proper large folio
> > > > > implementation are something that should be merged upstream.
> > > > 
> > > > Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> > > > I'm not sure it's acceptable or not. 
> > > 
> > > You haven't sent a proposal.  This is a reply to a reply to a reply of a
> > > patch.  There's no justification for why f2fs is so special that it
> > > needs this.  What the hell is going on?  You know this is not the way to
> > > get code merged into Linux.
> > 
> > I added two ideas in that email. Have you even tried to understand?
> 
> You want to establish "user.fadvise" as an extended attribute containing
> a bitmask.  The sole bit defined in that attribute means "use large
> folios", but you also have to change the file mode and set the IMMUTABLE
> bit for it to actually do anything.

Partly yes. This path has nothing to do with IMMUTABLE bit, since I used to
activate the large folio with that bit, but hit a big pain which requires
clearing the bit whenever just deleting the file.

So, this gives a new way to activate the large folio by chmod(0400) and
setxattr("user.fadvise") only while providing quick inode eviction in order
to set mapping by iget, and allowing file deletion easily.

I feel the arguable points would be 1) the path to evict inode by calling
d_drop in release_file and returning 1 in drop_inode, 2) how to give the
hint between fadvise(FADV_LARGE_FOLIO) or setxattr(user.fadvise) by individual
file system.

> 
> Meanwhile, you can't actually persist any of the fadvise(2) advice
> flags, so the xattr name doesn't even make sense.  Maybe you meant to
> call it "user.madvise" since the closest thing I can think of is
> MADV_HUGEPAGE?
> 
> I've understood enough.  YUCK.

Thank you for taking the time to take a look.

> 
> --D

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-04-16 11:41 UTC (permalink / raw)
  To: Dorjoy Chowdhury
  Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
	jlayton, chuck.lever, alex.aring, arnd, adilger, mjguzik,
	smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260328172314.45807-2-dorjoychy111@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> This flag indicates the path should be opened if it's a regular file.
> This is useful to write secure programs that want to avoid being
> tricked into opening device nodes with special semantics while thinking
> they operate on regular files. This is a requested feature from the
> uapi-group[1].
> 
> A corresponding error code EFTYPE has been introduced. For example, if
> openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> like FreeBSD, macOS.
> 
> When used in combination with O_CREAT, either the regular file is
> created, or if the path already exists, it is opened if it's a regular
> file. Otherwise, -EFTYPE is returned.
> 
> When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> as it doesn't make sense to open a path that is both a directory and a
> regular file.
> 
> [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> 
> Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> ---

Aside from the nit below, feel free to take a

Reviewed-by: Aleksa Sarai <aleksa@amutable.com>

> diff --git a/fs/open.c b/fs/open.c
> index 681d405bc61e..a6f445f72181 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
>  	if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
>  		f->f_mode |= FMODE_CAN_ODIRECT;
>  
> -	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> +	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);

It's not clear to me why you dropped this, I didn't see a review
mentioning it either. (General note: Ideally the cover letter changelog
would mention who suggested a change in brackets after the changelog
line so it's easier to track where a change might've come from.)

I would personally keep it since O_DIRECTORY is not dropped (I do find
it interesting that O_EXCL is dropped too -- you could imagine a
userspace program wanting to know that the file was opened with O_EXCL,
though it provides you very little information).

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-04-16 11:58 UTC (permalink / raw)
  To: Aleksa Sarai, jlayton
  Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
	chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
	richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
	deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
	trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
	bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-04-16-selfless-milky-wasps-shin-p6liRL@cyphar.com>

On Thu, Apr 16, 2026 at 5:41 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > This flag indicates the path should be opened if it's a regular file.
> > This is useful to write secure programs that want to avoid being
> > tricked into opening device nodes with special semantics while thinking
> > they operate on regular files. This is a requested feature from the
> > uapi-group[1].
> >
> > A corresponding error code EFTYPE has been introduced. For example, if
> > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > like FreeBSD, macOS.
> >
> > When used in combination with O_CREAT, either the regular file is
> > created, or if the path already exists, it is opened if it's a regular
> > file. Otherwise, -EFTYPE is returned.
> >
> > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > as it doesn't make sense to open a path that is both a directory and a
> > regular file.
> >
> > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> >
> > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > ---
>
> Aside from the nit below, feel free to take a
>
> Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
>

Thanks for reviewing!

> > diff --git a/fs/open.c b/fs/open.c
> > index 681d405bc61e..a6f445f72181 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> >       if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> >               f->f_mode |= FMODE_CAN_ODIRECT;
> >
> > -     f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> > +     f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
>
> It's not clear to me why you dropped this, I didn't see a review
> mentioning it either. (General note: Ideally the cover letter changelog
> would mention who suggested a change in brackets after the changelog
> line so it's easier to track where a change might've come from.)
>

Thanks for the general note. I will keep that in mind.

The review was from Jeff Layton in v5
https://lore.kernel.org/linux-fsdevel/5fcc2a6e6d92dae0601c6b3b8faa8b2f83981afb.camel@kernel.org/
" 1. OPENAT2_REGULAR leaks into f_flags - do_dentry_open() strips
open-time-only flags (O_CREAT|O_EXCL|O_NOCTTY|O_TRUNC)
  but does not strip OPENAT2_REGULAR. When a regular file is
successfully opened via openat2() with this flag, the bit
  persists in file->f_flags and will be returned by fcntl(fd, F_GETFL)."

I think it makes sense to strip off as OPENAT2_REGULAR is an open time
only flag (like O_CREAT and the others already), right?

Regards,
Dorjoy

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-04-16 13:05 UTC (permalink / raw)
  To: Dorjoy Chowdhury
  Cc: jlayton, linux-fsdevel, Linus Torvalds, linux-kernel, linux-api,
	ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
	viro, brauner, jack, chuck.lever, alex.aring, arnd, adilger,
	mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CAFfO_h5kWCYszymaY=tPAbpU=PjLFxsND+CWSYtypN4iuW+qPw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4282 bytes --]

On 2026-04-16, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> On Thu, Apr 16, 2026 at 5:41 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > This flag indicates the path should be opened if it's a regular file.
> > > This is useful to write secure programs that want to avoid being
> > > tricked into opening device nodes with special semantics while thinking
> > > they operate on regular files. This is a requested feature from the
> > > uapi-group[1].
> > >
> > > A corresponding error code EFTYPE has been introduced. For example, if
> > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > like FreeBSD, macOS.
> > >
> > > When used in combination with O_CREAT, either the regular file is
> > > created, or if the path already exists, it is opened if it's a regular
> > > file. Otherwise, -EFTYPE is returned.
> > >
> > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > as it doesn't make sense to open a path that is both a directory and a
> > > regular file.
> > >
> > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > >
> > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > ---
> >
> > Aside from the nit below, feel free to take a
> >
> > Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
> >
> 
> Thanks for reviewing!
> 
> > > diff --git a/fs/open.c b/fs/open.c
> > > index 681d405bc61e..a6f445f72181 100644
> > > --- a/fs/open.c
> > > +++ b/fs/open.c
> > > @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> > >       if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> > >               f->f_mode |= FMODE_CAN_ODIRECT;
> > >
> > > -     f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> > > +     f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
> >
> > It's not clear to me why you dropped this, I didn't see a review
> > mentioning it either. (General note: Ideally the cover letter changelog
> > would mention who suggested a change in brackets after the changelog
> > line so it's easier to track where a change might've come from.)
> >
> 
> Thanks for the general note. I will keep that in mind.
> 
> The review was from Jeff Layton in v5
> https://lore.kernel.org/linux-fsdevel/5fcc2a6e6d92dae0601c6b3b8faa8b2f83981afb.camel@kernel.org/
> " 1. OPENAT2_REGULAR leaks into f_flags - do_dentry_open() strips
> open-time-only flags (O_CREAT|O_EXCL|O_NOCTTY|O_TRUNC)
>   but does not strip OPENAT2_REGULAR. When a regular file is
> successfully opened via openat2() with this flag, the bit
>   persists in file->f_flags and will be returned by fcntl(fd, F_GETFL)."
> 
> I think it makes sense to strip off as OPENAT2_REGULAR is an open time
> only flag (like O_CREAT and the others already), right?

Well, O_DIRECTORY isn't stripped so if we want to mirror that behaviour
then it shouldn't be stripped either IMHO.

O_NOCTTY and O_TRUNC make sense to strip (they are not relevant to the
file after it was opened -- truncation only happens at open time and you
can always set your controlling TTY later).

The story with O_CREAT and O_EXCL is a bit more complicated. They are
stripped but the history there is unclear -- the line was added in Linux
0.98.4(!) with no mention in the release note at the time. (Linus: I
wonder if you remember why this was changed at the time? Sorry for the
trip down memory lane...)

However, the existence of F_CREATED_QUERY kind of shows that these kinds
of checks are stuff that userspace can find handy (though FMODE_CREATED
is more useful than O_CREAT|O_EXCL anyway). O_EXCL is used internally
for stuff so it can be re-exposed, I'm just not sure it's a good
precedent to make a decision based on.

Then again, userspace can check with fstat(2) so it's not the end of the
world, but I don't really see a strong reason to hide information from
userspace. Since the mail was from Claude (and it tends to give silly
nits like that) I'm not sure whether Jeff would agree with my view or
not.

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v6 0/4] OPENAT2_REGULAR flag support for openat2
From: Christian Brauner @ 2026-04-16 13:07 UTC (permalink / raw)
  To: linux-fsdevel, Dorjoy Chowdhury
  Cc: Christian Brauner, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
	richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
	deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
	trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
	bharathsm, shuah, miklos, hansg
In-Reply-To: <20260328172314.45807-1-dorjoychy111@gmail.com>

On Sat, 28 Mar 2026 23:22:21 +0600, Dorjoy Chowdhury wrote:
> I came upon this "Ability to only open regular files" uapi feature suggestion
> from https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> and thought it would be something I could do as a first patch and get to
> know the kernel code a bit better.
> 
> The following filesystems have been tested by building and booting the kernel
> x86 bzImage in a Fedora 43 VM in QEMU. I have tested with OPENAT2_REGULAR that
> regular files can be successfully opened and non-regular files (directory, fifo etc)
> return -EFTYPE.
> - btrfs
> - NFS (loopback)
> - SMB (loopback)
> 
> [...]

- I've added an explanation why OPENAT2_REGULAR is only needed for some
  ->atomic_open() implementers but not others. What I don't like is that
  we need all that custom handling in there but it's managable.

- I dropped the topmost style conversions. They really don't belong
  there and if we switch to something better we should use (1 << <nr>).

- I split the EFTYPE errno introduction into a separate patch.

---

Applied to the vfs-7.2.openat.regular branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.openat.regular branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: master

[1/4] openat2: new OPENAT2_REGULAR flag support
      https://git.kernel.org/vfs/vfs/c/0b649c4d70f7
[2/4] kselftest/openat2: test for OPENAT2_REGULAR flag
      https://git.kernel.org/vfs/vfs/c/d7dc36df8fa7
[3/4] sparc/fcntl.h: convert O_* flag macros from hex to octal
      (dropped)
[4/4] mips/fcntl.h: convert O_* flag macros from hex to octal
      (dropped)

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jeff Layton @ 2026-04-16 13:28 UTC (permalink / raw)
  To: Aleksa Sarai, Dorjoy Chowdhury
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, linux-api,
	ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
	viro, brauner, jack, chuck.lever, alex.aring, arnd, adilger,
	mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-04-16-raunchy-random-curfew-guide-GmtLJR@cyphar.com>

On Thu, 2026-04-16 at 23:05 +1000, Aleksa Sarai wrote:
> On 2026-04-16, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > On Thu, Apr 16, 2026 at 5:41 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > 
> > > On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > > This flag indicates the path should be opened if it's a regular file.
> > > > This is useful to write secure programs that want to avoid being
> > > > tricked into opening device nodes with special semantics while thinking
> > > > they operate on regular files. This is a requested feature from the
> > > > uapi-group[1].
> > > > 
> > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > like FreeBSD, macOS.
> > > > 
> > > > When used in combination with O_CREAT, either the regular file is
> > > > created, or if the path already exists, it is opened if it's a regular
> > > > file. Otherwise, -EFTYPE is returned.
> > > > 
> > > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > > as it doesn't make sense to open a path that is both a directory and a
> > > > regular file.
> > > > 
> > > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > > > 
> > > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > > ---
> > > 
> > > Aside from the nit below, feel free to take a
> > > 
> > > Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
> > > 
> > 
> > Thanks for reviewing!
> > 
> > > > diff --git a/fs/open.c b/fs/open.c
> > > > index 681d405bc61e..a6f445f72181 100644
> > > > --- a/fs/open.c
> > > > +++ b/fs/open.c
> > > > @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> > > >       if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> > > >               f->f_mode |= FMODE_CAN_ODIRECT;
> > > > 
> > > > -     f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> > > > +     f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
> > > 
> > > It's not clear to me why you dropped this, I didn't see a review
> > > mentioning it either. (General note: Ideally the cover letter changelog
> > > would mention who suggested a change in brackets after the changelog
> > > line so it's easier to track where a change might've come from.)
> > > 
> > 
> > Thanks for the general note. I will keep that in mind.
> > 
> > The review was from Jeff Layton in v5
> > https://lore.kernel.org/linux-fsdevel/5fcc2a6e6d92dae0601c6b3b8faa8b2f83981afb.camel@kernel.org/
> > " 1. OPENAT2_REGULAR leaks into f_flags - do_dentry_open() strips
> > open-time-only flags (O_CREAT|O_EXCL|O_NOCTTY|O_TRUNC)
> >   but does not strip OPENAT2_REGULAR. When a regular file is
> > successfully opened via openat2() with this flag, the bit
> >   persists in file->f_flags and will be returned by fcntl(fd, F_GETFL)."
> > 
> > I think it makes sense to strip off as OPENAT2_REGULAR is an open time
> > only flag (like O_CREAT and the others already), right?
> 
> Well, O_DIRECTORY isn't stripped so if we want to mirror that behaviour
> then it shouldn't be stripped either IMHO.
> 
> O_NOCTTY and O_TRUNC make sense to strip (they are not relevant to the
> file after it was opened -- truncation only happens at open time and you
> can always set your controlling TTY later).
> 
> The story with O_CREAT and O_EXCL is a bit more complicated. They are
> stripped but the history there is unclear -- the line was added in Linux
> 0.98.4(!) with no mention in the release note at the time. (Linus: I
> wonder if you remember why this was changed at the time? Sorry for the
> trip down memory lane...)
> 
> However, the existence of F_CREATED_QUERY kind of shows that these kinds
> of checks are stuff that userspace can find handy (though FMODE_CREATED
> is more useful than O_CREAT|O_EXCL anyway). O_EXCL is used internally
> for stuff so it can be re-exposed, I'm just not sure it's a good
> precedent to make a decision based on.
> 
> Then again, userspace can check with fstat(2) so it's not the end of the
> world, but I don't really see a strong reason to hide information from
> userspace. Since the mail was from Claude (and it tends to give silly
> nits like that) I'm not sure whether Jeff would agree with my view or
> not.

I don't have a strong feeling either way, but it "feels" like O_REGULAR
is not particularly useful to return in F_GETFL.

Once the file is open, then O_REGULAR really doesn't matter anymore. We
_know_ it's a regular file at that point or the open wouldn't have
happened. F_GETFL is more useful for showing flags that actually affect
how the file description works (e.g. O_DIRECT, O_ASYNC, etc.).

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jori Koolstra @ 2026-04-16 13:52 UTC (permalink / raw)
  To: Dorjoy Chowdhury
  Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
	jlayton, chuck.lever, alex.aring, arnd, adilger, mjguzik,
	smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260328172314.45807-2-dorjoychy111@gmail.com>

On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> index 50bdc8e8a271..fe488bf7c18e 100644
> --- a/arch/alpha/include/uapi/asm/fcntl.h
> +++ b/arch/alpha/include/uapi/asm/fcntl.h
> @@ -34,6 +34,7 @@
>  
>  #define O_PATH		040000000
>  #define __O_TMPFILE	0100000000
> +#define OPENAT2_REGULAR	0200000000
>

I don't quite understand why we are adding OPENAT2_REGULAR inside the
O_* flag range. Wasn't this supposed to be only supported for openat2()?
If so, I don't see the need to waste an O_* flag bit. But maybe I am
missing something.

Thanks,
Jori.

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-04-16 14:21 UTC (permalink / raw)
  To: Jori Koolstra, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
	linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
	linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg
In-Reply-To: <aeDpIgfDaIKEaBcL@lt-jori.localdomain>

On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
>
> On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > index 50bdc8e8a271..fe488bf7c18e 100644
> > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > @@ -34,6 +34,7 @@
> >
> >  #define O_PATH               040000000
> >  #define __O_TMPFILE  0100000000
> > +#define OPENAT2_REGULAR      0200000000
> >
>
> I don't quite understand why we are adding OPENAT2_REGULAR inside the
> O_* flag range. Wasn't this supposed to be only supported for openat2()?
> If so, I don't see the need to waste an O_* flag bit. But maybe I am
> missing something.
>

Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
bit range. But as far as I understand, for the old open system calls
we can't easily add new O_* flags as the older codepaths don't strip
off unknown bits which openat2 does. It's not easy to add new O_*
flags for the old open system calls since that could break userspace
programs. So I guess it's okay to add OPENAT2_REGULAR in the 32 bits
range anyway? (Also lots of code paths take 32bit flags param right
now and those would need changing to take uint64_t instead but this is
of course not a reason to not add the new flag outside of the 32
bits).

Regards,
Dorjoy

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jori Koolstra @ 2026-04-16 15:03 UTC (permalink / raw)
  To: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
	ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
	viro, brauner, jack, jlayton, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg,
	Aleksa Sarai
In-Reply-To: <CAFfO_h6pkyX=uN5uoXda6toTtT6KsahfBNBLom9i21HdZ7JOmQ@mail.gmail.com>


> Op 16-04-2026 16:21 CEST schreef Dorjoy Chowdhury <dorjoychy111@gmail.com>:
> 
>  
> On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> >
> > On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > @@ -34,6 +34,7 @@
> > >
> > >  #define O_PATH               040000000
> > >  #define __O_TMPFILE  0100000000
> > > +#define OPENAT2_REGULAR      0200000000
> > >
> >
> > I don't quite understand why we are adding OPENAT2_REGULAR inside the
> > O_* flag range. Wasn't this supposed to be only supported for openat2()?
> > If so, I don't see the need to waste an O_* flag bit. But maybe I am
> > missing something.
> >
> 
> Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
> got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
> bit range. But as far as I understand, for the old open system calls
> we can't easily add new O_* flags as the older codepaths don't strip
> off unknown bits which openat2 does. It's not easy to add new O_*
> flags for the old open system calls since that could break userspace
> programs.

If I recall correctly, Aleksa has suggested we might also want to add
O_EMPTYPATH to openat() instead of only allowing this for openat2().
I am waiting to see what Christian thinks of this.

I guess in that case it is relatively harmless to change UAPI
behavior because openat() with an empty path never works; so it
would be silly if there are userspace programs that make
this call, which always fails and does nothing, and somehow rely on
that.

> So I guess it's okay to add OPENAT2_REGULAR in the 32 bits
> range anyway? (Also lots of code paths take 32bit flags param right
> now and those would need changing to take uint64_t instead but this is
> of course not a reason to not add the new flag outside of the 32
> bits).
> 
> Regards,
> Dorjoy

Thanks,
Jori.

^ permalink raw reply

* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Christian Brauner @ 2026-04-16 15:15 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
	ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
	viro, jack, jlayton, chuck.lever, alex.aring, arnd, adilger,
	mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg, Aleksa Sarai
In-Reply-To: <1714293523.333222.1776351806025@kpc.webmail.kpnmail.nl>

On Thu, Apr 16, 2026 at 05:03:26PM +0200, Jori Koolstra wrote:
> 
> > Op 16-04-2026 16:21 CEST schreef Dorjoy Chowdhury <dorjoychy111@gmail.com>:
> > 
> >  
> > On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> > >
> > > On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > > @@ -34,6 +34,7 @@
> > > >
> > > >  #define O_PATH               040000000
> > > >  #define __O_TMPFILE  0100000000
> > > > +#define OPENAT2_REGULAR      0200000000
> > > >
> > >
> > > I don't quite understand why we are adding OPENAT2_REGULAR inside the
> > > O_* flag range. Wasn't this supposed to be only supported for openat2()?
> > > If so, I don't see the need to waste an O_* flag bit. But maybe I am
> > > missing something.
> > >
> > 
> > Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
> > got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
> > bit range. But as far as I understand, for the old open system calls
> > we can't easily add new O_* flags as the older codepaths don't strip
> > off unknown bits which openat2 does. It's not easy to add new O_*
> > flags for the old open system calls since that could break userspace
> > programs.
> 
> If I recall correctly, Aleksa has suggested we might also want to add
> O_EMPTYPATH to openat() instead of only allowing this for openat2().
> I am waiting to see what Christian thinks of this.

We can do that, yes. For O_EMPTYPATH that is workable.

I don't mind too much if we leave OPENAT2_REGUALR in the 32-bit flag
space. It'll silently be ignored but the flag name should give it away.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox