* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-04-14 17:33 UTC (permalink / raw)
To: linux-fsdevel, brauner
Cc: Jeff Layton, linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs,
linux-cifs, v9fs, linux-kselftest, viro, jack, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <CAFfO_h5FOTv-VMbh2Dmwkp04BFxQu192gsvFLohDFXAWPccRNA@mail.gmail.com>
On Mon, Apr 6, 2026 at 9:30 PM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
>
> On Mon, Apr 6, 2026 at 5:27 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sat, 2026-04-04 at 21:17 +0600, Dorjoy Chowdhury wrote:
> > > On Thu, Apr 2, 2026 at 1:02 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > On Mon, 2026-03-30 at 21:07 +0600, Dorjoy Chowdhury wrote:
> > > > > On Mon, Mar 30, 2026 at 5:49 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > >
> > > > > > On Sat, 2026-03-28 at 23:22 +0600, Dorjoy Chowdhury wrote:
> > > > > > > This flag indicates the path should be opened if it's a regular file.
> > > > > > > This is useful to write secure programs that want to avoid being
> > > > > > > tricked into opening device nodes with special semantics while thinking
> > > > > > > they operate on regular files. This is a requested feature from the
> > > > > > > uapi-group[1].
> > > > > > >
> > > > > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > > > > like FreeBSD, macOS.
> > > > > > >
> > > > > > > When used in combination with O_CREAT, either the regular file is
> > > > > > > created, or if the path already exists, it is opened if it's a regular
> > > > > > > file. Otherwise, -EFTYPE is returned.
> > > > > > >
> > > > > > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > > > > > as it doesn't make sense to open a path that is both a directory and a
> > > > > > > regular file.
> > > > > > >
> > > > > > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > > > > > >
> > > > > > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > > > > > ---
> > > > > > > arch/alpha/include/uapi/asm/errno.h | 2 ++
> > > > > > > arch/alpha/include/uapi/asm/fcntl.h | 1 +
> > > > > > > arch/mips/include/uapi/asm/errno.h | 2 ++
> > > > > > > arch/parisc/include/uapi/asm/errno.h | 2 ++
> > > > > > > arch/parisc/include/uapi/asm/fcntl.h | 1 +
> > > > > > > arch/sparc/include/uapi/asm/errno.h | 2 ++
> > > > > > > arch/sparc/include/uapi/asm/fcntl.h | 1 +
> > > > > > > fs/ceph/file.c | 4 ++++
> > > > > > > fs/fcntl.c | 4 ++--
> > > > > > > fs/gfs2/inode.c | 6 ++++++
> > > > > > > fs/namei.c | 4 ++++
> > > > > > > fs/nfs/dir.c | 4 ++++
> > > > > > > fs/open.c | 8 +++++---
> > > > > > > fs/smb/client/dir.c | 14 +++++++++++++-
> > > > > > > include/linux/fcntl.h | 2 ++
> > > > > > > include/uapi/asm-generic/errno.h | 2 ++
> > > > > > > include/uapi/asm-generic/fcntl.h | 4 ++++
> > > > > > > tools/arch/alpha/include/uapi/asm/errno.h | 2 ++
> > > > > > > tools/arch/mips/include/uapi/asm/errno.h | 2 ++
> > > > > > > tools/arch/parisc/include/uapi/asm/errno.h | 2 ++
> > > > > > > tools/arch/sparc/include/uapi/asm/errno.h | 2 ++
> > > > > > > tools/include/uapi/asm-generic/errno.h | 2 ++
> > > > > > > 22 files changed, 67 insertions(+), 6 deletions(-)
> > > > > > >
> > > > > > > diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
> > > > > > > index 6791f6508632..1a99f38813c7 100644
> > > > > > > --- a/arch/alpha/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/alpha/include/uapi/asm/errno.h
> > > > > > > @@ -127,4 +127,6 @@
> > > > > > >
> > > > > > > #define EHWPOISON 139 /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE 140 /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > > #endif
> > > > > > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > > > > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > > > > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > > > > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > > > > > @@ -34,6 +34,7 @@
> > > > > > >
> > > > > > > #define O_PATH 040000000
> > > > > > > #define __O_TMPFILE 0100000000
> > > > > > > +#define OPENAT2_REGULAR 0200000000
> > > > > > >
> > > > > > > #define F_GETLK 7
> > > > > > > #define F_SETLK 8
> > > > > > > diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
> > > > > > > index c01ed91b1ef4..1835a50b69ce 100644
> > > > > > > --- a/arch/mips/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/mips/include/uapi/asm/errno.h
> > > > > > > @@ -126,6 +126,8 @@
> > > > > > >
> > > > > > > #define EHWPOISON 168 /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE 169 /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > > #define EDQUOT 1133 /* Quota exceeded */
> > > > > > >
> > > > > > >
> > > > > > > diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
> > > > > > > index 8cbc07c1903e..93194fbb0a80 100644
> > > > > > > --- a/arch/parisc/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/parisc/include/uapi/asm/errno.h
> > > > > > > @@ -124,4 +124,6 @@
> > > > > > >
> > > > > > > #define EHWPOISON 257 /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE 258 /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > > #endif
> > > > > > > diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
> > > > > > > index 03dee816cb13..d46812f2f0f4 100644
> > > > > > > --- a/arch/parisc/include/uapi/asm/fcntl.h
> > > > > > > +++ b/arch/parisc/include/uapi/asm/fcntl.h
> > > > > > > @@ -19,6 +19,7 @@
> > > > > > >
> > > > > > > #define O_PATH 020000000
> > > > > > > #define __O_TMPFILE 040000000
> > > > > > > +#define OPENAT2_REGULAR 0100000000
> > > > > > >
> > > > > > > #define F_GETLK64 8
> > > > > > > #define F_SETLK64 9
> > > > > > > diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
> > > > > > > index 4a41e7835fd5..71940ec9130b 100644
> > > > > > > --- a/arch/sparc/include/uapi/asm/errno.h
> > > > > > > +++ b/arch/sparc/include/uapi/asm/errno.h
> > > > > > > @@ -117,4 +117,6 @@
> > > > > > >
> > > > > > > #define EHWPOISON 135 /* Memory page has hardware error */
> > > > > > >
> > > > > > > +#define EFTYPE 136 /* Wrong file type for the intended operation */
> > > > > > > +
> > > > > > > #endif
> > > > > > > diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
> > > > > > > index 67dae75e5274..bb6e9fa94bc9 100644
> > > > > > > --- a/arch/sparc/include/uapi/asm/fcntl.h
> > > > > > > +++ b/arch/sparc/include/uapi/asm/fcntl.h
> > > > > > > @@ -37,6 +37,7 @@
> > > > > > >
> > > > > > > #define O_PATH 0x1000000
> > > > > > > #define __O_TMPFILE 0x2000000
> > > > > > > +#define OPENAT2_REGULAR 0x4000000
> > > > > > >
> > > > > > > #define F_GETOWN 5 /* for sockets. */
> > > > > > > #define F_SETOWN 6 /* for sockets. */
> > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > > > index 66bbf6d517a9..6d8d4c7765e6 100644
> > > > > > > --- a/fs/ceph/file.c
> > > > > > > +++ b/fs/ceph/file.c
> > > > > > > @@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > > > > > ceph_init_inode_acls(newino, &as_ctx);
> > > > > > > file->f_mode |= FMODE_CREATED;
> > > > > > > }
> > > > > > > + if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
> > > > > > > + err = -EFTYPE;
> > > > > > > + goto out_req;
> > > > > > > + }
> > > > > >
> > > > > > ^^^
> > > > > > This doesn't look quite right. Here's a larger chunk of the code:
> > > > > >
> > > > > > -------------------------8<--------------------------
> > > > > > if (d_in_lookup(dentry)) {
> > > > > > dn = ceph_finish_lookup(req, dentry, err);
> > > > > > if (IS_ERR(dn))
> > > > > > err = PTR_ERR(dn);
> > > > > > } else {
> > > > > > /* we were given a hashed negative dentry */
> > > > > > dn = NULL;
> > > > > > }
> > > > > > if (err)
> > > > > > goto out_req;
> > > > > > if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> > > > > > /* make vfs retry on splice, ENOENT, or symlink */
> > > > > > doutc(cl, "finish_no_open on dn %p\n", dn);
> > > > > > err = finish_no_open(file, dn);
> > > > > > } else {
> > > > > > if (IS_ENCRYPTED(dir) &&
> > > > > > !fscrypt_has_permitted_context(dir, d_inode(dentry))) {
> > > > > > pr_warn_client(cl,
> > > > > > "Inconsistent encryption context (parent %llx:%llx child %llx:%llx)\n",
> > > > > > ceph_vinop(dir), ceph_vinop(d_inode(dentry)));
> > > > > > goto out_req;
> > > > > > }
> > > > > >
> > > > > > doutc(cl, "finish_open on dn %p\n", dn);
> > > > > > if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> > > > > > struct inode *newino = d_inode(dentry);
> > > > > >
> > > > > > cache_file_layout(dir, newino);
> > > > > > ceph_init_inode_acls(newino, &as_ctx);
> > > > > > file->f_mode |= FMODE_CREATED;
> > > > > > }
> > > > > > err = finish_open(file, dentry, ceph_open);
> > > > > > }
> > > > > > -------------------------8<--------------------------
> > > > > >
> > > > > > It looks like this won't handle it correctly if the pathwalk terminates
> > > > > > on a symlink (re: d_is_symlink() case). You should either set up a test
> > > > > > ceph cluster on your own, or reach out to the ceph community and ask
> > > > > > them to test this.
> > > > > >
> > > > >
> > > > > Thanks for reviewing. The d_is_symlink() case seems to be calling
> > > > > finish_no_open so shouldn't this be okay?
> > > > >
> > > >
> > > > My mistake -- you're correct. I keep forgetting that finish_no_open()
> > > > will handle this case regardless of what else happens.
> > > >
> > > > > > > err = finish_open(file, dentry, ceph_open);
> > > > > > > }
> > > > > > > out_req:
> > > > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > > > > > index beab8080badf..240bb511557a 100644
> > > > > > > --- a/fs/fcntl.c
> > > > > > > +++ b/fs/fcntl.c
> > > > > > > @@ -1169,9 +1169,9 @@ static int __init fcntl_init(void)
> > > > > > > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > > > > > > * is defined as O_NONBLOCK on some platforms and not on others.
> > > > > > > */
> > > > > > > - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> > > > > > > + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > > > > > > HWEIGHT32(
> > > > > > > - (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > > > > > > + (VALID_OPENAT2_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > > > > > > __FMODE_EXEC));
> > > > > > >
> > > > > > > fasync_cache = kmem_cache_create("fasync_cache",
> > > > > > > diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> > > > > > > index 8344040ecaf7..4604e2e8a9cc 100644
> > > > > > > --- a/fs/gfs2/inode.c
> > > > > > > +++ b/fs/gfs2/inode.c
> > > > > > > @@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
> > > > > > > inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
> > > > > > > error = PTR_ERR(inode);
> > > > > > > if (!IS_ERR(inode)) {
> > > > > > > + if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
> > > > > >
> > > > > > Isn't OPENAT2_REGULAR getting masked off in ->f_flags now?
> > > > > >
> > > > > Yes, I thought the masking off was happening after this codepath got
> > > > > executed. Maybe it's better anyway to pass another flags param to this
> > > > > function and forward the flags from the gfs2_atomic_open function and
> > > > > in other call sites pass 0 ? What do you think?
> > > > >
> > > >
> > > > Also my mistake. That happens in do_dentry_open() which happens in
> > > > finish_open(), so you should be OK here.
> > > >
> > > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > >
> > > Thanks for patiently reviewing this! I am planning on sending patches
> > > for man-pages and looking into some xfs-tests for this. But I am not
> > > sure if this patch series will get more reviews from others or if it
> > > will be picked up in the vfs branch?
> > >
> >
> > This is a change to rather core VFS infrastructure so yes, you should
> > expect some more review. Assuming no major issues are found, then yes,
> > this should eventually get picked up by the VFS maintainers.
> >
> > Cheers,
> > --
> > Jeff Layton <jlayton@kernel.org>
>
> Ping....
> This patch series got a "Reviewed-by" from Jeff Layton but it probably
> requires more reviews from other maintainers/reviewers as well. So
> requesting for review on this patch series. Thanks!
>
Ping...
Requesting for review on this patch series please.
Regards,
Dorjoy
^ permalink raw reply
* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-15 16:41 UTC (permalink / raw)
To: linux-kernel, linux-f2fs-devel, linux-api, linux-fsdevel; +Cc: Akilesh Kailash
In-Reply-To: <adhPZxtbZxgU-37v@google.com>
By the way, is it worth to add some generic apis such as
1) reclaim a specifc inode object when closing the last file
2) add another fadvise hint for large folio
On 04/10, Jaegeuk Kim wrote:
> enum {
> F2FS_XATTR_FADV_LARGEFOLIO,
> };
>
> unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
>
> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> -> register the inode number for large folio
> 2. chmod(0400, file)
> -> make Read-Only
> 3. fsync() && close() && open(READ)
> -> f2fs_iget() with large folio
> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
> -> return error
> 5. close() and open()
> -> goto #3
> 6. unlink
> -> deregister the inode number
>
> Suggested-by: Akilesh Kailash <akailash@google.com>
> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> ---
>
> Log from v1:
> - add a condition in f2fs_drop_inode
> - add Doc
>
> Documentation/filesystems/f2fs.rst | 41 ++++++++++++++++++++++++++----
> fs/f2fs/checkpoint.c | 2 +-
> fs/f2fs/data.c | 2 +-
> fs/f2fs/f2fs.h | 1 +
> fs/f2fs/file.c | 11 ++++++--
> fs/f2fs/inode.c | 19 +++++++++++---
> fs/f2fs/super.c | 7 +++++
> fs/f2fs/xattr.c | 35 ++++++++++++++++++++++++-
> fs/f2fs/xattr.h | 6 +++++
> 9 files changed, 111 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
> index 7e4031631286..de899d0d3088 100644
> --- a/Documentation/filesystems/f2fs.rst
> +++ b/Documentation/filesystems/f2fs.rst
> @@ -1044,11 +1044,14 @@ page allocation for significant performance gains. To minimize code complexity,
> this support is currently excluded from the write path, which requires handling
> complex optimizations such as compression and block allocation modes.
>
> -This optional feature is triggered only when a file's immutable bit is set.
> -Consequently, F2FS will return EOPNOTSUPP if a user attempts to open a cached
> -file with write permissions, even immediately after clearing the bit. Write
> -access is only restored once the cached inode is dropped. The usage flow is
> -demonstrated below:
> +This optional feature is triggered by two mechanisms: the file's immutable bit
> +or a specific xattr flag. In both cases, F2FS ensures data integrity by
> +restricting the file to a read-only state while large folios are active.
> +
> +1. Immutable Bit Approach:
> +Triggered when the FS_IMMUTABLE_FL is set. This is a strict enforcement
> +where the file cannot be modified at all until the bit is cleared and
> +the cached inode is dropped.
>
> .. code-block::
>
> @@ -1078,3 +1081,31 @@ demonstrated below:
> Written 4096 bytes with pattern = zero, total_time = 29 us, max_latency = 28 us
>
> # rm /data/testfile_read_seq
> +
> +2. XATTR fadvise Approach:
> +A more flexible registration via extended attributes.
> +
> +.. code-block::
> +
> + enum {
> + F2FS_XATTR_FADV_LARGEFOLIO,
> + };
> + unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> +
> + /* Registers the inode number for large folio support in the subsystem.*/
> + # setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> +
> + /* The file must be made Read-Only to transition into the large folio path. */
> + # fchmod(0400, fd)
> +
> + /* clean up dirty inode state. */
> + # fsync(fd)
> +
> + /* Drop the inode cache.
> + # close(fd)
> +
> + /* f2fs_iget() instantiates the inode with large folio support.*/
> + # open()
> +
> + /* Returns -EOPNOTSUPP or error to protect the large folio cache.*/
> + # open(WRITE), mkwrite on mmap, or chmod(WRITE)
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 01e1ba77263e..fdd62ddc3ed6 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -778,7 +778,7 @@ void f2fs_remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type)
> __remove_ino_entry(sbi, ino, type);
> }
>
> -/* mode should be APPEND_INO, UPDATE_INO or TRANS_DIR_INO */
> +/* mode should be APPEND_INO, UPDATE_INO, LARGE_FOLIO_IO, or TRANS_DIR_INO */
> bool f2fs_exist_written_data(struct f2fs_sb_info *sbi, nid_t ino, int mode)
> {
> struct inode_management *im = &sbi->im[mode];
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 965d4e6443c6..5e46230398d7 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -2494,7 +2494,7 @@ static int f2fs_read_data_large_folio(struct inode *inode,
> int ret = 0;
> bool folio_in_bio;
>
> - if (!IS_IMMUTABLE(inode) || f2fs_compressed_file(inode)) {
> + if (f2fs_compressed_file(inode)) {
> if (folio)
> folio_unlock(folio);
> return -EOPNOTSUPP;
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index e40b6b2784ee..02bc6eb96a59 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -381,6 +381,7 @@ enum {
> /* for the list of ino */
> enum {
> ORPHAN_INO, /* for orphan ino list */
> + LARGE_FOLIO_INO, /* for large folio case */
> APPEND_INO, /* for append ino list */
> UPDATE_INO, /* for update ino list */
> TRANS_DIR_INO, /* for transactions dir ino list */
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index c0220cd7b332..64ba900410fc 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -2068,9 +2068,16 @@ static long f2fs_fallocate(struct file *file, int mode,
>
> static int f2fs_release_file(struct inode *inode, struct file *filp)
> {
> - if (atomic_dec_and_test(&F2FS_I(inode)->open_count))
> + if (atomic_dec_and_test(&F2FS_I(inode)->open_count)) {
> f2fs_remove_donate_inode(inode);
> -
> + /*
> + * In order to get large folio as soon as possible, let's drop
> + * inode cache asap. See also f2fs_drop_inode.
> + */
> + if (f2fs_exist_written_data(F2FS_I_SB(inode),
> + inode->i_ino, LARGE_FOLIO_INO))
> + d_drop(filp->f_path.dentry);
> + }
> /*
> * f2fs_release_file is called at every close calls. So we should
> * not drop any inmemory pages by close called by other process.
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index 89240be8cc59..e100bc5a378c 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -565,6 +565,20 @@ static bool is_meta_ino(struct f2fs_sb_info *sbi, unsigned int ino)
> ino == F2FS_COMPRESS_INO(sbi);
> }
>
> +static void f2fs_mapping_set_large_folio(struct inode *inode)
> +{
> + struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> +
> + if (f2fs_compressed_file(inode))
> + return;
> + if (f2fs_quota_file(sbi, inode->i_ino))
> + return;
> + if (IS_IMMUTABLE(inode) ||
> + (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> + !(inode->i_mode & S_IWUGO)))
> + mapping_set_folio_min_order(inode->i_mapping, 0);
> +}
> +
> struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> {
> struct f2fs_sb_info *sbi = F2FS_SB(sb);
> @@ -620,9 +634,7 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> inode->i_op = &f2fs_file_inode_operations;
> inode->i_fop = &f2fs_file_operations;
> inode->i_mapping->a_ops = &f2fs_dblock_aops;
> - if (IS_IMMUTABLE(inode) && !f2fs_compressed_file(inode) &&
> - !f2fs_quota_file(sbi, inode->i_ino))
> - mapping_set_folio_min_order(inode->i_mapping, 0);
> + f2fs_mapping_set_large_folio(inode);
> } else if (S_ISDIR(inode->i_mode)) {
> inode->i_op = &f2fs_dir_inode_operations;
> inode->i_fop = &f2fs_dir_operations;
> @@ -895,6 +907,7 @@ void f2fs_evict_inode(struct inode *inode)
> f2fs_remove_ino_entry(sbi, inode->i_ino, APPEND_INO);
> f2fs_remove_ino_entry(sbi, inode->i_ino, UPDATE_INO);
> f2fs_remove_ino_entry(sbi, inode->i_ino, FLUSH_INO);
> + f2fs_remove_ino_entry(sbi, inode->i_ino, LARGE_FOLIO_INO);
>
> if (!is_sbi_flag_set(sbi, SBI_IS_FREEZING)) {
> sb_start_intwrite(inode->i_sb);
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ccf806b676f5..11d1e0c99ac1 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -1844,6 +1844,13 @@ static int f2fs_drop_inode(struct inode *inode)
> return 1;
> }
> }
> + /*
> + * In order to get large folio as soon as possible, let's drop
> + * inode cache asap. See also f2fs_release_file.
> + */
> + if (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> + !is_inode_flag_set(inode, FI_DIRTY_INODE))
> + return 1;
>
> /*
> * This is to avoid a deadlock condition like below.
> diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
> index 941dc62a6d6f..0c0e44c2dcdd 100644
> --- a/fs/f2fs/xattr.c
> +++ b/fs/f2fs/xattr.c
> @@ -44,6 +44,16 @@ static void xattr_free(struct f2fs_sb_info *sbi, void *xattr_addr,
> kfree(xattr_addr);
> }
>
> +static int f2fs_xattr_fadvise_get(struct inode *inode, void *buffer)
> +{
> + if (!buffer)
> + goto out;
> + if (mapping_large_folio_support(inode->i_mapping))
> + *((unsigned int *)buffer) |= BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> +out:
> + return sizeof(unsigned int);
> +}
> +
> static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
> struct dentry *unused, struct inode *inode,
> const char *name, void *buffer, size_t size)
> @@ -61,10 +71,29 @@ static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
> default:
> return -EINVAL;
> }
> + if (handler->flags == F2FS_XATTR_INDEX_USER &&
> + !strcmp(name, "fadvise"))
> + return f2fs_xattr_fadvise_get(inode, buffer);
> +
> return f2fs_getxattr(inode, handler->flags, name,
> buffer, size, NULL);
> }
>
> +static int f2fs_xattr_fadvise_set(struct inode *inode, const void *value)
> +{
> + unsigned int new_fadvise;
> +
> + new_fadvise = *(unsigned int *)value;
> +
> + if (new_fadvise & BIT(F2FS_XATTR_FADV_LARGEFOLIO))
> + f2fs_add_ino_entry(F2FS_I_SB(inode),
> + inode->i_ino, LARGE_FOLIO_INO);
> + else
> + f2fs_remove_ino_entry(F2FS_I_SB(inode),
> + inode->i_ino, LARGE_FOLIO_INO);
> + return 0;
> +}
> +
> static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
> struct mnt_idmap *idmap,
> struct dentry *unused, struct inode *inode,
> @@ -84,6 +113,10 @@ static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
> default:
> return -EINVAL;
> }
> + if (handler->flags == F2FS_XATTR_INDEX_USER &&
> + !strcmp(name, "fadvise"))
> + return f2fs_xattr_fadvise_set(inode, value);
> +
> return f2fs_setxattr(inode, handler->flags, name,
> value, size, NULL, flags);
> }
> @@ -842,4 +875,4 @@ int __init f2fs_init_xattr_cache(void)
> void f2fs_destroy_xattr_cache(void)
> {
> kmem_cache_destroy(inline_xattr_slab);
> -}
> \ No newline at end of file
> +}
> diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h
> index bce3d93e4755..455f460d014e 100644
> --- a/fs/f2fs/xattr.h
> +++ b/fs/f2fs/xattr.h
> @@ -24,6 +24,7 @@
> #define F2FS_XATTR_REFCOUNT_MAX 1024
>
> /* Name indexes */
> +#define F2FS_USER_FADVISE_NAME "user.fadvise"
> #define F2FS_SYSTEM_ADVISE_NAME "system.advise"
> #define F2FS_XATTR_INDEX_USER 1
> #define F2FS_XATTR_INDEX_POSIX_ACL_ACCESS 2
> @@ -39,6 +40,11 @@
> #define F2FS_XATTR_NAME_ENCRYPTION_CONTEXT "c"
> #define F2FS_XATTR_NAME_VERITY "v"
>
> +/* used for F2FS_USER_FADVISE_NAME */
> +enum {
> + F2FS_XATTR_FADV_LARGEFOLIO,
> +};
> +
> struct f2fs_xattr_header {
> __le32 h_magic; /* magic number for identification */
> __le32 h_refcount; /* reference count */
> --
> 2.53.0.1213.gd9a14994de-goog
>
^ permalink raw reply
* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-15 16:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-kernel, linux-f2fs-devel, Akilesh Kailash, linux-fsdevel,
linux-mm, linux-api
In-Reply-To: <ad30g9xMs9wNJhFb@infradead.org>
On 04/14, Christoph Hellwig wrote:
> Please add the relevant mailing lists when adding new user interfaces.
>
> And I'm not sure hacks working around the proper large folio
> implementation are something that should be merged upstream.
Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
I'm not sure it's acceptable or not.
>
> On Fri, Apr 10, 2026 at 01:16:23AM +0000, Jaegeuk Kim wrote:
> > enum {
> > F2FS_XATTR_FADV_LARGEFOLIO,
> > };
> >
> > unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> >
> > 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> > -> register the inode number for large folio
> > 2. chmod(0400, file)
> > -> make Read-Only
> > 3. fsync() && close() && open(READ)
> > -> f2fs_iget() with large folio
> > 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
> > -> return error
> > 5. close() and open()
> > -> goto #3
> > 6. unlink
> > -> deregister the inode number
> >
> > Suggested-by: Akilesh Kailash <akailash@google.com>
> > Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> > ---
> >
> > Log from v1:
> > - add a condition in f2fs_drop_inode
> > - add Doc
> >
> > Documentation/filesystems/f2fs.rst | 41 ++++++++++++++++++++++++++----
> > fs/f2fs/checkpoint.c | 2 +-
> > fs/f2fs/data.c | 2 +-
> > fs/f2fs/f2fs.h | 1 +
> > fs/f2fs/file.c | 11 ++++++--
> > fs/f2fs/inode.c | 19 +++++++++++---
> > fs/f2fs/super.c | 7 +++++
> > fs/f2fs/xattr.c | 35 ++++++++++++++++++++++++-
> > fs/f2fs/xattr.h | 6 +++++
> > 9 files changed, 111 insertions(+), 13 deletions(-)
> >
> > diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
> > index 7e4031631286..de899d0d3088 100644
> > --- a/Documentation/filesystems/f2fs.rst
> > +++ b/Documentation/filesystems/f2fs.rst
> > @@ -1044,11 +1044,14 @@ page allocation for significant performance gains. To minimize code complexity,
> > this support is currently excluded from the write path, which requires handling
> > complex optimizations such as compression and block allocation modes.
> >
> > -This optional feature is triggered only when a file's immutable bit is set.
> > -Consequently, F2FS will return EOPNOTSUPP if a user attempts to open a cached
> > -file with write permissions, even immediately after clearing the bit. Write
> > -access is only restored once the cached inode is dropped. The usage flow is
> > -demonstrated below:
> > +This optional feature is triggered by two mechanisms: the file's immutable bit
> > +or a specific xattr flag. In both cases, F2FS ensures data integrity by
> > +restricting the file to a read-only state while large folios are active.
> > +
> > +1. Immutable Bit Approach:
> > +Triggered when the FS_IMMUTABLE_FL is set. This is a strict enforcement
> > +where the file cannot be modified at all until the bit is cleared and
> > +the cached inode is dropped.
> >
> > .. code-block::
> >
> > @@ -1078,3 +1081,31 @@ demonstrated below:
> > Written 4096 bytes with pattern = zero, total_time = 29 us, max_latency = 28 us
> >
> > # rm /data/testfile_read_seq
> > +
> > +2. XATTR fadvise Approach:
> > +A more flexible registration via extended attributes.
> > +
> > +.. code-block::
> > +
> > + enum {
> > + F2FS_XATTR_FADV_LARGEFOLIO,
> > + };
> > + unsigned int value = BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> > +
> > + /* Registers the inode number for large folio support in the subsystem.*/
> > + # setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> > +
> > + /* The file must be made Read-Only to transition into the large folio path. */
> > + # fchmod(0400, fd)
> > +
> > + /* clean up dirty inode state. */
> > + # fsync(fd)
> > +
> > + /* Drop the inode cache.
> > + # close(fd)
> > +
> > + /* f2fs_iget() instantiates the inode with large folio support.*/
> > + # open()
> > +
> > + /* Returns -EOPNOTSUPP or error to protect the large folio cache.*/
> > + # open(WRITE), mkwrite on mmap, or chmod(WRITE)
> > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> > index 01e1ba77263e..fdd62ddc3ed6 100644
> > --- a/fs/f2fs/checkpoint.c
> > +++ b/fs/f2fs/checkpoint.c
> > @@ -778,7 +778,7 @@ void f2fs_remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type)
> > __remove_ino_entry(sbi, ino, type);
> > }
> >
> > -/* mode should be APPEND_INO, UPDATE_INO or TRANS_DIR_INO */
> > +/* mode should be APPEND_INO, UPDATE_INO, LARGE_FOLIO_IO, or TRANS_DIR_INO */
> > bool f2fs_exist_written_data(struct f2fs_sb_info *sbi, nid_t ino, int mode)
> > {
> > struct inode_management *im = &sbi->im[mode];
> > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> > index 965d4e6443c6..5e46230398d7 100644
> > --- a/fs/f2fs/data.c
> > +++ b/fs/f2fs/data.c
> > @@ -2494,7 +2494,7 @@ static int f2fs_read_data_large_folio(struct inode *inode,
> > int ret = 0;
> > bool folio_in_bio;
> >
> > - if (!IS_IMMUTABLE(inode) || f2fs_compressed_file(inode)) {
> > + if (f2fs_compressed_file(inode)) {
> > if (folio)
> > folio_unlock(folio);
> > return -EOPNOTSUPP;
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index e40b6b2784ee..02bc6eb96a59 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -381,6 +381,7 @@ enum {
> > /* for the list of ino */
> > enum {
> > ORPHAN_INO, /* for orphan ino list */
> > + LARGE_FOLIO_INO, /* for large folio case */
> > APPEND_INO, /* for append ino list */
> > UPDATE_INO, /* for update ino list */
> > TRANS_DIR_INO, /* for transactions dir ino list */
> > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > index c0220cd7b332..64ba900410fc 100644
> > --- a/fs/f2fs/file.c
> > +++ b/fs/f2fs/file.c
> > @@ -2068,9 +2068,16 @@ static long f2fs_fallocate(struct file *file, int mode,
> >
> > static int f2fs_release_file(struct inode *inode, struct file *filp)
> > {
> > - if (atomic_dec_and_test(&F2FS_I(inode)->open_count))
> > + if (atomic_dec_and_test(&F2FS_I(inode)->open_count)) {
> > f2fs_remove_donate_inode(inode);
> > -
> > + /*
> > + * In order to get large folio as soon as possible, let's drop
> > + * inode cache asap. See also f2fs_drop_inode.
> > + */
> > + if (f2fs_exist_written_data(F2FS_I_SB(inode),
> > + inode->i_ino, LARGE_FOLIO_INO))
> > + d_drop(filp->f_path.dentry);
> > + }
> > /*
> > * f2fs_release_file is called at every close calls. So we should
> > * not drop any inmemory pages by close called by other process.
> > diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> > index 89240be8cc59..e100bc5a378c 100644
> > --- a/fs/f2fs/inode.c
> > +++ b/fs/f2fs/inode.c
> > @@ -565,6 +565,20 @@ static bool is_meta_ino(struct f2fs_sb_info *sbi, unsigned int ino)
> > ino == F2FS_COMPRESS_INO(sbi);
> > }
> >
> > +static void f2fs_mapping_set_large_folio(struct inode *inode)
> > +{
> > + struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> > +
> > + if (f2fs_compressed_file(inode))
> > + return;
> > + if (f2fs_quota_file(sbi, inode->i_ino))
> > + return;
> > + if (IS_IMMUTABLE(inode) ||
> > + (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> > + !(inode->i_mode & S_IWUGO)))
> > + mapping_set_folio_min_order(inode->i_mapping, 0);
> > +}
> > +
> > struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> > {
> > struct f2fs_sb_info *sbi = F2FS_SB(sb);
> > @@ -620,9 +634,7 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> > inode->i_op = &f2fs_file_inode_operations;
> > inode->i_fop = &f2fs_file_operations;
> > inode->i_mapping->a_ops = &f2fs_dblock_aops;
> > - if (IS_IMMUTABLE(inode) && !f2fs_compressed_file(inode) &&
> > - !f2fs_quota_file(sbi, inode->i_ino))
> > - mapping_set_folio_min_order(inode->i_mapping, 0);
> > + f2fs_mapping_set_large_folio(inode);
> > } else if (S_ISDIR(inode->i_mode)) {
> > inode->i_op = &f2fs_dir_inode_operations;
> > inode->i_fop = &f2fs_dir_operations;
> > @@ -895,6 +907,7 @@ void f2fs_evict_inode(struct inode *inode)
> > f2fs_remove_ino_entry(sbi, inode->i_ino, APPEND_INO);
> > f2fs_remove_ino_entry(sbi, inode->i_ino, UPDATE_INO);
> > f2fs_remove_ino_entry(sbi, inode->i_ino, FLUSH_INO);
> > + f2fs_remove_ino_entry(sbi, inode->i_ino, LARGE_FOLIO_INO);
> >
> > if (!is_sbi_flag_set(sbi, SBI_IS_FREEZING)) {
> > sb_start_intwrite(inode->i_sb);
> > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> > index ccf806b676f5..11d1e0c99ac1 100644
> > --- a/fs/f2fs/super.c
> > +++ b/fs/f2fs/super.c
> > @@ -1844,6 +1844,13 @@ static int f2fs_drop_inode(struct inode *inode)
> > return 1;
> > }
> > }
> > + /*
> > + * In order to get large folio as soon as possible, let's drop
> > + * inode cache asap. See also f2fs_release_file.
> > + */
> > + if (f2fs_exist_written_data(sbi, inode->i_ino, LARGE_FOLIO_INO) &&
> > + !is_inode_flag_set(inode, FI_DIRTY_INODE))
> > + return 1;
> >
> > /*
> > * This is to avoid a deadlock condition like below.
> > diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
> > index 941dc62a6d6f..0c0e44c2dcdd 100644
> > --- a/fs/f2fs/xattr.c
> > +++ b/fs/f2fs/xattr.c
> > @@ -44,6 +44,16 @@ static void xattr_free(struct f2fs_sb_info *sbi, void *xattr_addr,
> > kfree(xattr_addr);
> > }
> >
> > +static int f2fs_xattr_fadvise_get(struct inode *inode, void *buffer)
> > +{
> > + if (!buffer)
> > + goto out;
> > + if (mapping_large_folio_support(inode->i_mapping))
> > + *((unsigned int *)buffer) |= BIT(F2FS_XATTR_FADV_LARGEFOLIO);
> > +out:
> > + return sizeof(unsigned int);
> > +}
> > +
> > static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
> > struct dentry *unused, struct inode *inode,
> > const char *name, void *buffer, size_t size)
> > @@ -61,10 +71,29 @@ static int f2fs_xattr_generic_get(const struct xattr_handler *handler,
> > default:
> > return -EINVAL;
> > }
> > + if (handler->flags == F2FS_XATTR_INDEX_USER &&
> > + !strcmp(name, "fadvise"))
> > + return f2fs_xattr_fadvise_get(inode, buffer);
> > +
> > return f2fs_getxattr(inode, handler->flags, name,
> > buffer, size, NULL);
> > }
> >
> > +static int f2fs_xattr_fadvise_set(struct inode *inode, const void *value)
> > +{
> > + unsigned int new_fadvise;
> > +
> > + new_fadvise = *(unsigned int *)value;
> > +
> > + if (new_fadvise & BIT(F2FS_XATTR_FADV_LARGEFOLIO))
> > + f2fs_add_ino_entry(F2FS_I_SB(inode),
> > + inode->i_ino, LARGE_FOLIO_INO);
> > + else
> > + f2fs_remove_ino_entry(F2FS_I_SB(inode),
> > + inode->i_ino, LARGE_FOLIO_INO);
> > + return 0;
> > +}
> > +
> > static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
> > struct mnt_idmap *idmap,
> > struct dentry *unused, struct inode *inode,
> > @@ -84,6 +113,10 @@ static int f2fs_xattr_generic_set(const struct xattr_handler *handler,
> > default:
> > return -EINVAL;
> > }
> > + if (handler->flags == F2FS_XATTR_INDEX_USER &&
> > + !strcmp(name, "fadvise"))
> > + return f2fs_xattr_fadvise_set(inode, value);
> > +
> > return f2fs_setxattr(inode, handler->flags, name,
> > value, size, NULL, flags);
> > }
> > @@ -842,4 +875,4 @@ int __init f2fs_init_xattr_cache(void)
> > void f2fs_destroy_xattr_cache(void)
> > {
> > kmem_cache_destroy(inline_xattr_slab);
> > -}
> > \ No newline at end of file
> > +}
> > diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h
> > index bce3d93e4755..455f460d014e 100644
> > --- a/fs/f2fs/xattr.h
> > +++ b/fs/f2fs/xattr.h
> > @@ -24,6 +24,7 @@
> > #define F2FS_XATTR_REFCOUNT_MAX 1024
> >
> > /* Name indexes */
> > +#define F2FS_USER_FADVISE_NAME "user.fadvise"
> > #define F2FS_SYSTEM_ADVISE_NAME "system.advise"
> > #define F2FS_XATTR_INDEX_USER 1
> > #define F2FS_XATTR_INDEX_POSIX_ACL_ACCESS 2
> > @@ -39,6 +40,11 @@
> > #define F2FS_XATTR_NAME_ENCRYPTION_CONTEXT "c"
> > #define F2FS_XATTR_NAME_VERITY "v"
> >
> > +/* used for F2FS_USER_FADVISE_NAME */
> > +enum {
> > + F2FS_XATTR_FADV_LARGEFOLIO,
> > +};
> > +
> > struct f2fs_xattr_header {
> > __le32 h_magic; /* magic number for identification */
> > __le32 h_refcount; /* reference count */
> > --
> > 2.53.0.1213.gd9a14994de-goog
> >
> >
> ---end quoted text---
^ permalink raw reply
* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-04-15 17:15 UTC (permalink / raw)
To: Jaegeuk Kim
Cc: Christoph Hellwig, linux-kernel, linux-f2fs-devel,
Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <ad_AVHe7RMnGrGTb@google.com>
On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> On 04/14, Christoph Hellwig wrote:
> > Please add the relevant mailing lists when adding new user interfaces.
> >
> > And I'm not sure hacks working around the proper large folio
> > implementation are something that should be merged upstream.
>
> Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> I'm not sure it's acceptable or not.
You haven't sent a proposal. This is a reply to a reply to a reply of a
patch. There's no justification for why f2fs is so special that it
needs this. What the hell is going on? You know this is not the way to
get code merged into Linux.
^ permalink raw reply
* Re: [PATCH 1/4] exec: inherit HWCAPs from the parent process
From: Andrei Vagin @ 2026-04-15 19:27 UTC (permalink / raw)
To: Mark Rutland
Cc: Andrei Vagin, Will Deacon, Kees Cook, Andrew Morton,
Marek Szyprowski, Cyrill Gorcunov, Mike Rapoport,
Alexander Mikhalitsyn, linux-kernel, linux-fsdevel, linux-mm,
criu, Catalin Marinas, linux-arm-kernel, Chen Ridong,
Christian Brauner, David Hildenbrand, Eric Biederman,
Lorenzo Stoakes, Michal Koutny, Alexander Mikhalitsyn, Linux API
In-Reply-To: <adUhbk0sKT0ucWhJ@J2N7QTR9R3>
Hi Mark,
Thanks for the feedback and sorry for the delay, was on vacation.
Please see my comments inline.
On Tue, Apr 7, 2026 at 8:29 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Fri, Mar 27, 2026 at 05:21:26PM -0700, Andrei Vagin wrote:
> > Hi Mark,
> >
> > I understand all these points and they are valid. However, as I
> > mentioned, we are not trying to introduce a mechanism that will strictly
> > enforce feature sets for every container. While we would like to have
> > that functionality, as you and will mentioned, it would require
> > substantially more complexity to address, and maintainers would unlikely
> > to pick up that complexity.
>
> The crux of my complaint here is that unless you do that (to some
> degree), this is not going to work reliably, even with the constraints
> you outline.
>
> Further, I disagree with your proposed solution of pushing more
> constraints onto userspace (to also consider HWCAPs as overriding other
> mechainsms, etc).
>
> I think that as-is, the approach is flawed.
I would really appreciate it if we could move this conversation toward
how we can make it work.
>
> > Even masking ID registers on a per-container basis would introduce
> > extra complexity that could make architecture maintainers unhappy.
> > There were a few attempts to introduce container CPUID masking on
> > x86_64 in the past.
>
> > In CRIU, we are not aiming to handle every possible workload. Our goal
> > is to target workloads where developers are ready to cooperate and
> > willing to make adjustments to be C/R compatible. The goal here is to
> > provide developers with clear instructions on what they can do to ensure
> > their applications are C/R compatible. When I say "workloads", I mean
> > this in a broad sense. A container might pack a set of tools with
> > different runtimes (Go, Java, libc-based). All these runtimes should
> > detect only allowed features.
>
> I do not think that arbitrary applications (and libraries!) should have
> to pick up additional constraints that are unnecessary without CRIU,
> especially where that goes against deliberate design decisions (e.g.
> features in arm64's HINT instruction space, which are designed to be
> usable in fast paths WITHOUT needing explicit checks of things like
> HWCAPs). Note that those typically *do* have kernel controls.
>
> I think there's a much larger problem space than you anticipate, and
> adding an incomplete solution now is just going to introduce a
> maintenance burden.
I am not adding arbitrary constraints for standard non-CRIU use cases.
Previously, I suggested that standard libraries would need to call prctl
to determine if hwcaps should be used for feature detection. However,
we can avoid this extra syscall by adding the new HWCAP2_CR bit. Then
libraries will simply check this bit in auxv[AT_HWCAP2], meaning the
overhead for "non-criu" cases is just a single bit check.
As for HINT instructions, there are two class of instructions.
The first one doesn't change a process state and they are not required
any special handling in term of checkpoint/restore. If a process is
checkpointed on a newer cpu, and restore it on an older cpu, the older
hardware will simply skip over that instructions. The architectural
state (registers, memory) should remain consistent.
The second class such as PAC are instructions that actually change a
process state. These instructions require kernel/userspace coordination.
For example, usage of PAC keys can be controlled from userspace via prctl.
I mean when support for new instructions is implemented in the kernel,
we will need to consider that userspace should be able to control them.
>
> > Returning to the subject of this patchset: this series extends the role
> > of hwcaps. With this change, we would establish that hwcaps is the
> > "source of truth" for which features an application can safely use. Any
> > other features available on the current CPU would not be guaranteed to
> > remain available after migration to another machine.
> >
> > After this discussion, I found that the current version missed one major
> > thing: there should be a signal indicating that hwcaps must be used for
> > feature detection. Since we will need to integrate this interface into
> > libc, Go, and other runtimes, they definitely should not rely just on
> > hwcaps by default, especially in the early stages. This can be solved
> > via the prctl command. Libraries like libc would call
> > prctl(PR_USER_HWCAP_ENABLED). If this returns true, the runtime knows
> > that only the features explicitly listed in hwcaps should be used.
>
> I do not think we should be pushing that shape of constraint onto
> userspace.
Look at the previous command.
>
> > You are right, the controlled feature set will be limited to features
> > the kernel knows about. And yes, we would need to report CPU features in
> > hwcaps even if the kernel isn't directly involved in handling them.
>
> To be clear, that is not what I am arguing.
>
> As I mentioned before, the way this works on arm64 is that the kernel
> only exposes what it is aware of, even in the ID regs accessible to
> userspace. We usually *can* hide features, and do that for cases of
> mismatched big.LITTLE, virtual machines, etc.
I understand that. My point was that the kernel would need to report
features in hwcaps even if they don't require specific kernel-side
handling.
>
> > Honestly, I am not certain if this is the "right" interface for that,
> > and I would be happy to consider other ideas. I understand that these
> > hwcaps will not work right out of the box, but we need a way to solve
> > this problem. Having a centralized API for CPU/kernel feature detection
> > seems like the right direction.
>
> I think that for better or worse the approach you are tkaing here simply
> does not solve enough of the problem to actually be worthwhile.
This approach mimics solutions that some CRIU users are already
implementing in userspace, but those only work when the user controls/
recompiles all their libraries. I am open to other ideas, but we need a
path forward.
>
> > As for signal frame size and extended states like SVE/SME, we aware
> > about this problem. However, it is partly mitigated by the fact that if
> > an application does not use some features, those states are not placed
> > in the signal frame.
>
> That is not true. The kernel can and will create signal frames for
> architectural state that a task might never have touched.
>
> Generally arm64 creates signal frames for features when the feature
> *exists*, regardless of whether the task has actively manipulated the
> relevant state. For example, on systems with SVE a trivial SVE signal
> frame gets created even if a task only uses the FPSIMD registers, and on
> systms with SME a TPIDR2 signal frame gets created even if the task has
> never read/written TPIDR2.
>
> When restoring, an unrecognised signal frame is treated as invalid, and
> we can require that certain signal frames are present.
You are right; that was my mistake. My only explanation for why we don't
see this failure often is that C/R is rarely triggered while a process
is actually
inside a signal handler. This is definitely a problem that still needs
to be solved.
>
> > In the future, when we construct/reload a signal frame, we could look
> > at a process feature set for a process and generate a frame according
> > to those features...
>
> When you say 'we' here, are you talking about within the kernel, or
> within the userspace C/R mechanism?
... within the kernel.
Thanks,
Andrei
^ permalink raw reply
* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-15 22:02 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, linux-kernel, linux-f2fs-devel,
Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <ad_HwhzlNPUEKQi6@casper.infradead.org>
On 04/15, Matthew Wilcox wrote:
> On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> > On 04/14, Christoph Hellwig wrote:
> > > Please add the relevant mailing lists when adding new user interfaces.
> > >
> > > And I'm not sure hacks working around the proper large folio
> > > implementation are something that should be merged upstream.
> >
> > Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> > I'm not sure it's acceptable or not.
>
> You haven't sent a proposal. This is a reply to a reply to a reply of a
> patch. There's no justification for why f2fs is so special that it
> needs this. What the hell is going on? You know this is not the way to
> get code merged into Linux.
I added two ideas in that email. Have you even tried to understand?
^ permalink raw reply
* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Darrick J. Wong @ 2026-04-15 23:49 UTC (permalink / raw)
To: Jaegeuk Kim
Cc: Matthew Wilcox, Christoph Hellwig, linux-kernel, linux-f2fs-devel,
Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <aeAK8mFxzgMOepmZ@google.com>
On Wed, Apr 15, 2026 at 10:02:26PM +0000, Jaegeuk Kim wrote:
> On 04/15, Matthew Wilcox wrote:
> > On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> > > On 04/14, Christoph Hellwig wrote:
> > > > Please add the relevant mailing lists when adding new user interfaces.
> > > >
> > > > And I'm not sure hacks working around the proper large folio
> > > > implementation are something that should be merged upstream.
> > >
> > > Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> > > I'm not sure it's acceptable or not.
> >
> > You haven't sent a proposal. This is a reply to a reply to a reply of a
> > patch. There's no justification for why f2fs is so special that it
> > needs this. What the hell is going on? You know this is not the way to
> > get code merged into Linux.
>
> I added two ideas in that email. Have you even tried to understand?
You want to establish "user.fadvise" as an extended attribute containing
a bitmask. The sole bit defined in that attribute means "use large
folios", but you also have to change the file mode and set the IMMUTABLE
bit for it to actually do anything.
Meanwhile, you can't actually persist any of the fadvise(2) advice
flags, so the xattr name doesn't even make sense. Maybe you meant to
call it "user.madvise" since the closest thing I can think of is
MADV_HUGEPAGE?
I've understood enough. YUCK.
--D
^ permalink raw reply
* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-04-16 1:19 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Matthew Wilcox, Christoph Hellwig, linux-kernel, linux-f2fs-devel,
Akilesh Kailash, linux-fsdevel, linux-mm, linux-api
In-Reply-To: <20260415234950.GC114184@frogsfrogsfrogs>
On 04/15, Darrick J. Wong wrote:
> On Wed, Apr 15, 2026 at 10:02:26PM +0000, Jaegeuk Kim wrote:
> > On 04/15, Matthew Wilcox wrote:
> > > On Wed, Apr 15, 2026 at 04:44:04PM +0000, Jaegeuk Kim wrote:
> > > > On 04/14, Christoph Hellwig wrote:
> > > > > Please add the relevant mailing lists when adding new user interfaces.
> > > > >
> > > > > And I'm not sure hacks working around the proper large folio
> > > > > implementation are something that should be merged upstream.
> > > >
> > > > Cc'ed linux-api and linux-fsdevel onto the patch thread with a proposal that
> > > > I'm not sure it's acceptable or not.
> > >
> > > You haven't sent a proposal. This is a reply to a reply to a reply of a
> > > patch. There's no justification for why f2fs is so special that it
> > > needs this. What the hell is going on? You know this is not the way to
> > > get code merged into Linux.
> >
> > I added two ideas in that email. Have you even tried to understand?
>
> You want to establish "user.fadvise" as an extended attribute containing
> a bitmask. The sole bit defined in that attribute means "use large
> folios", but you also have to change the file mode and set the IMMUTABLE
> bit for it to actually do anything.
Partly yes. This path has nothing to do with IMMUTABLE bit, since I used to
activate the large folio with that bit, but hit a big pain which requires
clearing the bit whenever just deleting the file.
So, this gives a new way to activate the large folio by chmod(0400) and
setxattr("user.fadvise") only while providing quick inode eviction in order
to set mapping by iget, and allowing file deletion easily.
I feel the arguable points would be 1) the path to evict inode by calling
d_drop in release_file and returning 1 in drop_inode, 2) how to give the
hint between fadvise(FADV_LARGE_FOLIO) or setxattr(user.fadvise) by individual
file system.
>
> Meanwhile, you can't actually persist any of the fadvise(2) advice
> flags, so the xattr name doesn't even make sense. Maybe you meant to
> call it "user.madvise" since the closest thing I can think of is
> MADV_HUGEPAGE?
>
> I've understood enough. YUCK.
Thank you for taking the time to take a look.
>
> --D
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-04-16 11:41 UTC (permalink / raw)
To: Dorjoy Chowdhury
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
jlayton, chuck.lever, alex.aring, arnd, adilger, mjguzik,
smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260328172314.45807-2-dorjoychy111@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]
On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> This flag indicates the path should be opened if it's a regular file.
> This is useful to write secure programs that want to avoid being
> tricked into opening device nodes with special semantics while thinking
> they operate on regular files. This is a requested feature from the
> uapi-group[1].
>
> A corresponding error code EFTYPE has been introduced. For example, if
> openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> like FreeBSD, macOS.
>
> When used in combination with O_CREAT, either the regular file is
> created, or if the path already exists, it is opened if it's a regular
> file. Otherwise, -EFTYPE is returned.
>
> When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> as it doesn't make sense to open a path that is both a directory and a
> regular file.
>
> [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
>
> Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> ---
Aside from the nit below, feel free to take a
Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
> diff --git a/fs/open.c b/fs/open.c
> index 681d405bc61e..a6f445f72181 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> f->f_mode |= FMODE_CAN_ODIRECT;
>
> - f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> + f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
It's not clear to me why you dropped this, I didn't see a review
mentioning it either. (General note: Ideally the cover letter changelog
would mention who suggested a change in brackets after the changelog
line so it's easier to track where a change might've come from.)
I would personally keep it since O_DIRECTORY is not dropped (I do find
it interesting that O_EXCL is dropped too -- you could imagine a
userspace program wanting to know that the file was opened with O_EXCL,
though it provides you very little information).
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-04-16 11:58 UTC (permalink / raw)
To: Aleksa Sarai, jlayton
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-04-16-selfless-milky-wasps-shin-p6liRL@cyphar.com>
On Thu, Apr 16, 2026 at 5:41 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > This flag indicates the path should be opened if it's a regular file.
> > This is useful to write secure programs that want to avoid being
> > tricked into opening device nodes with special semantics while thinking
> > they operate on regular files. This is a requested feature from the
> > uapi-group[1].
> >
> > A corresponding error code EFTYPE has been introduced. For example, if
> > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > like FreeBSD, macOS.
> >
> > When used in combination with O_CREAT, either the regular file is
> > created, or if the path already exists, it is opened if it's a regular
> > file. Otherwise, -EFTYPE is returned.
> >
> > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > as it doesn't make sense to open a path that is both a directory and a
> > regular file.
> >
> > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> >
> > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > ---
>
> Aside from the nit below, feel free to take a
>
> Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
>
Thanks for reviewing!
> > diff --git a/fs/open.c b/fs/open.c
> > index 681d405bc61e..a6f445f72181 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> > if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> > f->f_mode |= FMODE_CAN_ODIRECT;
> >
> > - f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> > + f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
>
> It's not clear to me why you dropped this, I didn't see a review
> mentioning it either. (General note: Ideally the cover letter changelog
> would mention who suggested a change in brackets after the changelog
> line so it's easier to track where a change might've come from.)
>
Thanks for the general note. I will keep that in mind.
The review was from Jeff Layton in v5
https://lore.kernel.org/linux-fsdevel/5fcc2a6e6d92dae0601c6b3b8faa8b2f83981afb.camel@kernel.org/
" 1. OPENAT2_REGULAR leaks into f_flags - do_dentry_open() strips
open-time-only flags (O_CREAT|O_EXCL|O_NOCTTY|O_TRUNC)
but does not strip OPENAT2_REGULAR. When a regular file is
successfully opened via openat2() with this flag, the bit
persists in file->f_flags and will be returned by fcntl(fd, F_GETFL)."
I think it makes sense to strip off as OPENAT2_REGULAR is an open time
only flag (like O_CREAT and the others already), right?
Regards,
Dorjoy
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-04-16 13:05 UTC (permalink / raw)
To: Dorjoy Chowdhury
Cc: jlayton, linux-fsdevel, Linus Torvalds, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, brauner, jack, chuck.lever, alex.aring, arnd, adilger,
mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CAFfO_h5kWCYszymaY=tPAbpU=PjLFxsND+CWSYtypN4iuW+qPw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 4282 bytes --]
On 2026-04-16, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> On Thu, Apr 16, 2026 at 5:41 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > This flag indicates the path should be opened if it's a regular file.
> > > This is useful to write secure programs that want to avoid being
> > > tricked into opening device nodes with special semantics while thinking
> > > they operate on regular files. This is a requested feature from the
> > > uapi-group[1].
> > >
> > > A corresponding error code EFTYPE has been introduced. For example, if
> > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > like FreeBSD, macOS.
> > >
> > > When used in combination with O_CREAT, either the regular file is
> > > created, or if the path already exists, it is opened if it's a regular
> > > file. Otherwise, -EFTYPE is returned.
> > >
> > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > as it doesn't make sense to open a path that is both a directory and a
> > > regular file.
> > >
> > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > >
> > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > ---
> >
> > Aside from the nit below, feel free to take a
> >
> > Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
> >
>
> Thanks for reviewing!
>
> > > diff --git a/fs/open.c b/fs/open.c
> > > index 681d405bc61e..a6f445f72181 100644
> > > --- a/fs/open.c
> > > +++ b/fs/open.c
> > > @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> > > if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> > > f->f_mode |= FMODE_CAN_ODIRECT;
> > >
> > > - f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> > > + f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
> >
> > It's not clear to me why you dropped this, I didn't see a review
> > mentioning it either. (General note: Ideally the cover letter changelog
> > would mention who suggested a change in brackets after the changelog
> > line so it's easier to track where a change might've come from.)
> >
>
> Thanks for the general note. I will keep that in mind.
>
> The review was from Jeff Layton in v5
> https://lore.kernel.org/linux-fsdevel/5fcc2a6e6d92dae0601c6b3b8faa8b2f83981afb.camel@kernel.org/
> " 1. OPENAT2_REGULAR leaks into f_flags - do_dentry_open() strips
> open-time-only flags (O_CREAT|O_EXCL|O_NOCTTY|O_TRUNC)
> but does not strip OPENAT2_REGULAR. When a regular file is
> successfully opened via openat2() with this flag, the bit
> persists in file->f_flags and will be returned by fcntl(fd, F_GETFL)."
>
> I think it makes sense to strip off as OPENAT2_REGULAR is an open time
> only flag (like O_CREAT and the others already), right?
Well, O_DIRECTORY isn't stripped so if we want to mirror that behaviour
then it shouldn't be stripped either IMHO.
O_NOCTTY and O_TRUNC make sense to strip (they are not relevant to the
file after it was opened -- truncation only happens at open time and you
can always set your controlling TTY later).
The story with O_CREAT and O_EXCL is a bit more complicated. They are
stripped but the history there is unclear -- the line was added in Linux
0.98.4(!) with no mention in the release note at the time. (Linus: I
wonder if you remember why this was changed at the time? Sorry for the
trip down memory lane...)
However, the existence of F_CREATED_QUERY kind of shows that these kinds
of checks are stuff that userspace can find handy (though FMODE_CREATED
is more useful than O_CREAT|O_EXCL anyway). O_EXCL is used internally
for stuff so it can be re-exposed, I'm just not sure it's a good
precedent to make a decision based on.
Then again, userspace can check with fstat(2) so it's not the end of the
world, but I don't really see a strong reason to hide information from
userspace. Since the mail was from Claude (and it tends to give silly
nits like that) I'm not sure whether Jeff would agree with my view or
not.
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v6 0/4] OPENAT2_REGULAR flag support for openat2
From: Christian Brauner @ 2026-04-16 13:07 UTC (permalink / raw)
To: linux-fsdevel, Dorjoy Chowdhury
Cc: Christian Brauner, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, jack, jlayton,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <20260328172314.45807-1-dorjoychy111@gmail.com>
On Sat, 28 Mar 2026 23:22:21 +0600, Dorjoy Chowdhury wrote:
> I came upon this "Ability to only open regular files" uapi feature suggestion
> from https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> and thought it would be something I could do as a first patch and get to
> know the kernel code a bit better.
>
> The following filesystems have been tested by building and booting the kernel
> x86 bzImage in a Fedora 43 VM in QEMU. I have tested with OPENAT2_REGULAR that
> regular files can be successfully opened and non-regular files (directory, fifo etc)
> return -EFTYPE.
> - btrfs
> - NFS (loopback)
> - SMB (loopback)
>
> [...]
- I've added an explanation why OPENAT2_REGULAR is only needed for some
->atomic_open() implementers but not others. What I don't like is that
we need all that custom handling in there but it's managable.
- I dropped the topmost style conversions. They really don't belong
there and if we switch to something better we should use (1 << <nr>).
- I split the EFTYPE errno introduction into a separate patch.
---
Applied to the vfs-7.2.openat.regular branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.openat.regular branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: master
[1/4] openat2: new OPENAT2_REGULAR flag support
https://git.kernel.org/vfs/vfs/c/0b649c4d70f7
[2/4] kselftest/openat2: test for OPENAT2_REGULAR flag
https://git.kernel.org/vfs/vfs/c/d7dc36df8fa7
[3/4] sparc/fcntl.h: convert O_* flag macros from hex to octal
(dropped)
[4/4] mips/fcntl.h: convert O_* flag macros from hex to octal
(dropped)
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jeff Layton @ 2026-04-16 13:28 UTC (permalink / raw)
To: Aleksa Sarai, Dorjoy Chowdhury
Cc: linux-fsdevel, Linus Torvalds, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, brauner, jack, chuck.lever, alex.aring, arnd, adilger,
mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-04-16-raunchy-random-curfew-guide-GmtLJR@cyphar.com>
On Thu, 2026-04-16 at 23:05 +1000, Aleksa Sarai wrote:
> On 2026-04-16, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > On Thu, Apr 16, 2026 at 5:41 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > On 2026-03-28, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > > This flag indicates the path should be opened if it's a regular file.
> > > > This is useful to write secure programs that want to avoid being
> > > > tricked into opening device nodes with special semantics while thinking
> > > > they operate on regular files. This is a requested feature from the
> > > > uapi-group[1].
> > > >
> > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > like FreeBSD, macOS.
> > > >
> > > > When used in combination with O_CREAT, either the regular file is
> > > > created, or if the path already exists, it is opened if it's a regular
> > > > file. Otherwise, -EFTYPE is returned.
> > > >
> > > > When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
> > > > as it doesn't make sense to open a path that is both a directory and a
> > > > regular file.
> > > >
> > > > [1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > > >
> > > > Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
> > > > ---
> > >
> > > Aside from the nit below, feel free to take a
> > >
> > > Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
> > >
> >
> > Thanks for reviewing!
> >
> > > > diff --git a/fs/open.c b/fs/open.c
> > > > index 681d405bc61e..a6f445f72181 100644
> > > > --- a/fs/open.c
> > > > +++ b/fs/open.c
> > > > @@ -960,7 +960,7 @@ static int do_dentry_open(struct file *f,
> > > > if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
> > > > f->f_mode |= FMODE_CAN_ODIRECT;
> > > >
> > > > - f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> > > > + f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | OPENAT2_REGULAR);
> > >
> > > It's not clear to me why you dropped this, I didn't see a review
> > > mentioning it either. (General note: Ideally the cover letter changelog
> > > would mention who suggested a change in brackets after the changelog
> > > line so it's easier to track where a change might've come from.)
> > >
> >
> > Thanks for the general note. I will keep that in mind.
> >
> > The review was from Jeff Layton in v5
> > https://lore.kernel.org/linux-fsdevel/5fcc2a6e6d92dae0601c6b3b8faa8b2f83981afb.camel@kernel.org/
> > " 1. OPENAT2_REGULAR leaks into f_flags - do_dentry_open() strips
> > open-time-only flags (O_CREAT|O_EXCL|O_NOCTTY|O_TRUNC)
> > but does not strip OPENAT2_REGULAR. When a regular file is
> > successfully opened via openat2() with this flag, the bit
> > persists in file->f_flags and will be returned by fcntl(fd, F_GETFL)."
> >
> > I think it makes sense to strip off as OPENAT2_REGULAR is an open time
> > only flag (like O_CREAT and the others already), right?
>
> Well, O_DIRECTORY isn't stripped so if we want to mirror that behaviour
> then it shouldn't be stripped either IMHO.
>
> O_NOCTTY and O_TRUNC make sense to strip (they are not relevant to the
> file after it was opened -- truncation only happens at open time and you
> can always set your controlling TTY later).
>
> The story with O_CREAT and O_EXCL is a bit more complicated. They are
> stripped but the history there is unclear -- the line was added in Linux
> 0.98.4(!) with no mention in the release note at the time. (Linus: I
> wonder if you remember why this was changed at the time? Sorry for the
> trip down memory lane...)
>
> However, the existence of F_CREATED_QUERY kind of shows that these kinds
> of checks are stuff that userspace can find handy (though FMODE_CREATED
> is more useful than O_CREAT|O_EXCL anyway). O_EXCL is used internally
> for stuff so it can be re-exposed, I'm just not sure it's a good
> precedent to make a decision based on.
>
> Then again, userspace can check with fstat(2) so it's not the end of the
> world, but I don't really see a strong reason to hide information from
> userspace. Since the mail was from Claude (and it tends to give silly
> nits like that) I'm not sure whether Jeff would agree with my view or
> not.
I don't have a strong feeling either way, but it "feels" like O_REGULAR
is not particularly useful to return in F_GETFL.
Once the file is open, then O_REGULAR really doesn't matter anymore. We
_know_ it's a regular file at that point or the open wouldn't have
happened. F_GETFL is more useful for showing flags that actually affect
how the file description works (e.g. O_DIRECT, O_ASYNC, etc.).
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jori Koolstra @ 2026-04-16 13:52 UTC (permalink / raw)
To: Dorjoy Chowdhury
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
jlayton, chuck.lever, alex.aring, arnd, adilger, mjguzik,
smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260328172314.45807-2-dorjoychy111@gmail.com>
On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> index 50bdc8e8a271..fe488bf7c18e 100644
> --- a/arch/alpha/include/uapi/asm/fcntl.h
> +++ b/arch/alpha/include/uapi/asm/fcntl.h
> @@ -34,6 +34,7 @@
>
> #define O_PATH 040000000
> #define __O_TMPFILE 0100000000
> +#define OPENAT2_REGULAR 0200000000
>
I don't quite understand why we are adding OPENAT2_REGULAR inside the
O_* flag range. Wasn't this supposed to be only supported for openat2()?
If so, I don't see the need to waste an O_* flag bit. But maybe I am
missing something.
Thanks,
Jori.
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-04-16 14:21 UTC (permalink / raw)
To: Jori Koolstra, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
miklos, hansg
In-Reply-To: <aeDpIgfDaIKEaBcL@lt-jori.localdomain>
On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
>
> On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > index 50bdc8e8a271..fe488bf7c18e 100644
> > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > @@ -34,6 +34,7 @@
> >
> > #define O_PATH 040000000
> > #define __O_TMPFILE 0100000000
> > +#define OPENAT2_REGULAR 0200000000
> >
>
> I don't quite understand why we are adding OPENAT2_REGULAR inside the
> O_* flag range. Wasn't this supposed to be only supported for openat2()?
> If so, I don't see the need to waste an O_* flag bit. But maybe I am
> missing something.
>
Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
bit range. But as far as I understand, for the old open system calls
we can't easily add new O_* flags as the older codepaths don't strip
off unknown bits which openat2 does. It's not easy to add new O_*
flags for the old open system calls since that could break userspace
programs. So I guess it's okay to add OPENAT2_REGULAR in the 32 bits
range anyway? (Also lots of code paths take 32bit flags param right
now and those would need changing to take uint64_t instead but this is
of course not a reason to not add the new flag outside of the 32
bits).
Regards,
Dorjoy
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jori Koolstra @ 2026-04-16 15:03 UTC (permalink / raw)
To: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, brauner, jack, jlayton, chuck.lever, alex.aring, arnd,
adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg,
Aleksa Sarai
In-Reply-To: <CAFfO_h6pkyX=uN5uoXda6toTtT6KsahfBNBLom9i21HdZ7JOmQ@mail.gmail.com>
> Op 16-04-2026 16:21 CEST schreef Dorjoy Chowdhury <dorjoychy111@gmail.com>:
>
>
> On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> >
> > On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > @@ -34,6 +34,7 @@
> > >
> > > #define O_PATH 040000000
> > > #define __O_TMPFILE 0100000000
> > > +#define OPENAT2_REGULAR 0200000000
> > >
> >
> > I don't quite understand why we are adding OPENAT2_REGULAR inside the
> > O_* flag range. Wasn't this supposed to be only supported for openat2()?
> > If so, I don't see the need to waste an O_* flag bit. But maybe I am
> > missing something.
> >
>
> Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
> got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
> bit range. But as far as I understand, for the old open system calls
> we can't easily add new O_* flags as the older codepaths don't strip
> off unknown bits which openat2 does. It's not easy to add new O_*
> flags for the old open system calls since that could break userspace
> programs.
If I recall correctly, Aleksa has suggested we might also want to add
O_EMPTYPATH to openat() instead of only allowing this for openat2().
I am waiting to see what Christian thinks of this.
I guess in that case it is relatively harmless to change UAPI
behavior because openat() with an empty path never works; so it
would be silly if there are userspace programs that make
this call, which always fails and does nothing, and somehow rely on
that.
> So I guess it's okay to add OPENAT2_REGULAR in the 32 bits
> range anyway? (Also lots of code paths take 32bit flags param right
> now and those would need changing to take uint64_t instead but this is
> of course not a reason to not add the new flag outside of the 32
> bits).
>
> Regards,
> Dorjoy
Thanks,
Jori.
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Christian Brauner @ 2026-04-16 15:15 UTC (permalink / raw)
To: Jori Koolstra
Cc: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, jack, jlayton, chuck.lever, alex.aring, arnd, adilger,
mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg, Aleksa Sarai
In-Reply-To: <1714293523.333222.1776351806025@kpc.webmail.kpnmail.nl>
On Thu, Apr 16, 2026 at 05:03:26PM +0200, Jori Koolstra wrote:
>
> > Op 16-04-2026 16:21 CEST schreef Dorjoy Chowdhury <dorjoychy111@gmail.com>:
> >
> >
> > On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> > >
> > > On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > > @@ -34,6 +34,7 @@
> > > >
> > > > #define O_PATH 040000000
> > > > #define __O_TMPFILE 0100000000
> > > > +#define OPENAT2_REGULAR 0200000000
> > > >
> > >
> > > I don't quite understand why we are adding OPENAT2_REGULAR inside the
> > > O_* flag range. Wasn't this supposed to be only supported for openat2()?
> > > If so, I don't see the need to waste an O_* flag bit. But maybe I am
> > > missing something.
> > >
> >
> > Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
> > got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
> > bit range. But as far as I understand, for the old open system calls
> > we can't easily add new O_* flags as the older codepaths don't strip
> > off unknown bits which openat2 does. It's not easy to add new O_*
> > flags for the old open system calls since that could break userspace
> > programs.
>
> If I recall correctly, Aleksa has suggested we might also want to add
> O_EMPTYPATH to openat() instead of only allowing this for openat2().
> I am waiting to see what Christian thinks of this.
We can do that, yes. For O_EMPTYPATH that is workable.
I don't mind too much if we leave OPENAT2_REGUALR in the 32-bit flag
space. It'll silently be ignored but the flag name should give it away.
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-04-16 15:15 UTC (permalink / raw)
To: Dorjoy Chowdhury
Cc: Jori Koolstra, linux-fsdevel, linux-kernel, linux-api, ceph-devel,
gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner,
jack, jlayton, chuck.lever, alex.aring, arnd, adilger, mjguzik,
smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CAFfO_h6pkyX=uN5uoXda6toTtT6KsahfBNBLom9i21HdZ7JOmQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]
On 2026-04-16, Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> On Thu, Apr 16, 2026 at 7:52 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> >
> > On Sat, Mar 28, 2026 at 11:22:22PM +0600, Dorjoy Chowdhury wrote:
> > > diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> > > index 50bdc8e8a271..fe488bf7c18e 100644
> > > --- a/arch/alpha/include/uapi/asm/fcntl.h
> > > +++ b/arch/alpha/include/uapi/asm/fcntl.h
> > > @@ -34,6 +34,7 @@
> > >
> > > #define O_PATH 040000000
> > > #define __O_TMPFILE 0100000000
> > > +#define OPENAT2_REGULAR 0200000000
> > >
> >
> > I don't quite understand why we are adding OPENAT2_REGULAR inside the
> > O_* flag range. Wasn't this supposed to be only supported for openat2()?
> > If so, I don't see the need to waste an O_* flag bit. But maybe I am
> > missing something.
> >
>
> Yes, OPENAT2_REGULAR is only supported for openat2. I am not sure if I
> got a specific review to not add OPENAT2_REGULAR in the O_* flag 32
> bit range. But as far as I understand, for the old open system calls
> we can't easily add new O_* flags as the older codepaths don't strip
> off unknown bits which openat2 does. It's not easy to add new O_*
> flags for the old open system calls since that could break userspace
> programs. So I guess it's okay to add OPENAT2_REGULAR in the 32 bits
> range anyway? (Also lots of code paths take 32bit flags param right
> now and those would need changing to take uint64_t instead but this is
> of course not a reason to not add the new flag outside of the 32
> bits).
Oh, I didn't notice that this wasn't mentioned here, we had a separate
discussion about it in a thread with Jori and I must've assumed we
discussed it in both. (My brain is also really not wired up to read
large octal values easily.)
While it is hard to add new O_* flags (hence OPENAT2_REGULAR), it's not
/impossible/ (Jori has a patch for OPENAT2_EMPTY_PATH that is safe to
add to O_* flags because of some fun historical coincidences).
I would have a slight preference towards segregating the bits, ideally
at the top end but even 1<<31 would've been nice. Then again, I'm not
too fussed either way to be honest...
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v6 0/4] OPENAT2_REGULAR flag support for openat2
From: Dorjoy Chowdhury @ 2026-04-16 15:22 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, jack, jlayton,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <20260416-abgraben-seeweg-a44ce660957f@brauner>
On Thu, Apr 16, 2026 at 7:07 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sat, 28 Mar 2026 23:22:21 +0600, Dorjoy Chowdhury wrote:
> > I came upon this "Ability to only open regular files" uapi feature suggestion
> > from https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > and thought it would be something I could do as a first patch and get to
> > know the kernel code a bit better.
> >
> > The following filesystems have been tested by building and booting the kernel
> > x86 bzImage in a Fedora 43 VM in QEMU. I have tested with OPENAT2_REGULAR that
> > regular files can be successfully opened and non-regular files (directory, fifo etc)
> > return -EFTYPE.
> > - btrfs
> > - NFS (loopback)
> > - SMB (loopback)
> >
> > [...]
>
> - I've added an explanation why OPENAT2_REGULAR is only needed for some
> ->atomic_open() implementers but not others. What I don't like is that
> we need all that custom handling in there but it's managable.
>
> - I dropped the topmost style conversions. They really don't belong
> there and if we switch to something better we should use (1 << <nr>).
>
> - I split the EFTYPE errno introduction into a separate patch.
>
> ---
Thanks for fixing up and picking this one up!
>
> Applied to the vfs-7.2.openat.regular branch of the vfs/vfs.git tree.
> Patches in the vfs-7.2.openat.regular branch should appear in linux-next soon.
>
I don't see a vfs-7.2.openat.regular branch in vfs/vfs.git tree in
git.kernel.org. Maybe this hasn't been pushed yet?
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: master
>
I guess you wanted to mean vfs-7.2.openat.regular here?
Regards,
Dorjoy
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jori Koolstra @ 2026-04-16 21:36 UTC (permalink / raw)
To: Christian Brauner
Cc: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, jack, jlayton, chuck.lever, alex.aring, arnd, adilger,
mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg, Aleksa Sarai
In-Reply-To: <20260416-aufbau-sorgenfrei-cfa87c9ddc11@brauner>
> Op 16-04-2026 17:15 CEST schreef Christian Brauner <brauner@kernel.org>:
>
>
> On Thu, Apr 16, 2026 at 05:03:26PM +0200, Jori Koolstra wrote:
> >
> > If I recall correctly, Aleksa has suggested we might also want to add
> > O_EMPTYPATH to openat() instead of only allowing this for openat2().
> > I am waiting to see what Christian thinks of this.
>
> We can do that, yes. For O_EMPTYPATH that is workable.
All right, then I'll update the patch this weekend.
>
> I don't mind too much if we leave OPENAT2_REGUALR in the 32-bit flag
> space. It'll silently be ignored but the flag name should give it away.
I would also prefer to have the bits separated. Although it is unlikely
that we will add so many O_* that we will ever run out of space, it just
seems cleaner, and at no cost. But it's not too important.
Thanks,
Jori.
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jori Koolstra @ 2026-04-16 21:42 UTC (permalink / raw)
To: Aleksa Sarai, Dorjoy Chowdhury, brauner
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, jack, jlayton,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <2026-04-16-upstate-capable-deacon-petals-0l25lH@cyphar.com>
> Op 16-04-2026 17:15 CEST schreef Aleksa Sarai <cyphar@cyphar.com>:
>
>
> Oh, I didn't notice that this wasn't mentioned here, we had a separate
> discussion about it in a thread with Jori and I must've assumed we
> discussed it in both. (My brain is also really not wired up to read
> large octal values easily.)
>
> While it is hard to add new O_* flags (hence OPENAT2_REGULAR), it's not
> /impossible/ (Jori has a patch for OPENAT2_EMPTY_PATH that is safe to
> add to O_* flags because of some fun historical coincidences).
But it would change userspace, at least in theory, right? If anyone for
some reason decided to set whatever the bit will be for O_EMPTYPATH
in a call to openat(), and pass an empty string, relying on this to fail,
that will no longer be the case. But that is just really silly. Or are you
hinting on something else?
Thanks,
Jori.
^ permalink raw reply
* Re: [PATCH v6 1/4] openat2: new OPENAT2_REGULAR flag support
From: Aleksa Sarai @ 2026-04-17 7:58 UTC (permalink / raw)
To: Jori Koolstra
Cc: Dorjoy Chowdhury, brauner, linux-fsdevel, linux-kernel, linux-api,
ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
viro, jack, jlayton, chuck.lever, alex.aring, arnd, adilger,
mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <2059025134.378522.1776375762839@kpc.webmail.kpnmail.nl>
[-- Attachment #1: Type: text/plain, Size: 2323 bytes --]
On 2026-04-16, Jori Koolstra <jkoolstra@xs4all.nl> wrote:
>
> > Op 16-04-2026 17:15 CEST schreef Aleksa Sarai <cyphar@cyphar.com>:
> >
> >
> > Oh, I didn't notice that this wasn't mentioned here, we had a separate
> > discussion about it in a thread with Jori and I must've assumed we
> > discussed it in both. (My brain is also really not wired up to read
> > large octal values easily.)
> >
> > While it is hard to add new O_* flags (hence OPENAT2_REGULAR), it's not
> > /impossible/ (Jori has a patch for OPENAT2_EMPTY_PATH that is safe to
> > add to O_* flags because of some fun historical coincidences).
>
> But it would change userspace, at least in theory, right? If anyone for
> some reason decided to set whatever the bit will be for O_EMPTYPATH
> in a call to openat(), and pass an empty string, relying on this to fail,
> that will no longer be the case. But that is just really silly. Or are you
> hinting on something else?
Yes, such a program would break, but it is a fairly safe bet that no
such program actually exists in the wild. There is a limit to "never
break userspace" -- it actually needs to break a real userspace program
for it to matter.
Even then there are limits -- in theory someone could write a program
that would error out if any new flag is added to any syscall that
returns -EINVAL for invalid flags (in fact, we have selftests for
openat2(2) that would break because we test the error path) but it
wouldn't make sense to not add features to any syscall because such a
program could theoretically exist.
We change uAPI all the time, the trick is doing it so that userspace
doesn't notice.
For O_EMPTYPATH the logic is that programs that pass regular paths would
work the same way as they do now (i.e., LOOKUP_EMPTY semantics) and
programs that used to pass "" would previously get ENOENT -- it seems
quite unlikely anyone would depend on this for anything (they could
check if the string was empty themselves, after all) and it seems
astronomically unlikely that that they would pass garbage *and* depend
on it for anything.
(It is a little funky that open("", O_EMPTYPATH) would give you an fd to
"." but that makes more sense than the alternatives so let's just keep
it consistent.)
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* [PATCH] [PATCH] PM: docs: Add comprehensive wakeup_count documentation
From: chenheyun @ 2026-04-19 7:23 UTC (permalink / raw)
To: rafael, pavel; +Cc: linux-pm, linux-api, linux-kernel, chenheyun
The current Documentation/power/wakeup-count.rst is empty and lacks
description of the race-free suspend mechanism, sysfs ABI semantics,
blocking behavior, and standard userspace usage.
Add complete documentation for /sys/power/wakeup_count, including
overview, interface semantics, usage example, and related interfaces.
Also update Documentation/power/index.rst to include the new document.
Signed-off-by: chenheyun <chen_heyun@163.com>
---
Documentation/power/index.rst | 1 +
Documentation/power/wakeup-count.rst | 63 ++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)
create mode 100644 Documentation/power/wakeup-count.rst
diff --git a/Documentation/power/index.rst b/Documentation/power/index.rst
index b4581e4ae785..901268049d7c 100644
--- a/Documentation/power/index.rst
+++ b/Documentation/power/index.rst
@@ -27,6 +27,7 @@ Power Management
swsusp
video
tricks
+ wakeup-count
userland-swsusp
diff --git a/Documentation/power/wakeup-count.rst b/Documentation/power/wakeup-count.rst
new file mode 100644
index 000000000000..5f3a1ca654ce
--- /dev/null
+++ b/Documentation/power/wakeup-count.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2025 The Linux Foundation
+
+The wakeup_count mechanism for race-free suspend
+================================================
+
+Overview
+--------
+
+The ``/sys/power/wakeup_count`` sysfs interface provides a stable userspace
+mechanism to perform race-free system suspend transitions. It eliminates the
+well-known race condition between suspend permission check and actual system
+suspend entry.
+
+Userspace may use it in a standard three-step sequence:
+
+1. Read the current global wakeup event counter. The read operation blocks
+ until all ongoing wakeup event processing is finished, returning a stable value.
+2. Perform necessary suspend preparation steps in userspace.
+3. Write the previously-read counter value back to the interface.
+ The write operation will only succeed if no new wakeup events have occurred
+ since the read.
+
+Only after a successful write may userspace safely trigger system suspend.
+
+Interface semantics
+-------------------
+
+``/sys/power/wakeup_count``
+
+**Read**
+ Returns the global monotonically-increasing wakeup event counter.
+ This call blocks until there are no wakeup events under active processing
+ inside the kernel. If interrupted by a signal, it returns -EINTR.
+
+**Write**
+ Accepts the counter value obtained from a prior read.
+ The write succeeds only if the kernel's current counter exactly matches
+ the written value. Mismatch indicates new wakeup events arrived during
+ userspace preparation, and suspend must be aborted.
+
+Standard userspace usage example
+--------------------------------
+
+.. code-block:: shell
+
+ count=$(cat /sys/power/wakeup_count)
+ do_suspend_preparation
+ echo "$count" > /sys/power/wakeup_count && echo mem > /sys/power/state
+
+Blocking behavior
+-----------------
+
+The blocking read ensures that userspace never observes an inconsistent state
+where wakeup events are still being handled within the kernel. This stability
+is the core guarantee of the interface.
+
+Related kernel interfaces
+-------------------------
+
+- ``/sys/power/state``: System suspend state control interface.
+- ``/sys/kernel/debug/wakeup_sources``: Per-device wakeup source statistics.
+- ``Documentation/power/wakeup-events.rst``: General wakeup event framework.
\ No newline at end of file
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v6 0/4] OPENAT2_REGULAR flag support for openat2
From: Christian Brauner @ 2026-04-20 13:20 UTC (permalink / raw)
To: Dorjoy Chowdhury
Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, jack, jlayton,
chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
bharathsm, shuah, miklos, hansg
In-Reply-To: <CAFfO_h5mORm0OuK-d4thzBWWySmyvLSVeVa7phZc4Df-8D=1Cg@mail.gmail.com>
On Thu, Apr 16, 2026 at 09:22:03PM +0600, Dorjoy Chowdhury wrote:
> On Thu, Apr 16, 2026 at 7:07 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Sat, 28 Mar 2026 23:22:21 +0600, Dorjoy Chowdhury wrote:
> > > I came upon this "Ability to only open regular files" uapi feature suggestion
> > > from https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
> > > and thought it would be something I could do as a first patch and get to
> > > know the kernel code a bit better.
> > >
> > > The following filesystems have been tested by building and booting the kernel
> > > x86 bzImage in a Fedora 43 VM in QEMU. I have tested with OPENAT2_REGULAR that
> > > regular files can be successfully opened and non-regular files (directory, fifo etc)
> > > return -EFTYPE.
> > > - btrfs
> > > - NFS (loopback)
> > > - SMB (loopback)
> > >
> > > [...]
> >
> > - I've added an explanation why OPENAT2_REGULAR is only needed for some
> > ->atomic_open() implementers but not others. What I don't like is that
> > we need all that custom handling in there but it's managable.
> >
> > - I dropped the topmost style conversions. They really don't belong
> > there and if we switch to something better we should use (1 << <nr>).
> >
> > - I split the EFTYPE errno introduction into a separate patch.
> >
> > ---
>
> Thanks for fixing up and picking this one up!
>
> >
> > Applied to the vfs-7.2.openat.regular branch of the vfs/vfs.git tree.
> > Patches in the vfs-7.2.openat.regular branch should appear in linux-next soon.
> >
>
> I don't see a vfs-7.2.openat.regular branch in vfs/vfs.git tree in
> git.kernel.org. Maybe this hasn't been pushed yet?
Nothing will get pushed prior to -rc1 which is due this Sunday.
^ permalink raw reply
* [PATCH bpf-next v12 0/8] bpf: Extend BPF syscall with common attributes support
From: Leon Hwang @ 2026-04-20 14:17 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Leon Hwang, Willem de Bruijn, Jason Xing,
Tao Chen, Mykyta Yatsenko, Kumar Kartikeya Dwivedi,
Anton Protopopov, Amery Hung, Rong Tao, linux-kernel, linux-api,
linux-kselftest, kernel-patches-bot
This patch series builds upon the discussion in
"[PATCH bpf-next v4 0/4] bpf: Improve error reporting for freplace attachment failure" [1].
This patch series introduces support for *common attributes* in the BPF
syscall, providing a unified mechanism for passing shared metadata across
all BPF commands, initially used by BPF_PROG_LOAD, BPF_BTF_LOAD, and
BPF_MAP_CREATE.
The initial set of common attributes includes:
1. 'log_buf': User-provided buffer for storing log output.
2. 'log_size': Size of the provided log buffer.
3. 'log_level': Verbosity level for logging.
4. 'log_true_size': Actual log size reported by kernel.
With this extension, the BPF syscall will be able to return meaningful
error messages (e.g., map creation failures), improving debuggability
and user experience.
Links:
[1] https://lore.kernel.org/bpf/20250224153352.64689-1-leon.hwang@linux.dev/
Changes:
v11 -> v12:
* Drop "log_" prefix in struct bpf_log_attr in patch #3.
* Drop "log_" prefix in struct bpf_log_opts in patch #7.
* Copy log_true_size using copy_to_bpfptr_offset() in patch #3 (per Alexei).
* v11: https://lore.kernel.org/bpf/20260216150445.68278-1-leon.hwang@linux.dev/
v10 -> v11:
* Collect Acked-by from Andrii, thanks.
* Validate whether log_buf, log_size, and log_level are valid by reusing
bpf_verifier_log_attr_valid() in patch #4 (per Andrii).
* v10: https://lore.kernel.org/bpf/20260211151115.78013-1-leon.hwang@linux.dev/
v9 -> v10:
* Collect Acked-by from Andrii, thanks.
* Address comments from Andrii:
* Drop log NULL check in bpf_log_attr_finalize().
* Return -EFAULT early in bpf_log_attr_finalize().
* Validate whether log_buf, log_size, and log_level are set.
* Keep log_buf, log_size, log_level, and user-pointer log_true_size in struct
bpf_log_attr.
* Make prog_load and btf_load work with the new struct bpf_log_attr.
* Add comment to log_true_size of struct bpf_log_opts in libbpf.
* Address comment from Alexei:
* Avoid using BPF_LOG_FIXED as log_level in tests.
* v9: https://lore.kernel.org/bpf/20260202144046.30651-1-leon.hwang@linux.dev/
v8 -> v9:
* Rework reporting 'log_true_size' for prog_load, btf_load, and map_create to
simplify struct bpf_log_attr (per Alexei).
* v8: https://lore.kernel.org/bpf/20260126151409.52072-1-leon.hwang@linux.dev/
v7 -> v8:
* Return 0 when fd < 0 and errno != EFAULT in probe_sys_bpf_ext(), then simplify
probe_bpf_syscall_common_attrs() (per Alexei and Andrii).
* v7: https://lore.kernel.org/bpf/20260123032445.125259-1-leon.hwang@linux.dev/
v6 -> v7:
* Return -errno when fd < 0 and errno != EFAULT in probe_sys_bpf_ext().
* Convert return value of probe_sys_bpf_ext() to bool in
probe_bpf_syscall_common_attrs().
* Address comments from Andrii:
* Drop the comment, and handle fd >= 0 case explicitly in
probe_sys_bpf_ext().
* Return an error when fd >= 0 in probe_sys_bpf_ext().
* v6: https://lore.kernel.org/bpf/20260120152424.40766-1-leon.hwang@linux.dev/
v5 -> v6:
* Address comments from Andrii:
* Update some variables' name.
* Drop unnecessary 'close(fd)' in libbpf.
* Rename FEAT_EXTENDED_SYSCALL to FEAT_BPF_SYSCALL_COMMON_ATTRS with
updated description in libbpf.
* Use EINVAL instead of EUSERS, as EUSERS is not used in bpf yet.
* Rename struct bpf_syscall_common_attr_opts to bpf_log_opts in libbpf.
* Add 'OPTS_SET(log_opts, log_true_size, 0);' in libbpf's 'bpf_map_create()'.
* v5: https://lore.kernel.org/bpf/20260112145616.44195-1-leon.hwang@linux.dev/
v4 -> v5:
* Rework reporting 'log_true_size' for prog_load, btf_load, and map_create
(per Alexei).
* v4: https://lore.kernel.org/bpf/20260106172018.57757-1-leon.hwang@linux.dev/
RFC v3 -> v4:
* Drop RFC.
* Address comments from Andrii:
* Add parentheses in 'sys_bpf_ext()'.
* Avoid creating new fd in 'probe_sys_bpf_ext()'.
* Add a new struct to wrap log fields in libbpf.
* Address comments from Alexei:
* Do not skip writing to user space when log_true_size is zero.
* Do not use 'bool' arguments.
* Drop the adding WARN_ON_ONCE()'s.
* v3: https://lore.kernel.org/bpf/20251002154841.99348-1-leon.hwang@linux.dev/
RFC v2 -> RFC v3:
* Rename probe_sys_bpf_extended to probe_sys_bpf_ext.
* Refactor reporting 'log_true_size' for prog_load.
* Refactor reporting 'btf_log_true_size' for btf_load.
* Add warnings for internal bugs in map_create.
* Check log_true_size in test cases.
* Address comment from Alexei:
* Change kvzalloc/kvfree to kzalloc/kfree.
* Address comments from Andrii:
* Move BPF_COMMON_ATTRS to 'enum bpf_cmd' alongside brief comment.
* Add bpf_check_uarg_tail_zero() for extra checks.
* Rename sys_bpf_extended to sys_bpf_ext.
* Rename sys_bpf_fd_extended to sys_bpf_ext_fd.
* Probe the new feature using NULL and -EFAULT.
* Move probe_sys_bpf_ext to libbpf_internal.h and drop LIBBPF_API.
* Return -EUSERS when log attrs are conflict between bpf_attr and
bpf_common_attr.
* Avoid touching bpf_vlog_init().
* Update the reason messages in map_create.
* Finalize the log using __cleanup().
* Report log size to users.
* Change type of log_buf from '__u64' to 'const char *' and cast type
using ptr_to_u64() in bpf_map_create().
* Do not return -EOPNOTSUPP when kernel doesn't support this feature
in bpf_map_create().
* Add log_level support for map creation for consistency.
* Address comment from Eduard:
* Use common_attrs->log_level instead of BPF_LOG_FIXED.
* v2: https://lore.kernel.org/bpf/20250911163328.93490-1-leon.hwang@linux.dev/
RFC v1 -> RFC v2:
* Fix build error reported by test bot.
* Address comments from Alexei:
* Drop new uapi for freplace.
* Add common attributes support for prog_load and btf_load.
* Add common attributes support for map_create.
* v1: https://lore.kernel.org/bpf/20250728142346.95681-1-leon.hwang@linux.dev/
Leon Hwang (8):
bpf: Extend BPF syscall with common attributes support
libbpf: Add support for extended BPF syscall
bpf: Refactor reporting log_true_size for prog_load
bpf: Add syscall common attributes support for prog_load
bpf: Add syscall common attributes support for btf_load
bpf: Add syscall common attributes support for map_create
libbpf: Add syscall common attributes support for map_create
selftests/bpf: Add tests to verify map create failure log
include/linux/bpf.h | 4 +-
include/linux/bpf_verifier.h | 16 ++
include/linux/btf.h | 3 +-
include/linux/syscalls.h | 3 +-
include/uapi/linux/bpf.h | 8 +
kernel/bpf/btf.c | 30 +---
kernel/bpf/log.c | 91 +++++++++-
kernel/bpf/syscall.c | 114 +++++++++---
kernel/bpf/verifier.c | 17 +-
tools/include/uapi/linux/bpf.h | 8 +
tools/lib/bpf/bpf.c | 52 +++++-
tools/lib/bpf/bpf.h | 17 +-
tools/lib/bpf/features.c | 8 +
tools/lib/bpf/libbpf_internal.h | 3 +
.../selftests/bpf/prog_tests/map_init.c | 166 ++++++++++++++++++
15 files changed, 473 insertions(+), 67 deletions(-)
--
2.53.0
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox