Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Kanchan Joshi @ 2026-03-10 17:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
	gost.dev, linux-api
In-Reply-To: <20260309163325.GE6033@frogsfrogsfrogs>

On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
>> +struct fs_write_stream {
>> +	__u32		op_flags;	/* IN: operation flags */
>> +	__u32		stream_id;	/* IN/OUT:  stream value to assign/guery */
>> +	__u32		max_streams;	/* OUT: max streams values supported */
>> +	__u32		rsvd;
>> +};
> This isn't an very cohesive interface -- GET_MAX probably only needs
> op_flags and max_streams, right?  And GET/SET only use op_flags and
> stream_id, right?

Yeah, right. That's the trade-off with swiss army knife type ioctl which 
uses op_flags to decide what it should do. Apart from keeping a single 
ioctl I was thinking a bit about extensibility (for anything new we may 
be able to do a new op_flags with some rsvd or union) too. But if you 
feel strong about this, I can take 3 ioctl route?

>> +#define FS_WRITE_STREAM_OP_GET_MAX		(1 << 0)
>> +#define FS_WRITE_STREAM_OP_GET			(1 << 1)
>> +#define FS_WRITE_STREAM_OP_SET			(1 << 2)
>> +
>> +#define FS_IOC_WRITE_STREAM		_IOWR('f', 43, struct fs_write_stream)
> EXT4_IOC_CHECKPOINT already took 'f' / 43.  I/think/ there's no problem
> because its argument is a u32 and ioctl definitions incorporate the
> lower bits of of the argument size but you might want to be careful
> anyway.

Indeed, thanks!

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Christian Brauner @ 2026-03-10 11:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jeff Layton, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
	linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
	linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrWjb+V-zrMT412MtmgDCx9y8simJBQ7+45C9MtdiSMnuw@mail.gmail.com>

On Mon, Mar 09, 2026 at 09:50:18AM -0700, Andy Lutomirski wrote:
> On Mon, Mar 9, 2026 at 1:58 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Sun, Mar 08, 2026 at 10:10:05AM -0700, Andy Lutomirski wrote:
> > > On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > > > >
> > > > > > This flag indicates the path should be opened if it's a regular file.
> > > > > > This is useful to write secure programs that want to avoid being
> > > > > > tricked into opening device nodes with special semantics while thinking
> > > > > > they operate on regular files. This is a requested feature from the
> > > > > > uapi-group[1].
> > > > > >
> > > > >
> > > > > I think this needs a lot more clarification as to what "regular"
> > > > > means.  If it's literally
> > > > >
> > > > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > > > like FreeBSD, macOS.
> > > > >
> > > > > I think this needs more clarification as to what "regular" means,
> > > > > since S_IFREG may not be sufficient.  The UAPI group page says:
> > > > >
> > > > > Use-Case: this would be very useful to write secure programs that want
> > > > > to avoid being tricked into opening device nodes with special
> > > > > semantics while thinking they operate on regular files. This is
> > > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > > implementation of a naive web browser which is pointed to
> > > > > file://dev/zero, not expecting an endless amount of data to read.
> > > > >
> > > > > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > > > > where that fd is a memfd?  What about files backed by non-"fast" disk
> > > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > > >
> > > > > Are we concerned about blocking open?  (open blocks as a matter of
> > > > > course.)  Are we concerned about open having strange side effects?
> > > > > Are we concerned about write having strange side effects?  Are we
> > > > > concerned about cases where opening the file as root results in
> > > > > elevated privilege beyond merely gaining the ability to write to that
> > > > > specific path on an ordinary filesystem?
> >
> > I think this is opening up a barrage of question that I'm not sure are
> > all that useful. The ability to only open regular file isn't intended to
> > defend against hung FUSE or NFS servers or other random Linux
> > special-sauce murder-suicide file descriptor traps. For a lot of those
> > we have O_PATH which can easily function with the new extension. A lot
> > of the other special-sauce files (most anonymous inode fds) cannot even
> > be reopened via e.g., /proc.
> 
> On the flip side, /proc itself can certainly be opened.  Should
> O_REGULAR be able to open the more magical /proc and /sys files?  Are
> there any that are problematic?

If procfs job isn't to provide problematic files to userspace I'm not
sure what it is. Joking aside, I think in general you are of course
right that procfs is full of files that under a very strict
interpretation of "regular file" should absolutely not count as a
regular file. sysfs probably as well and let's ignore debugfs and
tracefs and all the other magic filesystems or files.

In general, Linux has been so loosey-goosey with "regular file" for such
a long-time that making OPENAT2_REGULAR come up with some strict
definition of "this is a regular file - no really, pinky-promise a
regular one" - is just doomed to fail.

The other problem is that we cannot reasonably determine what odd file
the user really wanted to defend against opening with OPENAT2_REGULAR.
A caller may really want to open /proc/kmsg and just be sure that
someone didn't overmount it with a fifo (systemd does that in containers
iirc).

My personal "hot take" is that adding an api built around a regular file
with immediate irreversible side-effects for the caller on VFS
syscall-based open [1] is a bug. Such broken semantics is what ioctl()s
are for.

[1]: I mean specifically open(), openat2() etc. I'm excluding all
     dedicated APIs that return file descriptors that cannot be reopened
     via regular lookup.

From my pov, what would help is if one had a flexible way to scope opens
on e.g., filesystem. But imo, that is not policy the kernel can
reasonably express at the syscall api layer - it would look fugly as
hell and how many other knobs would we have to add to satisfy all needs.
I think that is best left to an lsm hooking into security_file_open()
which can maintain a map of files and filesystems to allow or deny - a
bpf lsm can do this quite nicely.

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Florian Weimer @ 2026-03-09 17:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Jeff Layton, Dorjoy Chowdhury, linux-fsdevel,
	linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrWjb+V-zrMT412MtmgDCx9y8simJBQ7+45C9MtdiSMnuw@mail.gmail.com>

* Andy Lutomirski:

> On the flip side, /proc itself can certainly be opened.  Should
> O_REGULAR be able to open the more magical /proc and /sys files?  Are
> there any that are problematic?

It seems reading from /proc/kmsg is destructive.  The file doesn't have
an end, either.  It's more like a character device.  Apparently,
/sys/kernel/tracing/trace_pipe is similar in that regard.  Maybe that's
sufficient reason for blocking access?  Although the side effect does
not happen on open.

The other issue is the incorrect size reporting in stat, which affects
most (all?) files under /proc and /sys.  Userspace has already to around
that, though.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-09 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jeff Layton, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
	linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
	linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260309-umsturz-herfallen-067eb2df7ec2@brauner>

On Mon, Mar 9, 2026 at 1:58 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sun, Mar 08, 2026 at 10:10:05AM -0700, Andy Lutomirski wrote:
> > On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > > >
> > > > > This flag indicates the path should be opened if it's a regular file.
> > > > > This is useful to write secure programs that want to avoid being
> > > > > tricked into opening device nodes with special semantics while thinking
> > > > > they operate on regular files. This is a requested feature from the
> > > > > uapi-group[1].
> > > > >
> > > >
> > > > I think this needs a lot more clarification as to what "regular"
> > > > means.  If it's literally
> > > >
> > > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > > like FreeBSD, macOS.
> > > >
> > > > I think this needs more clarification as to what "regular" means,
> > > > since S_IFREG may not be sufficient.  The UAPI group page says:
> > > >
> > > > Use-Case: this would be very useful to write secure programs that want
> > > > to avoid being tricked into opening device nodes with special
> > > > semantics while thinking they operate on regular files. This is
> > > > particularly relevant as many device nodes (or even FIFOs) come with
> > > > blocking I/O (or even blocking open()!) by default, which is not
> > > > expected from regular files backed by “fast” disk I/O. Consider
> > > > implementation of a naive web browser which is pointed to
> > > > file://dev/zero, not expecting an endless amount of data to read.
> > > >
> > > > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > > > where that fd is a memfd?  What about files backed by non-"fast" disk
> > > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > > >
> > > > Are we concerned about blocking open?  (open blocks as a matter of
> > > > course.)  Are we concerned about open having strange side effects?
> > > > Are we concerned about write having strange side effects?  Are we
> > > > concerned about cases where opening the file as root results in
> > > > elevated privilege beyond merely gaining the ability to write to that
> > > > specific path on an ordinary filesystem?
>
> I think this is opening up a barrage of question that I'm not sure are
> all that useful. The ability to only open regular file isn't intended to
> defend against hung FUSE or NFS servers or other random Linux
> special-sauce murder-suicide file descriptor traps. For a lot of those
> we have O_PATH which can easily function with the new extension. A lot
> of the other special-sauce files (most anonymous inode fds) cannot even
> be reopened via e.g., /proc.

On the flip side, /proc itself can certainly be opened.  Should
O_REGULAR be able to open the more magical /proc and /sys files?  Are
there any that are problematic?

--Andy

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
From: Darrick J. Wong @ 2026-03-09 16:33 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
	gost.dev, linux-api
In-Reply-To: <20260309052944.156054-2-joshi.k@samsung.com>

[cc linux-api because this is certainly an API definition]

On Mon, Mar 09, 2026 at 10:59:40AM +0530, Kanchan Joshi wrote:
> Wire up the userspace interface for write stream management via a new
> vfs ioctl 'FS_IOC_WRITE_STEAM'.
> Application communictes the intended operation using the 'op_flags'
> field of the passed 'struct fs_write_stream'.
> Valid flags are:
> FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
> FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
> FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.
> 
> Application should query the available streams by using
> FS_WRITE_STREAM_OP_GET_MAX first.
> If returned value is N, valid stream values for the file are 0 to N.
> Stream value 0 implies that no stream is set on the file.
> Setting a larger value than available streams is rejected.
> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  include/uapi/linux/fs.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 70b2b661f42c..4d0805b52949 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -338,6 +338,18 @@ struct file_attr {
>  /* Get logical block metadata capability details */
>  #define FS_IOC_GETLBMD_CAP		_IOWR(0x15, 2, struct logical_block_metadata_cap)
>  
> +struct fs_write_stream {
> +	__u32		op_flags;	/* IN: operation flags */
> +	__u32		stream_id;	/* IN/OUT:  stream value to assign/guery */
> +	__u32		max_streams;	/* OUT: max streams values supported */
> +	__u32		rsvd;
> +};

This isn't an very cohesive interface -- GET_MAX probably only needs
op_flags and max_streams, right?  And GET/SET only use op_flags and
stream_id, right?

> +#define FS_WRITE_STREAM_OP_GET_MAX		(1 << 0)
> +#define FS_WRITE_STREAM_OP_GET			(1 << 1)
> +#define FS_WRITE_STREAM_OP_SET			(1 << 2)
> +
> +#define FS_IOC_WRITE_STREAM		_IOWR('f', 43, struct fs_write_stream)

EXT4_IOC_CHECKPOINT already took 'f' / 43.  I /think/ there's no problem
because its argument is a u32 and ioctl definitions incorporate the
lower bits of of the argument size but you might want to be careful
anyway.

--D

>  /*
>   * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
>   *
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v2] sched/deadline: document new sched_getattr() feature for retrieving current parameters for DEADLINE tasks
From: Jonathan Corbet @ 2026-03-09 16:17 UTC (permalink / raw)
  To: Tommaso Cucinotta, Peter Zijlstra
  Cc: Tommaso Cucinotta, linux-api, Juri Lelli, Shuah Khan,
	Shashank Balaji, linux-doc, linux-kernel
In-Reply-To: <20260304102843.1373905-2-tommaso.cucinotta@santannapisa.it>

Tommaso Cucinotta <tommaso.cucinotta@gmail.com> writes:

> Document in Documentation/sched/sched-deadline.rst the new capability of
> sched_getattr() to retrieve, for DEADLINE tasks, the runtime left and absolute
> deadline (setting the flags syscall parameter to 1), in addition to the static
> parameters (obtained with flags=0).
>
> Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
> Acked-by: Juri Lelli <juri.lelli@redhat.com>
> ---
>  Documentation/scheduler/sched-deadline.rst | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)

Applied, thanks.

jon

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Christian Brauner @ 2026-03-09  8:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jeff Layton, Dorjoy Chowdhury, linux-fsdevel, linux-kernel,
	linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs,
	linux-kselftest, viro, jack, chuck.lever, alex.aring, arnd,
	adilger, mjguzik, smfrench, richard.henderson, mattst88, linmag7,
	tsbogend, James.Bottomley, deller, davem, andreas, idryomov,
	amarkuze, slava, agruenba, trondmy, anna, sfrench, pc,
	ronniesahlberg, sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrVt7o+7JCMfTX3Vu9PANJJgR8hB5Z2THcXzam61kG9Gig@mail.gmail.com>

On Sun, Mar 08, 2026 at 10:10:05AM -0700, Andy Lutomirski wrote:
> On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > > >
> > > > This flag indicates the path should be opened if it's a regular file.
> > > > This is useful to write secure programs that want to avoid being
> > > > tricked into opening device nodes with special semantics while thinking
> > > > they operate on regular files. This is a requested feature from the
> > > > uapi-group[1].
> > > >
> > >
> > > I think this needs a lot more clarification as to what "regular"
> > > means.  If it's literally
> > >
> > > > A corresponding error code EFTYPE has been introduced. For example, if
> > > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > > like FreeBSD, macOS.
> > >
> > > I think this needs more clarification as to what "regular" means,
> > > since S_IFREG may not be sufficient.  The UAPI group page says:
> > >
> > > Use-Case: this would be very useful to write secure programs that want
> > > to avoid being tricked into opening device nodes with special
> > > semantics while thinking they operate on regular files. This is
> > > particularly relevant as many device nodes (or even FIFOs) come with
> > > blocking I/O (or even blocking open()!) by default, which is not
> > > expected from regular files backed by “fast” disk I/O. Consider
> > > implementation of a naive web browser which is pointed to
> > > file://dev/zero, not expecting an endless amount of data to read.
> > >
> > > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > > where that fd is a memfd?  What about files backed by non-"fast" disk
> > > I/O like something on a flaky USB stick or a network mount or FUSE?
> > >
> > > Are we concerned about blocking open?  (open blocks as a matter of
> > > course.)  Are we concerned about open having strange side effects?
> > > Are we concerned about write having strange side effects?  Are we
> > > concerned about cases where opening the file as root results in
> > > elevated privilege beyond merely gaining the ability to write to that
> > > specific path on an ordinary filesystem?

I think this is opening up a barrage of question that I'm not sure are
all that useful. The ability to only open regular file isn't intended to
defend against hung FUSE or NFS servers or other random Linux
special-sauce murder-suicide file descriptor traps. For a lot of those
we have O_PATH which can easily function with the new extension. A lot
of the other special-sauce files (most anonymous inode fds) cannot even
be reopened via e.g., /proc.

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-08 17:10 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dorjoy Chowdhury, linux-fsdevel, linux-kernel, linux-api,
	ceph-devel, gfs2, linux-nfs, linux-cifs, v9fs, linux-kselftest,
	viro, brauner, jack, chuck.lever, alex.aring, arnd, adilger,
	mjguzik, smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <801cf2c42b80d486726ea0a3774e52abcb158100.camel@kernel.org>

On Sun, Mar 8, 2026 at 4:40 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> > On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > >
> > > This flag indicates the path should be opened if it's a regular file.
> > > This is useful to write secure programs that want to avoid being
> > > tricked into opening device nodes with special semantics while thinking
> > > they operate on regular files. This is a requested feature from the
> > > uapi-group[1].
> > >
> >
> > I think this needs a lot more clarification as to what "regular"
> > means.  If it's literally
> >
> > > A corresponding error code EFTYPE has been introduced. For example, if
> > > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > > like FreeBSD, macOS.
> >
> > I think this needs more clarification as to what "regular" means,
> > since S_IFREG may not be sufficient.  The UAPI group page says:
> >
> > Use-Case: this would be very useful to write secure programs that want
> > to avoid being tricked into opening device nodes with special
> > semantics while thinking they operate on regular files. This is
> > particularly relevant as many device nodes (or even FIFOs) come with
> > blocking I/O (or even blocking open()!) by default, which is not
> > expected from regular files backed by “fast” disk I/O. Consider
> > implementation of a naive web browser which is pointed to
> > file://dev/zero, not expecting an endless amount of data to read.
> >
> > What about procfs?  What about sysfs?  What about /proc/self/fd/17
> > where that fd is a memfd?  What about files backed by non-"fast" disk
> > I/O like something on a flaky USB stick or a network mount or FUSE?
> >
> > Are we concerned about blocking open?  (open blocks as a matter of
> > course.)  Are we concerned about open having strange side effects?
> > Are we concerned about write having strange side effects?  Are we
> > concerned about cases where opening the file as root results in
> > elevated privilege beyond merely gaining the ability to write to that
> > specific path on an ordinary filesystem?
> >
>
> Above the use-case, it also says:
>
> "O_REGULAR (inspired by the existing O_DIRECTORY flag for open()),
> which opens a file only if it is of type S_IFREG."
>
> Since we allow programs to open a directory under /proc or /sys using
> O_DIRECTORY, I don't think we should do anything different here. To the
> VFS, all of the examples you gave above are S_IFREG "regular files",
> even if they are backed by something quite irregular.

That's certainly a valid and consistent way to define this, but is it useful?

--Andy

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Jeff Layton @ 2026-03-08 11:40 UTC (permalink / raw)
  To: Andy Lutomirski, Dorjoy Chowdhury
  Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
	chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
	richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
	deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
	trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
	bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrXVBA9uGEUdQPEZ2MVdxjLwwcWi5kzhOr1NdOWSSRaROw@mail.gmail.com>

On Sat, 2026-03-07 at 10:56 -0800, Andy Lutomirski wrote:
> On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> > 
> > This flag indicates the path should be opened if it's a regular file.
> > This is useful to write secure programs that want to avoid being
> > tricked into opening device nodes with special semantics while thinking
> > they operate on regular files. This is a requested feature from the
> > uapi-group[1].
> > 
> 
> I think this needs a lot more clarification as to what "regular"
> means.  If it's literally
> 
> > A corresponding error code EFTYPE has been introduced. For example, if
> > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > like FreeBSD, macOS.
> 
> I think this needs more clarification as to what "regular" means,
> since S_IFREG may not be sufficient.  The UAPI group page says:
> 
> Use-Case: this would be very useful to write secure programs that want
> to avoid being tricked into opening device nodes with special
> semantics while thinking they operate on regular files. This is
> particularly relevant as many device nodes (or even FIFOs) come with
> blocking I/O (or even blocking open()!) by default, which is not
> expected from regular files backed by “fast” disk I/O. Consider
> implementation of a naive web browser which is pointed to
> file://dev/zero, not expecting an endless amount of data to read.
>
> What about procfs?  What about sysfs?  What about /proc/self/fd/17
> where that fd is a memfd?  What about files backed by non-"fast" disk
> I/O like something on a flaky USB stick or a network mount or FUSE?
> 
> Are we concerned about blocking open?  (open blocks as a matter of
> course.)  Are we concerned about open having strange side effects?
> Are we concerned about write having strange side effects?  Are we
> concerned about cases where opening the file as root results in
> elevated privilege beyond merely gaining the ability to write to that
> specific path on an ordinary filesystem?
>

Above the use-case, it also says:

"O_REGULAR (inspired by the existing O_DIRECTORY flag for open()),
which opens a file only if it is of type S_IFREG."

Since we allow programs to open a directory under /proc or /sys using
O_DIRECTORY, I don't think we should do anything different here. To the
VFS, all of the examples you gave above are S_IFREG "regular files",
even if they are backed by something quite irregular.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-03-08  6:31 UTC (permalink / raw)
  To: Andy Lutomirski, brauner
  Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, adilger, mjguzik, smfrench,
	richard.henderson, mattst88, linmag7, tsbogend, James.Bottomley,
	deller, davem, andreas, idryomov, amarkuze, slava, agruenba,
	trondmy, anna, sfrench, pc, ronniesahlberg, sprasad, tom,
	bharathsm, shuah, miklos, hansg
In-Reply-To: <CALCETrXVBA9uGEUdQPEZ2MVdxjLwwcWi5kzhOr1NdOWSSRaROw@mail.gmail.com>

On Sun, Mar 8, 2026 at 12:56 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
> >
> > This flag indicates the path should be opened if it's a regular file.
> > This is useful to write secure programs that want to avoid being
> > tricked into opening device nodes with special semantics while thinking
> > they operate on regular files. This is a requested feature from the
> > uapi-group[1].
> >
>
> I think this needs a lot more clarification as to what "regular"
> means.  If it's literally
>
> > A corresponding error code EFTYPE has been introduced. For example, if
> > openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> > param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> > like FreeBSD, macOS.
>
> I think this needs more clarification as to what "regular" means,
> since S_IFREG may not be sufficient.  The UAPI group page says:
>
> Use-Case: this would be very useful to write secure programs that want
> to avoid being tricked into opening device nodes with special
> semantics while thinking they operate on regular files. This is
> particularly relevant as many device nodes (or even FIFOs) come with
> blocking I/O (or even blocking open()!) by default, which is not
> expected from regular files backed by “fast” disk I/O. Consider
> implementation of a naive web browser which is pointed to
> file://dev/zero, not expecting an endless amount of data to read.
>
> What about procfs?  What about sysfs?  What about /proc/self/fd/17
> where that fd is a memfd?  What about files backed by non-"fast" disk
> I/O like something on a flaky USB stick or a network mount or FUSE?
>
> Are we concerned about blocking open?  (open blocks as a matter of
> course.)  Are we concerned about open having strange side effects?
> Are we concerned about write having strange side effects?  Are we
> concerned about cases where opening the file as root results in
> elevated privilege beyond merely gaining the ability to write to that
> specific path on an ordinary filesystem?
>

Good questions. I had assumed regular file means S_IFREG when
implementing this as mentioned in the UAPI page:
"O_REGULAR (inspired by the existing O_DIRECTORY flag for open()),
which opens a file only if it is of type S_IFREG"
I think Christian Brauner (cc-d) can better answer your above questions.

Regards,
Dorjoy

^ permalink raw reply

* Re: [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Andy Lutomirski @ 2026-03-07 18:56 UTC (permalink / raw)
  To: Dorjoy Chowdhury
  Cc: linux-fsdevel, linux-kernel, linux-api, ceph-devel, gfs2,
	linux-nfs, linux-cifs, v9fs, linux-kselftest, viro, brauner, jack,
	jlayton, chuck.lever, alex.aring, arnd, adilger, mjguzik,
	smfrench, richard.henderson, mattst88, linmag7, tsbogend,
	James.Bottomley, deller, davem, andreas, idryomov, amarkuze,
	slava, agruenba, trondmy, anna, sfrench, pc, ronniesahlberg,
	sprasad, tom, bharathsm, shuah, miklos, hansg
In-Reply-To: <20260307140726.70219-2-dorjoychy111@gmail.com>

On Sat, Mar 7, 2026 at 6:09 AM Dorjoy Chowdhury <dorjoychy111@gmail.com> wrote:
>
> This flag indicates the path should be opened if it's a regular file.
> This is useful to write secure programs that want to avoid being
> tricked into opening device nodes with special semantics while thinking
> they operate on regular files. This is a requested feature from the
> uapi-group[1].
>

I think this needs a lot more clarification as to what "regular"
means.  If it's literally

> A corresponding error code EFTYPE has been introduced. For example, if
> openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
> param, it will return -EFTYPE. EFTYPE is already used in BSD systems
> like FreeBSD, macOS.

I think this needs more clarification as to what "regular" means,
since S_IFREG may not be sufficient.  The UAPI group page says:

Use-Case: this would be very useful to write secure programs that want
to avoid being tricked into opening device nodes with special
semantics while thinking they operate on regular files. This is
particularly relevant as many device nodes (or even FIFOs) come with
blocking I/O (or even blocking open()!) by default, which is not
expected from regular files backed by “fast” disk I/O. Consider
implementation of a naive web browser which is pointed to
file://dev/zero, not expecting an endless amount of data to read.

What about procfs?  What about sysfs?  What about /proc/self/fd/17
where that fd is a memfd?  What about files backed by non-"fast" disk
I/O like something on a flaky USB stick or a network mount or FUSE?

Are we concerned about blocking open?  (open blocks as a matter of
course.)  Are we concerned about open having strange side effects?
Are we concerned about write having strange side effects?  Are we
concerned about cases where opening the file as root results in
elevated privilege beyond merely gaining the ability to write to that
specific path on an ordinary filesystem?

--Andy

^ permalink raw reply

* [PATCH v5 4/4] mips/fcntl.h: convert O_* flag macros from hex to octal
From: Dorjoy Chowdhury @ 2026-03-07 14:06 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg
In-Reply-To: <20260307140726.70219-1-dorjoychy111@gmail.com>

Following the convention in include/uapi/asm-generic/fcntl.h and other
architecture specific arch/*/include/uapi/asm/fcntl.h files.

Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
 arch/mips/include/uapi/asm/fcntl.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/mips/include/uapi/asm/fcntl.h b/arch/mips/include/uapi/asm/fcntl.h
index 0369a38e3d4f..6aa3f49df17e 100644
--- a/arch/mips/include/uapi/asm/fcntl.h
+++ b/arch/mips/include/uapi/asm/fcntl.h
@@ -11,15 +11,15 @@
 
 #include <asm/sgidefs.h>
 
-#define O_APPEND	0x0008
-#define O_DSYNC		0x0010	/* used to be O_SYNC, see below */
-#define O_NONBLOCK	0x0080
-#define O_CREAT		0x0100	/* not fcntl */
-#define O_TRUNC		0x0200	/* not fcntl */
-#define O_EXCL		0x0400	/* not fcntl */
-#define O_NOCTTY	0x0800	/* not fcntl */
-#define FASYNC		0x1000	/* fcntl, for BSD compatibility */
-#define O_LARGEFILE	0x2000	/* allow large file opens */
+#define O_APPEND	0000010
+#define O_DSYNC		0000020	/* used to be O_SYNC, see below */
+#define O_NONBLOCK	0000200
+#define O_CREAT		0000400	/* not fcntl */
+#define O_TRUNC		0001000	/* not fcntl */
+#define O_EXCL		0002000	/* not fcntl */
+#define O_NOCTTY	0004000	/* not fcntl */
+#define FASYNC		0010000	/* fcntl, for BSD compatibility */
+#define O_LARGEFILE	0020000	/* allow large file opens */
 /*
  * Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
  * the O_SYNC flag.  We continue to use the existing numerical value
@@ -33,9 +33,9 @@
  *
  * Note: __O_SYNC must never be used directly.
  */
-#define __O_SYNC	0x4000
+#define __O_SYNC	0040000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_DIRECT	0x8000	/* direct disk access hint */
+#define O_DIRECT	0100000	/* direct disk access hint */
 
 #define F_GETLK		14
 #define F_SETLK		6
-- 
2.53.0


^ permalink raw reply related

* [PATCH v5 3/4] sparc/fcntl.h: convert O_* flag macros from hex to octal
From: Dorjoy Chowdhury @ 2026-03-07 14:06 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg
In-Reply-To: <20260307140726.70219-1-dorjoychy111@gmail.com>

Following the convention in include/uapi/asm-generic/fcntl.h and other
architecture specific arch/*/include/uapi/asm/fcntl.h files.

Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
 arch/sparc/include/uapi/asm/fcntl.h | 36 ++++++++++++++---------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index bb6e9fa94bc9..33ce58ec57f6 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -2,23 +2,23 @@
 #ifndef _SPARC_FCNTL_H
 #define _SPARC_FCNTL_H
 
-#define O_APPEND	0x0008
-#define FASYNC		0x0040	/* fcntl, for BSD compatibility */
-#define O_CREAT		0x0200	/* not fcntl */
-#define O_TRUNC		0x0400	/* not fcntl */
-#define O_EXCL		0x0800	/* not fcntl */
-#define O_DSYNC		0x2000	/* used to be O_SYNC, see below */
-#define O_NONBLOCK	0x4000
+#define O_APPEND	0000000010
+#define FASYNC		0000000100	/* fcntl, for BSD compatibility */
+#define O_CREAT		0000001000	/* not fcntl */
+#define O_TRUNC		0000002000	/* not fcntl */
+#define O_EXCL		0000004000	/* not fcntl */
+#define O_DSYNC		0000020000	/* used to be O_SYNC, see below */
+#define O_NONBLOCK	0000040000
 #if defined(__sparc__) && defined(__arch64__)
-#define O_NDELAY	0x0004
+#define O_NDELAY	0000000004
 #else
-#define O_NDELAY	(0x0004 | O_NONBLOCK)
+#define O_NDELAY	(0000000004 | O_NONBLOCK)
 #endif
-#define O_NOCTTY	0x8000	/* not fcntl */
-#define O_LARGEFILE	0x40000
-#define O_DIRECT        0x100000 /* direct disk access hint */
-#define O_NOATIME	0x200000
-#define O_CLOEXEC	0x400000
+#define O_NOCTTY	0000100000	/* not fcntl */
+#define O_LARGEFILE	0001000000
+#define O_DIRECT        0004000000 /* direct disk access hint */
+#define O_NOATIME	0010000000
+#define O_CLOEXEC	0020000000
 /*
  * Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
  * the O_SYNC flag.  We continue to use the existing numerical value
@@ -32,12 +32,12 @@
  *
  * Note: __O_SYNC must never be used directly.
  */
-#define __O_SYNC	0x800000
+#define __O_SYNC	0040000000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
 
-#define O_PATH		0x1000000
-#define __O_TMPFILE	0x2000000
-#define OPENAT2_REGULAR	0x4000000
+#define O_PATH		0100000000
+#define __O_TMPFILE	0200000000
+#define OPENAT2_REGULAR	0400000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
-- 
2.53.0


^ permalink raw reply related

* [PATCH v5 2/4] kselftest/openat2: test for OPENAT2_REGULAR flag
From: Dorjoy Chowdhury @ 2026-03-07 14:06 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg
In-Reply-To: <20260307140726.70219-1-dorjoychy111@gmail.com>

Just a happy path test.

Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
 .../testing/selftests/openat2/openat2_test.c  | 37 ++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
index 0e161ef9e9e4..e8847f7d416c 100644
--- a/tools/testing/selftests/openat2/openat2_test.c
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -320,8 +320,42 @@ void test_openat2_flags(void)
 	}
 }
 
+#ifndef OPENAT2_REGULAR
+#define OPENAT2_REGULAR 040000000
+#endif
+
+#ifndef EFTYPE
+#define EFTYPE 134
+#endif
+
+void test_openat2_regular_flag(void)
+{
+	if (!openat2_supported) {
+		ksft_test_result_skip("Skipping %s as openat2 is not supported\n", __func__);
+		return;
+	}
+
+	struct open_how how = {
+		.flags = OPENAT2_REGULAR | O_RDONLY
+	};
+
+	int fd = sys_openat2(AT_FDCWD, "/dev/null", &how);
+
+	if (fd == -ENOENT) {
+		ksft_test_result_skip("Skipping %s as there is no /dev/null\n", __func__);
+		return;
+	}
+
+	if (fd != -EFTYPE) {
+		ksft_test_result_fail("openat2 should return EFTYPE\n");
+		return;
+	}
+
+	ksft_test_result_pass("%s succeeded\n", __func__);
+}
+
 #define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \
-		   NUM_OPENAT2_FLAG_TESTS)
+		   NUM_OPENAT2_FLAG_TESTS + 1)
 
 int main(int argc, char **argv)
 {
@@ -330,6 +364,7 @@ int main(int argc, char **argv)
 
 	test_openat2_struct();
 	test_openat2_flags();
+	test_openat2_regular_flag();
 
 	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
 		ksft_exit_fail();
-- 
2.53.0


^ permalink raw reply related

* [PATCH v5 1/4] openat2: new OPENAT2_REGULAR flag support
From: Dorjoy Chowdhury @ 2026-03-07 14:06 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg
In-Reply-To: <20260307140726.70219-1-dorjoychy111@gmail.com>

This flag indicates the path should be opened if it's a regular file.
This is useful to write secure programs that want to avoid being
tricked into opening device nodes with special semantics while thinking
they operate on regular files. This is a requested feature from the
uapi-group[1].

A corresponding error code EFTYPE has been introduced. For example, if
openat2 is called on path /dev/null with OPENAT2_REGULAR in the flag
param, it will return -EFTYPE. EFTYPE is already used in BSD systems
like FreeBSD, macOS.

When used in combination with O_CREAT, either the regular file is
created, or if the path already exists, it is opened if it's a regular
file. Otherwise, -EFTYPE is returned.

When OPENAT2_REGULAR is combined with O_DIRECTORY, -EINVAL is returned
as it doesn't make sense to open a path that is both a directory and a
regular file.

[1]: https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files

Signed-off-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
---
 arch/alpha/include/uapi/asm/errno.h        |  2 ++
 arch/alpha/include/uapi/asm/fcntl.h        |  1 +
 arch/mips/include/uapi/asm/errno.h         |  2 ++
 arch/parisc/include/uapi/asm/errno.h       |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h       |  1 +
 arch/sparc/include/uapi/asm/errno.h        |  2 ++
 arch/sparc/include/uapi/asm/fcntl.h        |  1 +
 fs/ceph/file.c                             |  4 ++++
 fs/gfs2/inode.c                            |  6 ++++++
 fs/namei.c                                 |  4 ++++
 fs/nfs/dir.c                               |  4 ++++
 fs/open.c                                  |  4 +++-
 fs/smb/client/dir.c                        | 14 +++++++++++++-
 include/linux/fcntl.h                      |  2 ++
 include/uapi/asm-generic/errno.h           |  2 ++
 include/uapi/asm-generic/fcntl.h           |  4 ++++
 tools/arch/alpha/include/uapi/asm/errno.h  |  2 ++
 tools/arch/mips/include/uapi/asm/errno.h   |  2 ++
 tools/arch/parisc/include/uapi/asm/errno.h |  2 ++
 tools/arch/sparc/include/uapi/asm/errno.h  |  2 ++
 tools/include/uapi/asm-generic/errno.h     |  2 ++
 21 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
index 6791f6508632..1a99f38813c7 100644
--- a/arch/alpha/include/uapi/asm/errno.h
+++ b/arch/alpha/include/uapi/asm/errno.h
@@ -127,4 +127,6 @@
 
 #define EHWPOISON	139	/* Memory page has hardware error */
 
+#define EFTYPE		140	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fe488bf7c18e 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
 
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
+#define OPENAT2_REGULAR	0200000000
 
 #define F_GETLK		7
 #define F_SETLK		8
diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
index c01ed91b1ef4..1835a50b69ce 100644
--- a/arch/mips/include/uapi/asm/errno.h
+++ b/arch/mips/include/uapi/asm/errno.h
@@ -126,6 +126,8 @@
 
 #define EHWPOISON	168	/* Memory page has hardware error */
 
+#define EFTYPE		169	/* Wrong file type for the intended operation */
+
 #define EDQUOT		1133	/* Quota exceeded */
 
 
diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
index 8cbc07c1903e..93194fbb0a80 100644
--- a/arch/parisc/include/uapi/asm/errno.h
+++ b/arch/parisc/include/uapi/asm/errno.h
@@ -124,4 +124,6 @@
 
 #define EHWPOISON	257	/* Memory page has hardware error */
 
+#define EFTYPE		258	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03dee816cb13..d46812f2f0f4 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -19,6 +19,7 @@
 
 #define O_PATH		020000000
 #define __O_TMPFILE	040000000
+#define OPENAT2_REGULAR	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
index 4a41e7835fd5..71940ec9130b 100644
--- a/arch/sparc/include/uapi/asm/errno.h
+++ b/arch/sparc/include/uapi/asm/errno.h
@@ -117,4 +117,6 @@
 
 #define EHWPOISON	135	/* Memory page has hardware error */
 
+#define EFTYPE		136	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..bb6e9fa94bc9 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define OPENAT2_REGULAR	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 66bbf6d517a9..6d8d4c7765e6 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -977,6 +977,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			ceph_init_inode_acls(newino, &as_ctx);
 			file->f_mode |= FMODE_CREATED;
 		}
+		if ((flags & OPENAT2_REGULAR) && !d_is_reg(dentry)) {
+			err = -EFTYPE;
+			goto out_req;
+		}
 		err = finish_open(file, dentry, ceph_open);
 	}
 out_req:
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8344040ecaf7..4604e2e8a9cc 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -738,6 +738,12 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
 	inode = gfs2_dir_search(dir, &dentry->d_name, !S_ISREG(mode) || excl);
 	error = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
+		if (file && (file->f_flags & OPENAT2_REGULAR) && !S_ISREG(inode->i_mode)) {
+			iput(inode);
+			inode = NULL;
+			error = -EFTYPE;
+			goto fail_gunlock;
+		}
 		if (S_ISDIR(inode->i_mode)) {
 			iput(inode);
 			inode = NULL;
diff --git a/fs/namei.c b/fs/namei.c
index 58f715f7657e..2a47289262bd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4651,6 +4651,10 @@ static int do_open(struct nameidata *nd,
 		if (unlikely(error))
 			return error;
 	}
+
+	if ((open_flag & OPENAT2_REGULAR) && !d_is_reg(nd->path.dentry))
+		return -EFTYPE;
+
 	if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
 		return -ENOTDIR;
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2402f57c8e7d..d8037c119317 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2195,6 +2195,10 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 			break;
 		case -EISDIR:
 		case -ENOTDIR:
+			if (open_flags & OPENAT2_REGULAR) {
+				err = -EFTYPE;
+				break;
+			}
 			goto no_open;
 		case -ELOOP:
 			if (!(open_flags & O_NOFOLLOW))
diff --git a/fs/open.c b/fs/open.c
index 4f0a76dc8993..026b59af6124 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1195,7 +1195,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 	 * values before calling build_open_flags(), but openat2(2) checks all
 	 * of its arguments.
 	 */
-	if (flags & ~VALID_OPEN_FLAGS)
+	if (flags & ~VALID_OPENAT2_FLAGS)
 		return -EINVAL;
 	if (how->resolve & ~VALID_RESOLVE_FLAGS)
 		return -EINVAL;
@@ -1234,6 +1234,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 			return -EINVAL;
 		if (!(acc_mode & MAY_WRITE))
 			return -EINVAL;
+	} else if ((flags & O_DIRECTORY) && (flags & OPENAT2_REGULAR)) {
+		return -EINVAL;
 	}
 	if (flags & O_PATH) {
 		/* O_PATH only permits certain other flags to be set. */
diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index 953f1fee8cb8..355681ebacf1 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -222,6 +222,13 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 				goto cifs_create_get_file_info;
 			}
 
+			if ((oflags & OPENAT2_REGULAR) && !S_ISREG(newinode->i_mode)) {
+				CIFSSMBClose(xid, tcon, fid->netfid);
+				iput(newinode);
+				rc = -EFTYPE;
+				goto out;
+			}
+
 			if (S_ISDIR(newinode->i_mode)) {
 				CIFSSMBClose(xid, tcon, fid->netfid);
 				iput(newinode);
@@ -436,11 +443,16 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		goto out_err;
 	}
 
-	if (newinode)
+	if (newinode) {
+		if ((oflags & OPENAT2_REGULAR) && !S_ISREG(newinode->i_mode)) {
+			rc = -EFTYPE;
+			goto out_err;
+		}
 		if (S_ISDIR(newinode->i_mode)) {
 			rc = -EISDIR;
 			goto out_err;
 		}
+	}
 
 	d_drop(direntry);
 	d_add(direntry, newinode);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d1bb87ff70e3..a6c692773af8 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -15,6 +15,8 @@
 	 /* upper 32-bit flags (openat2(2) only) */ \
 	 OPENAT2_EMPTY_PATH)
 
+#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OPENAT2_REGULAR)
+
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
index 92e7ae493ee3..bd78e69e0a43 100644
--- a/include/uapi/asm-generic/errno.h
+++ b/include/uapi/asm-generic/errno.h
@@ -122,4 +122,6 @@
 
 #define EHWPOISON	133	/* Memory page has hardware error */
 
+#define EFTYPE		134	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..b2c2ddd0edc0 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef OPENAT2_REGULAR
+#define OPENAT2_REGULAR	040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 
diff --git a/tools/arch/alpha/include/uapi/asm/errno.h b/tools/arch/alpha/include/uapi/asm/errno.h
index 6791f6508632..1a99f38813c7 100644
--- a/tools/arch/alpha/include/uapi/asm/errno.h
+++ b/tools/arch/alpha/include/uapi/asm/errno.h
@@ -127,4 +127,6 @@
 
 #define EHWPOISON	139	/* Memory page has hardware error */
 
+#define EFTYPE		140	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/tools/arch/mips/include/uapi/asm/errno.h b/tools/arch/mips/include/uapi/asm/errno.h
index c01ed91b1ef4..1835a50b69ce 100644
--- a/tools/arch/mips/include/uapi/asm/errno.h
+++ b/tools/arch/mips/include/uapi/asm/errno.h
@@ -126,6 +126,8 @@
 
 #define EHWPOISON	168	/* Memory page has hardware error */
 
+#define EFTYPE		169	/* Wrong file type for the intended operation */
+
 #define EDQUOT		1133	/* Quota exceeded */
 
 
diff --git a/tools/arch/parisc/include/uapi/asm/errno.h b/tools/arch/parisc/include/uapi/asm/errno.h
index 8cbc07c1903e..93194fbb0a80 100644
--- a/tools/arch/parisc/include/uapi/asm/errno.h
+++ b/tools/arch/parisc/include/uapi/asm/errno.h
@@ -124,4 +124,6 @@
 
 #define EHWPOISON	257	/* Memory page has hardware error */
 
+#define EFTYPE		258	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/tools/arch/sparc/include/uapi/asm/errno.h b/tools/arch/sparc/include/uapi/asm/errno.h
index 4a41e7835fd5..71940ec9130b 100644
--- a/tools/arch/sparc/include/uapi/asm/errno.h
+++ b/tools/arch/sparc/include/uapi/asm/errno.h
@@ -117,4 +117,6 @@
 
 #define EHWPOISON	135	/* Memory page has hardware error */
 
+#define EFTYPE		136	/* Wrong file type for the intended operation */
+
 #endif
diff --git a/tools/include/uapi/asm-generic/errno.h b/tools/include/uapi/asm-generic/errno.h
index 92e7ae493ee3..bd78e69e0a43 100644
--- a/tools/include/uapi/asm-generic/errno.h
+++ b/tools/include/uapi/asm-generic/errno.h
@@ -122,4 +122,6 @@
 
 #define EHWPOISON	133	/* Memory page has hardware error */
 
+#define EFTYPE		134	/* Wrong file type for the intended operation */
+
 #endif
-- 
2.53.0


^ permalink raw reply related

* [PATCH v5 0/4] OPENAT2_REGULAR flag support for openat2
From: Dorjoy Chowdhury @ 2026-03-07 14:06 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, ceph-devel, gfs2, linux-nfs, linux-cifs,
	v9fs, linux-kselftest, viro, brauner, jack, jlayton, chuck.lever,
	alex.aring, arnd, adilger, mjguzik, smfrench, richard.henderson,
	mattst88, linmag7, tsbogend, James.Bottomley, deller, davem,
	andreas, idryomov, amarkuze, slava, agruenba, trondmy, anna,
	sfrench, pc, ronniesahlberg, sprasad, tom, bharathsm, shuah,
	miklos, hansg

Hi,

I came upon this "Ability to only open regular files" uapi feature suggestion
from https://uapi-group.org/kernel-features/#ability-to-only-open-regular-files
and thought it would be something I could do as a first patch and get to
know the kernel code a bit better.

The following filesystems have been tested by building and booting the kernel
x86 bzImage in a Fedora 43 VM in QEMU. I have tested with OPENAT2_REGULAR that
regular files can be successfully opened and non-regular files (directory, fifo etc)
return -EFTYPE.
- btrfs
- NFS (loopback)
- SMB (loopback)

Changes in v5:
- EFTYPE is already used in BSDs mentioned in commit message
- consistently return -EFTYPE in all filesystems

Changes in v4:
- changed O_REGULAR to OPENAT2_REGULAR
- OPENAT2_REGULAR does not affect O_PATH
- atomic_open codepaths updated to work properly for OPENAT2_REGULAR
- commit message includes the uapi-group URL
- v3 is at: https://lore.kernel.org/linux-fsdevel/20260127180109.66691-1-dorjoychy111@gmail.com/T/

Changes in v3:
- included motivation about O_REGULAR flag in commit message e.g., programs not wanting to be tricked into opening device nodes
- fixed commit message wrongly referencing ENOTREGULAR instead of ENOTREG
- fixed the O_REGULAR flag in arch/parisc/include/uapi/asm/fcntl.h from 060000000 to 0100000000
- added 2 commits converting arch/{mips,sparc}/include/uapi/asm/fcntl.h O_* macros from hex to octal
- v2 is at: https://lore.kernel.org/linux-fsdevel/20260126154156.55723-1-dorjoychy111@gmail.com/T/

Changes in v2:
- rename ENOTREGULAR to ENOTREG
- define ENOTREG in uapi/asm-generic/errno.h (instead of errno-base.h) and in arch/*/include/uapi/asm/errno.h files
- override O_REGULAR in arch/{alpha,sparc,parisc}/include/uapi/asm/fcntl.h due to clash with include/uapi/asm-generic/fcntl.h
- I have kept the kselftest but now that O_REGULAR and ENOTREG can have different value on different architectures I am not sure if it's right
- v1 is at: https://lore.kernel.org/linux-fsdevel/20260125141518.59493-1-dorjoychy111@gmail.com/T/

Thanks.

Regards,
Dorjoy

Dorjoy Chowdhury (4):
  openat2: new OPENAT2_REGULAR flag support
  kselftest/openat2: test for OPENAT2_REGULAR flag
  sparc/fcntl.h: convert O_* flag macros from hex to octal
  mips/fcntl.h: convert O_* flag macros from hex to octal

 arch/alpha/include/uapi/asm/errno.h           |  2 +
 arch/alpha/include/uapi/asm/fcntl.h           |  1 +
 arch/mips/include/uapi/asm/errno.h            |  2 +
 arch/mips/include/uapi/asm/fcntl.h            | 22 +++++------
 arch/parisc/include/uapi/asm/errno.h          |  2 +
 arch/parisc/include/uapi/asm/fcntl.h          |  1 +
 arch/sparc/include/uapi/asm/errno.h           |  2 +
 arch/sparc/include/uapi/asm/fcntl.h           | 35 +++++++++---------
 fs/ceph/file.c                                |  4 ++
 fs/gfs2/inode.c                               |  6 +++
 fs/namei.c                                    |  4 ++
 fs/nfs/dir.c                                  |  4 ++
 fs/open.c                                     |  4 +-
 fs/smb/client/dir.c                           | 14 ++++++-
 include/linux/fcntl.h                         |  2 +
 include/uapi/asm-generic/errno.h              |  2 +
 include/uapi/asm-generic/fcntl.h              |  4 ++
 tools/arch/alpha/include/uapi/asm/errno.h     |  2 +
 tools/arch/mips/include/uapi/asm/errno.h      |  2 +
 tools/arch/parisc/include/uapi/asm/errno.h    |  2 +
 tools/arch/sparc/include/uapi/asm/errno.h     |  2 +
 tools/include/uapi/asm-generic/errno.h        |  2 +
 .../testing/selftests/openat2/openat2_test.c  | 37 ++++++++++++++++++-
 23 files changed, 127 insertions(+), 31 deletions(-)

-- 
2.53.0


^ permalink raw reply

* [PATCH v2] sched/deadline: document new sched_getattr() feature for retrieving current parameters for DEADLINE tasks
From: Tommaso Cucinotta @ 2026-03-04 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tommaso Cucinotta, linux-api, Juri Lelli, Jonathan Corbet,
	Shuah Khan, Shashank Balaji, linux-doc, linux-kernel
In-Reply-To: <20260304102843.1373905-1-tommaso.cucinotta@santannapisa.it>

Document in Documentation/sched/sched-deadline.rst the new capability of
sched_getattr() to retrieve, for DEADLINE tasks, the runtime left and absolute
deadline (setting the flags syscall parameter to 1), in addition to the static
parameters (obtained with flags=0).

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
---
 Documentation/scheduler/sched-deadline.rst | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
index ec543a12..76fdf435 100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -628,10 +628,21 @@ Deadline Task Scheduling
   * the new scheduling related syscalls that manipulate it, i.e.,
     sched_setattr() and sched_getattr() are implemented.
 
- For debugging purposes, the leftover runtime and absolute deadline of a
- SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
- dl.runtime and dl.deadline, both values in ns). A programmatic way to
- retrieve these values from production code is under discussion.
+ The leftover runtime and absolute deadline of a SCHED_DEADLINE task can be
+ read using the sched_getattr() syscall, setting the last syscall parameter
+ flags to the SCHED_GETATTR_FLAG_DL_DYNAMIC=1 value. This updates the
+ runtime left, converts the absolute deadline in CLOCK_MONOTONIC reference,
+ then returns these parameters to user-space. The absolute deadline is
+ returned as the number of nanoseconds since the CLOCK_MONOTONIC time
+ reference (boot instant), as a u64 in the sched_deadline field of sched_attr,
+ which can represent nearly 585 years since boot time (calling sched_getattr()
+ with flags=0 causes retrieval of the static parameters instead).
+
+ For debugging purposes, these parameters can also be retrieved through
+ /proc/<pid>/sched (entries dl.runtime and dl.deadline, both values in ns),
+ but: this is highly inefficient; the returned runtime left is not updated as
+ done by sched_getattr(); the deadline is provided in kernel rq_clock time
+ reference, that is not directly usable from user-space.
 
 
 4.3 Default behavior

base-commit: f74d204baf9febf96237af6c1d7eff57fba7de36
-- 
2.45.2


^ permalink raw reply related

* [PATCH v2] sched/deadline: document new sched_getattr() feature for retrieving current parameters for DEADLINE tasks
From: Tommaso Cucinotta @ 2026-03-04 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tommaso Cucinotta, linux-api, Jonathan Corbet, Shuah Khan,
	Juri Lelli, Shashank Balaji, linux-doc, linux-kernel
In-Reply-To: <20260303104215.1324243-1-tommaso.cucinotta@santannapisa.it>


Compared to the initially submitted documentation patch, this
version addresses the issue highlighted by Juri of the wrong wrapping
of the commit message, and the one found by the chatbot of the wrong
use of quotes around the flags parameter. I'm also adding "v2" in
the subject line, as requested by Randy.

^ permalink raw reply

* Re: [PATCH bpf-next v10 3/8] bpf: Refactor reporting log_true_size for prog_load
From: Leon Hwang @ 2026-03-04  6:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
	Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
	Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
	Amery Hung, Rong Tao, LKML, Linux API,
	open list:KERNEL SELFTEST FRAMEWORK, kernel-patches-bot
In-Reply-To: <CAADnVQJ4E5L8rL-K=yJJZpCeRBvEJZcSKOEQP0kg2ztowhGmvA@mail.gmail.com>



On 4/3/26 13:58, Alexei Starovoitov wrote:
> On Tue, Mar 3, 2026 at 9:47 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> On 4/3/26 00:32, Alexei Starovoitov wrote:
>>> On Wed, Feb 11, 2026 at 7:13 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>>
>>
>> [...]
>>
>>>> @@ -6241,7 +6244,11 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size,
>>>>                 err = map_freeze(&attr);
>>>>                 break;
>>>>         case BPF_PROG_LOAD:
>>>> -               err = bpf_prog_load(&attr, uattr, size);
>>>> +               if (from_user && size >= offsetofend(union bpf_attr, log_true_size))
>>>> +                       log_true_size = uattr.user + offsetof(union bpf_attr, log_true_size);
>>>
>>> So you added 'from_user' gating because
>>> you replaced copy_to_bpfptr_offset() with copy_to_user()?
>>> This is a drastic change in behavior and you don't even talk about
>>> it in the commit log.
>>> You said "refactor". This is not a refactoring!
>>>
>>> This is v10. The common_attr feature is useful, but
>>> you really need to think harder about what your patches
>>> are doing.
>>>
>>
>> Refactoring should not introduce any functional changes. If a functional
>> change is involved, it should be factored out of the refactoring commit
>> into a separate commit with an explanation in the commit log.
>>
>> I'll add this to my self-review checklist.
>>
>> The intention of 'from_user' was to replace copy_to_bpfptr_offset() with
>> copy_to_user(), since the log is always copied to the user-space buffer
>> when the log level is not BPF_LOG_KERNEL in
>> kernel/bpf/log.c::bpf_verifier_vlog().
>>
>> The 'from_user' gating will be dropped in v12 to keep this patch as pure
>> refactoring.
> 
> You were told multiple times to avoid copy pasting AI into your emails.
> Sorry, but this crosses the line for me.
> Your patches will be ignored for 2 weeks.

Oops. The above reply was written by my hand. Possibly, the reply
carried LLM smell because I learnt LLM tongue recently.

As you said, I won't send patches for 2 weeks. :-(

Thanks,
Leon


^ permalink raw reply

* Re: [PATCH bpf-next v10 3/8] bpf: Refactor reporting log_true_size for prog_load
From: Alexei Starovoitov @ 2026-03-04  5:58 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
	Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
	Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
	Amery Hung, Rong Tao, LKML, Linux API,
	open list:KERNEL SELFTEST FRAMEWORK, kernel-patches-bot
In-Reply-To: <c9cd645f-810b-4dd4-a1ed-27569dca5055@linux.dev>

On Tue, Mar 3, 2026 at 9:47 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> On 4/3/26 00:32, Alexei Starovoitov wrote:
> > On Wed, Feb 11, 2026 at 7:13 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
>
> [...]
>
> >> @@ -6241,7 +6244,11 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size,
> >>                 err = map_freeze(&attr);
> >>                 break;
> >>         case BPF_PROG_LOAD:
> >> -               err = bpf_prog_load(&attr, uattr, size);
> >> +               if (from_user && size >= offsetofend(union bpf_attr, log_true_size))
> >> +                       log_true_size = uattr.user + offsetof(union bpf_attr, log_true_size);
> >
> > So you added 'from_user' gating because
> > you replaced copy_to_bpfptr_offset() with copy_to_user()?
> > This is a drastic change in behavior and you don't even talk about
> > it in the commit log.
> > You said "refactor". This is not a refactoring!
> >
> > This is v10. The common_attr feature is useful, but
> > you really need to think harder about what your patches
> > are doing.
> >
>
> Refactoring should not introduce any functional changes. If a functional
> change is involved, it should be factored out of the refactoring commit
> into a separate commit with an explanation in the commit log.
>
> I'll add this to my self-review checklist.
>
> The intention of 'from_user' was to replace copy_to_bpfptr_offset() with
> copy_to_user(), since the log is always copied to the user-space buffer
> when the log level is not BPF_LOG_KERNEL in
> kernel/bpf/log.c::bpf_verifier_vlog().
>
> The 'from_user' gating will be dropped in v12 to keep this patch as pure
> refactoring.

You were told multiple times to avoid copy pasting AI into your emails.
Sorry, but this crosses the line for me.
Your patches will be ignored for 2 weeks.

^ permalink raw reply

* Re: [PATCH bpf-next v10 3/8] bpf: Refactor reporting log_true_size for prog_load
From: Leon Hwang @ 2026-03-04  5:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
	Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
	Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
	Amery Hung, Rong Tao, LKML, Linux API,
	open list:KERNEL SELFTEST FRAMEWORK, kernel-patches-bot
In-Reply-To: <CAADnVQKc5H=k-++CHxs+Y1ggptRSLRcACLgVaMgOmt=QBT=dkA@mail.gmail.com>

On 4/3/26 00:32, Alexei Starovoitov wrote:
> On Wed, Feb 11, 2026 at 7:13 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>

[...]

>> @@ -6241,7 +6244,11 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size,
>>                 err = map_freeze(&attr);
>>                 break;
>>         case BPF_PROG_LOAD:
>> -               err = bpf_prog_load(&attr, uattr, size);
>> +               if (from_user && size >= offsetofend(union bpf_attr, log_true_size))
>> +                       log_true_size = uattr.user + offsetof(union bpf_attr, log_true_size);
> 
> So you added 'from_user' gating because
> you replaced copy_to_bpfptr_offset() with copy_to_user()?
> This is a drastic change in behavior and you don't even talk about
> it in the commit log.
> You said "refactor". This is not a refactoring!
> 
> This is v10. The common_attr feature is useful, but
> you really need to think harder about what your patches
> are doing.
> 

Refactoring should not introduce any functional changes. If a functional
change is involved, it should be factored out of the refactoring commit
into a separate commit with an explanation in the commit log.

I'll add this to my self-review checklist.

The intention of 'from_user' was to replace copy_to_bpfptr_offset() with
copy_to_user(), since the log is always copied to the user-space buffer
when the log level is not BPF_LOG_KERNEL in
kernel/bpf/log.c::bpf_verifier_vlog().

The 'from_user' gating will be dropped in v12 to keep this patch as pure
refactoring.

Thanks,
Leon


^ permalink raw reply

* Re: [PATCH] sched/deadline: document new sched_getattr() feature for retrieving current parameters for DEADLINE tasks
From: Randy Dunlap @ 2026-03-03 23:20 UTC (permalink / raw)
  To: Tommaso Cucinotta, Peter Zijlstra
  Cc: Tommaso Cucinotta, linux-api, Juri Lelli, Jonathan Corbet,
	Shuah Khan, Shashank Balaji, linux-doc, linux-kernel
In-Reply-To: <20260303184313.1356499-1-tommaso.cucinotta@santannapisa.it>

This patch should be marked as v2, with the differences between
v1 and v2 described.



On 3/3/26 10:42 AM, Tommaso Cucinotta wrote:
> Document in Documentation/sched/sched-deadline.rst the new capability of
> sched_getattr() to retrieve, for DEADLINE tasks, the runtime left and absolute
> deadline (setting the flags syscall parameter to 1), in addition to the static
> parameters (obtained with flags=0).
> 
> Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
> Acked-by: Juri Lelli <juri.lelli@redhat.com>
> ---
>  Documentation/scheduler/sched-deadline.rst | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
> index ec543a12..76fdf435 100644
> --- a/Documentation/scheduler/sched-deadline.rst
> +++ b/Documentation/scheduler/sched-deadline.rst
> @@ -628,10 +628,21 @@ Deadline Task Scheduling
>    * the new scheduling related syscalls that manipulate it, i.e.,
>      sched_setattr() and sched_getattr() are implemented.
>  
> - For debugging purposes, the leftover runtime and absolute deadline of a
> - SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
> - dl.runtime and dl.deadline, both values in ns). A programmatic way to
> - retrieve these values from production code is under discussion.
> + The leftover runtime and absolute deadline of a SCHED_DEADLINE task can be
> + read using the sched_getattr() syscall, setting the last syscall parameter
> + flags to the SCHED_GETATTR_FLAG_DL_DYNAMIC=1 value. This updates the

About the build warning due to the use of  `flags':
If you want smart quotes, just use 'flags'.
If you want italics, use           `flags`.
If you want a code-look (monotype), use ``flags``.

> + runtime left, converts the absolute deadline in CLOCK_MONOTONIC reference,
> + then returns these parameters to user-space. The absolute deadline is
> + returned as the number of nanoseconds since the CLOCK_MONOTONIC time
> + reference (boot instant), as a u64 in the sched_deadline field of sched_attr,
> + which can represent nearly 585 years since boot time (calling sched_getattr()
> + with flags=0 causes retrieval of the static parameters instead).
> +
> + For debugging purposes, these parameters can also be retrieved through
> + /proc/<pid>/sched (entries dl.runtime and dl.deadline, both values in ns),
> + but: this is highly inefficient; the returned runtime left is not updated as
> + done by sched_getattr(); the deadline is provided in kernel rq_clock time
> + reference, that is not directly usable from user-space.
>  
>  
>  4.3 Default behavior
> 
> base-commit: f74d204baf9febf96237af6c1d7eff57fba7de36

-- 
~Randy


^ permalink raw reply

* [PATCH] sched/deadline: document new sched_getattr() feature for retrieving current parameters for DEADLINE tasks
From: Tommaso Cucinotta @ 2026-03-03 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tommaso Cucinotta, linux-api, Juri Lelli, Jonathan Corbet,
	Shuah Khan, Shashank Balaji, linux-doc, linux-kernel
In-Reply-To: <20260303104215.1324243-1-tommaso.cucinotta@santannapisa.it>

Document in Documentation/sched/sched-deadline.rst the new capability of
sched_getattr() to retrieve, for DEADLINE tasks, the runtime left and absolute
deadline (setting the flags syscall parameter to 1), in addition to the static
parameters (obtained with flags=0).

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
---
 Documentation/scheduler/sched-deadline.rst | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
index ec543a12..76fdf435 100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -628,10 +628,21 @@ Deadline Task Scheduling
   * the new scheduling related syscalls that manipulate it, i.e.,
     sched_setattr() and sched_getattr() are implemented.
 
- For debugging purposes, the leftover runtime and absolute deadline of a
- SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
- dl.runtime and dl.deadline, both values in ns). A programmatic way to
- retrieve these values from production code is under discussion.
+ The leftover runtime and absolute deadline of a SCHED_DEADLINE task can be
+ read using the sched_getattr() syscall, setting the last syscall parameter
+ flags to the SCHED_GETATTR_FLAG_DL_DYNAMIC=1 value. This updates the
+ runtime left, converts the absolute deadline in CLOCK_MONOTONIC reference,
+ then returns these parameters to user-space. The absolute deadline is
+ returned as the number of nanoseconds since the CLOCK_MONOTONIC time
+ reference (boot instant), as a u64 in the sched_deadline field of sched_attr,
+ which can represent nearly 585 years since boot time (calling sched_getattr()
+ with flags=0 causes retrieval of the static parameters instead).
+
+ For debugging purposes, these parameters can also be retrieved through
+ /proc/<pid>/sched (entries dl.runtime and dl.deadline, both values in ns),
+ but: this is highly inefficient; the returned runtime left is not updated as
+ done by sched_getattr(); the deadline is provided in kernel rq_clock time
+ reference, that is not directly usable from user-space.
 
 
 4.3 Default behavior

base-commit: f74d204baf9febf96237af6c1d7eff57fba7de36
-- 
2.45.2


^ permalink raw reply related

* Re: [PATCH bpf-next v10 3/8] bpf: Refactor reporting log_true_size for prog_load
From: Alexei Starovoitov @ 2026-03-03 16:32 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
	Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
	Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
	Amery Hung, Rong Tao, LKML, Linux API,
	open list:KERNEL SELFTEST FRAMEWORK, kernel-patches-bot
In-Reply-To: <20260211151115.78013-4-leon.hwang@linux.dev>

On Wed, Feb 11, 2026 at 7:13 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> The next commit will add support for reporting logs via extended common
> attributes, including 'log_true_size'.
>
> To prepare for that, refactor the 'log_true_size' reporting logic by
> introducing a new struct bpf_log_attr to encapsulate log-related behavior:
>
>  * bpf_log_attr_init(): initialize log fields, which will support
>    extended common attributes in the next commit.
>  * bpf_log_attr_finalize(): handle log finalization and write back
>    'log_true_size' to userspace.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>  include/linux/bpf.h          |  4 +++-
>  include/linux/bpf_verifier.h | 11 +++++++++++
>  kernel/bpf/log.c             | 25 +++++++++++++++++++++++++
>  kernel/bpf/syscall.c         | 13 ++++++++++---
>  kernel/bpf/verifier.c        | 17 ++++-------------
>  5 files changed, 53 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index cd9b96434904..d4dbcc7ad156 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2913,7 +2913,9 @@ int bpf_check_uarg_tail_zero(bpfptr_t uaddr, size_t expected_size,
>                              size_t actual_size);
>
>  /* verify correctness of eBPF program */
> -int bpf_check(struct bpf_prog **fp, union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size);
> +struct bpf_log_attr;
> +int bpf_check(struct bpf_prog **fp, union bpf_attr *attr, bpfptr_t uattr,
> +             struct bpf_log_attr *attr_log);
>
>  #ifndef CONFIG_BPF_JIT_ALWAYS_ON
>  void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth);
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index ef8e45a362d9..dbd9bdb955b3 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -635,6 +635,17 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log)
>         return log && log->level;
>  }
>
> +struct bpf_log_attr {
> +       char __user *log_buf;
> +       u32 log_size;
> +       u32 log_level;
> +       u32 __user *log_true_size;
> +};
> +
> +int bpf_log_attr_init(struct bpf_log_attr *log, u64 log_buf, u32 log_size, u32 log_level,
> +                     u32 __user *log_true_size);
> +int bpf_log_attr_finalize(struct bpf_log_attr *attr, struct bpf_verifier_log *log);
> +
>  #define BPF_MAX_SUBPROGS 256
>
>  struct bpf_subprog_arg_info {
> diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
> index a0c3b35de2ce..e31747b84fe2 100644
> --- a/kernel/bpf/log.c
> +++ b/kernel/bpf/log.c
> @@ -863,3 +863,28 @@ void print_insn_state(struct bpf_verifier_env *env, const struct bpf_verifier_st
>         }
>         print_verifier_state(env, vstate, frameno, false);
>  }
> +
> +int bpf_log_attr_init(struct bpf_log_attr *log, u64 log_buf, u32 log_size, u32 log_level,
> +                     u32 __user *log_true_size)
> +{
> +       memset(log, 0, sizeof(*log));
> +       log->log_buf = u64_to_user_ptr(log_buf);
> +       log->log_size = log_size;
> +       log->log_level = log_level;
> +       log->log_true_size = log_true_size;
> +       return 0;
> +}
> +
> +int bpf_log_attr_finalize(struct bpf_log_attr *attr, struct bpf_verifier_log *log)
> +{
> +       u32 log_true_size;
> +       int err;
> +
> +       err = bpf_vlog_finalize(log, &log_true_size);
> +
> +       if (attr->log_true_size && copy_to_user(attr->log_true_size, &log_true_size,
> +                                               sizeof(log_true_size)))
> +               return -EFAULT;
> +
> +       return err;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0e231c0b1d04..e86674811996 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2867,7 +2867,7 @@ static int bpf_prog_mark_insn_arrays_ready(struct bpf_prog *prog)
>  /* last field in 'union bpf_attr' used by this command */
>  #define BPF_PROG_LOAD_LAST_FIELD keyring_id
>
> -static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
> +static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, struct bpf_log_attr *attr_log)
>  {
>         enum bpf_prog_type type = attr->prog_type;
>         struct bpf_prog *prog, *dst_prog = NULL;
> @@ -3085,7 +3085,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>                 goto free_prog_sec;
>
>         /* run eBPF verifier */
> -       err = bpf_check(&prog, attr, uattr, uattr_size);
> +       err = bpf_check(&prog, attr, uattr, attr_log);
>         if (err < 0)
>                 goto free_used_maps;
>
> @@ -6189,7 +6189,10 @@ static int prog_assoc_struct_ops(union bpf_attr *attr)
>  static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size,
>                      bpfptr_t uattr_common, unsigned int size_common)
>  {
> +       bool from_user = !bpfptr_is_kernel(uattr);
>         struct bpf_common_attr attr_common;
> +       u32 __user *log_true_size = NULL;
> +       struct bpf_log_attr attr_log;
>         union bpf_attr attr;
>         int err;
>
> @@ -6241,7 +6244,11 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size,
>                 err = map_freeze(&attr);
>                 break;
>         case BPF_PROG_LOAD:
> -               err = bpf_prog_load(&attr, uattr, size);
> +               if (from_user && size >= offsetofend(union bpf_attr, log_true_size))
> +                       log_true_size = uattr.user + offsetof(union bpf_attr, log_true_size);

So you added 'from_user' gating because
you replaced copy_to_bpfptr_offset() with copy_to_user()?
This is a drastic change in behavior and you don't even talk about
it in the commit log.
You said "refactor". This is not a refactoring!

This is v10. The common_attr feature is useful, but
you really need to think harder about what your patches
are doing.

pw-bot: cr

^ permalink raw reply

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
From: Mathieu Desnoyers @ 2026-03-02 16:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha@sourceware.org, Arnd Bergmann,
	Sebastian Andrzej Siewior
In-Reply-To: <lhuqzq2chdw.fsf@oldenburg.str.redhat.com>

On 2026-03-02 11:42, Florian Weimer wrote:
> * Mathieu Desnoyers:
[...]
>> AFAIU we don't need to evaluate this on context switch. We only need
>> to evaluate it at:
>>
>> (a) Signal delivery,
>> (b) Process exit.
> 
> Ah, missed that part.  It changes the rules somewhat.
> 
>> Also, the tradeoff here is not clear cut to me: the only thing the rseq
>> flag would prevent is comparisons of the instruction pointer against a
>> vDSO range at (a) and (b), which are not as performance critical as
>> context switches. I'm not sure it would warrant the added complexity of
>> the rseq flag, and coupling with rseq. Moreover, I'm not convinced that
>> loading an extra rseq flag field from userspace would be faster than
>> just comparing with a known range of vDSO addresses.
> 
> It wouldn't work for the signal case anyway.  That would need space in
> rseq for some kind of write-ahead log of the operation before it's being
> carried out, so that it can be completed on signal delivery/process
> exit.

The signal handler case can be dealt with by making sure we clear the
pending ops list on signal delivery. AFAIU with that in place we would
not need a write-ahead log. But even then, I don't think the rseq flag
would bring any benefit over simple vDSO instruction pointer ranges
comparisons.

Also the rseq flag set/clear cannot be done atomically with respect
to the mutex unlock (success) and pending ops clear state transitions,
so we'd need instruction pointer comparisons anyway.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox