Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-27  8:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826205057.GC1603531@mit.edu>

On Tue, Aug 26, 2025 at 04:50:57PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> > 
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> 
> But if that's it, why can't the script interpreter (python, bash,
> etc.) before executing the script, checks for executability via
> faccessat(2) or fstat(2)?

From commit a5874fde3c08 ("exec: Add a new AT_EXECVE_CHECK flag to
execveat(2)"):

    This is different from faccessat(2) + X_OK which only checks a subset of
    access rights (i.e. inode permission and mount options for regular
    files), but not the full context (e.g. all LSM access checks).  The main
    use case for access(2) is for SUID processes to (partially) check access
    on behalf of their caller.  The main use case for execveat(2) +
    AT_EXECVE_CHECK is to check if a script execution would be allowed,
    according to all the different restrictions in place.  Because the use
    of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
    execution, user space gets the same error codes.


> 
> The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
> was doing more than just the executability check?

I would say that that AT_EXECVE_CHECK does a full executability check
(with the full caller's credentials checked against the currently
enforced security policy).

The rationale to add O_DENY_WRITE (which is now abandoned) was to avoid a race
condition between the check and the full read.  Indeed, with a full
execveat(2), the kernel write-lock the file to avoid such issue (which can lead
to other issues).

> 
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> 
> Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
> more substantive requirements and design doc might be helpful.  It
> appears you have some assumptions that perhaps other kernel developers
> are not aware.  I certainly seem to be missing something.....

My reasoning was to explain the rationale for a kernel feature in the commit
message, and the user doc (why and how to use it) in the user-facing
documentation.  Documentation improvements are welcome!

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-27  8:19 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <CABi2SkUJ1PDm_uri=4o+C13o5wFQD=xA7zVKU-we+unsEDm3dw@mail.gmail.com>

On Tue, Aug 26, 2025 at 01:29:55PM -0700, Jeff Xu wrote:
> Hi Mickaël
> 
> On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> > > Hi Mickaël
> > >
> > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > >
> > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > >
> > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > > > >
> > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > > > >
> > > > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > > > because it broke userspace assumptions.
> > > > > >
> > > > > > Oh, good to know.
> > > > > >
> > > > > > >
> > > > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > > > >
> > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > > > removed for security reasons; as
> > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > > > >
> > > > > > > |        MAP_DENYWRITE
> > > > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > > > |               signaled that attempts to write to the underlying file
> > > > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > > > |               of-service attacks.)"
> > > > > > >
> > > > > > > It seems to me that the same issue applies to your patch - it would
> > > > > > > allow unprivileged processes to essentially lock files such that other
> > > > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > > > users to prevent root from updating config files or stuff like that if
> > > > > > > they're updated in-place.
> > > > > >
> > > > > > Yes, I agree, but since it is the case for executed files I though it
> > > > > > was worth starting a discussion on this topic.  This new flag could be
> > > > > > restricted to executable files, but we should avoid system-wide locks
> > > > > > like this.  I'm not sure how Windows handle these issues though.
> > > > > >
> > > > > > Anyway, we should rely on the access control policy to control write and
> > > > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > > > the references and the background!
> > > > >
> > > > > I'm confused.  I understand that there are many contexts in which one
> > > > > would want to prevent execution of unapproved content, which might
> > > > > include preventing a given process from modifying some code and then
> > > > > executing it.
> > > > >
> > > > > I don't understand what these deny-write features have to do with it.
> > > > > These features merely prevent someone from modifying code *that is
> > > > > currently in use*, which is not at all the same thing as preventing
> > > > > modifying code that might get executed -- one can often modify
> > > > > contents *before* executing those contents.
> > > >
> > > > The order of checks would be:
> > > > 1. open script with O_DENY_WRITE
> > > > 2. check executability with AT_EXECVE_CHECK
> > > > 3. read the content and interpret it
> > > >
> > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> > >
> > > AT_EXECVE_CHECK is not just for scripting languages. It could also
> > > work with bytecodes like Java, for example. If we let the Java runtime
> > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> > > develop a policy based on that.
> >
> > Sure, I'm using "script" to make it simple, but this applies to other
> > use cases.
> >
> That makes sense.
> 
> > >
> > > > The deny-write feature was to guarantee that there is no race condition
> > > > between step 2 and 3.  All these checks are supposed to be done by a
> > > > trusted interpreter (which is allowed to be executed).  The
> > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > > > associated security policies) allowed the *current* content of the file
> > > > to be executed.  Whatever happen before or after that (wrt.
> > > > O_DENY_WRITE) should be covered by the security policy.
> > > >
> > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> > >
> > > Enforcing non-write for the path that stores scripts or bytecodes can
> > > be challenging due to historical or backward compatibility reasons.
> > > Since AT_EXECVE_CHECK provides a mechanism to check the file right
> > > before it is used, we can assume it will detect any "problem" that
> > > happened before that, (e.g. the file was overwritten). However, that
> > > also imposes two additional requirements:
> > > 1> the file doesn't change while AT_EXECVE_CHECK does the check.
> >
> > This is already the case, so any kind of LSM checks are good.
> >
> May I ask how this is done? some code in do_open_execat() does this ?
> Apologies if this is a basic question.

do_open_execat() calls exe_file_deny_write_access()

> 
> > > 2>The file content kept by the process remains unchanged after passing
> > > the AT_EXECVE_CHECK.
> >
> > The goal of this patch was to avoid such race condition in the case
> > where executable files can be updated.  But in most cases it should not
> > be a security issue (because processes allowed to write to executable
> > files should be trusted), but this could still lead to bugs (because of
> > inconsistent file content, half-updated).
> >
> There is also a time gap between:
> a> the time of AT_EXECVE_CHECK
> b> the time that the app opens the file for execution.
> right ? another potential attack path (though this is not the case I
> mentioned previously).

As explained in the documentation, to avoid this specific race
condition, interpreters should open the script once, check the FD with
AT_EXECVE_CHECK, and then read the content with the same FD.

> 
> For the case I mentioned previously, I have to think more if the race
> condition is a bug or security issue.
> IIUC, two solutions are discussed so far:
> 1> the process could write to fs to update the script.  However, for
> execution, the process still uses the copy that passed the
> AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski)

Yes, the snapshot solution would be the best, but I guess it would rely
on filesystems to support this feature.

> or 2> the process blocks the write while opening the file as read only
> and executing the script. (this seems to be the approach of this
> patch).

Yes, and this is not something we want anymore.

> 
> I wonder if there are other ideas.

I don't see other efficient ways do give the same guarantees.

^ permalink raw reply

* Re: [PATCH v3 1/2] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Aleksa Sarai @ 2025-08-27  9:42 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250826083227.2611457-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2382 bytes --]

On 2025-08-26, Askar Safin <safinaskar@zohomail.com> wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>
> ---
>  man/man2/mount.2 | 27 ++++++++++++++++++++++++---
>  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..599c2d6fa 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,25 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing the underlying filesystem parameters.

When reading the whole sentence, this feels a bit incomplete
("filesystem parameters ... of what?"). Maybe

  This is particularly useful for setting or clearing the "read-only"
  flag on a mount without changing the underlying filesystem's
  filesystem parameters.

or

  This is particularly useful for setting or clearing the "read-only"
  flag on a mount without changing the filesystem parameters of the
  underlying filesystem.

would be better?

That one nit aside, feel free to take my

Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>

> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The mount point will
> +have its existing per-mount-point flags
> +cleared and replaced with those in
> +.IR mountflags .
> +This means that
> +if you wish to preserve
> +any existing per-mount-point flags,
> +you need to include them in
> +.IR mountflags ,
> +along with the per-mount-point flags you wish to set
> +(or with the flags you wish to clear missing).
>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +434,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mount point read-only
> +(clearing all other per-mount-point flags),
> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If
> -- 
> 2.47.2
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Aleksa Sarai @ 2025-08-27 10:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Mickaël Salaün, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu
In-Reply-To: <CAG48ez1XjUdcFztc_pF2qcoLi7xvfpJ224Ypc=FoGi-Px-qyZw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2558 bytes --]

On 2025-08-22, Jann Horn <jannh@google.com> wrote:
> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > passed file descriptors).  This changes the state of the opened file by
> > making it read-only until it is closed.  The main use case is for script
> > interpreters to get the guarantee that script' content cannot be altered
> > while being read and interpreted.  This is useful for generic distros
> > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> >
> > Both execve(2) and the IOCTL to enable fsverity can already set this
> > property on files with deny_write_access().  This new O_DENY_WRITE make
> 
> The kernel actually tried to get rid of this behavior on execve() in
> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> because it broke userspace assumptions.

Also the ETXTBSY behaviour for binaries is not always guaranteed to
block writes to the file. When we were discussing this back in 2021 and
when we initially removed it, I remember there being some fairly trivial
ways to get around it anyway (but because process mm is mapped with
MAP_PRIVATE, writes aren't seen by the actual process).

> > it widely available.  This is similar to what other OSs may provide
> > e.g., opening a file with only FILE_SHARE_READ on Windows.
> 
> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> removed for security reasons; as
> https://man7.org/linux/man-pages/man2/mmap.2.html says:
> 
> |        MAP_DENYWRITE
> |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> |               signaled that attempts to write to the underlying file
> |               should fail with ETXTBSY.  But this was a source of denial-
> |               of-service attacks.)"
> 
> It seems to me that the same issue applies to your patch - it would
> allow unprivileged processes to essentially lock files such that other
> processes can't write to them anymore. This might allow unprivileged
> users to prevent root from updating config files or stuff like that if
> they're updated in-place.

Agreed, and this was one of the major issues with the also-now-removed
mandatory locking as well.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Aleksa Sarai @ 2025-08-27 10:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-2-mic@digikod.net>

[-- Attachment #1: Type: text/plain, Size: 6799 bytes --]

On 2025-08-22, Mickaël Salaün <mic@digikod.net> wrote:
> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> passed file descriptors).  This changes the state of the opened file by
> making it read-only until it is closed.  The main use case is for script
> interpreters to get the guarantee that script' content cannot be altered
> while being read and interpreted.  This is useful for generic distros
> that may not have a write-xor-execute policy.  See commit a5874fde3c08
> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> 
> Both execve(2) and the IOCTL to enable fsverity can already set this
> property on files with deny_write_access().  This new O_DENY_WRITE make
> it widely available.  This is similar to what other OSs may provide
> e.g., opening a file with only FILE_SHARE_READ on Windows.
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jeff Xu <jeffxu@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Paul Moore <paul@paul-moore.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Reported-by: Robert Waite <rowait@microsoft.com>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Link: https://lore.kernel.org/r/20250822170800.2116980-2-mic@digikod.net
> ---
>  fs/fcntl.c                       | 26 ++++++++++++++++++++++++--
>  fs/file_table.c                  |  2 ++
>  fs/namei.c                       |  6 ++++++
>  include/linux/fcntl.h            |  2 +-
>  include/uapi/asm-generic/fcntl.h |  4 ++++
>  5 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 5598e4d57422..0c80c0fbc706 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -34,7 +34,8 @@
>  
>  #include "internal.h"
>  
> -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> +	O_DENY_WRITE)
>  
>  static int setfl(int fd, struct file * filp, unsigned int arg)
>  {
> @@ -80,8 +81,29 @@ static int setfl(int fd, struct file * filp, unsigned int arg)
>  			error = 0;
>  	}
>  	spin_lock(&filp->f_lock);
> +
> +	if (arg & O_DENY_WRITE) {
> +		/* Only regular files. */
> +		if (!S_ISREG(inode->i_mode)) {
> +			error = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		/* Only sets once. */
> +		if (!(filp->f_flags & O_DENY_WRITE)) {
> +			error = exe_file_deny_write_access(filp);
> +			if (error)
> +				goto unlock;
> +		}
> +	} else {
> +		if (filp->f_flags & O_DENY_WRITE)
> +			exe_file_allow_write_access(filp);
> +	}

I appreciate the goal of making this something that can be cleared
(presumably for interpreters that mmap(MAP_PRIVATE) their scripts), but
making a security-related flag this easy to clear seems like a footgun
(any library function could mask O_DENY_WRITE or forget to copy the old
flag values).

> +
>  	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
>  	filp->f_iocb_flags = iocb_flags(filp);
> +
> +unlock:
>  	spin_unlock(&filp->f_lock);
>  
>   out:
> @@ -1158,7 +1180,7 @@ static int __init fcntl_init(void)
>  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>  	 * is defined as O_NONBLOCK on some platforms and not on others.
>  	 */
> -	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> +	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
>  		HWEIGHT32(
>  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>  			__FMODE_EXEC));
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 81c72576e548..6ba896b6a53f 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -460,6 +460,8 @@ static void __fput(struct file *file)
>  	locks_remove_file(file);
>  
>  	security_file_release(file);
> +	if (unlikely(file->f_flags & O_DENY_WRITE))
> +		exe_file_allow_write_access(file);
>  	if (unlikely(file->f_flags & FASYNC)) {
>  		if (file->f_op->fasync)
>  			file->f_op->fasync(-1, file, 0);
> diff --git a/fs/namei.c b/fs/namei.c
> index cd43ff89fbaa..366530bf937d 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3885,6 +3885,12 @@ static int do_open(struct nameidata *nd,
>  	error = may_open(idmap, &nd->path, acc_mode, open_flag);
>  	if (!error && !(file->f_mode & FMODE_OPENED))
>  		error = vfs_open(&nd->path, file);
> +	if (!error && (open_flag & O_DENY_WRITE)) {
> +		if (S_ISREG(file_inode(file)->i_mode))
> +			error = exe_file_deny_write_access(file);
> +		else
> +			error = -EINVAL;
> +	}
>  	if (!error)
>  		error = security_file_post_open(file, op->acc_mode);
>  	if (!error && do_truncate)
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index a332e79b3207..dad14101686f 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -10,7 +10,7 @@
>  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>  	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
>  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_DENY_WRITE)

I don't like this patch for the same reasons Christian has already said,
but in addition -- you cannot just add new open(2) flags like this.

Unlike openat2(2), classic open(2) does not verify invalid flag bits, so
any new flag must be designed so that old kernels will return an error
for that flag combination, which ensures that:

 * No old programs set those bits inadvertently, which lets us avoid
   breaking userspace in some really fun and hard-to-debug ways.
 * For security-related bits, that new programs running on old kernels
   do not think they are getting a security property that they aren't
   actually getting.

O_TMPFILE's bitflag soup is an example of how you can resolve this issue
for open(2), but I would suggest that authors of new O_* flags seriously
consider making their flags openat2(2)-only unless it's trivial to get
the above behaviour.

>  /* List of all valid flags for the how->resolve argument: */
>  #define VALID_RESOLVE_FLAGS \
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..facd9136f5af 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -91,6 +91,10 @@
>  /* a horrid kludge trying to make sure that this will fail on old kernels */
>  #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>  
> +#ifndef O_DENY_WRITE
> +#define O_DENY_WRITE	040000000
> +#endif
> +
>  #ifndef O_NDELAY
>  #define O_NDELAY	O_NONBLOCK
>  #endif

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-08-27 14:01 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Gunthorpe, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bB9r_pMzd0VbLsAGTwh8kvV_o3rFM_W--drutewomr1ZQ@mail.gmail.com>

On Tue, Aug 26 2025, Pasha Tatashin wrote:

>> > The existing interface, with the addition of passing a pidfd, provides
>> > the necessary flexibility without being invasive. The change would be
>> > localized to the new code that performs the FD retrieval and wouldn't
>> > involve spoofing current or making widespread changes.
>> > For example, to handle cgroup charging for a memfd, the flow inside
>> > memfd_luo_retrieve() would look something like this:
>> >
>> > task = get_pid_task(target_pid, PIDTYPE_PID);
>> > mm = get_task_mm(task);
>> >     // ...
>> >     folio = kho_restore_folio(phys);
>> >     // Charge to the target mm, not 'current->mm'
>> >     mem_cgroup_charge(folio, mm, ...);
>> > mmput(mm);
>> > put_task_struct(task);
>> >
>> > This approach seems quite contained, and does not modify the existing
>> > interfaces. It avoids the need for the kernel to manage the entire
>> > session state and its associated security model.

Even with sessions, I don't think the kernel has to deal with the
security model. /dev/liveupdate can still be single-open only, with only
luod getting access to it. The the kernel just hands over sessions to
luod (maybe with a new ioctl LIVEUPDATE_IOCTL_CREATE_SESSION), and luod
takes care of the security model and lifecycle. If luod crashes and
loses its handle to /dev/liveupdate, all the sessions associated with it
go away too.

Essentially, the sessions from kernel perspective would just be a
container to group different resources together. I think this adds a
small bit of complexity on the session management and serialization
side, but I think will save complexity on participating subsystems.

>>
>> Execpt it doesn't work like that in all places, iommufd for example
>> uses GFP_KERNEL_ACCOUNT which relies on current.
>
> That's a good point. For kernel allocations, I don't see a clean way
> to account for a different process.
>
> We should not be doing major allocations during the retrieval process
> itself. Ideally, the kernel would restore an FD using only the
> preserved folio data (that we can cleanly charge), and then let the
> user process perform any subsequent actions that might cause new
> kernel memory allocations. However, I can see how that might not be
> practical for all handlers.
>
> Perhaps, we should add session extensions to the kernel as follow-up
> after this series lands, we would also need to rewrite luod design
> accordingly to move some of the sessions logic into the kernel.

I know the KHO is supposed to not be backwards compatible yet. What is
the goal for the LUO APIs? Are they also not backwards compatible? If
not, I think we should also consider how sessions will play into
backwards compatibility. For example, once we add sessions, what happens
to the older versions of luod that directly call preserve or unpreserve?

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-08-27 15:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

Hi Jason,

Thanks for the review.

On Tue, Aug 26 2025, Jason Gunthorpe wrote:

> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>
>> +	/*
>> +	 * Most of the space should be taken by preserved folios. So take its
>> +	 * size, plus a page for other properties.
>> +	 */
>> +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> +	if (!fdt) {
>> +		err = -ENOMEM;
>> +		goto err_unpin;
>> +	}
>
> This doesn't seem to have any versioning scheme, it really should..

It does. See the "compatible" property.

    static const char memfd_luo_compatible[] = "memfd-v1";

static struct liveupdate_file_handler memfd_luo_handler = {
	.ops = &memfd_luo_file_ops,
	.compatible = memfd_luo_compatible,
};

This goes into the LUO FDT:

	static int luo_files_to_fdt(struct xarray *files_xa_out)
	[...]
	xa_for_each(files_xa_out, token, h) {
		[...]
		ret = fdt_property_string(luo_file_fdt_out, "compatible",
					  h->fh->compatible);

So this function only gets called for the version 1.

>
>> +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> +				       (void **)&preserved_folios);
>> +	if (err) {
>> +		pr_err("Failed to reserve folios property in FDT: %s\n",
>> +		       fdt_strerror(err));
>> +		err = -ENOMEM;
>> +		goto err_free_fdt;
>> +	}
>
> Yuk.
>
> This really wants some luo helper
>
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'
>
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
>
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.

Yep, I agree. This version already runs into size limits of around 1 GiB
due to the FDT being limited to MAX_PAGE_ORDER, since that is the
largest contiguous piece of memory folio_alloc() can give us. On top,
FDT is only limited to 32 bits. While very large, it isn't unreasonable
to expect metadata exceeding that for some use cases (4 GiB is only 0.4%
of 1 TiB and there are systems a lot larger than that around).

I think we need something a luo_xarray data structure that users like
memfd (and later hugetlb and guest_memfd and maybe others) can build to
make serialization easier. It will cover both contiguous arrays and
arrays with some holes in them.

I did it this way mainly to keep things simple and get things out. But
Pasha already mentioned he is running into this limit for some tests, so
I think I will experiment around with a serialized xarray design.

>
>> +	for (; i < nr_pfolios; i++) {
>> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
>> +		phys_addr_t phys;
>> +		u64 index;
>> +		int flags;
>> +
>> +		if (!pfolio->foliodesc)
>> +			continue;
>> +
>> +		phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
>> +		folio = kho_restore_folio(phys);
>> +		if (!folio) {
>> +			pr_err("Unable to restore folio at physical address: %llx\n",
>> +			       phys);
>> +			goto put_file;
>> +		}
>> +		index = pfolio->index;
>> +		flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
>> +
>> +		/* Set up the folio for insertion. */
>> +		/*
>> +		 * TODO: Should find a way to unify this and
>> +		 * shmem_alloc_and_add_folio().
>> +		 */
>> +		__folio_set_locked(folio);
>> +		__folio_set_swapbacked(folio);
>> 
>> +		ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
>> +		if (ret) {
>> +			pr_err("shmem: failed to charge folio index %d: %d\n",
>> +			       i, ret);
>> +			goto unlock_folio;
>> +		}
>
> [..]
>
>> +		folio_add_lru(folio);
>> +		folio_unlock(folio);
>> +		folio_put(folio);
>> +	}
>
> Probably some consolidation will be needed to make this less
> duplicated..

Maybe. I do have that as a TODO item, but I took a quick look today and
I am not sure if it will make things simple enough. There are a few
places that add a folio to the shmem page cache, and all of them have
subtle differences and consolidating them all might be tricky. Let me
give it a shot...

>
> But overall I think just using the memfd_luo_preserved_folio as the
> serialization is entirely file, I don't think this needs anything more
> complicated.
>
> What it does need is an alternative to the FDT with versioning.

As I explained above, the versioning is already there. Beyond that, why
do you think a raw C struct is better than FDT? It is just another way
of expressing the same information. FDT is a bit more cumbersome to
write and read, but comes at the benefit of more introspect-ability.

>
> Which seems to me to be entirely fine as:
>
>  struct memfd_luo_v0 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios;
>  };
>
>  struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>
> Which also shows the actual data needing to be serialized comes from
> more than one struct and has to be marshaled in code, somehow, to a
> single struct.
>
> Then I imagine a fairly simple forwards/backwards story. If something
> new is needed that is non-optional, lets say you compress the folios
> list to optimize holes:
>
>  struct memfd_luo_v1 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios_list_with_holes;
>  };
>
> Obviously a v0 kernel cannot parse this, but in this case a v1 aware
> kernel could optionally duplicate and write out the v0 format as well:
>
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

I think what you describe here is essentially how LUO works currently,
just that the mechanisms are a bit different.

For example, instead of the subsystem calling luo_store_object(), the
LUO core calls back into the subsystem at the appropriate time to let it
populate the object. See memfd_luo_prepare() and the data argument. The
version is decided by the compatible string with which the handler was
registered.

Since LUO knows when to start serializing what, I think this flow of
calling into the subsystem and letting it fill in an object that LUO
tracks and hands over makes a lot of sense.

>
> Then the rule is fairly simple, when the sucessor kernel goes to
> deserialize it asks luo for the versions it supports:
>
>  if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
>     restore_v1(&memfd_luo_v1)
>  else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
>     restore_v0(&memfd_luo_v0)
>  else
>     luo_failure("Do not understand this");

Similarly, on restore side, the new kernel can register handlers of all
the versions it can deal with, and LUO core takes care of calling into
the right callback. See  memfd_luo_retrieve() for example. If we now have
a v2, the new kernel can simply define a new handler for v2 and add a
new memfd_luo_retrieve_v2().

>
> luo core just manages this list of versioned data per serialized
> object. There is only one version per object.

This also holds true.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
From: Pratyush Yadav @ 2025-08-27 15:34 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-18-pasha.tatashin@soleen.com>

Hi Pasha,

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> Currently, a file descriptor registered for preservation via the remains
> globally registered with LUO until it is explicitly unregistered. This
> creates a potential for resource leaks into the next kernel if the
> userspace agent crashes or exits without proper cleanup before a live
> update is fully initiated.
>
> This patch ties the lifetime of FD preservation requests to the lifetime
> of the open file descriptor for /dev/liveupdate, creating an implicit
> "session".
>
> When the /dev/liveupdate file descriptor is closed (either explicitly
> via close() or implicitly on process exit/crash), the .release
> handler, luo_release(), is now called. This handler invokes the new
> function luo_unregister_all_files(), which iterates through all FDs
> that were preserved through that session and unregisters them.

Why special case files here? Shouldn't you undo all the serialization
done for all the subsystems?

Anyway, this is buggy. I found this when testing the memfd patches. If
you preserve a memfd and close the /dev/liveupdate FD before reboot,
luo_unregister_all_files() calls the cancel callback, which calls
kho_unpreserve_folio(). But kho_unpreserve_folio() fails because KHO is
still in finalized state. This doesn't happen when cancelling explicitly
because luo_cancel() calls kho_abort().

I think you should just make the release go through the cancel flow,
since the operation is essentially a cancel anyway. There are subtle
differences here though, since the release might be called before
prepare, so we need to be careful of that.


>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/liveupdate/luo_files.c    | 19 +++++++++++++++++++
>  kernel/liveupdate/luo_internal.h |  1 +
>  kernel/liveupdate/luo_ioctl.c    |  1 +
>  3 files changed, 21 insertions(+)
>
> diff --git a/kernel/liveupdate/luo_files.c b/kernel/liveupdate/luo_files.c
> index 33577c9e9a64..63f8b086b785 100644
> --- a/kernel/liveupdate/luo_files.c
> +++ b/kernel/liveupdate/luo_files.c
> @@ -721,6 +721,25 @@ int luo_unregister_file(u64 token)
>  	return ret;
>  }
>  
> +/**
> + * luo_unregister_all_files - Unpreserve all currently registered files.
> + *
> + * Iterates through all file descriptors currently registered for preservation
> + * and unregisters them, freeing all associated resources. This is typically
> + * called when LUO agent exits.
> + */
> +void luo_unregister_all_files(void)
> +{
> +	struct luo_file *luo_file;
> +	unsigned long token;
> +
> +	luo_state_read_enter();
> +	xa_for_each(&luo_files_xa_out, token, luo_file)
> +		__luo_unregister_file(token);
> +	luo_state_read_exit();
> +	WARN_ON_ONCE(atomic64_read(&luo_files_count) != 0);
> +}
> +
>  /**
>   * luo_retrieve_file - Find a registered file instance by its token.
>   * @token: The unique token of the file instance to retrieve.
> diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
> index 5692196fd425..189e032d7738 100644
> --- a/kernel/liveupdate/luo_internal.h
> +++ b/kernel/liveupdate/luo_internal.h
> @@ -37,5 +37,6 @@ void luo_do_subsystems_cancel_calls(void);
>  int luo_retrieve_file(u64 token, struct file **filep);
>  int luo_register_file(u64 token, int fd);
>  int luo_unregister_file(u64 token);
> +void luo_unregister_all_files(void);
>  
>  #endif /* _LINUX_LUO_INTERNAL_H */
> diff --git a/kernel/liveupdate/luo_ioctl.c b/kernel/liveupdate/luo_ioctl.c
> index 6f61569c94e8..7ca33d1c868f 100644
> --- a/kernel/liveupdate/luo_ioctl.c
> +++ b/kernel/liveupdate/luo_ioctl.c
> @@ -137,6 +137,7 @@ static int luo_open(struct inode *inodep, struct file *filep)
>  
>  static int luo_release(struct inode *inodep, struct file *filep)
>  {
> +	luo_unregister_all_files();
>  	atomic_set(&luo_device_in_use, 0);
>  
>  	return 0;

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-08-27 17:35 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250826.iewie7Et5aiw@digikod.net>

On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > Is there a single, unified design and requirements document that
> > describes the threat model, and what you are trying to achieve with
> > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> >
> >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> >    securebits are intended for script interpreters and dynamic linkers
> >    to enforce a consistent execution security policy handled by the
> >    kernel."
>
> From the documentation:
>
>   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
>   on a regular file and returns 0 if execution of this file would be
>   allowed, ignoring the file format and then the related interpreter
>   dependencies (e.g. ELF libraries, script’s shebang).
>
> >
> > Um, what security policy?
>
> Whether the file is allowed to be executed.  This includes file
> permission, mount point option, ACL, LSM policies...

This needs *waaaaay* more detail for any sort of useful evaluation.
Is an actual credible security policy rolling dice?  Asking ChatGPT?
Looking at security labels?  Does it care who can write to the file,
or who owns the file, or what the file's hash is, or what filesystem
it's on, or where it came from?  Does it dynamically inspect the
contents?  Is it controlled by an unprivileged process?

I can easily come up with security policies for which DENYWRITE is
completely useless.  I can come up with convoluted and
not-really-credible policies where DENYWRITE is important, but I'm
honestly not sure that those policies are actually useful.  I'm
honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
because it should have been parametrized by *what format is expected*
-- it might be possible to bypass a policy by executing a perfectly
fine Python script using bash, for example.

I genuinely have not come up with a security policy that I believe
makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
that such a policy does not exist -- I'm saying that I have not
thought of such a thing after a few minutes of thought and reading
these threads.


> > And then on top of it, why can't you do these checks by modifying the
> > script interpreters?
>
> The script interpreter requires modification to use AT_EXECVE_CHECK.
>
> There is no other way for user space to reliably check executability of
> files (taking into account all enforced security
> policies/configurations).
>

As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish
this goal.  If it were genuinely useful, I would much, much prefer a
totally different API: a *syscall* that takes, as input, a file
descriptor of something that an interpreter wants to execute and a
whole lot of context as to what that interpreter wants to do with it.
And I admit I'm *still* not convinced.

Seriously, consider all the unending recent attacks on LLMs an
inspiration.  The implications of viewing an image, downscaling the
image, possibly interpreting the image as something containing text,
possibly following instructions in a given language contained in the
image, etc are all wildly different.  A mechanism for asking for
general permission to "consume this image" is COMPLETELY MISSING THE
POINT.  (Never mind that the current crop of LLMs seem entirely
incapable of constraining their own use of some piece of input, but
that's a different issue and is besides the point here.)

^ permalink raw reply

* Re: [PATCH v19 4/8] fork: Add shadow stack support to clone3()
From: Edgecombe, Rick P @ 2025-08-27 17:58 UTC (permalink / raw)
  To: dietmar.eggemann@arm.com, broonie@kernel.org,
	Szabolcs.Nagy@arm.com, brauner@kernel.org,
	dave.hansen@linux.intel.com, debug@rivosinc.com, mgorman@suse.de,
	vincent.guittot@linaro.org, fweimer@redhat.com, mingo@redhat.com,
	rostedt@goodmis.org, hjl.tools@gmail.com, tglx@linutronix.de,
	vschneid@redhat.com, shuah@kernel.org, hpa@zytor.com,
	peterz@infradead.org, bp@alien8.de, bsegall@google.com,
	x86@kernel.org, juri.lelli@redhat.com
  Cc: yury.khrustalev@arm.com, linux-kselftest@vger.kernel.org,
	akpm@linux-foundation.org, jannh@google.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	will@kernel.org, wilco.dijkstra@arm.com, kees@kernel.org,
	linux-api@vger.kernel.org
In-Reply-To: <20250819-clone3-shadow-stack-v19-4-bc957075479b@kernel.org>

On Tue, 2025-08-19 at 17:21 +0100, Mark Brown wrote:
> +int arch_shstk_validate_clone(struct task_struct *t,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	/*
> +	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
> +	 * the token 64-bit.
> +	 */

What is this comment doing here? It doesn't make sense. It looks copied from
create_rstor_token()?

> +	void *maddr = page_address(page);
> +	unsigned long token;
> +	int offset;
> +	u64 expected;
> +
> +	token = args->shadow_stack_token;
> +	expected = (token + SS_FRAME_SIZE) | BIT(0);

Instead of the above comment, I think the important thing to say is that args-
>shadow_stack_token is 8 byte aligned, so offset can't overflow out of the page.

Maybe?

/* kernel_clone_args verification assures token address is 8 byte aligned */

> +	offset = offset_in_page(token);
> +
> +	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
> +				  expected, 0))
> +		return -EINVAL;
> +	set_page_dirty_lock(page);
> +
> +	return 0;
> +}
> +

With those changes, for the series:

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-27 19:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn,
	Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <CALCETrW=V9vst_ho2Q4sQUJ5uZECY5h7TnF==sG4JWq8PsWb8Q@mail.gmail.com>

On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > > Is there a single, unified design and requirements document that
> > > describes the threat model, and what you are trying to achieve with
> > > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> > >
> > >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> > >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> > >    securebits are intended for script interpreters and dynamic linkers
> > >    to enforce a consistent execution security policy handled by the
> > >    kernel."
> >
> > From the documentation:
> >
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> >
> > >
> > > Um, what security policy?
> >
> > Whether the file is allowed to be executed.  This includes file
> > permission, mount point option, ACL, LSM policies...
> 
> This needs *waaaaay* more detail for any sort of useful evaluation.
> Is an actual credible security policy rolling dice?  Asking ChatGPT?
> Looking at security labels?  Does it care who can write to the file,
> or who owns the file, or what the file's hash is, or what filesystem
> it's on, or where it came from?  Does it dynamically inspect the
> contents?  Is it controlled by an unprivileged process?

AT_EXECVE_CHECK only does the same checks as done by other execveat(2)
calls, but without actually executing the file/fd.

> 
> I can easily come up with security policies for which DENYWRITE is
> completely useless.  I can come up with convoluted and
> not-really-credible policies where DENYWRITE is important, but I'm
> honestly not sure that those policies are actually useful.  I'm
> honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
> because it should have been parametrized by *what format is expected*
> -- it might be possible to bypass a policy by executing a perfectly
> fine Python script using bash, for example.

There have been a lot of bikesheding for the AT_EXECVE_CHECK patch
series, and a lot of discussions too (you where part of them).  We ended
up with this design, which is simple and follows the kernel semantic
(requested by Linus).

> 
> I genuinely have not come up with a security policy that I believe
> makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
> that such a policy does not exist -- I'm saying that I have not
> thought of such a thing after a few minutes of thought and reading
> these threads.

A simple use case is for systems that wants to enforce a
write-xor-execute policy e.g., thanks to mount point options.

> 
> 
> > > And then on top of it, why can't you do these checks by modifying the
> > > script interpreters?
> >
> > The script interpreter requires modification to use AT_EXECVE_CHECK.
> >
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> >
> 
> As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish
> this goal.  If it were genuinely useful, I would much, much prefer a
> totally different API: a *syscall* that takes, as input, a file
> descriptor of something that an interpreter wants to execute and a
> whole lot of context as to what that interpreter wants to do with it.
> And I admit I'm *still* not convinced.

As mentioned above, AT_EXECVE_CHECK follows the kernel semantic. Nothing
fancy.

> 
> Seriously, consider all the unending recent attacks on LLMs an
> inspiration.  The implications of viewing an image, downscaling the
> image, possibly interpreting the image as something containing text,
> possibly following instructions in a given language contained in the
> image, etc are all wildly different.  A mechanism for asking for
> general permission to "consume this image" is COMPLETELY MISSING THE
> POINT.  (Never mind that the current crop of LLMs seem entirely
> incapable of constraining their own use of some piece of input, but
> that's a different issue and is besides the point here.)

You're asking about what should we consider executable.  This is a good
question, but AT_EXECVE_CHECK is there to answer another question: would
the kernel execute it or not?

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-08-27 20:35 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Andy Lutomirski, Theodore Ts'o, Christian Brauner, Al Viro,
	Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250827.Fuo1Iel1pa7i@digikod.net>

On Wed, Aug 27, 2025 at 12:07 PM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote:
> > On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > > > Is there a single, unified design and requirements document that
> > > > describes the threat model, and what you are trying to achieve with
> > > > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > > > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> > > >
> > > >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> > > >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> > > >    securebits are intended for script interpreters and dynamic linkers
> > > >    to enforce a consistent execution security policy handled by the
> > > >    kernel."
> > >
> > > From the documentation:
> > >
> > >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> > >   on a regular file and returns 0 if execution of this file would be
> > >   allowed, ignoring the file format and then the related interpreter
> > >   dependencies (e.g. ELF libraries, script’s shebang).
> > >
> > > >
> > > > Um, what security policy?
> > >
> > > Whether the file is allowed to be executed.  This includes file
> > > permission, mount point option, ACL, LSM policies...
> >
> > This needs *waaaaay* more detail for any sort of useful evaluation.
> > Is an actual credible security policy rolling dice?  Asking ChatGPT?
> > Looking at security labels?  Does it care who can write to the file,
> > or who owns the file, or what the file's hash is, or what filesystem
> > it's on, or where it came from?  Does it dynamically inspect the
> > contents?  Is it controlled by an unprivileged process?
>
> AT_EXECVE_CHECK only does the same checks as done by other execveat(2)
> calls, but without actually executing the file/fd.
>

okay... but see below.

> >
> > I can easily come up with security policies for which DENYWRITE is
> > completely useless.  I can come up with convoluted and
> > not-really-credible policies where DENYWRITE is important, but I'm
> > honestly not sure that those policies are actually useful.  I'm
> > honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
> > because it should have been parametrized by *what format is expected*
> > -- it might be possible to bypass a policy by executing a perfectly
> > fine Python script using bash, for example.
>
> There have been a lot of bikesheding for the AT_EXECVE_CHECK patch
> series, and a lot of discussions too (you where part of them).  We ended
> up with this design, which is simple and follows the kernel semantic
> (requested by Linus).

I recall this.  That doesn't mean I totally love AT_EXECVE_CHECK.  And
it especially doesn't mean that I believe that it usefully does
something that justifies anything like DENYWRITE.

>
> >
> > I genuinely have not come up with a security policy that I believe
> > makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
> > that such a policy does not exist -- I'm saying that I have not
> > thought of such a thing after a few minutes of thought and reading
> > these threads.
>
> A simple use case is for systems that wants to enforce a
> write-xor-execute policy e.g., thanks to mount point options.

Sure, but I'm contemplating DENYWRITE, and this thread is about
DENYWRITE.  If the kernel is enforcing W^X, then there are really two
almost unrelated things going on:

1. LSM policy that enforces W^X for memory mappings.  This is to
enforce that applications don't do nasty things like having executable
stacks, and it's a mess because no one has really figured out how JITs
are supposed to work in this world.  It has almost nothing to do with
execve except incidentally.

2. LSM policy that enforces that someone doesn't execve (or similar)
something that *that user* can write.  Or that non-root can write.  Or
that anyone at all can write, etc.

I think, but I'm not sure, that you're talking about #2.  So maybe
there's a policy that says that one may only exec things that are on
an fs with the 'exec' mount option.  Or maybe there's a policy that
says that one may only exec things that are on a readonly fs.  In
these specific cases, I believe in AT_EXECVE_CHECK.  *But* I don't
believe in DENYWRITE: in the 'exec' case, if an fs has the exec option
set, that doesn't change if the file is subsequently modified.  And if
an fs is readonly, then the file is quite unlikely to be modified at
all and will certainly not be modified via the mount through which
it's being executed.  And you don't need DENYWRITE.

So I think my question still stands: is there a credible security
policy *that actually benefits from DENYWRITE*?  If so, can you give
an example?

> >
> > Seriously, consider all the unending recent attacks on LLMs an
> > inspiration.  The implications of viewing an image, downscaling the
> > image, possibly interpreting the image as something containing text,
> > possibly following instructions in a given language contained in the
> > image, etc are all wildly different.  A mechanism for asking for
> > general permission to "consume this image" is COMPLETELY MISSING THE
> > POINT.  (Never mind that the current crop of LLMs seem entirely
> > incapable of constraining their own use of some piece of input, but
> > that's a different issue and is besides the point here.)
>
> You're asking about what should we consider executable.  This is a good
> question, but AT_EXECVE_CHECK is there to answer another question: would
> the kernel execute it or not?
>

That's a sort of odd way of putting it.  The kernel won't execute it
because the kernel doesn't know how to :)  But I think I understand
what you're saying.

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Aleksa Sarai @ 2025-08-27 22:48 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <5c9fa556-da00-4b76-8a70-8e2d1dddd92d@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 4172 bytes --]

On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2025-08-26 22:58, Aleksa Sarai wrote:
> > Personally I think both approaches are less than ideal, and having rich
> > feature flags for the entire system would be better but I don't think
> > this is something that would be feasible to apply to everything in the
> > entire kernel.
> 
> Agreed. But I'm not seeing how a hypothetical "give me the supported flags"
> flag would be useful enough to justify the flag.
> 
> I'm looking at this from the user point of view, and it is not ringing a
> bell for me. Granted, the current "try the flag combination you want and see
> whether it works" is not ideal, but it's accurate (which is not always true
> for "give me the supported flags" flag) and you need to do it anyway
> (because the "give me the supported flags" flag is inherently inaccurate),
> so why bother with a "give me the supported flags" flag?
> 
> Here's an example. Suppose we want to extend openat2 so that it also does
> the equivalent of statx atomically with the open, to avoid some races with
> the current openat/fstat pair of system calls. Under the approach you're
> proposing, I suppose we could extend struct open_how so that it has a new
> struct statx member, add new flags to be put into struct open_how's flags
> member, and programs would be able to query the new flags via a "give me the
> supported flags" call.
> 
> But in this scenario, the "give me the supported flags" flag is useless. If
> I'm an old program I can't use the new flags even if I detect them because
> my struct open_how is too small. And if I'm a new program I can simply use
> the new flags - and even if I tested for the new flags (with the "give me
> the supported flags" flag) I'd have to test the result anyway because
> perhaps the new flags are not supported for this particular flag combination
> or file.
> 
> What specific scenario would make the "give me supported flags" flag worth
> the hassle of supporting and documenting and testing such a flag?

"Just try it" leads to programs that have to test dozens of flag
combinations for syscalls at startup, and for many syscalls you cannot
"just try" whether the new flag works (think of a new shutdown(2) flag,
or most clone3(2) flags). What you end up having to do is create an
elaborate setup where if the flag works you get an error (but not
-EINVAL!) so that you can be fairly confident that you didn't modify the
system when doing the check. As someone who has to write this
boilerplate whenever I need to use most system calls, this really
**really** sucks. In some cases you can just try it and then fallback
(caching whether it was supported), but in a lot of programs it is
preferable to know well in advance whether a feature is supported.

A simple example would be mounts -- if MOUNT_BENEATH is not supported
then you need to structure how you construct your mount tree differently
to try to emulate the same behaviour. This means that not knowing if
MOUNT_BENEATH is supported upfront causes you to redo a lot of work in
the fallback case. If changing id-mappings for mounts hadn't required
adding a new syscall, this would've also been an issue for programs that
needed to change the ID-mappings of mounts.

Some kind of "just tell me what flags are supported" mechanism avoids
this problem by telling you in one shot what features are supported (so
newer programs can take advantage of them). Most systems that expect to
be extended over time have something like this, but it's usually in the
form of string-based feature names (/sys/kernel/cgroup/features, for
instance). I wouldn't be against such an idea (if we could actually
guarantee that everyone actually used it), but something similar was
proposed back in 2020 and never happened -- CHECK_FIELDS is a very
simple solution to the problem that works for the most common case and
can be implemented per-syscall.

I've added linux-api to Cc, as I'm sure there are plenty of other ideas
on how to solve this.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Paul Eggert @ 2025-08-27 23:19 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <2025-08-27-perky-glossy-dam-spindle-kPpnnk@cyphar.com>

On 2025-08-27 15:48, Aleksa Sarai wrote:
> On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
>> What specific scenario would make the "give me supported flags" flag worth
>> the hassle of supporting and documenting and testing such a flag?
> 
> "Just try it" leads to programs that have to test dozens of flag
> combinations for syscalls at startup,

Although that sort of thing can indeed be a problem in general, I don't 
see how it's a problem for openat2 in particular.

The issue here is whether openat2's API should reflect current behavior 
(where the HOW argument is pointer-to-const) or a potential future 
behavior (where the kernel might modify the struct that HOW points to, 
if some hypothetical future flag is set in that struct). I am skeptical 
that this hypothetical situation is so plausible that it justifies the 
maintenance hassle of a glibc API that doesn't correspond to how openat2 
currently behaves.


> A simple example would be mounts -- if MOUNT_BENEATH is not supported

I don't understand this example. Are you talking about <linux/mount.h>'s 
MOVE_MOUNT_BENEATH? That's a move_mount flag, and I don't see what that 
has to do with openat2. Or are you saying that openat2 might not support 
<linux/openat2.h>'s RESOLVE_BENEATH flag? Under what conditions might 
that be, exactly? Can you give some plausible user code to illustrate 
the openat2 example you're thinking of?

I still fail to understand how a hypothetical "give me the supported 
flags" openat2 flag would be useful enough to justify complicating the 
openat2 API today.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Aleksa Sarai @ 2025-08-28  0:14 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826.aig5aiShunga@digikod.net>

[-- Attachment #1: Type: text/plain, Size: 2981 bytes --]

On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > Nothing has changed in that regard and I'm not interested in stuffing
> > the VFS APIs full of special-purpose behavior to work around the fact
> > that this is work that needs to be done in userspace. Change the apps,
> > stop pushing more and more cruft into the VFS that has no business
> > there.
> 
> It would be interesting to know how to patch user space to get the same
> guarantees...  Do you think I would propose a kernel patch otherwise?

You could mmap the script file with MAP_PRIVATE. This is the *actual*
protection the kernel uses against overwriting binaries (yes, ETXTBSY is
nice but IIRC there are ways to get around it anyway). Of course, most
interpreters don't mmap their scripts, but this is a potential solution.
If the security policy is based on validating the script text in some
way, this avoids the TOCTOU.

Now, in cases where you have IMA or something and you only permit signed
binaries to execute, you could argue there is a different race here (an
attacker creates a malicious script, runs it, and then replaces it with
a valid script's contents and metadata after the fact to get
AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
this is even possible with IMA (can an unprivileged user even set
security.ima?). But even then, I would expect users that really need
this would also probably use fs-verity or dm-verity that would block
this kind of attack since it would render the files read-only anyway.

This is why a more detailed threat model of what kinds of attacks are
relevant is useful. I was there for the talk you gave and subsequent
discussion at last year's LPC, but I felt that your threat model was
not really fleshed out at all. I am still not sure what capabilities you
expect the attacker to have nor what is being used to authenticate
binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
assumptions, but I can't know without knowing what threat model you have
in mind, *in detail*.

For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
there are plenty of ways for an attacker to execute their own code
without using interpreters (create a new tmpfs with fsopen(2) for
instance). Executable memfds are even easier and don't require
privileges on most systems (yes, you can block them with vm.memfd_noexec
but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
mount(2)).

(As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
top-level AT_* bits for a per-syscall flag -- the block comment I added
in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
allocated") was meant to avoid this happening but it seems you and the
reviewers missed that...)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-08-28  0:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <2025-08-27-obscene-great-toy-diary-X1gVRV@cyphar.com>

On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > Nothing has changed in that regard and I'm not interested in stuffing
> > > the VFS APIs full of special-purpose behavior to work around the fact
> > > that this is work that needs to be done in userspace. Change the apps,
> > > stop pushing more and more cruft into the VFS that has no business
> > > there.
> >
> > It would be interesting to know how to patch user space to get the same
> > guarantees...  Do you think I would propose a kernel patch otherwise?
>
> You could mmap the script file with MAP_PRIVATE. This is the *actual*
> protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> nice but IIRC there are ways to get around it anyway).

Wait, really?  MAP_PRIVATE prevents writes to the mapping from
affecting the file, but I don't think that writes to the file will
break the MAP_PRIVATE CoW if it's not already broken.

IPython says:

In [1]: import mmap, tempfile

In [2]: f = tempfile.TemporaryFile()

In [3]: f.write(b'initial contents')
Out[3]: 16

In [4]: f.flush()

In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
prot=mmap.PROT_READ)

In [6]: map[:]
Out[6]: b'initial contents'

In [7]: f.seek(0)
Out[7]: 0

In [8]: f.write(b'changed')
Out[8]: 7

In [9]: f.flush()

In [10]: map[:]
Out[10]: b'changed contents'

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Aleksa Sarai @ 2025-08-28  0:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn,
	Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <CALCETrWHKga33bvzUHnd-mRQUeNXTtXSS8Y8+40d5bxv-CqBhw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2052 bytes --]

On 2025-08-27, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.

Oh I guess you're right -- that's news to me. And from mmap(2):

> MAP_PRIVATE
> [...] It is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.

But then what is the protection mechanism (in the absence of -ETXTBSY)
that stops you from overwriting the live text of a binary by just
writing to it?

I would need to go trawling through my old scripts to find the
reproducer that let you get around -ETXTBSY (I think it involved
executable memfds) but I distinctly remember that even if you overwrote
the binary you would not see the live process's mapped mm change value.
(Ditto for the few kernels when we removed -ETXTBSY.) I found this
surprising, but assumed that it was because of MAP_PRIVATE.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-08-28  7:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> 
> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +				       (void **)&preserved_folios);
> > +	if (err) {
> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
> > +		       fdt_strerror(err));
> > +		err = -ENOMEM;
> > +		goto err_free_fdt;
> > +	}
> 
> Yuk.
> 
> This really wants some luo helper
> 
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'
> 
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
> 
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.

I agree that we need a side-car structure for preserving large (potentially
sparse) arrays, but I think it should be a part of KHO rather than LUO.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* [PATCH 01/14] perf test: Fix a build error in x86 topdown test
From: mysteryli @ 2025-08-28  8:24 UTC (permalink / raw)
  To: m13940358460
  Cc: Namhyung Kim, Naresh Kamboju, Paolo Bonzini, kvm, Yury Norov,
	Mark Rutland, x86, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Madhavan Srinivasan, linuxppc-dev, Arnd Bergmann, linux-api,
	Christian Brauner, linux-fsdevel, Michael S. Tsirkin, Jason Wang,
	virtualization, Ian Rogers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 60100 bytes --]

From: Namhyung Kim <namhyung@kernel.org>

There's an environment that caused the following build error.  Include
"debug.h" (under util directory) to fix it.

  arch/x86/tests/topdown.c: In function 'event_cb':
  arch/x86/tests/topdown.c:53:25: error: implicit declaration of function 'pr_debug'
                                         [-Werror=implicit-function-declaration]
     53 |                         pr_debug("Broken topdown information for '%s'\n", evsel__name(evsel));
        |                         ^~~~~~~~
  cc1: all warnings being treated as errors

Link: https://lore.kernel.org/r/20250815164122.289651-1-namhyung@kernel.org
Fixes: 5b546de9cc177936 ("perf topdown: Use attribute to see an event is a topdown metic or slots")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/arch/x86/tests/topdown.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/arch/x86/tests/topdown.c b/tools/perf/arch/x86/tests/topdown.c
index 8d0ea7a4bbc1..1eba3b4594ef 100644
--- a/tools/perf/arch/x86/tests/topdown.c
+++ b/tools/perf/arch/x86/tests/topdown.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include "arch-tests.h"
 #include "../util/topdown.h"
+#include "debug.h"
 #include "evlist.h"
 #include "parse-events.h"
 #include "pmu.h"
-- 
2.25.1


From bd842ff41543af424c2473dc16c678ac8ba2b43f Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 02/14] tools headers: Sync KVM headers with the kernel source

To pick up the changes in this cset:

  f55ce5a6cd33211c KVM: arm64: Expose new KVM cap for cacheable PFNMAP
  28224ef02b56fcee KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
  4580dbef5ce0f95a KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
  25e8b1dd4883e6c2 KVM: TDX: Exit to userspace for GetTdVmCallInfo
  cf207eac06f661fb KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
    diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h

Please see tools/include/uapi/README for further details.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/x86/include/uapi/asm/kvm.h |  8 +++++++-
 tools/include/uapi/linux/kvm.h        | 27 +++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 6f3499507c5e..0f15d683817d 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -965,7 +965,13 @@ struct kvm_tdx_cmd {
 struct kvm_tdx_capabilities {
 	__u64 supported_attrs;
 	__u64 supported_xfam;
-	__u64 reserved[254];
+
+	__u64 kernel_tdvmcallinfo_1_r11;
+	__u64 user_tdvmcallinfo_1_r11;
+	__u64 kernel_tdvmcallinfo_1_r12;
+	__u64 user_tdvmcallinfo_1_r12;
+
+	__u64 reserved[250];
 
 	/* Configurable CPUID bits for userspace */
 	struct kvm_cpuid2 cpuid;
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 7415a3863891..f0f0d49d2544 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -178,6 +178,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_TDX              40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -447,6 +448,31 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_TDX */
+		struct {
+			__u64 flags;
+			__u64 nr;
+			union {
+				struct {
+					__u64 ret;
+					__u64 data[5];
+				} unknown;
+				struct {
+					__u64 ret;
+					__u64 gpa;
+					__u64 size;
+				} get_quote;
+				struct {
+					__u64 ret;
+					__u64 leaf;
+					__u64 r11, r12, r13, r14;
+				} get_tdvmcall_info;
+				struct {
+					__u64 ret;
+					__u64 vector;
+				} setup_event_notify;
+			};
+		} tdx;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -935,6 +961,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_EL2 240
 #define KVM_CAP_ARM_EL2_E2H0 241
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
+#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
-- 
2.25.1


From 6cb8607934d937f4ad24ec9ad26aeb669e266937 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 03/14] tools headers: Sync linux/bits.h with the kernel source

To pick up the changes in this cset:

  104ea1c84b91c9f4 bits: unify the non-asm GENMASK*()
  6d4471252ccc1722 bits: split the definition of the asm and non-asm GENMASK*()

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/linux/bits.h include/linux/bits.h

Please see tools/include/uapi/README for further details.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/linux/bits.h | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/tools/include/linux/bits.h b/tools/include/linux/bits.h
index 7ad056219115..a40cc861b3a7 100644
--- a/tools/include/linux/bits.h
+++ b/tools/include/linux/bits.h
@@ -2,10 +2,8 @@
 #ifndef __LINUX_BITS_H
 #define __LINUX_BITS_H
 
-#include <linux/const.h>
 #include <vdso/bits.h>
 #include <uapi/linux/bits.h>
-#include <asm/bitsperlong.h>
 
 #define BIT_MASK(nr)		(UL(1) << ((nr) % BITS_PER_LONG))
 #define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
@@ -50,10 +48,14 @@
 	     (type_max(t) << (l) &				\
 	      type_max(t) >> (BITS_PER_TYPE(t) - 1 - (h)))))
 
+#define GENMASK(h, l)		GENMASK_TYPE(unsigned long, h, l)
+#define GENMASK_ULL(h, l)	GENMASK_TYPE(unsigned long long, h, l)
+
 #define GENMASK_U8(h, l)	GENMASK_TYPE(u8, h, l)
 #define GENMASK_U16(h, l)	GENMASK_TYPE(u16, h, l)
 #define GENMASK_U32(h, l)	GENMASK_TYPE(u32, h, l)
 #define GENMASK_U64(h, l)	GENMASK_TYPE(u64, h, l)
+#define GENMASK_U128(h, l)	GENMASK_TYPE(u128, h, l)
 
 /*
  * Fixed-type variants of BIT(), with additional checks like GENMASK_TYPE(). The
@@ -79,28 +81,9 @@
  * BUILD_BUG_ON_ZERO is not available in h files included from asm files,
  * disable the input check if that is the case.
  */
-#define GENMASK_INPUT_CHECK(h, l) 0
+#define GENMASK(h, l)		__GENMASK(h, l)
+#define GENMASK_ULL(h, l)	__GENMASK_ULL(h, l)
 
 #endif /* !defined(__ASSEMBLY__) */
 
-#define GENMASK(h, l) \
-	(GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
-#define GENMASK_ULL(h, l) \
-	(GENMASK_INPUT_CHECK(h, l) + __GENMASK_ULL(h, l))
-
-#if !defined(__ASSEMBLY__)
-/*
- * Missing asm support
- *
- * __GENMASK_U128() depends on _BIT128() which would not work
- * in the asm code, as it shifts an 'unsigned __int128' data
- * type instead of direct representation of 128 bit constants
- * such as long and unsigned long. The fundamental problem is
- * that a 128 bit constant will get silently truncated by the
- * gcc compiler.
- */
-#define GENMASK_U128(h, l) \
-	(GENMASK_INPUT_CHECK(h, l) + __GENMASK_U128(h, l))
-#endif
-
 #endif	/* __LINUX_BITS_H */
-- 
2.25.1


From aa34642f6fc36a436de5ae5b30d414578b3622f5 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 04/14] tools headers: Sync linux/cfi_types.h with the kernel
 source

To pick up the changes in this cset:

  5ccaeedb489b41ce cfi: add C CFI type macro

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/linux/cfi_types.h include/linux/cfi_types.h

Please see tools/include/uapi/README for further details.

Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/linux/cfi_types.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/include/linux/cfi_types.h b/tools/include/linux/cfi_types.h
index 6b8713675765..685f7181780f 100644
--- a/tools/include/linux/cfi_types.h
+++ b/tools/include/linux/cfi_types.h
@@ -41,5 +41,28 @@
 	SYM_TYPED_START(name, SYM_L_GLOBAL, SYM_A_ALIGN)
 #endif
 
+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_CFI_CLANG
+#define DEFINE_CFI_TYPE(name, func)						\
+	/*									\
+	 * Force a reference to the function so the compiler generates		\
+	 * __kcfi_typeid_<func>.						\
+	 */									\
+	__ADDRESSABLE(func);							\
+	/* u32 name __ro_after_init = __kcfi_typeid_<func> */			\
+	extern u32 name;							\
+	asm (									\
+	"	.pushsection	.data..ro_after_init,\"aw\",\%progbits	\n"	\
+	"	.type	" #name ",\%object				\n"	\
+	"	.globl	" #name "					\n"	\
+	"	.p2align	2, 0x0					\n"	\
+	#name ":							\n"	\
+	"	.4byte	__kcfi_typeid_" #func "				\n"	\
+	"	.size	" #name ", 4					\n"	\
+	"	.popsection						\n"	\
+	);
+#endif
+
 #endif /* __ASSEMBLY__ */
 #endif /* _LINUX_CFI_TYPES_H */
-- 
2.25.1


From 619f55c859014e2235f83ba6cde8c59edc492f39 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 05/14] tools headers: Sync x86 headers with the kernel source

To pick up the changes in this cset:

  7b306dfa326f7011 x86/sev: Evict cache lines during SNP memory validation
  65f55a30176662ee x86/CPU/AMD: Add CPUID faulting support
  d8010d4ba43e9f79 x86/bugs: Add a Transient Scheduler Attacks mitigation
  a3c4f3396b82849a x86/msr-index: Add AMD workload classification MSRs
  17ec2f965344ee3f KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
    diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h

Please see tools/include/uapi/README for further details.

Cc: x86@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/x86/include/asm/cpufeatures.h | 10 +++++++++-
 tools/arch/x86/include/asm/msr-index.h   |  7 +++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index ee176236c2be..06fc0479a23f 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -218,6 +218,7 @@
 #define X86_FEATURE_FLEXPRIORITY	( 8*32+ 1) /* "flexpriority" Intel FlexPriority */
 #define X86_FEATURE_EPT			( 8*32+ 2) /* "ept" Intel Extended Page Table */
 #define X86_FEATURE_VPID		( 8*32+ 3) /* "vpid" Intel Virtual Processor ID */
+#define X86_FEATURE_COHERENCY_SFW_NO	( 8*32+ 4) /* SNP cache coherency software work around not needed */
 
 #define X86_FEATURE_VMMCALL		( 8*32+15) /* "vmmcall" Prefer VMMCALL to VMCALL */
 #define X86_FEATURE_XENPV		( 8*32+16) /* Xen paravirtual guest */
@@ -456,10 +457,14 @@
 #define X86_FEATURE_NO_NESTED_DATA_BP	(20*32+ 0) /* No Nested Data Breakpoints */
 #define X86_FEATURE_WRMSR_XX_BASE_NS	(20*32+ 1) /* WRMSR to {FS,GS,KERNEL_GS}_BASE is non-serializing */
 #define X86_FEATURE_LFENCE_RDTSC	(20*32+ 2) /* LFENCE always serializing / synchronizes RDTSC */
+#define X86_FEATURE_VERW_CLEAR		(20*32+ 5) /* The memory form of VERW mitigates TSA */
 #define X86_FEATURE_NULL_SEL_CLR_BASE	(20*32+ 6) /* Null Selector Clears Base */
+
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* Automatic IBRS */
 #define X86_FEATURE_NO_SMM_CTL_MSR	(20*32+ 9) /* SMM_CTL MSR is not present */
 
+#define X86_FEATURE_GP_ON_USER_CPUID	(20*32+17) /* User CPUID faulting */
+
 #define X86_FEATURE_PREFETCHI		(20*32+20) /* Prefetch Data/Instruction to Cache Level */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
@@ -487,6 +492,9 @@
 #define X86_FEATURE_PREFER_YMM		(21*32+ 8) /* Avoid ZMM registers due to downclocking */
 #define X86_FEATURE_APX			(21*32+ 9) /* Advanced Performance Extensions */
 #define X86_FEATURE_INDIRECT_THUNK_ITS	(21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
+#define X86_FEATURE_TSA_SQ_NO		(21*32+11) /* AMD CPU not vulnerable to TSA-SQ */
+#define X86_FEATURE_TSA_L1_NO		(21*32+12) /* AMD CPU not vulnerable to TSA-L1 */
+#define X86_FEATURE_CLEAR_CPU_BUF_VM	(21*32+13) /* Clear CPU buffers using VERW before VMRUN */
 
 /*
  * BUG word(s)
@@ -542,5 +550,5 @@
 #define X86_BUG_OLD_MICROCODE		X86_BUG( 1*32+ 6) /* "old_microcode" CPU has old microcode, it is surely vulnerable to something */
 #define X86_BUG_ITS			X86_BUG( 1*32+ 7) /* "its" CPU is affected by Indirect Target Selection */
 #define X86_BUG_ITS_NATIVE_ONLY		X86_BUG( 1*32+ 8) /* "its_native_only" CPU is affected by ITS, VMX is not affected */
-
+#define X86_BUG_TSA			X86_BUG( 1*32+ 9) /* "tsa" CPU is affected by Transient Scheduler Attacks */
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 5cfb5d74dd5f..b65c3ba5fa14 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -419,6 +419,7 @@
 #define DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI	(1UL << 12)
 #define DEBUGCTLMSR_FREEZE_IN_SMM_BIT	14
 #define DEBUGCTLMSR_FREEZE_IN_SMM	(1UL << DEBUGCTLMSR_FREEZE_IN_SMM_BIT)
+#define DEBUGCTLMSR_RTM_DEBUG		BIT(15)
 
 #define MSR_PEBS_FRONTEND		0x000003f7
 
@@ -733,6 +734,11 @@
 #define MSR_AMD64_PERF_CNTR_GLOBAL_CTL		0xc0000301
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR	0xc0000302
 
+/* AMD Hardware Feedback Support MSRs */
+#define MSR_AMD_WORKLOAD_CLASS_CONFIG		0xc0000500
+#define MSR_AMD_WORKLOAD_CLASS_ID		0xc0000501
+#define MSR_AMD_WORKLOAD_HRST			0xc0000502
+
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
 
@@ -831,6 +837,7 @@
 #define MSR_K7_HWCR_SMMLOCK		BIT_ULL(MSR_K7_HWCR_SMMLOCK_BIT)
 #define MSR_K7_HWCR_IRPERF_EN_BIT	30
 #define MSR_K7_HWCR_IRPERF_EN		BIT_ULL(MSR_K7_HWCR_IRPERF_EN_BIT)
+#define MSR_K7_HWCR_CPUID_USER_DIS_BIT	35
 #define MSR_K7_FID_VID_CTL		0xc0010041
 #define MSR_K7_FID_VID_STATUS		0xc0010042
 #define MSR_K7_HWCR_CPB_DIS_BIT		25
-- 
2.25.1


From 14ec8ce45611c767656e4fa575f17b05344aa80a Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 06/14] tools headers: Sync arm64 headers with the kernel
 source
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

To pick up the changes in this cset:

  efe676a1a7554219 arm64: proton-pack: Add new CPUs 'k' values for branch mitigation
  e18c09b204e81702 arm64: Add support for HIP09 Spectre-BHB mitigation
  a9b5bd81b294d30a arm64: cputype: Add MIDR_CORTEX_A76AE
  53a52a0ec7680287 arm64: cputype: Add comments about Qualcomm Kryo 5XX and 6XX cores
  401c3333bb2396aa arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD
  86edf6bdcf0571c0 smccc/kvm_guest: Enable errata based on implementation CPUs
  0bc9a9e85fcf4ffb KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h

But the following two changes cannot be applied since they introduced
new build errors in util/arm-spe.c.  So it still has the warning after
this change.

  c8c2647e69bedf80 arm64: Make  _midr_in_range_list() an exported function
  e3121298c7fcaf48 arm64: Modify _midr_range() functions to read MIDR/REVIDR internally

Please see tools/include/uapi/README for further details.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf build: [WIP] Fix arm-spe build errors

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/arm64/include/asm/cputype.h | 28 ++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
index 9a5d85cfd1fb..139d5e87dc95 100644
--- a/tools/arch/arm64/include/asm/cputype.h
+++ b/tools/arch/arm64/include/asm/cputype.h
@@ -75,11 +75,13 @@
 #define ARM_CPU_PART_CORTEX_A76		0xD0B
 #define ARM_CPU_PART_NEOVERSE_N1	0xD0C
 #define ARM_CPU_PART_CORTEX_A77		0xD0D
+#define ARM_CPU_PART_CORTEX_A76AE	0xD0E
 #define ARM_CPU_PART_NEOVERSE_V1	0xD40
 #define ARM_CPU_PART_CORTEX_A78		0xD41
 #define ARM_CPU_PART_CORTEX_A78AE	0xD42
 #define ARM_CPU_PART_CORTEX_X1		0xD44
 #define ARM_CPU_PART_CORTEX_A510	0xD46
+#define ARM_CPU_PART_CORTEX_X1C		0xD4C
 #define ARM_CPU_PART_CORTEX_A520	0xD80
 #define ARM_CPU_PART_CORTEX_A710	0xD47
 #define ARM_CPU_PART_CORTEX_A715	0xD4D
@@ -119,9 +121,11 @@
 #define QCOM_CPU_PART_KRYO		0x200
 #define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
 #define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
+#define QCOM_CPU_PART_KRYO_3XX_GOLD	0x802
 #define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
 #define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
 #define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
+#define QCOM_CPU_PART_ORYON_X1		0x001
 
 #define NVIDIA_CPU_PART_DENVER		0x003
 #define NVIDIA_CPU_PART_CARMEL		0x004
@@ -129,6 +133,7 @@
 #define FUJITSU_CPU_PART_A64FX		0x001
 
 #define HISI_CPU_PART_TSV110		0xD01
+#define HISI_CPU_PART_HIP09			0xD02
 #define HISI_CPU_PART_HIP12		0xD06
 
 #define APPLE_CPU_PART_M1_ICESTORM	0x022
@@ -159,11 +164,13 @@
 #define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
 #define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
 #define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
+#define MIDR_CORTEX_A76AE	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76AE)
 #define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
 #define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
 #define MIDR_CORTEX_A78AE	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78AE)
 #define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
 #define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
+#define MIDR_CORTEX_X1C MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1C)
 #define MIDR_CORTEX_A520 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A520)
 #define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
 #define MIDR_CORTEX_A715 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A715)
@@ -196,13 +203,26 @@
 #define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
 #define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
 #define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
+#define MIDR_QCOM_KRYO_3XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_GOLD)
 #define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
 #define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
 #define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
+#define MIDR_QCOM_ORYON_X1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_ORYON_X1)
+
+/*
+ * NOTES:
+ * - Qualcomm Kryo 5XX Prime / Gold ID themselves as MIDR_CORTEX_A77
+ * - Qualcomm Kryo 5XX Silver IDs itself as MIDR_QCOM_KRYO_4XX_SILVER
+ * - Qualcomm Kryo 6XX Prime IDs itself as MIDR_CORTEX_X1
+ * - Qualcomm Kryo 6XX Gold IDs itself as ARM_CPU_PART_CORTEX_A78
+ * - Qualcomm Kryo 6XX Silver IDs itself as MIDR_CORTEX_A55
+ */
+
 #define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
 #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
 #define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
 #define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
+#define MIDR_HISI_HIP09 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_HIP09)
 #define MIDR_HISI_HIP12 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_HIP12)
 #define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
 #define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
@@ -291,6 +311,14 @@ static inline u32 __attribute_const__ read_cpuid_id(void)
 	return read_cpuid(MIDR_EL1);
 }
 
+struct target_impl_cpu {
+	u64 midr;
+	u64 revidr;
+	u64 aidr;
+};
+
+bool cpu_errata_set_target_impl(u64 num, void *impl_cpus);
+
 static inline u64 __attribute_const__ read_cpuid_mpidr(void)
 {
 	return read_cpuid(MPIDR_EL1);
-- 
2.25.1


From c85538c4e3c7111958057d15ea8ee444116891c3 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 07/14] tools headers: Sync powerpc headers with the kernel
 source

To pick up the changes in this cset:

  69bf2053608423cb powerpc: Drop GPL boilerplate text with obsolete FSF address

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/powerpc/include/uapi/asm/kvm.h arch/powerpc/include/uapi/asm/kvm.h

Please see tools/include/uapi/README for further details.

Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/powerpc/include/uapi/asm/kvm.h | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/tools/arch/powerpc/include/uapi/asm/kvm.h b/tools/arch/powerpc/include/uapi/asm/kvm.h
index eaeda001784e..077c5437f521 100644
--- a/tools/arch/powerpc/include/uapi/asm/kvm.h
+++ b/tools/arch/powerpc/include/uapi/asm/kvm.h
@@ -1,18 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 /*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License, version 2, as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
- *
  * Copyright IBM Corp. 2007
  *
  * Authors: Hollis Blanchard <hollisb@us.ibm.com>
-- 
2.25.1


From 52174e0eb13876654f56701c26a672890aa5e7e3 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 08/14] tools headers: Sync syscall tables with the kernel
 source

To pick up the changes in this cset:

  be7efb2d20d67f33 fs: introduce file_getattr and file_setattr syscalls

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
    diff -u tools/scripts/syscall.tbl scripts/syscall.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
    diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
    diff -u tools/perf/arch/arm/entry/syscalls/syscall.tbl arch/arm/tools/syscall.tbl
    diff -u tools/perf/arch/sh/entry/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/sparc/entry/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/xtensa/entry/syscalls/syscall.tbl arch/xtensa/kernel/syscalls/syscall.tbl

Please see tools/include/uapi/README for further details.

Cc: Arnd Bergmann <arnd@arndb.de>
CC: linux-api@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/uapi/asm-generic/unistd.h             | 8 +++++++-
 tools/perf/arch/arm/entry/syscalls/syscall.tbl      | 2 ++
 tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 2 ++
 tools/perf/arch/s390/entry/syscalls/syscall.tbl     | 2 ++
 tools/perf/arch/sh/entry/syscalls/syscall.tbl       | 2 ++
 tools/perf/arch/sparc/entry/syscalls/syscall.tbl    | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_32.tbl   | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 2 ++
 tools/perf/arch/xtensa/entry/syscalls/syscall.tbl   | 2 ++
 tools/scripts/syscall.tbl                           | 2 ++
 11 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 2892a45023af..04e0077fb4c9 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -852,8 +852,14 @@ __SYSCALL(__NR_removexattrat, sys_removexattrat)
 #define __NR_open_tree_attr 467
 __SYSCALL(__NR_open_tree_attr, sys_open_tree_attr)
 
+/* fs/inode.c */
+#define __NR_file_getattr 468
+__SYSCALL(__NR_file_getattr, sys_file_getattr)
+#define __NR_file_setattr 469
+__SYSCALL(__NR_file_setattr, sys_file_setattr)
+
 #undef __NR_syscalls
-#define __NR_syscalls 468
+#define __NR_syscalls 470
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/arm/entry/syscalls/syscall.tbl b/tools/perf/arch/arm/entry/syscalls/syscall.tbl
index 27c1d5ebcd91..b07e699aaa3c 100644
--- a/tools/perf/arch/arm/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/arm/entry/syscalls/syscall.tbl
@@ -482,3 +482,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 1e8c44c7b614..7a7049c2c307 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -382,3 +382,5 @@
 465	n64	listxattrat			sys_listxattrat
 466	n64	removexattrat			sys_removexattrat
 467	n64	open_tree_attr			sys_open_tree_attr
+468	n64	file_getattr			sys_file_getattr
+469	n64	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 9a084bdb8926..b453e80dfc00 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -558,3 +558,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index a4569b96ef06..8a6744d658db 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
 465  common	listxattrat		sys_listxattrat			sys_listxattrat
 466  common	removexattrat		sys_removexattrat		sys_removexattrat
 467  common	open_tree_attr		sys_open_tree_attr		sys_open_tree_attr
+468  common	file_getattr		sys_file_getattr		sys_file_getattr
+469  common	file_setattr		sys_file_setattr		sys_file_setattr
diff --git a/tools/perf/arch/sh/entry/syscalls/syscall.tbl b/tools/perf/arch/sh/entry/syscalls/syscall.tbl
index 52a7652fcff6..5e9c9eff5539 100644
--- a/tools/perf/arch/sh/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/sh/entry/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/sparc/entry/syscalls/syscall.tbl b/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
index 83e45eb6c095..ebb7d06d1044 100644
--- a/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
@@ -513,3 +513,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
index ac007ea00979..4877e16da69a 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
@@ -473,3 +473,5 @@
 465	i386	listxattrat		sys_listxattrat
 466	i386	removexattrat		sys_removexattrat
 467	i386	open_tree_attr		sys_open_tree_attr
+468	i386	file_getattr		sys_file_getattr
+469	i386	file_setattr		sys_file_setattr
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index cfb5ca41e30d..92cf0fe2291e 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -391,6 +391,8 @@
 465	common	listxattrat		sys_listxattrat
 466	common	removexattrat		sys_removexattrat
 467	common	open_tree_attr		sys_open_tree_attr
+468	common	file_getattr		sys_file_getattr
+469	common	file_setattr		sys_file_setattr
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl b/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
index f657a77314f8..374e4cb788d8 100644
--- a/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
@@ -438,3 +438,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/scripts/syscall.tbl b/tools/scripts/syscall.tbl
index 580b4e246aec..d1ae5e92c615 100644
--- a/tools/scripts/syscall.tbl
+++ b/tools/scripts/syscall.tbl
@@ -408,3 +408,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
-- 
2.25.1


From b18aabe283a10774977d698c075d2296a2336aef Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 09/14] tools headers: Sync uapi/linux/fcntl.h with the kernel
 source

To pick up the changes in this cset:

  3941e37f62fe2c3c uapi/fcntl: add FD_PIDFS_ROOT
  cd5d2006327b6d84 uapi/fcntl: add FD_INVALID
  67fcec2919e4ed31 fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  a4c746f06853f91d uapi/fcntl: mark range as reserved

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h

Please see tools/include/uapi/README for further details.

Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../trace/beauty/include/uapi/linux/fcntl.h    | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
index a15ac2fa4b20..f291ab4f94eb 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
@@ -90,10 +90,28 @@
 #define DN_ATTRIB	0x00000020	/* File changed attibutes */
 #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
 
+/* Reserved kernel ranges [-100], [-10000, -40000]. */
 #define AT_FDCWD		-100    /* Special value for dirfd used to
 					   indicate openat should use the
 					   current working directory. */
 
+/*
+ * The concept of process and threads in userland and the kernel is a confusing
+ * one - within the kernel every thread is a 'task' with its own individual PID,
+ * however from userland's point of view threads are grouped by a single PID,
+ * which is that of the 'thread group leader', typically the first thread
+ * spawned.
+ *
+ * To cut the Gideon knot, for internal kernel usage, we refer to
+ * PIDFD_SELF_THREAD to refer to the current thread (or task from a kernel
+ * perspective), and PIDFD_SELF_THREAD_GROUP to refer to the current thread
+ * group leader...
+ */
+#define PIDFD_SELF_THREAD		-10000 /* Current thread. */
+#define PIDFD_SELF_THREAD_GROUP		-10001 /* Current thread group leader. */
+
+#define FD_PIDFS_ROOT			-10002 /* Root of the pidfs filesystem */
+#define FD_INVALID			-10009 /* Invalid file descriptor: -10000 - EBADF = -10009 */
 
 /* Generic flags for the *at(2) family of syscalls. */
 
-- 
2.25.1


From 4a4083af03a7a75a86c392fd60cb37ce23ed87b6 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 10/14] tools headers: Sync uapi/linux/fs.h with the kernel
 source

To pick up the changes in this cset:

  76fdb7eb4e1c9108 uapi: export PROCFS_ROOT_INO
  ca115d7e754691c0 tree-wide: s/struct fileattr/struct file_kattr/g
  be7efb2d20d67f33 fs: introduce file_getattr and file_setattr syscalls
  9eb22f7fedfc9eb1 fs: add ioctl to query metadata and protection info capabilities

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/fs.h include/uapi/linux/fs.h

Please see tools/include/uapi/README for further details.

Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../perf/trace/beauty/include/uapi/linux/fs.h | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/fs.h b/tools/perf/trace/beauty/include/uapi/linux/fs.h
index 0098b0ce8ccb..0bd678a4a10e 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/fs.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/fs.h
@@ -60,6 +60,17 @@
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
 
+/*
+ * The root inode of procfs is guaranteed to always have the same inode number.
+ * For programs that make heavy use of procfs, verifying that the root is a
+ * real procfs root and using openat2(RESOLVE_{NO_{XDEV,MAGICLINKS},BENEATH})
+ * will allow you to make sure you are never tricked into operating on the
+ * wrong procfs file.
+ */
+enum procfs_ino {
+	PROCFS_ROOT_INO = 1,
+};
+
 struct file_clone_range {
 	__s64 src_fd;
 	__u64 src_offset;
@@ -91,6 +102,63 @@ struct fs_sysfs_path {
 	__u8			name[128];
 };
 
+/* Protection info capability flags */
+#define	LBMD_PI_CAP_INTEGRITY		(1 << 0)
+#define	LBMD_PI_CAP_REFTAG		(1 << 1)
+
+/* Checksum types for Protection Information */
+#define LBMD_PI_CSUM_NONE		0
+#define LBMD_PI_CSUM_IP			1
+#define LBMD_PI_CSUM_CRC16_T10DIF	2
+#define LBMD_PI_CSUM_CRC64_NVME		4
+
+/* sizeof first published struct */
+#define LBMD_SIZE_VER0			16
+
+/*
+ * Logical block metadata capability descriptor
+ * If the device does not support metadata, all the fields will be zero.
+ * Applications must check lbmd_flags to determine whether metadata is
+ * supported or not.
+ */
+struct logical_block_metadata_cap {
+	/* Bitmask of logical block metadata capability flags */
+	__u32	lbmd_flags;
+	/*
+	 * The amount of data described by each unit of logical block
+	 * metadata
+	 */
+	__u16	lbmd_interval;
+	/*
+	 * Size in bytes of the logical block metadata associated with each
+	 * interval
+	 */
+	__u8	lbmd_size;
+	/*
+	 * Size in bytes of the opaque block tag associated with each
+	 * interval
+	 */
+	__u8	lbmd_opaque_size;
+	/*
+	 * Offset in bytes of the opaque block tag within the logical block
+	 * metadata
+	 */
+	__u8	lbmd_opaque_offset;
+	/* Size in bytes of the T10 PI tuple associated with each interval */
+	__u8	lbmd_pi_size;
+	/* Offset in bytes of T10 PI tuple within the logical block metadata */
+	__u8	lbmd_pi_offset;
+	/* T10 PI guard tag type */
+	__u8	lbmd_guard_tag_type;
+	/* Size in bytes of the T10 PI application tag */
+	__u8	lbmd_app_tag_size;
+	/* Size in bytes of the T10 PI reference tag */
+	__u8	lbmd_ref_tag_size;
+	/* Size in bytes of the T10 PI storage tag */
+	__u8	lbmd_storage_tag_size;
+	__u8	pad;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -148,6 +216,24 @@ struct fsxattr {
 	unsigned char	fsx_pad[8];
 };
 
+/*
+ * Variable size structure for file_[sg]et_attr().
+ *
+ * Note. This is alternative to the structure 'struct file_kattr'/'struct fsxattr'.
+ * As this structure is passed to/from userspace with its size, this can
+ * be versioned based on the size.
+ */
+struct file_attr {
+	__u64 fa_xflags;	/* xflags field value (get/set) */
+	__u32 fa_extsize;	/* extsize field value (get/set)*/
+	__u32 fa_nextents;	/* nextents field value (get)   */
+	__u32 fa_projid;	/* project identifier (get/set) */
+	__u32 fa_cowextsize;	/* CoW extsize field value (get/set) */
+};
+
+#define FILE_ATTR_SIZE_VER0 24
+#define FILE_ATTR_SIZE_LATEST FILE_ATTR_SIZE_VER0
+
 /*
  * Flags for the fsx_xflags field
  */
@@ -247,6 +333,8 @@ struct fsxattr {
  * also /sys/kernel/debug/ for filesystems with debugfs exports
  */
 #define FS_IOC_GETFSSYSFSPATH		_IOR(0x15, 1, struct fs_sysfs_path)
+/* Get logical block metadata capability details */
+#define FS_IOC_GETLBMD_CAP		_IOWR(0x15, 2, struct logical_block_metadata_cap)
 
 /*
  * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
-- 
2.25.1


From e7e79e99726190a5a83d158576cd448896d68102 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 11/14] tools headers: Sync uapi/linux/prctl.h with the kernel
 source

To pick up the changes in this cset:

  b1fabef37bd504f3 prctl: Introduce PR_MTE_STORE_ONLY
  a2fc422ed75748ee syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ON

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/prctl.h include/uapi/linux/prctl.h

Please see tools/include/uapi/README for further details.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/trace/beauty/include/uapi/linux/prctl.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 3b93fb906e3c..ed3aed264aeb 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -244,6 +244,8 @@ struct prctl_mm_map {
 # define PR_MTE_TAG_MASK		(0xffffUL << PR_MTE_TAG_SHIFT)
 /* Unused; kept only for source compatibility */
 # define PR_MTE_TCF_SHIFT		1
+/* MTE tag check store only */
+# define PR_MTE_STORE_ONLY		(1UL << 19)
 /* RISC-V pointer masking tag length */
 # define PR_PMLEN_SHIFT			24
 # define PR_PMLEN_MASK			(0x7fUL << PR_PMLEN_SHIFT)
@@ -255,7 +257,12 @@ struct prctl_mm_map {
 /* Dispatch syscalls to a userspace handler */
 #define PR_SET_SYSCALL_USER_DISPATCH	59
 # define PR_SYS_DISPATCH_OFF		0
-# define PR_SYS_DISPATCH_ON		1
+/* Enable dispatch except for the specified range */
+# define PR_SYS_DISPATCH_EXCLUSIVE_ON	1
+/* Enable dispatch for the specified range */
+# define PR_SYS_DISPATCH_INCLUSIVE_ON	2
+/* Legacy name for backwards compatibility */
+# define PR_SYS_DISPATCH_ON		PR_SYS_DISPATCH_EXCLUSIVE_ON
 /* The control values for the user space selector when dispatch is enabled */
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
-- 
2.25.1


From f79a62f4b3c750759e60a402e8fe5180fc5771f0 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 12/14] tools headers: Sync uapi/linux/vhost.h with the kernel
 source

To pick up the changes in this cset:

  7d9896e9f6d02d8a vhost: Reintroduce kthread API and add mode selection
  333c515d189657c9 vhost-net: allow configuring extended features

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/vhost.h include/uapi/linux/vhost.h

Please see tools/include/uapi/README for further details.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux.dev
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../trace/beauty/include/uapi/linux/vhost.h   | 35 +++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/vhost.h b/tools/perf/trace/beauty/include/uapi/linux/vhost.h
index d4b3e2ae1314..c57674a6aa0d 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/vhost.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/vhost.h
@@ -235,4 +235,39 @@
  */
 #define VHOST_VDPA_GET_VRING_SIZE	_IOWR(VHOST_VIRTIO, 0x82,	\
 					      struct vhost_vring_state)
+
+/* Extended features manipulation */
+#define VHOST_GET_FEATURES_ARRAY _IOR(VHOST_VIRTIO, 0x83, \
+				       struct vhost_features_array)
+#define VHOST_SET_FEATURES_ARRAY _IOW(VHOST_VIRTIO, 0x83, \
+				       struct vhost_features_array)
+
+/* fork_owner values for vhost */
+#define VHOST_FORK_OWNER_KTHREAD 0
+#define VHOST_FORK_OWNER_TASK 1
+
+/**
+ * VHOST_SET_FORK_FROM_OWNER - Set the fork_owner flag for the vhost device,
+ * This ioctl must called before VHOST_SET_OWNER.
+ * Only available when CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL=y
+ *
+ * @param fork_owner: An 8-bit value that determines the vhost thread mode
+ *
+ * When fork_owner is set to VHOST_FORK_OWNER_TASK(default value):
+ *   - Vhost will create vhost worker as tasks forked from the owner,
+ *     inheriting all of the owner's attributes.
+ *
+ * When fork_owner is set to VHOST_FORK_OWNER_KTHREAD:
+ *   - Vhost will create vhost workers as kernel threads.
+ */
+#define VHOST_SET_FORK_FROM_OWNER _IOW(VHOST_VIRTIO, 0x84, __u8)
+
+/**
+ * VHOST_GET_FORK_OWNER - Get the current fork_owner flag for the vhost device.
+ * Only available when CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL=y
+ *
+ * @return: An 8-bit value indicating the current thread mode.
+ */
+#define VHOST_GET_FORK_FROM_OWNER _IOR(VHOST_VIRTIO, 0x85, __u8)
+
 #endif
-- 
2.25.1


From ba0b7081f7a521d7c28b527a4f18666a148471e7 Mon Sep 17 00:00:00 2001
From: Ian Rogers <irogers@google.com>
Date: Fri, 22 Aug 2025 17:00:23 -0700
Subject: [PATCH 13/14] perf symbol-minimal: Fix ehdr reading in
 filename__read_build_id

The e_ident is part of the ehdr and so reading it a second time would
mean the read ehdr was displaced by 16-bytes. Switch from stdio to
open/read/lseek syscalls for similarity with the symbol-elf version of
the function and so that later changes can alter then open flags.

Fixes: fef8f648bb47 ("perf symbol: Fix use-after-free in filename__read_build_id")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250823000024.724394-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/symbol-minimal.c | 55 ++++++++++++++++----------------
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/tools/perf/util/symbol-minimal.c b/tools/perf/util/symbol-minimal.c
index 7201494c5c20..8d41bd7842df 100644
--- a/tools/perf/util/symbol-minimal.c
+++ b/tools/perf/util/symbol-minimal.c
@@ -4,7 +4,6 @@
 
 #include <errno.h>
 #include <unistd.h>
-#include <stdio.h>
 #include <fcntl.h>
 #include <string.h>
 #include <stdlib.h>
@@ -88,11 +87,8 @@ int filename__read_debuglink(const char *filename __maybe_unused,
  */
 int filename__read_build_id(const char *filename, struct build_id *bid)
 {
-	FILE *fp;
-	int ret = -1;
+	int fd, ret = -1;
 	bool need_swap = false, elf32;
-	u8 e_ident[EI_NIDENT];
-	int i;
 	union {
 		struct {
 			Elf32_Ehdr ehdr32;
@@ -103,28 +99,27 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 			Elf64_Phdr *phdr64;
 		};
 	} hdrs;
-	void *phdr;
-	size_t phdr_size;
-	void *buf = NULL;
-	size_t buf_size = 0;
+	void *phdr, *buf = NULL;
+	ssize_t phdr_size, ehdr_size, buf_size = 0;
 
-	fp = fopen(filename, "r");
-	if (fp == NULL)
+	fd = open(filename, O_RDONLY);
+	if (fd < 0)
 		return -1;
 
-	if (fread(e_ident, sizeof(e_ident), 1, fp) != 1)
+	if (read(fd, hdrs.ehdr32.e_ident, EI_NIDENT) != EI_NIDENT)
 		goto out;
 
-	if (memcmp(e_ident, ELFMAG, SELFMAG) ||
-	    e_ident[EI_VERSION] != EV_CURRENT)
+	if (memcmp(hdrs.ehdr32.e_ident, ELFMAG, SELFMAG) ||
+	    hdrs.ehdr32.e_ident[EI_VERSION] != EV_CURRENT)
 		goto out;
 
-	need_swap = check_need_swap(e_ident[EI_DATA]);
-	elf32 = e_ident[EI_CLASS] == ELFCLASS32;
+	need_swap = check_need_swap(hdrs.ehdr32.e_ident[EI_DATA]);
+	elf32 = hdrs.ehdr32.e_ident[EI_CLASS] == ELFCLASS32;
+	ehdr_size = (elf32 ? sizeof(hdrs.ehdr32) : sizeof(hdrs.ehdr64)) - EI_NIDENT;
 
-	if (fread(elf32 ? (void *)&hdrs.ehdr32 : (void *)&hdrs.ehdr64,
-		  elf32 ? sizeof(hdrs.ehdr32) : sizeof(hdrs.ehdr64),
-		  1, fp) != 1)
+	if (read(fd,
+		 (elf32 ? (void *)&hdrs.ehdr32 : (void *)&hdrs.ehdr64) + EI_NIDENT,
+		 ehdr_size) != ehdr_size)
 		goto out;
 
 	if (need_swap) {
@@ -138,14 +133,18 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 			hdrs.ehdr64.e_phnum = bswap_16(hdrs.ehdr64.e_phnum);
 		}
 	}
-	phdr_size = elf32 ? hdrs.ehdr32.e_phentsize * hdrs.ehdr32.e_phnum
-			  : hdrs.ehdr64.e_phentsize * hdrs.ehdr64.e_phnum;
+	if ((elf32 && hdrs.ehdr32.e_phentsize != sizeof(Elf32_Phdr)) ||
+	    (!elf32 && hdrs.ehdr64.e_phentsize != sizeof(Elf64_Phdr)))
+		goto out;
+
+	phdr_size = elf32 ? sizeof(Elf32_Phdr) * hdrs.ehdr32.e_phnum
+			  : sizeof(Elf64_Phdr) * hdrs.ehdr64.e_phnum;
 	phdr = malloc(phdr_size);
 	if (phdr == NULL)
 		goto out;
 
-	fseek(fp, elf32 ? hdrs.ehdr32.e_phoff : hdrs.ehdr64.e_phoff, SEEK_SET);
-	if (fread(phdr, phdr_size, 1, fp) != 1)
+	lseek(fd, elf32 ? hdrs.ehdr32.e_phoff : hdrs.ehdr64.e_phoff, SEEK_SET);
+	if (read(fd, phdr, phdr_size) != phdr_size)
 		goto out_free;
 
 	if (elf32)
@@ -153,8 +152,8 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 	else
 		hdrs.phdr64 = phdr;
 
-	for (i = 0; i < elf32 ? hdrs.ehdr32.e_phnum : hdrs.ehdr64.e_phnum; i++) {
-		size_t p_filesz;
+	for (int i = 0; i < (elf32 ? hdrs.ehdr32.e_phnum : hdrs.ehdr64.e_phnum); i++) {
+		ssize_t p_filesz;
 
 		if (need_swap) {
 			if (elf32) {
@@ -180,8 +179,8 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 				goto out_free;
 			buf = tmp;
 		}
-		fseek(fp, elf32 ? hdrs.phdr32[i].p_offset : hdrs.phdr64[i].p_offset, SEEK_SET);
-		if (fread(buf, p_filesz, 1, fp) != 1)
+		lseek(fd, elf32 ? hdrs.phdr32[i].p_offset : hdrs.phdr64[i].p_offset, SEEK_SET);
+		if (read(fd, buf, p_filesz) != p_filesz)
 			goto out_free;
 
 		ret = read_build_id(buf, p_filesz, bid, need_swap);
@@ -194,7 +193,7 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 	free(buf);
 	free(phdr);
 out:
-	fclose(fp);
+	close(fd);
 	return ret;
 }
 
-- 
2.25.1


From 2c369d91d0933aaff96b6b807b22363e6a38a625 Mon Sep 17 00:00:00 2001
From: Ian Rogers <irogers@google.com>
Date: Fri, 22 Aug 2025 17:00:24 -0700
Subject: [PATCH 14/14] perf symbol: Add blocking argument to
 filename__read_build_id

When synthesizing build-ids, for build ID mmap2 events, they will be
added for data mmaps if -d/--data is specified. The files opened for
their build IDs may block on the open causing perf to hang during
synthesis. There is some robustness in existing calls to
filename__read_build_id by checking the file path is to a regular
file, which unfortunately fails for symlinks. Rather than adding more
is_regular_file calls, switch filename__read_build_id to take a
"block" argument and specify O_NONBLOCK when this is false. The
existing is_regular_file checking callers and the event synthesis
callers are made to pass false and thereby avoiding the hang.

Fixes: 53b00ff358dc ("perf record: Make --buildid-mmap the default")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250823000024.724394-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/bench/inject-buildid.c  | 2 +-
 tools/perf/builtin-buildid-cache.c | 8 ++++----
 tools/perf/builtin-inject.c        | 4 ++--
 tools/perf/tests/sdt.c             | 2 +-
 tools/perf/util/build-id.c         | 4 ++--
 tools/perf/util/debuginfo.c        | 8 ++++++--
 tools/perf/util/dsos.c             | 4 ++--
 tools/perf/util/symbol-elf.c       | 9 +++++----
 tools/perf/util/symbol-minimal.c   | 6 +++---
 tools/perf/util/symbol.c           | 8 ++++----
 tools/perf/util/symbol.h           | 2 +-
 tools/perf/util/synthetic-events.c | 2 +-
 12 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/tools/perf/bench/inject-buildid.c b/tools/perf/bench/inject-buildid.c
index aad572a78d7f..12387ea88b9a 100644
--- a/tools/perf/bench/inject-buildid.c
+++ b/tools/perf/bench/inject-buildid.c
@@ -85,7 +85,7 @@ static int add_dso(const char *fpath, const struct stat *sb __maybe_unused,
 	if (typeflag == FTW_D || typeflag == FTW_SL)
 		return 0;
 
-	if (filename__read_build_id(fpath, &bid) < 0)
+	if (filename__read_build_id(fpath, &bid, /*block=*/true) < 0)
 		return 0;
 
 	dso->name = realpath(fpath, NULL);
diff --git a/tools/perf/builtin-buildid-cache.c b/tools/perf/builtin-buildid-cache.c
index c98104481c8a..2e0f2004696a 100644
--- a/tools/perf/builtin-buildid-cache.c
+++ b/tools/perf/builtin-buildid-cache.c
@@ -180,7 +180,7 @@ static int build_id_cache__add_file(const char *filename, struct nsinfo *nsi)
 	struct nscookie nsc;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 	if (err < 0) {
 		pr_debug("Couldn't read a build-id in %s\n", filename);
@@ -204,7 +204,7 @@ static int build_id_cache__remove_file(const char *filename, struct nsinfo *nsi)
 	int err;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 	if (err < 0) {
 		pr_debug("Couldn't read a build-id in %s\n", filename);
@@ -280,7 +280,7 @@ static bool dso__missing_buildid_cache(struct dso *dso, int parm __maybe_unused)
 	if (!dso__build_id_filename(dso, filename, sizeof(filename), false))
 		return true;
 
-	if (filename__read_build_id(filename, &bid) == -1) {
+	if (filename__read_build_id(filename, &bid, /*block=*/true) == -1) {
 		if (errno == ENOENT)
 			return false;
 
@@ -309,7 +309,7 @@ static int build_id_cache__update_file(const char *filename, struct nsinfo *nsi)
 	int err;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 	if (err < 0) {
 		pr_debug("Couldn't read a build-id in %s\n", filename);
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 40ba6a94f719..a114b3fa1bea 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -680,12 +680,12 @@ static int dso__read_build_id(struct dso *dso)
 
 	mutex_lock(dso__lock(dso));
 	nsinfo__mountns_enter(dso__nsinfo(dso), &nsc);
-	if (filename__read_build_id(dso__long_name(dso), &bid) > 0)
+	if (filename__read_build_id(dso__long_name(dso), &bid, /*block=*/true) > 0)
 		dso__set_build_id(dso, &bid);
 	else if (dso__nsinfo(dso)) {
 		char *new_name = dso__filename_with_chroot(dso, dso__long_name(dso));
 
-		if (new_name && filename__read_build_id(new_name, &bid) > 0)
+		if (new_name && filename__read_build_id(new_name, &bid, /*block=*/true) > 0)
 			dso__set_build_id(dso, &bid);
 		free(new_name);
 	}
diff --git a/tools/perf/tests/sdt.c b/tools/perf/tests/sdt.c
index 93baee2eae42..6132f1af3e22 100644
--- a/tools/perf/tests/sdt.c
+++ b/tools/perf/tests/sdt.c
@@ -31,7 +31,7 @@ static int build_id_cache__add_file(const char *filename)
 	struct build_id bid = { .size = 0, };
 	int err;
 
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	if (err < 0) {
 		pr_debug("Failed to read build id of %s\n", filename);
 		return err;
diff --git a/tools/perf/util/build-id.c b/tools/perf/util/build-id.c
index a7018a3b0437..bf7f3268b9a2 100644
--- a/tools/perf/util/build-id.c
+++ b/tools/perf/util/build-id.c
@@ -115,7 +115,7 @@ int filename__snprintf_build_id(const char *pathname, char *sbuild_id, size_t sb
 	struct build_id bid = { .size = 0, };
 	int ret;
 
-	ret = filename__read_build_id(pathname, &bid);
+	ret = filename__read_build_id(pathname, &bid, /*block=*/true);
 	if (ret < 0)
 		return ret;
 
@@ -841,7 +841,7 @@ static int filename__read_build_id_ns(const char *filename,
 	int ret;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	ret = filename__read_build_id(filename, bid);
+	ret = filename__read_build_id(filename, bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 
 	return ret;
diff --git a/tools/perf/util/debuginfo.c b/tools/perf/util/debuginfo.c
index a44c70f93156..bb9ebd84ec2d 100644
--- a/tools/perf/util/debuginfo.c
+++ b/tools/perf/util/debuginfo.c
@@ -110,8 +110,12 @@ struct debuginfo *debuginfo__new(const char *path)
 	if (!dso)
 		goto out;
 
-	/* Set the build id for DSO_BINARY_TYPE__BUILDID_DEBUGINFO */
-	if (is_regular_file(path) && filename__read_build_id(path, &bid) > 0)
+	/*
+	 * Set the build id for DSO_BINARY_TYPE__BUILDID_DEBUGINFO. Don't block
+	 * incase the path isn't for a regular file.
+	 */
+	assert(!dso__has_build_id(dso));
+	if (filename__read_build_id(path, &bid, /*block=*/false) > 0)
 		dso__set_build_id(dso, &bid);
 
 	for (type = distro_dwarf_types;
diff --git a/tools/perf/util/dsos.c b/tools/perf/util/dsos.c
index 0a7645c7fae7..64c1d65b0149 100644
--- a/tools/perf/util/dsos.c
+++ b/tools/perf/util/dsos.c
@@ -81,13 +81,13 @@ static int dsos__read_build_ids_cb(struct dso *dso, void *data)
 		return 0;
 	}
 	nsinfo__mountns_enter(dso__nsinfo(dso), &nsc);
-	if (filename__read_build_id(dso__long_name(dso), &bid) > 0) {
+	if (filename__read_build_id(dso__long_name(dso), &bid, /*block=*/true) > 0) {
 		dso__set_build_id(dso, &bid);
 		args->have_build_id = true;
 	} else if (errno == ENOENT && dso__nsinfo(dso)) {
 		char *new_name = dso__filename_with_chroot(dso, dso__long_name(dso));
 
-		if (new_name && filename__read_build_id(new_name, &bid) > 0) {
+		if (new_name && filename__read_build_id(new_name, &bid, /*block=*/true) > 0) {
 			dso__set_build_id(dso, &bid);
 			args->have_build_id = true;
 		}
diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
index 6d2c280a1730..033c79231a54 100644
--- a/tools/perf/util/symbol-elf.c
+++ b/tools/perf/util/symbol-elf.c
@@ -902,7 +902,7 @@ static int read_build_id(const char *filename, struct build_id *bid)
 
 #else // HAVE_LIBBFD_BUILDID_SUPPORT
 
-static int read_build_id(const char *filename, struct build_id *bid)
+static int read_build_id(const char *filename, struct build_id *bid, bool block)
 {
 	size_t size = sizeof(bid->data);
 	int fd, err = -1;
@@ -911,7 +911,7 @@ static int read_build_id(const char *filename, struct build_id *bid)
 	if (size < BUILD_ID_SIZE)
 		goto out;
 
-	fd = open(filename, O_RDONLY);
+	fd = open(filename, block ? O_RDONLY : (O_RDONLY | O_NONBLOCK));
 	if (fd < 0)
 		goto out;
 
@@ -934,7 +934,7 @@ static int read_build_id(const char *filename, struct build_id *bid)
 
 #endif // HAVE_LIBBFD_BUILDID_SUPPORT
 
-int filename__read_build_id(const char *filename, struct build_id *bid)
+int filename__read_build_id(const char *filename, struct build_id *bid, bool block)
 {
 	struct kmod_path m = { .name = NULL, };
 	char path[PATH_MAX];
@@ -958,9 +958,10 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 		}
 		close(fd);
 		filename = path;
+		block = true;
 	}
 
-	err = read_build_id(filename, bid);
+	err = read_build_id(filename, bid, block);
 
 	if (m.comp)
 		unlink(filename);
diff --git a/tools/perf/util/symbol-minimal.c b/tools/perf/util/symbol-minimal.c
index 8d41bd7842df..41e4ebe5eac5 100644
--- a/tools/perf/util/symbol-minimal.c
+++ b/tools/perf/util/symbol-minimal.c
@@ -85,7 +85,7 @@ int filename__read_debuglink(const char *filename __maybe_unused,
 /*
  * Just try PT_NOTE header otherwise fails
  */
-int filename__read_build_id(const char *filename, struct build_id *bid)
+int filename__read_build_id(const char *filename, struct build_id *bid, bool block)
 {
 	int fd, ret = -1;
 	bool need_swap = false, elf32;
@@ -102,7 +102,7 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 	void *phdr, *buf = NULL;
 	ssize_t phdr_size, ehdr_size, buf_size = 0;
 
-	fd = open(filename, O_RDONLY);
+	fd = open(filename, block ? O_RDONLY : (O_RDONLY | O_NONBLOCK));
 	if (fd < 0)
 		return -1;
 
@@ -323,7 +323,7 @@ int dso__load_sym(struct dso *dso, struct map *map __maybe_unused,
 	if (ret >= 0)
 		RC_CHK_ACCESS(dso)->is_64_bit = ret;
 
-	if (filename__read_build_id(ss->name, &bid) > 0)
+	if (filename__read_build_id(ss->name, &bid, /*block=*/true) > 0)
 		dso__set_build_id(dso, &bid);
 	return 0;
 }
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index e816e4220d33..3fed54de5401 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1869,14 +1869,14 @@ int dso__load(struct dso *dso, struct map *map)
 
 	/*
 	 * Read the build id if possible. This is required for
-	 * DSO_BINARY_TYPE__BUILDID_DEBUGINFO to work
+	 * DSO_BINARY_TYPE__BUILDID_DEBUGINFO to work. Don't block in case path
+	 * isn't for a regular file.
 	 */
-	if (!dso__has_build_id(dso) &&
-	    is_regular_file(dso__long_name(dso))) {
+	if (!dso__has_build_id(dso)) {
 		struct build_id bid = { .size = 0, };
 
 		__symbol__join_symfs(name, PATH_MAX, dso__long_name(dso));
-		if (filename__read_build_id(name, &bid) > 0)
+		if (filename__read_build_id(name, &bid, /*block=*/false) > 0)
 			dso__set_build_id(dso, &bid);
 	}
 
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 3fb5d146d9b1..347106218799 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -140,7 +140,7 @@ struct symbol *dso__next_symbol(struct symbol *sym);
 
 enum dso_type dso__type_fd(int fd);
 
-int filename__read_build_id(const char *filename, struct build_id *id);
+int filename__read_build_id(const char *filename, struct build_id *id, bool block);
 int sysfs__read_build_id(const char *filename, struct build_id *bid);
 int modules__parse(const char *filename, void *arg,
 		   int (*process_module)(void *arg, const char *name,
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index cb2c1ace304a..fcd1fd13c30e 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -401,7 +401,7 @@ static void perf_record_mmap2__read_build_id(struct perf_record_mmap2 *event,
 	nsi = nsinfo__new(event->pid);
 	nsinfo__mountns_enter(nsi, &nc);
 
-	rc = filename__read_build_id(event->filename, &bid) > 0 ? 0 : -1;
+	rc = filename__read_build_id(event->filename, &bid, /*block=*/false) > 0 ? 0 : -1;
 
 	nsinfo__mountns_exit(&nc);
 	nsinfo__put(nsi);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Aleksa Sarai @ 2025-08-28  8:42 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <5cbd7011-9c2a-4a23-bbce-84c100877cdb@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2025-08-27 15:48, Aleksa Sarai wrote:
> > On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> > > What specific scenario would make the "give me supported flags" flag worth
> > > the hassle of supporting and documenting and testing such a flag?
> > 
> > "Just try it" leads to programs that have to test dozens of flag
> > combinations for syscalls at startup,
> 
> Although that sort of thing can indeed be a problem in general, I don't see
> how it's a problem for openat2 in particular.

While O_* and RESOLVE_* flags are trivial to detect (since you can
always pass -EBADF to force a non-EINVAL error), my goal was to have a
unified interface for extensible-struct syscalls in this department.

> The issue here is whether openat2's API should reflect current behavior
> (where the HOW argument is pointer-to-const) or a potential future behavior
> (where the kernel might modify the struct that HOW points to, if some
> hypothetical future flag is set in that struct). I am skeptical that this
> hypothetical situation is so plausible that it justifies the maintenance
> hassle of a glibc API that doesn't correspond to how openat2 currently
> behaves.

I mean, the kernel definition doesn't mark the syscall argument as
"const" so making it const in glibc also means maintaining a divergence
from the kernel. Of course, glibc does this for plenty of other
syscalls so it's not my place to say which is better.

My intention was just to say that this *was* intentiona (which was how I
understood the initial question that I was Cc'd onl, and if you feel
that intention is misguided / doesn't mesh with what glibc wants then
that's your call.

> > A simple example would be mounts -- if MOUNT_BENEATH is not supported
>
> I don't understand this example. Are you talking about <linux/mount.h>'s
> MOVE_MOUNT_BENEATH? That's a move_mount flag, and I don't see what that has
> to do with openat2. Or are you saying that openat2 might not support
> <linux/openat2.h>'s RESOLVE_BENEATH flag? Under what conditions might that
> be, exactly? Can you give some plausible user code to illustrate the openat2
> example you're thinking of?

I was just giving it as an example where "just try it" is not really
ideal for userspace today. clone3(2) is an extensible-struct syscall
that needs this.

> I still fail to understand how a hypothetical "give me the supported flags"
> openat2 flag would be useful enough to justify complicating the openat2 API
> today.

My only concern is that it would break recompiles if/when we change it
back. If that is not a concern for glibc as a project then you are of
course free to do whatever makes sense for glibc.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-08-28 12:43 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0bjo0yffo.fsf@kernel.org>

On Wed, Aug 27, 2025 at 05:03:55PM +0200, Pratyush Yadav wrote:

> I think we need something a luo_xarray data structure that users like
> memfd (and later hugetlb and guest_memfd and maybe others) can build to
> make serialization easier. It will cover both contiguous arrays and
> arrays with some holes in them.

I'm not sure xarray is the right way to go, it is very complex data
structure and building a kho variation of it seems like it is a huge
amount of work.

I'd stick with simple kvalloc type approaches until we really run into
trouble.

You can always map a sparse xarray into a kvalloc linear list by
including the xarray index in each entry.

Especially for memfd where we don't actually expect any sparsity in
real uses cases there is no reason to invest a huge effort to optimize
for it..

> As I explained above, the versioning is already there. Beyond that, why
> do you think a raw C struct is better than FDT? It is just another way
> of expressing the same information. FDT is a bit more cumbersome to
> write and read, but comes at the benefit of more introspect-ability.

Doesn't have the size limitations, is easier to work list, runs
faster.

> >  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
> >  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
> 
> I think what you describe here is essentially how LUO works currently,
> just that the mechanisms are a bit different.

The bit different is a very important bit though :)

The versioning should be first class, not hidden away as some emergent
property of registering multiple serializers or something like that.

Jason

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Paul Eggert @ 2025-08-28 13:43 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <2025-08-28-foreign-swampy-comments-arbor-nOkpXI@cyphar.com>

On 2025-08-28 01:42, Aleksa Sarai wrote:
>> I still fail to understand how a hypothetical "give me the supported flags"
>> openat2 flag would be useful enough to justify complicating the openat2 API
>> today.
> My only concern is that it would break recompiles if/when we change it
> back.

OK, but from what I can see there's no identified possibility that 
openat2 will modify the objects its arguments point to, just as there's 
no identified possibility that plain openat will do so (in a 
hypothetical extension to remove unnecessary slashes from its filename 
argument, say).

In that case it's pretty clear that glibc should mark the open_how 
argument as pointer-to-const, just as glibc already marks the filename 
argument.

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Adhemerval Zanella Netto @ 2025-08-28 17:06 UTC (permalink / raw)
  To: Paul Eggert, Aleksa Sarai; +Cc: Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <cbbc9639-0443-4bf8-bbd1-9d3fdcb2fd37@cs.ucla.edu>



On 28/08/25 10:43, Paul Eggert wrote:
> On 2025-08-28 01:42, Aleksa Sarai wrote:
>>> I still fail to understand how a hypothetical "give me the supported flags"
>>> openat2 flag would be useful enough to justify complicating the openat2 API
>>> today.
>> My only concern is that it would break recompiles if/when we change it
>> back.
> 
> OK, but from what I can see there's no identified possibility that openat2 will modify the objects its arguments point to, just as there's no identified possibility that plain openat will do so (in a hypothetical extension to remove unnecessary slashes from its filename argument, say).
> 
> In that case it's pretty clear that glibc should mark the open_how argument as pointer-to-const, just as glibc already marks the filename argument.

I am still not sure how a potentially CHECK_FIELDS feature would play with 
openat2 in the future, especially since glibc now prefers to first include 
the kernel headers before redefining a minimal API to the syscall usage 
(meaning that programs will have access to potentially new flags depending 
on the installed kernel header).

If the kernel intends to modify the open_how in the future, setting open_how 
as const will only add extra confusion. Users might be exposed to this feature 
without explicitly including the kernel headers.

Another option might to *not* include the kernel headers and keep syncing the 
kernel definitions on kernel releases (and maybe excluding flags that might 
modify the open_how). As Florian has said, this kind of mediation by glibc was 
historically time-consuming, complex, and subject to subtle bugs (and that's 
why we abandoned this over time).

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jeff Xu @ 2025-08-28 20:17 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <20250827.ieRaeNg4pah3@digikod.net>

Hi Mickaël

On Wed, Aug 27, 2025 at 1:19 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Tue, Aug 26, 2025 at 01:29:55PM -0700, Jeff Xu wrote:
> > Hi Mickaël
> >
> > On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> > > > Hi Mickaël
> > > >
> > > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > >
> > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > > > > >
> > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > > > > >
> > > > > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > > > > because it broke userspace assumptions.
> > > > > > >
> > > > > > > Oh, good to know.
> > > > > > >
> > > > > > > >
> > > > > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > > > > >
> > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > > > > removed for security reasons; as
> > > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > > > > >
> > > > > > > > |        MAP_DENYWRITE
> > > > > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > > > > |               signaled that attempts to write to the underlying file
> > > > > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > > > > |               of-service attacks.)"
> > > > > > > >
> > > > > > > > It seems to me that the same issue applies to your patch - it would
> > > > > > > > allow unprivileged processes to essentially lock files such that other
> > > > > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > > > > users to prevent root from updating config files or stuff like that if
> > > > > > > > they're updated in-place.
> > > > > > >
> > > > > > > Yes, I agree, but since it is the case for executed files I though it
> > > > > > > was worth starting a discussion on this topic.  This new flag could be
> > > > > > > restricted to executable files, but we should avoid system-wide locks
> > > > > > > like this.  I'm not sure how Windows handle these issues though.
> > > > > > >
> > > > > > > Anyway, we should rely on the access control policy to control write and
> > > > > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > > > > the references and the background!
> > > > > >
> > > > > > I'm confused.  I understand that there are many contexts in which one
> > > > > > would want to prevent execution of unapproved content, which might
> > > > > > include preventing a given process from modifying some code and then
> > > > > > executing it.
> > > > > >
> > > > > > I don't understand what these deny-write features have to do with it.
> > > > > > These features merely prevent someone from modifying code *that is
> > > > > > currently in use*, which is not at all the same thing as preventing
> > > > > > modifying code that might get executed -- one can often modify
> > > > > > contents *before* executing those contents.
> > > > >
> > > > > The order of checks would be:
> > > > > 1. open script with O_DENY_WRITE
> > > > > 2. check executability with AT_EXECVE_CHECK
> > > > > 3. read the content and interpret it
> > > > >
> > > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> > > >
> > > > AT_EXECVE_CHECK is not just for scripting languages. It could also
> > > > work with bytecodes like Java, for example. If we let the Java runtime
> > > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> > > > develop a policy based on that.
> > >
> > > Sure, I'm using "script" to make it simple, but this applies to other
> > > use cases.
> > >
> > That makes sense.
> >
> > > >
> > > > > The deny-write feature was to guarantee that there is no race condition
> > > > > between step 2 and 3.  All these checks are supposed to be done by a
> > > > > trusted interpreter (which is allowed to be executed).  The
> > > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > > > > associated security policies) allowed the *current* content of the file
> > > > > to be executed.  Whatever happen before or after that (wrt.
> > > > > O_DENY_WRITE) should be covered by the security policy.
> > > > >
> > > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> > > >
> > > > Enforcing non-write for the path that stores scripts or bytecodes can
> > > > be challenging due to historical or backward compatibility reasons.
> > > > Since AT_EXECVE_CHECK provides a mechanism to check the file right
> > > > before it is used, we can assume it will detect any "problem" that
> > > > happened before that, (e.g. the file was overwritten). However, that
> > > > also imposes two additional requirements:
> > > > 1> the file doesn't change while AT_EXECVE_CHECK does the check.
> > >
> > > This is already the case, so any kind of LSM checks are good.
> > >
> > May I ask how this is done? some code in do_open_execat() does this ?
> > Apologies if this is a basic question.
>
> do_open_execat() calls exe_file_deny_write_access()
>
Thanks for pointing.
With that, now I read the full history of discussion regarding this :-)

> >
> > > > 2>The file content kept by the process remains unchanged after passing
> > > > the AT_EXECVE_CHECK.
> > >
> > > The goal of this patch was to avoid such race condition in the case
> > > where executable files can be updated.  But in most cases it should not
> > > be a security issue (because processes allowed to write to executable
> > > files should be trusted), but this could still lead to bugs (because of
> > > inconsistent file content, half-updated).
> > >
> > There is also a time gap between:
> > a> the time of AT_EXECVE_CHECK
> > b> the time that the app opens the file for execution.
> > right ? another potential attack path (though this is not the case I
> > mentioned previously).
>
> As explained in the documentation, to avoid this specific race
> condition, interpreters should open the script once, check the FD with
> AT_EXECVE_CHECK, and then read the content with the same FD.
>
Ya, now I see that in the description of this patch, sorry that I
missed that previously.

> >
> > For the case I mentioned previously, I have to think more if the race
> > condition is a bug or security issue.
> > IIUC, two solutions are discussed so far:
> > 1> the process could write to fs to update the script.  However, for
> > execution, the process still uses the copy that passed the
> > AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski)
>
> Yes, the snapshot solution would be the best, but I guess it would rely
> on filesystems to support this feature.
>
snapshot seems to be the reasonable direction to go

Is this something related to the VMA ? e.g. preserve the in-memory
copy of the file when the file on fs was updated.

According to man mmap:
       MAP_PRIVATE
              Create a private copy-on-write mapping.  Updates to the
              mapping are not visible to other processes mapping the same
              file, and are not carried through to the underlying file.
              It is unspecified whether changes made to the file after
              the mmap() call are visible in the mapped region.

so the direction here is
the process -> update the vma -> doesn't carry to the file.

What we want is the reverse direction: (the unspecified part in the man page)
file updated on fs -> doesn't carry to the vma of this process.

> > or 2> the process blocks the write while opening the file as read only
> > and executing the script. (this seems to be the approach of this
> > patch).
>
> Yes, and this is not something we want anymore.
>
right. Thank you for clarifying this.

> >
> > I wonder if there are other ideas.
>
> I don't see other efficient ways to give the same guarantees.
right, me neither.

Thanks and regards,
-Jeff

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Serge E. Hallyn @ 2025-08-28 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CALCETrWHKga33bvzUHnd-mRQUeNXTtXSS8Y8+40d5bxv-CqBhw@mail.gmail.com>

On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.
> 
> IPython says:
> 
> In [1]: import mmap, tempfile
> 
> In [2]: f = tempfile.TemporaryFile()
> 
> In [3]: f.write(b'initial contents')
> Out[3]: 16
> 
> In [4]: f.flush()
> 
> In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> prot=mmap.PROT_READ)
> 
> In [6]: map[:]
> Out[6]: b'initial contents'
> 
> In [7]: f.seek(0)
> Out[7]: 0
> 
> In [8]: f.write(b'changed')
> Out[8]: 7
> 
> In [9]: f.flush()
> 
> In [10]: map[:]
> Out[10]: b'changed contents'

That was surprising to me, however, if I split the reader
and writer into different processes, so

P1:
f = open("/tmp/3", "w")
f.write('initial contents')
f.flush()

P2:
import mmap
f = open("/tmp/3", "r")
map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)

Back to P1:
f.seek(0)
f.write('changed')

Back to P2:
map[:]

Then P2 gives me:

b'initial contents'

-serge

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox