Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-09-01 16:25 UTC (permalink / raw)
  To: Roberto Sassu
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <54e27d05bae55749a975bc7cbe109b237b2b1323.camel@huaweicloud.com>

Can you clarify this a bit for those of us who are not well-versed in
exactly what "measurement" does?

On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
<roberto.sassu@huaweicloud.com> wrote:
> > Now, in cases where you have IMA or something and you only permit signed
> > binaries to execute, you could argue there is a different race here (an
> > attacker creates a malicious script, runs it, and then replaces it with
> > a valid script's contents and metadata after the fact to get
> > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
>
> Uhm, let's consider measurement, I'm more familiar with.
>
> I think the race you wanted to express was that the attacker replaces
> the good script, verified with AT_EXECVE_CHECK, with the bad script
> after the IMA verification but before the interpreter reads it.
>
> Fortunately, IMA is able to cope with this situation, since this race
> can happen for any file open, where of course a file can be not read-
> locked.

I assume you mean that this has nothing specifically to do with
scripts, as IMA tries to protect ordinary (non-"execute" file access)
as well.  Am I right?

>
> If the attacker tries to concurrently open the script for write in this
> race window, IMA will report this event (called violation) in the
> measurement list, and during remote attestation it will be clear that
> the interpreter did not read what was measured.
>
> We just need to run the violation check for the BPRM_CHECK hook too
> (then, probably for us the O_DENY_WRITE flag or alternative solution
> would not be needed, for measurement).

This seems consistent with my interpretation above, but ...

>
> Please, let us know when you apply patches like 2a010c412853 ("fs:
> don't block i_writecount during exec"). We had a discussion [1], but
> probably I missed when it was decided to be applied (I saw now it was
> in the same thread, but didn't get that at the time). We would have
> needed to update our code accordingly. In the future, we will try to
> clarify better our expectations from the VFS.

... I didn't follow this.

Suppose there's some valid contents of /bin/sleep.  I execute
/bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
for write, not by replacing it), and the kernel in question doesn't do
ETXTBSY.  Then the sleep process reads (and executes) the modified
contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
needed?

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-09-01 16:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> 
> > +	/*
> > +	 * Most of the space should be taken by preserved folios. So take its
> > +	 * size, plus a page for other properties.
> > +	 */
> > +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > +	if (!fdt) {
> > +		err = -ENOMEM;
> > +		goto err_unpin;
> > +	}
> 
> This doesn't seem to have any versioning scheme, it really should..
> 
> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +				       (void **)&preserved_folios);
> > +	if (err) {
> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
> > +		       fdt_strerror(err));
> > +		err = -ENOMEM;
> > +		goto err_free_fdt;
> > +	}
> 
> Yuk.
> 
> This really wants some luo helper
> 
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'

We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1

Will wait for kbuild and then send proper patches.
 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-09-01 16:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Andy Lutomirski, Aleksa Sarai,
	Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CAG48ez0p1B9nmG3ZyNRywaSYTtEULSpbxueia912nVpg2Q7WYA@mail.gmail.com>

On Mon, Sep 1, 2025 at 4:06 AM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > > there.
> > > > >
> > > > > It would be interesting to know how to patch user space to get the same
> > > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > > >
> > > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > > nice but IIRC there are ways to get around it anyway).
> > >
> > > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > > affecting the file, but I don't think that writes to the file will
> > > break the MAP_PRIVATE CoW if it's not already broken.
> > >
> > > IPython says:
> > >
> > > In [1]: import mmap, tempfile
> > >
> > > In [2]: f = tempfile.TemporaryFile()
> > >
> > > In [3]: f.write(b'initial contents')
> > > Out[3]: 16
> > >
> > > In [4]: f.flush()
> > >
> > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > > prot=mmap.PROT_READ)
> > >
> > > In [6]: map[:]
> > > Out[6]: b'initial contents'
> > >
> > > In [7]: f.seek(0)
> > > Out[7]: 0
> > >
> > > In [8]: f.write(b'changed')
> > > Out[8]: 7
> > >
> > > In [9]: f.flush()
> > >
> > > In [10]: map[:]
> > > Out[10]: b'changed contents'
> >
> > That was surprising to me, however, if I split the reader
> > and writer into different processes, so
>
> Testing this in python is a terrible idea because it obfuscates the
> actual syscalls from you.
>
> > P1:
> > f = open("/tmp/3", "w")
> > f.write('initial contents')
> > f.flush()
> >
> > P2:
> > import mmap
> > f = open("/tmp/3", "r")
> > map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)
> >
> > Back to P1:
> > f.seek(0)
> > f.write('changed')
> >
> > Back to P2:
> > map[:]
> >
> > Then P2 gives me:
> >
> > b'initial contents'
>
> Because when you executed `f.write('changed')`, Python internally
> buffered the write. "changed" is never actually written into the file
> in your example. If you add a `f.flush()` in P1 after this, running
> `map[:]` in P2 again will show you the new data.
>

These days, one can type in Python, ask an LLM to translate to C, and
get almost-correct output :)  Or one can use os.write(), which is
exactly what I should have done.

--Andy

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Serge E. Hallyn @ 2025-09-01 13:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Andy Lutomirski, Aleksa Sarai,
	Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CAG48ez0p1B9nmG3ZyNRywaSYTtEULSpbxueia912nVpg2Q7WYA@mail.gmail.com>

On Mon, Sep 01, 2025 at 01:05:16PM +0200, Jann Horn wrote:
> On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > > there.
> > > > >
> > > > > It would be interesting to know how to patch user space to get the same
> > > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > > >
> > > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > > nice but IIRC there are ways to get around it anyway).
> > >
> > > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > > affecting the file, but I don't think that writes to the file will
> > > break the MAP_PRIVATE CoW if it's not already broken.
> > >
> > > IPython says:
> > >
> > > In [1]: import mmap, tempfile
> > >
> > > In [2]: f = tempfile.TemporaryFile()
> > >
> > > In [3]: f.write(b'initial contents')
> > > Out[3]: 16
> > >
> > > In [4]: f.flush()
> > >
> > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > > prot=mmap.PROT_READ)
> > >
> > > In [6]: map[:]
> > > Out[6]: b'initial contents'
> > >
> > > In [7]: f.seek(0)
> > > Out[7]: 0
> > >
> > > In [8]: f.write(b'changed')
> > > Out[8]: 7
> > >
> > > In [9]: f.flush()
> > >
> > > In [10]: map[:]
> > > Out[10]: b'changed contents'
> >
> > That was surprising to me, however, if I split the reader
> > and writer into different processes, so
> 
> Testing this in python is a terrible idea because it obfuscates the
> actual syscalls from you.

Hah, I was just trying to fit in :), but of course you're right.
Redoing it in straight c, I'm getting the updates.

-serge

// mmap-w.c (creates an overwrites)
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

#define FIRST "Initial contents"
#define SECOND "updated contents"

int main() {
	int fd, rc;
	char c;

	fd = open("/tmp/m", O_CREAT | O_RDWR, 0644);
	if (fd < 0) {
		printf("failed to open /tmp/m: %m\n");
		_exit(1);
	}
	rc = write(fd, FIRST, sizeof(FIRST));
	if (rc < 0) {
		printf("write failed: %m\n");
		_exit(1);
	}
	rc = fsync(fd);
	if (rc < 0) {
		printf("flush failed: %m\n");
		_exit(1);
	}

	read(STDIN_FILENO, &c, 1);

	printf("updating the contents\n");

	rc = lseek(fd, 0, SEEK_SET);
	if (rc < 0) {
		printf("seek failed; %m\n");
		_exit(1);
	}

	rc = write(fd, SECOND, sizeof(SECOND));
	if (fd < 0) {
		printf("write failed: %m\n");
		_exit(1);
	}
	rc = close(fd);
	if (rc < 0) {
		printf("close failed: %m\n");
		_exit(1);
	}
	printf("done\n");
}

// mmap-r.c (checks and re-checks contents)
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#define FIRST "Initial contents"
#define SECOND "Updated contents"

int main() {
	int fd, rc;
	char *m;
	char c;

	fd = open("/tmp/m", O_RDONLY);
	if (fd < 0) {
		printf("failed to open /tmp/m: %m\n");
		_exit(1);
	}

	m = mmap(NULL, 40, PROT_READ, MAP_PRIVATE, fd, 0);
	if (m == MAP_FAILED) {
		printf("mmap failed: %m\n");
		_exit(1);
	}

	if (strncmp(m, FIRST, 7) != 0) {
		printf("m is %c%c%c%c%c%c%c\n",
			m[0], m[1], m[2], m[3], m[4], m[5], m[6]);
		_exit(1);
	}

	read(STDIN_FILENO, &c, 1);

	if (strncmp(m, SECOND, 7) != 0) {
		printf("m is %c%c%c%c%c%c%c\n",
			m[0], m[1], m[2], m[3], m[4], m[5], m[6]);
		_exit(1);
	}

	printf("done\n");
}

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Jann Horn @ 2025-09-01 11:05 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Aleksa Sarai, Mickaël Salaün,
	Christian Brauner, Al Viro, Kees Cook, Paul Moore, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <aLDDk4x7QBKxLmoi@mail.hallyn.com>

On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > there.
> > > >
> > > > It would be interesting to know how to patch user space to get the same
> > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > >
> > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > nice but IIRC there are ways to get around it anyway).
> >
> > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > affecting the file, but I don't think that writes to the file will
> > break the MAP_PRIVATE CoW if it's not already broken.
> >
> > IPython says:
> >
> > In [1]: import mmap, tempfile
> >
> > In [2]: f = tempfile.TemporaryFile()
> >
> > In [3]: f.write(b'initial contents')
> > Out[3]: 16
> >
> > In [4]: f.flush()
> >
> > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > prot=mmap.PROT_READ)
> >
> > In [6]: map[:]
> > Out[6]: b'initial contents'
> >
> > In [7]: f.seek(0)
> > Out[7]: 0
> >
> > In [8]: f.write(b'changed')
> > Out[8]: 7
> >
> > In [9]: f.flush()
> >
> > In [10]: map[:]
> > Out[10]: b'changed contents'
>
> That was surprising to me, however, if I split the reader
> and writer into different processes, so

Testing this in python is a terrible idea because it obfuscates the
actual syscalls from you.

> P1:
> f = open("/tmp/3", "w")
> f.write('initial contents')
> f.flush()
>
> P2:
> import mmap
> f = open("/tmp/3", "r")
> map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)
>
> Back to P1:
> f.seek(0)
> f.write('changed')
>
> Back to P2:
> map[:]
>
> Then P2 gives me:
>
> b'initial contents'

Because when you executed `f.write('changed')`, Python internally
buffered the write. "changed" is never actually written into the file
in your example. If you add a `f.flush()` in P1 after this, running
`map[:]` in P2 again will show you the new data.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Roberto Sassu @ 2025-09-01  9:24 UTC (permalink / raw)
  To: Aleksa Sarai, Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <2025-08-27-obscene-great-toy-diary-X1gVRV@cyphar.com>

On Thu, 2025-08-28 at 10:14 +1000, Aleksa Sarai wrote:
> On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > Nothing has changed in that regard and I'm not interested in stuffing
> > > the VFS APIs full of special-purpose behavior to work around the fact
> > > that this is work that needs to be done in userspace. Change the apps,
> > > stop pushing more and more cruft into the VFS that has no business
> > > there.
> > 
> > It would be interesting to know how to patch user space to get the same
> > guarantees...  Do you think I would propose a kernel patch otherwise?
> 
> You could mmap the script file with MAP_PRIVATE. This is the *actual*
> protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> nice but IIRC there are ways to get around it anyway). Of course, most
> interpreters don't mmap their scripts, but this is a potential solution.
> If the security policy is based on validating the script text in some
> way, this avoids the TOCTOU.
> 
> Now, in cases where you have IMA or something and you only permit signed
> binaries to execute, you could argue there is a different race here (an
> attacker creates a malicious script, runs it, and then replaces it with
> a valid script's contents and metadata after the fact to get
> AT_EXECVE_CHECK to permit the execution). However, I'm not sure that

Uhm, let's consider measurement, I'm more familiar with.

I think the race you wanted to express was that the attacker replaces
the good script, verified with AT_EXECVE_CHECK, with the bad script
after the IMA verification but before the interpreter reads it.

Fortunately, IMA is able to cope with this situation, since this race
can happen for any file open, where of course a file can be not read-
locked.

If the attacker tries to concurrently open the script for write in this
race window, IMA will report this event (called violation) in the
measurement list, and during remote attestation it will be clear that
the interpreter did not read what was measured.

We just need to run the violation check for the BPRM_CHECK hook too
(then, probably for us the O_DENY_WRITE flag or alternative solution
would not be needed, for measurement).

Please, let us know when you apply patches like 2a010c412853 ("fs:
don't block i_writecount during exec"). We had a discussion [1], but
probably I missed when it was decided to be applied (I saw now it was
in the same thread, but didn't get that at the time). We would have
needed to update our code accordingly. In the future, we will try to
clarify better our expectations from the VFS.

Thanks

Roberto

[1]: https://lore.kernel.org/linux-fsdevel/88d5a92379755413e1ec3c981d9a04e6796da110.camel@huaweicloud.com/#t

> this is even possible with IMA (can an unprivileged user even set
> security.ima?). But even then, I would expect users that really need
> this would also probably use fs-verity or dm-verity that would block
> this kind of attack since it would render the files read-only anyway.
> 
> This is why a more detailed threat model of what kinds of attacks are
> relevant is useful. I was there for the talk you gave and subsequent
> discussion at last year's LPC, but I felt that your threat model was
> not really fleshed out at all. I am still not sure what capabilities you
> expect the attacker to have nor what is being used to authenticate
> binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
> assumptions, but I can't know without knowing what threat model you have
> in mind, *in detail*.
> 
> For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
> there are plenty of ways for an attacker to execute their own code
> without using interpreters (create a new tmpfs with fsopen(2) for
> instance). Executable memfds are even easier and don't require
> privileges on most systems (yes, you can block them with vm.memfd_noexec
> but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
> mount(2)).
> 
> (As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
> top-level AT_* bits for a per-syscall flag -- the block comment I added
> in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
> allocated") was meant to avoid this happening but it seems you and the
> reviewers missed that...)
> 


^ permalink raw reply

* Re: [PATCH v3 2/2] man2/mount.2: tfix (mountpoint => mount point)
From: Alejandro Colomar @ 2025-08-31  9:16 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250826083227.2611457-3-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1001 bytes --]

On Tue, Aug 26, 2025 at 08:32:27AM +0000, Askar Safin wrote:
> Here we fix the only remaining mention of "mountpoint"
> in all man pages
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>

Patch applied.  Thanks!


Cheers,
Alex

> ---
>  man/man2/mount.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 599c2d6fa..9b11fff51 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -311,7 +311,7 @@ Since Linux 2.6.16,
>  can be set or cleared on a per-mount-point basis as well as on
>  the underlying filesystem superblock.
>  The mounted filesystem will be writable only if neither the filesystem
> -nor the mountpoint are flagged as read-only.
> +nor the mount point are flagged as read-only.
>  .\"
>  .SS Remounting an existing mount
>  An existing mount may be remounted by specifying
> -- 
> 2.47.2
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/2] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Alejandro Colomar @ 2025-08-31  9:15 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Askar Safin, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <2025-08-27.1756287489-unsure-quiet-flakes-gymnast-P41YdV@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 3379 bytes --]

Hi Aleksa, Askar,

On Wed, Aug 27, 2025 at 07:42:09PM +1000, Aleksa Sarai wrote:
> On 2025-08-26, Askar Safin <safinaskar@zohomail.com> wrote:
> > My edit is based on experiments and reading Linux code
> > 
> > Signed-off-by: Askar Safin <safinaskar@zohomail.com>

Thanks!  I've applied the patch, with some tweaks.
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=b479b1fe01569d4926cbc59fa31caab8cd01fdad>
(use port 80; this stops AI crawlers.)

> > ---
> >  man/man2/mount.2 | 27 ++++++++++++++++++++++++---
> >  1 file changed, 24 insertions(+), 3 deletions(-)
> > 
> > diff --git a/man/man2/mount.2 b/man/man2/mount.2
> > index 5d83231f9..599c2d6fa 100644
> > --- a/man/man2/mount.2
> > +++ b/man/man2/mount.2
> > @@ -405,7 +405,25 @@ flag can be used with
> >  to modify only the per-mount-point flags.
> >  .\" See https://lwn.net/Articles/281157/
> >  This is particularly useful for setting or clearing the "read-only"
> > -flag on a mount without changing the underlying filesystem.
> > +flag on a mount without changing the underlying filesystem parameters.
> 
> When reading the whole sentence, this feels a bit incomplete
> ("filesystem parameters ... of what?"). Maybe
> 
>   This is particularly useful for setting or clearing the "read-only"
>   flag on a mount without changing the underlying filesystem's
>   filesystem parameters.
> 
> or
> 
>   This is particularly useful for setting or clearing the "read-only"
>   flag on a mount without changing the filesystem parameters of the
>   underlying filesystem.
> 
> would be better?

Yep; I've taken the second proposal.

> 
> That one nit aside, feel free to take my
> 
> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks!  Appended.

> > +The
> > +.I data
> > +argument is ignored if
> > +.B MS_REMOUNT
> > +and
> > +.B MS_BIND
> > +are specified.

I have removed the mention of MS_REMOUNT and MS_BIND, since the first
sentence in the paragraph already mentions them.  Otherwise, it felt a
bit confusing why some sentences mentioned it and others not.

> > +The mount point will
> > +have its existing per-mount-point flags

I have reworded this to use present instead of future, and also reversed
the order of the clauses; if feels more readable now.


Have a lovely day!
Alex

> > +cleared and replaced with those in
> > +.IR mountflags .
> > +This means that
> > +if you wish to preserve
> > +any existing per-mount-point flags,
> > +you need to include them in
> > +.IR mountflags ,
> > +along with the per-mount-point flags you wish to set
> > +(or with the flags you wish to clear missing).
> >  Specifying
> >  .I mountflags
> >  as:
> > @@ -416,8 +434,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
> >  .EE
> >  .in
> >  .P
> > -will make access through this mountpoint read-only, without affecting
> > -other mounts.
> > +will make access through this mount point read-only
> > +(clearing all other per-mount-point flags),
> > +without affecting
> > +other mounts
> > +of this filesystem.
> >  .\"
> >  .SS Creating a bind mount
> >  If
> > -- 
> > 2.47.2
> > 
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> https://www.cyphar.com/



-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/30] liveupdate: kho: move to kernel/liveupdate
From: Mike Rapoport @ 2025-08-30  8:35 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-10-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:15AM +0000, Pasha Tatashin wrote:
> Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
> core kernel related files to the same place.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>
> ---
> diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile
> new file mode 100644
> index 000000000000..72cf7a8e6739
> --- /dev/null
> +++ b/kernel/liveupdate/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for the linux kernel.

Nit: this line does not provide much, let's drop it

> +
> +obj-$(CONFIG_KEXEC_HANDOVER)		+= kexec_handover.o
> +obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)	+= kexec_handover_debug.o
> diff --git a/kernel/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> similarity index 99%
> rename from kernel/kexec_handover.c
> rename to kernel/liveupdate/kexec_handover.c
> index 07755184f44b..05f5694ea057 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -23,8 +23,8 @@
>   * KHO is tightly coupled with mm init and needs access to some of mm
>   * internal APIs.
>   */
> -#include "../mm/internal.h"
> -#include "kexec_internal.h"
> +#include "../../mm/internal.h"
> +#include "../kexec_internal.h"
>  #include "kexec_handover_internal.h"
>  
>  #define KHO_FDT_COMPATIBLE "kho-v1"
> @@ -824,7 +824,7 @@ static int __kho_finalize(void)
>  	err |= fdt_finish_reservemap(root);
>  	err |= fdt_begin_node(root, "");
>  	err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE);
> -	/**
> +	/*
>  	 * Reserve the preserved-memory-map property in the root FDT, so
>  	 * that all property definitions will precede subnodes created by
>  	 * KHO callers.
> diff --git a/kernel/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c
> similarity index 100%
> rename from kernel/kexec_handover_debug.c
> rename to kernel/liveupdate/kexec_handover_debug.c
> diff --git a/kernel/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h
> similarity index 100%
> rename from kernel/kexec_handover_internal.h
> rename to kernel/liveupdate/kexec_handover_internal.h
> -- 
> 2.50.1.565.gc32cd1483b-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-08-29 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

On Tue, Aug 26, 2025 at 9:20 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>
> > +     /*
> > +      * Most of the space should be taken by preserved folios. So take its
> > +      * size, plus a page for other properties.
> > +      */
> > +     fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > +     if (!fdt) {
> > +             err = -ENOMEM;
> > +             goto err_unpin;
> > +     }
>
> This doesn't seem to have any versioning scheme, it really should..
>
> > +     err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +                                    (void **)&preserved_folios);
> > +     if (err) {
> > +             pr_err("Failed to reserve folios property in FDT: %s\n",
> > +                    fdt_strerror(err));
> > +             err = -ENOMEM;
> > +             goto err_free_fdt;
> > +     }
>
> Yuk.
>
> This really wants some luo helper
>
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'

Yes, that will be one step forward.

Another idea is that having a middle layer manages the life cycle of
the reserved memory for you. Kind of like a slab allocator for the
preserved memory. It allows bulk free if there is an error on the live
update prepare(), you need to free all previously allocated memory
anyway. If there is some preserved memory that needs to stay after a
long term after the live update kernel boot up, use some special flags
to indicate so don't mix the free_all pool.
>
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
>
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.
>
> > +     for (; i < nr_pfolios; i++) {
> > +             const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> > +             phys_addr_t phys;
> > +             u64 index;
> > +             int flags;
> > +
> > +             if (!pfolio->foliodesc)
> > +                     continue;
> > +
> > +             phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> > +             folio = kho_restore_folio(phys);
> > +             if (!folio) {
> > +                     pr_err("Unable to restore folio at physical address: %llx\n",
> > +                            phys);
> > +                     goto put_file;
> > +             }
> > +             index = pfolio->index;
> > +             flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> > +
> > +             /* Set up the folio for insertion. */
> > +             /*
> > +              * TODO: Should find a way to unify this and
> > +              * shmem_alloc_and_add_folio().
> > +              */
> > +             __folio_set_locked(folio);
> > +             __folio_set_swapbacked(folio);
> >
> > +             ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> > +             if (ret) {
> > +                     pr_err("shmem: failed to charge folio index %d: %d\n",
> > +                            i, ret);
> > +                     goto unlock_folio;
> > +             }
>
> [..]
>
> > +             folio_add_lru(folio);
> > +             folio_unlock(folio);
> > +             folio_put(folio);
> > +     }
>
> Probably some consolidation will be needed to make this less
> duplicated..
>
> But overall I think just using the memfd_luo_preserved_folio as the
> serialization is entirely file, I don't think this needs anything more
> complicated.
>
> What it does need is an alternative to the FDT with versioning.
>
> Which seems to me to be entirely fine as:
>
>  struct memfd_luo_v0 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios;
>  };
>
>  struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>
> Which also shows the actual data needing to be serialized comes from
> more than one struct and has to be marshaled in code, somehow, to a
> single struct.
>
> Then I imagine a fairly simple forwards/backwards story. If something
> new is needed that is non-optional, lets say you compress the folios
> list to optimize holes:
>
>  struct memfd_luo_v1 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios_list_with_holes;
>  };
>
> Obviously a v0 kernel cannot parse this, but in this case a v1 aware
> kernel could optionally duplicate and write out the v0 format as well:
>
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

Question: Do we have a matching FDT node to match the memfd C
structure hierarchy? Otherwise all the C struct will lump into one FDT
node. Maybe one FDT node for all C struct is fine. Then there is a
risk of overflowing the 4K buffer limit on the FDT node.

I would like to get independent of FDT for the versioning.

FDT on the top level sounds OK. Not ideal but workable. We are getting
deeper and deeper into complex internal data structures. Do we still
want every data structure referenced by a FDT identifier?

> Then the rule is fairly simple, when the sucessor kernel goes to
> deserialize it asks luo for the versions it supports:
>
>  if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
>     restore_v1(&memfd_luo_v1)
>  else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
>     restore_v0(&memfd_luo_v0)
>  else
>     luo_failure("Do not understand this");
>
> luo core just manages this list of versioned data per serialized
> object. There is only one version per object.

Obviously, this can be done.

Is that approach you want to expand to every other C struct as well?
See the above FDT node complexity.

I am getting the feeling that we are hand crafting screws to build an
airplane. Can it be done? Of course. Does it scale well? I am not
sure. There are many developers who are currently hand-crafting this
kind of screws to be used on the different components of the airplane.

We need a machine that can stamp out screws with our specifications,
faster. I want such a machine. Other developers might want one as
well.

The initial discussion of the idea of such a machine is pretty
discouraged. There are huge communication barriers because of the
fixation on hand crafted screws. I understand exploring such machine
ideas alone might distract the engineer from hand crafting more
screws, one of them might realize that, oh, I want such a machine as
well.

At this stage, do you see that exploring such a machine idea can be
beneficial or harmful to the project? If such an idea is considered
harmful, we should stop discussing such an idea at all. Go back to
building more batches of hand crafted screws, which are waiting by the
next critical component.

Also if such a machine can produce screws up to your specification,
but it has a different look and feel than the hand crafted screws. We
can stamp out the screw faster.  Would you consider putting such a
machined screw on your most critical component of the engine?

Best Regards,

Chris

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-08-29 18:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLABxkpPcbxyv6m_@kernel.org>

On Thu, Aug 28, 2025 at 12:14 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> >
> > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > > +                                  (void **)&preserved_folios);
> > > +   if (err) {
> > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
> > > +                  fdt_strerror(err));
> > > +           err = -ENOMEM;
> > > +           goto err_free_fdt;
> > > +   }
> >
> > Yuk.
> >
> > This really wants some luo helper
> >
> > 'luo alloc array'
> > 'luo restore array'
> > 'luo free array'
> >
> > Which would get a linearized list of pages in the vmap to hold the
> > array and then allocate some structure to record the page list and
> > return back the u64 of the phys_addr of the top of the structure to
> > store in whatever.
> >
> > Getting fdt to allocate the array inside the fds is just not going to
> > work for anything of size.
>
> I agree that we need a side-car structure for preserving large (potentially
> sparse) arrays, but I think it should be a part of KHO rather than LUO.

I agree this can be used by components outside of LUO as well. Ideally
as some helper library so every component can use it. I don't have a
strong opinion on KHO or the stand alone library. I am fine with both.

Chris

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-08-28 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250828124320.GB7333@nvidia.com>

On Thu, Aug 28, 2025 at 5:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Aug 27, 2025 at 05:03:55PM +0200, Pratyush Yadav wrote:
>
> > I think we need something a luo_xarray data structure that users like
> > memfd (and later hugetlb and guest_memfd and maybe others) can build to
> > make serialization easier. It will cover both contiguous arrays and
> > arrays with some holes in them.
>
> I'm not sure xarray is the right way to go, it is very complex data
> structure and building a kho variation of it seems like it is a huge
> amount of work.
>
> I'd stick with simple kvalloc type approaches until we really run into
> trouble.
>
> You can always map a sparse xarray into a kvalloc linear list by
> including the xarray index in each entry.

Each entry will be 16 byte, 8 for index and 8 for XAvalue, right?

> Especially for memfd where we don't actually expect any sparsity in
> real uses cases there is no reason to invest a huge effort to optimize
> for it..

Ack.

>
> > As I explained above, the versioning is already there. Beyond that, why
> > do you think a raw C struct is better than FDT? It is just another way
> > of expressing the same information. FDT is a bit more cumbersome to
> > write and read, but comes at the benefit of more introspect-ability.
>
> Doesn't have the size limitations, is easier to work list, runs
> faster.

Yes, especially when you have a large array.

Chris

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Serge E. Hallyn @ 2025-08-28 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CALCETrWHKga33bvzUHnd-mRQUeNXTtXSS8Y8+40d5bxv-CqBhw@mail.gmail.com>

On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.
> 
> IPython says:
> 
> In [1]: import mmap, tempfile
> 
> In [2]: f = tempfile.TemporaryFile()
> 
> In [3]: f.write(b'initial contents')
> Out[3]: 16
> 
> In [4]: f.flush()
> 
> In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> prot=mmap.PROT_READ)
> 
> In [6]: map[:]
> Out[6]: b'initial contents'
> 
> In [7]: f.seek(0)
> Out[7]: 0
> 
> In [8]: f.write(b'changed')
> Out[8]: 7
> 
> In [9]: f.flush()
> 
> In [10]: map[:]
> Out[10]: b'changed contents'

That was surprising to me, however, if I split the reader
and writer into different processes, so

P1:
f = open("/tmp/3", "w")
f.write('initial contents')
f.flush()

P2:
import mmap
f = open("/tmp/3", "r")
map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)

Back to P1:
f.seek(0)
f.write('changed')

Back to P2:
map[:]

Then P2 gives me:

b'initial contents'

-serge

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jeff Xu @ 2025-08-28 20:17 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <20250827.ieRaeNg4pah3@digikod.net>

Hi Mickaël

On Wed, Aug 27, 2025 at 1:19 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Tue, Aug 26, 2025 at 01:29:55PM -0700, Jeff Xu wrote:
> > Hi Mickaël
> >
> > On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> > > > Hi Mickaël
> > > >
> > > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > >
> > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > > > > >
> > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > > > > >
> > > > > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > > > > because it broke userspace assumptions.
> > > > > > >
> > > > > > > Oh, good to know.
> > > > > > >
> > > > > > > >
> > > > > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > > > > >
> > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > > > > removed for security reasons; as
> > > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > > > > >
> > > > > > > > |        MAP_DENYWRITE
> > > > > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > > > > |               signaled that attempts to write to the underlying file
> > > > > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > > > > |               of-service attacks.)"
> > > > > > > >
> > > > > > > > It seems to me that the same issue applies to your patch - it would
> > > > > > > > allow unprivileged processes to essentially lock files such that other
> > > > > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > > > > users to prevent root from updating config files or stuff like that if
> > > > > > > > they're updated in-place.
> > > > > > >
> > > > > > > Yes, I agree, but since it is the case for executed files I though it
> > > > > > > was worth starting a discussion on this topic.  This new flag could be
> > > > > > > restricted to executable files, but we should avoid system-wide locks
> > > > > > > like this.  I'm not sure how Windows handle these issues though.
> > > > > > >
> > > > > > > Anyway, we should rely on the access control policy to control write and
> > > > > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > > > > the references and the background!
> > > > > >
> > > > > > I'm confused.  I understand that there are many contexts in which one
> > > > > > would want to prevent execution of unapproved content, which might
> > > > > > include preventing a given process from modifying some code and then
> > > > > > executing it.
> > > > > >
> > > > > > I don't understand what these deny-write features have to do with it.
> > > > > > These features merely prevent someone from modifying code *that is
> > > > > > currently in use*, which is not at all the same thing as preventing
> > > > > > modifying code that might get executed -- one can often modify
> > > > > > contents *before* executing those contents.
> > > > >
> > > > > The order of checks would be:
> > > > > 1. open script with O_DENY_WRITE
> > > > > 2. check executability with AT_EXECVE_CHECK
> > > > > 3. read the content and interpret it
> > > > >
> > > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> > > >
> > > > AT_EXECVE_CHECK is not just for scripting languages. It could also
> > > > work with bytecodes like Java, for example. If we let the Java runtime
> > > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> > > > develop a policy based on that.
> > >
> > > Sure, I'm using "script" to make it simple, but this applies to other
> > > use cases.
> > >
> > That makes sense.
> >
> > > >
> > > > > The deny-write feature was to guarantee that there is no race condition
> > > > > between step 2 and 3.  All these checks are supposed to be done by a
> > > > > trusted interpreter (which is allowed to be executed).  The
> > > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > > > > associated security policies) allowed the *current* content of the file
> > > > > to be executed.  Whatever happen before or after that (wrt.
> > > > > O_DENY_WRITE) should be covered by the security policy.
> > > > >
> > > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> > > >
> > > > Enforcing non-write for the path that stores scripts or bytecodes can
> > > > be challenging due to historical or backward compatibility reasons.
> > > > Since AT_EXECVE_CHECK provides a mechanism to check the file right
> > > > before it is used, we can assume it will detect any "problem" that
> > > > happened before that, (e.g. the file was overwritten). However, that
> > > > also imposes two additional requirements:
> > > > 1> the file doesn't change while AT_EXECVE_CHECK does the check.
> > >
> > > This is already the case, so any kind of LSM checks are good.
> > >
> > May I ask how this is done? some code in do_open_execat() does this ?
> > Apologies if this is a basic question.
>
> do_open_execat() calls exe_file_deny_write_access()
>
Thanks for pointing.
With that, now I read the full history of discussion regarding this :-)

> >
> > > > 2>The file content kept by the process remains unchanged after passing
> > > > the AT_EXECVE_CHECK.
> > >
> > > The goal of this patch was to avoid such race condition in the case
> > > where executable files can be updated.  But in most cases it should not
> > > be a security issue (because processes allowed to write to executable
> > > files should be trusted), but this could still lead to bugs (because of
> > > inconsistent file content, half-updated).
> > >
> > There is also a time gap between:
> > a> the time of AT_EXECVE_CHECK
> > b> the time that the app opens the file for execution.
> > right ? another potential attack path (though this is not the case I
> > mentioned previously).
>
> As explained in the documentation, to avoid this specific race
> condition, interpreters should open the script once, check the FD with
> AT_EXECVE_CHECK, and then read the content with the same FD.
>
Ya, now I see that in the description of this patch, sorry that I
missed that previously.

> >
> > For the case I mentioned previously, I have to think more if the race
> > condition is a bug or security issue.
> > IIUC, two solutions are discussed so far:
> > 1> the process could write to fs to update the script.  However, for
> > execution, the process still uses the copy that passed the
> > AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski)
>
> Yes, the snapshot solution would be the best, but I guess it would rely
> on filesystems to support this feature.
>
snapshot seems to be the reasonable direction to go

Is this something related to the VMA ? e.g. preserve the in-memory
copy of the file when the file on fs was updated.

According to man mmap:
       MAP_PRIVATE
              Create a private copy-on-write mapping.  Updates to the
              mapping are not visible to other processes mapping the same
              file, and are not carried through to the underlying file.
              It is unspecified whether changes made to the file after
              the mmap() call are visible in the mapped region.

so the direction here is
the process -> update the vma -> doesn't carry to the file.

What we want is the reverse direction: (the unspecified part in the man page)
file updated on fs -> doesn't carry to the vma of this process.

> > or 2> the process blocks the write while opening the file as read only
> > and executing the script. (this seems to be the approach of this
> > patch).
>
> Yes, and this is not something we want anymore.
>
right. Thank you for clarifying this.

> >
> > I wonder if there are other ideas.
>
> I don't see other efficient ways to give the same guarantees.
right, me neither.

Thanks and regards,
-Jeff

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Adhemerval Zanella Netto @ 2025-08-28 17:06 UTC (permalink / raw)
  To: Paul Eggert, Aleksa Sarai; +Cc: Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <cbbc9639-0443-4bf8-bbd1-9d3fdcb2fd37@cs.ucla.edu>

On 28/08/25 10:43, Paul Eggert wrote:
> On 2025-08-28 01:42, Aleksa Sarai wrote:
>>> I still fail to understand how a hypothetical "give me the supported flags"
>>> openat2 flag would be useful enough to justify complicating the openat2 API
>>> today.
>> My only concern is that it would break recompiles if/when we change it
>> back.
> 
> OK, but from what I can see there's no identified possibility that openat2 will modify the objects its arguments point to, just as there's no identified possibility that plain openat will do so (in a hypothetical extension to remove unnecessary slashes from its filename argument, say).
> 
> In that case it's pretty clear that glibc should mark the open_how argument as pointer-to-const, just as glibc already marks the filename argument.

I am still not sure how a potentially CHECK_FIELDS feature would play with 
openat2 in the future, especially since glibc now prefers to first include 
the kernel headers before redefining a minimal API to the syscall usage 
(meaning that programs will have access to potentially new flags depending 
on the installed kernel header).

If the kernel intends to modify the open_how in the future, setting open_how 
as const will only add extra confusion. Users might be exposed to this feature 
without explicitly including the kernel headers.

Another option might to *not* include the kernel headers and keep syncing the 
kernel definitions on kernel releases (and maybe excluding flags that might 
modify the open_how). As Florian has said, this kind of mediation by glibc was 
historically time-consuming, complex, and subject to subtle bugs (and that's 
why we abandoned this over time).

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Paul Eggert @ 2025-08-28 13:43 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <2025-08-28-foreign-swampy-comments-arbor-nOkpXI@cyphar.com>

On 2025-08-28 01:42, Aleksa Sarai wrote:
>> I still fail to understand how a hypothetical "give me the supported flags"
>> openat2 flag would be useful enough to justify complicating the openat2 API
>> today.
> My only concern is that it would break recompiles if/when we change it
> back.

OK, but from what I can see there's no identified possibility that 
openat2 will modify the objects its arguments point to, just as there's 
no identified possibility that plain openat will do so (in a 
hypothetical extension to remove unnecessary slashes from its filename 
argument, say).

In that case it's pretty clear that glibc should mark the open_how 
argument as pointer-to-const, just as glibc already marks the filename 
argument.

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-08-28 12:43 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0bjo0yffo.fsf@kernel.org>

On Wed, Aug 27, 2025 at 05:03:55PM +0200, Pratyush Yadav wrote:

> I think we need something a luo_xarray data structure that users like
> memfd (and later hugetlb and guest_memfd and maybe others) can build to
> make serialization easier. It will cover both contiguous arrays and
> arrays with some holes in them.

I'm not sure xarray is the right way to go, it is very complex data
structure and building a kho variation of it seems like it is a huge
amount of work.

I'd stick with simple kvalloc type approaches until we really run into
trouble.

You can always map a sparse xarray into a kvalloc linear list by
including the xarray index in each entry.

Especially for memfd where we don't actually expect any sparsity in
real uses cases there is no reason to invest a huge effort to optimize
for it..

> As I explained above, the versioning is already there. Beyond that, why
> do you think a raw C struct is better than FDT? It is just another way
> of expressing the same information. FDT is a bit more cumbersome to
> write and read, but comes at the benefit of more introspect-ability.

Doesn't have the size limitations, is easier to work list, runs
faster.

> >  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
> >  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
> 
> I think what you describe here is essentially how LUO works currently,
> just that the mechanisms are a bit different.

The bit different is a very important bit though :)

The versioning should be first class, not hidden away as some emergent
property of registering multiple serializers or something like that.

Jason

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Aleksa Sarai @ 2025-08-28  8:42 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <5cbd7011-9c2a-4a23-bbce-84c100877cdb@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2025-08-27 15:48, Aleksa Sarai wrote:
> > On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> > > What specific scenario would make the "give me supported flags" flag worth
> > > the hassle of supporting and documenting and testing such a flag?
> > 
> > "Just try it" leads to programs that have to test dozens of flag
> > combinations for syscalls at startup,
> 
> Although that sort of thing can indeed be a problem in general, I don't see
> how it's a problem for openat2 in particular.

While O_* and RESOLVE_* flags are trivial to detect (since you can
always pass -EBADF to force a non-EINVAL error), my goal was to have a
unified interface for extensible-struct syscalls in this department.

> The issue here is whether openat2's API should reflect current behavior
> (where the HOW argument is pointer-to-const) or a potential future behavior
> (where the kernel might modify the struct that HOW points to, if some
> hypothetical future flag is set in that struct). I am skeptical that this
> hypothetical situation is so plausible that it justifies the maintenance
> hassle of a glibc API that doesn't correspond to how openat2 currently
> behaves.

I mean, the kernel definition doesn't mark the syscall argument as
"const" so making it const in glibc also means maintaining a divergence
from the kernel. Of course, glibc does this for plenty of other
syscalls so it's not my place to say which is better.

My intention was just to say that this *was* intentiona (which was how I
understood the initial question that I was Cc'd onl, and if you feel
that intention is misguided / doesn't mesh with what glibc wants then
that's your call.

> > A simple example would be mounts -- if MOUNT_BENEATH is not supported
>
> I don't understand this example. Are you talking about <linux/mount.h>'s
> MOVE_MOUNT_BENEATH? That's a move_mount flag, and I don't see what that has
> to do with openat2. Or are you saying that openat2 might not support
> <linux/openat2.h>'s RESOLVE_BENEATH flag? Under what conditions might that
> be, exactly? Can you give some plausible user code to illustrate the openat2
> example you're thinking of?

I was just giving it as an example where "just try it" is not really
ideal for userspace today. clone3(2) is an extensible-struct syscall
that needs this.

> I still fail to understand how a hypothetical "give me the supported flags"
> openat2 flag would be useful enough to justify complicating the openat2 API
> today.

My only concern is that it would break recompiles if/when we change it
back. If that is not a concern for glibc as a project then you are of
course free to do whatever makes sense for glibc.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* [PATCH 01/14] perf test: Fix a build error in x86 topdown test
From: mysteryli @ 2025-08-28  8:24 UTC (permalink / raw)
  To: m13940358460
  Cc: Namhyung Kim, Naresh Kamboju, Paolo Bonzini, kvm, Yury Norov,
	Mark Rutland, x86, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Madhavan Srinivasan, linuxppc-dev, Arnd Bergmann, linux-api,
	Christian Brauner, linux-fsdevel, Michael S. Tsirkin, Jason Wang,
	virtualization, Ian Rogers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 60100 bytes --]

From: Namhyung Kim <namhyung@kernel.org>

There's an environment that caused the following build error.  Include
"debug.h" (under util directory) to fix it.

  arch/x86/tests/topdown.c: In function 'event_cb':
  arch/x86/tests/topdown.c:53:25: error: implicit declaration of function 'pr_debug'
                                         [-Werror=implicit-function-declaration]
     53 |                         pr_debug("Broken topdown information for '%s'\n", evsel__name(evsel));
        |                         ^~~~~~~~
  cc1: all warnings being treated as errors

Link: https://lore.kernel.org/r/20250815164122.289651-1-namhyung@kernel.org
Fixes: 5b546de9cc177936 ("perf topdown: Use attribute to see an event is a topdown metic or slots")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/arch/x86/tests/topdown.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/arch/x86/tests/topdown.c b/tools/perf/arch/x86/tests/topdown.c
index 8d0ea7a4bbc1..1eba3b4594ef 100644
--- a/tools/perf/arch/x86/tests/topdown.c
+++ b/tools/perf/arch/x86/tests/topdown.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include "arch-tests.h"
 #include "../util/topdown.h"
+#include "debug.h"
 #include "evlist.h"
 #include "parse-events.h"
 #include "pmu.h"
-- 
2.25.1


From bd842ff41543af424c2473dc16c678ac8ba2b43f Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 02/14] tools headers: Sync KVM headers with the kernel source

To pick up the changes in this cset:

  f55ce5a6cd33211c KVM: arm64: Expose new KVM cap for cacheable PFNMAP
  28224ef02b56fcee KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
  4580dbef5ce0f95a KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
  25e8b1dd4883e6c2 KVM: TDX: Exit to userspace for GetTdVmCallInfo
  cf207eac06f661fb KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
    diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h

Please see tools/include/uapi/README for further details.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/x86/include/uapi/asm/kvm.h |  8 +++++++-
 tools/include/uapi/linux/kvm.h        | 27 +++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 6f3499507c5e..0f15d683817d 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -965,7 +965,13 @@ struct kvm_tdx_cmd {
 struct kvm_tdx_capabilities {
 	__u64 supported_attrs;
 	__u64 supported_xfam;
-	__u64 reserved[254];
+
+	__u64 kernel_tdvmcallinfo_1_r11;
+	__u64 user_tdvmcallinfo_1_r11;
+	__u64 kernel_tdvmcallinfo_1_r12;
+	__u64 user_tdvmcallinfo_1_r12;
+
+	__u64 reserved[250];
 
 	/* Configurable CPUID bits for userspace */
 	struct kvm_cpuid2 cpuid;
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 7415a3863891..f0f0d49d2544 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -178,6 +178,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_TDX              40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -447,6 +448,31 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_TDX */
+		struct {
+			__u64 flags;
+			__u64 nr;
+			union {
+				struct {
+					__u64 ret;
+					__u64 data[5];
+				} unknown;
+				struct {
+					__u64 ret;
+					__u64 gpa;
+					__u64 size;
+				} get_quote;
+				struct {
+					__u64 ret;
+					__u64 leaf;
+					__u64 r11, r12, r13, r14;
+				} get_tdvmcall_info;
+				struct {
+					__u64 ret;
+					__u64 vector;
+				} setup_event_notify;
+			};
+		} tdx;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -935,6 +961,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_EL2 240
 #define KVM_CAP_ARM_EL2_E2H0 241
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
+#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
-- 
2.25.1


From 6cb8607934d937f4ad24ec9ad26aeb669e266937 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 03/14] tools headers: Sync linux/bits.h with the kernel source

To pick up the changes in this cset:

  104ea1c84b91c9f4 bits: unify the non-asm GENMASK*()
  6d4471252ccc1722 bits: split the definition of the asm and non-asm GENMASK*()

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/linux/bits.h include/linux/bits.h

Please see tools/include/uapi/README for further details.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/linux/bits.h | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/tools/include/linux/bits.h b/tools/include/linux/bits.h
index 7ad056219115..a40cc861b3a7 100644
--- a/tools/include/linux/bits.h
+++ b/tools/include/linux/bits.h
@@ -2,10 +2,8 @@
 #ifndef __LINUX_BITS_H
 #define __LINUX_BITS_H
 
-#include <linux/const.h>
 #include <vdso/bits.h>
 #include <uapi/linux/bits.h>
-#include <asm/bitsperlong.h>
 
 #define BIT_MASK(nr)		(UL(1) << ((nr) % BITS_PER_LONG))
 #define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
@@ -50,10 +48,14 @@
 	     (type_max(t) << (l) &				\
 	      type_max(t) >> (BITS_PER_TYPE(t) - 1 - (h)))))
 
+#define GENMASK(h, l)		GENMASK_TYPE(unsigned long, h, l)
+#define GENMASK_ULL(h, l)	GENMASK_TYPE(unsigned long long, h, l)
+
 #define GENMASK_U8(h, l)	GENMASK_TYPE(u8, h, l)
 #define GENMASK_U16(h, l)	GENMASK_TYPE(u16, h, l)
 #define GENMASK_U32(h, l)	GENMASK_TYPE(u32, h, l)
 #define GENMASK_U64(h, l)	GENMASK_TYPE(u64, h, l)
+#define GENMASK_U128(h, l)	GENMASK_TYPE(u128, h, l)
 
 /*
  * Fixed-type variants of BIT(), with additional checks like GENMASK_TYPE(). The
@@ -79,28 +81,9 @@
  * BUILD_BUG_ON_ZERO is not available in h files included from asm files,
  * disable the input check if that is the case.
  */
-#define GENMASK_INPUT_CHECK(h, l) 0
+#define GENMASK(h, l)		__GENMASK(h, l)
+#define GENMASK_ULL(h, l)	__GENMASK_ULL(h, l)
 
 #endif /* !defined(__ASSEMBLY__) */
 
-#define GENMASK(h, l) \
-	(GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
-#define GENMASK_ULL(h, l) \
-	(GENMASK_INPUT_CHECK(h, l) + __GENMASK_ULL(h, l))
-
-#if !defined(__ASSEMBLY__)
-/*
- * Missing asm support
- *
- * __GENMASK_U128() depends on _BIT128() which would not work
- * in the asm code, as it shifts an 'unsigned __int128' data
- * type instead of direct representation of 128 bit constants
- * such as long and unsigned long. The fundamental problem is
- * that a 128 bit constant will get silently truncated by the
- * gcc compiler.
- */
-#define GENMASK_U128(h, l) \
-	(GENMASK_INPUT_CHECK(h, l) + __GENMASK_U128(h, l))
-#endif
-
 #endif	/* __LINUX_BITS_H */
-- 
2.25.1


From aa34642f6fc36a436de5ae5b30d414578b3622f5 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 04/14] tools headers: Sync linux/cfi_types.h with the kernel
 source

To pick up the changes in this cset:

  5ccaeedb489b41ce cfi: add C CFI type macro

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/linux/cfi_types.h include/linux/cfi_types.h

Please see tools/include/uapi/README for further details.

Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/linux/cfi_types.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/include/linux/cfi_types.h b/tools/include/linux/cfi_types.h
index 6b8713675765..685f7181780f 100644
--- a/tools/include/linux/cfi_types.h
+++ b/tools/include/linux/cfi_types.h
@@ -41,5 +41,28 @@
 	SYM_TYPED_START(name, SYM_L_GLOBAL, SYM_A_ALIGN)
 #endif
 
+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_CFI_CLANG
+#define DEFINE_CFI_TYPE(name, func)						\
+	/*									\
+	 * Force a reference to the function so the compiler generates		\
+	 * __kcfi_typeid_<func>.						\
+	 */									\
+	__ADDRESSABLE(func);							\
+	/* u32 name __ro_after_init = __kcfi_typeid_<func> */			\
+	extern u32 name;							\
+	asm (									\
+	"	.pushsection	.data..ro_after_init,\"aw\",\%progbits	\n"	\
+	"	.type	" #name ",\%object				\n"	\
+	"	.globl	" #name "					\n"	\
+	"	.p2align	2, 0x0					\n"	\
+	#name ":							\n"	\
+	"	.4byte	__kcfi_typeid_" #func "				\n"	\
+	"	.size	" #name ", 4					\n"	\
+	"	.popsection						\n"	\
+	);
+#endif
+
 #endif /* __ASSEMBLY__ */
 #endif /* _LINUX_CFI_TYPES_H */
-- 
2.25.1


From 619f55c859014e2235f83ba6cde8c59edc492f39 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 05/14] tools headers: Sync x86 headers with the kernel source

To pick up the changes in this cset:

  7b306dfa326f7011 x86/sev: Evict cache lines during SNP memory validation
  65f55a30176662ee x86/CPU/AMD: Add CPUID faulting support
  d8010d4ba43e9f79 x86/bugs: Add a Transient Scheduler Attacks mitigation
  a3c4f3396b82849a x86/msr-index: Add AMD workload classification MSRs
  17ec2f965344ee3f KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
    diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h

Please see tools/include/uapi/README for further details.

Cc: x86@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/x86/include/asm/cpufeatures.h | 10 +++++++++-
 tools/arch/x86/include/asm/msr-index.h   |  7 +++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index ee176236c2be..06fc0479a23f 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -218,6 +218,7 @@
 #define X86_FEATURE_FLEXPRIORITY	( 8*32+ 1) /* "flexpriority" Intel FlexPriority */
 #define X86_FEATURE_EPT			( 8*32+ 2) /* "ept" Intel Extended Page Table */
 #define X86_FEATURE_VPID		( 8*32+ 3) /* "vpid" Intel Virtual Processor ID */
+#define X86_FEATURE_COHERENCY_SFW_NO	( 8*32+ 4) /* SNP cache coherency software work around not needed */
 
 #define X86_FEATURE_VMMCALL		( 8*32+15) /* "vmmcall" Prefer VMMCALL to VMCALL */
 #define X86_FEATURE_XENPV		( 8*32+16) /* Xen paravirtual guest */
@@ -456,10 +457,14 @@
 #define X86_FEATURE_NO_NESTED_DATA_BP	(20*32+ 0) /* No Nested Data Breakpoints */
 #define X86_FEATURE_WRMSR_XX_BASE_NS	(20*32+ 1) /* WRMSR to {FS,GS,KERNEL_GS}_BASE is non-serializing */
 #define X86_FEATURE_LFENCE_RDTSC	(20*32+ 2) /* LFENCE always serializing / synchronizes RDTSC */
+#define X86_FEATURE_VERW_CLEAR		(20*32+ 5) /* The memory form of VERW mitigates TSA */
 #define X86_FEATURE_NULL_SEL_CLR_BASE	(20*32+ 6) /* Null Selector Clears Base */
+
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* Automatic IBRS */
 #define X86_FEATURE_NO_SMM_CTL_MSR	(20*32+ 9) /* SMM_CTL MSR is not present */
 
+#define X86_FEATURE_GP_ON_USER_CPUID	(20*32+17) /* User CPUID faulting */
+
 #define X86_FEATURE_PREFETCHI		(20*32+20) /* Prefetch Data/Instruction to Cache Level */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
@@ -487,6 +492,9 @@
 #define X86_FEATURE_PREFER_YMM		(21*32+ 8) /* Avoid ZMM registers due to downclocking */
 #define X86_FEATURE_APX			(21*32+ 9) /* Advanced Performance Extensions */
 #define X86_FEATURE_INDIRECT_THUNK_ITS	(21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
+#define X86_FEATURE_TSA_SQ_NO		(21*32+11) /* AMD CPU not vulnerable to TSA-SQ */
+#define X86_FEATURE_TSA_L1_NO		(21*32+12) /* AMD CPU not vulnerable to TSA-L1 */
+#define X86_FEATURE_CLEAR_CPU_BUF_VM	(21*32+13) /* Clear CPU buffers using VERW before VMRUN */
 
 /*
  * BUG word(s)
@@ -542,5 +550,5 @@
 #define X86_BUG_OLD_MICROCODE		X86_BUG( 1*32+ 6) /* "old_microcode" CPU has old microcode, it is surely vulnerable to something */
 #define X86_BUG_ITS			X86_BUG( 1*32+ 7) /* "its" CPU is affected by Indirect Target Selection */
 #define X86_BUG_ITS_NATIVE_ONLY		X86_BUG( 1*32+ 8) /* "its_native_only" CPU is affected by ITS, VMX is not affected */
-
+#define X86_BUG_TSA			X86_BUG( 1*32+ 9) /* "tsa" CPU is affected by Transient Scheduler Attacks */
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 5cfb5d74dd5f..b65c3ba5fa14 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -419,6 +419,7 @@
 #define DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI	(1UL << 12)
 #define DEBUGCTLMSR_FREEZE_IN_SMM_BIT	14
 #define DEBUGCTLMSR_FREEZE_IN_SMM	(1UL << DEBUGCTLMSR_FREEZE_IN_SMM_BIT)
+#define DEBUGCTLMSR_RTM_DEBUG		BIT(15)
 
 #define MSR_PEBS_FRONTEND		0x000003f7
 
@@ -733,6 +734,11 @@
 #define MSR_AMD64_PERF_CNTR_GLOBAL_CTL		0xc0000301
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR	0xc0000302
 
+/* AMD Hardware Feedback Support MSRs */
+#define MSR_AMD_WORKLOAD_CLASS_CONFIG		0xc0000500
+#define MSR_AMD_WORKLOAD_CLASS_ID		0xc0000501
+#define MSR_AMD_WORKLOAD_HRST			0xc0000502
+
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
 
@@ -831,6 +837,7 @@
 #define MSR_K7_HWCR_SMMLOCK		BIT_ULL(MSR_K7_HWCR_SMMLOCK_BIT)
 #define MSR_K7_HWCR_IRPERF_EN_BIT	30
 #define MSR_K7_HWCR_IRPERF_EN		BIT_ULL(MSR_K7_HWCR_IRPERF_EN_BIT)
+#define MSR_K7_HWCR_CPUID_USER_DIS_BIT	35
 #define MSR_K7_FID_VID_CTL		0xc0010041
 #define MSR_K7_FID_VID_STATUS		0xc0010042
 #define MSR_K7_HWCR_CPB_DIS_BIT		25
-- 
2.25.1


From 14ec8ce45611c767656e4fa575f17b05344aa80a Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 06/14] tools headers: Sync arm64 headers with the kernel
 source
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

To pick up the changes in this cset:

  efe676a1a7554219 arm64: proton-pack: Add new CPUs 'k' values for branch mitigation
  e18c09b204e81702 arm64: Add support for HIP09 Spectre-BHB mitigation
  a9b5bd81b294d30a arm64: cputype: Add MIDR_CORTEX_A76AE
  53a52a0ec7680287 arm64: cputype: Add comments about Qualcomm Kryo 5XX and 6XX cores
  401c3333bb2396aa arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD
  86edf6bdcf0571c0 smccc/kvm_guest: Enable errata based on implementation CPUs
  0bc9a9e85fcf4ffb KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h

But the following two changes cannot be applied since they introduced
new build errors in util/arm-spe.c.  So it still has the warning after
this change.

  c8c2647e69bedf80 arm64: Make  _midr_in_range_list() an exported function
  e3121298c7fcaf48 arm64: Modify _midr_range() functions to read MIDR/REVIDR internally

Please see tools/include/uapi/README for further details.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf build: [WIP] Fix arm-spe build errors

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/arm64/include/asm/cputype.h | 28 ++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
index 9a5d85cfd1fb..139d5e87dc95 100644
--- a/tools/arch/arm64/include/asm/cputype.h
+++ b/tools/arch/arm64/include/asm/cputype.h
@@ -75,11 +75,13 @@
 #define ARM_CPU_PART_CORTEX_A76		0xD0B
 #define ARM_CPU_PART_NEOVERSE_N1	0xD0C
 #define ARM_CPU_PART_CORTEX_A77		0xD0D
+#define ARM_CPU_PART_CORTEX_A76AE	0xD0E
 #define ARM_CPU_PART_NEOVERSE_V1	0xD40
 #define ARM_CPU_PART_CORTEX_A78		0xD41
 #define ARM_CPU_PART_CORTEX_A78AE	0xD42
 #define ARM_CPU_PART_CORTEX_X1		0xD44
 #define ARM_CPU_PART_CORTEX_A510	0xD46
+#define ARM_CPU_PART_CORTEX_X1C		0xD4C
 #define ARM_CPU_PART_CORTEX_A520	0xD80
 #define ARM_CPU_PART_CORTEX_A710	0xD47
 #define ARM_CPU_PART_CORTEX_A715	0xD4D
@@ -119,9 +121,11 @@
 #define QCOM_CPU_PART_KRYO		0x200
 #define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
 #define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
+#define QCOM_CPU_PART_KRYO_3XX_GOLD	0x802
 #define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
 #define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
 #define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
+#define QCOM_CPU_PART_ORYON_X1		0x001
 
 #define NVIDIA_CPU_PART_DENVER		0x003
 #define NVIDIA_CPU_PART_CARMEL		0x004
@@ -129,6 +133,7 @@
 #define FUJITSU_CPU_PART_A64FX		0x001
 
 #define HISI_CPU_PART_TSV110		0xD01
+#define HISI_CPU_PART_HIP09			0xD02
 #define HISI_CPU_PART_HIP12		0xD06
 
 #define APPLE_CPU_PART_M1_ICESTORM	0x022
@@ -159,11 +164,13 @@
 #define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
 #define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
 #define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
+#define MIDR_CORTEX_A76AE	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76AE)
 #define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
 #define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
 #define MIDR_CORTEX_A78AE	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78AE)
 #define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
 #define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
+#define MIDR_CORTEX_X1C MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1C)
 #define MIDR_CORTEX_A520 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A520)
 #define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
 #define MIDR_CORTEX_A715 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A715)
@@ -196,13 +203,26 @@
 #define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
 #define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
 #define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
+#define MIDR_QCOM_KRYO_3XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_GOLD)
 #define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
 #define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
 #define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
+#define MIDR_QCOM_ORYON_X1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_ORYON_X1)
+
+/*
+ * NOTES:
+ * - Qualcomm Kryo 5XX Prime / Gold ID themselves as MIDR_CORTEX_A77
+ * - Qualcomm Kryo 5XX Silver IDs itself as MIDR_QCOM_KRYO_4XX_SILVER
+ * - Qualcomm Kryo 6XX Prime IDs itself as MIDR_CORTEX_X1
+ * - Qualcomm Kryo 6XX Gold IDs itself as ARM_CPU_PART_CORTEX_A78
+ * - Qualcomm Kryo 6XX Silver IDs itself as MIDR_CORTEX_A55
+ */
+
 #define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
 #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
 #define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
 #define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
+#define MIDR_HISI_HIP09 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_HIP09)
 #define MIDR_HISI_HIP12 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_HIP12)
 #define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
 #define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
@@ -291,6 +311,14 @@ static inline u32 __attribute_const__ read_cpuid_id(void)
 	return read_cpuid(MIDR_EL1);
 }
 
+struct target_impl_cpu {
+	u64 midr;
+	u64 revidr;
+	u64 aidr;
+};
+
+bool cpu_errata_set_target_impl(u64 num, void *impl_cpus);
+
 static inline u64 __attribute_const__ read_cpuid_mpidr(void)
 {
 	return read_cpuid(MPIDR_EL1);
-- 
2.25.1


From c85538c4e3c7111958057d15ea8ee444116891c3 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 07/14] tools headers: Sync powerpc headers with the kernel
 source

To pick up the changes in this cset:

  69bf2053608423cb powerpc: Drop GPL boilerplate text with obsolete FSF address

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/powerpc/include/uapi/asm/kvm.h arch/powerpc/include/uapi/asm/kvm.h

Please see tools/include/uapi/README for further details.

Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/arch/powerpc/include/uapi/asm/kvm.h | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/tools/arch/powerpc/include/uapi/asm/kvm.h b/tools/arch/powerpc/include/uapi/asm/kvm.h
index eaeda001784e..077c5437f521 100644
--- a/tools/arch/powerpc/include/uapi/asm/kvm.h
+++ b/tools/arch/powerpc/include/uapi/asm/kvm.h
@@ -1,18 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 /*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License, version 2, as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
- *
  * Copyright IBM Corp. 2007
  *
  * Authors: Hollis Blanchard <hollisb@us.ibm.com>
-- 
2.25.1


From 52174e0eb13876654f56701c26a672890aa5e7e3 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 08/14] tools headers: Sync syscall tables with the kernel
 source

To pick up the changes in this cset:

  be7efb2d20d67f33 fs: introduce file_getattr and file_setattr syscalls

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
    diff -u tools/scripts/syscall.tbl scripts/syscall.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
    diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
    diff -u tools/perf/arch/arm/entry/syscalls/syscall.tbl arch/arm/tools/syscall.tbl
    diff -u tools/perf/arch/sh/entry/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/sparc/entry/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/xtensa/entry/syscalls/syscall.tbl arch/xtensa/kernel/syscalls/syscall.tbl

Please see tools/include/uapi/README for further details.

Cc: Arnd Bergmann <arnd@arndb.de>
CC: linux-api@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/uapi/asm-generic/unistd.h             | 8 +++++++-
 tools/perf/arch/arm/entry/syscalls/syscall.tbl      | 2 ++
 tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 2 ++
 tools/perf/arch/s390/entry/syscalls/syscall.tbl     | 2 ++
 tools/perf/arch/sh/entry/syscalls/syscall.tbl       | 2 ++
 tools/perf/arch/sparc/entry/syscalls/syscall.tbl    | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_32.tbl   | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 2 ++
 tools/perf/arch/xtensa/entry/syscalls/syscall.tbl   | 2 ++
 tools/scripts/syscall.tbl                           | 2 ++
 11 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 2892a45023af..04e0077fb4c9 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -852,8 +852,14 @@ __SYSCALL(__NR_removexattrat, sys_removexattrat)
 #define __NR_open_tree_attr 467
 __SYSCALL(__NR_open_tree_attr, sys_open_tree_attr)
 
+/* fs/inode.c */
+#define __NR_file_getattr 468
+__SYSCALL(__NR_file_getattr, sys_file_getattr)
+#define __NR_file_setattr 469
+__SYSCALL(__NR_file_setattr, sys_file_setattr)
+
 #undef __NR_syscalls
-#define __NR_syscalls 468
+#define __NR_syscalls 470
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/arm/entry/syscalls/syscall.tbl b/tools/perf/arch/arm/entry/syscalls/syscall.tbl
index 27c1d5ebcd91..b07e699aaa3c 100644
--- a/tools/perf/arch/arm/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/arm/entry/syscalls/syscall.tbl
@@ -482,3 +482,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 1e8c44c7b614..7a7049c2c307 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -382,3 +382,5 @@
 465	n64	listxattrat			sys_listxattrat
 466	n64	removexattrat			sys_removexattrat
 467	n64	open_tree_attr			sys_open_tree_attr
+468	n64	file_getattr			sys_file_getattr
+469	n64	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 9a084bdb8926..b453e80dfc00 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -558,3 +558,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index a4569b96ef06..8a6744d658db 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
 465  common	listxattrat		sys_listxattrat			sys_listxattrat
 466  common	removexattrat		sys_removexattrat		sys_removexattrat
 467  common	open_tree_attr		sys_open_tree_attr		sys_open_tree_attr
+468  common	file_getattr		sys_file_getattr		sys_file_getattr
+469  common	file_setattr		sys_file_setattr		sys_file_setattr
diff --git a/tools/perf/arch/sh/entry/syscalls/syscall.tbl b/tools/perf/arch/sh/entry/syscalls/syscall.tbl
index 52a7652fcff6..5e9c9eff5539 100644
--- a/tools/perf/arch/sh/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/sh/entry/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/sparc/entry/syscalls/syscall.tbl b/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
index 83e45eb6c095..ebb7d06d1044 100644
--- a/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
@@ -513,3 +513,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
index ac007ea00979..4877e16da69a 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
@@ -473,3 +473,5 @@
 465	i386	listxattrat		sys_listxattrat
 466	i386	removexattrat		sys_removexattrat
 467	i386	open_tree_attr		sys_open_tree_attr
+468	i386	file_getattr		sys_file_getattr
+469	i386	file_setattr		sys_file_setattr
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index cfb5ca41e30d..92cf0fe2291e 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -391,6 +391,8 @@
 465	common	listxattrat		sys_listxattrat
 466	common	removexattrat		sys_removexattrat
 467	common	open_tree_attr		sys_open_tree_attr
+468	common	file_getattr		sys_file_getattr
+469	common	file_setattr		sys_file_setattr
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl b/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
index f657a77314f8..374e4cb788d8 100644
--- a/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
@@ -438,3 +438,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/scripts/syscall.tbl b/tools/scripts/syscall.tbl
index 580b4e246aec..d1ae5e92c615 100644
--- a/tools/scripts/syscall.tbl
+++ b/tools/scripts/syscall.tbl
@@ -408,3 +408,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
-- 
2.25.1


From b18aabe283a10774977d698c075d2296a2336aef Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 09/14] tools headers: Sync uapi/linux/fcntl.h with the kernel
 source

To pick up the changes in this cset:

  3941e37f62fe2c3c uapi/fcntl: add FD_PIDFS_ROOT
  cd5d2006327b6d84 uapi/fcntl: add FD_INVALID
  67fcec2919e4ed31 fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  a4c746f06853f91d uapi/fcntl: mark range as reserved

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h

Please see tools/include/uapi/README for further details.

Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../trace/beauty/include/uapi/linux/fcntl.h    | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
index a15ac2fa4b20..f291ab4f94eb 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
@@ -90,10 +90,28 @@
 #define DN_ATTRIB	0x00000020	/* File changed attibutes */
 #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
 
+/* Reserved kernel ranges [-100], [-10000, -40000]. */
 #define AT_FDCWD		-100    /* Special value for dirfd used to
 					   indicate openat should use the
 					   current working directory. */
 
+/*
+ * The concept of process and threads in userland and the kernel is a confusing
+ * one - within the kernel every thread is a 'task' with its own individual PID,
+ * however from userland's point of view threads are grouped by a single PID,
+ * which is that of the 'thread group leader', typically the first thread
+ * spawned.
+ *
+ * To cut the Gideon knot, for internal kernel usage, we refer to
+ * PIDFD_SELF_THREAD to refer to the current thread (or task from a kernel
+ * perspective), and PIDFD_SELF_THREAD_GROUP to refer to the current thread
+ * group leader...
+ */
+#define PIDFD_SELF_THREAD		-10000 /* Current thread. */
+#define PIDFD_SELF_THREAD_GROUP		-10001 /* Current thread group leader. */
+
+#define FD_PIDFS_ROOT			-10002 /* Root of the pidfs filesystem */
+#define FD_INVALID			-10009 /* Invalid file descriptor: -10000 - EBADF = -10009 */
 
 /* Generic flags for the *at(2) family of syscalls. */
 
-- 
2.25.1


From 4a4083af03a7a75a86c392fd60cb37ce23ed87b6 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 10/14] tools headers: Sync uapi/linux/fs.h with the kernel
 source

To pick up the changes in this cset:

  76fdb7eb4e1c9108 uapi: export PROCFS_ROOT_INO
  ca115d7e754691c0 tree-wide: s/struct fileattr/struct file_kattr/g
  be7efb2d20d67f33 fs: introduce file_getattr and file_setattr syscalls
  9eb22f7fedfc9eb1 fs: add ioctl to query metadata and protection info capabilities

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/fs.h include/uapi/linux/fs.h

Please see tools/include/uapi/README for further details.

Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../perf/trace/beauty/include/uapi/linux/fs.h | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/fs.h b/tools/perf/trace/beauty/include/uapi/linux/fs.h
index 0098b0ce8ccb..0bd678a4a10e 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/fs.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/fs.h
@@ -60,6 +60,17 @@
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
 
+/*
+ * The root inode of procfs is guaranteed to always have the same inode number.
+ * For programs that make heavy use of procfs, verifying that the root is a
+ * real procfs root and using openat2(RESOLVE_{NO_{XDEV,MAGICLINKS},BENEATH})
+ * will allow you to make sure you are never tricked into operating on the
+ * wrong procfs file.
+ */
+enum procfs_ino {
+	PROCFS_ROOT_INO = 1,
+};
+
 struct file_clone_range {
 	__s64 src_fd;
 	__u64 src_offset;
@@ -91,6 +102,63 @@ struct fs_sysfs_path {
 	__u8			name[128];
 };
 
+/* Protection info capability flags */
+#define	LBMD_PI_CAP_INTEGRITY		(1 << 0)
+#define	LBMD_PI_CAP_REFTAG		(1 << 1)
+
+/* Checksum types for Protection Information */
+#define LBMD_PI_CSUM_NONE		0
+#define LBMD_PI_CSUM_IP			1
+#define LBMD_PI_CSUM_CRC16_T10DIF	2
+#define LBMD_PI_CSUM_CRC64_NVME		4
+
+/* sizeof first published struct */
+#define LBMD_SIZE_VER0			16
+
+/*
+ * Logical block metadata capability descriptor
+ * If the device does not support metadata, all the fields will be zero.
+ * Applications must check lbmd_flags to determine whether metadata is
+ * supported or not.
+ */
+struct logical_block_metadata_cap {
+	/* Bitmask of logical block metadata capability flags */
+	__u32	lbmd_flags;
+	/*
+	 * The amount of data described by each unit of logical block
+	 * metadata
+	 */
+	__u16	lbmd_interval;
+	/*
+	 * Size in bytes of the logical block metadata associated with each
+	 * interval
+	 */
+	__u8	lbmd_size;
+	/*
+	 * Size in bytes of the opaque block tag associated with each
+	 * interval
+	 */
+	__u8	lbmd_opaque_size;
+	/*
+	 * Offset in bytes of the opaque block tag within the logical block
+	 * metadata
+	 */
+	__u8	lbmd_opaque_offset;
+	/* Size in bytes of the T10 PI tuple associated with each interval */
+	__u8	lbmd_pi_size;
+	/* Offset in bytes of T10 PI tuple within the logical block metadata */
+	__u8	lbmd_pi_offset;
+	/* T10 PI guard tag type */
+	__u8	lbmd_guard_tag_type;
+	/* Size in bytes of the T10 PI application tag */
+	__u8	lbmd_app_tag_size;
+	/* Size in bytes of the T10 PI reference tag */
+	__u8	lbmd_ref_tag_size;
+	/* Size in bytes of the T10 PI storage tag */
+	__u8	lbmd_storage_tag_size;
+	__u8	pad;
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -148,6 +216,24 @@ struct fsxattr {
 	unsigned char	fsx_pad[8];
 };
 
+/*
+ * Variable size structure for file_[sg]et_attr().
+ *
+ * Note. This is alternative to the structure 'struct file_kattr'/'struct fsxattr'.
+ * As this structure is passed to/from userspace with its size, this can
+ * be versioned based on the size.
+ */
+struct file_attr {
+	__u64 fa_xflags;	/* xflags field value (get/set) */
+	__u32 fa_extsize;	/* extsize field value (get/set)*/
+	__u32 fa_nextents;	/* nextents field value (get)   */
+	__u32 fa_projid;	/* project identifier (get/set) */
+	__u32 fa_cowextsize;	/* CoW extsize field value (get/set) */
+};
+
+#define FILE_ATTR_SIZE_VER0 24
+#define FILE_ATTR_SIZE_LATEST FILE_ATTR_SIZE_VER0
+
 /*
  * Flags for the fsx_xflags field
  */
@@ -247,6 +333,8 @@ struct fsxattr {
  * also /sys/kernel/debug/ for filesystems with debugfs exports
  */
 #define FS_IOC_GETFSSYSFSPATH		_IOR(0x15, 1, struct fs_sysfs_path)
+/* Get logical block metadata capability details */
+#define FS_IOC_GETLBMD_CAP		_IOWR(0x15, 2, struct logical_block_metadata_cap)
 
 /*
  * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
-- 
2.25.1


From e7e79e99726190a5a83d158576cd448896d68102 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 11/14] tools headers: Sync uapi/linux/prctl.h with the kernel
 source

To pick up the changes in this cset:

  b1fabef37bd504f3 prctl: Introduce PR_MTE_STORE_ONLY
  a2fc422ed75748ee syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ON

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/prctl.h include/uapi/linux/prctl.h

Please see tools/include/uapi/README for further details.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/trace/beauty/include/uapi/linux/prctl.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 3b93fb906e3c..ed3aed264aeb 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -244,6 +244,8 @@ struct prctl_mm_map {
 # define PR_MTE_TAG_MASK		(0xffffUL << PR_MTE_TAG_SHIFT)
 /* Unused; kept only for source compatibility */
 # define PR_MTE_TCF_SHIFT		1
+/* MTE tag check store only */
+# define PR_MTE_STORE_ONLY		(1UL << 19)
 /* RISC-V pointer masking tag length */
 # define PR_PMLEN_SHIFT			24
 # define PR_PMLEN_MASK			(0x7fUL << PR_PMLEN_SHIFT)
@@ -255,7 +257,12 @@ struct prctl_mm_map {
 /* Dispatch syscalls to a userspace handler */
 #define PR_SET_SYSCALL_USER_DISPATCH	59
 # define PR_SYS_DISPATCH_OFF		0
-# define PR_SYS_DISPATCH_ON		1
+/* Enable dispatch except for the specified range */
+# define PR_SYS_DISPATCH_EXCLUSIVE_ON	1
+/* Enable dispatch for the specified range */
+# define PR_SYS_DISPATCH_INCLUSIVE_ON	2
+/* Legacy name for backwards compatibility */
+# define PR_SYS_DISPATCH_ON		PR_SYS_DISPATCH_EXCLUSIVE_ON
 /* The control values for the user space selector when dispatch is enabled */
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
-- 
2.25.1


From f79a62f4b3c750759e60a402e8fe5180fc5771f0 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 18 Aug 2025 10:32:18 -0700
Subject: [PATCH 12/14] tools headers: Sync uapi/linux/vhost.h with the kernel
 source

To pick up the changes in this cset:

  7d9896e9f6d02d8a vhost: Reintroduce kthread API and add mode selection
  333c515d189657c9 vhost-net: allow configuring extended features

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/vhost.h include/uapi/linux/vhost.h

Please see tools/include/uapi/README for further details.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux.dev
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../trace/beauty/include/uapi/linux/vhost.h   | 35 +++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/vhost.h b/tools/perf/trace/beauty/include/uapi/linux/vhost.h
index d4b3e2ae1314..c57674a6aa0d 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/vhost.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/vhost.h
@@ -235,4 +235,39 @@
  */
 #define VHOST_VDPA_GET_VRING_SIZE	_IOWR(VHOST_VIRTIO, 0x82,	\
 					      struct vhost_vring_state)
+
+/* Extended features manipulation */
+#define VHOST_GET_FEATURES_ARRAY _IOR(VHOST_VIRTIO, 0x83, \
+				       struct vhost_features_array)
+#define VHOST_SET_FEATURES_ARRAY _IOW(VHOST_VIRTIO, 0x83, \
+				       struct vhost_features_array)
+
+/* fork_owner values for vhost */
+#define VHOST_FORK_OWNER_KTHREAD 0
+#define VHOST_FORK_OWNER_TASK 1
+
+/**
+ * VHOST_SET_FORK_FROM_OWNER - Set the fork_owner flag for the vhost device,
+ * This ioctl must called before VHOST_SET_OWNER.
+ * Only available when CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL=y
+ *
+ * @param fork_owner: An 8-bit value that determines the vhost thread mode
+ *
+ * When fork_owner is set to VHOST_FORK_OWNER_TASK(default value):
+ *   - Vhost will create vhost worker as tasks forked from the owner,
+ *     inheriting all of the owner's attributes.
+ *
+ * When fork_owner is set to VHOST_FORK_OWNER_KTHREAD:
+ *   - Vhost will create vhost workers as kernel threads.
+ */
+#define VHOST_SET_FORK_FROM_OWNER _IOW(VHOST_VIRTIO, 0x84, __u8)
+
+/**
+ * VHOST_GET_FORK_OWNER - Get the current fork_owner flag for the vhost device.
+ * Only available when CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL=y
+ *
+ * @return: An 8-bit value indicating the current thread mode.
+ */
+#define VHOST_GET_FORK_FROM_OWNER _IOR(VHOST_VIRTIO, 0x85, __u8)
+
 #endif
-- 
2.25.1


From ba0b7081f7a521d7c28b527a4f18666a148471e7 Mon Sep 17 00:00:00 2001
From: Ian Rogers <irogers@google.com>
Date: Fri, 22 Aug 2025 17:00:23 -0700
Subject: [PATCH 13/14] perf symbol-minimal: Fix ehdr reading in
 filename__read_build_id

The e_ident is part of the ehdr and so reading it a second time would
mean the read ehdr was displaced by 16-bytes. Switch from stdio to
open/read/lseek syscalls for similarity with the symbol-elf version of
the function and so that later changes can alter then open flags.

Fixes: fef8f648bb47 ("perf symbol: Fix use-after-free in filename__read_build_id")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250823000024.724394-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/symbol-minimal.c | 55 ++++++++++++++++----------------
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/tools/perf/util/symbol-minimal.c b/tools/perf/util/symbol-minimal.c
index 7201494c5c20..8d41bd7842df 100644
--- a/tools/perf/util/symbol-minimal.c
+++ b/tools/perf/util/symbol-minimal.c
@@ -4,7 +4,6 @@
 
 #include <errno.h>
 #include <unistd.h>
-#include <stdio.h>
 #include <fcntl.h>
 #include <string.h>
 #include <stdlib.h>
@@ -88,11 +87,8 @@ int filename__read_debuglink(const char *filename __maybe_unused,
  */
 int filename__read_build_id(const char *filename, struct build_id *bid)
 {
-	FILE *fp;
-	int ret = -1;
+	int fd, ret = -1;
 	bool need_swap = false, elf32;
-	u8 e_ident[EI_NIDENT];
-	int i;
 	union {
 		struct {
 			Elf32_Ehdr ehdr32;
@@ -103,28 +99,27 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 			Elf64_Phdr *phdr64;
 		};
 	} hdrs;
-	void *phdr;
-	size_t phdr_size;
-	void *buf = NULL;
-	size_t buf_size = 0;
+	void *phdr, *buf = NULL;
+	ssize_t phdr_size, ehdr_size, buf_size = 0;
 
-	fp = fopen(filename, "r");
-	if (fp == NULL)
+	fd = open(filename, O_RDONLY);
+	if (fd < 0)
 		return -1;
 
-	if (fread(e_ident, sizeof(e_ident), 1, fp) != 1)
+	if (read(fd, hdrs.ehdr32.e_ident, EI_NIDENT) != EI_NIDENT)
 		goto out;
 
-	if (memcmp(e_ident, ELFMAG, SELFMAG) ||
-	    e_ident[EI_VERSION] != EV_CURRENT)
+	if (memcmp(hdrs.ehdr32.e_ident, ELFMAG, SELFMAG) ||
+	    hdrs.ehdr32.e_ident[EI_VERSION] != EV_CURRENT)
 		goto out;
 
-	need_swap = check_need_swap(e_ident[EI_DATA]);
-	elf32 = e_ident[EI_CLASS] == ELFCLASS32;
+	need_swap = check_need_swap(hdrs.ehdr32.e_ident[EI_DATA]);
+	elf32 = hdrs.ehdr32.e_ident[EI_CLASS] == ELFCLASS32;
+	ehdr_size = (elf32 ? sizeof(hdrs.ehdr32) : sizeof(hdrs.ehdr64)) - EI_NIDENT;
 
-	if (fread(elf32 ? (void *)&hdrs.ehdr32 : (void *)&hdrs.ehdr64,
-		  elf32 ? sizeof(hdrs.ehdr32) : sizeof(hdrs.ehdr64),
-		  1, fp) != 1)
+	if (read(fd,
+		 (elf32 ? (void *)&hdrs.ehdr32 : (void *)&hdrs.ehdr64) + EI_NIDENT,
+		 ehdr_size) != ehdr_size)
 		goto out;
 
 	if (need_swap) {
@@ -138,14 +133,18 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 			hdrs.ehdr64.e_phnum = bswap_16(hdrs.ehdr64.e_phnum);
 		}
 	}
-	phdr_size = elf32 ? hdrs.ehdr32.e_phentsize * hdrs.ehdr32.e_phnum
-			  : hdrs.ehdr64.e_phentsize * hdrs.ehdr64.e_phnum;
+	if ((elf32 && hdrs.ehdr32.e_phentsize != sizeof(Elf32_Phdr)) ||
+	    (!elf32 && hdrs.ehdr64.e_phentsize != sizeof(Elf64_Phdr)))
+		goto out;
+
+	phdr_size = elf32 ? sizeof(Elf32_Phdr) * hdrs.ehdr32.e_phnum
+			  : sizeof(Elf64_Phdr) * hdrs.ehdr64.e_phnum;
 	phdr = malloc(phdr_size);
 	if (phdr == NULL)
 		goto out;
 
-	fseek(fp, elf32 ? hdrs.ehdr32.e_phoff : hdrs.ehdr64.e_phoff, SEEK_SET);
-	if (fread(phdr, phdr_size, 1, fp) != 1)
+	lseek(fd, elf32 ? hdrs.ehdr32.e_phoff : hdrs.ehdr64.e_phoff, SEEK_SET);
+	if (read(fd, phdr, phdr_size) != phdr_size)
 		goto out_free;
 
 	if (elf32)
@@ -153,8 +152,8 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 	else
 		hdrs.phdr64 = phdr;
 
-	for (i = 0; i < elf32 ? hdrs.ehdr32.e_phnum : hdrs.ehdr64.e_phnum; i++) {
-		size_t p_filesz;
+	for (int i = 0; i < (elf32 ? hdrs.ehdr32.e_phnum : hdrs.ehdr64.e_phnum); i++) {
+		ssize_t p_filesz;
 
 		if (need_swap) {
 			if (elf32) {
@@ -180,8 +179,8 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 				goto out_free;
 			buf = tmp;
 		}
-		fseek(fp, elf32 ? hdrs.phdr32[i].p_offset : hdrs.phdr64[i].p_offset, SEEK_SET);
-		if (fread(buf, p_filesz, 1, fp) != 1)
+		lseek(fd, elf32 ? hdrs.phdr32[i].p_offset : hdrs.phdr64[i].p_offset, SEEK_SET);
+		if (read(fd, buf, p_filesz) != p_filesz)
 			goto out_free;
 
 		ret = read_build_id(buf, p_filesz, bid, need_swap);
@@ -194,7 +193,7 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 	free(buf);
 	free(phdr);
 out:
-	fclose(fp);
+	close(fd);
 	return ret;
 }
 
-- 
2.25.1


From 2c369d91d0933aaff96b6b807b22363e6a38a625 Mon Sep 17 00:00:00 2001
From: Ian Rogers <irogers@google.com>
Date: Fri, 22 Aug 2025 17:00:24 -0700
Subject: [PATCH 14/14] perf symbol: Add blocking argument to
 filename__read_build_id

When synthesizing build-ids, for build ID mmap2 events, they will be
added for data mmaps if -d/--data is specified. The files opened for
their build IDs may block on the open causing perf to hang during
synthesis. There is some robustness in existing calls to
filename__read_build_id by checking the file path is to a regular
file, which unfortunately fails for symlinks. Rather than adding more
is_regular_file calls, switch filename__read_build_id to take a
"block" argument and specify O_NONBLOCK when this is false. The
existing is_regular_file checking callers and the event synthesis
callers are made to pass false and thereby avoiding the hang.

Fixes: 53b00ff358dc ("perf record: Make --buildid-mmap the default")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250823000024.724394-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/bench/inject-buildid.c  | 2 +-
 tools/perf/builtin-buildid-cache.c | 8 ++++----
 tools/perf/builtin-inject.c        | 4 ++--
 tools/perf/tests/sdt.c             | 2 +-
 tools/perf/util/build-id.c         | 4 ++--
 tools/perf/util/debuginfo.c        | 8 ++++++--
 tools/perf/util/dsos.c             | 4 ++--
 tools/perf/util/symbol-elf.c       | 9 +++++----
 tools/perf/util/symbol-minimal.c   | 6 +++---
 tools/perf/util/symbol.c           | 8 ++++----
 tools/perf/util/symbol.h           | 2 +-
 tools/perf/util/synthetic-events.c | 2 +-
 12 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/tools/perf/bench/inject-buildid.c b/tools/perf/bench/inject-buildid.c
index aad572a78d7f..12387ea88b9a 100644
--- a/tools/perf/bench/inject-buildid.c
+++ b/tools/perf/bench/inject-buildid.c
@@ -85,7 +85,7 @@ static int add_dso(const char *fpath, const struct stat *sb __maybe_unused,
 	if (typeflag == FTW_D || typeflag == FTW_SL)
 		return 0;
 
-	if (filename__read_build_id(fpath, &bid) < 0)
+	if (filename__read_build_id(fpath, &bid, /*block=*/true) < 0)
 		return 0;
 
 	dso->name = realpath(fpath, NULL);
diff --git a/tools/perf/builtin-buildid-cache.c b/tools/perf/builtin-buildid-cache.c
index c98104481c8a..2e0f2004696a 100644
--- a/tools/perf/builtin-buildid-cache.c
+++ b/tools/perf/builtin-buildid-cache.c
@@ -180,7 +180,7 @@ static int build_id_cache__add_file(const char *filename, struct nsinfo *nsi)
 	struct nscookie nsc;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 	if (err < 0) {
 		pr_debug("Couldn't read a build-id in %s\n", filename);
@@ -204,7 +204,7 @@ static int build_id_cache__remove_file(const char *filename, struct nsinfo *nsi)
 	int err;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 	if (err < 0) {
 		pr_debug("Couldn't read a build-id in %s\n", filename);
@@ -280,7 +280,7 @@ static bool dso__missing_buildid_cache(struct dso *dso, int parm __maybe_unused)
 	if (!dso__build_id_filename(dso, filename, sizeof(filename), false))
 		return true;
 
-	if (filename__read_build_id(filename, &bid) == -1) {
+	if (filename__read_build_id(filename, &bid, /*block=*/true) == -1) {
 		if (errno == ENOENT)
 			return false;
 
@@ -309,7 +309,7 @@ static int build_id_cache__update_file(const char *filename, struct nsinfo *nsi)
 	int err;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 	if (err < 0) {
 		pr_debug("Couldn't read a build-id in %s\n", filename);
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 40ba6a94f719..a114b3fa1bea 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -680,12 +680,12 @@ static int dso__read_build_id(struct dso *dso)
 
 	mutex_lock(dso__lock(dso));
 	nsinfo__mountns_enter(dso__nsinfo(dso), &nsc);
-	if (filename__read_build_id(dso__long_name(dso), &bid) > 0)
+	if (filename__read_build_id(dso__long_name(dso), &bid, /*block=*/true) > 0)
 		dso__set_build_id(dso, &bid);
 	else if (dso__nsinfo(dso)) {
 		char *new_name = dso__filename_with_chroot(dso, dso__long_name(dso));
 
-		if (new_name && filename__read_build_id(new_name, &bid) > 0)
+		if (new_name && filename__read_build_id(new_name, &bid, /*block=*/true) > 0)
 			dso__set_build_id(dso, &bid);
 		free(new_name);
 	}
diff --git a/tools/perf/tests/sdt.c b/tools/perf/tests/sdt.c
index 93baee2eae42..6132f1af3e22 100644
--- a/tools/perf/tests/sdt.c
+++ b/tools/perf/tests/sdt.c
@@ -31,7 +31,7 @@ static int build_id_cache__add_file(const char *filename)
 	struct build_id bid = { .size = 0, };
 	int err;
 
-	err = filename__read_build_id(filename, &bid);
+	err = filename__read_build_id(filename, &bid, /*block=*/true);
 	if (err < 0) {
 		pr_debug("Failed to read build id of %s\n", filename);
 		return err;
diff --git a/tools/perf/util/build-id.c b/tools/perf/util/build-id.c
index a7018a3b0437..bf7f3268b9a2 100644
--- a/tools/perf/util/build-id.c
+++ b/tools/perf/util/build-id.c
@@ -115,7 +115,7 @@ int filename__snprintf_build_id(const char *pathname, char *sbuild_id, size_t sb
 	struct build_id bid = { .size = 0, };
 	int ret;
 
-	ret = filename__read_build_id(pathname, &bid);
+	ret = filename__read_build_id(pathname, &bid, /*block=*/true);
 	if (ret < 0)
 		return ret;
 
@@ -841,7 +841,7 @@ static int filename__read_build_id_ns(const char *filename,
 	int ret;
 
 	nsinfo__mountns_enter(nsi, &nsc);
-	ret = filename__read_build_id(filename, bid);
+	ret = filename__read_build_id(filename, bid, /*block=*/true);
 	nsinfo__mountns_exit(&nsc);
 
 	return ret;
diff --git a/tools/perf/util/debuginfo.c b/tools/perf/util/debuginfo.c
index a44c70f93156..bb9ebd84ec2d 100644
--- a/tools/perf/util/debuginfo.c
+++ b/tools/perf/util/debuginfo.c
@@ -110,8 +110,12 @@ struct debuginfo *debuginfo__new(const char *path)
 	if (!dso)
 		goto out;
 
-	/* Set the build id for DSO_BINARY_TYPE__BUILDID_DEBUGINFO */
-	if (is_regular_file(path) && filename__read_build_id(path, &bid) > 0)
+	/*
+	 * Set the build id for DSO_BINARY_TYPE__BUILDID_DEBUGINFO. Don't block
+	 * incase the path isn't for a regular file.
+	 */
+	assert(!dso__has_build_id(dso));
+	if (filename__read_build_id(path, &bid, /*block=*/false) > 0)
 		dso__set_build_id(dso, &bid);
 
 	for (type = distro_dwarf_types;
diff --git a/tools/perf/util/dsos.c b/tools/perf/util/dsos.c
index 0a7645c7fae7..64c1d65b0149 100644
--- a/tools/perf/util/dsos.c
+++ b/tools/perf/util/dsos.c
@@ -81,13 +81,13 @@ static int dsos__read_build_ids_cb(struct dso *dso, void *data)
 		return 0;
 	}
 	nsinfo__mountns_enter(dso__nsinfo(dso), &nsc);
-	if (filename__read_build_id(dso__long_name(dso), &bid) > 0) {
+	if (filename__read_build_id(dso__long_name(dso), &bid, /*block=*/true) > 0) {
 		dso__set_build_id(dso, &bid);
 		args->have_build_id = true;
 	} else if (errno == ENOENT && dso__nsinfo(dso)) {
 		char *new_name = dso__filename_with_chroot(dso, dso__long_name(dso));
 
-		if (new_name && filename__read_build_id(new_name, &bid) > 0) {
+		if (new_name && filename__read_build_id(new_name, &bid, /*block=*/true) > 0) {
 			dso__set_build_id(dso, &bid);
 			args->have_build_id = true;
 		}
diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
index 6d2c280a1730..033c79231a54 100644
--- a/tools/perf/util/symbol-elf.c
+++ b/tools/perf/util/symbol-elf.c
@@ -902,7 +902,7 @@ static int read_build_id(const char *filename, struct build_id *bid)
 
 #else // HAVE_LIBBFD_BUILDID_SUPPORT
 
-static int read_build_id(const char *filename, struct build_id *bid)
+static int read_build_id(const char *filename, struct build_id *bid, bool block)
 {
 	size_t size = sizeof(bid->data);
 	int fd, err = -1;
@@ -911,7 +911,7 @@ static int read_build_id(const char *filename, struct build_id *bid)
 	if (size < BUILD_ID_SIZE)
 		goto out;
 
-	fd = open(filename, O_RDONLY);
+	fd = open(filename, block ? O_RDONLY : (O_RDONLY | O_NONBLOCK));
 	if (fd < 0)
 		goto out;
 
@@ -934,7 +934,7 @@ static int read_build_id(const char *filename, struct build_id *bid)
 
 #endif // HAVE_LIBBFD_BUILDID_SUPPORT
 
-int filename__read_build_id(const char *filename, struct build_id *bid)
+int filename__read_build_id(const char *filename, struct build_id *bid, bool block)
 {
 	struct kmod_path m = { .name = NULL, };
 	char path[PATH_MAX];
@@ -958,9 +958,10 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 		}
 		close(fd);
 		filename = path;
+		block = true;
 	}
 
-	err = read_build_id(filename, bid);
+	err = read_build_id(filename, bid, block);
 
 	if (m.comp)
 		unlink(filename);
diff --git a/tools/perf/util/symbol-minimal.c b/tools/perf/util/symbol-minimal.c
index 8d41bd7842df..41e4ebe5eac5 100644
--- a/tools/perf/util/symbol-minimal.c
+++ b/tools/perf/util/symbol-minimal.c
@@ -85,7 +85,7 @@ int filename__read_debuglink(const char *filename __maybe_unused,
 /*
  * Just try PT_NOTE header otherwise fails
  */
-int filename__read_build_id(const char *filename, struct build_id *bid)
+int filename__read_build_id(const char *filename, struct build_id *bid, bool block)
 {
 	int fd, ret = -1;
 	bool need_swap = false, elf32;
@@ -102,7 +102,7 @@ int filename__read_build_id(const char *filename, struct build_id *bid)
 	void *phdr, *buf = NULL;
 	ssize_t phdr_size, ehdr_size, buf_size = 0;
 
-	fd = open(filename, O_RDONLY);
+	fd = open(filename, block ? O_RDONLY : (O_RDONLY | O_NONBLOCK));
 	if (fd < 0)
 		return -1;
 
@@ -323,7 +323,7 @@ int dso__load_sym(struct dso *dso, struct map *map __maybe_unused,
 	if (ret >= 0)
 		RC_CHK_ACCESS(dso)->is_64_bit = ret;
 
-	if (filename__read_build_id(ss->name, &bid) > 0)
+	if (filename__read_build_id(ss->name, &bid, /*block=*/true) > 0)
 		dso__set_build_id(dso, &bid);
 	return 0;
 }
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index e816e4220d33..3fed54de5401 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1869,14 +1869,14 @@ int dso__load(struct dso *dso, struct map *map)
 
 	/*
 	 * Read the build id if possible. This is required for
-	 * DSO_BINARY_TYPE__BUILDID_DEBUGINFO to work
+	 * DSO_BINARY_TYPE__BUILDID_DEBUGINFO to work. Don't block in case path
+	 * isn't for a regular file.
 	 */
-	if (!dso__has_build_id(dso) &&
-	    is_regular_file(dso__long_name(dso))) {
+	if (!dso__has_build_id(dso)) {
 		struct build_id bid = { .size = 0, };
 
 		__symbol__join_symfs(name, PATH_MAX, dso__long_name(dso));
-		if (filename__read_build_id(name, &bid) > 0)
+		if (filename__read_build_id(name, &bid, /*block=*/false) > 0)
 			dso__set_build_id(dso, &bid);
 	}
 
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 3fb5d146d9b1..347106218799 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -140,7 +140,7 @@ struct symbol *dso__next_symbol(struct symbol *sym);
 
 enum dso_type dso__type_fd(int fd);
 
-int filename__read_build_id(const char *filename, struct build_id *id);
+int filename__read_build_id(const char *filename, struct build_id *id, bool block);
 int sysfs__read_build_id(const char *filename, struct build_id *bid);
 int modules__parse(const char *filename, void *arg,
 		   int (*process_module)(void *arg, const char *name,
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index cb2c1ace304a..fcd1fd13c30e 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -401,7 +401,7 @@ static void perf_record_mmap2__read_build_id(struct perf_record_mmap2 *event,
 	nsi = nsinfo__new(event->pid);
 	nsinfo__mountns_enter(nsi, &nc);
 
-	rc = filename__read_build_id(event->filename, &bid) > 0 ? 0 : -1;
+	rc = filename__read_build_id(event->filename, &bid, /*block=*/false) > 0 ? 0 : -1;
 
 	nsinfo__mountns_exit(&nc);
 	nsinfo__put(nsi);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-08-28  7:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> 
> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +				       (void **)&preserved_folios);
> > +	if (err) {
> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
> > +		       fdt_strerror(err));
> > +		err = -ENOMEM;
> > +		goto err_free_fdt;
> > +	}
> 
> Yuk.
> 
> This really wants some luo helper
> 
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'
> 
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
> 
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.

I agree that we need a side-car structure for preserving large (potentially
sparse) arrays, but I think it should be a part of KHO rather than LUO.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Aleksa Sarai @ 2025-08-28  0:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn,
	Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <CALCETrWHKga33bvzUHnd-mRQUeNXTtXSS8Y8+40d5bxv-CqBhw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2052 bytes --]

On 2025-08-27, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.

Oh I guess you're right -- that's news to me. And from mmap(2):

> MAP_PRIVATE
> [...] It is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.

But then what is the protection mechanism (in the absence of -ETXTBSY)
that stops you from overwriting the live text of a binary by just
writing to it?

I would need to go trawling through my old scripts to find the
reproducer that let you get around -ETXTBSY (I think it involved
executable memfds) but I distinctly remember that even if you overwrote
the binary you would not see the live process's mapped mm change value.
(Ditto for the few kernels when we removed -ETXTBSY.) I found this
surprising, but assumed that it was because of MAP_PRIVATE.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-08-28  0:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <2025-08-27-obscene-great-toy-diary-X1gVRV@cyphar.com>

On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > Nothing has changed in that regard and I'm not interested in stuffing
> > > the VFS APIs full of special-purpose behavior to work around the fact
> > > that this is work that needs to be done in userspace. Change the apps,
> > > stop pushing more and more cruft into the VFS that has no business
> > > there.
> >
> > It would be interesting to know how to patch user space to get the same
> > guarantees...  Do you think I would propose a kernel patch otherwise?
>
> You could mmap the script file with MAP_PRIVATE. This is the *actual*
> protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> nice but IIRC there are ways to get around it anyway).

Wait, really?  MAP_PRIVATE prevents writes to the mapping from
affecting the file, but I don't think that writes to the file will
break the MAP_PRIVATE CoW if it's not already broken.

IPython says:

In [1]: import mmap, tempfile

In [2]: f = tempfile.TemporaryFile()

In [3]: f.write(b'initial contents')
Out[3]: 16

In [4]: f.flush()

In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
prot=mmap.PROT_READ)

In [6]: map[:]
Out[6]: b'initial contents'

In [7]: f.seek(0)
Out[7]: 0

In [8]: f.write(b'changed')
Out[8]: 7

In [9]: f.flush()

In [10]: map[:]
Out[10]: b'changed contents'

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Aleksa Sarai @ 2025-08-28  0:14 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826.aig5aiShunga@digikod.net>

[-- Attachment #1: Type: text/plain, Size: 2981 bytes --]

On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > Nothing has changed in that regard and I'm not interested in stuffing
> > the VFS APIs full of special-purpose behavior to work around the fact
> > that this is work that needs to be done in userspace. Change the apps,
> > stop pushing more and more cruft into the VFS that has no business
> > there.
> 
> It would be interesting to know how to patch user space to get the same
> guarantees...  Do you think I would propose a kernel patch otherwise?

You could mmap the script file with MAP_PRIVATE. This is the *actual*
protection the kernel uses against overwriting binaries (yes, ETXTBSY is
nice but IIRC there are ways to get around it anyway). Of course, most
interpreters don't mmap their scripts, but this is a potential solution.
If the security policy is based on validating the script text in some
way, this avoids the TOCTOU.

Now, in cases where you have IMA or something and you only permit signed
binaries to execute, you could argue there is a different race here (an
attacker creates a malicious script, runs it, and then replaces it with
a valid script's contents and metadata after the fact to get
AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
this is even possible with IMA (can an unprivileged user even set
security.ima?). But even then, I would expect users that really need
this would also probably use fs-verity or dm-verity that would block
this kind of attack since it would render the files read-only anyway.

This is why a more detailed threat model of what kinds of attacks are
relevant is useful. I was there for the talk you gave and subsequent
discussion at last year's LPC, but I felt that your threat model was
not really fleshed out at all. I am still not sure what capabilities you
expect the attacker to have nor what is being used to authenticate
binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
assumptions, but I can't know without knowing what threat model you have
in mind, *in detail*.

For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
there are plenty of ways for an attacker to execute their own code
without using interpreters (create a new tmpfs with fsopen(2) for
instance). Executable memfds are even easier and don't require
privileges on most systems (yes, you can block them with vm.memfd_noexec
but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
mount(2)).

(As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
top-level AT_* bits for a per-syscall flag -- the block comment I added
in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
allocated") was meant to avoid this happening but it seems you and the
reviewers missed that...)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Paul Eggert @ 2025-08-27 23:19 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <2025-08-27-perky-glossy-dam-spindle-kPpnnk@cyphar.com>

On 2025-08-27 15:48, Aleksa Sarai wrote:
> On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
>> What specific scenario would make the "give me supported flags" flag worth
>> the hassle of supporting and documenting and testing such a flag?
> 
> "Just try it" leads to programs that have to test dozens of flag
> combinations for syscalls at startup,

Although that sort of thing can indeed be a problem in general, I don't 
see how it's a problem for openat2 in particular.

The issue here is whether openat2's API should reflect current behavior 
(where the HOW argument is pointer-to-const) or a potential future 
behavior (where the kernel might modify the struct that HOW points to, 
if some hypothetical future flag is set in that struct). I am skeptical 
that this hypothetical situation is so plausible that it justifies the 
maintenance hassle of a glibc API that doesn't correspond to how openat2 
currently behaves.

> A simple example would be mounts -- if MOUNT_BENEATH is not supported

I don't understand this example. Are you talking about <linux/mount.h>'s 
MOVE_MOUNT_BENEATH? That's a move_mount flag, and I don't see what that 
has to do with openat2. Or are you saying that openat2 might not support 
<linux/openat2.h>'s RESOLVE_BENEATH flag? Under what conditions might 
that be, exactly? Can you give some plausible user code to illustrate 
the openat2 example you're thinking of?

I still fail to understand how a hypothetical "give me the supported 
flags" openat2 flag would be useful enough to justify complicating the 
openat2 API today.

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Aleksa Sarai @ 2025-08-27 22:48 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <5c9fa556-da00-4b76-8a70-8e2d1dddd92d@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 4172 bytes --]

On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2025-08-26 22:58, Aleksa Sarai wrote:
> > Personally I think both approaches are less than ideal, and having rich
> > feature flags for the entire system would be better but I don't think
> > this is something that would be feasible to apply to everything in the
> > entire kernel.
> 
> Agreed. But I'm not seeing how a hypothetical "give me the supported flags"
> flag would be useful enough to justify the flag.
> 
> I'm looking at this from the user point of view, and it is not ringing a
> bell for me. Granted, the current "try the flag combination you want and see
> whether it works" is not ideal, but it's accurate (which is not always true
> for "give me the supported flags" flag) and you need to do it anyway
> (because the "give me the supported flags" flag is inherently inaccurate),
> so why bother with a "give me the supported flags" flag?
> 
> Here's an example. Suppose we want to extend openat2 so that it also does
> the equivalent of statx atomically with the open, to avoid some races with
> the current openat/fstat pair of system calls. Under the approach you're
> proposing, I suppose we could extend struct open_how so that it has a new
> struct statx member, add new flags to be put into struct open_how's flags
> member, and programs would be able to query the new flags via a "give me the
> supported flags" call.
> 
> But in this scenario, the "give me the supported flags" flag is useless. If
> I'm an old program I can't use the new flags even if I detect them because
> my struct open_how is too small. And if I'm a new program I can simply use
> the new flags - and even if I tested for the new flags (with the "give me
> the supported flags" flag) I'd have to test the result anyway because
> perhaps the new flags are not supported for this particular flag combination
> or file.
> 
> What specific scenario would make the "give me supported flags" flag worth
> the hassle of supporting and documenting and testing such a flag?

"Just try it" leads to programs that have to test dozens of flag
combinations for syscalls at startup, and for many syscalls you cannot
"just try" whether the new flag works (think of a new shutdown(2) flag,
or most clone3(2) flags). What you end up having to do is create an
elaborate setup where if the flag works you get an error (but not
-EINVAL!) so that you can be fairly confident that you didn't modify the
system when doing the check. As someone who has to write this
boilerplate whenever I need to use most system calls, this really
**really** sucks. In some cases you can just try it and then fallback
(caching whether it was supported), but in a lot of programs it is
preferable to know well in advance whether a feature is supported.

A simple example would be mounts -- if MOUNT_BENEATH is not supported
then you need to structure how you construct your mount tree differently
to try to emulate the same behaviour. This means that not knowing if
MOUNT_BENEATH is supported upfront causes you to redo a lot of work in
the fallback case. If changing id-mappings for mounts hadn't required
adding a new syscall, this would've also been an issue for programs that
needed to change the ID-mappings of mounts.

Some kind of "just tell me what flags are supported" mechanism avoids
this problem by telling you in one shot what features are supported (so
newer programs can take advantage of them). Most systems that expect to
be extended over time have something like this, but it's usually in the
form of string-based feature names (/sys/kernel/cgroup/features, for
instance). I wouldn't be against such an idea (if we could actually
guarantee that everyone actually used it), but something similar was
proposed back in 2020 and never happened -- CHECK_FIELDS is a very
simple solution to the problem that works for the most common case and
can be implemented per-syscall.

I've added linux-api to Cc, as I'm sure there are plenty of other ideas
on how to solve this.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox