Generic Linux architectural discussions

Generic Linux architectural discussions
 help / color / mirror / Atom feed

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Christian Brauner @ 2026-06-10  7:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Li Chen, Kees Cook, Alexander Viro, linux-fsdevel, linux-api,
	linux-kernel, linux-mm, linux-arch, linux-doc, linux-kselftest,
	x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <CALCETrWJQpLR4n1cpichBk8=uExSKLWTMGU3BufGdk_WE_p5UA@mail.gmail.com>

On Mon, Jun 08, 2026 at 05:01:57PM -0700, Andy Lutomirski wrote:
> On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > > Hi,
> > >
> > > This is an early RFC for an idea that is probably still rough in both the
> > > UAPI and implementation details. Sorry for the rough edges; I am sending
> > > it now to check whether this direction is worth pursuing and to get
> > > feedback on the kernel/userspace boundary.
> >
> > The idea of having a builder api for exec isn't all that crazy. But it
> > should simply be built on top of pidfds and thus pidfs itself instead.
> > It has all the basic infrastructure in place already. Any implementation
> > should also allow userspace to implement posix_spawn() on top of it.
> >
> > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> >
> > pidfd_config(fd, ...) // modeled similar to fsconfig()
> >
> 
> After contemplating this for a bit... why pidfd?  Doesn't a pidfd
> refer to an actual process that is, or at least was, running?  This
> new thing is a process that we are contemplating spawning.  I can
> imagine that basically all pidfd APIs would be a bit confused by the
> nonexistence of the process in question.

I don't think that would be a problem because every api just needs to
handle ESRCH. Ignoring that for a second: the mount api has a builder fd
that is later transformed into a pidfd. Which is easily doable here as
well. My point is that all the infrastructure building blocks already
exist in pidfs.

^ permalink raw reply

* [PATCH bpf-next] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: Gabriele Monaco @ 2026-06-10  9:04 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Arnd Bergmann, bpf,
	linux-arch, linux-kernel
  Cc: Gabriele Monaco, stable, Waiman Long

raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
restores interrupts, this means preemption is enabled when interrupts
are still disabled (as part of raw_res_spin_unlock()) so this cannot
trigger an actual preemption.
This is inconsistent with other spinlock implementations
(raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
itself).

Adjust the macro to ensure interrupts are enabled before enabling
preemption, allowing to schedule at that point. Make the same
modification in the error path of raw_res_spin_lock_irqsave().

Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")
Cc: stable@vger.kernel.org
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
New submission of [1]

[1] - https://lore.kernel.org/lkml/20260609094941.56122-1-gmonaco@redhat.com
---
 include/asm-generic/rqspinlock.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 151d267a49..4d46643f46 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -243,12 +243,20 @@ static __always_inline void res_spin_unlock(rqspinlock_t *lock)
 	({                                        \
 		int __ret;                        \
 		local_irq_save(flags);            \
-		__ret = raw_res_spin_lock(lock);  \
-		if (__ret)                        \
+		preempt_disable();                \
+		__ret = res_spin_lock(lock);      \
+		if (__ret) {                      \
 			local_irq_restore(flags); \
+			preempt_enable();         \
+		}                                 \
 		__ret;                            \
 	})
 
-#define raw_res_spin_unlock_irqrestore(lock, flags) ({ raw_res_spin_unlock(lock); local_irq_restore(flags); })
+#define raw_res_spin_unlock_irqrestore(lock, flags) \
+	({                                          \
+		res_spin_unlock(lock);              \
+		local_irq_restore(flags);           \
+		preempt_enable();                   \
+	})
 
 #endif /* __ASM_GENERIC_RQSPINLOCK_H */

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH bpf-next] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: bot+bpf-ci @ 2026-06-10  9:43 UTC (permalink / raw)
  To: gmonaco, ast, daniel, andrii, eddyz87, memxor, arnd, bpf,
	linux-arch, linux-kernel
  Cc: gmonaco, stable, longman, ast, andrii, daniel, martin.lau,
	eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260610090431.32427-1-gmonaco@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

> diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
> index 151d267a496b..4d46643f46cb 100644
> --- a/include/asm-generic/rqspinlock.h
> +++ b/include/asm-generic/rqspinlock.h
> @@ -243,12 +243,20 @@ static __always_inline void res_spin_unlock(rqspinlock_t *lock)
>  	({                                        \
>  		int __ret;                        \
>  		local_irq_save(flags);            \
> -		__ret = raw_res_spin_lock(lock);  \
> -		if (__ret)                        \
> +		preempt_disable();                \
> +		__ret = res_spin_lock(lock);      \
> +		if (__ret) {                      \
>  			local_irq_restore(flags); \
> +			preempt_enable();         \
> +		}                                 \
>  		__ret;                            \
>  	})
>
> -#define raw_res_spin_unlock_irqrestore(lock, flags) ({ raw_res_spin_unlock(lock); local_irq_restore(flags); })
> +#define raw_res_spin_unlock_irqrestore(lock, flags) \
> +	({                                          \
> +		res_spin_unlock(lock);              \
> +		local_irq_restore(flags);           \
> +		preempt_enable();                   \
> +	})
>
>  #endif /* __ASM_GENERIC_RQSPINLOCK_H */

Related issue in kernel/locking/locktorture.c

During the v2 review discussion, the Sashiko AI bot identified a
pre-existing bug in the locktorture module where return values from
raw_res_spin_lock() and raw_res_spin_lock_irqsave() are ignored in
torture_raw_res_spin_write_lock():

    static int torture_raw_res_spin_write_lock(int tid __maybe_unused)
    {
        raw_res_spin_lock(&rqspinlock);
        return 0;
    }

Since these functions can fail with -EDEADLK or -ETIMEDOUT, ignoring
the return value allows the caller to proceed into the critical section
without holding the lock. When the torture thread later calls
raw_res_spin_unlock(), it will cause an unbalanced preempt_enable()
because the failed lock acquisition path already restored preemption.

Furthermore, forcibly clearing the lock state could unlock a lock
legitimately held by another thread, potentially causing lock state
corruption.

The same issue exists in torture_raw_res_spin_write_lock_irq() which
also ignores the return value from raw_res_spin_lock_irqsave().

Should these torture functions check the return values and handle lock
acquisition failures appropriately?

Reference: https://lore.kernel.org/all/20260610091602.318C01F00893@smtp.kernel.org/


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27265934925

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-10 12:29 UTC (permalink / raw)
  To: John Ericson
  Cc: Andy Lutomirski, Christian Brauner, Kees Cook, Al Viro,
	linux-fsdevel, linux-api, LKML, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <4e049396-377d-48a7-a34c-91318413a876@app.fastmail.com>

Hi John,

 ---- On Wed, 10 Jun 2026 01:27:47 +0800  John Ericson <mail@johnericson.me> wrote --- 
 > 
 > 
 > On Tue, Jun 9, 2026, at 10:43 AM, Li Chen wrote:
 > > Hi Andy,
 > >
 > > ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote ---
 > > > [...]
 > > >
 > > > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
 > > > refer to an actual process that is, or at least was, running?  This
 > > > new thing is a process that we are contemplating spawning.  I can
 > > > imagine that basically all pidfd APIs would be a bit confused by the
 > > > nonexistence of the process in question.
 > > >
 > >
 > > Yes, I think that is a real concern.
 > >
 > > In my current local WIP I tried to keep that distinction explicit.
 > > pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
 > > referring to a process. The builder fd is allocated as an anonymous pidfs
 > > file with builder-specific file operations:
 > >
 > >     file = pidfs_alloc_anon_file("[pidfd_spawn]",
 > >                                  &pidfd_spawn_builder_fops, builder,
 > >                                  O_RDWR);
 > >
 > 
 > What does your builder fd point to, explicitly? For example in my other reply I
 > talked about how it was "real" process state. In my FreeBSD patch, for example,
 > I found there was already a status for a process "in exec", and I figured that
 > was clean to reuse for one of these "embryonic" processes that also hadn't
 > started running. I would reckon that Linux probably has some similar notions.
 > 
 > > and the normal pidfd helpers still reject it because it does not use the
 > > ordinary pidfd file operations:
 > >
 > >     struct pid *pidfd_pid(const struct file *file)
 > >     {
 > >         if (file->f_op != &pidfs_file_operations)
 > >             return ERR_PTR(-EBADF);
 > >         return file_inode(file)->i_private;
 > >     }
 > >
 > > So the current split is:
 > >
 > >     builder_fd = pidfd_spawn_open(...);       /* builder object */
 > >     pidfd_config(builder_fd, ...);
 > >     child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
 > >
 > > Only the last fd is a normal pidfd for an actual child process. The builder
 > > fd is only accepted by the builder operations.
 > >
 > > This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
 > > pidfd_getfd(), poll(), etc. mean before the process exists.
 > 
 > I wouldn't be so sure this is necessary/good. For example, I think it could
 > make sense to wait on a process that has yet to be started; one just waits for
 > both the process to start and the process to exit. Obviously a blocking syscall
 > in the thread that is spawning the process is not useful, but the asynchronous
 > poll variation seems fine.
 > 
 > As long as there is real process state here, it shouldn't be too hard to
 > implement.
 > 
 > > The downside is that it adds a separate open-style entry point and is less
 > > uniform than the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.
 > 
 > I do think there is no point having two file descriptors. The file descriptor
 > that previously referred to the builder/embryonic process then can refer to the
 > real process, right?
 > 
 > > If people think there is a better way to represent the pre-spawn builder
 > > state, or if the preference is to integrate it directly into pidfd_open()
 > > with an explicit empty/future-pidfd state, I would be happy to discuss that.
 > 
 > Hope the above answers your question? I suppose my ideas lean more on the
 > "future" than "empty" side --- there is indeed a thread in the thread group,
 > with real VM/namespace/file descriptor etc. state. Moreover, state gets
 > initialized before the process is started, so the actual start is a pretty
 > lightweight step of just letting the scheduler know the now-ready process can
 > be scheduled. The only thing that distinguishes the embryonic process from a
 > real one is simply that it isn't running --- i.e. isn't (yet) available to be
 > scheduled --- so the pidfds holders are free to poke at its state.
 > 
 > Cheers,
 > 
 > John
 > 

Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
closer with P_LINTRANSIT, described as "process in exec or in creation".
Linux does not seem to have a single equivalent today: current->in_execve
is only an LSM hint, while the real synchronization is spread across
exec_update_lock, cred_guard_mutex, and the exec path.

I am switching my local WIP from the two-fd builder model to one fd,
closer to Christian's sketch:

fd = pidfd_open(0, PIDFD_EMPTY);
pidfd_config(fd, ...);
pidfd_spawn_run(fd, ...);

In my current local version, I still use copy_process(), so the fd points
at a real task_struct/pid that is not woken until run. Following
Christian's point that existing APIs can handle this not-yet-running case
with ESRCH, I currently make ordinary pidfd operations that need a real
started process return -ESRCH before start.

I am not sure yet whether Linux should grow a general exec/creation
transition state like that, or whether a narrower future-process
lifecycle is enough for this API. I will think more about that when
working on the pristine process version.

Regards,
Li


^ permalink raw reply

* [PATCH] audit: add missing syscalls to PERM class tables
From: Ricardo Robaina @ 2026-06-10 16:47 UTC (permalink / raw)
  To: audit, linux-kernel, linux-arch
  Cc: paul, eparis, arnd, sgrubb, Ricardo Robaina

Add missing file timestamp and attribute syscalls to the audit PERM
class tables. The most critical gap was the complete absence of
timestamp syscalls from audit_change_attr.h, which meant timestamp
syscalls failed the kernel-side AUDIT_PERM_ATTR class check, so
rules using perm=a did not match those operations.

Changes:
- audit_change_attr.h: Add utime, utimes, futimesat, utimensat,
  utimensat_time64, and file_setattr

- audit_read.h: Add quotactl_fd, file_getattr, stat, lstat, fstat,
  newfstatat, and statx

- audit_write.h: Add quotactl_fd

Architecture-specific and conditionally-compiled syscalls are guarded
with #ifdef.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
---
 include/asm-generic/audit_change_attr.h | 16 ++++++++++++++++
 include/asm-generic/audit_read.h        | 19 +++++++++++++++++++
 include/asm-generic/audit_write.h       |  3 +++
 3 files changed, 38 insertions(+)

diff --git a/include/asm-generic/audit_change_attr.h b/include/asm-generic/audit_change_attr.h
index ddd90bbe40df..5cb036695d8a 100644
--- a/include/asm-generic/audit_change_attr.h
+++ b/include/asm-generic/audit_change_attr.h
@@ -40,3 +40,19 @@ __NR_link,
 #ifdef __NR_linkat
 __NR_linkat,
 #endif
+#ifdef __NR_utime
+__NR_utime,
+#endif
+#ifdef __NR_utimes
+__NR_utimes,
+#endif
+#ifdef __NR_futimesat
+__NR_futimesat,
+#endif
+__NR_utimensat,
+#ifdef __NR_utimensat_time64
+__NR_utimensat_time64,
+#endif
+#ifdef __NR_file_setattr
+__NR_file_setattr,
+#endif
diff --git a/include/asm-generic/audit_read.h b/include/asm-generic/audit_read.h
index fb9991f53fb6..8feebc5b4c50 100644
--- a/include/asm-generic/audit_read.h
+++ b/include/asm-generic/audit_read.h
@@ -3,6 +3,9 @@
 __NR_readlink,
 #endif
 __NR_quotactl,
+#ifdef __NR_quotactl_fd
+__NR_quotactl_fd,
+#endif
 __NR_listxattr,
 #ifdef __NR_listxattrat
 __NR_listxattrat,
@@ -18,3 +21,19 @@ __NR_fgetxattr,
 #ifdef __NR_readlinkat
 __NR_readlinkat,
 #endif
+#ifdef __NR_file_getattr
+__NR_file_getattr,
+#endif
+#ifdef __NR_stat
+__NR_stat,
+#endif
+#ifdef __NR_lstat
+__NR_lstat,
+#endif
+#ifdef __NR_fstat
+__NR_fstat,
+#endif
+#ifdef __NR_newfstatat
+__NR_newfstatat,
+#endif
+__NR_statx,
diff --git a/include/asm-generic/audit_write.h b/include/asm-generic/audit_write.h
index f9f1d0ae11d9..378128dc31e3 100644
--- a/include/asm-generic/audit_write.h
+++ b/include/asm-generic/audit_write.h
@@ -5,6 +5,9 @@ __NR_acct,
 __NR_swapon,
 #endif
 __NR_quotactl,
+#ifdef __NR_quotactl_fd
+__NR_quotactl_fd,
+#endif
 #ifdef __NR_truncate
 __NR_truncate,
 #endif
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: Arnd Bergmann @ 2026-06-10 17:05 UTC (permalink / raw)
  To: Ricardo Robaina, audit, linux-kernel, Linux-Arch
  Cc: Paul Moore, Eric Paris, sgrubb
In-Reply-To: <20260610164719.2668906-1-rrobaina@redhat.com>

On Wed, Jun 10, 2026, at 18:47, Ricardo Robaina wrote:
> diff --git a/include/asm-generic/audit_read.h 
> b/include/asm-generic/audit_read.h
> index fb9991f53fb6..8feebc5b4c50 100644
> --- a/include/asm-generic/audit_read.h
> +++ b/include/asm-generic/audit_read.h
> @@ -18,3 +21,19 @@ __NR_fgetxattr,
>  #ifdef __NR_readlinkat
>  __NR_readlinkat,
>  #endif
> +#ifdef __NR_file_getattr
> +__NR_file_getattr,
> +#endif
> +#ifdef __NR_stat
> +__NR_stat,
> +#endif
> +#ifdef __NR_lstat
> +__NR_lstat,
> +#endif
> +#ifdef __NR_fstat
> +__NR_fstat,
> +#endif
> +#ifdef __NR_newfstatat
> +__NR_newfstatat,
> +#endif
> +__NR_statx,

There are additional variants of 'stat' that I think you need
to cover here:

scripts/syscall.tbl:79	stat64	fstatat64			sys_fstatat64
scripts/syscall.tbl:80	stat64	fstat64				sys_fstat64
arch/x86/entry/syscalls/syscall_32.tbl:18	i386	oldstat			sys_stat
arch/x86/entry/syscalls/syscall_32.tbl:28	i386	oldfstat		sys_fstat
arch/x86/entry/syscalls/syscall_32.tbl:84	i386	oldlstat		sys_lstat
arch/x86/entry/syscalls/syscall_32.tbl:195	i386	stat64			sys_stat64			compat_sys_ia32_stat64
arch/x86/entry/syscalls/syscall_32.tbl:196	i386	lstat64			sys_lstat64			compat_sys_ia32_lstat64
arch/x86/entry/syscalls/syscall_32.tbl:197	i386	fstat64			sys_fstat64			compat_sys_ia32_fstat64
arch/x86/entry/syscalls/syscall_32.tbl:300	i386	fstatat64		sys_fstatat64			compat_sys_ia32_fstatat64
arch/alpha/kernel/syscalls/syscall.tbl:224	common	osf_stat			sys_osf_stat
arch/alpha/kernel/syscalls/syscall.tbl:225	common	osf_lstat			sys_osf_lstat
arch/alpha/kernel/syscalls/syscall.tbl:226	common	osf_fstat			sys_osf_fstat

Not sure about ustat/fstatfs/statfs, I suppose those are a different
category, right?

       Arnd

^ permalink raw reply

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: Ricardo Robaina @ 2026-06-10 17:40 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: audit, linux-kernel, Linux-Arch, Paul Moore, Eric Paris, sgrubb
In-Reply-To: <476e0a44-c6fb-4e6a-af56-a9f1d054518a@app.fastmail.com>

On Wed, Jun 10, 2026 at 2:05 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Jun 10, 2026, at 18:47, Ricardo Robaina wrote:
> > diff --git a/include/asm-generic/audit_read.h
> > b/include/asm-generic/audit_read.h
> > index fb9991f53fb6..8feebc5b4c50 100644
> > --- a/include/asm-generic/audit_read.h
> > +++ b/include/asm-generic/audit_read.h
> > @@ -18,3 +21,19 @@ __NR_fgetxattr,
> >  #ifdef __NR_readlinkat
> >  __NR_readlinkat,
> >  #endif
> > +#ifdef __NR_file_getattr
> > +__NR_file_getattr,
> > +#endif
> > +#ifdef __NR_stat
> > +__NR_stat,
> > +#endif
> > +#ifdef __NR_lstat
> > +__NR_lstat,
> > +#endif
> > +#ifdef __NR_fstat
> > +__NR_fstat,
> > +#endif
> > +#ifdef __NR_newfstatat
> > +__NR_newfstatat,
> > +#endif
> > +__NR_statx,
>
> There are additional variants of 'stat' that I think you need
> to cover here:
>
> scripts/syscall.tbl:79  stat64  fstatat64                       sys_fstatat64
> scripts/syscall.tbl:80  stat64  fstat64                         sys_fstat64
> arch/x86/entry/syscalls/syscall_32.tbl:18       i386    oldstat                 sys_stat
> arch/x86/entry/syscalls/syscall_32.tbl:28       i386    oldfstat                sys_fstat
> arch/x86/entry/syscalls/syscall_32.tbl:84       i386    oldlstat                sys_lstat
> arch/x86/entry/syscalls/syscall_32.tbl:195      i386    stat64                  sys_stat64                      compat_sys_ia32_stat64
> arch/x86/entry/syscalls/syscall_32.tbl:196      i386    lstat64                 sys_lstat64                     compat_sys_ia32_lstat64
> arch/x86/entry/syscalls/syscall_32.tbl:197      i386    fstat64                 sys_fstat64                     compat_sys_ia32_fstat64
> arch/x86/entry/syscalls/syscall_32.tbl:300      i386    fstatat64               sys_fstatat64                   compat_sys_ia32_fstatat64
> arch/alpha/kernel/syscalls/syscall.tbl:224      common  osf_stat                        sys_osf_stat
> arch/alpha/kernel/syscalls/syscall.tbl:225      common  osf_lstat                       sys_osf_lstat
> arch/alpha/kernel/syscalls/syscall.tbl:226      common  osf_fstat                       sys_osf_fstat
>

Hi Arnd,

Thanks for reviewing this patch! You're right, it seems all these stat
variants should be added as well. Steve and Paul, correct me if I'm
wrong here, please.

> Not sure about ustat/fstatfs/statfs, I suppose those are a different
> category, right?

Yes, I believe these would fall under a different category, since they
are related to filesystem stats. Audit PERM classes are specifically
for file metadata and access operations, not filesystem statistics

>
>        Arnd
>

I will work on v2 shortly.

Thanks,
-Ricardo


^ permalink raw reply

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: Steve Grubb @ 2026-06-10 18:13 UTC (permalink / raw)
  To: Arnd Bergmann, Ricardo Robaina
  Cc: audit, linux-kernel, Linux-Arch, Paul Moore, Eric Paris
In-Reply-To: <CAABTaaBBVP6eY8D+a1KTgfZ3x8v3egnKTiQLRQfXPiOhpmQXjg@mail.gmail.com>

On Wednesday, June 10, 2026 1:40:47 PM Eastern Daylight Time Ricardo Robaina 
wrote:
> On Wed, Jun 10, 2026 at 2:05 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Wed, Jun 10, 2026, at 18:47, Ricardo Robaina wrote:
> > > diff --git a/include/asm-generic/audit_read.h
> > > b/include/asm-generic/audit_read.h
> > > index fb9991f53fb6..8feebc5b4c50 100644
> > > --- a/include/asm-generic/audit_read.h
> > > +++ b/include/asm-generic/audit_read.h
> > > @@ -18,3 +21,19 @@ __NR_fgetxattr,
> > > 
> > >  #ifdef __NR_readlinkat
> > >  __NR_readlinkat,
> > >  #endif
> > > 
> > > +#ifdef __NR_file_getattr
> > > +__NR_file_getattr,
> > > +#endif
> > > +#ifdef __NR_stat
> > > +__NR_stat,
> > > +#endif
> > > +#ifdef __NR_lstat
> > > +__NR_lstat,
> > > +#endif
> > > +#ifdef __NR_fstat
> > > +__NR_fstat,
> > > +#endif
> > > +#ifdef __NR_newfstatat
> > > +__NR_newfstatat,
> > > +#endif
> > > +__NR_statx,
> > 
> > There are additional variants of 'stat' that I think you need
> > to cover here:
> > 
> > scripts/syscall.tbl:79  stat64  fstatat64                      
> > sys_fstatat64 scripts/syscall.tbl:80  stat64  fstat64                   
> >      sys_fstat64 arch/x86/entry/syscalls/syscall_32.tbl:18       i386   
> > oldstat                 sys_stat
> > arch/x86/entry/syscalls/syscall_32.tbl:28       i386    oldfstat        
> >        sys_fstat arch/x86/entry/syscalls/syscall_32.tbl:84       i386   
> > oldlstat                sys_lstat
> > arch/x86/entry/syscalls/syscall_32.tbl:195      i386    stat64          
> >        sys_stat64                      compat_sys_ia32_stat64
> > arch/x86/entry/syscalls/syscall_32.tbl:196      i386    lstat64         
> >        sys_lstat64                     compat_sys_ia32_lstat64
> > arch/x86/entry/syscalls/syscall_32.tbl:197      i386    fstat64         
> >        sys_fstat64                     compat_sys_ia32_fstat64
> > arch/x86/entry/syscalls/syscall_32.tbl:300      i386    fstatat64       
> >        sys_fstatat64                   compat_sys_ia32_fstatat64
> > arch/alpha/kernel/syscalls/syscall.tbl:224      common  osf_stat        
> >                sys_osf_stat arch/alpha/kernel/syscalls/syscall.tbl:225  
> >    common  osf_lstat                       sys_osf_lstat
> > arch/alpha/kernel/syscalls/syscall.tbl:226      common  osf_fstat       
> >                sys_osf_fstat
> Hi Arnd,
> 
> Thanks for reviewing this patch! You're right, it seems all these stat
> variants should be added as well. Steve and Paul, correct me if I'm
> wrong here, please.

Alpha is unsupported. Those are True64 compatibility syscalls. You can 
include it and #ifdef will filter it everywhere else.  But, yeah. I guess the 
rest are ok. I don't pay much attention to the 32 bit arches.

-Steve



^ permalink raw reply

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: Ricardo Robaina @ 2026-06-10 18:54 UTC (permalink / raw)
  To: Steve Grubb
  Cc: Arnd Bergmann, audit, linux-kernel, Linux-Arch, Paul Moore,
	Eric Paris
In-Reply-To: <2021591.7Z3S40VBb9@x2>

On Wed, Jun 10, 2026 at 3:13 PM Steve Grubb <sgrubb@redhat.com> wrote:
>
> On Wednesday, June 10, 2026 1:40:47 PM Eastern Daylight Time Ricardo Robaina
> wrote:
> > On Wed, Jun 10, 2026 at 2:05 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Wed, Jun 10, 2026, at 18:47, Ricardo Robaina wrote:
> > > > diff --git a/include/asm-generic/audit_read.h
> > > > b/include/asm-generic/audit_read.h
> > > > index fb9991f53fb6..8feebc5b4c50 100644
> > > > --- a/include/asm-generic/audit_read.h
> > > > +++ b/include/asm-generic/audit_read.h
> > > > @@ -18,3 +21,19 @@ __NR_fgetxattr,
> > > >
> > > >  #ifdef __NR_readlinkat
> > > >  __NR_readlinkat,
> > > >  #endif
> > > >
> > > > +#ifdef __NR_file_getattr
> > > > +__NR_file_getattr,
> > > > +#endif
> > > > +#ifdef __NR_stat
> > > > +__NR_stat,
> > > > +#endif
> > > > +#ifdef __NR_lstat
> > > > +__NR_lstat,
> > > > +#endif
> > > > +#ifdef __NR_fstat
> > > > +__NR_fstat,
> > > > +#endif
> > > > +#ifdef __NR_newfstatat
> > > > +__NR_newfstatat,
> > > > +#endif
> > > > +__NR_statx,
> > >
> > > There are additional variants of 'stat' that I think you need
> > > to cover here:
> > >
> > > scripts/syscall.tbl:79  stat64  fstatat64
> > > sys_fstatat64 scripts/syscall.tbl:80  stat64  fstat64
> > >      sys_fstat64 arch/x86/entry/syscalls/syscall_32.tbl:18       i386
> > > oldstat                 sys_stat
> > > arch/x86/entry/syscalls/syscall_32.tbl:28       i386    oldfstat
> > >        sys_fstat arch/x86/entry/syscalls/syscall_32.tbl:84       i386
> > > oldlstat                sys_lstat
> > > arch/x86/entry/syscalls/syscall_32.tbl:195      i386    stat64
> > >        sys_stat64                      compat_sys_ia32_stat64
> > > arch/x86/entry/syscalls/syscall_32.tbl:196      i386    lstat64
> > >        sys_lstat64                     compat_sys_ia32_lstat64
> > > arch/x86/entry/syscalls/syscall_32.tbl:197      i386    fstat64
> > >        sys_fstat64                     compat_sys_ia32_fstat64
> > > arch/x86/entry/syscalls/syscall_32.tbl:300      i386    fstatat64
> > >        sys_fstatat64                   compat_sys_ia32_fstatat64
> > > arch/alpha/kernel/syscalls/syscall.tbl:224      common  osf_stat
> > >                sys_osf_stat arch/alpha/kernel/syscalls/syscall.tbl:225
> > >    common  osf_lstat                       sys_osf_lstat
> > > arch/alpha/kernel/syscalls/syscall.tbl:226      common  osf_fstat
> > >                sys_osf_fstat
> > Hi Arnd,
> >
> > Thanks for reviewing this patch! You're right, it seems all these stat
> > variants should be added as well. Steve and Paul, correct me if I'm
> > wrong here, please.
>
> Alpha is unsupported. Those are True64 compatibility syscalls. You can
> include it and #ifdef will filter it everywhere else.  But, yeah. I guess the
> rest are ok. I don't pay much attention to the 32 bit arches.

Thanks, Steve.

>
> -Steve
>
>


^ permalink raw reply

* Re: [PATCH v4 6/8] string: introduce memcpy_streaming() helpers
From: Borislav Petkov @ 2026-06-10 19:19 UTC (permalink / raw)
  To: Li Zhe
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, mingo, rppt, tglx, x86
In-Reply-To: <20260609120132.84323-1-lizhe.67@bytedance.com>

On Tue, Jun 09, 2026 at 08:01:32PM +0800, Li Zhe wrote:
> That said, I see your layering point. If arch/x86/include/asm/string.h
> is the preferred place for the arch-visible wrapper, I can move the
> wrapper there in the next revision while keeping the x86_64-specific
> implementation details in string_64.h.

No, 64-bit only's fine. We don't put any new features into 32-bit already
anyway but that wasn't clear from the commit message what your goal is.

> Thinking about it more, I agree that this is hard to justify for a
> generic helper. For this series, what really matters is that the
> struct page copies in patch 8 can use the existing x86
> memcpy_flushcache() fastpaths where that is beneficial; I do not need
> patch 6 to impose extra selection policy on unrelated callers.

What I am asking is, you need to show numbers why those helpers exist.

Your 0th message is talking about measuring this in VMs. If this workload is
not VM-specific, then those numbers don't matter. They're just handwaving.

So I'd need a good justification why we need the changes before we go any
further.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: Arnd Bergmann @ 2026-06-10 19:39 UTC (permalink / raw)
  To: Steve Grubb, Ricardo Robaina
  Cc: audit, linux-kernel, Linux-Arch, Paul Moore, Eric Paris
In-Reply-To: <2021591.7Z3S40VBb9@x2>

On Wed, Jun 10, 2026, at 20:13, Steve Grubb wrote:
> On Wednesday, June 10, 2026 1:40:47 PM Eastern Daylight Time Ricardo Robaina 
> wrote:

>> > scripts/syscall.tbl:79  stat64  fstatat64                      
>> > sys_fstatat64 scripts/syscall.tbl:80  stat64  fstat64                   
>> >      sys_fstat64 arch/x86/entry/syscalls/syscall_32.tbl:18       i386   
>> > oldstat                 sys_stat
>> > arch/x86/entry/syscalls/syscall_32.tbl:28       i386    oldfstat        
>> >        sys_fstat arch/x86/entry/syscalls/syscall_32.tbl:84       i386   
>> > oldlstat                sys_lstat
>> > arch/x86/entry/syscalls/syscall_32.tbl:195      i386    stat64          
>> >        sys_stat64                      compat_sys_ia32_stat64
>> > arch/x86/entry/syscalls/syscall_32.tbl:196      i386    lstat64         
>> >        sys_lstat64                     compat_sys_ia32_lstat64
>> > arch/x86/entry/syscalls/syscall_32.tbl:197      i386    fstat64         
>> >        sys_fstat64                     compat_sys_ia32_fstat64
>> > arch/x86/entry/syscalls/syscall_32.tbl:300      i386    fstatat64       
>> >        sys_fstatat64                   compat_sys_ia32_fstatat64
>> > arch/alpha/kernel/syscalls/syscall.tbl:224      common  osf_stat        
>> >                sys_osf_stat arch/alpha/kernel/syscalls/syscall.tbl:225  
>> >    common  osf_lstat                       sys_osf_lstat
>> > arch/alpha/kernel/syscalls/syscall.tbl:226      common  osf_fstat       
>> >                sys_osf_fstat
>> Hi Arnd,
>> 
>> Thanks for reviewing this patch! You're right, it seems all these stat
>> variants should be added as well. Steve and Paul, correct me if I'm
>> wrong here, please.
>
> Alpha is unsupported. Those are True64 compatibility syscalls. You can 
> include it and #ifdef will filter it everywhere else.  But, yeah. I guess the 
> rest are ok. I don't pay much attention to the 32 bit arches.

Ah, indeed. I assumed that these were part of the syscalls that
originally came from osf1 but are still used on Linux systems,
but it appears that the stat family was never used like that
with glibc.

The oldstat family is actually in a similar category, as those
were only used on libc5 or earlier. stat64 is definitely
still needed on 32-bit userspace with 32-bit time_t.

     Arnd

^ permalink raw reply

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: Ricardo Robaina @ 2026-06-10 19:53 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Steve Grubb, audit, linux-kernel, Linux-Arch, Paul Moore,
	Eric Paris
In-Reply-To: <4c885308-7556-49a9-836b-37089ac3bafb@app.fastmail.com>

On Wed, Jun 10, 2026 at 4:39 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Jun 10, 2026, at 20:13, Steve Grubb wrote:
> > On Wednesday, June 10, 2026 1:40:47 PM Eastern Daylight Time Ricardo Robaina
> > wrote:
>
> >> > scripts/syscall.tbl:79  stat64  fstatat64
> >> > sys_fstatat64 scripts/syscall.tbl:80  stat64  fstat64
> >> >      sys_fstat64 arch/x86/entry/syscalls/syscall_32.tbl:18       i386
> >> > oldstat                 sys_stat
> >> > arch/x86/entry/syscalls/syscall_32.tbl:28       i386    oldfstat
> >> >        sys_fstat arch/x86/entry/syscalls/syscall_32.tbl:84       i386
> >> > oldlstat                sys_lstat
> >> > arch/x86/entry/syscalls/syscall_32.tbl:195      i386    stat64
> >> >        sys_stat64                      compat_sys_ia32_stat64
> >> > arch/x86/entry/syscalls/syscall_32.tbl:196      i386    lstat64
> >> >        sys_lstat64                     compat_sys_ia32_lstat64
> >> > arch/x86/entry/syscalls/syscall_32.tbl:197      i386    fstat64
> >> >        sys_fstat64                     compat_sys_ia32_fstat64
> >> > arch/x86/entry/syscalls/syscall_32.tbl:300      i386    fstatat64
> >> >        sys_fstatat64                   compat_sys_ia32_fstatat64
> >> > arch/alpha/kernel/syscalls/syscall.tbl:224      common  osf_stat
> >> >                sys_osf_stat arch/alpha/kernel/syscalls/syscall.tbl:225
> >> >    common  osf_lstat                       sys_osf_lstat
> >> > arch/alpha/kernel/syscalls/syscall.tbl:226      common  osf_fstat
> >> >                sys_osf_fstat
> >> Hi Arnd,
> >>
> >> Thanks for reviewing this patch! You're right, it seems all these stat
> >> variants should be added as well. Steve and Paul, correct me if I'm
> >> wrong here, please.
> >
> > Alpha is unsupported. Those are True64 compatibility syscalls. You can
> > include it and #ifdef will filter it everywhere else.  But, yeah. I guess the
> > rest are ok. I don't pay much attention to the 32 bit arches.
>
> Ah, indeed. I assumed that these were part of the syscalls that
> originally came from osf1 but are still used on Linux systems,
> but it appears that the stat family was never used like that
> with glibc.
>
> The oldstat family is actually in a similar category, as those
> were only used on libc5 or earlier. stat64 is definitely
> still needed on 32-bit userspace with 32-bit time_t.
>
>      Arnd
>

Thanks for the context, Arnd and Steve. I'll append just stat64,
lstat64, fstat64 and fstatat64 to the v2 I'm about to send, then.

Please let me know if there's anything else to adjust.

-Ricardo


^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-10 20:38 UTC (permalink / raw)
  To: Li Chen
  Cc: Andy Lutomirski, Christian Brauner, Kees Cook, Al Viro,
	linux-fsdevel, linux-api, LKML, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <19eb181fdd4.6d028f442844776.3737831021032223216@linux.beauty>

On Wed, Jun 10, 2026, at 8:29 AM, Li Chen wrote:
> Hi John,
>
> [...]
>
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.

Great! Glad to hear my suggestion (and the patch too I linked in the
other email, I hope?) was useful.

> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
>
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);

Glad to hear it is also one-fd now.

> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run.

So this is an interesting thing to think about. My hunch is that
`copy_process` is, at least in the longer term, still doing too much! In
particular, `struct kernel_clone_args` has many degrees of freedom, and
might also make assumptions about preserving more of the parent process
than is needed in this case.

This is a bit tangential, but one thing I have thought about is having
"null namespaces". I think the current (i.e. existing clone API) default
of "share with parent process" is a poor security practice (more
privileges, i.e. sharing, should always be opt-in). But the opposite
default of "unshare everything" is expensive since creating new
namespaces is non-free. The goal of the null namespaces would be a cheap
way of creating a more isolated and unprivileged process — and "cheap"
here is literal: a null pointer in `nsproxy`, no allocation, no
namespace object, no ID. This null state would be what
`pidfd_open(0, PIDFD_EMPTY)` (using your example above, or really
whatever the first step is) hands back.

Then, from that maximally cheap and unprivileged initial state, the
`pidfd_config(fd, ...);` calls (plural important, I think!) would opt
into either sharing or unsharing namespaces between the child and parent
as the parent sees fit.

The larger point here is that insofar as there are not good defaults for
things, there is pressure, whether in step 1 or step 2, to make larger
everything-at-once configuration. But when we think a bit outside the
box to create the good defaults where they didn't previously exist, we
can end up in a situation where a minimal initial blank unstarted
process, and the builder pattern to initialize it, are more "natural".

> Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.

Also glad to hear.

> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.

Sounds good, as I think you can guess, my preference is for "yes", but I
agree we can see what you end up with in the next patchset and make more
informed decisions based on that.

Cheers,

John

^ permalink raw reply

* Re: [PATCH 04/11] treewide: Convert struct kernel_param_ops initializers to DEFINE_KERNEL_PARAM_OPS
From: jim.cromie @ 2026-06-10 21:06 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Kees Cook, Luis Chamberlain, Pengpeng Hou, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Rafael J. Wysocki, Len Brown,
	Corey Minyard, Gabriel Somlo, Michael S. Tsirkin, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, David Airlie,
	Simona Vetter, Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Tiwei Bie, Benjamin Berg,
	Ilpo Järvinen, David E. Box, Maciej W. Rozycki,
	Srinivas Pandruvada, Peter Zijlstra, Heiko Carstens,
	Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <da358ae1-91b4-4a16-ac76-ffab99c230b9@suse.com>

On Mon, May 25, 2026 at 7:35 AM Petr Pavlu <petr.pavlu@suse.com> wrote:
>
> On 5/21/26 3:33 PM, Kees Cook wrote:
> > Using Coccinelle, rewrite every struct kernel_param_ops initializer that
> > sets .get into a DEFINE_KERNEL_PARAM_OPS-family macro invocation,
> > for example:
> >
> > @@
> > declarer name DEFINE_KERNEL_PARAM_OPS;
> > identifier OPS;
> > expression SET, GET;
> > @@
> > - const struct kernel_param_ops OPS = {
> > -       .set = SET,
> > -       .get = GET,
> > - };
> > + DEFINE_KERNEL_PARAM_OPS(OPS, SET, GET);
> >
> > Using the macro for initialization means future changes can manipulate
> > the struct layout and callback prototypes without having to change every
> > initializer.
>
> Nit: For consistency, I suggest also converting the few remaining
> kernel_param_ops instances that specify only .set and no .get, such as
> simdisk_param_ops_filename.
>
> --
> Thanks,
> Petr

for the dynamic-debug changes

Reviewed-by: Jim Cromie <jim.cromie@gmail.com>

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Mateusz Guzik @ 2026-06-10 22:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Li Chen, Kees Cook, Alexander Viro,
	linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
	linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <CAG48ez38OEE8ZPLyU6nr9=cYx-hMsdoh5WRrv-GMZGMDKyyOTA@mail.gmail.com>

On Mon, Jun 8, 2026 at 5:02 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > This problem is dear to my heart and I have been pondering it on and off
> > for some time now. The entire fork + exec idiom is terrible and needs to
> > be retired.
>
> It seems to me like vfork+exec is a decent UAPI building block, on
> which you can build nice-looking userspace APIs, though I agree that
> this is not an ideal direct interface for application code.
>
> > Additionally there is a known problem where transiently copied file
> > descriptors on fork + exec cause a headache in multithreaded programs
> > doing something like this in parallel. I only did cursory reading, it
> > seems your patchset keeps the same problem in place.
>
> I think we almost have UAPI that would let you avoid this issue?
> You can use clone() with CLONE_FILES, then unshare the FD table with
> close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently
> implemented to be atomic with stuff that happens on other threads, but
> if we changed that, and it doesn't provide a good way to carry some
> FDs across, but it feels to me like this could be fixed with a variant
> of close_range() that removes O_CLOEXEC FDs except ones listed in an
> array.

Suppose you want to exec a binary with the following fd set:
0 is /dev/null
1 is fd 1023 in your process
2 is fd 1023 in your process

You have tons of other fds and you don't want any of them anywhere near this.

Clean interface from my standpoint would avoid any unnecessary
overhead and would allow you to clearly specify what do you want.

In this case whatever the interface it should provide the ability to
map 1023 to 1 and 2 in the child. With the current syscall set you get
refs taken on these on clone, then you have to manually dup2 these
which is separate syscalls with extra atomics on top. A fast & elegant
solution would allow you to tell the kernel directly where to install
the 2 files.

Also note in practical terms userspace likes to closefrom/close_range
anyway to get rid of unwanted fds which happen to not have the cloexec
bit which is yet another syscall to invoke on the way to exec. A
better interface would instantly avoid the problem by not copying the
unwanted fds if not asked. For viability for use as foundation to
build posix_spawn over it such copying would have to be supported of
course.

>
> > There are numerous impactful ways to speed up execs both in terms of
> > single-threaded cost and their multicore scalability, most of which
> > would be immediately usable by all programs without an opt-in. imo these
> > needs to be exhausted before something like a "template" can be
> > considered.
>
> (I think probably a large part of this would be stuff that happens in
> userspace, like dynamic linking.)

I have not investigated userspace, even putting specific APIs aside
the kernel has *a lot* of avoidable overhead.

>
> > Per the above, the primary win would stem from *NOT* messing with mm.
>
> As you write below, I think we have that with CLONE_MM? The C function
> vfork() is kind of a terrible API because of its returns-twice
> behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> wrapped by libc in a way similar to clone() (with the child executing
> a separate handler function), or if it was used in the implementation
> of some higher-level process-spawning API, it would be a perfectly
> fine API?
>
> Or am I misunderstanding what you mean by "messing with mm"?
>

I was not aware of this functionality, let's assume it indeed works.
You still have the file issue described above.

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Mateusz Guzik @ 2026-06-10 23:40 UTC (permalink / raw)
  To: Li Chen
  Cc: John Ericson, Andy Lutomirski, Christian Brauner, Kees Cook,
	Al Viro, linux-fsdevel, linux-api, LKML, linux-mm, linux-arch,
	linux-doc, linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <19eb181fdd4.6d028f442844776.3737831021032223216@linux.beauty>

On Wed, Jun 10, 2026 at 08:29:06PM +0800, Li Chen wrote:
>  ---- On Wed, 10 Jun 2026 01:27:47 +0800  John Ericson <mail@johnericson.me> wrote --- 
>  > Hope the above answers your question? I suppose my ideas lean more on the
>  > "future" than "empty" side --- there is indeed a thread in the thread group,
>  > with real VM/namespace/file descriptor etc. state. Moreover, state gets
>  > initialized before the process is started, so the actual start is a pretty
>  > lightweight step of just letting the scheduler know the now-ready process can
>  > be scheduled. The only thing that distinguishes the embryonic process from a
>  > real one is simply that it isn't running --- i.e. isn't (yet) available to be
>  > scheduled --- so the pidfds holders are free to poke at its state.
>  > 
> 
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.
> 
> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
> 
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);
> 
> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run. Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.
> 
> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.
> 

As I tried to explain in my previous e-mail this approach does not cut
it because of NUMA.

Suppose you have a machine with 2 nodes. The parent-to-be is running
on node 0 and the child is intended to exec something on node 1.

When the parent-to-be allocates and populates stuff, it takes place with
memory backed by node 0. If you allocate task_struct, the file table and
other frequently used (and modified!) objs in this way, you are
guaranteeing performance loss due to interconnect traffic to access it.

Trying to add plumbing so that all allocations respect numa placement is
probably too cumbersome.

The primary example for that is looking up the binary to exec in the
first place.

userspace likes to pass paths which don't exist, meaning checking for
the binary before any hard work is a useful optimizaiton. Suppose the
binary to be executed is in a container bound with a taskset using
node 1 and the content of the fs part of the container is currently
fully uncached.

When you perform the lookup on node 0, you are populating a bunch of
metadata (inode, dentry) using memory from that domain. But the intended
user will only execute on node 1, again resulting in a performance loss.

In order to not do it you would need to convince VFS to allocate memory
elsewhere.

So I stand by my previous claim that ultimately a pristine child has to
be created (like in this patch), but which also has to do the work on
its own.

Suppose there is no explicit placement requested anywhere. Even in that
case there are legitimate workloads which will eventually be forced to
exec stuff on another node. Even these have a better chance retaining
full locality if the child process does all the work.

Per my previous message I don't see a clean interface to do it.
something quasi-posix_spawn is probably the least bad way out, it will
also allow userspace to easily wrap the new thing with posix_spawn
itself.

Also note there is another issue with the fd-based approach: the fd will
get inherited on fork and will hang out in the child afterwards unless
explicitly closed. Suppose you have a multithreaded program which likes
to both fork(+no exec) and fork+exec. With the fd-based approach you
have no means of stopping another thread from grabbing your state thanks
to unix defaulting to copying everything. There was an attempt to fix
this aspect with O_CLOFORK, but this got rejected.

Whatever exactly happens, NUMA is a sad fact of computing and needs to
be accounted for. The approach as proposed not only does not do it, but
it actively hinders such deployments.

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-06-11  7:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
x86 and measured the effect on a few real workloads.

Short version: inlining of _raw_spin_unlock() adds measurable kernel
i-cache pressure on every workload I tried, and on a
kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
throughput. I did not find a workload where it helps.

HOW BENCHMARKS WERE CHOSEN

The cost of inlining unlock is text footprint increase. Every unlock
site grows, and the extra bytes compete for the shared L1i. The bill is
paid by unrelated code, in both kernel and userspace.

Locktorture and similar microbenchmarks can't see this, because they
usually hammer a tiny loop that stays L1i-resident, so they measure
fast-path cycles, where inlining (fewer instructions per unlock) looks
neutral-to-good.

To make the cost visible, the workload has to have real instruction
cache pressure. To achieve that, it has to touch a lot of code.

A good way to screen benchmarks: look for high tma_frontend_bound
fraction from 'perf stat -M TopdownL1' and simultaneously require it to
spend non-trivial time in the kernel (be syscall-heavy).

SETUP

Hardware: 2x Intel Xeon Gold 6138 (Skylake-SP), 20 cores/socket, 40C/80T
with kernel built from locking/core branch. Baseline _raw_spin_unlock()
is out-of-line via UNINLINE_SPIN_UNLOCK=y. Experiment adds the four
selects above (exact patch is at the end of this message). Cache
geometry (lscpu -C):

NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL  SETS PHY-LINE COHERENCY-SIZE
L1d       32K     1.3M    8 Data            1    64        1             64
L1i       32K     1.3M    8 Instruction     1    64        1             64
L2         1M      40M   16 Unified         2  1024        1             64
L3      27.5M      55M   11 Unified         3 40960        1             64

Per run I collected cycles, instructions and L1i-misses. To stay within
the available PMU counters, each run used only 3 events: cycles,
instructions and one L1i filter (:u or :k). The NMI watchdog was off and
every run reported 100% counter enablement (no multiplexing). Userspace
and kernel misses therefore come from separate runs. Each benchmark was
run 20x per side: 10 with the :u counter, 10 with :k.  Cycles,
instructions and throughput are pooled across all 20, each L1i split
comes from its 10.

KERNEL IMAGE SIZE

To give a sense of the code-footprint increase, scripts/bloat-o-meter on
vmlinux, GCC 11, x86_64, defconfig + CONFIG_PARAVIRT_SPINLOCKS=y:

    Total: Before=23838694, After=23977159, chg +0.58%

ROCKSDB (DELETESEQ)

    db_bench -benchmarks=deleteseq

Metric                       Baseline      Experiment     Delta   Sig
----------------------------------------------------------------------
Instructions (total)    9,574,476,543   9,573,602,441    -0.01%   flat
L1i-miss :k (kernel)      198,588,165     216,672,536    +9.11%   **
L1i-miss :u (userspace)   593,276,235     616,433,813    +3.90%   **
Throughput ops/s            431,398         432,897      +0.35%   ns
Cycles (total)          4,681,002,302   4,665,106,876    -0.34%   ns
IPC                          2.045           2.052       +0.33%   ns
Time elapsed (s)            2.4012          2.3865       -0.62%   ns
----------------------------------------------------------------------
L1i-miss: higher = worse. Throughput: higher = better.
** = beyond per-run noise (+-0.1..0.36%), ns = within noise.

At constant instructions, inlining raises L1i misses +9.11% (kernel) and
+3.90% (userspace), both well beyond noise. Throughput, cycles, IPC and
wall-time all stay within run-to-run noise. So the i-cache cost is real,
but at IPC ~2 db_bench isn't fetch-bound at the app level, so it doesn't
surface.

No benefit from _raw_spin_unlock() inlining.

KERNEL BUILD

Building locking/core (defconfig), GCC 11.

    make -j80

Metric              Baseline      Experiment     Delta   Sig
-------------------------------------------------------------
L1i-miss :k          36.72G        37.51G       +2.16%   **
L1i-miss :u         246.99G       246.06G       -0.38%   **
Sys (s)             478.250       482.420       +0.87%   **
Time elapsed (s)    105.221       105.373       +0.14%   ns
User (s)           4022.046      4024.012       +0.05%   flat
Cycles            8,894.10G     8,902.12G       +0.09%   flat
Instructions      8,424.28G     8,426.48G       +0.03%   flat
IPC                   0.947         0.947       -0.06%   flat
-------------------------------------------------------------
L1i-miss/Sys: higher = worse.
** = beyond per-run noise, ns = within noise.

Kernel i-cache misses (+2.16%) and sys time (+0.87%) both rise and are
significant. Wall-time and userspace L1i are flat. Kernel build is
GCC/userspace-bound (User 4022s vs Sys 478s), so the added kernel fetch
cost is real but appears to sit off the critical path.

No benefit from _raw_spin_unlock() inlining.

NGINX

I ran nginx with taskset -c 2.

    perf stat -C 2 ... -- ab -n 100000 -c 80 http://127.0.0.1:8080/

Config for nginx was the following.

  worker_processes 1;
  error_log /tmp/ngx/error.log;
  pid       /tmp/ngx/nginx.pid;
  events { worker_connections 16384; }
  http {
      access_log off;
      server { listen 8080 reuseport; location / { return 200 "ok\n"; } }
  }

I used nginx version 1.20.1 (prebuilt, from CentOS repo).

Metric              Baseline      Experiment     Delta   Sig
------------------------------------------------------------
req/s (ab)           25,113        24,795       -1.27%   **
L1i MPKI :k          70.06         72.10        +2.92%   **
L1i MPKI :u          20.16         20.66        +2.50%   **
instructions          5.86G         5.83G       -0.50%   **
L1i-miss :k           0.41G         0.42G       +2.44%   **
L1i-miss :u           0.12G         0.12G       +1.95%   **
cycles                4.82G         4.81G       -0.28%   ns
IPC                   1.215         1.213       -0.22%   ns
perf time (s)         4.077         4.129       +1.26%   **
failed reqs              0             0          -      valid
------------------------------------------------------------
req/s: higher=better. MPKI: higher=worse.
** = beyond per-run noise, ns = within noise.

nginx connection-churn is the one workload that is genuinely
kernel-fetch-bound: MPKI:k ~70 and IPC ~1.2 (vs db_bench's 2.05). Here
the cost surfaces: req/s −1.27%. Misses rise in both domains (+2.9%
MPKI:k, +2.5% MPKI:u). Unlike kernel build, userspace is hit too,
because nginx runs user and kernel hot on the same core and the kernel
bloat pollutes the shared L1i.

And the kicker: instructions fell 0.5% (inlining removed the call/ret)
yet throughput dropped.

Caveat: ab is single-threaded, so it seems the worker core is
under-saturated: cycles is flat (−0.28%, ns) while wall-time rose
(+1.26%).

Measurable throughput regression from _raw_spin_unlock() inlining.

CONCLUSION

Inlining _raw_spin_unlock() raises kernel L1i misses on every workload.
It's an unconditional cost. Whether it costs the application throughput
depends on how kernel-fetch-bound the workload is.

The cost is real everywhere. It only surfaces as throughput regression
where the kernel is on the fetch critical path. And inlining did not
help in any workload I measured. The one micro-effect inlining produced
(-0.5% instructions on nginx) was erased by the added i-cache pressure.

From 99502328caed3c195e20cf194a1e8aa1563f3896 Mon Sep 17 00:00:00 2001
From: Dmitry Ilvokhin <d@ilvokhin.com>
Date: Thu, 4 Jun 2026 07:43:00 -0700
Subject: [PATCH] x86/locking: Inline the spin_unlock()

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fdaef60b46d6..c9a0638225fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -113,6 +113,10 @@ config X86
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_HAVE_EXTRA_ELF_NOTES
+	select ARCH_INLINE_SPIN_UNLOCK
+	select ARCH_INLINE_SPIN_UNLOCK_BH
+	select ARCH_INLINE_SPIN_UNLOCK_IRQ
+	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE
 	select ARCH_MEMORY_ORDER_TSO
 	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH v4 6/8] string: introduce memcpy_streaming() helpers
From: Li Zhe @ 2026-06-11  9:38 UTC (permalink / raw)
  To: bp
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, lizhe.67, mingo, rppt,
	tglx, x86
In-Reply-To: <20260610191920.GBaim4uMX3z6OqJwHr@fat_crate.local>

On Wed, Jun 10, 2026 at 12:19:20 -0700, bp@alien8.de wrote:

> On Tue, Jun 09, 2026 at 08:01:32PM +0800, Li Zhe wrote:
> > That said, I see your layering point. If arch/x86/include/asm/string.h
> > is the preferred place for the arch-visible wrapper, I can move the
> > wrapper there in the next revision while keeping the x86_64-specific
> > implementation details in string_64.h.
> 
> No, 64-bit only's fine. We don't put any new features into 32-bit already
> anyway but that wasn't clear from the commit message what your goal is.

Thanks, that makes sense.

> > Thinking about it more, I agree that this is hard to justify for a
> > generic helper. For this series, what really matters is that the
> > struct page copies in patch 8 can use the existing x86
> > memcpy_flushcache() fastpaths where that is beneficial; I do not need
> > patch 6 to impose extra selection policy on unrelated callers.
> 
> What I am asking is, you need to show numbers why those helpers exist.
> 
> Your 0th message is talking about measuring this in VMs. If this workload is
> not VM-specific, then those numbers don't matter. They're just handwaving.
> 
> So I'd need a good justification why we need the changes before we go any
> further.

Understood.

I do not currently have access to physical PMEM hardware on my side, so
the numbers I posted so far were all from a VM-based setup. I agree that
this is not sufficient justification for introducing the helper / x86 nt
part of the series.

For the next resend, I will first split out and resend the mm-only
subset, and drop the helper / x86 nt part for now.

If anyone  has access to real PMEM hardware and is willing to test
whether that part shows a measurable benefit there, I would greatly
appreciate it.

Otherwise, is there a preferred way to justify or validate that part
without physical PMEM measurements, or is the right approach simply to
keep it out of the series until such data is available?

Thanks,
Zhe

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Peter Zijlstra @ 2026-06-11 13:44 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <aiphFXe_TPNPxZ_n@shell.ilvokhin.com>

On Thu, Jun 11, 2026 at 07:17:41AM +0000, Dmitry Ilvokhin wrote:
> On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> > Also, I think someone should go do some performance runs with
> > ARCH_INLINE_SPIN_* set for x86 just like for s390.
> 
> As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
> x86 and measured the effect on a few real workloads.
> 
> Short version: inlining of _raw_spin_unlock() adds measurable kernel
> i-cache pressure on every workload I tried, and on a
> kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
> throughput. I did not find a workload where it helps.

Thanks for checking!

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-11 18:53 UTC (permalink / raw)
  To: Mateusz Guzik, Li Chen
  Cc: Andy Lutomirski, Christian Brauner, Kees Cook, Al Viro,
	linux-fsdevel, linux-api, LKML, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <hd3i6pxxohsjesyid7nhuic6ppp6nyoxxpwa4mny6riqvpyqec@mylfprni2yaw>

On Wed, Jun 10, 2026, at 7:40 PM, Mateusz Guzik wrote:
> [...]
>
> As I tried to explain in my previous e-mail this approach does not cut
> it because of NUMA.
>
> Suppose you have a machine with 2 nodes. The parent-to-be is running
> on node 0 and the child is intended to exec something on node 1.
>
> When the parent-to-be allocates and populates stuff, it takes place with
> memory backed by node 0. If you allocate task_struct, the file table and
> other frequently used (and modified!) objs in this way, you are
> guaranteeing performance loss due to interconnect traffic to access it.
>
> Trying to add plumbing so that all allocations respect numa placement is
> probably too cumbersome.

Are we sure that last part is true?

Let's also assume when this stuff was initially implemented, we didn't
have it. If the basic thrust of this work is to replace functions that
previously only worked on the current thread with those that worked on
either arbitrary (not yet started) threads or the current thread, would
that not prepare us for slowly migrating the allocation choice to
reflect the node of the target task (new parameter) rather than the node
of the current task over time?

(This assumes the task is pre-placed on a node before it is actually run
there, and that pre-placement happens as early in the allocation process
as possible, so subsequent allocations can read off the
partially-initialized task's node.)

"Slowly migrating" is good here! It doesn't need to be the fastest thing
out of the gate, but if this new proper spawning API gets popular as I
think it would, and there is a clear path to optimizing it per the
above, then I am confident that over the years it will happen.

> The primary example for that is looking up the binary to exec in the
> first place.
>
> userspace likes to pass paths which don't exist, meaning checking for
> the binary before any hard work is a useful optimization. Suppose the
> binary to be executed is in a container bound with a taskset using
> node 1 and the content of the fs part of the container is currently
> fully uncached.
>
> When you perform the lookup on node 0, you are populating a bunch of
> metadata (inode, dentry) using memory from that domain. But the intended
> user will only execute on node 1, again resulting in a performance loss.
>
> In order to not do it you would need to convince VFS to allocate memory
> elsewhere.

One thing I don't get about this is that isn't the cost doing a bunch of
work searching the PATH for the directories where the executable
*doesn't* exist? In the case of something like a shell that is going to
spawn a lot of processes, I would think it is *good* to keep all that
PATH crawling VFS filling to be on the shell's node, rather than the
child processes' nodes.

It is only the executable itself, the final step of the VFS crawl, that
should be loaded into the other NUMA nodes. Insofar as (unless I am
missing something) creating the process means finding the inode for the
executable but not loading those pages, aren't we OK here? Only when the
new process is actually scheduled and run must the ELF be paged into
memory, and then that will happen on the correct node.

> So I stand by my previous claim that ultimately a pristine child has to
> be created (like in this patch), but which also has to do the work on
> its own.

I have not been a kernel dev, so my apologies if I am missing things.
But in conclusion for me, the FS and other resource access patterns of
*creating a process* vs *that process itself running* do not seem
necessarily coincident to me. What you are describing as for sure a
problem might possibly be a *good thing*, if they are in fact quite
different.

> Suppose there is no explicit placement requested anywhere. Even in that
> case there are legitimate workloads which will eventually be forced to
> exec stuff on another node. Even these have a better chance retaining
> full locality if the child process does all the work.
>
> Per my previous message I don't see a clean interface to do it.
> something quasi-posix_spawn is probably the least bad way out, it will
> also allow userspace to easily wrap the new thing with posix_spawn
> itself.
>
> Also note there is another issue with the fd-based approach: the fd will
> get inherited on fork and will hang out in the child afterwards unless
> explicitly closed. Suppose you have a multithreaded program which likes
> to both fork(+no exec) and fork+exec. With the fd-based approach you
> have no means of stopping another thread from grabbing your state thanks
> to unix defaulting to copying everything. There was an attempt to fix
> this aspect with O_CLOFORK, but this got rejected.

I would think we don't need to worry about clone/fork very much, right?
I think the premise of your emails, and just about everyone else's in
this thread too, is that we agree fork+exec is bad, and the problem of
unnecessarily sharing resources is inherent to fork. Furthermore, I
think we all agree that while `O_CLOEXEC` and `O_CLOFORK` may help, both
are unsatisfying solutions because they are opt-out not opt-in, and
global to the parent process / preexec state (respectively) rather than
local to the specific fork / exec in question.

pidfds encounter these problems no more than any other
file-descriptor-based UAPI, right? And I don't think it is good to blame
any such file-descriptor-based UAPI when fork/exec are at fault.

Maybe during the transition, when some things use fork and some things
use this new API, stuff will be awkward, but I would rather that just be
an incentive to complete the transition away from fork, not a reason to
second-guess the plan.

Once the transition is complete, and everyone is diligently assembling
their child processes from scratch as is proposed, `O_CLOEXEC` and
`O_CLOFORK` are both unneeded, and oversharing privileges will be much
less common simply because "lazy coding"/"minimal typing" will only
share what is needed --- anything else is more code/keystrokes!

> Whatever exactly happens, NUMA is a sad fact of computing and needs to
> be accounted for. The approach as proposed not only does not do it, but
> it actively hinders such deployments.

Despite everything I said, I want to be clear that I do agree that NUMA
performance should be accounted for. Even if the first version isn't as
great as it could be on that metric, there should be a clear plan for
how future work can conclusively address it.

Cheers,

John

^ permalink raw reply

* Re: [PATCH] vfs: missing inode operation should return a consistent error code
From: Jori Koolstra @ 2026-06-11 20:31 UTC (permalink / raw)
  To: Jan Kara, Christian Brauner
  Cc: Jeff Layton, Alexander Viro, Arnd Bergmann,
	open list:FILESYSTEMS (VFS and infrastructure), open list,
	open list:GENERIC INCLUDE/ASM HEADER FILES
In-Reply-To: <m5fqzr2ny4zb36u6zhmrrxgl36ycsxvlqnzf5idvsq4lxpfh3i@g276qtqgfv3f>

@Christian, since you suggested equalizing the error codes for missing
inode ops, what is your opinion?

> Op 01-06-2026 18:50 CEST schreef Jan Kara <jack@suse.cz>:
> 
> We certainly can (and sometimes do) modify the returned errors. It is
> always just a balancing act between the benefit of the change and chances
> somebody will get broken by it. In this case I don't quite see the
> benefit, not that I'd be too worried about the a regression but there's
> always the chance...
> 
> 								Honza

No, I get that, and maybe you are right. I feel in this case it would be
really odd if this breaks anyone's application. It would mean you are
rigorously testing for receiving specific error codes, without handling
a general error case. That just seems odd, especially since some of these
error codes are not even listed as a possibility for missing ops (like EACCES).
I would say we change this and revert if anyone complains. I believe that was
also Christian's view in another thread, but I may be mistaken.

But regardless, it's just a nice to have, and I can definitely live without.
It is just a clean-up I came across while working on O_CREAT|O_DIRECTORY.

Best,
Jori.

^ permalink raw reply

* Re: [PATCH] audit: add missing syscalls to PERM class tables
From: kernel test robot @ 2026-06-12  1:20 UTC (permalink / raw)
  To: Ricardo Robaina, audit, linux-kernel, linux-arch
  Cc: oe-kbuild-all, paul, eparis, arnd, sgrubb, Ricardo Robaina
In-Reply-To: <20260610164719.2668906-1-rrobaina@redhat.com>

Hi Ricardo,

kernel test robot noticed the following build errors:

[auto build test ERROR on pcmoore-audit/next]
[also build test ERROR on linus/master v7.1-rc7 next-20260610]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ricardo-Robaina/audit-add-missing-syscalls-to-PERM-class-tables/20260611-024240
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git next
patch link:    https://lore.kernel.org/r/20260610164719.2668906-1-rrobaina%40redhat.com
patch subject: [PATCH] audit: add missing syscalls to PERM class tables
config: riscv-randconfig-001-20260611 (https://download.01.org/0day-ci/archive/20260612/202606120946.0eNy5YXB-lkp@intel.com/config)
compiler: riscv32-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260612/202606120946.0eNy5YXB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606120946.0eNy5YXB-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from lib/audit.c:23:
>> include/asm-generic/audit_change_attr.h:52:1: error: '__NR_utimensat' undeclared here (not in a function); did you mean 'vfs_utimes'?
    __NR_utimensat,
    ^~~~~~~~~~~~~~
    vfs_utimes

vim +52 include/asm-generic/audit_change_attr.h

  > 52	__NR_utimensat,

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH v2] audit: add missing syscalls to PERM class tables
From: Ricardo Robaina @ 2026-06-12 14:14 UTC (permalink / raw)
  To: audit, linux-kernel, linux-arch
  Cc: paul, eparis, arnd, sgrubb, Ricardo Robaina

Add missing file metadata syscalls to the audit PERM class tables,
addressing gaps where certain file operations were not properly
classified for audit rule matching.

Changes:
- audit_change_attr.h: Add file_setattr

- audit_read.h: Add quotactl_fd, file_getattr, stat, stat64, lstat,
  lstat64, fstat, fstat64, newfstatat, fstatat64, and statx

- audit_write.h: Add quotactl_fd

Architecture-specific and conditionally-compiled syscalls are guarded
with #ifdef.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
---
Changes in v2:
- Added stat64 family syscalls (stat64, lstat64, fstat64, fstatat64) to
  audit_read.h for 32-bit architecture support.
- Dropped timestamp-related syscalls (utime, utimes, utimensat, etc.)
  due to potential audit log volume increase impact. Those will be
  addressed in a separate patch after closer investigation.

 include/asm-generic/audit_change_attr.h |  3 +++
 include/asm-generic/audit_read.h        | 31 +++++++++++++++++++++++++
 include/asm-generic/audit_write.h       |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/include/asm-generic/audit_change_attr.h b/include/asm-generic/audit_change_attr.h
index ddd90bbe40df..94388da3490c 100644
--- a/include/asm-generic/audit_change_attr.h
+++ b/include/asm-generic/audit_change_attr.h
@@ -40,3 +40,6 @@ __NR_link,
 #ifdef __NR_linkat
 __NR_linkat,
 #endif
+#ifdef __NR_file_setattr
+__NR_file_setattr,
+#endif
diff --git a/include/asm-generic/audit_read.h b/include/asm-generic/audit_read.h
index fb9991f53fb6..d8dc3dd6bf63 100644
--- a/include/asm-generic/audit_read.h
+++ b/include/asm-generic/audit_read.h
@@ -3,6 +3,9 @@
 __NR_readlink,
 #endif
 __NR_quotactl,
+#ifdef __NR_quotactl_fd
+__NR_quotactl_fd,
+#endif
 __NR_listxattr,
 #ifdef __NR_listxattrat
 __NR_listxattrat,
@@ -18,3 +21,31 @@ __NR_fgetxattr,
 #ifdef __NR_readlinkat
 __NR_readlinkat,
 #endif
+#ifdef __NR_file_getattr
+__NR_file_getattr,
+#endif
+#ifdef __NR_stat
+__NR_stat,
+#endif
+#ifdef __NR_stat64
+__NR_stat64,
+#endif
+#ifdef __NR_lstat
+__NR_lstat,
+#endif
+#ifdef __NR_lstat64
+__NR_lstat64,
+#endif
+#ifdef __NR_fstat
+__NR_fstat,
+#endif
+#ifdef __NR_fstat64
+__NR_fstat64,
+#endif
+#ifdef __NR_newfstatat
+__NR_newfstatat,
+#endif
+#ifdef __NR_fstatat64
+__NR_fstatat64,
+#endif
+__NR_statx,
diff --git a/include/asm-generic/audit_write.h b/include/asm-generic/audit_write.h
index f9f1d0ae11d9..378128dc31e3 100644
--- a/include/asm-generic/audit_write.h
+++ b/include/asm-generic/audit_write.h
@@ -5,6 +5,9 @@ __NR_acct,
 __NR_swapon,
 #endif
 __NR_quotactl,
+#ifdef __NR_quotactl_fd
+__NR_quotactl_fd,
+#endif
 #ifdef __NR_truncate
 __NR_truncate,
 #endif
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH bpf-next] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
From: patchwork-bot+netdevbpf @ 2026-06-13  3:40 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: ast, daniel, andrii, eddyz87, memxor, arnd, bpf, linux-arch,
	linux-kernel, stable, longman
In-Reply-To: <20260610090431.32427-1-gmonaco@redhat.com>

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Wed, 10 Jun 2026 11:04:29 +0200 you wrote:
> raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
> restores interrupts, this means preemption is enabled when interrupts
> are still disabled (as part of raw_res_spin_unlock()) so this cannot
> trigger an actual preemption.
> This is inconsistent with other spinlock implementations
> (raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
> itself).
> 
> [...]

Here is the summary with links:
  - [bpf-next] rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
    https://git.kernel.org/bpf/bpf-next/c/b48bd16eb9fc

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] init/main: Expose built-in initcalls and blacklist status via debugfs
From: Aaron Tomlin @ 2026-06-13 21:05 UTC (permalink / raw)
  To: arnd, mcgrof, petr.pavlu, da.gomez, samitolvanen
  Cc: neelx, kees, peterz, akpm, sean, chjohnst, steve, mproche,
	nick.lane, linux-arch, linux-modules, linux-kernel
In-Reply-To: <20260510061301.41341-1-atomlin@atomlin.com>

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

On Sun, May 10, 2026 at 02:13:01AM -0400, Aaron Tomlin wrote:
> At present, identifying the correct function name to supply to the
> "initcall_blacklist=" kernel command-line parameter requires manual
> inspection of the source code or kernel symbol tables. Furthermore,
> administrators lack a reliable runtime mechanism to verify whether a
> specified built-in module has been successfully blacklisted.
> 
> To resolve this, introduce a new debugfs interface at
> /sys/kernel/debug/modules/builtin_initcalls. This file enumerates all
> built-in modules alongside their corresponding initialisation callbacks
> (e.g., those specified by module_init()) in a simple format:
> "module_name init_callback". If a built-in module has been actively
> blacklisted, the entry is explicitly appended with a " [blacklisted]"
> suffix.

Dear maintainers,

I am writing to politely follow up on this patch, as it has been just over
a month since its initial submission.

To briefly reiterate, this patch introduces a reliable runtime mechanism to
identify built-in initcalls and verify their blacklisted status, thereby
significantly improving the usability of the "initcall_blacklist="
parameter.

I would be most grateful for any feedback, or to know whether any further
refinements are required for it to be considered for inclusion.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox