Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Jann Horn @ 2026-06-09 17:53 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mateusz Guzik, Christian Brauner, Li Chen, Kees Cook,
	Alexander Viro, linux-fsdevel, linux-api, linux-kernel, linux-mm,
	linux-arch, linux-doc, linux-kselftest, x86, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <lhubjdk1c1m.fsf@oldenburg.str.redhat.com>

On Tue, Jun 9, 2026 at 8:08 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Jann Horn:
>
> >> Per the above, the primary win would stem from *NOT* messing with mm.
> >
> > As you write below, I think we have that with CLONE_MM? The C function
> > vfork() is kind of a terrible API because of its returns-twice
> > behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> > wrapped by libc in a way similar to clone() (with the child executing
> > a separate handler function), or if it was used in the implementation
> > of some higher-level process-spawning API, it would be a perfectly
> > fine API?
>
> No, there is still a problem with SIGTSTP handling because we cannot
> atomically unmask the signal during execve.  We need to unblock SIGTSTP
> before execve in the new process, but this means that it can get
> suspended by SIGTSTP.  Consequently, the execve never happens and the
> original process is stuck in vfork:
>
>   posix_spawn: parent can get stuck in uninterruptible sleep if child
>   receives SIGTSTP early enough
>   <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>
>
> More on the low-level side, it's difficult to make sure that execve gets
> a consistent snapshot of the environ vector.  Both vfork and execve need
> to be async-signal-safe.  Any locking or memory allocation (except for
> the stack …) persists in the original process after vfork returns.  The

I think that's not entirely accurate; if you call set_robust_list() on
a futex list, then call execve(), the futexes should be released once
the process switches to a new MM, in
begin_new_exec -> exec_mmap -> exec_mm_release -> futex_exec_release
-> futex_cleanup -> exit_robust_list.

So in theory you could use clone() with CLONE_VM and without
CLONE_VFORK, and let the parent either wait for a futex that is
released on exec, or somehow asynchronously check later whether the
futex is still held... probably not the nicest building block but
maybe workable? Though I guess it would fit more nicely if there was a
"munmap() this range on exec" API...

> environ vector can be large, so making a copy on the stack is not ideal.
> It's even harder for getenv/setenv/unsetenv implementations that use
> locking instead of software transactional memory.

Makes sense, that kind of sounds like a pain inherent in being able to
execute from signal handler context...

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-09 17:27 UTC (permalink / raw)
  To: Li Chen, Andy Lutomirski
  Cc: Christian Brauner, Kees Cook, Al Viro, linux-fsdevel, linux-api,
	LKML, linux-mm, linux-arch, linux-doc, linux-kselftest, x86,
	Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <19eacd64508.26b92c022125848.262962729296162879@linux.beauty>



On Tue, Jun 9, 2026, at 10:43 AM, Li Chen wrote:
> Hi Andy,
>
> ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote ---
> > [...]
> >
> > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
> > refer to an actual process that is, or at least was, running?  This
> > new thing is a process that we are contemplating spawning.  I can
> > imagine that basically all pidfd APIs would be a bit confused by the
> > nonexistence of the process in question.
> >
>
> Yes, I think that is a real concern.
>
> In my current local WIP I tried to keep that distinction explicit.
> pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
> referring to a process. The builder fd is allocated as an anonymous pidfs
> file with builder-specific file operations:
>
>     file = pidfs_alloc_anon_file("[pidfd_spawn]",
>                                  &pidfd_spawn_builder_fops, builder,
>                                  O_RDWR);
>

What does your builder fd point to, explicitly? For example in my other reply I
talked about how it was "real" process state. In my FreeBSD patch, for example,
I found there was already a status for a process "in exec", and I figured that
was clean to reuse for one of these "embryonic" processes that also hadn't
started running. I would reckon that Linux probably has some similar notions.

> and the normal pidfd helpers still reject it because it does not use the
> ordinary pidfd file operations:
>
>     struct pid *pidfd_pid(const struct file *file)
>     {
>         if (file->f_op != &pidfs_file_operations)
>             return ERR_PTR(-EBADF);
>         return file_inode(file)->i_private;
>     }
>
> So the current split is:
>
>     builder_fd = pidfd_spawn_open(...);       /* builder object */
>     pidfd_config(builder_fd, ...);
>     child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
>
> Only the last fd is a normal pidfd for an actual child process. The builder
> fd is only accepted by the builder operations.
>
> This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
> pidfd_getfd(), poll(), etc. mean before the process exists.

I wouldn't be so sure this is necessary/good. For example, I think it could
make sense to wait on a process that has yet to be started; one just waits for
both the process to start and the process to exit. Obviously a blocking syscall
in the thread that is spawning the process is not useful, but the asynchronous
poll variation seems fine.

As long as there is real process state here, it shouldn't be too hard to
implement.

> The downside is that it adds a separate open-style entry point and is less
> uniform than the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.

I do think there is no point having two file descriptors. The file descriptor
that previously referred to the builder/embryonic process then can refer to the
real process, right?

> If people think there is a better way to represent the pre-spawn builder
> state, or if the preference is to integrate it directly into pidfd_open()
> with an explicit empty/future-pidfd state, I would be happy to discuss that.

Hope the above answers your question? I suppose my ideas lean more on the
"future" than "empty" side --- there is indeed a thread in the thread group,
with real VM/namespace/file descriptor etc. state. Moreover, state gets
initialized before the process is started, so the actual start is a pretty
lightweight step of just letting the scheduler know the now-ready process can
be scheduled. The only thing that distinguishes the embryonic process from a
real one is simply that it isn't running --- i.e. isn't (yet) available to be
scheduled --- so the pidfds holders are free to poke at its state.

Cheers,

John

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-09 14:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <CALCETrWJQpLR4n1cpichBk8=uExSKLWTMGU3BufGdk_WE_p5UA@mail.gmail.com>

Hi Andy,

 ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
 > >
 > > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > > Hi,
 > > >
 > > > This is an early RFC for an idea that is probably still rough in both the
 > > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > > it now to check whether this direction is worth pursuing and to get
 > > > feedback on the kernel/userspace boundary.
 > >
 > > The idea of having a builder api for exec isn't all that crazy. But it
 > > should simply be built on top of pidfds and thus pidfs itself instead.
 > > It has all the basic infrastructure in place already. Any implementation
 > > should also allow userspace to implement posix_spawn() on top of it.
 > >
 > > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
 > >
 > > pidfd_config(fd, ...) // modeled similar to fsconfig()
 > >
 > 
 > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
 > refer to an actual process that is, or at least was, running?  This
 > new thing is a process that we are contemplating spawning.  I can
 > imagine that basically all pidfd APIs would be a bit confused by the
 > nonexistence of the process in question.
 > 

Yes, I think that is a real concern.                                                                                                                                                               
                                                                                 
In my current local WIP I tried to keep that distinction explicit.                                     
pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
referring to a process. The builder fd is allocated as an anonymous pidfs                                                                                                                                        
file with builder-specific file operations:       
                                                                                                       
    file = pidfs_alloc_anon_file("[pidfd_spawn]",                                                      
                                 &pidfd_spawn_builder_fops, builder,      
                                 O_RDWR);                                                              
                                                  
and the normal pidfd helpers still reject it because it does not use the
ordinary pidfd file operations:                                                                        
                                                                                                       
    struct pid *pidfd_pid(const struct file *file)
    {
        if (file->f_op != &pidfs_file_operations)                                                      
            return ERR_PTR(-EBADF);               
        return file_inode(file)->i_private;                                                                                                                                                                      
    }                                                                                                                                                                                                            
                                                                                                                                                                                                                 
So the current split is:                                                                               
                                                                                                       
    builder_fd = pidfd_spawn_open(...);       /* builder object */
    pidfd_config(builder_fd, ...);     
    child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
                                                                                                       
Only the last fd is a normal pidfd for an actual child process. The
builder fd is only accepted by the builder operations.                                                                                                                                                           
                                                                                                       
This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
pidfd_getfd(), poll(), etc. mean before the process exists. The downside                                                                                                                                         
is that it adds a separate open-style entry point and is less uniform than                                                                                                                                       
the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.                                                                                                                                                      
                                                                                                                                                                                                                 
If people think there is a better way to represent the pre-spawn builder
state, or if the preference is to integrate it directly into pidfd_open()
with an explicit empty/future-pidfd state, I would be happy to discuss
that.

Regards,
Li​


^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Florian Weimer @ 2026-06-09  6:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Mateusz Guzik, Christian Brauner, Li Chen, Kees Cook,
	Alexander Viro, linux-fsdevel, linux-api, linux-kernel, linux-mm,
	linux-arch, linux-doc, linux-kselftest, x86, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <CAG48ez38OEE8ZPLyU6nr9=cYx-hMsdoh5WRrv-GMZGMDKyyOTA@mail.gmail.com>

* Jann Horn:

>> Per the above, the primary win would stem from *NOT* messing with mm.
>
> As you write below, I think we have that with CLONE_MM? The C function
> vfork() is kind of a terrible API because of its returns-twice
> behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> wrapped by libc in a way similar to clone() (with the child executing
> a separate handler function), or if it was used in the implementation
> of some higher-level process-spawning API, it would be a perfectly
> fine API?

No, there is still a problem with SIGTSTP handling because we cannot
atomically unmask the signal during execve.  We need to unblock SIGTSTP
before execve in the new process, but this means that it can get
suspended by SIGTSTP.  Consequently, the execve never happens and the
original process is stuck in vfork:

  posix_spawn: parent can get stuck in uninterruptible sleep if child
  receives SIGTSTP early enough
  <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>

More on the low-level side, it's difficult to make sure that execve gets
a consistent snapshot of the environ vector.  Both vfork and execve need
to be async-signal-safe.  Any locking or memory allocation (except for
the stack …) persists in the original process after vfork returns.  The
environ vector can be large, so making a copy on the stack is not ideal.
It's even harder for getenv/setenv/unsetenv implementations that use
locking instead of software transactional memory.

In general, I prefer the vfork+execve API over things like posix_spawn
because eventually, you have dependencies between the syslets, or need
control flow.  This introduces a lot of complexity.  Conceptually,
vfork+execve is much simpler, and in many ways quite safe (even mutexes
work as long as they do not need a correct TID).

Thanks,
Florian


^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Andy Lutomirski @ 2026-06-09  0:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Li Chen, Kees Cook, Alexander Viro, linux-fsdevel, linux-api,
	linux-kernel, linux-mm, linux-arch, linux-doc, linux-kselftest,
	x86, Arnd Bergmann, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20260528-madig-fachrichtung-fehlinformation-61117ba640da@brauner>

On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > Hi,
> >
> > This is an early RFC for an idea that is probably still rough in both the
> > UAPI and implementation details. Sorry for the rough edges; I am sending
> > it now to check whether this direction is worth pursuing and to get
> > feedback on the kernel/userspace boundary.
>
> The idea of having a builder api for exec isn't all that crazy. But it
> should simply be built on top of pidfds and thus pidfs itself instead.
> It has all the basic infrastructure in place already. Any implementation
> should also allow userspace to implement posix_spawn() on top of it.
>
> fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
>
> pidfd_config(fd, ...) // modeled similar to fsconfig()
>

After contemplating this for a bit... why pidfd?  Doesn't a pidfd
refer to an actual process that is, or at least was, running?  This
new thing is a process that we are contemplating spawning.  I can
imagine that basically all pidfd APIs would be a bit confused by the
nonexistence of the process in question.

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-08 23:06 UTC (permalink / raw)
  To: Li Chen, Christian Brauner
  Cc: Kees Cook, Al Viro, linux-fsdevel, linux-api, LKML, linux-mm,
	linux-arch, linux-doc, linux-kselftest, x86, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan
In-Reply-To: <19e8113d290.893abab26142069.5024234139508454104@linux.beauty>

Hi all,

I am happy to see this thread appear. I emailed Christian and others ~5 years
ago about this in this thread[1]; it would be great to see it finally happen!

I very much agree that the new process spawning should be pidfd based. I also
want to emphasize that the crux of the matter is that code needed to set up the
initial unscheduled process --- which I do think should be "real state" and
more than a mere template --- is currently chopped up between clone and exec.
So the real meat of the implementation would be factoring out a bunch of stuff
so it can be reused in both the legacy clone+exec and modern code paths.

I'll say a bit more about this "real state" vs "mere template" distinction,
which is that the latter is effectively some sort of ad-hoc operation batching
language, and always runs the risk of falling behind what the kernel actually
supports. The "real state" approach, where we have honest-to-goodness process
state, just in some partially initialized fashion and thus it's not yet
scheduled, always supports everything the kernel supports in principle.

Yes, alternative syscalls that specify which "embryonic" process (as opposed to
always the current active process) need to be created, but that is less bad
than trying to stuff things into flags etc. for a single existing system call,
and also one can imagine a world (as described in
https://catern.com/rsys21.pdf) where the exact "which process?" parameter
starts getting added to new process modifying machinery by *default*, with a
sentinel value analogous to `AT_FDCWD` used to mean "the current process" for
the legacy used-between-fork-and-exec usecase.

---

Anyways, years ago, after taking a glance at the relevant code in Linux and
FreeBSD, I figured that it would be easier for me personally to first implement
this functionality in FreeBSD, and then, once I had a feel for some of the
refactoring, take a stab at it in Linux. This is because Linux's feature set,
especially things like `binfmt_misc`, makes its clone and exec quite a bit more
complex, and thus the (IMO) necessary heavy refactoring quite a bit more
extensive too.

I never got around to it in the 5 years, but these days, with LLMs, doing an
"exploratory refactor" (to get a sketch of a patch that is fodder for discussion
not yet fit for actual submission) is much easier. So inspired by this thread, I
took a few hours to do the exploratory FreeBSD refactor in [2]. The man page for
the new syscalls, [3], might be a good place to start reading. (This, being from
a FreeBSD patch, describes the change in terms of "proc fds", but the switch to
Linux's "pidfds" should be self-explanatory. The former after all inspired the
latter.)

Hope discussion of such a patch isn't too off topic here, but there is an
interesting thing to note that would also apply to a Linux implementation. It
took *more* factored out helper functions than I thought. The current count is
over 15(!) --- there didn't seem to be a way to build both the old and new way
of doing things with fewer, coarser building blocks. Now, granted, maybe
someone more familiar with either kernel than me could do a better job, but I
think it will still be a number of functions. This indicates just how much
untangling there is to do. And the number will surely be much higher for Linux.

[1]: https://lore.kernel.org/all/f8457e20-c3cc-6e56-96a4-3090d7da0cb6@JohnEricson.me/

[2]: https://github.com/obsidiansystems/freebsd-src/commit/better-proc-spawn
     239dcdefe6ad244e58d998155b527375e5293ff7 for posterity

[3]: https://raw.githubusercontent.com/obsidiansystems/freebsd-src/refs/heads/better-proc-spawn/lib/libsys/proc_new.2


On Sun, May 31, 2026, at 10:47 PM, Li Chen wrote:
> Hi Christian,
>
> Thanks a lot for your great review!
>
> ---- On Thu, 28 May 2026 19:02:53 +0800  Christian Brauner <brauner@kernel.org> wrote ---
> > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > > Hi,
> > >
> > > This is an early RFC for an idea that is probably still rough in both the
> > > UAPI and implementation details. Sorry for the rough edges; I am sending
> > > it now to check whether this direction is worth pursuing and to get
> > > feedback on the kernel/userspace boundary.
> >
> > The idea of having a builder api for exec isn't all that crazy. But it
> > should simply be built on top of pidfds and thus pidfs itself instead.
> > It has all the basic infrastructure in place already.
>
> Yes, that makes a lot more sense. I was staring too hard at the "hot
> executable" part and made the cache/template the API, which was probably
> the wrong thing to expose. Sorry about that.
>
> > Any implementation
> > should also allow userspace to implement posix_spawn() on top of it.
>
> That's so cool, and this is a really useful point. I had not thought about this as
> something that could sit under posix_spawn(), but that makes the target
> much clearer. It should be a generic exec/spawn builder first, and the
> agent use case should just be one user of it.
>
> > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> >
> > pidfd_config(fd, ...) // modeled similar to fsconfig()
>
> Reusing pidfd_open() with an empty target is nice because it keeps the API close
> to pidfds, but I wonder if a separate entry point such as
> pidfd_spawn_open() or pidfd_create() would make the "new process
> builder" case a bit more explicit? Either way, the configuration side
> being fsconfig-like makes sense to me.

Yeah check out my syscalls [3] on that front. It's important to design the
workflow / state machine in a good way. Performance/efficiency, security (share
less state/privileges by default!), and extensibility (where will newer
concepts, like a new type of namespace, fit in?) are all competing concerns,
but I think they mostly pull in the same direction. (Only no ambient authority,
back compat, and extensibility exist in some tension.)

> Thanks again for pointing me in this direction. It helps a lot.
>
> Regards,
> Li

Glad you are sold on pidfds, and more broadly, best of luck! You'll be a hero
to everyone else that has wanted this over the years :)

John

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-08 18:42 UTC (permalink / raw)
  To: Alexander Gordeev, Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, Pedro Falcato, Miklos Szeredi, patches, linux-s390,
	linux-next
In-Reply-To: <20260608171917.3195488Afc-agordeev@linux.ibm.com>

On 6/8/26 19:19, Alexander Gordeev wrote:
> On Sun, May 31, 2026 at 01:01:06AM +0000, Askar Safin wrote:
>> vmsplice behavior on writable pipe became equivalent to pwritev2.
>> vmsplice behavior on readable pipe already was nearly
>> equivalent to preadv2, but I made this explicit. I. e. I made it
>> obvious from code that vmsplice now is equivalent to preadv2/pwritev2.
>>
>> Also I moved vmsplice to fs/read_write.c, because now it arguably
>> belongs there.
>>
>> Note that SPLICE_F_NONBLOCK behavior slightly changed: previously
>> vmsplice ignored whether the pipe was opened with O_NONBLOCK, and mode
>> of operation depended on whether SPLICE_F_NONBLOCK was passed only.
>> Now the operation will be non-blocking if O_NONBLOCK was passed when
>> opening *or* SPLICE_F_NONBLOCK was passed to vmsplice. Previous
>> behavior was arguably buggy, and new behavior is arguably better.
>>
>> Now SPLICE_F_GIFT is always ignored by all 3 syscalls: splice, tee
>> and vmsplice.
>>
>> Signed-off-by: Askar Safin <safinaskar@gmail.com>
>> ---
>>  fs/read_write.c          |  23 +++++
>>  fs/splice.c              | 192 +--------------------------------------
>>  include/linux/skbuff.h   |   4 +-
>>  include/linux/splice.h   |   2 +-
>>  include/linux/syscalls.h |   4 +-
>>  5 files changed, 29 insertions(+), 196 deletions(-)
> 
> Hi All,
> 
> This patch as commit e2c0b2368081b ("vmsplice: make vmsplice a trivial
> wrapper for preadv2/pwritev2") in linux-next on s390 causes the selftest
> tools/testing/selftests/mm/cow.c to hang:
> 
> # [RUN] vmsplice() + unmap in child ... with PTE-mapped THP (128 kB)
> 
> Recently there has been changes in THP area, so the problem is not
> necessary linked to this patch per se.

If we reach 128 kB, then 64 kB likely worked. Which might hint at a similar problem
as found by the vmsplice01 ltp test case (blocking instead of returning once the
pipe is full).

https://lore.kernel.org/r/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner


-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Alexander Gordeev @ 2026-06-08 17:19 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches, linux-s390, linux-next
In-Reply-To: <20260531010107.1953702-3-safinaskar@gmail.com>

On Sun, May 31, 2026 at 01:01:06AM +0000, Askar Safin wrote:
> vmsplice behavior on writable pipe became equivalent to pwritev2.
> vmsplice behavior on readable pipe already was nearly
> equivalent to preadv2, but I made this explicit. I. e. I made it
> obvious from code that vmsplice now is equivalent to preadv2/pwritev2.
> 
> Also I moved vmsplice to fs/read_write.c, because now it arguably
> belongs there.
> 
> Note that SPLICE_F_NONBLOCK behavior slightly changed: previously
> vmsplice ignored whether the pipe was opened with O_NONBLOCK, and mode
> of operation depended on whether SPLICE_F_NONBLOCK was passed only.
> Now the operation will be non-blocking if O_NONBLOCK was passed when
> opening *or* SPLICE_F_NONBLOCK was passed to vmsplice. Previous
> behavior was arguably buggy, and new behavior is arguably better.
> 
> Now SPLICE_F_GIFT is always ignored by all 3 syscalls: splice, tee
> and vmsplice.
> 
> Signed-off-by: Askar Safin <safinaskar@gmail.com>
> ---
>  fs/read_write.c          |  23 +++++
>  fs/splice.c              | 192 +--------------------------------------
>  include/linux/skbuff.h   |   4 +-
>  include/linux/splice.h   |   2 +-
>  include/linux/syscalls.h |   4 +-
>  5 files changed, 29 insertions(+), 196 deletions(-)

Hi All,

This patch as commit e2c0b2368081b ("vmsplice: make vmsplice a trivial
wrapper for preadv2/pwritev2") in linux-next on s390 causes the selftest
tools/testing/selftests/mm/cow.c to hang:

# [RUN] vmsplice() + unmap in child ... with PTE-mapped THP (128 kB)

Recently there has been changes in THP area, so the problem is not
necessary linked to this patch per se.

Please, let me know if you need any additional information.

Thanks!

^ permalink raw reply

* Re: [LTP] [PATCH 0/5] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-08 17:12 UTC (permalink / raw)
  To: andrea.cervesato
  Cc: akpm, axboe, brauner, collin.funk1, david.laight.linux, david,
	dhowells, fuse-devel, hch, jack, joannelkoong, kernel, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, ltp, luto, metze, miklos,
	netdev, patches, pfalcato, rostedt, safinaskar, torvalds, viro, w,
	willy
In-Reply-To: <6a26aa28.283787d8.1f1282.ba36@mx.google.com>

"Andrea Cervesato" <andrea.cervesato@suse.com>:
> Hi Askar,
> 
> the patch-set doesn't apply:

This is patchset for Linux kernel.
(It is expected it will fix some failing LTP tests, among other things.)

-- 
Askar Safin

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Jann Horn @ 2026-06-08 15:02 UTC (permalink / raw)
  To: Mateusz Guzik, Christian Brauner
  Cc: Li Chen, Kees Cook, Alexander Viro, linux-fsdevel, linux-api,
	linux-kernel, linux-mm, linux-arch, linux-doc, linux-kselftest,
	x86, Arnd Bergmann, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <vealb52tv5suireenkke4lul2l3wbnaul2rp3ea545ly5wa5ty@yk3aksvp7skt>

On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> This problem is dear to my heart and I have been pondering it on and off
> for some time now. The entire fork + exec idiom is terrible and needs to
> be retired.

It seems to me like vfork+exec is a decent UAPI building block, on
which you can build nice-looking userspace APIs, though I agree that
this is not an ideal direct interface for application code.

> Additionally there is a known problem where transiently copied file
> descriptors on fork + exec cause a headache in multithreaded programs
> doing something like this in parallel. I only did cursory reading, it
> seems your patchset keeps the same problem in place.

I think we almost have UAPI that would let you avoid this issue?
You can use clone() with CLONE_FILES, then unshare the FD table with
close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently
implemented to be atomic with stuff that happens on other threads, but
if we changed that, and it doesn't provide a good way to carry some
FDs across, but it feels to me like this could be fixed with a variant
of close_range() that removes O_CLOEXEC FDs except ones listed in an
array.

> There are numerous impactful ways to speed up execs both in terms of
> single-threaded cost and their multicore scalability, most of which
> would be immediately usable by all programs without an opt-in. imo these
> needs to be exhausted before something like a "template" can be
> considered.

(I think probably a large part of this would be stuff that happens in
userspace, like dynamic linking.)

> Per the above, the primary win would stem from *NOT* messing with mm.

As you write below, I think we have that with CLONE_MM? The C function
vfork() is kind of a terrible API because of its returns-twice
behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
wrapped by libc in a way similar to clone() (with the child executing
a separate handler function), or if it was used in the implementation
of some higher-level process-spawning API, it would be a perfectly
fine API?

Or am I misunderstanding what you mean by "messing with mm"?

> As in, whatever the interface, it needs to create an "empty" target
> process (for lack of a better term).
>
> In terms of userspace-visible APIs, a clean solution escapes me.

I think we already have relatively good API for this - you can use
clone() to create something that initially shares almost all the state
that a thread would, and then incrementally unshare resources and go
through execve().

^ permalink raw reply

* Re: [LTP] [PATCH 0/5] vmsplice: fix some problems in my previous vmsplice patchset
From: Andrea Cervesato @ 2026-06-08 11:40 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	The 8472, patches, David Howells, linux-mm, Collin Funk,
	Joanne Koong, Miklos Szeredi, David Laight, Matthew Wilcox,
	Christoph Hellwig, Steven Rostedt, fuse-devel, David Hildenbrand,
	Pedro Falcato, ltp, Jens Axboe, Stefan Metzmacher, netdev,
	linux-kernel, Andy Lutomirski, linux-api, Andrew Morton,
	Linus Torvalds, Willy Tarreau
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

Hi Askar,

the patch-set doesn't apply:

error: fs/read_write.c: does not exist in index
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Applying: vmsplice: open-code do_writev and do_readv
Patch failed at 0001 vmsplice: open-code do_writev and do_readv

https://github.com/linux-test-project/ltp-agent/actions/runs/27129052434/job/80065058557#step:8:21

Please update it to a new version after rebasing with the upstream master
branch.

Regards,
--
Andrea Cervesato
SUSE QE Automation Engineer Linux
andrea.cervesato@suse.com

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-07 13:22 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <87fr31xdz3.fsf@mailhost.krisman.be>

Hi Gabriel,

Yes, I looked at Josh's slides and your RFC a few days ago.

I agree that io_uring is a very interesting direction, and I can see why it
fits the "ordered setup operations before exec" model.

My current preference is still to first explore a pidfd/pidfs-based builder,
modeled roughly like fsconfig(). Process creation feels like a core process
lifecycle API, and I think a normal fd-based syscall interface may be easier
for libc, language runtimes, shells,and sandboxing tools to adopt.

My hesitation is practical rather than conceptual.Some important
deployments still disable io_uring entirely; Docker's default seccomp
profile blocks the io_uring syscalls, and Google has disabled or restricted
io_uring in ChromeOS, Android app processes, and production servers.

I will study your io_uring work more carefully and compare the two directions.
One possible outcome is that io_uring can drive/share the same builder object later;
I do not know that yet.

Thanks for pointing this out.

 ---- On Fri, 05 Jun 2026 22:24:00 +0800  Gabriel Krisman Bertazi <krisman@suse.de> wrote --- 
 > Li Chen <me@linux.beauty> writes:
 > 
 > > Hi,
 > >
 > > This is an early RFC for an idea that is probably still rough in both the
 > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > it now to check whether this direction is worth pursuing and to get
 > > feedback on the kernel/userspace boundary.
 > >
 > > The series is based on linux-next version 20260518.
 > >
 > > This RFC adds spawn_template, a userspace-controlled exec acceleration
 > > mechanism for runtimes that repeatedly start the same executable with
 > > different argv, envp, and per-spawn file descriptor setup.
 > 
 > Have you looked at Josh's proposal to do this over io_uring [1] and my
 > implementation of it at [2]?  I think io_uring is a very natural
 > interface for something like this, it will avoid adding a larger API,
 > since you could, in theory, set up the entire new task context using
 > regular io_uring operations in an io workqueue and then starting it would
 > be a matter of forking the pre-configured io thread with a new io_uring
 > operation.
 > 
 > [1]
 > https://lpc.events/event/16/contributions/1213/attachments/1012/1945/io-uring-spawn.pdf
 > [2] https://lwn.net/Articles/1001622/
 > 
 > >
 > > The main target is agent runtimes. Modern coding agents repeatedly start
 > > short-lived helper tools such as rg, git, sed, awk, python, node, and
 > > shell wrappers while they inspect and edit a workspace. Those runtimes
 > > already know which tools are hot, and they are also the right place to
 > > decide policy. The kernel does not choose names such as rg, git, or sed.
 > > Userspace opts in by creating a template fd for one executable, then uses
 > > that fd for later spawns. Launchers, shells, and build systems have a
 > > similar repeated-startup shape and could use the same primitive, but the
 > > agent runtime case is the main motivation for this RFC.
 > >
 > > The mechanism applies to the executable that userspace asks the kernel to
 > > start. If an agent runtime directly starts /usr/bin/rg, the rg executable
 > > is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
 > > head", the shell is the template target unless the shell itself opts in
 > > when it starts rg and head. The kernel does not parse the shell command
 > > string or rewrite inner commands into template spawns. Userspace has to
 > > call spawn_template for those inner commands explicitly:
 > >
 > >     direct exec                 shell wrapper
 > >     -----------                 -------------
 > >     agent                       agent
 > >       template("/usr/bin/rg")     template("/usr/bin/bash")
 > >       spawn rg argv              spawn bash -c "rg ... | head"
 > >
 > >     kernel target: rg          kernel target: bash
 > >     rg startup benefits        rg/head need shell opt-in
 > >
 > > Several agent runtime discussions are moving toward direct argv-style
 > > exec tools for both security and policy clarity. For example, opencode
 > > issue #2206 proposes an exec tool as a safer alternative to a shell-only
 > > bash tool:
 > >
 > > https://github.com/anomalyco/opencode/issues/2206
 > >
 > > spawn_template is meant to support both models. Direct exec users can
 > > cache the actual hot tool. Shell-wrapper users can cache the shell and
 > > still reduce shell startup cost. If a shell or an agent runtime later
 > > uses the same API for commands started inside a shell command, those
 > > inner tools can benefit too.
 > >
 > > Each spawn still goes through the normal exec path. The template reuses
 > > only metadata that can be revalidated before use. Credential preparation,
 > > permission checks, binary handler checks, secure-exec handling, and LSM
 > > hooks remain on the normal execve path.
 > >
 > > The UAPI has two operations. spawn_template_create() creates an
 > > anonymous-inode template fd from either an executable fd or an absolute
 > > executable path. spawn_template_spawn() starts one child from that
 > > template, applies per-spawn fd, cwd, and signal actions, and returns both
 > > pid and pidfd.
 > >
 > > fd inheritance is deliberately conservative. By default, after the
 > > requested per-spawn actions have run, the child closes fds above stderr.
 > > An agent runtime can still request traditional inheritance explicitly,
 > > but helper tools do not inherit unrelated secret files or sockets by
 > > accident. The create-time actions fields are reserved and rejected in
 > > this RFC because fd numbers are per-process state, not stable reusable
 > > objects. The caller supplies fd actions for each spawn instead.
 > >
 > > A typical agent runtime would keep one template per hot executable and
 > > still build argv, envp, cwd, and pipe wiring for each tool call:
 > >
 > >     rg_tmpl = spawn_template_create("/usr/bin/rg");
 > >
 > >     for each search request:
 > >         out_r, out_w = pipe_cloexec();
 > >         err_r, err_w = pipe_cloexec();
 > >         actions = [
 > >             FCHDIR(worktree_fd),
 > >             DUP2(out_w, STDOUT_FILENO),
 > >             DUP2(err_w, STDERR_FILENO),
 > >         ];
 > >         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
 > >         close(out_w);
 > >         close(err_w);
 > >         read out_r and err_r;
 > >         waitid(P_PIDFD, child.pidfd, ...);
 > >
 > > A shell-wrapper runtime would use the same shape with a template for
 > > /usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
 > > reduces shell startup cost, but it does not cache rg or head inside that
 > > command unless the shell also opts into spawn_template for commands it
 > > starts internally.
 > >
 > > The template pins the executable and denies writes to that file while the
 > > template fd is alive, so cached executable metadata cannot race with a
 > > writer changing the same inode. This means direct in-place writes to the
 > > executable can fail while a runtime keeps a template open. It does not
 > > block the common package-manager update pattern where a new inode is
 > > written and then atomically renamed over the old path. In that case the
 > > old path-created template becomes stale, spawn_template_spawn() rejects
 > > it with ESTALE, and the runtime should close and recreate the template
 > > for the new executable.
 > >
 > >     in-place write              package-manager update
 > >     --------------              ----------------------
 > >     template pins old inode     write new inode
 > >     write(old inode) denied     rename(new, "/usr/bin/rg")
 > >
 > >     cached metadata safe        old template sees path mismatch
 > >                                 spawn_template_spawn() = -ESTALE
 > >                                 recreate template for new inode
 > >
 > > Each spawn revalidates executable identity before cached metadata is
 > > used. Path-created templates only accept absolute paths: a relative path
 > > such as ./tool depends on cwd, and the same string can name a different
 > > file after chdir. For an absolute path template, each spawn reopens the
 > > path and checks that it still resolves to the executable recorded when
 > > the template was created. If the path now names a replaced file, the
 > > template is stale and userspace should close and recreate it.
 > >
 > > A template fd can be passed over SCM_RIGHTS like any other fd, but this
 > > RFC does not treat that as delegation. spawn_template_spawn() only works
 > > while the caller still has the same struct cred object that created the
 > > template. If another task, or the same task after a credential change,
 > > receives the fd, spawn fails instead of running the executable using the
 > > creator's launch authority:
 > >
 > >     ordinary fd                         spawn_template fd
 > >     -----------                         -----------------
 > >     A: open log                         A: create rg template
 > >     A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)
 > >
 > >     B: read(fd) = ok                    B: spawn(tfd) = -EACCES
 > >                                         B: create own rg template
 > >                                         B: spawn(own_tfd) = ok
 > >
 > >     open-file use is delegated          spawn authority is not delegated
 > >
 > > The cached state is intentionally small. The template fd keeps the opened
 > > main executable file, an optional absolute path string, the creator
 > > credential pointer, and the deny-write state. The executable identity key
 > > records device, inode, size, mode, owner, ctime, and mtime, and is
 > > rechecked before cached metadata is used. The ELF cache keeps only the
 > > main executable's ELF header, program header table, and program header
 > > count.
 > >
 > >     cached in this RFC          not cached in this RFC
 > >     ------------------          ----------------------
 > >     opened main executable      PT_INTERP metadata
 > >     executable identity key     shared-library graph
 > >     main ELF header             VMA layout metadata
 > >     main ELF program headers    cross-process metadata sharing
 > >     creator cred pointer
 > >     deny-write state
 > >
 > > This RFC does not cache ELF interpreter metadata, shared-library
 > > dependency state, or derived mapping-layout state. Shared-library
 > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
 > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
 > > state. It also does not share cached executable metadata between template
 > > fds created by different processes. Each template owns its small cached
 > > metadata object in this RFC.
 > >
 > > Performance
 > > ===========
 > >
 > > The numbers below come from my separate local autogen-bench project.
 > > autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
 > > instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
 > > fans out concurrent tool-call requests to worker agents. The workload
 > > definitions, generated test files, and subprocess/spawn_template backends
 > > are local to autogen-bench.
 > >
 > > The agent-tools preset includes direct tool calls and shell-wrapper forms
 > > for:
 > >
 > > rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
 > > python-small, node-small, sh-c, and bash-c.
 > >
 > > The benchmark is launch-heavy but not no-op: it searches generated
 > > Python-like source files, reads sample files, runs small Python and
 > > Node.js programs, and runs git status and git diff in a small repository.
 > > It does not include model inference or long-running tool work, so the
 > > numbers mainly describe the short-tool regime.
 > >
 > > The subprocess column starts each tool call through the existing
 > > userspace launch path. The spawn_template column creates templates for
 > > hot executables and uses spawn_template_spawn() for later calls.
 > >
 > > Total in-flight tool calls stay at 16; only the worker-process split
 > > changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
 > > calls each. The two time_s values are subprocess/spawn_template wall
 > > times.
 > >
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > >
 > > The table measures the whole mixed workload, including both process
 > > startup and the short tool work done after exec. Since this workload is
 > > launch-heavy, the possible launch-side savings include:
 > >
 > > - the template fd keeps an opened executable, avoiding repeated ordinary
 > >   open/path setup for that executable;
 > > - the kernel can reuse cached main-executable ELF header and program
 > >   header metadata after revalidation;
 > > - the fork-and-exec-style launch is submitted as one
 > >   spawn_template_spawn() operation;
 > > - fd, cwd, and signal actions run in the child kernel path instead of
 > >   being driven one syscall at a time by userspace child glue;
 > > - pid and pidfd are returned by the same operation, reducing some
 > >   runtime-side bookkeeping.
 > >
 > > In local experiments before this RFC, I also tried caching ELF
 > > interpreter metadata and derived ELF mapping-layout metadata. A focused
 > > repeated-exec benchmark did not show a stable standalone throughput gain
 > > for those two optimizations, so this RFC leaves them out and keeps only
 > > the main executable metadata cache.
 > >
 > > I also tried sharing main-executable ELF metadata across template fds
 > > created by different processes for the same executable identity. That can
 > > reduce duplicated metadata memory when many agent worker processes create
 > > their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
 > > it did not show a stable throughput win in local multi-agent tests. It
 > > also adds cache keying, lifetime, invalidation, credential, and namespace
 > > questions to the RFC. This version therefore keeps per-template metadata
 > > ownership and leaves cross-process sharing out.
 > >
 > > Sorry again for the rough edges in this RFC. I would appreciate feedback
 > > on whether this direction is useful and what the right API boundary
 > > should be.
 > >
 > > Thanks,
 > > Li
 > >
 > > [1]: https://github.com/microsoft/autogen
 > >
 > > Li Chen (13):
 > >   exec: factor argument setup out of do_execveat_common()
 > >   exec: add an internal helper for opened executables
 > >   file: expose helpers for in-kernel fd actions
 > >   exec: add spawn template UAPI definitions
 > >   exec: add spawn template file descriptors
 > >   exec: add spawn_template_spawn()
 > >   exec: validate spawn template executable identity
 > >   binfmt_elf: cache ELF metadata for spawn templates
 > >   Documentation: describe spawn templates
 > >   exec: require absolute paths for path-created templates
 > >   exec: let close-range actions target the max fd
 > >   syscalls: add generic spawn template entries
 > >   selftests/exec: cover spawn template basics
 > >
 > >  Documentation/userspace-api/index.rst         |   1 +
 > >  .../userspace-api/spawn_template.rst          | 153 +++
 > >  MAINTAINERS                                   |   6 +
 > >  arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
 > >  fs/Makefile                                   |   2 +-
 > >  fs/binfmt_elf.c                               | 104 +-
 > >  fs/exec.c                                     | 162 ++-
 > >  fs/file.c                                     |  11 +-
 > >  fs/spawn_template.c                           | 619 +++++++++++
 > >  include/linux/binfmts.h                       |  10 +
 > >  include/linux/fdtable.h                       |   2 +
 > >  include/linux/spawn_template.h                |  72 ++
 > >  include/linux/syscalls.h                      |   7 +
 > >  include/uapi/asm-generic/unistd.h             |   7 +-
 > >  include/uapi/linux/spawn_template.h           |  62 ++
 > >  scripts/syscall.tbl                           |   2 +
 > >  tools/testing/selftests/exec/Makefile         |   1 +
 > >  tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
 > >  18 files changed, 2179 insertions(+), 42 deletions(-)
 > >  create mode 100644 Documentation/userspace-api/spawn_template.rst
 > >  create mode 100644 fs/spawn_template.c
 > >  create mode 100644 include/linux/spawn_template.h
 > >  create mode 100644 include/uapi/linux/spawn_template.h
 > >  create mode 100644 tools/testing/selftests/exec/spawn_template.c
 > 
 > -- 
 > Gabriel Krisman Bertazi
 > 

Regards,
Li​


^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Richard Patel @ 2026-06-06 23:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <lhua4t73hz9.fsf@oldenburg.str.redhat.com>

On Sat, Jun 06, 2026 at 03:40:10PM +0200, Florian Weimer wrote:
> * Richard Patel:
> 
> > On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:
> >
> >> How do you detect that handling a signal is complete and IBT can be
> >> re-enabled?  Or is it re-enabled before entering the userspace signal
> >> handler?
> >
> > Hi Florian,
> >
> > In v1, we backed up the IBT CPU state into the (user-accessible) signal
> > frame from FRED/XSAVE, then restored it:
> > https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/
> >
> > In v2, when entering the signal handler, the kernel just context switches
> > to the new user rip, bypassing IBT checks (continues executing if the
> > signal handler does not begin with endbr).
> 
> What's the reason for this?

Hi Florian,

We just don't have a nice way to include IBT state in the signal frame
right now.  v1 had an uabi change (adding a new bit in ucontext_t uc_flags),
which was originally proposed by Intel years ago.  My preferred way to add
IBT state is to carve out an XSAVE area in fpstate, which works well with all
the existing signal frame code.

But I figured it's better to just keep the first pass at user IBT super
simple, in the hopes upstream is more inclined to accept that.

BTW, OpenBSD uses the v2 approach (don't preserve IBT state across signal
handlers), presumably because it's also hard for them to restore IBT state
on sigreturn.

> >> That's not necessarily a problem because its address cannot be directly
> >> overwritten in userspace.  Not all indirect branches need to be checked,
> >> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
> >> are better (although we wouldn't drop the marker from a signal handler
> >> specifically, of course).
> >
> > Just one concern I have is that people start relying on signal handlers
> > not requiring endbr64, and then a future kernel version breaking them once
> > we enforce it.
> 
> Would software enforcement be a possibility?  The kernel could check if
> the landing pad is there.

Enforcement is the easy part.  I can trivially add back 'check if signal
handler starts with endbr64'.  Just the backup/restore of the pre-signal
handler state ('do I expect an endbr64 after returning') is the tricky part.

Thank you,
-Richard

^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Florian Weimer @ 2026-06-06 13:40 UTC (permalink / raw)
  To: Richard Patel
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <aiMyaJ8zDl76YOVN@wii.dev>

* Richard Patel:

> On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:
>
>> How do you detect that handling a signal is complete and IBT can be
>> re-enabled?  Or is it re-enabled before entering the userspace signal
>> handler?
>
> Hi Florian,
>
> In v1, we backed up the IBT CPU state into the (user-accessible) signal
> frame from FRED/XSAVE, then restored it:
> https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/
>
> In v2, when entering the signal handler, the kernel just context switches
> to the new user rip, bypassing IBT checks (continues executing if the
> signal handler does not begin with endbr).

What's the reason for this?

> Some time in the future, ideally:
> - signal handler is *required* to start with endbr (this is easy)
> - sigreturn as in my asm example enforces endbr after returning from a
>   signal handler to a in-progres indirect branc
> - libc (sig)longjmp is made IBT-compatible

I think the compiler already emits ENDBR markers for returns-twice
functions, which is why longjmp does not use a no-track jump.  Other
architectures require such a proliferation of markers because they do
not support no-track jumps at all.  However, longjmp is arguable a
corner case.  It's not completely safe, like loading a function address
from a RELRO GOT and jumping to it.

> Btw, I had self-tests for the v1 design, and {signal handle,rt_sigreturn,
> siglongjmp} with {success case,violation} works flawlessly with Fedora 44
> glibc amd64. With glibc i686 I ran into PLT issues, probably my fault.

There's no IBT support planned for i686, that's why we dropped all
marker instructions in Fedora.

> It is quite surprised that siglongjmp was working, btw, since the glibc
> longjmp code uses 'jmp *reg' (without notrack prefix). I guess you do an
> endbr64 at the setjmp side?

Yes, compilers generate landing pads for returns-twice functions.  Not
ideal, but it's the only way to get setjmp working on targets without
NOTRACK.

>> Adding the ELF GNU note parsing can be added later, but perhaps not
>> cleanly.  I'm still a bit worried we might have to rev the markup
>> because too many binaries are in circulation that claim compatibility,
>> have never been tested, and are actually broken.  If the kernel does not
>> look at the ELF bits, things a slightly simpler.
>
> Phew, I was hoping you'd say that.
>
> If you want, I can sketch out glibc IBT enabling and test it on Debian
> and Fedora, which IIRC already emit compile with -fcf-protection=branch
> for all OS packages.

For Fedora, please coordinate with Arjun (Cc:ed), who is going through
the motions of enabling SHSTK for real.

>> That's not necessarily a problem because its address cannot be directly
>> overwritten in userspace.  Not all indirect branches need to be checked,
>> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
>> are better (although we wouldn't drop the marker from a signal handler
>> specifically, of course).
>
> Just one concern I have is that people start relying on signal handlers
> not requiring endbr64, and then a future kernel version breaking them once
> we enforce it.

Would software enforcement be a possibility?  The kernel could check if
the landing pad is there.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-06 10:22 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <634c8ae2-3f1c-46b1-b002-1e2ac797dd80@samba.org>

On Fri, 5 Jun 2026 17:20:34 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Hi David,
> 
> >>> So sendfile() as a concept (whether you use combinations of splice()
> >>> system calls or the sendfile system call itsefl) isn't necessarily
> >>> only about the zero-copy, it's really also about avoiding the user
> >>> space memory management.  
> >>
> >> I don't think so. Ok, maybe for webservers just serving tiny
> >> html files, that's true. But for me with Samba it's really the
> >> copy_to/from_iter() that is the major factor.  
> > 
> > Is that copy also doing the ip checksum?  
> 
> Not in my tests. I guess there's offload in the network hardware
> for this.

There will be, it is just whether the syscall checksum is actually
being suppressed.

-- David

> 
> At least at the syscall layer of sendmsg() there's no checksuming
> happening.
> 
> metze
> 


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-06  9:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Florian Weimer, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wiTQr4YYYUH38srGvWAq3_UpDeAPR+qZWVyf-ZU7z8Hzw@mail.gmail.com>

On Fri, 5 Jun 2026 10:12:05 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 5 Jun 2026 at 09:30, Florian Weimer <fweimer@redhat.com> wrote:
> >  
> > > Uhhuh. But that is only specific to 'bool', right?  
> >
> > Also char and short.  
> 
> That sounds like a complete ABI violation as far as I can tell.
> 
> Scary. Because I would not be surprised if we have code that assumes otherwise.
> 
> Now, the kernel *seldom* uses char/short types, and since compilers
> are typically at least self-consistent in those cases and we don't
> interact directly with untrusted sources.

There are plenty of places where char/short are used for function call
parameters/results (and not for single characters or similar).

I'm sure some people (even some who should really know better) think
the smaller type will save space.

I've always worried about whether the calling or called code is responsible
for ensuring the unused bits are zero (or maybe the sign extension of a
signed value).
Clearly the compiler should obey its own rules - so mostly it is just
extra instruction to do the masking.
But for interactions with asm code, and possibly code that gets mixed
between gcc and clang (maybe for out of tree modules) it does matter.

You also don't really want to be doing maths of char/short (and there
are quite a of of those as well). I think it is only m86 and m68k that
actually have 8/16 bits maths instructions (is s390 old enough?)
everywhere else the compiler has to explicitly mask the high bits.

Maybe it is time to 'nuke' all the 'short' locals/parameters/results
(eg from htons()) as well as all the 'long' for values than aren't
dependant on 32/64 bit builds.

-- David


^ permalink raw reply

* [PATCH 5/5] vmsplice: make sure we don't wait after writing some data
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

Make sure we don't wait for space in pipe after writing some data.
This is needed for compatibility with previous version of vmsplice.
Found by LTP vmsplice01.
See comments in the code and links below for details.

Link: https://lore.kernel.org/all/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner/
Link: https://lore.kernel.org/all/CAHk-=wgV-j-G3d+899Zm1pQ=NaJrddPz=GKcL5Yw5DTUM=GaUw@mail.gmail.com/
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 77487b307..dbd0debc2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1221,6 +1221,8 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
+	struct pipe_inode_info *pipe;
+
 	if (unlikely(flags & ~SPLICE_F_ALL))
 		return -EINVAL;
 
@@ -1229,11 +1231,44 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		return -EBADF;
 
 	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
-	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+	pipe = get_pipe_info(fd_file(f), /* for_splice = */ false);
+
+	if (!pipe)
 		return -EBADF;
 
 	if (fd_file(f)->f_mode & FMODE_WRITE) {
-		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		/*
+		 * When writing to the pipe, previous implementation of vmsplice
+		 * first waited for space in the pipe to appear
+		 * (depending on whether SPLICE_F_NONBLOCK was passed),
+		 * then did unconditional non-blocking write to the pipe.
+		 *
+		 * This differs from what pwritev2 does.
+		 *
+		 * For compatibility we do the same thing previous
+		 * implementation did.
+		 *
+		 * We lock the pipe, do pipe_wait_for_space, then unlock
+		 * the pipe, and then do vfs_writev. vfs_writev internally
+		 * locks the pipe again. This may cause TOCTOU: when we
+		 * do vfs_writev, the pipe may become full again. So we
+		 * do a loop.
+		 */
+
+		bool non_block = (flags & SPLICE_F_NONBLOCK) || (fd_file(f)->f_flags & O_NONBLOCK);
+		ssize_t ret;
+
+		do {
+			pipe_lock(pipe);
+			ret = pipe_wait_for_space(pipe, non_block);
+			pipe_unlock(pipe);
+
+			if (ret < 0)
+				break;
+
+			ret = vfs_writev(fd_file(f), vec, vlen, NULL, RWF_NOWAIT);
+		} while (!non_block && ret == -EAGAIN);
+
 		if (ret > 0)
 			add_wchar(current, ret);
 		inc_syscw(current);
-- 
2.47.3


^ permalink raw reply related

* [PATCH 4/5] pipe: move wait_for_space to fs/pipe.c and rename it
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

This is needed, because I plan to use it in fs/read_write.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/pipe.c                 | 17 +++++++++++++++++
 fs/splice.c               | 19 +------------------
 include/linux/pipe_fs_i.h |  2 ++
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9..c0ccf21b9 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1451,6 +1451,23 @@ long pipe_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 	return ret;
 }
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block)
+{
+	for (;;) {
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
+		if (!pipe_is_full(pipe))
+			return 0;
+		if (non_block)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+		pipe_wait_writable(pipe);
+	}
+}
+
 static const struct super_operations pipefs_ops = {
 	.destroy_inode = free_inode_nonrcu,
 	.statfs = simple_statfs,
diff --git a/fs/splice.c b/fs/splice.c
index 707db2c2c..d12243d19 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,23 +1239,6 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
-{
-	for (;;) {
-		if (unlikely(!pipe->readers)) {
-			send_sig(SIGPIPE, current, 0);
-			return -EPIPE;
-		}
-		if (!pipe_is_full(pipe))
-			return 0;
-		if (non_block)
-			return -EAGAIN;
-		if (signal_pending(current))
-			return -ERESTARTSYS;
-		pipe_wait_writable(pipe);
-	}
-}
-
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1268,7 +1251,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
+	ret = pipe_wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index a1eeed800..be653625d 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -335,4 +335,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned int size);
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block);
+
 #endif
-- 
2.47.3


^ permalink raw reply related

* [PATCH 3/5] splice: turn wait_for_space flags argument into bool
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

I want to do this, because I will move this function to fs/pipe.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/splice.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6ddf7dd72..707db2c2c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,7 +1239,7 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
+static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
 {
 	for (;;) {
 		if (unlikely(!pipe->readers)) {
@@ -1248,7 +1248,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		}
 		if (!pipe_is_full(pipe))
 			return 0;
-		if (flags & SPLICE_F_NONBLOCK)
+		if (non_block)
 			return -EAGAIN;
 		if (signal_pending(current))
 			return -ERESTARTSYS;
@@ -1268,7 +1268,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags);
+	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
-- 
2.47.3


^ permalink raw reply related

* [PATCH 2/5] vmsplice: change argument type back to "int"
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

My previous vmsplice patchset changed vmsplice argument from
"int" to "unsigned long". This may cause problems, so let's
change it back.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c          | 2 +-
 include/linux/syscalls.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e224e7cb8..77487b307 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1218,7 +1218,7 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 /*
  * Legacy preadv2/pwritev2 wrapper.
  */
-SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
+SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
 	if (unlikely(flags & ~SPLICE_F_ALL))
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a86a88207..46a3ec954 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -514,7 +514,7 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
 			  struct old_timespec32 __user *, const sigset_t __user *,
 			  size_t);
 asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
-asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
+asmlinkage long sys_vmsplice(int fd, const struct iovec __user *vec,
 			     unsigned long vlen, unsigned int flags);
 asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
 			   int fd_out, loff_t __user *off_out,
-- 
2.47.3


^ permalink raw reply related

* [PATCH 1/5] vmsplice: open-code do_writev and do_readv
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

My previous vmsplice patch did the following mistake: I did
"CLASS(fd, f)(fd)", then did some checks on resulting "struct file",
then passed numeric (!) file descriptor to a function.

This is somewhat okay in this particular case, but I still think
this is code smell, so I fix this by open-coding do_writev and do_readv.

Also I insert a comment to warn other developers to keep
do_writev and do_readv in sync with vmsplice(2).

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 1e5444f4d..e224e7cb8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1070,6 +1070,7 @@ static ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 			unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1093,6 +1094,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 			 unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1226,14 +1228,24 @@ SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
 	if (fd_empty(f))
 		return -EBADF;
 
-	/* We do do_writev/do_readv, so it is okay to pass "false" here */
+	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
 	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
 		return -EBADF;
 
-	if (fd_file(f)->f_mode & FMODE_WRITE)
-		return do_writev(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
-	else
-		return do_readv(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+	if (fd_file(f)->f_mode & FMODE_WRITE) {
+		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		if (ret > 0)
+			add_wchar(current, ret);
+		inc_syscw(current);
+		return ret;
+	} else {
+		ssize_t ret = vfs_readv(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+
+		if (ret > 0)
+			add_rchar(current, ret);
+		inc_syscr(current);
+		return ret;
+	}
 }
 
 /*
-- 
2.47.3


^ permalink raw reply related

* [PATCH 0/5] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches

This patchset is for VFS. Of course, it depends on my previous vmsplice
patchset.

I fix some problems in my previous patchset.

1. Fix problem with CLASS(fd, f)(fd). See first patch for details.
This is probably not so important, but I fix it anyway.

2. Change "unsigned long" back to "int". See second patch for details.
Again, this is probably not important, but I want to fix this anyway.

3. Fix that LTP vmsplice01 bug.

See patches for details.

Please, run that LTP vmsplice01 test again.

Notes:

- I want to repeat: I change behavior around SPLICE_F_NONBLOCK.
Previously, vmsplice ignored whether pipe itself was opened as
non-blocking file. Now it is not ignored. And in my opinion
new behavior is better.
- vmsplice(2) now is in fs/read_write.c . It is very similar to
preadv2 and pwritev2 now, so I think it belongs to fs/read_write.c now.

Please, review this patchset carefully. I'm still new contributor.
In particular, please, review that do-while loop, I'm not sure I did
everything right.

Tested in Qemu.

Askar Safin (5):
  vmsplice: open-code do_writev and do_readv
  vmsplice: change argument type back to "int"
  splice: turn wait_for_space flags argument into bool
  pipe: move wait_for_space to fs/pipe.c and rename it
  vmsplice: make sure we don't wait after writing some data

 fs/pipe.c                 | 17 +++++++++++
 fs/read_write.c           | 61 ++++++++++++++++++++++++++++++++++-----
 fs/splice.c               | 19 +-----------
 include/linux/pipe_fs_i.h |  2 ++
 include/linux/syscalls.h  |  2 +-
 5 files changed, 75 insertions(+), 26 deletions(-)


base-commit: 8d86fcfc2857d64af85f5c87c193c25655c970af (vfs-7.2.vmsplice)
-- 
2.47.3


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: The 8472 @ 2026-06-05 20:54 UTC (permalink / raw)
  To: Linus Torvalds, Willy Tarreau
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <CAHk-=wg0e8pP5haNW4qJP1=QwwUEctwjK5k07sv8bskitoMDgg@mail.gmail.com>

On 04/06/2026 17:58, Linus Torvalds wrote:
> On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
>>
>>> It looks like you're actually doing exactly the thing that I thought
>>> was crazy and wouldn't even work reliably: you change the
>>> common_response[] contents dynamically *after* the vmsplice, and
>>> depend on the fact that changing it in user space changes the buffer
>>> in the pipe too.
>>
>> No no, it's definitely not doing that (or it's a bug, but it's not
>> supposed to happen). I'm perfectly aware that one must definitely not
>> do that, and it's a guarantee the user of vmsplice() must provide.
> 
> Whew, good.
> 
> In that case, can you just try the vmsplice patch series (Christian
> already found a bug, but I don't think it will necessarily matter in
> practice - famous last words) and that test patch of mine, and see if
> it all (a) works for you and (b) if you have any numbers for
> performance that would be *great*.
> 
> There aren't many obvious splice users out there, and even if they
> were to exist they are typically specialized enough that you have to
> have a real use case to then tell if the patches make a difference in
> real life or not.

In the Rust standard library we use splice as one of several strategies
in our generic io::copy[0] routine. It selects the strategy[1] based on
source and sink types.

It tries

- copy_file_range
- sendfile
- splice
- fallback to userspace read-write loop

sendfile or splice are skipped when we can't uphold the "callers must ensure
transferred portions in_fd remain unmodified" condition on the manpage,
which unfortunately includes some particularly desirable combinations of
sinks and sources (such as mutable files -> socket).

We primarily want this for reflink copies and to avoid the syscall
overhead of a read-write loop with a small stack buffer.

Any additional zerocopy benefit, when it doesn't lead to unstable data, is
welcome but not critical. E.g. it'd be nice if sendfile could do the following:
For a 1MB source and a socket with a 64kB sendbuffer it could zerocopy first ~900kB
safely and then memcpy the last 64kB to ensure it can't be modified after the
syscall returns. But a "just memcpy in kernel space instead of zerocopy" flag for
sendfile would be ok too.

We're currently not making use of vmsplice. In theory we'd like to use it for
copying from `&'static [u8]` sources since the type upholds the requirements of
vmsplice, but type specialization currently is not powerful enough to
select based on this lifetime and it's unclear if it'll ever be.


[0] https://doc.rust-lang.org/nightly/std/io/fn.copy.html
[1] https://github.com/rust-lang/rust/blob/ac6f3a3e778a586854bdbf8f15202e11e2348d9f/library/std/src/sys/io/kernel_copy/linux.rs#L210-L259

> 
> So you testing that thing would seem to be a great first test of
> whether any of this is realistic..
> 
>                 Linus
> 


^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Richard Patel @ 2026-06-05 20:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <lhu1pek4w89.fsf@oldenburg.str.redhat.com>

On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:

> How do you detect that handling a signal is complete and IBT can be
> re-enabled?  Or is it re-enabled before entering the userspace signal
> handler?

Hi Florian,

In v1, we backed up the IBT CPU state into the (user-accessible) signal
frame from FRED/XSAVE, then restored it:
https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/

In v2, when entering the signal handler, the kernel just context switches
to the new user rip, bypassing IBT checks (continues executing if the
signal handler does not begin with endbr).

IBT stays enabled in both designs, just the IBT state is preserved in v1,
and lost in v2.

The same thing happens when doing a sigreturn in v2 (e.g. via trampoline),
again IBT is not enforced.  IBT stays enabled when doing a siglongjmp,
though.

Some time in the future, ideally:
- signal handler is *required* to start with endbr (this is easy)
- sigreturn as in my asm example enforces endbr after returning from a
  signal handler to a in-progres indirect branc
- libc (sig)longjmp is made IBT-compatible

Btw, I had self-tests for the v1 design, and {signal handle,rt_sigreturn,
siglongjmp} with {success case,violation} works flawlessly with Fedora 44
glibc amd64. With glibc i686 I ran into PLT issues, probably my fault.

It is quite surprised that siglongjmp was working, btw, since the glibc
longjmp code uses 'jmp *reg' (without notrack prefix). I guess you do an
endbr64 at the setjmp side?

> > The main question is whether glibc is happy with this prctl syscall API.
> 
> As far as I can tell, the prctl works for glibc.  Re-use of an
> arch_prctl constant might have been problematic, but the series is not
> doing that.

Nice :-)
The alternative would have been to bolt on stuff to ARCH_SHSTK, or create
an entirely new arch_prctl. Open to any API.

> Adding the ELF GNU note parsing can be added later, but perhaps not
> cleanly.  I'm still a bit worried we might have to rev the markup
> because too many binaries are in circulation that claim compatibility,
> have never been tested, and are actually broken.  If the kernel does not
> look at the ELF bits, things a slightly simpler.

Phew, I was hoping you'd say that.

If you want, I can sketch out glibc IBT enabling and test it on Debian
and Fedora, which IIRC already emit compile with -fcf-protection=branch
for all OS packages.

> > There is one notable gap in this patch series, to do with signals:
> >
> >   000a: mov rax, 0x100a
> >   000f: jmp rax
> >   *** signal occurs ***
> >   *** signal handler runs, does sigreturn ***
> >   100a: nop
> >
> > The above sequence does not crash.
> >
> > With IBT, it should crash at the nop (because an endr64 is expected there).
> > The IBT state (WAIT_FOR_ENDBR in IA32_U_CET MSR) is not backed up to the
> > signal frame though.  So, when userland does a sigreturn, the CPU has
> > forgotten that it was doing an indirect branch before the signal.
> > (This specifically only occurs with signal handlers that sigreturn.)
> >
> > This is because IA32_U_CET is part of XSAVE 'supervisor' state, so
> > regular XSAVE/XRSTOR can't access it.  Doing a manual backup is tricky.
> 
> That's a bit annoying.  Is this restricted to signal handlers, or does
> it apply to page faults, too?

Only signal handlers, page faults don't reset IBT.

> > A related problem is that the signal handler routine is not checked for
> > endbr preamble.
> 
> That's not necessarily a problem because its address cannot be directly
> overwritten in userspace.  Not all indirect branches need to be checked,
> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
> are better (although we wouldn't drop the marker from a signal handler
> specifically, of course).

Just one concern I have is that people start relying on signal handlers
not requiring endbr64, and then a future kernel version breaking them once
we enforce it.

Really appreciate your review,

-Richard

^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Florian Weimer @ 2026-06-05 19:34 UTC (permalink / raw)
  To: Richard Patel
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <20260605184715.3383415-2-ripatel@wii.dev>

* Richard Patel:

> Adds basic support for x86 userspace IBT.
>
> IBT is part of Intel CET. It requires indirect call and jump targets
> to start with an endbr{32,64} instruction, otherwise throwing #CP.
>
> In summary, this patch does 3 things:
> - Config wiring ensuring supervisor XSAVE contains IBT state
> - Allow userspace to enable IBT via prctl(PR_CFI_*) for an entire thread
> - Enable IBT support (ENDBR instructions) in VDSO
>
> Unlike the arm64 BTI API:
> - does not support mixed usermode (all or nothing)
> - does not touch page table code
> - not enabled automatically (no ELF GNU note parsing)
> - temporarily disables IBT enforcement when handling signals
> These can all be cleanly added later.

Adding the ELF GNU note parsing can be added later, but perhaps not
cleanly.  I'm still a bit worried we might have to rev the markup
because too many binaries are in circulation that claim compatibility,
have never been tested, and are actually broken.  If the kernel does not
look at the ELF bits, things a slightly simpler.

How do you detect that handling a signal is complete and IBT can be
re-enabled?  Or is it re-enabled before entering the userspace signal
handler?

> The main question is whether glibc is happy with this prctl syscall API.

As far as I can tell, the prctl works for glibc.  Re-use of an
arch_prctl constant might have been problematic, but the series is not
doing that.

> There is one notable gap in this patch series, to do with signals:
>
>   000a: mov rax, 0x100a
>   000f: jmp rax
>   *** signal occurs ***
>   *** signal handler runs, does sigreturn ***
>   100a: nop
>
> The above sequence does not crash.
>
> With IBT, it should crash at the nop (because an endr64 is expected there).
> The IBT state (WAIT_FOR_ENDBR in IA32_U_CET MSR) is not backed up to the
> signal frame though.  So, when userland does a sigreturn, the CPU has
> forgotten that it was doing an indirect branch before the signal.
> (This specifically only occurs with signal handlers that sigreturn.)
>
> This is because IA32_U_CET is part of XSAVE 'supervisor' state, so
> regular XSAVE/XRSTOR can't access it.  Doing a manual backup is tricky.

That's a bit annoying.  Is this restricted to signal handlers, or does
it apply to page faults, too?

> A related problem is that the signal handler routine is not checked for
> endbr preamble.

That's not necessarily a problem because its address cannot be directly
overwritten in userspace.  Not all indirect branches need to be checked,
only those that have tweakable targets.  In fact, fewer ENDBR64 markers
are better (although we wouldn't drop the marker from a signal handler
specifically, of course).

Thanks,
Florian


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox