Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05 15:20 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <20260605131942.4584728e@pumpkin>

Hi David,

>>> So sendfile() as a concept (whether you use combinations of splice()
>>> system calls or the sendfile system call itsefl) isn't necessarily
>>> only about the zero-copy, it's really also about avoiding the user
>>> space memory management.
>>
>> I don't think so. Ok, maybe for webservers just serving tiny
>> html files, that's true. But for me with Samba it's really the
>> copy_to/from_iter() that is the major factor.
> 
> Is that copy also doing the ip checksum?

Not in my tests. I guess there's offload in the network hardware
for this.

At least at the syscall layer of sendmsg() there's no checksuming
happening.

metze

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Willy Tarreau @ 2026-06-05 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <aiGkrQnMeyPmEvRB@1wt.eu>

On Thu, Jun 04, 2026 at 06:15:41PM +0200, Willy Tarreau wrote:
> On Thu, Jun 04, 2026 at 08:58:33AM -0700, Linus Torvalds wrote:
> > On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > > It looks like you're actually doing exactly the thing that I thought
> > > > was crazy and wouldn't even work reliably: you change the
> > > > common_response[] contents dynamically *after* the vmsplice, and
> > > > depend on the fact that changing it in user space changes the buffer
> > > > in the pipe too.
> > >
> > > No no, it's definitely not doing that (or it's a bug, but it's not
> > > supposed to happen). I'm perfectly aware that one must definitely not
> > > do that, and it's a guarantee the user of vmsplice() must provide.
> > 
> > Whew, good.
> > 
> > In that case, can you just try the vmsplice patch series (Christian
> > already found a bug, but I don't think it will necessarily matter in
> > practice - famous last words) and that test patch of mine, and see if
> > it all (a) works for you and (b) if you have any numbers for
> > performance that would be *great*.
> 
> Yes I wanted to do that and noted it on my todo list yesterday when
> noticing the ongoing discussion. Just been super busy with yesterday's
> by-yearly release ;-) But at least I wanted to share quick feedback in
> this thread about existing uses.

OK so I could run the test this afternoon, with:
  - ddd664bbff63 Merge tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
    (v7.1-rc6-178)

  - the same with Christian's vfs-7.2.vmsplice branch merged into it
    ( 8d86fcfc2857 include/linux/splice.h: trivial fix: declerations -> declarations)

Both show 71-72 Gbps of TLS traffic per core on my test utility (I
stopped at 3 cores since having only 2x100G at the moment), so for
this use case I'm not impacted by the change. I noted that I will
have to reconsider other options for the cache (send(MSG_ZEROCOPY)
probably) but in my case since the code doesn't exist yet it's not
per-se a userland breakage, but a change of plans. I just hope I'll
find my way through the alternate solution.

FWIW for Christian's branch:

Tested-by: Willy Tarreau <w@1wt.eu>

Willy

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 15:54 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <87se71jps4.fsf@oldenburg.str.redhat.com>

On Fri, 5 Jun 2026 at 02:33, Florian Weimer <fweimer@redhat.com> wrote:
>
> * Linus Torvalds:
>
> > x86 really doesn't *care*. If the caller zero-extends or leaves high
> > bits set randomly, according to the x86 ABI that's perfectly fine: the
> > callee will only care about the low 32 bits. So the high bits are
> > simply not relevant for the ABI.
>
> Please note that Clang does not implement the x86-64 ABI and requires
> zero extension.  We see increasing problems from that, now that we have
> more C code calling Rust code.

Uhhuh. But that is only specific to 'bool', right?

If it were to have the same issue that powerpc(*) had - that 'unsigned
int' has to be passed to functions with well-defined high bits - that
would be bad.

And I'm pretty sure that clang doesn't do that.

Anyway, for the kernel, this shouldn't be an issue simply because we
typically avoid 'bool' in arguments or structures that are exposed to
outside.

(I say 'typically' because I'm sure it happens in some broken UAPI
thing anyway).

                  Linus

(*) I may mis-remember. Maybe it was s390, not powerpc. The s390
compat layer independently had a similar issue wrt pointers, where bit
31 had to be cleared. s390 dropped the 31-bit code entirely fairly
recently, but it caused some "interesting" code in the already
disgusting syscall argument handling wrapper macros.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 15:58 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <f1c7fbbf-5be1-48b0-8927-2d9b75a35816@samba.org>

On Fri, 5 Jun 2026 at 08:15, Stefan Metzmacher <metze@samba.org> wrote:
>
> It means the most common workload, e.g. a file only opened for
> file serving (or simple opens in general) would still be able to
> be optimized.

Nope. If your web server opens files with write access, I'd be
extremely surprised.

And if you don't have write access, and you're sending out data from
files you opened just for reading - the onle sane case - you hit all
the existing problems with "I can certainly look up pages, but I damn
well shouldn't pass those pages to the networking code without copying
them".

               Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 16:02 UTC (permalink / raw)
  To: Mark Brown
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aiswharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <d9806b34-fc73-4878-997a-95c5e8ae4b29@sirena.org.uk>

On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:
>
> FWIW this is triggering a failure in the LTP vmsplice01 test case (which
> sends with a vmsplice() and then tries to read that with a splice()) in
> -next:
>
> | tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsp3vEmQ as tmpdir (tmpfs filesystem)
> | L4471tst_test.c:2047: TINFO: LTP version: 20260130
> | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
> | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
> | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s

I htink this is the same thing that Christian already noted (he said
"reported by David", but I don't know which David ;), where the
vmsplice() writev() emulation was done as a blocking write, even
though vmsplice only blocked at the beginning (ie waiting only for
_initial_ space to write, not then blocking afterwards).

                 Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Mark Brown @ 2026-06-05 16:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aishwharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <CAHk-=wjBZAzPdZgEeHAtSiwJpomt8ZZgKbixuiHfRm09a4=PtA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Fri, Jun 05, 2026 at 09:02:52AM -0700, Linus Torvalds wrote:
> On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:

> > | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
> > | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
> > | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s

> I htink this is the same thing that Christian already noted (he said
> "reported by David", but I don't know which David ;), where the
> vmsplice() writev() emulation was done as a blocking write, even
> though vmsplice only blocked at the beginning (ie waiting only for
> _initial_ space to write, not then blocking afterwards).

Ah, yes it is - exactly the same issue that's mentioned in[1], I missed
it in the middle of the quite large thread and didn't directly find
David's report.  Sorry for the duplication there.

[1] https://lore.kernel.org/r/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Florian Weimer @ 2026-06-05 16:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wjkZSAhxykvG6tQM5DnBoS30_XCKkYpCsQwEGcxJb=i3Q@mail.gmail.com>

* Linus Torvalds:

> On Fri, 5 Jun 2026 at 02:33, Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Linus Torvalds:
>>
>> > x86 really doesn't *care*. If the caller zero-extends or leaves high
>> > bits set randomly, according to the x86 ABI that's perfectly fine: the
>> > callee will only care about the low 32 bits. So the high bits are
>> > simply not relevant for the ABI.
>>
>> Please note that Clang does not implement the x86-64 ABI and requires
>> zero extension.  We see increasing problems from that, now that we have
>> more C code calling Rust code.
>
> Uhhuh. But that is only specific to 'bool', right?

Also char and short.  This

extern int a[];
int
f (short i)
{
  return a[i];
}

gets turned into:

f:
	movslq	%edi, %rax
	movl	a(,%rax,4), %eax
	retq

This code assumes that the short value has been previously sign-extended
into %edi.

As I read the original psABI, this assumption was not valid, and the
extra bits were unspecified by omission.  And GCC tends to use shorter
instruction encodings without extension if that does not result in
partial register stalls.

> If it were to have the same issue that powerpc(*) had - that 'unsigned
> int' has to be passed to functions with well-defined high bits - that
> would be bad.

I would have to ask around.  It's hard to tell from experiments what the
expectations around int/unsigned arguments are.  Clang and LLVM treat
the upper bits from int/unsigned return values as undefined in some
cases.

> Anyway, for the kernel, this shouldn't be an issue simply because we
> typically avoid 'bool' in arguments or structures that are exposed to
> outside.

Right, but array indexing with u8/u16/s8/s16 function arguments is
impacted, too.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 16:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wjkZSAhxykvG6tQM5DnBoS30_XCKkYpCsQwEGcxJb=i3Q@mail.gmail.com>

On Fri, 5 Jun 2026 at 08:54, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If it were to have the same issue that powerpc(*) had - that 'unsigned
> int' has to be passed to functions with well-defined high bits - that
> would be bad.
>
> And I'm pretty sure that clang doesn't do that.

It's perhaps worth nothing *why* it's horribly bad and why I think the
powerpc ABI is nasty: it means that *some* things are done in 32 bits,
but other things then expect the upper bits to always match.

It caused security issues, where user space would (for example) pass
in a 'int fd' what was value in the low bits, and then had interesting
upper bits.

The range check in the kernel would then compare fd to max_fds - using
a 32-bit unsigned compare - and see that it is all in range.

Then it would use the *exact same fd variable* to index into the fd
array, but the compiler would use the full 64-bit value for that array
dereference - without having ever checked those upper bits. And it had
passed the unmodified full 64-bit value around the whole time, all the
way from untrusted user space, and the kernel code all looked
"obviously correct" and had all the proper checks in place.

If you want to bleed out of your eyes, take a look at the rather
horrendous macros in <linux/syscalls.h> (and the sometimes even more
horrendous arch 'syscall_wrappers.h' files).

They deal with issues like this - and others - with some truly
inscrutable code. You have to be super-human to be able to read it,
but those wrappers are why we can then just do things like

   SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)

and it will generate not only infrastructure for tracing etc, but also
the code necessary to force clean up the types for the architecture.

                    Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 17:12 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <87wlwdhrvr.fsf@oldenburg.str.redhat.com>

On Fri, 5 Jun 2026 at 09:30, Florian Weimer <fweimer@redhat.com> wrote:
>
> > Uhhuh. But that is only specific to 'bool', right?
>
> Also char and short.

That sounds like a complete ABI violation as far as I can tell.

Scary. Because I would not be surprised if we have code that assumes otherwise.

Now, the kernel *seldom* uses char/short types, and since compilers
are typically at least self-consistent in those cases and we don't
interact directly with untrusted sources.

The system call interface is special, but we wrap that for other
reasons so deeply these days that we'd not be impacted.

But we also do have various assembler code, and I certainly wasn't
aware that apparently compilers have been walking away from the old
ABI rules.

I did find assembler code that clearly uses just 8-bit register
accesses and function calls, but it was all _entirely_ within
assembler. The low-level debug printing in

    arch/x86/kernel/relocate_kernel_64.S

puts the character values in %al and then calls pr_char_8250() or
pr_char_8250_mmio32() with it, but that is *all* in asm code.

I didn't find anything obvious that calls C code with that kind of
argument though (which makes sense - we typically call the other way:
C code calling into asm code, not the other way around).

So at a guess we're fine, but it's still somewhat unsettling.

And maybe others were aware of this, and it's just me that has old
32-bit x86 code in mind.

              Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-05 17:21 UTC (permalink / raw)
  To: Mark Brown, Linus Torvalds
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, Pedro Falcato, Aishwharya.TCV, ltp, Miklos Szeredi,
	patches
In-Reply-To: <1eac3b42-5dde-41c4-930a-d74cda9e6d68@sirena.org.uk>

On 6/5/26 18:26, Mark Brown wrote:
> On Fri, Jun 05, 2026 at 09:02:52AM -0700, Linus Torvalds wrote:
>> On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:
> 
>>> | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
>>> | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
>>> | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s
> 
>> I htink this is the same thing that Christian already noted (he said
>> "reported by David", but I don't know which David ;), where the
>> vmsplice() writev() emulation was done as a blocking write, even
>> though vmsplice only blocked at the beginning (ie waiting only for
>> _initial_ space to write, not then blocking afterwards).
> 
> Ah, yes it is - exactly the same issue that's mentioned in[1], I missed
> it in the middle of the quite large thread and didn't directly find
> David's report.  Sorry for the duplication there.

Yeah, I quickly discussed this with Christian on a different channel and he
ended up sharing the report with the analysis directly.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Florian Weimer @ 2026-06-05 19:34 UTC (permalink / raw)
  To: Richard Patel
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <20260605184715.3383415-2-ripatel@wii.dev>

* Richard Patel:

> Adds basic support for x86 userspace IBT.
>
> IBT is part of Intel CET. It requires indirect call and jump targets
> to start with an endbr{32,64} instruction, otherwise throwing #CP.
>
> In summary, this patch does 3 things:
> - Config wiring ensuring supervisor XSAVE contains IBT state
> - Allow userspace to enable IBT via prctl(PR_CFI_*) for an entire thread
> - Enable IBT support (ENDBR instructions) in VDSO
>
> Unlike the arm64 BTI API:
> - does not support mixed usermode (all or nothing)
> - does not touch page table code
> - not enabled automatically (no ELF GNU note parsing)
> - temporarily disables IBT enforcement when handling signals
> These can all be cleanly added later.

Adding the ELF GNU note parsing can be added later, but perhaps not
cleanly.  I'm still a bit worried we might have to rev the markup
because too many binaries are in circulation that claim compatibility,
have never been tested, and are actually broken.  If the kernel does not
look at the ELF bits, things a slightly simpler.

How do you detect that handling a signal is complete and IBT can be
re-enabled?  Or is it re-enabled before entering the userspace signal
handler?

> The main question is whether glibc is happy with this prctl syscall API.

As far as I can tell, the prctl works for glibc.  Re-use of an
arch_prctl constant might have been problematic, but the series is not
doing that.

> There is one notable gap in this patch series, to do with signals:
>
>   000a: mov rax, 0x100a
>   000f: jmp rax
>   *** signal occurs ***
>   *** signal handler runs, does sigreturn ***
>   100a: nop
>
> The above sequence does not crash.
>
> With IBT, it should crash at the nop (because an endr64 is expected there).
> The IBT state (WAIT_FOR_ENDBR in IA32_U_CET MSR) is not backed up to the
> signal frame though.  So, when userland does a sigreturn, the CPU has
> forgotten that it was doing an indirect branch before the signal.
> (This specifically only occurs with signal handlers that sigreturn.)
>
> This is because IA32_U_CET is part of XSAVE 'supervisor' state, so
> regular XSAVE/XRSTOR can't access it.  Doing a manual backup is tricky.

That's a bit annoying.  Is this restricted to signal handlers, or does
it apply to page faults, too?

> A related problem is that the signal handler routine is not checked for
> endbr preamble.

That's not necessarily a problem because its address cannot be directly
overwritten in userspace.  Not all indirect branches need to be checked,
only those that have tweakable targets.  In fact, fewer ENDBR64 markers
are better (although we wouldn't drop the marker from a signal handler
specifically, of course).

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Richard Patel @ 2026-06-05 20:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <lhu1pek4w89.fsf@oldenburg.str.redhat.com>

On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:

> How do you detect that handling a signal is complete and IBT can be
> re-enabled?  Or is it re-enabled before entering the userspace signal
> handler?

Hi Florian,

In v1, we backed up the IBT CPU state into the (user-accessible) signal
frame from FRED/XSAVE, then restored it:
https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/

In v2, when entering the signal handler, the kernel just context switches
to the new user rip, bypassing IBT checks (continues executing if the
signal handler does not begin with endbr).

IBT stays enabled in both designs, just the IBT state is preserved in v1,
and lost in v2.

The same thing happens when doing a sigreturn in v2 (e.g. via trampoline),
again IBT is not enforced.  IBT stays enabled when doing a siglongjmp,
though.

Some time in the future, ideally:
- signal handler is *required* to start with endbr (this is easy)
- sigreturn as in my asm example enforces endbr after returning from a
  signal handler to a in-progres indirect branc
- libc (sig)longjmp is made IBT-compatible

Btw, I had self-tests for the v1 design, and {signal handle,rt_sigreturn,
siglongjmp} with {success case,violation} works flawlessly with Fedora 44
glibc amd64. With glibc i686 I ran into PLT issues, probably my fault.

It is quite surprised that siglongjmp was working, btw, since the glibc
longjmp code uses 'jmp *reg' (without notrack prefix). I guess you do an
endbr64 at the setjmp side?

> > The main question is whether glibc is happy with this prctl syscall API.
> 
> As far as I can tell, the prctl works for glibc.  Re-use of an
> arch_prctl constant might have been problematic, but the series is not
> doing that.

Nice :-)
The alternative would have been to bolt on stuff to ARCH_SHSTK, or create
an entirely new arch_prctl. Open to any API.

> Adding the ELF GNU note parsing can be added later, but perhaps not
> cleanly.  I'm still a bit worried we might have to rev the markup
> because too many binaries are in circulation that claim compatibility,
> have never been tested, and are actually broken.  If the kernel does not
> look at the ELF bits, things a slightly simpler.

Phew, I was hoping you'd say that.

If you want, I can sketch out glibc IBT enabling and test it on Debian
and Fedora, which IIRC already emit compile with -fcf-protection=branch
for all OS packages.

> > There is one notable gap in this patch series, to do with signals:
> >
> >   000a: mov rax, 0x100a
> >   000f: jmp rax
> >   *** signal occurs ***
> >   *** signal handler runs, does sigreturn ***
> >   100a: nop
> >
> > The above sequence does not crash.
> >
> > With IBT, it should crash at the nop (because an endr64 is expected there).
> > The IBT state (WAIT_FOR_ENDBR in IA32_U_CET MSR) is not backed up to the
> > signal frame though.  So, when userland does a sigreturn, the CPU has
> > forgotten that it was doing an indirect branch before the signal.
> > (This specifically only occurs with signal handlers that sigreturn.)
> >
> > This is because IA32_U_CET is part of XSAVE 'supervisor' state, so
> > regular XSAVE/XRSTOR can't access it.  Doing a manual backup is tricky.
> 
> That's a bit annoying.  Is this restricted to signal handlers, or does
> it apply to page faults, too?

Only signal handlers, page faults don't reset IBT.

> > A related problem is that the signal handler routine is not checked for
> > endbr preamble.
> 
> That's not necessarily a problem because its address cannot be directly
> overwritten in userspace.  Not all indirect branches need to be checked,
> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
> are better (although we wouldn't drop the marker from a signal handler
> specifically, of course).

Just one concern I have is that people start relying on signal handlers
not requiring endbr64, and then a future kernel version breaking them once
we enforce it.

Really appreciate your review,

-Richard

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: The 8472 @ 2026-06-05 20:54 UTC (permalink / raw)
  To: Linus Torvalds, Willy Tarreau
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <CAHk-=wg0e8pP5haNW4qJP1=QwwUEctwjK5k07sv8bskitoMDgg@mail.gmail.com>

On 04/06/2026 17:58, Linus Torvalds wrote:
> On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
>>
>>> It looks like you're actually doing exactly the thing that I thought
>>> was crazy and wouldn't even work reliably: you change the
>>> common_response[] contents dynamically *after* the vmsplice, and
>>> depend on the fact that changing it in user space changes the buffer
>>> in the pipe too.
>>
>> No no, it's definitely not doing that (or it's a bug, but it's not
>> supposed to happen). I'm perfectly aware that one must definitely not
>> do that, and it's a guarantee the user of vmsplice() must provide.
> 
> Whew, good.
> 
> In that case, can you just try the vmsplice patch series (Christian
> already found a bug, but I don't think it will necessarily matter in
> practice - famous last words) and that test patch of mine, and see if
> it all (a) works for you and (b) if you have any numbers for
> performance that would be *great*.
> 
> There aren't many obvious splice users out there, and even if they
> were to exist they are typically specialized enough that you have to
> have a real use case to then tell if the patches make a difference in
> real life or not.

In the Rust standard library we use splice as one of several strategies
in our generic io::copy[0] routine. It selects the strategy[1] based on
source and sink types.

It tries

- copy_file_range
- sendfile
- splice
- fallback to userspace read-write loop

sendfile or splice are skipped when we can't uphold the "callers must ensure
transferred portions in_fd remain unmodified" condition on the manpage,
which unfortunately includes some particularly desirable combinations of
sinks and sources (such as mutable files -> socket).

We primarily want this for reflink copies and to avoid the syscall
overhead of a read-write loop with a small stack buffer.

Any additional zerocopy benefit, when it doesn't lead to unstable data, is
welcome but not critical. E.g. it'd be nice if sendfile could do the following:
For a 1MB source and a socket with a 64kB sendbuffer it could zerocopy first ~900kB
safely and then memcpy the last 64kB to ensure it can't be modified after the
syscall returns. But a "just memcpy in kernel space instead of zerocopy" flag for
sendfile would be ok too.

We're currently not making use of vmsplice. In theory we'd like to use it for
copying from `&'static [u8]` sources since the type upholds the requirements of
vmsplice, but type specialization currently is not powerful enough to
select based on this lifetime and it's unclear if it'll ever be.

[0] https://doc.rust-lang.org/nightly/std/io/fn.copy.html
[1] https://github.com/rust-lang/rust/blob/ac6f3a3e778a586854bdbf8f15202e11e2348d9f/library/std/src/sys/io/kernel_copy/linux.rs#L210-L259

> 
> So you testing that thing would seem to be a great first test of
> whether any of this is realistic..
> 
>                 Linus
> 

^ permalink raw reply

* [PATCH 0/5] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches

This patchset is for VFS. Of course, it depends on my previous vmsplice
patchset.

I fix some problems in my previous patchset.

1. Fix problem with CLASS(fd, f)(fd). See first patch for details.
This is probably not so important, but I fix it anyway.

2. Change "unsigned long" back to "int". See second patch for details.
Again, this is probably not important, but I want to fix this anyway.

3. Fix that LTP vmsplice01 bug.

See patches for details.

Please, run that LTP vmsplice01 test again.

Notes:

- I want to repeat: I change behavior around SPLICE_F_NONBLOCK.
Previously, vmsplice ignored whether pipe itself was opened as
non-blocking file. Now it is not ignored. And in my opinion
new behavior is better.
- vmsplice(2) now is in fs/read_write.c . It is very similar to
preadv2 and pwritev2 now, so I think it belongs to fs/read_write.c now.

Please, review this patchset carefully. I'm still new contributor.
In particular, please, review that do-while loop, I'm not sure I did
everything right.

Tested in Qemu.

Askar Safin (5):
  vmsplice: open-code do_writev and do_readv
  vmsplice: change argument type back to "int"
  splice: turn wait_for_space flags argument into bool
  pipe: move wait_for_space to fs/pipe.c and rename it
  vmsplice: make sure we don't wait after writing some data

 fs/pipe.c                 | 17 +++++++++++
 fs/read_write.c           | 61 ++++++++++++++++++++++++++++++++++-----
 fs/splice.c               | 19 +-----------
 include/linux/pipe_fs_i.h |  2 ++
 include/linux/syscalls.h  |  2 +-
 5 files changed, 75 insertions(+), 26 deletions(-)


base-commit: 8d86fcfc2857d64af85f5c87c193c25655c970af (vfs-7.2.vmsplice)
-- 
2.47.3


^ permalink raw reply

* [PATCH 1/5] vmsplice: open-code do_writev and do_readv
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

My previous vmsplice patch did the following mistake: I did
"CLASS(fd, f)(fd)", then did some checks on resulting "struct file",
then passed numeric (!) file descriptor to a function.

This is somewhat okay in this particular case, but I still think
this is code smell, so I fix this by open-coding do_writev and do_readv.

Also I insert a comment to warn other developers to keep
do_writev and do_readv in sync with vmsplice(2).

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 1e5444f4d..e224e7cb8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1070,6 +1070,7 @@ static ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 			unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1093,6 +1094,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 			 unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1226,14 +1228,24 @@ SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
 	if (fd_empty(f))
 		return -EBADF;
 
-	/* We do do_writev/do_readv, so it is okay to pass "false" here */
+	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
 	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
 		return -EBADF;
 
-	if (fd_file(f)->f_mode & FMODE_WRITE)
-		return do_writev(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
-	else
-		return do_readv(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+	if (fd_file(f)->f_mode & FMODE_WRITE) {
+		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		if (ret > 0)
+			add_wchar(current, ret);
+		inc_syscw(current);
+		return ret;
+	} else {
+		ssize_t ret = vfs_readv(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+
+		if (ret > 0)
+			add_rchar(current, ret);
+		inc_syscr(current);
+		return ret;
+	}
 }
 
 /*
-- 
2.47.3


^ permalink raw reply related

* [PATCH 2/5] vmsplice: change argument type back to "int"
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

My previous vmsplice patchset changed vmsplice argument from
"int" to "unsigned long". This may cause problems, so let's
change it back.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c          | 2 +-
 include/linux/syscalls.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e224e7cb8..77487b307 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1218,7 +1218,7 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 /*
  * Legacy preadv2/pwritev2 wrapper.
  */
-SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
+SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
 	if (unlikely(flags & ~SPLICE_F_ALL))
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a86a88207..46a3ec954 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -514,7 +514,7 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
 			  struct old_timespec32 __user *, const sigset_t __user *,
 			  size_t);
 asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
-asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
+asmlinkage long sys_vmsplice(int fd, const struct iovec __user *vec,
 			     unsigned long vlen, unsigned int flags);
 asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
 			   int fd_out, loff_t __user *off_out,
-- 
2.47.3


^ permalink raw reply related

* [PATCH 3/5] splice: turn wait_for_space flags argument into bool
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

I want to do this, because I will move this function to fs/pipe.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/splice.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6ddf7dd72..707db2c2c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,7 +1239,7 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
+static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
 {
 	for (;;) {
 		if (unlikely(!pipe->readers)) {
@@ -1248,7 +1248,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		}
 		if (!pipe_is_full(pipe))
 			return 0;
-		if (flags & SPLICE_F_NONBLOCK)
+		if (non_block)
 			return -EAGAIN;
 		if (signal_pending(current))
 			return -ERESTARTSYS;
@@ -1268,7 +1268,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags);
+	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
-- 
2.47.3


^ permalink raw reply related

* [PATCH 4/5] pipe: move wait_for_space to fs/pipe.c and rename it
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

This is needed, because I plan to use it in fs/read_write.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/pipe.c                 | 17 +++++++++++++++++
 fs/splice.c               | 19 +------------------
 include/linux/pipe_fs_i.h |  2 ++
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9..c0ccf21b9 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1451,6 +1451,23 @@ long pipe_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 	return ret;
 }
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block)
+{
+	for (;;) {
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
+		if (!pipe_is_full(pipe))
+			return 0;
+		if (non_block)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+		pipe_wait_writable(pipe);
+	}
+}
+
 static const struct super_operations pipefs_ops = {
 	.destroy_inode = free_inode_nonrcu,
 	.statfs = simple_statfs,
diff --git a/fs/splice.c b/fs/splice.c
index 707db2c2c..d12243d19 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,23 +1239,6 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
-{
-	for (;;) {
-		if (unlikely(!pipe->readers)) {
-			send_sig(SIGPIPE, current, 0);
-			return -EPIPE;
-		}
-		if (!pipe_is_full(pipe))
-			return 0;
-		if (non_block)
-			return -EAGAIN;
-		if (signal_pending(current))
-			return -ERESTARTSYS;
-		pipe_wait_writable(pipe);
-	}
-}
-
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1268,7 +1251,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
+	ret = pipe_wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index a1eeed800..be653625d 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -335,4 +335,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned int size);
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block);
+
 #endif
-- 
2.47.3


^ permalink raw reply related

* [PATCH 5/5] vmsplice: make sure we don't wait after writing some data
From: Askar Safin @ 2026-06-06  6:10 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel, ltp,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, Steven Rostedt, The 8472, Willy Tarreau,
	Joanne Koong, patches
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

Make sure we don't wait for space in pipe after writing some data.
This is needed for compatibility with previous version of vmsplice.
Found by LTP vmsplice01.
See comments in the code and links below for details.

Link: https://lore.kernel.org/all/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner/
Link: https://lore.kernel.org/all/CAHk-=wgV-j-G3d+899Zm1pQ=NaJrddPz=GKcL5Yw5DTUM=GaUw@mail.gmail.com/
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 77487b307..dbd0debc2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1221,6 +1221,8 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
+	struct pipe_inode_info *pipe;
+
 	if (unlikely(flags & ~SPLICE_F_ALL))
 		return -EINVAL;
 
@@ -1229,11 +1231,44 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		return -EBADF;
 
 	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
-	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+	pipe = get_pipe_info(fd_file(f), /* for_splice = */ false);
+
+	if (!pipe)
 		return -EBADF;
 
 	if (fd_file(f)->f_mode & FMODE_WRITE) {
-		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		/*
+		 * When writing to the pipe, previous implementation of vmsplice
+		 * first waited for space in the pipe to appear
+		 * (depending on whether SPLICE_F_NONBLOCK was passed),
+		 * then did unconditional non-blocking write to the pipe.
+		 *
+		 * This differs from what pwritev2 does.
+		 *
+		 * For compatibility we do the same thing previous
+		 * implementation did.
+		 *
+		 * We lock the pipe, do pipe_wait_for_space, then unlock
+		 * the pipe, and then do vfs_writev. vfs_writev internally
+		 * locks the pipe again. This may cause TOCTOU: when we
+		 * do vfs_writev, the pipe may become full again. So we
+		 * do a loop.
+		 */
+
+		bool non_block = (flags & SPLICE_F_NONBLOCK) || (fd_file(f)->f_flags & O_NONBLOCK);
+		ssize_t ret;
+
+		do {
+			pipe_lock(pipe);
+			ret = pipe_wait_for_space(pipe, non_block);
+			pipe_unlock(pipe);
+
+			if (ret < 0)
+				break;
+
+			ret = vfs_writev(fd_file(f), vec, vlen, NULL, RWF_NOWAIT);
+		} while (!non_block && ret == -EAGAIN);
+
 		if (ret > 0)
 			add_wchar(current, ret);
 		inc_syscw(current);
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-06  9:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Florian Weimer, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wiTQr4YYYUH38srGvWAq3_UpDeAPR+qZWVyf-ZU7z8Hzw@mail.gmail.com>

On Fri, 5 Jun 2026 10:12:05 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 5 Jun 2026 at 09:30, Florian Weimer <fweimer@redhat.com> wrote:
> >  
> > > Uhhuh. But that is only specific to 'bool', right?  
> >
> > Also char and short.  
> 
> That sounds like a complete ABI violation as far as I can tell.
> 
> Scary. Because I would not be surprised if we have code that assumes otherwise.
> 
> Now, the kernel *seldom* uses char/short types, and since compilers
> are typically at least self-consistent in those cases and we don't
> interact directly with untrusted sources.

There are plenty of places where char/short are used for function call
parameters/results (and not for single characters or similar).

I'm sure some people (even some who should really know better) think
the smaller type will save space.

I've always worried about whether the calling or called code is responsible
for ensuring the unused bits are zero (or maybe the sign extension of a
signed value).
Clearly the compiler should obey its own rules - so mostly it is just
extra instruction to do the masking.
But for interactions with asm code, and possibly code that gets mixed
between gcc and clang (maybe for out of tree modules) it does matter.

You also don't really want to be doing maths of char/short (and there
are quite a of of those as well). I think it is only m86 and m68k that
actually have 8/16 bits maths instructions (is s390 old enough?)
everywhere else the compiler has to explicitly mask the high bits.

Maybe it is time to 'nuke' all the 'short' locals/parameters/results
(eg from htons()) as well as all the 'long' for values than aren't
dependant on 32/64 bit builds.

-- David

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-06 10:22 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <634c8ae2-3f1c-46b1-b002-1e2ac797dd80@samba.org>

On Fri, 5 Jun 2026 17:20:34 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Hi David,
> 
> >>> So sendfile() as a concept (whether you use combinations of splice()
> >>> system calls or the sendfile system call itsefl) isn't necessarily
> >>> only about the zero-copy, it's really also about avoiding the user
> >>> space memory management.  
> >>
> >> I don't think so. Ok, maybe for webservers just serving tiny
> >> html files, that's true. But for me with Samba it's really the
> >> copy_to/from_iter() that is the major factor.  
> > 
> > Is that copy also doing the ip checksum?  
> 
> Not in my tests. I guess there's offload in the network hardware
> for this.

There will be, it is just whether the syscall checksum is actually
being suppressed.

-- David

> 
> At least at the syscall layer of sendmsg() there's no checksuming
> happening.
> 
> metze
> 


^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Florian Weimer @ 2026-06-06 13:40 UTC (permalink / raw)
  To: Richard Patel
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <aiMyaJ8zDl76YOVN@wii.dev>

* Richard Patel:

> On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:
>
>> How do you detect that handling a signal is complete and IBT can be
>> re-enabled?  Or is it re-enabled before entering the userspace signal
>> handler?
>
> Hi Florian,
>
> In v1, we backed up the IBT CPU state into the (user-accessible) signal
> frame from FRED/XSAVE, then restored it:
> https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/
>
> In v2, when entering the signal handler, the kernel just context switches
> to the new user rip, bypassing IBT checks (continues executing if the
> signal handler does not begin with endbr).

What's the reason for this?

> Some time in the future, ideally:
> - signal handler is *required* to start with endbr (this is easy)
> - sigreturn as in my asm example enforces endbr after returning from a
>   signal handler to a in-progres indirect branc
> - libc (sig)longjmp is made IBT-compatible

I think the compiler already emits ENDBR markers for returns-twice
functions, which is why longjmp does not use a no-track jump.  Other
architectures require such a proliferation of markers because they do
not support no-track jumps at all.  However, longjmp is arguable a
corner case.  It's not completely safe, like loading a function address
from a RELRO GOT and jumping to it.

> Btw, I had self-tests for the v1 design, and {signal handle,rt_sigreturn,
> siglongjmp} with {success case,violation} works flawlessly with Fedora 44
> glibc amd64. With glibc i686 I ran into PLT issues, probably my fault.

There's no IBT support planned for i686, that's why we dropped all
marker instructions in Fedora.

> It is quite surprised that siglongjmp was working, btw, since the glibc
> longjmp code uses 'jmp *reg' (without notrack prefix). I guess you do an
> endbr64 at the setjmp side?

Yes, compilers generate landing pads for returns-twice functions.  Not
ideal, but it's the only way to get setjmp working on targets without
NOTRACK.

>> Adding the ELF GNU note parsing can be added later, but perhaps not
>> cleanly.  I'm still a bit worried we might have to rev the markup
>> because too many binaries are in circulation that claim compatibility,
>> have never been tested, and are actually broken.  If the kernel does not
>> look at the ELF bits, things a slightly simpler.
>
> Phew, I was hoping you'd say that.
>
> If you want, I can sketch out glibc IBT enabling and test it on Debian
> and Fedora, which IIRC already emit compile with -fcf-protection=branch
> for all OS packages.

For Fedora, please coordinate with Arjun (Cc:ed), who is going through
the motions of enabling SHSTK for real.

>> That's not necessarily a problem because its address cannot be directly
>> overwritten in userspace.  Not all indirect branches need to be checked,
>> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
>> are better (although we wouldn't drop the marker from a signal handler
>> specifically, of course).
>
> Just one concern I have is that people start relying on signal handlers
> not requiring endbr64, and then a future kernel version breaking them once
> we enforce it.

Would software enforcement be a possibility?  The kernel could check if
the landing pad is there.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Richard Patel @ 2026-06-06 23:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <lhua4t73hz9.fsf@oldenburg.str.redhat.com>

On Sat, Jun 06, 2026 at 03:40:10PM +0200, Florian Weimer wrote:
> * Richard Patel:
> 
> > On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:
> >
> >> How do you detect that handling a signal is complete and IBT can be
> >> re-enabled?  Or is it re-enabled before entering the userspace signal
> >> handler?
> >
> > Hi Florian,
> >
> > In v1, we backed up the IBT CPU state into the (user-accessible) signal
> > frame from FRED/XSAVE, then restored it:
> > https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/
> >
> > In v2, when entering the signal handler, the kernel just context switches
> > to the new user rip, bypassing IBT checks (continues executing if the
> > signal handler does not begin with endbr).
> 
> What's the reason for this?

Hi Florian,

We just don't have a nice way to include IBT state in the signal frame
right now.  v1 had an uabi change (adding a new bit in ucontext_t uc_flags),
which was originally proposed by Intel years ago.  My preferred way to add
IBT state is to carve out an XSAVE area in fpstate, which works well with all
the existing signal frame code.

But I figured it's better to just keep the first pass at user IBT super
simple, in the hopes upstream is more inclined to accept that.

BTW, OpenBSD uses the v2 approach (don't preserve IBT state across signal
handlers), presumably because it's also hard for them to restore IBT state
on sigreturn.

> >> That's not necessarily a problem because its address cannot be directly
> >> overwritten in userspace.  Not all indirect branches need to be checked,
> >> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
> >> are better (although we wouldn't drop the marker from a signal handler
> >> specifically, of course).
> >
> > Just one concern I have is that people start relying on signal handlers
> > not requiring endbr64, and then a future kernel version breaking them once
> > we enforce it.
> 
> Would software enforcement be a possibility?  The kernel could check if
> the landing pad is there.

Enforcement is the easy part.  I can trivially add back 'check if signal
handler starts with endbr64'.  Just the backup/restore of the pre-signal
handler state ('do I expect an endbr64 after returning') is the tricky part.

Thank you,
-Richard

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-07 13:22 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <87fr31xdz3.fsf@mailhost.krisman.be>

Hi Gabriel,

Yes, I looked at Josh's slides and your RFC a few days ago.

I agree that io_uring is a very interesting direction, and I can see why it
fits the "ordered setup operations before exec" model.

My current preference is still to first explore a pidfd/pidfs-based builder,
modeled roughly like fsconfig(). Process creation feels like a core process
lifecycle API, and I think a normal fd-based syscall interface may be easier
for libc, language runtimes, shells,and sandboxing tools to adopt.

My hesitation is practical rather than conceptual.Some important
deployments still disable io_uring entirely; Docker's default seccomp
profile blocks the io_uring syscalls, and Google has disabled or restricted
io_uring in ChromeOS, Android app processes, and production servers.

I will study your io_uring work more carefully and compare the two directions.
One possible outcome is that io_uring can drive/share the same builder object later;
I do not know that yet.

Thanks for pointing this out.

 ---- On Fri, 05 Jun 2026 22:24:00 +0800  Gabriel Krisman Bertazi <krisman@suse.de> wrote --- 
 > Li Chen <me@linux.beauty> writes:
 > 
 > > Hi,
 > >
 > > This is an early RFC for an idea that is probably still rough in both the
 > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > it now to check whether this direction is worth pursuing and to get
 > > feedback on the kernel/userspace boundary.
 > >
 > > The series is based on linux-next version 20260518.
 > >
 > > This RFC adds spawn_template, a userspace-controlled exec acceleration
 > > mechanism for runtimes that repeatedly start the same executable with
 > > different argv, envp, and per-spawn file descriptor setup.
 > 
 > Have you looked at Josh's proposal to do this over io_uring [1] and my
 > implementation of it at [2]?  I think io_uring is a very natural
 > interface for something like this, it will avoid adding a larger API,
 > since you could, in theory, set up the entire new task context using
 > regular io_uring operations in an io workqueue and then starting it would
 > be a matter of forking the pre-configured io thread with a new io_uring
 > operation.
 > 
 > [1]
 > https://lpc.events/event/16/contributions/1213/attachments/1012/1945/io-uring-spawn.pdf
 > [2] https://lwn.net/Articles/1001622/
 > 
 > >
 > > The main target is agent runtimes. Modern coding agents repeatedly start
 > > short-lived helper tools such as rg, git, sed, awk, python, node, and
 > > shell wrappers while they inspect and edit a workspace. Those runtimes
 > > already know which tools are hot, and they are also the right place to
 > > decide policy. The kernel does not choose names such as rg, git, or sed.
 > > Userspace opts in by creating a template fd for one executable, then uses
 > > that fd for later spawns. Launchers, shells, and build systems have a
 > > similar repeated-startup shape and could use the same primitive, but the
 > > agent runtime case is the main motivation for this RFC.
 > >
 > > The mechanism applies to the executable that userspace asks the kernel to
 > > start. If an agent runtime directly starts /usr/bin/rg, the rg executable
 > > is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
 > > head", the shell is the template target unless the shell itself opts in
 > > when it starts rg and head. The kernel does not parse the shell command
 > > string or rewrite inner commands into template spawns. Userspace has to
 > > call spawn_template for those inner commands explicitly:
 > >
 > >     direct exec                 shell wrapper
 > >     -----------                 -------------
 > >     agent                       agent
 > >       template("/usr/bin/rg")     template("/usr/bin/bash")
 > >       spawn rg argv              spawn bash -c "rg ... | head"
 > >
 > >     kernel target: rg          kernel target: bash
 > >     rg startup benefits        rg/head need shell opt-in
 > >
 > > Several agent runtime discussions are moving toward direct argv-style
 > > exec tools for both security and policy clarity. For example, opencode
 > > issue #2206 proposes an exec tool as a safer alternative to a shell-only
 > > bash tool:
 > >
 > > https://github.com/anomalyco/opencode/issues/2206
 > >
 > > spawn_template is meant to support both models. Direct exec users can
 > > cache the actual hot tool. Shell-wrapper users can cache the shell and
 > > still reduce shell startup cost. If a shell or an agent runtime later
 > > uses the same API for commands started inside a shell command, those
 > > inner tools can benefit too.
 > >
 > > Each spawn still goes through the normal exec path. The template reuses
 > > only metadata that can be revalidated before use. Credential preparation,
 > > permission checks, binary handler checks, secure-exec handling, and LSM
 > > hooks remain on the normal execve path.
 > >
 > > The UAPI has two operations. spawn_template_create() creates an
 > > anonymous-inode template fd from either an executable fd or an absolute
 > > executable path. spawn_template_spawn() starts one child from that
 > > template, applies per-spawn fd, cwd, and signal actions, and returns both
 > > pid and pidfd.
 > >
 > > fd inheritance is deliberately conservative. By default, after the
 > > requested per-spawn actions have run, the child closes fds above stderr.
 > > An agent runtime can still request traditional inheritance explicitly,
 > > but helper tools do not inherit unrelated secret files or sockets by
 > > accident. The create-time actions fields are reserved and rejected in
 > > this RFC because fd numbers are per-process state, not stable reusable
 > > objects. The caller supplies fd actions for each spawn instead.
 > >
 > > A typical agent runtime would keep one template per hot executable and
 > > still build argv, envp, cwd, and pipe wiring for each tool call:
 > >
 > >     rg_tmpl = spawn_template_create("/usr/bin/rg");
 > >
 > >     for each search request:
 > >         out_r, out_w = pipe_cloexec();
 > >         err_r, err_w = pipe_cloexec();
 > >         actions = [
 > >             FCHDIR(worktree_fd),
 > >             DUP2(out_w, STDOUT_FILENO),
 > >             DUP2(err_w, STDERR_FILENO),
 > >         ];
 > >         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
 > >         close(out_w);
 > >         close(err_w);
 > >         read out_r and err_r;
 > >         waitid(P_PIDFD, child.pidfd, ...);
 > >
 > > A shell-wrapper runtime would use the same shape with a template for
 > > /usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
 > > reduces shell startup cost, but it does not cache rg or head inside that
 > > command unless the shell also opts into spawn_template for commands it
 > > starts internally.
 > >
 > > The template pins the executable and denies writes to that file while the
 > > template fd is alive, so cached executable metadata cannot race with a
 > > writer changing the same inode. This means direct in-place writes to the
 > > executable can fail while a runtime keeps a template open. It does not
 > > block the common package-manager update pattern where a new inode is
 > > written and then atomically renamed over the old path. In that case the
 > > old path-created template becomes stale, spawn_template_spawn() rejects
 > > it with ESTALE, and the runtime should close and recreate the template
 > > for the new executable.
 > >
 > >     in-place write              package-manager update
 > >     --------------              ----------------------
 > >     template pins old inode     write new inode
 > >     write(old inode) denied     rename(new, "/usr/bin/rg")
 > >
 > >     cached metadata safe        old template sees path mismatch
 > >                                 spawn_template_spawn() = -ESTALE
 > >                                 recreate template for new inode
 > >
 > > Each spawn revalidates executable identity before cached metadata is
 > > used. Path-created templates only accept absolute paths: a relative path
 > > such as ./tool depends on cwd, and the same string can name a different
 > > file after chdir. For an absolute path template, each spawn reopens the
 > > path and checks that it still resolves to the executable recorded when
 > > the template was created. If the path now names a replaced file, the
 > > template is stale and userspace should close and recreate it.
 > >
 > > A template fd can be passed over SCM_RIGHTS like any other fd, but this
 > > RFC does not treat that as delegation. spawn_template_spawn() only works
 > > while the caller still has the same struct cred object that created the
 > > template. If another task, or the same task after a credential change,
 > > receives the fd, spawn fails instead of running the executable using the
 > > creator's launch authority:
 > >
 > >     ordinary fd                         spawn_template fd
 > >     -----------                         -----------------
 > >     A: open log                         A: create rg template
 > >     A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)
 > >
 > >     B: read(fd) = ok                    B: spawn(tfd) = -EACCES
 > >                                         B: create own rg template
 > >                                         B: spawn(own_tfd) = ok
 > >
 > >     open-file use is delegated          spawn authority is not delegated
 > >
 > > The cached state is intentionally small. The template fd keeps the opened
 > > main executable file, an optional absolute path string, the creator
 > > credential pointer, and the deny-write state. The executable identity key
 > > records device, inode, size, mode, owner, ctime, and mtime, and is
 > > rechecked before cached metadata is used. The ELF cache keeps only the
 > > main executable's ELF header, program header table, and program header
 > > count.
 > >
 > >     cached in this RFC          not cached in this RFC
 > >     ------------------          ----------------------
 > >     opened main executable      PT_INTERP metadata
 > >     executable identity key     shared-library graph
 > >     main ELF header             VMA layout metadata
 > >     main ELF program headers    cross-process metadata sharing
 > >     creator cred pointer
 > >     deny-write state
 > >
 > > This RFC does not cache ELF interpreter metadata, shared-library
 > > dependency state, or derived mapping-layout state. Shared-library
 > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
 > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
 > > state. It also does not share cached executable metadata between template
 > > fds created by different processes. Each template owns its small cached
 > > metadata object in this RFC.
 > >
 > > Performance
 > > ===========
 > >
 > > The numbers below come from my separate local autogen-bench project.
 > > autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
 > > instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
 > > fans out concurrent tool-call requests to worker agents. The workload
 > > definitions, generated test files, and subprocess/spawn_template backends
 > > are local to autogen-bench.
 > >
 > > The agent-tools preset includes direct tool calls and shell-wrapper forms
 > > for:
 > >
 > > rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
 > > python-small, node-small, sh-c, and bash-c.
 > >
 > > The benchmark is launch-heavy but not no-op: it searches generated
 > > Python-like source files, reads sample files, runs small Python and
 > > Node.js programs, and runs git status and git diff in a small repository.
 > > It does not include model inference or long-running tool work, so the
 > > numbers mainly describe the short-tool regime.
 > >
 > > The subprocess column starts each tool call through the existing
 > > userspace launch path. The spawn_template column creates templates for
 > > hot executables and uses spawn_template_spawn() for later calls.
 > >
 > > Total in-flight tool calls stay at 16; only the worker-process split
 > > changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
 > > calls each. The two time_s values are subprocess/spawn_template wall
 > > times.
 > >
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > >
 > > The table measures the whole mixed workload, including both process
 > > startup and the short tool work done after exec. Since this workload is
 > > launch-heavy, the possible launch-side savings include:
 > >
 > > - the template fd keeps an opened executable, avoiding repeated ordinary
 > >   open/path setup for that executable;
 > > - the kernel can reuse cached main-executable ELF header and program
 > >   header metadata after revalidation;
 > > - the fork-and-exec-style launch is submitted as one
 > >   spawn_template_spawn() operation;
 > > - fd, cwd, and signal actions run in the child kernel path instead of
 > >   being driven one syscall at a time by userspace child glue;
 > > - pid and pidfd are returned by the same operation, reducing some
 > >   runtime-side bookkeeping.
 > >
 > > In local experiments before this RFC, I also tried caching ELF
 > > interpreter metadata and derived ELF mapping-layout metadata. A focused
 > > repeated-exec benchmark did not show a stable standalone throughput gain
 > > for those two optimizations, so this RFC leaves them out and keeps only
 > > the main executable metadata cache.
 > >
 > > I also tried sharing main-executable ELF metadata across template fds
 > > created by different processes for the same executable identity. That can
 > > reduce duplicated metadata memory when many agent worker processes create
 > > their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
 > > it did not show a stable throughput win in local multi-agent tests. It
 > > also adds cache keying, lifetime, invalidation, credential, and namespace
 > > questions to the RFC. This version therefore keeps per-template metadata
 > > ownership and leaves cross-process sharing out.
 > >
 > > Sorry again for the rough edges in this RFC. I would appreciate feedback
 > > on whether this direction is useful and what the right API boundary
 > > should be.
 > >
 > > Thanks,
 > > Li
 > >
 > > [1]: https://github.com/microsoft/autogen
 > >
 > > Li Chen (13):
 > >   exec: factor argument setup out of do_execveat_common()
 > >   exec: add an internal helper for opened executables
 > >   file: expose helpers for in-kernel fd actions
 > >   exec: add spawn template UAPI definitions
 > >   exec: add spawn template file descriptors
 > >   exec: add spawn_template_spawn()
 > >   exec: validate spawn template executable identity
 > >   binfmt_elf: cache ELF metadata for spawn templates
 > >   Documentation: describe spawn templates
 > >   exec: require absolute paths for path-created templates
 > >   exec: let close-range actions target the max fd
 > >   syscalls: add generic spawn template entries
 > >   selftests/exec: cover spawn template basics
 > >
 > >  Documentation/userspace-api/index.rst         |   1 +
 > >  .../userspace-api/spawn_template.rst          | 153 +++
 > >  MAINTAINERS                                   |   6 +
 > >  arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
 > >  fs/Makefile                                   |   2 +-
 > >  fs/binfmt_elf.c                               | 104 +-
 > >  fs/exec.c                                     | 162 ++-
 > >  fs/file.c                                     |  11 +-
 > >  fs/spawn_template.c                           | 619 +++++++++++
 > >  include/linux/binfmts.h                       |  10 +
 > >  include/linux/fdtable.h                       |   2 +
 > >  include/linux/spawn_template.h                |  72 ++
 > >  include/linux/syscalls.h                      |   7 +
 > >  include/uapi/asm-generic/unistd.h             |   7 +-
 > >  include/uapi/linux/spawn_template.h           |  62 ++
 > >  scripts/syscall.tbl                           |   2 +
 > >  tools/testing/selftests/exec/Makefile         |   1 +
 > >  tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
 > >  18 files changed, 2179 insertions(+), 42 deletions(-)
 > >  create mode 100644 Documentation/userspace-api/spawn_template.rst
 > >  create mode 100644 fs/spawn_template.c
 > >  create mode 100644 include/linux/spawn_template.h
 > >  create mode 100644 include/uapi/linux/spawn_template.h
 > >  create mode 100644 tools/testing/selftests/exec/spawn_template.c
 > 
 > -- 
 > Gabriel Krisman Bertazi
 > 

Regards,
Li


^ permalink raw reply

* Re: [LTP] [PATCH 0/5] vmsplice: fix some problems in my previous vmsplice patchset
From: Andrea Cervesato @ 2026-06-08 11:40 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	The 8472, patches, David Howells, linux-mm, Collin Funk,
	Joanne Koong, Miklos Szeredi, David Laight, Matthew Wilcox,
	Christoph Hellwig, Steven Rostedt, fuse-devel, David Hildenbrand,
	Pedro Falcato, ltp, Jens Axboe, Stefan Metzmacher, netdev,
	linux-kernel, Andy Lutomirski, linux-api, Andrew Morton,
	Linus Torvalds, Willy Tarreau
In-Reply-To: <20260606061031.3744880-1-safinaskar@gmail.com>

Hi Askar,

the patch-set doesn't apply:

error: fs/read_write.c: does not exist in index
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Applying: vmsplice: open-code do_writev and do_readv
Patch failed at 0001 vmsplice: open-code do_writev and do_readv

https://github.com/linux-test-project/ltp-agent/actions/runs/27129052434/job/80065058557#step:8:21

Please update it to a new version after rebasing with the upstream master
branch.

Regards,
--
Andrea Cervesato
SUSE QE Automation Engineer Linux
andrea.cervesato@suse.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox