Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: The 8472 @ 2026-06-05 20:54 UTC (permalink / raw)
  To: Linus Torvalds, Willy Tarreau
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <CAHk-=wg0e8pP5haNW4qJP1=QwwUEctwjK5k07sv8bskitoMDgg@mail.gmail.com>

On 04/06/2026 17:58, Linus Torvalds wrote:
> On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
>>
>>> It looks like you're actually doing exactly the thing that I thought
>>> was crazy and wouldn't even work reliably: you change the
>>> common_response[] contents dynamically *after* the vmsplice, and
>>> depend on the fact that changing it in user space changes the buffer
>>> in the pipe too.
>>
>> No no, it's definitely not doing that (or it's a bug, but it's not
>> supposed to happen). I'm perfectly aware that one must definitely not
>> do that, and it's a guarantee the user of vmsplice() must provide.
> 
> Whew, good.
> 
> In that case, can you just try the vmsplice patch series (Christian
> already found a bug, but I don't think it will necessarily matter in
> practice - famous last words) and that test patch of mine, and see if
> it all (a) works for you and (b) if you have any numbers for
> performance that would be *great*.
> 
> There aren't many obvious splice users out there, and even if they
> were to exist they are typically specialized enough that you have to
> have a real use case to then tell if the patches make a difference in
> real life or not.

In the Rust standard library we use splice as one of several strategies
in our generic io::copy[0] routine. It selects the strategy[1] based on
source and sink types.

It tries

- copy_file_range
- sendfile
- splice
- fallback to userspace read-write loop

sendfile or splice are skipped when we can't uphold the "callers must ensure
transferred portions in_fd remain unmodified" condition on the manpage,
which unfortunately includes some particularly desirable combinations of
sinks and sources (such as mutable files -> socket).

We primarily want this for reflink copies and to avoid the syscall
overhead of a read-write loop with a small stack buffer.

Any additional zerocopy benefit, when it doesn't lead to unstable data, is
welcome but not critical. E.g. it'd be nice if sendfile could do the following:
For a 1MB source and a socket with a 64kB sendbuffer it could zerocopy first ~900kB
safely and then memcpy the last 64kB to ensure it can't be modified after the
syscall returns. But a "just memcpy in kernel space instead of zerocopy" flag for
sendfile would be ok too.

We're currently not making use of vmsplice. In theory we'd like to use it for
copying from `&'static [u8]` sources since the type upholds the requirements of
vmsplice, but type specialization currently is not powerful enough to
select based on this lifetime and it's unclear if it'll ever be.

[0] https://doc.rust-lang.org/nightly/std/io/fn.copy.html
[1] https://github.com/rust-lang/rust/blob/ac6f3a3e778a586854bdbf8f15202e11e2348d9f/library/std/src/sys/io/kernel_copy/linux.rs#L210-L259

> 
> So you testing that thing would seem to be a great first test of
> whether any of this is realistic..
> 
>                 Linus
> 

^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Richard Patel @ 2026-06-05 20:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <lhu1pek4w89.fsf@oldenburg.str.redhat.com>

On Fri, Jun 05, 2026 at 09:34:46PM +0200, Florian Weimer wrote:

> How do you detect that handling a signal is complete and IBT can be
> re-enabled?  Or is it re-enabled before entering the userspace signal
> handler?

Hi Florian,

In v1, we backed up the IBT CPU state into the (user-accessible) signal
frame from FRED/XSAVE, then restored it:
https://lore.kernel.org/lkml/20260517183024.16292-4-ripatel@wii.dev/

In v2, when entering the signal handler, the kernel just context switches
to the new user rip, bypassing IBT checks (continues executing if the
signal handler does not begin with endbr).

IBT stays enabled in both designs, just the IBT state is preserved in v1,
and lost in v2.

The same thing happens when doing a sigreturn in v2 (e.g. via trampoline),
again IBT is not enforced.  IBT stays enabled when doing a siglongjmp,
though.

Some time in the future, ideally:
- signal handler is *required* to start with endbr (this is easy)
- sigreturn as in my asm example enforces endbr after returning from a
  signal handler to a in-progres indirect branc
- libc (sig)longjmp is made IBT-compatible

Btw, I had self-tests for the v1 design, and {signal handle,rt_sigreturn,
siglongjmp} with {success case,violation} works flawlessly with Fedora 44
glibc amd64. With glibc i686 I ran into PLT issues, probably my fault.

It is quite surprised that siglongjmp was working, btw, since the glibc
longjmp code uses 'jmp *reg' (without notrack prefix). I guess you do an
endbr64 at the setjmp side?

> > The main question is whether glibc is happy with this prctl syscall API.
> 
> As far as I can tell, the prctl works for glibc.  Re-use of an
> arch_prctl constant might have been problematic, but the series is not
> doing that.

Nice :-)
The alternative would have been to bolt on stuff to ARCH_SHSTK, or create
an entirely new arch_prctl. Open to any API.

> Adding the ELF GNU note parsing can be added later, but perhaps not
> cleanly.  I'm still a bit worried we might have to rev the markup
> because too many binaries are in circulation that claim compatibility,
> have never been tested, and are actually broken.  If the kernel does not
> look at the ELF bits, things a slightly simpler.

Phew, I was hoping you'd say that.

If you want, I can sketch out glibc IBT enabling and test it on Debian
and Fedora, which IIRC already emit compile with -fcf-protection=branch
for all OS packages.

> > There is one notable gap in this patch series, to do with signals:
> >
> >   000a: mov rax, 0x100a
> >   000f: jmp rax
> >   *** signal occurs ***
> >   *** signal handler runs, does sigreturn ***
> >   100a: nop
> >
> > The above sequence does not crash.
> >
> > With IBT, it should crash at the nop (because an endr64 is expected there).
> > The IBT state (WAIT_FOR_ENDBR in IA32_U_CET MSR) is not backed up to the
> > signal frame though.  So, when userland does a sigreturn, the CPU has
> > forgotten that it was doing an indirect branch before the signal.
> > (This specifically only occurs with signal handlers that sigreturn.)
> >
> > This is because IA32_U_CET is part of XSAVE 'supervisor' state, so
> > regular XSAVE/XRSTOR can't access it.  Doing a manual backup is tricky.
> 
> That's a bit annoying.  Is this restricted to signal handlers, or does
> it apply to page faults, too?

Only signal handlers, page faults don't reset IBT.

> > A related problem is that the signal handler routine is not checked for
> > endbr preamble.
> 
> That's not necessarily a problem because its address cannot be directly
> overwritten in userspace.  Not all indirect branches need to be checked,
> only those that have tweakable targets.  In fact, fewer ENDBR64 markers
> are better (although we wouldn't drop the marker from a signal handler
> specifically, of course).

Just one concern I have is that people start relying on signal handlers
not requiring endbr64, and then a future kernel version breaking them once
we enforce it.

Really appreciate your review,

-Richard

^ permalink raw reply

* Re: [PATCH v2 0/5] Usermode Indirect Branch Tracking
From: Florian Weimer @ 2026-06-05 19:34 UTC (permalink / raw)
  To: Richard Patel
  Cc: x86, H. Peter Anvin, Peter Zijlstra, Rick Edgecombe, Yu-cheng Yu,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	David Laight, Andy Lutomirski, Kees Cook, Shuah Khan,
	linux-kselftest, linux-kernel, libc-alpha, linux-api,
	Arjun Shankar
In-Reply-To: <20260605184715.3383415-2-ripatel@wii.dev>

* Richard Patel:

> Adds basic support for x86 userspace IBT.
>
> IBT is part of Intel CET. It requires indirect call and jump targets
> to start with an endbr{32,64} instruction, otherwise throwing #CP.
>
> In summary, this patch does 3 things:
> - Config wiring ensuring supervisor XSAVE contains IBT state
> - Allow userspace to enable IBT via prctl(PR_CFI_*) for an entire thread
> - Enable IBT support (ENDBR instructions) in VDSO
>
> Unlike the arm64 BTI API:
> - does not support mixed usermode (all or nothing)
> - does not touch page table code
> - not enabled automatically (no ELF GNU note parsing)
> - temporarily disables IBT enforcement when handling signals
> These can all be cleanly added later.

Adding the ELF GNU note parsing can be added later, but perhaps not
cleanly.  I'm still a bit worried we might have to rev the markup
because too many binaries are in circulation that claim compatibility,
have never been tested, and are actually broken.  If the kernel does not
look at the ELF bits, things a slightly simpler.

How do you detect that handling a signal is complete and IBT can be
re-enabled?  Or is it re-enabled before entering the userspace signal
handler?

> The main question is whether glibc is happy with this prctl syscall API.

As far as I can tell, the prctl works for glibc.  Re-use of an
arch_prctl constant might have been problematic, but the series is not
doing that.

> There is one notable gap in this patch series, to do with signals:
>
>   000a: mov rax, 0x100a
>   000f: jmp rax
>   *** signal occurs ***
>   *** signal handler runs, does sigreturn ***
>   100a: nop
>
> The above sequence does not crash.
>
> With IBT, it should crash at the nop (because an endr64 is expected there).
> The IBT state (WAIT_FOR_ENDBR in IA32_U_CET MSR) is not backed up to the
> signal frame though.  So, when userland does a sigreturn, the CPU has
> forgotten that it was doing an indirect branch before the signal.
> (This specifically only occurs with signal handlers that sigreturn.)
>
> This is because IA32_U_CET is part of XSAVE 'supervisor' state, so
> regular XSAVE/XRSTOR can't access it.  Doing a manual backup is tricky.

That's a bit annoying.  Is this restricted to signal handlers, or does
it apply to page faults, too?

> A related problem is that the signal handler routine is not checked for
> endbr preamble.

That's not necessarily a problem because its address cannot be directly
overwritten in userspace.  Not all indirect branches need to be checked,
only those that have tweakable targets.  In fact, fewer ENDBR64 markers
are better (although we wouldn't drop the marker from a signal handler
specifically, of course).

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-05 17:21 UTC (permalink / raw)
  To: Mark Brown, Linus Torvalds
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, Pedro Falcato, Aishwharya.TCV, ltp, Miklos Szeredi,
	patches
In-Reply-To: <1eac3b42-5dde-41c4-930a-d74cda9e6d68@sirena.org.uk>

On 6/5/26 18:26, Mark Brown wrote:
> On Fri, Jun 05, 2026 at 09:02:52AM -0700, Linus Torvalds wrote:
>> On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:
> 
>>> | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
>>> | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
>>> | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s
> 
>> I htink this is the same thing that Christian already noted (he said
>> "reported by David", but I don't know which David ;), where the
>> vmsplice() writev() emulation was done as a blocking write, even
>> though vmsplice only blocked at the beginning (ie waiting only for
>> _initial_ space to write, not then blocking afterwards).
> 
> Ah, yes it is - exactly the same issue that's mentioned in[1], I missed
> it in the middle of the quite large thread and didn't directly find
> David's report.  Sorry for the duplication there.

Yeah, I quickly discussed this with Christian on a different channel and he
ended up sharing the report with the analysis directly.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 17:12 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <87wlwdhrvr.fsf@oldenburg.str.redhat.com>

On Fri, 5 Jun 2026 at 09:30, Florian Weimer <fweimer@redhat.com> wrote:
>
> > Uhhuh. But that is only specific to 'bool', right?
>
> Also char and short.

That sounds like a complete ABI violation as far as I can tell.

Scary. Because I would not be surprised if we have code that assumes otherwise.

Now, the kernel *seldom* uses char/short types, and since compilers
are typically at least self-consistent in those cases and we don't
interact directly with untrusted sources.

The system call interface is special, but we wrap that for other
reasons so deeply these days that we'd not be impacted.

But we also do have various assembler code, and I certainly wasn't
aware that apparently compilers have been walking away from the old
ABI rules.

I did find assembler code that clearly uses just 8-bit register
accesses and function calls, but it was all _entirely_ within
assembler. The low-level debug printing in

    arch/x86/kernel/relocate_kernel_64.S

puts the character values in %al and then calls pr_char_8250() or
pr_char_8250_mmio32() with it, but that is *all* in asm code.

I didn't find anything obvious that calls C code with that kind of
argument though (which makes sense - we typically call the other way:
C code calling into asm code, not the other way around).

So at a guess we're fine, but it's still somewhat unsettling.

And maybe others were aware of this, and it's just me that has old
32-bit x86 code in mind.

              Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 16:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wjkZSAhxykvG6tQM5DnBoS30_XCKkYpCsQwEGcxJb=i3Q@mail.gmail.com>

On Fri, 5 Jun 2026 at 08:54, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If it were to have the same issue that powerpc(*) had - that 'unsigned
> int' has to be passed to functions with well-defined high bits - that
> would be bad.
>
> And I'm pretty sure that clang doesn't do that.

It's perhaps worth nothing *why* it's horribly bad and why I think the
powerpc ABI is nasty: it means that *some* things are done in 32 bits,
but other things then expect the upper bits to always match.

It caused security issues, where user space would (for example) pass
in a 'int fd' what was value in the low bits, and then had interesting
upper bits.

The range check in the kernel would then compare fd to max_fds - using
a 32-bit unsigned compare - and see that it is all in range.

Then it would use the *exact same fd variable* to index into the fd
array, but the compiler would use the full 64-bit value for that array
dereference - without having ever checked those upper bits. And it had
passed the unmodified full 64-bit value around the whole time, all the
way from untrusted user space, and the kernel code all looked
"obviously correct" and had all the proper checks in place.

If you want to bleed out of your eyes, take a look at the rather
horrendous macros in <linux/syscalls.h> (and the sometimes even more
horrendous arch 'syscall_wrappers.h' files).

They deal with issues like this - and others - with some truly
inscrutable code. You have to be super-human to be able to read it,
but those wrappers are why we can then just do things like

   SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)

and it will generate not only infrastructure for tracing etc, but also
the code necessary to force clean up the types for the architecture.

                    Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Florian Weimer @ 2026-06-05 16:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wjkZSAhxykvG6tQM5DnBoS30_XCKkYpCsQwEGcxJb=i3Q@mail.gmail.com>

* Linus Torvalds:

> On Fri, 5 Jun 2026 at 02:33, Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Linus Torvalds:
>>
>> > x86 really doesn't *care*. If the caller zero-extends or leaves high
>> > bits set randomly, according to the x86 ABI that's perfectly fine: the
>> > callee will only care about the low 32 bits. So the high bits are
>> > simply not relevant for the ABI.
>>
>> Please note that Clang does not implement the x86-64 ABI and requires
>> zero extension.  We see increasing problems from that, now that we have
>> more C code calling Rust code.
>
> Uhhuh. But that is only specific to 'bool', right?

Also char and short.  This

extern int a[];
int
f (short i)
{
  return a[i];
}

gets turned into:

f:
	movslq	%edi, %rax
	movl	a(,%rax,4), %eax
	retq

This code assumes that the short value has been previously sign-extended
into %edi.

As I read the original psABI, this assumption was not valid, and the
extra bits were unspecified by omission.  And GCC tends to use shorter
instruction encodings without extension if that does not result in
partial register stalls.

> If it were to have the same issue that powerpc(*) had - that 'unsigned
> int' has to be passed to functions with well-defined high bits - that
> would be bad.

I would have to ask around.  It's hard to tell from experiments what the
expectations around int/unsigned arguments are.  Clang and LLVM treat
the upper bits from int/unsigned return values as undefined in some
cases.

> Anyway, for the kernel, this shouldn't be an issue simply because we
> typically avoid 'bool' in arguments or structures that are exposed to
> outside.

Right, but array indexing with u8/u16/s8/s16 function arguments is
impacted, too.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Mark Brown @ 2026-06-05 16:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aishwharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <CAHk-=wjBZAzPdZgEeHAtSiwJpomt8ZZgKbixuiHfRm09a4=PtA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Fri, Jun 05, 2026 at 09:02:52AM -0700, Linus Torvalds wrote:
> On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:

> > | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
> > | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
> > | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s

> I htink this is the same thing that Christian already noted (he said
> "reported by David", but I don't know which David ;), where the
> vmsplice() writev() emulation was done as a blocking write, even
> though vmsplice only blocked at the beginning (ie waiting only for
> _initial_ space to write, not then blocking afterwards).

Ah, yes it is - exactly the same issue that's mentioned in[1], I missed
it in the middle of the quite large thread and didn't directly find
David's report.  Sorry for the duplication there.

[1] https://lore.kernel.org/r/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 16:02 UTC (permalink / raw)
  To: Mark Brown
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aiswharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <d9806b34-fc73-4878-997a-95c5e8ae4b29@sirena.org.uk>

On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:
>
> FWIW this is triggering a failure in the LTP vmsplice01 test case (which
> sends with a vmsplice() and then tries to read that with a splice()) in
> -next:
>
> | tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsp3vEmQ as tmpdir (tmpfs filesystem)
> | L4471tst_test.c:2047: TINFO: LTP version: 20260130
> | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
> | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
> | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s

I htink this is the same thing that Christian already noted (he said
"reported by David", but I don't know which David ;), where the
vmsplice() writev() emulation was done as a blocking write, even
though vmsplice only blocked at the beginning (ie waiting only for
_initial_ space to write, not then blocking afterwards).

                 Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 15:58 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <f1c7fbbf-5be1-48b0-8927-2d9b75a35816@samba.org>

On Fri, 5 Jun 2026 at 08:15, Stefan Metzmacher <metze@samba.org> wrote:
>
> It means the most common workload, e.g. a file only opened for
> file serving (or simple opens in general) would still be able to
> be optimized.

Nope. If your web server opens files with write access, I'd be
extremely surprised.

And if you don't have write access, and you're sending out data from
files you opened just for reading - the onle sane case - you hit all
the existing problems with "I can certainly look up pages, but I damn
well shouldn't pass those pages to the networking code without copying
them".

               Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 15:54 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <87se71jps4.fsf@oldenburg.str.redhat.com>

On Fri, 5 Jun 2026 at 02:33, Florian Weimer <fweimer@redhat.com> wrote:
>
> * Linus Torvalds:
>
> > x86 really doesn't *care*. If the caller zero-extends or leaves high
> > bits set randomly, according to the x86 ABI that's perfectly fine: the
> > callee will only care about the low 32 bits. So the high bits are
> > simply not relevant for the ABI.
>
> Please note that Clang does not implement the x86-64 ABI and requires
> zero extension.  We see increasing problems from that, now that we have
> more C code calling Rust code.

Uhhuh. But that is only specific to 'bool', right?

If it were to have the same issue that powerpc(*) had - that 'unsigned
int' has to be passed to functions with well-defined high bits - that
would be bad.

And I'm pretty sure that clang doesn't do that.

Anyway, for the kernel, this shouldn't be an issue simply because we
typically avoid 'bool' in arguments or structures that are exposed to
outside.

(I say 'typically' because I'm sure it happens in some broken UAPI
thing anyway).

                  Linus

(*) I may mis-remember. Maybe it was s390, not powerpc. The s390
compat layer independently had a similar issue wrt pointers, where bit
31 had to be cleared. s390 dropped the 31-bit code entirely fairly
recently, but it caused some "interesting" code in the already
disgusting syscall argument handling wrapper macros.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Willy Tarreau @ 2026-06-05 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <aiGkrQnMeyPmEvRB@1wt.eu>

On Thu, Jun 04, 2026 at 06:15:41PM +0200, Willy Tarreau wrote:
> On Thu, Jun 04, 2026 at 08:58:33AM -0700, Linus Torvalds wrote:
> > On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > > It looks like you're actually doing exactly the thing that I thought
> > > > was crazy and wouldn't even work reliably: you change the
> > > > common_response[] contents dynamically *after* the vmsplice, and
> > > > depend on the fact that changing it in user space changes the buffer
> > > > in the pipe too.
> > >
> > > No no, it's definitely not doing that (or it's a bug, but it's not
> > > supposed to happen). I'm perfectly aware that one must definitely not
> > > do that, and it's a guarantee the user of vmsplice() must provide.
> > 
> > Whew, good.
> > 
> > In that case, can you just try the vmsplice patch series (Christian
> > already found a bug, but I don't think it will necessarily matter in
> > practice - famous last words) and that test patch of mine, and see if
> > it all (a) works for you and (b) if you have any numbers for
> > performance that would be *great*.
> 
> Yes I wanted to do that and noted it on my todo list yesterday when
> noticing the ongoing discussion. Just been super busy with yesterday's
> by-yearly release ;-) But at least I wanted to share quick feedback in
> this thread about existing uses.

OK so I could run the test this afternoon, with:
  - ddd664bbff63 Merge tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
    (v7.1-rc6-178)

  - the same with Christian's vfs-7.2.vmsplice branch merged into it
    ( 8d86fcfc2857 include/linux/splice.h: trivial fix: declerations -> declarations)

Both show 71-72 Gbps of TLS traffic per core on my test utility (I
stopped at 3 cores since having only 2x100G at the moment), so for
this use case I'm not impacted by the change. I noted that I will
have to reconsider other options for the cache (send(MSG_ZEROCOPY)
probably) but in my case since the code doesn't exist yet it's not
per-se a userland breakage, but a change of plans. I just hope I'll
find my way through the alternate solution.

FWIW for Christian's branch:

Tested-by: Willy Tarreau <w@1wt.eu>

Willy

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05 15:20 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <20260605131942.4584728e@pumpkin>

Hi David,

>>> So sendfile() as a concept (whether you use combinations of splice()
>>> system calls or the sendfile system call itsefl) isn't necessarily
>>> only about the zero-copy, it's really also about avoiding the user
>>> space memory management.
>>
>> I don't think so. Ok, maybe for webservers just serving tiny
>> html files, that's true. But for me with Samba it's really the
>> copy_to/from_iter() that is the major factor.
> 
> Is that copy also doing the ip checksum?

Not in my tests. I guess there's offload in the network hardware
for this.

At least at the syscall layer of sendmsg() there's no checksuming
happening.

metze

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05 15:15 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wh5bFj1a7eaGp9sixDg3UXu7xUGfU=YJo+ckpGxGAyhXQ@mail.gmail.com>

Hi Linus,

> On Wed, 3 Jun 2026 at 15:23, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> So I'm suspicious that you've possibly make bugs much (MUCH) harder to
>> exploit, but the underlying awful code and opportunity for bugs is
>> still there.  MSG_SPLICE_PAGES is still around, and there is still
>> (AFAICS) no actual coherent description of what it means.
> 
> I don't disagree. I've only looked at the filesystem side.
> 
> The networking side does some odd stuff too (and I did look at some of
> that, and had to be edumacated by Jakub on some of the subtler rules
> for what skb data sharing is ok and when it's not - really not my
> area).
> 
> But at least MSG_SPLICE_PAGES should be kernel-internal only
> interface, and once you don't share page cache pages with networking
> code I think that kneecaps a lot of the attacks.
> 
> So that's really the aim here for me - at least _attempting_ to go
> "maybe we can just limit splice enough that it doesn't even *matter*
> when networking does something odd and questionable".

While prototyping a smbdirect_splice_to_bvecs() in order to
do use rdma_rw_ctx_init_bvec() I found things like pipe_buf_try_steal()
and dived a bit deeper into struct address_space and found things like:
mapping_mapped, mapping_tagged, mapping_deny_writable, mapping_allow_writable
and similar things.

With that I'm wondering if we could allow splicing of
pages only if nobody mmap'ed the file => mapping_mapped() returned 0
and the page is not tagged with any of PAGECACHE_TAG_{DIRTY,WRITEBACK,TOWRITE}
and once a page is spliced we tag the page in the i_pages xarray
with a PAGECACHE_TAG_SPLICED. In all other cases the page would be copied.

Then any call to do_mmap() or vfs_writev at the highlevel
and at the lower levels most likely filemap_get_entry()/filemap_map_pages()
will remove the pages marked with PAGECACHE_TAG_SPLICED
and allocate new pages used for the pagecache of the related index.
It would be a bit similar to invalidate_inode_pages2_range() for direct io
writes. Maybe optimizing by clearing PAGECACHE_TAG_SPLICED if the refcount
of the page is 1.

This would also mean the content of spliced pages won't be changed
by future writes to the file, which removes the problem with unstable
pages and checksums.

It means the most common workload, e.g. a file only opened for
file serving (or simple opens in general) would still be able to
be optimized.

Does that sound useful and doable?

metze

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Gabriel Krisman Bertazi @ 2026-06-05 14:24 UTC (permalink / raw)
  To: Li Chen
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>

Li Chen <me@linux.beauty> writes:

> Hi,
>
> This is an early RFC for an idea that is probably still rough in both the
> UAPI and implementation details. Sorry for the rough edges; I am sending
> it now to check whether this direction is worth pursuing and to get
> feedback on the kernel/userspace boundary.
>
> The series is based on linux-next version 20260518.
>
> This RFC adds spawn_template, a userspace-controlled exec acceleration
> mechanism for runtimes that repeatedly start the same executable with
> different argv, envp, and per-spawn file descriptor setup.

Have you looked at Josh's proposal to do this over io_uring [1] and my
implementation of it at [2]?  I think io_uring is a very natural
interface for something like this, it will avoid adding a larger API,
since you could, in theory, set up the entire new task context using
regular io_uring operations in an io workqueue and then starting it would
be a matter of forking the pre-configured io thread with a new io_uring
operation.

[1]
https://lpc.events/event/16/contributions/1213/attachments/1012/1945/io-uring-spawn.pdf
[2] https://lwn.net/Articles/1001622/

>
> The main target is agent runtimes. Modern coding agents repeatedly start
> short-lived helper tools such as rg, git, sed, awk, python, node, and
> shell wrappers while they inspect and edit a workspace. Those runtimes
> already know which tools are hot, and they are also the right place to
> decide policy. The kernel does not choose names such as rg, git, or sed.
> Userspace opts in by creating a template fd for one executable, then uses
> that fd for later spawns. Launchers, shells, and build systems have a
> similar repeated-startup shape and could use the same primitive, but the
> agent runtime case is the main motivation for this RFC.
>
> The mechanism applies to the executable that userspace asks the kernel to
> start. If an agent runtime directly starts /usr/bin/rg, the rg executable
> is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
> head", the shell is the template target unless the shell itself opts in
> when it starts rg and head. The kernel does not parse the shell command
> string or rewrite inner commands into template spawns. Userspace has to
> call spawn_template for those inner commands explicitly:
>
>     direct exec                 shell wrapper
>     -----------                 -------------
>     agent                       agent
>       template("/usr/bin/rg")     template("/usr/bin/bash")
>       spawn rg argv              spawn bash -c "rg ... | head"
>
>     kernel target: rg          kernel target: bash
>     rg startup benefits        rg/head need shell opt-in
>
> Several agent runtime discussions are moving toward direct argv-style
> exec tools for both security and policy clarity. For example, opencode
> issue #2206 proposes an exec tool as a safer alternative to a shell-only
> bash tool:
>
> https://github.com/anomalyco/opencode/issues/2206
>
> spawn_template is meant to support both models. Direct exec users can
> cache the actual hot tool. Shell-wrapper users can cache the shell and
> still reduce shell startup cost. If a shell or an agent runtime later
> uses the same API for commands started inside a shell command, those
> inner tools can benefit too.
>
> Each spawn still goes through the normal exec path. The template reuses
> only metadata that can be revalidated before use. Credential preparation,
> permission checks, binary handler checks, secure-exec handling, and LSM
> hooks remain on the normal execve path.
>
> The UAPI has two operations. spawn_template_create() creates an
> anonymous-inode template fd from either an executable fd or an absolute
> executable path. spawn_template_spawn() starts one child from that
> template, applies per-spawn fd, cwd, and signal actions, and returns both
> pid and pidfd.
>
> fd inheritance is deliberately conservative. By default, after the
> requested per-spawn actions have run, the child closes fds above stderr.
> An agent runtime can still request traditional inheritance explicitly,
> but helper tools do not inherit unrelated secret files or sockets by
> accident. The create-time actions fields are reserved and rejected in
> this RFC because fd numbers are per-process state, not stable reusable
> objects. The caller supplies fd actions for each spawn instead.
>
> A typical agent runtime would keep one template per hot executable and
> still build argv, envp, cwd, and pipe wiring for each tool call:
>
>     rg_tmpl = spawn_template_create("/usr/bin/rg");
>
>     for each search request:
>         out_r, out_w = pipe_cloexec();
>         err_r, err_w = pipe_cloexec();
>         actions = [
>             FCHDIR(worktree_fd),
>             DUP2(out_w, STDOUT_FILENO),
>             DUP2(err_w, STDERR_FILENO),
>         ];
>         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
>         close(out_w);
>         close(err_w);
>         read out_r and err_r;
>         waitid(P_PIDFD, child.pidfd, ...);
>
> A shell-wrapper runtime would use the same shape with a template for
> /usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
> reduces shell startup cost, but it does not cache rg or head inside that
> command unless the shell also opts into spawn_template for commands it
> starts internally.
>
> The template pins the executable and denies writes to that file while the
> template fd is alive, so cached executable metadata cannot race with a
> writer changing the same inode. This means direct in-place writes to the
> executable can fail while a runtime keeps a template open. It does not
> block the common package-manager update pattern where a new inode is
> written and then atomically renamed over the old path. In that case the
> old path-created template becomes stale, spawn_template_spawn() rejects
> it with ESTALE, and the runtime should close and recreate the template
> for the new executable.
>
>     in-place write              package-manager update
>     --------------              ----------------------
>     template pins old inode     write new inode
>     write(old inode) denied     rename(new, "/usr/bin/rg")
>
>     cached metadata safe        old template sees path mismatch
>                                 spawn_template_spawn() = -ESTALE
>                                 recreate template for new inode
>
> Each spawn revalidates executable identity before cached metadata is
> used. Path-created templates only accept absolute paths: a relative path
> such as ./tool depends on cwd, and the same string can name a different
> file after chdir. For an absolute path template, each spawn reopens the
> path and checks that it still resolves to the executable recorded when
> the template was created. If the path now names a replaced file, the
> template is stale and userspace should close and recreate it.
>
> A template fd can be passed over SCM_RIGHTS like any other fd, but this
> RFC does not treat that as delegation. spawn_template_spawn() only works
> while the caller still has the same struct cred object that created the
> template. If another task, or the same task after a credential change,
> receives the fd, spawn fails instead of running the executable using the
> creator's launch authority:
>
>     ordinary fd                         spawn_template fd
>     -----------                         -----------------
>     A: open log                         A: create rg template
>     A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)
>
>     B: read(fd) = ok                    B: spawn(tfd) = -EACCES
>                                         B: create own rg template
>                                         B: spawn(own_tfd) = ok
>
>     open-file use is delegated          spawn authority is not delegated
>
> The cached state is intentionally small. The template fd keeps the opened
> main executable file, an optional absolute path string, the creator
> credential pointer, and the deny-write state. The executable identity key
> records device, inode, size, mode, owner, ctime, and mtime, and is
> rechecked before cached metadata is used. The ELF cache keeps only the
> main executable's ELF header, program header table, and program header
> count.
>
>     cached in this RFC          not cached in this RFC
>     ------------------          ----------------------
>     opened main executable      PT_INTERP metadata
>     executable identity key     shared-library graph
>     main ELF header             VMA layout metadata
>     main ELF program headers    cross-process metadata sharing
>     creator cred pointer
>     deny-write state
>
> This RFC does not cache ELF interpreter metadata, shared-library
> dependency state, or derived mapping-layout state. Shared-library
> resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
> RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
> state. It also does not share cached executable metadata between template
> fds created by different processes. Each template owns its small cached
> metadata object in this RFC.
>
> Performance
> ===========
>
> The numbers below come from my separate local autogen-bench project.
> autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
> instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
> fans out concurrent tool-call requests to worker agents. The workload
> definitions, generated test files, and subprocess/spawn_template backends
> are local to autogen-bench.
>
> The agent-tools preset includes direct tool calls and shell-wrapper forms
> for:
>
> rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
> python-small, node-small, sh-c, and bash-c.
>
> The benchmark is launch-heavy but not no-op: it searches generated
> Python-like source files, reads sample files, runs small Python and
> Node.js programs, and runs git status and git diff in a small repository.
> It does not include model inference or long-running tool work, so the
> numbers mainly describe the short-tool regime.
>
> The subprocess column starts each tool call through the existing
> userspace launch path. The spawn_template column creates templates for
> hot executables and uses spawn_template_spawn() for later calls.
>
> Total in-flight tool calls stay at 16; only the worker-process split
> changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
> calls each. The two time_s values are subprocess/spawn_template wall
> times.
>
> Workload     Calls  subprocess  spawn_template  time_s       Delta
> (workers)    calls  calls/s     calls/s         seconds
> 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
> 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
> 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
> 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
>
> The table measures the whole mixed workload, including both process
> startup and the short tool work done after exec. Since this workload is
> launch-heavy, the possible launch-side savings include:
>
> - the template fd keeps an opened executable, avoiding repeated ordinary
>   open/path setup for that executable;
> - the kernel can reuse cached main-executable ELF header and program
>   header metadata after revalidation;
> - the fork-and-exec-style launch is submitted as one
>   spawn_template_spawn() operation;
> - fd, cwd, and signal actions run in the child kernel path instead of
>   being driven one syscall at a time by userspace child glue;
> - pid and pidfd are returned by the same operation, reducing some
>   runtime-side bookkeeping.
>
> In local experiments before this RFC, I also tried caching ELF
> interpreter metadata and derived ELF mapping-layout metadata. A focused
> repeated-exec benchmark did not show a stable standalone throughput gain
> for those two optimizations, so this RFC leaves them out and keeps only
> the main executable metadata cache.
>
> I also tried sharing main-executable ELF metadata across template fds
> created by different processes for the same executable identity. That can
> reduce duplicated metadata memory when many agent worker processes create
> their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
> it did not show a stable throughput win in local multi-agent tests. It
> also adds cache keying, lifetime, invalidation, credential, and namespace
> questions to the RFC. This version therefore keeps per-template metadata
> ownership and leaves cross-process sharing out.
>
> Sorry again for the rough edges in this RFC. I would appreciate feedback
> on whether this direction is useful and what the right API boundary
> should be.
>
> Thanks,
> Li
>
> [1]: https://github.com/microsoft/autogen
>
> Li Chen (13):
>   exec: factor argument setup out of do_execveat_common()
>   exec: add an internal helper for opened executables
>   file: expose helpers for in-kernel fd actions
>   exec: add spawn template UAPI definitions
>   exec: add spawn template file descriptors
>   exec: add spawn_template_spawn()
>   exec: validate spawn template executable identity
>   binfmt_elf: cache ELF metadata for spawn templates
>   Documentation: describe spawn templates
>   exec: require absolute paths for path-created templates
>   exec: let close-range actions target the max fd
>   syscalls: add generic spawn template entries
>   selftests/exec: cover spawn template basics
>
>  Documentation/userspace-api/index.rst         |   1 +
>  .../userspace-api/spawn_template.rst          | 153 +++
>  MAINTAINERS                                   |   6 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
>  fs/Makefile                                   |   2 +-
>  fs/binfmt_elf.c                               | 104 +-
>  fs/exec.c                                     | 162 ++-
>  fs/file.c                                     |  11 +-
>  fs/spawn_template.c                           | 619 +++++++++++
>  include/linux/binfmts.h                       |  10 +
>  include/linux/fdtable.h                       |   2 +
>  include/linux/spawn_template.h                |  72 ++
>  include/linux/syscalls.h                      |   7 +
>  include/uapi/asm-generic/unistd.h             |   7 +-
>  include/uapi/linux/spawn_template.h           |  62 ++
>  scripts/syscall.tbl                           |   2 +
>  tools/testing/selftests/exec/Makefile         |   1 +
>  tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
>  18 files changed, 2179 insertions(+), 42 deletions(-)
>  create mode 100644 Documentation/userspace-api/spawn_template.rst
>  create mode 100644 fs/spawn_template.c
>  create mode 100644 include/linux/spawn_template.h
>  create mode 100644 include/uapi/linux/spawn_template.h
>  create mode 100644 tools/testing/selftests/exec/spawn_template.c

-- 
Gabriel Krisman Bertazi

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-05 12:19 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <512d948f-7883-4d8c-b2c5-a777e70ca975@samba.org>

On Fri, 5 Jun 2026 11:43:45 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Hi Linus,
> 
> >> Am I understanding correctly that this will completely break zerocopy
> >> sendfile?  
> > 
> > Very much, yes.
> > 
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> > 
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> > 
> > So in that sense, it's a bit sad to discuss castrating splice.
> > 
> > But it's probably still the right thing to at least try.
> > 
> > I've seen very impressive benchmark numbers over the years, but
> > they've often smelled more like benchmarketing than actual real work.
> > 
> > There's also a real possibility that a lot of the sendfile / splice
> > advantage has little to do with zero-copy, and more to do with the
> > cost of mapping and maintaining buffers in user space.
> > 
> > If you are sending file data using plain reads and writes, it's not
> > just the "copy from user space to socket data structures".
> > 
> > There's also the cost of populating user space in the first place:
> > page faults for mmap made *that* historical copy avoidance basically a
> > fairy tale.
> > 
> > And not using mmap means that you have the cost of double caching in
> > the kernel _and_ user space etc.
> > 
> > So sendfile() as a concept (whether you use combinations of splice()
> > system calls or the sendfile system call itsefl) isn't necessarily
> > only about the zero-copy, it's really also about avoiding the user
> > space memory management.  
> 
> I don't think so. Ok, maybe for webservers just serving tiny
> html files, that's true. But for me with Samba it's really the
> copy_to/from_iter() that is the major factor.

Is that copy also doing the ip checksum?
I really can't tell from the code (it does sometimes, even for tcp).
But I can't help feeling that optimisation is well past its sell by date.

-- David

> 
> We can use io_uring with IOSQE_ASYNC in order to offload
> the memcpy cpu wasting to different cores, but it's still
> wasting a lot of resources.
> 
> For the case of filesystem => socket, we can use
> IORING_OP_SENDMSG_ZC and that at least removes the
> copy_from_iter() in the sendmsg path, but the
> IORING_OP_READV of buffers in the sizes up to 8MBytes
> is wasting cpu in copy_to_iter().
> 
> For the case with smbdirect and RDMA offload with 2x200GBit/s links
> changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
> is the limit. Without the memcpy waste ~46GByte/s is easily reached
> and the limit is just the network link.
> 
> Maybe another solution could be having a version of
> copy_to/from_iter that uses async_memcpy(), but didn't
> have the time to experiment with that yet. Maybe a new flag
> to preadv2/pwritev2 could control that, so that the
> application can decide what's better.
> 
> But without an alternative please don't kill splice.
> 
> A lot of people are frustrated because they bought hardware
> that is able to handle a lot of throughput, but
> e.g. with the default of smb over tcp they get no
> higher than 3.5GByte/s on a 100GBit/s link that's able
> to handle ~11GBytes/s. And io_uring and splice are
> a key factor to fix that.
> 
> Thanks!
> metze
> 


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Mark Brown @ 2026-06-05 11:02 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aiswharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <20260531010107.1953702-3-safinaskar@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7806 bytes --]

On Sun, May 31, 2026 at 01:01:06AM +0000, Askar Safin wrote:

> vmsplice behavior on writable pipe became equivalent to pwritev2.
> vmsplice behavior on readable pipe already was nearly
> equivalent to preadv2, but I made this explicit. I. e. I made it
> obvious from code that vmsplice now is equivalent to preadv2/pwritev2.

> Also I moved vmsplice to fs/read_write.c, because now it arguably
> belongs there.

> Note that SPLICE_F_NONBLOCK behavior slightly changed: previously
> vmsplice ignored whether the pipe was opened with O_NONBLOCK, and mode
> of operation depended on whether SPLICE_F_NONBLOCK was passed only.
> Now the operation will be non-blocking if O_NONBLOCK was passed when
> opening *or* SPLICE_F_NONBLOCK was passed to vmsplice. Previous
> behavior was arguably buggy, and new behavior is arguably better.

FWIW this is triggering a failure in the LTP vmsplice01 test case (which
sends with a vmsplice() and then tries to read that with a splice()) in
-next:

| tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsp3vEmQ as tmpdir (tmpfs filesystem)
| L4471tst_test.c:2047: TINFO: LTP version: 20260130
| L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
| L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
| L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s
| L4475tst_test.c:1632: TINFO: tmpfs is supported by the test
| L4476Test timeouted, sending SIGKILL!
| L4477tst_test.c:1947: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
| L4478tst_test.c:1949: TBROK: Test killed! (timeout?)

As the test itself notes it's not super realistic but thought it was
better to mention it since I didn't see any report on the list yet, I've
CCed the LTP list.

Full log:

   https://lava.sirena.org.uk/scheduler/job/2831199#L4468

bisect log:

# bad: [b99ae45861eccff1e1d8c7b05a13650be805d437] Add linux-next specific files for 20260604
# good: [5e9d583f58c8b56c9f5022639ac70cb7ae6e9fe9] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [a9f7db50ed2fff96727782456f49bf88def68510] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
# good: [215701873a7ed1214bba82c8fadcd2583d0246c3] net: mdio: realtek-rtl9300: Add ports to info structure
# good: [b882d807fa443b529ae8bf917d7b640a8d555437] tools/nolibc: stackprotector: Avoid stalling program startup if crng is not init yet
# good: [835fa43a4d36bd66ad0dd052f9fa15f7bd365fa8] tools/nolibc: add ftruncate()
# good: [5d72c6a754ef488f88ad33c83d14556c6f38af3a] Merge branch 'pci/controller/dwc-qcom'
# good: [04b18289834f20bbed25813c4978833ca8bd6e82] Merge branch 'pci/controller/misc'
# good: [94bf692c483ffb325af0b110429ad37f445d40b0] Merge branch 'pci/endpoint'
# good: [1c996a37fd244d19e5dbb715328c1676e28ef607] tools/power turbostat: pmt: Improve sscanf() hygiene
# good: [083c9ab12c402efe9ed55c942ede92eb35f8bf2a] crypto: omap-des - add COMPILE_TEST and fix CONFIG_OF=n build
# good: [49e05bb00f2e8168695f7af4d694c39e1423e8a2] crypto: atmel-sha204a - fail on hwrng registration error in probe path
# good: [86ad8069366642fec18c1bc53c24cad3da720ce5] Documentation: qat_rl: make rate limiting wording clearer
# good: [d58b4a09d7f06750a706b70d068f5a678dad8233] crypto: atmel-sha204a - remove sysfs group before hwrng
# good: [34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc] regmap-i2c: fix sparse warning in regmap_smbus_word_write_reg16
# good: [dc3ec9af62a92a46378960e599521c2ac5f81343] hwrng: core - use MAX to simplify RNG_BUFFER_SIZE
# good: [513f49c33e91e58975ada7967b44512179f0e703] power: sequencing: print power sequencing device parent in debugfs
# good: [bb2d82d41894cb30d836e9796ff67d2f9a71eccf] Merge tag 'v7.1-rc2' into nolibc/for-next
# good: [07a1a6562ce29e2e0c134a57882d6e52e8758492] kcsan: Silence -Wmaybe-uninitialized when calling __kcsan_check_access()
git bisect start 'b99ae45861eccff1e1d8c7b05a13650be805d437' '5e9d583f58c8b56c9f5022639ac70cb7ae6e9fe9' 'a9f7db50ed2fff96727782456f49bf88def68510' '215701873a7ed1214bba82c8fadcd2583d0246c3' 'b882d807fa443b529ae8bf917d7b640a8d555437' '835fa43a4d36bd66ad0dd052f9fa15f7bd365fa8' '5d72c6a754ef488f88ad33c83d14556c6f38af3a' '04b18289834f20bbed25813c4978833ca8bd6e82' '94bf692c483ffb325af0b110429ad37f445d40b0' '1c996a37fd244d19e5dbb715328c1676e28ef607' '083c9ab12c402efe9ed55c942ede92eb35f8bf2a' '49e05bb00f2e8168695f7af4d694c39e1423e8a2' '86ad8069366642fec18c1bc53c24cad3da720ce5' 'd58b4a09d7f06750a706b70d068f5a678dad8233' '34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc' 'dc3ec9af62a92a46378960e599521c2ac5f81343' '513f49c33e91e58975ada7967b44512179f0e703' 'bb2d82d41894cb30d836e9796ff67d2f9a71eccf' '07a1a6562ce29e2e0c134a57882d6e52e8758492'
# test job: [a9f7db50ed2fff96727782456f49bf88def68510] https://lava.sirena.org.uk/scheduler/job/2827407
# test job: [215701873a7ed1214bba82c8fadcd2583d0246c3] https://lava.sirena.org.uk/scheduler/job/2804576
# test job: [b882d807fa443b529ae8bf917d7b640a8d555437] https://lava.sirena.org.uk/scheduler/job/2801267
# test job: [835fa43a4d36bd66ad0dd052f9fa15f7bd365fa8] https://lava.sirena.org.uk/scheduler/job/2801414
# test job: [5d72c6a754ef488f88ad33c83d14556c6f38af3a] https://lava.sirena.org.uk/scheduler/job/2827129
# test job: [04b18289834f20bbed25813c4978833ca8bd6e82] https://lava.sirena.org.uk/scheduler/job/2827205
# test job: [94bf692c483ffb325af0b110429ad37f445d40b0] https://lava.sirena.org.uk/scheduler/job/2827087
# test job: [1c996a37fd244d19e5dbb715328c1676e28ef607] https://lava.sirena.org.uk/scheduler/job/2801077
# test job: [083c9ab12c402efe9ed55c942ede92eb35f8bf2a] https://lava.sirena.org.uk/scheduler/job/2804905
# test job: [49e05bb00f2e8168695f7af4d694c39e1423e8a2] https://lava.sirena.org.uk/scheduler/job/2804997
# test job: [86ad8069366642fec18c1bc53c24cad3da720ce5] https://lava.sirena.org.uk/scheduler/job/2804815
# test job: [d58b4a09d7f06750a706b70d068f5a678dad8233] https://lava.sirena.org.uk/scheduler/job/2805032
# test job: [34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc] https://lava.sirena.org.uk/scheduler/job/2783569
# test job: [dc3ec9af62a92a46378960e599521c2ac5f81343] https://lava.sirena.org.uk/scheduler/job/2804763
# test job: [513f49c33e91e58975ada7967b44512179f0e703] https://lava.sirena.org.uk/scheduler/job/2801614
# test job: [bb2d82d41894cb30d836e9796ff67d2f9a71eccf] https://lava.sirena.org.uk/scheduler/job/2744679
# test job: [07a1a6562ce29e2e0c134a57882d6e52e8758492] https://lava.sirena.org.uk/scheduler/job/2745239
# test job: [b99ae45861eccff1e1d8c7b05a13650be805d437] https://lava.sirena.org.uk/scheduler/job/2831199
# bad: [b99ae45861eccff1e1d8c7b05a13650be805d437] Add linux-next specific files for 20260604
git bisect bad b99ae45861eccff1e1d8c7b05a13650be805d437
# test job: [2d4099641dbaed4b98711fc7d8ec94c5aec0a0f0] https://lava.sirena.org.uk/scheduler/job/2827279
# bad: [2d4099641dbaed4b98711fc7d8ec94c5aec0a0f0] Merge patch series "vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2"
git bisect bad 2d4099641dbaed4b98711fc7d8ec94c5aec0a0f0
# test job: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] https://lava.sirena.org.uk/scheduler/job/2827345
# bad: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
git bisect bad e2c0b2368081bef7d1f6758cc9e7edde8521237c
# first bad commit: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
# test job: [7d75aa8edfce3e8ac4f0f8fe5728fd1eab8f7f23] https://lava.sirena.org.uk/scheduler/job/2829187
# skip: [7d75aa8edfce3e8ac4f0f8fe5728fd1eab8f7f23] splice: remove PIPE_BUF_FLAG_GIFT
git bisect skip 7d75aa8edfce3e8ac4f0f8fe5728fd1eab8f7f23
# first bad commit: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05  9:43 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>

Hi Linus,

>> Am I understanding correctly that this will completely break zerocopy
>> sendfile?
> 
> Very much, yes.
> 
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
> 
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.
> 
> But it's probably still the right thing to at least try.
> 
> I've seen very impressive benchmark numbers over the years, but
> they've often smelled more like benchmarketing than actual real work.
> 
> There's also a real possibility that a lot of the sendfile / splice
> advantage has little to do with zero-copy, and more to do with the
> cost of mapping and maintaining buffers in user space.
> 
> If you are sending file data using plain reads and writes, it's not
> just the "copy from user space to socket data structures".
> 
> There's also the cost of populating user space in the first place:
> page faults for mmap made *that* historical copy avoidance basically a
> fairy tale.
> 
> And not using mmap means that you have the cost of double caching in
> the kernel _and_ user space etc.
> 
> So sendfile() as a concept (whether you use combinations of splice()
> system calls or the sendfile system call itsefl) isn't necessarily
> only about the zero-copy, it's really also about avoiding the user
> space memory management.

I don't think so. Ok, maybe for webservers just serving tiny
html files, that's true. But for me with Samba it's really the
copy_to/from_iter() that is the major factor.

We can use io_uring with IOSQE_ASYNC in order to offload
the memcpy cpu wasting to different cores, but it's still
wasting a lot of resources.

For the case of filesystem => socket, we can use
IORING_OP_SENDMSG_ZC and that at least removes the
copy_from_iter() in the sendmsg path, but the
IORING_OP_READV of buffers in the sizes up to 8MBytes
is wasting cpu in copy_to_iter().

For the case with smbdirect and RDMA offload with 2x200GBit/s links
changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
is the limit. Without the memcpy waste ~46GByte/s is easily reached
and the limit is just the network link.

Maybe another solution could be having a version of
copy_to/from_iter that uses async_memcpy(), but didn't
have the time to experiment with that yet. Maybe a new flag
to preadv2/pwritev2 could control that, so that the
application can decide what's better.

But without an alternative please don't kill splice.

A lot of people are frustrated because they bought hardware
that is able to handle a lot of throughput, but
e.g. with the default of smb over tcp they get no
higher than 3.5GByte/s on a 100GBit/s link that's able
to handle ~11GBytes/s. And io_uring and splice are
a key factor to fix that.

Thanks!
metze

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Florian Weimer @ 2026-06-05  9:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wh1Avi8FGbWkt9k0Z-KXWB0spo5mQyowO8PB2sJ3unbkw@mail.gmail.com>

* Linus Torvalds:

> On Thu, 4 Jun 2026 at 14:32, David Laight <david.laight.linux@gmail.com> wrote:
>>
>> I think riscv might sign extend 32bit values in 64bit registers.
>> x86 and arm both zero extend.
>
> That's different.
>
> x86 really doesn't *care*. If the caller zero-extends or leaves high
> bits set randomly, according to the x86 ABI that's perfectly fine: the
> callee will only care about the low 32 bits. So the high bits are
> simply not relevant for the ABI.

Please note that Clang does not implement the x86-64 ABI and requires
zero extension.  We see increasing problems from that, now that we have
more C code calling Rust code.  (The other direction is generally fine.)
Unfortunately, it's difficult to fix in LLVM.

In the original x86-64 psABI, this was left unspecified by omission
except for the special case of _Bool.  However, Clang/LLVM gets the
_Bool case wrong as well, so it's not just a matter of an unclear
specification.

This isn't really specific to x86-64.  _Bool is simply not part of the
ABI that is stable across compilers, a bit like bitfields in structs
passed by value.

Thanks,
Florian

^ permalink raw reply

* (no subject)
From: Collin Funk @ 2026-06-05  8:35 UTC (permalink / raw)
  To: brauner
  Cc: Pádraig Brady, akpm, axboe, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, safinaskar, torvalds, viro, willy
In-Reply-To: <20260601-enthusiasmus-canceln-anlehnen-0e62317a9784@brauner>

Hi all,

Christian Brauner <brauner@kernel.org> wrote:

> On Sun, 31 May 2026 01:01:04 +0000, Askar Safin wrote:
> > This patchset is for VFS.
> > 
> > Recently we got a lot of vulnerabilities in splice/vmsplice.
> > 
> > Also vmsplice already was source of vulnerabilities in the past:
> > CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> > 
> > [...]
> 
> Applied to the vfs-7.2.vmsplice branch of the vfs/vfs.git tree.
> Patches in the vfs-7.2.vmsplice branch should appear in linux-next soon.
> 
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
> 
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
> 
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs-7.2.vmsplice
> 
> [1/3] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
>       https://git.kernel.org/vfs/vfs/c/a9f7db50ed2f
> [2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
>       https://git.kernel.org/vfs/vfs/c/e2c0b2368081
> [3/3] splice: remove PIPE_BUF_FLAG_GIFT
>       https://git.kernel.org/vfs/vfs/c/7d75aa8edfce

In GNU coreutils-9.11, released 2026-04-20, Pádraig Brady added the use
of splice and vmsplice to the 'yes' command [1]. Afterward, I added the
use of splice to the 'cat' command, which is now used if copy_file_range
fails or cannot be used [2]. There were some minor adjustments that had
to be made to those patches pre-release. However, as far as I am aware,
they have not had any issues yet which was a bit surprising to me at
least. Now it seems we are a bit unlucky with our timing...

Anyways, I figured you may be interested in seeing how these changes
affect some applications. I built a kernel from the vfs-7.2.vmsplice
branch and used a config based on my recent Fedora kernel.

Here is the throughput on my Fedora kernel:

    $ uname -r
    7.0.10-201.fc44.x86_64
    $ yes --version | head -n 1
    yes (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 yes | taskset 2 pv -r > /dev/null
    [36.9GiB/s]
    $ cat --version | head -n 1
    cat (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 cat /dev/zero | taskset 2 pv -r > /dev/null
    [9.34GiB/s]

Here is the throughput on the vfs-7.2.vmsplice kernel:

    $ uname -r
    7.1.0-rc1+
    $ yes --version | head -n 1
    yes (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 yes | taskset 2 pv -r > /dev/null
    [3.41GiB/s]
    $ cat --version | head -n 1
    cat (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 cat /dev/zero | taskset 2 pv -r > /dev/null
    [9.50GiB/s]

Unsurprisingly, 'cat' is not affected since it does not use vmsplice. On
the other hand 'yes' is 10x slower. I dislike this, obviously. However,
of course I realize that the average person uses the 'yes' command much
less frequently than I do, if they use it at all. To them security is a
far greater concern. Just want to make it clear that this message isn't
an attempt at getting this change reverted or anything like that.

Anyways, hope the testing was at least somewhat useful.

Collin

[1] https://github.com/coreutils/coreutils/commit/2b1c059e6a06eebbb721d010b1221ec54200cc33
[2] https://github.com/coreutils/coreutils/commit/457f88513a128ce91160c4a60f821cc1612204be

P.S. It would be fun to test this branch on the machine where we got
'yes' to output at 175GiB/s. Sadly we do not have root access on it to
install a new kernel, though.

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-05  8:23 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Linus Torvalds, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260605015724.GA520134@ax162>

On Thu, 4 Jun 2026 18:57:24 -0700
Nathan Chancellor <nathan@kernel.org> wrote:

> On Thu, Jun 04, 2026 at 10:32:16PM +0100, David Laight wrote:
> > Talking of broken compilers, had you noticed that:
> > struct foo {
> >     int a;
> >     char c[32];
> > };
> > 
> > int b(struct foo *f)
> > {
> >     return __builtin_object_size(f->c, 1);
> > }
> > returns -1 (size unknown/indefinite).
> > You can't use __builtin_object_size() to stop code running off the end
> > of anything referenced by address - even when the size is constant.  
> 
> That is the entire point of using '-fstrict-flex-arrays=3' in the
> kernel:
> 
>   df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3")
>   https://godbolt.org/z/bvfrh7W58
> 
> Without it, all trailing arrays in structures are treated as flexible
> arrays, even those with fixed sizes.
> 

strict-flex-arrays got added in gcc 13.1 and clang 15.0; it isn't supported
by the gcc 12.2 on the debian 12 system I'm building kernels on.
__buitin_object_size() itself is in gcc 4.1.2 and clang 3.0.

Neither are flex arrays mentioned in the gcc docs for __builtin_object_size().

Someone might have used (eg) 'char x[4]' as a flex array to include the
padding, but no one would have used anything that extended the structure.
And the chance of those hitting __builtin_object_size() is even smaller.

-- David


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Nathan Chancellor @ 2026-06-05  1:57 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260604223216.73468830@pumpkin>

On Thu, Jun 04, 2026 at 10:32:16PM +0100, David Laight wrote:
> Talking of broken compilers, had you noticed that:
> struct foo {
>     int a;
>     char c[32];
> };
> 
> int b(struct foo *f)
> {
>     return __builtin_object_size(f->c, 1);
> }
> returns -1 (size unknown/indefinite).
> You can't use __builtin_object_size() to stop code running off the end
> of anything referenced by address - even when the size is constant.

That is the entire point of using '-fstrict-flex-arrays=3' in the
kernel:

  df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3")
  https://godbolt.org/z/bvfrh7W58

Without it, all trailing arrays in structures are treated as flexible
arrays, even those with fixed sizes.

-- 
Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-04 23:25 UTC (permalink / raw)
  To: david.laight.linux
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, metze, miklos, netdev,
	patches, pfalcato, safinaskar, torvalds, viro, willy
In-Reply-To: <20260604183829.63c35fd9@pumpkin>

David Laight <david.laight.linux@gmail.com>:
> I know it has mattered elsewhere, and is easy to get wrong because
> 'mostly it works'.

I will send a patch, which will change that argument back to "int".
Hopefully today.
Also I will add that "wait_for_space".

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-04 21:42 UTC (permalink / raw)
  To: David Laight
  Cc: Askar Safin, metze, akpm, axboe, brauner, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260604223216.73468830@pumpkin>

On Thu, 4 Jun 2026 at 14:32, David Laight <david.laight.linux@gmail.com> wrote:
>
> I think riscv might sign extend 32bit values in 64bit registers.
> x86 and arm both zero extend.

That's different.

x86 really doesn't *care*. If the caller zero-extends or leaves high
bits set randomly, according to the x86 ABI that's perfectly fine: the
callee will only care about the low 32 bits. So the high bits are
simply not relevant for the ABI.

The Powerpc ABI makes the the sign extension part of the calling conventions.

So if a caller doesn't sign extend a 32-bit value, the result is
random behavior - if you pass some function an 'int' argument, the
function may end up looking at bit 63 to see if it's negative (except
IBM calls it "bit 0" because they have to be different from everybody
else)

            Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-04 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, metze, akpm, axboe, brauner, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wip3mwLOHOYJ9TtjDxOaq9YUXmuCg2AycyASGgeY6qqUw@mail.gmail.com>

On Thu, 4 Jun 2026 12:30:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 4 Jun 2026 at 10:38, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > Bool is another matter entirely, (IIRC from a couple of weeks ago)
> > gcc will assume that the low 8 bits of the parameter register are
> > either 0 or 1 and clang assumes that the low 32 bits are 0 or 1.
> > You can't even check with 'if ((u32)bool_param > 1) error()' because
> > the compiler 'knows' it can't be false.  
> 
> Nobody should ever use 'bool' as a system call argument. Anything that
> takes a boolean should take a 'flags' field with bits.

I was thinking of more generally, not syscall arguments.
In C you can't really guarantee that a 'bool' variable will always
contain 0 or 1.

Even if you write: https://godbolt.org/z/81P87vv7o
int f(char *p, _Bool b)
{
    return p[b ? 1 : 0];
}
you get *(p + b), neither (int)b, !!b or (b & 1) make any difference.

Talking of broken compilers, had you noticed that:
struct foo {
    int a;
    char c[32];
};

int b(struct foo *f)
{
    return __builtin_object_size(f->c, 1);
}
returns -1 (size unknown/indefinite).
You can't use __builtin_object_size() to stop code running off the end
of anything referenced by address - even when the size is constant.


> But this is basically what a lot of the SYSCALL_DEFINEx() macros are
> all about - sorting out ABI assumptions.
> 
> For example, on powerpc (iirc - maybe it was 390), a 32-bit argument
> is always sign-extended by the ABI, and the compiler *depends* on
> that.

I think riscv might sign extend 32bit values in 64bit registers.
x86 and arm both zero extend.
Zero extending is more friendly to the kernel where pretty much
all values are non-negative.
I think I've used signed variables slightly more often than fp in the
last 47+ years.

-- David

> But at system call boundaries we can't trust that the user side
> actually follows the ABI, so SYSCALL_DEFINEx() will actually take a
> 'unsigned long' and turn it into a 32-bit argument so that things like
> this are well-defined and you can't fool the kernel by not following
> the ABI rules.
> 
> The same would be the case if some system call actually takes bool
> (but I don't think such garbage exists). The SYSCALL_DEFINE() macro
> magic would take the full register content and *force* it to follow
> the ABI conventions, whatever they are on that platform.
> 
>               Linus


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox