Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 16:02 UTC (permalink / raw)
  To: Mark Brown
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aiswharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <d9806b34-fc73-4878-997a-95c5e8ae4b29@sirena.org.uk>

On Fri, 5 Jun 2026 at 04:02, Mark Brown <broonie@kernel.org> wrote:
>
> FWIW this is triggering a failure in the LTP vmsplice01 test case (which
> sends with a vmsplice() and then tries to read that with a splice()) in
> -next:
>
> | tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsp3vEmQ as tmpdir (tmpfs filesystem)
> | L4471tst_test.c:2047: TINFO: LTP version: 20260130
> | L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
> | L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
> | L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s

I htink this is the same thing that Christian already noted (he said
"reported by David", but I don't know which David ;), where the
vmsplice() writev() emulation was done as a blocking write, even
though vmsplice only blocked at the beginning (ie waiting only for
_initial_ space to write, not then blocking afterwards).

                 Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 15:58 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <f1c7fbbf-5be1-48b0-8927-2d9b75a35816@samba.org>

On Fri, 5 Jun 2026 at 08:15, Stefan Metzmacher <metze@samba.org> wrote:
>
> It means the most common workload, e.g. a file only opened for
> file serving (or simple opens in general) would still be able to
> be optimized.

Nope. If your web server opens files with write access, I'd be
extremely surprised.

And if you don't have write access, and you're sending out data from
files you opened just for reading - the onle sane case - you hit all
the existing problems with "I can certainly look up pages, but I damn
well shouldn't pass those pages to the networking code without copying
them".

               Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-05 15:54 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <87se71jps4.fsf@oldenburg.str.redhat.com>

On Fri, 5 Jun 2026 at 02:33, Florian Weimer <fweimer@redhat.com> wrote:
>
> * Linus Torvalds:
>
> > x86 really doesn't *care*. If the caller zero-extends or leaves high
> > bits set randomly, according to the x86 ABI that's perfectly fine: the
> > callee will only care about the low 32 bits. So the high bits are
> > simply not relevant for the ABI.
>
> Please note that Clang does not implement the x86-64 ABI and requires
> zero extension.  We see increasing problems from that, now that we have
> more C code calling Rust code.

Uhhuh. But that is only specific to 'bool', right?

If it were to have the same issue that powerpc(*) had - that 'unsigned
int' has to be passed to functions with well-defined high bits - that
would be bad.

And I'm pretty sure that clang doesn't do that.

Anyway, for the kernel, this shouldn't be an issue simply because we
typically avoid 'bool' in arguments or structures that are exposed to
outside.

(I say 'typically' because I'm sure it happens in some broken UAPI
thing anyway).

                  Linus

(*) I may mis-remember. Maybe it was s390, not powerpc. The s390
compat layer independently had a similar issue wrt pointers, where bit
31 had to be cleared. s390 dropped the 31-bit code entirely fairly
recently, but it caused some "interesting" code in the already
disgusting syscall argument handling wrapper macros.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Willy Tarreau @ 2026-06-05 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <aiGkrQnMeyPmEvRB@1wt.eu>

On Thu, Jun 04, 2026 at 06:15:41PM +0200, Willy Tarreau wrote:
> On Thu, Jun 04, 2026 at 08:58:33AM -0700, Linus Torvalds wrote:
> > On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > > It looks like you're actually doing exactly the thing that I thought
> > > > was crazy and wouldn't even work reliably: you change the
> > > > common_response[] contents dynamically *after* the vmsplice, and
> > > > depend on the fact that changing it in user space changes the buffer
> > > > in the pipe too.
> > >
> > > No no, it's definitely not doing that (or it's a bug, but it's not
> > > supposed to happen). I'm perfectly aware that one must definitely not
> > > do that, and it's a guarantee the user of vmsplice() must provide.
> > 
> > Whew, good.
> > 
> > In that case, can you just try the vmsplice patch series (Christian
> > already found a bug, but I don't think it will necessarily matter in
> > practice - famous last words) and that test patch of mine, and see if
> > it all (a) works for you and (b) if you have any numbers for
> > performance that would be *great*.
> 
> Yes I wanted to do that and noted it on my todo list yesterday when
> noticing the ongoing discussion. Just been super busy with yesterday's
> by-yearly release ;-) But at least I wanted to share quick feedback in
> this thread about existing uses.

OK so I could run the test this afternoon, with:
  - ddd664bbff63 Merge tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
    (v7.1-rc6-178)

  - the same with Christian's vfs-7.2.vmsplice branch merged into it
    ( 8d86fcfc2857 include/linux/splice.h: trivial fix: declerations -> declarations)

Both show 71-72 Gbps of TLS traffic per core on my test utility (I
stopped at 3 cores since having only 2x100G at the moment), so for
this use case I'm not impacted by the change. I noted that I will
have to reconsider other options for the cache (send(MSG_ZEROCOPY)
probably) but in my case since the code doesn't exist yet it's not
per-se a userland breakage, but a change of plans. I just hope I'll
find my way through the alternate solution.

FWIW for Christian's branch:

Tested-by: Willy Tarreau <w@1wt.eu>

Willy

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05 15:20 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <20260605131942.4584728e@pumpkin>

Hi David,

>>> So sendfile() as a concept (whether you use combinations of splice()
>>> system calls or the sendfile system call itsefl) isn't necessarily
>>> only about the zero-copy, it's really also about avoiding the user
>>> space memory management.
>>
>> I don't think so. Ok, maybe for webservers just serving tiny
>> html files, that's true. But for me with Samba it's really the
>> copy_to/from_iter() that is the major factor.
> 
> Is that copy also doing the ip checksum?

Not in my tests. I guess there's offload in the network hardware
for this.

At least at the syscall layer of sendmsg() there's no checksuming
happening.

metze

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05 15:15 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wh5bFj1a7eaGp9sixDg3UXu7xUGfU=YJo+ckpGxGAyhXQ@mail.gmail.com>

Hi Linus,

> On Wed, 3 Jun 2026 at 15:23, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> So I'm suspicious that you've possibly make bugs much (MUCH) harder to
>> exploit, but the underlying awful code and opportunity for bugs is
>> still there.  MSG_SPLICE_PAGES is still around, and there is still
>> (AFAICS) no actual coherent description of what it means.
> 
> I don't disagree. I've only looked at the filesystem side.
> 
> The networking side does some odd stuff too (and I did look at some of
> that, and had to be edumacated by Jakub on some of the subtler rules
> for what skb data sharing is ok and when it's not - really not my
> area).
> 
> But at least MSG_SPLICE_PAGES should be kernel-internal only
> interface, and once you don't share page cache pages with networking
> code I think that kneecaps a lot of the attacks.
> 
> So that's really the aim here for me - at least _attempting_ to go
> "maybe we can just limit splice enough that it doesn't even *matter*
> when networking does something odd and questionable".

While prototyping a smbdirect_splice_to_bvecs() in order to
do use rdma_rw_ctx_init_bvec() I found things like pipe_buf_try_steal()
and dived a bit deeper into struct address_space and found things like:
mapping_mapped, mapping_tagged, mapping_deny_writable, mapping_allow_writable
and similar things.

With that I'm wondering if we could allow splicing of
pages only if nobody mmap'ed the file => mapping_mapped() returned 0
and the page is not tagged with any of PAGECACHE_TAG_{DIRTY,WRITEBACK,TOWRITE}
and once a page is spliced we tag the page in the i_pages xarray
with a PAGECACHE_TAG_SPLICED. In all other cases the page would be copied.

Then any call to do_mmap() or vfs_writev at the highlevel
and at the lower levels most likely filemap_get_entry()/filemap_map_pages()
will remove the pages marked with PAGECACHE_TAG_SPLICED
and allocate new pages used for the pagecache of the related index.
It would be a bit similar to invalidate_inode_pages2_range() for direct io
writes. Maybe optimizing by clearing PAGECACHE_TAG_SPLICED if the refcount
of the page is 1.

This would also mean the content of spliced pages won't be changed
by future writes to the file, which removes the problem with unstable
pages and checksums.

It means the most common workload, e.g. a file only opened for
file serving (or simple opens in general) would still be able to
be optimized.

Does that sound useful and doable?

metze

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Gabriel Krisman Bertazi @ 2026-06-05 14:24 UTC (permalink / raw)
  To: Li Chen
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>

Li Chen <me@linux.beauty> writes:

> Hi,
>
> This is an early RFC for an idea that is probably still rough in both the
> UAPI and implementation details. Sorry for the rough edges; I am sending
> it now to check whether this direction is worth pursuing and to get
> feedback on the kernel/userspace boundary.
>
> The series is based on linux-next version 20260518.
>
> This RFC adds spawn_template, a userspace-controlled exec acceleration
> mechanism for runtimes that repeatedly start the same executable with
> different argv, envp, and per-spawn file descriptor setup.

Have you looked at Josh's proposal to do this over io_uring [1] and my
implementation of it at [2]?  I think io_uring is a very natural
interface for something like this, it will avoid adding a larger API,
since you could, in theory, set up the entire new task context using
regular io_uring operations in an io workqueue and then starting it would
be a matter of forking the pre-configured io thread with a new io_uring
operation.

[1]
https://lpc.events/event/16/contributions/1213/attachments/1012/1945/io-uring-spawn.pdf
[2] https://lwn.net/Articles/1001622/

>
> The main target is agent runtimes. Modern coding agents repeatedly start
> short-lived helper tools such as rg, git, sed, awk, python, node, and
> shell wrappers while they inspect and edit a workspace. Those runtimes
> already know which tools are hot, and they are also the right place to
> decide policy. The kernel does not choose names such as rg, git, or sed.
> Userspace opts in by creating a template fd for one executable, then uses
> that fd for later spawns. Launchers, shells, and build systems have a
> similar repeated-startup shape and could use the same primitive, but the
> agent runtime case is the main motivation for this RFC.
>
> The mechanism applies to the executable that userspace asks the kernel to
> start. If an agent runtime directly starts /usr/bin/rg, the rg executable
> is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
> head", the shell is the template target unless the shell itself opts in
> when it starts rg and head. The kernel does not parse the shell command
> string or rewrite inner commands into template spawns. Userspace has to
> call spawn_template for those inner commands explicitly:
>
>     direct exec                 shell wrapper
>     -----------                 -------------
>     agent                       agent
>       template("/usr/bin/rg")     template("/usr/bin/bash")
>       spawn rg argv              spawn bash -c "rg ... | head"
>
>     kernel target: rg          kernel target: bash
>     rg startup benefits        rg/head need shell opt-in
>
> Several agent runtime discussions are moving toward direct argv-style
> exec tools for both security and policy clarity. For example, opencode
> issue #2206 proposes an exec tool as a safer alternative to a shell-only
> bash tool:
>
> https://github.com/anomalyco/opencode/issues/2206
>
> spawn_template is meant to support both models. Direct exec users can
> cache the actual hot tool. Shell-wrapper users can cache the shell and
> still reduce shell startup cost. If a shell or an agent runtime later
> uses the same API for commands started inside a shell command, those
> inner tools can benefit too.
>
> Each spawn still goes through the normal exec path. The template reuses
> only metadata that can be revalidated before use. Credential preparation,
> permission checks, binary handler checks, secure-exec handling, and LSM
> hooks remain on the normal execve path.
>
> The UAPI has two operations. spawn_template_create() creates an
> anonymous-inode template fd from either an executable fd or an absolute
> executable path. spawn_template_spawn() starts one child from that
> template, applies per-spawn fd, cwd, and signal actions, and returns both
> pid and pidfd.
>
> fd inheritance is deliberately conservative. By default, after the
> requested per-spawn actions have run, the child closes fds above stderr.
> An agent runtime can still request traditional inheritance explicitly,
> but helper tools do not inherit unrelated secret files or sockets by
> accident. The create-time actions fields are reserved and rejected in
> this RFC because fd numbers are per-process state, not stable reusable
> objects. The caller supplies fd actions for each spawn instead.
>
> A typical agent runtime would keep one template per hot executable and
> still build argv, envp, cwd, and pipe wiring for each tool call:
>
>     rg_tmpl = spawn_template_create("/usr/bin/rg");
>
>     for each search request:
>         out_r, out_w = pipe_cloexec();
>         err_r, err_w = pipe_cloexec();
>         actions = [
>             FCHDIR(worktree_fd),
>             DUP2(out_w, STDOUT_FILENO),
>             DUP2(err_w, STDERR_FILENO),
>         ];
>         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
>         close(out_w);
>         close(err_w);
>         read out_r and err_r;
>         waitid(P_PIDFD, child.pidfd, ...);
>
> A shell-wrapper runtime would use the same shape with a template for
> /usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
> reduces shell startup cost, but it does not cache rg or head inside that
> command unless the shell also opts into spawn_template for commands it
> starts internally.
>
> The template pins the executable and denies writes to that file while the
> template fd is alive, so cached executable metadata cannot race with a
> writer changing the same inode. This means direct in-place writes to the
> executable can fail while a runtime keeps a template open. It does not
> block the common package-manager update pattern where a new inode is
> written and then atomically renamed over the old path. In that case the
> old path-created template becomes stale, spawn_template_spawn() rejects
> it with ESTALE, and the runtime should close and recreate the template
> for the new executable.
>
>     in-place write              package-manager update
>     --------------              ----------------------
>     template pins old inode     write new inode
>     write(old inode) denied     rename(new, "/usr/bin/rg")
>
>     cached metadata safe        old template sees path mismatch
>                                 spawn_template_spawn() = -ESTALE
>                                 recreate template for new inode
>
> Each spawn revalidates executable identity before cached metadata is
> used. Path-created templates only accept absolute paths: a relative path
> such as ./tool depends on cwd, and the same string can name a different
> file after chdir. For an absolute path template, each spawn reopens the
> path and checks that it still resolves to the executable recorded when
> the template was created. If the path now names a replaced file, the
> template is stale and userspace should close and recreate it.
>
> A template fd can be passed over SCM_RIGHTS like any other fd, but this
> RFC does not treat that as delegation. spawn_template_spawn() only works
> while the caller still has the same struct cred object that created the
> template. If another task, or the same task after a credential change,
> receives the fd, spawn fails instead of running the executable using the
> creator's launch authority:
>
>     ordinary fd                         spawn_template fd
>     -----------                         -----------------
>     A: open log                         A: create rg template
>     A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)
>
>     B: read(fd) = ok                    B: spawn(tfd) = -EACCES
>                                         B: create own rg template
>                                         B: spawn(own_tfd) = ok
>
>     open-file use is delegated          spawn authority is not delegated
>
> The cached state is intentionally small. The template fd keeps the opened
> main executable file, an optional absolute path string, the creator
> credential pointer, and the deny-write state. The executable identity key
> records device, inode, size, mode, owner, ctime, and mtime, and is
> rechecked before cached metadata is used. The ELF cache keeps only the
> main executable's ELF header, program header table, and program header
> count.
>
>     cached in this RFC          not cached in this RFC
>     ------------------          ----------------------
>     opened main executable      PT_INTERP metadata
>     executable identity key     shared-library graph
>     main ELF header             VMA layout metadata
>     main ELF program headers    cross-process metadata sharing
>     creator cred pointer
>     deny-write state
>
> This RFC does not cache ELF interpreter metadata, shared-library
> dependency state, or derived mapping-layout state. Shared-library
> resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
> RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
> state. It also does not share cached executable metadata between template
> fds created by different processes. Each template owns its small cached
> metadata object in this RFC.
>
> Performance
> ===========
>
> The numbers below come from my separate local autogen-bench project.
> autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
> instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
> fans out concurrent tool-call requests to worker agents. The workload
> definitions, generated test files, and subprocess/spawn_template backends
> are local to autogen-bench.
>
> The agent-tools preset includes direct tool calls and shell-wrapper forms
> for:
>
> rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
> python-small, node-small, sh-c, and bash-c.
>
> The benchmark is launch-heavy but not no-op: it searches generated
> Python-like source files, reads sample files, runs small Python and
> Node.js programs, and runs git status and git diff in a small repository.
> It does not include model inference or long-running tool work, so the
> numbers mainly describe the short-tool regime.
>
> The subprocess column starts each tool call through the existing
> userspace launch path. The spawn_template column creates templates for
> hot executables and uses spawn_template_spawn() for later calls.
>
> Total in-flight tool calls stay at 16; only the worker-process split
> changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
> calls each. The two time_s values are subprocess/spawn_template wall
> times.
>
> Workload     Calls  subprocess  spawn_template  time_s       Delta
> (workers)    calls  calls/s     calls/s         seconds
> 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
> 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
> 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
> 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
>
> The table measures the whole mixed workload, including both process
> startup and the short tool work done after exec. Since this workload is
> launch-heavy, the possible launch-side savings include:
>
> - the template fd keeps an opened executable, avoiding repeated ordinary
>   open/path setup for that executable;
> - the kernel can reuse cached main-executable ELF header and program
>   header metadata after revalidation;
> - the fork-and-exec-style launch is submitted as one
>   spawn_template_spawn() operation;
> - fd, cwd, and signal actions run in the child kernel path instead of
>   being driven one syscall at a time by userspace child glue;
> - pid and pidfd are returned by the same operation, reducing some
>   runtime-side bookkeeping.
>
> In local experiments before this RFC, I also tried caching ELF
> interpreter metadata and derived ELF mapping-layout metadata. A focused
> repeated-exec benchmark did not show a stable standalone throughput gain
> for those two optimizations, so this RFC leaves them out and keeps only
> the main executable metadata cache.
>
> I also tried sharing main-executable ELF metadata across template fds
> created by different processes for the same executable identity. That can
> reduce duplicated metadata memory when many agent worker processes create
> their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
> it did not show a stable throughput win in local multi-agent tests. It
> also adds cache keying, lifetime, invalidation, credential, and namespace
> questions to the RFC. This version therefore keeps per-template metadata
> ownership and leaves cross-process sharing out.
>
> Sorry again for the rough edges in this RFC. I would appreciate feedback
> on whether this direction is useful and what the right API boundary
> should be.
>
> Thanks,
> Li
>
> [1]: https://github.com/microsoft/autogen
>
> Li Chen (13):
>   exec: factor argument setup out of do_execveat_common()
>   exec: add an internal helper for opened executables
>   file: expose helpers for in-kernel fd actions
>   exec: add spawn template UAPI definitions
>   exec: add spawn template file descriptors
>   exec: add spawn_template_spawn()
>   exec: validate spawn template executable identity
>   binfmt_elf: cache ELF metadata for spawn templates
>   Documentation: describe spawn templates
>   exec: require absolute paths for path-created templates
>   exec: let close-range actions target the max fd
>   syscalls: add generic spawn template entries
>   selftests/exec: cover spawn template basics
>
>  Documentation/userspace-api/index.rst         |   1 +
>  .../userspace-api/spawn_template.rst          | 153 +++
>  MAINTAINERS                                   |   6 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
>  fs/Makefile                                   |   2 +-
>  fs/binfmt_elf.c                               | 104 +-
>  fs/exec.c                                     | 162 ++-
>  fs/file.c                                     |  11 +-
>  fs/spawn_template.c                           | 619 +++++++++++
>  include/linux/binfmts.h                       |  10 +
>  include/linux/fdtable.h                       |   2 +
>  include/linux/spawn_template.h                |  72 ++
>  include/linux/syscalls.h                      |   7 +
>  include/uapi/asm-generic/unistd.h             |   7 +-
>  include/uapi/linux/spawn_template.h           |  62 ++
>  scripts/syscall.tbl                           |   2 +
>  tools/testing/selftests/exec/Makefile         |   1 +
>  tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
>  18 files changed, 2179 insertions(+), 42 deletions(-)
>  create mode 100644 Documentation/userspace-api/spawn_template.rst
>  create mode 100644 fs/spawn_template.c
>  create mode 100644 include/linux/spawn_template.h
>  create mode 100644 include/uapi/linux/spawn_template.h
>  create mode 100644 tools/testing/selftests/exec/spawn_template.c

-- 
Gabriel Krisman Bertazi

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-05 12:19 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, Andy Lutomirski, Askar Safin, akpm, axboe,
	brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato, viro,
	willy
In-Reply-To: <512d948f-7883-4d8c-b2c5-a777e70ca975@samba.org>

On Fri, 5 Jun 2026 11:43:45 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Hi Linus,
> 
> >> Am I understanding correctly that this will completely break zerocopy
> >> sendfile?  
> > 
> > Very much, yes.
> > 
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> > 
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> > 
> > So in that sense, it's a bit sad to discuss castrating splice.
> > 
> > But it's probably still the right thing to at least try.
> > 
> > I've seen very impressive benchmark numbers over the years, but
> > they've often smelled more like benchmarketing than actual real work.
> > 
> > There's also a real possibility that a lot of the sendfile / splice
> > advantage has little to do with zero-copy, and more to do with the
> > cost of mapping and maintaining buffers in user space.
> > 
> > If you are sending file data using plain reads and writes, it's not
> > just the "copy from user space to socket data structures".
> > 
> > There's also the cost of populating user space in the first place:
> > page faults for mmap made *that* historical copy avoidance basically a
> > fairy tale.
> > 
> > And not using mmap means that you have the cost of double caching in
> > the kernel _and_ user space etc.
> > 
> > So sendfile() as a concept (whether you use combinations of splice()
> > system calls or the sendfile system call itsefl) isn't necessarily
> > only about the zero-copy, it's really also about avoiding the user
> > space memory management.  
> 
> I don't think so. Ok, maybe for webservers just serving tiny
> html files, that's true. But for me with Samba it's really the
> copy_to/from_iter() that is the major factor.

Is that copy also doing the ip checksum?
I really can't tell from the code (it does sometimes, even for tcp).
But I can't help feeling that optimisation is well past its sell by date.

-- David

> 
> We can use io_uring with IOSQE_ASYNC in order to offload
> the memcpy cpu wasting to different cores, but it's still
> wasting a lot of resources.
> 
> For the case of filesystem => socket, we can use
> IORING_OP_SENDMSG_ZC and that at least removes the
> copy_from_iter() in the sendmsg path, but the
> IORING_OP_READV of buffers in the sizes up to 8MBytes
> is wasting cpu in copy_to_iter().
> 
> For the case with smbdirect and RDMA offload with 2x200GBit/s links
> changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
> is the limit. Without the memcpy waste ~46GByte/s is easily reached
> and the limit is just the network link.
> 
> Maybe another solution could be having a version of
> copy_to/from_iter that uses async_memcpy(), but didn't
> have the time to experiment with that yet. Maybe a new flag
> to preadv2/pwritev2 could control that, so that the
> application can decide what's better.
> 
> But without an alternative please don't kill splice.
> 
> A lot of people are frustrated because they bought hardware
> that is able to handle a lot of throughput, but
> e.g. with the default of smb over tcp they get no
> higher than 3.5GByte/s on a 100GBit/s link that's able
> to handle ~11GBytes/s. And io_uring and splice are
> a key factor to fix that.
> 
> Thanks!
> metze
> 


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Mark Brown @ 2026-06-05 11:02 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Aiswharya.TCV,
	ltp, Miklos Szeredi, patches
In-Reply-To: <20260531010107.1953702-3-safinaskar@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7806 bytes --]

On Sun, May 31, 2026 at 01:01:06AM +0000, Askar Safin wrote:

> vmsplice behavior on writable pipe became equivalent to pwritev2.
> vmsplice behavior on readable pipe already was nearly
> equivalent to preadv2, but I made this explicit. I. e. I made it
> obvious from code that vmsplice now is equivalent to preadv2/pwritev2.

> Also I moved vmsplice to fs/read_write.c, because now it arguably
> belongs there.

> Note that SPLICE_F_NONBLOCK behavior slightly changed: previously
> vmsplice ignored whether the pipe was opened with O_NONBLOCK, and mode
> of operation depended on whether SPLICE_F_NONBLOCK was passed only.
> Now the operation will be non-blocking if O_NONBLOCK was passed when
> opening *or* SPLICE_F_NONBLOCK was passed to vmsplice. Previous
> behavior was arguably buggy, and new behavior is arguably better.

FWIW this is triggering a failure in the LTP vmsplice01 test case (which
sends with a vmsplice() and then tries to read that with a splice()) in
-next:

| tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsp3vEmQ as tmpdir (tmpfs filesystem)
| L4471tst_test.c:2047: TINFO: LTP version: 20260130
| L4472tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260604 #1 SMP @1780589917 armv7l
| L4473tst_kconfig.c:71: TINFO: Couldn't locate kernel config!
| L4474tst_test.c:1875: TINFO: Overall timeout per run is 0h 00m 30s
| L4475tst_test.c:1632: TINFO: tmpfs is supported by the test
| L4476Test timeouted, sending SIGKILL!
| L4477tst_test.c:1947: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
| L4478tst_test.c:1949: TBROK: Test killed! (timeout?)

As the test itself notes it's not super realistic but thought it was
better to mention it since I didn't see any report on the list yet, I've
CCed the LTP list.

Full log:

   https://lava.sirena.org.uk/scheduler/job/2831199#L4468

bisect log:

# bad: [b99ae45861eccff1e1d8c7b05a13650be805d437] Add linux-next specific files for 20260604
# good: [5e9d583f58c8b56c9f5022639ac70cb7ae6e9fe9] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [a9f7db50ed2fff96727782456f49bf88def68510] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
# good: [215701873a7ed1214bba82c8fadcd2583d0246c3] net: mdio: realtek-rtl9300: Add ports to info structure
# good: [b882d807fa443b529ae8bf917d7b640a8d555437] tools/nolibc: stackprotector: Avoid stalling program startup if crng is not init yet
# good: [835fa43a4d36bd66ad0dd052f9fa15f7bd365fa8] tools/nolibc: add ftruncate()
# good: [5d72c6a754ef488f88ad33c83d14556c6f38af3a] Merge branch 'pci/controller/dwc-qcom'
# good: [04b18289834f20bbed25813c4978833ca8bd6e82] Merge branch 'pci/controller/misc'
# good: [94bf692c483ffb325af0b110429ad37f445d40b0] Merge branch 'pci/endpoint'
# good: [1c996a37fd244d19e5dbb715328c1676e28ef607] tools/power turbostat: pmt: Improve sscanf() hygiene
# good: [083c9ab12c402efe9ed55c942ede92eb35f8bf2a] crypto: omap-des - add COMPILE_TEST and fix CONFIG_OF=n build
# good: [49e05bb00f2e8168695f7af4d694c39e1423e8a2] crypto: atmel-sha204a - fail on hwrng registration error in probe path
# good: [86ad8069366642fec18c1bc53c24cad3da720ce5] Documentation: qat_rl: make rate limiting wording clearer
# good: [d58b4a09d7f06750a706b70d068f5a678dad8233] crypto: atmel-sha204a - remove sysfs group before hwrng
# good: [34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc] regmap-i2c: fix sparse warning in regmap_smbus_word_write_reg16
# good: [dc3ec9af62a92a46378960e599521c2ac5f81343] hwrng: core - use MAX to simplify RNG_BUFFER_SIZE
# good: [513f49c33e91e58975ada7967b44512179f0e703] power: sequencing: print power sequencing device parent in debugfs
# good: [bb2d82d41894cb30d836e9796ff67d2f9a71eccf] Merge tag 'v7.1-rc2' into nolibc/for-next
# good: [07a1a6562ce29e2e0c134a57882d6e52e8758492] kcsan: Silence -Wmaybe-uninitialized when calling __kcsan_check_access()
git bisect start 'b99ae45861eccff1e1d8c7b05a13650be805d437' '5e9d583f58c8b56c9f5022639ac70cb7ae6e9fe9' 'a9f7db50ed2fff96727782456f49bf88def68510' '215701873a7ed1214bba82c8fadcd2583d0246c3' 'b882d807fa443b529ae8bf917d7b640a8d555437' '835fa43a4d36bd66ad0dd052f9fa15f7bd365fa8' '5d72c6a754ef488f88ad33c83d14556c6f38af3a' '04b18289834f20bbed25813c4978833ca8bd6e82' '94bf692c483ffb325af0b110429ad37f445d40b0' '1c996a37fd244d19e5dbb715328c1676e28ef607' '083c9ab12c402efe9ed55c942ede92eb35f8bf2a' '49e05bb00f2e8168695f7af4d694c39e1423e8a2' '86ad8069366642fec18c1bc53c24cad3da720ce5' 'd58b4a09d7f06750a706b70d068f5a678dad8233' '34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc' 'dc3ec9af62a92a46378960e599521c2ac5f81343' '513f49c33e91e58975ada7967b44512179f0e703' 'bb2d82d41894cb30d836e9796ff67d2f9a71eccf' '07a1a6562ce29e2e0c134a57882d6e52e8758492'
# test job: [a9f7db50ed2fff96727782456f49bf88def68510] https://lava.sirena.org.uk/scheduler/job/2827407
# test job: [215701873a7ed1214bba82c8fadcd2583d0246c3] https://lava.sirena.org.uk/scheduler/job/2804576
# test job: [b882d807fa443b529ae8bf917d7b640a8d555437] https://lava.sirena.org.uk/scheduler/job/2801267
# test job: [835fa43a4d36bd66ad0dd052f9fa15f7bd365fa8] https://lava.sirena.org.uk/scheduler/job/2801414
# test job: [5d72c6a754ef488f88ad33c83d14556c6f38af3a] https://lava.sirena.org.uk/scheduler/job/2827129
# test job: [04b18289834f20bbed25813c4978833ca8bd6e82] https://lava.sirena.org.uk/scheduler/job/2827205
# test job: [94bf692c483ffb325af0b110429ad37f445d40b0] https://lava.sirena.org.uk/scheduler/job/2827087
# test job: [1c996a37fd244d19e5dbb715328c1676e28ef607] https://lava.sirena.org.uk/scheduler/job/2801077
# test job: [083c9ab12c402efe9ed55c942ede92eb35f8bf2a] https://lava.sirena.org.uk/scheduler/job/2804905
# test job: [49e05bb00f2e8168695f7af4d694c39e1423e8a2] https://lava.sirena.org.uk/scheduler/job/2804997
# test job: [86ad8069366642fec18c1bc53c24cad3da720ce5] https://lava.sirena.org.uk/scheduler/job/2804815
# test job: [d58b4a09d7f06750a706b70d068f5a678dad8233] https://lava.sirena.org.uk/scheduler/job/2805032
# test job: [34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc] https://lava.sirena.org.uk/scheduler/job/2783569
# test job: [dc3ec9af62a92a46378960e599521c2ac5f81343] https://lava.sirena.org.uk/scheduler/job/2804763
# test job: [513f49c33e91e58975ada7967b44512179f0e703] https://lava.sirena.org.uk/scheduler/job/2801614
# test job: [bb2d82d41894cb30d836e9796ff67d2f9a71eccf] https://lava.sirena.org.uk/scheduler/job/2744679
# test job: [07a1a6562ce29e2e0c134a57882d6e52e8758492] https://lava.sirena.org.uk/scheduler/job/2745239
# test job: [b99ae45861eccff1e1d8c7b05a13650be805d437] https://lava.sirena.org.uk/scheduler/job/2831199
# bad: [b99ae45861eccff1e1d8c7b05a13650be805d437] Add linux-next specific files for 20260604
git bisect bad b99ae45861eccff1e1d8c7b05a13650be805d437
# test job: [2d4099641dbaed4b98711fc7d8ec94c5aec0a0f0] https://lava.sirena.org.uk/scheduler/job/2827279
# bad: [2d4099641dbaed4b98711fc7d8ec94c5aec0a0f0] Merge patch series "vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2"
git bisect bad 2d4099641dbaed4b98711fc7d8ec94c5aec0a0f0
# test job: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] https://lava.sirena.org.uk/scheduler/job/2827345
# bad: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
git bisect bad e2c0b2368081bef7d1f6758cc9e7edde8521237c
# first bad commit: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
# test job: [7d75aa8edfce3e8ac4f0f8fe5728fd1eab8f7f23] https://lava.sirena.org.uk/scheduler/job/2829187
# skip: [7d75aa8edfce3e8ac4f0f8fe5728fd1eab8f7f23] splice: remove PIPE_BUF_FLAG_GIFT
git bisect skip 7d75aa8edfce3e8ac4f0f8fe5728fd1eab8f7f23
# first bad commit: [e2c0b2368081bef7d1f6758cc9e7edde8521237c] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-05  9:43 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>

Hi Linus,

>> Am I understanding correctly that this will completely break zerocopy
>> sendfile?
> 
> Very much, yes.
> 
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
> 
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.
> 
> But it's probably still the right thing to at least try.
> 
> I've seen very impressive benchmark numbers over the years, but
> they've often smelled more like benchmarketing than actual real work.
> 
> There's also a real possibility that a lot of the sendfile / splice
> advantage has little to do with zero-copy, and more to do with the
> cost of mapping and maintaining buffers in user space.
> 
> If you are sending file data using plain reads and writes, it's not
> just the "copy from user space to socket data structures".
> 
> There's also the cost of populating user space in the first place:
> page faults for mmap made *that* historical copy avoidance basically a
> fairy tale.
> 
> And not using mmap means that you have the cost of double caching in
> the kernel _and_ user space etc.
> 
> So sendfile() as a concept (whether you use combinations of splice()
> system calls or the sendfile system call itsefl) isn't necessarily
> only about the zero-copy, it's really also about avoiding the user
> space memory management.

I don't think so. Ok, maybe for webservers just serving tiny
html files, that's true. But for me with Samba it's really the
copy_to/from_iter() that is the major factor.

We can use io_uring with IOSQE_ASYNC in order to offload
the memcpy cpu wasting to different cores, but it's still
wasting a lot of resources.

For the case of filesystem => socket, we can use
IORING_OP_SENDMSG_ZC and that at least removes the
copy_from_iter() in the sendmsg path, but the
IORING_OP_READV of buffers in the sizes up to 8MBytes
is wasting cpu in copy_to_iter().

For the case with smbdirect and RDMA offload with 2x200GBit/s links
changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
is the limit. Without the memcpy waste ~46GByte/s is easily reached
and the limit is just the network link.

Maybe another solution could be having a version of
copy_to/from_iter that uses async_memcpy(), but didn't
have the time to experiment with that yet. Maybe a new flag
to preadv2/pwritev2 could control that, so that the
application can decide what's better.

But without an alternative please don't kill splice.

A lot of people are frustrated because they bought hardware
that is able to handle a lot of throughput, but
e.g. with the default of smb over tcp they get no
higher than 3.5GByte/s on a 100GBit/s link that's able
to handle ~11GBytes/s. And io_uring and splice are
a key factor to fix that.

Thanks!
metze

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Florian Weimer @ 2026-06-05  9:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wh1Avi8FGbWkt9k0Z-KXWB0spo5mQyowO8PB2sJ3unbkw@mail.gmail.com>

* Linus Torvalds:

> On Thu, 4 Jun 2026 at 14:32, David Laight <david.laight.linux@gmail.com> wrote:
>>
>> I think riscv might sign extend 32bit values in 64bit registers.
>> x86 and arm both zero extend.
>
> That's different.
>
> x86 really doesn't *care*. If the caller zero-extends or leaves high
> bits set randomly, according to the x86 ABI that's perfectly fine: the
> callee will only care about the low 32 bits. So the high bits are
> simply not relevant for the ABI.

Please note that Clang does not implement the x86-64 ABI and requires
zero extension.  We see increasing problems from that, now that we have
more C code calling Rust code.  (The other direction is generally fine.)
Unfortunately, it's difficult to fix in LLVM.

In the original x86-64 psABI, this was left unspecified by omission
except for the special case of _Bool.  However, Clang/LLVM gets the
_Bool case wrong as well, so it's not just a matter of an unclear
specification.

This isn't really specific to x86-64.  _Bool is simply not part of the
ABI that is stable across compilers, a bit like bitfields in structs
passed by value.

Thanks,
Florian

^ permalink raw reply

* (no subject)
From: Collin Funk @ 2026-06-05  8:35 UTC (permalink / raw)
  To: brauner
  Cc: Pádraig Brady, akpm, axboe, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, safinaskar, torvalds, viro, willy
In-Reply-To: <20260601-enthusiasmus-canceln-anlehnen-0e62317a9784@brauner>

Hi all,

Christian Brauner <brauner@kernel.org> wrote:

> On Sun, 31 May 2026 01:01:04 +0000, Askar Safin wrote:
> > This patchset is for VFS.
> > 
> > Recently we got a lot of vulnerabilities in splice/vmsplice.
> > 
> > Also vmsplice already was source of vulnerabilities in the past:
> > CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> > 
> > [...]
> 
> Applied to the vfs-7.2.vmsplice branch of the vfs/vfs.git tree.
> Patches in the vfs-7.2.vmsplice branch should appear in linux-next soon.
> 
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
> 
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
> 
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs-7.2.vmsplice
> 
> [1/3] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
>       https://git.kernel.org/vfs/vfs/c/a9f7db50ed2f
> [2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
>       https://git.kernel.org/vfs/vfs/c/e2c0b2368081
> [3/3] splice: remove PIPE_BUF_FLAG_GIFT
>       https://git.kernel.org/vfs/vfs/c/7d75aa8edfce

In GNU coreutils-9.11, released 2026-04-20, Pádraig Brady added the use
of splice and vmsplice to the 'yes' command [1]. Afterward, I added the
use of splice to the 'cat' command, which is now used if copy_file_range
fails or cannot be used [2]. There were some minor adjustments that had
to be made to those patches pre-release. However, as far as I am aware,
they have not had any issues yet which was a bit surprising to me at
least. Now it seems we are a bit unlucky with our timing...

Anyways, I figured you may be interested in seeing how these changes
affect some applications. I built a kernel from the vfs-7.2.vmsplice
branch and used a config based on my recent Fedora kernel.

Here is the throughput on my Fedora kernel:

    $ uname -r
    7.0.10-201.fc44.x86_64
    $ yes --version | head -n 1
    yes (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 yes | taskset 2 pv -r > /dev/null
    [36.9GiB/s]
    $ cat --version | head -n 1
    cat (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 cat /dev/zero | taskset 2 pv -r > /dev/null
    [9.34GiB/s]

Here is the throughput on the vfs-7.2.vmsplice kernel:

    $ uname -r
    7.1.0-rc1+
    $ yes --version | head -n 1
    yes (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 yes | taskset 2 pv -r > /dev/null
    [3.41GiB/s]
    $ cat --version | head -n 1
    cat (GNU coreutils) 9.11.50-157bd
    $ timeout 1m taskset 1 cat /dev/zero | taskset 2 pv -r > /dev/null
    [9.50GiB/s]

Unsurprisingly, 'cat' is not affected since it does not use vmsplice. On
the other hand 'yes' is 10x slower. I dislike this, obviously. However,
of course I realize that the average person uses the 'yes' command much
less frequently than I do, if they use it at all. To them security is a
far greater concern. Just want to make it clear that this message isn't
an attempt at getting this change reverted or anything like that.

Anyways, hope the testing was at least somewhat useful.

Collin

[1] https://github.com/coreutils/coreutils/commit/2b1c059e6a06eebbb721d010b1221ec54200cc33
[2] https://github.com/coreutils/coreutils/commit/457f88513a128ce91160c4a60f821cc1612204be

P.S. It would be fun to test this branch on the machine where we got
'yes' to output at 175GiB/s. Sadly we do not have root access on it to
install a new kernel, though.

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-05  8:23 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Linus Torvalds, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260605015724.GA520134@ax162>

On Thu, 4 Jun 2026 18:57:24 -0700
Nathan Chancellor <nathan@kernel.org> wrote:

> On Thu, Jun 04, 2026 at 10:32:16PM +0100, David Laight wrote:
> > Talking of broken compilers, had you noticed that:
> > struct foo {
> >     int a;
> >     char c[32];
> > };
> > 
> > int b(struct foo *f)
> > {
> >     return __builtin_object_size(f->c, 1);
> > }
> > returns -1 (size unknown/indefinite).
> > You can't use __builtin_object_size() to stop code running off the end
> > of anything referenced by address - even when the size is constant.  
> 
> That is the entire point of using '-fstrict-flex-arrays=3' in the
> kernel:
> 
>   df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3")
>   https://godbolt.org/z/bvfrh7W58
> 
> Without it, all trailing arrays in structures are treated as flexible
> arrays, even those with fixed sizes.
> 

strict-flex-arrays got added in gcc 13.1 and clang 15.0; it isn't supported
by the gcc 12.2 on the debian 12 system I'm building kernels on.
__buitin_object_size() itself is in gcc 4.1.2 and clang 3.0.

Neither are flex arrays mentioned in the gcc docs for __builtin_object_size().

Someone might have used (eg) 'char x[4]' as a flex array to include the
padding, but no one would have used anything that extended the structure.
And the chance of those hitting __builtin_object_size() is even smaller.

-- David


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Nathan Chancellor @ 2026-06-05  1:57 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Askar Safin, metze, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260604223216.73468830@pumpkin>

On Thu, Jun 04, 2026 at 10:32:16PM +0100, David Laight wrote:
> Talking of broken compilers, had you noticed that:
> struct foo {
>     int a;
>     char c[32];
> };
> 
> int b(struct foo *f)
> {
>     return __builtin_object_size(f->c, 1);
> }
> returns -1 (size unknown/indefinite).
> You can't use __builtin_object_size() to stop code running off the end
> of anything referenced by address - even when the size is constant.

That is the entire point of using '-fstrict-flex-arrays=3' in the
kernel:

  df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3")
  https://godbolt.org/z/bvfrh7W58

Without it, all trailing arrays in structures are treated as flexible
arrays, even those with fixed sizes.

-- 
Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-04 23:25 UTC (permalink / raw)
  To: david.laight.linux
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, metze, miklos, netdev,
	patches, pfalcato, safinaskar, torvalds, viro, willy
In-Reply-To: <20260604183829.63c35fd9@pumpkin>

David Laight <david.laight.linux@gmail.com>:
> I know it has mattered elsewhere, and is easy to get wrong because
> 'mostly it works'.

I will send a patch, which will change that argument back to "int".
Hopefully today.
Also I will add that "wait_for_space".

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-04 21:42 UTC (permalink / raw)
  To: David Laight
  Cc: Askar Safin, metze, akpm, axboe, brauner, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260604223216.73468830@pumpkin>

On Thu, 4 Jun 2026 at 14:32, David Laight <david.laight.linux@gmail.com> wrote:
>
> I think riscv might sign extend 32bit values in 64bit registers.
> x86 and arm both zero extend.

That's different.

x86 really doesn't *care*. If the caller zero-extends or leaves high
bits set randomly, according to the x86 ABI that's perfectly fine: the
callee will only care about the low 32 bits. So the high bits are
simply not relevant for the ABI.

The Powerpc ABI makes the the sign extension part of the calling conventions.

So if a caller doesn't sign extend a 32-bit value, the result is
random behavior - if you pass some function an 'int' argument, the
function may end up looking at bit 63 to see if it's negative (except
IBM calls it "bit 0" because they have to be different from everybody
else)

            Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-04 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, metze, akpm, axboe, brauner, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wip3mwLOHOYJ9TtjDxOaq9YUXmuCg2AycyASGgeY6qqUw@mail.gmail.com>

On Thu, 4 Jun 2026 12:30:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 4 Jun 2026 at 10:38, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > Bool is another matter entirely, (IIRC from a couple of weeks ago)
> > gcc will assume that the low 8 bits of the parameter register are
> > either 0 or 1 and clang assumes that the low 32 bits are 0 or 1.
> > You can't even check with 'if ((u32)bool_param > 1) error()' because
> > the compiler 'knows' it can't be false.  
> 
> Nobody should ever use 'bool' as a system call argument. Anything that
> takes a boolean should take a 'flags' field with bits.

I was thinking of more generally, not syscall arguments.
In C you can't really guarantee that a 'bool' variable will always
contain 0 or 1.

Even if you write: https://godbolt.org/z/81P87vv7o
int f(char *p, _Bool b)
{
    return p[b ? 1 : 0];
}
you get *(p + b), neither (int)b, !!b or (b & 1) make any difference.

Talking of broken compilers, had you noticed that:
struct foo {
    int a;
    char c[32];
};

int b(struct foo *f)
{
    return __builtin_object_size(f->c, 1);
}
returns -1 (size unknown/indefinite).
You can't use __builtin_object_size() to stop code running off the end
of anything referenced by address - even when the size is constant.


> But this is basically what a lot of the SYSCALL_DEFINEx() macros are
> all about - sorting out ABI assumptions.
> 
> For example, on powerpc (iirc - maybe it was 390), a 32-bit argument
> is always sign-extended by the ABI, and the compiler *depends* on
> that.

I think riscv might sign extend 32bit values in 64bit registers.
x86 and arm both zero extend.
Zero extending is more friendly to the kernel where pretty much
all values are non-negative.
I think I've used signed variables slightly more often than fp in the
last 47+ years.

-- David

> But at system call boundaries we can't trust that the user side
> actually follows the ABI, so SYSCALL_DEFINEx() will actually take a
> 'unsigned long' and turn it into a 32-bit argument so that things like
> this are well-defined and you can't fool the kernel by not following
> the ABI rules.
> 
> The same would be the case if some system call actually takes bool
> (but I don't think such garbage exists). The SYSCALL_DEFINE() macro
> magic would take the full register content and *force* it to follow
> the ABI conventions, whatever they are on that platform.
> 
>               Linus


^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-04 19:30 UTC (permalink / raw)
  To: David Laight
  Cc: Askar Safin, metze, akpm, axboe, brauner, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260604183829.63c35fd9@pumpkin>

On Thu, 4 Jun 2026 at 10:38, David Laight <david.laight.linux@gmail.com> wrote:
>
> Bool is another matter entirely, (IIRC from a couple of weeks ago)
> gcc will assume that the low 8 bits of the parameter register are
> either 0 or 1 and clang assumes that the low 32 bits are 0 or 1.
> You can't even check with 'if ((u32)bool_param > 1) error()' because
> the compiler 'knows' it can't be false.

Nobody should ever use 'bool' as a system call argument. Anything that
takes a boolean should take a 'flags' field with bits.

But this is basically what a lot of the SYSCALL_DEFINEx() macros are
all about - sorting out ABI assumptions.

For example, on powerpc (iirc - maybe it was 390), a 32-bit argument
is always sign-extended by the ABI, and the compiler *depends* on
that. But at system call boundaries we can't trust that the user side
actually follows the ABI, so SYSCALL_DEFINEx() will actually take a
'unsigned long' and turn it into a 32-bit argument so that things like
this are well-defined and you can't fool the kernel by not following
the ABI rules.

The same would be the case if some system call actually takes bool
(but I don't think such garbage exists). The SYSCALL_DEFINE() macro
magic would take the full register content and *force* it to follow
the ABI conventions, whatever they are on that platform.

              Linus

^ permalink raw reply

* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Laight @ 2026-06-04 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, metze, akpm, axboe, brauner, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wjvb56qo27-axALOKCY-CjLsj9_294zeBovPVJaYm14dA@mail.gmail.com>

On Thu, 4 Jun 2026 07:17:10 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 4 Jun 2026 at 02:06, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > Something needs to ensure that the high 32bits of the fd get masked off
> > on 64bit systems.  
> 
> That something already exists: CLASS(fd, f)(fd);
> 
> It ignores the top bits, because 'fdget()' takes an 'unsigned int'.
> 
> We have been a bit random in how we declare the system calls in
> general, and we mix 'unsigned int' and 'int' and 'unsigned long'
> pretty much randomly when it comes to file descriptor arguments to
> system calls.
> 
> fs/read_write.c in particular uses all three cases with no real logic to it all:
> 
>   SYSCALL_DEFINE3(lseek, unsigned int, fd, ..
>   SYSCALL_DEFINE3(readv, unsigned long, fd, ..
>   SYSCALL_DEFINE4(sendfile, int, out_fd, ..
> 
> but then anything that uses fdget() (through one of the helper classes
> or not) will simply not care.
> 
> Does it make sense? Is it pretty? Nope. Does it matter? Also nope.

I know it has mattered elsewhere, and is easy to get wrong because
'mostly it works'.

At least u32/u64 is reasonably sane - the called function has to ignore
the high bits (at least on x86).

Bool is another matter entirely, (IIRC from a couple of weeks ago)
gcc will assume that the low 8 bits of the parameter register are
either 0 or 1 and clang assumes that the low 32 bits are 0 or 1.
You can't even check with 'if ((u32)bool_param > 1) error()' because
the compiler 'knows' it can't be false.
It all dumps you down one of the UB 'rabbit holes'.

-- David

> 
>                   Linus


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-04 17:25 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
	Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches, linux-fsdevel, Jan Kara
In-Reply-To: <aiGjUqI59e966oBu@1wt.eu>

On Thu, Jun 4, 2026 at 9:09 AM Willy Tarreau <w@1wt.eu> wrote:
>
> On Thu, Jun 04, 2026 at 08:53:15AM -0700, Andy Lutomirski wrote:
> > On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > > > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > > > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > >
> > > > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > > > >
> > > > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > > > a big simplification.
> > > > > >
> > > > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > > > Communications between the kernel and fuse server at least used to
> > > > > > seriously want that, so that would be one place to look for unhappy
> > > > > > userland...
> > > > > >
> > > > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > > > >
> > > > > > rostedt Cc'd (miklos already had been)
> > > > >
> > > > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > > > by splice and the libtracefs has a lot of code to use it as well. As
> > > > > reading the ring buffer literally swaps out the write portion with a blank
> > > > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > > > splice, providing a zero-copy of the trace data from the write of the event
> > > > > to going into a file.
> > > > >
> > > > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > > > into files to avoid as much copying during live recordings as possible.
> > > > >
> > > > > Whatever changes we make, I would like to make sure there's no regressions
> > > > > in performance of trace-cmd record.
> > > >
> > > > Well yes, The patchset seems sensible from a quality POV.  But to make
> > > > a decision we should first have a decent understanding of its downside
> > > > impact.
> > > >
> > > > I haven't seen a description of that impact in the discussion thus far.
> > > > And that description is owed, please.
> > > >
> > > > I assume a small number of specialized applications are using
> > > > vmsplice() to great effect?  What are those applications?  What is the
> > > > impact of this change?
> > >
> > > > Once we are armed with that information, is there some middle ground in
> > > > which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> > > > tricky cases and still permit vmsplicing if the application is
> > > > appropriately restrictive in it usage?
> > >
> > > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > > load generators to be precise, and soon a cache. This is super convenient
> > > and extremely efficient:
> > >
> > >   - vmsplice() is used to prepare a "master" pipe with data to be sent
> > >     over TCP or kTLS
> > >   - then for each request, we do tee() from this master pipe to per-request
> > >     pipes.
> > >   - the per-request pipes are those that are used to deliver the data to
> > >     the socket via splice().
> > >
> > > So we effectively use vmsplice(), tee() and splice() here, and for exactly
> > > the reasons they were designed: only play with page refcount and not copy
> > > data. The code is here for the curious:
> > >
> > >    https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
> > >
> > > and its ancestor is here:
> > >
> > >    https://github.com/wtarreau/httpterm/blob/master/httpterm.c
> > >
> > > It simply doubles the network bandwidth compared to not using that.
> > > (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> > > this anymore.
> > >
> >
> > Wait a moment.  This is neat, but it's literally just a benchmark,
> > right?
>
> No, it's a benchmark *tool*: it's being used to stress production code,
> which is important and super hard at high loads. You place it after your
> proxy and you measure the performance of the proxy (which is supposed not
> to be as capable as the testing tools otherwise the methodology revolves
> to testing the testing tools, which is not the point).
>
> > I skimmed the code, and it doesn't look like a production
> > workload, either.  And you manage to get around the awfulness of the
> > vmsplice API's complete failure to tell you when it's done with a
> > buffer by ... never actually changing the contents of the buffer.  Do
> > you have any idea how you would write correct code that uses vmsplice
> > for sends and then *ever* mutates the data without literally
> > munmapping (or madvise or something) the data do you can safely mutate
> > it?
>
> I'm not sure what you mean here Andy. I *do not* need to change the
> data, it's just a pre-made pattern.

What I mean is: this particular pattern seems limited for use in an
actual webserver as a opposed to a load-tester.

> > Or discover that we already have something better, perhaps :)
> >
> > https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html
>
> io_uring is different. We tried it "the dirty way" in the past, by
> emulating a poller, and it's not worth it this way. And in order to
> do it the right way, it needs to be done totally differently, which
> has impacts all over the stack. The code in the file pointed to above
> is just for the httpterm testing feature, but the rest is much more
> complex.

I'm curious how this kludge does:

https://github.com/amluto/zc_bench

I vibe-coded this up without much care, and I don't have the hardware
needed to actually run it in an interesting manner.  But, on a Linux
VM on an Apple M4, I can push about 130Gbps on a single core over
loopback.  In theory this will do zerocopy sends (but not over
loopback), and I would hope that it runs *faster* than vmsplice + tee.

(I have a fancy workstation that can do a whopping 2.5Gbps.  I could
probably jury-rig a test over Thunderbolt at higher speeds.  I have
systems that are not available for this test right now that can do
10Gbps.  But someone probably needs 40Gbps or better hardware for a
genuinely interesting test.)

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Willy Tarreau @ 2026-06-04 16:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <CAHk-=wg0e8pP5haNW4qJP1=QwwUEctwjK5k07sv8bskitoMDgg@mail.gmail.com>

On Thu, Jun 04, 2026 at 08:58:33AM -0700, Linus Torvalds wrote:
> On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
> >
> > > It looks like you're actually doing exactly the thing that I thought
> > > was crazy and wouldn't even work reliably: you change the
> > > common_response[] contents dynamically *after* the vmsplice, and
> > > depend on the fact that changing it in user space changes the buffer
> > > in the pipe too.
> >
> > No no, it's definitely not doing that (or it's a bug, but it's not
> > supposed to happen). I'm perfectly aware that one must definitely not
> > do that, and it's a guarantee the user of vmsplice() must provide.
> 
> Whew, good.
> 
> In that case, can you just try the vmsplice patch series (Christian
> already found a bug, but I don't think it will necessarily matter in
> practice - famous last words) and that test patch of mine, and see if
> it all (a) works for you and (b) if you have any numbers for
> performance that would be *great*.

Yes I wanted to do that and noted it on my todo list yesterday when
noticing the ongoing discussion. Just been super busy with yesterday's
by-yearly release ;-) But at least I wanted to share quick feedback in
this thread about existing uses.

> There aren't many obvious splice users out there, and even if they
> were to exist they are typically specialized enough that you have to
> have a real use case to then tell if the patches make a difference in
> real life or not.

I totally agree, that's why I want to share some feedback. I remember
years ago when splice() was broken in 2.6.25, there were so few users
that I was the one reporting an API issue to Eric who addressed it
early by lack of users. And I even consider that due to the very few
users, it's even acceptable to slightly change the way to use it if
it can provide extra guarantees (like requiring a capability to access
non-anonymous pages for example). It should not break that many apps,
and as long as they can preserve their essential benefits, I think
most will be OK to adapt.

> So you testing that thing would seem to be a great first test of
> whether any of this is realistic..

Absolutely! 

Willy

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Willy Tarreau @ 2026-06-04 16:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
	Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches, linux-fsdevel, Jan Kara
In-Reply-To: <CALCETrULMixRGJyGqAAujW7RN6PP2f_Orn2Y_0hpPMjRqQnY7Q@mail.gmail.com>

On Thu, Jun 04, 2026 at 08:53:15AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@1wt.eu> wrote:
> >
> > On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > > >
> > > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > > a big simplification.
> > > > >
> > > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > > Communications between the kernel and fuse server at least used to
> > > > > seriously want that, so that would be one place to look for unhappy
> > > > > userland...
> > > > >
> > > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > > >
> > > > > rostedt Cc'd (miklos already had been)
> > > >
> > > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > > by splice and the libtracefs has a lot of code to use it as well. As
> > > > reading the ring buffer literally swaps out the write portion with a blank
> > > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > > splice, providing a zero-copy of the trace data from the write of the event
> > > > to going into a file.
> > > >
> > > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > > into files to avoid as much copying during live recordings as possible.
> > > >
> > > > Whatever changes we make, I would like to make sure there's no regressions
> > > > in performance of trace-cmd record.
> > >
> > > Well yes, The patchset seems sensible from a quality POV.  But to make
> > > a decision we should first have a decent understanding of its downside
> > > impact.
> > >
> > > I haven't seen a description of that impact in the discussion thus far.
> > > And that description is owed, please.
> > >
> > > I assume a small number of specialized applications are using
> > > vmsplice() to great effect?  What are those applications?  What is the
> > > impact of this change?
> >
> > > Once we are armed with that information, is there some middle ground in
> > > which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> > > tricky cases and still permit vmsplicing if the application is
> > > appropriately restrictive in it usage?
> >
> > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > load generators to be precise, and soon a cache. This is super convenient
> > and extremely efficient:
> >
> >   - vmsplice() is used to prepare a "master" pipe with data to be sent
> >     over TCP or kTLS
> >   - then for each request, we do tee() from this master pipe to per-request
> >     pipes.
> >   - the per-request pipes are those that are used to deliver the data to
> >     the socket via splice().
> >
> > So we effectively use vmsplice(), tee() and splice() here, and for exactly
> > the reasons they were designed: only play with page refcount and not copy
> > data. The code is here for the curious:
> >
> >    https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
> >
> > and its ancestor is here:
> >
> >    https://github.com/wtarreau/httpterm/blob/master/httpterm.c
> >
> > It simply doubles the network bandwidth compared to not using that.
> > (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> > this anymore.
> >
> 
> Wait a moment.  This is neat, but it's literally just a benchmark,
> right?

No, it's a benchmark *tool*: it's being used to stress production code,
which is important and super hard at high loads. You place it after your
proxy and you measure the performance of the proxy (which is supposed not
to be as capable as the testing tools otherwise the methodology revolves
to testing the testing tools, which is not the point).

> I skimmed the code, and it doesn't look like a production
> workload, either.  And you manage to get around the awfulness of the
> vmsplice API's complete failure to tell you when it's done with a
> buffer by ... never actually changing the contents of the buffer.  Do
> you have any idea how you would write correct code that uses vmsplice
> for sends and then *ever* mutates the data without literally
> munmapping (or madvise or something) the data do you can safely mutate
> it?

I'm not sure what you mean here Andy. I *do not* need to change the
data, it's just a pre-made pattern.

> > I also have mid-term plans for using vmsplice() to deliver contents from
> > a cache to sockets as well via splice(). Right now our cache is split into
> > too small chunks (1kB) to make that useful, but as soon as we can move to
> > 4kB pages, it will make sense. There the same gains are expected, and I
> > would particularly dislike the idea of no longer being able to implement
> > zero-copy!
> 
> If I'm understanding you correctly, you see (and measured!) a
> performance improvement, and you would like to use it in production.

The prod for the tool is to be used to benchmark other tools. It does
the job quite well. It's even more important when you use kTLS-enabled
hardware where you can get zero-copy all along the line and delegate
the crypto to the hardware. That's the beauty of all the nice work that
was done in the stack along all these years. That code started to be
used in clear maybe 15 years ago or so, but nowadays the gains are even
more interesting.

> It seems to me that this is an excellent opportunity to remember that
> vmsplice gets a performance boost in a highly synthetic situation that
> sort of resembles a cache scenario and then to deprecate vmsplice and
> build something better!

I've definitely been keeping vmsplice() on my radar for our cache,
and we've progressively implemented various architectural updates in
haproxy precisely for this.

> Or discover that we already have something better, perhaps :)
> 
> https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html

io_uring is different. We tried it "the dirty way" in the past, by
emulating a poller, and it's not worth it this way. And in order to
do it the right way, it needs to be done totally differently, which
has impacts all over the stack. The code in the file pointed to above
is just for the httpterm testing feature, but the rest is much more
complex.

> I see that this can submit a buffer without a syscall (tee + splice is
> *two* syscalls!) and that it has directly addressed what I see as the
> really big deficiency in vmsplice: "This second notification tells the
> application that the memory associated with the send is safe to get
> reused."  If I were writing the user code, I would very much want that
> notification to be an explicit part of the API instead of making a
> wild guess as I think I would need to do with vmsplice.

I agree, for the cache it's something important (not for the load
generator). But IIRC that's something you can also check via SIOCOUTQ
which is normally sufficient for a cache's eviction system (though not
fantastic).

Willy

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-04 15:58 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <aiGfgRch99l_5z11@1wt.eu>

On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
>
> > It looks like you're actually doing exactly the thing that I thought
> > was crazy and wouldn't even work reliably: you change the
> > common_response[] contents dynamically *after* the vmsplice, and
> > depend on the fact that changing it in user space changes the buffer
> > in the pipe too.
>
> No no, it's definitely not doing that (or it's a bug, but it's not
> supposed to happen). I'm perfectly aware that one must definitely not
> do that, and it's a guarantee the user of vmsplice() must provide.

Whew, good.

In that case, can you just try the vmsplice patch series (Christian
already found a bug, but I don't think it will necessarily matter in
practice - famous last words) and that test patch of mine, and see if
it all (a) works for you and (b) if you have any numbers for
performance that would be *great*.

There aren't many obvious splice users out there, and even if they
were to exist they are typically specialized enough that you have to
have a real use case to then tell if the patches make a difference in
real life or not.

So you testing that thing would seem to be a great first test of
whether any of this is realistic..

               Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Willy Tarreau @ 2026-06-04 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Christian Brauner,
	Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <CAHk-=wiQB-j53cTs9kM4UeXoXPaFj78aJe3D6Yp1Fohg7i4tWA@mail.gmail.com>

On Thu, Jun 04, 2026 at 07:31:30AM -0700, Linus Torvalds wrote:
> On Wed, 3 Jun 2026 at 23:32, Willy Tarreau <w@1wt.eu> wrote:
> >
> > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > load generators to be precise, and soon a cache. This is super convenient
> > and extremely efficient:
> >
> >   - vmsplice() is used to prepare a "master" pipe with data to be sent
> >     over TCP or kTLS
> >   - then for each request, we do tee() from this master pipe to per-request
> >     pipes.
> >   - the per-request pipes are those that are used to deliver the data to
> >     the socket via splice().
> 
> So most of those would actually not be affected by any of the existing
> patches: the pipe->socket splice would remain, the tee() code would
> still just take a ref to the page count.

OK!

> The vmsplice() would change,

OK but for this use case it's not dramatic (it could be more annyoing
for the cache where I'd like this zero-copy from memory to the wire
though).

> but looking at your haterm.c sources, it
> looks like it's mostly a fairly small thing ("common_response[]" being
> 16kB).

In this one it's indeed a 16kB block that is repeated into the
same pipe by simplicity, in its ancestor it was 64kB. We try to
make as large a pipe as we can, but that's all.

> That is typically *faster* to just copy than look up pages.
> 
> HOWEVER.
> 
> It looks like you're actually doing exactly the thing that I thought
> was crazy and wouldn't even work reliably: you change the
> common_response[] contents dynamically *after* the vmsplice, and
> depend on the fact that changing it in user space changes the buffer
> in the pipe too.

No no, it's definitely not doing that (or it's a bug, but it's not
supposed to happen). I'm perfectly aware that one must definitely not
do that, and it's a guarantee the user of vmsplice() must provide.

> So that would break *entirely* with the vmsplice() changes if I read
> the code right (which I might not do) simply because that looks like
> it really does require that "wrutably shared buffer after the fact".

We agree that this would deliver complete garbage an I'm not interested
in such a "feature" at all.

> Interesting.  Because the vmsplice() code uses get_user_pages_fast(),
> and honestly, it never pinned the page reliably to the original source
> - it breaks COW randomly in one direction or the other after fork()

I must confess I never knew how it deals with pages shared over a
fork(), and have been wondering if two processes could create a
shared memory area on the fly just by using vmsplice() on each side
and end up with the same pages (I don't need this but it could have
very nice use cases).

> (and I thouht even after a page-out, but thinking more about it the
> swap cache may have made it work for that case).
> 
> Uhhuh. That does look like it makes the vmsplice() changes untenable.

No no don't worry, I'm not seeing any value in changing data after
vmsplice() and that would just be a bug. My goal here is only to
pre-fill a buffer with a pattern then prepare the pipe with that
pattern, nothing less, nothing more.

> But I may be reading your haproxy code entirely wrong.

I think so, but I wouldn't be the one blaming you for this ;-)

Thanks for the clarifications!
Willy

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-04 15:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
	Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches, linux-fsdevel, Jan Kara
In-Reply-To: <aiEb8CTM-ovMIq7-@1wt.eu>

On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@1wt.eu> wrote:
>
> On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > >
> > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > a big simplification.
> > > >
> > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > Communications between the kernel and fuse server at least used to
> > > > seriously want that, so that would be one place to look for unhappy
> > > > userland...
> > > >
> > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > >
> > > > rostedt Cc'd (miklos already had been)
> > >
> > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > by splice and the libtracefs has a lot of code to use it as well. As
> > > reading the ring buffer literally swaps out the write portion with a blank
> > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > splice, providing a zero-copy of the trace data from the write of the event
> > > to going into a file.
> > >
> > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > into files to avoid as much copying during live recordings as possible.
> > >
> > > Whatever changes we make, I would like to make sure there's no regressions
> > > in performance of trace-cmd record.
> >
> > Well yes, The patchset seems sensible from a quality POV.  But to make
> > a decision we should first have a decent understanding of its downside
> > impact.
> >
> > I haven't seen a description of that impact in the discussion thus far.
> > And that description is owed, please.
> >
> > I assume a small number of specialized applications are using
> > vmsplice() to great effect?  What are those applications?  What is the
> > impact of this change?
>
> > Once we are armed with that information, is there some middle ground in
> > which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> > tricky cases and still permit vmsplicing if the application is
> > appropriately restrictive in it usage?
>
> I'm using vmsplice() + tee() + splice() in high-performance applications,
> load generators to be precise, and soon a cache. This is super convenient
> and extremely efficient:
>
>   - vmsplice() is used to prepare a "master" pipe with data to be sent
>     over TCP or kTLS
>   - then for each request, we do tee() from this master pipe to per-request
>     pipes.
>   - the per-request pipes are those that are used to deliver the data to
>     the socket via splice().
>
> So we effectively use vmsplice(), tee() and splice() here, and for exactly
> the reasons they were designed: only play with page refcount and not copy
> data. The code is here for the curious:
>
>    https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
>
> and its ancestor is here:
>
>    https://github.com/wtarreau/httpterm/blob/master/httpterm.c
>
> It simply doubles the network bandwidth compared to not using that.
> (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> this anymore.
>

Wait a moment.  This is neat, but it's literally just a benchmark,
right?  I skimmed the code, and it doesn't look like a production
workload, either.  And you manage to get around the awfulness of the
vmsplice API's complete failure to tell you when it's done with a
buffer by ... never actually changing the contents of the buffer.  Do
you have any idea how you would write correct code that uses vmsplice
for sends and then *ever* mutates the data without literally
munmapping (or madvise or something) the data do you can safely mutate
it?

> I also have mid-term plans for using vmsplice() to deliver contents from
> a cache to sockets as well via splice(). Right now our cache is split into
> too small chunks (1kB) to make that useful, but as soon as we can move to
> 4kB pages, it will make sense. There the same gains are expected, and I
> would particularly dislike the idea of no longer being able to implement
> zero-copy!

If I'm understanding you correctly, you see (and measured!) a
performance improvement, and you would like to use it in production.

It seems to me that this is an excellent opportunity to remember that
vmsplice gets a performance boost in a highly synthetic situation that
sort of resembles a cache scenario and then to deprecate vmsplice and
build something better!  Or discover that we already have something
better, perhaps :)

https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html

I see that this can submit a buffer without a syscall (tee + splice is
*two* syscalls!) and that it has directly addressed what I see as the
really big deficiency in vmsplice: "This second notification tells the
application that the memory associated with the send is safe to get
reused."  If I were writing the user code, I would very much want that
notification to be an explicit part of the API instead of making a
wild guess as I think I would need to do with vmsplice.

--Andy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox