Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Jakub Kicinski @ 2026-06-03 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>

On Tue, 2 Jun 2026 21:20:13 -0700 Linus Torvalds wrote:
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.

+1 IMVHO the networking bugs where people just not knowing what they
were doing. Presumably AI has scrounged all the occurrences of that
bug by now. I'd also hate to render splice optimizations moot based
on those bugs.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-03 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>

> On Jun 2, 2026, at 9:20 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Am I understanding correctly that this will completely break zerocopy
>> sendfile?
>
> Very much, yes.
>
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
>
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
>
> So in that sense, it's a bit sad to discuss castrating splice.
>
> But it's probably still the right thing to at least try.
>
> I've seen very impressive benchmark numbers over the years, but
> they've often smelled more like benchmarketing than actual real work.
>
> There's also a real possibility that a lot of the sendfile / splice
> advantage has little to do with zero-copy, and more to do with the
> cost of mapping and maintaining buffers in user space.
>
> If you are sending file data using plain reads and writes, it's not
> just the "copy from user space to socket data structures".
>
> There's also the cost of populating user space in the first place:
> page faults for mmap made *that* historical copy avoidance basically a
> fairy tale.
>
> And not using mmap means that you have the cost of double caching in
> the kernel _and_ user space etc.
>
> So sendfile() as a concept (whether you use combinations of splice()
> system calls or the sendfile system call itsefl) isn't necessarily
> only about the zero-copy, it's really also about avoiding the user
> space memory management.

So maybe we should make sure that, if we go down the route of
disabling all the splice magic, that we leave an API, maybe the
existing sendfile or maybe something else, that does an optimized copy
from one fd to another and that is at least capable of sending from a
file to the network with at most one CPU-side copy.

Even if we’re just doing that, I continue to find it strange that we
require that a pipe be involved. What’s so special about pipes that we
allow splicing from file to pipe and then pipe to socket (this
requiring that the pipe retain a reference to the file’s page cache
structures to avoid *two* copies), but we can’t splice straight from
file to socket. Heck, even sendfile is implemented under the hood as a
pair of splices!

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03 15:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner>

On Wed, 3 Jun 2026 at 06:40, Christian Brauner <brauner@kernel.org> wrote:
>
> Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
> full. So iter_to_pipe stops and returns a partial count capped at pipe
> capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
> the test drains it, call 2 returns the remaining 64K. Done.
>
> After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
> 0) then calls pipe_write which does not stop when the pipe fills. It
> blocks until the entire iovec is consumed.
>
> I kinda think we need to preserve similar semantics.

Ack. We definitely do need to keep the old semantics.

Looking at the patch again, I think it's that

    (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0

thing that is broken. I think splice_to_pipe is *always* nowait - but
has the special conditional _initial_ wait.

So I think the RWF_NOWAIT should be unconditional to the do_writev(),
and instead the code should do something like

        ret = wait_for_space(pipe, flags);
        if (!ret) do_writev(...RWF_NOWAIT);

but admittedly I did not think very much about the details, so I might
miss something.

Which also then probably measn that we should just keep the legacy
wrapper in fs/splice.c and we'd just need to make do_writev() and
do_readv() non-static.

Because I'd rather keep wait_for_space() internal to splice (or
alternatively we'd move it to pipe.c, rename it to
"pipe_wait_for_space()", and change the 'flags' argument to be a
boolean to not make it use that splice-specific flags etc).

            Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260603-navigieren-pleite-stilvoll-60e6da66b1d4@brauner>

On Wed, Jun 03, 2026 at 08:45:18AM +0200, Christian Brauner wrote:
> On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> > On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > Am I understanding correctly that this will completely break zerocopy
> > > sendfile?
> > 
> > Very much, yes.
> > 
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> > 
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> > 
> > So in that sense, it's a bit sad to discuss castrating splice.
> 
> Well, we're completely ignoring the fact that splice()'s locking and
> interactions with pipe_lock() are complete insanity. So unless someone
> sits down and really thinks about how to rework the locking I think
> degrading splice() is just fine.
> 
> > But it's probably still the right thing to at least try.
> 
> Yes.
> 
> > I just suspect we'll never get real answers without going the "let's
> > just see what happens" route...
> 
> Yes.

Reading this thread again I'm really amazed how willingly people argue
to remain locked into a really broken API even if they're giving a risk
but worthwhile chance to kill it for good. Anway, odd-userspace behavior
time:

David reported vmsplice01 failing in the LTP testsuite after the change:

11297 20:41:02.548383  <LAVA_SIGNAL_STARTTC vmsplice01>
11298 20:41:02.548518  tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsZ13ZQj as tmpdir (tmpfs filesystem)
11299 20:41:02.548656  tst_test.c:2047: TINFO: LTP version: 20260130
11300 20:41:02.548793  tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260602 #1 SMP PREEMPT Tue Jun  2 18:13:29 UTC 2026 aarch64
11301 20:41:02.548932  tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
11302 20:41:02.549069  tst_test.c:1875: TINFO: Overall timeout per run is 0h 01m 30s
11303 20:41:02.549205  tst_test.c:1632: TINFO: tmpfs is supported by the test
11304 20:41:02.549340  Test timeouted, sending SIGKILL!
11305 20:41:02.549477  tst_test.c:1947: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
11306 20:41:02.549614  tst_test.c:1949: TBROK: Test killed! (timeout?)
11307 20:41:02.549751  
11308 20:41:02.549887  Summary:
11309 20:41:02.550021  passed   0
11310 20:41:02.550155  failed   0
11311 20:41:02.550290  broken   1
11312 20:41:02.550450  skipped  0
11313 20:41:02.550582  warnings 0
11314 20:41:02.550710  
11315 20:41:02.550838  <LAVA_SIGNAL_ENDTC vmsplice01>

So I looked at the test:

	while (v.iov_len) {
		/*
		 * in a real app you'd be more clever with poll of course,
		 * here we are basically just blocking on output room and
		 * not using the free time for anything interesting.
		 */
		if (poll(&pfd, 1, -1) < 0)
			tst_brk(TBROK | TERRNO, "poll() failed");

		written = vmsplice(pipes[1], &v, 1, 0);
		if (written < 0) {
			tst_brk(TBROK | TERRNO, "vmsplice() failed");
		} else {
			if (written == 0) {
				break;
			} else {
				v.iov_base += written;
				v.iov_len -= written;
			}
		}

		SAFE_SPLICE(pipes[0], NULL, fd_out, &offset, written, 0);
		//printf("offset = %lld\n", (long long)offset);
	}

Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
full. So iter_to_pipe stops and returns a partial count capped at pipe
capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
the test drains it, call 2 returns the remaining 64K. Done.

After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
0) then calls pipe_write which does not stop when the pipe fills. It
blocks until the entire iovec is consumed.

I kinda think we need to preserve similar semantics.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-06-03 11:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Askar Safin, akpm, axboe, brauner, david,
	dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
	linux-mm, miklos, netdev, patches, viro, willy
In-Reply-To: <CALCETrWx8-Q5-rK1KnAPCxCbXaWCd=Yfs_Pr8qVMa8k8L6of1w@mail.gmail.com>

On Tue, Jun 02, 2026 at 08:51:03PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 2, 2026 at 5:12 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
> > >
> > > Pedro is talking here not about this vmsplice patch, but about
> > > my future hypothetical patch, which will remove splice-pagecache-to-pipe.
> >
> > That absolutely would be my suggested next step.
> >
> > Something like the attached - get rid of filemap_splice_read()
> > entirely, and just replace it with copy_splice_read().
> 
> Am I understanding correctly that this will completely break zerocopy
> sendfile?  sendfile is, internally, splice-to-a-secret-per-task-pipe
> and then splice to the socket.  How much to people care?  These days,
> a lot of high-bandwidth network senders are sending encrypted data,
> which is not zerocopy frompagecache.  But there are surely some users

You can do zerocopy from the page cache, even with TLS on top, by having
your (fancy) NIC do TLS offloading for you. See https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf.
Linux works similarly. Slide 26 is particularly interesting.
(No KTLS I assume is using simple sendmsg()'s from user memory, SW TLS
and NIC KTLS are both sendfile(), per the slides)

TL;DR I really do think it matters.

> 
> Now maybe someone cares about a different path?  Splice from socket to
> pipe to file?  Splice from socket to pipe to other socket?  Does
> anyone do any of this?  One can, of course, recv() directly to an
> mmapped file, but then you pay for page faults, so that probably a bad
> idea in most cases.  At least all of these cases don't have spliced
> buffers that refer to a potentially read-only file.
> 
> 
> But I'm a little concerned that zerocopy sends from files to network
> are actually important.
> 
> --Andy

-- 
Pedro

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Miklos Szeredi @ 2026-06-03  9:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christian Brauner, Askar Safin, linux-kernel,
	linux-mm, linux-api, netdev, Matthew Wilcox, Jens Axboe,
	Christoph Hellwig, David Howells, Andrew Morton,
	David Hildenbrand, Pedro Falcato, patches, linux-fsdevel,
	Jan Kara, Steven Rostedt, Joanne Koong, fuse-devel
In-Reply-To: <20260601173325.GH2636677@ZenIV>

On Mon, 1 Jun 2026 at 19:33, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
>
> > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > a big simplification.
>
> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> Communications between the kernel and fuse server at least used to
> seriously want that, so that would be one place to look for unhappy
> userland...
>
> splice-related logics in fs/fuse/dev.c is interesting; another place
> like this is kernel/trace/, but I'm less familiar with that one.

[Cc: Joanne, fuse-devel]

I'd favor simplification, but care is needed to not regress performance.

Joanne might be in a better position to say something about relative
performance of various transport modes in fuse.

Thanks,
Miklos

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-03  7:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
	Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <20260602184440.GB2503276@google.com>

On 6/2/26 20:44, Eric Biggers wrote:
> On Tue, Jun 02, 2026 at 10:25:06AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/2/26 02:28, Andrew Morton wrote:
>>>
>>>
>>> Well yes, The patchset seems sensible from a quality POV.  But to make
>>> a decision we should first have a decent understanding of its downside
>>> impact.
>>
>> I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
>> entirely is certainly very appealing ...
>>
>>>
>>> I haven't seen a description of that impact in the discussion thus far.
>>> And that description is owed, please.
>>>
>>> I assume a small number of specialized applications are using
>>> vmsplice() to great effect?  What are those applications?  What is the
>>> impact of this change?
>>
>>
>> I did some digging, and the kernel crypto API documents using splice/vmsplice
>> for zero-copy[1] and libkcapi [2].
>>
>> I did not find performance numbers, how much vmsplice/splice actually gives us.
>> Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
>> doesn't really reveal a big difference at least on my notebook. Not sure if the
>> parameters I specify are reasonable.
>>
>> I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
>> significantly worse than sendmsg ... and I don't know what the default would
>> usually be (default to vmsplice or sendmsg). I might try finding some time to
>> play with it more, but I doubt it, so if anybody else has time ... :)
> 
> AF_ALG is a mistake and isn't commonly used.  Using a userspace crypto
> library is faster and is what almost everyone does anyway, as it avoids
> the syscall overhead.  There are many other issues with AF_ALG as well.
> 
> 7.2 will mark AF_ALG as deprecated, mostly remove AF_ALG's zero-copy
> support, and remove AF_ALG's async I/O support:
> 
>     https://lore.kernel.org/linux-crypto/20260430011544.31823-1-ebiggers@kernel.org/
>     https://lore.kernel.org/linux-crypto/20260504225328.25356-1-ebiggers@kernel.org/
>     https://lore.kernel.org/linux-crypto/20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com/
> 
> In practice, the programs that are keeping Linux distros from disabling
> AF_ALG in their kconfig outright are just iwd, cryptsetup, and bluez.
> They use AF_ALG just because it was mistakenly thought to be easier than
> using a userspace crypto library.  They don't need maximum performance,
> nor do they use vmsplice, splice, or sendfile.
> 
> There is other highly niche code out there that does implement the
> AF_ALG + vmsplice + splice thing, e.g. libkcapi.  But it's just not
> enough of a reason to keep zero-copy support, especially considering
> that AF_ALG has always been the wrong solution in the first place.  The
> fallback to copying the data is fine for this deprecated API.

Cool, thanks for sharing that Eric!

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-03  6:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Askar Safin, akpm, axboe, david, dhowells, hch,
	jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
	netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>

On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Am I understanding correctly that this will completely break zerocopy
> > sendfile?
> 
> Very much, yes.
> 
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
> 
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.

Well, we're completely ignoring the fact that splice()'s locking and
interactions with pipe_lock() are complete insanity. So unless someone
sits down and really thinks about how to rework the locking I think
degrading splice() is just fine.

> But it's probably still the right thing to at least try.

Yes.

> I just suspect we'll never get real answers without going the "let's
> just see what happens" route...

Yes.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03  4:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CALCETrWx8-Q5-rK1KnAPCxCbXaWCd=Yfs_Pr8qVMa8k8L6of1w@mail.gmail.com>

On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Am I understanding correctly that this will completely break zerocopy
> sendfile?

Very much, yes.

And it's worth making it very very clear that ABSOLUTELY NONE of the
recent big security bugs were in splice.

They were all in the networking and crypto code that just didn't deal
with shared data correctly.

So in that sense, it's a bit sad to discuss castrating splice.

But it's probably still the right thing to at least try.

I've seen very impressive benchmark numbers over the years, but
they've often smelled more like benchmarketing than actual real work.

There's also a real possibility that a lot of the sendfile / splice
advantage has little to do with zero-copy, and more to do with the
cost of mapping and maintaining buffers in user space.

If you are sending file data using plain reads and writes, it's not
just the "copy from user space to socket data structures".

There's also the cost of populating user space in the first place:
page faults for mmap made *that* historical copy avoidance basically a
fairy tale.

And not using mmap means that you have the cost of double caching in
the kernel _and_ user space etc.

So sendfile() as a concept (whether you use combinations of splice()
system calls or the sendfile system call itsefl) isn't necessarily
only about the zero-copy, it's really also about avoiding the user
space memory management.

But yes, there's a very real question of performance.

I just suspect we'll never get real answers without going the "let's
just see what happens" route...

                Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-03  3:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wgKy4dP0oQCNKyMQQf3-uVpaigmDyH6_T0Via76gWST9g@mail.gmail.com>

On Tue, Jun 2, 2026 at 5:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
> >
> > Pedro is talking here not about this vmsplice patch, but about
> > my future hypothetical patch, which will remove splice-pagecache-to-pipe.
>
> That absolutely would be my suggested next step.
>
> Something like the attached - get rid of filemap_splice_read()
> entirely, and just replace it with copy_splice_read().

Am I understanding correctly that this will completely break zerocopy
sendfile?  sendfile is, internally, splice-to-a-secret-per-task-pipe
and then splice to the socket.  How much to people care?  These days,
a lot of high-bandwidth network senders are sending encrypted data,
which is not zerocopy frompagecache.  But there are surely some users
that care, for example the person who went to the effort to implement
IORING_OP_SPLICE:

commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Feb 24 11:32:45 2020 +0300

    io_uring: add splice(2) support

Now maybe someone cares about a different path?  Splice from socket to
pipe to file?  Splice from socket to pipe to other socket?  Does
anyone do any of this?  One can, of course, recv() directly to an
mmapped file, but then you pay for page faults, so that probably a bad
idea in most cases.  At least all of these cases don't have spliced
buffers that refer to a potentially read-only file.


But I'm a little concerned that zerocopy sends from files to network
are actually important.

--Andy

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-03  1:08 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	pfalcato, safinaskar, viro, willy
In-Reply-To: <CAHk-=wgKy4dP0oQCNKyMQQf3-uVpaigmDyH6_T0Via76gWST9g@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org>:
> That absolutely would be my suggested next step.
> 
> Something like the attached - get rid of filemap_splice_read()
> entirely, and just replace it with copy_splice_read().

Okay, I will post something like this soon.

But I'm slow person, and also I will test things in Qemu, so this will
take some days.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03  0:05 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	pfalcato, viro, willy
In-Reply-To: <20260602225426.122258-1-safinaskar@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
>
> Pedro is talking here not about this vmsplice patch, but about
> my future hypothetical patch, which will remove splice-pagecache-to-pipe.

That absolutely would be my suggested next step.

Something like the attached - get rid of filemap_splice_read()
entirely, and just replace it with copy_splice_read().

That also make the whole O_DIRECT and DAX special case just simply go away.

This is - in case there was any question about it - ENTIRELY untested.

It may not compile.

And if it does compile, it may do unspeakable things to your pets.

So think of this as nothing more than a "something like this". It does
leave "splice_read" around, and it intentionally just does that

   #define filemap_splice_read copy_splice_read

to not have to modify all the existing users one by one.

It would be interesting to hear if there are any actual real loads
that would ever notice?

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 10978 bytes --]

 fs/splice.c        |   6 --
 include/linux/fs.h |   4 +-
 mm/filemap.c       | 145 ------------------------------------------------
 mm/internal.h      |   6 --
 mm/shmem.c         | 159 +----------------------------------------------------
 5 files changed, 2 insertions(+), 318 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 9d8f63e2fd1a..37136b9a6612 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -971,12 +971,6 @@ static ssize_t do_splice_read(struct file *in, loff_t *ppos,
 
 	if (unlikely(!in->f_op->splice_read))
 		return warn_unsupported(in, "read");
-	/*
-	 * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
-	 * buffer, copy into it and splice that into the pipe.
-	 */
-	if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
-		return copy_splice_read(in, ppos, pipe, len, flags);
 	return in->f_op->splice_read(in, ppos, pipe, len, flags);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..e623c2804468 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3072,9 +3072,7 @@ ssize_t vfs_iocb_iter_write(struct file *file, struct kiocb *iocb,
 			    struct iov_iter *iter);
 
 /* fs/splice.c */
-ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
-			    struct pipe_inode_info *pipe,
-			    size_t len, unsigned int flags);
+#define filemap_splice_read copy_splice_read
 ssize_t copy_splice_read(struct file *in, loff_t *ppos,
 			 struct pipe_inode_info *pipe,
 			 size_t len, unsigned int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..c0dbcbb84dba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2999,151 +2999,6 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 }
 EXPORT_SYMBOL(generic_file_read_iter);
 
-/*
- * Splice subpages from a folio into a pipe.
- */
-size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
-			      struct folio *folio, loff_t fpos, size_t size)
-{
-	struct page *page;
-	size_t spliced = 0, offset = offset_in_folio(folio, fpos);
-
-	page = folio_page(folio, offset / PAGE_SIZE);
-	size = min(size, folio_size(folio) - offset);
-	offset %= PAGE_SIZE;
-
-	while (spliced < size && !pipe_is_full(pipe)) {
-		struct pipe_buffer *buf = pipe_head_buf(pipe);
-		size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
-
-		*buf = (struct pipe_buffer) {
-			.ops	= &page_cache_pipe_buf_ops,
-			.page	= page,
-			.offset	= offset,
-			.len	= part,
-		};
-		folio_get(folio);
-		pipe->head++;
-		page++;
-		spliced += part;
-		offset = 0;
-	}
-
-	return spliced;
-}
-
-/**
- * filemap_splice_read -  Splice data from a file's pagecache into a pipe
- * @in: The file to read from
- * @ppos: Pointer to the file position to read from
- * @pipe: The pipe to splice into
- * @len: The amount to splice
- * @flags: The SPLICE_F_* flags
- *
- * This function gets folios from a file's pagecache and splices them into the
- * pipe.  Readahead will be called as necessary to fill more folios.  This may
- * be used for blockdevs also.
- *
- * Return: On success, the number of bytes read will be returned and *@ppos
- * will be updated if appropriate; 0 will be returned if there is no more data
- * to be read; -EAGAIN will be returned if the pipe had no space, and some
- * other negative error code will be returned on error.  A short read may occur
- * if the pipe has insufficient space, we reach the end of the data or we hit a
- * hole.
- */
-ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
-			    struct pipe_inode_info *pipe,
-			    size_t len, unsigned int flags)
-{
-	struct folio_batch fbatch;
-	struct kiocb iocb;
-	size_t total_spliced = 0, used, npages;
-	loff_t isize, end_offset;
-	bool writably_mapped;
-	int i, error = 0;
-
-	if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
-		return 0;
-
-	init_sync_kiocb(&iocb, in);
-	iocb.ki_pos = *ppos;
-
-	/* Work out how much data we can actually add into the pipe */
-	used = pipe_buf_usage(pipe);
-	npages = max_t(ssize_t, pipe->max_usage - used, 0);
-	len = min_t(size_t, len, npages * PAGE_SIZE);
-
-	folio_batch_init(&fbatch);
-
-	do {
-		cond_resched();
-
-		if (*ppos >= i_size_read(in->f_mapping->host))
-			break;
-
-		iocb.ki_pos = *ppos;
-		error = filemap_get_pages(&iocb, len, &fbatch, true);
-		if (error < 0)
-			break;
-
-		/*
-		 * i_size must be checked after we know the pages are Uptodate.
-		 *
-		 * Checking i_size after the check allows us to calculate
-		 * the correct value for "nr", which means the zero-filled
-		 * part of the page is not copied back to userspace (unless
-		 * another truncate extends the file - this is desired though).
-		 */
-		isize = i_size_read(in->f_mapping->host);
-		if (unlikely(*ppos >= isize))
-			break;
-		end_offset = min_t(loff_t, isize, *ppos + len);
-
-		/*
-		 * Once we start copying data, we don't want to be touching any
-		 * cachelines that might be contended:
-		 */
-		writably_mapped = mapping_writably_mapped(in->f_mapping);
-
-		for (i = 0; i < folio_batch_count(&fbatch); i++) {
-			struct folio *folio = fbatch.folios[i];
-			size_t n;
-
-			if (folio_pos(folio) >= end_offset)
-				goto out;
-			folio_mark_accessed(folio);
-
-			/*
-			 * If users can be writing to this folio using arbitrary
-			 * virtual addresses, take care of potential aliasing
-			 * before reading the folio on the kernel side.
-			 */
-			if (writably_mapped)
-				flush_dcache_folio(folio);
-
-			n = min_t(loff_t, len, isize - *ppos);
-			n = splice_folio_into_pipe(pipe, folio, *ppos, n);
-			if (!n)
-				goto out;
-			len -= n;
-			total_spliced += n;
-			*ppos += n;
-			in->f_ra.prev_pos = *ppos;
-			if (pipe_is_full(pipe))
-				goto out;
-		}
-
-		folio_batch_release(&fbatch);
-	} while (len);
-
-out:
-	folio_batch_release(&fbatch);
-	file_accessed(in);
-
-	return total_spliced ? total_spliced : error;
-}
-EXPORT_SYMBOL(filemap_splice_read);
-
 static inline loff_t folio_seek_hole_data(struct xa_state *xas,
 		struct address_space *mapping, struct folio *folio,
 		loff_t start, loff_t end, bool seek_data)
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..c0ca0df5ac7e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1521,12 +1521,6 @@ struct migration_target_control {
 	enum migrate_reason reason;
 };
 
-/*
- * mm/filemap.c
- */
-size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
-			      struct folio *folio, loff_t fpos, size_t size);
-
 /*
  * mm/vmalloc.c
  */
diff --git a/mm/shmem.c b/mm/shmem.c
index 3b5dc21b323c..92138b7277b5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3481,163 +3481,6 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return ret;
 }
 
-static bool zero_pipe_buf_get(struct pipe_inode_info *pipe,
-			      struct pipe_buffer *buf)
-{
-	return true;
-}
-
-static void zero_pipe_buf_release(struct pipe_inode_info *pipe,
-				  struct pipe_buffer *buf)
-{
-}
-
-static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
-				    struct pipe_buffer *buf)
-{
-	return false;
-}
-
-static const struct pipe_buf_operations zero_pipe_buf_ops = {
-	.release	= zero_pipe_buf_release,
-	.try_steal	= zero_pipe_buf_try_steal,
-	.get		= zero_pipe_buf_get,
-};
-
-static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
-					loff_t fpos, size_t size)
-{
-	size_t offset = fpos & ~PAGE_MASK;
-
-	size = min_t(size_t, size, PAGE_SIZE - offset);
-
-	if (!pipe_is_full(pipe)) {
-		struct pipe_buffer *buf = pipe_head_buf(pipe);
-
-		*buf = (struct pipe_buffer) {
-			.ops	= &zero_pipe_buf_ops,
-			.page	= ZERO_PAGE(0),
-			.offset	= offset,
-			.len	= size,
-		};
-		pipe->head++;
-	}
-
-	return size;
-}
-
-static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
-				      struct pipe_inode_info *pipe,
-				      size_t len, unsigned int flags)
-{
-	struct inode *inode = file_inode(in);
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio = NULL;
-	size_t total_spliced = 0, used, npages, n, part;
-	loff_t isize;
-	int error = 0;
-
-	/* Work out how much data we can actually add into the pipe */
-	used = pipe_buf_usage(pipe);
-	npages = max_t(ssize_t, pipe->max_usage - used, 0);
-	len = min_t(size_t, len, npages * PAGE_SIZE);
-
-	do {
-		bool fallback_page_splice = false;
-		struct page *page = NULL;
-		pgoff_t index;
-		size_t size;
-
-		if (*ppos >= i_size_read(inode))
-			break;
-
-		index = *ppos >> PAGE_SHIFT;
-		error = shmem_get_folio(inode, index, 0, &folio, SGP_READ);
-		if (error) {
-			if (error == -EINVAL)
-				error = 0;
-			break;
-		}
-		if (folio) {
-			folio_unlock(folio);
-
-			page = folio_file_page(folio, index);
-			if (PageHWPoison(page)) {
-				error = -EIO;
-				break;
-			}
-
-			if (folio_test_large(folio) &&
-			    folio_test_has_hwpoisoned(folio))
-				fallback_page_splice = true;
-		}
-
-		/*
-		 * i_size must be checked after we know the pages are Uptodate.
-		 *
-		 * Checking i_size after the check allows us to calculate
-		 * the correct value for "nr", which means the zero-filled
-		 * part of the page is not copied back to userspace (unless
-		 * another truncate extends the file - this is desired though).
-		 */
-		isize = i_size_read(inode);
-		if (unlikely(*ppos >= isize))
-			break;
-		/*
-		 * Fallback to PAGE_SIZE splice if the large folio has hwpoisoned
-		 * pages.
-		 */
-		size = len;
-		if (unlikely(fallback_page_splice)) {
-			size_t offset = *ppos & ~PAGE_MASK;
-
-			size = umin(size, PAGE_SIZE - offset);
-		}
-		part = min_t(loff_t, isize - *ppos, size);
-
-		if (folio) {
-			/*
-			 * If users can be writing to this page using arbitrary
-			 * virtual addresses, take care about potential aliasing
-			 * before reading the page on the kernel side.
-			 */
-			if (mapping_writably_mapped(mapping)) {
-				if (likely(!fallback_page_splice))
-					flush_dcache_folio(folio);
-				else
-					flush_dcache_page(page);
-			}
-			folio_mark_accessed(folio);
-			/*
-			 * Ok, we have the page, and it's up-to-date, so we can
-			 * now splice it into the pipe.
-			 */
-			n = splice_folio_into_pipe(pipe, folio, *ppos, part);
-			folio_put(folio);
-			folio = NULL;
-		} else {
-			n = splice_zeropage_into_pipe(pipe, *ppos, part);
-		}
-
-		if (!n)
-			break;
-		len -= n;
-		total_spliced += n;
-		*ppos += n;
-		in->f_ra.prev_pos = *ppos;
-		if (pipe_is_full(pipe))
-			break;
-
-		cond_resched();
-	} while (len);
-
-	if (folio)
-		folio_put(folio);
-
-	file_accessed(in);
-	return total_spliced ? total_spliced : error;
-}
-
 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 {
 	struct address_space *mapping = file->f_mapping;
@@ -5223,7 +5066,7 @@ static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= shmem_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= shmem_file_splice_read,
+	.splice_read	= copy_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 	.setlease	= generic_setlease,

^ permalink raw reply related

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-02 23:07 UTC (permalink / raw)
  To: pfalcato
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	safinaskar, torvalds, viro, willy
In-Reply-To: <ah9Yle5pd6mD9Ugr@pedro-suse.lan>

Pedro Falcato <pfalcato@suse.de>:
> (Askar, if I was too hostile, I do sincerely apologize.)

You did nothing wrong.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-02 22:54 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	pfalcato, safinaskar, viro, willy
In-Reply-To: <CAHk-=wiAqf0PdZ4AKj_4riUnnEb=g_ZNPkLnXrByA9BBHYiFRg@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org>:
> That isn't what Askar's patch ever did.
> 
> You apparently didn't even read it.
> 
> Honestly, I think you are the one out of line here.
> 
> Askar did something I suggested years ago, and didn't remove any functionality.
> 
> It just changes vmsplice to be a copying model (one of the directions
> already was). It doesn't change regular splice at all.

Pedro is talking here not about this vmsplice patch, but about
my future hypothetical patch, which will remove splice-pagecache-to-pipe.

Let me clarify, what I want to send: I will make splice-pagecache-to-pipe
be a copy. I. e. this splice direction will continue to work, but will be
possibly slower. I. e. I will do something like this (see end of this email)
(absolutely not tested), and the same thing for other filesystems,
and also I will remove resulting dead code and remove
pipe_buf_operations::confirm (it will likely become unneeded).

If Pedro sends this instead, this will be okay.

diff --git i/fs/ext2/file.c w/fs/ext2/file.c
index d9b1eb34694a..8edcc3769793 100644
--- i/fs/ext2/file.c
+++ w/fs/ext2/file.c
@@ -326,7 +326,7 @@ const struct file_operations ext2_file_operations = {
        .release        = ext2_release_file,
        .fsync          = ext2_fsync,
        .get_unmapped_area = thp_get_unmapped_area,
-       .splice_read    = filemap_splice_read,
+       .splice_read    = copy_splice_read,
        .splice_write   = iter_file_splice_write,
        .setlease       = generic_setlease,
 };

-- 
Askar Safin

^ permalink raw reply related

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-06-02 22:41 UTC (permalink / raw)
  To: Linus Torvalds, Askar Safin
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	viro, willy
In-Reply-To: <CAHk-=wiAqf0PdZ4AKj_4riUnnEb=g_ZNPkLnXrByA9BBHYiFRg@mail.gmail.com>

On Tue, Jun 02, 2026 at 03:06:07PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jun 2026 at 14:37, Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > Well, that's most definitely part of my patch. Also, you cannot outright
> > remove splice() functionality
> 
> That isn't what Askar's patch ever did.
> 
> You apparently didn't even read it.
 
Well, I was replying to Askar's new idea to remove pagecache-to-pipe splice,
which is what he suggested. And directly intersects with my sysctl-to-disable-splice
patch.

> Honestly, I think you are the one out of line here.
> 
> Askar did something I suggested years ago, and didn't remove any functionality.
> 
> It just changes vmsplice to be a copying model (one of the directions
> already was). It doesn't change regular splice at all.
> 
> And yes, it has the potential to be a visible behavior difference - if
> some insane user uses vmsplice and then modifies the buffer
> *afterwards*, then that would be semantically different between a
> zero-copy and a normal copy.
> 
> But that would be insane behavior, and was never really reliable
> anyway even with zero-copy (ie subsequent writes to user space buffers
> would potentially do COW breaking based purely on timing and memory
> pressure etc, so anybody who relied on it being visible wasn't goign
> to get it realiably anyway)
> 
> Perhaps more importantly, it has the potential to change performance -
> zero-copy *can* be a performance win, although typically it really
> doesn't tend to be (looking up the page mapping is often slower than
> copying).
> 
> I would expect it to be very clear in trivial benchmarks that aren't
> actually real loads. And probably not visible anywhere else.

Yes, vmsplice() sucks, and we know it. Hopefully no one else will see the
difference. I don't think we can say the same for splice(), though.

> Trying to make it look like Askar is the problem is only making you look worse.

To be clear, I don't think Askar is the (or a) problem. I'm glad he's
contributing, and getting rid of bad kernel interfaces is always nice. I was
just a little frustrated with a parallel splice-related-unscrew patch.

(Askar, if I was too hostile, I do sincerely apologize.)

-- 
Pedro

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-02 22:06 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, viro, willy
In-Reply-To: <ah9LaPQayJ6tBE53@pedro-suse.lan>

On Tue, 2 Jun 2026 at 14:37, Pedro Falcato <pfalcato@suse.de> wrote:
>
> Well, that's most definitely part of my patch. Also, you cannot outright
> remove splice() functionality

That isn't what Askar's patch ever did.

You apparently didn't even read it.

Honestly, I think you are the one out of line here.

Askar did something I suggested years ago, and didn't remove any functionality.

It just changes vmsplice to be a copying model (one of the directions
already was). It doesn't change regular splice at all.

And yes, it has the potential to be a visible behavior difference - if
some insane user uses vmsplice and then modifies the buffer
*afterwards*, then that would be semantically different between a
zero-copy and a normal copy.

But that would be insane behavior, and was never really reliable
anyway even with zero-copy (ie subsequent writes to user space buffers
would potentially do COW breaking based purely on timing and memory
pressure etc, so anybody who relied on it being visible wasn't goign
to get it realiably anyway)

Perhaps more importantly, it has the potential to change performance -
zero-copy *can* be a performance win, although typically it really
doesn't tend to be (looking up the page mapping is often slower than
copying).

I would expect it to be very clear in trivial benchmarks that aren't
actually real loads. And probably not visible anywhere else.

But your responses have been making it clear that you didn't seem to
actually look at the patch or the history of it.

Trying to make it look like Askar is the problem is only making you look worse.

Anyway, the vmsplice() thing is queued up in Christian's tree, and I
guess we'll see if anybody even notices anything.

              Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-06-02 21:37 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	torvalds, viro, willy
In-Reply-To: <20260602211242.13870-1-safinaskar@gmail.com>

On Wed, Jun 03, 2026 at 12:12:42AM +0300, Askar Safin wrote:
> Pedro Falcato <pfalcato@suse.de>:
> > On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> > > See recent discussion here:
> > > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> > 
> > So, you took an ongoing discussion with an ongoing RFC patchset, and you
> > decided to reimplement part of the idea on your own, as a concurrent patchset.
> > 
> > Riiiiiight.... I don't think I have to NAK this, do I?
> 
> Okay, possibly this was indeed inappropriate.
> 
> So this time I'm asking explicitly: is it okay to post new patchset?
> 
> I want to post patchset, which will remove pagecache-to-pipe splice.

Well, that's most definitely part of my patch. Also, you cannot outright
remove splice() functionality, it's pretty important (besides people doing
funky pipe business, it can also used for stuff like "take these pages that
we just got on a socket, put them on a pipe and then ship them off to an
actual file" with minimal copying; doing stuff like sendfile() also uses
splice() internally).

So, I guess I'll be sending the v2 soon.

-- 
Pedro

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-02 21:12 UTC (permalink / raw)
  To: pfalcato
  Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
	safinaskar, torvalds, viro, willy
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

Pedro Falcato <pfalcato@suse.de>:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> > See recent discussion here:
> > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> 
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Riiiiiight.... I don't think I have to NAK this, do I?

Okay, possibly this was indeed inappropriate.

So this time I'm asking explicitly: is it okay to post new patchset?

I want to post patchset, which will remove pagecache-to-pipe splice.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Eric Biggers @ 2026-06-02 18:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
	Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <821ed41e-5b2f-4d17-aeb2-71b0361f8e7f@kernel.org>

On Tue, Jun 02, 2026 at 10:25:06AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 02:28, Andrew Morton wrote:
> > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> >> On Mon, 1 Jun 2026 18:33:25 +0100
> >> Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >>>
> >>>
> >>> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> >>> Communications between the kernel and fuse server at least used to
> >>> seriously want that, so that would be one place to look for unhappy
> >>> userland...
> >>>
> >>> splice-related logics in fs/fuse/dev.c is interesting; another place
> >>> like this is kernel/trace/, but I'm less familiar with that one.
> >>>
> >>> rostedt Cc'd (miklos already had been)
> >>
> >> Thanks for the Cc. The tracing ring buffer was specifically made to be used
> >> by splice and the libtracefs has a lot of code to use it as well. As
> >> reading the ring buffer literally swaps out the write portion with a blank
> >> read portion, that portion (sub-buffer) is used to be directly fed into
> >> splice, providing a zero-copy of the trace data from the write of the event
> >> to going into a file.
> >>
> >> trace-cmd defaults to using splice to copy the tracing ring buffer directly
> >> into files to avoid as much copying during live recordings as possible.
> >>
> >> Whatever changes we make, I would like to make sure there's no regressions
> >> in performance of trace-cmd record.
> > 
> > Well yes, The patchset seems sensible from a quality POV.  But to make
> > a decision we should first have a decent understanding of its downside
> > impact.
> 
> I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
> entirely is certainly very appealing ...
> 
> > 
> > I haven't seen a description of that impact in the discussion thus far.
> > And that description is owed, please.
> > 
> > I assume a small number of specialized applications are using
> > vmsplice() to great effect?  What are those applications?  What is the
> > impact of this change?
> 
> 
> I did some digging, and the kernel crypto API documents using splice/vmsplice
> for zero-copy[1] and libkcapi [2].
> 
> I did not find performance numbers, how much vmsplice/splice actually gives us.
> Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
> doesn't really reveal a big difference at least on my notebook. Not sure if the
> parameters I specify are reasonable.
> 
> I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
> significantly worse than sendmsg ... and I don't know what the default would
> usually be (default to vmsplice or sendmsg). I might try finding some time to
> play with it more, but I doubt it, so if anybody else has time ... :)

AF_ALG is a mistake and isn't commonly used.  Using a userspace crypto
library is faster and is what almost everyone does anyway, as it avoids
the syscall overhead.  There are many other issues with AF_ALG as well.

7.2 will mark AF_ALG as deprecated, mostly remove AF_ALG's zero-copy
support, and remove AF_ALG's async I/O support:

    https://lore.kernel.org/linux-crypto/20260430011544.31823-1-ebiggers@kernel.org/
    https://lore.kernel.org/linux-crypto/20260504225328.25356-1-ebiggers@kernel.org/
    https://lore.kernel.org/linux-crypto/20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com/

In practice, the programs that are keeping Linux distros from disabling
AF_ALG in their kconfig outright are just iwd, cryptsetup, and bluez.
They use AF_ALG just because it was mistakenly thought to be easier than
using a userspace crypto library.  They don't need maximum performance,
nor do they use vmsplice, splice, or sendfile.

There is other highly niche code out there that does implement the
AF_ALG + vmsplice + splice thing, e.g. libkcapi.  But it's just not
enough of a reason to keep zero-copy support, especially considering
that AF_ALG has always been the wrong solution in the first place.  The
fallback to copying the data is fine for this deprecated API.

- Eric

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-02 12:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <CALCETrXqWcqn_79sMKnkyKOSAjg4AmcSHsuyH83oW8zJFoV6Dw@mail.gmail.com>

Hi Andy,

 ---- On Fri, 29 May 2026 02:27:00 +0800  Andy Lutomirski <luto@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 2:55 AM Li Chen <me@linux.beauty> wrote:
 > >
 > 
 > >
 > > The template pins the executable and denies writes to that file while the
 > > template fd is alive,
 > 
 > Please don't.  *Maybe* detect when it gets modified and clear your cache.
 > 
 > Or develop a generic way to open a new fd that's an immutable view
 > into an existing file such that the fd retains its contents even if
 > the file changes.  (Think a reflink that's not persistent and has no
 > name -- you'll need some way to avoid resource exhaustion.)

 I agree that deny-write is not a good long-term invalidation model. I had
 considered clear-cache-on-modify, but kept this RFC smaller.

 > >
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > 
 > This is a lot of complexity in the kernel for a teeny tiny gain.
 > 
 > I'm with Christian -- a better spawn API would be great (and much
 > faster than fork/vfork + exec), but that's a different patch.
 
 Thanks, I agree. A pidfd/pidfs spawn builder looks like the much better API shape.

 The cover letter numbers were from a mixed agent-tool workload. For very short
 single-tool runs I saw larger wins, about +14% for printf-style work.
 I should have called that out separately.

 I will work toward a pidfd_config-style builder next.

Regards,

Li​


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-02  8:25 UTC (permalink / raw)
  To: Andrew Morton, Steven Rostedt
  Cc: Al Viro, Linus Torvalds, Christian Brauner, Askar Safin,
	linux-kernel, linux-mm, linux-api, netdev, Matthew Wilcox,
	Jens Axboe, Christoph Hellwig, David Howells, Pedro Falcato,
	Miklos Szeredi, patches, linux-fsdevel, Jan Kara
In-Reply-To: <20260601172825.a51a588ec1c32617a0e12d78@linux-foundation.org>

On 6/2/26 02:28, Andrew Morton wrote:
> On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> On Mon, 1 Jun 2026 18:33:25 +0100
>> Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>>
>>>
>>> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
>>> Communications between the kernel and fuse server at least used to
>>> seriously want that, so that would be one place to look for unhappy
>>> userland...
>>>
>>> splice-related logics in fs/fuse/dev.c is interesting; another place
>>> like this is kernel/trace/, but I'm less familiar with that one.
>>>
>>> rostedt Cc'd (miklos already had been)
>>
>> Thanks for the Cc. The tracing ring buffer was specifically made to be used
>> by splice and the libtracefs has a lot of code to use it as well. As
>> reading the ring buffer literally swaps out the write portion with a blank
>> read portion, that portion (sub-buffer) is used to be directly fed into
>> splice, providing a zero-copy of the trace data from the write of the event
>> to going into a file.
>>
>> trace-cmd defaults to using splice to copy the tracing ring buffer directly
>> into files to avoid as much copying during live recordings as possible.
>>
>> Whatever changes we make, I would like to make sure there's no regressions
>> in performance of trace-cmd record.
> 
> Well yes, The patchset seems sensible from a quality POV.  But to make
> a decision we should first have a decent understanding of its downside
> impact.

I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
entirely is certainly very appealing ...

> 
> I haven't seen a description of that impact in the discussion thus far.
> And that description is owed, please.
> 
> I assume a small number of specialized applications are using
> vmsplice() to great effect?  What are those applications?  What is the
> impact of this change?


I did some digging, and the kernel crypto API documents using splice/vmsplice
for zero-copy[1] and libkcapi [2].

I did not find performance numbers, how much vmsplice/splice actually gives us.
Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
doesn't really reveal a big difference at least on my notebook. Not sure if the
parameters I specify are reasonable.

I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
significantly worse than sendmsg ... and I don't know what the default would
usually be (default to vmsplice or sendmsg). I might try finding some time to
play with it more, but I doubt it, so if anybody else has time ... :)


I'll note that we have a bunch of selftests (mostly around COW handling) that
rely on vmsplice to test R/O pinning behavior. For R/W pinning, we can use
iouring fixed buffers easily. The only alternative for R/O pinning is using the
gup_test infrastructure that needs to be compiled into the kernel, unfortunately ...

So we'll have to adjust some tests there to use a different interface. I'm sure
I can find someone to work on that once this change here landed and doesn't have
to be yanked immediately again.


[1] https://www.kernel.org/doc/html/latest/crypto/userspace-if.html
[2] https://github.com/smuellerDD/libkcapi/blob/master/lib/kcapi-kernel-if.c
[3] https://github.com/smuellerDD/libkcapi/tree/master/speed-test

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrew Morton @ 2026-06-02  0:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, Linus Torvalds, Christian Brauner, Askar Safin,
	linux-kernel, linux-mm, linux-api, netdev, Matthew Wilcox,
	Jens Axboe, Christoph Hellwig, David Howells, David Hildenbrand,
	Pedro Falcato, Miklos Szeredi, patches, linux-fsdevel, Jan Kara
In-Reply-To: <20260601160455.2c187574@gandalf.local.home>

On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Jun 2026 18:33:25 +0100
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > 
> > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > a big simplification.  
> > 
> > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > Communications between the kernel and fuse server at least used to
> > seriously want that, so that would be one place to look for unhappy
> > userland...
> > 
> > splice-related logics in fs/fuse/dev.c is interesting; another place
> > like this is kernel/trace/, but I'm less familiar with that one.
> > 
> > rostedt Cc'd (miklos already had been)
> 
> Thanks for the Cc. The tracing ring buffer was specifically made to be used
> by splice and the libtracefs has a lot of code to use it as well. As
> reading the ring buffer literally swaps out the write portion with a blank
> read portion, that portion (sub-buffer) is used to be directly fed into
> splice, providing a zero-copy of the trace data from the write of the event
> to going into a file.
> 
> trace-cmd defaults to using splice to copy the tracing ring buffer directly
> into files to avoid as much copying during live recordings as possible.
> 
> Whatever changes we make, I would like to make sure there's no regressions
> in performance of trace-cmd record.

Well yes, The patchset seems sensible from a quality POV.  But to make
a decision we should first have a decent understanding of its downside
impact.

I haven't seen a description of that impact in the discussion thus far.
And that description is owed, please.

I assume a small number of specialized applications are using
vmsplice() to great effect?  What are those applications?  What is the
impact of this change?

Once we are armed with that information, is there some middle ground in
which we de-feature vmsplice()?  Fall back to pread/pwrite in the
tricky cases and still permit vmsplicing if the application is
appropriately restrictive in it usage?

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Steven Rostedt @ 2026-06-01 20:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christian Brauner, Askar Safin, linux-kernel,
	linux-mm, linux-api, netdev, Matthew Wilcox, Jens Axboe,
	Christoph Hellwig, David Howells, Andrew Morton,
	David Hildenbrand, Pedro Falcato, Miklos Szeredi, patches,
	linux-fsdevel, Jan Kara
In-Reply-To: <20260601173325.GH2636677@ZenIV>

On Mon, 1 Jun 2026 18:33:25 +0100
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> 
> > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > a big simplification.  
> 
> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> Communications between the kernel and fuse server at least used to
> seriously want that, so that would be one place to look for unhappy
> userland...
> 
> splice-related logics in fs/fuse/dev.c is interesting; another place
> like this is kernel/trace/, but I'm less familiar with that one.
> 
> rostedt Cc'd (miklos already had been)

Thanks for the Cc. The tracing ring buffer was specifically made to be used
by splice and the libtracefs has a lot of code to use it as well. As
reading the ring buffer literally swaps out the write portion with a blank
read portion, that portion (sub-buffer) is used to be directly fed into
splice, providing a zero-copy of the trace data from the write of the event
to going into a file.

trace-cmd defaults to using splice to copy the tracing ring buffer directly
into files to avoid as much copying during live recordings as possible.

Whatever changes we make, I would like to make sure there's no regressions
in performance of trace-cmd record.

Thanks,

-- Steve

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Kees Cook @ 2026-06-01 19:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Li Chen, Alexander Viro, linux-fsdevel, linux-api, linux-kernel,
	linux-mm, linux-arch, linux-doc, linux-kselftest, x86,
	Arnd Bergmann, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20260528-madig-fachrichtung-fehlinformation-61117ba640da@brauner>

On Thu, May 28, 2026 at 01:02:53PM +0200, Christian Brauner wrote:
> On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > Hi,
> > 
> > This is an early RFC for an idea that is probably still rough in both the
> > UAPI and implementation details. Sorry for the rough edges; I am sending
> > it now to check whether this direction is worth pursuing and to get
> > feedback on the kernel/userspace boundary.
> 
> The idea of having a builder api for exec isn't all that crazy. But it
> should simply be built on top of pidfds and thus pidfs itself instead.
> It has all the basic infrastructure in place already. Any implementation
> should also allow userspace to implement posix_spawn() on top of it.
> 
> fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> 
> pidfd_config(fd, ...) // modeled similar to fsconfig()

FWIW, I agree this should be modelled after fsconfig and built on pidfs.
Doing so will avoid a bunch of design issues, etc.

-Kees

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Al Viro @ 2026-06-01 17:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, patches, linux-fsdevel, Jan Kara, Steven Rostedt
In-Reply-To: <CAHk-=wifX_rrDjRGnDnOqE-usptAukuXKrmuPuVDP5bOCBWzGQ@mail.gmail.com>

On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:

> TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> a big simplification.

FUSE might be interesting - fuse_dev_splice_read() and its ilk.
Communications between the kernel and fuse server at least used to
seriously want that, so that would be one place to look for unhappy
userland...

splice-related logics in fs/fuse/dev.c is interesting; another place
like this is kernel/trace/, but I'm less familiar with that one.

rostedt Cc'd (miklos already had been)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox