* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Eric Biggers @ 2026-06-02 18:44 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
David Howells, Pedro Falcato, Miklos Szeredi, patches,
linux-fsdevel, Jan Kara
In-Reply-To: <821ed41e-5b2f-4d17-aeb2-71b0361f8e7f@kernel.org>
On Tue, Jun 02, 2026 at 10:25:06AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 02:28, Andrew Morton wrote:
> > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> >> On Mon, 1 Jun 2026 18:33:25 +0100
> >> Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >>>
> >>>
> >>> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> >>> Communications between the kernel and fuse server at least used to
> >>> seriously want that, so that would be one place to look for unhappy
> >>> userland...
> >>>
> >>> splice-related logics in fs/fuse/dev.c is interesting; another place
> >>> like this is kernel/trace/, but I'm less familiar with that one.
> >>>
> >>> rostedt Cc'd (miklos already had been)
> >>
> >> Thanks for the Cc. The tracing ring buffer was specifically made to be used
> >> by splice and the libtracefs has a lot of code to use it as well. As
> >> reading the ring buffer literally swaps out the write portion with a blank
> >> read portion, that portion (sub-buffer) is used to be directly fed into
> >> splice, providing a zero-copy of the trace data from the write of the event
> >> to going into a file.
> >>
> >> trace-cmd defaults to using splice to copy the tracing ring buffer directly
> >> into files to avoid as much copying during live recordings as possible.
> >>
> >> Whatever changes we make, I would like to make sure there's no regressions
> >> in performance of trace-cmd record.
> >
> > Well yes, The patchset seems sensible from a quality POV. But to make
> > a decision we should first have a decent understanding of its downside
> > impact.
>
> I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
> entirely is certainly very appealing ...
>
> >
> > I haven't seen a description of that impact in the discussion thus far.
> > And that description is owed, please.
> >
> > I assume a small number of specialized applications are using
> > vmsplice() to great effect? What are those applications? What is the
> > impact of this change?
>
>
> I did some digging, and the kernel crypto API documents using splice/vmsplice
> for zero-copy[1] and libkcapi [2].
>
> I did not find performance numbers, how much vmsplice/splice actually gives us.
> Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
> doesn't really reveal a big difference at least on my notebook. Not sure if the
> parameters I specify are reasonable.
>
> I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
> significantly worse than sendmsg ... and I don't know what the default would
> usually be (default to vmsplice or sendmsg). I might try finding some time to
> play with it more, but I doubt it, so if anybody else has time ... :)
AF_ALG is a mistake and isn't commonly used. Using a userspace crypto
library is faster and is what almost everyone does anyway, as it avoids
the syscall overhead. There are many other issues with AF_ALG as well.
7.2 will mark AF_ALG as deprecated, mostly remove AF_ALG's zero-copy
support, and remove AF_ALG's async I/O support:
https://lore.kernel.org/linux-crypto/20260430011544.31823-1-ebiggers@kernel.org/
https://lore.kernel.org/linux-crypto/20260504225328.25356-1-ebiggers@kernel.org/
https://lore.kernel.org/linux-crypto/20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com/
In practice, the programs that are keeping Linux distros from disabling
AF_ALG in their kconfig outright are just iwd, cryptsetup, and bluez.
They use AF_ALG just because it was mistakenly thought to be easier than
using a userspace crypto library. They don't need maximum performance,
nor do they use vmsplice, splice, or sendfile.
There is other highly niche code out there that does implement the
AF_ALG + vmsplice + splice thing, e.g. libkcapi. But it's just not
enough of a reason to keep zero-copy support, especially considering
that AF_ALG has always been the wrong solution in the first place. The
fallback to copying the data is fine for this deprecated API.
- Eric
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-02 21:12 UTC (permalink / raw)
To: pfalcato
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
safinaskar, torvalds, viro, willy
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>
Pedro Falcato <pfalcato@suse.de>:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> > See recent discussion here:
> > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
>
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
>
> Riiiiiight.... I don't think I have to NAK this, do I?
Okay, possibly this was indeed inappropriate.
So this time I'm asking explicitly: is it okay to post new patchset?
I want to post patchset, which will remove pagecache-to-pipe splice.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-06-02 21:37 UTC (permalink / raw)
To: Askar Safin
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
torvalds, viro, willy
In-Reply-To: <20260602211242.13870-1-safinaskar@gmail.com>
On Wed, Jun 03, 2026 at 12:12:42AM +0300, Askar Safin wrote:
> Pedro Falcato <pfalcato@suse.de>:
> > On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> > > See recent discussion here:
> > > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> >
> > So, you took an ongoing discussion with an ongoing RFC patchset, and you
> > decided to reimplement part of the idea on your own, as a concurrent patchset.
> >
> > Riiiiiight.... I don't think I have to NAK this, do I?
>
> Okay, possibly this was indeed inappropriate.
>
> So this time I'm asking explicitly: is it okay to post new patchset?
>
> I want to post patchset, which will remove pagecache-to-pipe splice.
Well, that's most definitely part of my patch. Also, you cannot outright
remove splice() functionality, it's pretty important (besides people doing
funky pipe business, it can also used for stuff like "take these pages that
we just got on a socket, put them on a pipe and then ship them off to an
actual file" with minimal copying; doing stuff like sendfile() also uses
splice() internally).
So, I guess I'll be sending the v2 soon.
--
Pedro
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-02 22:06 UTC (permalink / raw)
To: Pedro Falcato
Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
patches, viro, willy
In-Reply-To: <ah9LaPQayJ6tBE53@pedro-suse.lan>
On Tue, 2 Jun 2026 at 14:37, Pedro Falcato <pfalcato@suse.de> wrote:
>
> Well, that's most definitely part of my patch. Also, you cannot outright
> remove splice() functionality
That isn't what Askar's patch ever did.
You apparently didn't even read it.
Honestly, I think you are the one out of line here.
Askar did something I suggested years ago, and didn't remove any functionality.
It just changes vmsplice to be a copying model (one of the directions
already was). It doesn't change regular splice at all.
And yes, it has the potential to be a visible behavior difference - if
some insane user uses vmsplice and then modifies the buffer
*afterwards*, then that would be semantically different between a
zero-copy and a normal copy.
But that would be insane behavior, and was never really reliable
anyway even with zero-copy (ie subsequent writes to user space buffers
would potentially do COW breaking based purely on timing and memory
pressure etc, so anybody who relied on it being visible wasn't goign
to get it realiably anyway)
Perhaps more importantly, it has the potential to change performance -
zero-copy *can* be a performance win, although typically it really
doesn't tend to be (looking up the page mapping is often slower than
copying).
I would expect it to be very clear in trivial benchmarks that aren't
actually real loads. And probably not visible anywhere else.
But your responses have been making it clear that you didn't seem to
actually look at the patch or the history of it.
Trying to make it look like Askar is the problem is only making you look worse.
Anyway, the vmsplice() thing is queued up in Christian's tree, and I
guess we'll see if anybody even notices anything.
Linus
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-06-02 22:41 UTC (permalink / raw)
To: Linus Torvalds, Askar Safin
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
viro, willy
In-Reply-To: <CAHk-=wiAqf0PdZ4AKj_4riUnnEb=g_ZNPkLnXrByA9BBHYiFRg@mail.gmail.com>
On Tue, Jun 02, 2026 at 03:06:07PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jun 2026 at 14:37, Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > Well, that's most definitely part of my patch. Also, you cannot outright
> > remove splice() functionality
>
> That isn't what Askar's patch ever did.
>
> You apparently didn't even read it.
Well, I was replying to Askar's new idea to remove pagecache-to-pipe splice,
which is what he suggested. And directly intersects with my sysctl-to-disable-splice
patch.
> Honestly, I think you are the one out of line here.
>
> Askar did something I suggested years ago, and didn't remove any functionality.
>
> It just changes vmsplice to be a copying model (one of the directions
> already was). It doesn't change regular splice at all.
>
> And yes, it has the potential to be a visible behavior difference - if
> some insane user uses vmsplice and then modifies the buffer
> *afterwards*, then that would be semantically different between a
> zero-copy and a normal copy.
>
> But that would be insane behavior, and was never really reliable
> anyway even with zero-copy (ie subsequent writes to user space buffers
> would potentially do COW breaking based purely on timing and memory
> pressure etc, so anybody who relied on it being visible wasn't goign
> to get it realiably anyway)
>
> Perhaps more importantly, it has the potential to change performance -
> zero-copy *can* be a performance win, although typically it really
> doesn't tend to be (looking up the page mapping is often slower than
> copying).
>
> I would expect it to be very clear in trivial benchmarks that aren't
> actually real loads. And probably not visible anywhere else.
Yes, vmsplice() sucks, and we know it. Hopefully no one else will see the
difference. I don't think we can say the same for splice(), though.
> Trying to make it look like Askar is the problem is only making you look worse.
To be clear, I don't think Askar is the (or a) problem. I'm glad he's
contributing, and getting rid of bad kernel interfaces is always nice. I was
just a little frustrated with a parallel splice-related-unscrew patch.
(Askar, if I was too hostile, I do sincerely apologize.)
--
Pedro
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-02 22:54 UTC (permalink / raw)
To: torvalds
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
pfalcato, safinaskar, viro, willy
In-Reply-To: <CAHk-=wiAqf0PdZ4AKj_4riUnnEb=g_ZNPkLnXrByA9BBHYiFRg@mail.gmail.com>
Linus Torvalds <torvalds@linux-foundation.org>:
> That isn't what Askar's patch ever did.
>
> You apparently didn't even read it.
>
> Honestly, I think you are the one out of line here.
>
> Askar did something I suggested years ago, and didn't remove any functionality.
>
> It just changes vmsplice to be a copying model (one of the directions
> already was). It doesn't change regular splice at all.
Pedro is talking here not about this vmsplice patch, but about
my future hypothetical patch, which will remove splice-pagecache-to-pipe.
Let me clarify, what I want to send: I will make splice-pagecache-to-pipe
be a copy. I. e. this splice direction will continue to work, but will be
possibly slower. I. e. I will do something like this (see end of this email)
(absolutely not tested), and the same thing for other filesystems,
and also I will remove resulting dead code and remove
pipe_buf_operations::confirm (it will likely become unneeded).
If Pedro sends this instead, this will be okay.
diff --git i/fs/ext2/file.c w/fs/ext2/file.c
index d9b1eb34694a..8edcc3769793 100644
--- i/fs/ext2/file.c
+++ w/fs/ext2/file.c
@@ -326,7 +326,7 @@ const struct file_operations ext2_file_operations = {
.release = ext2_release_file,
.fsync = ext2_fsync,
.get_unmapped_area = thp_get_unmapped_area,
- .splice_read = filemap_splice_read,
+ .splice_read = copy_splice_read,
.splice_write = iter_file_splice_write,
.setlease = generic_setlease,
};
--
Askar Safin
^ permalink raw reply related
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-02 23:07 UTC (permalink / raw)
To: pfalcato
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
safinaskar, torvalds, viro, willy
In-Reply-To: <ah9Yle5pd6mD9Ugr@pedro-suse.lan>
Pedro Falcato <pfalcato@suse.de>:
> (Askar, if I was too hostile, I do sincerely apologize.)
You did nothing wrong.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03 0:05 UTC (permalink / raw)
To: Askar Safin
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
pfalcato, viro, willy
In-Reply-To: <20260602225426.122258-1-safinaskar@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 979 bytes --]
On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
>
> Pedro is talking here not about this vmsplice patch, but about
> my future hypothetical patch, which will remove splice-pagecache-to-pipe.
That absolutely would be my suggested next step.
Something like the attached - get rid of filemap_splice_read()
entirely, and just replace it with copy_splice_read().
That also make the whole O_DIRECT and DAX special case just simply go away.
This is - in case there was any question about it - ENTIRELY untested.
It may not compile.
And if it does compile, it may do unspeakable things to your pets.
So think of this as nothing more than a "something like this". It does
leave "splice_read" around, and it intentionally just does that
#define filemap_splice_read copy_splice_read
to not have to modify all the existing users one by one.
It would be interesting to hear if there are any actual real loads
that would ever notice?
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 10978 bytes --]
fs/splice.c | 6 --
include/linux/fs.h | 4 +-
mm/filemap.c | 145 ------------------------------------------------
mm/internal.h | 6 --
mm/shmem.c | 159 +----------------------------------------------------
5 files changed, 2 insertions(+), 318 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index 9d8f63e2fd1a..37136b9a6612 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -971,12 +971,6 @@ static ssize_t do_splice_read(struct file *in, loff_t *ppos,
if (unlikely(!in->f_op->splice_read))
return warn_unsupported(in, "read");
- /*
- * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
- * buffer, copy into it and splice that into the pipe.
- */
- if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
- return copy_splice_read(in, ppos, pipe, len, flags);
return in->f_op->splice_read(in, ppos, pipe, len, flags);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..e623c2804468 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3072,9 +3072,7 @@ ssize_t vfs_iocb_iter_write(struct file *file, struct kiocb *iocb,
struct iov_iter *iter);
/* fs/splice.c */
-ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
- struct pipe_inode_info *pipe,
- size_t len, unsigned int flags);
+#define filemap_splice_read copy_splice_read
ssize_t copy_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe,
size_t len, unsigned int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..c0dbcbb84dba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2999,151 +2999,6 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
}
EXPORT_SYMBOL(generic_file_read_iter);
-/*
- * Splice subpages from a folio into a pipe.
- */
-size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
- struct folio *folio, loff_t fpos, size_t size)
-{
- struct page *page;
- size_t spliced = 0, offset = offset_in_folio(folio, fpos);
-
- page = folio_page(folio, offset / PAGE_SIZE);
- size = min(size, folio_size(folio) - offset);
- offset %= PAGE_SIZE;
-
- while (spliced < size && !pipe_is_full(pipe)) {
- struct pipe_buffer *buf = pipe_head_buf(pipe);
- size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
-
- *buf = (struct pipe_buffer) {
- .ops = &page_cache_pipe_buf_ops,
- .page = page,
- .offset = offset,
- .len = part,
- };
- folio_get(folio);
- pipe->head++;
- page++;
- spliced += part;
- offset = 0;
- }
-
- return spliced;
-}
-
-/**
- * filemap_splice_read - Splice data from a file's pagecache into a pipe
- * @in: The file to read from
- * @ppos: Pointer to the file position to read from
- * @pipe: The pipe to splice into
- * @len: The amount to splice
- * @flags: The SPLICE_F_* flags
- *
- * This function gets folios from a file's pagecache and splices them into the
- * pipe. Readahead will be called as necessary to fill more folios. This may
- * be used for blockdevs also.
- *
- * Return: On success, the number of bytes read will be returned and *@ppos
- * will be updated if appropriate; 0 will be returned if there is no more data
- * to be read; -EAGAIN will be returned if the pipe had no space, and some
- * other negative error code will be returned on error. A short read may occur
- * if the pipe has insufficient space, we reach the end of the data or we hit a
- * hole.
- */
-ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
- struct pipe_inode_info *pipe,
- size_t len, unsigned int flags)
-{
- struct folio_batch fbatch;
- struct kiocb iocb;
- size_t total_spliced = 0, used, npages;
- loff_t isize, end_offset;
- bool writably_mapped;
- int i, error = 0;
-
- if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
- return 0;
-
- init_sync_kiocb(&iocb, in);
- iocb.ki_pos = *ppos;
-
- /* Work out how much data we can actually add into the pipe */
- used = pipe_buf_usage(pipe);
- npages = max_t(ssize_t, pipe->max_usage - used, 0);
- len = min_t(size_t, len, npages * PAGE_SIZE);
-
- folio_batch_init(&fbatch);
-
- do {
- cond_resched();
-
- if (*ppos >= i_size_read(in->f_mapping->host))
- break;
-
- iocb.ki_pos = *ppos;
- error = filemap_get_pages(&iocb, len, &fbatch, true);
- if (error < 0)
- break;
-
- /*
- * i_size must be checked after we know the pages are Uptodate.
- *
- * Checking i_size after the check allows us to calculate
- * the correct value for "nr", which means the zero-filled
- * part of the page is not copied back to userspace (unless
- * another truncate extends the file - this is desired though).
- */
- isize = i_size_read(in->f_mapping->host);
- if (unlikely(*ppos >= isize))
- break;
- end_offset = min_t(loff_t, isize, *ppos + len);
-
- /*
- * Once we start copying data, we don't want to be touching any
- * cachelines that might be contended:
- */
- writably_mapped = mapping_writably_mapped(in->f_mapping);
-
- for (i = 0; i < folio_batch_count(&fbatch); i++) {
- struct folio *folio = fbatch.folios[i];
- size_t n;
-
- if (folio_pos(folio) >= end_offset)
- goto out;
- folio_mark_accessed(folio);
-
- /*
- * If users can be writing to this folio using arbitrary
- * virtual addresses, take care of potential aliasing
- * before reading the folio on the kernel side.
- */
- if (writably_mapped)
- flush_dcache_folio(folio);
-
- n = min_t(loff_t, len, isize - *ppos);
- n = splice_folio_into_pipe(pipe, folio, *ppos, n);
- if (!n)
- goto out;
- len -= n;
- total_spliced += n;
- *ppos += n;
- in->f_ra.prev_pos = *ppos;
- if (pipe_is_full(pipe))
- goto out;
- }
-
- folio_batch_release(&fbatch);
- } while (len);
-
-out:
- folio_batch_release(&fbatch);
- file_accessed(in);
-
- return total_spliced ? total_spliced : error;
-}
-EXPORT_SYMBOL(filemap_splice_read);
-
static inline loff_t folio_seek_hole_data(struct xa_state *xas,
struct address_space *mapping, struct folio *folio,
loff_t start, loff_t end, bool seek_data)
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..c0ca0df5ac7e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1521,12 +1521,6 @@ struct migration_target_control {
enum migrate_reason reason;
};
-/*
- * mm/filemap.c
- */
-size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
- struct folio *folio, loff_t fpos, size_t size);
-
/*
* mm/vmalloc.c
*/
diff --git a/mm/shmem.c b/mm/shmem.c
index 3b5dc21b323c..92138b7277b5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3481,163 +3481,6 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
return ret;
}
-static bool zero_pipe_buf_get(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf)
-{
- return true;
-}
-
-static void zero_pipe_buf_release(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf)
-{
-}
-
-static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf)
-{
- return false;
-}
-
-static const struct pipe_buf_operations zero_pipe_buf_ops = {
- .release = zero_pipe_buf_release,
- .try_steal = zero_pipe_buf_try_steal,
- .get = zero_pipe_buf_get,
-};
-
-static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
- loff_t fpos, size_t size)
-{
- size_t offset = fpos & ~PAGE_MASK;
-
- size = min_t(size_t, size, PAGE_SIZE - offset);
-
- if (!pipe_is_full(pipe)) {
- struct pipe_buffer *buf = pipe_head_buf(pipe);
-
- *buf = (struct pipe_buffer) {
- .ops = &zero_pipe_buf_ops,
- .page = ZERO_PAGE(0),
- .offset = offset,
- .len = size,
- };
- pipe->head++;
- }
-
- return size;
-}
-
-static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
- struct pipe_inode_info *pipe,
- size_t len, unsigned int flags)
-{
- struct inode *inode = file_inode(in);
- struct address_space *mapping = inode->i_mapping;
- struct folio *folio = NULL;
- size_t total_spliced = 0, used, npages, n, part;
- loff_t isize;
- int error = 0;
-
- /* Work out how much data we can actually add into the pipe */
- used = pipe_buf_usage(pipe);
- npages = max_t(ssize_t, pipe->max_usage - used, 0);
- len = min_t(size_t, len, npages * PAGE_SIZE);
-
- do {
- bool fallback_page_splice = false;
- struct page *page = NULL;
- pgoff_t index;
- size_t size;
-
- if (*ppos >= i_size_read(inode))
- break;
-
- index = *ppos >> PAGE_SHIFT;
- error = shmem_get_folio(inode, index, 0, &folio, SGP_READ);
- if (error) {
- if (error == -EINVAL)
- error = 0;
- break;
- }
- if (folio) {
- folio_unlock(folio);
-
- page = folio_file_page(folio, index);
- if (PageHWPoison(page)) {
- error = -EIO;
- break;
- }
-
- if (folio_test_large(folio) &&
- folio_test_has_hwpoisoned(folio))
- fallback_page_splice = true;
- }
-
- /*
- * i_size must be checked after we know the pages are Uptodate.
- *
- * Checking i_size after the check allows us to calculate
- * the correct value for "nr", which means the zero-filled
- * part of the page is not copied back to userspace (unless
- * another truncate extends the file - this is desired though).
- */
- isize = i_size_read(inode);
- if (unlikely(*ppos >= isize))
- break;
- /*
- * Fallback to PAGE_SIZE splice if the large folio has hwpoisoned
- * pages.
- */
- size = len;
- if (unlikely(fallback_page_splice)) {
- size_t offset = *ppos & ~PAGE_MASK;
-
- size = umin(size, PAGE_SIZE - offset);
- }
- part = min_t(loff_t, isize - *ppos, size);
-
- if (folio) {
- /*
- * If users can be writing to this page using arbitrary
- * virtual addresses, take care about potential aliasing
- * before reading the page on the kernel side.
- */
- if (mapping_writably_mapped(mapping)) {
- if (likely(!fallback_page_splice))
- flush_dcache_folio(folio);
- else
- flush_dcache_page(page);
- }
- folio_mark_accessed(folio);
- /*
- * Ok, we have the page, and it's up-to-date, so we can
- * now splice it into the pipe.
- */
- n = splice_folio_into_pipe(pipe, folio, *ppos, part);
- folio_put(folio);
- folio = NULL;
- } else {
- n = splice_zeropage_into_pipe(pipe, *ppos, part);
- }
-
- if (!n)
- break;
- len -= n;
- total_spliced += n;
- *ppos += n;
- in->f_ra.prev_pos = *ppos;
- if (pipe_is_full(pipe))
- break;
-
- cond_resched();
- } while (len);
-
- if (folio)
- folio_put(folio);
-
- file_accessed(in);
- return total_spliced ? total_spliced : error;
-}
-
static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
{
struct address_space *mapping = file->f_mapping;
@@ -5223,7 +5066,7 @@ static const struct file_operations shmem_file_operations = {
.read_iter = shmem_file_read_iter,
.write_iter = shmem_file_write_iter,
.fsync = noop_fsync,
- .splice_read = shmem_file_splice_read,
+ .splice_read = copy_splice_read,
.splice_write = iter_file_splice_write,
.fallocate = shmem_fallocate,
.setlease = generic_setlease,
^ permalink raw reply related
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-03 1:08 UTC (permalink / raw)
To: torvalds
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, miklos, netdev, patches,
pfalcato, safinaskar, viro, willy
In-Reply-To: <CAHk-=wgKy4dP0oQCNKyMQQf3-uVpaigmDyH6_T0Via76gWST9g@mail.gmail.com>
Linus Torvalds <torvalds@linux-foundation.org>:
> That absolutely would be my suggested next step.
>
> Something like the attached - get rid of filemap_splice_read()
> entirely, and just replace it with copy_splice_read().
Okay, I will post something like this soon.
But I'm slow person, and also I will test things in Qemu, so this will
take some days.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-03 3:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wgKy4dP0oQCNKyMQQf3-uVpaigmDyH6_T0Via76gWST9g@mail.gmail.com>
On Tue, Jun 2, 2026 at 5:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
> >
> > Pedro is talking here not about this vmsplice patch, but about
> > my future hypothetical patch, which will remove splice-pagecache-to-pipe.
>
> That absolutely would be my suggested next step.
>
> Something like the attached - get rid of filemap_splice_read()
> entirely, and just replace it with copy_splice_read().
Am I understanding correctly that this will completely break zerocopy
sendfile? sendfile is, internally, splice-to-a-secret-per-task-pipe
and then splice to the socket. How much to people care? These days,
a lot of high-bandwidth network senders are sending encrypted data,
which is not zerocopy frompagecache. But there are surely some users
that care, for example the person who went to the effort to implement
IORING_OP_SPLICE:
commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd
Author: Pavel Begunkov <asml.silence@gmail.com>
Date: Mon Feb 24 11:32:45 2020 +0300
io_uring: add splice(2) support
Now maybe someone cares about a different path? Splice from socket to
pipe to file? Splice from socket to pipe to other socket? Does
anyone do any of this? One can, of course, recv() directly to an
mmapped file, but then you pay for page faults, so that probably a bad
idea in most cases. At least all of these cases don't have spliced
buffers that refer to a potentially read-only file.
But I'm a little concerned that zerocopy sends from files to network
are actually important.
--Andy
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03 4:20 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
patches, pfalcato, viro, willy
In-Reply-To: <CALCETrWx8-Q5-rK1KnAPCxCbXaWCd=Yfs_Pr8qVMa8k8L6of1w@mail.gmail.com>
On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Am I understanding correctly that this will completely break zerocopy
> sendfile?
Very much, yes.
And it's worth making it very very clear that ABSOLUTELY NONE of the
recent big security bugs were in splice.
They were all in the networking and crypto code that just didn't deal
with shared data correctly.
So in that sense, it's a bit sad to discuss castrating splice.
But it's probably still the right thing to at least try.
I've seen very impressive benchmark numbers over the years, but
they've often smelled more like benchmarketing than actual real work.
There's also a real possibility that a lot of the sendfile / splice
advantage has little to do with zero-copy, and more to do with the
cost of mapping and maintaining buffers in user space.
If you are sending file data using plain reads and writes, it's not
just the "copy from user space to socket data structures".
There's also the cost of populating user space in the first place:
page faults for mmap made *that* historical copy avoidance basically a
fairy tale.
And not using mmap means that you have the cost of double caching in
the kernel _and_ user space etc.
So sendfile() as a concept (whether you use combinations of splice()
system calls or the sendfile system call itsefl) isn't necessarily
only about the zero-copy, it's really also about avoiding the user
space memory management.
But yes, there's a very real question of performance.
I just suspect we'll never get real answers without going the "let's
just see what happens" route...
Linus
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-03 6:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Askar Safin, akpm, axboe, david, dhowells, hch,
jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>
On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Am I understanding correctly that this will completely break zerocopy
> > sendfile?
>
> Very much, yes.
>
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
>
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
>
> So in that sense, it's a bit sad to discuss castrating splice.
Well, we're completely ignoring the fact that splice()'s locking and
interactions with pipe_lock() are complete insanity. So unless someone
sits down and really thinks about how to rework the locking I think
degrading splice() is just fine.
> But it's probably still the right thing to at least try.
Yes.
> I just suspect we'll never get real answers without going the "let's
> just see what happens" route...
Yes.
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-03 7:50 UTC (permalink / raw)
To: Eric Biggers
Cc: Andrew Morton, Steven Rostedt, Al Viro, Linus Torvalds,
Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
David Howells, Pedro Falcato, Miklos Szeredi, patches,
linux-fsdevel, Jan Kara
In-Reply-To: <20260602184440.GB2503276@google.com>
On 6/2/26 20:44, Eric Biggers wrote:
> On Tue, Jun 02, 2026 at 10:25:06AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/2/26 02:28, Andrew Morton wrote:
>>>
>>>
>>> Well yes, The patchset seems sensible from a quality POV. But to make
>>> a decision we should first have a decent understanding of its downside
>>> impact.
>>
>> I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
>> entirely is certainly very appealing ...
>>
>>>
>>> I haven't seen a description of that impact in the discussion thus far.
>>> And that description is owed, please.
>>>
>>> I assume a small number of specialized applications are using
>>> vmsplice() to great effect? What are those applications? What is the
>>> impact of this change?
>>
>>
>> I did some digging, and the kernel crypto API documents using splice/vmsplice
>> for zero-copy[1] and libkcapi [2].
>>
>> I did not find performance numbers, how much vmsplice/splice actually gives us.
>> Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
>> doesn't really reveal a big difference at least on my notebook. Not sure if the
>> parameters I specify are reasonable.
>>
>> I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
>> significantly worse than sendmsg ... and I don't know what the default would
>> usually be (default to vmsplice or sendmsg). I might try finding some time to
>> play with it more, but I doubt it, so if anybody else has time ... :)
>
> AF_ALG is a mistake and isn't commonly used. Using a userspace crypto
> library is faster and is what almost everyone does anyway, as it avoids
> the syscall overhead. There are many other issues with AF_ALG as well.
>
> 7.2 will mark AF_ALG as deprecated, mostly remove AF_ALG's zero-copy
> support, and remove AF_ALG's async I/O support:
>
> https://lore.kernel.org/linux-crypto/20260430011544.31823-1-ebiggers@kernel.org/
> https://lore.kernel.org/linux-crypto/20260504225328.25356-1-ebiggers@kernel.org/
> https://lore.kernel.org/linux-crypto/20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com/
>
> In practice, the programs that are keeping Linux distros from disabling
> AF_ALG in their kconfig outright are just iwd, cryptsetup, and bluez.
> They use AF_ALG just because it was mistakenly thought to be easier than
> using a userspace crypto library. They don't need maximum performance,
> nor do they use vmsplice, splice, or sendfile.
>
> There is other highly niche code out there that does implement the
> AF_ALG + vmsplice + splice thing, e.g. libkcapi. But it's just not
> enough of a reason to keep zero-copy support, especially considering
> that AF_ALG has always been the wrong solution in the first place. The
> fallback to copying the data is fine for this deprecated API.
Cool, thanks for sharing that Eric!
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Miklos Szeredi @ 2026-06-03 9:57 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Christian Brauner, Askar Safin, linux-kernel,
linux-mm, linux-api, netdev, Matthew Wilcox, Jens Axboe,
Christoph Hellwig, David Howells, Andrew Morton,
David Hildenbrand, Pedro Falcato, patches, linux-fsdevel,
Jan Kara, Steven Rostedt, Joanne Koong, fuse-devel
In-Reply-To: <20260601173325.GH2636677@ZenIV>
On Mon, 1 Jun 2026 at 19:33, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
>
> > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > a big simplification.
>
> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> Communications between the kernel and fuse server at least used to
> seriously want that, so that would be one place to look for unhappy
> userland...
>
> splice-related logics in fs/fuse/dev.c is interesting; another place
> like this is kernel/trace/, but I'm less familiar with that one.
[Cc: Joanne, fuse-devel]
I'd favor simplification, but care is needed to not regress performance.
Joanne might be in a better position to say something about relative
performance of various transport modes in fuse.
Thanks,
Miklos
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-06-03 11:43 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Askar Safin, akpm, axboe, brauner, david,
dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
linux-mm, miklos, netdev, patches, viro, willy
In-Reply-To: <CALCETrWx8-Q5-rK1KnAPCxCbXaWCd=Yfs_Pr8qVMa8k8L6of1w@mail.gmail.com>
On Tue, Jun 02, 2026 at 08:51:03PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 2, 2026 at 5:12 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
> > >
> > > Pedro is talking here not about this vmsplice patch, but about
> > > my future hypothetical patch, which will remove splice-pagecache-to-pipe.
> >
> > That absolutely would be my suggested next step.
> >
> > Something like the attached - get rid of filemap_splice_read()
> > entirely, and just replace it with copy_splice_read().
>
> Am I understanding correctly that this will completely break zerocopy
> sendfile? sendfile is, internally, splice-to-a-secret-per-task-pipe
> and then splice to the socket. How much to people care? These days,
> a lot of high-bandwidth network senders are sending encrypted data,
> which is not zerocopy frompagecache. But there are surely some users
You can do zerocopy from the page cache, even with TLS on top, by having
your (fancy) NIC do TLS offloading for you. See https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf.
Linux works similarly. Slide 26 is particularly interesting.
(No KTLS I assume is using simple sendmsg()'s from user memory, SW TLS
and NIC KTLS are both sendfile(), per the slides)
TL;DR I really do think it matters.
>
> Now maybe someone cares about a different path? Splice from socket to
> pipe to file? Splice from socket to pipe to other socket? Does
> anyone do any of this? One can, of course, recv() directly to an
> mmapped file, but then you pay for page faults, so that probably a bad
> idea in most cases. At least all of these cases don't have spliced
> buffers that refer to a potentially read-only file.
>
>
> But I'm a little concerned that zerocopy sends from files to network
> are actually important.
>
> --Andy
--
Pedro
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-03 13:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Askar Safin, akpm, axboe, david, dhowells, hch,
jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260603-navigieren-pleite-stilvoll-60e6da66b1d4@brauner>
On Wed, Jun 03, 2026 at 08:45:18AM +0200, Christian Brauner wrote:
> On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> > On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > Am I understanding correctly that this will completely break zerocopy
> > > sendfile?
> >
> > Very much, yes.
> >
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> >
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> >
> > So in that sense, it's a bit sad to discuss castrating splice.
>
> Well, we're completely ignoring the fact that splice()'s locking and
> interactions with pipe_lock() are complete insanity. So unless someone
> sits down and really thinks about how to rework the locking I think
> degrading splice() is just fine.
>
> > But it's probably still the right thing to at least try.
>
> Yes.
>
> > I just suspect we'll never get real answers without going the "let's
> > just see what happens" route...
>
> Yes.
Reading this thread again I'm really amazed how willingly people argue
to remain locked into a really broken API even if they're giving a risk
but worthwhile chance to kill it for good. Anway, odd-userspace behavior
time:
David reported vmsplice01 failing in the LTP testsuite after the change:
11297 20:41:02.548383 <LAVA_SIGNAL_STARTTC vmsplice01>
11298 20:41:02.548518 tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsZ13ZQj as tmpdir (tmpfs filesystem)
11299 20:41:02.548656 tst_test.c:2047: TINFO: LTP version: 20260130
11300 20:41:02.548793 tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260602 #1 SMP PREEMPT Tue Jun 2 18:13:29 UTC 2026 aarch64
11301 20:41:02.548932 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
11302 20:41:02.549069 tst_test.c:1875: TINFO: Overall timeout per run is 0h 01m 30s
11303 20:41:02.549205 tst_test.c:1632: TINFO: tmpfs is supported by the test
11304 20:41:02.549340 Test timeouted, sending SIGKILL!
11305 20:41:02.549477 tst_test.c:1947: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
11306 20:41:02.549614 tst_test.c:1949: TBROK: Test killed! (timeout?)
11307 20:41:02.549751
11308 20:41:02.549887 Summary:
11309 20:41:02.550021 passed 0
11310 20:41:02.550155 failed 0
11311 20:41:02.550290 broken 1
11312 20:41:02.550450 skipped 0
11313 20:41:02.550582 warnings 0
11314 20:41:02.550710
11315 20:41:02.550838 <LAVA_SIGNAL_ENDTC vmsplice01>
So I looked at the test:
while (v.iov_len) {
/*
* in a real app you'd be more clever with poll of course,
* here we are basically just blocking on output room and
* not using the free time for anything interesting.
*/
if (poll(&pfd, 1, -1) < 0)
tst_brk(TBROK | TERRNO, "poll() failed");
written = vmsplice(pipes[1], &v, 1, 0);
if (written < 0) {
tst_brk(TBROK | TERRNO, "vmsplice() failed");
} else {
if (written == 0) {
break;
} else {
v.iov_base += written;
v.iov_len -= written;
}
}
SAFE_SPLICE(pipes[0], NULL, fd_out, &offset, written, 0);
//printf("offset = %lld\n", (long long)offset);
}
Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
full. So iter_to_pipe stops and returns a partial count capped at pipe
capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
the test drains it, call 2 returns the remaining 64K. Done.
After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
0) then calls pipe_write which does not stop when the pipe fills. It
blocks until the entire iovec is consumed.
I kinda think we need to preserve similar semantics.
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03 15:26 UTC (permalink / raw)
To: Christian Brauner
Cc: Andy Lutomirski, Askar Safin, akpm, axboe, david, dhowells, hch,
jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
netdev, patches, pfalcato, viro, willy
In-Reply-To: <20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner>
On Wed, 3 Jun 2026 at 06:40, Christian Brauner <brauner@kernel.org> wrote:
>
> Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
> full. So iter_to_pipe stops and returns a partial count capped at pipe
> capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
> the test drains it, call 2 returns the remaining 64K. Done.
>
> After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
> 0) then calls pipe_write which does not stop when the pipe fills. It
> blocks until the entire iovec is consumed.
>
> I kinda think we need to preserve similar semantics.
Ack. We definitely do need to keep the old semantics.
Looking at the patch again, I think it's that
(flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0
thing that is broken. I think splice_to_pipe is *always* nowait - but
has the special conditional _initial_ wait.
So I think the RWF_NOWAIT should be unconditional to the do_writev(),
and instead the code should do something like
ret = wait_for_space(pipe, flags);
if (!ret) do_writev(...RWF_NOWAIT);
but admittedly I did not think very much about the details, so I might
miss something.
Which also then probably measn that we should just keep the legacy
wrapper in fs/splice.c and we'd just need to make do_writev() and
do_readv() non-static.
Because I'd rather keep wait_for_space() internal to splice (or
alternatively we'd move it to pipe.c, rename it to
"pipe_wait_for_space()", and change the 'flags' argument to be a
boolean to not make it use that splice-specific flags etc).
Linus
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-03 18:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>
> On Jun 2, 2026, at 9:20 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Am I understanding correctly that this will completely break zerocopy
>> sendfile?
>
> Very much, yes.
>
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
>
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
>
> So in that sense, it's a bit sad to discuss castrating splice.
>
> But it's probably still the right thing to at least try.
>
> I've seen very impressive benchmark numbers over the years, but
> they've often smelled more like benchmarketing than actual real work.
>
> There's also a real possibility that a lot of the sendfile / splice
> advantage has little to do with zero-copy, and more to do with the
> cost of mapping and maintaining buffers in user space.
>
> If you are sending file data using plain reads and writes, it's not
> just the "copy from user space to socket data structures".
>
> There's also the cost of populating user space in the first place:
> page faults for mmap made *that* historical copy avoidance basically a
> fairy tale.
>
> And not using mmap means that you have the cost of double caching in
> the kernel _and_ user space etc.
>
> So sendfile() as a concept (whether you use combinations of splice()
> system calls or the sendfile system call itsefl) isn't necessarily
> only about the zero-copy, it's really also about avoiding the user
> space memory management.
So maybe we should make sure that, if we go down the route of
disabling all the splice magic, that we leave an API, maybe the
existing sendfile or maybe something else, that does an optimized copy
from one fd to another and that is at least capable of sending from a
file to the network with at most one CPU-side copy.
Even if we’re just doing that, I continue to find it strange that we
require that a pipe be involved. What’s so special about pipes that we
allow splicing from file to pipe and then pipe to socket (this
requiring that the pipe retain a reference to the file’s page cache
structures to avoid *two* copies), but we can’t splice straight from
file to socket. Heck, even sendfile is implemented under the hood as a
pair of splices!
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Jakub Kicinski @ 2026-06-03 18:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Askar Safin, akpm, axboe, brauner, david,
dhowells, hch, jack, linux-api, linux-fsdevel, linux-kernel,
linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wizkDXRut5xLXRF-CVUVYMaZ5AOexxeghOAoXPb4yAvQg@mail.gmail.com>
On Tue, 2 Jun 2026 21:20:13 -0700 Linus Torvalds wrote:
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
>
> So in that sense, it's a bit sad to discuss castrating splice.
+1 IMVHO the networking bugs where people just not knowing what they
were doing. Presumably AI has scrounged all the occurrences of that
bug by now. I'd also hate to render splice optimizations moot based
on those bugs.
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Jakub Kicinski @ 2026-06-03 18:14 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andy Lutomirski, Linus Torvalds, Askar Safin, akpm, axboe,
brauner, david, dhowells, hch, jack, linux-api, linux-fsdevel,
linux-kernel, linux-mm, miklos, netdev, patches, viro, willy
In-Reply-To: <aiAREqlHK1llOw_y@pedro-suse.lan>
On Wed, 3 Jun 2026 12:43:54 +0100 Pedro Falcato wrote:
> > Am I understanding correctly that this will completely break zerocopy
> > sendfile? sendfile is, internally, splice-to-a-secret-per-task-pipe
> > and then splice to the socket. How much to people care? These days,
> > a lot of high-bandwidth network senders are sending encrypted data,
> > which is not zerocopy frompagecache. But there are surely some users
>
> You can do zerocopy from the page cache, even with TLS on top, by having
> your (fancy) NIC do TLS offloading for you. See https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf.
> Linux works similarly. Slide 26 is particularly interesting.
> (No KTLS I assume is using simple sendmsg()'s from user memory, SW TLS
> and NIC KTLS are both sendfile(), per the slides)
FTR this datapoint should come with the caveat that kTLS _offload_ does
not support TLS 1.3 today. So how much that configuration is used in
practice is unclear.
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03 18:28 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
patches, pfalcato, viro, willy
In-Reply-To: <CALCETrXzxubt4eWue3+wv7Fq9C2m7uu6bWPstqFh6Mo57bPwQQ@mail.gmail.com>
On Wed, 3 Jun 2026 at 11:10, Andy Lutomirski <luto@amacapital.net> wrote:
>
> So maybe we should make sure that, if we go down the route of
> disabling all the splice magic, that we leave an API, maybe the
> existing sendfile or maybe something else, that does an optimized copy
> from one fd to another and that is at least capable of sending from a
> file to the network with at most one CPU-side copy.
Why?
That is *LITERALLY* the attack surface - and the complexity - that we
should be removing.
sendfile() was a mistake. It is literally the "file->socket" thing
that has been buggy.
I absolutely refuse to get rid of splice code but keep the buggy sh*t
cases that caused all the problems in the first place.
Because *THAT* would just be completely insane and pointless.
> Even if we’re just doing that, I continue to find it strange that we
> require that a pipe be involved. What’s so special about pipes
Again: it was never splice or the pipe that was the problem. Stop
barking up the wrong tree.
It was "file data to socket" that was the truly horrendous issue.
That said, to explain the pipe: The reason for the pipe is to act as
the kernel-side buffer.
Now, these days we have much more capable iov_iter interfaces than we
used to, and in that sense the "pipe as a buffer" is certainly not the
obvious choice now.
But even then you need to have a *handle* to the buffers for the
general case, and that's what the pipe fd ends up then still
effectively being.
It was also done to avoid the M:N translation problem, because people
wanted to do zero-copy between other things than just "file ->
socket".
But again: we're ABNSOLUTELY NOT keeping that "file -> socket" thing
and getting rid of splice. That's literally keeping the bath-water
and throwing out the baby.
Splice is the *good* part (well, relatively - splice is bad too).
ile->socket needs to DIE IN A FIRE considering the security problems it has had.
I hope Jakub is right that the problems have been all fixed, and this
is all theoretical, but having seen just *how* many there were, I'm a
bit sceptical.
Because if people think splice is complicated, you haven't looked at
the skb rules. They are completely arbitrary and complex and spread
all over the tree.
Linus
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Howells @ 2026-06-03 19:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Andy Lutomirski, Askar Safin, akpm, axboe, brauner,
david, hch, jack, linux-api, linux-fsdevel, linux-kernel,
linux-mm, miklos, netdev, patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wiEwSjfbjfO74xu=UmkkdHXkJg5QNQ8pP-3iYmunmeV9g@mail.gmail.com>
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Because if people think splice is complicated, you haven't looked at
> the skb rules. They are completely arbitrary and complex and spread
> all over the tree.
Yeah - I fell foul of the net loopback driver just reflecting the outgoing
packet back, complete with all the original spliced bufferage. I was
wondering if the loopback driver needs to look at the skbuff, see if it has
zerocopy elements of some sort and, if so, copy it (or drop it if ENOMEM).
David
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Howells @ 2026-06-03 19:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Matthew Wilcox, Andy Lutomirski, Askar Safin,
linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
linux-kernel, linux-mm, linux-api, netdev, Jens Axboe,
Christoph Hellwig, Andrew Morton, David Hildenbrand,
Pedro Falcato, Miklos Szeredi, patches
In-Reply-To: <CAHk-=wiFuud0Nn3B9YpTWyQja08TeXVk2AB-aAkmVXyigOagbQ@mail.gmail.com>
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Well, since it pretty much is what I suggested a few years ago, I
> certainly won't NAK it.
I've been wanting to get rid of vmsplice for a while, so I'm in favour of this
too.
David
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-03 19:59 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Askar Safin, akpm, axboe, brauner, david, dhowells, hch, jack,
linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
patches, pfalcato, viro, willy
In-Reply-To: <CAHk-=wiEwSjfbjfO74xu=UmkkdHXkJg5QNQ8pP-3iYmunmeV9g@mail.gmail.com>
On Wed, 3 Jun 2026 at 11:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But even then you need to have a *handle* to the buffers for the
> general case, and that's what the pipe fd ends up then still
> effectively being.
Again: for sendfile, you don't need the handle, because you can just
"read the file data again".
But the the handle is needed for any buffering that can't do that -
iow pretty much *any* other case than a file-backed source.
So the original use-cases included things like copying media data from
a TV capture card to a GPU for outputting in a window.
There it's actually the intermediate buffer that is the important
thing, and it needs to have a lifetime that is independent of the
system call itself, because the system call may be interrupted by
signals etc, and you can't just "read the data again" when you
restart.
So the whole idea with splice() is that you have an input, an output,
and a stateful buffer between the two that has a lifetime.
Having just a iov_iter isn't enough - even with the current much more
capable iov_iter we have now (compared to when splice came to be: two
decades ago when the modern iov_iter didn't even exist). You have to
have that notion of a buffer with a lifetime.
(iov_iter came a couple of years later, but it then took many many
years for it to become the powerful thing it is today where you can
put almost arbitrary data into it - it started as purely a user space
iovec iterator, all the bvec/kvec etc stuff that you need for IO
buffering came a decade later)
So there's historical reasons for the use of pipes, but there really
is a very fundamental reason for it too: wanting to *generic* data
transfer between two points, not sendfile.
It's worth noticing that in the generic case, zero-copy isn't really
even an issue.
When you think operations like "splice TV capture input to a pipe",
you typically need to allocate the pages that you then DMA into
*anyway*, and you'd just put those pages into the pipe. And the facty
that you can then just take the data directly from those pages when
you splice from the pipe to whatever GPU engine that does the decoding
is kind of secondary.
So again: the big deal with splice() and the pipe isn't really about
zero-copy. It's the in-kernel buffers where the drivers control the
allocation and you don't have some "user space allocates memory, then
kernel looks that allocation up and uses it" model.
Having less copies is kind of incidental. It *might* happen just
because it's natural when some streaming device just gives it data
away and doesn't care after the fact.
The problem with splicing from a file has been exactly the fact that
it's *not* streaming data, and the filesystem zero-copy case gave
direct access to the long-term cache.
Which is undoubtedly good for performance. But it fundamentally
*requires* that the sink is trustworthy. Which has been problematic.
That's why sendfile() is bad. Not because splice itself is a bad
concept, but because you have to have that absolute trust across
components.
Linus
^ permalink raw reply
* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Stefan Metzmacher @ 2026-06-03 20:56 UTC (permalink / raw)
To: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
Jan Kara
Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
patches
In-Reply-To: <20260531010107.1953702-3-safinaskar@gmail.com>
Hi Askar,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index f5639d5ac331..a86a88207956 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -514,8 +514,8 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
> struct old_timespec32 __user *, const sigset_t __user *,
> size_t);
> asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
> -asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
> - unsigned long nr_segs, unsigned int flags);
> +asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, unsigned int flags);
Why is 'int fd' changed to 'unsigned long fd'?
Should that be its own commit if the change is desired?
metze
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox