From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: bpf@vger.kernel.org, john.fastabend@gmail.com, jakub@cloudflare.com
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
Kuniyuki Iwashima <kuniyu@google.com>,
Willem de Bruijn <willemb@google.com>,
David Ahern <dsahern@kernel.org>,
Neal Cardwell <ncardwell@google.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Shuah Khan <shuah@kernel.org>,
Jiapeng Chong <jiapeng.chong@linux.alibaba.com>,
Ihor Solodrai <ihor.solodrai@linux.dev>,
Michal Luczaj <mhal@rbox.co>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org
Subject: [PATCH bpf-next v1 5/7] tcp_bpf: optimize splice_read with zero-copy for non-slab pages
Date: Wed, 4 Mar 2026 14:33:56 +0800 [thread overview]
Message-ID: <20260304063643.14581-6-jiayuan.chen@linux.dev> (raw)
In-Reply-To: <20260304063643.14581-1-jiayuan.chen@linux.dev>
The previous splice_read implementation copies all data through
intermediate pages (alloc_page + memcpy). This is wasteful for
skb fragment pages which are allocated from the page allocator
and can be safely referenced via get_page().
Optimize by checking PageSlab() to distinguish between linear
skb data (slab-backed) and fragment pages (page allocator-backed):
- For slab pages (skb linear data): copy to a page fragment via
sk_page_frag, matching what linear_to_page() does in the
standard TCP splice path (skb_splice_bits). get_page() is
invalid on slab pages so a copy is unavoidable here.
- For non-slab pages (skb frags): use get_page() directly for
true zero-copy, same as skb_splice_bits does for fragments.
Both paths use nosteal_pipe_buf_ops. The sk_page_frag approach
is more memory-efficient than alloc_page for small linear copies,
as multiple copies can share a single page fragment.
Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs):
splice(2) + always-copy: ~2770 MB/s (before this patch)
splice(2) + zero-copy: ~4270 MB/s (after this patch, +54%)
read(2): ~4292 MB/s (baseline for reference)
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
net/ipv4/tcp_bpf.c | 41 +++++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e85a27e32ea7..13506ba7672f 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -447,6 +447,7 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
struct tcp_bpf_splice_ctx {
struct pipe_inode_info *pipe;
+ struct sock *sk;
};
static int sk_msg_splice_actor(void *arg, struct page *page,
@@ -458,13 +459,33 @@ static int sk_msg_splice_actor(void *arg, struct page *page,
};
ssize_t ret;
- buf.page = alloc_page(GFP_KERNEL);
- if (!buf.page)
- return 0;
+ if (PageSlab(page)) {
+ /*
+ * skb linear data is backed by slab memory where
+ * get_page() is invalid. Copy to a page fragment from
+ * the socket's page allocator, matching what
+ * linear_to_page() does in the standard TCP splice
+ * path (skb_splice_bits).
+ */
+ struct page_frag *pfrag = sk_page_frag(ctx->sk);
+
+ if (!sk_page_frag_refill(ctx->sk, pfrag))
+ return 0;
- memcpy(page_address(buf.page), page_address(page) + offset, len);
- buf.offset = 0;
- buf.len = len;
+ len = min_t(size_t, len, pfrag->size - pfrag->offset);
+ memcpy(page_address(pfrag->page) + pfrag->offset,
+ page_address(page) + offset, len);
+ buf.page = pfrag->page;
+ buf.offset = pfrag->offset;
+ buf.len = len;
+ pfrag->offset += len;
+ } else {
+ buf.page = page;
+ buf.offset = offset;
+ buf.len = len;
+ }
+
+ get_page(buf.page);
/*
* add_to_pipe() calls pipe_buf_release() on failure, which
@@ -481,9 +502,9 @@ static ssize_t tcp_bpf_splice_read(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
- struct tcp_bpf_splice_ctx ctx = { .pipe = pipe };
- int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
struct sock *sk = sock->sk;
+ struct tcp_bpf_splice_ctx ctx = { .pipe = pipe, .sk = sk };
+ int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
struct sk_psock *psock;
int ret;
@@ -508,9 +529,9 @@ static ssize_t tcp_bpf_splice_read_parser(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe,
size_t len, unsigned int flags)
{
- struct tcp_bpf_splice_ctx ctx = { .pipe = pipe };
- int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
struct sock *sk = sock->sk;
+ struct tcp_bpf_splice_ctx ctx = { .pipe = pipe, .sk = sk };
+ int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
struct sk_psock *psock;
int ret;
--
2.43.0
next prev parent reply other threads:[~2026-03-04 6:39 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 6:33 [PATCH bpf-next v1 0/7] bpf/sockmap: add splice support for tcp_bpf Jiayuan Chen
2026-03-04 6:33 ` [PATCH bpf-next v1 1/7] net: add splice_read to struct proto and set it in tcp_prot/tcpv6_prot Jiayuan Chen
2026-03-04 6:33 ` [PATCH bpf-next v1 2/7] inet: add inet_splice_read() and use it in inet_stream_ops/inet6_stream_ops Jiayuan Chen
2026-03-04 6:33 ` [PATCH bpf-next v1 3/7] tcp_bpf: refactor recvmsg with read actor abstraction Jiayuan Chen
2026-03-04 7:14 ` bot+bpf-ci
2026-03-04 6:33 ` [PATCH bpf-next v1 4/7] tcp_bpf: add splice_read support for sockmap Jiayuan Chen
2026-03-04 7:27 ` bot+bpf-ci
2026-03-04 6:33 ` Jiayuan Chen [this message]
2026-03-04 6:33 ` [PATCH bpf-next v1 6/7] selftests/bpf: add splice_read tests " Jiayuan Chen
2026-03-06 17:25 ` Mykyta Yatsenko
2026-03-04 6:33 ` [PATCH bpf-next v1 7/7] selftests/bpf: add splice option to sockmap benchmark Jiayuan Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260304063643.14581-6-jiayuan.chen@linux.dev \
--to=jiayuan.chen@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=haoluo@google.com \
--cc=horms@kernel.org \
--cc=ihor.solodrai@linux.dev \
--cc=jakub@cloudflare.com \
--cc=jiapeng.chong@linux.alibaba.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=mhal@rbox.co \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=willemb@google.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.