public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: bpf@vger.kernel.org, john.fastabend@gmail.com, jakub@cloudflare.com
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	Willem de Bruijn <willemb@google.com>,
	David Ahern <dsahern@kernel.org>,
	Neal Cardwell <ncardwell@google.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Shuah Khan <shuah@kernel.org>,
	Jiapeng Chong <jiapeng.chong@linux.alibaba.com>,
	Ihor Solodrai <ihor.solodrai@linux.dev>,
	Michal Luczaj <mhal@rbox.co>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: [PATCH bpf-next v1 5/7] tcp_bpf: optimize splice_read with zero-copy for non-slab pages
Date: Wed,  4 Mar 2026 14:33:56 +0800	[thread overview]
Message-ID: <20260304063643.14581-6-jiayuan.chen@linux.dev> (raw)
In-Reply-To: <20260304063643.14581-1-jiayuan.chen@linux.dev>

The previous splice_read implementation copies all data through
intermediate pages (alloc_page + memcpy). This is wasteful for
skb fragment pages which are allocated from the page allocator
and can be safely referenced via get_page().

Optimize by checking PageSlab() to distinguish between linear
skb data (slab-backed) and fragment pages (page allocator-backed):

- For slab pages (skb linear data): copy to a page fragment via
  sk_page_frag, matching what linear_to_page() does in the
  standard TCP splice path (skb_splice_bits). get_page() is
  invalid on slab pages so a copy is unavoidable here.
- For non-slab pages (skb frags): use get_page() directly for
  true zero-copy, same as skb_splice_bits does for fragments.

Both paths use nosteal_pipe_buf_ops. The sk_page_frag approach
is more memory-efficient than alloc_page for small linear copies,
as multiple copies can share a single page fragment.

Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs):

  splice(2) + always-copy:  ~2770 MB/s (before this patch)
  splice(2) + zero-copy:    ~4270 MB/s (after this patch, +54%)
  read(2):                  ~4292 MB/s (baseline for reference)

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/ipv4/tcp_bpf.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e85a27e32ea7..13506ba7672f 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -447,6 +447,7 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 struct tcp_bpf_splice_ctx {
 	struct pipe_inode_info *pipe;
+	struct sock *sk;
 };
 
 static int sk_msg_splice_actor(void *arg, struct page *page,
@@ -458,13 +459,33 @@ static int sk_msg_splice_actor(void *arg, struct page *page,
 	};
 	ssize_t ret;
 
-	buf.page = alloc_page(GFP_KERNEL);
-	if (!buf.page)
-		return 0;
+	if (PageSlab(page)) {
+		/*
+		 * skb linear data is backed by slab memory where
+		 * get_page() is invalid. Copy to a page fragment from
+		 * the socket's page allocator, matching what
+		 * linear_to_page() does in the standard TCP splice
+		 * path (skb_splice_bits).
+		 */
+		struct page_frag *pfrag = sk_page_frag(ctx->sk);
+
+		if (!sk_page_frag_refill(ctx->sk, pfrag))
+			return 0;
 
-	memcpy(page_address(buf.page), page_address(page) + offset, len);
-	buf.offset = 0;
-	buf.len = len;
+		len = min_t(size_t, len, pfrag->size - pfrag->offset);
+		memcpy(page_address(pfrag->page) + pfrag->offset,
+		       page_address(page) + offset, len);
+		buf.page = pfrag->page;
+		buf.offset = pfrag->offset;
+		buf.len = len;
+		pfrag->offset += len;
+	} else {
+		buf.page = page;
+		buf.offset = offset;
+		buf.len = len;
+	}
+
+	get_page(buf.page);
 
 	/*
 	 * add_to_pipe() calls pipe_buf_release() on failure, which
@@ -481,9 +502,9 @@ static ssize_t tcp_bpf_splice_read(struct socket *sock, loff_t *ppos,
 				   struct pipe_inode_info *pipe, size_t len,
 				   unsigned int flags)
 {
-	struct tcp_bpf_splice_ctx ctx = { .pipe = pipe };
-	int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
 	struct sock *sk = sock->sk;
+	struct tcp_bpf_splice_ctx ctx = { .pipe = pipe, .sk = sk };
+	int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
 	struct sk_psock *psock;
 	int ret;
 
@@ -508,9 +529,9 @@ static ssize_t tcp_bpf_splice_read_parser(struct socket *sock, loff_t *ppos,
 					  struct pipe_inode_info *pipe,
 					  size_t len, unsigned int flags)
 {
-	struct tcp_bpf_splice_ctx ctx = { .pipe = pipe };
-	int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
 	struct sock *sk = sock->sk;
+	struct tcp_bpf_splice_ctx ctx = { .pipe = pipe, .sk = sk };
+	int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
 	struct sk_psock *psock;
 	int ret;
 
-- 
2.43.0


  parent reply	other threads:[~2026-03-04  6:39 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04  6:33 [PATCH bpf-next v1 0/7] bpf/sockmap: add splice support for tcp_bpf Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 1/7] net: add splice_read to struct proto and set it in tcp_prot/tcpv6_prot Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 2/7] inet: add inet_splice_read() and use it in inet_stream_ops/inet6_stream_ops Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 3/7] tcp_bpf: refactor recvmsg with read actor abstraction Jiayuan Chen
2026-03-04  7:14   ` bot+bpf-ci
2026-03-04  6:33 ` [PATCH bpf-next v1 4/7] tcp_bpf: add splice_read support for sockmap Jiayuan Chen
2026-03-04  7:27   ` bot+bpf-ci
2026-03-04  6:33 ` Jiayuan Chen [this message]
2026-03-04  6:33 ` [PATCH bpf-next v1 6/7] selftests/bpf: add splice_read tests " Jiayuan Chen
2026-03-06 17:25   ` Mykyta Yatsenko
2026-03-04  6:33 ` [PATCH bpf-next v1 7/7] selftests/bpf: add splice option to sockmap benchmark Jiayuan Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260304063643.14581-6-jiayuan.chen@linux.dev \
    --to=jiayuan.chen@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=horms@kernel.org \
    --cc=ihor.solodrai@linux.dev \
    --cc=jakub@cloudflare.com \
    --cc=jiapeng.chong@linux.alibaba.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mhal@rbox.co \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=willemb@google.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox