All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: bpf@vger.kernel.org, john.fastabend@gmail.com, jakub@cloudflare.com
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	Willem de Bruijn <willemb@google.com>,
	David Ahern <dsahern@kernel.org>,
	Neal Cardwell <ncardwell@google.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Shuah Khan <shuah@kernel.org>,
	Jiapeng Chong <jiapeng.chong@linux.alibaba.com>,
	Ihor Solodrai <ihor.solodrai@linux.dev>,
	Michal Luczaj <mhal@rbox.co>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: [PATCH bpf-next v1 4/7] tcp_bpf: add splice_read support for sockmap
Date: Wed,  4 Mar 2026 14:33:55 +0800	[thread overview]
Message-ID: <20260304063643.14581-5-jiayuan.chen@linux.dev> (raw)
In-Reply-To: <20260304063643.14581-1-jiayuan.chen@linux.dev>

Implement splice_read for sockmap using an always-copy approach.
Each page from the psock ingress scatterlist is copied to a newly
allocated page before being added to the pipe, avoiding lifetime
and slab-page issues.

Add sk_msg_splice_actor() which allocates a fresh page via
alloc_page(), copies the data with memcpy(), then passes it to
add_to_pipe(). The newly allocated page already has a refcount
of 1, so no additional get_page() is needed. On add_to_pipe()
failure, no explicit cleanup is needed since add_to_pipe()
internally calls pipe_buf_release().

Also fix sk_msg_read_core() to update msg_rx->sg.start when the
actor returns 0 mid-way through processing. The loop processes
msg_rx->sg entries sequentially — if the actor fails (e.g. pipe
full for splice, or user buffer fault for recvmsg), prior entries
may already be consumed with sge->length set to 0. Without
advancing sg.start, subsequent calls would revisit these
zero-length entries and return -EFAULT. This is especially
common with the splice actor since the pipe has a small fixed
capacity (16 slots), but theoretically affects recvmsg as well.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/core/skmsg.c   | 10 ++++++
 net/ipv4/tcp_bpf.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6a906bfe3aa4..2fcbf8eaf4cf 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -445,6 +445,16 @@ int sk_msg_read_core(struct sock *sk, struct sk_psock *psock,
 				copy = actor(actor_arg, page,
 					     sge->offset, copy);
 			if (!copy) {
+				/*
+				 * The loop processes msg_rx->sg entries
+				 * sequentially and prior entries may
+				 * already be consumed. Advance sg.start
+				 * so the next call resumes at the correct
+				 * entry, otherwise it would revisit
+				 * zero-length entries and return -EFAULT.
+				 */
+				if (!peek)
+					msg_rx->sg.start = i;
 				copied = copied ? copied : -EFAULT;
 				goto out;
 			}
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 606c2b079f86..e85a27e32ea7 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/wait.h>
 #include <linux/util_macros.h>
+#include <linux/splice.h>
 
 #include <net/inet_common.h>
 #include <net/tls.h>
@@ -444,6 +445,85 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return ret;
 }
 
+struct tcp_bpf_splice_ctx {
+	struct pipe_inode_info *pipe;
+};
+
+static int sk_msg_splice_actor(void *arg, struct page *page,
+			       unsigned int offset, size_t len)
+{
+	struct tcp_bpf_splice_ctx *ctx = arg;
+	struct pipe_buffer buf = {
+		.ops = &nosteal_pipe_buf_ops,
+	};
+	ssize_t ret;
+
+	buf.page = alloc_page(GFP_KERNEL);
+	if (!buf.page)
+		return 0;
+
+	memcpy(page_address(buf.page), page_address(page) + offset, len);
+	buf.offset = 0;
+	buf.len = len;
+
+	/*
+	 * add_to_pipe() calls pipe_buf_release() on failure, which
+	 * handles put_page() via nosteal_pipe_buf_ops, so no explicit
+	 * cleanup is needed here.
+	 */
+	ret = add_to_pipe(ctx->pipe, &buf);
+	if (ret <= 0)
+		return 0;
+	return ret;
+}
+
+static ssize_t tcp_bpf_splice_read(struct socket *sock, loff_t *ppos,
+				   struct pipe_inode_info *pipe, size_t len,
+				   unsigned int flags)
+{
+	struct tcp_bpf_splice_ctx ctx = { .pipe = pipe };
+	int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
+	struct sock *sk = sock->sk;
+	struct sk_psock *psock;
+	int ret;
+
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock))
+		return tcp_splice_read(sock, ppos, pipe, len, flags);
+	if (!skb_queue_empty(&sk->sk_receive_queue) &&
+	    sk_psock_queue_empty(psock)) {
+		sk_psock_put(sk, psock);
+		return tcp_splice_read(sock, ppos, pipe, len, flags);
+	}
+
+	ret = __tcp_bpf_recvmsg(sk, psock, sk_msg_splice_actor, &ctx,
+				len, bpf_flags);
+	sk_psock_put(sk, psock);
+	if (!ret)
+		return tcp_splice_read(sock, ppos, pipe, len, flags);
+	return ret;
+}
+
+static ssize_t tcp_bpf_splice_read_parser(struct socket *sock, loff_t *ppos,
+					  struct pipe_inode_info *pipe,
+					  size_t len, unsigned int flags)
+{
+	struct tcp_bpf_splice_ctx ctx = { .pipe = pipe };
+	int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0;
+	struct sock *sk = sock->sk;
+	struct sk_psock *psock;
+	int ret;
+
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock))
+		return tcp_splice_read(sock, ppos, pipe, len, flags);
+
+	ret = __tcp_bpf_recvmsg_parser(sk, psock, sk_msg_splice_actor, &ctx,
+				       len, bpf_flags);
+	sk_psock_put(sk, psock);
+	return ret;
+}
+
 static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 				struct sk_msg *msg, int *copied, int flags)
 {
@@ -671,6 +751,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_BASE].destroy		= sock_map_destroy;
 	prot[TCP_BPF_BASE].close		= sock_map_close;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
+	prot[TCP_BPF_BASE].splice_read		= tcp_bpf_splice_read;
 	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
 	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;
 
@@ -679,9 +760,11 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 
 	prot[TCP_BPF_RX]			= prot[TCP_BPF_BASE];
 	prot[TCP_BPF_RX].recvmsg		= tcp_bpf_recvmsg_parser;
+	prot[TCP_BPF_RX].splice_read		= tcp_bpf_splice_read_parser;
 
 	prot[TCP_BPF_TXRX]			= prot[TCP_BPF_TX];
 	prot[TCP_BPF_TXRX].recvmsg		= tcp_bpf_recvmsg_parser;
+	prot[TCP_BPF_TXRX].splice_read		= tcp_bpf_splice_read_parser;
 }
 
 static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
-- 
2.43.0


  parent reply	other threads:[~2026-03-04  6:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04  6:33 [PATCH bpf-next v1 0/7] bpf/sockmap: add splice support for tcp_bpf Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 1/7] net: add splice_read to struct proto and set it in tcp_prot/tcpv6_prot Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 2/7] inet: add inet_splice_read() and use it in inet_stream_ops/inet6_stream_ops Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 3/7] tcp_bpf: refactor recvmsg with read actor abstraction Jiayuan Chen
2026-03-04  7:14   ` bot+bpf-ci
2026-03-04  6:33 ` Jiayuan Chen [this message]
2026-03-04  7:27   ` [PATCH bpf-next v1 4/7] tcp_bpf: add splice_read support for sockmap bot+bpf-ci
2026-03-04  6:33 ` [PATCH bpf-next v1 5/7] tcp_bpf: optimize splice_read with zero-copy for non-slab pages Jiayuan Chen
2026-03-04  6:33 ` [PATCH bpf-next v1 6/7] selftests/bpf: add splice_read tests for sockmap Jiayuan Chen
2026-03-06 17:25   ` Mykyta Yatsenko
2026-03-04  6:33 ` [PATCH bpf-next v1 7/7] selftests/bpf: add splice option to sockmap benchmark Jiayuan Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260304063643.14581-5-jiayuan.chen@linux.dev \
    --to=jiayuan.chen@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=horms@kernel.org \
    --cc=ihor.solodrai@linux.dev \
    --cc=jakub@cloudflare.com \
    --cc=jiapeng.chong@linux.alibaba.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mhal@rbox.co \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=willemb@google.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.