BPF List
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
@ 2026-06-12  1:14 Cong Wang
  2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Cong Wang @ 2026-06-12  1:14 UTC (permalink / raw)
  To: netdev
  Cc: bpf, John Fastabend, Jakub Sitnicki, Jiayuan Chen, hemanthmalla,
	zijianzhang, Cong Wang

This series adds an opportunistic "loopback splice" fast path for two
locally-connected TCP sockets that a sock_ops BPF program pairs at
handshake completion. Once paired, sendmsg copies the user payload into
a per-direction in-kernel byte ring and recvmsg drains it on the other
side; both copies happen in their own task's mm, so the fast path incurs
no skb construction, no softirq, and no TCP protocol-state processing.

The underlying TCP connection stays fully real: sequence numbers are
frozen at post-handshake values, so FIN/RST/keepalive keep flowing
through the normal paths and the pair tears down via a regular close.
Pairing is opt-in per flow and fallback is per-message - handshake-style
traffic takes the TCP path, the bulk phase takes the ring, on the same
socket. Nothing leaves the host and applications need no changes: no new
address family, no LD_PRELOAD, no source modification.

The target use cases are co-located endpoints that speak plain TCP:
 - regular TCP loopback (127.0.0.1) between processes on the same host;
 - container sidecar deployments - e.g. a service-mesh sidecar proxy and
   its application in the same pod, talking over loopback or a veth pair -
   where the per-skb veth+bridge cost is exactly what the ring sidesteps.

Highlights (TCP_RR, 1 KB request/response, netperf, pinned CPUs,
baseline TCP vs splice; full tables across message sizes and TCP_STREAM
in patches 1 and 2):

  loopback (127.0.0.1):
    without busy-poll:   105.8k -> 235.1k tps  (2.2x)
    with busy-poll 50us: 106.1k -> 713.0k tps  (6.7x)

  container (netns + veth + bridge):
    without busy-poll:    99.9k -> 233.9k tps  (2.3x)
    with busy-poll 50us: 100.4k -> 704.9k tps  (7.0x)

Synchronous-RPC (TCP_RR) at a 1 KB message wins ~2.2x without busy
polling and ~6.7x with it (the win grows toward smaller messages and
narrows toward 64 KB), because the ring removes the per-cycle kernel TCP
receive-path cost and the receiver can spin on the ring directly -
loopback delivers via the per-CPU backlog and exposes no pollable
napi_id, so the generic sk_busy_loop() is a no-op there. Bulk streaming
is roughly neutral on bare-metal loopback but wins decisively (up to
~6x) container-to-container, where per-skb veth+bridge cost dominates
the path the ring sidesteps.

---
Cong Wang (5):
  tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback
    splice
  tcp_bpf: busy-poll the splice ring before parking the receiver
  selftests/bpf: add tcp_splice basic round-trip test
  bpf: allow SO_BUSY_POLL in bpf_setsockopt()
  selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog

 include/linux/skmsg.h                         |   9 +
 include/net/tcp.h                             |   8 +
 net/core/filter.c                             |   1 +
 net/core/skmsg.c                              |   3 +
 net/ipv4/tcp_bpf.c                            | 847 +++++++++++++++++-
 .../selftests/bpf/prog_tests/tcp_splice.c     | 206 +++++
 .../selftests/bpf/progs/test_tcp_splice.c     | 125 +++
 7 files changed, 1198 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_splice.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_splice.c


base-commit: 30dee2c176e7954f63d1fa3e52d172f30beb9bfb
-- 
2.43.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
@ 2026-06-12  1:14 ` Cong Wang
  2026-06-12  1:33   ` sashiko-bot
  2026-06-12  2:10   ` bot+bpf-ci
  2026-06-12  1:14 ` [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver Cong Wang
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Cong Wang @ 2026-06-12  1:14 UTC (permalink / raw)
  To: netdev
  Cc: bpf, John Fastabend, Jakub Sitnicki, Jiayuan Chen, hemanthmalla,
	zijianzhang, Cong Wang, Cong Wang

Two locally-connected TCP sockets can be paired by a sock_ops BPF
program at handshake completion. Once paired, sendmsg copies the user
payload into a per-direction kernel-side byte ring; recvmsg drains the
ring into the user buffer. Both copies happen in their own task's mm,
so no cross-mm pin / kmap dance is needed and the splice fast path
incurs no skb construction, no softirq, and no TCP protocol-state
processing. The TCP wire connection itself never sees the spliced
bytes: sequence numbers stay frozen at post-handshake values, so FIN,
RST, and keepalive continue to work through the regular paths and the
pair tears down via a normal close handshake.

Per-direction ring layout
-------------------------

The ring is a power-of-two byte buffer (16 KiB by default) backed by
order-2 pages from __get_free_pages(), manipulated through
include/linux/circ_buf.h macros. Both per-direction rings are allocated
when the pair is formed, so the data path never allocates. Producer and
consumer are SPSC (one socket each side), so head and tail are updated
with smp_store_release() / smp_load_acquire() without a data-path lock.
Each side keeps a private cache of the other's cursor (the producer
caches ring_tail) and reads the real, cross-CPU cursor only when the
cache is exhausted - standard SPSC cursor caching. sendmsg copies from
the user iov into the ring at head; recvmsg copies out at tail.

The ring is the queue between sender and receiver - it accumulates
across recvmsg calls, so a sequence of small sends amortises into one
wake when the receiver isn't draining synchronously. This is what
makes the splice path viable for streaming workloads without forcing
per-message rendezvous. The cost is one extra in-kernel copy compared
to a sender->user-pages direct mechanism, but the benefit is that the
splice path can stay engaged across arbitrary phasing between sender
and receiver - no app cooperation required.

The sender keeps the peer ring alive across the copy with a per-pair
percpu_ref rather than a per-message socket refcount, and validates
only its own socket's error/shutdown state, as the rest of tcp_bpf does
- a peer reset reaches it over the still-live TCP connection. Both keep
the per-message cost off cross-CPU cachelines.

Sender (splice_send_ring) defers to tcp_sendmsg() when (a) the peer
rcv_queue is non-empty (preserving stream ordering against prior TCP
fallback) or (b) the ring is full (TCP-level backpressure via sndbuf /
snd_wnd absorbs the overflow). Receiver (tcp_bpf_splice_recvmsg)
defers to tcp_recvmsg() when the rcv_queue holds data and the ring is
empty. The end-to-end ordering invariant is: rcv_queue bytes are
always older than any ring bytes drained alongside them, because the
sender only writes to the ring while the peer rcv_queue is empty.

For bytes that take the splice fast path, SO_SNDBUF and SO_RCVBUF are
not honored - the sndbuf / rcvbuf accounting machinery is exactly
what splice intentionally bypasses. The associated infrastructure -
sk_mem_charge / sk_mem_uncharge, sk_forward_alloc,
prot->memory_allocated, tcp_memory_pressure, the per-cpu reserves -
is among the most painful parts of TCP to maintain, and spliced bytes
opt out of it as a side effect of having no skb-borne kernel-side
bytes to account for. The ring's own capacity is bounded
(SPLICE_RING_SIZE), giving a hard upper bound on per-pair memory.
SIOCINQ / SIOCOUTQ reflect only the underlying TCP socket's frozen
counters, and getsockopt(TCP_INFO) likewise. Bytes that take the TCP
fallback go through the regular TCP path with all of its normal
accounting.

Pairing is opt-in per flow - the BPF program at handshake decides
which connections to splice. Applications that mix handshake-style
traffic and bulk streaming on the same paired socket get the right
behaviour on both phases automatically: the handshake survives via
TCP fallback, the bulk phase runs through the ring.

The receiver parks on the socket waitqueue when the ring is empty.
A following patch adds an optional bounded busy-poll of the ring
before parking, gated on the socket's SO_BUSY_POLL budget; it is off
by default and is what turns the latency-bound TCP_RR case into a
large win once enabled. The numbers below are with busy polling
disabled.

Microbenchmarks
---------------

Pinned to two adjacent CPUs (sender CPU 1, receiver CPU 0), 10s per
run, 3 runs averaged; netperf to 127.0.0.1. Splice without busy
polling.

Bare-metal loopback:

  TCP_STREAM  msg=   64 B:   2577 ->  1680  Mbps  (0.65x)
              msg=  256 B:   9336 ->  8640  Mbps  (0.93x)
              msg=    1 KB: 22416 -> 24136  Mbps  (1.08x)
              msg=    4 KB: 37893 -> 52304  Mbps  (1.38x)
              msg=   16 KB: 48019 -> 53235  Mbps  (1.11x)
              msg=   64 KB: 49686 -> 49418  Mbps  (0.99x)

  TCP_RR      sz=     1 B:  110.2k -> 267.0k tps  (2.42x)
              sz=    64 B:  111.6k -> 265.7k tps  (2.38x)
              sz=     1 KB: 105.8k -> 235.1k tps  (2.22x)
              sz=    16 KB:  40.5k ->  89.6k tps  (2.21x)
              sz=    64 KB:  17.8k ->  20.9k tps  (1.17x)

Container-to-container (two network namespaces connected via veth pair
plus Linux bridge, processes pinned in the same way):

  TCP_STREAM  msg=   64 B:   1420 ->   1643 Mbps  (1.16x)
              msg=    1 KB:  3710 -> 21326 Mbps  (5.75x)
              msg=    4 KB:  8084 -> 48834 Mbps  (6.04x)
              msg=   16 KB: 26083 -> 27788 Mbps  (1.07x)
              msg=   64 KB: 47659 -> 47507 Mbps  (1.00x)

  TCP_RR      sz=     1 B:  105.0k -> 265.0k tps  (2.52x)
              sz=    64 B:  101.2k -> 264.3k tps  (2.61x)
              sz=     1 KB:  99.9k -> 233.9k tps  (2.34x)
              sz=    16 KB:  44.8k ->  91.1k tps  (2.03x)
              sz=    64 KB:  18.1k ->  23.5k tps  (1.30x)

Synchronous-RPC workloads (TCP_RR) win 2.0-2.6x across both
environments because the ring eliminates the per-cycle overhead of the
kernel TCP receive path. Mid-message streaming wins on loopback
(4-16 KB at 1.1-1.4x). Tiny-message streaming on bare-metal loopback
regresses to 0.65x because loopback TCP's TSO super-segments amortise
per-batch cost to ~20 ns/msg, below the ring's 2-copy per-cycle floor;
this is structural. In containers the same workload wins decisively
because per-packet veth+bridge overhead dwarfs the ring's floor:
STREAM-1 KB and STREAM-4 KB go 5.75x and 6.04x because TCP's per-skb
cost dominates the container path and the ring sidesteps it entirely.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 include/linux/skmsg.h |   9 +
 include/net/tcp.h     |   8 +
 net/core/skmsg.c      |   3 +
 net/ipv4/tcp_bpf.c    | 809 +++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 828 insertions(+), 1 deletion(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 19f4f253b4f9..c9b7144cc846 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -80,6 +80,8 @@ struct sk_psock_work_state {
 	u32				off;
 };
 
+struct sk_psock_splice;	/* defined in net/ipv4/tcp_bpf.c */
+
 struct sk_psock {
 	struct sock			*sk;
 	struct sock			*sk_redir;
@@ -121,6 +123,13 @@ struct sk_psock {
 	struct delayed_work		work;
 	struct sock			*sk_pair;
 	struct rcu_work			rwork;
+
+	/* Loopback splice state for paired stream sockets. NULL until the
+	 * first bpf_sock_splice_pair() call on this psock; lazily allocated
+	 * and kept for the lifetime of the psock so that sender/receiver
+	 * paths don't need to revalidate the pointer mid-flight.
+	 */
+	struct sk_psock_splice __rcu	*splice;
 };
 
 int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 98848db62894..c1597accdac9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2855,6 +2855,8 @@ struct sk_psock;
 #ifdef CONFIG_BPF_SYSCALL
 int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
+void tcp_bpf_splice_unpair(struct sk_psock *psock);
+void tcp_bpf_splice_destroy(struct sk_psock *psock);
 #ifdef CONFIG_BPF_STREAM_PARSER
 struct strparser;
 int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
@@ -2880,6 +2882,12 @@ static inline void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
 }
 #endif
 
+#if !defined(CONFIG_BPF_SYSCALL)
+struct sk_psock;
+static inline void tcp_bpf_splice_unpair(struct sk_psock *psock) {}
+static inline void tcp_bpf_splice_destroy(struct sk_psock *psock) {}
+#endif
+
 #ifdef CONFIG_CGROUP_BPF
 static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
 				      struct sk_buff *skb,
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index e1850caf1a71..b39fc249a18d 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -881,12 +881,15 @@ static void sk_psock_destroy(struct work_struct *work)
 		sock_put(psock->sk_redir);
 	if (psock->sk_pair)
 		sock_put(psock->sk_pair);
+	tcp_bpf_splice_destroy(psock);
 	sock_put(psock->sk);
 	kfree(psock);
 }
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
+	tcp_bpf_splice_unpair(psock);
+
 	write_lock_bh(&sk->sk_callback_lock);
 	sk_psock_restore_proto(sk, psock);
 	rcu_assign_sk_user_data(sk, NULL);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cc0bd73f36b6..549f37077244 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -4,14 +4,31 @@
 #include <linux/skmsg.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <linux/circ_buf.h>
 #include <linux/init.h>
+#include <linux/mm.h>
 #include <linux/wait.h>
 #include <linux/util_macros.h>
+#include <linux/percpu-refcount.h>
 
 #include <net/inet_common.h>
+#include <net/inet_sock.h>
 #include <net/tls.h>
 #include <asm/ioctls.h>
 
+static bool sk_psock_is_spliced(const struct sk_psock *psock);
+static int tcp_bpf_splice_recvmsg(struct sock *sk, struct sk_psock *psock,
+				  struct msghdr *msg, size_t len,
+				  int flags, int *err);
+static int splice_send_ring(struct sock *sk, struct sk_psock *psock,
+			    struct msghdr *msg, size_t size, int flags);
+static int tcp_bpf_splice_sendmsg(struct sock *sk, struct msghdr *msg,
+				  size_t size);
+static void splice_ring_free(struct sk_psock_splice *s);
+static bool tcp_bpf_is_readable(struct sock *sk);
+
 void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tcp;
@@ -365,6 +382,46 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	psock = sk_psock_get(sk);
 	if (unlikely(!psock))
 		return tcp_recvmsg(sk, msg, len, flags);
+
+	/* Splice dispatch.
+	 *
+	 * Streaming-friendly ordering: drain anything TCP has already
+	 * queued in sk_receive_queue FIRST. The sender stays on plain
+	 * tcp_sendmsg() (preserving Nagle, TSO, sk_write_queue
+	 * coalescing) whenever the peer rcv_queue has bytes in flight,
+	 * so if a receiver is keeping up with a bulk stream we never
+	 * publish a bvec and never push the sender into per-message
+	 * synchronous mode. Only when sk_receive_queue is empty (the
+	 * receiver would otherwise block) do we enter the rendezvous
+	 * path; the sender's opportunistic check then finds our pinned
+	 * iov and does the direct user-to-user copy fast path.
+	 *
+	 * splice_recvmsg returns 0 with no error if rcv_queue gained
+	 * bytes during the wait (TCP arrival raced our pin), in which
+	 * case the next block below drains them via tcp_recvmsg() and
+	 * stream ordering is preserved end-to-end.
+	 */
+	if (sk_psock_is_spliced(psock)) {
+		int err = 0, rcopied;
+
+		/* tcp_bpf_splice_recvmsg drains the ring first (ring bytes
+		 * predate any rcv_queue bytes when both have data) and only
+		 * returns 0 when both are empty or rcv_queue has the only
+		 * bytes left. The block below then routes the rcv_queue
+		 * drain via tcp_recvmsg().
+		 */
+		rcopied = tcp_bpf_splice_recvmsg(sk, psock, msg, len,
+						 flags, &err);
+		if (rcopied > 0) {
+			sk_psock_put(sk, psock);
+			return rcopied;
+		}
+		if (err) {
+			sk_psock_put(sk, psock);
+			return err;
+		}
+	}
+
 	if (!skb_queue_empty(&sk->sk_receive_queue) &&
 	    sk_psock_queue_empty(psock)) {
 		sk_psock_put(sk, psock);
@@ -626,8 +683,9 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_BASE]			= *base;
 	prot[TCP_BPF_BASE].destroy		= sock_map_destroy;
 	prot[TCP_BPF_BASE].close		= sock_map_close;
+	prot[TCP_BPF_BASE].sendmsg		= tcp_bpf_splice_sendmsg;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
-	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
+	prot[TCP_BPF_BASE].sock_is_readable	= tcp_bpf_is_readable;
 	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;
 
 	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
@@ -756,4 +814,753 @@ void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
 	if (is_insidevar(prot, tcp_bpf_prots))
 		newsk->sk_prot = sk->sk_prot_creator;
 }
+
+/* Per-psock splice state: a SPSC byte ring (this socket reads from
+ * ring_buf; the paired sender writes into it). Sender defers to
+ * tcp_sendmsg() when peer rcv_queue is non-empty (ordering) or the
+ * ring is full (backpressure); receiver defers to tcp_recvmsg() when
+ * rcv_queue has data. Head/tail are monotonic; buffer offset is
+ * (cursor & (ring_size - 1)). Data path is lockless via release/
+ * acquire on head/tail; ->lock serialises only lazy alloc / teardown.
+ */
+struct sk_psock_splice {
+	struct sk_psock		*peer;      /* NULL after unpair */
+	spinlock_t		lock;       /* alloc/teardown only */
+	void			*ring_buf;  /* order-2 pages, ring_size bytes */
+	size_t			ring_size;  /* power of 2 */
+	struct percpu_ref	ring_ref;   /* cross-socket writers into ring_buf */
+
+	/* Producer and consumer cursors live on separate cache lines: the
+	 * writer's release-store of ring_head must not invalidate the
+	 * reader's hot ring_tail line, and vice versa. cached_tail is the
+	 * producer's private cache of ring_tail, kept on the producer's own
+	 * line, so the producer reads the consumer-owned ring_tail only when
+	 * its cache says the ring is full - standard SPSC cursor caching.
+	 */
+	unsigned long		ring_head ____cacheline_aligned_in_smp;
+	unsigned long		cached_tail;
+	unsigned long		ring_tail ____cacheline_aligned_in_smp;
+};
+
+#define SPLICE_RING_SIZE	(16U * 1024U)
+
+/* Wake any waiters parked on @sk. Used at teardown so a sleeping
+ * receiver observes the cleared ->peer and exits. The smp_mb() closes
+ * the same lost-wakeup window as splice_wake_sync() below.
+ */
+static inline void splice_wake(struct sock *sk)
+{
+	wait_queue_head_t *wq = sk_sleep(sk);
+
+	smp_mb();
+	if (wq && waitqueue_active(wq))
+		wake_up_interruptible_all(wq);
+}
+
+/* Wake the receiver after a producer write to the ring. The _poll
+ * variant with EPOLLIN | EPOLLRDNORM is required so poll()/select()/
+ * epoll waiters see the wake (a plain sync wake carries no mask and is
+ * silently dropped by poll waiters); wait_event-style waiters wake on
+ * it too. The smp_mb() orders the ring head publish before the
+ * waitqueue_active() check, pairing with set_current_state() in the
+ * consumer's wait loop - without it the producer can skip the wake
+ * while the consumer concurrently parks with the predicate just-
+ * not-yet-true, a lost wakeup. _sync hints the scheduler to keep the
+ * wakee on the producer's CPU.
+ */
+static inline void splice_wake_sync(struct sock *sk)
+{
+	wait_queue_head_t *wq = sk_sleep(sk);
+
+	smp_mb();
+	if (wq && waitqueue_active(wq))
+		wake_up_interruptible_sync_poll(wq, EPOLLIN | EPOLLRDNORM);
+}
+
+static bool sk_psock_is_spliced(const struct sk_psock *psock)
+{
+	struct sk_psock_splice *s = rcu_dereference(psock->splice);
+
+	return s && rcu_access_pointer(s->peer);
+}
+
+static int tcp_bpf_splice_sendmsg(struct sock *sk, struct msghdr *msg,
+				  size_t size)
+{
+	struct sk_psock *psock;
+	int spliced = 0;
+	int ret;
+
+	psock = sk_psock_get(sk);
+	if (psock) {
+		if (sk_psock_is_spliced(psock)) {
+			int flags = (msg->msg_flags &
+				     ~MSG_SENDPAGE_DECRYPTED) |
+				     MSG_NO_SHARED_FRAGS;
+
+			spliced = splice_send_ring(sk, psock, msg,
+						   size, flags);
+		}
+		sk_psock_put(sk, psock);
+	}
+
+	if ((size_t)spliced < size) {
+		ret = tcp_sendmsg(sk, msg, size - spliced);
+		if (ret < 0)
+			return spliced > 0 ? spliced : ret;
+		return spliced + ret;
+	}
+	return spliced;
+}
+
+/* percpu_ref release: fires after percpu_ref_kill() once every in-flight
+ * cross-socket sender has dropped its hold. Safe to free the ring and the
+ * splice state now.
+ */
+static void splice_ring_ref_release(struct percpu_ref *ref)
+{
+	struct sk_psock_splice *s =
+		container_of(ref, struct sk_psock_splice, ring_ref);
+
+	splice_ring_free(s);
+	percpu_ref_exit(&s->ring_ref);
+	kfree(s);
+}
+
+static struct sk_psock_splice *splice_get_or_alloc(struct sk_psock *psock)
+{
+	struct sk_psock_splice *s, *old;
+
+	s = rcu_dereference_protected(psock->splice, 1);
+	if (s)
+		return s;
+
+	s = kzalloc_obj(*s, GFP_ATOMIC);
+	if (!s)
+		return NULL;
+	spin_lock_init(&s->lock);
+
+	if (percpu_ref_init(&s->ring_ref, splice_ring_ref_release, 0,
+			    GFP_ATOMIC)) {
+		kfree(s);
+		return NULL;
+	}
+
+	old = cmpxchg((struct sk_psock_splice **)&psock->splice, NULL, s);
+	if (old) {
+		percpu_ref_exit(&s->ring_ref);
+		kfree(s);
+		return old;
+	}
+	return s;
+}
+
+static void splice_lock_pair(struct sk_psock_splice *a,
+			     struct sk_psock_splice *b)
+{
+	if (a < b) {
+		spin_lock_bh(&a->lock);
+		spin_lock_nested(&b->lock, SINGLE_DEPTH_NESTING);
+	} else {
+		spin_lock_bh(&b->lock);
+		spin_lock_nested(&a->lock, SINGLE_DEPTH_NESTING);
+	}
+}
+
+static void splice_unlock_pair(struct sk_psock_splice *a,
+			       struct sk_psock_splice *b)
+{
+	if (a < b) {
+		spin_unlock(&b->lock);
+		spin_unlock_bh(&a->lock);
+	} else {
+		spin_unlock(&a->lock);
+		spin_unlock_bh(&b->lock);
+	}
+}
+
+/*
+ * Tear down a splice pair. Idempotent and safe to call from any teardown
+ * path (sk_psock_drop, tcp_close, tcp_disconnect, RST handler). No-op if
+ * the psock was never spliced.
+ *
+ * Note: the splice_state allocation is NOT freed here - it lives until
+ * sk_psock_destroy. That keeps sender/receiver fast paths free of
+ * lifetime dances.
+ */
+void tcp_bpf_splice_unpair(struct sk_psock *psock)
+{
+	struct sk_psock_splice *self_s, *peer_s;
+	struct sk_psock *peer;
+	bool was_paired = false;
+
+	self_s = rcu_dereference_protected(psock->splice, 1);
+	if (!self_s)
+		return;
+
+	rcu_read_lock();
+	peer = rcu_dereference(self_s->peer);
+	if (!peer) {
+		rcu_read_unlock();
+		return;
+	}
+	if (!sk_psock_get(peer->sk)) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+
+	peer_s = rcu_dereference_protected(peer->splice, 1);
+	if (!peer_s) {
+		sk_psock_put(peer->sk, peer);
+		return;
+	}
+
+	splice_lock_pair(self_s, peer_s);
+	if (self_s->peer == peer && peer_s->peer == psock) {
+		rcu_assign_pointer(self_s->peer, NULL);
+		rcu_assign_pointer(peer_s->peer, NULL);
+		was_paired = true;
+	}
+	splice_unlock_pair(self_s, peer_s);
+
+	/* Wake any blocked rendezvous waiters on either side. They will
+	 * re-check the predicate, see splice->peer == NULL, and exit.
+	 */
+	splice_wake(psock->sk);
+	splice_wake(peer->sk);
+
+	if (was_paired) {
+		/* Drop the pair's psock references. Ring buffers are NOT
+		 * freed here: a recvmsg may be mid-splice_ring_read() on
+		 * either side, holding only sk_psock_get() - it does not
+		 * keep ring_buf alive. Defer the kvfree to
+		 * tcp_bpf_splice_destroy(), which runs after psock teardown
+		 * has drained all callers.
+		 */
+		sk_psock_put(peer->sk, peer);
+		sk_psock_put(psock->sk, psock);
+	}
+	sk_psock_put(peer->sk, peer);
+}
+EXPORT_SYMBOL_GPL(tcp_bpf_splice_unpair);
+
+void tcp_bpf_splice_destroy(struct sk_psock *psock)
+{
+	struct sk_psock_splice *s;
+
+	/* Kill the ring ref; splice_ring_ref_release() frees the ring and s
+	 * once any in-flight cross-socket sender has dropped its hold.
+	 */
+	s = rcu_dereference_protected(psock->splice, 1);
+	if (s)
+		percpu_ref_kill(&s->ring_ref);
+}
+EXPORT_SYMBOL_GPL(tcp_bpf_splice_destroy);
+
+/* The PASSIVE_ESTABLISHED_CB fires BEFORE the kernel transitions the
+ * accepted child's state from TCP_SYN_RECV to TCP_ESTABLISHED.Accept
+ * SYN_RECV here since we know the callback contract guarantees
+ * imminent ESTABLISHED.
+ */
+static bool splice_state_ok(int state)
+{
+	return state == TCP_ESTABLISHED || state == TCP_SYN_RECV;
+}
+
+static int splice_validate(struct sock *a, struct sock *b)
+{
+	struct tcp_sock *ta = tcp_sk(a), *tb = tcp_sk(b);
+
+	if (a->sk_family != b->sk_family)
+		return -EINVAL;
+	if (a->sk_protocol != IPPROTO_TCP || b->sk_protocol != IPPROTO_TCP)
+		return -EINVAL;
+	if (!splice_state_ok(a->sk_state) || !splice_state_ok(b->sk_state))
+		return -EINVAL;
+	if (ta->repair || tb->repair)
+		return -EINVAL;
+	if (ta->urg_data || tb->urg_data)
+		return -EINVAL;
+	return 0;
+}
+
+static int splice_ring_alloc(struct sk_psock_splice *s)
+{
+	void *buf;
+
+	if (READ_ONCE(s->ring_buf))
+		return 0;
+
+	buf = (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOWARN,
+				       get_order(SPLICE_RING_SIZE));
+	if (!buf)
+		return -ENOMEM;
+
+	spin_lock_bh(&s->lock);
+	if (s->ring_buf) {
+		spin_unlock_bh(&s->lock);
+		free_pages((unsigned long)buf, get_order(SPLICE_RING_SIZE));
+		return 0;
+	}
+	s->ring_buf    = buf;
+	s->ring_size   = SPLICE_RING_SIZE;
+	s->ring_head   = 0;
+	s->ring_tail   = 0;
+	s->cached_tail = 0;
+	spin_unlock_bh(&s->lock);
+	return 0;
+}
+
+static void splice_ring_free(struct sk_psock_splice *s)
+{
+	void *buf;
+
+	spin_lock_bh(&s->lock);
+	buf = s->ring_buf;
+	s->ring_buf    = NULL;
+	s->ring_size   = 0;
+	s->ring_head   = 0;
+	s->ring_tail   = 0;
+	s->cached_tail = 0;
+	spin_unlock_bh(&s->lock);
+
+	if (buf)
+		free_pages((unsigned long)buf, get_order(SPLICE_RING_SIZE));
+}
+
+static size_t splice_ring_write(struct sk_psock_splice *s,
+				struct iov_iter *from, size_t size)
+{
+	unsigned long head, tail, mask;
+	size_t avail, want, to_end, first, second, done;
+
+	if (!s->ring_buf)
+		return 0;
+
+	mask = s->ring_size - 1;
+	head = s->ring_head;
+	/* Use the producer's cached_tail, refreshed by splice_ring_space()
+	 * earlier in this same send. It is conservative - the real ring_tail
+	 * only advances - so the free space computed here never exceeds the
+	 * true free space, and we avoid a second cross-CPU ring_tail read.
+	 */
+	tail = s->cached_tail;
+	avail = CIRC_SPACE(head, tail, s->ring_size);
+	want = min_t(size_t, size, avail);
+	if (!want)
+		return 0;
+
+	to_end = s->ring_size - (head & mask);
+	first  = min_t(size_t, want, to_end);
+
+	done = copy_from_iter(s->ring_buf + (head & mask), first, from);
+	if (done < first) {
+		/* Publish data before head advance. */
+		smp_store_release(&s->ring_head, head + done);
+		return done;
+	}
+	second = want - first;
+	if (second) {
+		done = copy_from_iter(s->ring_buf, second, from);
+		/* Publish data before head advance. */
+		smp_store_release(&s->ring_head, head + first + done);
+		return first + done;
+	}
+	/* Publish data before head advance. */
+	smp_store_release(&s->ring_head, head + first);
+	return first;
+}
+
+static size_t splice_ring_space(struct sk_psock_splice *s)
+{
+	unsigned long head = s->ring_head;
+	size_t space = CIRC_SPACE(head, s->cached_tail, s->ring_size);
+
+	if (space)
+		return space;
+	/* Cache exhausted; refresh from the consumer-owned cursor - the only
+	 * cross-CPU ring_tail read. Pairs with smp_store_release(&ring_tail).
+	 */
+	s->cached_tail = smp_load_acquire(&s->ring_tail);
+	return CIRC_SPACE(head, s->cached_tail, s->ring_size);
+}
+
+static size_t splice_ring_read(struct sk_psock_splice *s,
+			       struct iov_iter *to, size_t size)
+{
+	unsigned long head, tail, mask;
+	size_t have, want, to_end, first, second, done;
+
+	if (!s->ring_buf)
+		return 0;
+
+	mask = s->ring_size - 1;
+	tail = s->ring_tail;
+	/* Pairs with smp_store_release(&ring_head) in splice_ring_write():
+	 * ensure we read producer's data after observing the head advance.
+	 */
+	head = smp_load_acquire(&s->ring_head);
+	have = CIRC_CNT(head, tail, s->ring_size);
+	want = min_t(size_t, size, have);
+	if (!want)
+		return 0;
+
+	to_end = s->ring_size - (tail & mask);
+	first  = min_t(size_t, want, to_end);
+
+	done = copy_to_iter(s->ring_buf + (tail & mask), first, to);
+	if (done < first) {
+		/* Release: free slots before the producer sees the advance. */
+		smp_store_release(&s->ring_tail, tail + done);
+		return done;
+	}
+	second = want - first;
+	if (second) {
+		done = copy_to_iter(s->ring_buf, second, to);
+		/* Release: free slots before the producer sees the advance. */
+		smp_store_release(&s->ring_tail, tail + first + done);
+		return first + done;
+	}
+	/* Release: free slots before the producer sees the advance. */
+	smp_store_release(&s->ring_tail, tail + first);
+	return first;
+}
+
+static bool splice_ring_has_data(const struct sk_psock_splice *s)
+{
+	if (!s->ring_buf)
+		return false;
+	/* Acquire ring_head so any data published by the producer is
+	 * visible if we go on to read it after this check.
+	 */
+	return CIRC_CNT(smp_load_acquire(&s->ring_head),
+			READ_ONCE(s->ring_tail),
+			s->ring_size) > 0;
+}
+
+static bool splice_recv_ready(struct sock *sk, struct sk_psock_splice *s)
+{
+	return splice_ring_has_data(s) ||
+	       !skb_queue_empty(&sk->sk_receive_queue) ||
+	       READ_ONCE(sk->sk_err) ||
+	       (READ_ONCE(sk->sk_shutdown) & RCV_SHUTDOWN) ||
+	       !rcu_access_pointer(s->peer);
+}
+
+static long splice_recv_wait(struct sock *sk, struct sk_psock_splice *s,
+			     long timeo)
+{
+	return wait_event_interruptible_timeout(*sk_sleep(sk),
+					splice_recv_ready(sk, s), timeo);
+}
+
+/* prot->sock_is_readable for paired-splice sockets. tcp_stream_is_readable()
+ * (via tcp_poll() / select() / epoll) consults this to mark POLLIN when
+ * sk_receive_queue is empty - we must also report data sitting in the
+ * splice ring, otherwise poll-driven readers wait forever despite the
+ * sender having produced bytes.
+ */
+static bool tcp_bpf_is_readable(struct sock *sk)
+{
+	struct sk_psock_splice *s;
+	struct sk_psock *psock;
+	bool readable = false;
+
+	rcu_read_lock();
+	psock = sk_psock(sk);
+	if (psock) {
+		s = rcu_dereference(psock->splice);
+		if (s && splice_ring_has_data(s))
+			readable = true;
+		else
+			readable = !list_empty(&psock->ingress_msg);
+	}
+	rcu_read_unlock();
+	return readable;
+}
+
+/*
+ * Drain the ring or sleep until the sender publishes more data.
+ * A spurious wake loops back and re-waits rather than returning 0,
+ * because the dispatcher's TCP/sk_msg fallback is keyed on
+ * sk_receive_queue / psock->ingress_msg - neither observes the ring,
+ * so returning 0 with no error would deadlock the caller in
+ * tcp_msg_wait_data() that the sender's next splice_wake_sync()
+ * cannot satisfy.
+ *
+ * Returning 0 is reserved for: EOF (peer shutdown), pair gone, or
+ * sk_receive_queue gained bytes (sender dropped back to tcp_sendmsg,
+ * defer to the TCP path). Errors are reported via *err.
+ *
+ * Caller must NOT hold sk's socket lock - this function may sleep.
+ */
+static int tcp_bpf_splice_recvmsg(struct sock *sk,
+				  struct sk_psock *psock,
+				  struct msghdr *msg, size_t len,
+				  int flags, int *err)
+{
+	struct sk_psock_splice *s;
+	size_t copied;
+	long timeo;
+
+	*err = 0;
+	/* PEEK is not implemented against the ring (no peek-without-advance
+	 * helper). Return 0 with no error so the dispatcher defers to the
+	 * TCP path; ring contents are invisible to PEEK but the socket
+	 * continues to work for normal apps.
+	 */
+	if (flags & MSG_PEEK)
+		return 0;
+
+	s = rcu_dereference_protected(psock->splice, 1);
+	if (!s)
+		return 0;
+
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+	for (;;) {
+		copied = splice_ring_read(s, &msg->msg_iter, len);
+		if (copied)
+			return copied;
+
+		/* Stream-ordering: if the sender ever dropped back to
+		 * tcp_sendmsg, those bytes are now in sk_receive_queue
+		 * and predate any future ring writes (sender only writes
+		 * to the ring when peer rcv_queue is empty).
+		 */
+		if (!skb_queue_empty(&sk->sk_receive_queue))
+			return 0;
+
+		if (sk->sk_err) {
+			*err = -sk->sk_err;
+			return 0;
+		}
+		if (sk->sk_shutdown & RCV_SHUTDOWN)
+			return 0; /* EOF */
+		if (!rcu_access_pointer(s->peer))
+			return 0; /* Pair gone */
+		if (signal_pending(current)) {
+			*err = sock_intr_errno(timeo);
+			return 0;
+		}
+		if (!timeo) {
+			*err = -EAGAIN;
+			return 0;
+		}
+
+		timeo = splice_recv_wait(sk, s, timeo);
+	}
+}
+
+static int splice_send_ring(struct sock *sk, struct sk_psock *psock,
+			    struct msghdr *msg, size_t size, int flags)
+{
+	struct sk_psock_splice *self_s, *peer_s;
+	struct sk_psock *peer;
+	int total = 0;
+
+	if (msg->msg_flags & MSG_OOB)
+		return 0;
+
+	self_s = rcu_dereference_protected(psock->splice, 1);
+	if (!self_s)
+		return 0;
+
+	while (size > 0) {
+		size_t done, space = 0;
+
+		/* All peer / peer->sk accesses happen under RCU. If the ring
+		 * has space, grab the peer's ring_ref before dropping RCU: that
+		 * pins peer_s (and its ring) so the copy below can run outside
+		 * RCU and fault/sleep normally. peer_sk is *not* pinned by the
+		 * ref, so it must not be touched after rcu_read_unlock().
+		 */
+		peer_s = NULL;
+		rcu_read_lock();
+		peer = rcu_dereference(self_s->peer);
+		if (peer) {
+			struct sock *peer_sk = peer->sk;
+			struct sk_psock_splice *ps = rcu_dereference(peer->splice);
+
+			if (ps && READ_ONCE(ps->ring_buf) &&
+			    !sk->sk_err && !(sk->sk_shutdown & SEND_SHUTDOWN) &&
+			    skb_queue_empty(&peer_sk->sk_receive_queue)) {
+				space = splice_ring_space(ps);
+				if (space && percpu_ref_tryget_live(&ps->ring_ref))
+					peer_s = ps;
+			}
+		}
+		rcu_read_unlock();
+		if (!peer_s)
+			break;
+
+		/* Holding peer_s->ring_ref: peer_s and its ring stay alive.
+		 * The copy touches only the ring, never peer_sk, so a normal
+		 * faulting copy is safe here.
+		 */
+		done = splice_ring_write(peer_s, &msg->msg_iter,
+					 min(size, space));
+		percpu_ref_put(&peer_s->ring_ref);
+
+		if (!done)
+			break;
+		total += done;
+		size  -= done;
+	}
+
+	/* Wake exactly once, after the loop, re-deref'ing peer under RCU.
+	 * Doing this inside the loop would carry the _sync hint repeatedly
+	 * and cost a redundant wake per wraparound iteration.
+	 */
+	if (total) {
+		rcu_read_lock();
+		peer = rcu_dereference(self_s->peer);
+		if (peer)
+			splice_wake_sync(peer->sk);
+		rcu_read_unlock();
+	}
+	return total;
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_sock_splice_pair - pair two stream sockets for opportunistic
+ *			  loopback splice.
+ * @peer:  the other socket, retrieved via sockhash lookup. This kfunc is
+ *	   KF_RELEASE: it consumes the reference the sockhash
+ *	   bpf_map_lookup_elem acquired on @peer (a sockmap/sockhash lookup
+ *	   is an acquire - see is_acquire_function() in the verifier).
+ *	   Consuming it here is required, not merely convenient: a sock_ops
+ *	   program cannot call bpf_sk_release (the helper is not available
+ *	   to that program type), so a release kfunc is the only way the
+ *	   program can avoid leaking the acquired reference.
+ * @skops: sock_ops context; ctx->sk is one side of the pair.
+ *
+ * Atomically installs the splice peering on both sides. Both sockets
+ * must be SOCK_STREAM, of the same address family, with psocks attached
+ * (typically via prior bpf_sock_hash_update), and neither already
+ * paired. Currently only TCP_ESTABLISHED is accepted; AF_UNIX
+ * SOCK_STREAM support is planned (the generic name reflects that
+ * extension path).
+ *
+ * After this call, sendmsg attempts a direct iov-to-iov copy into the
+ * peer's currently published recv iov; any bytes the splice path did
+ * not consume (because the peer is not in recvmsg) fall back to the
+ * normal TCP send path so the sender never blocks. Recvmsg first drains
+ * the socket's TCP rcv_queue (preserving stream ordering) and otherwise
+ * publishes the user iov for a sender to copy into. No skb, no sk_msg,
+ * and no verdict-program involvement on the splice fast path.
+ *
+ * Pairing is torn down automatically on close, disconnect, shutdown, or
+ * RST.
+ *
+ * Return: 0 on success; -EEXIST if either side is already paired (race
+ * loser); -EINVAL on state validation failure; -ENOENT if no psock
+ * exists on either side; -ENOMEM on splice-state allocation failure.
+ */
+__bpf_kfunc int bpf_sock_splice_pair(struct sock *peer,
+				     struct bpf_sock_ops_kern *skops)
+{
+	struct sk_psock_splice *self_s, *peer_s;
+	struct sk_psock *p_self, *p_peer;
+	struct sock *sk;
+	int ret;
+
+	if (!skops || !peer) {
+		ret = -EINVAL;
+		goto out_release;
+	}
+	sk = skops->sk;
+	if (!sk || sk == peer) {
+		ret = -EINVAL;
+		goto out_release;
+	}
+
+	ret = splice_validate(sk, peer);
+	if (ret)
+		goto out_release;
+
+	p_self = sk_psock_get(sk);
+	if (!p_self) {
+		ret = -ENOENT;
+		goto out_release;
+	}
+	p_peer = sk_psock_get(peer);
+	if (!p_peer) {
+		sk_psock_put(sk, p_self);
+		ret = -ENOENT;
+		goto out_release;
+	}
+
+	self_s = splice_get_or_alloc(p_self);
+	peer_s = self_s ? splice_get_or_alloc(p_peer) : NULL;
+	if (!self_s || !peer_s) {
+		/* If self_s succeeded but peer_s failed, self_s stays
+		 * attached to p_self; it isn't leaked (freed at psock
+		 * destroy) and is reusable for a future pair attempt.
+		 */
+		ret = -ENOMEM;
+		goto out_put;
+	}
+
+	if (splice_ring_alloc(self_s) || splice_ring_alloc(peer_s)) {
+		ret = -ENOMEM;
+		goto out_put;
+	}
+
+	splice_lock_pair(self_s, peer_s);
+	if (self_s->peer || peer_s->peer) {
+		ret = -EEXIST;
+		goto out_unlock;
+	}
+
+	/* Each side keeps a psock ref on the other for the duration. */
+	if (!sk_psock_get(sk)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+	if (!sk_psock_get(peer)) {
+		sk_psock_put(sk, p_self);
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+	rcu_assign_pointer(self_s->peer, p_peer);
+	rcu_assign_pointer(peer_s->peer, p_self);
+	ret = 0;
+
+out_unlock:
+	splice_unlock_pair(self_s, peer_s);
+out_put:
+	sk_psock_put(peer, p_peer);
+	sk_psock_put(sk, p_self);
+out_release:
+	/* KF_RELEASE: consume the caller's refcount on @peer (taken by
+	 * bpf_map_lookup_elem on the sockhash). All exit paths come
+	 * through here.
+	 */
+	if (peer && sk_is_refcounted(peer))
+		sock_gen_put(peer);
+	return ret;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_tcp_splice_kfunc_set)
+BTF_ID_FLAGS(func, bpf_sock_splice_pair, KF_RELEASE)
+BTF_KFUNCS_END(bpf_tcp_splice_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_tcp_splice_kfunc_id_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_tcp_splice_kfunc_set,
+};
+
+static int __init bpf_tcp_splice_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS,
+					 &bpf_tcp_splice_kfunc_id_set);
+}
+late_initcall(bpf_tcp_splice_init);
+
 #endif /* CONFIG_BPF_SYSCALL */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
  2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
@ 2026-06-12  1:14 ` Cong Wang
  2026-06-12  3:29   ` sashiko-bot
  2026-06-12  1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Cong Wang @ 2026-06-12  1:14 UTC (permalink / raw)
  To: netdev
  Cc: bpf, John Fastabend, Jakub Sitnicki, Jiayuan Chen, hemanthmalla,
	zijianzhang, Cong Wang, Cong Wang

When a paired-splice receiver finds the ring empty it parks on the
socket waitqueue. For latency-bound synchronous-RPC workloads that is
a wakeup per request-response cycle, which dominates the per-cycle
cost.

Add an optional bounded busy-poll of the ring before parking, reusing
the socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop()
and sk_busy_loop_timeout(). The default budget of 0 leaves
sk_can_busy_loop() false, so this is a no-op unless the application
(or net.core.busy_read) opted in.

Unlike sk_busy_loop() / napi_busy_loop(), splice_busy_loop() spins on
the in-kernel ring directly rather than polling a NAPI instance, so it
is effective on loopback - which delivers via the per-CPU backlog and
exposes no pollable napi_id. Keeping the receiver hot lets a
synchronous sender's small writes accumulate in the ring without a
wakeup per message; this is what turns the latency-bound TCP_RR case
into a large win once enabled.

A BPF program enables the budget by setting SO_BUSY_POLL via
bpf_setsockopt() (see the following patch). netperf, pinned CPUs,
3x10s, 50 us budget, baseline TCP vs splice + busy-poll:

  TCP_RR (loopback)    1 B    111.9k -> 1113.8k tps  (9.96x)
                       64 B   111.7k -> 1073.3k tps  (9.61x)
                       1 KB   106.1k ->  713.0k tps  (6.72x)
                       16 KB   40.3k ->  123.7k tps  (3.07x)
                       64 KB   17.8k ->   40.5k tps  (2.28x)

  TCP_RR (container)   1 B    105.6k -> 1103.7k tps  (10.45x)
                       64 B   105.5k -> 1103.9k tps  (10.46x)
                       1 KB   100.4k ->  704.9k tps  (7.02x)
                       16 KB   45.1k ->  114.8k tps  (2.54x)
                       64 KB   18.2k ->   38.8k tps  (2.13x)

Busy polling contributes ~4.2x of the 1 B loopback win (splice without
it is 267.0k tps; see the splice patch). Baseline TCP is unchanged by
busy_read on both loopback and default (non-XDP) veth: both deliver via
the per-CPU backlog, which has no pollable napi_id, so SO_BUSY_POLL is a
no-op for them (the container baseline TCP_RR measures the same at
busy_read 0 and 50). The gain therefore comes from the splice ring spin,
not from busy_read itself.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 net/ipv4/tcp_bpf.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 549f37077244..9c4421a74225 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -13,6 +13,7 @@
 #include <linux/util_macros.h>
 #include <linux/percpu-refcount.h>
 
+#include <net/busy_poll.h>
 #include <net/inet_common.h>
 #include <net/inet_sock.h>
 #include <net/tls.h>
@@ -1255,6 +1256,33 @@ static long splice_recv_wait(struct sock *sk, struct sk_psock_splice *s,
 					splice_recv_ready(sk, s), timeo);
 }
 
+/* Bounded busy-poll on the ring before parking the receiver. Reuses the
+ * socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop() and
+ * sk_busy_loop_timeout(); the default budget of 0 makes sk_can_busy_loop()
+ * false so this is a no-op unless the application (or net.core.busy_read)
+ * opted in.
+ *
+ * Unlike sk_busy_loop() / napi_busy_loop(), this spins on the in-kernel
+ * ring directly rather than polling a NAPI instance, so it is effective on
+ * loopback - which delivers via the per-CPU backlog and exposes no
+ * pollable napi_id. Keeping the receiver hot lets a synchronous sender's
+ * small writes accumulate in the ring without a wakeup per message.
+ */
+static void splice_busy_loop(struct sock *sk, struct sk_psock_splice *s)
+{
+	unsigned long start;
+
+	if (!sk_can_busy_loop(sk))
+		return;
+
+	start = busy_loop_current_time();
+	do {
+		cpu_relax();
+		if (splice_recv_ready(sk, s) || signal_pending(current))
+			return;
+	} while (!sk_busy_loop_timeout(sk, start));
+}
+
 /* prot->sock_is_readable for paired-splice sockets. tcp_stream_is_readable()
  * (via tcp_poll() / select() / epoll) consults this to mark POLLIN when
  * sk_receive_queue is empty - we must also report data sitting in the
@@ -1349,6 +1377,16 @@ static int tcp_bpf_splice_recvmsg(struct sock *sk,
 			return 0;
 		}
 
+		/* Spin on the ring for the SO_BUSY_POLL budget before
+		 * sleeping. If the spin observes data, re-read from the
+		 * loop head; otherwise (budget expired or a terminal
+		 * condition) proceed to park - splice_recv_wait() returns
+		 * immediately for terminal conditions.
+		 */
+		splice_busy_loop(sk, s);
+		if (splice_ring_has_data(s))
+			continue;
+
 		timeo = splice_recv_wait(sk, s, timeo);
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
  2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
  2026-06-12  1:14 ` [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver Cong Wang
@ 2026-06-12  1:14 ` Cong Wang
  2026-06-12  1:28   ` sashiko-bot
  2026-06-12  1:14 ` [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt() Cong Wang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Cong Wang @ 2026-06-12  1:14 UTC (permalink / raw)
  To: netdev
  Cc: bpf, John Fastabend, Jakub Sitnicki, Jiayuan Chen, hemanthmalla,
	zijianzhang, Cong Wang, Cong Wang

Loads a sock_ops BPF program that, on each ESTABLISHED callback,
inserts self into a sockhash keyed by the local 4-tuple, looks up
the peer using the swapped 4-tuple, and calls the new
bpf_sock_splice_pair kfunc on whichever peer it finds. Counters track
how many calls returned 0 (winner) vs -EEXIST (race loser) vs other
errors.

Userspace creates a loopback TCP pair, waits for both ESTABLISHED
callbacks to land, then verifies pair_ok >= 1 and pair_other_err == 0.
A receiver thread blocks in recv() before the main thread sends; the
test asserts the bytes round-trip through the rendezvous data plane.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 .../selftests/bpf/prog_tests/tcp_splice.c     | 206 ++++++++++++++++++
 .../selftests/bpf/progs/test_tcp_splice.c     | 101 +++++++++
 2 files changed, 307 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_splice.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_splice.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_splice.c b/tools/testing/selftests/bpf/prog_tests/tcp_splice.c
new file mode 100644
index 000000000000..b80a1129c6aa
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_splice.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_splice.skel.h"
+
+#include <pthread.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#define MSG "hello rendezvous"
+#define CLIENT_BANNER "client-banner"
+#define SERVER_BANNER "server-banner"
+
+struct recv_arg {
+	int fd;
+	char buf[64];
+	int n;
+	int err;
+};
+
+static void *recv_thread(void *p)
+{
+	struct recv_arg *a = p;
+
+	a->n = recv(a->fd, a->buf, sizeof(a->buf) - 1, 0);
+	a->err = errno;
+	return NULL;
+}
+
+struct send_arg {
+	int fd;
+	const char *buf;
+	size_t len;
+	int n;
+	int err;
+};
+
+static void *send_thread(void *p)
+{
+	struct send_arg *a = p;
+
+	a->n = send(a->fd, a->buf, a->len, 0);
+	a->err = errno;
+	return NULL;
+}
+
+static int run_basic(int cgroup_fd, struct test_tcp_splice *skel)
+{
+	pthread_t tid;
+	struct recv_arg a = {};
+	int sfd = -1, cfd = -1, lfd = -1;
+	int n, err = -1;
+
+	lfd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(lfd, 0, "start_server"))
+		return -1;
+
+	cfd = connect_to_fd(lfd, 0);
+	if (!ASSERT_GE(cfd, 0, "connect_to_fd"))
+		goto out;
+
+	sfd = accept(lfd, NULL, NULL);
+	if (!ASSERT_GE(sfd, 0, "accept"))
+		goto out;
+
+	/* Give both ESTABLISHED sock_ops callbacks a moment to run. */
+	usleep(50 * 1000);
+
+	if (!ASSERT_GE(skel->bss->pair_ok, 1, "splice paired"))
+		goto out;
+	ASSERT_EQ(skel->bss->pair_other_err, 0, "no unexpected pair errors");
+
+	/* Drive the splice fast path: receiver enters recv() and publishes
+	 * its bvec, sender then writes directly into it.
+	 */
+	a.fd = sfd;
+	if (!ASSERT_OK(pthread_create(&tid, NULL, recv_thread, &a),
+		       "pthread_create"))
+		goto out;
+	usleep(20 * 1000); /* let recv block */
+
+	n = send(cfd, MSG, strlen(MSG), 0);
+	ASSERT_EQ(n, (int)strlen(MSG), "send length");
+
+	pthread_join(tid, NULL);
+	ASSERT_EQ(a.n, (int)strlen(MSG), "recv length");
+	a.buf[a.n > 0 ? a.n : 0] = 0;
+	ASSERT_STREQ(a.buf, MSG, "recv contents");
+
+	err = 0;
+out:
+	if (cfd >= 0)
+		close(cfd);
+	if (sfd >= 0)
+		close(sfd);
+	if (lfd >= 0)
+		close(lfd);
+	return err;
+}
+
+/* Bidirectional-write deadlock-avoidance test.
+ *
+ * Both sides issue send() before either calls recv(), the classic
+ * pattern that used to deadlock under synchronous rendezvous (and
+ * the actual cause of "kex_exchange_identification: write: Broken
+ * pipe" with SSH on loopback). The bounded-wait fallback in
+ * tcp_bpf_splice_sendmsg() must let both writes complete via the
+ * normal TCP path within ~1 ms, and the banners must arrive intact
+ * on the other side when recv() is called next.
+ */
+static int run_bidir_write(int cgroup_fd, struct test_tcp_splice *skel)
+{
+	pthread_t client_send_tid, server_send_tid;
+	struct send_arg cs = { .buf = CLIENT_BANNER,
+			       .len = sizeof(CLIENT_BANNER) - 1 };
+	struct send_arg ss = { .buf = SERVER_BANNER,
+			       .len = sizeof(SERVER_BANNER) - 1 };
+	struct recv_arg cr = {}, sr = {};
+	int sfd = -1, cfd = -1, lfd = -1;
+	int err = -1;
+
+	lfd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(lfd, 0, "start_server"))
+		return -1;
+	cfd = connect_to_fd(lfd, 0);
+	if (!ASSERT_GE(cfd, 0, "connect_to_fd"))
+		goto out;
+	sfd = accept(lfd, NULL, NULL);
+	if (!ASSERT_GE(sfd, 0, "accept"))
+		goto out;
+
+	usleep(50 * 1000); /* let pair complete */
+
+	/* Both sides write first, neither reads yet. Both must return
+	 * within bounded time (no deadlock).
+	 */
+	cs.fd = cfd;
+	ss.fd = sfd;
+	if (!ASSERT_OK(pthread_create(&client_send_tid, NULL, send_thread, &cs),
+		       "client send thread"))
+		goto out;
+	if (!ASSERT_OK(pthread_create(&server_send_tid, NULL, send_thread, &ss),
+		       "server send thread"))
+		goto out;
+
+	pthread_join(client_send_tid, NULL);
+	pthread_join(server_send_tid, NULL);
+	ASSERT_EQ(cs.n, (int)cs.len, "client send length");
+	ASSERT_EQ(ss.n, (int)ss.len, "server send length");
+
+	/* Now read on each side - the bytes the peer wrote should have
+	 * landed via the TCP fallback path.
+	 */
+	cr.fd = cfd;
+	cr.n = recv(cr.fd, cr.buf, sizeof(cr.buf) - 1, 0);
+	ASSERT_EQ(cr.n, (int)ss.len, "client recv length");
+	cr.buf[cr.n > 0 ? cr.n : 0] = 0;
+	ASSERT_STREQ(cr.buf, SERVER_BANNER, "client got server banner");
+
+	sr.fd = sfd;
+	sr.n = recv(sr.fd, sr.buf, sizeof(sr.buf) - 1, 0);
+	ASSERT_EQ(sr.n, (int)cs.len, "server recv length");
+	sr.buf[sr.n > 0 ? sr.n : 0] = 0;
+	ASSERT_STREQ(sr.buf, CLIENT_BANNER, "server got client banner");
+
+	err = 0;
+out:
+	if (cfd >= 0)
+		close(cfd);
+	if (sfd >= 0)
+		close(sfd);
+	if (lfd >= 0)
+		close(lfd);
+	return err;
+}
+
+void test_tcp_splice(void)
+{
+	struct test_tcp_splice *skel;
+	int cgroup_fd, prog_fd;
+
+	cgroup_fd = test__join_cgroup("/tcp_splice");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		return;
+
+	skel = test_tcp_splice__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_load"))
+		goto close_cgroup;
+
+	prog_fd = bpf_program__fd(skel->progs.sockops_splice);
+	if (!ASSERT_OK(bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_SOCK_OPS, 0),
+		       "attach sockops"))
+		goto destroy_skel;
+
+	if (test__start_subtest("basic"))
+		run_basic(cgroup_fd, skel);
+	if (test__start_subtest("bidir_write"))
+		run_bidir_write(cgroup_fd, skel);
+
+destroy_skel:
+	test_tcp_splice__destroy(skel);
+close_cgroup:
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_splice.c b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
new file mode 100644
index 000000000000..09c7f0f9e311
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Sock_ops BPF program that pairs locally-connected TCP sockets via the
+ * bpf_sock_splice_pair kfunc. Each side of an established loopback
+ * connection inserts itself into a sockhash keyed by its 4-tuple and
+ * looks up the peer using the swapped tuple. Whichever side finds the
+ * peer attempts to splice; the race loser sees -EEXIST.
+ */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct flow_key {
+	__u32	saddr;
+	__u32	daddr;
+	__u16	sport;
+	__u16	dport;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKHASH);
+	__uint(max_entries, 16);
+	__type(key, struct flow_key);
+	__type(value, __u64);
+} rendezvous SEC(".maps");
+
+int bpf_sock_splice_pair(struct sock *peer, struct bpf_sock_ops_kern *skops) __ksym;
+void *bpf_cast_to_kern_ctx(void *obj) __ksym;
+
+__u32 pair_ok;
+__u32 pair_other_err;
+
+/* IPv4 only: the verifier doesn't accept memcpy from sock_ops ctx
+ * because it lowers to "ctx + reg" pointer arithmetic. IPv6 support
+ * would need explicit field-by-field reads of local_ip6[i] /
+ * remote_ip6[i] at constant indices.
+ */
+static __always_inline void mk_key(struct bpf_sock_ops *s,
+				   struct flow_key *k, int swap)
+{
+	/* skops->local_port is already in host byte order. skops->remote_port
+	 * is laid out as the network-order 16-bit port in the upper half of
+	 * a u32 (see sock_ops_convert_ctx_access); bpf_ntohl produces the
+	 * host-order port directly - no further shift.
+	 */
+	__u16 lport = (__u16)s->local_port;
+	__u16 rport = bpf_ntohl(s->remote_port);
+
+	if (!swap) {
+		k->saddr = s->local_ip4;
+		k->daddr = s->remote_ip4;
+		k->sport = lport;
+		k->dport = rport;
+	} else {
+		k->saddr = s->remote_ip4;
+		k->daddr = s->local_ip4;
+		k->sport = rport;
+		k->dport = lport;
+	}
+}
+
+SEC("sockops")
+int sockops_splice(struct bpf_sock_ops *skops)
+{
+	struct flow_key self_key, peer_key;
+	struct bpf_sock *peer;
+	int ret;
+
+	if (skops->op != BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB &&
+	    skops->op != BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB)
+		return 0;
+	if (skops->family != 2 /* AF_INET */)
+		return 0;
+
+	mk_key(skops, &self_key, 0);
+	mk_key(skops, &peer_key, 1);
+
+	/* BPF_ANY: a reused 4-tuple after close (e.g. fast reconnect) must
+	 * overwrite the stale entry rather than silently fail.
+	 */
+	bpf_sock_hash_update(skops, &rendezvous, &self_key, BPF_ANY);
+
+	peer = bpf_map_lookup_elem(&rendezvous, &peer_key);
+	if (!peer)
+		return 0;
+
+	/* The sockhash bpf_map_lookup_elem above is an acquire, so @peer
+	 * carries a reference. A sock_ops program cannot call
+	 * bpf_sk_release, so the reference is handed to bpf_sock_splice_pair
+	 * which is KF_RELEASE and consumes it - no explicit release here,
+	 * and none is possible from this program type.
+	 */
+	ret = bpf_sock_splice_pair((struct sock *)peer,
+				   bpf_cast_to_kern_ctx(skops));
+	if (ret == 0)
+		__sync_fetch_and_add(&pair_ok, 1);
+	else if (ret != -17 /* -EEXIST: race loser, expected */)
+		__sync_fetch_and_add(&pair_other_err, 1);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt()
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
                   ` (2 preceding siblings ...)
  2026-06-12  1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
@ 2026-06-12  1:14 ` Cong Wang
  2026-06-12  1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Cong Wang @ 2026-06-12  1:14 UTC (permalink / raw)
  To: netdev
  Cc: bpf, John Fastabend, Jakub Sitnicki, Jiayuan Chen, hemanthmalla,
	zijianzhang, Cong Wang, Cong Wang

Add SO_BUSY_POLL to the SOL_SOCKET allowlist in sol_socket_sockopt() so a
sock_ops or cgroup BPF program can enable busy polling on a socket (set
sk->sk_ll_usec) without an application setsockopt or the global
net.core.busy_read sysctl. SO_BUSY_POLL needs no CAP_NET_ADMIN in
sk_setsockopt(), so no privilege gating is added; the value is an int and
joins the existing optlen == sizeof(int) group.

This lets a BPF program opt specific flows into busy polling at the point
it has the context to decide. The TCP loopback splice path
(bpf_sock_splice_pair) uses it: the splice receiver busy-polls the ring
instead of parking, turning the latency-bound TCP_RR case into a large
win (numbers are in the splice busy-poll patch).

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 net/core/filter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9590877b0714..302dfaf03f39 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5325,6 +5325,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 	case SO_MAX_PACING_RATE:
 	case SO_BINDTOIFINDEX:
 	case SO_TXREHASH:
+	case SO_BUSY_POLL:
 	case SK_BPF_CB_FLAGS:
 		if (*optlen != sizeof(int))
 			return -EINVAL;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
                   ` (3 preceding siblings ...)
  2026-06-12  1:14 ` [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt() Cong Wang
@ 2026-06-12  1:14 ` Cong Wang
  2026-06-12  1:26   ` sashiko-bot
  2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
  2026-06-12 22:10 ` [syzbot ci] " syzbot ci
  6 siblings, 1 reply; 18+ messages in thread
From: Cong Wang @ 2026-06-12  1:14 UTC (permalink / raw)
  To: netdev
  Cc: bpf, John Fastabend, Jakub Sitnicki, Jiayuan Chen, hemanthmalla,
	zijianzhang, Cong Wang, Cong Wang

Set SO_BUSY_POLL (busy_poll_us) on each paired socket via
bpf_setsockopt() so the splice receiver busy-polls the ring instead of
parking - without net.core.busy_read or an application setsockopt.

The sock_ops prog runs for both the active and passive established
callbacks, so each endpoint sets its own socket. This is done before the
peer-not-found early return: pairing is asymmetric (only the second side
to establish finds a peer and calls bpf_sock_splice_pair), so setting it
only on the pairing side would leave the other end without busy-poll.
bpf_setsockopt acts on skops->sk; the peer sets itself on its own
callback. Busy polling is a receive-path optimization
(splice_busy_loop() in tcp_bpf_splice_recvmsg()); TCP is full-duplex so
both ends are receivers and both need it, which the per-endpoint setting
provides.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 .../selftests/bpf/progs/test_tcp_splice.c     | 24 +++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/test_tcp_splice.c b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
index 09c7f0f9e311..da43f00046c0 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_splice.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
@@ -9,6 +9,13 @@
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
 
+#ifndef SOL_SOCKET
+#define SOL_SOCKET 1
+#endif
+#ifndef SO_BUSY_POLL
+#define SO_BUSY_POLL 46
+#endif
+
 struct flow_key {
 	__u32	saddr;
 	__u32	daddr;
@@ -29,6 +36,8 @@ void *bpf_cast_to_kern_ctx(void *obj) __ksym;
 __u32 pair_ok;
 __u32 pair_other_err;
 
+__u32 busy_poll_us;
+
 /* IPv4 only: the verifier doesn't accept memcpy from sock_ops ctx
  * because it lowers to "ctx + reg" pointer arithmetic. IPv6 support
  * would need explicit field-by-field reads of local_ip6[i] /
@@ -71,6 +80,21 @@ int sockops_splice(struct bpf_sock_ops *skops)
 	if (skops->family != 2 /* AF_INET */)
 		return 0;
 
+	/* Enable busy-poll on this socket. Both endpoints run this callback,
+	 * so each sets its own socket; this must happen here, before the
+	 * peer-not-found early return below, because pairing is asymmetric -
+	 * only the second side to establish finds a peer and calls
+	 * bpf_sock_splice_pair. Setting it only on the pairing side would
+	 * leave the other side without busy-poll. bpf_setsockopt acts on
+	 * skops->sk only - there is no variant to set the peer - but the peer
+	 * sets itself when its own ESTABLISHED callback fires.
+	 */
+	if (busy_poll_us) {
+		int us = busy_poll_us;
+
+		bpf_setsockopt(skops, SOL_SOCKET, SO_BUSY_POLL, &us, sizeof(us));
+	}
+
 	mk_key(skops, &self_key, 0);
 	mk_key(skops, &peer_key, 1);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog
  2026-06-12  1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
@ 2026-06-12  1:26   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-12  1:26 UTC (permalink / raw)
  To: Cong Wang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] The BPF global variable `busy_poll_us` is never initialized by the userspace test runner, rendering the newly added `SO_BUSY_POLL` test logic unreachable dead code.
- [Low] Multi-line comment does not follow BPF subsystem guidelines.
--

commit c9396935bece9a3118a940f3f318b60c9109d778
Author: Cong Wang <xiyou.wangcong@gmail.com>

selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog

Set SO_BUSY_POLL (busy_poll_us) on each paired socket via
bpf_setsockopt() so the splice receiver busy-polls the ring instead of
parking - without net.core.busy_read or an application setsockopt.

> diff --git a/tools/testing/selftests/bpf/progs/test_tcp_splice.c b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
> index 09c7f0f9e3110..da43f00046c0f 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcp_splice.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcp_splice.c

[ ... ]

> @@ -71,6 +80,21 @@ int sockops_splice(struct bpf_sock_ops *skops)
>  	if (skops->family != 2 /* AF_INET */)
>  		return 0;
>  
> +	/* Enable busy-poll on this socket. Both endpoints run this callback,
> +	 * so each sets its own socket; this must happen here, before the

[Severity: Low]
This isn't a bug, but the BPF subsystem multi-line comment style requires
the opening /* to be on its own line. Should this be updated to match the
modern kernel comment style?

> +	 * peer-not-found early return below, because pairing is asymmetric -
> +	 * only the second side to establish finds a peer and calls
> +	 * bpf_sock_splice_pair. Setting it only on the pairing side would
> +	 * leave the other side without busy-poll. bpf_setsockopt acts on
> +	 * skops->sk only - there is no variant to set the peer - but the peer
> +	 * sets itself when its own ESTABLISHED callback fires.
> +	 */
> +	if (busy_poll_us) {
> +		int us = busy_poll_us;
> +
> +		bpf_setsockopt(skops, SOL_SOCKET, SO_BUSY_POLL, &us, sizeof(us));
> +	}

[Severity: Medium]
Is the global variable busy_poll_us ever initialized by the userspace test
runner?

Since BPF .bss variables are zero-initialized by default, it appears that
skel->bss->busy_poll_us is never assigned a value in the userspace setup
code (like prog_tests/tcp_splice.c). Does this mean the if (busy_poll_us)
condition will always evaluate to false, leaving the new SO_BUSY_POLL logic
unreachable and untested?

> +
>  	mk_key(skops, &self_key, 0);
>  	mk_key(skops, &peer_key, 1);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612011452.134466-1-xiyou.wangcong@gmail.com?part=5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test
  2026-06-12  1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
@ 2026-06-12  1:28   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-12  1:28 UTC (permalink / raw)
  To: Cong Wang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] If the second `pthread_create()` fails in `run_bidir_write`, the function returns early without joining the first thread, leading to a leaked thread and a Use-After-Free on the stack.
- [Low] Multi-line comments violate the required BPF subsystem comment style.
--

commit 2de2ff2207c8434507b69004f912f22fbc96a325
Author: Cong Wang <xiyou.wangcong@gmail.com>

selftests/bpf: add tcp_splice basic round-trip test

Adds a test for the tcp_splice BPF functionality using a sock_ops program to
pair local TCP sockets via bpf_sock_splice_pair. The test includes basic
round-trip and bidirectional-write subtests.

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_splice.c b/tools/testing/selftests/bpf/prog_tests/tcp_splice.c
> new file mode 100644
> index 0000000000000..b80a1129c6aa8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_splice.c
[ ... ]
> +	/* Drive the splice fast path: receiver enters recv() and publishes
> +	 * its bvec, sender then writes directly into it.
> +	 */

[Severity: Low]
This isn't a bug, but does this comment style violate the BPF subsystem
guidelines? Multi-line comments should have the opening /* on its own line.

[ ... ]
> +/* Bidirectional-write deadlock-avoidance test.
> + *
> + * Both sides issue send() before either calls recv(), the classic

[Severity: Low]
This isn't a bug, but shouldn't the opening /* be on its own line
for multi-line comments in the BPF subsystem?

> + * pattern that used to deadlock under synchronous rendezvous (and
> + * the actual cause of "kex_exchange_identification: write: Broken
> + * pipe" with SSH on loopback). The bounded-wait fallback in
> + * tcp_bpf_splice_sendmsg() must let both writes complete via the
> + * normal TCP path within ~1 ms, and the banners must arrive intact
> + * on the other side when recv() is called next.
> + */
> +static int run_bidir_write(int cgroup_fd, struct test_tcp_splice *skel)
> +{
[ ... ]
> +	/* Both sides write first, neither reads yet. Both must return
> +	 * within bounded time (no deadlock).
> +	 */
> +	cs.fd = cfd;
> +	ss.fd = sfd;
> +	if (!ASSERT_OK(pthread_create(&client_send_tid, NULL, send_thread, &cs),
> +		       "client send thread"))
> +		goto out;
> +	if (!ASSERT_OK(pthread_create(&server_send_tid, NULL, send_thread, &ss),
> +		       "server send thread"))
> +		goto out;

[Severity: Medium]
If the second pthread_create() fails, will this jump to out and return, leaving
client_send_tid running in the background? 

Since cs is allocated on the stack, could the background thread's execution of
send_thread() write to cs.n and cs.err after the stack frame is destroyed,
causing a use-after-free memory corruption in the test runner?

> +
> +	pthread_join(client_send_tid, NULL);
> +	pthread_join(server_send_tid, NULL);
[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/test_tcp_splice.c b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
> new file mode 100644
> index 0000000000000..09c7f0f9e3110
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_tcp_splice.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Sock_ops BPF program that pairs locally-connected TCP sockets via the
> + * bpf_sock_splice_pair kfunc. Each side of an established loopback

[Severity: Low]
This isn't a bug, but shouldn't the opening /* be on its own line
for multi-line comments in the BPF subsystem?

> + * connection inserts itself into a sockhash keyed by its 4-tuple and
> + * looks up the peer using the swapped tuple. Whichever side finds the
> + * peer attempts to splice; the race loser sees -EEXIST.
> + */
[ ... ]
> +static __always_inline void mk_key(struct bpf_sock_ops *s,
> +				   struct flow_key *k, int swap)
> +{
> +	/* skops->local_port is already in host byte order. skops->remote_port
> +	 * is laid out as the network-order 16-bit port in the upper half of

[Severity: Low]
This isn't a bug, but shouldn't the opening /* be on its own line
for multi-line comments in the BPF subsystem?

> +	 * a u32 (see sock_ops_convert_ctx_access); bpf_ntohl produces the
> +	 * host-order port directly - no further shift.
> +	 */
> +	__u16 lport = (__u16)s->local_port;
> +	__u16 rport = bpf_ntohl(s->remote_port);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612011452.134466-1-xiyou.wangcong@gmail.com?part=3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice
  2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
@ 2026-06-12  1:33   ` sashiko-bot
  2026-06-12  2:10   ` bot+bpf-ci
  1 sibling, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-12  1:33 UTC (permalink / raw)
  To: Cong Wang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [High] Missing socket lock in spliced sendmsg/recvmsg leading to SPSC ring buffer corruption.
- [High] Permanent memory leak of sockets and psocks due to circular references.
- [High] TCP stream ordering violation due to fast path racing with in-flight TCP data.
- [High] Use-after-free of socket waitqueue due to missing RCU read lock in splice_wake.
- [High] RCU API Violations and lockless data races when accessing `psock->splice`.
- [High] Silent data blackhole when the receiving spliced socket has an ingress BPF program.
- [High] Refcount leak and Use-After-Free due to ignoring the return value of `sk_psock_get()`.
- [Medium] Missing __GFP_ACCOUNT on ring buffer allocations allows memcg limits bypass.
- [Medium] MSG_PEEK semantics broken for spliced sockets, causing hangs or silent data omission.
- [Medium] Socket errors are not cleared when fetched, leading to permanent error state.
--

commit 2f2595b5999c95fe8a664fc82201339a3716f692
Author: Cong Wang <xiyou.wangcong@gmail.com>

    tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice

    Two locally-connected TCP sockets can be paired by a sock_ops BPF
    program at handshake completion. Once paired, sendmsg copies the user
    payload into a per-direction kernel-side byte ring; recvmsg drains the
    ring into the user buffer. Both copies happen in their own task's mm,
    so no cross-mm pin / kmap dance is needed and the splice fast path
    incurs no skb construction, no softirq, and no TCP protocol-state
    processing. The TCP wire connection itself never sees the spliced
    bytes: sequence numbers stay frozen at post-handshake values, so FIN,
    RST, and keepalive continue to work through the regular paths and the
    pair tears down via a normal close handshake.

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index e1850caf1a71a..b39fc249a18d5 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c

[ ... ]

> @@ -881,12 +881,15 @@ static void sk_psock_destroy(struct work_struct *work)
>  		sock_put(psock->sk_redir);
>  	if (psock->sk_pair)
>  		sock_put(psock->sk_pair);
> +	tcp_bpf_splice_destroy(psock);
>  	sock_put(psock->sk);
>  	kfree(psock);
>  }
>  
>  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>  {
> +	tcp_bpf_splice_unpair(psock);
> +

[Severity: High]
Does this code create a permanent memory leak due to a circular reference?

In bpf_sock_splice_pair(), both sides of the paired connection acquire a
reference to the other's sk_psock via sk_psock_get(). This establishes a
circular reference.

Since the cross-references are only dropped in tcp_bpf_splice_unpair(), and
this is only called from sk_psock_drop() which executes when a psock's
reference count reaches 0, the circular reference guarantees neither refcount
will ever reach 0.

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index cc0bd73f36b6d..549f370772446 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c

[ ... ]

> @@ -626,8 +683,9 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
>  	prot[TCP_BPF_BASE]			= *base;
>  	prot[TCP_BPF_BASE].destroy		= sock_map_destroy;
>  	prot[TCP_BPF_BASE].close		= sock_map_close;
> +	prot[TCP_BPF_BASE].sendmsg		= tcp_bpf_splice_sendmsg;
>  	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
> -	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
> +	prot[TCP_BPF_BASE].sock_is_readable	= tcp_bpf_is_readable;
>  	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;

[Severity: High]
Does this cause a silent data blackhole when the receiving spliced socket has
an ingress BPF program?

The TCP_BPF_RX protocol configuration inherits its base configuration here,
so its sendmsg is assigned tcp_bpf_splice_sendmsg. However, its recvmsg
is explicitly overridden to tcp_bpf_recvmsg_parser, which lacks the splice
ring-buffer logic.

A sender will correctly push data into the splice ring buffer, but the
receiver will only poll the standard sk_receive_queue and ingress_msg queue,
never reading the ring buffer.

[ ... ]

> +static inline void splice_wake(struct sock *sk)
> +{
> +	wait_queue_head_t *wq = sk_sleep(sk);
> +
> +	smp_mb();
> +	if (wq && waitqueue_active(wq))
> +		wake_up_interruptible_all(wq);
> +}

[Severity: High]
Is it possible for sk_sleep(sk) to access a freed waitqueue here?

This function uses sk_sleep(sk) to get the socket's waitqueue, which uses
rcu_dereference_raw(). Without the caller holding the RCU read lock, the
memory backing the waitqueue could be concurrently freed by sock_release().

[ ... ]

> +static bool sk_psock_is_spliced(const struct sk_psock *psock)
> +{
> +	struct sk_psock_splice *s = rcu_dereference(psock->splice);
> +
> +	return s && rcu_access_pointer(s->peer);
> +}

[Severity: High]
Is this missing an RCU read lock?

sk_psock_is_spliced() calls rcu_dereference(psock->splice) from
tcp_bpf_splice_sendmsg() and tcp_bpf_recvmsg() without an active
rcu_read_lock(), which can trigger a lockdep warning.

[ ... ]

> +static int tcp_bpf_splice_sendmsg(struct sock *sk, struct msghdr *msg,
> +				  size_t size)
> +{
> +	struct sk_psock *psock;
> +	int spliced = 0;
> +	int ret;

[Severity: High]
Could the lack of socket locking here lead to SPSC ring buffer corruption?

tcp_bpf_splice_sendmsg() bypasses the standard socket lock (lock_sock(sk)).
Since userspace can call sendmsg or recvmsg from multiple threads concurrently
on the same socket descriptor, multiple producers or consumers will race
when accessing the SPSC ring buffer logic in splice_ring_write() and
tcp_bpf_splice_recvmsg(). This can lead to non-atomic cursor updates and
memory corruption.

[ ... ]

> +void tcp_bpf_splice_unpair(struct sk_psock *psock)
> +{
> +	struct sk_psock_splice *self_s, *peer_s;
> +	struct sk_psock *peer;
> +	bool was_paired = false;
> +
> +	self_s = rcu_dereference_protected(psock->splice, 1);
> +	if (!self_s)
> +		return;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(self_s->peer);
> +	if (!peer) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +	if (!sk_psock_get(peer->sk)) {

[Severity: High]
Does ignoring the return value of sk_psock_get() here cause a refcount leak
and use-after-free?

If a socket's psock is concurrently replaced, sk_psock_get() returns and
increments the refcount of the new psock. Because this code continues using
the original peer pointer to interact with and ultimately drop the refcount
(via sk_psock_put()), it creates an unbalanced increment on the new psock
(memory leak) and an unbalanced decrement on the old psock (use-after-free).

[ ... ]

> +	rcu_read_unlock();

[ ... ]

> +	/* Wake any blocked rendezvous waiters on either side. They will
> +	 * re-check the predicate, see splice->peer == NULL, and exit.
> +	 */
> +	splice_wake(psock->sk);
> +	splice_wake(peer->sk);

[Severity: High]
Is it safe to call splice_wake() here after dropping the RCU read lock?

splice_wake() requires the RCU read lock to safely access the socket's
waitqueue without it being concurrently freed.

[ ... ]

> +static int splice_ring_alloc(struct sk_psock_splice *s)
> +{
> +	void *buf;
> +
> +	if (READ_ONCE(s->ring_buf))
> +		return 0;
> +
> +	buf = (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOWARN,
> +				       get_order(SPLICE_RING_SIZE));

[Severity: Medium]
Should this allocation use the __GFP_ACCOUNT flag?

Because this 16 KiB allocation is not charged to the memory cgroup, a local
application could create and pair an arbitrary number of sockets to
intentionally exhaust host memory without being restricted by its container's
memory limits.

[ ... ]

> +static int tcp_bpf_splice_recvmsg(struct sock *sk,
> +				  struct sk_psock *psock,
> +				  struct msghdr *msg, size_t len,
> +				  int flags, int *err)
> +{
> +	struct sk_psock_splice *s;
> +	size_t copied;
> +	long timeo;
> +
> +	*err = 0;
> +	/* PEEK is not implemented against the ring (no peek-without-advance
> +	 * helper). Return 0 with no error so the dispatcher defers to the
> +	 * TCP path; ring contents are invisible to PEEK but the socket
> +	 * continues to work for normal apps.
> +	 */
> +	if (flags & MSG_PEEK)
> +		return 0;

[Severity: Medium]
Does returning 0 here break MSG_PEEK semantics?

Since the data has already been spliced directly into the ring buffer, it is
completely invisible to the TCP stack. A MSG_PEEK call falling back to the
standard TCP sk_receive_queue will falsely report that no data is available,
and a blocking MSG_PEEK could hang indefinitely waiting for normal TCP data.

[ ... ]

> +	s = rcu_dereference_protected(psock->splice, 1);

[Severity: High]
Can using a hardcoded 1 here cause data races?

Using rcu_dereference_protected() with a hardcoded 1 in a lockless,
concurrently accessible path suppresses lockdep checks and omits the data
dependency barriers provided by rcu_dereference() or READ_ONCE(). In a
concurrent environment, this creates data races and risks observing
uninitialized ring buffer state on architectures that require strict data
dependency barriers.

[ ... ]

> +		if (sk->sk_err) {
> +			*err = -sk->sk_err;
> +			return 0;
> +		}

[Severity: Medium]
Should the socket error be cleared when it is fetched?

When this observes an error in sk->sk_err, it sets *err and returns, which
propagates the error to user space. However, sk->sk_err is never cleared
(e.g., via sock_error(sk) or xchg(&sk->sk_err, 0)). Failing to clear the
error means every subsequent recvmsg() call will erroneously fail with the
exact same error, permanently breaking the socket.

[ ... ]

> +static int splice_send_ring(struct sock *sk, struct sk_psock *psock,
> +			    struct msghdr *msg, size_t size, int flags)
> +{

[ ... ]

> +			if (ps && READ_ONCE(ps->ring_buf) &&
> +			    !sk->sk_err && !(sk->sk_shutdown & SEND_SHUTDOWN) &&
> +			    skb_queue_empty(&peer_sk->sk_receive_queue)) {

[Severity: High]
Does this check violate TCP stream ordering by ignoring in-flight TCP data?

To preserve stream ordering when falling back to the standard TCP stack, the
sender relies on skb_queue_empty(&peer_sk->sk_receive_queue) to determine if
it is safe to write to the ring buffer. However, this check ignores data that
is actively in-flight (queued in the sender's sk_write_queue, the local
qdisc, or softirq backlog). A subsequent sendmsg will see an empty
peer_sk->sk_receive_queue and splice new data directly into the ring buffer,
allowing the receiver to read it before the older in-flight TCP data arrives.

[ ... ]

> +__bpf_kfunc int bpf_sock_splice_pair(struct sock *peer,
> +				     struct bpf_sock_ops_kern *skops)
> +{

[ ... ]

> +	/* Each side keeps a psock ref on the other for the duration. */
> +	if (!sk_psock_get(sk)) {
> +		ret = -ENOENT;
> +		goto out_unlock;
> +	}
> +	if (!sk_psock_get(peer)) {

[Severity: High]
Could ignoring the return value of sk_psock_get() here cause a refcount leak
and use-after-free?

Similar to tcp_bpf_splice_unpair(), if a socket's psock is concurrently
replaced, sk_psock_get() returns and increments the refcount of the new psock.
Because this code continues using the original peer pointer to interact with
it, it creates an unbalanced increment on the new psock and an unbalanced
decrement on the old psock.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612011452.134466-1-xiyou.wangcong@gmail.com?part=1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice
  2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
  2026-06-12  1:33   ` sashiko-bot
@ 2026-06-12  2:10   ` bot+bpf-ci
  1 sibling, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-06-12  2:10 UTC (permalink / raw)
  To: xiyou.wangcong, netdev
  Cc: bpf, john.fastabend, jakub, jiayuan.chen, hemanthmalla,
	zijianzhang, xiyou.wangcong, cwang, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index cc0bd73f36b6..549f37077244 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c

[ ... ]

> @@ -756,4 +814,753 @@ void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
>  		newsk->sk_prot = sk->sk_prot_creator;
>  }
>
> +static int splice_send_ring(struct sock *sk, struct sk_psock *psock,
> +			    struct msghdr *msg, size_t size, int flags)
> +{
> +	struct sk_psock_splice *self_s, *peer_s;
> +	struct sk_psock *peer;
> +	int total = 0;
> +
> +	if (msg->msg_flags & MSG_OOB)
> +		return 0;
> +
> +	self_s = rcu_dereference_protected(psock->splice, 1);
> +	if (!self_s)
> +		return 0;
> +
> +	while (size > 0) {
> +		size_t done, space = 0;
> +
> +		/* All peer / peer->sk accesses happen under RCU. If the ring
> +		 * has space, grab the peer's ring_ref before dropping RCU: that
> +		 * pins peer_s (and its ring) so the copy below can run outside
> +		 * RCU and fault/sleep normally. peer_sk is *not* pinned by the
> +		 * ref, so it must not be touched after rcu_read_unlock().
> +		 */
> +		peer_s = NULL;
> +		rcu_read_lock();
> +		peer = rcu_dereference(self_s->peer);
> +		if (peer) {
> +			struct sock *peer_sk = peer->sk;
> +			struct sk_psock_splice *ps = rcu_dereference(peer->splice);
> +
> +			if (ps && READ_ONCE(ps->ring_buf) &&
> +			    !sk->sk_err && !(sk->sk_shutdown & SEND_SHUTDOWN) &&
> +			    skb_queue_empty(&peer_sk->sk_receive_queue)) {
> +				space = splice_ring_space(ps);
> +				if (space && percpu_ref_tryget_live(&ps->ring_ref))
> +					peer_s = ps;
> +			}
> +		}
> +		rcu_read_unlock();
> +		if (!peer_s)
> +			break;
> +
> +		/* Holding peer_s->ring_ref: peer_s and its ring stay alive.
> +		 * The copy touches only the ring, never peer_sk, so a normal
> +		 * faulting copy is safe here.
> +		 */
> +		done = splice_ring_write(peer_s, &msg->msg_iter,
> +					 min(size, space));
> +		percpu_ref_put(&peer_s->ring_ref);
> +
> +		if (!done)
> +			break;
> +		total += done;
> +		size  -= done;
> +	}
> +
> +	/* Wake exactly once, after the loop, re-deref'ing peer under RCU.
> +	 * Doing this inside the loop would carry the _sync hint repeatedly
> +	 * and cost a redundant wake per wraparound iteration.
> +	 */
> +	if (total) {
> +		rcu_read_lock();
> +		peer = rcu_dereference(self_s->peer);
> +		if (peer)
> +			splice_wake_sync(peer->sk);
> +		rcu_read_unlock();
> +	}
> +	return total;
> +}

Is the flags parameter needed? The function signature includes an int
flags parameter, but the function body never references it. The only
flag checked is msg->msg_flags for MSG_OOB.

The caller tcp_bpf_splice_sendmsg computes this value as:

int flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED) |
            MSG_NO_SHARED_FRAGS;

but it appears to go unused. Should it be dropped from the signature,
or is there intended handling that was omitted?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27388683867

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver
  2026-06-12  1:14 ` [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver Cong Wang
@ 2026-06-12  3:29   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-12  3:29 UTC (permalink / raw)
  To: Cong Wang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Missing `need_resched()` check in `splice_busy_loop()` allows an unprivileged user to monopolize a CPU, causing soft lockups and denial of service.
- [High] `tcp_bpf_splice_recvmsg` incorrectly returns `-EINTR` instead of `-ERESTARTSYS` for blocking sockets when interrupted by a signal, breaking `SA_RESTART` semantics.
- [High] Providing an invalid memory buffer to `recvmsg` on a spliced socket causes an uninterruptible infinite loop in kernel space, leading to Denial of Service.
--

commit 8f0193a527ddda6d7d50681be8a081f8bc267f16
Author: Cong Wang <xiyou.wangcong@gmail.com>

tcp_bpf: busy-poll the splice ring before parking the receiver

This commit adds an optional bounded busy-poll of the ring before parking,
reusing the socket's SO_BUSY_POLL budget via sk_can_busy_loop() and
sk_busy_loop_timeout(). It spins directly on the in-kernel ring without
polling a NAPI instance.

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 549f370772446..9c4421a74225b 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c

[ ... ]

> @@ -1255,6 +1256,33 @@ static long splice_recv_wait(struct sock *sk, struct sk_psock_splice *s,
>  					splice_recv_ready(sk, s), timeo);
>  }
>  
> +/* Bounded busy-poll on the ring before parking the receiver. Reuses the
> + * socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop() and
> + * sk_busy_loop_timeout(); the default budget of 0 makes sk_can_busy_loop()
> + * false so this is a no-op unless the application (or net.core.busy_read)
> + * opted in.
> + *
> + * Unlike sk_busy_loop() / napi_busy_loop(), this spins on the in-kernel
> + * ring directly rather than polling a NAPI instance, so it is effective on
> + * loopback - which delivers via the per-CPU backlog and exposes no
> + * pollable napi_id. Keeping the receiver hot lets a synchronous sender's
> + * small writes accumulate in the ring without a wakeup per message.
> + */
> +static void splice_busy_loop(struct sock *sk, struct sk_psock_splice *s)
> +{
> +	unsigned long start;
> +
> +	if (!sk_can_busy_loop(sk))
> +		return;
> +
> +	start = busy_loop_current_time();
> +	do {
> +		cpu_relax();
> +		if (splice_recv_ready(sk, s) || signal_pending(current))
> +			return;
> +	} while (!sk_busy_loop_timeout(sk, start));

[Severity: High]
Does this loop need a need_resched() check or a call to cond_resched()?

Because an unprivileged user can set SO_BUSY_POLL to an extremely large
value (up to INT_MAX microseconds), triggering this path via recvmsg()
could cause the thread to spin on the CPU without yielding.

On non-preemptible kernels, this might completely starve other tasks,
stall RCU grace periods, and trigger watchdog soft lockups.

> +}
> +
>  /* prot->sock_is_readable for paired-splice sockets. tcp_stream_is_readable()

[ ... ]

> @@ -1349,6 +1377,16 @@ static int tcp_bpf_splice_recvmsg(struct sock *sk,
>  			return 0;
>  		}
>  
> +		/* Spin on the ring for the SO_BUSY_POLL budget before
> +		 * sleeping. If the spin observes data, re-read from the
> +		 * loop head; otherwise (budget expired or a terminal
> +		 * condition) proceed to park - splice_recv_wait() returns
> +		 * immediately for terminal conditions.
> +		 */
> +		splice_busy_loop(sk, s);
> +		if (splice_ring_has_data(s))
> +			continue;

[Severity: High]
Could this result in an infinite loop if a user provides an invalid
buffer to recvmsg?

If the user buffer is unmapped memory, copy_to_iter() in splice_ring_read()
called earlier in the loop might safely fault and return 0. If
splice_ring_read() returns 0, the ring's tail pointer isn't advanced, and
the data remains in the ring buffer.

tcp_bpf_splice_recvmsg() seems to treat the 0 return as an empty read
rather than an -EFAULT error. When it subsequently checks
splice_ring_has_data() here, it will evaluate to true because the data is
still present.

This triggers the continue statement, which loops back immediately,
potentially creating a tight infinite loop in the kernel.

> +
>  		timeo = splice_recv_wait(sk, s, timeo);

[Severity: High]
Could this cause the system call to incorrectly return -EINTR instead of
-ERESTARTSYS when interrupted by a signal?

When splice_recv_wait() uses wait_event_interruptible_timeout() and is
interrupted, it returns -ERESTARTSYS (-512). This return value directly
overwrites the timeo variable here.

On the next loop iteration, signal_pending(current) would evaluate to true,
and the error code is computed via sock_intr_errno(timeo). Because timeo
is now -512 instead of MAX_SCHEDULE_TIMEOUT, sock_intr_errno() might
incorrectly return -EINTR instead of -ERESTARTSYS, which breaks SA_RESTART
semantics.

>  	}
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612011452.134466-1-xiyou.wangcong@gmail.com?part=2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
                   ` (4 preceding siblings ...)
  2026-06-12  1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
@ 2026-06-12 16:01 ` Alexei Starovoitov
  2026-06-12 18:12   ` Cong Wang
  2026-06-13 17:57   ` Jakub Kicinski
  2026-06-12 22:10 ` [syzbot ci] " syzbot ci
  6 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2026-06-12 16:01 UTC (permalink / raw)
  To: Cong Wang, Jakub Kicinski
  Cc: Network Development, bpf, John Fastabend, Jakub Sitnicki,
	Jiayuan Chen, Hemanth Malla, zijianzhang

On Thu, Jun 11, 2026 at 6:15 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> This series adds an opportunistic "loopback splice" fast path for two
> locally-connected TCP sockets that a sock_ops BPF program pairs at
> handshake completion. Once paired, sendmsg copies the user payload into
> a per-direction in-kernel byte ring and recvmsg drains it on the other
> side; both copies happen in their own task's mm, so the fast path incurs
> no skb construction, no softirq, and no TCP protocol-state processing.
>
> The underlying TCP connection stays fully real: sequence numbers are
> frozen at post-handshake values, so FIN/RST/keepalive keep flowing
> through the normal paths and the pair tears down via a regular close.
> Pairing is opt-in per flow and fallback is per-message - handshake-style
> traffic takes the TCP path, the bulk phase takes the ring, on the same
> socket. Nothing leaves the host and applications need no changes: no new
> address family, no LD_PRELOAD, no source modification.
>
> The target use cases are co-located endpoints that speak plain TCP:
>  - regular TCP loopback (127.0.0.1) between processes on the same host;
>  - container sidecar deployments - e.g. a service-mesh sidecar proxy and
>    its application in the same pod, talking over loopback or a veth pair -
>    where the per-skb veth+bridge cost is exactly what the ring sidesteps.
>
> Highlights (TCP_RR, 1 KB request/response, netperf, pinned CPUs,
> baseline TCP vs splice; full tables across message sizes and TCP_STREAM
> in patches 1 and 2):
>
>   loopback (127.0.0.1):
>     without busy-poll:   105.8k -> 235.1k tps  (2.2x)
>     with busy-poll 50us: 106.1k -> 713.0k tps  (6.7x)
>
>   container (netns + veth + bridge):
>     without busy-poll:    99.9k -> 233.9k tps  (2.3x)
>     with busy-poll 50us: 100.4k -> 704.9k tps  (7.0x)
>
> Synchronous-RPC (TCP_RR) at a 1 KB message wins ~2.2x without busy
> polling and ~6.7x with it (the win grows toward smaller messages and
> narrows toward 64 KB), because the ring removes the per-cycle kernel TCP
> receive-path cost and the receiver can spin on the ring directly -
> loopback delivers via the per-CPU backlog and exposes no pollable
> napi_id, so the generic sk_busy_loop() is a no-op there. Bulk streaming
> is roughly neutral on bare-metal loopback but wins decisively (up to
> ~6x) container-to-container, where per-skb veth+bridge cost dominates
> the path the ring sidesteps.
>
> ---
> Cong Wang (5):
>   tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback
>     splice
>   tcp_bpf: busy-poll the splice ring before parking the receiver
>   selftests/bpf: add tcp_splice basic round-trip test
>   bpf: allow SO_BUSY_POLL in bpf_setsockopt()
>   selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog
>
>  include/linux/skmsg.h                         |   9 +
>  include/net/tcp.h                             |   8 +
>  net/core/filter.c                             |   1 +
>  net/core/skmsg.c                              |   3 +
>  net/ipv4/tcp_bpf.c                            | 847 +++++++++++++++++-
>  .../selftests/bpf/prog_tests/tcp_splice.c     | 206 +++++
>  .../selftests/bpf/progs/test_tcp_splice.c     | 125 +++
>  7 files changed, 1198 insertions(+), 1 deletion(-)

Just saying that the code is free nowadays, so whether it's 1k lines
or 10 lines is irrelevant for the discussion.

As far as the idea goes, I think, it would be interesting in pre-AI era,
but today splice and friends are a prime target for bugs and more bugs.
skmsg and tcp_bpf are reeling from unfixed bugs too,
so my take is that we should not add any new features to skmsg
and instead deprecate what is already there.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
@ 2026-06-12 18:12   ` Cong Wang
  2026-06-12 18:34     ` Alexei Starovoitov
  2026-06-13 17:57   ` Jakub Kicinski
  1 sibling, 1 reply; 18+ messages in thread
From: Cong Wang @ 2026-06-12 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Cong Wang, Jakub Kicinski, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang

On Fri, Jun 12, 2026 at 09:01:43AM -0700, Alexei Starovoitov wrote:
> Just saying that the code is free nowadays, so whether it's 1k lines
> or 10 lines is irrelevant for the discussion.
> 
> As far as the idea goes, I think, it would be interesting in pre-AI era,
> but today splice and friends are a prime target for bugs and more bugs.
> skmsg and tcp_bpf are reeling from unfixed bugs too,
> so my take is that we should not add any new features to skmsg
> and instead deprecate what is already there.

I guess maybe the name misleads you, it has nothing related to splice()
syscall. Its ring buffer was developed on top of include/linux/circ_buf.h
which again has nothing related to splice()/vmsplice()/pipe().

In case it is not obvious, this patchset does not add any new user-space
interface, only a kfunc which is visible to only sockmap eBPF programs
which already require CAP_BPF privilege.

If you have a better name on your mind, I am happy to change it.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-12 18:12   ` Cong Wang
@ 2026-06-12 18:34     ` Alexei Starovoitov
  2026-06-12 20:17       ` Cong Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2026-06-12 18:34 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cong Wang, Jakub Kicinski, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang

On Fri, Jun 12, 2026 at 11:12 AM Cong Wang <cwang@multikernel.io> wrote:
>
> On Fri, Jun 12, 2026 at 09:01:43AM -0700, Alexei Starovoitov wrote:
> > Just saying that the code is free nowadays, so whether it's 1k lines
> > or 10 lines is irrelevant for the discussion.
> >
> > As far as the idea goes, I think, it would be interesting in pre-AI era,
> > but today splice and friends are a prime target for bugs and more bugs.
> > skmsg and tcp_bpf are reeling from unfixed bugs too,
> > so my take is that we should not add any new features to skmsg
> > and instead deprecate what is already there.
>
> I guess maybe the name misleads you, it has nothing related to splice()
> syscall. Its ring buffer was developed on top of include/linux/circ_buf.h
> which again has nothing related to splice()/vmsplice()/pipe().
>
> In case it is not obvious, this patchset does not add any new user-space
> interface, only a kfunc which is visible to only sockmap eBPF programs
> which already require CAP_BPF privilege.

Not the name, but the concept. Taking from one socket and feeding
into another already caused a ton of issues for the networking stack.
If you can convince Kuba we can entertain it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-12 18:34     ` Alexei Starovoitov
@ 2026-06-12 20:17       ` Cong Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Cong Wang @ 2026-06-12 20:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Cong Wang, Jakub Kicinski, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang

On Fri, Jun 12, 2026 at 11:34:25AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 12, 2026 at 11:12 AM Cong Wang <cwang@multikernel.io> wrote:
> >
> > On Fri, Jun 12, 2026 at 09:01:43AM -0700, Alexei Starovoitov wrote:
> > > Just saying that the code is free nowadays, so whether it's 1k lines
> > > or 10 lines is irrelevant for the discussion.
> > >
> > > As far as the idea goes, I think, it would be interesting in pre-AI era,
> > > but today splice and friends are a prime target for bugs and more bugs.
> > > skmsg and tcp_bpf are reeling from unfixed bugs too,
> > > so my take is that we should not add any new features to skmsg
> > > and instead deprecate what is already there.
> >
> > I guess maybe the name misleads you, it has nothing related to splice()
> > syscall. Its ring buffer was developed on top of include/linux/circ_buf.h
> > which again has nothing related to splice()/vmsplice()/pipe().
> >
> > In case it is not obvious, this patchset does not add any new user-space
> > interface, only a kfunc which is visible to only sockmap eBPF programs
> > which already require CAP_BPF privilege.
> 
> Not the name, but the concept. Taking from one socket and feeding
> into another already caused a ton of issues for the networking stack.
> If you can convince Kuba we can entertain it.

If you could be specific and provide examples, I could provide better
answer and take better actions.

Until that, all I can say is Copy Fail leverages page *references*,
bpf_sock_splice_pair() shares no pages, it is a private kernel allocation,
with no pipe_buffer or page-cache involvement at all. Probably the most
common thing between these 2 is the name "splice".

In fact, it has 2 copies (not 0, not 1) by design, see details here:
https://multikernel.io/2026/06/11/bpf-sock-splice-pair-two-copies/

Or if you mean skmsg or sockmap has a lot of bugs, this is true but it
is mostly due to TLS (which codebase is already a mess) and the
complication of skmsg itself, none of them is related to
bpf_sock_splice_pair().

For your reference, this is the data sheet I collected with AI:

  ┌─────────────────────┬─────────┬──────────┬
  │ Code path the fixes │  ~Fix   │ Splice   │
  │       live in       │ commits │  ring    │
  │                     │         │ uses it? │
  ├─────────────────────┼─────────┼──────────┼
  │ sk_msg / verdict /  │         │          │
  │ strparser / skb     │     ~59 │    No    │
  │ redirect            │         │          │
  ├─────────────────────┼─────────┼──────────┼
  │ TLS / ULP layering  │       8 │    No    │
  ├─────────────────────┼─────────┼──────────┼
  │ psock / sock_map    │         │          │
  │ teardown (close,    │     ~10 │   Yes    │
  │ unhash, destroy,    │         │          │
  │ replace, free)      │         │          │
  └─────────────────────┴─────────┴──────────┴

Thanks for your comments!
Cong

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [syzbot ci] Re: tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
                   ` (5 preceding siblings ...)
  2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
@ 2026-06-12 22:10 ` syzbot ci
  6 siblings, 0 replies; 18+ messages in thread
From: syzbot ci @ 2026-06-12 22:10 UTC (permalink / raw)
  To: bpf, cwang, hemanthmalla, jakub, jiayuan.chen, john.fastabend,
	netdev, xiyou.wangcong, zijianzhang
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] tcp: opportunistic loopback splice for BPF-paired sockets
https://lore.kernel.org/all/20260612011452.134466-1-xiyou.wangcong@gmail.com
* [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice
* [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver
* [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test
* [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt()
* [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog

and found the following issues:
* WARNING: suspicious RCU usage in tcp_bpf_recvmsg
* WARNING: suspicious RCU usage in tcp_bpf_splice_sendmsg

Full report is available here:
https://ci.syzbot.org/series/7c43d5ae-cb19-4b2b-96ad-f7f0806ac63c

***

WARNING: suspicious RCU usage in tcp_bpf_recvmsg

tree:      bpf-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
base:      30dee2c176e7954f63d1fa3e52d172f30beb9bfb
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/09e43fc4-ebab-492e-b1de-15bb86aa4588/config
syz repro: https://ci.syzbot.org/findings/79db1500-f71c-4550-8525-89bf06044d60/syz_repro

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/ipv4/tcp_bpf.c:883 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
no locks held by syz.0.17/5822.

stack backtrace:
CPU: 1 UID: 0 PID: 5822 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 sk_psock_is_spliced net/ipv4/tcp_bpf.c:883 [inline]
 tcp_bpf_recvmsg+0x1780/0x1980 net/ipv4/tcp_bpf.c:405
 sock_recvmsg_nosec+0xee/0x140 net/socket.c:1137
 ____sys_recvmsg+0x3e3/0x4a0 net/socket.c:2916
 ___sys_recvmsg+0x215/0x590 net/socket.c:2960
 do_recvmmsg+0x334/0x800 net/socket.c:3055
 __sys_recvmmsg net/socket.c:3129 [inline]
 __do_sys_recvmmsg net/socket.c:3152 [inline]
 __se_sys_recvmmsg net/socket.c:3145 [inline]
 __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3145
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1af799ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f1af87bc028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00007f1af7c15fa0 RCX: 00007f1af799ce59
RDX: 0000000000000002 RSI: 0000200000000400 RDI: 0000000000000003
RBP: 00007f1af7a32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000010051 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1af7c16038 R14: 00007f1af7c15fa0 R15: 00007ffdef205588
 </TASK>


***

WARNING: suspicious RCU usage in tcp_bpf_splice_sendmsg

tree:      bpf-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
base:      30dee2c176e7954f63d1fa3e52d172f30beb9bfb
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/09e43fc4-ebab-492e-b1de-15bb86aa4588/config
syz repro: https://ci.syzbot.org/findings/8144de7d-1a2e-454e-ab17-08d1c6586df2/syz_repro

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/ipv4/tcp_bpf.c:883 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
no locks held by syz.2.19/5817.

stack backtrace:
CPU: 0 UID: 0 PID: 5817 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 sk_psock_is_spliced net/ipv4/tcp_bpf.c:883 [inline]
 tcp_bpf_splice_sendmsg+0x1165/0x1490 net/ipv4/tcp_bpf.c:897
 sock_sendmsg_nosec net/socket.c:787 [inline]
 __sock_sendmsg net/socket.c:802 [inline]
 ____sys_sendmsg+0x80a/0x9f0 net/socket.c:2698
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2752
 __sys_sendmmsg+0x27c/0x4e0 net/socket.c:2841
 __do_sys_sendmmsg net/socket.c:2868 [inline]
 __se_sys_sendmmsg net/socket.c:2865 [inline]
 __x64_sys_sendmmsg+0xa0/0xc0 net/socket.c:2865
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fd1c339ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fd1c42e3028 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007fd1c3615fa0 RCX: 00007fd1c339ce59
RDX: 0000000000000001 RSI: 0000200000001000 RDI: 0000000000000003
RBP: 00007fd1c3432d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000004008005 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fd1c3616038 R14: 00007fd1c3615fa0 R15: 00007ffdcdbac378
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
  2026-06-12 18:12   ` Cong Wang
@ 2026-06-13 17:57   ` Jakub Kicinski
  2026-06-13 21:25     ` Cong Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2026-06-13 17:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Cong Wang, Network Development, bpf, John Fastabend,
	Jakub Sitnicki, Jiayuan Chen, Hemanth Malla, zijianzhang

On Fri, 12 Jun 2026 09:01:43 -0700 Alexei Starovoitov wrote:
> Just saying that the code is free nowadays, so whether it's 1k lines
> or 10 lines is irrelevant for the discussion.
> 
> As far as the idea goes, I think, it would be interesting in pre-AI era,
> but today splice and friends are a prime target for bugs and more bugs.
> skmsg and tcp_bpf are reeling from unfixed bugs too,
> so my take is that we should not add any new features to skmsg
> and instead deprecate what is already there.

100% agreed. There are so many unfixed skmsg bugs it's hard to know
were to start :( Kernel "intelligence" to help unoptimized applications
is particularly unappealing right now.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
  2026-06-13 17:57   ` Jakub Kicinski
@ 2026-06-13 21:25     ` Cong Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Cong Wang @ 2026-06-13 21:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Cong Wang, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang

On Sat, Jun 13, 2026 at 10:57:30AM -0700, Jakub Kicinski wrote:
> On Fri, 12 Jun 2026 09:01:43 -0700 Alexei Starovoitov wrote:
> > Just saying that the code is free nowadays, so whether it's 1k lines
> > or 10 lines is irrelevant for the discussion.
> > 
> > As far as the idea goes, I think, it would be interesting in pre-AI era,
> > but today splice and friends are a prime target for bugs and more bugs.
> > skmsg and tcp_bpf are reeling from unfixed bugs too,
> > so my take is that we should not add any new features to skmsg
> > and instead deprecate what is already there.
> 
> 100% agreed. There are so many unfixed skmsg bugs it's hard to know
> were to start :( Kernel "intelligence" to help unoptimized applications
> is particularly unappealing right now.

You are absolutely right. :)

Thanks for offering opportunity for me to make profit out of it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-13 21:25 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
2026-06-12  1:33   ` sashiko-bot
2026-06-12  2:10   ` bot+bpf-ci
2026-06-12  1:14 ` [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver Cong Wang
2026-06-12  3:29   ` sashiko-bot
2026-06-12  1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
2026-06-12  1:28   ` sashiko-bot
2026-06-12  1:14 ` [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt() Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
2026-06-12  1:26   ` sashiko-bot
2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
2026-06-12 18:12   ` Cong Wang
2026-06-12 18:34     ` Alexei Starovoitov
2026-06-12 20:17       ` Cong Wang
2026-06-13 17:57   ` Jakub Kicinski
2026-06-13 21:25     ` Cong Wang
2026-06-12 22:10 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox