Netdev List
 help / color / mirror / Atom feed
From: Cong Wang <xiyou.wangcong@gmail.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, John Fastabend <john.fastabend@gmail.com>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Jiayuan Chen <jiayuan.chen@linux.dev>,
	hemanthmalla@gmail.com, zijianzhang@bytedance.com,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Cong Wang <cwang@multikernel.io>
Subject: [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver
Date: Thu, 11 Jun 2026 18:14:49 -0700	[thread overview]
Message-ID: <20260612011452.134466-3-xiyou.wangcong@gmail.com> (raw)
In-Reply-To: <20260612011452.134466-1-xiyou.wangcong@gmail.com>

When a paired-splice receiver finds the ring empty it parks on the
socket waitqueue. For latency-bound synchronous-RPC workloads that is
a wakeup per request-response cycle, which dominates the per-cycle
cost.

Add an optional bounded busy-poll of the ring before parking, reusing
the socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop()
and sk_busy_loop_timeout(). The default budget of 0 leaves
sk_can_busy_loop() false, so this is a no-op unless the application
(or net.core.busy_read) opted in.

Unlike sk_busy_loop() / napi_busy_loop(), splice_busy_loop() spins on
the in-kernel ring directly rather than polling a NAPI instance, so it
is effective on loopback - which delivers via the per-CPU backlog and
exposes no pollable napi_id. Keeping the receiver hot lets a
synchronous sender's small writes accumulate in the ring without a
wakeup per message; this is what turns the latency-bound TCP_RR case
into a large win once enabled.

A BPF program enables the budget by setting SO_BUSY_POLL via
bpf_setsockopt() (see the following patch). netperf, pinned CPUs,
3x10s, 50 us budget, baseline TCP vs splice + busy-poll:

  TCP_RR (loopback)    1 B    111.9k -> 1113.8k tps  (9.96x)
                       64 B   111.7k -> 1073.3k tps  (9.61x)
                       1 KB   106.1k ->  713.0k tps  (6.72x)
                       16 KB   40.3k ->  123.7k tps  (3.07x)
                       64 KB   17.8k ->   40.5k tps  (2.28x)

  TCP_RR (container)   1 B    105.6k -> 1103.7k tps  (10.45x)
                       64 B   105.5k -> 1103.9k tps  (10.46x)
                       1 KB   100.4k ->  704.9k tps  (7.02x)
                       16 KB   45.1k ->  114.8k tps  (2.54x)
                       64 KB   18.2k ->   38.8k tps  (2.13x)

Busy polling contributes ~4.2x of the 1 B loopback win (splice without
it is 267.0k tps; see the splice patch). Baseline TCP is unchanged by
busy_read on both loopback and default (non-XDP) veth: both deliver via
the per-CPU backlog, which has no pollable napi_id, so SO_BUSY_POLL is a
no-op for them (the container baseline TCP_RR measures the same at
busy_read 0 and 50). The gain therefore comes from the splice ring spin,
not from busy_read itself.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 net/ipv4/tcp_bpf.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 549f37077244..9c4421a74225 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -13,6 +13,7 @@
 #include <linux/util_macros.h>
 #include <linux/percpu-refcount.h>
 
+#include <net/busy_poll.h>
 #include <net/inet_common.h>
 #include <net/inet_sock.h>
 #include <net/tls.h>
@@ -1255,6 +1256,33 @@ static long splice_recv_wait(struct sock *sk, struct sk_psock_splice *s,
 					splice_recv_ready(sk, s), timeo);
 }
 
+/* Bounded busy-poll on the ring before parking the receiver. Reuses the
+ * socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop() and
+ * sk_busy_loop_timeout(); the default budget of 0 makes sk_can_busy_loop()
+ * false so this is a no-op unless the application (or net.core.busy_read)
+ * opted in.
+ *
+ * Unlike sk_busy_loop() / napi_busy_loop(), this spins on the in-kernel
+ * ring directly rather than polling a NAPI instance, so it is effective on
+ * loopback - which delivers via the per-CPU backlog and exposes no
+ * pollable napi_id. Keeping the receiver hot lets a synchronous sender's
+ * small writes accumulate in the ring without a wakeup per message.
+ */
+static void splice_busy_loop(struct sock *sk, struct sk_psock_splice *s)
+{
+	unsigned long start;
+
+	if (!sk_can_busy_loop(sk))
+		return;
+
+	start = busy_loop_current_time();
+	do {
+		cpu_relax();
+		if (splice_recv_ready(sk, s) || signal_pending(current))
+			return;
+	} while (!sk_busy_loop_timeout(sk, start));
+}
+
 /* prot->sock_is_readable for paired-splice sockets. tcp_stream_is_readable()
  * (via tcp_poll() / select() / epoll) consults this to mark POLLIN when
  * sk_receive_queue is empty - we must also report data sitting in the
@@ -1349,6 +1377,16 @@ static int tcp_bpf_splice_recvmsg(struct sock *sk,
 			return 0;
 		}
 
+		/* Spin on the ring for the SO_BUSY_POLL budget before
+		 * sleeping. If the spin observes data, re-read from the
+		 * loop head; otherwise (budget expired or a terminal
+		 * condition) proceed to park - splice_recv_wait() returns
+		 * immediately for terminal conditions.
+		 */
+		splice_busy_loop(sk, s);
+		if (splice_ring_has_data(s))
+			continue;
+
 		timeo = splice_recv_wait(sk, s, timeo);
 	}
 }
-- 
2.43.0


  parent reply	other threads:[~2026-06-12  1:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
2026-06-12  2:10   ` bot+bpf-ci
2026-06-12  1:14 ` Cong Wang [this message]
2026-06-12  1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt() Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
2026-06-12 18:12   ` Cong Wang
2026-06-12 18:34     ` Alexei Starovoitov
2026-06-12 20:17       ` Cong Wang
2026-06-12 22:10 ` [syzbot ci] " syzbot ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612011452.134466-3-xiyou.wangcong@gmail.com \
    --to=xiyou.wangcong@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=cwang@multikernel.io \
    --cc=hemanthmalla@gmail.com \
    --cc=jakub@cloudflare.com \
    --cc=jiayuan.chen@linux.dev \
    --cc=john.fastabend@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=zijianzhang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox