From: Cong Wang <xiyou.wangcong@gmail.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, John Fastabend <john.fastabend@gmail.com>,
Jakub Sitnicki <jakub@cloudflare.com>,
Jiayuan Chen <jiayuan.chen@linux.dev>,
hemanthmalla@gmail.com, zijianzhang@bytedance.com,
Cong Wang <xiyou.wangcong@gmail.com>,
Cong Wang <cwang@multikernel.io>
Subject: [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver
Date: Thu, 11 Jun 2026 18:14:49 -0700 [thread overview]
Message-ID: <20260612011452.134466-3-xiyou.wangcong@gmail.com> (raw)
In-Reply-To: <20260612011452.134466-1-xiyou.wangcong@gmail.com>
When a paired-splice receiver finds the ring empty it parks on the
socket waitqueue. For latency-bound synchronous-RPC workloads that is
a wakeup per request-response cycle, which dominates the per-cycle
cost.
Add an optional bounded busy-poll of the ring before parking, reusing
the socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop()
and sk_busy_loop_timeout(). The default budget of 0 leaves
sk_can_busy_loop() false, so this is a no-op unless the application
(or net.core.busy_read) opted in.
Unlike sk_busy_loop() / napi_busy_loop(), splice_busy_loop() spins on
the in-kernel ring directly rather than polling a NAPI instance, so it
is effective on loopback - which delivers via the per-CPU backlog and
exposes no pollable napi_id. Keeping the receiver hot lets a
synchronous sender's small writes accumulate in the ring without a
wakeup per message; this is what turns the latency-bound TCP_RR case
into a large win once enabled.
A BPF program enables the budget by setting SO_BUSY_POLL via
bpf_setsockopt() (see the following patch). netperf, pinned CPUs,
3x10s, 50 us budget, baseline TCP vs splice + busy-poll:
TCP_RR (loopback) 1 B 111.9k -> 1113.8k tps (9.96x)
64 B 111.7k -> 1073.3k tps (9.61x)
1 KB 106.1k -> 713.0k tps (6.72x)
16 KB 40.3k -> 123.7k tps (3.07x)
64 KB 17.8k -> 40.5k tps (2.28x)
TCP_RR (container) 1 B 105.6k -> 1103.7k tps (10.45x)
64 B 105.5k -> 1103.9k tps (10.46x)
1 KB 100.4k -> 704.9k tps (7.02x)
16 KB 45.1k -> 114.8k tps (2.54x)
64 KB 18.2k -> 38.8k tps (2.13x)
Busy polling contributes ~4.2x of the 1 B loopback win (splice without
it is 267.0k tps; see the splice patch). Baseline TCP is unchanged by
busy_read on both loopback and default (non-XDP) veth: both deliver via
the per-CPU backlog, which has no pollable napi_id, so SO_BUSY_POLL is a
no-op for them (the container baseline TCP_RR measures the same at
busy_read 0 and 50). The gain therefore comes from the splice ring spin,
not from busy_read itself.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
net/ipv4/tcp_bpf.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 549f37077244..9c4421a74225 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -13,6 +13,7 @@
#include <linux/util_macros.h>
#include <linux/percpu-refcount.h>
+#include <net/busy_poll.h>
#include <net/inet_common.h>
#include <net/inet_sock.h>
#include <net/tls.h>
@@ -1255,6 +1256,33 @@ static long splice_recv_wait(struct sock *sk, struct sk_psock_splice *s,
splice_recv_ready(sk, s), timeo);
}
+/* Bounded busy-poll on the ring before parking the receiver. Reuses the
+ * socket's SO_BUSY_POLL budget (sk_ll_usec) via sk_can_busy_loop() and
+ * sk_busy_loop_timeout(); the default budget of 0 makes sk_can_busy_loop()
+ * false so this is a no-op unless the application (or net.core.busy_read)
+ * opted in.
+ *
+ * Unlike sk_busy_loop() / napi_busy_loop(), this spins on the in-kernel
+ * ring directly rather than polling a NAPI instance, so it is effective on
+ * loopback - which delivers via the per-CPU backlog and exposes no
+ * pollable napi_id. Keeping the receiver hot lets a synchronous sender's
+ * small writes accumulate in the ring without a wakeup per message.
+ */
+static void splice_busy_loop(struct sock *sk, struct sk_psock_splice *s)
+{
+ unsigned long start;
+
+ if (!sk_can_busy_loop(sk))
+ return;
+
+ start = busy_loop_current_time();
+ do {
+ cpu_relax();
+ if (splice_recv_ready(sk, s) || signal_pending(current))
+ return;
+ } while (!sk_busy_loop_timeout(sk, start));
+}
+
/* prot->sock_is_readable for paired-splice sockets. tcp_stream_is_readable()
* (via tcp_poll() / select() / epoll) consults this to mark POLLIN when
* sk_receive_queue is empty - we must also report data sitting in the
@@ -1349,6 +1377,16 @@ static int tcp_bpf_splice_recvmsg(struct sock *sk,
return 0;
}
+ /* Spin on the ring for the SO_BUSY_POLL budget before
+ * sleeping. If the spin observes data, re-read from the
+ * loop head; otherwise (budget expired or a terminal
+ * condition) proceed to park - splice_recv_wait() returns
+ * immediately for terminal conditions.
+ */
+ splice_busy_loop(sk, s);
+ if (splice_ring_has_data(s))
+ continue;
+
timeo = splice_recv_wait(sk, s, timeo);
}
}
--
2.43.0
next prev parent reply other threads:[~2026-06-12 1:15 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
2026-06-12 1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
2026-06-12 2:10 ` bot+bpf-ci
2026-06-12 1:14 ` Cong Wang [this message]
2026-06-12 1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
2026-06-12 1:14 ` [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt() Cong Wang
2026-06-12 1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
2026-06-12 18:12 ` Cong Wang
2026-06-12 18:34 ` Alexei Starovoitov
2026-06-12 20:17 ` Cong Wang
2026-06-12 22:10 ` [syzbot ci] " syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612011452.134466-3-xiyou.wangcong@gmail.com \
--to=xiyou.wangcong@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=cwang@multikernel.io \
--cc=hemanthmalla@gmail.com \
--cc=jakub@cloudflare.com \
--cc=jiayuan.chen@linux.dev \
--cc=john.fastabend@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=zijianzhang@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox