[Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking
@ 2025-07-01  1:11 Cong Wang
  2025-07-01  1:11 ` [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Cong Wang @ 2025-07-01  1:11 UTC (permalink / raw)
  To: netdev; +Cc: bpf, john.fastabend, jakub, zijianzhang, zhoufeng.zf, Cong Wang

This patchset improves skmsg ingress redirection performance by a)
sophisticated batching with kworker; b) skmsg allocation caching with
kmem cache.

As a result, our patches significantly outperforms the vanilla kernel
in terms of throughput for almost all packet sizes. The percentage
improvement in throughput ranges from 3.13% to 160.92%, with smaller
packets showing the highest improvements.

For latency, it induces slightly higher latency across most packet sizes
compared to the vanilla, which is also expected since this is a natural
side effect of batching.

Here are the detailed benchmarks:

+-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Throughput  | 64     | 128    | 256    | 512    | 1k     | 4k     | 16k    | 32k    | 64k    | 128k   | 256k   |
+-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Vanilla     | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44 | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
| Patched     | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
| Percentage  | 141.18%   | 127.78%   | 125.00%   | 143.07%   | 148.08%   | 160.92%   | 106.52%    | 97.00%     | 5.38%      | 3.13%      | 6.32%      |
+-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

+-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Latency     | 64        | 128       | 256       | 512       | 1k        | 4k        | 16k       | 32k       | 63k       |
+-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Vanilla     | 5.80±4.02 | 5.83±3.61 | 5.86±4.10 | 5.91±4.19 | 5.98±4.14 | 6.61±4.47 | 8.60±2.59 | 10.96±5.50| 15.02±6.78|
| Patched     | 6.18±3.03 | 6.23±4.38 | 6.25±4.44 | 6.13±4.35 | 6.32±4.23 | 6.94±4.61 | 8.90±5.49 | 11.12±6.10| 14.88±6.55|
| Percentage  | 6.55%     | 6.87%     | 6.66%     | 3.72%     | 5.68%     | 4.99%     | 3.49%     | 1.46%     |-0.93%     |
+-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

---
v4: pass false instead of 'redir_ingress' to tcp_bpf_sendmsg_redir()

v3: no change, just rebase

v2: improved commit message of patch 3/4
    changed to 'u8' for bitfields, as suggested by Jakub

Cong Wang (2):
  skmsg: rename sk_msg_alloc() to sk_msg_expand()
  skmsg: save some space in struct sk_psock

Zijian Zhang (2):
  skmsg: implement slab allocator cache for sk_msg
  tcp_bpf: improve ingress redirection performance with message corking

 include/linux/skmsg.h |  48 +++++++---
 net/core/skmsg.c      | 173 ++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_bpf.c    | 204 +++++++++++++++++++++++++++++++++++++++---
 net/tls/tls_sw.c      |   6 +-
 net/xfrm/espintcp.c   |   2 +-
 5 files changed, 394 insertions(+), 39 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand()
  2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
@ 2025-07-01  1:11 ` Cong Wang
  2025-07-02  9:24   ` Jakub Sitnicki
  2025-07-01  1:11 ` [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2025-07-01  1:11 UTC (permalink / raw)
  To: netdev; +Cc: bpf, john.fastabend, jakub, zijianzhang, zhoufeng.zf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

The name sk_msg_alloc is misleading, that function does not allocate
sk_msg at all, it simply refills sock page frags. Rename it to
sk_msg_expand() to better reflect what it actually does.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 4 ++--
 net/core/skmsg.c      | 6 +++---
 net/ipv4/tcp_bpf.c    | 2 +-
 net/tls/tls_sw.c      | 6 +++---
 net/xfrm/espintcp.c   | 2 +-
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 0b9095a281b8..d6f0a8cd73c4 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -121,8 +121,8 @@ struct sk_psock {
 	struct rcu_work			rwork;
 };
 
-int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
-		 int elem_first_coalesce);
+int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
+		  int elem_first_coalesce);
 int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
 		 u32 off, u32 len);
 void sk_msg_trim(struct sock *sk, struct sk_msg *msg, int len);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 34c51eb1a14f..0939356828a4 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -24,8 +24,8 @@ static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
 	return false;
 }
 
-int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
-		 int elem_first_coalesce)
+int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
+		  int elem_first_coalesce)
 {
 	struct page_frag *pfrag = sk_page_frag(sk);
 	u32 osize = msg->sg.size;
@@ -82,7 +82,7 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 	sk_msg_trim(sk, msg, osize);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(sk_msg_alloc);
+EXPORT_SYMBOL_GPL(sk_msg_expand);
 
 int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
 		 u32 off, u32 len)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index ba581785adb4..85b64ffc20c6 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -530,7 +530,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		}
 
 		osize = msg_tx->sg.size;
-		err = sk_msg_alloc(sk, msg_tx, msg_tx->sg.size + copy, msg_tx->sg.end - 1);
+		err = sk_msg_expand(sk, msg_tx, msg_tx->sg.size + copy, msg_tx->sg.end - 1);
 		if (err) {
 			if (err != -ENOSPC)
 				goto wait_for_memory;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index fc88e34b7f33..1e8d7fc179a8 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -324,7 +324,7 @@ static int tls_alloc_encrypted_msg(struct sock *sk, int len)
 	struct tls_rec *rec = ctx->open_rec;
 	struct sk_msg *msg_en = &rec->msg_encrypted;
 
-	return sk_msg_alloc(sk, msg_en, len, 0);
+	return sk_msg_expand(sk, msg_en, len, 0);
 }
 
 static int tls_clone_plaintext_msg(struct sock *sk, int required)
@@ -619,8 +619,8 @@ static int tls_split_open_record(struct sock *sk, struct tls_rec *from,
 	new = tls_get_rec(sk);
 	if (!new)
 		return -ENOMEM;
-	ret = sk_msg_alloc(sk, &new->msg_encrypted, msg_opl->sg.size +
-			   tx_overhead_size, 0);
+	ret = sk_msg_expand(sk, &new->msg_encrypted, msg_opl->sg.size +
+			    tx_overhead_size, 0);
 	if (ret < 0) {
 		tls_free_rec(sk, new);
 		return ret;
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index fc7a603b04f1..db7585ec7610 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -353,7 +353,7 @@ static int espintcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	sk_msg_init(&emsg->skmsg);
 	while (1) {
 		/* only -ENOMEM is possible since we don't coalesce */
-		err = sk_msg_alloc(sk, &emsg->skmsg, msglen, 0);
+		err = sk_msg_expand(sk, &emsg->skmsg, msglen, 0);
 		if (!err)
 			break;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg
  2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
  2025-07-01  1:11 ` [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
@ 2025-07-01  1:11 ` Cong Wang
  2025-07-02 11:36   ` Jakub Sitnicki
  2025-07-01  1:12 ` [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock Cong Wang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2025-07-01  1:11 UTC (permalink / raw)
  To: netdev; +Cc: bpf, john.fastabend, jakub, zijianzhang, zhoufeng.zf, Cong Wang

From: Zijian Zhang <zijianzhang@bytedance.com>

Optimizing redirect ingress performance requires frequent allocation and
deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
sk_msg to reduce memory allocation overhead and improve performance.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
---
 include/linux/skmsg.h | 21 ++++++++++++---------
 net/core/skmsg.c      | 28 +++++++++++++++++++++-------
 net/ipv4/tcp_bpf.c    |  5 ++---
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index d6f0a8cd73c4..bf28ce9b5fdb 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -121,6 +121,7 @@ struct sk_psock {
 	struct rcu_work			rwork;
 };
 
+struct sk_msg *sk_msg_alloc(gfp_t gfp);
 int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
 		  int elem_first_coalesce);
 int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
@@ -143,6 +144,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 		   int len, int flags);
 bool sk_msg_is_readable(struct sock *sk);
 
+extern struct kmem_cache *sk_msg_cachep;
+
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
 	WARN_ON(i == msg->sg.end && bytes);
@@ -319,6 +322,13 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
 	kfree_skb(skb);
 }
 
+static inline void kfree_sk_msg(struct sk_msg *msg)
+{
+	if (msg->skb)
+		consume_skb(msg->skb);
+	kmem_cache_free(sk_msg_cachep, msg);
+}
+
 static inline bool sk_psock_queue_msg(struct sk_psock *psock,
 				      struct sk_msg *msg)
 {
@@ -330,7 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
 		ret = true;
 	} else {
 		sk_msg_free(psock->sk, msg);
-		kfree(msg);
+		kfree_sk_msg(msg);
 		ret = false;
 	}
 	spin_unlock_bh(&psock->ingress_lock);
@@ -378,13 +388,6 @@ static inline bool sk_psock_queue_empty(const struct sk_psock *psock)
 	return psock ? list_empty(&psock->ingress_msg) : true;
 }
 
-static inline void kfree_sk_msg(struct sk_msg *msg)
-{
-	if (msg->skb)
-		consume_skb(msg->skb);
-	kfree(msg);
-}
-
 static inline void sk_psock_report_error(struct sk_psock *psock, int err)
 {
 	struct sock *sk = psock->sk;
@@ -441,7 +444,7 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
 {
 	if (psock->cork) {
 		sk_msg_free(psock->sk, psock->cork);
-		kfree(psock->cork);
+		kfree_sk_msg(psock->cork);
 		psock->cork = NULL;
 	}
 }
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 0939356828a4..5aafa5817394 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -10,6 +10,8 @@
 #include <net/tls.h>
 #include <trace/events/sock.h>
 
+struct kmem_cache *sk_msg_cachep;
+
 static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
 {
 	if (msg->sg.end > msg->sg.start &&
@@ -503,16 +505,17 @@ bool sk_msg_is_readable(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_msg_is_readable);
 
-static struct sk_msg *alloc_sk_msg(gfp_t gfp)
+struct sk_msg *sk_msg_alloc(gfp_t gfp)
 {
 	struct sk_msg *msg;
 
-	msg = kzalloc(sizeof(*msg), gfp | __GFP_NOWARN);
+	msg = kmem_cache_zalloc(sk_msg_cachep, gfp | __GFP_NOWARN);
 	if (unlikely(!msg))
 		return NULL;
 	sg_init_marker(msg->sg.data, NR_MSG_FRAG_IDS);
 	return msg;
 }
+EXPORT_SYMBOL_GPL(sk_msg_alloc);
 
 static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 						  struct sk_buff *skb)
@@ -523,7 +526,7 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 	if (!sk_rmem_schedule(sk, skb, skb->truesize))
 		return NULL;
 
-	return alloc_sk_msg(GFP_KERNEL);
+	return sk_msg_alloc(GFP_KERNEL);
 }
 
 static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
@@ -598,7 +601,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 	skb_set_owner_r(skb, sk);
 	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
 	if (err < 0)
-		kfree(msg);
+		kfree_sk_msg(msg);
 	return err;
 }
 
@@ -609,7 +612,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb,
 				     u32 off, u32 len, bool take_ref)
 {
-	struct sk_msg *msg = alloc_sk_msg(GFP_ATOMIC);
+	struct sk_msg *msg = sk_msg_alloc(GFP_ATOMIC);
 	struct sock *sk = psock->sk;
 	int err;
 
@@ -618,7 +621,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 	skb_set_owner_r(skb, sk);
 	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
 	if (err < 0)
-		kfree(msg);
+		kfree_sk_msg(msg);
 	return err;
 }
 
@@ -795,7 +798,7 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
 		if (!msg->skb)
 			atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
 		sk_msg_free(psock->sk, msg);
-		kfree(msg);
+		kfree_sk_msg(msg);
 	}
 }
 
@@ -1280,3 +1283,14 @@ void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock)
 	sk->sk_data_ready = psock->saved_data_ready;
 	psock->saved_data_ready = NULL;
 }
+
+static int __init sk_msg_cachep_init(void)
+{
+	sk_msg_cachep = kmem_cache_create("sk_msg_cachep",
+					  sizeof(struct sk_msg),
+					  0,
+					  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
+					  NULL);
+	return 0;
+}
+late_initcall(sk_msg_cachep_init);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 85b64ffc20c6..f0ef41c951e2 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -38,7 +38,7 @@ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
 	struct sk_msg *tmp;
 	int i, ret = 0;
 
-	tmp = kzalloc(sizeof(*tmp), __GFP_NOWARN | GFP_KERNEL);
+	tmp = sk_msg_alloc(GFP_KERNEL);
 	if (unlikely(!tmp))
 		return -ENOMEM;
 
@@ -406,8 +406,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 	    msg->cork_bytes > msg->sg.size && !enospc) {
 		psock->cork_bytes = msg->cork_bytes - msg->sg.size;
 		if (!psock->cork) {
-			psock->cork = kzalloc(sizeof(*psock->cork),
-					      GFP_ATOMIC | __GFP_NOWARN);
+			psock->cork = sk_msg_alloc(GFP_ATOMIC);
 			if (!psock->cork)
 				return -ENOMEM;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock
  2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
  2025-07-01  1:11 ` [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
  2025-07-01  1:11 ` [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
@ 2025-07-01  1:12 ` Cong Wang
  2025-07-02 11:46   ` Jakub Sitnicki
  2025-07-01  1:12 ` [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2025-07-01  1:12 UTC (permalink / raw)
  To: netdev; +Cc: bpf, john.fastabend, jakub, zijianzhang, zhoufeng.zf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

This patch aims to save some space in struct sk_psock and prepares for
the next patch which will add more fields.

psock->eval can only have 4 possible values, make it 8-bit is
sufficient.

psock->redir_ingress is just a boolean, using 1 bit is enough.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index bf28ce9b5fdb..7620f170c4b1 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -85,8 +85,8 @@ struct sk_psock {
 	struct sock			*sk_redir;
 	u32				apply_bytes;
 	u32				cork_bytes;
-	u32				eval;
-	bool				redir_ingress; /* undefined if sk_redir is null */
+	u8				eval;
+	u8 				redir_ingress : 1; /* undefined if sk_redir is null */
 	struct sk_msg			*cork;
 	struct sk_psock_progs		progs;
 #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
                   ` (2 preceding siblings ...)
  2025-07-01  1:12 ` [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock Cong Wang
@ 2025-07-01  1:12 ` Cong Wang
  2025-07-02 12:17   ` Jakub Sitnicki
  2025-07-02 10:22 ` [Patch bpf-next v4 0/4] " Jakub Sitnicki
  2025-07-02 11:05 ` Jakub Sitnicki
  5 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2025-07-01  1:12 UTC (permalink / raw)
  To: netdev
  Cc: bpf, john.fastabend, jakub, zijianzhang, zhoufeng.zf, Amery Hung,
	Cong Wang

From: Zijian Zhang <zijianzhang@bytedance.com>

The TCP_BPF ingress redirection path currently lacks the message corking
mechanism found in standard TCP. This causes the sender to wake up the
receiver for every message, even when messages are small, resulting in
reduced throughput compared to regular TCP in certain scenarios.

This change introduces a kernel worker-based intermediate layer to provide
automatic message corking for TCP_BPF. While this adds a slight latency
overhead, it significantly improves overall throughput by reducing
unnecessary wake-ups and reducing the sock lock contention.

Reviewed-by: Amery Hung <amery.hung@bytedance.com>
Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
---
 include/linux/skmsg.h |  19 ++++
 net/core/skmsg.c      | 139 ++++++++++++++++++++++++++++-
 net/ipv4/tcp_bpf.c    | 197 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 347 insertions(+), 8 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 7620f170c4b1..2531428168ad 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -15,6 +15,8 @@
 
 #define MAX_MSG_FRAGS			MAX_SKB_FRAGS
 #define NR_MSG_FRAG_IDS			(MAX_MSG_FRAGS + 1)
+/* GSO size for TCP BPF backlog processing */
+#define TCP_BPF_GSO_SIZE		65536
 
 enum __sk_action {
 	__SK_DROP = 0,
@@ -85,8 +87,10 @@ struct sk_psock {
 	struct sock			*sk_redir;
 	u32				apply_bytes;
 	u32				cork_bytes;
+	u32				backlog_since_notify;
 	u8				eval;
 	u8 				redir_ingress : 1; /* undefined if sk_redir is null */
+	u8				backlog_work_delayed : 1;
 	struct sk_msg			*cork;
 	struct sk_psock_progs		progs;
 #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
@@ -97,6 +101,9 @@ struct sk_psock {
 	struct sk_buff_head		ingress_skb;
 	struct list_head		ingress_msg;
 	spinlock_t			ingress_lock;
+	struct list_head		backlog_msg;
+	/* spin_lock for backlog_msg and backlog_since_notify */
+	spinlock_t			backlog_msg_lock;
 	unsigned long			state;
 	struct list_head		link;
 	spinlock_t			link_lock;
@@ -117,11 +124,13 @@ struct sk_psock {
 	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
 	struct delayed_work		work;
+	struct delayed_work		backlog_work;
 	struct sock			*sk_pair;
 	struct rcu_work			rwork;
 };
 
 struct sk_msg *sk_msg_alloc(gfp_t gfp);
+bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce);
 int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
 		  int elem_first_coalesce);
 int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
@@ -396,9 +405,19 @@ static inline void sk_psock_report_error(struct sk_psock *psock, int err)
 	sk_error_report(sk);
 }
 
+void sk_psock_backlog_msg(struct sk_psock *psock);
 struct sk_psock *sk_psock_init(struct sock *sk, int node);
 void sk_psock_stop(struct sk_psock *psock);
 
+static inline void sk_psock_run_backlog_work(struct sk_psock *psock,
+					     bool delayed)
+{
+	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+		return;
+	psock->backlog_work_delayed = delayed;
+	schedule_delayed_work(&psock->backlog_work, delayed ? 1 : 0);
+}
+
 #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
 int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock);
 void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 5aafa5817394..29ac83fc05d8 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -12,7 +12,7 @@
 
 struct kmem_cache *sk_msg_cachep;
 
-static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
+bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
 {
 	if (msg->sg.end > msg->sg.start &&
 	    elem_first_coalesce < msg->sg.end)
@@ -721,6 +721,118 @@ static void sk_psock_backlog(struct work_struct *work)
 	sk_psock_put(psock->sk, psock);
 }
 
+static bool backlog_notify(struct sk_psock *psock, bool m_sched_failed,
+			   bool ingress_empty)
+{
+	/* Notify if:
+	 * 1. We have corked enough bytes
+	 * 2. We have already delayed notification
+	 * 3. Memory allocation failed
+	 * 4. Ingress queue was empty and we're about to add data
+	 */
+	return psock->backlog_since_notify >= TCP_BPF_GSO_SIZE ||
+	       psock->backlog_work_delayed ||
+	       m_sched_failed ||
+	       ingress_empty;
+}
+
+static bool backlog_xfer_to_local(struct sk_psock *psock, struct sock *sk_from,
+				  struct list_head *local_head, u32 *tot_size)
+{
+	struct sock *sk = psock->sk;
+	struct sk_msg *msg, *tmp;
+	u32 size = 0;
+
+	list_for_each_entry_safe(msg, tmp, &psock->backlog_msg, list) {
+		if (msg->sk != sk_from)
+			break;
+
+		if (!__sk_rmem_schedule(sk, msg->sg.size, false))
+			return true;
+
+		list_move_tail(&msg->list, local_head);
+		sk_wmem_queued_add(msg->sk, -msg->sg.size);
+		sock_put(msg->sk);
+		msg->sk = NULL;
+		psock->backlog_since_notify += msg->sg.size;
+		size += msg->sg.size;
+	}
+
+	*tot_size = size;
+	return false;
+}
+
+/* This function handles the transfer of backlogged messages from the sender
+ * backlog queue to the ingress queue of the peer socket. Notification of data
+ * availability will be sent under some conditions.
+ */
+void sk_psock_backlog_msg(struct sk_psock *psock)
+{
+	bool rmem_schedule_failed = false;
+	struct sock *sk_from = NULL;
+	struct sock *sk = psock->sk;
+	LIST_HEAD(local_head);
+	struct sk_msg *msg;
+	bool should_notify;
+	u32 tot_size = 0;
+
+	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+		return;
+
+	lock_sock(sk);
+	spin_lock(&psock->backlog_msg_lock);
+
+	msg = list_first_entry_or_null(&psock->backlog_msg,
+				       struct sk_msg, list);
+	if (!msg) {
+		should_notify = !list_empty(&psock->ingress_msg);
+		spin_unlock(&psock->backlog_msg_lock);
+		goto notify;
+	}
+
+	sk_from = msg->sk;
+	sock_hold(sk_from);
+
+	rmem_schedule_failed = backlog_xfer_to_local(psock, sk_from,
+						     &local_head, &tot_size);
+	should_notify = backlog_notify(psock, rmem_schedule_failed,
+				       list_empty(&psock->ingress_msg));
+	spin_unlock(&psock->backlog_msg_lock);
+
+	spin_lock_bh(&psock->ingress_lock);
+	list_splice_tail_init(&local_head, &psock->ingress_msg);
+	spin_unlock_bh(&psock->ingress_lock);
+
+	atomic_add(tot_size, &sk->sk_rmem_alloc);
+	sk_mem_charge(sk, tot_size);
+
+notify:
+	if (should_notify) {
+		psock->backlog_since_notify = 0;
+		sk_psock_data_ready(sk, psock);
+		if (!list_empty(&psock->backlog_msg))
+			sk_psock_run_backlog_work(psock, rmem_schedule_failed);
+	} else {
+		sk_psock_run_backlog_work(psock, true);
+	}
+	release_sock(sk);
+
+	if (sk_from) {
+		bool slow = lock_sock_fast(sk_from);
+
+		sk_mem_uncharge(sk_from, tot_size);
+		unlock_sock_fast(sk_from, slow);
+		sock_put(sk_from);
+	}
+}
+
+static void sk_psock_backlog_msg_work(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+
+	sk_psock_backlog_msg(container_of(dwork, struct sk_psock, backlog_work));
+}
+
 struct sk_psock *sk_psock_init(struct sock *sk, int node)
 {
 	struct sk_psock *psock;
@@ -758,8 +870,11 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 
 	INIT_DELAYED_WORK(&psock->work, sk_psock_backlog);
 	mutex_init(&psock->work_mutex);
+	INIT_DELAYED_WORK(&psock->backlog_work, sk_psock_backlog_msg_work);
 	INIT_LIST_HEAD(&psock->ingress_msg);
 	spin_lock_init(&psock->ingress_lock);
+	INIT_LIST_HEAD(&psock->backlog_msg);
+	spin_lock_init(&psock->backlog_msg_lock);
 	skb_queue_head_init(&psock->ingress_skb);
 
 	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
@@ -813,6 +928,26 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock)
 	__sk_psock_purge_ingress_msg(psock);
 }
 
+static void __sk_psock_purge_backlog_msg(struct sk_psock *psock)
+{
+	struct sk_msg *msg, *tmp;
+
+	spin_lock(&psock->backlog_msg_lock);
+	list_for_each_entry_safe(msg, tmp, &psock->backlog_msg, list) {
+		struct sock *sk_from = msg->sk;
+		bool slow;
+
+		list_del(&msg->list);
+		slow = lock_sock_fast(sk_from);
+		sk_wmem_queued_add(sk_from, -msg->sg.size);
+		sock_put(sk_from);
+		sk_msg_free(sk_from, msg);
+		unlock_sock_fast(sk_from, slow);
+		kfree_sk_msg(msg);
+	}
+	spin_unlock(&psock->backlog_msg_lock);
+}
+
 static void sk_psock_link_destroy(struct sk_psock *psock)
 {
 	struct sk_psock_link *link, *tmp;
@@ -842,7 +977,9 @@ static void sk_psock_destroy(struct work_struct *work)
 	sk_psock_done_strp(psock);
 
 	cancel_delayed_work_sync(&psock->work);
+	cancel_delayed_work_sync(&psock->backlog_work);
 	__sk_psock_zap_ingress(psock);
+	__sk_psock_purge_backlog_msg(psock);
 	mutex_destroy(&psock->work_mutex);
 
 	psock_progs_drop(&psock->progs);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index f0ef41c951e2..6f7359157205 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -381,6 +381,183 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return ret;
 }
 
+static int tcp_bpf_coalesce_msg(struct sk_msg *last, struct sk_msg *msg,
+				u32 *apply_bytes_ptr, u32 *tot_size)
+{
+	struct scatterlist *sge_from, *sge_to;
+	u32 apply_bytes = *apply_bytes_ptr;
+	bool apply = apply_bytes;
+	int i = msg->sg.start;
+	u32 size;
+
+	while (i != msg->sg.end) {
+		int last_sge_idx = last->sg.end;
+
+		if (sk_msg_full(last))
+			break;
+
+		sge_from = sk_msg_elem(msg, i);
+		sk_msg_iter_var_prev(last_sge_idx);
+		sge_to = &last->sg.data[last_sge_idx];
+
+		size = (apply && apply_bytes < sge_from->length) ?
+			apply_bytes : sge_from->length;
+		if (sk_msg_try_coalesce_ok(last, last_sge_idx) &&
+		    sg_page(sge_to) == sg_page(sge_from) &&
+		    sge_to->offset + sge_to->length == sge_from->offset) {
+			sge_to->length += size;
+		} else {
+			sge_to = &last->sg.data[last->sg.end];
+			sg_unmark_end(sge_to);
+			sg_set_page(sge_to, sg_page(sge_from), size,
+				    sge_from->offset);
+			get_page(sg_page(sge_to));
+			sk_msg_iter_next(last, end);
+		}
+
+		sge_from->length -= size;
+		sge_from->offset += size;
+
+		if (sge_from->length == 0) {
+			put_page(sg_page(sge_to));
+			sk_msg_iter_var_next(i);
+		}
+
+		msg->sg.size -= size;
+		last->sg.size += size;
+		*tot_size += size;
+
+		if (apply) {
+			apply_bytes -= size;
+			if (!apply_bytes)
+				break;
+		}
+	}
+
+	if (apply)
+		*apply_bytes_ptr = apply_bytes;
+
+	msg->sg.start = i;
+	return i;
+}
+
+static void tcp_bpf_xfer_msg(struct sk_msg *dst, struct sk_msg *msg,
+			     u32 *apply_bytes_ptr, u32 *tot_size)
+{
+	u32 apply_bytes = *apply_bytes_ptr;
+	bool apply = apply_bytes;
+	struct scatterlist *sge;
+	int i = msg->sg.start;
+	u32 size;
+
+	do {
+		sge = sk_msg_elem(msg, i);
+		size = (apply && apply_bytes < sge->length) ?
+			apply_bytes : sge->length;
+
+		sk_msg_xfer(dst, msg, i, size);
+		*tot_size += size;
+		if (sge->length)
+			get_page(sk_msg_page(dst, i));
+		sk_msg_iter_var_next(i);
+		dst->sg.end = i;
+		if (apply) {
+			apply_bytes -= size;
+			if (!apply_bytes) {
+				if (sge->length)
+					sk_msg_iter_var_prev(i);
+				break;
+			}
+		}
+	} while (i != msg->sg.end);
+
+	if (apply)
+		*apply_bytes_ptr = apply_bytes;
+	msg->sg.start = i;
+}
+
+static int tcp_bpf_ingress_backlog(struct sock *sk, struct sock *sk_redir,
+				   struct sk_msg *msg, u32 apply_bytes)
+{
+	bool ingress_msg_empty = false;
+	bool apply = apply_bytes;
+	struct sk_psock *psock;
+	struct sk_msg *tmp;
+	u32 tot_size = 0;
+	int ret = 0;
+	u8 nonagle;
+
+	psock = sk_psock_get(sk_redir);
+	if (unlikely(!psock))
+		return -EPIPE;
+
+	spin_lock(&psock->backlog_msg_lock);
+	/* If possible, coalesce the curr sk_msg to the last sk_msg from the
+	 * psock->backlog_msg.
+	 */
+	if (!list_empty(&psock->backlog_msg)) {
+		struct sk_msg *last;
+
+		last = list_last_entry(&psock->backlog_msg, struct sk_msg, list);
+		if (last->sk == sk) {
+			int i = tcp_bpf_coalesce_msg(last, msg, &apply_bytes,
+						     &tot_size);
+
+			if (i == msg->sg.end || (apply && !apply_bytes))
+				goto out_unlock;
+		}
+	}
+
+	/* Otherwise, allocate a new sk_msg and transfer the data from the
+	 * passed in msg to it.
+	 */
+	tmp = sk_msg_alloc(GFP_ATOMIC);
+	if (!tmp) {
+		ret = -ENOMEM;
+		spin_unlock(&psock->backlog_msg_lock);
+		goto error;
+	}
+
+	tmp->sk = sk;
+	sock_hold(tmp->sk);
+	tmp->sg.start = msg->sg.start;
+	tcp_bpf_xfer_msg(tmp, msg, &apply_bytes, &tot_size);
+
+	ingress_msg_empty = list_empty(&psock->ingress_msg);
+	list_add_tail(&tmp->list, &psock->backlog_msg);
+
+out_unlock:
+	spin_unlock(&psock->backlog_msg_lock);
+	sk_wmem_queued_add(sk, tot_size);
+
+	/* At this point, the data has been handled well. If one of the
+	 * following conditions is met, we can notify the peer socket in
+	 * the context of this system call immediately.
+	 * 1. If the write buffer has been used up;
+	 * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
+	 * 3. Or, the ingress queue was empty;
+	 * 4. Or, the tcp socket is set to no_delay.
+	 * Otherwise, kick off the backlog work so that we can have some
+	 * time to wait for any incoming messages before sending a
+	 * notification to the peer socket.
+	 */
+	nonagle = tcp_sk(sk)->nonagle;
+	if (!sk_stream_memory_free(sk) ||
+	    tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
+	    (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
+		release_sock(sk);
+		psock->backlog_work_delayed = false;
+		sk_psock_backlog_msg(psock);
+		lock_sock(sk);
+	} else {
+		sk_psock_run_backlog_work(psock, false);
+	}
+
+error:
+	sk_psock_put(sk_redir, psock);
+	return ret;
+}
+
 static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 				struct sk_msg *msg, int *copied, int flags)
 {
@@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 			cork = true;
 			psock->cork = NULL;
 		}
-		release_sock(sk);
 
-		origsize = msg->sg.size;
-		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
-					    msg, tosend, flags);
-		sent = origsize - msg->sg.size;
+		if (redir_ingress) {
+			ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
+		} else {
+			release_sock(sk);
+
+			origsize = msg->sg.size;
+			ret = tcp_bpf_sendmsg_redir(sk_redir, false,
+						    msg, tosend, flags);
+			sent = origsize - msg->sg.size;
+
+			lock_sock(sk);
+			sk_mem_uncharge(sk, sent);
+		}
 
 		if (eval == __SK_REDIRECT)
 			sock_put(sk_redir);
 
-		lock_sock(sk);
-		sk_mem_uncharge(sk, sent);
 		if (unlikely(ret < 0)) {
 			int free = sk_msg_free(sk, msg);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand()
  2025-07-01  1:11 ` [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
@ 2025-07-02  9:24   ` Jakub Sitnicki
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-02  9:24 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bpf, john.fastabend, zijianzhang, zhoufeng.zf, Cong Wang

On Mon, Jun 30, 2025 at 06:11 PM -07, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
>
> The name sk_msg_alloc is misleading, that function does not allocate
> sk_msg at all, it simply refills sock page frags. Rename it to
> sk_msg_expand() to better reflect what it actually does.
>
> Reviewed-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

Nit: If there happens to be another iteration, might as well add a doc
comment while at it.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
                   ` (3 preceding siblings ...)
  2025-07-01  1:12 ` [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
@ 2025-07-02 10:22 ` Jakub Sitnicki
  2025-07-03  1:48   ` Zijian Zhang
  2025-07-02 11:05 ` Jakub Sitnicki
  5 siblings, 1 reply; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-02 10:22 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, bpf, john.fastabend, zijianzhang, zhoufeng.zf

On Mon, Jun 30, 2025 at 06:11 PM -07, Cong Wang wrote:
> This patchset improves skmsg ingress redirection performance by a)
> sophisticated batching with kworker; b) skmsg allocation caching with
> kmem cache.
>
> As a result, our patches significantly outperforms the vanilla kernel
> in terms of throughput for almost all packet sizes. The percentage
> improvement in throughput ranges from 3.13% to 160.92%, with smaller
> packets showing the highest improvements.
>
> For latency, it induces slightly higher latency across most packet sizes
> compared to the vanilla, which is also expected since this is a natural
> side effect of batching.
>
> Here are the detailed benchmarks:
>
> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> | Throughput  | 64     | 128    | 256    | 512    | 1k     | 4k     | 16k    | 32k    | 64k    | 128k   | 256k   |
> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> | Vanilla     | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44 | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
> | Patched     | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
> | Percentage  | 141.18%   | 127.78%   | 125.00%   | 143.07%   | 148.08%   | 160.92%   | 106.52%    | 97.00%     | 5.38%      | 3.13%      | 6.32%      |
> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

That's a bit easier to read when aligned:

| Throughput |    64     |    128    |    256    |    512    |    1k     |     4k     |    16k     |    32k     |    64k     |    128k    |    256k    |
|------------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------|
|    Vanilla | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44  | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
|    Patched | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
| Percentage |  141.18%  |  127.78%  |  125.00%  |  143.07%  |  148.08%  |  160.92%   |  106.52%   |   97.00%   |   5.38%    |   3.13%    |   6.32%    |

>
> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | Latency     | 64        | 128       | 256       | 512       | 1k        | 4k        | 16k       | 32k       | 63k       |
> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | Vanilla     | 5.80±4.02 | 5.83±3.61 | 5.86±4.10 | 5.91±4.19 | 5.98±4.14 | 6.61±4.47 | 8.60±2.59 | 10.96±5.50| 15.02±6.78|
> | Patched     | 6.18±3.03 | 6.23±4.38 | 6.25±4.44 | 6.13±4.35 | 6.32±4.23 | 6.94±4.61 | 8.90±5.49 | 11.12±6.10| 14.88±6.55|
> | Percentage  | 6.55%     | 6.87%     | 6.66%     | 3.72%     | 5.68%     | 4.99%     | 3.49%     | 1.46%     |-0.93%     |
> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

What are throughput and latency units here?

Which microbenchmark was used?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
                   ` (4 preceding siblings ...)
  2025-07-02 10:22 ` [Patch bpf-next v4 0/4] " Jakub Sitnicki
@ 2025-07-02 11:05 ` Jakub Sitnicki
  5 siblings, 0 replies; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-02 11:05 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, bpf, john.fastabend, zijianzhang, zhoufeng.zf

On Mon, Jun 30, 2025 at 06:11 PM -07, Cong Wang wrote:
> This patchset improves skmsg ingress redirection performance by a)
> sophisticated batching with kworker; b) skmsg allocation caching with
> kmem cache.
>
> As a result, our patches significantly outperforms the vanilla kernel
> in terms of throughput for almost all packet sizes. The percentage
> improvement in throughput ranges from 3.13% to 160.92%, with smaller
> packets showing the highest improvements.
>
> For latency, it induces slightly higher latency across most packet sizes
> compared to the vanilla, which is also expected since this is a natural
> side effect of batching.
>
> Here are the detailed benchmarks:
>
> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> | Throughput  | 64     | 128    | 256    | 512    | 1k     | 4k     | 16k    | 32k    | 64k    | 128k   | 256k   |
> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> | Vanilla     | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44 | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
> | Patched     | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
> | Percentage  | 141.18%   | 127.78%   | 125.00%   | 143.07%   | 148.08%   | 160.92%   | 106.52%    | 97.00%     | 5.38%      | 3.13%      | 6.32%      |
> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
>
> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | Latency     | 64        | 128       | 256       | 512       | 1k        | 4k        | 16k       | 32k       | 63k       |
> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | Vanilla     | 5.80±4.02 | 5.83±3.61 | 5.86±4.10 | 5.91±4.19 | 5.98±4.14 | 6.61±4.47 | 8.60±2.59 | 10.96±5.50| 15.02±6.78|
> | Patched     | 6.18±3.03 | 6.23±4.38 | 6.25±4.44 | 6.13±4.35 | 6.32±4.23 | 6.94±4.61 | 8.90±5.49 | 11.12±6.10| 14.88±6.55|
> | Percentage  | 6.55%     | 6.87%     | 6.66%     | 3.72%     | 5.68%     | 4.99%     | 3.49%     | 1.46%     |-0.93%     |
> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

Plots for easier review. Courtesy of Gemini.

https://gist.githubusercontent.com/jsitnicki/f255aab05d050a56ab743247475426a1/raw/715011c50396ef01c9b5397a19f07d380d56cd25/sk_msg-redir-ingress-throughput.png
https://gist.githubusercontent.com/jsitnicki/f255aab05d050a56ab743247475426a1/raw/715011c50396ef01c9b5397a19f07d380d56cd25/sk_msg-redir-ingress-latency.png

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg
  2025-07-01  1:11 ` [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
@ 2025-07-02 11:36   ` Jakub Sitnicki
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-02 11:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bpf, john.fastabend, zijianzhang, zhoufeng.zf, Cong Wang

On Mon, Jun 30, 2025 at 06:11 PM -07, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> Optimizing redirect ingress performance requires frequent allocation and
> deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
> sk_msg to reduce memory allocation overhead and improve performance.
>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Reviewed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> ---

skb ref management looks all right. Meaning I see no double release.

sk_msg_free clears the msg->skb field.

While sk_psock_skb_ingress_enqueue sets msg->skb only when it succeeds.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock
  2025-07-01  1:12 ` [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock Cong Wang
@ 2025-07-02 11:46   ` Jakub Sitnicki
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-02 11:46 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bpf, john.fastabend, zijianzhang, zhoufeng.zf, Cong Wang

On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
>
> This patch aims to save some space in struct sk_psock and prepares for
> the next patch which will add more fields.
>
> psock->eval can only have 4 possible values, make it 8-bit is
> sufficient.
>
> psock->redir_ingress is just a boolean, using 1 bit is enough.
>
> Reviewed-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

We could probably tweak sk_psock_map_verd to map (SK_PASS, redir=true,
ingress=true) in to something like __SK_REDIR_INGRESS in the future, and
do away with psock->redir_ingress field completely.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-01  1:12 ` [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
@ 2025-07-02 12:17   ` Jakub Sitnicki
  2025-07-03  2:17     ` Zijian Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-02 12:17 UTC (permalink / raw)
  To: Cong Wang, zijianzhang
  Cc: netdev, bpf, john.fastabend, zhoufeng.zf, Amery Hung, Cong Wang

On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> The TCP_BPF ingress redirection path currently lacks the message corking
> mechanism found in standard TCP. This causes the sender to wake up the
> receiver for every message, even when messages are small, resulting in
> reduced throughput compared to regular TCP in certain scenarios.

I'm curious what scenarios are you referring to? Is it send-to-local or
ingress-to-local? [1]

If the sender is emitting small messages, that's probably intended -
that is they likely want to get the message across as soon as possible,
because They must have disabled the Nagle algo (set TCP_NODELAY) to do
that.

Otherwise, you get small segment merging on the sender side by default.
And if MTU is a limiting factor, you should also be getting batching
from GRO.

What I'm getting at is that I don't quite follow why you don't see
sufficient batching before the sockmap redirect today?

> This change introduces a kernel worker-based intermediate layer to provide
> automatic message corking for TCP_BPF. While this adds a slight latency
> overhead, it significantly improves overall throughput by reducing
> unnecessary wake-ups and reducing the sock lock contention.

"Slight" for a +5% increase in latency is an understatement :-)

IDK about this being always on for every socket. For send-to-local
[1], sk_msg redirs can be viewed as a form of IPC, where latency
matters.

I do understand that you're trying to optimize for bulk-transfer
workloads, but please consider also request-response workloads.

[1] https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png

> Reviewed-by: Amery Hung <amery.hung@bytedance.com>
> Co-developed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> ---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-02 10:22 ` [Patch bpf-next v4 0/4] " Jakub Sitnicki
@ 2025-07-03  1:48   ` Zijian Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Zijian Zhang @ 2025-07-03  1:48 UTC (permalink / raw)
  To: Jakub Sitnicki, Cong Wang; +Cc: netdev, bpf, john.fastabend, zhoufeng.zf

On 7/2/25 6:22 PM, Jakub Sitnicki wrote:
> On Mon, Jun 30, 2025 at 06:11 PM -07, Cong Wang wrote:
>> This patchset improves skmsg ingress redirection performance by a)
>> sophisticated batching with kworker; b) skmsg allocation caching with
>> kmem cache.
>>
>> As a result, our patches significantly outperforms the vanilla kernel
>> in terms of throughput for almost all packet sizes. The percentage
>> improvement in throughput ranges from 3.13% to 160.92%, with smaller
>> packets showing the highest improvements.
>>
>> For latency, it induces slightly higher latency across most packet sizes
>> compared to the vanilla, which is also expected since this is a natural
>> side effect of batching.
>>
>> Here are the detailed benchmarks:
>>
>> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
>> | Throughput  | 64     | 128    | 256    | 512    | 1k     | 4k     | 16k    | 32k    | 64k    | 128k   | 256k   |
>> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
>> | Vanilla     | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44 | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
>> | Patched     | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
>> | Percentage  | 141.18%   | 127.78%   | 125.00%   | 143.07%   | 148.08%   | 160.92%   | 106.52%    | 97.00%     | 5.38%      | 3.13%      | 6.32%      |
>> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> 
> That's a bit easier to read when aligned:
> 
> | Throughput |    64     |    128    |    256    |    512    |    1k     |     4k     |    16k     |    32k     |    64k     |    128k    |    256k    |
> |------------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------|
> |    Vanilla | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44  | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
> |    Patched | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
> | Percentage |  141.18%  |  127.78%  |  125.00%  |  143.07%  |  148.08%  |  160.92%   |  106.52%   |   97.00%   |   5.38%    |   3.13%    |   6.32%    |
> 

Thanks for the suggestion!

>>
>> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
>> | Latency     | 64        | 128       | 256       | 512       | 1k        | 4k        | 16k       | 32k       | 63k       |
>> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
>> | Vanilla     | 5.80±4.02 | 5.83±3.61 | 5.86±4.10 | 5.91±4.19 | 5.98±4.14 | 6.61±4.47 | 8.60±2.59 | 10.96±5.50| 15.02±6.78|
>> | Patched     | 6.18±3.03 | 6.23±4.38 | 6.25±4.44 | 6.13±4.35 | 6.32±4.23 | 6.94±4.61 | 8.90±5.49 | 11.12±6.10| 14.88±6.55|
>> | Percentage  | 6.55%     | 6.87%     | 6.66%     | 3.72%     | 5.68%     | 4.99%     | 3.49%     | 1.46%     |-0.93%     |
>> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> 
> What are throughput and latency units here?
> 
> Which microbenchmark was used?

Let me add some details here,

# Tput Test: iperf 3.18 (cJSON 1.7.15)
# unit is Gbits/sec
iperf3 -4 -s
iperf3 -4 -c $local_host -l $buffer_length

During this process, some meta data will be exchanged between server
and client via TCP, it also verifies the data integrity of our patched
code.

# Latency Test: sockperf, version 3.10-31
# unit is us
sockperf server -i $local_host --tcp --daemonize
sockperf ping-pong -i $local_host --tcp --time 10 --sender-affinity 0 
--receiver-affinity 1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-02 12:17   ` Jakub Sitnicki
@ 2025-07-03  2:17     ` Zijian Zhang
  2025-07-03 11:32       ` Jakub Sitnicki
  0 siblings, 1 reply; 17+ messages in thread
From: Zijian Zhang @ 2025-07-03  2:17 UTC (permalink / raw)
  To: Jakub Sitnicki, Cong Wang
  Cc: netdev, bpf, john.fastabend, zhoufeng.zf, Amery Hung, Cong Wang

On 7/2/25 8:17 PM, Jakub Sitnicki wrote:
> On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> The TCP_BPF ingress redirection path currently lacks the message corking
>> mechanism found in standard TCP. This causes the sender to wake up the
>> receiver for every message, even when messages are small, resulting in
>> reduced throughput compared to regular TCP in certain scenarios.
> 
> I'm curious what scenarios are you referring to? Is it send-to-local or
> ingress-to-local? [1]
> 

Thanks for your attention and detailed reviewing!
We are referring to "send-to-local" here.

> If the sender is emitting small messages, that's probably intended -
> that is they likely want to get the message across as soon as possible,
> because They must have disabled the Nagle algo (set TCP_NODELAY) to do
> that.
> 
> Otherwise, you get small segment merging on the sender side by default.
> And if MTU is a limiting factor, you should also be getting batching
> from GRO.
> 
> What I'm getting at is that I don't quite follow why you don't see
> sufficient batching before the sockmap redirect today?
> 

IMHO,

In “send-to-local” case, both sender and receiver sockets are added to
the sockmap. Their protocol is modified from TCP to eBPF_TCP, so that
sendmsg will invoke “tcp_bpf_sendmsg” instead of “tcp_sendmsg”. In this
case, the whole process is building a skmsg and moving it to the
receiver socket’s queue immediately. In this process, there is no
sk_buff generated, and we cannot benefit from TCP stack optimizations.
As a result, small segments will not be merged by default, that's the
reason why I am implementing skmsg coalescing here.

>> This change introduces a kernel worker-based intermediate layer to provide
>> automatic message corking for TCP_BPF. While this adds a slight latency
>> overhead, it significantly improves overall throughput by reducing
>> unnecessary wake-ups and reducing the sock lock contention.
> 
> "Slight" for a +5% increase in latency is an understatement :-)
> 
> IDK about this being always on for every socket. For send-to-local
> [1], sk_msg redirs can be viewed as a form of IPC, where latency
> matters.
> 
> I do understand that you're trying to optimize for bulk-transfer
> workloads, but please consider also request-response workloads.
> 
> [1] https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png
> 

Totally understand that request-response workloads are also very
important.

Here are my thoughts:

I had an idea before: when the user sets NO_DELAY, we could follow the
original code path. However, I'm concerned about a specific scenario: if
a user sends part of a message and then sets NO_DELAY to send another
message, it's possible that messages sent via kworker haven't yet
reached the ingress_msg (maybe due to delayed kworker scheduling), while
the messages sent with NO_DELAY have already arrived. This could disrupt
the order. Since there's no TCP packet formation or retransmission
mechanism in this process, once the order is disrupted, it stays that
way.

As a result, I propose,

- When the user sets NO_DELAY, introduce a wait (I haven't determined
the exact location yet; maybe in tcp_bpf_sendmsg) to ensure all messages
from kworker are sent before proceeding. Then follow the original path
for packet transmission.

- When the user switches back from NO_DELAY to DELAY, it's less of an 
issue. We can simply follow our current code path.

If 5% degradation is not a blocker for this patchset, and the solution
looks good to you, we will do it in the next patchset.

>> Reviewed-by: Amery Hung <amery.hung@bytedance.com>
>> Co-developed-by: Cong Wang <cong.wang@bytedance.com>
>> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>> ---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-03  2:17     ` Zijian Zhang
@ 2025-07-03 11:32       ` Jakub Sitnicki
  2025-07-04  4:20         ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-03 11:32 UTC (permalink / raw)
  To: Zijian Zhang
  Cc: Cong Wang, netdev, bpf, john.fastabend, zhoufeng.zf, Amery Hung,
	Cong Wang

On Thu, Jul 03, 2025 at 10:17 AM +08, Zijian Zhang wrote:
> On 7/2/25 8:17 PM, Jakub Sitnicki wrote:
>> On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
>>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>>
>>> The TCP_BPF ingress redirection path currently lacks the message corking
>>> mechanism found in standard TCP. This causes the sender to wake up the
>>> receiver for every message, even when messages are small, resulting in
>>> reduced throughput compared to regular TCP in certain scenarios.
>> I'm curious what scenarios are you referring to? Is it send-to-local or
>> ingress-to-local? [1]
>> 
>
> Thanks for your attention and detailed reviewing!
> We are referring to "send-to-local" here.
>
>> If the sender is emitting small messages, that's probably intended -
>> that is they likely want to get the message across as soon as possible,
>> because They must have disabled the Nagle algo (set TCP_NODELAY) to do
>> that.
>> Otherwise, you get small segment merging on the sender side by default.
>> And if MTU is a limiting factor, you should also be getting batching
>> from GRO.
>> What I'm getting at is that I don't quite follow why you don't see
>> sufficient batching before the sockmap redirect today?
>> 
>
> IMHO,
>
> In “send-to-local” case, both sender and receiver sockets are added to
> the sockmap. Their protocol is modified from TCP to eBPF_TCP, so that
> sendmsg will invoke “tcp_bpf_sendmsg” instead of “tcp_sendmsg”. In this
> case, the whole process is building a skmsg and moving it to the
> receiver socket’s queue immediately. In this process, there is no
> sk_buff generated, and we cannot benefit from TCP stack optimizations.
> As a result, small segments will not be merged by default, that's the
> reason why I am implementing skmsg coalescing here.
>
>>> This change introduces a kernel worker-based intermediate layer to provide
>>> automatic message corking for TCP_BPF. While this adds a slight latency
>>> overhead, it significantly improves overall throughput by reducing
>>> unnecessary wake-ups and reducing the sock lock contention.
>> "Slight" for a +5% increase in latency is an understatement :-)
>> IDK about this being always on for every socket. For send-to-local
>> [1], sk_msg redirs can be viewed as a form of IPC, where latency
>> matters.
>> I do understand that you're trying to optimize for bulk-transfer
>> workloads, but please consider also request-response workloads.
>> [1]
>> https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png
>> 
>
> Totally understand that request-response workloads are also very
> important.
>
> Here are my thoughts:
>
> I had an idea before: when the user sets NO_DELAY, we could follow the
> original code path. However, I'm concerned about a specific scenario: if
> a user sends part of a message and then sets NO_DELAY to send another
> message, it's possible that messages sent via kworker haven't yet
> reached the ingress_msg (maybe due to delayed kworker scheduling), while
> the messages sent with NO_DELAY have already arrived. This could disrupt
> the order. Since there's no TCP packet formation or retransmission
> mechanism in this process, once the order is disrupted, it stays that
> way.
>
> As a result, I propose,
>
> - When the user sets NO_DELAY, introduce a wait (I haven't determined
> the exact location yet; maybe in tcp_bpf_sendmsg) to ensure all messages
> from kworker are sent before proceeding. Then follow the original path
> for packet transmission.
>
> - When the user switches back from NO_DELAY to DELAY, it's less of an issue. We
>  can simply follow our current code path.
>
> If 5% degradation is not a blocker for this patchset, and the solution
> looks good to you, we will do it in the next patchset.

I'm all for reaping the benefits of batching, but I'm not thrilled about
having a backlog worker on the path. The one we have on the sk_skb path
has been a bottleneck:

1) There's no backpressure propagation so you can have a backlog
build-up. One thing to check is what happens if the receiver closes its
window.

2) There's a scheduling latency. That's why the performance of splicing
sockets with sockmap (ingress-to-egress) looks bleak [1].

So I have to dig deeper...

Have you considered and/or evaluated any alternative designs? For
instance, what stops us from having an auto-corking / coalescing
strategy on the sender side?

[1] https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-03 11:32       ` Jakub Sitnicki
@ 2025-07-04  4:20         ` Cong Wang
  2025-07-07 17:51           ` Jakub Sitnicki
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2025-07-04  4:20 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Zijian Zhang, netdev, bpf, john.fastabend, zhoufeng.zf,
	Amery Hung, Cong Wang

On Thu, Jul 03, 2025 at 01:32:08PM +0200, Jakub Sitnicki wrote:
> I'm all for reaping the benefits of batching, but I'm not thrilled about
> having a backlog worker on the path. The one we have on the sk_skb path
> has been a bottleneck:

It depends on what you compare with. If you compare it with vanilla
TCP_BPF, we did see is 5% latency increase. If you compare it with
regular TCP, it is still much better. Our goal is to make Cillium's
sockops-enable competitive with regular TCP, hence we compare it with
regular TCP.

I hope this makes sense to you. Sorry if this was not clear in our cover
letter.

> 
> 1) There's no backpressure propagation so you can have a backlog
> build-up. One thing to check is what happens if the receiver closes its
> window.

Right, I am sure there are still a lot of optimizations we can further
improve. The only question is how much we need for now. How about
optimizing it one step each time? :)

> 
> 2) There's a scheduling latency. That's why the performance of splicing
> sockets with sockmap (ingress-to-egress) looks bleak [1].

Same for regular TCP, we have to wakeup the receiver/worker. But I may
misunderstand this point?

> 
> So I have to dig deeper...
> 
> Have you considered and/or evaluated any alternative designs? For
> instance, what stops us from having an auto-corking / coalescing
> strategy on the sender side?

Auto corking _may_ be not as easy as TCP, since essentially we have no
protocol here, just a pure socket layer.

Thanks for your review!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-04  4:20         ` Cong Wang
@ 2025-07-07 17:51           ` Jakub Sitnicki
  2025-07-15  0:26             ` Zijian Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Sitnicki @ 2025-07-07 17:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zijian Zhang, netdev, bpf, john.fastabend, zhoufeng.zf,
	Amery Hung, Cong Wang

On Thu, Jul 03, 2025 at 09:20 PM -07, Cong Wang wrote:
> On Thu, Jul 03, 2025 at 01:32:08PM +0200, Jakub Sitnicki wrote:
>> I'm all for reaping the benefits of batching, but I'm not thrilled about
>> having a backlog worker on the path. The one we have on the sk_skb path
>> has been a bottleneck:
>
> It depends on what you compare with. If you compare it with vanilla
> TCP_BPF, we did see is 5% latency increase. If you compare it with
> regular TCP, it is still much better. Our goal is to make Cillium's
> sockops-enable competitive with regular TCP, hence we compare it with
> regular TCP.
>
> I hope this makes sense to you. Sorry if this was not clear in our cover
> letter.

Latency-wise I think we should be comparing sk_msg send-to-local against
UDS rather than full-stack TCP.

There is quite a bit of guessing on my side as to what you're looking
for because the cover letter doesn't say much about the use case.

For instance, do you control the sender?  Why not do big writes on the
sender side if raw throughput is what you care about?

>> 1) There's no backpressure propagation so you can have a backlog
>> build-up. One thing to check is what happens if the receiver closes its
>> window.
>
> Right, I am sure there are still a lot of optimizations we can further
> improve. The only question is how much we need for now. How about
> optimizing it one step each time? :)

This is introducing a quite a bit complexity from the start. I'd like to
least explore if it can be done in a simpler fashion before committing to
it.

You point at wake-ups as being the throughput killer. As an alternative,
can we wake up the receiver conditionally? That is only if the receiver
has made progress since on the queue since the last notification. This
could also be a form of wakeup moderation.

>> 2) There's a scheduling latency. That's why the performance of splicing
>> sockets with sockmap (ingress-to-egress) looks bleak [1].
>
> Same for regular TCP, we have to wakeup the receiver/worker. But I may
> misunderstand this point?

What I meant is that, in the pessimistic case, to deliver a message we
now have to go through two wakeups:

sender -wakeup-> kworker -wakeup-> receiver

>> So I have to dig deeper...
>> 
>> Have you considered and/or evaluated any alternative designs? For
>> instance, what stops us from having an auto-corking / coalescing
>> strategy on the sender side?
>
> Auto corking _may_ be not as easy as TCP, since essentially we have no
> protocol here, just a pure socket layer.

You're right. We don't have a flush signal for auto-corking on the
sender side with sk_msg's.

What about what I mentioned above - can we moderate the wakeups based on
receiver making progress? Does that sound feasible to you?

Thanks,
-jkbs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
  2025-07-07 17:51           ` Jakub Sitnicki
@ 2025-07-15  0:26             ` Zijian Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Zijian Zhang @ 2025-07-15  0:26 UTC (permalink / raw)
  To: Jakub Sitnicki, Cong Wang
  Cc: netdev, bpf, john.fastabend, zhoufeng.zf, Amery Hung, Cong Wang

On 7/8/25 1:51 AM, Jakub Sitnicki wrote:
> On Thu, Jul 03, 2025 at 09:20 PM -07, Cong Wang wrote:
>> On Thu, Jul 03, 2025 at 01:32:08PM +0200, Jakub Sitnicki wrote:
>>> I'm all for reaping the benefits of batching, but I'm not thrilled about
>>> having a backlog worker on the path. The one we have on the sk_skb path
>>> has been a bottleneck:
>>
>> It depends on what you compare with. If you compare it with vanilla
>> TCP_BPF, we did see is 5% latency increase. If you compare it with
>> regular TCP, it is still much better. Our goal is to make Cillium's
>> sockops-enable competitive with regular TCP, hence we compare it with
>> regular TCP.
>>
>> I hope this makes sense to you. Sorry if this was not clear in our cover
>> letter.
> 
> Latency-wise I think we should be comparing sk_msg send-to-local against
> UDS rather than full-stack TCP.
> 
> There is quite a bit of guessing on my side as to what you're looking
> for because the cover letter doesn't say much about the use case.
> 

Let me add more details to the use cases,

Assume user space code uses TCP to connect to a peer which may be
local or remote. We are trying to use sockmap to transparently
accelerate the TCP connection where both the sender and the receiver are
on the same machine. User space code does not need to be modified, local
connections will be accelerated, remote connections remain the same.
Because of the transparency here, UDS is not an option here. UDS
requires user-space code change, and it means users know they are
talking to local peer.

We assume that since we bypass the Linux network stack, better tput,
latency and cpu usage will be observed. However, it's not ths case, tput
is worse when the message size is small (<64k).

It's similar to cilium "sockops-enable" config, which is deprecated
mostly because of performance. The config uses sockmap to manage the
TCP connection between pods in the same machine.

https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c

> For instance, do you control the sender?  Why not do big writes on the
> sender side if raw throughput is what you care about?
> 

As described above, we assume user space uses TCP, and we cannot change
the user space code.

>>> 1) There's no backpressure propagation so you can have a backlog
>>> build-up. One thing to check is what happens if the receiver closes its
>>> window.
>>
>> Right, I am sure there are still a lot of optimizations we can further
>> improve. The only question is how much we need for now. How about
>> optimizing it one step each time? :)
> 
> This is introducing a quite a bit complexity from the start. I'd like to
> least explore if it can be done in a simpler fashion before committing to
> it.
> 
> You point at wake-ups as being the throughput killer. As an alternative,
> can we wake up the receiver conditionally? That is only if the receiver
> has made progress since on the queue since the last notification. This
> could also be a form of wakeup moderation.
> 

wake-up is indeed one of the throughput killer, and I agree it can be
mitigated by waking up the receiver conditionally.

IIRC, sock lock is another __main__ throughput killer,
In the tcp_bpf_sendmsg, the context of sender process,
we need to lock_sock(sender) -> release_sock(sender) -> lock_sock(recv)
-> release_sock(recv) -> lock_sock(sender) -> release_sock(sender).

This makes the sender somewhat dependent to the receiver, when the 
receiver is working, the sender will be blocked.

    sender                      receiver
tcp_bpf_sendmsg
                            tcp_bpf_recvmsg (working)
tcp_bpf_sendmsg (blocked)


We introduce kworker here mainly to solve the sock lock issue, we want
to have senders only need to acquire sender sock lock, receivers only
need to acquire receiver sock lock. Only the kworker, as a middle man,
needs to have both sender and receiver lock to transfer the data from
the sender to the receiver. As a result, tcp_bpf_sendmsg and
tcp_bpf_recvmsg can be independent to each other.

    sender                      receiver
tcp_bpf_sendmsg
                            tcp_bpf_recvmsg (working)
tcp_bpf_sendmsg
tcp_bpf_sendmsg
...

>>> 2) There's a scheduling latency. That's why the performance of splicing
>>> sockets with sockmap (ingress-to-egress) looks bleak [1].
>>
>> Same for regular TCP, we have to wakeup the receiver/worker. But I may
>> misunderstand this point?
> 
> What I meant is that, in the pessimistic case, to deliver a message we
> now have to go through two wakeups:
> 
> sender -wakeup-> kworker -wakeup-> receiver
> 
>>> So I have to dig deeper...
>>>
>>> Have you considered and/or evaluated any alternative designs? For
>>> instance, what stops us from having an auto-corking / coalescing
>>> strategy on the sender side?
>>
>> Auto corking _may_ be not as easy as TCP, since essentially we have no
>> protocol here, just a pure socket layer.
> 
> You're right. We don't have a flush signal for auto-corking on the
> sender side with sk_msg's.
> 
> What about what I mentioned above - can we moderate the wakeups based on
> receiver making progress? Does that sound feasible to you?
> 
> Thanks,
> -jkbs


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-15  0:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-07-01  1:11 ` [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
2025-07-02  9:24   ` Jakub Sitnicki
2025-07-01  1:11 ` [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
2025-07-02 11:36   ` Jakub Sitnicki
2025-07-01  1:12 ` [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock Cong Wang
2025-07-02 11:46   ` Jakub Sitnicki
2025-07-01  1:12 ` [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-07-02 12:17   ` Jakub Sitnicki
2025-07-03  2:17     ` Zijian Zhang
2025-07-03 11:32       ` Jakub Sitnicki
2025-07-04  4:20         ` Cong Wang
2025-07-07 17:51           ` Jakub Sitnicki
2025-07-15  0:26             ` Zijian Zhang
2025-07-02 10:22 ` [Patch bpf-next v4 0/4] " Jakub Sitnicki
2025-07-03  1:48   ` Zijian Zhang
2025-07-02 11:05 ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).