* [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking
@ 2025-05-19 20:36 Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Cong Wang @ 2025-05-19 20:36 UTC (permalink / raw)
To: netdev; +Cc: bpf, zhoufeng.zf, jakub, john.fastabend, zijianzhang, Cong Wang
This patchset improves skmsg ingress redirection performance by a)
sophisticated batching with kworker; b) skmsg allocation caching with
kmem cache.
As a result, our patches significantly outperforms the vanilla kernel
in terms of throughput for almost all packet sizes. The percentage
improvement in throughput ranges from 3.13% to 160.92%, with smaller
packets showing the highest improvements.
For latency, it induces slightly higher latency across most packet sizes
compared to the vanilla, which is also expected since this is a natural
side effect of batching.
Here are the detailed benchmarks:
+-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Throughput | 64 | 128 | 256 | 512 | 1k | 4k | 16k | 32k | 64k | 128k | 256k |
+-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Vanilla | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44 | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
| Patched | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
| Percentage | 141.18% | 127.78% | 125.00% | 143.07% | 148.08% | 160.92% | 106.52% | 97.00% | 5.38% | 3.13% | 6.32% |
+-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
+-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Latency | 64 | 128 | 256 | 512 | 1k | 4k | 16k | 32k | 63k |
+-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Vanilla | 5.80±4.02 | 5.83±3.61 | 5.86±4.10 | 5.91±4.19 | 5.98±4.14 | 6.61±4.47 | 8.60±2.59 | 10.96±5.50| 15.02±6.78|
| Patched | 6.18±3.03 | 6.23±4.38 | 6.25±4.44 | 6.13±4.35 | 6.32±4.23 | 6.94±4.61 | 8.90±5.49 | 11.12±6.10| 14.88±6.55|
| Percentage | 6.55% | 6.87% | 6.66% | 3.72% | 5.68% | 4.99% | 3.49% | 1.46% |-0.93% |
+-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
---
v3: no change, just rebase
v2: improved commit message of patch 3/4
changed to 'u8' for bitfields, as suggested by Jakub
Cong Wang (2):
skmsg: rename sk_msg_alloc() to sk_msg_expand()
skmsg: save some space in struct sk_psock
Zijian Zhang (2):
skmsg: implement slab allocator cache for sk_msg
tcp_bpf: improve ingress redirection performance with message corking
include/linux/skmsg.h | 48 +++++++---
net/core/skmsg.c | 173 ++++++++++++++++++++++++++++++++---
net/ipv4/tcp_bpf.c | 204 +++++++++++++++++++++++++++++++++++++++---
net/tls/tls_sw.c | 6 +-
net/xfrm/espintcp.c | 2 +-
5 files changed, 394 insertions(+), 39 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand()
2025-05-19 20:36 [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
@ 2025-05-19 20:36 ` Cong Wang
2025-05-28 23:51 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2025-05-19 20:36 UTC (permalink / raw)
To: netdev; +Cc: bpf, zhoufeng.zf, jakub, john.fastabend, zijianzhang, Cong Wang
From: Cong Wang <cong.wang@bytedance.com>
The name sk_msg_alloc is misleading, that function does not allocate
sk_msg at all, it simply refills sock page frags. Rename it to
sk_msg_expand() to better reflect what it actually does.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
include/linux/skmsg.h | 4 ++--
net/core/skmsg.c | 6 +++---
net/ipv4/tcp_bpf.c | 2 +-
net/tls/tls_sw.c | 6 +++---
net/xfrm/espintcp.c | 2 +-
5 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 0b9095a281b8..d6f0a8cd73c4 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -121,8 +121,8 @@ struct sk_psock {
struct rcu_work rwork;
};
-int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
- int elem_first_coalesce);
+int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
+ int elem_first_coalesce);
int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
u32 off, u32 len);
void sk_msg_trim(struct sock *sk, struct sk_msg *msg, int len);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 276934673066..8e85f4a3406e 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -24,8 +24,8 @@ static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
return false;
}
-int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
- int elem_first_coalesce)
+int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
+ int elem_first_coalesce)
{
struct page_frag *pfrag = sk_page_frag(sk);
u32 osize = msg->sg.size;
@@ -82,7 +82,7 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
sk_msg_trim(sk, msg, osize);
return ret;
}
-EXPORT_SYMBOL_GPL(sk_msg_alloc);
+EXPORT_SYMBOL_GPL(sk_msg_expand);
int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
u32 off, u32 len)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index ba581785adb4..85b64ffc20c6 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -530,7 +530,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
}
osize = msg_tx->sg.size;
- err = sk_msg_alloc(sk, msg_tx, msg_tx->sg.size + copy, msg_tx->sg.end - 1);
+ err = sk_msg_expand(sk, msg_tx, msg_tx->sg.size + copy, msg_tx->sg.end - 1);
if (err) {
if (err != -ENOSPC)
goto wait_for_memory;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index fc88e34b7f33..1e8d7fc179a8 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -324,7 +324,7 @@ static int tls_alloc_encrypted_msg(struct sock *sk, int len)
struct tls_rec *rec = ctx->open_rec;
struct sk_msg *msg_en = &rec->msg_encrypted;
- return sk_msg_alloc(sk, msg_en, len, 0);
+ return sk_msg_expand(sk, msg_en, len, 0);
}
static int tls_clone_plaintext_msg(struct sock *sk, int required)
@@ -619,8 +619,8 @@ static int tls_split_open_record(struct sock *sk, struct tls_rec *from,
new = tls_get_rec(sk);
if (!new)
return -ENOMEM;
- ret = sk_msg_alloc(sk, &new->msg_encrypted, msg_opl->sg.size +
- tx_overhead_size, 0);
+ ret = sk_msg_expand(sk, &new->msg_encrypted, msg_opl->sg.size +
+ tx_overhead_size, 0);
if (ret < 0) {
tls_free_rec(sk, new);
return ret;
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index fe82e2d07300..4fd03edb4497 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -351,7 +351,7 @@ static int espintcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
sk_msg_init(&emsg->skmsg);
while (1) {
/* only -ENOMEM is possible since we don't coalesce */
- err = sk_msg_alloc(sk, &emsg->skmsg, msglen, 0);
+ err = sk_msg_expand(sk, &emsg->skmsg, msglen, 0);
if (!err)
break;
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg
2025-05-19 20:36 [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
@ 2025-05-19 20:36 ` Cong Wang
2025-05-29 0:04 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
3 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2025-05-19 20:36 UTC (permalink / raw)
To: netdev; +Cc: bpf, zhoufeng.zf, jakub, john.fastabend, zijianzhang, Cong Wang
From: Zijian Zhang <zijianzhang@bytedance.com>
Optimizing redirect ingress performance requires frequent allocation and
deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
sk_msg to reduce memory allocation overhead and improve performance.
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
---
include/linux/skmsg.h | 21 ++++++++++++---------
net/core/skmsg.c | 28 +++++++++++++++++++++-------
net/ipv4/tcp_bpf.c | 5 ++---
3 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index d6f0a8cd73c4..bf28ce9b5fdb 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -121,6 +121,7 @@ struct sk_psock {
struct rcu_work rwork;
};
+struct sk_msg *sk_msg_alloc(gfp_t gfp);
int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
int elem_first_coalesce);
int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
@@ -143,6 +144,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
int len, int flags);
bool sk_msg_is_readable(struct sock *sk);
+extern struct kmem_cache *sk_msg_cachep;
+
static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
{
WARN_ON(i == msg->sg.end && bytes);
@@ -319,6 +322,13 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
kfree_skb(skb);
}
+static inline void kfree_sk_msg(struct sk_msg *msg)
+{
+ if (msg->skb)
+ consume_skb(msg->skb);
+ kmem_cache_free(sk_msg_cachep, msg);
+}
+
static inline bool sk_psock_queue_msg(struct sk_psock *psock,
struct sk_msg *msg)
{
@@ -330,7 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
ret = true;
} else {
sk_msg_free(psock->sk, msg);
- kfree(msg);
+ kfree_sk_msg(msg);
ret = false;
}
spin_unlock_bh(&psock->ingress_lock);
@@ -378,13 +388,6 @@ static inline bool sk_psock_queue_empty(const struct sk_psock *psock)
return psock ? list_empty(&psock->ingress_msg) : true;
}
-static inline void kfree_sk_msg(struct sk_msg *msg)
-{
- if (msg->skb)
- consume_skb(msg->skb);
- kfree(msg);
-}
-
static inline void sk_psock_report_error(struct sk_psock *psock, int err)
{
struct sock *sk = psock->sk;
@@ -441,7 +444,7 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
{
if (psock->cork) {
sk_msg_free(psock->sk, psock->cork);
- kfree(psock->cork);
+ kfree_sk_msg(psock->cork);
psock->cork = NULL;
}
}
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 8e85f4a3406e..30220185fd45 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -10,6 +10,8 @@
#include <net/tls.h>
#include <trace/events/sock.h>
+struct kmem_cache *sk_msg_cachep;
+
static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
{
if (msg->sg.end > msg->sg.start &&
@@ -503,16 +505,17 @@ bool sk_msg_is_readable(struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_msg_is_readable);
-static struct sk_msg *alloc_sk_msg(gfp_t gfp)
+struct sk_msg *sk_msg_alloc(gfp_t gfp)
{
struct sk_msg *msg;
- msg = kzalloc(sizeof(*msg), gfp | __GFP_NOWARN);
+ msg = kmem_cache_zalloc(sk_msg_cachep, gfp | __GFP_NOWARN);
if (unlikely(!msg))
return NULL;
sg_init_marker(msg->sg.data, NR_MSG_FRAG_IDS);
return msg;
}
+EXPORT_SYMBOL_GPL(sk_msg_alloc);
static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
struct sk_buff *skb)
@@ -523,7 +526,7 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
if (!sk_rmem_schedule(sk, skb, skb->truesize))
return NULL;
- return alloc_sk_msg(GFP_KERNEL);
+ return sk_msg_alloc(GFP_KERNEL);
}
static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
@@ -598,7 +601,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
skb_set_owner_r(skb, sk);
err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
if (err < 0)
- kfree(msg);
+ kfree_sk_msg(msg);
return err;
}
@@ -609,7 +612,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb,
u32 off, u32 len, bool take_ref)
{
- struct sk_msg *msg = alloc_sk_msg(GFP_ATOMIC);
+ struct sk_msg *msg = sk_msg_alloc(GFP_ATOMIC);
struct sock *sk = psock->sk;
int err;
@@ -618,7 +621,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
skb_set_owner_r(skb, sk);
err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
if (err < 0)
- kfree(msg);
+ kfree_sk_msg(msg);
return err;
}
@@ -787,7 +790,7 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
if (!msg->skb)
atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
sk_msg_free(psock->sk, msg);
- kfree(msg);
+ kfree_sk_msg(msg);
}
}
@@ -1272,3 +1275,14 @@ void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock)
sk->sk_data_ready = psock->saved_data_ready;
psock->saved_data_ready = NULL;
}
+
+static int __init sk_msg_cachep_init(void)
+{
+ sk_msg_cachep = kmem_cache_create("sk_msg_cachep",
+ sizeof(struct sk_msg),
+ 0,
+ SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
+ NULL);
+ return 0;
+}
+late_initcall(sk_msg_cachep_init);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 85b64ffc20c6..f0ef41c951e2 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -38,7 +38,7 @@ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
struct sk_msg *tmp;
int i, ret = 0;
- tmp = kzalloc(sizeof(*tmp), __GFP_NOWARN | GFP_KERNEL);
+ tmp = sk_msg_alloc(GFP_KERNEL);
if (unlikely(!tmp))
return -ENOMEM;
@@ -406,8 +406,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
msg->cork_bytes > msg->sg.size && !enospc) {
psock->cork_bytes = msg->cork_bytes - msg->sg.size;
if (!psock->cork) {
- psock->cork = kzalloc(sizeof(*psock->cork),
- GFP_ATOMIC | __GFP_NOWARN);
+ psock->cork = sk_msg_alloc(GFP_ATOMIC);
if (!psock->cork)
return -ENOMEM;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock
2025-05-19 20:36 [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
@ 2025-05-19 20:36 ` Cong Wang
2025-05-30 17:15 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
3 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2025-05-19 20:36 UTC (permalink / raw)
To: netdev; +Cc: bpf, zhoufeng.zf, jakub, john.fastabend, zijianzhang, Cong Wang
From: Cong Wang <cong.wang@bytedance.com>
This patch aims to save some space in struct sk_psock and prepares for
the next patch which will add more fields.
psock->eval can only have 4 possible values, make it 8-bit is
sufficient.
psock->redir_ingress is just a boolean, using 1 bit is enough.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
include/linux/skmsg.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index bf28ce9b5fdb..7620f170c4b1 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -85,8 +85,8 @@ struct sk_psock {
struct sock *sk_redir;
u32 apply_bytes;
u32 cork_bytes;
- u32 eval;
- bool redir_ingress; /* undefined if sk_redir is null */
+ u8 eval;
+ u8 redir_ingress : 1; /* undefined if sk_redir is null */
struct sk_msg *cork;
struct sk_psock_progs progs;
#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking
2025-05-19 20:36 [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
` (2 preceding siblings ...)
2025-05-19 20:36 ` [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock Cong Wang
@ 2025-05-19 20:36 ` Cong Wang
2025-05-30 20:07 ` John Fastabend
3 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2025-05-19 20:36 UTC (permalink / raw)
To: netdev
Cc: bpf, zhoufeng.zf, jakub, john.fastabend, zijianzhang, Amery Hung,
Cong Wang
From: Zijian Zhang <zijianzhang@bytedance.com>
The TCP_BPF ingress redirection path currently lacks the message corking
mechanism found in standard TCP. This causes the sender to wake up the
receiver for every message, even when messages are small, resulting in
reduced throughput compared to regular TCP in certain scenarios.
This change introduces a kernel worker-based intermediate layer to provide
automatic message corking for TCP_BPF. While this adds a slight latency
overhead, it significantly improves overall throughput by reducing
unnecessary wake-ups and reducing the sock lock contention.
Reviewed-by: Amery Hung <amery.hung@bytedance.com>
Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
---
include/linux/skmsg.h | 19 ++++
net/core/skmsg.c | 139 ++++++++++++++++++++++++++++-
net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++--
3 files changed, 347 insertions(+), 8 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 7620f170c4b1..2531428168ad 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -15,6 +15,8 @@
#define MAX_MSG_FRAGS MAX_SKB_FRAGS
#define NR_MSG_FRAG_IDS (MAX_MSG_FRAGS + 1)
+/* GSO size for TCP BPF backlog processing */
+#define TCP_BPF_GSO_SIZE 65536
enum __sk_action {
__SK_DROP = 0,
@@ -85,8 +87,10 @@ struct sk_psock {
struct sock *sk_redir;
u32 apply_bytes;
u32 cork_bytes;
+ u32 backlog_since_notify;
u8 eval;
u8 redir_ingress : 1; /* undefined if sk_redir is null */
+ u8 backlog_work_delayed : 1;
struct sk_msg *cork;
struct sk_psock_progs progs;
#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
@@ -97,6 +101,9 @@ struct sk_psock {
struct sk_buff_head ingress_skb;
struct list_head ingress_msg;
spinlock_t ingress_lock;
+ struct list_head backlog_msg;
+ /* spin_lock for backlog_msg and backlog_since_notify */
+ spinlock_t backlog_msg_lock;
unsigned long state;
struct list_head link;
spinlock_t link_lock;
@@ -117,11 +124,13 @@ struct sk_psock {
struct mutex work_mutex;
struct sk_psock_work_state work_state;
struct delayed_work work;
+ struct delayed_work backlog_work;
struct sock *sk_pair;
struct rcu_work rwork;
};
struct sk_msg *sk_msg_alloc(gfp_t gfp);
+bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce);
int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
int elem_first_coalesce);
int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
@@ -396,9 +405,19 @@ static inline void sk_psock_report_error(struct sk_psock *psock, int err)
sk_error_report(sk);
}
+void sk_psock_backlog_msg(struct sk_psock *psock);
struct sk_psock *sk_psock_init(struct sock *sk, int node);
void sk_psock_stop(struct sk_psock *psock);
+static inline void sk_psock_run_backlog_work(struct sk_psock *psock,
+ bool delayed)
+{
+ if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+ return;
+ psock->backlog_work_delayed = delayed;
+ schedule_delayed_work(&psock->backlog_work, delayed ? 1 : 0);
+}
+
#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock);
void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 30220185fd45..d0bc1322ac8f 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -12,7 +12,7 @@
struct kmem_cache *sk_msg_cachep;
-static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
+bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
{
if (msg->sg.end > msg->sg.start &&
elem_first_coalesce < msg->sg.end)
@@ -713,6 +713,118 @@ static void sk_psock_backlog(struct work_struct *work)
mutex_unlock(&psock->work_mutex);
}
+static bool backlog_notify(struct sk_psock *psock, bool m_sched_failed,
+ bool ingress_empty)
+{
+ /* Notify if:
+ * 1. We have corked enough bytes
+ * 2. We have already delayed notification
+ * 3. Memory allocation failed
+ * 4. Ingress queue was empty and we're about to add data
+ */
+ return psock->backlog_since_notify >= TCP_BPF_GSO_SIZE ||
+ psock->backlog_work_delayed ||
+ m_sched_failed ||
+ ingress_empty;
+}
+
+static bool backlog_xfer_to_local(struct sk_psock *psock, struct sock *sk_from,
+ struct list_head *local_head, u32 *tot_size)
+{
+ struct sock *sk = psock->sk;
+ struct sk_msg *msg, *tmp;
+ u32 size = 0;
+
+ list_for_each_entry_safe(msg, tmp, &psock->backlog_msg, list) {
+ if (msg->sk != sk_from)
+ break;
+
+ if (!__sk_rmem_schedule(sk, msg->sg.size, false))
+ return true;
+
+ list_move_tail(&msg->list, local_head);
+ sk_wmem_queued_add(msg->sk, -msg->sg.size);
+ sock_put(msg->sk);
+ msg->sk = NULL;
+ psock->backlog_since_notify += msg->sg.size;
+ size += msg->sg.size;
+ }
+
+ *tot_size = size;
+ return false;
+}
+
+/* This function handles the transfer of backlogged messages from the sender
+ * backlog queue to the ingress queue of the peer socket. Notification of data
+ * availability will be sent under some conditions.
+ */
+void sk_psock_backlog_msg(struct sk_psock *psock)
+{
+ bool rmem_schedule_failed = false;
+ struct sock *sk_from = NULL;
+ struct sock *sk = psock->sk;
+ LIST_HEAD(local_head);
+ struct sk_msg *msg;
+ bool should_notify;
+ u32 tot_size = 0;
+
+ if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+ return;
+
+ lock_sock(sk);
+ spin_lock(&psock->backlog_msg_lock);
+
+ msg = list_first_entry_or_null(&psock->backlog_msg,
+ struct sk_msg, list);
+ if (!msg) {
+ should_notify = !list_empty(&psock->ingress_msg);
+ spin_unlock(&psock->backlog_msg_lock);
+ goto notify;
+ }
+
+ sk_from = msg->sk;
+ sock_hold(sk_from);
+
+ rmem_schedule_failed = backlog_xfer_to_local(psock, sk_from,
+ &local_head, &tot_size);
+ should_notify = backlog_notify(psock, rmem_schedule_failed,
+ list_empty(&psock->ingress_msg));
+ spin_unlock(&psock->backlog_msg_lock);
+
+ spin_lock_bh(&psock->ingress_lock);
+ list_splice_tail_init(&local_head, &psock->ingress_msg);
+ spin_unlock_bh(&psock->ingress_lock);
+
+ atomic_add(tot_size, &sk->sk_rmem_alloc);
+ sk_mem_charge(sk, tot_size);
+
+notify:
+ if (should_notify) {
+ psock->backlog_since_notify = 0;
+ sk_psock_data_ready(sk, psock);
+ if (!list_empty(&psock->backlog_msg))
+ sk_psock_run_backlog_work(psock, rmem_schedule_failed);
+ } else {
+ sk_psock_run_backlog_work(psock, true);
+ }
+ release_sock(sk);
+
+ if (sk_from) {
+ bool slow = lock_sock_fast(sk_from);
+
+ sk_mem_uncharge(sk_from, tot_size);
+ unlock_sock_fast(sk_from, slow);
+ sock_put(sk_from);
+ }
+}
+
+static void sk_psock_backlog_msg_work(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+
+ sk_psock_backlog_msg(container_of(dwork, struct sk_psock, backlog_work));
+}
+
struct sk_psock *sk_psock_init(struct sock *sk, int node)
{
struct sk_psock *psock;
@@ -750,8 +862,11 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
INIT_DELAYED_WORK(&psock->work, sk_psock_backlog);
mutex_init(&psock->work_mutex);
+ INIT_DELAYED_WORK(&psock->backlog_work, sk_psock_backlog_msg_work);
INIT_LIST_HEAD(&psock->ingress_msg);
spin_lock_init(&psock->ingress_lock);
+ INIT_LIST_HEAD(&psock->backlog_msg);
+ spin_lock_init(&psock->backlog_msg_lock);
skb_queue_head_init(&psock->ingress_skb);
sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
@@ -805,6 +920,26 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock)
__sk_psock_purge_ingress_msg(psock);
}
+static void __sk_psock_purge_backlog_msg(struct sk_psock *psock)
+{
+ struct sk_msg *msg, *tmp;
+
+ spin_lock(&psock->backlog_msg_lock);
+ list_for_each_entry_safe(msg, tmp, &psock->backlog_msg, list) {
+ struct sock *sk_from = msg->sk;
+ bool slow;
+
+ list_del(&msg->list);
+ slow = lock_sock_fast(sk_from);
+ sk_wmem_queued_add(sk_from, -msg->sg.size);
+ sock_put(sk_from);
+ sk_msg_free(sk_from, msg);
+ unlock_sock_fast(sk_from, slow);
+ kfree_sk_msg(msg);
+ }
+ spin_unlock(&psock->backlog_msg_lock);
+}
+
static void sk_psock_link_destroy(struct sk_psock *psock)
{
struct sk_psock_link *link, *tmp;
@@ -834,7 +969,9 @@ static void sk_psock_destroy(struct work_struct *work)
sk_psock_done_strp(psock);
cancel_delayed_work_sync(&psock->work);
+ cancel_delayed_work_sync(&psock->backlog_work);
__sk_psock_zap_ingress(psock);
+ __sk_psock_purge_backlog_msg(psock);
mutex_destroy(&psock->work_mutex);
psock_progs_drop(&psock->progs);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index f0ef41c951e2..82d437210f6f 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -381,6 +381,183 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
return ret;
}
+static int tcp_bpf_coalesce_msg(struct sk_msg *last, struct sk_msg *msg,
+ u32 *apply_bytes_ptr, u32 *tot_size)
+{
+ struct scatterlist *sge_from, *sge_to;
+ u32 apply_bytes = *apply_bytes_ptr;
+ bool apply = apply_bytes;
+ int i = msg->sg.start;
+ u32 size;
+
+ while (i != msg->sg.end) {
+ int last_sge_idx = last->sg.end;
+
+ if (sk_msg_full(last))
+ break;
+
+ sge_from = sk_msg_elem(msg, i);
+ sk_msg_iter_var_prev(last_sge_idx);
+ sge_to = &last->sg.data[last_sge_idx];
+
+ size = (apply && apply_bytes < sge_from->length) ?
+ apply_bytes : sge_from->length;
+ if (sk_msg_try_coalesce_ok(last, last_sge_idx) &&
+ sg_page(sge_to) == sg_page(sge_from) &&
+ sge_to->offset + sge_to->length == sge_from->offset) {
+ sge_to->length += size;
+ } else {
+ sge_to = &last->sg.data[last->sg.end];
+ sg_unmark_end(sge_to);
+ sg_set_page(sge_to, sg_page(sge_from), size,
+ sge_from->offset);
+ get_page(sg_page(sge_to));
+ sk_msg_iter_next(last, end);
+ }
+
+ sge_from->length -= size;
+ sge_from->offset += size;
+
+ if (sge_from->length == 0) {
+ put_page(sg_page(sge_to));
+ sk_msg_iter_var_next(i);
+ }
+
+ msg->sg.size -= size;
+ last->sg.size += size;
+ *tot_size += size;
+
+ if (apply) {
+ apply_bytes -= size;
+ if (!apply_bytes)
+ break;
+ }
+ }
+
+ if (apply)
+ *apply_bytes_ptr = apply_bytes;
+
+ msg->sg.start = i;
+ return i;
+}
+
+static void tcp_bpf_xfer_msg(struct sk_msg *dst, struct sk_msg *msg,
+ u32 *apply_bytes_ptr, u32 *tot_size)
+{
+ u32 apply_bytes = *apply_bytes_ptr;
+ bool apply = apply_bytes;
+ struct scatterlist *sge;
+ int i = msg->sg.start;
+ u32 size;
+
+ do {
+ sge = sk_msg_elem(msg, i);
+ size = (apply && apply_bytes < sge->length) ?
+ apply_bytes : sge->length;
+
+ sk_msg_xfer(dst, msg, i, size);
+ *tot_size += size;
+ if (sge->length)
+ get_page(sk_msg_page(dst, i));
+ sk_msg_iter_var_next(i);
+ dst->sg.end = i;
+ if (apply) {
+ apply_bytes -= size;
+ if (!apply_bytes) {
+ if (sge->length)
+ sk_msg_iter_var_prev(i);
+ break;
+ }
+ }
+ } while (i != msg->sg.end);
+
+ if (apply)
+ *apply_bytes_ptr = apply_bytes;
+ msg->sg.start = i;
+}
+
+static int tcp_bpf_ingress_backlog(struct sock *sk, struct sock *sk_redir,
+ struct sk_msg *msg, u32 apply_bytes)
+{
+ bool ingress_msg_empty = false;
+ bool apply = apply_bytes;
+ struct sk_psock *psock;
+ struct sk_msg *tmp;
+ u32 tot_size = 0;
+ int ret = 0;
+ u8 nonagle;
+
+ psock = sk_psock_get(sk_redir);
+ if (unlikely(!psock))
+ return -EPIPE;
+
+ spin_lock(&psock->backlog_msg_lock);
+ /* If possible, coalesce the curr sk_msg to the last sk_msg from the
+ * psock->backlog_msg.
+ */
+ if (!list_empty(&psock->backlog_msg)) {
+ struct sk_msg *last;
+
+ last = list_last_entry(&psock->backlog_msg, struct sk_msg, list);
+ if (last->sk == sk) {
+ int i = tcp_bpf_coalesce_msg(last, msg, &apply_bytes,
+ &tot_size);
+
+ if (i == msg->sg.end || (apply && !apply_bytes))
+ goto out_unlock;
+ }
+ }
+
+ /* Otherwise, allocate a new sk_msg and transfer the data from the
+ * passed in msg to it.
+ */
+ tmp = sk_msg_alloc(GFP_ATOMIC);
+ if (!tmp) {
+ ret = -ENOMEM;
+ spin_unlock(&psock->backlog_msg_lock);
+ goto error;
+ }
+
+ tmp->sk = sk;
+ sock_hold(tmp->sk);
+ tmp->sg.start = msg->sg.start;
+ tcp_bpf_xfer_msg(tmp, msg, &apply_bytes, &tot_size);
+
+ ingress_msg_empty = list_empty(&psock->ingress_msg);
+ list_add_tail(&tmp->list, &psock->backlog_msg);
+
+out_unlock:
+ spin_unlock(&psock->backlog_msg_lock);
+ sk_wmem_queued_add(sk, tot_size);
+
+ /* At this point, the data has been handled well. If one of the
+ * following conditions is met, we can notify the peer socket in
+ * the context of this system call immediately.
+ * 1. If the write buffer has been used up;
+ * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
+ * 3. Or, the ingress queue was empty;
+ * 4. Or, the tcp socket is set to no_delay.
+ * Otherwise, kick off the backlog work so that we can have some
+ * time to wait for any incoming messages before sending a
+ * notification to the peer socket.
+ */
+ nonagle = tcp_sk(sk)->nonagle;
+ if (!sk_stream_memory_free(sk) ||
+ tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
+ (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
+ release_sock(sk);
+ psock->backlog_work_delayed = false;
+ sk_psock_backlog_msg(psock);
+ lock_sock(sk);
+ } else {
+ sk_psock_run_backlog_work(psock, false);
+ }
+
+error:
+ sk_psock_put(sk_redir, psock);
+ return ret;
+}
+
static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
struct sk_msg *msg, int *copied, int flags)
{
@@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
cork = true;
psock->cork = NULL;
}
- release_sock(sk);
- origsize = msg->sg.size;
- ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
- msg, tosend, flags);
- sent = origsize - msg->sg.size;
+ if (redir_ingress) {
+ ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
+ } else {
+ release_sock(sk);
+
+ origsize = msg->sg.size;
+ ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
+ msg, tosend, flags);
+ sent = origsize - msg->sg.size;
+
+ lock_sock(sk);
+ sk_mem_uncharge(sk, sent);
+ }
if (eval == __SK_REDIRECT)
sock_put(sk_redir);
- lock_sock(sk);
- sk_mem_uncharge(sk, sent);
if (unlikely(ret < 0)) {
int free = sk_msg_free(sk, msg);
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand()
2025-05-19 20:36 ` [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
@ 2025-05-28 23:51 ` John Fastabend
0 siblings, 0 replies; 13+ messages in thread
From: John Fastabend @ 2025-05-28 23:51 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, bpf, zhoufeng.zf, jakub, zijianzhang, Cong Wang
On 2025-05-19 13:36:25, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
>
> The name sk_msg_alloc is misleading, that function does not allocate
> sk_msg at all, it simply refills sock page frags. Rename it to
> sk_msg_expand() to better reflect what it actually does.
>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg
2025-05-19 20:36 ` [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
@ 2025-05-29 0:04 ` John Fastabend
2025-05-29 0:49 ` Zijian Zhang
0 siblings, 1 reply; 13+ messages in thread
From: John Fastabend @ 2025-05-29 0:04 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, bpf, zhoufeng.zf, jakub, zijianzhang, Cong Wang
On 2025-05-19 13:36:26, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> Optimizing redirect ingress performance requires frequent allocation and
> deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
> sk_msg to reduce memory allocation overhead and improve performance.
>
> Reviewed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> ---
> include/linux/skmsg.h | 21 ++++++++++++---------
> net/core/skmsg.c | 28 +++++++++++++++++++++-------
> net/ipv4/tcp_bpf.c | 5 ++---
> 3 files changed, 35 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index d6f0a8cd73c4..bf28ce9b5fdb 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -121,6 +121,7 @@ struct sk_psock {
> struct rcu_work rwork;
> };
>
> +struct sk_msg *sk_msg_alloc(gfp_t gfp);
> int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
> int elem_first_coalesce);
> int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
> @@ -143,6 +144,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> int len, int flags);
> bool sk_msg_is_readable(struct sock *sk);
>
> +extern struct kmem_cache *sk_msg_cachep;
> +
> static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
> {
> WARN_ON(i == msg->sg.end && bytes);
> @@ -319,6 +322,13 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
> kfree_skb(skb);
> }
>
> +static inline void kfree_sk_msg(struct sk_msg *msg)
> +{
> + if (msg->skb)
> + consume_skb(msg->skb);
> + kmem_cache_free(sk_msg_cachep, msg);
> +}
> +
> static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> struct sk_msg *msg)
> {
> @@ -330,7 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> ret = true;
> } else {
> sk_msg_free(psock->sk, msg);
> - kfree(msg);
> + kfree_sk_msg(msg);
Isn't this a potential use after free on msg->skb? The sk_msg_free() a
line above will consume_skb() if it exists and its not nil set so we would
consume_skb() again?
> ret = false;
> }
> spin_unlock_bh(&psock->ingress_lock);
> @@ -378,13 +388,6 @@ static inline bool sk_psock_queue_empty(const struct sk_psock *psock)
> return psock ? list_empty(&psock->ingress_msg) : true;
> }
>
> -static inline void kfree_sk_msg(struct sk_msg *msg)
> -{
> - if (msg->skb)
> - consume_skb(msg->skb);
> - kfree(msg);
> -}
> -
> static inline void sk_psock_report_error(struct sk_psock *psock, int err)
> {
> struct sock *sk = psock->sk;
> @@ -441,7 +444,7 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
> {
> if (psock->cork) {
> sk_msg_free(psock->sk, psock->cork);
> - kfree(psock->cork);
> + kfree_sk_msg(psock->cork);
Same here.
> psock->cork = NULL;
> }
> }
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg
2025-05-29 0:04 ` John Fastabend
@ 2025-05-29 0:49 ` Zijian Zhang
2025-05-29 18:38 ` Cong Wang
0 siblings, 1 reply; 13+ messages in thread
From: Zijian Zhang @ 2025-05-29 0:49 UTC (permalink / raw)
To: John Fastabend, Cong Wang; +Cc: netdev, bpf, zhoufeng.zf, jakub, Cong Wang
On 5/28/25 5:04 PM, John Fastabend wrote:
> On 2025-05-19 13:36:26, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> Optimizing redirect ingress performance requires frequent allocation and
>> deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
>> sk_msg to reduce memory allocation overhead and improve performance.
>>
>> Reviewed-by: Cong Wang <cong.wang@bytedance.com>
>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>> ---
>> include/linux/skmsg.h | 21 ++++++++++++---------
>> net/core/skmsg.c | 28 +++++++++++++++++++++-------
>> net/ipv4/tcp_bpf.c | 5 ++---
>> 3 files changed, 35 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
>> index d6f0a8cd73c4..bf28ce9b5fdb 100644
>> --- a/include/linux/skmsg.h
>> +++ b/include/linux/skmsg.h
>> @@ -121,6 +121,7 @@ struct sk_psock {
>> struct rcu_work rwork;
>> };
>>
>> +struct sk_msg *sk_msg_alloc(gfp_t gfp);
>> int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
>> int elem_first_coalesce);
>> int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
>> @@ -143,6 +144,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>> int len, int flags);
>> bool sk_msg_is_readable(struct sock *sk);
>>
>> +extern struct kmem_cache *sk_msg_cachep;
>> +
>> static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
>> {
>> WARN_ON(i == msg->sg.end && bytes);
>> @@ -319,6 +322,13 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
>> kfree_skb(skb);
>> }
>>
>> +static inline void kfree_sk_msg(struct sk_msg *msg)
>> +{
>> + if (msg->skb)
>> + consume_skb(msg->skb);
>> + kmem_cache_free(sk_msg_cachep, msg);
>> +}
>> +
>> static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>> struct sk_msg *msg)
>> {
>> @@ -330,7 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>> ret = true;
>> } else {
>> sk_msg_free(psock->sk, msg);
>> - kfree(msg);
>> + kfree_sk_msg(msg);
>
> Isn't this a potential use after free on msg->skb? The sk_msg_free() a
> line above will consume_skb() if it exists and its not nil set so we would
> consume_skb() again?
>
Thanks to sk_msg_free, after consuming the skb, it invokes sk_msg_init
to make msg->skb NULL to prevent further double free.
To avoid the confusion, we can replace kfree_sk_msg here with
kmem_cache_free.
>> ret = false;
>> }
>> spin_unlock_bh(&psock->ingress_lock);
>> @@ -378,13 +388,6 @@ static inline bool sk_psock_queue_empty(const struct sk_psock *psock)
>> return psock ? list_empty(&psock->ingress_msg) : true;
>> }
>>
>> -static inline void kfree_sk_msg(struct sk_msg *msg)
>> -{
>> - if (msg->skb)
>> - consume_skb(msg->skb);
>> - kfree(msg);
>> -}
>> -
>> static inline void sk_psock_report_error(struct sk_psock *psock, int err)
>> {
>> struct sock *sk = psock->sk;
>> @@ -441,7 +444,7 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
>> {
>> if (psock->cork) {
>> sk_msg_free(psock->sk, psock->cork);
>> - kfree(psock->cork);
>> + kfree_sk_msg(psock->cork);
>
> Same here.
>
>> psock->cork = NULL;
>> }
>> }
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg
2025-05-29 0:49 ` Zijian Zhang
@ 2025-05-29 18:38 ` Cong Wang
2025-05-30 6:30 ` John Fastabend
0 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2025-05-29 18:38 UTC (permalink / raw)
To: Zijian Zhang; +Cc: John Fastabend, netdev, bpf, zhoufeng.zf, jakub, Cong Wang
On Wed, May 28, 2025 at 05:49:22PM -0700, Zijian Zhang wrote:
> On 5/28/25 5:04 PM, John Fastabend wrote:
> > On 2025-05-19 13:36:26, Cong Wang wrote:
> > > From: Zijian Zhang <zijianzhang@bytedance.com>
> > >
> > > Optimizing redirect ingress performance requires frequent allocation and
> > > deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
> > > sk_msg to reduce memory allocation overhead and improve performance.
> > >
> > > Reviewed-by: Cong Wang <cong.wang@bytedance.com>
> > > Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> > > ---
> > > include/linux/skmsg.h | 21 ++++++++++++---------
> > > net/core/skmsg.c | 28 +++++++++++++++++++++-------
> > > net/ipv4/tcp_bpf.c | 5 ++---
> > > 3 files changed, 35 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> > > index d6f0a8cd73c4..bf28ce9b5fdb 100644
> > > --- a/include/linux/skmsg.h
> > > +++ b/include/linux/skmsg.h
> > > @@ -121,6 +121,7 @@ struct sk_psock {
> > > struct rcu_work rwork;
> > > };
> > > +struct sk_msg *sk_msg_alloc(gfp_t gfp);
> > > int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
> > > int elem_first_coalesce);
> > > int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
> > > @@ -143,6 +144,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> > > int len, int flags);
> > > bool sk_msg_is_readable(struct sock *sk);
> > > +extern struct kmem_cache *sk_msg_cachep;
> > > +
> > > static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
> > > {
> > > WARN_ON(i == msg->sg.end && bytes);
> > > @@ -319,6 +322,13 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
> > > kfree_skb(skb);
> > > }
> > > +static inline void kfree_sk_msg(struct sk_msg *msg)
> > > +{
> > > + if (msg->skb)
> > > + consume_skb(msg->skb);
> > > + kmem_cache_free(sk_msg_cachep, msg);
> > > +}
> > > +
> > > static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> > > struct sk_msg *msg)
> > > {
> > > @@ -330,7 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> > > ret = true;
> > > } else {
> > > sk_msg_free(psock->sk, msg);
> > > - kfree(msg);
> > > + kfree_sk_msg(msg);
> >
> > Isn't this a potential use after free on msg->skb? The sk_msg_free() a
> > line above will consume_skb() if it exists and its not nil set so we would
> > consume_skb() again?
> >
>
> Thanks to sk_msg_free, after consuming the skb, it invokes sk_msg_init
> to make msg->skb NULL to prevent further double free.
>
> To avoid the confusion, we can replace kfree_sk_msg here with
> kmem_cache_free.
>
Right, the re-initialization in sk_msg_free() is indeed confusing, maybe
it is time to clean up its logic? For example, separate sk_msg_init()
out from sk_msg_free().
I can add a separate patch for this in next update, if people prefer.
Thanks!
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg
2025-05-29 18:38 ` Cong Wang
@ 2025-05-30 6:30 ` John Fastabend
0 siblings, 0 replies; 13+ messages in thread
From: John Fastabend @ 2025-05-30 6:30 UTC (permalink / raw)
To: Cong Wang; +Cc: Zijian Zhang, netdev, bpf, zhoufeng.zf, jakub, Cong Wang
On 2025-05-29 11:38:19, Cong Wang wrote:
> On Wed, May 28, 2025 at 05:49:22PM -0700, Zijian Zhang wrote:
> > On 5/28/25 5:04 PM, John Fastabend wrote:
> > > On 2025-05-19 13:36:26, Cong Wang wrote:
> > > > From: Zijian Zhang <zijianzhang@bytedance.com>
> > > >
> > > > Optimizing redirect ingress performance requires frequent allocation and
> > > > deallocation of sk_msg structures. Introduce a dedicated kmem_cache for
> > > > sk_msg to reduce memory allocation overhead and improve performance.
> > > >
> > > > Reviewed-by: Cong Wang <cong.wang@bytedance.com>
> > > > Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> > > > ---
> > > > include/linux/skmsg.h | 21 ++++++++++++---------
> > > > net/core/skmsg.c | 28 +++++++++++++++++++++-------
> > > > net/ipv4/tcp_bpf.c | 5 ++---
> > > > 3 files changed, 35 insertions(+), 19 deletions(-)
> > > >
> > > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> > > > index d6f0a8cd73c4..bf28ce9b5fdb 100644
> > > > --- a/include/linux/skmsg.h
> > > > +++ b/include/linux/skmsg.h
> > > > @@ -121,6 +121,7 @@ struct sk_psock {
> > > > struct rcu_work rwork;
> > > > };
> > > > +struct sk_msg *sk_msg_alloc(gfp_t gfp);
> > > > int sk_msg_expand(struct sock *sk, struct sk_msg *msg, int len,
> > > > int elem_first_coalesce);
> > > > int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
> > > > @@ -143,6 +144,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> > > > int len, int flags);
> > > > bool sk_msg_is_readable(struct sock *sk);
> > > > +extern struct kmem_cache *sk_msg_cachep;
> > > > +
> > > > static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
> > > > {
> > > > WARN_ON(i == msg->sg.end && bytes);
> > > > @@ -319,6 +322,13 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
> > > > kfree_skb(skb);
> > > > }
> > > > +static inline void kfree_sk_msg(struct sk_msg *msg)
> > > > +{
> > > > + if (msg->skb)
> > > > + consume_skb(msg->skb);
> > > > + kmem_cache_free(sk_msg_cachep, msg);
> > > > +}
> > > > +
> > > > static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> > > > struct sk_msg *msg)
> > > > {
> > > > @@ -330,7 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> > > > ret = true;
> > > > } else {
> > > > sk_msg_free(psock->sk, msg);
> > > > - kfree(msg);
> > > > + kfree_sk_msg(msg);
> > >
> > > Isn't this a potential use after free on msg->skb? The sk_msg_free() a
> > > line above will consume_skb() if it exists and its not nil set so we would
> > > consume_skb() again?
> > >
> >
> > Thanks to sk_msg_free, after consuming the skb, it invokes sk_msg_init
> > to make msg->skb NULL to prevent further double free.
> >
> > To avoid the confusion, we can replace kfree_sk_msg here with
> > kmem_cache_free.
> >
>
> Right, the re-initialization in sk_msg_free() is indeed confusing, maybe
> it is time to clean up its logic? For example, separate sk_msg_init()
> out from sk_msg_free().
>
> I can add a separate patch for this in next update, if people prefer.
>
> Thanks!
OK so its not a problem we have the init there. So ACK for this patch.
Perhaps a follow up to clean up the different types of 'frees' would be
useful. Move into sk_msg_free+kfree_sk_msg into a single call. But,
I'm not completely convinced its worth the churn.
Acked-by: John Fastabend <john.fastabend@gmail.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock
2025-05-19 20:36 ` [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock Cong Wang
@ 2025-05-30 17:15 ` John Fastabend
0 siblings, 0 replies; 13+ messages in thread
From: John Fastabend @ 2025-05-30 17:15 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, bpf, zhoufeng.zf, jakub, zijianzhang, Cong Wang
On 2025-05-19 13:36:27, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
>
> This patch aims to save some space in struct sk_psock and prepares for
> the next patch which will add more fields.
>
> psock->eval can only have 4 possible values, make it 8-bit is
> sufficient.
>
> psock->redir_ingress is just a boolean, using 1 bit is enough.
>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
> include/linux/skmsg.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index bf28ce9b5fdb..7620f170c4b1 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -85,8 +85,8 @@ struct sk_psock {
> struct sock *sk_redir;
> u32 apply_bytes;
> u32 cork_bytes;
> - u32 eval;
> - bool redir_ingress; /* undefined if sk_redir is null */
> + u8 eval;
> + u8 redir_ingress : 1; /* undefined if sk_redir is null */
> struct sk_msg *cork;
> struct sk_psock_progs progs;
> #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
> --
> 2.34.1
>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking
2025-05-19 20:36 ` [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
@ 2025-05-30 20:07 ` John Fastabend
2025-05-30 20:37 ` Zijian Zhang
0 siblings, 1 reply; 13+ messages in thread
From: John Fastabend @ 2025-05-30 20:07 UTC (permalink / raw)
To: Cong Wang
Cc: netdev, bpf, zhoufeng.zf, jakub, zijianzhang, Amery Hung,
Cong Wang
On 2025-05-19 13:36:28, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> The TCP_BPF ingress redirection path currently lacks the message corking
> mechanism found in standard TCP. This causes the sender to wake up the
> receiver for every message, even when messages are small, resulting in
> reduced throughput compared to regular TCP in certain scenarios.
>
> This change introduces a kernel worker-based intermediate layer to provide
> automatic message corking for TCP_BPF. While this adds a slight latency
> overhead, it significantly improves overall throughput by reducing
> unnecessary wake-ups and reducing the sock lock contention.
>
> Reviewed-by: Amery Hung <amery.hung@bytedance.com>
> Co-developed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> ---
> include/linux/skmsg.h | 19 ++++
> net/core/skmsg.c | 139 ++++++++++++++++++++++++++++-
> net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 347 insertions(+), 8 deletions(-)
[...]
> + /* At this point, the data has been handled well. If one of the
> + * following conditions is met, we can notify the peer socket in
> + * the context of this system call immediately.
> + * 1. If the write buffer has been used up;
> + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
> + * 3. Or, the ingress queue was empty;
> + * 4. Or, the tcp socket is set to no_delay.
> + * Otherwise, kick off the backlog work so that we can have some
> + * time to wait for any incoming messages before sending a
> + * notification to the peer socket.
> + */
OK this series looks like it should work to me. See one small comment
below. Also from the perf numbers in the cover letter is the latency
difference reduced/removed if the socket is set to no_delay?
> + nonagle = tcp_sk(sk)->nonagle;
> + if (!sk_stream_memory_free(sk) ||
> + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
> + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
> + release_sock(sk);
> + psock->backlog_work_delayed = false;
> + sk_psock_backlog_msg(psock);
> + lock_sock(sk);
> + } else {
> + sk_psock_run_backlog_work(psock, false);
> + }
> +
> +error:
> + sk_psock_put(sk_redir, psock);
> + return ret;
> +}
> +
> static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
> struct sk_msg *msg, int *copied, int flags)
> {
> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
> cork = true;
> psock->cork = NULL;
> }
> - release_sock(sk);
>
> - origsize = msg->sg.size;
> - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> - msg, tosend, flags);
> - sent = origsize - msg->sg.size;
> + if (redir_ingress) {
> + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
> + } else {
> + release_sock(sk);
> +
> + origsize = msg->sg.size;
> + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> + msg, tosend, flags);
nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
It no longer handles ingress? A follow up patch would probably be fine.
> + sent = origsize - msg->sg.size;
> +
> + lock_sock(sk);
> + sk_mem_uncharge(sk, sent);
> + }
>
> if (eval == __SK_REDIRECT)
> sock_put(sk_redir);
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking
2025-05-30 20:07 ` John Fastabend
@ 2025-05-30 20:37 ` Zijian Zhang
0 siblings, 0 replies; 13+ messages in thread
From: Zijian Zhang @ 2025-05-30 20:37 UTC (permalink / raw)
To: John Fastabend, Cong Wang
Cc: netdev, bpf, zhoufeng.zf, jakub, Amery Hung, Cong Wang
On 5/30/25 1:07 PM, John Fastabend wrote:
> On 2025-05-19 13:36:28, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> The TCP_BPF ingress redirection path currently lacks the message corking
>> mechanism found in standard TCP. This causes the sender to wake up the
>> receiver for every message, even when messages are small, resulting in
>> reduced throughput compared to regular TCP in certain scenarios.
>>
>> This change introduces a kernel worker-based intermediate layer to provide
>> automatic message corking for TCP_BPF. While this adds a slight latency
>> overhead, it significantly improves overall throughput by reducing
>> unnecessary wake-ups and reducing the sock lock contention.
>>
>> Reviewed-by: Amery Hung <amery.hung@bytedance.com>
>> Co-developed-by: Cong Wang <cong.wang@bytedance.com>
>> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>> ---
>> include/linux/skmsg.h | 19 ++++
>> net/core/skmsg.c | 139 ++++++++++++++++++++++++++++-
>> net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 347 insertions(+), 8 deletions(-)
>
> [...]
>
>> + /* At this point, the data has been handled well. If one of the
>> + * following conditions is met, we can notify the peer socket in
>> + * the context of this system call immediately.
>> + * 1. If the write buffer has been used up;
>> + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
>> + * 3. Or, the ingress queue was empty;
>> + * 4. Or, the tcp socket is set to no_delay.
>> + * Otherwise, kick off the backlog work so that we can have some
>> + * time to wait for any incoming messages before sending a
>> + * notification to the peer socket.
>> + */
>
>
> OK this series looks like it should work to me. See one small comment
> below. Also from the perf numbers in the cover letter is the latency
> difference reduced/removed if the socket is set to no_delay?
>
Even if the socket is set to no_delay, we still have minor latency diff.
The main reason is that we now have dynamic allocation for skmsg and
kworker in the middle, the path is more complex now.
>> + nonagle = tcp_sk(sk)->nonagle;
>> + if (!sk_stream_memory_free(sk) ||
>> + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
>> + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
>> + release_sock(sk);
>> + psock->backlog_work_delayed = false;
>> + sk_psock_backlog_msg(psock);
>> + lock_sock(sk);
>> + } else {
>> + sk_psock_run_backlog_work(psock, false);
>> + }
>> +
>> +error:
>> + sk_psock_put(sk_redir, psock);
>> + return ret;
>> +}
>> +
>> static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>> struct sk_msg *msg, int *copied, int flags)
>> {
>> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>> cork = true;
>> psock->cork = NULL;
>> }
>> - release_sock(sk);
>>
>> - origsize = msg->sg.size;
>> - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> - msg, tosend, flags);
>> - sent = origsize - msg->sg.size;
>> + if (redir_ingress) {
>> + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
>> + } else {
>> + release_sock(sk);
>> +
>> + origsize = msg->sg.size;
>> + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> + msg, tosend, flags);
>
> nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
> It no longer handles ingress? A follow up patch would probably be fine.
>
Indeed, we will do this in a follow up patch.
>> + sent = origsize - msg->sg.size;
>> +
>> + lock_sock(sk);
>> + sk_mem_uncharge(sk, sent);
>> + }
>>
>> if (eval == __SK_REDIRECT)
>> sock_put(sk_redir);
>
> Thanks.
Thanks for the review!
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-05-30 20:37 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 20:36 [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
2025-05-28 23:51 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
2025-05-29 0:04 ` John Fastabend
2025-05-29 0:49 ` Zijian Zhang
2025-05-29 18:38 ` Cong Wang
2025-05-30 6:30 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock Cong Wang
2025-05-30 17:15 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-05-30 20:07 ` John Fastabend
2025-05-30 20:37 ` Zijian Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).