[PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues

linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues
@ 2025-04-07 14:21 Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 1/4] bpf, sockmap: Fix data lost during EAGAIN retries Jiayuan Chen
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Jiayuan Chen @ 2025-04-07 14:21 UTC (permalink / raw)
  To: bpf
  Cc: mrpre, Jiayuan Chen, John Fastabend, Jakub Sitnicki,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, linux-kernel, netdev, linux-kselftest

I was writing a benchmark based on sockmap + TCP and discovered several
issues:

1. When EAGAIN occurs, the direction of skb is incorrect, causing data
   loss when retry.
2. When sending partial data, the offset is not recorded, leading to
   duplicate data being sent when retry.
3. An unexpected BUG_ON() judgment in skb_linearize is triggered.
4. The memory of psock->ingress_skb is not limited by the socket buffer
   and memcg.

Issues 1, 2, and 3 are described in each patch's commit message.

Regarding issue 4, this patchset does not cover it as it is difficult to
handle in practice, and I am still working on it.

Here is a brief description of the issue:
When using sockmap to skb/stream redirect, if the receiving end does not
perform read operations, all data will be buffered in ingress_skb.

For example:
'''
// set memory limit to 50G
cgcreate -g memory:myGroup
cgset -r memory.max="5000M" myGroup

// start benchmark and disable consumer from reading
cgexec -g "memory:myGroup" ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --delay-consumer=-1 -d 100
Iter   0 ( 29.179us): Send Speed 2668.548 MB/s (20360.406 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   1 ( -7.237us): Send Speed 2694.467 MB/s (20557.149 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   2 ( -1.918us): Send Speed 2693.404 MB/s (20548.039 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   3 ( -0.684us): Send Speed 2693.138 MB/s (20548.014 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   4 (  7.879us): Send Speed 2698.620 MB/s (20588.838 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   5 ( -3.224us): Send Speed 2696.553 MB/s (20573.066 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   6 ( -5.409us): Send Speed 2699.705 MB/s (20597.111 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
Iter   7 ( -0.439us): Send Speed 2699.691 MB/s (20597.009 calls/s), ... Rcv Speed    0.000 MB/s (   0.000 calls/s)
...

// memory usage are not limited
cat /proc/slabinfo | grep skb
skbuff_small_head   11824024 11824024    704   46    8 : tunables    0    0    0 : slabdata 257044 257044      0
skbuff_fclone_cache 11822080 11822080    512   32    4 : tunables    0    0    0 : slabdata 369440 369440      0
'''
Thus, a simple socket in a large file upload/download model can eat the
entire OS memory.

We must charge the skb memory to psock->sk, and if we do not want losing
skb, we need to feedback the error info to read_sock/read_skb when the
enqueue operation of psock->ingress_skb fails.

---
My another patch related to stability also requires maintainers to spare
some time from their busy schedules for review.
https://lore.kernel.org/bpf/20250317092257.68760-1-jiayuan.chen@linux.dev/T/#t


Jiayuan Chen (4):
  bpf, sockmap: Fix data lost during EAGAIN retries
  bpf, sockmap: fix duplicated data transmission
  bpf, sockmap: Fix panic when calling skb_linearize
  selftest/bpf/benchs: Add benchmark for sockmap usage

 net/core/skmsg.c                              |  48 +-
 tools/testing/selftests/bpf/Makefile          |   2 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_sockmap.c      | 599 ++++++++++++++++++
 .../selftests/bpf/progs/bench_sockmap_prog.c  |  65 ++
 5 files changed, 697 insertions(+), 21 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_sockmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/bench_sockmap_prog.c

-- 
2.47.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH bpf-next v1 1/4] bpf, sockmap: Fix data lost during EAGAIN retries
  2025-04-07 14:21 [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues Jiayuan Chen
@ 2025-04-07 14:21 ` Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 2/4] bpf, sockmap: fix duplicated data transmission Jiayuan Chen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Jiayuan Chen @ 2025-04-07 14:21 UTC (permalink / raw)
  To: bpf
  Cc: mrpre, Jiayuan Chen, John Fastabend, Jakub Sitnicki,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, linux-kernel, netdev, linux-kselftest

We call skb_bpf_redirect_clear() to clean _sk_redir before handling skb in
backlog, but when sk_psock_handle_skb() return EAGAIN due to sk_rcvbuf
limit, the redirect info in _sk_redir is not recovered.

Fix skb redir loss during EAGAIN retries by restoring _sk_redir
information using skb_bpf_set_redir().

Before this patch:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
Setting up benchmark 'sockmap'...
create socket fd c1:13 p1:14 c2:15 p2:16
Benchmark 'sockmap' started.
Send Speed 1343.172 MB/s, BPF Speed 1343.238 MB/s, Rcv Speed   65.271 MB/s
Send Speed 1352.022 MB/s, BPF Speed 1352.088 MB/s, Rcv Speed   0 MB/s
Send Speed 1354.105 MB/s, BPF Speed 1354.105 MB/s, Rcv Speed   0 MB/s
Send Speed 1355.018 MB/s, BPF Speed 1354.887 MB/s, Rcv Speed   0 MB/s
'''
Due to the high send rate, the RX processing path may frequently hit the
sk_rcvbuf limit. Once triggered, incorrect _sk_redir will cause the flow
to mistakenly enter the "!ingress" path, leading to send failures.
(The Rcv speed depends on tcp_rmem).

After this patch:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
Setting up benchmark 'sockmap'...
create socket fd c1:13 p1:14 c2:15 p2:16
Benchmark 'sockmap' started.
Send Speed 1347.236 MB/s, BPF Speed 1347.367 MB/s, Rcv Speed   65.402 MB/s
Send Speed 1353.320 MB/s, BPF Speed 1353.320 MB/s, Rcv Speed   65.536 MB/s
Send Speed 1353.186 MB/s, BPF Speed 1353.121 MB/s, Rcv Speed   65.536 MB/s
'''

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/core/skmsg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 0ddc4c718833..29cb5ffd56c0 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -680,7 +680,8 @@ static void sk_psock_backlog(struct work_struct *work)
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
 					sk_psock_skb_state(psock, state, len, off);
-
+					/* Restore redir info we cleared before */
+					skb_bpf_set_redir(skb, psock->sk, ingress);
 					/* Delay slightly to prioritize any
 					 * other work that might be here.
 					 */
-- 
2.47.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH bpf-next v1 2/4] bpf, sockmap: fix duplicated data transmission
  2025-04-07 14:21 [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 1/4] bpf, sockmap: Fix data lost during EAGAIN retries Jiayuan Chen
@ 2025-04-07 14:21 ` Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 3/4] bpf, sockmap: Fix panic when calling skb_linearize Jiayuan Chen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Jiayuan Chen @ 2025-04-07 14:21 UTC (permalink / raw)
  To: bpf
  Cc: mrpre, Jiayuan Chen, John Fastabend, Jakub Sitnicki,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, linux-kernel, netdev, linux-kselftest

In the !ingress path under sk_psock_handle_skb(), when sending data to the
remote under snd_buf limitations, partial skb data might be transmitted.

Although we preserved the partial transmission state (offset/length), the
state wasn't properly consumed during retries. This caused the retry path
to resend the entire skb data instead of continuing from the previous
offset, resulting in data overlap at the receiver side.

Fixes: 405df89dd52c ("bpf, sockmap: Improved check for empty queue")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/core/skmsg.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 29cb5ffd56c0..9533b3e40ad7 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -656,11 +656,6 @@ static void sk_psock_backlog(struct work_struct *work)
 	int ret;
 
 	mutex_lock(&psock->work_mutex);
-	if (unlikely(state->len)) {
-		len = state->len;
-		off = state->off;
-	}
-
 	while ((skb = skb_peek(&psock->ingress_skb))) {
 		len = skb->len;
 		off = 0;
@@ -670,6 +665,13 @@ static void sk_psock_backlog(struct work_struct *work)
 			off = stm->offset;
 			len = stm->full_len;
 		}
+
+		/* Resume processing from previous partial state */
+		if (unlikely(state->len)) {
+			len = state->len;
+			off = state->off;
+		}
+
 		ingress = skb_bpf_ingress(skb);
 		skb_bpf_redirect_clear(skb);
 		do {
@@ -698,6 +700,8 @@ static void sk_psock_backlog(struct work_struct *work)
 			len -= ret;
 		} while (len);
 
+		/* The entire skb sent, clear state */
+		sk_psock_skb_state(psock, state, 0, 0);
 		skb = skb_dequeue(&psock->ingress_skb);
 		kfree_skb(skb);
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH bpf-next v1 3/4] bpf, sockmap: Fix panic when calling skb_linearize
  2025-04-07 14:21 [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 1/4] bpf, sockmap: Fix data lost during EAGAIN retries Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 2/4] bpf, sockmap: fix duplicated data transmission Jiayuan Chen
@ 2025-04-07 14:21 ` Jiayuan Chen
  2025-04-07 14:21 ` [PATCH bpf-next v1 4/4] selftest/bpf/benchs: Add benchmark for sockmap usage Jiayuan Chen
  2025-04-10  3:10 ` [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues patchwork-bot+netdevbpf
  4 siblings, 0 replies; 7+ messages in thread
From: Jiayuan Chen @ 2025-04-07 14:21 UTC (permalink / raw)
  To: bpf
  Cc: mrpre, Jiayuan Chen, John Fastabend, Jakub Sitnicki,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, linux-kernel, netdev, linux-kselftest

The panic can be reproduced by executing the command:
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --rx-strp 100000

Then a kernel panic was captured:
'''
[  657.460555] kernel BUG at net/core/skbuff.c:2178!
[  657.462680] Tainted: [W]=WARN
[  657.463287] Workqueue: events sk_psock_backlog
...
[  657.469610]  <TASK>
[  657.469738]  ? die+0x36/0x90
[  657.469916]  ? do_trap+0x1d0/0x270
[  657.470118]  ? pskb_expand_head+0x612/0xf40
[  657.470376]  ? pskb_expand_head+0x612/0xf40
[  657.470620]  ? do_error_trap+0xa3/0x170
[  657.470846]  ? pskb_expand_head+0x612/0xf40
[  657.471092]  ? handle_invalid_op+0x2c/0x40
[  657.471335]  ? pskb_expand_head+0x612/0xf40
[  657.471579]  ? exc_invalid_op+0x2d/0x40
[  657.471805]  ? asm_exc_invalid_op+0x1a/0x20
[  657.472052]  ? pskb_expand_head+0xd1/0xf40
[  657.472292]  ? pskb_expand_head+0x612/0xf40
[  657.472540]  ? lock_acquire+0x18f/0x4e0
[  657.472766]  ? find_held_lock+0x2d/0x110
[  657.472999]  ? __pfx_pskb_expand_head+0x10/0x10
[  657.473263]  ? __kmalloc_cache_noprof+0x5b/0x470
[  657.473537]  ? __pfx___lock_release.isra.0+0x10/0x10
[  657.473826]  __pskb_pull_tail+0xfd/0x1d20
[  657.474062]  ? __kasan_slab_alloc+0x4e/0x90
[  657.474707]  sk_psock_skb_ingress_enqueue+0x3bf/0x510
[  657.475392]  ? __kasan_kmalloc+0xaa/0xb0
[  657.476010]  sk_psock_backlog+0x5cf/0xd70
[  657.476637]  process_one_work+0x858/0x1a20
'''

The panic originates from the assertion BUG_ON(skb_shared(skb)) in
skb_linearize(). A previous commit(see Fixes tag) introduced skb_get()
to avoid race conditions between skb operations in the backlog and skb
release in the recvmsg path. However, this caused the panic to always
occur when skb_linearize is executed.

The "--rx-strp 100000" parameter forces the RX path to use the strparser
module which aggregates data until it reaches 100KB before calling sockmap
logic. The 100KB payload exceeds MAX_MSG_FRAGS, triggering skb_linearize.

To fix this issue, just move skb_get into sk_psock_skb_ingress_enqueue.

'''
sk_psock_backlog:
    sk_psock_handle_skb
       skb_get(skb) <== we move it into 'sk_psock_skb_ingress_enqueue'
       sk_psock_skb_ingress____________
                                       ↓
                                       |
                                       | → sk_psock_skb_ingress_self
                                       |      sk_psock_skb_ingress_enqueue
sk_psock_verdict_apply_________________↑          skb_linearize
'''

Note that for verdict_apply path, the skb_get operation is unnecessary so
we add 'take_ref' param to control it's behavior.

Fixes: a454d84ee20b ("bpf, sockmap: Fix skb refcnt race after locking changes")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/core/skmsg.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 9533b3e40ad7..276934673066 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -530,16 +530,22 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
 					u32 off, u32 len,
 					struct sk_psock *psock,
 					struct sock *sk,
-					struct sk_msg *msg)
+					struct sk_msg *msg,
+					bool take_ref)
 {
 	int num_sge, copied;
 
+	/* skb_to_sgvec will fail when the total number of fragments in
+	 * frag_list and frags exceeds MAX_MSG_FRAGS. For example, the
+	 * caller may aggregate multiple skbs.
+	 */
 	num_sge = skb_to_sgvec(skb, msg->sg.data, off, len);
 	if (num_sge < 0) {
 		/* skb linearize may fail with ENOMEM, but lets simply try again
 		 * later if this happens. Under memory pressure we don't want to
 		 * drop the skb. We need to linearize the skb so that the mapping
 		 * in skb_to_sgvec can not error.
+		 * Note that skb_linearize requires the skb not to be shared.
 		 */
 		if (skb_linearize(skb))
 			return -EAGAIN;
@@ -556,7 +562,7 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
 	msg->sg.start = 0;
 	msg->sg.size = copied;
 	msg->sg.end = num_sge;
-	msg->skb = skb;
+	msg->skb = take_ref ? skb_get(skb) : skb;
 
 	sk_psock_queue_msg(psock, msg);
 	sk_psock_data_ready(sk, psock);
@@ -564,7 +570,7 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
 }
 
 static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb,
-				     u32 off, u32 len);
+				     u32 off, u32 len, bool take_ref);
 
 static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 				u32 off, u32 len)
@@ -578,7 +584,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 	 * correctly.
 	 */
 	if (unlikely(skb->sk == sk))
-		return sk_psock_skb_ingress_self(psock, skb, off, len);
+		return sk_psock_skb_ingress_self(psock, skb, off, len, true);
 	msg = sk_psock_create_ingress_msg(sk, skb);
 	if (!msg)
 		return -EAGAIN;
@@ -590,7 +596,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 	 * into user buffers.
 	 */
 	skb_set_owner_r(skb, sk);
-	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg);
+	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
 	if (err < 0)
 		kfree(msg);
 	return err;
@@ -601,7 +607,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
  * because the skb is already accounted for here.
  */
 static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb,
-				     u32 off, u32 len)
+				     u32 off, u32 len, bool take_ref)
 {
 	struct sk_msg *msg = alloc_sk_msg(GFP_ATOMIC);
 	struct sock *sk = psock->sk;
@@ -610,7 +616,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 	if (unlikely(!msg))
 		return -EAGAIN;
 	skb_set_owner_r(skb, sk);
-	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg);
+	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
 	if (err < 0)
 		kfree(msg);
 	return err;
@@ -619,18 +625,13 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 			       u32 off, u32 len, bool ingress)
 {
-	int err = 0;
-
 	if (!ingress) {
 		if (!sock_writeable(psock->sk))
 			return -EAGAIN;
 		return skb_send_sock(psock->sk, skb, off, len);
 	}
-	skb_get(skb);
-	err = sk_psock_skb_ingress(psock, skb, off, len);
-	if (err < 0)
-		kfree_skb(skb);
-	return err;
+
+	return sk_psock_skb_ingress(psock, skb, off, len);
 }
 
 static void sk_psock_skb_state(struct sk_psock *psock,
@@ -1019,7 +1020,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 				off = stm->offset;
 				len = stm->full_len;
 			}
-			err = sk_psock_skb_ingress_self(psock, skb, off, len);
+			err = sk_psock_skb_ingress_self(psock, skb, off, len, false);
 		}
 		if (err < 0) {
 			spin_lock_bh(&psock->ingress_lock);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH bpf-next v1 4/4] selftest/bpf/benchs: Add benchmark for sockmap usage
  2025-04-07 14:21 [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues Jiayuan Chen
                   ` (2 preceding siblings ...)
  2025-04-07 14:21 ` [PATCH bpf-next v1 3/4] bpf, sockmap: Fix panic when calling skb_linearize Jiayuan Chen
@ 2025-04-07 14:21 ` Jiayuan Chen
  2025-04-10  3:10 ` [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues patchwork-bot+netdevbpf
  4 siblings, 0 replies; 7+ messages in thread
From: Jiayuan Chen @ 2025-04-07 14:21 UTC (permalink / raw)
  To: bpf
  Cc: mrpre, Jiayuan Chen, John Fastabend, Jakub Sitnicki,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, linux-kernel, netdev, linux-kselftest

Add TCP+sockmap-based benchmark.
Since sockmap's own update and delete operations are generally less
critical, the performance of the fast forwarding framework built upon
it is the key aspect.

Also with cgset/cgexec, we can observe the behavior of sockmap under
memory pressure.

The benchmark can be run with:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
'''

In the future, we plan to move socket_helpers.h out of the prog_tests
directory to make it accessible for the benchmark. This will enable
better support for various socket types.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 tools/testing/selftests/bpf/Makefile          |   2 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_sockmap.c      | 599 ++++++++++++++++++
 .../selftests/bpf/progs/bench_sockmap_prog.c  |  65 ++
 4 files changed, 670 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_sockmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/bench_sockmap_prog.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e6a02d5b87d1..c0ef5b174c6c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -815,6 +815,7 @@ $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.ske
 $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
 $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
+$(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -835,6 +836,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_local_storage_create.o \
 		 $(OUTPUT)/bench_htab_mem.o \
 		 $(OUTPUT)/bench_bpf_crypto.o \
+		 $(OUTPUT)/bench_sockmap.o \
 		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b..c80df9b75094 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -283,6 +283,7 @@ extern struct argp bench_local_storage_create_argp;
 extern struct argp bench_htab_mem_argp;
 extern struct argp bench_trigger_batch_argp;
 extern struct argp bench_crypto_argp;
+extern struct argp bench_sockmap_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -297,6 +298,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
 	{ &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
 	{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
+	{ &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 },
 	{},
 };
 
@@ -549,6 +551,7 @@ extern const struct bench bench_local_storage_create;
 extern const struct bench bench_htab_mem;
 extern const struct bench bench_crypto_encrypt;
 extern const struct bench bench_crypto_decrypt;
+extern const struct bench bench_sockmap;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -609,6 +612,7 @@ static const struct bench *benchs[] = {
 	&bench_htab_mem,
 	&bench_crypto_encrypt,
 	&bench_crypto_decrypt,
+	&bench_sockmap,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_sockmap.c b/tools/testing/selftests/bpf/benchs/bench_sockmap.c
new file mode 100644
index 000000000000..54f4e7c03cd2
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_sockmap.c
@@ -0,0 +1,599 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <error.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+#include <sys/sendfile.h>
+#include <arpa/inet.h>
+#include <fcntl.h>
+#include <argp.h>
+#include "bench.h"
+#include "bench_sockmap_prog.skel.h"
+
+#define FILE_SIZE (128 * 1024)
+#define DATA_REPEAT_SIZE 10
+
+static const char snd_data[DATA_REPEAT_SIZE] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
+
+/* c1 <-> [p1, p2] <-> c2
+ * RX bench(BPF_SK_SKB_STREAM_VERDICT):
+ *	ARG_FW_RX_PASS:
+ *		send(p2) -> recv(c2) -> bpf skb passthrough -> recv(c2)
+ *	ARG_FW_RX_VERDICT_EGRESS:
+ *		send(c1) -> verdict skb to tx queuec of p2 -> recv(c2)
+ *	ARG_FW_RX_VERDICT_INGRESS:
+ *		send(c1) -> verdict skb to rx queuec of c2 -> recv(c2)
+ *
+ * TX bench(BPF_SK_MSG_VERDIC):
+ *	ARG_FW_TX_PASS:
+ *		send(p2) -> bpf msg passthrough -> send(p2) -> recv(c2)
+ *	ARG_FW_TX_VERDICT_INGRESS:
+ *		send(p2) -> verdict msg to rx queue of c2 -> recv(c2)
+ *	ARG_FW_TX_VERDICT_EGRESS:
+ *		send(p1) -> verdict msg to tx queue of p2 -> recv(c2)
+ */
+enum SOCKMAP_ARG_FLAG {
+	ARG_FW_RX_NORMAL = 11000,
+	ARG_FW_RX_PASS,
+	ARG_FW_RX_VERDICT_EGRESS,
+	ARG_FW_RX_VERDICT_INGRESS,
+	ARG_FW_TX_NORMAL,
+	ARG_FW_TX_PASS,
+	ARG_FW_TX_VERDICT_INGRESS,
+	ARG_FW_TX_VERDICT_EGRESS,
+	ARG_CTL_RX_STRP,
+	ARG_CONSUMER_DELAY_TIME,
+	ARG_PRODUCER_DURATION,
+};
+
+#define TXMODE_NORMAL()				\
+	((ctx.mode) == ARG_FW_TX_NORMAL)
+
+#define TXMODE_BPF_INGRESS()			\
+	((ctx.mode) == ARG_FW_TX_VERDICT_INGRESS)
+
+#define TXMODE_BPF_EGRESS()			\
+	((ctx.mode) == ARG_FW_TX_VERDICT_EGRESS)
+
+#define TXMODE_BPF_PASS()			\
+	((ctx.mode) == ARG_FW_TX_PASS)
+
+#define TXMODE_BPF() (				\
+	TXMODE_BPF_PASS() ||			\
+	TXMODE_BPF_INGRESS() ||			\
+	TXMODE_BPF_EGRESS())
+
+#define TXMODE() (				\
+	TXMODE_NORMAL() ||			\
+	TXMODE_BPF())
+
+#define RXMODE_NORMAL()				\
+	((ctx.mode) == ARG_FW_RX_NORMAL)
+
+#define RXMODE_BPF_PASS()			\
+	((ctx.mode) == ARG_FW_RX_PASS)
+
+#define RXMODE_BPF_VERDICT_EGRESS()		\
+	((ctx.mode) == ARG_FW_RX_VERDICT_EGRESS)
+
+#define RXMODE_BPF_VERDICT_INGRESS()		\
+	((ctx.mode) == ARG_FW_RX_VERDICT_INGRESS)
+
+#define RXMODE_BPF_VERDICT() (			\
+	RXMODE_BPF_VERDICT_INGRESS() ||		\
+	RXMODE_BPF_VERDICT_EGRESS())
+
+#define RXMODE_BPF() (				\
+	RXMODE_BPF_PASS() ||			\
+	RXMODE_BPF_VERDICT())
+
+#define RXMODE() (				\
+	RXMODE_NORMAL() ||			\
+	RXMODE_BPF())
+
+static struct socmap_ctx {
+	struct bench_sockmap_prog *skel;
+	enum SOCKMAP_ARG_FLAG mode;
+	#define c1	fds[0]
+	#define p1	fds[1]
+	#define c2	fds[2]
+	#define p2	fds[3]
+	#define sfd	fds[4]
+	int		fds[5];
+	long		send_calls;
+	long		read_calls;
+	long		prod_send;
+	long		user_read;
+	int		file_size;
+	int		delay_consumer;
+	int		prod_run_time;
+	int		strp_size;
+} ctx = {
+	.prod_send	= 0,
+	.user_read	= 0,
+	.file_size	= FILE_SIZE,
+	.mode		= ARG_FW_RX_VERDICT_EGRESS,
+	.fds		= {0},
+	.delay_consumer = 0,
+	.prod_run_time	= 0,
+	.strp_size	= 0,
+};
+
+static void bench_sockmap_prog_destroy(void)
+{
+	int i;
+
+	for (i = 0; i < sizeof(ctx.fds); i++) {
+		if (ctx.fds[0] > 0)
+			close(ctx.fds[i]);
+	}
+
+	bench_sockmap_prog__destroy(ctx.skel);
+}
+
+static void init_addr(struct sockaddr_storage *ss,
+		      socklen_t *len)
+{
+	struct sockaddr_in *addr4 = memset(ss, 0, sizeof(*ss));
+
+	addr4->sin_family = AF_INET;
+	addr4->sin_port = 0;
+	addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	*len = sizeof(*addr4);
+}
+
+static bool set_non_block(int fd, bool blocking)
+{
+	int flags = fcntl(fd, F_GETFL, 0);
+
+	if (flags == -1)
+		return false;
+	flags = blocking ? (flags | O_NONBLOCK) : (flags & ~O_NONBLOCK);
+	return (fcntl(fd, F_SETFL, flags) == 0);
+}
+
+static int create_pair(int *c, int *p, int type)
+{
+	struct sockaddr_storage addr;
+	int err, cfd, pfd;
+	socklen_t addr_len = sizeof(struct sockaddr_storage);
+
+	err = getsockname(ctx.sfd, (struct sockaddr *)&addr, &addr_len);
+	if (err) {
+		fprintf(stderr, "getsockname error %d\n", errno);
+		return err;
+	}
+	cfd = socket(AF_INET, type, 0);
+	if (cfd < 0) {
+		fprintf(stderr, "socket error %d\n", errno);
+		return err;
+	}
+
+	err = connect(cfd, (struct sockaddr *)&addr, addr_len);
+	if (err && errno != EINPROGRESS) {
+		fprintf(stderr, "connect error %d\n", errno);
+		return err;
+	}
+
+	pfd = accept(ctx.sfd, NULL, NULL);
+	if (pfd < 0) {
+		fprintf(stderr, "accept error %d\n", errno);
+		return err;
+	}
+	*c = cfd;
+	*p = pfd;
+	return 0;
+}
+
+static int create_sockets(void)
+{
+	struct sockaddr_storage addr;
+	int err, one = 1;
+	socklen_t addr_len;
+
+	init_addr(&addr, &addr_len);
+	ctx.sfd = socket(AF_INET, SOCK_STREAM, 0);
+	if (ctx.sfd < 0) {
+		fprintf(stderr, "socket error:%d\n", errno);
+		return ctx.sfd;
+	}
+	err = setsockopt(ctx.sfd, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one));
+	if (err) {
+		fprintf(stderr, "setsockopt error:%d\n", errno);
+		return err;
+	}
+
+	err = bind(ctx.sfd, (struct sockaddr *)&addr, addr_len);
+	if (err) {
+		fprintf(stderr, "bind error:%d\n", errno);
+		return err;
+	}
+
+	err = listen(ctx.sfd, SOMAXCONN);
+	if (err) {
+		fprintf(stderr, "listen error:%d\n", errno);
+		return err;
+	}
+
+	err = create_pair(&ctx.c1, &ctx.p1, SOCK_STREAM);
+	if (err) {
+		fprintf(stderr, "create_pair 1 error\n");
+		return err;
+	}
+
+	err = create_pair(&ctx.c2, &ctx.p2, SOCK_STREAM);
+	if (err) {
+		fprintf(stderr, "create_pair 2 error\n");
+		return err;
+	}
+	printf("create socket fd c1:%d p1:%d c2:%d p2:%d\n",
+	       ctx.c1, ctx.p1, ctx.c2, ctx.p2);
+	return 0;
+}
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 2 || env.producer_cnt != 1 ||
+	    !env.affinity)
+		goto err;
+	return;
+err:
+	fprintf(stderr, "argument '-c 2 -p 1 -a' is necessary");
+	exit(1);
+}
+
+static int setup_rx_sockmap(void)
+{
+	int verdict, pass, parser, map;
+	int zero = 0, one = 1;
+	int err;
+
+	parser = bpf_program__fd(ctx.skel->progs.prog_skb_parser);
+	verdict = bpf_program__fd(ctx.skel->progs.prog_skb_verdict);
+	pass = bpf_program__fd(ctx.skel->progs.prog_skb_pass);
+	map = bpf_map__fd(ctx.skel->maps.sock_map_rx);
+
+	if (ctx.strp_size != 0) {
+		ctx.skel->bss->pkt_size = ctx.strp_size;
+		err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0);
+		if (err)
+			return err;
+	}
+
+	if (RXMODE_BPF_VERDICT())
+		err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	else if (RXMODE_BPF_PASS())
+		err = bpf_prog_attach(pass, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (err)
+		return err;
+
+	if (RXMODE_BPF_PASS())
+		return bpf_map_update_elem(map, &zero, &ctx.c2, BPF_NOEXIST);
+
+	err = bpf_map_update_elem(map, &zero, &ctx.p1, BPF_NOEXIST);
+	if (err < 0)
+		return err;
+
+	if (RXMODE_BPF_VERDICT_INGRESS()) {
+		ctx.skel->bss->verdict_dir = BPF_F_INGRESS;
+		err = bpf_map_update_elem(map, &one, &ctx.c2, BPF_NOEXIST);
+	} else {
+		err = bpf_map_update_elem(map, &one, &ctx.p2, BPF_NOEXIST);
+	}
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int setup_tx_sockmap(void)
+{
+	int zero = 0, one = 1;
+	int prog, map;
+	int err;
+
+	map = bpf_map__fd(ctx.skel->maps.sock_map_tx);
+	prog = TXMODE_BPF_PASS() ?
+		bpf_program__fd(ctx.skel->progs.prog_skmsg_pass) :
+		bpf_program__fd(ctx.skel->progs.prog_skmsg_verdict);
+
+	err = bpf_prog_attach(prog, map, BPF_SK_MSG_VERDICT, 0);
+	if (err)
+		return err;
+
+	if (TXMODE_BPF_EGRESS()) {
+		err = bpf_map_update_elem(map, &zero, &ctx.p1, BPF_NOEXIST);
+		err |= bpf_map_update_elem(map, &one, &ctx.p2, BPF_NOEXIST);
+	} else {
+		ctx.skel->bss->verdict_dir = BPF_F_INGRESS;
+		err = bpf_map_update_elem(map, &zero, &ctx.p2, BPF_NOEXIST);
+		err |= bpf_map_update_elem(map, &one, &ctx.c2, BPF_NOEXIST);
+	}
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static void setup(void)
+{
+	int err;
+
+	ctx.skel = bench_sockmap_prog__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "error loading skel\n");
+		exit(1);
+	}
+
+	if (create_sockets()) {
+		fprintf(stderr, "create_net_mode error\n");
+		goto err;
+	}
+
+	if (RXMODE_BPF()) {
+		err = setup_rx_sockmap();
+		if (err) {
+			fprintf(stderr, "setup_rx_sockmap error:%d\n", err);
+			goto err;
+		}
+	} else if (TXMODE_BPF()) {
+		err = setup_tx_sockmap();
+		if (err) {
+			fprintf(stderr, "setup_tx_sockmap error:%d\n", err);
+			goto err;
+		}
+	} else {
+		fprintf(stderr, "unknown sockmap bench mode: %d\n", ctx.mode);
+		goto err;
+	}
+
+	return;
+
+err:
+	bench_sockmap_prog_destroy();
+	exit(1);
+}
+
+static void measure(struct bench_res *res)
+{
+	res->drops = atomic_swap(&ctx.prod_send, 0);
+	res->hits = atomic_swap(&ctx.skel->bss->process_byte, 0);
+	res->false_hits = atomic_swap(&ctx.user_read, 0);
+	res->important_hits = atomic_swap(&ctx.send_calls, 0);
+	res->important_hits |= atomic_swap(&ctx.read_calls, 0) << 32;
+}
+
+static void verify_data(int *check_pos, char *buf, int rcv)
+{
+	for (int i = 0 ; i < rcv; i++) {
+		if (buf[i] != snd_data[(*check_pos) % DATA_REPEAT_SIZE]) {
+			fprintf(stderr, "verify data fail");
+			exit(1);
+		}
+		(*check_pos)++;
+		if (*check_pos >= FILE_SIZE)
+			*check_pos = 0;
+	}
+}
+
+static void *consumer(void *input)
+{
+	int rcv, sent;
+	int check_pos = 0;
+	int tid = (long)input;
+	int recv_buf_size = FILE_SIZE;
+	char *buf = malloc(recv_buf_size);
+	int delay_read = ctx.delay_consumer;
+
+	if (!buf) {
+		fprintf(stderr, "fail to init read buffer");
+		return NULL;
+	}
+
+	while (true) {
+		if (tid == 1) {
+			/* consumer 1 is unused for tx test and stream verdict test */
+			if (RXMODE_BPF() || TXMODE())
+				return NULL;
+			/* it's only for RX_NORMAL which service as reserve-proxy mode */
+			rcv = read(ctx.p1, buf, recv_buf_size);
+			if (rcv < 0) {
+				fprintf(stderr, "fail to read p1");
+				return NULL;
+			}
+
+			sent = send(ctx.p2, buf, recv_buf_size, 0);
+			if (sent < 0) {
+				fprintf(stderr, "fail to send p2");
+				return NULL;
+			}
+		} else {
+			if (delay_read != 0) {
+				if (delay_read < 0)
+					return NULL;
+				sleep(delay_read);
+				delay_read = 0;
+			}
+			/* read real endpoint by consumer 0 */
+			atomic_inc(&ctx.read_calls);
+			rcv = read(ctx.c2, buf, recv_buf_size);
+			if (rcv < 0 && errno != EAGAIN) {
+				fprintf(stderr, "%s fail to read c2 %d\n", __func__, errno);
+				return NULL;
+			}
+			verify_data(&check_pos, buf, rcv);
+			atomic_add(&ctx.user_read, rcv);
+		}
+	}
+
+	return NULL;
+}
+
+static void *producer(void *input)
+{
+	int off = 0, fp, need_sent, sent;
+	int file_size = ctx.file_size;
+	struct timespec ts1, ts2;
+	int target;
+	FILE *file;
+
+	file = tmpfile();
+	if (!file) {
+		fprintf(stderr, "create file for sendfile");
+		return NULL;
+	}
+
+	/* we need simple verify */
+	for (int i = 0; i < file_size; i++) {
+		if (fwrite(&snd_data[off], sizeof(char), 1, file) != 1) {
+			fprintf(stderr, "init tmpfile error");
+			return NULL;
+		}
+		if (++off >= sizeof(snd_data))
+			off = 0;
+	}
+	fflush(file);
+	fseek(file, 0, SEEK_SET);
+
+	fp = fileno(file);
+	need_sent = file_size;
+	clock_gettime(CLOCK_MONOTONIC, &ts1);
+
+	if (RXMODE_BPF_VERDICT())
+		target = ctx.c1;
+	else if (TXMODE_BPF_EGRESS())
+		target = ctx.p1;
+	else
+		target = ctx.p2;
+	set_non_block(target, true);
+	while (true) {
+		if (ctx.prod_run_time) {
+			clock_gettime(CLOCK_MONOTONIC, &ts2);
+			if (ts2.tv_sec - ts1.tv_sec > ctx.prod_run_time)
+				return NULL;
+		}
+
+		errno = 0;
+		atomic_inc(&ctx.send_calls);
+		sent = sendfile(target, fp, NULL, need_sent);
+		if (sent < 0) {
+			if (errno != EAGAIN && errno != ENOMEM && errno != ENOBUFS) {
+				fprintf(stderr, "sendfile return %d, errorno %d:%s\n",
+					sent, errno, strerror(errno));
+				return NULL;
+			}
+			continue;
+		} else if (sent < need_sent) {
+			need_sent -= sent;
+			atomic_add(&ctx.prod_send, sent);
+			continue;
+		}
+		atomic_add(&ctx.prod_send, need_sent);
+		need_sent = file_size;
+		lseek(fp, 0, SEEK_SET);
+	}
+
+	return NULL;
+}
+
+static void report_progress(int iter, struct bench_res *res, long delta_ns)
+{
+	double speed_mbs, prod_mbs, bpf_mbs, send_hz, read_hz;
+
+	prod_mbs = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
+	speed_mbs = res->false_hits / 1000000.0 / (delta_ns / 1000000000.0);
+	bpf_mbs = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
+	send_hz = (res->important_hits & 0xFFFFFFFF) / (delta_ns / 1000000000.0);
+	read_hz = (res->important_hits >> 32) / (delta_ns / 1000000000.0);
+
+	printf("Iter %3d (%7.3lfus): ",
+	       iter, (delta_ns - 1000000000) / 1000.0);
+	printf("Send Speed %8.3lf MB/s (%8.3lf calls/s), BPF Speed %8.3lf MB/s, "
+	       "Rcv Speed %8.3lf MB/s (%8.3lf calls/s)\n",
+	       prod_mbs, send_hz, bpf_mbs, speed_mbs, read_hz);
+}
+
+static void report_final(struct bench_res res[], int res_cnt)
+{
+	double verdict_mbs_mean = 0.0;
+	long verdict_total = 0;
+	int i;
+
+	for (i = 0; i < res_cnt; i++) {
+		verdict_mbs_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+		verdict_total += res[i].hits / 1000000.0;
+	}
+
+	printf("Summary: total trans %8.3lu MB \u00B1 %5.3lf MB/s\n",
+	       verdict_total, verdict_mbs_mean);
+}
+
+static const struct argp_option opts[] = {
+	{ "rx-normal", ARG_FW_RX_NORMAL, NULL, 0,
+		"simple reserve-proxy mode, no bfp enabled"},
+	{ "rx-pass", ARG_FW_RX_PASS, NULL, 0,
+		"run bpf prog but no redir applied"},
+	{ "rx-strp", ARG_CTL_RX_STRP, "Byte", 0,
+		"enable strparser and set the encapsulation size"},
+	{ "rx-verdict-egress", ARG_FW_RX_VERDICT_EGRESS, NULL, 0,
+		"forward data with bpf(stream verdict)"},
+	{ "rx-verdict-ingress", ARG_FW_RX_VERDICT_INGRESS, NULL, 0,
+		"forward data with bpf(stream verdict)"},
+	{ "tx-normal", ARG_FW_TX_NORMAL, NULL, 0,
+		"simple c-s mode, no bfp enabled"},
+	{ "tx-pass", ARG_FW_TX_PASS, NULL, 0,
+		"run bpf prog but no redir applied"},
+	{ "tx-verdict-ingress", ARG_FW_TX_VERDICT_INGRESS, NULL, 0,
+		"forward msg to ingress queue of another socket"},
+	{ "tx-verdict-egress", ARG_FW_TX_VERDICT_EGRESS, NULL, 0,
+		"forward msg to egress queue of another socket"},
+	{ "delay-consumer", ARG_CONSUMER_DELAY_TIME, "SEC", 0,
+		"delay consumer start"},
+	{ "producer-duration", ARG_PRODUCER_DURATION, "SEC", 0,
+		"producer duration"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_FW_RX_NORMAL...ARG_FW_TX_VERDICT_EGRESS:
+		ctx.mode = key;
+		break;
+	case ARG_CONSUMER_DELAY_TIME:
+		ctx.delay_consumer = strtol(arg, NULL, 10);
+		break;
+	case ARG_PRODUCER_DURATION:
+		ctx.prod_run_time = strtol(arg, NULL, 10);
+		break;
+	case ARG_CTL_RX_STRP:
+		ctx.strp_size = strtol(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+/* exported into benchmark runner */
+const struct argp bench_sockmap_argp = {
+	.options	= opts,
+	.parser		= parse_arg,
+};
+
+/* Benchmark performance of creating bpf local storage  */
+const struct bench bench_sockmap = {
+	.name			= "sockmap",
+	.argp			= &bench_sockmap_argp,
+	.validate		= validate,
+	.setup			= setup,
+	.producer_thread	= producer,
+	.consumer_thread	= consumer,
+	.measure		= measure,
+	.report_progress	= report_progress,
+	.report_final		= report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/bench_sockmap_prog.c b/tools/testing/selftests/bpf/progs/bench_sockmap_prog.c
new file mode 100644
index 000000000000..079bf3794b3a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bench_sockmap_prog.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+long process_byte = 0;
+int  verdict_dir = 0;
+int  dropped = 0;
+int  pkt_size = 0;
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 20);
+	__type(key, int);
+	__type(value, int);
+} sock_map_rx SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 20);
+	__type(key, int);
+	__type(value, int);
+} sock_map_tx SEC(".maps");
+
+SEC("sk_skb/stream_parser")
+int prog_skb_parser(struct __sk_buff *skb)
+{
+	return pkt_size;
+}
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+	int one = 1;
+	int ret =  bpf_sk_redirect_map(skb, &sock_map_rx, one, verdict_dir);
+
+	if (ret == SK_DROP)
+		dropped++;
+	__sync_fetch_and_add(&process_byte, skb->len);
+	return ret;
+}
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_pass(struct __sk_buff *skb)
+{
+	__sync_fetch_and_add(&process_byte, skb->len);
+	return SK_PASS;
+}
+
+SEC("sk_msg")
+int prog_skmsg_verdict(struct sk_msg_md *msg)
+{
+	int one = 1;
+
+	__sync_fetch_and_add(&process_byte, msg->size);
+	return bpf_msg_redirect_map(msg, &sock_map_tx, one, verdict_dir);
+}
+
+SEC("sk_msg")
+int prog_skmsg_pass(struct sk_msg_md *msg)
+{
+	__sync_fetch_and_add(&process_byte, msg->size);
+	return SK_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues
  2025-04-07 14:21 [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues Jiayuan Chen
                   ` (3 preceding siblings ...)
  2025-04-07 14:21 ` [PATCH bpf-next v1 4/4] selftest/bpf/benchs: Add benchmark for sockmap usage Jiayuan Chen
@ 2025-04-10  3:10 ` patchwork-bot+netdevbpf
  2025-04-10  5:50   ` John Fastabend
  4 siblings, 1 reply; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-04-10  3:10 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, mrpre, john.fastabend, jakub, davem, edumazet, kuba, pabeni,
	horms, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
	linux-kernel, netdev, linux-kselftest

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Mon,  7 Apr 2025 22:21:19 +0800 you wrote:
> I was writing a benchmark based on sockmap + TCP and discovered several
> issues:
> 
> 1. When EAGAIN occurs, the direction of skb is incorrect, causing data
>    loss when retry.
> 2. When sending partial data, the offset is not recorded, leading to
>    duplicate data being sent when retry.
> 3. An unexpected BUG_ON() judgment in skb_linearize is triggered.
> 4. The memory of psock->ingress_skb is not limited by the socket buffer
>    and memcg.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v1,1/4] bpf, sockmap: Fix data lost during EAGAIN retries
    https://git.kernel.org/bpf/bpf-next/c/7683167196bd
  - [bpf-next,v1,2/4] bpf, sockmap: fix duplicated data transmission
    https://git.kernel.org/bpf/bpf-next/c/3b4f14b79428
  - [bpf-next,v1,3/4] bpf, sockmap: Fix panic when calling skb_linearize
    https://git.kernel.org/bpf/bpf-next/c/5ca2e29f6834
  - [bpf-next,v1,4/4] selftest/bpf/benchs: Add benchmark for sockmap usage
    https://git.kernel.org/bpf/bpf-next/c/7b2fa44de5e7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues
  2025-04-10  3:10 ` [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues patchwork-bot+netdevbpf
@ 2025-04-10  5:50   ` John Fastabend
  0 siblings, 0 replies; 7+ messages in thread
From: John Fastabend @ 2025-04-10  5:50 UTC (permalink / raw)
  To: patchwork-bot+netdevbpf
  Cc: Jiayuan Chen, bpf, mrpre, jakub, davem, edumazet, kuba, pabeni,
	horms, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
	linux-kernel, netdev, linux-kselftest

On 2025-04-10 03:10:37, patchwork-bot+netdevbpf@kernel.org wrote:
> Hello:
> 
> This series was applied to bpf/bpf-next.git (master)
> by Alexei Starovoitov <ast@kernel.org>:
> 
> On Mon,  7 Apr 2025 22:21:19 +0800 you wrote:
> > I was writing a benchmark based on sockmap + TCP and discovered several
> > issues:
> > 
> > 1. When EAGAIN occurs, the direction of skb is incorrect, causing data
> >    loss when retry.
> > 2. When sending partial data, the offset is not recorded, leading to
> >    duplicate data being sent when retry.
> > 3. An unexpected BUG_ON() judgment in skb_linearize is triggered.
> > 4. The memory of psock->ingress_skb is not limited by the socket buffer
> >    and memcg.
> > 
> > [...]

LGTM thanks for the fixes Jiayuan. Good to see someone working through
all the cases.

already merged but ACK for me.


> 
> Here is the summary with links:
>   - [bpf-next,v1,1/4] bpf, sockmap: Fix data lost during EAGAIN retries
>     https://git.kernel.org/bpf/bpf-next/c/7683167196bd
>   - [bpf-next,v1,2/4] bpf, sockmap: fix duplicated data transmission
>     https://git.kernel.org/bpf/bpf-next/c/3b4f14b79428
>   - [bpf-next,v1,3/4] bpf, sockmap: Fix panic when calling skb_linearize
>     https://git.kernel.org/bpf/bpf-next/c/5ca2e29f6834
>   - [bpf-next,v1,4/4] selftest/bpf/benchs: Add benchmark for sockmap usage
>     https://git.kernel.org/bpf/bpf-next/c/7b2fa44de5e7
> 
> You are awesome, thank you!
> -- 
> Deet-doot-dot, I am a bot.
> https://korg.docs.kernel.org/patchwork/pwbot.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-04-10  5:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 14:21 [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues Jiayuan Chen
2025-04-07 14:21 ` [PATCH bpf-next v1 1/4] bpf, sockmap: Fix data lost during EAGAIN retries Jiayuan Chen
2025-04-07 14:21 ` [PATCH bpf-next v1 2/4] bpf, sockmap: fix duplicated data transmission Jiayuan Chen
2025-04-07 14:21 ` [PATCH bpf-next v1 3/4] bpf, sockmap: Fix panic when calling skb_linearize Jiayuan Chen
2025-04-07 14:21 ` [PATCH bpf-next v1 4/4] selftest/bpf/benchs: Add benchmark for sockmap usage Jiayuan Chen
2025-04-10  3:10 ` [PATCH bpf-next v1 0/4] bpf, sockmap: Fix data loss and panic issues patchwork-bot+netdevbpf
2025-04-10  5:50   ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).