[PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
@ 2024-05-15  8:19 Feng zhou
  2024-05-15  9:48 ` Jakub Sitnicki
  0 siblings, 1 reply; 8+ messages in thread
From: Feng zhou @ 2024-05-15  8:19 UTC (permalink / raw)
  To: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao
  Cc: netdev, linux-kernel, bpf, yangzhenze, wangdongdong.6,
	zhoufeng.zf

From: Feng Zhou <zhoufeng.zf@bytedance.com>

Set the full package write tcp option, the test found that the loss
will be 20%. If a package wants to write tcp option, it will trigger
bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
call "tcp_established_options" to reserve tcp opt len, call
"bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
TSO. Through bpftrace tracking, it was found that during the pressure
test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
len does not change often, consider caching opt len for optimization.

Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
---
 include/linux/tcp.h            |  3 +++
 include/uapi/linux/bpf.h       |  8 +++++++-
 net/ipv4/tcp_output.c          | 12 +++++++++++-
 tools/include/uapi/linux/bpf.h |  8 +++++++-
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6a5e08b937b3..74437fcf94a2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -455,6 +455,9 @@ struct tcp_sock {
 					  * to recur itself by calling
 					  * bpf_setsockopt(TCP_CONGESTION, "itself").
 					  */
+	u8	bpf_opt_len;		/* save tcp opt len implementation
+					 * BPF_SOCK_OPS_HDR_OPT_LEN_CB fast path
+					 */
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
 #else
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 90706a47f6ff..f2092de1f432 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6892,8 +6892,14 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Fast path to reserve space in a skb under
+	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
+	 * opt length doesn't change often, so it can save in the tcp_sock. And
+	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
+	 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ea7ad7d99245..0e7480a58012 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -488,12 +488,21 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
 {
 	struct bpf_sock_ops_kern sock_ops;
 	int err;
+	struct tcp_sock *th = (struct tcp_sock *)sk;
 
-	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+	if (likely(!BPF_SOCK_OPS_TEST_FLAG(th,
 					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
 	    !*remaining)
 		return;
 
+	if (likely(BPF_SOCK_OPS_TEST_FLAG(th,
+					  BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG)) &&
+	    th->bpf_opt_len) {
+		*remaining -= th->bpf_opt_len;
+		opts->bpf_opt_len = th->bpf_opt_len;
+		return;
+	}
+
 	/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
 
 	/* init sock_ops */
@@ -538,6 +547,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
 	opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
 	/* round up to 4 bytes */
 	opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
+	th->bpf_opt_len = opts->bpf_opt_len;
 
 	*remaining -= opts->bpf_opt_len;
 }
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 90706a47f6ff..f2092de1f432 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6892,8 +6892,14 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Fast path to reserve space in a skb under
+	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
+	 * opt length doesn't change often, so it can save in the tcp_sock. And
+	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
+	 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 /* List of known BPF sock_ops operators.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-15  8:19 [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance Feng zhou
@ 2024-05-15  9:48 ` Jakub Sitnicki
  2024-05-16  3:15   ` Feng Zhou
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Sitnicki @ 2024-05-15  9:48 UTC (permalink / raw)
  To: Feng zhou
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

On Wed, May 15, 2024 at 04:19 PM +08, Feng zhou wrote:
> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>
> Set the full package write tcp option, the test found that the loss
> will be 20%. If a package wants to write tcp option, it will trigger
> bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
> call "tcp_established_options" to reserve tcp opt len, call
> "bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
> TSO. Through bpftrace tracking, it was found that during the pressure
> test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
> len does not change often, consider caching opt len for optimization.

You could also make your BPF sock_ops program cache the value and return
the cached value when called for BPF_SOCK_OPS_HDR_OPT_LEN_CB.

If that is in your opinion prohibitevely expensive then it would be good
to see a sample program and CPU cycle measurements (bpftool prog profile).

>
> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
> ---
>  include/linux/tcp.h            |  3 +++
>  include/uapi/linux/bpf.h       |  8 +++++++-
>  net/ipv4/tcp_output.c          | 12 +++++++++++-
>  tools/include/uapi/linux/bpf.h |  8 +++++++-
>  4 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b937b3..74437fcf94a2 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -455,6 +455,9 @@ struct tcp_sock {
>  					  * to recur itself by calling
>  					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>  					  */
> +	u8	bpf_opt_len;		/* save tcp opt len implementation
> +					 * BPF_SOCK_OPS_HDR_OPT_LEN_CB fast path
> +					 */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 90706a47f6ff..f2092de1f432 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6892,8 +6892,14 @@ enum {
>  	 * options first before the BPF program does.
>  	 */
>  	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> +	/* Fast path to reserve space in a skb under
> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
> +	 */
> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>  /* Mask of all currently supported cb flags */
> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>  };
>  
>  /* List of known BPF sock_ops operators.
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index ea7ad7d99245..0e7480a58012 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -488,12 +488,21 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  {
>  	struct bpf_sock_ops_kern sock_ops;
>  	int err;
> +	struct tcp_sock *th = (struct tcp_sock *)sk;
>  
> -	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> +	if (likely(!BPF_SOCK_OPS_TEST_FLAG(th,
>  					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
>  	    !*remaining)
>  		return;
>  
> +	if (likely(BPF_SOCK_OPS_TEST_FLAG(th,
> +					  BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG)) &&
> +	    th->bpf_opt_len) {
> +		*remaining -= th->bpf_opt_len;

What if *remaining value shrinks from one call to the next?

BPF sock_ops program can't react to change. Feels like there should be a
safety check to prevent an underflow.

> +		opts->bpf_opt_len = th->bpf_opt_len;
> +		return;
> +	}
> +
>  	/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
>  
>  	/* init sock_ops */
> @@ -538,6 +547,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  	opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
>  	/* round up to 4 bytes */
>  	opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
> +	th->bpf_opt_len = opts->bpf_opt_len;
>  
>  	*remaining -= opts->bpf_opt_len;
>  }
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 90706a47f6ff..f2092de1f432 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -6892,8 +6892,14 @@ enum {
>  	 * options first before the BPF program does.
>  	 */
>  	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> +	/* Fast path to reserve space in a skb under
> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
> +	 */
> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),

Have you considered a bpf_reserve_hdr_opt() flag instead?

An example or test coverage would to show this API extension in action
would help.

>  /* Mask of all currently supported cb flags */
> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>  };
>  
>  /* List of known BPF sock_ops operators.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-15  9:48 ` Jakub Sitnicki
@ 2024-05-16  3:15   ` Feng Zhou
  2024-05-16 17:15     ` Jakub Sitnicki
  0 siblings, 1 reply; 8+ messages in thread
From: Feng Zhou @ 2024-05-16  3:15 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

在 2024/5/15 17:48, Jakub Sitnicki 写道:
> On Wed, May 15, 2024 at 04:19 PM +08, Feng zhou wrote:
>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>
>> Set the full package write tcp option, the test found that the loss
>> will be 20%. If a package wants to write tcp option, it will trigger
>> bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
>> call "tcp_established_options" to reserve tcp opt len, call
>> "bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
>> TSO. Through bpftrace tracking, it was found that during the pressure
>> test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
>> len does not change often, consider caching opt len for optimization.
> 
> You could also make your BPF sock_ops program cache the value and return
> the cached value when called for BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> 
> If that is in your opinion prohibitevely expensive then it would be good
> to see a sample program and CPU cycle measurements (bpftool prog profile).
> 

I'm not referring to the overhead introduced by the time-consuming
operation of bpf prog. I have tested that bpf prog does nothing and
returns directly, and the loss is still 20%. During the pressure test
process, "tcp_send_mss" and "__tcp_transmit_skb" the call frequency per
second

@[
     bpf_skops_hdr_opt_len.isra.46+1
     tcp_established_options+730
     tcp_current_mss+81
     tcp_send_mss+23
     tcp_sendmsg_locked+285
     tcp_sendmsg+58
     sock_sendmsg+48
     sock_write_iter+151
     new_sync_write+296
     vfs_write+165
     ksys_write+89
     do_syscall_64+89
     entry_SYSCALL_64_after_hwframe+68
]: 3671671

@[
     bpf_skops_write_hdr_opt.isra.47+1
     __tcp_transmit_skb+761
     tcp_write_xmit+822
     __tcp_push_pending_frames+52
     tcp_close+813
     inet_release+60
     __sock_release+55
     sock_close+17
     __fput+179
     task_work_run+112
     exit_to_usermode_loop+245
     do_syscall_64+456
     entry_SYSCALL_64_after_hwframe+68
]: 36125

"tcp_send_mss" before TSO, without packet aggregation, and
"__tcp_transmit_skb" after TSO, the gap between the two is
100 times.

>>
>> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
>> ---
>>   include/linux/tcp.h            |  3 +++
>>   include/uapi/linux/bpf.h       |  8 +++++++-
>>   net/ipv4/tcp_output.c          | 12 +++++++++++-
>>   tools/include/uapi/linux/bpf.h |  8 +++++++-
>>   4 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index 6a5e08b937b3..74437fcf94a2 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -455,6 +455,9 @@ struct tcp_sock {
>>   					  * to recur itself by calling
>>   					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>>   					  */
>> +	u8	bpf_opt_len;		/* save tcp opt len implementation
>> +					 * BPF_SOCK_OPS_HDR_OPT_LEN_CB fast path
>> +					 */
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>>   #else
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 90706a47f6ff..f2092de1f432 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -6892,8 +6892,14 @@ enum {
>>   	 * options first before the BPF program does.
>>   	 */
>>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>> +	/* Fast path to reserve space in a skb under
>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>> +	 */
>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>   /* Mask of all currently supported cb flags */
>> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
>> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>>   };
>>   
>>   /* List of known BPF sock_ops operators.
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index ea7ad7d99245..0e7480a58012 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -488,12 +488,21 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>>   {
>>   	struct bpf_sock_ops_kern sock_ops;
>>   	int err;
>> +	struct tcp_sock *th = (struct tcp_sock *)sk;
>>   
>> -	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
>> +	if (likely(!BPF_SOCK_OPS_TEST_FLAG(th,
>>   					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
>>   	    !*remaining)
>>   		return;
>>   
>> +	if (likely(BPF_SOCK_OPS_TEST_FLAG(th,
>> +					  BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG)) &&
>> +	    th->bpf_opt_len) {
>> +		*remaining -= th->bpf_opt_len;
> 
> What if *remaining value shrinks from one call to the next?
> 
> BPF sock_ops program can't react to change. Feels like there should be a
> safety check to prevent an underflow.
> 

Thanks for the reminder, I'll add a judgment.

>> +		opts->bpf_opt_len = th->bpf_opt_len;
>> +		return;
>> +	}
>> +
>>   	/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
>>   
>>   	/* init sock_ops */
>> @@ -538,6 +547,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>>   	opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
>>   	/* round up to 4 bytes */
>>   	opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
>> +	th->bpf_opt_len = opts->bpf_opt_len;
>>   
>>   	*remaining -= opts->bpf_opt_len;
>>   }
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 90706a47f6ff..f2092de1f432 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -6892,8 +6892,14 @@ enum {
>>   	 * options first before the BPF program does.
>>   	 */
>>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>> +	/* Fast path to reserve space in a skb under
>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>> +	 */
>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
> 
> Have you considered a bpf_reserve_hdr_opt() flag instead?
> 
> An example or test coverage would to show this API extension in action
> would help.
> 

bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
that bpf prog will not be triggered frequently before TSO. Provide
a way for users to not trigger bpf prog when opt len is unchanged.
Then when writing opt, if len changes, clear the flag, and then
change opt len in the next package.

In the next version, I will add test cases.

>>   /* Mask of all currently supported cb flags */
>> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
>> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>>   };
>>   
>>   /* List of known BPF sock_ops operators.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-16  3:15   ` Feng Zhou
@ 2024-05-16 17:15     ` Jakub Sitnicki
  2024-05-16 17:22       ` Jakub Sitnicki
  2024-05-17  7:27       ` Feng Zhou
  0 siblings, 2 replies; 8+ messages in thread
From: Jakub Sitnicki @ 2024-05-16 17:15 UTC (permalink / raw)
  To: Feng Zhou
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
>> On Wed, May 15, 2024 at 04:19 PM +08, Feng zhou wrote:
>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>
>>> Set the full package write tcp option, the test found that the loss
>>> will be 20%. If a package wants to write tcp option, it will trigger
>>> bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
>>> call "tcp_established_options" to reserve tcp opt len, call
>>> "bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
>>> TSO. Through bpftrace tracking, it was found that during the pressure
>>> test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
>>> len does not change often, consider caching opt len for optimization.
>> You could also make your BPF sock_ops program cache the value and return
>> the cached value when called for BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>> If that is in your opinion prohibitevely expensive then it would be good
>> to see a sample program and CPU cycle measurements (bpftool prog profile).
>> 
>
> I'm not referring to the overhead introduced by the time-consuming
> operation of bpf prog. I have tested that bpf prog does nothing and
> returns directly, and the loss is still 20%. During the pressure test
> process, "tcp_send_mss" and "__tcp_transmit_skb" the call frequency per
> second
>
> @[
>     bpf_skops_hdr_opt_len.isra.46+1
>     tcp_established_options+730
>     tcp_current_mss+81
>     tcp_send_mss+23
>     tcp_sendmsg_locked+285
>     tcp_sendmsg+58
>     sock_sendmsg+48
>     sock_write_iter+151
>     new_sync_write+296
>     vfs_write+165
>     ksys_write+89
>     do_syscall_64+89
>     entry_SYSCALL_64_after_hwframe+68
> ]: 3671671
>
> @[
>     bpf_skops_write_hdr_opt.isra.47+1
>     __tcp_transmit_skb+761
>     tcp_write_xmit+822
>     __tcp_push_pending_frames+52
>     tcp_close+813
>     inet_release+60
>     __sock_release+55
>     sock_close+17
>     __fput+179
>     task_work_run+112
>     exit_to_usermode_loop+245
>     do_syscall_64+456
>     entry_SYSCALL_64_after_hwframe+68
> ]: 36125
>
> "tcp_send_mss" before TSO, without packet aggregation, and
> "__tcp_transmit_skb" after TSO, the gap between the two is
> 100 times.

All right, we are getting somewhere.

So in your workload bpf_skops_hdr_opt_len more times that you like.
And you have determined that by memoizing the BPF
skops/BPF_SOCK_OPS_HDR_OPT_LEN_CB result and skipping over part of
tcp_established_options you get a performance boost.

Did you first check with perf record to which ops in
tcp_established_options are taking up so many cycles?

If it's not the BPF prog, which you have ruled out, then where are we
burining cycles? Maybe that is something that can be improved.

Also, in terms on quantifying the improvement - it is 20% in terms of
what? Throughput, pps, cycles? And was that a single data point? For
multiple measurements there must be some variance (+/- X pp).

Would be great to see some data to back it up.

[...]

>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 90706a47f6ff..f2092de1f432 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -6892,8 +6892,14 @@ enum {
>>>   	 * options first before the BPF program does.
>>>   	 */
>>>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>> +	/* Fast path to reserve space in a skb under
>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>> +	 */
>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>> An example or test coverage would to show this API extension in action
>> would help.
>> 
>
> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
> that bpf prog will not be triggered frequently before TSO. Provide
> a way for users to not trigger bpf prog when opt len is unchanged.
> Then when writing opt, if len changes, clear the flag, and then
> change opt len in the next package.

I haven't seen a sample using the API extenstion that you're proposing,
so I can only guess. But you probably have something like:

SEC("sockops")
int sockops_prog(struct bpf_sock_ops *ctx)
{
	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
		bpf_reserve_hdr_opt(ctx, N, 0);
		bpf_sock_ops_cb_flags_set(ctx,
					  ctx->bpf_sock_ops_cb_flags |
					  MY_NEW_FLAG);
		return 1;
	}
}

I don't understand why you're saying it can't be transformed into:

int sockops_prog(struct bpf_sock_ops *ctx)
{
	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
		return 1;
	}
}

[...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-16 17:15     ` Jakub Sitnicki
@ 2024-05-16 17:22       ` Jakub Sitnicki
  2024-05-17  7:27       ` Feng Zhou
  1 sibling, 0 replies; 8+ messages in thread
From: Jakub Sitnicki @ 2024-05-16 17:22 UTC (permalink / raw)
  To: Feng Zhou
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

On Thu, May 16, 2024 at 07:15 PM +02, Jakub Sitnicki wrote:
> So in your workload bpf_skops_hdr_opt_len more times that you like.

Oops, I can't type any more at this hour. That should have said:

... bpf_skops_hdr_opt_len *gets called* more times ...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-16 17:15     ` Jakub Sitnicki
  2024-05-16 17:22       ` Jakub Sitnicki
@ 2024-05-17  7:27       ` Feng Zhou
  2024-05-31 10:45         ` Jakub Sitnicki
  1 sibling, 1 reply; 8+ messages in thread
From: Feng Zhou @ 2024-05-17  7:27 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

在 2024/5/17 01:15, Jakub Sitnicki 写道:
> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
>>> On Wed, May 15, 2024 at 04:19 PM +08, Feng zhou wrote:
>>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>>
>>>> Set the full package write tcp option, the test found that the loss
>>>> will be 20%. If a package wants to write tcp option, it will trigger
>>>> bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
>>>> call "tcp_established_options" to reserve tcp opt len, call
>>>> "bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
>>>> TSO. Through bpftrace tracking, it was found that during the pressure
>>>> test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
>>>> len does not change often, consider caching opt len for optimization.
>>> You could also make your BPF sock_ops program cache the value and return
>>> the cached value when called for BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>> If that is in your opinion prohibitevely expensive then it would be good
>>> to see a sample program and CPU cycle measurements (bpftool prog profile).
>>>
>>
>> I'm not referring to the overhead introduced by the time-consuming
>> operation of bpf prog. I have tested that bpf prog does nothing and
>> returns directly, and the loss is still 20%. During the pressure test
>> process, "tcp_send_mss" and "__tcp_transmit_skb" the call frequency per
>> second
>>
>> @[
>>      bpf_skops_hdr_opt_len.isra.46+1
>>      tcp_established_options+730
>>      tcp_current_mss+81
>>      tcp_send_mss+23
>>      tcp_sendmsg_locked+285
>>      tcp_sendmsg+58
>>      sock_sendmsg+48
>>      sock_write_iter+151
>>      new_sync_write+296
>>      vfs_write+165
>>      ksys_write+89
>>      do_syscall_64+89
>>      entry_SYSCALL_64_after_hwframe+68
>> ]: 3671671
>>
>> @[
>>      bpf_skops_write_hdr_opt.isra.47+1
>>      __tcp_transmit_skb+761
>>      tcp_write_xmit+822
>>      __tcp_push_pending_frames+52
>>      tcp_close+813
>>      inet_release+60
>>      __sock_release+55
>>      sock_close+17
>>      __fput+179
>>      task_work_run+112
>>      exit_to_usermode_loop+245
>>      do_syscall_64+456
>>      entry_SYSCALL_64_after_hwframe+68
>> ]: 36125
>>

Sorry, The call stack here is copied incorrectly, it is this one.

     bpf_skops_write_hdr_opt.isra.47+1
     __tcp_transmit_skb+761
     tcp_write_xmit+822
     __tcp_push_pending_frames+52
     tcp_sendmsg_locked+3324
     tcp_sendmsg+58
     sock_sendmsg+48
     sock_write_iter+151
     new_sync_write+296
     vfs_write+165
     ksys_write+89
     do_syscall_64+89
     entry_SYSCALL_64_after_hwframe+68

>> "tcp_send_mss" before TSO, without packet aggregation, and
>> "__tcp_transmit_skb" after TSO, the gap between the two is
>> 100 times.
> 
> All right, we are getting somewhere.
> 
> So in your workload bpf_skops_hdr_opt_len more times that you like.
> And you have determined that by memoizing the BPF
> skops/BPF_SOCK_OPS_HDR_OPT_LEN_CB result and skipping over part of
> tcp_established_options you get a performance boost.
> 
> Did you first check with perf record to which ops in
> tcp_established_options are taking up so many cycles?
> 

bpftrace -e 'k:__cgroup_bpf_run_filter_sock_ops { @[((struct 
bpf_sock_ops_kern *)arg1)->op, ((struct bpf_sock_ops_kern 
*)arg1)->args[0]]=count(); } i:s:1 { print(@); clear(@); }' --include 
'linux/filter.h'

@[12, 0]: 2003
@[15, 0]: 40952
@[14, 0]: 40953
@[14, 1]: 1512647

@[12, 0]: 5580
@[15, 0]: 45686
@[14, 0]: 45687
@[14, 1]: 1481577

BPF_SOCK_OPS_HDR_OPT_LEN_CB   14
BPF_SOCK_OPS_WRITE_HDR_OPT_CB 15

BPF_WRITE_HDR_TCP_CURRENT_MSS = 1

It can be found that the frequency of bpf prog "BPF_SOCK_OPS_HDR_OPT_LEN_CB"
triggered by "BPF_WRITE_HDR_TCP_CURRENT_MSS"
is very high during the pressure test, because
"tcp_send_mss" call is before TSO, and the package
has not been pooled.


> If it's not the BPF prog, which you have ruled out, then where are we
> burining cycles? Maybe that is something that can be improved.
> 
> Also, in terms on quantifying the improvement - it is 20% in terms of
> what? Throughput, pps, cycles? And was that a single data point? For
> multiple measurements there must be some variance (+/- X pp).
> 
> Would be great to see some data to back it up.
> 
> [...]
> 

Pressure measurement method:

server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30

Default mode, no bpf prog:

taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
sockperf: == version #3.10-23.gited92afb185e6 ==
sockperf[CLIENT] send on:
[ 0] IP = x.x.x.x    PORT =  7654 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: Total of 71520808 messages sent in 30.000 sec

sockperf: NOTE: test was performed, using msg-size=1200. For getting 
maximum throughput consider using --msg-size=1472
sockperf: Summary: Message Rate is 2384000 [msg/sec]
sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)

perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
perf report

80.88%--sock_sendmsg
  79.53%--tcp_sendmsg
   42.48%--tcp_sendmsg_locked
    16.23%--_copy_from_iter
    4.24%--tcp_send_mss
     3.25%--tcp_current_mss


perf top -C 8

19.13%  [kernel]            [k] _raw_spin_lock_bh
11.75%  [kernel]            [k] copy_user_enhanced_fast_string
  9.86%  [kernel]            [k] tcp_sendmsg_locked
  4.44%  sockperf            [.] 
_Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
  4.16%  libpthread-2.28.so  [.] __libc_sendto
  3.85%  [kernel]            [k] syscall_return_via_sysret
  2.70%  [kernel]            [k] _copy_from_iter
  2.48%  [kernel]            [k] entry_SYSCALL_64
  2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
  1.89%  [kernel]            [k] __virt_addr_valid
  1.77%  [kernel]            [k] __check_object_size
  1.75%  [kernel]            [k] __sys_sendto
  1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
  1.42%  [kernel]            [k] __fget_light
  1.28%  [kernel]            [k] tcp_push
  1.01%  [kernel]            [k] tcp_established_options
  0.97%  [kernel]            [k] tcp_send_mss
  0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
  0.94%  [kernel]            [k] tcp_sendmsg
  0.86%  [kernel]            [k] tcp_current_mss

Having bpf prog to write tcp opt in all pkts:

taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
sockperf: == version #3.10-23.gited92afb185e6 ==
sockperf[CLIENT] send on:
[ 0] IP = x.x.x.x    PORT =  7654 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: Total of 60636218 messages sent in 30.000 sec

sockperf: NOTE: test was performed, using msg-size=1200. For getting 
maximum throughput consider using --msg-size=1472
sockperf: Summary: Message Rate is 2021185 [msg/sec]
sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)

perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
perf report

80.30%--sock_sendmsg
  79.02%--tcp_sendmsg
   54.14%--tcp_sendmsg_locked
    12.82%--_copy_from_iter
    12.51%--tcp_send_mss
     11.77%--tcp_current_mss
      10.10%--tcp_established_options
       8.75%--bpf_skops_hdr_opt_len.isra.54
        5.71%--__cgroup_bpf_run_filter_sock_ops
         3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
   6.61%--__tcp_push_pending_frames
    6.60%--tcp_write_xmit
     5.89%--__tcp_transmit_skb

perf top -C 8

10.98%  [kernel]                           [k] _raw_spin_lock_bh
  9.04%  [kernel]                           [k] 
copy_user_enhanced_fast_string
  7.78%  [kernel]                           [k] tcp_sendmsg_locked
  3.91%  sockperf                           [.] 
_Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
  3.46%  libpthread-2.28.so                 [.] __libc_sendto
  3.35%  [kernel]                           [k] syscall_return_via_sysret
  2.86%  [kernel]                           [k] 
bpf_skops_hdr_opt_len.isra.54
  2.16%  [kernel]                           [k] __htab_map_lookup_elem
  2.11%  [kernel]                           [k] _copy_from_iter
  2.09%  [kernel]                           [k] entry_SYSCALL_64
  1.97%  [kernel]                           [k] __virt_addr_valid
  1.95%  [kernel]                           [k] 
__cgroup_bpf_run_filter_sock_ops
  1.95%  [kernel]                           [k] lookup_nulls_elem_raw
  1.89%  [kernel]                           [k] __fget_light
  1.42%  [kernel]                           [k] __sys_sendto
  1.41%  [kernel]                           [k] 
entry_SYSCALL_64_after_hwframe
  1.31%  [kernel]                           [k] 
native_queued_spin_lock_slowpath
  1.22%  [kernel]                           [k] __check_object_size
  1.18%  [kernel]                           [k] tcp_established_options
  1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] 
bpf_prog_e7ccbf819f5be0d0_tcpopt

Compare the above test results, fill up a CPU, you can find that
the upper limit of qps or BandWidth has a loss of nearly 18-20%.
Then CPU occupancy, you can find that "tcp_send_mss" has increased 
significantly.

>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>> index 90706a47f6ff..f2092de1f432 100644
>>>> --- a/tools/include/uapi/linux/bpf.h
>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>    	 * options first before the BPF program does.
>>>>    	 */
>>>>    	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>> +	/* Fast path to reserve space in a skb under
>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>> +	 */
>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>> An example or test coverage would to show this API extension in action
>>> would help.
>>>
>>
>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>> that bpf prog will not be triggered frequently before TSO. Provide
>> a way for users to not trigger bpf prog when opt len is unchanged.
>> Then when writing opt, if len changes, clear the flag, and then
>> change opt len in the next package.
> 
> I haven't seen a sample using the API extenstion that you're proposing,
> so I can only guess. But you probably have something like:
> 
> SEC("sockops")
> int sockops_prog(struct bpf_sock_ops *ctx)
> {
> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
> 		bpf_reserve_hdr_opt(ctx, N, 0);
> 		bpf_sock_ops_cb_flags_set(ctx,
> 					  ctx->bpf_sock_ops_cb_flags |
> 					  MY_NEW_FLAG);
> 		return 1;
> 	}
> }

Yes, that's what I expected.

> 
> I don't understand why you're saying it can't be transformed into:
> 
> int sockops_prog(struct bpf_sock_ops *ctx)
> {
> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
> 		return 1;
> 	}
> }

"bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"

I don't know what I can do to pass the flag parameter, let
"bpf_reserve_hdr_opt" return quickly? But this is not useful,
because the loss caused by the triggering of bpf prog is very
expensive, and it is still on the hotspot function of sending
packets, and the TSO has not been completed yet.

> 
> [...]


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-17  7:27       ` Feng Zhou
@ 2024-05-31 10:45         ` Jakub Sitnicki
  2024-06-05  7:37           ` Feng Zhou
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Sitnicki @ 2024-05-31 10:45 UTC (permalink / raw)
  To: Feng Zhou
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:

[...]

>> If it's not the BPF prog, which you have ruled out, then where are we
>> burining cycles? Maybe that is something that can be improved.
>> Also, in terms on quantifying the improvement - it is 20% in terms of
>> what? Throughput, pps, cycles? And was that a single data point? For
>> multiple measurements there must be some variance (+/- X pp).
>> Would be great to see some data to back it up.
>> [...]
>> 
>
> Pressure measurement method:
>
> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>
> Default mode, no bpf prog:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 71520808 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2384000 [msg/sec]
> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.88%--sock_sendmsg
>  79.53%--tcp_sendmsg
>   42.48%--tcp_sendmsg_locked
>    16.23%--_copy_from_iter
>    4.24%--tcp_send_mss
>     3.25%--tcp_current_mss
>
>
> perf top -C 8
>
> 19.13%  [kernel]            [k] _raw_spin_lock_bh
> 11.75%  [kernel]            [k] copy_user_enhanced_fast_string
>  9.86%  [kernel]            [k] tcp_sendmsg_locked
>  4.44%  sockperf            [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  4.16%  libpthread-2.28.so  [.] __libc_sendto
>  3.85%  [kernel]            [k] syscall_return_via_sysret
>  2.70%  [kernel]            [k] _copy_from_iter
>  2.48%  [kernel]            [k] entry_SYSCALL_64
>  2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
>  1.89%  [kernel]            [k] __virt_addr_valid
>  1.77%  [kernel]            [k] __check_object_size
>  1.75%  [kernel]            [k] __sys_sendto
>  1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
>  1.42%  [kernel]            [k] __fget_light
>  1.28%  [kernel]            [k] tcp_push
>  1.01%  [kernel]            [k] tcp_established_options
>  0.97%  [kernel]            [k] tcp_send_mss
>  0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
>  0.94%  [kernel]            [k] tcp_sendmsg
>  0.86%  [kernel]            [k] tcp_current_mss
>
> Having bpf prog to write tcp opt in all pkts:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 60636218 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2021185 [msg/sec]
> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.30%--sock_sendmsg
>  79.02%--tcp_sendmsg
>   54.14%--tcp_sendmsg_locked
>    12.82%--_copy_from_iter
>    12.51%--tcp_send_mss
>     11.77%--tcp_current_mss
>      10.10%--tcp_established_options
>       8.75%--bpf_skops_hdr_opt_len.isra.54
>        5.71%--__cgroup_bpf_run_filter_sock_ops
>         3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>   6.61%--__tcp_push_pending_frames
>    6.60%--tcp_write_xmit
>     5.89%--__tcp_transmit_skb
>
> perf top -C 8
>
> 10.98%  [kernel]                           [k] _raw_spin_lock_bh
>  9.04%  [kernel]                           [k] copy_user_enhanced_fast_string
>  7.78%  [kernel]                           [k] tcp_sendmsg_locked
>  3.91%  sockperf                           [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  3.46%  libpthread-2.28.so                 [.] __libc_sendto
>  3.35%  [kernel]                           [k] syscall_return_via_sysret
>  2.86%  [kernel]                           [k] bpf_skops_hdr_opt_len.isra.54
>  2.16%  [kernel]                           [k] __htab_map_lookup_elem
>  2.11%  [kernel]                           [k] _copy_from_iter
>  2.09%  [kernel]                           [k] entry_SYSCALL_64
>  1.97%  [kernel]                           [k] __virt_addr_valid
>  1.95%  [kernel]                           [k] __cgroup_bpf_run_filter_sock_ops
>  1.95%  [kernel]                           [k] lookup_nulls_elem_raw
>  1.89%  [kernel]                           [k] __fget_light
>  1.42%  [kernel]                           [k] __sys_sendto
>  1.41%  [kernel]                           [k] entry_SYSCALL_64_after_hwframe
>  1.31%  [kernel]                           [k] native_queued_spin_lock_slowpath
>  1.22%  [kernel]                           [k] __check_object_size
>  1.18%  [kernel]                           [k] tcp_established_options
>  1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>
> Compare the above test results, fill up a CPU, you can find that
> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
> Then CPU occupancy, you can find that "tcp_send_mss" has increased
> significantly.

This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.

For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.

>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>    	 * options first before the BPF program does.
>>>>>    	 */
>>>>>    	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>> +	/* Fast path to reserve space in a skb under
>>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>> +	 */
>>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>> An example or test coverage would to show this API extension in action
>>>> would help.
>>>>
>>>
>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>> that bpf prog will not be triggered frequently before TSO. Provide
>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>> Then when writing opt, if len changes, clear the flag, and then
>>> change opt len in the next package.
>> I haven't seen a sample using the API extenstion that you're proposing,
>> so I can only guess. But you probably have something like:
>> SEC("sockops")
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, 0);
>> 		bpf_sock_ops_cb_flags_set(ctx,
>> 					  ctx->bpf_sock_ops_cb_flags |
>> 					  MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> Yes, that's what I expected.
>
>> I don't understand why you're saying it can't be transformed into:
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>
> I don't know what I can do to pass the flag parameter, let
> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
> because the loss caused by the triggering of bpf prog is very
> expensive, and it is still on the hotspot function of sending
> packets, and the TSO has not been completed yet.
>
>> [...]

This is not what I'm suggesting.

bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again

Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
  2024-05-31 10:45         ` Jakub Sitnicki
@ 2024-06-05  7:37           ` Feng Zhou
  0 siblings, 0 replies; 8+ messages in thread
From: Feng Zhou @ 2024-06-05  7:37 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: edumazet, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, davem,
	dsahern, kuba, pabeni, laoar.shao, netdev, linux-kernel, bpf,
	yangzhenze, wangdongdong.6

在 2024/5/31 18:45, Jakub Sitnicki 写道:
> On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
>> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
> 
> [...]
> 
>>> If it's not the BPF prog, which you have ruled out, then where are we
>>> burining cycles? Maybe that is something that can be improved.
>>> Also, in terms on quantifying the improvement - it is 20% in terms of
>>> what? Throughput, pps, cycles? And was that a single data point? For
>>> multiple measurements there must be some variance (+/- X pp).
>>> Would be great to see some data to back it up.
>>> [...]
>>>
>>
>> Pressure measurement method:
>>
>> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
>> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>>
>> Default mode, no bpf prog:
>>
>> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>> sockperf: == version #3.10-23.gited92afb185e6 ==
>> sockperf[CLIENT] send on:
>> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: Total of 71520808 messages sent in 30.000 sec
>>
>> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
>> throughput consider using --msg-size=1472
>> sockperf: Summary: Message Rate is 2384000 [msg/sec]
>> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>>
>> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
>> perf report
>>
>> 80.88%--sock_sendmsg
>>   79.53%--tcp_sendmsg
>>    42.48%--tcp_sendmsg_locked
>>     16.23%--_copy_from_iter
>>     4.24%--tcp_send_mss
>>      3.25%--tcp_current_mss
>>
>>
>> perf top -C 8
>>
>> 19.13%  [kernel]            [k] _raw_spin_lock_bh
>> 11.75%  [kernel]            [k] copy_user_enhanced_fast_string
>>   9.86%  [kernel]            [k] tcp_sendmsg_locked
>>   4.44%  sockperf            [.]
>>   _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>>   4.16%  libpthread-2.28.so  [.] __libc_sendto
>>   3.85%  [kernel]            [k] syscall_return_via_sysret
>>   2.70%  [kernel]            [k] _copy_from_iter
>>   2.48%  [kernel]            [k] entry_SYSCALL_64
>>   2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
>>   1.89%  [kernel]            [k] __virt_addr_valid
>>   1.77%  [kernel]            [k] __check_object_size
>>   1.75%  [kernel]            [k] __sys_sendto
>>   1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
>>   1.42%  [kernel]            [k] __fget_light
>>   1.28%  [kernel]            [k] tcp_push
>>   1.01%  [kernel]            [k] tcp_established_options
>>   0.97%  [kernel]            [k] tcp_send_mss
>>   0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
>>   0.94%  [kernel]            [k] tcp_sendmsg
>>   0.86%  [kernel]            [k] tcp_current_mss
>>
>> Having bpf prog to write tcp opt in all pkts:
>>
>> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>> sockperf: == version #3.10-23.gited92afb185e6 ==
>> sockperf[CLIENT] send on:
>> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: Total of 60636218 messages sent in 30.000 sec
>>
>> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
>> throughput consider using --msg-size=1472
>> sockperf: Summary: Message Rate is 2021185 [msg/sec]
>> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>>
>> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
>> perf report
>>
>> 80.30%--sock_sendmsg
>>   79.02%--tcp_sendmsg
>>    54.14%--tcp_sendmsg_locked
>>     12.82%--_copy_from_iter
>>     12.51%--tcp_send_mss
>>      11.77%--tcp_current_mss
>>       10.10%--tcp_established_options
>>        8.75%--bpf_skops_hdr_opt_len.isra.54
>>         5.71%--__cgroup_bpf_run_filter_sock_ops
>>          3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>>    6.61%--__tcp_push_pending_frames
>>     6.60%--tcp_write_xmit
>>      5.89%--__tcp_transmit_skb
>>
>> perf top -C 8
>>
>> 10.98%  [kernel]                           [k] _raw_spin_lock_bh
>>   9.04%  [kernel]                           [k] copy_user_enhanced_fast_string
>>   7.78%  [kernel]                           [k] tcp_sendmsg_locked
>>   3.91%  sockperf                           [.]
>>   _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>>   3.46%  libpthread-2.28.so                 [.] __libc_sendto
>>   3.35%  [kernel]                           [k] syscall_return_via_sysret
>>   2.86%  [kernel]                           [k] bpf_skops_hdr_opt_len.isra.54
>>   2.16%  [kernel]                           [k] __htab_map_lookup_elem
>>   2.11%  [kernel]                           [k] _copy_from_iter
>>   2.09%  [kernel]                           [k] entry_SYSCALL_64
>>   1.97%  [kernel]                           [k] __virt_addr_valid
>>   1.95%  [kernel]                           [k] __cgroup_bpf_run_filter_sock_ops
>>   1.95%  [kernel]                           [k] lookup_nulls_elem_raw
>>   1.89%  [kernel]                           [k] __fget_light
>>   1.42%  [kernel]                           [k] __sys_sendto
>>   1.41%  [kernel]                           [k] entry_SYSCALL_64_after_hwframe
>>   1.31%  [kernel]                           [k] native_queued_spin_lock_slowpath
>>   1.22%  [kernel]                           [k] __check_object_size
>>   1.18%  [kernel]                           [k] tcp_established_options
>>   1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>>
>> Compare the above test results, fill up a CPU, you can find that
>> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
>> Then CPU occupancy, you can find that "tcp_send_mss" has increased
>> significantly.
> 
> This helps prove the point, but what I actually had in mind is to check
> "perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
> fruit there which we can optimize.
> 
> For instance, when I benchmark it in a VM, I see we're spending cycles
> mostly memset()/rep stos. I have no idea where the cycles are spent in
> your case.
> 
>>

How do you do your pressure test? Can you send it to me for a try? Or 
you can try my pressure test method. Have you checked the calling 
frequency of bpf_skops_hdr_opt_len and bpf_skops_write_hdr_opt?

>>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>>     	 * options first before the BPF program does.
>>>>>>     	 */
>>>>>>     	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>>> +	/* Fast path to reserve space in a skb under
>>>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>>> +	 */
>>>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>>> An example or test coverage would to show this API extension in action
>>>>> would help.
>>>>>
>>>>
>>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>>> that bpf prog will not be triggered frequently before TSO. Provide
>>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>>> Then when writing opt, if len changes, clear the flag, and then
>>>> change opt len in the next package.
>>> I haven't seen a sample using the API extenstion that you're proposing,
>>> so I can only guess. But you probably have something like:
>>> SEC("sockops")
>>> int sockops_prog(struct bpf_sock_ops *ctx)
>>> {
>>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>>> 		bpf_reserve_hdr_opt(ctx, N, 0);
>>> 		bpf_sock_ops_cb_flags_set(ctx,
>>> 					  ctx->bpf_sock_ops_cb_flags |
>>> 					  MY_NEW_FLAG);
>>> 		return 1;
>>> 	}
>>> }
>>
>> Yes, that's what I expected.
>>
>>> I don't understand why you're saying it can't be transformed into:
>>> int sockops_prog(struct bpf_sock_ops *ctx)
>>> {
>>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>>> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>>> 		return 1;
>>> 	}
>>> }
>>
>> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>>
>> I don't know what I can do to pass the flag parameter, let
>> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
>> because the loss caused by the triggering of bpf prog is very
>> expensive, and it is still on the hotspot function of sending
>> packets, and the TSO has not been completed yet.
>>
>>> [...]
> 
> This is not what I'm suggesting.
> 
> bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
> sock. You could either signal through bpf_sock_ops_kern to
> bpf_skops_hdr_opt_len() that it should not be called again
> 
> Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
> because it has access to it via bpf_sock_ops_kern{}.sk.

Oh, I see what you mean, this will achieve the goal.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-06-05  7:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-15  8:19 [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance Feng zhou
2024-05-15  9:48 ` Jakub Sitnicki
2024-05-16  3:15   ` Feng Zhou
2024-05-16 17:15     ` Jakub Sitnicki
2024-05-16 17:22       ` Jakub Sitnicki
2024-05-17  7:27       ` Feng Zhou
2024-05-31 10:45         ` Jakub Sitnicki
2024-06-05  7:37           ` Feng Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).