Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: edumazet@google.com,  ast@kernel.org,  daniel@iogearbox.net,
	andrii@kernel.org,  martin.lau@linux.dev,  eddyz87@gmail.com,
	song@kernel.org,  yonghong.song@linux.dev,
	 john.fastabend@gmail.com, kpsingh@kernel.org,  sdf@google.com,
	 haoluo@google.com, jolsa@kernel.org,  davem@davemloft.net,
	 dsahern@kernel.org, kuba@kernel.org,  pabeni@redhat.com,
	 laoar.shao@gmail.com, netdev@vger.kernel.org,
	 linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	 yangzhenze@bytedance.com, wangdongdong.6@bytedance.com
Subject: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
Date: Fri, 31 May 2024 12:45:03 +0200	[thread overview]
Message-ID: <875xuuxntc.fsf@cloudflare.com> (raw)
In-Reply-To: <d66d58f1-219e-450a-91fc-bd08337db77d@bytedance.com> (Feng Zhou's message of "Fri, 17 May 2024 15:27:11 +0800")

On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:

[...]

>> If it's not the BPF prog, which you have ruled out, then where are we
>> burining cycles? Maybe that is something that can be improved.
>> Also, in terms on quantifying the improvement - it is 20% in terms of
>> what? Throughput, pps, cycles? And was that a single data point? For
>> multiple measurements there must be some variance (+/- X pp).
>> Would be great to see some data to back it up.
>> [...]
>> 
>
> Pressure measurement method:
>
> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>
> Default mode, no bpf prog:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 71520808 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2384000 [msg/sec]
> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.88%--sock_sendmsg
>  79.53%--tcp_sendmsg
>   42.48%--tcp_sendmsg_locked
>    16.23%--_copy_from_iter
>    4.24%--tcp_send_mss
>     3.25%--tcp_current_mss
>
>
> perf top -C 8
>
> 19.13%  [kernel]            [k] _raw_spin_lock_bh
> 11.75%  [kernel]            [k] copy_user_enhanced_fast_string
>  9.86%  [kernel]            [k] tcp_sendmsg_locked
>  4.44%  sockperf            [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  4.16%  libpthread-2.28.so  [.] __libc_sendto
>  3.85%  [kernel]            [k] syscall_return_via_sysret
>  2.70%  [kernel]            [k] _copy_from_iter
>  2.48%  [kernel]            [k] entry_SYSCALL_64
>  2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
>  1.89%  [kernel]            [k] __virt_addr_valid
>  1.77%  [kernel]            [k] __check_object_size
>  1.75%  [kernel]            [k] __sys_sendto
>  1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
>  1.42%  [kernel]            [k] __fget_light
>  1.28%  [kernel]            [k] tcp_push
>  1.01%  [kernel]            [k] tcp_established_options
>  0.97%  [kernel]            [k] tcp_send_mss
>  0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
>  0.94%  [kernel]            [k] tcp_sendmsg
>  0.86%  [kernel]            [k] tcp_current_mss
>
> Having bpf prog to write tcp opt in all pkts:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 60636218 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2021185 [msg/sec]
> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.30%--sock_sendmsg
>  79.02%--tcp_sendmsg
>   54.14%--tcp_sendmsg_locked
>    12.82%--_copy_from_iter
>    12.51%--tcp_send_mss
>     11.77%--tcp_current_mss
>      10.10%--tcp_established_options
>       8.75%--bpf_skops_hdr_opt_len.isra.54
>        5.71%--__cgroup_bpf_run_filter_sock_ops
>         3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>   6.61%--__tcp_push_pending_frames
>    6.60%--tcp_write_xmit
>     5.89%--__tcp_transmit_skb
>
> perf top -C 8
>
> 10.98%  [kernel]                           [k] _raw_spin_lock_bh
>  9.04%  [kernel]                           [k] copy_user_enhanced_fast_string
>  7.78%  [kernel]                           [k] tcp_sendmsg_locked
>  3.91%  sockperf                           [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  3.46%  libpthread-2.28.so                 [.] __libc_sendto
>  3.35%  [kernel]                           [k] syscall_return_via_sysret
>  2.86%  [kernel]                           [k] bpf_skops_hdr_opt_len.isra.54
>  2.16%  [kernel]                           [k] __htab_map_lookup_elem
>  2.11%  [kernel]                           [k] _copy_from_iter
>  2.09%  [kernel]                           [k] entry_SYSCALL_64
>  1.97%  [kernel]                           [k] __virt_addr_valid
>  1.95%  [kernel]                           [k] __cgroup_bpf_run_filter_sock_ops
>  1.95%  [kernel]                           [k] lookup_nulls_elem_raw
>  1.89%  [kernel]                           [k] __fget_light
>  1.42%  [kernel]                           [k] __sys_sendto
>  1.41%  [kernel]                           [k] entry_SYSCALL_64_after_hwframe
>  1.31%  [kernel]                           [k] native_queued_spin_lock_slowpath
>  1.22%  [kernel]                           [k] __check_object_size
>  1.18%  [kernel]                           [k] tcp_established_options
>  1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>
> Compare the above test results, fill up a CPU, you can find that
> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
> Then CPU occupancy, you can find that "tcp_send_mss" has increased
> significantly.

This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.

For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.

>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>    	 * options first before the BPF program does.
>>>>>    	 */
>>>>>    	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>> +	/* Fast path to reserve space in a skb under
>>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>> +	 */
>>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>> An example or test coverage would to show this API extension in action
>>>> would help.
>>>>
>>>
>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>> that bpf prog will not be triggered frequently before TSO. Provide
>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>> Then when writing opt, if len changes, clear the flag, and then
>>> change opt len in the next package.
>> I haven't seen a sample using the API extenstion that you're proposing,
>> so I can only guess. But you probably have something like:
>> SEC("sockops")
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, 0);
>> 		bpf_sock_ops_cb_flags_set(ctx,
>> 					  ctx->bpf_sock_ops_cb_flags |
>> 					  MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> Yes, that's what I expected.
>
>> I don't understand why you're saying it can't be transformed into:
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>
> I don't know what I can do to pass the flag parameter, let
> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
> because the loss caused by the triggering of bpf prog is very
> expensive, and it is still on the hotspot function of sending
> packets, and the TSO has not been completed yet.
>
>> [...]

This is not what I'm suggesting.

bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again

Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.

next prev parent reply	other threads:[~2024-05-31 10:45 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15  8:19 [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance Feng zhou
2024-05-15  9:48 ` Jakub Sitnicki
2024-05-16  3:15   ` Feng Zhou
2024-05-16 17:15     ` Jakub Sitnicki
2024-05-16 17:22       ` Jakub Sitnicki
2024-05-17  7:27       ` Feng Zhou
2024-05-31 10:45         ` Jakub Sitnicki [this message]
2024-06-05  7:37           ` Feng Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875xuuxntc.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    --cc=yangzhenze@bytedance.com \
    --cc=yonghong.song@linux.dev \
    --cc=zhoufeng.zf@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).