From: Jakub Sitnicki <jakub@cloudflare.com>
To: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: edumazet@google.com, ast@kernel.org, daniel@iogearbox.net,
andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com,
haoluo@google.com, jolsa@kernel.org, davem@davemloft.net,
dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
laoar.shao@gmail.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
yangzhenze@bytedance.com, wangdongdong.6@bytedance.com
Subject: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
Date: Fri, 31 May 2024 12:45:03 +0200 [thread overview]
Message-ID: <875xuuxntc.fsf@cloudflare.com> (raw)
In-Reply-To: <d66d58f1-219e-450a-91fc-bd08337db77d@bytedance.com> (Feng Zhou's message of "Fri, 17 May 2024 15:27:11 +0800")
On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
[...]
>> If it's not the BPF prog, which you have ruled out, then where are we
>> burining cycles? Maybe that is something that can be improved.
>> Also, in terms on quantifying the improvement - it is 20% in terms of
>> what? Throughput, pps, cycles? And was that a single data point? For
>> multiple measurements there must be some variance (+/- X pp).
>> Would be great to see some data to back it up.
>> [...]
>>
>
> Pressure measurement method:
>
> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>
> Default mode, no bpf prog:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x PORT = 7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 71520808 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2384000 [msg/sec]
> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.88%--sock_sendmsg
> 79.53%--tcp_sendmsg
> 42.48%--tcp_sendmsg_locked
> 16.23%--_copy_from_iter
> 4.24%--tcp_send_mss
> 3.25%--tcp_current_mss
>
>
> perf top -C 8
>
> 19.13% [kernel] [k] _raw_spin_lock_bh
> 11.75% [kernel] [k] copy_user_enhanced_fast_string
> 9.86% [kernel] [k] tcp_sendmsg_locked
> 4.44% sockperf [.]
> _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
> 4.16% libpthread-2.28.so [.] __libc_sendto
> 3.85% [kernel] [k] syscall_return_via_sysret
> 2.70% [kernel] [k] _copy_from_iter
> 2.48% [kernel] [k] entry_SYSCALL_64
> 2.33% [kernel] [k] native_queued_spin_lock_slowpath
> 1.89% [kernel] [k] __virt_addr_valid
> 1.77% [kernel] [k] __check_object_size
> 1.75% [kernel] [k] __sys_sendto
> 1.74% [kernel] [k] entry_SYSCALL_64_after_hwframe
> 1.42% [kernel] [k] __fget_light
> 1.28% [kernel] [k] tcp_push
> 1.01% [kernel] [k] tcp_established_options
> 0.97% [kernel] [k] tcp_send_mss
> 0.94% [kernel] [k] syscall_exit_to_user_mode_prepare
> 0.94% [kernel] [k] tcp_sendmsg
> 0.86% [kernel] [k] tcp_current_mss
>
> Having bpf prog to write tcp opt in all pkts:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x PORT = 7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 60636218 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2021185 [msg/sec]
> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.30%--sock_sendmsg
> 79.02%--tcp_sendmsg
> 54.14%--tcp_sendmsg_locked
> 12.82%--_copy_from_iter
> 12.51%--tcp_send_mss
> 11.77%--tcp_current_mss
> 10.10%--tcp_established_options
> 8.75%--bpf_skops_hdr_opt_len.isra.54
> 5.71%--__cgroup_bpf_run_filter_sock_ops
> 3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
> 6.61%--__tcp_push_pending_frames
> 6.60%--tcp_write_xmit
> 5.89%--__tcp_transmit_skb
>
> perf top -C 8
>
> 10.98% [kernel] [k] _raw_spin_lock_bh
> 9.04% [kernel] [k] copy_user_enhanced_fast_string
> 7.78% [kernel] [k] tcp_sendmsg_locked
> 3.91% sockperf [.]
> _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
> 3.46% libpthread-2.28.so [.] __libc_sendto
> 3.35% [kernel] [k] syscall_return_via_sysret
> 2.86% [kernel] [k] bpf_skops_hdr_opt_len.isra.54
> 2.16% [kernel] [k] __htab_map_lookup_elem
> 2.11% [kernel] [k] _copy_from_iter
> 2.09% [kernel] [k] entry_SYSCALL_64
> 1.97% [kernel] [k] __virt_addr_valid
> 1.95% [kernel] [k] __cgroup_bpf_run_filter_sock_ops
> 1.95% [kernel] [k] lookup_nulls_elem_raw
> 1.89% [kernel] [k] __fget_light
> 1.42% [kernel] [k] __sys_sendto
> 1.41% [kernel] [k] entry_SYSCALL_64_after_hwframe
> 1.31% [kernel] [k] native_queued_spin_lock_slowpath
> 1.22% [kernel] [k] __check_object_size
> 1.18% [kernel] [k] tcp_established_options
> 1.04% bpf_prog_e7ccbf819f5be0d0_tcpopt [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>
> Compare the above test results, fill up a CPU, you can find that
> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
> Then CPU occupancy, you can find that "tcp_send_mss" has increased
> significantly.
This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.
For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.
>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>> * options first before the BPF program does.
>>>>> */
>>>>> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>> + /* Fast path to reserve space in a skb under
>>>>> + * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>> + * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>> + * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>> + */
>>>>> + BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>> An example or test coverage would to show this API extension in action
>>>> would help.
>>>>
>>>
>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>> that bpf prog will not be triggered frequently before TSO. Provide
>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>> Then when writing opt, if len changes, clear the flag, and then
>>> change opt len in the next package.
>> I haven't seen a sample using the API extenstion that you're proposing,
>> so I can only guess. But you probably have something like:
>> SEC("sockops")
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> bpf_reserve_hdr_opt(ctx, N, 0);
>> bpf_sock_ops_cb_flags_set(ctx,
>> ctx->bpf_sock_ops_cb_flags |
>> MY_NEW_FLAG);
>> return 1;
>> }
>> }
>
> Yes, that's what I expected.
>
>> I don't understand why you're saying it can't be transformed into:
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>> return 1;
>> }
>> }
>
> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>
> I don't know what I can do to pass the flag parameter, let
> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
> because the loss caused by the triggering of bpf prog is very
> expensive, and it is still on the hotspot function of sending
> packets, and the TSO has not been completed yet.
>
>> [...]
This is not what I'm suggesting.
bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again
Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.
next prev parent reply other threads:[~2024-05-31 10:45 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-15 8:19 [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance Feng zhou
2024-05-15 9:48 ` Jakub Sitnicki
2024-05-16 3:15 ` Feng Zhou
2024-05-16 17:15 ` Jakub Sitnicki
2024-05-16 17:22 ` Jakub Sitnicki
2024-05-17 7:27 ` Feng Zhou
2024-05-31 10:45 ` Jakub Sitnicki [this message]
2024-06-05 7:37 ` Feng Zhou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875xuuxntc.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=laoar.shao@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=wangdongdong.6@bytedance.com \
--cc=yangzhenze@bytedance.com \
--cc=yonghong.song@linux.dev \
--cc=zhoufeng.zf@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).