All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: edumazet@google.com,  ast@kernel.org,  daniel@iogearbox.net,
	andrii@kernel.org,  martin.lau@linux.dev,  eddyz87@gmail.com,
	song@kernel.org,  yonghong.song@linux.dev,
	 john.fastabend@gmail.com, kpsingh@kernel.org,  sdf@google.com,
	 haoluo@google.com, jolsa@kernel.org,  davem@davemloft.net,
	 dsahern@kernel.org, kuba@kernel.org,  pabeni@redhat.com,
	 laoar.shao@gmail.com, netdev@vger.kernel.org,
	 linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	 yangzhenze@bytedance.com, wangdongdong.6@bytedance.com
Subject: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
Date: Fri, 31 May 2024 12:45:03 +0200	[thread overview]
Message-ID: <875xuuxntc.fsf@cloudflare.com> (raw)
In-Reply-To: <d66d58f1-219e-450a-91fc-bd08337db77d@bytedance.com> (Feng Zhou's message of "Fri, 17 May 2024 15:27:11 +0800")

On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:

[...]

>> If it's not the BPF prog, which you have ruled out, then where are we
>> burining cycles? Maybe that is something that can be improved.
>> Also, in terms on quantifying the improvement - it is 20% in terms of
>> what? Throughput, pps, cycles? And was that a single data point? For
>> multiple measurements there must be some variance (+/- X pp).
>> Would be great to see some data to back it up.
>> [...]
>> 
>
> Pressure measurement method:
>
> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>
> Default mode, no bpf prog:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 71520808 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2384000 [msg/sec]
> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.88%--sock_sendmsg
>  79.53%--tcp_sendmsg
>   42.48%--tcp_sendmsg_locked
>    16.23%--_copy_from_iter
>    4.24%--tcp_send_mss
>     3.25%--tcp_current_mss
>
>
> perf top -C 8
>
> 19.13%  [kernel]            [k] _raw_spin_lock_bh
> 11.75%  [kernel]            [k] copy_user_enhanced_fast_string
>  9.86%  [kernel]            [k] tcp_sendmsg_locked
>  4.44%  sockperf            [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  4.16%  libpthread-2.28.so  [.] __libc_sendto
>  3.85%  [kernel]            [k] syscall_return_via_sysret
>  2.70%  [kernel]            [k] _copy_from_iter
>  2.48%  [kernel]            [k] entry_SYSCALL_64
>  2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
>  1.89%  [kernel]            [k] __virt_addr_valid
>  1.77%  [kernel]            [k] __check_object_size
>  1.75%  [kernel]            [k] __sys_sendto
>  1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
>  1.42%  [kernel]            [k] __fget_light
>  1.28%  [kernel]            [k] tcp_push
>  1.01%  [kernel]            [k] tcp_established_options
>  0.97%  [kernel]            [k] tcp_send_mss
>  0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
>  0.94%  [kernel]            [k] tcp_sendmsg
>  0.86%  [kernel]            [k] tcp_current_mss
>
> Having bpf prog to write tcp opt in all pkts:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 60636218 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2021185 [msg/sec]
> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.30%--sock_sendmsg
>  79.02%--tcp_sendmsg
>   54.14%--tcp_sendmsg_locked
>    12.82%--_copy_from_iter
>    12.51%--tcp_send_mss
>     11.77%--tcp_current_mss
>      10.10%--tcp_established_options
>       8.75%--bpf_skops_hdr_opt_len.isra.54
>        5.71%--__cgroup_bpf_run_filter_sock_ops
>         3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>   6.61%--__tcp_push_pending_frames
>    6.60%--tcp_write_xmit
>     5.89%--__tcp_transmit_skb
>
> perf top -C 8
>
> 10.98%  [kernel]                           [k] _raw_spin_lock_bh
>  9.04%  [kernel]                           [k] copy_user_enhanced_fast_string
>  7.78%  [kernel]                           [k] tcp_sendmsg_locked
>  3.91%  sockperf                           [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  3.46%  libpthread-2.28.so                 [.] __libc_sendto
>  3.35%  [kernel]                           [k] syscall_return_via_sysret
>  2.86%  [kernel]                           [k] bpf_skops_hdr_opt_len.isra.54
>  2.16%  [kernel]                           [k] __htab_map_lookup_elem
>  2.11%  [kernel]                           [k] _copy_from_iter
>  2.09%  [kernel]                           [k] entry_SYSCALL_64
>  1.97%  [kernel]                           [k] __virt_addr_valid
>  1.95%  [kernel]                           [k] __cgroup_bpf_run_filter_sock_ops
>  1.95%  [kernel]                           [k] lookup_nulls_elem_raw
>  1.89%  [kernel]                           [k] __fget_light
>  1.42%  [kernel]                           [k] __sys_sendto
>  1.41%  [kernel]                           [k] entry_SYSCALL_64_after_hwframe
>  1.31%  [kernel]                           [k] native_queued_spin_lock_slowpath
>  1.22%  [kernel]                           [k] __check_object_size
>  1.18%  [kernel]                           [k] tcp_established_options
>  1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>
> Compare the above test results, fill up a CPU, you can find that
> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
> Then CPU occupancy, you can find that "tcp_send_mss" has increased
> significantly.

This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.

For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.

>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>    	 * options first before the BPF program does.
>>>>>    	 */
>>>>>    	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>> +	/* Fast path to reserve space in a skb under
>>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>> +	 */
>>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>> An example or test coverage would to show this API extension in action
>>>> would help.
>>>>
>>>
>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>> that bpf prog will not be triggered frequently before TSO. Provide
>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>> Then when writing opt, if len changes, clear the flag, and then
>>> change opt len in the next package.
>> I haven't seen a sample using the API extenstion that you're proposing,
>> so I can only guess. But you probably have something like:
>> SEC("sockops")
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, 0);
>> 		bpf_sock_ops_cb_flags_set(ctx,
>> 					  ctx->bpf_sock_ops_cb_flags |
>> 					  MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> Yes, that's what I expected.
>
>> I don't understand why you're saying it can't be transformed into:
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>
> I don't know what I can do to pass the flag parameter, let
> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
> because the loss caused by the triggering of bpf prog is very
> expensive, and it is still on the hotspot function of sending
> packets, and the TSO has not been completed yet.
>
>> [...]

This is not what I'm suggesting.

bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again

Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.

  reply	other threads:[~2024-05-31 10:45 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15  8:19 [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance Feng zhou
2024-05-15  9:48 ` Jakub Sitnicki
2024-05-16  3:15   ` Feng Zhou
2024-05-16 17:15     ` Jakub Sitnicki
2024-05-16 17:22       ` Jakub Sitnicki
2024-05-17  7:27       ` Feng Zhou
2024-05-31 10:45         ` Jakub Sitnicki [this message]
2024-06-05  7:37           ` Feng Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875xuuxntc.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    --cc=yangzhenze@bytedance.com \
    --cc=yonghong.song@linux.dev \
    --cc=zhoufeng.zf@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.