Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Feng zhou <zhoufeng.zf@bytedance.com>
Cc: edumazet@google.com,  ast@kernel.org,  daniel@iogearbox.net,
	andrii@kernel.org,  martin.lau@linux.dev,  eddyz87@gmail.com,
	song@kernel.org,  yonghong.song@linux.dev,
	 john.fastabend@gmail.com, kpsingh@kernel.org,  sdf@google.com,
	 haoluo@google.com, jolsa@kernel.org,  davem@davemloft.net,
	 dsahern@kernel.org, kuba@kernel.org,  pabeni@redhat.com,
	 laoar.shao@gmail.com, netdev@vger.kernel.org,
	 linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	 yangzhenze@bytedance.com, wangdongdong.6@bytedance.com
Subject: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
Date: Wed, 15 May 2024 11:48:09 +0200	[thread overview]
Message-ID: <87seyjwgme.fsf@cloudflare.com> (raw)
In-Reply-To: <20240515081901.91058-1-zhoufeng.zf@bytedance.com> (Feng zhou's message of "Wed, 15 May 2024 16:19:01 +0800")

On Wed, May 15, 2024 at 04:19 PM +08, Feng zhou wrote:
> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>
> Set the full package write tcp option, the test found that the loss
> will be 20%. If a package wants to write tcp option, it will trigger
> bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
> call "tcp_established_options" to reserve tcp opt len, call
> "bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
> TSO. Through bpftrace tracking, it was found that during the pressure
> test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
> len does not change often, consider caching opt len for optimization.

You could also make your BPF sock_ops program cache the value and return
the cached value when called for BPF_SOCK_OPS_HDR_OPT_LEN_CB.

If that is in your opinion prohibitevely expensive then it would be good
to see a sample program and CPU cycle measurements (bpftool prog profile).

>
> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
> ---
>  include/linux/tcp.h            |  3 +++
>  include/uapi/linux/bpf.h       |  8 +++++++-
>  net/ipv4/tcp_output.c          | 12 +++++++++++-
>  tools/include/uapi/linux/bpf.h |  8 +++++++-
>  4 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b937b3..74437fcf94a2 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -455,6 +455,9 @@ struct tcp_sock {
>  					  * to recur itself by calling
>  					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>  					  */
> +	u8	bpf_opt_len;		/* save tcp opt len implementation
> +					 * BPF_SOCK_OPS_HDR_OPT_LEN_CB fast path
> +					 */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 90706a47f6ff..f2092de1f432 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6892,8 +6892,14 @@ enum {
>  	 * options first before the BPF program does.
>  	 */
>  	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> +	/* Fast path to reserve space in a skb under
> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
> +	 */
> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>  /* Mask of all currently supported cb flags */
> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>  };
>  
>  /* List of known BPF sock_ops operators.
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index ea7ad7d99245..0e7480a58012 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -488,12 +488,21 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  {
>  	struct bpf_sock_ops_kern sock_ops;
>  	int err;
> +	struct tcp_sock *th = (struct tcp_sock *)sk;
>  
> -	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> +	if (likely(!BPF_SOCK_OPS_TEST_FLAG(th,
>  					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
>  	    !*remaining)
>  		return;
>  
> +	if (likely(BPF_SOCK_OPS_TEST_FLAG(th,
> +					  BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG)) &&
> +	    th->bpf_opt_len) {
> +		*remaining -= th->bpf_opt_len;

What if *remaining value shrinks from one call to the next?

BPF sock_ops program can't react to change. Feels like there should be a
safety check to prevent an underflow.

> +		opts->bpf_opt_len = th->bpf_opt_len;
> +		return;
> +	}
> +
>  	/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
>  
>  	/* init sock_ops */
> @@ -538,6 +547,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  	opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
>  	/* round up to 4 bytes */
>  	opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
> +	th->bpf_opt_len = opts->bpf_opt_len;
>  
>  	*remaining -= opts->bpf_opt_len;
>  }
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 90706a47f6ff..f2092de1f432 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -6892,8 +6892,14 @@ enum {
>  	 * options first before the BPF program does.
>  	 */
>  	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> +	/* Fast path to reserve space in a skb under
> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
> +	 */
> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),

Have you considered a bpf_reserve_hdr_opt() flag instead?

An example or test coverage would to show this API extension in action
would help.

>  /* Mask of all currently supported cb flags */
> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>  };
>  
>  /* List of known BPF sock_ops operators.

next prev parent reply	other threads:[~2024-05-15  9:48 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15  8:19 [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance Feng zhou
2024-05-15  9:48 ` Jakub Sitnicki [this message]
2024-05-16  3:15   ` Feng Zhou
2024-05-16 17:15     ` Jakub Sitnicki
2024-05-16 17:22       ` Jakub Sitnicki
2024-05-17  7:27       ` Feng Zhou
2024-05-31 10:45         ` Jakub Sitnicki
2024-06-05  7:37           ` Feng Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87seyjwgme.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    --cc=yangzhenze@bytedance.com \
    --cc=yonghong.song@linux.dev \
    --cc=zhoufeng.zf@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.