From: John Fastabend <john.fastabend@gmail.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
zhoufeng.zf@bytedance.com, jakub@cloudflare.com,
zijianzhang@bytedance.com, Amery Hung <amery.hung@bytedance.com>,
Cong Wang <cong.wang@bytedance.com>
Subject: Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking
Date: Fri, 30 May 2025 13:07:35 -0700 [thread overview]
Message-ID: <20250530200735.hhzeicomnb7mbwdl@gmail.com> (raw)
In-Reply-To: <20250519203628.203596-5-xiyou.wangcong@gmail.com>
On 2025-05-19 13:36:28, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> The TCP_BPF ingress redirection path currently lacks the message corking
> mechanism found in standard TCP. This causes the sender to wake up the
> receiver for every message, even when messages are small, resulting in
> reduced throughput compared to regular TCP in certain scenarios.
>
> This change introduces a kernel worker-based intermediate layer to provide
> automatic message corking for TCP_BPF. While this adds a slight latency
> overhead, it significantly improves overall throughput by reducing
> unnecessary wake-ups and reducing the sock lock contention.
>
> Reviewed-by: Amery Hung <amery.hung@bytedance.com>
> Co-developed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> ---
> include/linux/skmsg.h | 19 ++++
> net/core/skmsg.c | 139 ++++++++++++++++++++++++++++-
> net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 347 insertions(+), 8 deletions(-)
[...]
> + /* At this point, the data has been handled well. If one of the
> + * following conditions is met, we can notify the peer socket in
> + * the context of this system call immediately.
> + * 1. If the write buffer has been used up;
> + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
> + * 3. Or, the ingress queue was empty;
> + * 4. Or, the tcp socket is set to no_delay.
> + * Otherwise, kick off the backlog work so that we can have some
> + * time to wait for any incoming messages before sending a
> + * notification to the peer socket.
> + */
OK this series looks like it should work to me. See one small comment
below. Also from the perf numbers in the cover letter is the latency
difference reduced/removed if the socket is set to no_delay?
> + nonagle = tcp_sk(sk)->nonagle;
> + if (!sk_stream_memory_free(sk) ||
> + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
> + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
> + release_sock(sk);
> + psock->backlog_work_delayed = false;
> + sk_psock_backlog_msg(psock);
> + lock_sock(sk);
> + } else {
> + sk_psock_run_backlog_work(psock, false);
> + }
> +
> +error:
> + sk_psock_put(sk_redir, psock);
> + return ret;
> +}
> +
> static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
> struct sk_msg *msg, int *copied, int flags)
> {
> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
> cork = true;
> psock->cork = NULL;
> }
> - release_sock(sk);
>
> - origsize = msg->sg.size;
> - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> - msg, tosend, flags);
> - sent = origsize - msg->sg.size;
> + if (redir_ingress) {
> + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
> + } else {
> + release_sock(sk);
> +
> + origsize = msg->sg.size;
> + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> + msg, tosend, flags);
nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
It no longer handles ingress? A follow up patch would probably be fine.
> + sent = origsize - msg->sg.size;
> +
> + lock_sock(sk);
> + sk_mem_uncharge(sk, sent);
> + }
>
> if (eval == __SK_REDIRECT)
> sock_put(sk_redir);
Thanks.
next prev parent reply other threads:[~2025-05-30 20:07 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-19 20:36 [Patch bpf-next v3 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-05-19 20:36 ` [Patch bpf-next v3 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
2025-05-28 23:51 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
2025-05-29 0:04 ` John Fastabend
2025-05-29 0:49 ` Zijian Zhang
2025-05-29 18:38 ` Cong Wang
2025-05-30 6:30 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 3/4] skmsg: save some space in struct sk_psock Cong Wang
2025-05-30 17:15 ` John Fastabend
2025-05-19 20:36 ` [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-05-30 20:07 ` John Fastabend [this message]
2025-05-30 20:37 ` Zijian Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250530200735.hhzeicomnb7mbwdl@gmail.com \
--to=john.fastabend@gmail.com \
--cc=amery.hung@bytedance.com \
--cc=bpf@vger.kernel.org \
--cc=cong.wang@bytedance.com \
--cc=jakub@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=xiyou.wangcong@gmail.com \
--cc=zhoufeng.zf@bytedance.com \
--cc=zijianzhang@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.