All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Zijian Zhang <zijianzhang@bytedance.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>,
	 netdev@vger.kernel.org, bpf@vger.kernel.org,
	 john.fastabend@gmail.com, zhoufeng.zf@bytedance.com,
	 Amery Hung <amery.hung@bytedance.com>,
	 Cong Wang <cong.wang@bytedance.com>
Subject: Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking
Date: Thu, 03 Jul 2025 13:32:08 +0200	[thread overview]
Message-ID: <87a55lmrwn.fsf@cloudflare.com> (raw)
In-Reply-To: <509939c4-2e3e-41a6-888f-cbbf6d4c93cb@bytedance.com> (Zijian Zhang's message of "Thu, 3 Jul 2025 10:17:09 +0800")

On Thu, Jul 03, 2025 at 10:17 AM +08, Zijian Zhang wrote:
> On 7/2/25 8:17 PM, Jakub Sitnicki wrote:
>> On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
>>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>>
>>> The TCP_BPF ingress redirection path currently lacks the message corking
>>> mechanism found in standard TCP. This causes the sender to wake up the
>>> receiver for every message, even when messages are small, resulting in
>>> reduced throughput compared to regular TCP in certain scenarios.
>> I'm curious what scenarios are you referring to? Is it send-to-local or
>> ingress-to-local? [1]
>> 
>
> Thanks for your attention and detailed reviewing!
> We are referring to "send-to-local" here.
>
>> If the sender is emitting small messages, that's probably intended -
>> that is they likely want to get the message across as soon as possible,
>> because They must have disabled the Nagle algo (set TCP_NODELAY) to do
>> that.
>> Otherwise, you get small segment merging on the sender side by default.
>> And if MTU is a limiting factor, you should also be getting batching
>> from GRO.
>> What I'm getting at is that I don't quite follow why you don't see
>> sufficient batching before the sockmap redirect today?
>> 
>
> IMHO,
>
> In “send-to-local” case, both sender and receiver sockets are added to
> the sockmap. Their protocol is modified from TCP to eBPF_TCP, so that
> sendmsg will invoke “tcp_bpf_sendmsg” instead of “tcp_sendmsg”. In this
> case, the whole process is building a skmsg and moving it to the
> receiver socket’s queue immediately. In this process, there is no
> sk_buff generated, and we cannot benefit from TCP stack optimizations.
> As a result, small segments will not be merged by default, that's the
> reason why I am implementing skmsg coalescing here.
>
>>> This change introduces a kernel worker-based intermediate layer to provide
>>> automatic message corking for TCP_BPF. While this adds a slight latency
>>> overhead, it significantly improves overall throughput by reducing
>>> unnecessary wake-ups and reducing the sock lock contention.
>> "Slight" for a +5% increase in latency is an understatement :-)
>> IDK about this being always on for every socket. For send-to-local
>> [1], sk_msg redirs can be viewed as a form of IPC, where latency
>> matters.
>> I do understand that you're trying to optimize for bulk-transfer
>> workloads, but please consider also request-response workloads.
>> [1]
>> https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png
>> 
>
> Totally understand that request-response workloads are also very
> important.
>
> Here are my thoughts:
>
> I had an idea before: when the user sets NO_DELAY, we could follow the
> original code path. However, I'm concerned about a specific scenario: if
> a user sends part of a message and then sets NO_DELAY to send another
> message, it's possible that messages sent via kworker haven't yet
> reached the ingress_msg (maybe due to delayed kworker scheduling), while
> the messages sent with NO_DELAY have already arrived. This could disrupt
> the order. Since there's no TCP packet formation or retransmission
> mechanism in this process, once the order is disrupted, it stays that
> way.
>
> As a result, I propose,
>
> - When the user sets NO_DELAY, introduce a wait (I haven't determined
> the exact location yet; maybe in tcp_bpf_sendmsg) to ensure all messages
> from kworker are sent before proceeding. Then follow the original path
> for packet transmission.
>
> - When the user switches back from NO_DELAY to DELAY, it's less of an issue. We
>  can simply follow our current code path.
>
> If 5% degradation is not a blocker for this patchset, and the solution
> looks good to you, we will do it in the next patchset.

I'm all for reaping the benefits of batching, but I'm not thrilled about
having a backlog worker on the path. The one we have on the sk_skb path
has been a bottleneck:

1) There's no backpressure propagation so you can have a backlog
build-up. One thing to check is what happens if the receiver closes its
window.

2) There's a scheduling latency. That's why the performance of splicing
sockets with sockmap (ingress-to-egress) looks bleak [1].

So I have to dig deeper...

Have you considered and/or evaluated any alternative designs? For
instance, what stops us from having an auto-corking / coalescing
strategy on the sender side?

[1] https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/

  reply	other threads:[~2025-07-03 11:32 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-01  1:11 [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-07-01  1:11 ` [Patch bpf-next v4 1/4] skmsg: rename sk_msg_alloc() to sk_msg_expand() Cong Wang
2025-07-02  9:24   ` Jakub Sitnicki
2025-07-01  1:11 ` [Patch bpf-next v4 2/4] skmsg: implement slab allocator cache for sk_msg Cong Wang
2025-07-02 11:36   ` Jakub Sitnicki
2025-07-01  1:12 ` [Patch bpf-next v4 3/4] skmsg: save some space in struct sk_psock Cong Wang
2025-07-02 11:46   ` Jakub Sitnicki
2025-07-01  1:12 ` [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking Cong Wang
2025-07-02 12:17   ` Jakub Sitnicki
2025-07-03  2:17     ` Zijian Zhang
2025-07-03 11:32       ` Jakub Sitnicki [this message]
2025-07-04  4:20         ` Cong Wang
2025-07-07 17:51           ` Jakub Sitnicki
2025-07-15  0:26             ` Zijian Zhang
2025-07-02 10:22 ` [Patch bpf-next v4 0/4] " Jakub Sitnicki
2025-07-03  1:48   ` Zijian Zhang
2025-07-02 11:05 ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a55lmrwn.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=amery.hung@bytedance.com \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=john.fastabend@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    --cc=zhoufeng.zf@bytedance.com \
    --cc=zijianzhang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.