public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>,
	 Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: davem@davemloft.net,  edumazet@google.com,  kuba@kernel.org,
	 pabeni@redhat.com,  horms@kernel.org,  willemb@google.com,
	 martin.lau@kernel.org,  netdev@vger.kernel.org,
	 bpf@vger.kernel.org,  Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity
Date: Mon, 06 Apr 2026 10:38:58 -0400	[thread overview]
Message-ID: <willemdebruijn.kernel.1cbb439d5cc4@gmail.com> (raw)
In-Reply-To: <CAL+tcoDuSyQdR83C-07bUpJfYdnrkkcA8gO+0qKZM8PTbiFofA@mail.gmail.com>

Jason Xing wrote:
> On Mon, Apr 6, 2026 at 10:17 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > 1. Design of send-level granularity
> > > Originally, socket timestamping was designed to support tracing each
> > > sendmsg instead of per packet because application needs to issue
> > > multiple extra recvmsg() calls to get the skbs carrying the timestamp
> > > one by one if application chooses tag with different tags(SCHED/DRV/ACK).
> > > It's an obvious huge burden if the application expects to see a finer
> > > grained behavior.
> > > Another point I mentioned a bit in Netdev 0x19[1], supposing the amount of
> > > data that application tries to transfer at one time is split into 100
> > > smaller packets, recording the last skb's timestamps (SCHED/DRV/HARDWARE)
> > > is no longer meaningful because at the moment timestamping only records
> > > 1/100 packets. In this case, only the delta between when to send and when
> > > to ack matters.
> > >
> > > 2. Known missing tag issues in TCP
> > > A critically important thing is that we can miss tagging the last packet
> > > in a few conditions as the patch 3/4 explains. That means we lose track
> > > of the send syscall. Digging into more into how tcp_sendmsg_locked works,
> > > I found it's not feasible to successfully identify the last skb before
> > > push functions get called. With that said, if we want to make the feature
> > > better to cover all of these cases, we inevitably needs to place
> > > tcp_bpf_tx_timestamp() function before each push function.
> > >
> > > 3. Practice at Tencent
> > > In production, we have a version that applies the packet basis policy to
> > > do the exhaustive profiling of each flow for months in order to:
> > > 1) 100% make sure to capture the jitter event. No sampling.
> > > 2) observe the performance, find the bottleneck and improve it.
> > > We're still collecting data and investigating how it helps us in all the
> > > potential aspects before upstreaming. My personal perspective on this is
> > > to replace tcpdump eventually. It's worth mentioning tcpdump no longer
> > > satisfies our micro observation in modern data center.
> > >
> > > 4. The tendency toward finer-grained observability
> > > As we're aware that there are already many various bpf scripts trying to
> > > implement the fine grained monitor of the packets, it's an unstoppable
> > > tendency for the future observability. We're faced with so many latency
> > > reports (like jitter, perf degradation) on a daily basis. Getting the
> > > root cause of each report is exactly what we pursue.
> > > After we know which request causes the problem, if it belongs to kernel,
> > > we will dig into the packet behavior with more useful information
> > > included. This is the process of tracing down the jitter problem.
> > > Likewise, in BPF timestamping that mitigates the impact of calling extra
> > > syscalls, breaking the coarse granularity into smaller ones is a first
> > > good way to go. It shouldn't be the burden like before especially it's
> > > independent of application.
> > >
> > > 5. Details of the series
> > > Now it's time to convert BPF timestamping feature into push-level
> > > granularity by only recording the last skb in each push function, which
> > > is quite similar to how we previously treat each send syscall.
> > > Regarding each push function as a whole, we only care about
> > > the last skb from each push since the skb can be chunked into different
> > > smaller packets. BPF scripts like progs/net_timestamping.c has the
> > > ability to trace each tagged skb and calculate the latency:
> > > 1) delta between send and each tagged skb in tcp_sendmsg_locked()
> > > 2) delta between SCHED/DRV/ACK. Three timestamps are also correlated
> > >    with the sendmsg time.
> > >
> > > In conclusion, push-level is more of a compromise approach which covers
> > > those corner cases and further enhances the capabilities (like a finer
> > > grained observation of jitter and performance issues).
> >
> > # push-level design
> >
> > It it significantly less intuitive than per-syscall, which is under
> > user control. Or even than per-packet. As a fix for missing timestamps
> > I understand these two extensions, even with the unintended side
> > effect of reporting many unnecessary extra skbs in the common case.
> > As a model to advocate for, less so.
> 
> Thanks for your review!
> 
> Fair enough. One mode of our internal kernel module directly hijacks
> tcp_skb_entail() that handles the last skb of this skb if
> fragmentation happens.
> 
> >
> > Would it help if all skbs from the same sendmsg() can still be
> > identified as common from the same syscall? That allows the user
> 
> I have to add more comments about push level:
> the last skb from each push function will always be correlated with
> its own sendmsg. With the help of BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, bpf
> script can do so.
> 
> > to discard all but the last one (if they wish)
> 
> Oh, I just replied with another reply. Let's start the discussion here.
> 
> It would be great if we have a definite finer grained observability.
> 
> >
> >
> > # ABI changes
> >
> > For SO_TIMESTAMPING we would not be able to make this change
> > unconditionally as the behavior change would break existing
> > application expectations.
> 
> Right.
> 
> >
> > That is why historically we have guarded new behabvior behind new
> > TS options flags.
> >
> > The same may be true for BPF.
> 
> How about adding a socket option for a per packet mode, say,
> BPF_SOCK_OPS_TSTAMP_TCP_PACKET_CB around tcp_skb_entail() which works
> very similarly to
> BPF_SOCK_OPS_TSTAMP_SENDMSG_CB. After that, users have a standalone
> option to decide whether to trace all the skbs from the sendmsg.
> 
> If so, the origin BPF timestamping that works exactly like socket
> timestamping is the best effort (we don't 100% guarantee the timestamping
> feature captures every sendmsg call). With the new socket option
> involved, we provide a finer grained vision without harming users who
> favor the origin design with those two issues resolved.
> 
> A kind reminder is that if the skb is fragmented, for instance, due to
> TSO being disabled, only the last smaller/child one will be monitored.

See also my reply in the other thread. Yes, per-packet may be more
informative and understandable as policy than per-push.

      reply	other threads:[~2026-04-06 14:39 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
2026-04-06  2:23   ` Willem de Bruijn
2026-04-06 11:48     ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
2026-04-06  2:28   ` Willem de Bruijn
2026-04-06 11:59     ` Jason Xing
2026-04-06 14:37       ` Willem de Bruijn
2026-04-07  3:33         ` Jason Xing
2026-04-07  7:43           ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg Jason Xing
2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
2026-04-06 12:25   ` Jason Xing
2026-04-06 14:38     ` Willem de Bruijn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=willemdebruijn.kernel.1cbb439d5cc4@gmail.com \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox