public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity
@ 2026-04-04 15:04 Jason Xing
  2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Jason Xing @ 2026-04-04 15:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau
  Cc: netdev, bpf, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

1. Design of send-level granularity
Originally, socket timestamping was designed to support tracing each
sendmsg instead of per packet because application needs to issue
multiple extra recvmsg() calls to get the skbs carrying the timestamp
one by one if application chooses tag with different tags(SCHED/DRV/ACK).
It's an obvious huge burden if the application expects to see a finer
grained behavior.
Another point I mentioned a bit in Netdev 0x19[1], supposing the amount of
data that application tries to transfer at one time is split into 100
smaller packets, recording the last skb's timestamps (SCHED/DRV/HARDWARE)
is no longer meaningful because at the moment timestamping only records
1/100 packets. In this case, only the delta between when to send and when
to ack matters.

2. Known missing tag issues in TCP
A critically important thing is that we can miss tagging the last packet
in a few conditions as the patch 3/4 explains. That means we lose track
of the send syscall. Digging into more into how tcp_sendmsg_locked works,
I found it's not feasible to successfully identify the last skb before
push functions get called. With that said, if we want to make the feature
better to cover all of these cases, we inevitably needs to place
tcp_bpf_tx_timestamp() function before each push function.

3. Practice at Tencent
In production, we have a version that applies the packet basis policy to
do the exhaustive profiling of each flow for months in order to:
1) 100% make sure to capture the jitter event. No sampling.
2) observe the performance, find the bottleneck and improve it.
We're still collecting data and investigating how it helps us in all the
potential aspects before upstreaming. My personal perspective on this is
to replace tcpdump eventually. It's worth mentioning tcpdump no longer
satisfies our micro observation in modern data center.

4. The tendency toward finer-grained observability
As we're aware that there are already many various bpf scripts trying to
implement the fine grained monitor of the packets, it's an unstoppable
tendency for the future observability. We're faced with so many latency
reports (like jitter, perf degradation) on a daily basis. Getting the
root cause of each report is exactly what we pursue.
After we know which request causes the problem, if it belongs to kernel,
we will dig into the packet behavior with more useful information
included. This is the process of tracing down the jitter problem.
Likewise, in BPF timestamping that mitigates the impact of calling extra
syscalls, breaking the coarse granularity into smaller ones is a first
good way to go. It shouldn't be the burden like before especially it's
independent of application.

5. Details of the series
Now it's time to convert BPF timestamping feature into push-level
granularity by only recording the last skb in each push function, which
is quite similar to how we previously treat each send syscall.
Regarding each push function as a whole, we only care about
the last skb from each push since the skb can be chunked into different
smaller packets. BPF scripts like progs/net_timestamping.c has the
ability to trace each tagged skb and calculate the latency:
1) delta between send and each tagged skb in tcp_sendmsg_locked()
2) delta between SCHED/DRV/ACK. Three timestamps are also correlated
   with the sendmsg time.

In conclusion, push-level is more of a compromise approach which covers
those corner cases and further enhances the capabilities (like a finer
grained observation of jitter and performance issues).

[1]: Page 29 of the slides demonstrates the picture of skb-level granularity
https://netdevconf.info/0x19/sessions/talk/the-future-of-so_timestamping.html

---
V2
Link: https://lore.kernel.org/all/20260402085831.36983-1-kerneljasonxing@gmail.com/
1. only handle BPF timestamping feature to cover those issues (Eric, Willem)
2. keep timestamping functions inline in send process (Eric)


Jason Xing (4):
  tcp: separate BPF timestamping from tcp_tx_timestamp
  tcp: advance the tsflags check to save cycles
  bpf-timestamp: keep track of the skb when wait_for_space occurs
  bpf-timestamp: complete tracing the skb from each push in sendmsg

 include/net/tcp.h | 20 ++++++++++++++++++++
 net/ipv4/tcp.c    | 23 +++++++++++++----------
 2 files changed, 33 insertions(+), 10 deletions(-)

-- 
2.41.3


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-07  7:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
2026-04-06  2:23   ` Willem de Bruijn
2026-04-06 11:48     ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
2026-04-06  2:28   ` Willem de Bruijn
2026-04-06 11:59     ` Jason Xing
2026-04-06 14:37       ` Willem de Bruijn
2026-04-07  3:33         ` Jason Xing
2026-04-07  7:43           ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg Jason Xing
2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
2026-04-06 12:25   ` Jason Xing
2026-04-06 14:38     ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox