[PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity
@ 2026-04-04 15:04 Jason Xing
  2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Jason Xing @ 2026-04-04 15:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau
  Cc: netdev, bpf, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

1. Design of send-level granularity
Originally, socket timestamping was designed to support tracing each
sendmsg instead of per packet because application needs to issue
multiple extra recvmsg() calls to get the skbs carrying the timestamp
one by one if application chooses tag with different tags(SCHED/DRV/ACK).
It's an obvious huge burden if the application expects to see a finer
grained behavior.
Another point I mentioned a bit in Netdev 0x19[1], supposing the amount of
data that application tries to transfer at one time is split into 100
smaller packets, recording the last skb's timestamps (SCHED/DRV/HARDWARE)
is no longer meaningful because at the moment timestamping only records
1/100 packets. In this case, only the delta between when to send and when
to ack matters.

2. Known missing tag issues in TCP
A critically important thing is that we can miss tagging the last packet
in a few conditions as the patch 3/4 explains. That means we lose track
of the send syscall. Digging into more into how tcp_sendmsg_locked works,
I found it's not feasible to successfully identify the last skb before
push functions get called. With that said, if we want to make the feature
better to cover all of these cases, we inevitably needs to place
tcp_bpf_tx_timestamp() function before each push function.

3. Practice at Tencent
In production, we have a version that applies the packet basis policy to
do the exhaustive profiling of each flow for months in order to:
1) 100% make sure to capture the jitter event. No sampling.
2) observe the performance, find the bottleneck and improve it.
We're still collecting data and investigating how it helps us in all the
potential aspects before upstreaming. My personal perspective on this is
to replace tcpdump eventually. It's worth mentioning tcpdump no longer
satisfies our micro observation in modern data center.

4. The tendency toward finer-grained observability
As we're aware that there are already many various bpf scripts trying to
implement the fine grained monitor of the packets, it's an unstoppable
tendency for the future observability. We're faced with so many latency
reports (like jitter, perf degradation) on a daily basis. Getting the
root cause of each report is exactly what we pursue.
After we know which request causes the problem, if it belongs to kernel,
we will dig into the packet behavior with more useful information
included. This is the process of tracing down the jitter problem.
Likewise, in BPF timestamping that mitigates the impact of calling extra
syscalls, breaking the coarse granularity into smaller ones is a first
good way to go. It shouldn't be the burden like before especially it's
independent of application.

5. Details of the series
Now it's time to convert BPF timestamping feature into push-level
granularity by only recording the last skb in each push function, which
is quite similar to how we previously treat each send syscall.
Regarding each push function as a whole, we only care about
the last skb from each push since the skb can be chunked into different
smaller packets. BPF scripts like progs/net_timestamping.c has the
ability to trace each tagged skb and calculate the latency:
1) delta between send and each tagged skb in tcp_sendmsg_locked()
2) delta between SCHED/DRV/ACK. Three timestamps are also correlated
   with the sendmsg time.

In conclusion, push-level is more of a compromise approach which covers
those corner cases and further enhances the capabilities (like a finer
grained observation of jitter and performance issues).

[1]: Page 29 of the slides demonstrates the picture of skb-level granularity
https://netdevconf.info/0x19/sessions/talk/the-future-of-so_timestamping.html

---
V2
Link: https://lore.kernel.org/all/20260402085831.36983-1-kerneljasonxing@gmail.com/
1. only handle BPF timestamping feature to cover those issues (Eric, Willem)
2. keep timestamping functions inline in send process (Eric)

Jason Xing (4):
  tcp: separate BPF timestamping from tcp_tx_timestamp
  tcp: advance the tsflags check to save cycles
  bpf-timestamp: keep track of the skb when wait_for_space occurs
  bpf-timestamp: complete tracing the skb from each push in sendmsg

 include/net/tcp.h | 20 ++++++++++++++++++++
 net/ipv4/tcp.c    | 23 +++++++++++++----------
 2 files changed, 33 insertions(+), 10 deletions(-)

-- 
2.41.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp
  2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
@ 2026-04-04 15:04 ` Jason Xing
  2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-04-04 15:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau
  Cc: netdev, bpf, Jason Xing, Yushan Zhou

From: Jason Xing <kernelxing@tencent.com>

Add a tcp_bpf_tx_timestamp() inline function as a prep for the later
patches.

Put it under the restriction of CONFIG_CGROUP_BPF.

Add a SKBTX_BPF check to avoid duplicated call of SENDMSG_CB if the
skb was tagged before. If we've already tagged this skb, only update
its tskey. It prevents the tskey from using an old value after more
data are all written into one skb, which is compatible with socket
timestamping.

Note: I didn't add back the process of reading skb from rtx queue which
was introduced by commit 838eb9687691 ("tcp: tcp_tx_timestamp() must look
at the rtx queue") because in BPF timestamping scenario:
1) BPF_SOCK_OPS_TSTAMP_SENDMSG_CB is the starting point for each skb
   that needs record.
2) BPF script must correlate sendmsg and the skb in this phase first
   and then will be able to record the SCHED/DRV/ACK timestamps to
   construct the correct timeline.
   Please see how progs/net_timestamping.c works with map.
3) at this point, notify the BPF script is too late to take care of
   the timestamp in sendmsg in time.
In conclusion, that doesn't work for BPF timestamping.

Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/tcp.h | 20 ++++++++++++++++++++
 net/ipv4/tcp.c    |  5 +----
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 565943c34b7e..6705205ff236 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2883,12 +2883,32 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
 	skops->skb = skb;
 	skops->skb_data_end = skb->data + end_offset;
 }
+static inline void tcp_bpf_tx_timestamp(struct sock *sk)
+{
+	struct sk_buff *skb;
+
+	if (!cgroup_bpf_enabled(CGROUP_SOCK_OPS) ||
+	    !SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING))
+		return;
+
+	skb = tcp_write_queue_tail(sk);
+	if (!skb)
+		return;
+
+	if (!(skb_shinfo(skb)->tx_flags & SKBTX_BPF))
+		bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
+	else
+		skb_shinfo(skb)->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+}
 #else
 static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
 				      struct sk_buff *skb,
 				      unsigned int end_offset)
 {
 }
+static inline void tcp_bpf_tx_timestamp(struct sock *sk)
+{
+}
 #endif
 
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd2c3c4587e1..169c3fff4f6d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -500,10 +500,6 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 		if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
-
-	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
-	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb)
-		bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
 }
 
 /* @wake is one when sk_stream_write_space() calls us.
@@ -1417,6 +1413,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 out:
 	if (copied) {
 		tcp_tx_timestamp(sk, &sockc);
+		tcp_bpf_tx_timestamp(sk);
 		tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
 	}
 out_nopush:
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles
  2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
  2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
@ 2026-04-04 15:04 ` Jason Xing
  2026-04-06  2:23   ` Willem de Bruijn
  2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-04-04 15:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau
  Cc: netdev, bpf, Jason Xing, Yushan Zhou

From: Jason Xing <kernelxing@tencent.com>

Check the tsflags first to see if the socket timestamping is enabled.
If so, then try to fetch the last skb from either write queue or
retransmission queue.

Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/tcp.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 169c3fff4f6d..c603b90057f6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -484,13 +484,14 @@ void tcp_init_sock(struct sock *sk)
 
 static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 {
-	struct sk_buff *skb = tcp_write_queue_tail(sk);
 	u32 tsflags = sockc->tsflags;
+	struct sk_buff *skb;
 
-	if (unlikely(!skb))
-		skb = skb_rb_last(&sk->tcp_rtx_queue);
+	if (!tsflags)
+		return;
 
-	if (tsflags && skb) {
+	skb = tcp_write_queue_tail(sk) ? : skb_rb_last(&sk->tcp_rtx_queue);
+	if (skb) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles
  2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
@ 2026-04-06  2:23   ` Willem de Bruijn
  2026-04-06 11:48     ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2026-04-06  2:23 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, horms, willemb,
	martin.lau
  Cc: netdev, bpf, Jason Xing, Yushan Zhou

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Check the tsflags first to see if the socket timestamping is enabled.
> If so, then try to fetch the last skb from either write queue or
> retransmission queue.

This message does not explain why this change is made.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles
  2026-04-06  2:23   ` Willem de Bruijn
@ 2026-04-06 11:48     ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-04-06 11:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing, Yushan Zhou

On Mon, Apr 6, 2026 at 10:23 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Check the tsflags first to see if the socket timestamping is enabled.
> > If so, then try to fetch the last skb from either write queue or
> > retransmission queue.
>
> This message does not explain why this change is made.

Oh, right, I should've added more direct benefits.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
  2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
  2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
  2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
@ 2026-04-04 15:04 ` Jason Xing
  2026-04-06  2:28   ` Willem de Bruijn
  2026-04-04 15:04 ` [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg Jason Xing
  2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
  4 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-04-04 15:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau
  Cc: netdev, bpf, Jason Xing, Yushan Zhou

From: Jason Xing <kernelxing@tencent.com>

The patch is the 1/2 part of push-level granularity feature.

Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
though it might not carry the last byte of the sendmsg.

Prior to the patch, BPF timestamping cannot cover this case:
The following steps reproduce this:
1) skb A is the current last skb before entering wait_for_space process
2) tcp_push() pushes A without any tag
3) A is transmitted from TCP to driver without putting any skb carrying
   timestamps in the error queue, like SCHED, DRV/HARDWARE.
4) sk_stream_wait_memory() sleeps for a while and then returns with an
   error code. Note that the socket lock is released.
5) skb A finally gets acked and removed from the rtx queue.
6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
   'do_error' label and then 'out' label.
7) at this moment, skb A turns out to be the last one in this send
   syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
   before the final tcp_push()
8) BPF script fails to see any timestamps this time

Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c603b90057f6..7d030a11d004 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 wait_for_space:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		tcp_remove_empty_skb(sk);
-		if (copied)
+		if (copied) {
+			tcp_bpf_tx_timestamp(sk);
 			tcp_push(sk, flags & ~MSG_MORE, mss_now,
 				 TCP_NAGLE_PUSH, size_goal);
+		}
 
 		err = sk_stream_wait_memory(sk, &timeo);
 		if (err != 0)
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
  2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
@ 2026-04-06  2:28   ` Willem de Bruijn
  2026-04-06 11:59     ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2026-04-06  2:28 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, horms, willemb,
	martin.lau
  Cc: netdev, bpf, Jason Xing, Yushan Zhou

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> The patch is the 1/2 part of push-level granularity feature.
> 
> Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> though it might not carry the last byte of the sendmsg.
> 
> Prior to the patch, BPF timestamping cannot cover this case:
> The following steps reproduce this:
> 1) skb A is the current last skb before entering wait_for_space process
> 2) tcp_push() pushes A without any tag
> 3) A is transmitted from TCP to driver without putting any skb carrying
>    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> 4) sk_stream_wait_memory() sleeps for a while and then returns with an
>    error code. Note that the socket lock is released.
> 5) skb A finally gets acked and removed from the rtx queue.
> 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
>    'do_error' label and then 'out' label.
> 7) at this moment, skb A turns out to be the last one in this send
>    syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
>    before the final tcp_push()
> 8) BPF script fails to see any timestamps this time
> 
> Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/ipv4/tcp.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c603b90057f6..7d030a11d004 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  wait_for_space:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  		tcp_remove_empty_skb(sk);
> -		if (copied)
> +		if (copied) {
> +			tcp_bpf_tx_timestamp(sk);
>  			tcp_push(sk, flags & ~MSG_MORE, mss_now,
>  				 TCP_NAGLE_PUSH, size_goal);

Now the number of skbs that will be tracked will be unpredictable,
varying based on memory pressure.

That sounds hard to use to me. Especially if these extra pushes
cannot be identified as such.

Perhaps if all skbs from the same sendmsg call can be identified,
that would help explain pattern in data resulting from these
uncommon extra data points.

> +		}
>  
>  		err = sk_stream_wait_memory(sk, &timeo);
>  		if (err != 0)
> -- 
> 2.41.3
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
  2026-04-06  2:28   ` Willem de Bruijn
@ 2026-04-06 11:59     ` Jason Xing
  2026-04-06 14:37       ` Willem de Bruijn
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-04-06 11:59 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing, Yushan Zhou

On Mon, Apr 6, 2026 at 10:28 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > The patch is the 1/2 part of push-level granularity feature.
> >
> > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > though it might not carry the last byte of the sendmsg.
> >
> > Prior to the patch, BPF timestamping cannot cover this case:
> > The following steps reproduce this:
> > 1) skb A is the current last skb before entering wait_for_space process
> > 2) tcp_push() pushes A without any tag
> > 3) A is transmitted from TCP to driver without putting any skb carrying
> >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> >    error code. Note that the socket lock is released.
> > 5) skb A finally gets acked and removed from the rtx queue.
> > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> >    'do_error' label and then 'out' label.
> > 7) at this moment, skb A turns out to be the last one in this send
> >    syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
> >    before the final tcp_push()
> > 8) BPF script fails to see any timestamps this time
> >
> > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/ipv4/tcp.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index c603b90057f6..7d030a11d004 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >  wait_for_space:
> >               set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >               tcp_remove_empty_skb(sk);
> > -             if (copied)
> > +             if (copied) {
> > +                     tcp_bpf_tx_timestamp(sk);
> >                       tcp_push(sk, flags & ~MSG_MORE, mss_now,
> >                                TCP_NAGLE_PUSH, size_goal);
>
> Now the number of skbs that will be tracked will be unpredictable,
> varying based on memory pressure.

Right, I put some effort into writing a selftests to check how many
push functions get called at one time and failed to do so.

>
> That sounds hard to use to me. Especially if these extra pushes
> cannot be identified as such.
>
> Perhaps if all skbs from the same sendmsg call can be identified,
> that would help explain pattern in data resulting from these
> uncommon extra data points.

You meant move tcp_bpf_tx_timestamp before tcp_skb_entail()? That is
close to packet basis without considering fragmentation of skb :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
  2026-04-06 11:59     ` Jason Xing
@ 2026-04-06 14:37       ` Willem de Bruijn
  2026-04-07  3:33         ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2026-04-06 14:37 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing, Yushan Zhou

Jason Xing wrote:
> On Mon, Apr 6, 2026 at 10:28 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > The patch is the 1/2 part of push-level granularity feature.
> > >
> > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > though it might not carry the last byte of the sendmsg.
> > >
> > > Prior to the patch, BPF timestamping cannot cover this case:
> > > The following steps reproduce this:
> > > 1) skb A is the current last skb before entering wait_for_space process
> > > 2) tcp_push() pushes A without any tag
> > > 3) A is transmitted from TCP to driver without putting any skb carrying
> > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > >    error code. Note that the socket lock is released.
> > > 5) skb A finally gets acked and removed from the rtx queue.
> > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > >    'do_error' label and then 'out' label.
> > > 7) at this moment, skb A turns out to be the last one in this send
> > >    syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
> > >    before the final tcp_push()
> > > 8) BPF script fails to see any timestamps this time
> > >
> > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/ipv4/tcp.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index c603b90057f6..7d030a11d004 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > >  wait_for_space:
> > >               set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > >               tcp_remove_empty_skb(sk);
> > > -             if (copied)
> > > +             if (copied) {
> > > +                     tcp_bpf_tx_timestamp(sk);
> > >                       tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > >                                TCP_NAGLE_PUSH, size_goal);
> >
> > Now the number of skbs that will be tracked will be unpredictable,
> > varying based on memory pressure.
> 
> Right, I put some effort into writing a selftests to check how many
> push functions get called at one time and failed to do so.
> 
> >
> > That sounds hard to use to me. Especially if these extra pushes
> > cannot be identified as such.
> >
> > Perhaps if all skbs from the same sendmsg call can be identified,
> > that would help explain pattern in data resulting from these
> > uncommon extra data points.
> 
> You meant move tcp_bpf_tx_timestamp before tcp_skb_entail()? That is
> close to packet basis without considering fragmentation of skb :)

No, I meant somehow in the notification having a way to identify all
the skbs belonging to the same sendmsg call, to allow filtering on
that. But I also don't immediately see how to do that (without adding
yet another counter say).

Right now, push-based seems rather arbitrary to me, informed more by
technical limitations than a clear design. Perhaps per-packet makes
more sense, esp. since BPF calls are cheap (compared to the other
errqueue mechanism).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
  2026-04-06 14:37       ` Willem de Bruijn
@ 2026-04-07  3:33         ` Jason Xing
  2026-04-07  7:43           ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-04-07  3:33 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing, Yushan Zhou

On Mon, Apr 6, 2026 at 10:37 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Mon, Apr 6, 2026 at 10:28 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > The patch is the 1/2 part of push-level granularity feature.
> > > >
> > > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > > though it might not carry the last byte of the sendmsg.
> > > >
> > > > Prior to the patch, BPF timestamping cannot cover this case:
> > > > The following steps reproduce this:
> > > > 1) skb A is the current last skb before entering wait_for_space process
> > > > 2) tcp_push() pushes A without any tag
> > > > 3) A is transmitted from TCP to driver without putting any skb carrying
> > > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > > >    error code. Note that the socket lock is released.
> > > > 5) skb A finally gets acked and removed from the rtx queue.
> > > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > > >    'do_error' label and then 'out' label.
> > > > 7) at this moment, skb A turns out to be the last one in this send
> > > >    syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
> > > >    before the final tcp_push()
> > > > 8) BPF script fails to see any timestamps this time
> > > >
> > > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  net/ipv4/tcp.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index c603b90057f6..7d030a11d004 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > >  wait_for_space:
> > > >               set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > > >               tcp_remove_empty_skb(sk);
> > > > -             if (copied)
> > > > +             if (copied) {
> > > > +                     tcp_bpf_tx_timestamp(sk);
> > > >                       tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > > >                                TCP_NAGLE_PUSH, size_goal);
> > >
> > > Now the number of skbs that will be tracked will be unpredictable,
> > > varying based on memory pressure.
> >
> > Right, I put some effort into writing a selftests to check how many
> > push functions get called at one time and failed to do so.
> >
> > >
> > > That sounds hard to use to me. Especially if these extra pushes
> > > cannot be identified as such.
> > >
> > > Perhaps if all skbs from the same sendmsg call can be identified,
> > > that would help explain pattern in data resulting from these
> > > uncommon extra data points.
> >
> > You meant move tcp_bpf_tx_timestamp before tcp_skb_entail()? That is
> > close to packet basis without considering fragmentation of skb :)
>
> No, I meant somehow in the notification having a way to identify all
> the skbs belonging to the same sendmsg call, to allow filtering on
> that. But I also don't immediately see how to do that (without adding
> yet another counter say).

If we don't build the relationship between skb and sendmsg (just like
the SENDMSG sock option), we will have no clue on how to calculate. If
we only take care of the skb from the view of the syscall layer, it's
fine by moving tcp_bpf_tx_timestamp() before tcp_skb_entail(). But in
terms of per skb even generated beneath TCP due to gso/tso, there is
only one way to correlate: adding an additional member in the skb
structure to store its sendmsg time. This discussion is only suitable
for use cases like net_timestamping.

Well, my key point is that, I have to admit, the above (including
existing bpf script net_timestamping) is a less effective way which
definitely harms the performance because of the extremely frequent
look-up process. It's not suitable for 7x24 observability in
production. What we've done internally is make the kernel layer as
lightweight/easy as possible and let the timestamping feature throw
each record into a ring buffer that the application can read, sort and
calculate. This arch survives the performance. But that's simply what
the design of the kernel module is, given the fast deployment in
production. I suppose in the future we could build a userspace tool
like blktrace to monitor efficiently instead of the selftest sample.
Honestly I don't like look-up action.

Since we're modifying the kernel, how about adding a new member to
record sendmsg time which bpf script is able to read. The whole
scenario looks like this:
1) in tcp_sendmsg_locked(), record the sendmsg time for each skb
2) in either tso_fragment() or tcp_gso_tstamp(), each new skb will get
a copy of its original skb
3) in each stage, bpf script reads the skb's sendmsg time and the
current time, and then effortlessly do the math.

At this point, what I had in mind is we have two options:
1) only handle the skb from the view of the send syscall layer, which
is, for sure, very simple but not thorough.
2) stick to a pure authentic packet basis, then adding a new member
seems inevitable. so the question would be where to add? The space of
the skb structure is very precious :(

>
> Right now, push-based seems rather arbitrary to me, informed more by
> technical limitations than a clear design. Perhaps per-packet makes
> more sense, esp. since BPF calls are cheap (compared to the other
> errqueue mechanism).

Agreed.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
  2026-04-07  3:33         ` Jason Xing
@ 2026-04-07  7:43           ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-04-07  7:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing, Yushan Zhou

On Tue, Apr 7, 2026 at 11:33 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Mon, Apr 6, 2026 at 10:37 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > On Mon, Apr 6, 2026 at 10:28 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >
> > > > Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > The patch is the 1/2 part of push-level granularity feature.
> > > > >
> > > > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > > > though it might not carry the last byte of the sendmsg.
> > > > >
> > > > > Prior to the patch, BPF timestamping cannot cover this case:
> > > > > The following steps reproduce this:
> > > > > 1) skb A is the current last skb before entering wait_for_space process
> > > > > 2) tcp_push() pushes A without any tag
> > > > > 3) A is transmitted from TCP to driver without putting any skb carrying
> > > > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > > > >    error code. Note that the socket lock is released.
> > > > > 5) skb A finally gets acked and removed from the rtx queue.
> > > > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > > > >    'do_error' label and then 'out' label.
> > > > > 7) at this moment, skb A turns out to be the last one in this send
> > > > >    syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
> > > > >    before the final tcp_push()
> > > > > 8) BPF script fails to see any timestamps this time
> > > > >
> > > > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index c603b90057f6..7d030a11d004 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > >  wait_for_space:
> > > > >               set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > > > >               tcp_remove_empty_skb(sk);
> > > > > -             if (copied)
> > > > > +             if (copied) {
> > > > > +                     tcp_bpf_tx_timestamp(sk);
> > > > >                       tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > > > >                                TCP_NAGLE_PUSH, size_goal);
> > > >
> > > > Now the number of skbs that will be tracked will be unpredictable,
> > > > varying based on memory pressure.
> > >
> > > Right, I put some effort into writing a selftests to check how many
> > > push functions get called at one time and failed to do so.
> > >
> > > >
> > > > That sounds hard to use to me. Especially if these extra pushes
> > > > cannot be identified as such.
> > > >
> > > > Perhaps if all skbs from the same sendmsg call can be identified,
> > > > that would help explain pattern in data resulting from these
> > > > uncommon extra data points.
> > >
> > > You meant move tcp_bpf_tx_timestamp before tcp_skb_entail()? That is
> > > close to packet basis without considering fragmentation of skb :)
> >
> > No, I meant somehow in the notification having a way to identify all
> > the skbs belonging to the same sendmsg call, to allow filtering on
> > that. But I also don't immediately see how to do that (without adding
> > yet another counter say).
>
> If we don't build the relationship between skb and sendmsg (just like
> the SENDMSG sock option), we will have no clue on how to calculate. If
> we only take care of the skb from the view of the syscall layer, it's
> fine by moving tcp_bpf_tx_timestamp() before tcp_skb_entail(). But in
> terms of per skb even generated beneath TCP due to gso/tso, there is
> only one way to correlate: adding an additional member in the skb
> structure to store its sendmsg time. This discussion is only suitable
> for use cases like net_timestamping.
>
> Well, my key point is that, I have to admit, the above (including
> existing bpf script net_timestamping) is a less effective way which
> definitely harms the performance because of the extremely frequent
> look-up process. It's not suitable for 7x24 observability in
> production. What we've done internally is make the kernel layer as
> lightweight/easy as possible and let the timestamping feature throw
> each record into a ring buffer that the application can read, sort and
> calculate. This arch survives the performance. But that's simply what
> the design of the kernel module is, given the fast deployment in
> production. I suppose in the future we could build a userspace tool
> like blktrace to monitor efficiently instead of the selftest sample.
> Honestly I don't like look-up action.
>
> Since we're modifying the kernel, how about adding a new member to
> record sendmsg time which bpf script is able to read. The whole
> scenario looks like this:
> 1) in tcp_sendmsg_locked(), record the sendmsg time for each skb
> 2) in either tso_fragment() or tcp_gso_tstamp(), each new skb will get
> a copy of its original skb
> 3) in each stage, bpf script reads the skb's sendmsg time and the
> current time, and then effortlessly do the math.
>
> At this point, what I had in mind is we have two options:
> 1) only handle the skb from the view of the send syscall layer, which
> is, for sure, very simple but not thorough.
> 2) stick to a pure authentic packet basis, then adding a new member
> seems inevitable. so the question would be where to add? The space of
> the skb structure is very precious :(

Finding a suitable place to put this timestamp is really hard. IIRC,
we can't expand the size of struct skb_shared_info so easily since
it's a global effect.

I'm wondering if we can turn the per-packet mode into a non-compatible
feature by reusing 'u32 tskey' to store a microsecond timestamp of
sendmsg.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg
  2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
                   ` (2 preceding siblings ...)
  2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
@ 2026-04-04 15:04 ` Jason Xing
  2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
  4 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-04-04 15:04 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau
  Cc: netdev, bpf, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

The patch is the 2/2 part of push-level granularity feature.

Prior to the patch, BPF timestamping cannot cover this case.
Here is how we reproduce in theory:
1) in the first round, __tcp_push_pending_frames() or tcp_push() kicks
   the stack to transfer the skb.
2) in the second round, problems like skb_copy_to_page_nocache() that
   returns with err code.
3) it jumps into 'do_error" label and then 'out' label.
4) at this point, there is no single skb staying in the write queue
5) the end of sendmsg

Monitor these two push functions with the previous one to complete
the conversion from send-level granularity to push-level granularity.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/tcp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7d030a11d004..43fa8329a5ad 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1392,9 +1392,12 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (forced_push(tp)) {
 			tcp_mark_push(tp, skb);
+			tcp_bpf_tx_timestamp(sk);
 			__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
-		} else if (skb == tcp_send_head(sk))
+		} else if (skb == tcp_send_head(sk)) {
+			tcp_bpf_tx_timestamp(sk);
 			tcp_push_one(sk, mss_now);
+		}
 		continue;
 
 wait_for_space:
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity
  2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
                   ` (3 preceding siblings ...)
  2026-04-04 15:04 ` [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg Jason Xing
@ 2026-04-06  2:17 ` Willem de Bruijn
  2026-04-06 12:25   ` Jason Xing
  4 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2026-04-06  2:17 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, horms, willemb,
	martin.lau
  Cc: netdev, bpf, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> 1. Design of send-level granularity
> Originally, socket timestamping was designed to support tracing each
> sendmsg instead of per packet because application needs to issue
> multiple extra recvmsg() calls to get the skbs carrying the timestamp
> one by one if application chooses tag with different tags(SCHED/DRV/ACK).
> It's an obvious huge burden if the application expects to see a finer
> grained behavior.
> Another point I mentioned a bit in Netdev 0x19[1], supposing the amount of
> data that application tries to transfer at one time is split into 100
> smaller packets, recording the last skb's timestamps (SCHED/DRV/HARDWARE)
> is no longer meaningful because at the moment timestamping only records
> 1/100 packets. In this case, only the delta between when to send and when
> to ack matters.
> 
> 2. Known missing tag issues in TCP
> A critically important thing is that we can miss tagging the last packet
> in a few conditions as the patch 3/4 explains. That means we lose track
> of the send syscall. Digging into more into how tcp_sendmsg_locked works,
> I found it's not feasible to successfully identify the last skb before
> push functions get called. With that said, if we want to make the feature
> better to cover all of these cases, we inevitably needs to place
> tcp_bpf_tx_timestamp() function before each push function.
> 
> 3. Practice at Tencent
> In production, we have a version that applies the packet basis policy to
> do the exhaustive profiling of each flow for months in order to:
> 1) 100% make sure to capture the jitter event. No sampling.
> 2) observe the performance, find the bottleneck and improve it.
> We're still collecting data and investigating how it helps us in all the
> potential aspects before upstreaming. My personal perspective on this is
> to replace tcpdump eventually. It's worth mentioning tcpdump no longer
> satisfies our micro observation in modern data center.
> 
> 4. The tendency toward finer-grained observability
> As we're aware that there are already many various bpf scripts trying to
> implement the fine grained monitor of the packets, it's an unstoppable
> tendency for the future observability. We're faced with so many latency
> reports (like jitter, perf degradation) on a daily basis. Getting the
> root cause of each report is exactly what we pursue.
> After we know which request causes the problem, if it belongs to kernel,
> we will dig into the packet behavior with more useful information
> included. This is the process of tracing down the jitter problem.
> Likewise, in BPF timestamping that mitigates the impact of calling extra
> syscalls, breaking the coarse granularity into smaller ones is a first
> good way to go. It shouldn't be the burden like before especially it's
> independent of application.
> 
> 5. Details of the series
> Now it's time to convert BPF timestamping feature into push-level
> granularity by only recording the last skb in each push function, which
> is quite similar to how we previously treat each send syscall.
> Regarding each push function as a whole, we only care about
> the last skb from each push since the skb can be chunked into different
> smaller packets. BPF scripts like progs/net_timestamping.c has the
> ability to trace each tagged skb and calculate the latency:
> 1) delta between send and each tagged skb in tcp_sendmsg_locked()
> 2) delta between SCHED/DRV/ACK. Three timestamps are also correlated
>    with the sendmsg time.
> 
> In conclusion, push-level is more of a compromise approach which covers
> those corner cases and further enhances the capabilities (like a finer
> grained observation of jitter and performance issues).

# push-level design

It it significantly less intuitive than per-syscall, which is under
user control. Or even than per-packet. As a fix for missing timestamps
I understand these two extensions, even with the unintended side
effect of reporting many unnecessary extra skbs in the common case.
As a model to advocate for, less so.

Would it help if all skbs from the same sendmsg() can still be
identified as common from the same syscall? That allows the user
to discard all but the last one (if they wish)


# ABI changes

For SO_TIMESTAMPING we would not be able to make this change
unconditionally as the behavior change would break existing
application expectations.

That is why historically we have guarded new behabvior behind new
TS options flags.

The same may be true for BPF.


# SO_TIMESTAMPING and BPF timestamping differences

A related point is that this breaks the 1:1 relationship between
SO_TIMESTAMPING and BPF timestamping. As said before, I think that
is fine as BPF timestamping can be cheaper. But we should avoid the
two forking in incompatible ways. I suggest that BPF timestamping
becomes a superset of SO_TIMESTAMPING: it must have all features
of SO_TIMESTAMPING, and may offer more.


# Documentation and testing

Please also expand Documentation and include a test.



 
> [1]: Page 29 of the slides demonstrates the picture of skb-level granularity
> https://netdevconf.info/0x19/sessions/talk/the-future-of-so_timestamping.html
> 
> ---
> V2
> Link: https://lore.kernel.org/all/20260402085831.36983-1-kerneljasonxing@gmail.com/
> 1. only handle BPF timestamping feature to cover those issues (Eric, Willem)
> 2. keep timestamping functions inline in send process (Eric)
> 
> 
> Jason Xing (4):
>   tcp: separate BPF timestamping from tcp_tx_timestamp
>   tcp: advance the tsflags check to save cycles
>   bpf-timestamp: keep track of the skb when wait_for_space occurs
>   bpf-timestamp: complete tracing the skb from each push in sendmsg
> 
>  include/net/tcp.h | 20 ++++++++++++++++++++
>  net/ipv4/tcp.c    | 23 +++++++++++++----------
>  2 files changed, 33 insertions(+), 10 deletions(-)
> 
> -- 
> 2.41.3
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity
  2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
@ 2026-04-06 12:25   ` Jason Xing
  2026-04-06 14:38     ` Willem de Bruijn
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-04-06 12:25 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing

On Mon, Apr 6, 2026 at 10:17 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > 1. Design of send-level granularity
> > Originally, socket timestamping was designed to support tracing each
> > sendmsg instead of per packet because application needs to issue
> > multiple extra recvmsg() calls to get the skbs carrying the timestamp
> > one by one if application chooses tag with different tags(SCHED/DRV/ACK).
> > It's an obvious huge burden if the application expects to see a finer
> > grained behavior.
> > Another point I mentioned a bit in Netdev 0x19[1], supposing the amount of
> > data that application tries to transfer at one time is split into 100
> > smaller packets, recording the last skb's timestamps (SCHED/DRV/HARDWARE)
> > is no longer meaningful because at the moment timestamping only records
> > 1/100 packets. In this case, only the delta between when to send and when
> > to ack matters.
> >
> > 2. Known missing tag issues in TCP
> > A critically important thing is that we can miss tagging the last packet
> > in a few conditions as the patch 3/4 explains. That means we lose track
> > of the send syscall. Digging into more into how tcp_sendmsg_locked works,
> > I found it's not feasible to successfully identify the last skb before
> > push functions get called. With that said, if we want to make the feature
> > better to cover all of these cases, we inevitably needs to place
> > tcp_bpf_tx_timestamp() function before each push function.
> >
> > 3. Practice at Tencent
> > In production, we have a version that applies the packet basis policy to
> > do the exhaustive profiling of each flow for months in order to:
> > 1) 100% make sure to capture the jitter event. No sampling.
> > 2) observe the performance, find the bottleneck and improve it.
> > We're still collecting data and investigating how it helps us in all the
> > potential aspects before upstreaming. My personal perspective on this is
> > to replace tcpdump eventually. It's worth mentioning tcpdump no longer
> > satisfies our micro observation in modern data center.
> >
> > 4. The tendency toward finer-grained observability
> > As we're aware that there are already many various bpf scripts trying to
> > implement the fine grained monitor of the packets, it's an unstoppable
> > tendency for the future observability. We're faced with so many latency
> > reports (like jitter, perf degradation) on a daily basis. Getting the
> > root cause of each report is exactly what we pursue.
> > After we know which request causes the problem, if it belongs to kernel,
> > we will dig into the packet behavior with more useful information
> > included. This is the process of tracing down the jitter problem.
> > Likewise, in BPF timestamping that mitigates the impact of calling extra
> > syscalls, breaking the coarse granularity into smaller ones is a first
> > good way to go. It shouldn't be the burden like before especially it's
> > independent of application.
> >
> > 5. Details of the series
> > Now it's time to convert BPF timestamping feature into push-level
> > granularity by only recording the last skb in each push function, which
> > is quite similar to how we previously treat each send syscall.
> > Regarding each push function as a whole, we only care about
> > the last skb from each push since the skb can be chunked into different
> > smaller packets. BPF scripts like progs/net_timestamping.c has the
> > ability to trace each tagged skb and calculate the latency:
> > 1) delta between send and each tagged skb in tcp_sendmsg_locked()
> > 2) delta between SCHED/DRV/ACK. Three timestamps are also correlated
> >    with the sendmsg time.
> >
> > In conclusion, push-level is more of a compromise approach which covers
> > those corner cases and further enhances the capabilities (like a finer
> > grained observation of jitter and performance issues).
>
> # push-level design
>
> It it significantly less intuitive than per-syscall, which is under
> user control. Or even than per-packet. As a fix for missing timestamps
> I understand these two extensions, even with the unintended side
> effect of reporting many unnecessary extra skbs in the common case.
> As a model to advocate for, less so.

Thanks for your review!

Fair enough. One mode of our internal kernel module directly hijacks
tcp_skb_entail() that handles the last skb of this skb if
fragmentation happens.

>
> Would it help if all skbs from the same sendmsg() can still be
> identified as common from the same syscall? That allows the user

I have to add more comments about push level:
the last skb from each push function will always be correlated with
its own sendmsg. With the help of BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, bpf
script can do so.

> to discard all but the last one (if they wish)

Oh, I just replied with another reply. Let's start the discussion here.

It would be great if we have a definite finer grained observability.

>
>
> # ABI changes
>
> For SO_TIMESTAMPING we would not be able to make this change
> unconditionally as the behavior change would break existing
> application expectations.

Right.

>
> That is why historically we have guarded new behabvior behind new
> TS options flags.
>
> The same may be true for BPF.

How about adding a socket option for a per packet mode, say,
BPF_SOCK_OPS_TSTAMP_TCP_PACKET_CB around tcp_skb_entail() which works
very similarly to
BPF_SOCK_OPS_TSTAMP_SENDMSG_CB. After that, users have a standalone
option to decide whether to trace all the skbs from the sendmsg.

If so, the origin BPF timestamping that works exactly like socket
timestamping is the best effort (we don't 100% guarantee the timestamping
feature captures every sendmsg call). With the new socket option
involved, we provide a finer grained vision without harming users who
favor the origin design with those two issues resolved.

A kind reminder is that if the skb is fragmented, for instance, due to
TSO being disabled, only the last smaller/child one will be monitored.

>
>
> # SO_TIMESTAMPING and BPF timestamping differences
>
> A related point is that this breaks the 1:1 relationship between
> SO_TIMESTAMPING and BPF timestamping. As said before, I think that
> is fine as BPF timestamping can be cheaper. But we should avoid the
> two forking in incompatible ways. I suggest that BPF timestamping
> becomes a superset of SO_TIMESTAMPING: it must have all features
> of SO_TIMESTAMPING, and may offer more.

Thanks for your insight. I agree with you! A new socket option should
be fine, I suppose.

>
>
> # Documentation and testing
>
> Please also expand Documentation and include a test.

Will update the doc in the next version.

Sorry, a test for push-level is a bit difficult. Let me complete it
after we reach a consensus on how we use it :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity
  2026-04-06 12:25   ` Jason Xing
@ 2026-04-06 14:38     ` Willem de Bruijn
  0 siblings, 0 replies; 15+ messages in thread
From: Willem de Bruijn @ 2026-04-06 14:38 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
	bpf, Jason Xing

Jason Xing wrote:
> On Mon, Apr 6, 2026 at 10:17 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > 1. Design of send-level granularity
> > > Originally, socket timestamping was designed to support tracing each
> > > sendmsg instead of per packet because application needs to issue
> > > multiple extra recvmsg() calls to get the skbs carrying the timestamp
> > > one by one if application chooses tag with different tags(SCHED/DRV/ACK).
> > > It's an obvious huge burden if the application expects to see a finer
> > > grained behavior.
> > > Another point I mentioned a bit in Netdev 0x19[1], supposing the amount of
> > > data that application tries to transfer at one time is split into 100
> > > smaller packets, recording the last skb's timestamps (SCHED/DRV/HARDWARE)
> > > is no longer meaningful because at the moment timestamping only records
> > > 1/100 packets. In this case, only the delta between when to send and when
> > > to ack matters.
> > >
> > > 2. Known missing tag issues in TCP
> > > A critically important thing is that we can miss tagging the last packet
> > > in a few conditions as the patch 3/4 explains. That means we lose track
> > > of the send syscall. Digging into more into how tcp_sendmsg_locked works,
> > > I found it's not feasible to successfully identify the last skb before
> > > push functions get called. With that said, if we want to make the feature
> > > better to cover all of these cases, we inevitably needs to place
> > > tcp_bpf_tx_timestamp() function before each push function.
> > >
> > > 3. Practice at Tencent
> > > In production, we have a version that applies the packet basis policy to
> > > do the exhaustive profiling of each flow for months in order to:
> > > 1) 100% make sure to capture the jitter event. No sampling.
> > > 2) observe the performance, find the bottleneck and improve it.
> > > We're still collecting data and investigating how it helps us in all the
> > > potential aspects before upstreaming. My personal perspective on this is
> > > to replace tcpdump eventually. It's worth mentioning tcpdump no longer
> > > satisfies our micro observation in modern data center.
> > >
> > > 4. The tendency toward finer-grained observability
> > > As we're aware that there are already many various bpf scripts trying to
> > > implement the fine grained monitor of the packets, it's an unstoppable
> > > tendency for the future observability. We're faced with so many latency
> > > reports (like jitter, perf degradation) on a daily basis. Getting the
> > > root cause of each report is exactly what we pursue.
> > > After we know which request causes the problem, if it belongs to kernel,
> > > we will dig into the packet behavior with more useful information
> > > included. This is the process of tracing down the jitter problem.
> > > Likewise, in BPF timestamping that mitigates the impact of calling extra
> > > syscalls, breaking the coarse granularity into smaller ones is a first
> > > good way to go. It shouldn't be the burden like before especially it's
> > > independent of application.
> > >
> > > 5. Details of the series
> > > Now it's time to convert BPF timestamping feature into push-level
> > > granularity by only recording the last skb in each push function, which
> > > is quite similar to how we previously treat each send syscall.
> > > Regarding each push function as a whole, we only care about
> > > the last skb from each push since the skb can be chunked into different
> > > smaller packets. BPF scripts like progs/net_timestamping.c has the
> > > ability to trace each tagged skb and calculate the latency:
> > > 1) delta between send and each tagged skb in tcp_sendmsg_locked()
> > > 2) delta between SCHED/DRV/ACK. Three timestamps are also correlated
> > >    with the sendmsg time.
> > >
> > > In conclusion, push-level is more of a compromise approach which covers
> > > those corner cases and further enhances the capabilities (like a finer
> > > grained observation of jitter and performance issues).
> >
> > # push-level design
> >
> > It it significantly less intuitive than per-syscall, which is under
> > user control. Or even than per-packet. As a fix for missing timestamps
> > I understand these two extensions, even with the unintended side
> > effect of reporting many unnecessary extra skbs in the common case.
> > As a model to advocate for, less so.
> 
> Thanks for your review!
> 
> Fair enough. One mode of our internal kernel module directly hijacks
> tcp_skb_entail() that handles the last skb of this skb if
> fragmentation happens.
> 
> >
> > Would it help if all skbs from the same sendmsg() can still be
> > identified as common from the same syscall? That allows the user
> 
> I have to add more comments about push level:
> the last skb from each push function will always be correlated with
> its own sendmsg. With the help of BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, bpf
> script can do so.
> 
> > to discard all but the last one (if they wish)
> 
> Oh, I just replied with another reply. Let's start the discussion here.
> 
> It would be great if we have a definite finer grained observability.
> 
> >
> >
> > # ABI changes
> >
> > For SO_TIMESTAMPING we would not be able to make this change
> > unconditionally as the behavior change would break existing
> > application expectations.
> 
> Right.
> 
> >
> > That is why historically we have guarded new behabvior behind new
> > TS options flags.
> >
> > The same may be true for BPF.
> 
> How about adding a socket option for a per packet mode, say,
> BPF_SOCK_OPS_TSTAMP_TCP_PACKET_CB around tcp_skb_entail() which works
> very similarly to
> BPF_SOCK_OPS_TSTAMP_SENDMSG_CB. After that, users have a standalone
> option to decide whether to trace all the skbs from the sendmsg.
> 
> If so, the origin BPF timestamping that works exactly like socket
> timestamping is the best effort (we don't 100% guarantee the timestamping
> feature captures every sendmsg call). With the new socket option
> involved, we provide a finer grained vision without harming users who
> favor the origin design with those two issues resolved.
> 
> A kind reminder is that if the skb is fragmented, for instance, due to
> TSO being disabled, only the last smaller/child one will be monitored.

See also my reply in the other thread. Yes, per-packet may be more
informative and understandable as policy than per-push.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-07  7:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
2026-04-06  2:23   ` Willem de Bruijn
2026-04-06 11:48     ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
2026-04-06  2:28   ` Willem de Bruijn
2026-04-06 11:59     ` Jason Xing
2026-04-06 14:37       ` Willem de Bruijn
2026-04-07  3:33         ` Jason Xing
2026-04-07  7:43           ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg Jason Xing
2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
2026-04-06 12:25   ` Jason Xing
2026-04-06 14:38     ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox