[PATCH net-next] net-timestamp: take track of the skb when wait_for

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
@ 2026-04-02  8:58 Jason Xing
  2026-04-02 14:24 ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Xing @ 2026-04-02  8:58 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, willemb
  Cc: netdev, Jason Xing, Yushan Zhou

From: Jason Xing <kernelxing@tencent.com>

Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
though it might not carry the last byte of the sendmsg.

If we don't do so, we might be faced with no single timestamp that
can be received by application from the error queue. The following steps
reproduce this:
1) skb A is the current last skb before entering wait_for_space process
2) tcp_push() pushes A without any tag
3) A is transmitted from TCP to driver without putting any skb carring
   timestamps in the error queue, like SCHED, DRV/HARDWARE.
4) sk_stream_wait_memory() sleeps for a while and then returns with an
   error code. Note that the socket lock is released.
5) skb A finally gets acked and removed from the rtx queue.
6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
   'do_error' label and then 'out' label.
7) at this moment, skb A turns out to be the last one in this send
   syscall, and miss the following tcp_tx_timestamp() opportunity before
   the final tcp_push
8) application receives no timestamps this time

The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
says it is best effort. Now it's time to cover the only potential point
to avoid missing record.

The side effect is obvious that we might record more than one time for a
single send syscall since the skb that we keep track of in this scenario
might not be the last one. But tracing more than one skb is not a bad
thing since there is an emerging/promissing trend to do a detailed
packet granularity monitor.

Thanks to the great ID, namely, tskey, application that is responsible
for the collect/sort of timestamps leverages it to put that record in
between two consecutive send syscalls correctly.

Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 516087c622ad..2db80d75cfa4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 wait_for_space:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		tcp_remove_empty_skb(sk);
-		if (copied)
+		if (copied) {
+			tcp_tx_timestamp(sk, &sockc);
 			tcp_push(sk, flags & ~MSG_MORE, mss_now,
 				 TCP_NAGLE_PUSH, size_goal);
+		}

 		err = sk_stream_wait_memory(sk, &timeo);
 		if (err != 0)
-- 
2.41.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
  2026-04-02  8:58 [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs Jason Xing
@ 2026-04-02 14:24 ` Eric Dumazet
  2026-04-02 15:02   ` Jason Xing
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2026-04-02 14:24 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, kuba, pabeni, horms, willemb, netdev, Jason Xing,
	Yushan Zhou

On Thu, Apr 2, 2026 at 1:58 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> though it might not carry the last byte of the sendmsg.
>
> If we don't do so, we might be faced with no single timestamp that
> can be received by application from the error queue. The following steps
> reproduce this:
> 1) skb A is the current last skb before entering wait_for_space process
> 2) tcp_push() pushes A without any tag
> 3) A is transmitted from TCP to driver without putting any skb carring
>    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> 4) sk_stream_wait_memory() sleeps for a while and then returns with an
>    error code. Note that the socket lock is released.
> 5) skb A finally gets acked and removed from the rtx queue.
> 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
>    'do_error' label and then 'out' label.
> 7) at this moment, skb A turns out to be the last one in this send
>    syscall, and miss the following tcp_tx_timestamp() opportunity before
>    the final tcp_push
> 8) application receives no timestamps this time
>
> The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
> says it is best effort. Now it's time to cover the only potential point
> to avoid missing record.
>
> The side effect is obvious that we might record more than one time for a
> single send syscall since the skb that we keep track of in this scenario
> might not be the last one. But tracing more than one skb is not a bad
> thing since there is an emerging/promissing trend to do a detailed
> packet granularity monitor.

This is rather weak/lazy, especially if you do not know how long the
thread is put to sleep?

>
> Thanks to the great ID, namely, tskey, application that is responsible
> for the collect/sort of timestamps leverages it to put that record in
> between two consecutive send syscalls correctly.
>
> Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/ipv4/tcp.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 516087c622ad..2db80d75cfa4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  wait_for_space:
>                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>                 tcp_remove_empty_skb(sk);
> -               if (copied)
> +               if (copied) {
> +                       tcp_tx_timestamp(sk, &sockc);
>                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
>                                  TCP_NAGLE_PUSH, size_goal);
> +               }
>
>                 err = sk_stream_wait_memory(sk, &timeo);
>                 if (err != 0)

I think non blocking model should be used by modern applications,
especially ones caring about timestamps.

This patch has performance implications.

 tcp_tx_timestamp() is quite big and was inlined because it had a single caller.

After this patch, it is no longer inlined.

scripts/bloat-o-meter -t vmlinux.before vmlinux.after
add/remove: 2/0 grow/shrink: 0/1 up/down: 239/-192 (47)
Function                                     old     new   delta
tcp_tx_timestamp                               -     223    +223
__pfx_tcp_tx_timestamp                         -      16     +16
tcp_sendmsg_locked                          4560    4368    -192
Total: Before=29652698, After=29652745, chg +0.00%

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
  2026-04-02 14:24 ` Eric Dumazet
@ 2026-04-02 15:02   ` Jason Xing
  2026-04-02 15:39     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Xing @ 2026-04-02 15:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, willemb, netdev, Jason Xing,
	Yushan Zhou

On Thu, Apr 2, 2026 at 10:24 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 2, 2026 at 1:58 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > though it might not carry the last byte of the sendmsg.
> >
> > If we don't do so, we might be faced with no single timestamp that
> > can be received by application from the error queue. The following steps
> > reproduce this:
> > 1) skb A is the current last skb before entering wait_for_space process
> > 2) tcp_push() pushes A without any tag
> > 3) A is transmitted from TCP to driver without putting any skb carring
> >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> >    error code. Note that the socket lock is released.
> > 5) skb A finally gets acked and removed from the rtx queue.
> > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> >    'do_error' label and then 'out' label.
> > 7) at this moment, skb A turns out to be the last one in this send
> >    syscall, and miss the following tcp_tx_timestamp() opportunity before
> >    the final tcp_push
> > 8) application receives no timestamps this time
> >
> > The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
> > says it is best effort. Now it's time to cover the only potential point
> > to avoid missing record.
> >
> > The side effect is obvious that we might record more than one time for a
> > single send syscall since the skb that we keep track of in this scenario
> > might not be the last one. But tracing more than one skb is not a bad
> > thing since there is an emerging/promissing trend to do a detailed
> > packet granularity monitor.
>
> This is rather weak/lazy, especially if you do not know how long the
> thread is put to sleep?

Thanks for the review!

I actually thought about how to recognize which one should be the last
skb in each send syscall. But it turns out that much more complicated
work needs to be done which hurts the stack and common structures like
tcp_sock. That's why I gave up and instead chose a much simpler way to
minimize the impact.

>
> >
> > Thanks to the great ID, namely, tskey, application that is responsible
> > for the collect/sort of timestamps leverages it to put that record in
> > between two consecutive send syscalls correctly.
> >
> > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/ipv4/tcp.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 516087c622ad..2db80d75cfa4 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >  wait_for_space:
> >                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >                 tcp_remove_empty_skb(sk);
> > -               if (copied)
> > +               if (copied) {
> > +                       tcp_tx_timestamp(sk, &sockc);
> >                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
> >                                  TCP_NAGLE_PUSH, size_goal);
> > +               }
> >
> >                 err = sk_stream_wait_memory(sk, &timeo);
> >                 if (err != 0)
>
> I think non blocking model should be used by modern applications,
> especially ones caring about timestamps.

Please note that BPF timestamping has already been merged. It's a
standalone script that monitors all kinds of flows and discovers those
strange/unexpected behaviors. So we normally don't take a look at the
implementation of each application before spotting any exceptions.

Besides, sometimes monitoring the latency of the host doesn't have
much to do with the mode of application. People might want to observe
more deeper into the underlying layer. Right now, without the patch,
the monitor possibly ends up missing the record of this send syscall,
which has been reliably reproduced by Yushan.

>
> This patch has performance implications.
>
>  tcp_tx_timestamp() is quite big and was inlined because it had a single caller.
>
> After this patch, it is no longer inlined.
>
> scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> add/remove: 2/0 grow/shrink: 0/1 up/down: 239/-192 (47)
> Function                                     old     new   delta
> tcp_tx_timestamp                               -     223    +223
> __pfx_tcp_tx_timestamp                         -      16     +16
> tcp_sendmsg_locked                          4560    4368    -192
> Total: Before=29652698, After=29652745, chg +0.00%

That's right and I really appreciate your recent great effort trying
to mitigate the impact of function calls.

My take is that it is a trade off. Adding one more track of the skb
(which adds a function call) actually doesn't hurt much especially for
this scenario and the last skb scenario. This path basically is not
that hot. There are some left things to be done to improve the
socket/BPF timestamping feature. And I plan to push more patches
(sure, very carefully) to enhance the ability of BPF timestamping to
observe TCP which is inevitable to add some extra functions and if
statements.

I'm wondering if the above makes sense to you.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
  2026-04-02 15:02   ` Jason Xing
@ 2026-04-02 15:39     ` Eric Dumazet
  2026-04-02 16:09       ` Jason Xing
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2026-04-02 15:39 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, kuba, pabeni, horms, willemb, netdev, Jason Xing,
	Yushan Zhou

On Thu, Apr 2, 2026 at 8:03 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Apr 2, 2026 at 10:24 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Apr 2, 2026 at 1:58 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > though it might not carry the last byte of the sendmsg.
> > >
> > > If we don't do so, we might be faced with no single timestamp that
> > > can be received by application from the error queue. The following steps
> > > reproduce this:
> > > 1) skb A is the current last skb before entering wait_for_space process
> > > 2) tcp_push() pushes A without any tag
> > > 3) A is transmitted from TCP to driver without putting any skb carring
> > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > >    error code. Note that the socket lock is released.
> > > 5) skb A finally gets acked and removed from the rtx queue.
> > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > >    'do_error' label and then 'out' label.
> > > 7) at this moment, skb A turns out to be the last one in this send
> > >    syscall, and miss the following tcp_tx_timestamp() opportunity before
> > >    the final tcp_push
> > > 8) application receives no timestamps this time
> > >
> > > The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
> > > says it is best effort. Now it's time to cover the only potential point
> > > to avoid missing record.
> > >
> > > The side effect is obvious that we might record more than one time for a
> > > single send syscall since the skb that we keep track of in this scenario
> > > might not be the last one. But tracing more than one skb is not a bad
> > > thing since there is an emerging/promissing trend to do a detailed
> > > packet granularity monitor.
> >
> > This is rather weak/lazy, especially if you do not know how long the
> > thread is put to sleep?
>
> Thanks for the review!
>
> I actually thought about how to recognize which one should be the last
> skb in each send syscall. But it turns out that much more complicated
> work needs to be done which hurts the stack and common structures like
> tcp_sock. That's why I gave up and instead chose a much simpler way to
> minimize the impact.
>
> >
> > >
> > > Thanks to the great ID, namely, tskey, application that is responsible
> > > for the collect/sort of timestamps leverages it to put that record in
> > > between two consecutive send syscalls correctly.
> > >
> > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/ipv4/tcp.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 516087c622ad..2db80d75cfa4 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > >  wait_for_space:
> > >                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > >                 tcp_remove_empty_skb(sk);
> > > -               if (copied)
> > > +               if (copied) {
> > > +                       tcp_tx_timestamp(sk, &sockc);
> > >                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > >                                  TCP_NAGLE_PUSH, size_goal);
> > > +               }
> > >
> > >                 err = sk_stream_wait_memory(sk, &timeo);
> > >                 if (err != 0)
> >
> > I think non blocking model should be used by modern applications,
> > especially ones caring about timestamps.
>
> Please note that BPF timestamping has already been merged. It's a
> standalone script that monitors all kinds of flows and discovers those
> strange/unexpected behaviors. So we normally don't take a look at the
> implementation of each application before spotting any exceptions.
>
> Besides, sometimes monitoring the latency of the host doesn't have
> much to do with the mode of application. People might want to observe
> more deeper into the underlying layer. Right now, without the patch,
> the monitor possibly ends up missing the record of this send syscall,
> which has been reliably reproduced by Yushan.
>
> >
> > This patch has performance implications.
> >
> >  tcp_tx_timestamp() is quite big and was inlined because it had a single caller.
> >
> > After this patch, it is no longer inlined.
> >
> > scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> > add/remove: 2/0 grow/shrink: 0/1 up/down: 239/-192 (47)
> > Function                                     old     new   delta
> > tcp_tx_timestamp                               -     223    +223
> > __pfx_tcp_tx_timestamp                         -      16     +16
> > tcp_sendmsg_locked                          4560    4368    -192
> > Total: Before=29652698, After=29652745, chg +0.00%
>
> That's right and I really appreciate your recent great effort trying
> to mitigate the impact of function calls.
>
> My take is that it is a trade off. Adding one more track of the skb
> (which adds a function call) actually doesn't hurt much especially for
> this scenario and the last skb scenario. This path basically is not
> that hot. There are some left things to be done to improve the
> socket/BPF timestamping feature. And I plan to push more patches
> (sure, very carefully) to enhance the ability of BPF timestamping to
> observe TCP which is inevitable to add some extra functions and if
> statements.
>
> I'm wondering if the above makes sense to you.

If you mostly care about BPF, I would suggest:

iff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd2c3c4587e133897aa057bd3b26c29df050607f..444254b8870c4406a7105cd3ce177745fb1b6c0a
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -506,6 +506,20 @@ static void tcp_tx_timestamp(struct sock *sk,
struct sockcm_cookie *sockc)
                bpf_skops_tx_timestamping(sk, skb,
BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
 }

+static void tcp_tx_timestamp_before_wait_memory(struct sock *sk)
+{
+       struct sk_buff *skb;
+
+       if (!cgroup_bpf_enabled(CGROUP_SOCK_OPS) ||
+           !SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING))
+               return;
+
+       skb = tcp_write_queue_tail(sk);
+       if (skb)
+               bpf_skops_tx_timestamping(sk, skb,
+                                         BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
+}
+
 /* @wake is one when sk_stream_write_space() calls us.
  * This sends EPOLLOUT only if notsent_bytes is half the limit.
  * This mimics the strategy used in sock_def_write_space().
@@ -1403,10 +1417,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct
msghdr *msg, size_t size)
 wait_for_space:
                set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
                tcp_remove_empty_skb(sk);
-               if (copied)
+               if (copied) {
+                       tcp_tx_timestamp_before_wait_memory(sk);
                        tcp_push(sk, flags & ~MSG_MORE, mss_now,
                                 TCP_NAGLE_PUSH, size_goal);
-
+               }
                err = sk_stream_wait_memory(sk, &timeo);
                if (err != 0)
                        goto do_error;


Note: You also could use a different skops to differentiate cases in
your monitoring.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 69/0 (69)
Function                                     old     new   delta
tcp_sendmsg_locked                          4848    4917     +69
Total: Before=29651496, After=29651565, chg +0.00%

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
  2026-04-02 15:39     ` Eric Dumazet
@ 2026-04-02 16:09       ` Jason Xing
  2026-04-02 19:18         ` Willem de Bruijn
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Xing @ 2026-04-02 16:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, willemb, netdev, Jason Xing,
	Yushan Zhou

On Thu, Apr 2, 2026 at 11:40 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 2, 2026 at 8:03 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Apr 2, 2026 at 10:24 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Apr 2, 2026 at 1:58 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > > though it might not carry the last byte of the sendmsg.
> > > >
> > > > If we don't do so, we might be faced with no single timestamp that
> > > > can be received by application from the error queue. The following steps
> > > > reproduce this:
> > > > 1) skb A is the current last skb before entering wait_for_space process
> > > > 2) tcp_push() pushes A without any tag
> > > > 3) A is transmitted from TCP to driver without putting any skb carring
> > > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > > >    error code. Note that the socket lock is released.
> > > > 5) skb A finally gets acked and removed from the rtx queue.
> > > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > > >    'do_error' label and then 'out' label.
> > > > 7) at this moment, skb A turns out to be the last one in this send
> > > >    syscall, and miss the following tcp_tx_timestamp() opportunity before
> > > >    the final tcp_push
> > > > 8) application receives no timestamps this time
> > > >
> > > > The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
> > > > says it is best effort. Now it's time to cover the only potential point
> > > > to avoid missing record.
> > > >
> > > > The side effect is obvious that we might record more than one time for a
> > > > single send syscall since the skb that we keep track of in this scenario
> > > > might not be the last one. But tracing more than one skb is not a bad
> > > > thing since there is an emerging/promissing trend to do a detailed
> > > > packet granularity monitor.
> > >
> > > This is rather weak/lazy, especially if you do not know how long the
> > > thread is put to sleep?
> >
> > Thanks for the review!
> >
> > I actually thought about how to recognize which one should be the last
> > skb in each send syscall. But it turns out that much more complicated
> > work needs to be done which hurts the stack and common structures like
> > tcp_sock. That's why I gave up and instead chose a much simpler way to
> > minimize the impact.
> >
> > >
> > > >
> > > > Thanks to the great ID, namely, tskey, application that is responsible
> > > > for the collect/sort of timestamps leverages it to put that record in
> > > > between two consecutive send syscalls correctly.
> > > >
> > > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  net/ipv4/tcp.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index 516087c622ad..2db80d75cfa4 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > >  wait_for_space:
> > > >                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > > >                 tcp_remove_empty_skb(sk);
> > > > -               if (copied)
> > > > +               if (copied) {
> > > > +                       tcp_tx_timestamp(sk, &sockc);
> > > >                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > > >                                  TCP_NAGLE_PUSH, size_goal);
> > > > +               }
> > > >
> > > >                 err = sk_stream_wait_memory(sk, &timeo);
> > > >                 if (err != 0)
> > >
> > > I think non blocking model should be used by modern applications,
> > > especially ones caring about timestamps.
> >
> > Please note that BPF timestamping has already been merged. It's a
> > standalone script that monitors all kinds of flows and discovers those
> > strange/unexpected behaviors. So we normally don't take a look at the
> > implementation of each application before spotting any exceptions.
> >
> > Besides, sometimes monitoring the latency of the host doesn't have
> > much to do with the mode of application. People might want to observe
> > more deeper into the underlying layer. Right now, without the patch,
> > the monitor possibly ends up missing the record of this send syscall,
> > which has been reliably reproduced by Yushan.
> >
> > >
> > > This patch has performance implications.
> > >
> > >  tcp_tx_timestamp() is quite big and was inlined because it had a single caller.
> > >
> > > After this patch, it is no longer inlined.
> > >
> > > scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> > > add/remove: 2/0 grow/shrink: 0/1 up/down: 239/-192 (47)
> > > Function                                     old     new   delta
> > > tcp_tx_timestamp                               -     223    +223
> > > __pfx_tcp_tx_timestamp                         -      16     +16
> > > tcp_sendmsg_locked                          4560    4368    -192
> > > Total: Before=29652698, After=29652745, chg +0.00%
> >
> > That's right and I really appreciate your recent great effort trying
> > to mitigate the impact of function calls.
> >
> > My take is that it is a trade off. Adding one more track of the skb
> > (which adds a function call) actually doesn't hurt much especially for
> > this scenario and the last skb scenario. This path basically is not
> > that hot. There are some left things to be done to improve the
> > socket/BPF timestamping feature. And I plan to push more patches
> > (sure, very carefully) to enhance the ability of BPF timestamping to
> > observe TCP which is inevitable to add some extra functions and if
> > statements.
> >
> > I'm wondering if the above makes sense to you.
>
> If you mostly care about BPF, I would suggest:

For me, yes, I've been internally working on this stuff for a long
while and still struggling with what part is worth upstreaming. I'm
not sure how Willem thinks of the socket timestamping feature.

>
> iff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index bd2c3c4587e133897aa057bd3b26c29df050607f..444254b8870c4406a7105cd3ce177745fb1b6c0a
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -506,6 +506,20 @@ static void tcp_tx_timestamp(struct sock *sk,
> struct sockcm_cookie *sockc)
>                 bpf_skops_tx_timestamping(sk, skb,
> BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
>  }
>
> +static void tcp_tx_timestamp_before_wait_memory(struct sock *sk)
> +{
> +       struct sk_buff *skb;
> +
> +       if (!cgroup_bpf_enabled(CGROUP_SOCK_OPS) ||
> +           !SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING))
> +               return;
> +
> +       skb = tcp_write_queue_tail(sk);
> +       if (skb)
> +               bpf_skops_tx_timestamping(sk, skb,
> +                                         BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
> +}
> +
>  /* @wake is one when sk_stream_write_space() calls us.
>   * This sends EPOLLOUT only if notsent_bytes is half the limit.
>   * This mimics the strategy used in sock_def_write_space().
> @@ -1403,10 +1417,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct
> msghdr *msg, size_t size)
>  wait_for_space:
>                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>                 tcp_remove_empty_skb(sk);
> -               if (copied)
> +               if (copied) {
> +                       tcp_tx_timestamp_before_wait_memory(sk);
>                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
>                                  TCP_NAGLE_PUSH, size_goal);
> -
> +               }
>                 err = sk_stream_wait_memory(sk, &timeo);
>                 if (err != 0)
>                         goto do_error;
>

Great, thanks, Eric!!

>
> Note: You also could use a different skops to differentiate cases in
> your monitoring.
>
> $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
> add/remove: 0/0 grow/shrink: 1/0 up/down: 69/0 (69)
> Function                                     old     new   delta
> tcp_sendmsg_locked                          4848    4917     +69
> Total: Before=29651496, After=29651565, chg +0.00%

Thanks again. I will give it a try tomorrow morning :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
  2026-04-02 16:09       ` Jason Xing
@ 2026-04-02 19:18         ` Willem de Bruijn
  0 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2026-04-02 19:18 UTC (permalink / raw)
  To: Jason Xing, Eric Dumazet
  Cc: davem, kuba, pabeni, horms, willemb, netdev, Jason Xing,
	Yushan Zhou

Jason Xing wrote:
> On Thu, Apr 2, 2026 at 11:40 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Apr 2, 2026 at 8:03 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Apr 2, 2026 at 10:24 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Apr 2, 2026 at 1:58 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > > > though it might not carry the last byte of the sendmsg.
> > > > >
> > > > > If we don't do so, we might be faced with no single timestamp that
> > > > > can be received by application from the error queue. The following steps
> > > > > reproduce this:
> > > > > 1) skb A is the current last skb before entering wait_for_space process
> > > > > 2) tcp_push() pushes A without any tag
> > > > > 3) A is transmitted from TCP to driver without putting any skb carring
> > > > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > > > >    error code. Note that the socket lock is released.
> > > > > 5) skb A finally gets acked and removed from the rtx queue.
> > > > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > > > >    'do_error' label and then 'out' label.
> > > > > 7) at this moment, skb A turns out to be the last one in this send
> > > > >    syscall, and miss the following tcp_tx_timestamp() opportunity before
> > > > >    the final tcp_push
> > > > > 8) application receives no timestamps this time
> > > > >
> > > > > The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
> > > > > says it is best effort. Now it's time to cover the only potential point
> > > > > to avoid missing record.
> > > > >
> > > > > The side effect is obvious that we might record more than one time for a
> > > > > single send syscall since the skb that we keep track of in this scenario
> > > > > might not be the last one. But tracing more than one skb is not a bad
> > > > > thing since there is an emerging/promissing trend to do a detailed
> > > > > packet granularity monitor.
> > > >
> > > > This is rather weak/lazy, especially if you do not know how long the
> > > > thread is put to sleep?
> > >
> > > Thanks for the review!
> > >
> > > I actually thought about how to recognize which one should be the last
> > > skb in each send syscall. But it turns out that much more complicated
> > > work needs to be done which hurts the stack and common structures like
> > > tcp_sock. That's why I gave up and instead chose a much simpler way to
> > > minimize the impact.
> > >
> > > >
> > > > >
> > > > > Thanks to the great ID, namely, tskey, application that is responsible
> > > > > for the collect/sort of timestamps leverages it to put that record in
> > > > > between two consecutive send syscalls correctly.
> > > > >
> > > > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index 516087c622ad..2db80d75cfa4 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > >  wait_for_space:
> > > > >                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > > > >                 tcp_remove_empty_skb(sk);
> > > > > -               if (copied)
> > > > > +               if (copied) {
> > > > > +                       tcp_tx_timestamp(sk, &sockc);
> > > > >                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > > > >                                  TCP_NAGLE_PUSH, size_goal);
> > > > > +               }
> > > > >
> > > > >                 err = sk_stream_wait_memory(sk, &timeo);
> > > > >                 if (err != 0)
> > > >
> > > > I think non blocking model should be used by modern applications,
> > > > especially ones caring about timestamps.
> > >
> > > Please note that BPF timestamping has already been merged. It's a
> > > standalone script that monitors all kinds of flows and discovers those
> > > strange/unexpected behaviors. So we normally don't take a look at the
> > > implementation of each application before spotting any exceptions.
> > >
> > > Besides, sometimes monitoring the latency of the host doesn't have
> > > much to do with the mode of application. People might want to observe
> > > more deeper into the underlying layer. Right now, without the patch,
> > > the monitor possibly ends up missing the record of this send syscall,
> > > which has been reliably reproduced by Yushan.
> > >
> > > >
> > > > This patch has performance implications.
> > > >
> > > >  tcp_tx_timestamp() is quite big and was inlined because it had a single caller.
> > > >
> > > > After this patch, it is no longer inlined.
> > > >
> > > > scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> > > > add/remove: 2/0 grow/shrink: 0/1 up/down: 239/-192 (47)
> > > > Function                                     old     new   delta
> > > > tcp_tx_timestamp                               -     223    +223
> > > > __pfx_tcp_tx_timestamp                         -      16     +16
> > > > tcp_sendmsg_locked                          4560    4368    -192
> > > > Total: Before=29652698, After=29652745, chg +0.00%
> > >
> > > That's right and I really appreciate your recent great effort trying
> > > to mitigate the impact of function calls.
> > >
> > > My take is that it is a trade off. Adding one more track of the skb
> > > (which adds a function call) actually doesn't hurt much especially for
> > > this scenario and the last skb scenario. This path basically is not
> > > that hot. There are some left things to be done to improve the
> > > socket/BPF timestamping feature. And I plan to push more patches
> > > (sure, very carefully) to enhance the ability of BPF timestamping to
> > > observe TCP which is inevitable to add some extra functions and if
> > > statements.
> > >
> > > I'm wondering if the above makes sense to you.
> >
> > If you mostly care about BPF, I would suggest:
> 
> For me, yes, I've been internally working on this stuff for a long
> while and still struggling with what part is worth upstreaming. I'm
> not sure how Willem thinks of the socket timestamping feature.

I think it's fine to move forward with BPF only features. We can make
that call on a case-by-case basis.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-02 19:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02  8:58 [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs Jason Xing
2026-04-02 14:24 ` Eric Dumazet
2026-04-02 15:02   ` Jason Xing
2026-04-02 15:39     ` Eric Dumazet
2026-04-02 16:09       ` Jason Xing
2026-04-02 19:18         ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox