Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>,
	 Eric Dumazet <edumazet@google.com>
Cc: davem@davemloft.net,  kuba@kernel.org,  pabeni@redhat.com,
	 horms@kernel.org,  willemb@google.com,  netdev@vger.kernel.org,
	 Jason Xing <kernelxing@tencent.com>,
	 Yushan Zhou <katrinzhou@tencent.com>
Subject: Re: [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs
Date: Thu, 02 Apr 2026 15:18:52 -0400	[thread overview]
Message-ID: <willemdebruijn.kernel.672c36ae4e6f@gmail.com> (raw)
In-Reply-To: <CAL+tcoBB+d-PwNzx1ugjDwZZBaQf8vxb3XooM3ARw9CtbiFeqw@mail.gmail.com>

Jason Xing wrote:
> On Thu, Apr 2, 2026 at 11:40 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Apr 2, 2026 at 8:03 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Apr 2, 2026 at 10:24 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Apr 2, 2026 at 1:58 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > > > though it might not carry the last byte of the sendmsg.
> > > > >
> > > > > If we don't do so, we might be faced with no single timestamp that
> > > > > can be received by application from the error queue. The following steps
> > > > > reproduce this:
> > > > > 1) skb A is the current last skb before entering wait_for_space process
> > > > > 2) tcp_push() pushes A without any tag
> > > > > 3) A is transmitted from TCP to driver without putting any skb carring
> > > > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > > > >    error code. Note that the socket lock is released.
> > > > > 5) skb A finally gets acked and removed from the rtx queue.
> > > > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > > > >    'do_error' label and then 'out' label.
> > > > > 7) at this moment, skb A turns out to be the last one in this send
> > > > >    syscall, and miss the following tcp_tx_timestamp() opportunity before
> > > > >    the final tcp_push
> > > > > 8) application receives no timestamps this time
> > > > >
> > > > > The original commit ad02c4f54782 ("tcp: provide timestamps for partial writes")
> > > > > says it is best effort. Now it's time to cover the only potential point
> > > > > to avoid missing record.
> > > > >
> > > > > The side effect is obvious that we might record more than one time for a
> > > > > single send syscall since the skb that we keep track of in this scenario
> > > > > might not be the last one. But tracing more than one skb is not a bad
> > > > > thing since there is an emerging/promissing trend to do a detailed
> > > > > packet granularity monitor.
> > > >
> > > > This is rather weak/lazy, especially if you do not know how long the
> > > > thread is put to sleep?
> > >
> > > Thanks for the review!
> > >
> > > I actually thought about how to recognize which one should be the last
> > > skb in each send syscall. But it turns out that much more complicated
> > > work needs to be done which hurts the stack and common structures like
> > > tcp_sock. That's why I gave up and instead chose a much simpler way to
> > > minimize the impact.
> > >
> > > >
> > > > >
> > > > > Thanks to the great ID, namely, tskey, application that is responsible
> > > > > for the collect/sort of timestamps leverages it to put that record in
> > > > > between two consecutive send syscalls correctly.
> > > > >
> > > > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index 516087c622ad..2db80d75cfa4 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -1411,9 +1411,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > >  wait_for_space:
> > > > >                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > > > >                 tcp_remove_empty_skb(sk);
> > > > > -               if (copied)
> > > > > +               if (copied) {
> > > > > +                       tcp_tx_timestamp(sk, &sockc);
> > > > >                         tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > > > >                                  TCP_NAGLE_PUSH, size_goal);
> > > > > +               }
> > > > >
> > > > >                 err = sk_stream_wait_memory(sk, &timeo);
> > > > >                 if (err != 0)
> > > >
> > > > I think non blocking model should be used by modern applications,
> > > > especially ones caring about timestamps.
> > >
> > > Please note that BPF timestamping has already been merged. It's a
> > > standalone script that monitors all kinds of flows and discovers those
> > > strange/unexpected behaviors. So we normally don't take a look at the
> > > implementation of each application before spotting any exceptions.
> > >
> > > Besides, sometimes monitoring the latency of the host doesn't have
> > > much to do with the mode of application. People might want to observe
> > > more deeper into the underlying layer. Right now, without the patch,
> > > the monitor possibly ends up missing the record of this send syscall,
> > > which has been reliably reproduced by Yushan.
> > >
> > > >
> > > > This patch has performance implications.
> > > >
> > > >  tcp_tx_timestamp() is quite big and was inlined because it had a single caller.
> > > >
> > > > After this patch, it is no longer inlined.
> > > >
> > > > scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> > > > add/remove: 2/0 grow/shrink: 0/1 up/down: 239/-192 (47)
> > > > Function                                     old     new   delta
> > > > tcp_tx_timestamp                               -     223    +223
> > > > __pfx_tcp_tx_timestamp                         -      16     +16
> > > > tcp_sendmsg_locked                          4560    4368    -192
> > > > Total: Before=29652698, After=29652745, chg +0.00%
> > >
> > > That's right and I really appreciate your recent great effort trying
> > > to mitigate the impact of function calls.
> > >
> > > My take is that it is a trade off. Adding one more track of the skb
> > > (which adds a function call) actually doesn't hurt much especially for
> > > this scenario and the last skb scenario. This path basically is not
> > > that hot. There are some left things to be done to improve the
> > > socket/BPF timestamping feature. And I plan to push more patches
> > > (sure, very carefully) to enhance the ability of BPF timestamping to
> > > observe TCP which is inevitable to add some extra functions and if
> > > statements.
> > >
> > > I'm wondering if the above makes sense to you.
> >
> > If you mostly care about BPF, I would suggest:
> 
> For me, yes, I've been internally working on this stuff for a long
> while and still struggling with what part is worth upstreaming. I'm
> not sure how Willem thinks of the socket timestamping feature.

I think it's fine to move forward with BPF only features. We can make
that call on a case-by-case basis.

     prev parent reply	other threads:[~2026-04-02 19:18 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02  8:58 [PATCH net-next] net-timestamp: take track of the skb when wait_for_space occurs Jason Xing
2026-04-02 14:24 ` Eric Dumazet
2026-04-02 15:02   ` Jason Xing
2026-04-02 15:39     ` Eric Dumazet
2026-04-02 16:09       ` Jason Xing
2026-04-02 19:18         ` Willem de Bruijn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=willemdebruijn.kernel.672c36ae4e6f@gmail.com \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=katrinzhou@tencent.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox