Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>,
	 Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: davem@davemloft.net,  edumazet@google.com,  kuba@kernel.org,
	 pabeni@redhat.com,  horms@kernel.org,  willemb@google.com,
	 martin.lau@kernel.org,  netdev@vger.kernel.org,
	 bpf@vger.kernel.org,  Jason Xing <kernelxing@tencent.com>,
	 Yushan Zhou <katrinzhou@tencent.com>
Subject: Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
Date: Tue, 07 Apr 2026 17:17:41 -0400	[thread overview]
Message-ID: <willemdebruijn.kernel.1b49845d3acc7@gmail.com> (raw)
In-Reply-To: <CAL+tcoCctfs=d9zz5ei1S999S94HmUDSToriZ8NJiLT8MmpQTA@mail.gmail.com>

Jason Xing wrote:
> On Tue, Apr 7, 2026 at 11:33 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Mon, Apr 6, 2026 at 10:37 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > On Mon, Apr 6, 2026 at 10:28 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > >
> > > > > Jason Xing wrote:
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > The patch is the 1/2 part of push-level granularity feature.
> > > > > >
> > > > > > Tag the skb in tcp_sendmsg_locked() when wait_for_space occurs even
> > > > > > though it might not carry the last byte of the sendmsg.
> > > > > >
> > > > > > Prior to the patch, BPF timestamping cannot cover this case:
> > > > > > The following steps reproduce this:
> > > > > > 1) skb A is the current last skb before entering wait_for_space process
> > > > > > 2) tcp_push() pushes A without any tag
> > > > > > 3) A is transmitted from TCP to driver without putting any skb carrying
> > > > > >    timestamps in the error queue, like SCHED, DRV/HARDWARE.
> > > > > > 4) sk_stream_wait_memory() sleeps for a while and then returns with an
> > > > > >    error code. Note that the socket lock is released.
> > > > > > 5) skb A finally gets acked and removed from the rtx queue.
> > > > > > 6) continue with the rest of tcp_sendmsg_locked(): it will jump to(goto)
> > > > > >    'do_error' label and then 'out' label.
> > > > > > 7) at this moment, skb A turns out to be the last one in this send
> > > > > >    syscall, and miss the following tcp_bpf_tx_timestamp() opportunity
> > > > > >    before the final tcp_push()
> > > > > > 8) BPF script fails to see any timestamps this time
> > > > > >
> > > > > > Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp.c | 4 +++-
> > > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > index c603b90057f6..7d030a11d004 100644
> > > > > > --- a/net/ipv4/tcp.c
> > > > > > +++ b/net/ipv4/tcp.c
> > > > > > @@ -1400,9 +1400,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > > > > >  wait_for_space:
> > > > > >               set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> > > > > >               tcp_remove_empty_skb(sk);
> > > > > > -             if (copied)
> > > > > > +             if (copied) {
> > > > > > +                     tcp_bpf_tx_timestamp(sk);
> > > > > >                       tcp_push(sk, flags & ~MSG_MORE, mss_now,
> > > > > >                                TCP_NAGLE_PUSH, size_goal);
> > > > >
> > > > > Now the number of skbs that will be tracked will be unpredictable,
> > > > > varying based on memory pressure.
> > > >
> > > > Right, I put some effort into writing a selftests to check how many
> > > > push functions get called at one time and failed to do so.
> > > >
> > > > >
> > > > > That sounds hard to use to me. Especially if these extra pushes
> > > > > cannot be identified as such.
> > > > >
> > > > > Perhaps if all skbs from the same sendmsg call can be identified,
> > > > > that would help explain pattern in data resulting from these
> > > > > uncommon extra data points.
> > > >
> > > > You meant move tcp_bpf_tx_timestamp before tcp_skb_entail()? That is
> > > > close to packet basis without considering fragmentation of skb :)
> > >
> > > No, I meant somehow in the notification having a way to identify all
> > > the skbs belonging to the same sendmsg call, to allow filtering on
> > > that. But I also don't immediately see how to do that (without adding
> > > yet another counter say).
> >
> > If we don't build the relationship between skb and sendmsg (just like
> > the SENDMSG sock option), we will have no clue on how to calculate. If
> > we only take care of the skb from the view of the syscall layer, it's
> > fine by moving tcp_bpf_tx_timestamp() before tcp_skb_entail(). But in
> > terms of per skb even generated beneath TCP due to gso/tso, there is
> > only one way to correlate: adding an additional member in the skb
> > structure to store its sendmsg time. This discussion is only suitable
> > for use cases like net_timestamping.
> >
> > Well, my key point is that, I have to admit, the above (including
> > existing bpf script net_timestamping) is a less effective way which
> > definitely harms the performance because of the extremely frequent
> > look-up process. It's not suitable for 7x24 observability in
> > production. What we've done internally is make the kernel layer as
> > lightweight/easy as possible and let the timestamping feature throw
> > each record into a ring buffer that the application can read, sort and
> > calculate. This arch survives the performance. But that's simply what
> > the design of the kernel module is, given the fast deployment in
> > production. I suppose in the future we could build a userspace tool
> > like blktrace to monitor efficiently instead of the selftest sample.
> > Honestly I don't like look-up action.
> >
> > Since we're modifying the kernel, how about adding a new member to
> > record sendmsg time which bpf script is able to read. The whole
> > scenario looks like this:
> > 1) in tcp_sendmsg_locked(), record the sendmsg time for each skb
> > 2) in either tso_fragment() or tcp_gso_tstamp(), each new skb will get
> > a copy of its original skb
> > 3) in each stage, bpf script reads the skb's sendmsg time and the
> > current time, and then effortlessly do the math.
> >
> > At this point, what I had in mind is we have two options:
> > 1) only handle the skb from the view of the send syscall layer, which
> > is, for sure, very simple but not thorough.
> > 2) stick to a pure authentic packet basis, then adding a new member
> > seems inevitable. so the question would be where to add? The space of
> > the skb structure is very precious :(
> 
> Finding a suitable place to put this timestamp is really hard. IIRC,
> we can't expand the size of struct skb_shared_info so easily since
> it's a global effect.
> 
> I'm wondering if we can turn the per-packet mode into a non-compatible
> feature by reusing 'u32 tskey' to store a microsecond timestamp of
> sendmsg.

Agreed that an extra field is hard. We should avoid that.

If the purpose is to group skbs by sendmsg call (e.g., to filter out
all but the last one), it is probably also unnecessary.

From a process PoV, since the process knows the sendmsg len and each
skb has a tskey in byte offset, it can correlate the skb with a given
sendmsg buffer.

The BPF program is under control of a third-party admin. So that does
not follow directly. But it can be passed additional metadata.

I thought about passing the offset of the skb from the start of the
sendmsg buffer to identify all consecutive skbs for a sendmsg call,
as each new buffer will start with an skb with offset 0 ..

.. but that won't work as there is no guarantee that a sendmsg call
will not append to an existing outstanding skb.

Anyway, the general idea is to pass to the BPF program through
bpf_skops_tx_timestamping some relevant signal , without having to
expand either skb or sk itself.

I hear you on that measuring every skb is too frequent. But is calling
the BPF program and letting it decide whether to measure too? BPF
program invocation itself should be cheap.

If per-push is preferable, with a filter ability like the above, it
seems more useful to me already.

next prev parent reply	other threads:[~2026-04-07 21:17 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-04 15:04 [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 1/4] tcp: separate BPF timestamping from tcp_tx_timestamp Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 2/4] tcp: advance the tsflags check to save cycles Jason Xing
2026-04-06  2:23   ` Willem de Bruijn
2026-04-06 11:48     ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs Jason Xing
2026-04-06  2:28   ` Willem de Bruijn
2026-04-06 11:59     ` Jason Xing
2026-04-06 14:37       ` Willem de Bruijn
2026-04-07  3:33         ` Jason Xing
2026-04-07  7:43           ` Jason Xing
2026-04-07 21:17             ` Willem de Bruijn [this message]
2026-04-08  0:35               ` Jason Xing
2026-04-04 15:04 ` [PATCH net-next v2 4/4] bpf-timestamp: complete tracing the skb from each push in sendmsg Jason Xing
2026-04-06  2:17 ` [PATCH net-next v2 0/4] bpf-timestamp: convert to push-level granularity Willem de Bruijn
2026-04-06 12:25   ` Jason Xing
2026-04-06 14:38     ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=willemdebruijn.kernel.1b49845d3acc7@gmail.com \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=katrinzhou@tencent.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox