From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.toke.dk (mail.toke.dk [45.145.95.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 429C43D8103; Tue, 19 May 2026 09:57:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.145.95.4 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779184659; cv=none; b=jkcb4+TJ7HG8Qtr+nmVGCSuu0TNfcWCe5YzmN9rH9KhTz65bAikkI/cvVnwAS+Q29AoH9jPtCYSIndylemcbfQtJIXoiXKuGBI6Unefb+8qqR7OUEWl3AE19iPiGrDmoStfW7r6fPYbqRxnKNjcRVKfYUPdZPdvp5ElU9T5vHMc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779184659; c=relaxed/simple; bh=ao5yktV3VAW0NwfNu4akMcKARIkpwyiF0BaqE2LX4lQ=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=WBDTdSsor/dFLLzC7Ge2EF62u3oCorSEtCjq1fMvL8ezKlBZAjPieNgXew5loOjp1OFt8IyJERhYkjsKTxpXO1h40oyaxv2Y6dsONrGXErupGb+KrJmXbiAy7dHB0S7tBBjWAFVMAcZ9kKyiQVCCg8mamAWth3gqswpzzrUx/uM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=toke.dk; spf=pass smtp.mailfrom=toke.dk; arc=none smtp.client-ip=45.145.95.4 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=toke.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=toke.dk From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= Authentication-Results: mail.toke.dk; dkim=none To: Jason Xing , Jesper Dangaard Brouer Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, horms@kernel.org, willemb@google.com, kuniyu@google.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com, memxor@gmail.com, song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org, john.fastabend@gmail.com, sdf@fomichev.me, Simon Sundberg , netdev@vger.kernel.org, bpf@vger.kernel.org, Jason Xing Subject: Re: [PATCH net-next 5/6] bpf: enable bpf timestamping rx in TCP layer In-Reply-To: References: <20260518082344.96647-1-kerneljasonxing@gmail.com> <20260518082344.96647-6-kerneljasonxing@gmail.com> <2942dd24-3b6f-4e88-acb2-67d35ea8938b@kernel.org> Date: Tue, 19 May 2026 11:57:26 +0200 X-Clacks-Overhead: GNU Terry Pratchett Message-ID: <87lddfn2m1.fsf@toke.dk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Jason Xing writes: > On Tue, May 19, 2026 at 7:16=E2=80=AFAM Jason Xing wrote: >> >> On Tue, May 19, 2026 at 12:40=E2=80=AFAM Jesper Dangaard Brouer wrote: >> > >> > >> > >> > On 18/05/2026 15.53, Jason Xing wrote: >> > > On Mon, May 18, 2026 at 9:01=E2=80=AFPM Jesper Dangaard Brouer wrote: >> > >> >> > >> >> > >> >> > >> On 18/05/2026 10.23, Jason Xing wrote: >> > >>> From: Jason Xing >> > >>> >> > >>> Add two if statements to accurately isolate bpf timestamping and so >> > >>> timestamping. They can work respectively. >> > >>> >> > >>> As to so_timestamping, only add a loose condition via report flags >> > >>> to avoid duplicate strict checks that is done in tcp_recv_timestam= p() >> > >>> and performance impact. If the loose condition is hit, >> > >>> tcp_recv_timestamp() is able to handle the exact case and doesn't >> > >>> hamper the existing timestamping feature. >> > >>> >> > >>> Make it work in TCP protocol. >> > >>> >> > >>> Signed-off-by: Jason Xing >> > >>> --- >> > >>> net/ipv4/tcp.c | 14 ++++++++++++-- >> > >>> 1 file changed, 12 insertions(+), 2 deletions(-) >> > >>> >> > >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c >> > >>> index 21ece4c71612..64c69bb3578a 100644 >> > >>> --- a/net/ipv4/tcp.c >> > >>> +++ b/net/ipv4/tcp.c >> > >>> @@ -2949,8 +2949,18 @@ int tcp_recvmsg(struct sock *sk, struct msg= hdr *msg, size_t len, int flags) >> > >>> release_sock(sk); >> > >>> >> > >>> if ((cmsg_flags | msg->msg_get_inq) && ret >=3D 0) { >> > >>> - if (cmsg_flags & TCP_CMSG_TS) >> > >>> - tcp_recv_timestamp(msg, sk, &tss); >> > >>> + if (cmsg_flags & TCP_CMSG_TS) { >> > >>> + u32 tsflags =3D READ_ONCE(sk->sk_tsflags); >> > >>> + >> > >>> + if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) && >> > >>> + SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_RX_TIM= ESTAMPING)) >> > >>> + bpf_skops_rx_timestamping(sk, &tss, >> > >>> + BPF_SOCK_O= PS_TSTAMP_RCV_CB); >> > >> >> > >> Does this mean I can enable timestamp reading per cgroup? >> > > >> > > Yes, I think so, but I didn't try. One of the natures of sockopt >> > > feature is supporting cgroup attach. >> > > cgroup_bpf_prog_attach()/cgroup_bpf_link_attach() is probably >> > > something that you're looking for. >> > > >> > >> > Sound good >> > >> > > IIUC, you can attach the prog onto the cgroup where all the sockets >> > > are set using the bpf timestamping function. So the current impl is >> > > cleaner and has better isolation (to filter out those unmatched >> > > flows). >> > > >> > >> >> > >> In Simon's netstacklat[1] tool we are forced process all RX timesta= mp >> > >> (hooking fentry/tcp_recv_timestamp), and then we have a BPF filter[= 2] on >> > >> the cgroup IDs that we are interested in (which is a significant >> > >> overhead, as this is deployed at Cloudflare production scale). >> > > >> > > I can feel the pain when filtering in this kind of relatively hot >> > > path, which is what I'm trying to avoid internally. What I've done in >> > > production (to cover those old kernels) is to just let the kernel >> > > print the information, that's it, and there is an agent continuously >> > > gathering the data, doing the match and computing latency. But it's >> > > overall complicated. >> > > >> > >> > I hope you don't mean your internal/old approach was using printk and >> > then analyzing this data. >> >> Of course not :) >> >> The internal approach is to cover the old kernels but doesn't mean the >> approach is old :P >> >> Instead, the internal kernel module is super efficient and I'm trying >> to ship bpf with such an ability. The fact is we've already deployed >> in production: 7x24 running, zero sampling. >> >> Please see page 24 where there is a brief introduction on how to deal >> with the log part: >> https://lpc.events/event/19/contributions/2055/#preview:3846 >> I believe this is the promising direction (ring buffer + lightweight >> kernel + heavy agent) we're taking. >> >> The headache part is that I need to provide an agent written in BPF to >> do the heavy process. >> >> > >> > > Many thanks here, I'm always interested in hearing more useful and >> > > real requirements and fancy ideas on how to monitor the latency :) N= ow >> > >> > Simon Sundberg have many more fancy ideas on h= ow >> > to monitor the latency. >> > The netstacklat tool is part of Simon's PhD thesis: >> > - https://doi.org/10.59217/qklv6836 >> > >> > And we even gotten a paper accepted on netstacklat: >> > - >> > https://kau.diva-portal.org/smash/record.jsf?pid=3Ddiva2%3A2034009&dsw= id=3D3032 >> >> Sorry, I cannot access this link. Could you give me the title of this pa= per? > > Waiting at the Front Door - Continuous Monitoring of Latency in the > Host Network Stack > > Oh, I guess it hasn't been officially published right? This is the > reason why I have no way to know the content. No, it's not published yet; I'll send you a copy off-list :) -Toke