From: Martin KaFai Lau <martin.lau@linux.dev>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
Stanislav Fomichev <sdf@fomichev.me>,
Eric Dumazet <edumazet@google.com>,
Neal Cardwell <ncardwell@google.com>,
Willem de Bruijn <willemb@google.com>,
Tenzin Ukyab <ukyab@berkeley.edu>,
Kuniyuki Iwashima <kuni1840@gmail.com>,
bpf@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
Date: Wed, 27 May 2026 12:46:25 -0700 [thread overview]
Message-ID: <2026527192229.olv9.martin.lau@linux.dev> (raw)
In-Reply-To: <CAL+tcoCSU9KFZsV33LE2R89iUzb5b3Xm+h6QObs5GOuRYR3sMQ@mail.gmail.com>
On Wed, May 27, 2026 at 12:01:11PM +0800, Jason Xing wrote:
> On Wed, May 27, 2026 at 6:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On Tue, May 26, 2026 at 02:21:56PM -0700, Kuniyuki Iwashima wrote:
> > > On Tue, May 26, 2026 at 1:34 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >
> > > > On Sat, May 23, 2026 at 08:29:32AM +0000, Kuniyuki Iwashima wrote:
> > > > > When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
> > > > > prog can be called with BPF_SOCK_OPS_RCVQ_CB.
> > > > >
> > > > > In this hook, we want to parse the RPC descriptor in the skb
> > > > > and adjust sk->sk_rcvlowat based on the RPC frame size.
> > > > >
> > > > > However, we cannot access payload via bpf_sock_ops.data on
> > > > > modern NICs with TCP header/data split on as the payload is
> > > > > not placed in the linear area.
> > > > >
> > > > > Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
> > > > >
> > > > > Three notes:
> > > > >
> > > > > 1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
> > > > > invoked from recvmsg().
> > > > >
> > > > > 2) Access to bpf_sock_ops.data will be disabled by passing
> > > > > 0 end_offset to bpf_skops_init_skb().
> > > > >
> > > > > 3) ____bpf_skb_load_bytes() is called directly instead of
> > > > > __bpf_skb_load_bytes() to allow compilers to inline it
> > > > > instead of generating a tail-call.
> > > >
> > > > Some observations below.
> > > >
> > > > >
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > > > > ---
> > > > > v2: Explain why using ____ version instead of __
> > > > > ---
> > > > > net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 34 insertions(+)
> > > > >
> > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > index 4a50fe2cd863..fa8a7c7d86eb 100644
> > > > > --- a/net/core/filter.c
> > > > > +++ b/net/core/filter.c
> > > > > @@ -7760,6 +7760,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
> > > > > .arg3_type = ARG_ANYTHING,
> > > > > };
> > > > >
> > > > > +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock,
> > > > > + u32, offset, void *, to, u32, len)
> > > > > +{
> > > > > + int err;
> > > > > +
> > > > > + if (bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB) {
> > > >
> > > > bpf_dynptr_from_skb() and bpf_dynptr_slice() kfunc could also be considered.
> > > > One less bpf_sock->op check in filter.c to maintain and could also avoid
> > > > a data copy. There is a bpf_cast_to_kern_ctx() to get to a trusted
> > > > skops_kern pointer but this will need changes in verifier.c to get to
> > > > skops_kern->skb (e.g. in type_is_trusted_or_null) and this is the tradeoff.
> > >
> > > Maybe a dumb question, but does it add extra cost (extra dynptr
> > > function call?) if data overlaps two frags, or can dynptr handle it
> > > seamlessly with a single bpf_dynptr_slice() ?
> >
> > Right, there is an extra bpf_dynptr_from_skb(). I don't think we have
> > benchmarked it.
> >
> > If I read it correctly, unlike bpf_xdp_pointer, the skb_header_pointer
> > will still copy even if the data is in one frag. It works well if the data
> > is in the headlen and the worst case is to copy, which is the same as
> > load_bytes.
> >
> > It is a readonly use case. Maybe the bpf prog can directly read the frag.
> > Regardless, it is useful to have a kfunc/helper to read it.
> >
> > >
> > > In our case, the data copy is ~16 bytes, so the cost will not be
> > > a big problem I think.
> > >
> > >
> > > >
> > > > If this new rcvq callback is added to the 'bpf_tcp_ops' proposal [1],
> > > > all this will go away. 'struct sk_buff *skb' can be directly passed to an
> > > > ops of the 'bpf_tcp_ops'. Supporting '*skb' in a struct_ops has already
> > > > been done in the bpf_qdisc.
> > > >
> > > > [1]: https://lore.kernel.org/bpf/20260519215841.2984970-11-martin.lau@linux.dev/
> > >
> > > Oh I missed the series, the struct_ops conversion looks nice !
> > > Since this work isn't urgent, I can wait for your series if mine
> > > churns it.
> > >
> > > Jason's series is adding a new op, and I guess this can be
> > > integrated too ?
> > > https://lore.kernel.org/bpf/20260521135244.40869-5-kerneljasonxing@gmail.com/
> >
> > imo, a new sock_ops cb should be added as an ops in struct_ops. For example,
> > in patch 4 of that series, bpf_skops_rx_timestamping assigns u64 to 'u32
> > args[4]', which is adding tech debt to the current sock_ops interface.
> > For the timestamping case, it could be a separate ops for the
> > 'struct sock' instead of 'struct tcp_sock' because it should
> > at least work for both TCP and UDP.
>
> Sorry, I don't get it. What is the tech debt in this? And
> bpf_skops_rx_timestamping() only outputs the timestamps, which has
> nothing to do with either 'sock' or 'tcp_sock'.
Did I say the tech debt is in bpf_skops_rx_timestamping?
I said the tech debt of the bpf_sock_ops[_kern] interface.
Besides always memzero the whole &sockops, the way that the bpf prog
needs to get a u64 timestamp from u32 is not thrilling in 2026.
args[4] is also not reliable when the earlier cgroup bpf prog
writes to replylong. We had already hit some of these in our rx net
timestamping discussion a few years ago. I suggest you go back
and revisit.
>
> Could you show me what to do next? Thanks in advance. It sounds like
> the tx side of bpf timestamping should be adjusted accordingly?
imo, the bpf_tcp_ops can land first. and then add net timestamping
ops (existing tx and the new rx) in bpf_sock_ops.
next prev parent reply other threads:[~2026-05-27 19:47 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-23 8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
2026-05-23 9:06 ` bot+bpf-ci
2026-05-23 8:29 ` [PATCH v3 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-26 20:34 ` Martin KaFai Lau
2026-05-26 21:21 ` Kuniyuki Iwashima
2026-05-26 22:18 ` Martin KaFai Lau
2026-05-27 4:01 ` Jason Xing
2026-05-27 19:46 ` Martin KaFai Lau [this message]
2026-05-27 19:52 ` Martin KaFai Lau
2026-05-27 21:39 ` Kuniyuki Iwashima
2026-05-28 0:24 ` Martin KaFai Lau
2026-05-28 0:49 ` Jason Xing
2026-05-23 8:29 ` [PATCH v3 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
2026-05-23 9:06 ` bot+bpf-ci
2026-05-23 8:29 ` [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
2026-05-23 9:20 ` bot+bpf-ci
2026-05-24 3:37 ` Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
2026-05-23 9:20 ` bot+bpf-ci
2026-05-23 8:29 ` [PATCH v3 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
2026-05-26 20:47 ` Martin KaFai Lau
2026-05-26 21:07 ` Kuniyuki Iwashima
2026-05-26 21:37 ` Amery Hung
2026-05-26 21:51 ` Kuniyuki Iwashima
2026-05-23 8:29 ` [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-23 9:20 ` bot+bpf-ci
2026-05-24 4:03 ` Kuniyuki Iwashima
2026-05-26 21:01 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2026527192229.olv9.martin.lau@linux.dev \
--to=martin.lau@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=kerneljasonxing@gmail.com \
--cc=kuni1840@gmail.com \
--cc=kuniyu@google.com \
--cc=memxor@gmail.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=sdf@fomichev.me \
--cc=ukyab@berkeley.edu \
--cc=willemb@google.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox