From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-178.mta1.migadu.com (out-178.mta1.migadu.com [95.215.58.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D98C23ECBFB for ; Tue, 19 May 2026 21:59:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.178 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779227965; cv=none; b=JDjQHnZ74cyBnnzIx6/2Emh8zq122FPk0+j++C3iuNR2YDOmTIFDmKGW5Of4t1uF2hH0UmbQbXOcHMKi/bED9v9NwG7f70v3jbu4leUXjPnmgXzwyI18W8UX4xuZYMsl9MhrP3s+r20+1n5342IGJRHfO6zGlGoyKw8av3wqnVw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779227965; c=relaxed/simple; bh=ZcGjKbVwQoxeFjHQ5lGnSM9g2YXjd6HchvKDY6b/Ke8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pJyJ5ID0Pmzfr89nqrN5c2B+yt2nZopBN+aIh+M6Hqn8+zWCCEhzle+ZdtJLcw4oVYJNvjqGeQzOeuFD2irxdTqpIFH2e5Q9KiKuy3+LyGk8B+c318FgtNXjdvloudtdKGZitjCJ0Pl26atYz9OTlV8BfqabO8pUEdKiQb8+XDI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=lo8JeHKj; arc=none smtp.client-ip=95.215.58.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="lo8JeHKj" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1779227958; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yPzwzImi5SRYWFHKZHpwiJt3zqE6j4NwSOR6Ja0AU/8=; b=lo8JeHKjLKA2868h0Z/H9jd5p3617I2ja/DNeANF7CPOQZyeWGRzBBrxK0/oUlMdklWmXF pE0SnyD6dei9M0kKDFsI6W+hc4xc6wfxlyef0b0NRxhw8L5vYSHMW2uqGfGeGblKXGPQlt Fx+ytFUlLMFSHrDM2Hd0m5q7NNiwgDg= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , 'Shakeel Butt ' , 'Roman Gushchin ' , 'Amery Hung ' , netdev@vger.kernel.org Subject: [RFC PATCH bpf-next 10/12] bpf: tcp: Support selected sock_ops callbacks as struct_ops Date: Tue, 19 May 2026 14:58:17 -0700 Message-ID: <20260519215841.2984970-11-martin.lau@linux.dev> In-Reply-To: <20260519215841.2984970-1-martin.lau@linux.dev> References: <20260519215841.2984970-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT From: Martin KaFai Lau In LSFMMBPF 2025, I have talked about moving the BPF_PROG_TYPE_SOCK_OPS to a struct_ops interface [1]. The BPF_SOCK_OPS_*_CB enum interface has grown over time as new TCP callback points were added. A BPF_PROG_TYPE_SOCK_OPS program now commonly needs a large switch on sock_ops->op, and the shared bpf_sock_ops_kern context has become harder to extend because different callbacks have different locking, argument, skb, and helper requirements. The existing 'union { u32 args[4]; u32 replylong[4]; }' is also not reliable in passing args to bpf prog when there are multiple progs attached to a cgroup. The above has already been solved in struct_ops. Add a TCP-specific struct_ops type, bpf_tcp_ops, and support attaching it to cgroups. This allows each callback have its own func signature and allows the verifier to select kfuncs/helpers based on the specific struct_ops member being implemented. This patch wires up the following existing sock_ops callbacks: - BPF_SOCK_OPS_TIMEOUT_INIT - BPF_SOCK_OPS_RWND_INIT - BPF_SOCK_OPS_RTT_CB - BPF_SOCK_OPS_STATE_CB - BPF_SOCK_OPS_RETRANS_CB - BPF_SOCK_OPS_TCP_CONNECT_CB - BPF_SOCK_OPS_TCP_LISTEN_CB I don't think BASE_RTT is useful. NEEDS_ECN should be done in bpf-tcp-cc instead. The tstamp ones should be a separate struct_ops (e.g. "bpf_sock_ops") that can work in both TCP and UDP. timeout_init and rwnd_init could have a request_sock pointer. This patch tries a different API and direclty passes the request_sock pointer as an arg The patch is incomplete. TODOs: - A skb argument should be added to the retrans ops. - ACTIVE_ESTABLISHED_CB and PASSIVE_ESTABLISHED_CB - BPF_SOCK_OPS_*HDR related ops - RTO [1], page 13: https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view?usp=sharing Signed-off-by: Martin KaFai Lau --- include/net/tcp.h | 74 ++++++++++++++++++++++- net/ipv4/Makefile | 1 + net/ipv4/af_inet.c | 1 + net/ipv4/bpf_tcp_ops.c | 134 +++++++++++++++++++++++++++++++++++++++++ net/ipv4/tcp.c | 4 +- net/ipv4/tcp_output.c | 5 +- 6 files changed, 214 insertions(+), 5 deletions(-) create mode 100644 net/ipv4/bpf_tcp_ops.c diff --git a/include/net/tcp.h b/include/net/tcp.h index ecbadcb3a744..eb9ff07f3c83 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2960,12 +2960,78 @@ static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2, #endif +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_CGROUP_BPF) + +struct bpf_tcp_ops { + int (*timeout_init)(struct sock *sk, struct request_sock *req); + int (*rwnd_init)(struct sock *sk, struct request_sock *req); + void (*rtt)(struct sock *sk, long mrtt, u32 srtt); + void (*set_state)(struct sock *sk, int state); + void (*retrans)(struct sock *sk, int type); + void (*connect)(struct sock *sk); + void (*listen)(struct sock *sk); +}; + +#define bpf_tcp_ops_call(op, sk, ...) \ +do { \ + if (cgroup_bpf_enabled(CGROUP_TCP_SOCK_OPS)) { \ + const struct bpf_prog_array_item *item; \ + const struct bpf_tcp_ops *tcp_ops; \ + struct cgroup *cgrp; \ + \ + cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); \ + rcu_read_lock_dont_migrate(); \ + bpf_cgroup_struct_ops_foreach(tcp_ops, item, cgrp, \ + CGROUP_TCP_SOCK_OPS) { \ + if (tcp_ops->op) \ + tcp_ops->op(sk, ##__VA_ARGS__); \ + } \ + rcu_read_unlock_migrate(); \ + } \ +} while (0) + +#define bpf_tcp_ops_call_int(op, init_retval, sk, ...) \ +({ \ + int __retval = (init_retval); \ + if (cgroup_bpf_enabled(CGROUP_TCP_SOCK_OPS)) { \ + const struct bpf_prog_array_item *item; \ + const struct bpf_tcp_ops *tcp_ops; \ + struct bpf_run_ctx *old_run_ctx; \ + struct bpf_cg_run_ctx run_ctx; \ + struct sock *__sk = sk_to_full_sk(sk); \ + struct request_sock *req = NULL; \ + struct cgroup *cgrp; \ + \ + if (__sk) { \ + run_ctx.retval = (init_retval); \ + cgrp = sock_cgroup_ptr(&__sk->sk_cgrp_data); \ + if (!sk_fullsock(sk)) \ + req = (struct request_sock *)sk; \ + rcu_read_lock_dont_migrate(); \ + old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);\ + bpf_cgroup_struct_ops_foreach(tcp_ops, item, cgrp, \ + CGROUP_TCP_SOCK_OPS) { \ + if (tcp_ops->op) \ + run_ctx.retval = tcp_ops->op(__sk, req, ##__VA_ARGS__); \ + } \ + bpf_reset_run_ctx(old_run_ctx); \ + rcu_read_unlock_migrate(); \ + __retval = run_ctx.retval; \ + } \ + } \ + __retval; \ +}) +#else +#define bpf_tcp_ops_call(op, sk, ...) do { } while (0) +#define bpf_tcp_ops_call_int(op, init_retval, sk, ...) (init_retval) +#endif + static inline u32 tcp_timeout_init(struct sock *sk) { int timeout; timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL); - + timeout = bpf_tcp_ops_call_int(timeout_init, timeout, sk); if (timeout <= 0) timeout = TCP_TIMEOUT_INIT; return min_t(int, timeout, TCP_RTO_MAX); @@ -2976,7 +3042,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk) int rwnd; rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL); - + rwnd = bpf_tcp_ops_call_int(rwnd_init, rwnd, sk); if (rwnd < 0) rwnd = 0; return rwnd; @@ -2989,8 +3055,10 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk) static inline void tcp_bpf_rtt(struct sock *sk, long mrtt, u32 srtt) { - if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RTT_CB_FLAG)) + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RTT_CB_FLAG)) { tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RTT_CB, mrtt, srtt); + bpf_tcp_ops_call(rtt, sk, mrtt, srtt); + } } #if IS_ENABLED(CONFIG_SMC) diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index 7f9f98813986..356335e06b4c 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -71,4 +71,5 @@ obj-$(CONFIG_TCP_AO) += tcp_ao.o ifeq ($(CONFIG_BPF_JIT),y) obj-$(CONFIG_BPF_SYSCALL) += bpf_tcp_ca.o +obj-$(CONFIG_CGROUP_BPF) += bpf_tcp_ops.o endif diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 0e62032e76b1..71f9a171310f 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -227,6 +227,7 @@ int __inet_listen_sk(struct sock *sk, int backlog) return err; tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL); + bpf_tcp_ops_call(listen, sk); } return 0; } diff --git a/net/ipv4/bpf_tcp_ops.c b/net/ipv4/bpf_tcp_ops.c new file mode 100644 index 000000000000..aa647d805882 --- /dev/null +++ b/net/ipv4/bpf_tcp_ops.c @@ -0,0 +1,134 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ + +#include +#include +#include +#include +#include + +static int timeout_init_stub(struct sock *sk, struct request_sock *req__nullable) +{ + struct bpf_cg_run_ctx *ctx = + container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx); + + return ctx->retval; +} + +static int rwnd_init_stub(struct sock *sk, struct request_sock *req__nullable) +{ + struct bpf_cg_run_ctx *ctx = + container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx); + + return ctx->retval; +} + +static void rtt_stub(struct sock *sk, long mrtt, u32 srtt) +{ +} + +static void set_state_stub(struct sock *sk, int state) +{ +} + +static void retrans_stub(struct sock *sk, int type) +{ +} + +static void connect_stub(struct sock *sk) +{ +} + +static void listen_stub(struct sock *sk) +{ +} + +static struct bpf_tcp_ops __bpf_tcp_ops = { + .timeout_init = timeout_init_stub, + .rwnd_init = rwnd_init_stub, + .rtt = rtt_stub, + .set_state = set_state_stub, + .retrans = retrans_stub, + .connect = connect_stub, + .listen = listen_stub, +}; + +static const struct bpf_func_proto * +get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + u32 moff = prog->aux->attach_st_ops_member_off; + + switch (func_id) { + case BPF_FUNC_sk_storage_get: + return &bpf_sk_storage_get_proto; + case BPF_FUNC_sk_storage_delete: + return &bpf_sk_storage_delete_proto; + case BPF_FUNC_setsockopt: + /* The listener is not locked. */ + if (moff == offsetof(struct bpf_tcp_ops, rwnd_init) || + moff == offsetof(struct bpf_tcp_ops, timeout_init)) + return NULL; + return &bpf_sk_setsockopt_proto; + case BPF_FUNC_getsockopt: + if (moff == offsetof(struct bpf_tcp_ops, rwnd_init) || + moff == offsetof(struct bpf_tcp_ops, timeout_init)) + return NULL; + return &bpf_sk_getsockopt_proto; + default: + return bpf_base_func_proto(func_id, prog); + } +} + +static bool is_valid_access(int off, int size, enum bpf_access_type type, + const struct bpf_prog *prog, struct bpf_insn_access_aux *info) +{ + if (!bpf_tracing_btf_ctx_access(off, size, type, prog, info)) + return false; + + if (base_type(info->reg_type) == PTR_TO_BTF_ID && + !bpf_type_has_unsafe_modifiers(info->reg_type) && + info->btf_id == btf_sock_ids[BTF_SOCK_TYPE_SOCK]) + /* promote it to tcp_sock */ + info->btf_id = btf_sock_ids[BTF_SOCK_TYPE_TCP]; + + return true; +} + +static int bpf_tcp_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + return 0; +} + +static int bpf_tcp_ops_init(struct btf *btf) +{ + return 0; +} + +static int bpf_tcp_ops_validate(void *kdata) +{ + return 0; +} + +static const struct bpf_verifier_ops bpf_tcp_ops_verifier = { + .get_func_proto = get_func_proto, + .is_valid_access = is_valid_access, +}; + +static struct bpf_struct_ops bpf_tcp_ops = { + .verifier_ops = &bpf_tcp_ops_verifier, + .init_member = bpf_tcp_ops_init_member, + .init = bpf_tcp_ops_init, + .validate = bpf_tcp_ops_validate, + .name = "bpf_tcp_ops", + .cgroup_atype = CGROUP_TCP_SOCK_OPS, + .cfi_stubs = &__bpf_tcp_ops, + .owner = THIS_MODULE, +}; + +static int __init __bpf_tcp_ops_init(void) +{ + return register_bpf_struct_ops(&bpf_tcp_ops, bpf_tcp_ops); +} +late_initcall(__bpf_tcp_ops_init) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 432fa28e47d4..c0f47a4c7980 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2998,8 +2998,10 @@ void tcp_set_state(struct sock *sk, int state) */ BTF_TYPE_EMIT_ENUM(BPF_TCP_ESTABLISHED); - if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG)) + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG)) { tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state); + bpf_tcp_ops_call(set_state, sk, state); + } switch (state) { case TCP_ESTABLISHED: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f9d8755705f7..ed51713b2216 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3673,9 +3673,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC); } - if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG)) + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG)) { tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB, TCP_SKB_CB(skb)->seq, segs, err); + bpf_tcp_ops_call(retrans, sk, err); + } if (unlikely(err) && err != -EBUSY) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL, segs); @@ -4296,6 +4298,7 @@ int tcp_connect(struct sock *sk) int err; tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB, 0, NULL); + bpf_tcp_ops_call(connect, sk); #if defined(CONFIG_TCP_MD5SIG) && defined(CONFIG_TCP_AO) /* Has to be checked late, after setting daddr/saddr/ops. -- 2.53.0-Meta