* [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.
@ 2026-05-22 7:44 Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
` (10 more replies)
0 siblings, 11 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
This series introduces BPF_SOCK_OPS_RCVQ_CB, a new type of opt-in
hooks for BPF SOCK_OPS prog.
The hooks can be enabled on per-socket basis by bpf_setsockopt():
int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG;
bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
&flags, sizeof(flags));
or via the SOCK_OPS specific helper:
bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG);
Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVQ_CB upon the following events:
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
This allows the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
is available in the receive queue.
This functionality, which we call "TCP AutoLOWAT", was originally
developed in 2020 by Tenzin Ukyab with the help of Soheil Hassas
Yeganeh, Arjun Roy, and Eric Dumazet. It has served Google RPC
workloads for more than 5 years.
Combined with TCP RX zerocopy, this typically allows us to read an
entire RPC frame with just a single wakeup and a single system call.
While the original implementation was specialised for our
internal RPC format, this series introduces a more flexible
version by leveraging BPF.
The BPF SOCK_OPS prog in the last selftest patch closely mirrors
the core logic of the original implementation to provide a real-world
example.
Overview:
Patch 1 : misc cleanup for testing
Patch 2 : Add BPF_SOCK_OPS_RCVQ_CB with no actual hooks
Patch 3 - 5 : Add bpf helpers
Patch 6 - 8 : Add safe guard for BPF_SOCK_OPS_RCVQ_CB
Patch 9 - 10 : Add BPF_SOCK_OPS_RCVQ_CB hooks
Patch 11 : selftest
Changes:
v2:
Add Patch 6 - 8
Patch 2: s/BPF_SOCK_OPS_RCVLOWAT_CB/BPF_SOCK_OPS_RCVQ_CB/g
Patch 3: Explain why using ____ version instead of __ version
Patch 10: Add explanation of tcp_bpf_rcvlowat() placement.
Patch 11: Make copy_len u64 and swap validation order for it
to pass no_alu32 test case
v1: https://lore.kernel.org/bpf/20260508073355.3916746-1-kuniyu@google.com/
Kuniyuki Iwashima (11):
selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB.
bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
tcp: Split out __tcp_set_rcvlowat().
bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.
bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
bpf: tcp: Factorise bpf_skops_established().
bpf: tcp: Add SOCK_OPS rcvlowat hook.
selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
include/net/tcp.h | 13 +
include/uapi/linux/bpf.h | 18 +-
net/core/filter.c | 89 ++++-
net/ipv4/tcp.c | 14 +-
net/ipv4/tcp_bpf.c | 2 +
net/ipv4/tcp_fastopen.c | 2 +
net/ipv4/tcp_input.c | 25 +-
tools/include/uapi/linux/bpf.h | 18 +-
tools/testing/selftests/bpf/bpf_kfuncs.h | 4 +
.../selftests/bpf/prog_tests/tcp_autolowat.c | 350 ++++++++++++++++++
.../selftests/bpf/prog_tests/tcpbpf_user.c | 3 +-
.../selftests/bpf/progs/bpf_tracing_net.h | 2 +
.../selftests/bpf/progs/tcp_autolowat.c | 326 ++++++++++++++++
.../selftests/bpf/progs/test_tcpbpf_kern.c | 3 +-
14 files changed, 855 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 8:22 ` bot+bpf-ci
2026-05-22 7:44 ` [PATCH v2 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
` (9 subsequent siblings)
10 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
Once bpf_sock_ops_cb_flags_set() supports a new flag,
tcpbpf_user.c fails due to the hard-coded max value, 0x80.
Let's replace 0x80 with BPF_SOCK_OPS_ALL_CB_FLAGS + 1.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c | 3 ++-
tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
index 7e8fe1bad03f..e4849d2a2956 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
@@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
- ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
+ ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
+ "bad_cb_test_rv");
ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
ASSERT_EQ(result->num_listen, 1, "num_listen");
diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
index 6935f32eeb8f..e30cb1fab079 100644
--- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
@@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
break;
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
/* Test failure to set largest cb flag (assumes not defined) */
- global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
+ global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
+ BPF_SOCK_OPS_ALL_CB_FLAGS + 1);
/* Set callback */
global.good_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
BPF_SOCK_OPS_STATE_CB_FLAG);
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 8:22 ` bot+bpf-ci
2026-05-22 7:44 ` [PATCH v2 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
` (8 subsequent siblings)
10 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
We will introduce a new type of opt-in hooks for BPF SOCK_OPS prog.
The hooks can be enabled on per-socket basis by bpf_setsockopt():
int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG;
bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
&flags, sizeof(flags));
or via the SOCK_OPS specific helper:
bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG);
Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVQ_CB upon the following events:
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
(e.g., a full RPC frame) is available in the receive queue.
Note that is_locked_tcp_sock_ops() is left unchanged not to enable
bpf_setsockopt() unnecessarily, but bpf_sock_ops_cb_flags_set() is
supported at BPF_SOCK_OPS_RCVQ_CB to disable by itself.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: s/BPF_SOCK_OPS_RCVLOWAT_CB/BPF_SOCK_OPS_RCVQ_CB/g
---
include/uapi/linux/bpf.h | 18 +++++++++++++++++-
net/core/filter.c | 3 ++-
tools/include/uapi/linux/bpf.h | 18 +++++++++++++++++-
3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aec171ccb6ef..31130e1b63ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6960,6 +6960,9 @@ struct bpf_sock_ops {
* the 3WHS.
* BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
* the 3WHS.
+ * BPF_SOCK_OPS_RCVQ_CB : No header included. The payload is only
+ * accessible by passing bpf_sock_ops to
+ * bpf_skb_load_bytes().
*
* bpf_load_hdr_opt() can also be used to read a particular option.
*/
@@ -7031,8 +7034,16 @@ enum {
* options first before the BPF program does.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+ /* Call bpf when TCP payload is queued to sk->sk_receive_queue
+ * and after recvmsg(). The bpf prog will be called under
+ * sock_ops->op == BPF_SOCK_OPS_RCVQ_CB.
+ *
+ * It can be used to adjust sk->sk_rcvlowat and suppress
+ * unnecessary wakeups before sufficient data is available.
+ */
+ BPF_SOCK_OPS_RCVQ_CB_FLAG = (1<<7),
/* Mask of all currently supported cb flags */
- BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
+ BPF_SOCK_OPS_ALL_CB_FLAGS = 0xFF,
};
enum {
@@ -7176,6 +7187,11 @@ enum {
* sendmsg timestamp with corresponding
* tskey.
*/
+ BPF_SOCK_OPS_RCVQ_CB, /* Called when TCP payload is queued to
+ * sk->sk_receive_queue and after recvmsg()
+ * to allow adjusting sk->sk_rcvlowat and
+ * to suppress early wakeups.
+ */
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/filter.c b/net/core/filter.c
index 9590877b0714..4a50fe2cd863 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6002,7 +6002,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
struct sock *sk = bpf_sock->sk;
int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
- if (!is_locked_tcp_sock_ops(bpf_sock))
+ if (!is_locked_tcp_sock_ops(bpf_sock) &&
+ bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB)
return -EOPNOTSUPP;
if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 37142e6d911a..3b8f392d8c69 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6960,6 +6960,9 @@ struct bpf_sock_ops {
* the 3WHS.
* BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
* the 3WHS.
+ * BPF_SOCK_OPS_RCVQ_CB : No header included. The payload is only
+ * accessible by passing bpf_sock_ops to
+ * bpf_skb_load_bytes().
*
* bpf_load_hdr_opt() can also be used to read a particular option.
*/
@@ -7031,8 +7034,16 @@ enum {
* options first before the BPF program does.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+ /* Call bpf when TCP payload is queued to sk->sk_receive_queue
+ * and after recvmsg(). The bpf prog will be called under
+ * sock_ops->op == BPF_SOCK_OPS_RCVQ_CB.
+ *
+ * It can be used to adjust sk->sk_rcvlowat and suppress
+ * unnecessary wakeups before sufficient data is available.
+ */
+ BPF_SOCK_OPS_RCVQ_CB_FLAG = (1<<7),
/* Mask of all currently supported cb flags */
- BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
+ BPF_SOCK_OPS_ALL_CB_FLAGS = 0xFF,
};
enum {
@@ -7176,6 +7187,11 @@ enum {
* sendmsg timestamp with corresponding
* tskey.
*/
+ BPF_SOCK_OPS_RCVQ_CB, /* Called when TCP payload is queued to
+ * sk->sk_receive_queue and after recvmsg()
+ * to allow adjusting sk->sk_rcvlowat and
+ * to suppress early wakeups.
+ */
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
` (7 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
prog can be called with BPF_SOCK_OPS_RCVQ_CB.
In this hook, we want to parse the RPC descriptor in the skb
and adjust sk->sk_rcvlowat based on the RPC frame size.
However, we cannot access payload via bpf_sock_ops.data on
modern NICs with TCP header/data split on as the payload is
not placed in the linear area.
Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
Three notes:
1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
invoked from recvmsg().
2) Access to bpf_sock_ops.data will be disabled by passing
0 end_offset to bpf_skops_init_skb().
3) ____bpf_skb_load_bytes() is called directly instead of
__bpf_skb_load_bytes() to allow compilers to inline it
instead of generating a tail-call.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: Explain why using ____ version instead of __
---
net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 4a50fe2cd863..fa8a7c7d86eb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7760,6 +7760,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock,
+ u32, offset, void *, to, u32, len)
+{
+ int err;
+
+ if (bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB) {
+ err = -EOPNOTSUPP;
+ goto err_clear;
+ }
+
+ if (!bpf_sock->skb) {
+ err = -EPERM;
+ goto err_clear;
+ }
+
+ return ____bpf_skb_load_bytes(bpf_sock->skb, offset, to, len);
+
+err_clear:
+ memset(to, 0, len);
+ return err;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_skb_load_bytes_proto = {
+ .func = bpf_sock_ops_skb_load_bytes,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg4_type = ARG_CONST_SIZE,
+};
+
static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend,
u8 search_kind, const u8 *magic,
u8 magic_len, bool *eol)
@@ -8616,6 +8648,8 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_get_netns_cookie:
return &bpf_get_netns_cookie_sock_ops_proto;
#ifdef CONFIG_INET
+ case BPF_FUNC_skb_load_bytes:
+ return &bpf_sock_ops_skb_load_bytes_proto;
case BPF_FUNC_load_hdr_opt:
return &bpf_sock_ops_load_hdr_opt_proto;
case BPF_FUNC_store_hdr_opt:
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat().
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (2 preceding siblings ...)
2026-05-22 7:44 ` [PATCH v2 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
` (6 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
We will add a kfunc for BPF_SOCK_OPS_RCVQ_CB hooks to
adjust sk->sk_rcvlowat.
These hooks will be triggered when:
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
In the enqueue path, tcp_data_ready() is always called
after the hooks in tcp_queue_rcv() and tcp_ofo_queue().
If tcp_set_rcvlowat() were used as is, tcp_data_ready()
could be called twice for the same skb, which is redundant
and also confusing.
Let's split out __tcp_set_rcvlowat() and add a flag to
control wakeup behaviour.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/tcp.h | 1 +
net/ipv4/tcp.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ecbadcb3a744..c6a6853909c4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -515,6 +515,7 @@ void tcp_set_keepalive(struct sock *sk, int val);
void tcp_syn_ack_timeout(const struct request_sock *req);
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int flags);
+int __tcp_set_rcvlowat(struct sock *sk, int val, bool wakeup);
int tcp_set_rcvlowat(struct sock *sk, int val);
void tcp_set_rcvbuf(struct sock *sk, int val);
int tcp_set_window_clamp(struct sock *sk, int val);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 432fa28e47d4..3afeb69a547a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1829,8 +1829,7 @@ int tcp_peek_len(struct socket *sock)
return tcp_inq(sock->sk);
}
-/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
-int tcp_set_rcvlowat(struct sock *sk, int val)
+int __tcp_set_rcvlowat(struct sock *sk, int val, bool wakeup)
{
struct tcp_sock *tp = tcp_sk(sk);
int space, cap;
@@ -1843,7 +1842,8 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
/* Check if we need to signal EPOLLIN right now */
- tcp_data_ready(sk);
+ if (wakeup)
+ tcp_data_ready(sk);
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
return 0;
@@ -1858,6 +1858,12 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
return 0;
}
+/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
+int tcp_set_rcvlowat(struct sock *sk, int val)
+{
+ return __tcp_set_rcvlowat(sk, val, true);
+}
+
void tcp_set_rcvbuf(struct sock *sk, int val)
{
tcp_set_window_clamp(sk, tcp_win_from_space(sk, val));
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (3 preceding siblings ...)
2026-05-22 7:44 ` [PATCH v2 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 8:22 ` bot+bpf-ci
2026-05-22 7:44 ` [PATCH v2 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
` (5 subsequent siblings)
10 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB
to adjust sk->sk_rcvlowat when
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
Let's provide a kfunc to set sk->sk_rcvlowat.
Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT.
The wakeup flag is determined based on bpf_sock_ops_kern.skb:
* For the enqueue hook, skb is always non-NULL, and wakeup is
set to false because
* tcp_data_ready() is always called after the hooks in
tcp_queue_rcv() and tcp_ofo_queue().
* when tcp_fastopen_add_skb() is called for TFO SYN,
the socket is not yet accept()ed, and when called
for TFO SYN+ACK, the socket is woken up by
sk->sk_state_change() anyway.
* For the recvmsg() hook, skb is always NULL, and wakeup is set
to true because tcp_data_ready() is not called in the path.
An alternative would be to support bpf_setsockopt() by adding
BPF_SOCK_OPS_RCVQ_CB to is_locked_tcp_sock_ops().
However, that approach involves excessive conditionals and an
unnecessary memcpy(), costs we do not want to pay for every skb
in the TCP fast path.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/core/filter.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index fa8a7c7d86eb..3608036632a8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -12388,6 +12388,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
#endif
}
+__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops,
+ int rcvlowat)
+{
+#ifdef CONFIG_INET
+ if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
+ return -EOPNOTSUPP;
+
+ if (rcvlowat < 0)
+ rcvlowat = INT_MAX;
+
+ return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb);
+#else
+ return -EOPNOTSUPP;
+#endif
+}
+
__bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
u64 flags)
{
@@ -12539,6 +12555,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
+BTF_ID_FLAGS(func, bpf_sock_ops_tcp_set_rcvlowat)
BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (4 preceding siblings ...)
2026-05-22 7:44 ` [PATCH v2 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
` (4 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle
socket receive queues, leading to overlapping use cases.
While BPF_SOCK_OPS_RCVQ_CB focuses on optimizing single-socket
performance by reducing EPOLLIN wakeups and fully preserves TCP
zerocopy support, SOCKMAP is designed to facilitate multi-socket
routing at the cost of higher overhead and no zerocopy support.
Enabling both features on the same socket makes no sense and
results in unexpected interference between them.
For instance, SOCKMAP calls __tcp_cleanup_rbuf(), where we will
add a BPF_SOCK_OPS_RCVQ_CB hook, and bpf_sock_ops_tcp_set_rcvlowat()
calls sk->sk_data_ready(), which would trigger SOCKMAP.
Let's make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
Both bpf_sol_tcp_setsockopt(TCP_BPF_SOCK_OPS_CB_FLAGS) and
bpf_sock_ops_cb_flags_set() now check if sk->sk_prot is
&tcp_prot or tcpv6_prot, while tcp_bpf_update_proto() checks
if BPF_SOCK_OPS_RCVQ_CB_FLAG is already set. Both checks are
performed under lock_sock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/core/filter.c | 29 +++++++++++++++++++++++++++--
net/ipv4/tcp_bpf.c | 2 ++
2 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 3608036632a8..ff7fd415486a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5382,12 +5382,27 @@ static int bpf_sol_tcp_getsockopt(struct sock *sk, int optname,
return 0;
}
+static int bpf_sock_ops_check_rcvq_cb(struct sock *sk, int val)
+{
+ if (val & BPF_SOCK_OPS_RCVQ_CB_FLAG) {
+ bool not_tcp_prot = sk->sk_prot != &tcp_prot;
+
+#if IS_ENABLED(CONFIG_IPV6)
+ not_tcp_prot &= sk->sk_prot != &tcpv6_prot;
+#endif
+ if (not_tcp_prot)
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
char *optval, int optlen)
{
struct tcp_sock *tp = tcp_sk(sk);
unsigned long timeout;
- int val;
+ int val, err;
if (optlen != sizeof(int))
return -EINVAL;
@@ -5424,6 +5439,11 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
case TCP_BPF_SOCK_OPS_CB_FLAGS:
if (val & ~(BPF_SOCK_OPS_ALL_CB_FLAGS))
return -EINVAL;
+
+ err = bpf_sock_ops_check_rcvq_cb(sk, val);
+ if (err)
+ return err;
+
tp->bpf_sock_ops_cb_flags = val;
break;
default:
@@ -5999,8 +6019,9 @@ static const struct bpf_func_proto bpf_sock_ops_getsockopt_proto = {
BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
int, argval)
{
- struct sock *sk = bpf_sock->sk;
int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+ struct sock *sk = bpf_sock->sk;
+ int err;
if (!is_locked_tcp_sock_ops(bpf_sock) &&
bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB)
@@ -6009,6 +6030,10 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
return -EINVAL;
+ err = bpf_sock_ops_check_rcvq_cb(sk, val);
+ if (err)
+ return err;
+
tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cc0bd73f36b6..5c5c67080740 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -729,6 +729,8 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
sock_replace_proto(sk, psock->sk_proto);
}
return 0;
+ } else if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG)) {
+ return -EBUSY;
}
if (sk->sk_family == AF_INET6) {
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (5 preceding siblings ...)
2026-05-22 7:44 ` [PATCH v2 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
` (3 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
MPTCP has its own sock->ops->set_rcvlowat() / mptcp_set_rcvlowat().
We should not allow calling __tcp_set_rcvlowat() for MPTCP subflows.
Let's disable BPF_SOCK_OPS_RCVQ_CB for MPTCP for now.
If needed in the future, bpf_sock_ops_tcp_set_rcvlowat() could be
extended to properly support MPTCP.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/core/filter.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index ff7fd415486a..5913b3be9f1d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5392,6 +5392,9 @@ static int bpf_sock_ops_check_rcvq_cb(struct sock *sk, int val)
#endif
if (not_tcp_prot)
return -EBUSY;
+
+ if (unlikely(sk_is_mptcp(sk)))
+ return -EOPNOTSUPP;
}
return 0;
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (6 preceding siblings ...)
2026-05-22 7:44 ` [PATCH v2 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-22 7:44 ` Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
` (2 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:44 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing
skbs in the receive queue when it is enabled for the first time.
In practical production use cases, this behavior is usually not
a problem.
We can safely assume that the upper-layer protocol is designed
with specific synchronisation points where the connection is
temporarily quiet.
At these points, the application can completely drain the receive
queue and safely enable BPF_SOCK_OPS_RCVQ_CB while no skbs are
pending.
A prime example is an application transitioning from HTTP to an
RPC protocol:
Client Server
| |
| --- HTTP Upgrade request ---------> |
| | [Drain all skbs]
| | [Enable BPF_SOCK_OPS_RCVQ_CB]
| <-- HTTP 200/Switching protocol --- |
| |
| --- RPC Frame 1 ------------------> |
However, to strictly prevent any potential race conditions arising
from unconventional upper-layer protocol designs, let's explicitly
signal a failure if BPF_SOCK_OPS_RCVQ_CB is enabled while the receive
queue is not empty.
-EUCLEAN is chosen to indicate that the caller needs to clean up
the receive queue before enabling the feature.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/core/filter.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 5913b3be9f1d..883e4aaed49e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5395,6 +5395,9 @@ static int bpf_sock_ops_check_rcvq_cb(struct sock *sk, int val)
if (unlikely(sk_is_mptcp(sk)))
return -EOPNOTSUPP;
+
+ if (!skb_queue_empty(&sk->sk_receive_queue))
+ return -EUCLEAN;
}
return 0;
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established().
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (7 preceding siblings ...)
2026-05-22 7:44 ` [PATCH v2 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
@ 2026-05-22 7:45 ` Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:45 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
We will call BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB.
It requires a similar setup to bpf_skops_established(), and the
only difference is the skb data length.
Let's factor out the common logic into bpf_skops_common_locked().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/ipv4/tcp_input.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d5c9e65d9760..c4ba4f1e9d9e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -179,8 +179,9 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}
-static void bpf_skops_established(struct sock *sk, int bpf_op,
- struct sk_buff *skb)
+static void bpf_skops_common_locked(struct sock *sk, int bpf_op,
+ struct sk_buff *skb,
+ unsigned int end_offset)
{
struct bpf_sock_ops_kern sock_ops;
@@ -191,12 +192,18 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
sock_ops.is_fullsock = 1;
sock_ops.is_locked_tcp_sock = 1;
sock_ops.sk = sk;
- /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
if (skb)
- bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
+ bpf_skops_init_skb(&sock_ops, skb, end_offset);
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}
+
+static void bpf_skops_established(struct sock *sk, int bpf_op,
+ struct sk_buff *skb)
+{
+ /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
+ bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0);
+}
#else
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (8 preceding siblings ...)
2026-05-22 7:45 ` [PATCH v2 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
@ 2026-05-22 7:45 ` Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:45 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.
Let's invoke the BPF SOCK_OPS prog when
1. TCP stack enqueues skb to sk->sk_receive_queue
-> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()
2. TCP recvmsg() completes
-> __tcp_cleanup_rbuf()
This will allow the BPF prog to parse each skb and dynamically
adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
until sufficient data (e.g., a full RPC frame) is available
in the receive queue.
Note that the direct access to bpf_sock_ops.data is intentionally
disabled by passing 0 as end_offset.
Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
with bpf_sock_ops because payload is not in the linear area
with TCP header/data split on and skb may contain a RPC
descriptor in skb frag. This also simplifies the BPF prog.
The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
tcp_fastopen_add_skb() is chosen to provide the same snapshot
with tcp_queue_rcv().
For example, if tcp_bpf_rcvlowat() were called before updating
TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
need to implement an unlikely if branch to strip SYN.
In addition, TCP stack can queue overlapping skb into recvq.
Once rcv_nxt is updated with a new skb, BPF prog cannot infer
the previous one from skb->len.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: Add explanation of tcp_bpf_rcvlowat() placement.
---
include/net/tcp.h | 12 ++++++++++++
net/ipv4/tcp.c | 2 ++
net/ipv4/tcp_fastopen.c | 2 ++
net/ipv4/tcp_input.c | 10 ++++++++++
4 files changed, 26 insertions(+)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c6a6853909c4..2247937e385a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2888,12 +2888,24 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
skops->skb = skb;
skops->skb_data_end = skb->data + end_offset;
}
+
+void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb);
+
+static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
+{
+ if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG))
+ bpf_skops_rcvlowat(sk, skb);
+}
#else
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
}
+
+static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
+{
+}
#endif
/* Call BPF_SOCK_OPS program that returns an int. If the return value
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3afeb69a547a..f7e32891bb4e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
tcp_mstamp_refresh(tp);
tcp_send_ack(sk);
}
+
+ tcp_bpf_rcvlowat(sk, NULL);
}
void tcp_cleanup_rbuf(struct sock *sk, int copied)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 471c78be5513..91bf421fc5b6 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -281,6 +281,8 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
TCP_SKB_CB(skb)->seq++;
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN;
+ tcp_bpf_rcvlowat(sk, skb);
+
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
tcp_add_receive_queue(sk, skb);
tp->syn_data_acked = 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c4ba4f1e9d9e..477bcf2ba89d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -204,6 +204,12 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0);
}
+
+void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb)
+{
+ /* skb is NULL when called from __tcp_cleanup_rbuf(). */
+ bpf_skops_common_locked(sk, BPF_SOCK_OPS_RCVQ_CB, skb, 0);
+}
#else
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
@@ -5306,6 +5312,8 @@ static void tcp_ofo_queue(struct sock *sk)
continue;
}
+ tcp_bpf_rcvlowat(sk, skb);
+
tail = skb_peek_tail(&sk->sk_receive_queue);
eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
@@ -5509,6 +5517,8 @@ static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
int eaten;
struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+ tcp_bpf_rcvlowat(sk, skb);
+
eaten = (tail &&
tcp_try_coalesce(sk, tail,
skb, fragstolen)) ? 1 : 0;
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
` (9 preceding siblings ...)
2026-05-22 7:45 ` [PATCH v2 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
@ 2026-05-22 7:45 ` Kuniyuki Iwashima
10 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-22 7:45 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
The test is roughly divided into two stages, and the sequence
is as follows:
I) Setup
1. Attach two BPF programs to a cgroup
2. Establish a TCP connection (@client <-> @child) within the cgroup
3. Enable BPF_SOCK_OPS_RCVQ_CB on @child
II) RPC frame exchange in various patterns
4. Send a partial RPC descriptor from @client to @child
5. Verify that epoll does NOT wake up @child
6. Send the remaining data of the RPC frame
7. Verify that epoll finally wakes up @child
During setup, two BPF programs are attached to simulate
a real-world scenario; one is SOCK_OPS and the other is
CGROUP_SOCKOPT.
While the SOCK_OPS prog handles the dynamic adjustment of
sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable
BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt() using
pseudo options:
#define SOL_BPF 0xdeadbeef
#define BPF_TCP_AUTOLOWAT 0x8badf00d
setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int));
This reflects a common production use case where an application
decides to start parsing RPC frames only at a certain point in
the stream (e.g., after HTTP Upgrade), rather than immediately
after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc).
When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises
sk_local_storage for two sequence numbers to manage its state.
Then, for the RPC frame exchange, this test uses a simple format
defined as follows
0 8 16 24 32
+--------+--------+-------+--------+ `.
| header size | |
+--------+--------+-------+--------+ > RPC descriptor (8 bytes)
| payload size | |
+--------+--------+-------+--------+ .'
~ header ~
+--------+--------+-------+--------+
~ payload ~
+--------+--------+-------+--------+
Every time a new skb is enqueued to sk->sk_receive_queue,
the SOCK_OPS prog parses it and updates these sequence numbers:
rpc_desc_seq : the SEQ # of the start of the RPC descriptor
rpc_end_seq : the SEQ # of the end of the RPC frame
=> rpc_desc_seq + 8 + header size + payload size
Assume we receive two RPC descriptors in the following pattern:
1. When we receive skb-1, only a part of RPC descriptor is parsed.
rpc_desc_seq is set to the first byte while rpc_end_seq is
unknown. Thus, sk->sk_rcvlowat is set to the size of the RPC
descriptor (8 bytes).
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+.................+....................+......
| RPC desc 1 | header + payload | RPC desc 2 | ...
+-----------+.................+....................+......
^ ^-.
`- rpc_desc_seq `- sk->sk_rcvlowat
2. Next, we receive skb-2, which completes the first RPC descriptor.
Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it.
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+-----------------+....................+......
| RPC desc 1 | header + payload | RPC desc 2 | ...
+-----------+-----------------+....................+......
^ ^
'- rpc_desc_seq '- rpc_end_seq
& sk->sk_rcvlowat
3. Once we receive skb-3, which contains the next full RPC descriptor,
rpc_desc_seq is advanced and rpc_end_seq is updated according
to the size of RPC frame 2.
Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq
yet. This ensures that the application is woken up to read the
already complete RPC frame 1.
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+-----------------+--------------------+......
| RPC desc 1 | header + payload | RPC desc 2 | ... |
+-----------+-----------------+--------------------+......
^ ^
rpc_desc_seq -----------' rpc_end_seq ----...-'
& sk->sk_rcvlowat
This sequence corresponds to the 4th test case in rpc_test_cases[],
and we can see helpful output if we "#define DEBUG":
# cat /sys/kernel/tracing/trace_pipe | \
awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \
BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID
...
AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0
AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8
AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150
AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8
AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8
AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2:
* Make copy_len u64 and swap validation order for it
for no_alu32.
---
tools/testing/selftests/bpf/bpf_kfuncs.h | 4 +
.../selftests/bpf/prog_tests/tcp_autolowat.c | 350 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 2 +
.../selftests/bpf/progs/tcp_autolowat.c | 326 ++++++++++++++++
4 files changed, 682 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c
diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index ae71e9b69051..fc4d6f68f247 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -64,6 +64,10 @@ struct bpf_tcp_req_attrs;
extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
+struct bpf_sock_ops_kern;
+extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
+ int rcvlowat) __ksym;
+
void *bpf_cast_to_kern_ctx(void *) __ksym;
extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
new file mode 100644
index 000000000000..5e971c42a32a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2026 Google LLC */
+#include <sys/epoll.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "tcp_autolowat.skel.h"
+
+#define SOL_BPF 0xdeadbeef
+#define BPF_TCP_AUTOLOWAT 0x8badf00d
+
+struct rpc_descriptor {
+ u32 header_len;
+ u32 payload_len;
+};
+
+enum rpc_event_type {
+ RPC_EVENT_END,
+ RPC_EVENT_AUTOLOWAT,
+ RPC_EVENT_SEND,
+ RPC_EVENT_RECV,
+ RPC_EVENT_EPOLL,
+ RPC_EVENT_RCVLOWAT,
+};
+
+struct rpc_event {
+ enum rpc_event_type type;
+ union {
+ int len;
+ int nfds;
+ int val;
+ int rcvlowat;
+ };
+};
+
+#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor))
+
+struct rpc_test_case {
+ char data[4096];
+ struct rpc_descriptor desc[32];
+ struct rpc_event event[32];
+} rpc_test_cases[] = {
+ {
+ .desc = {
+ { .header_len = 100, .payload_len = 150 },
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* Single full RPC message in skb. */
+ { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE + 100 + 150},
+ { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 100 + 150},
+ { .type = RPC_EVENT_EPOLL, .nfds = 1},
+ },
+ },
+ {
+ .desc = {
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* Two full RPC messages in skb. */
+ {.type = RPC_EVENT_SEND, .len = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_EPOLL, .nfds = 1},
+ /* Single full RPC message in skb. */
+ { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE + 100 + 150},
+ { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 3},
+ { .type = RPC_EVENT_EPOLL, .nfds = 1},
+ },
+ },
+ {
+ .desc = {
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* Two full RPC messages in skb. */
+ {.type = RPC_EVENT_SEND, .len = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_EPOLL, .nfds = 1},
+ /* Single full RPC message in skb. */
+ { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE},
+ { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+ { .type = RPC_EVENT_EPOLL, .nfds = 1},
+ },
+ },
+ {
+ .desc = {
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 200, .payload_len = 500},
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* The first descriptor is partial. */
+ {.type = RPC_EVENT_SEND, .len = 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE},
+ /* The first descriptor is available. */
+ {.type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE - 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+ /* The first header is ready. */
+ {.type = RPC_EVENT_SEND, .len = 100},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+ /* skb has the first payload and 1 byte of the next descriptor. */
+ {.type = RPC_EVENT_SEND, .len = 150 + 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 1},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+ /* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
+ {.type = RPC_EVENT_RECV, .len = RPC_DESC_SIZE + 150 + 100},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE},
+ /* The second descriptor is available. */
+ {.type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE - 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 200 + 500},
+ },
+ },
+};
+
+struct tcp_autolowat_test_cb {
+ int saved_netns;
+ union {
+ int fd[4];
+ struct {
+ int server, client, child;
+ int epoll;
+ };
+ };
+};
+
+static void tcp_autolowat_teardown_cb(struct tcp_autolowat_test_cb *cb)
+{
+ int i, err;
+
+ for (i = 0; i < ARRAY_SIZE(cb->fd); i++) {
+ if (cb->fd[i] != -1)
+ close(cb->fd[i]);
+ }
+
+ if (cb->saved_netns != -1) {
+ err = setns(cb->saved_netns, CLONE_NEWNET);
+ ASSERT_OK(err, "restore netns");
+
+ close(cb->saved_netns);
+ }
+}
+
+static int tcp_autolowat_setup_cb(struct tcp_autolowat_test_cb *cb, int family)
+{
+ struct epoll_event ev = {};
+ int err;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(cb->fd); i++)
+ cb->fd[i] = -1;
+
+ cb->saved_netns = open("/proc/self/ns/net", O_RDONLY);
+ if (!ASSERT_NEQ(cb->saved_netns, -1, "save netns"))
+ goto err;
+
+ err = unshare(CLONE_NEWNET);
+ if (!ASSERT_OK(err, "unshare"))
+ goto err;
+
+ err = system("ip link set dev lo up");
+ if (!ASSERT_OK(err, "set up lo"))
+ goto err;
+
+ cb->server = start_server(family, SOCK_STREAM, NULL, 0, 0);
+ if (!ASSERT_NEQ(cb->server, -1, "start_server"))
+ goto err;
+
+ cb->client = connect_to_fd(cb->server, 0);
+ if (!ASSERT_NEQ(cb->client, -1, "connect_to_fd"))
+ goto err;
+
+ cb->child = accept(cb->server, NULL, NULL);
+ if (!ASSERT_NEQ(cb->child, -1, "accept"))
+ goto err;
+
+ cb->epoll = epoll_create1(0);
+ if (!ASSERT_NEQ(cb->epoll, -1, "epoll_create"))
+ goto err;
+
+ ev.events = EPOLLIN;
+ ev.data.fd = cb->child;
+
+ err = epoll_ctl(cb->epoll, EPOLL_CTL_ADD, cb->child, &ev);
+ if (!ASSERT_OK(err, "epoll_ctl"))
+ goto err;
+
+ return 0;
+
+err:
+ tcp_autolowat_teardown_cb(cb);
+ return -1;
+}
+
+static int tcp_autolowat_build_data(struct rpc_test_case *test_case)
+{
+ struct rpc_descriptor *desc = test_case->desc;
+ char *ptr = test_case->data;
+ int rpc_size;
+
+ memset(ptr, 0, sizeof(test_case->data));
+
+ while (desc->header_len + desc->payload_len) {
+ rpc_size = sizeof(*desc) + desc->header_len + desc->payload_len;
+
+ if (!ASSERT_LE(ptr + rpc_size - test_case->data,
+ sizeof(test_case->data), "data overflow"))
+ return 1;
+
+ memcpy(ptr, desc, sizeof(*desc));
+ ptr += rpc_size;
+ desc++;
+ }
+
+ if (!ASSERT_GT(ptr - test_case->data, 0, "no data"))
+ return 1;
+
+ return 0;
+}
+
+static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
+ struct rpc_test_case *test_case)
+{
+ struct rpc_event *event = test_case->event;
+ char *ptr = test_case->data;
+ struct epoll_event ev;
+ socklen_t optlen;
+ int err, optval;
+ char buf[4096];
+
+ if (tcp_autolowat_build_data(test_case))
+ return;
+
+ while (1) {
+ switch (event->type) {
+ case RPC_EVENT_END:
+ return;
+ case RPC_EVENT_AUTOLOWAT:
+ err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
+ &event->val, sizeof(event->val));
+ if (!ASSERT_OK(err, "setsockopt"))
+ return;
+ break;
+ case RPC_EVENT_SEND:
+ err = send(cb->client, ptr, event->len, 0);
+ if (!ASSERT_EQ(err, event->len, "send"))
+ return;
+
+ ptr += event->len;
+ break;
+ case RPC_EVENT_RECV:
+ err = recv(cb->child, buf, event->len, 0);
+ if (!ASSERT_EQ(err, event->len, "recv"))
+ return;
+ break;
+ case RPC_EVENT_EPOLL:
+ err = epoll_wait(cb->epoll, &ev, 1, 0);
+ if (!ASSERT_EQ(err, event->nfds, "epoll_wait"))
+ return;
+ break;
+ case RPC_EVENT_RCVLOWAT:
+ optval = 0;
+ optlen = sizeof(optval);
+
+ err = getsockopt(cb->child, SOL_SOCKET, SO_RCVLOWAT, &optval, &optlen);
+ if (!ASSERT_OK(err, "getsockopt") ||
+ !ASSERT_EQ(optval, event->rcvlowat, "rcvlowat"))
+ return;
+ break;
+ }
+
+ event++;
+ }
+}
+
+static void tcp_autolowat_run_rpc_tests(struct tcp_autolowat *skel, int family)
+{
+ struct tcp_autolowat_test_cb cb;
+ int err;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(rpc_test_cases); i++) {
+ memset(skel->bss->test_name, 0, sizeof(skel->bss->test_name));
+
+ snprintf(skel->bss->test_name, sizeof(skel->bss->test_name),
+ "AF_INET%c rpc_test_cases[%d]",
+ family == AF_INET ? ' ' : '6', i);
+
+ if (!test__start_subtest(skel->bss->test_name))
+ continue;
+
+ err = tcp_autolowat_setup_cb(&cb, family);
+ if (err)
+ continue;
+
+ tcp_autolowat_run_rpc_test(&cb, &rpc_test_cases[i]);
+ tcp_autolowat_teardown_cb(&cb);
+ }
+}
+
+static void tcp_autolowat_run_tests(struct tcp_autolowat *skel)
+{
+ tcp_autolowat_run_rpc_tests(skel, AF_INET);
+ tcp_autolowat_run_rpc_tests(skel, AF_INET6);
+}
+
+void test_tcp_autolowat(void)
+{
+ struct tcp_autolowat *skel;
+ struct bpf_link *link[2];
+ int cgroup;
+
+ skel = tcp_autolowat__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ cgroup = test__join_cgroup("/tcp_autolowat");
+ if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+ goto destroy_skel;
+
+ link[0] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat, cgroup);
+ if (!ASSERT_OK_PTR(link[0], "attach_cgroup(SOCK_OPS)"))
+ goto close_cgroup;
+
+ link[1] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat_setsockopt, cgroup);
+ if (!ASSERT_OK_PTR(link[1], "attach_cgroup(SETSOCKOPT)"))
+ goto destroy_sockops;
+
+ tcp_autolowat_run_tests(skel);
+
+ bpf_link__destroy(link[1]);
+destroy_sockops:
+ bpf_link__destroy(link[0]);
+close_cgroup:
+ close(cgroup);
+destroy_skel:
+ tcp_autolowat__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index d8dacef37c16..bdf28d320383 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -74,6 +74,8 @@
#define NEXTHDR_TCP 6
+#define TCPHDR_FIN 0x01
+
#define TCPOPT_NOP 1
#define TCPOPT_EOL 0
#define TCPOPT_MSS 2
diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
new file mode 100644
index 000000000000..41eacfdb34aa
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2026 Google LLC */
+#include "vmlinux.h"
+
+#include <string.h>
+#include <limits.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+#include "bpf_tracing_net.h"
+
+#define SOL_BPF 0xdeadbeef
+#define BPF_TCP_AUTOLOWAT 0x8badf00d
+
+//#define DEBUG /* For verbose output. */
+
+struct rpc_descriptor {
+ u32 header_len;
+ u32 payload_len;
+};
+
+#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor))
+#define MAX_RPC_DESC_PER_SKB 100
+
+struct tcp_autolowat_cb {
+ /* Don't put this field at the end; BPF verifier complains. */
+ char rpc_desc_buf[RPC_DESC_SIZE];
+ u32 rpc_desc_seq;
+ u32 rpc_end_seq;
+#ifdef DEBUG
+ u32 isn;
+#endif
+ u8 rpc_desc_buff_len;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_SK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct tcp_autolowat_cb);
+} tcp_autolowat_map SEC(".maps");
+
+char test_name[64];
+
+#ifdef DEBUG
+#define LOG(str, ...) \
+ bpf_printk("%s: " str, test_name, ##__VA_ARGS__)
+#else
+#define LOG(...)
+#endif
+
+#define SEQ(val) \
+ (val - cb->isn)
+#define TP_SEQ(field) \
+ (tp->field - cb->isn)
+#define CB_SEQ(field) \
+ (cb->field - cb->isn)
+
+static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
+ struct tcp_autolowat_cb *cb,
+ u32 seq, u32 end_seq)
+{
+ struct rpc_descriptor *rpc_desc;
+ u32 rpc_copied_seq;
+ u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
+ u64 rpc_len;
+ int err;
+
+ rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
+
+ if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
+ copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
+ else
+ copy_len = end_seq - rpc_copied_seq;
+
+ /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
+ * clang omits the "copy_len == 0" check below, which is necessary
+ * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
+ */
+ barrier_var(copy_len);
+
+ /* Do not swap the order of the two copy_len checks below.
+ * The BPF verifier somehow does not properly track the minimum
+ * value for 'copy_len == 0'.
+ *
+ * 91: (15) if r7 == 0x0 goto pc+40 ; R7=scalar(smin=smin32=-247,smax=smax32=8)
+ * 92: (25) if r7 > 0x8 goto pc+39 ; R7=scalar(smin=smin32=0,smax=umax=smax32=umax32=8,var_off=(0x0; 0xf))
+ *
+ * This does not occur if copy_len is u32.
+ */
+ if (copy_len > RPC_DESC_SIZE)
+ goto disable; /* always false, only for verifier. */
+ if (copy_len == 0)
+ goto disable; /* FIN. */
+
+ if (cb->rpc_desc_buf + cb->rpc_desc_buff_len >= &cb->rpc_desc_buf[RPC_DESC_SIZE])
+ goto disable; /* always false, only for verifier. */
+
+ err = bpf_skb_load_bytes(skops, rpc_copied_seq - seq,
+ cb->rpc_desc_buf + cb->rpc_desc_buff_len, copy_len);
+ if (err)
+ goto disable;
+
+ cb->rpc_desc_buff_len += copy_len;
+
+ if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
+ LOG("Copied %d bytes: rpc_desc_buff_len: %u", copy_len, cb->rpc_desc_buff_len);
+ goto partial;
+ }
+
+ rpc_desc = (struct rpc_descriptor *)cb->rpc_desc_buf;
+ rpc_len = RPC_DESC_SIZE + rpc_desc->header_len + rpc_desc->payload_len;
+
+ if (rpc_len > INT_MAX)
+ goto disable;
+
+ cb->rpc_end_seq = cb->rpc_desc_seq + rpc_len;
+
+ LOG("Copied full descriptor: rpc_desc_seq: %u, rpc_end_seq: %u,"
+ " header_len: %u, payload_len: %u",
+ CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq),
+ rpc_desc->header_len, rpc_desc->payload_len);
+
+ return 0;
+disable:
+ return -1;
+partial:
+ return 1;
+}
+
+static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
+ struct tcp_autolowat_cb *cb,
+ struct tcp_sock *tp)
+{
+ /* To handle wraparound. */
+ u32 val = 0;
+
+ LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
+ TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+ if (before(tp->copied_seq, cb->rpc_desc_seq))
+ val = cb->rpc_desc_seq - tp->copied_seq;
+ else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
+ val = RPC_DESC_SIZE;
+ else
+ val = cb->rpc_end_seq - tp->copied_seq;
+
+ if (val != tp->inet_conn.icsk_inet.sk.sk_rcvlowat) {
+ bpf_sock_ops_tcp_set_rcvlowat(skops_kern, val);
+
+ LOG("Set rcvlowat: expected: %u, actual: %d\n",
+ val, tp->inet_conn.icsk_inet.sk.sk_rcvlowat);
+ } else {
+ LOG("No need to set rcvlowat: %u\n", val);
+ }
+}
+
+static void tcp_disable_autolowat(struct bpf_sock_ops *skops,
+ struct bpf_sock_ops_kern *skops_kern)
+{
+ int flags;
+
+ flags = skops->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RCVQ_CB_FLAG;
+ bpf_sock_ops_cb_flags_set(skops, flags);
+
+ bpf_sock_ops_tcp_set_rcvlowat(skops_kern, 1);
+
+ LOG("Disabled autolowat");
+}
+
+static void tcp_do_autolowat(struct bpf_sock_ops *skops,
+ struct tcp_autolowat_cb *cb,
+ struct tcp_sock *tp)
+{
+ struct bpf_sock_ops_kern *skops_kern;
+ struct tcp_skb_cb *tcb;
+ struct sk_buff *skb;
+ u32 seq, end_seq;
+ int ret = 0, i;
+
+ skops_kern = bpf_cast_to_kern_ctx(skops);
+ skb = skops_kern->skb;
+
+ if (!skb)
+ goto update;
+
+ tcb = bpf_core_cast(skb->cb, struct tcp_skb_cb);
+ seq = tcb->seq;
+ end_seq = tcb->end_seq - !!(tcb->tcp_flags & TCPHDR_FIN);
+
+ LOG("Start parsing skb: seq: %u, end_seq: %u, len: %u, "
+ "rpc_desc_seq: %u, rpc_end_seq: %u, rpc_buff_len: %u",
+ SEQ(seq), SEQ(end_seq), end_seq - seq,
+ CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+ if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
+ ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
+ if (ret)
+ goto update;
+ }
+
+ i = 0;
+
+ while (1) {
+ if (i++ > MAX_RPC_DESC_PER_SKB) {
+ ret = -1;
+ break;
+ }
+
+ if (after(cb->rpc_end_seq, end_seq)) {
+ LOG("No more descriptor: rpc_end_seq: %u, end_seq: %u",
+ CB_SEQ(rpc_end_seq), SEQ(end_seq));
+ break;
+ }
+
+ cb->rpc_desc_seq = cb->rpc_end_seq;
+ cb->rpc_desc_buff_len = 0;
+
+ if (cb->rpc_end_seq == end_seq)
+ break;
+
+ LOG("Found next descriptor: rpc_end_seq: %u, end_seq: %u, len: %u",
+ CB_SEQ(rpc_end_seq), SEQ(end_seq), end_seq - cb->rpc_end_seq);
+
+ ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
+ if (ret)
+ break;
+ }
+
+update:
+ if (ret >= 0)
+ tcp_set_autolowat(skops_kern, cb, tp);
+ else
+ tcp_disable_autolowat(skops, skops_kern);
+}
+
+SEC("sockops")
+int tcp_autolowat(struct bpf_sock_ops *skops)
+{
+ struct tcp_autolowat_cb *cb;
+ struct bpf_sock *bpf_sk;
+ struct tcp_sock *tp;
+
+ if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
+ goto out;
+
+ bpf_sk = skops->sk;
+ if (!bpf_sk)
+ goto out; /* always false, only for verifier. */
+
+ tp = bpf_skc_to_tcp_sock(bpf_sk);
+ if (!tp)
+ goto out; /* always false, only for verifier. */
+
+ cb = bpf_sk_storage_get(&tcp_autolowat_map, tp, 0, 0);
+ if (!cb)
+ goto out;
+
+ tcp_do_autolowat(skops, cb, tp);
+out:
+ return 1;
+}
+
+static int tcp_init_autolowat_cb(struct bpf_sockopt *sockopt,
+ struct bpf_tcp_sock *btp)
+{
+ struct tcp_autolowat_cb *cb;
+ struct tcp_sock *tp;
+ int flags;
+
+ cb = bpf_sk_storage_get(&tcp_autolowat_map, btp, 0,
+ BPF_SK_STORAGE_GET_F_CREATE);
+ if (!cb)
+ return -1;
+
+ tp = bpf_core_cast(btp, struct tcp_sock);
+ if (!tp)
+ return -1;
+
+ cb->rpc_desc_seq = tp->copied_seq;
+ cb->rpc_end_seq = tp->copied_seq;
+#ifdef DEBUG
+ cb->isn = tp->copied_seq;
+#endif
+
+ if (bpf_getsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
+ &flags, sizeof(flags)))
+ return -1;
+
+ flags |= BPF_SOCK_OPS_RCVQ_CB_FLAG;
+
+ if (bpf_setsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
+ &flags, sizeof(flags)))
+ return -1;
+
+ return 0;
+}
+
+SEC("cgroup/setsockopt")
+int tcp_autolowat_setsockopt(struct bpf_sockopt *ctx)
+{
+ void *optval_end = ctx->optval_end;
+ int *optval = ctx->optval;
+ struct bpf_tcp_sock *btp;
+
+ if (ctx->level != SOL_BPF || ctx->optname != BPF_TCP_AUTOLOWAT)
+ goto out;
+
+ if (optval + 1 > optval_end)
+ return 0; /* -EPERM */
+
+ btp = bpf_tcp_sock(ctx->sk);
+ if (!btp)
+ goto out;
+
+ if (*optval && tcp_init_autolowat_cb(ctx, btp))
+ return 0; /* -EPERM */
+
+ ctx->optlen = -1; /* BPF has consumed this option, don't call kernel
+ * setsockopt handler.
+ */
+out:
+ return 1;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.54.0.746.g67dd491aae-goog
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
2026-05-22 7:44 ` [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
@ 2026-05-22 8:22 ` bot+bpf-ci
0 siblings, 0 replies; 15+ messages in thread
From: bot+bpf-ci @ 2026-05-22 8:22 UTC (permalink / raw)
To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 2444 bytes --]
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 7e8fe1bad03f..e4849d2a2956 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
> ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
> ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
> ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
> - ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
> + ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
> + "bad_cb_test_rv");
> ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
> ASSERT_EQ(result->num_listen, 1, "num_listen");
>
[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> index 6935f32eeb8f..e30cb1fab079 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> @@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
> break;
> case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> /* Test failure to set largest cb flag (assumes not defined) */
> - global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
> + global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
> + BPF_SOCK_OPS_ALL_CB_FLAGS + 1);
Could BPF_SOCK_OPS_ALL_CB_FLAGS + 1 break on future kernels?
When this test is compiled, BPF_SOCK_OPS_ALL_CB_FLAGS + 1 evaluates to a
hardcoded value. If the compiled test runs on a future kernel where this
value is supported, the helper will successfully set the flag and return 0,
causing the test to fail.
The test also assumes contiguous bitmasks. If a flag is ever deprecated or
skipped, adding 1 could overlap with valid bits.
Would a safely distant bit like 1 << 30 be more robust for forward
compatibility?
(Reference: https://lore.kernel.org/bpf/20260508190227.AFAE9C2BD00@smtp.kernel.org/)
> /* Set callback */
> global.good_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
> BPF_SOCK_OPS_STATE_CB_FLAG);
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26275674572
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
2026-05-22 7:44 ` [PATCH v2 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
@ 2026-05-22 8:22 ` bot+bpf-ci
0 siblings, 0 replies; 15+ messages in thread
From: bot+bpf-ci @ 2026-05-22 8:22 UTC (permalink / raw)
To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fa8a7c7d86eb..3608036632a8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -12388,6 +12388,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
> #endif
> }
>
> +__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops,
> + int rcvlowat)
> +{
> +#ifdef CONFIG_INET
> + if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
> + return -EOPNOTSUPP;
> +
> + if (rcvlowat < 0)
> + rcvlowat = INT_MAX;
> +
> + return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb);
Could this reproduce the MPTCP subflow issue from commit fcf4692fa39e
("mptcp: prevent BPF accessing lowat from a subflow socket.")?
In the v1 review, Björn Töpel asked whether BPF programs could call this
kfunc on MPTCP subflow sockets. The response was "I'll rather drop support
for MPTCP for now," but there's no check here to prevent this kfunc from
being called on MPTCP subflows.
Without a check like:
if (sk_is_mptcp(skops->sk))
return -EOPNOTSUPP;
or a protocol check, this code could trigger the same warning that
fcf4692fa39e was designed to prevent.
> +#else
> + return -EOPNOTSUPP;
> +#endif
> +}
> +
> __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> u64 flags)
> {
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26275674572
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB.
2026-05-22 7:44 ` [PATCH v2 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-22 8:22 ` bot+bpf-ci
0 siblings, 0 replies; 15+ messages in thread
From: bot+bpf-ci @ 2026-05-22 8:22 UTC (permalink / raw)
To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 4073 bytes --]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index aec171ccb6ef..31130e1b63ea 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
[ ... ]
> @@ -7031,8 +7034,16 @@ enum {
> * options first before the BPF program does.
> */
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> + /* Call bpf when TCP payload is queued to sk->sk_receive_queue
> + * and after recvmsg(). The bpf prog will be called under
> + * sock_ops->op == BPF_SOCK_OPS_RCVQ_CB.
> + *
> + * It can be used to adjust sk->sk_rcvlowat and suppress
> + * unnecessary wakeups before sufficient data is available.
> + */
> + BPF_SOCK_OPS_RCVQ_CB_FLAG = (1<<7),
> /* Mask of all currently supported cb flags */
> - BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
> + BPF_SOCK_OPS_ALL_CB_FLAGS = 0xFF,
> };
The BPF subsystem requires multi-line comments to have the opening /* on
its own line. This comment starts with text on the same line as /*:
/* Call bpf when TCP payload is queued to sk->sk_receive_queue
Should it be formatted as:
/*
* Call bpf when TCP payload is queued to sk->sk_receive_queue
* and after recvmsg(). The bpf prog will be called under
* sock_ops->op == BPF_SOCK_OPS_RCVQ_CB.
*
* It can be used to adjust sk->sk_rcvlowat and suppress
* unnecessary wakeups before sufficient data is available.
*/
[ ... ]
> @@ -7176,6 +7187,11 @@ enum {
> * sendmsg timestamp with corresponding
> * tskey.
> */
> + BPF_SOCK_OPS_RCVQ_CB, /* Called when TCP payload is queued to
> + * sk->sk_receive_queue and after recvmsg()
> + * to allow adjusting sk->sk_rcvlowat and
> + * to suppress early wakeups.
> + */
> };
The multi-line comment here also starts with text on the same line as /*:
BPF_SOCK_OPS_RCVQ_CB, /* Called when TCP payload is queued to
Should it be reformatted with the opening /* on its own line?
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9590877b0714..4a50fe2cd863 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6002,7 +6002,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
> struct sock *sk = bpf_sock->sk;
> int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>
> - if (!is_locked_tcp_sock_ops(bpf_sock))
> + if (!is_locked_tcp_sock_ops(bpf_sock) &&
> + bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB)
> return -EOPNOTSUPP;
>
> if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> return -EINVAL;
>
> tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
>
> return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
The patch allows bpf_sock_ops_cb_flags_set() to modify
tcp_sk(sk)->bpf_sock_ops_cb_flags without holding the socket lock when
bpf_sock->op == BPF_SOCK_OPS_RCVQ_CB. However, the same field can also
be modified through bpf_setsockopt() which requires the socket lock.
Can there be a race between concurrent modifications?
When a BPF program in BPF_SOCK_OPS_RCVQ_CB context (unlocked) calls
bpf_sock_ops_cb_flags_set():
bpf_sock_ops_cb_flags_set()
tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
Another BPF program in a locked context could simultaneously call
bpf_setsockopt():
do_tcp_setsockopt()
sock_owned_by_me(sk);
...
tp->bpf_sock_ops_cb_flags = flags;
Both perform full writes (= val), not atomic operations. The
tcp_sock.bpf_sock_ops_cb_flags field is a plain u8, not atomic_t, and
no READ_ONCE/WRITE_ONCE is used. While byte writes are atomic on most
architectures, the lack of synchronization means one write can completely
overwrite the other.
Would this lead to incorrect callback flag state if two updates race?
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 37142e6d911a..3b8f392d8c69 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26275674572
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-05-22 8:22 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 7:44 [PATCH v2 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
2026-05-22 8:22 ` bot+bpf-ci
2026-05-22 7:44 ` [PATCH v2 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-22 8:22 ` bot+bpf-ci
2026-05-22 7:44 ` [PATCH v2 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
2026-05-22 8:22 ` bot+bpf-ci
2026-05-22 7:44 ` [PATCH v2 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-22 7:44 ` [PATCH v2 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
2026-05-22 7:45 ` [PATCH v2 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox