* [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.
@ 2026-05-08 7:33 Kuniyuki Iwashima
2026-05-08 7:33 ` [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
` (7 more replies)
0 siblings, 8 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
Kuniyuki Iwashima, bpf, netdev
This series introduces BPF_SOCK_OPS_RCVLOWAT_CB, a new type
of opt-in hooks for BPF SOCK_OPS prog.
The hooks can be enabled on per-socket basis by bpf_setsockopt():
int flag = BPF_SOCK_OPS_RCVLOWAT_CB_FLAG;
bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
&flags, sizeof(flags));
or via the SOCK_OPS specific helper:
bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVLOWAT_CB_FLAG);
Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVLOWAT_CB upon the following events:
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
This allows the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
is available in the receive queue.
This functionality, which we call "TCP AutoLOWAT", was originally
developed in 2020 by Tenzin Ukyab with the help of Soheil Hassas
Yeganeh, Arjun Roy, and Eric Dumazet. It has served Google RPC
workloads for more than 5 years.
Combined with TCP RX zerocopy, this typically allows us to read
an entire RPC frame with just a single wakeup and a single system
call.
While the original implementation was specialised for our
internal RPC format, this series introduces a more flexible
version by leveraging BPF.
The BPF SOCK_OPS prog in the last selftest patch closely mirrors
the core logic of the original implementation to provide a real-world
example.
Overview:
Patch 1 : misc cleanup for testing
Patch 2 : Add BPF_SOCK_OPS_RCVLOWAT_CB with no actual hooks
Patch 3 - 5 : Add bpf helpers
Patch 6 - 7 : Add BPF_SOCK_OPS_RCVLOWAT_CB hooks
Patch 8 : selftest
Kuniyuki Iwashima (8):
selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB.
bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB.
tcp: Split out __tcp_set_rcvlowat().
bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
bpf: tcp: Factorise bpf_skops_established().
bpf: tcp: Add SOCK_OPS rcvlowat hook.
selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB.
include/net/tcp.h | 15 +
include/uapi/linux/bpf.h | 18 +-
net/core/filter.c | 51 +++
net/ipv4/tcp.c | 14 +-
net/ipv4/tcp_fastopen.c | 2 +
net/ipv4/tcp_input.c | 25 +-
tools/include/uapi/linux/bpf.h | 18 +-
tools/testing/selftests/bpf/bpf_kfuncs.h | 4 +
.../selftests/bpf/prog_tests/tcp_autolowat.c | 350 ++++++++++++++++++
.../selftests/bpf/prog_tests/tcpbpf_user.c | 3 +-
.../selftests/bpf/progs/bpf_tracing_net.h | 2 +
.../selftests/bpf/progs/tcp_autolowat.c | 316 ++++++++++++++++
.../selftests/bpf/progs/test_tcpbpf_kern.c | 3 +-
13 files changed, 810 insertions(+), 11 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply [flat|nested] 30+ messages in thread* [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv. 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 19:02 ` sashiko-bot 2026-05-08 7:33 ` [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima ` (6 subsequent siblings) 7 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev Once bpf_sock_ops_cb_flags_set() supports a new flag, tcpbpf_user.c fails due to the hard-coded max value, 0x80. Let's replace 0x80 with BPF_SOCK_OPS_ALL_CB_FLAGS + 1. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c | 3 ++- tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c index 7e8fe1bad03f..e4849d2a2956 100644 --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result) ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked"); ASSERT_EQ(result->data_segs_in, 1, "data_segs_in"); ASSERT_EQ(result->data_segs_out, 1, "data_segs_out"); - ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv"); + ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1, + "bad_cb_test_rv"); ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv"); ASSERT_EQ(result->num_listen, 1, "num_listen"); diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c index 6935f32eeb8f..e30cb1fab079 100644 --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c @@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops) break; case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: /* Test failure to set largest cb flag (assumes not defined) */ - global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80); + global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, + BPF_SOCK_OPS_ALL_CB_FLAGS + 1); /* Set callback */ global.good_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_STATE_CB_FLAG); -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv. 2026-05-08 7:33 ` [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima @ 2026-05-08 19:02 ` sashiko-bot 2026-05-08 20:21 ` Kuniyuki Iwashima 0 siblings, 1 reply; 30+ messages in thread From: sashiko-bot @ 2026-05-08 19:02 UTC (permalink / raw) To: Kuniyuki Iwashima; +Cc: bpf Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [Medium] Using `BPF_SOCK_OPS_ALL_CB_FLAGS + 1` to test invalid BPF callback flags creates brittle test logic that assumes contiguous bitmasks and breaks forward compatibility. -- commit dfed6c2387074d0f6f7afb70b6bba777371f7267 Author: Kuniyuki Iwashima <kuniyu@google.com> selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv. Once bpf_sock_ops_cb_flags_set() supports a new flag, tcpbpf_user.c fails due to the hard-coded max value, 0x80. > diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c > index 7e8fe1bad03f5..e4849d2a2956f 100644 > --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c > +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c > @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result) > ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked"); > ASSERT_EQ(result->data_segs_in, 1, "data_segs_in"); > ASSERT_EQ(result->data_segs_out, 1, "data_segs_out"); > - ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv"); > + ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1, > + "bad_cb_test_rv"); Is BPF_SOCK_OPS_ALL_CB_FLAGS + 1 robust for forward compatibility? When the test is compiled, BPF_SOCK_OPS_ALL_CB_FLAGS + 1 is evaluated to a hardcoded value. If this compiled test runs on a future kernel where this value is supported, the helper will successfully set the flag and return 0. The userspace test would still expect the hardcoded remainder, causing it to fail. Furthermore, this assumes contiguous bitmasks. If a flag is ever deprecated or skipped, adding 1 could overlap with valid bits. Would it be better to use a safely distant bit, such as 1 << 30, to guarantee the flag remains unsupported and the test doesn't break in the future? -- Sashiko AI review · https://sashiko.dev/#/patchset/20260508073355.3916746-1-kuniyu@google.com?part=1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv. 2026-05-08 19:02 ` sashiko-bot @ 2026-05-08 20:21 ` Kuniyuki Iwashima 0 siblings, 0 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 20:21 UTC (permalink / raw) To: sashiko; +Cc: bpf On Fri, May 8, 2026 at 12:02 PM <sashiko-bot@kernel.org> wrote: > > Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: > - [Medium] Using `BPF_SOCK_OPS_ALL_CB_FLAGS + 1` to test invalid BPF callback flags creates brittle test logic that assumes contiguous bitmasks and breaks forward compatibility. > -- > > commit dfed6c2387074d0f6f7afb70b6bba777371f7267 > Author: Kuniyuki Iwashima <kuniyu@google.com> > > selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv. > > Once bpf_sock_ops_cb_flags_set() supports a new flag, > tcpbpf_user.c fails due to the hard-coded max value, 0x80. > > > diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c > > index 7e8fe1bad03f5..e4849d2a2956f 100644 > > --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c > > +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c > > @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result) > > ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked"); > > ASSERT_EQ(result->data_segs_in, 1, "data_segs_in"); > > ASSERT_EQ(result->data_segs_out, 1, "data_segs_out"); > > - ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv"); > > + ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1, > > + "bad_cb_test_rv"); > > Is BPF_SOCK_OPS_ALL_CB_FLAGS + 1 robust for forward compatibility? This test is not supposed to run on older kernel. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 19:17 ` sashiko-bot 2026-05-08 7:33 ` [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima ` (5 subsequent siblings) 7 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev We will introduce a new type of opt-in hooks for BPF SOCK_OPS prog. The hooks can be enabled on per-socket basis by bpf_setsockopt(): int flag = BPF_SOCK_OPS_RCVLOWAT_CB_FLAG; bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS, &flags, sizeof(flags)); or via the SOCK_OPS specific helper: bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVLOWAT_CB_FLAG); Once activated, the BPF prog will be invoked with bpf_sock_ops.op set to BPF_SOCK_OPS_RCVLOWAT_CB upon the following events: 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat, suppressing unnecessary EPOLLIN wakeups until sufficient data (e.g., a full RPC frame) is available in the receive queue. Note that is_locked_tcp_sock_ops() is left unchanged not to enable bpf_setsockopt() unnecessarily. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- include/uapi/linux/bpf.h | 18 +++++++++++++++++- tools/include/uapi/linux/bpf.h | 18 +++++++++++++++++- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 552bc5d9afbd..e139a4e94ffd 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -6952,6 +6952,9 @@ struct bpf_sock_ops { * the 3WHS. * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes * the 3WHS. + * BPF_SOCK_OPS_RCVLOWAT_CB : No header included. The payload is only + * accessible by passing bpf_sock_ops to + * bpf_skb_load_bytes(). * * bpf_load_hdr_opt() can also be used to read a particular option. */ @@ -7023,8 +7026,16 @@ enum { * options first before the BPF program does. */ BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), + /* Call bpf when TCP payload is queued to sk->sk_receive_queue + * and after recvmsg(). The bpf prog will be called under + * sock_ops->op == BPF_SOCK_OPS_RCVLOWAT_CB. + * + * It can be used to adjust sk->sk_rcvlowat and suppress + * unnecessary wakeups before sufficient data is available. + */ + BPF_SOCK_OPS_RCVLOWAT_CB_FLAG = (1<<7), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0xFF, }; enum { @@ -7168,6 +7179,11 @@ enum { * sendmsg timestamp with corresponding * tskey. */ + BPF_SOCK_OPS_RCVLOWAT_CB, /* Called when TCP payload is queued to + * sk->sk_receive_queue and after recvmsg() + * to allow adjusting sk->sk_rcvlowat and + * to suppress early wakeups. + */ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 677be9a47347..b5268a66ecb4 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -6952,6 +6952,9 @@ struct bpf_sock_ops { * the 3WHS. * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes * the 3WHS. + * BPF_SOCK_OPS_RCVLOWAT_CB : No header included. The payload is only + * accessible by passing bpf_sock_ops to + * bpf_skb_load_bytes(). * * bpf_load_hdr_opt() can also be used to read a particular option. */ @@ -7023,8 +7026,16 @@ enum { * options first before the BPF program does. */ BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), + /* Call bpf when TCP payload is queued to sk->sk_receive_queue + * and after recvmsg(). The bpf prog will be called under + * sock_ops->op == BPF_SOCK_OPS_RCVLOWAT_CB. + * + * It can be used to adjust sk->sk_rcvlowat and suppress + * unnecessary wakeups before sufficient data is available. + */ + BPF_SOCK_OPS_RCVLOWAT_CB_FLAG = (1<<7), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0xFF, }; enum { @@ -7168,6 +7179,11 @@ enum { * sendmsg timestamp with corresponding * tskey. */ + BPF_SOCK_OPS_RCVLOWAT_CB, /* Called when TCP payload is queued to + * sk->sk_receive_queue and after recvmsg() + * to allow adjusting sk->sk_rcvlowat and + * to suppress early wakeups. + */ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 ` [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima @ 2026-05-08 19:17 ` sashiko-bot 2026-05-08 20:26 ` Kuniyuki Iwashima 0 siblings, 1 reply; 30+ messages in thread From: sashiko-bot @ 2026-05-08 19:17 UTC (permalink / raw) To: Kuniyuki Iwashima; +Cc: bpf Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [High] The deliberate exclusion of BPF_SOCK_OPS_RCVLOWAT_CB from is_locked_tcp_sock_ops() disables critical socket helpers, notably bpf_sock_ops_cb_flags_set(). -- commit 2ab2e31caf792b1e7312616215a88ad7ae1ef4e1 Author: Kuniyuki Iwashima <kuniyu@google.com> bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB. This commit introduces a new hook, BPF_SOCK_OPS_RCVLOWAT_CB, which allows BPF programs to dynamically adjust sk->sk_rcvlowat upon recvmsg completion or when packets are enqueued. The commit leaves is_locked_tcp_sock_ops unchanged to prevent unnecessary enablement of bpf_setsockopt. > This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat, > suppressing unnecessary EPOLLIN wakeups until sufficient data > (e.g., a full RPC frame) is available in the receive queue. > > Note that is_locked_tcp_sock_ops() is left unchanged not to enable > bpf_setsockopt() unnecessarily. Does leaving is_locked_tcp_sock_ops() unchanged also prevent the use of bpf_sock_ops_cb_flags_set() from within the BPF_SOCK_OPS_RCVLOWAT_CB hook? Because BPF_SOCK_OPS_RCVLOWAT_CB is appended to the enum after BPF_SOCK_OPS_WRITE_HDR_OPT_CB, is_locked_tcp_sock_ops() returns false for this new hook. Both bpf_sock_ops_cb_flags_set() and bpf_setsockopt() require is_locked_tcp_sock_ops() to return true. If it returns false, they fail with -EOPNOTSUPP. Since direct modification of bpf_sock_ops_cb_flags is disallowed by the verifier, wouldn't a BPF program running in this context be entirely unable to clear the BPF_SOCK_OPS_RCVLOWAT_CB_FLAG once it finishes parsing? If the flag cannot be cleared, does this permanently trap the socket in the fast-path hook for every enqueued packet and recvmsg() call, leading to performance degradation? While the potential UAF concern regarding bpf_skb_load_bytes() is a false positive addressed later in the series by "bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB", and the hook omission is fixed by "bpf: tcp: Add SOCK_OPS rcvlowat hook", this limitation regarding is_locked_tcp_sock_ops() does not appear to be addressed later in the series. -- Sashiko AI review · https://sashiko.dev/#/patchset/20260508073355.3916746-1-kuniyu@google.com?part=2 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 19:17 ` sashiko-bot @ 2026-05-08 20:26 ` Kuniyuki Iwashima 0 siblings, 0 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 20:26 UTC (permalink / raw) To: sashiko; +Cc: bpf On Fri, May 8, 2026 at 12:17 PM <sashiko-bot@kernel.org> wrote: > > Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: > - [High] The deliberate exclusion of BPF_SOCK_OPS_RCVLOWAT_CB from is_locked_tcp_sock_ops() disables critical socket helpers, notably bpf_sock_ops_cb_flags_set(). > -- > > commit 2ab2e31caf792b1e7312616215a88ad7ae1ef4e1 > Author: Kuniyuki Iwashima <kuniyu@google.com> > > bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB. > > This commit introduces a new hook, BPF_SOCK_OPS_RCVLOWAT_CB, which allows > BPF programs to dynamically adjust sk->sk_rcvlowat upon recvmsg completion > or when packets are enqueued. The commit leaves is_locked_tcp_sock_ops > unchanged to prevent unnecessary enablement of bpf_setsockopt. > > > This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat, > > suppressing unnecessary EPOLLIN wakeups until sufficient data > > (e.g., a full RPC frame) is available in the receive queue. > > > > Note that is_locked_tcp_sock_ops() is left unchanged not to enable > > bpf_setsockopt() unnecessarily. > > Does leaving is_locked_tcp_sock_ops() unchanged also prevent the use of > bpf_sock_ops_cb_flags_set() from within the BPF_SOCK_OPS_RCVLOWAT_CB hook? Good catch. I need to support BPF_SOCK_OPS_RCVLOWAT_CB in bpf_sock_ops_cb_flags_set(). > > Because BPF_SOCK_OPS_RCVLOWAT_CB is appended to the enum after > BPF_SOCK_OPS_WRITE_HDR_OPT_CB, is_locked_tcp_sock_ops() returns false > for this new hook. > > Both bpf_sock_ops_cb_flags_set() and bpf_setsockopt() require > is_locked_tcp_sock_ops() to return true. If it returns false, they fail > with -EOPNOTSUPP. > > Since direct modification of bpf_sock_ops_cb_flags is disallowed by the > verifier, wouldn't a BPF program running in this context be entirely unable > to clear the BPF_SOCK_OPS_RCVLOWAT_CB_FLAG once it finishes parsing? > > If the flag cannot be cleared, does this permanently trap the socket in the > fast-path hook for every enqueued packet and recvmsg() call, leading to > performance degradation? > > While the potential UAF concern regarding bpf_skb_load_bytes() is a false > positive addressed later in the series by "bpf: tcp: Support > bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB", and the hook omission > is fixed by "bpf: tcp: Add SOCK_OPS rcvlowat hook", this limitation > regarding is_locked_tcp_sock_ops() does not appear to be addressed later > in the series. Looks like Sashiko now has context of the series, nice. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 15:15 ` Stanislav Fomichev 2026-05-08 7:33 ` [PATCH v1 bpf-next 4/8] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima ` (4 subsequent siblings) 7 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS prog can be called with BPF_SOCK_OPS_RCVLOWAT_CB. In this hook, we want to parse the RPC descriptor in the skb and adjust sk->sk_rcvlowat based on the RPC frame size. However, we cannot access payload via bpf_sock_ops.data on modern NICs with TCP header/data split on as the payload is not placed in the linear area. Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. Two notes: 1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is invoked from recvmsg(). 2) Access to bpf_sock_ops.data will be disabled by passing 0 end_offset to bpf_skops_init_skb(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index 5fa9189eb772..94d07a15b2ab 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -7718,6 +7718,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { .arg3_type = ARG_ANYTHING, }; +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock, + u32, offset, void *, to, u32, len) +{ + int err; + + if (bpf_sock->op != BPF_SOCK_OPS_RCVLOWAT_CB) { + err = -EOPNOTSUPP; + goto err_clear; + } + + if (!bpf_sock->skb) { + err = -EPERM; + goto err_clear; + } + + return ____bpf_skb_load_bytes(bpf_sock->skb, offset, to, len); + +err_clear: + memset(to, 0, len); + return err; +} + +static const struct bpf_func_proto bpf_sock_ops_skb_load_bytes_proto = { + .func = bpf_sock_ops_skb_load_bytes, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, +}; + static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend, u8 search_kind, const u8 *magic, u8 magic_len, bool *eol) @@ -8574,6 +8606,8 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_get_netns_cookie: return &bpf_get_netns_cookie_sock_ops_proto; #ifdef CONFIG_INET + case BPF_FUNC_skb_load_bytes: + return &bpf_sock_ops_skb_load_bytes_proto; case BPF_FUNC_load_hdr_opt: return &bpf_sock_ops_load_hdr_opt_proto; case BPF_FUNC_store_hdr_opt: -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 ` [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima @ 2026-05-08 15:15 ` Stanislav Fomichev 2026-05-08 19:45 ` Kuniyuki Iwashima 0 siblings, 1 reply; 30+ messages in thread From: Stanislav Fomichev @ 2026-05-08 15:15 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 05/08, Kuniyuki Iwashima wrote: > When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS > prog can be called with BPF_SOCK_OPS_RCVLOWAT_CB. > > In this hook, we want to parse the RPC descriptor in the skb > and adjust sk->sk_rcvlowat based on the RPC frame size. > > However, we cannot access payload via bpf_sock_ops.data on > modern NICs with TCP header/data split on as the payload is > not placed in the linear area. > > Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. > > Two notes: > > 1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is > invoked from recvmsg(). > > 2) Access to bpf_sock_ops.data will be disabled by passing > 0 end_offset to bpf_skops_init_skb(). > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 5fa9189eb772..94d07a15b2ab 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -7718,6 +7718,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { > .arg3_type = ARG_ANYTHING, > }; > > +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock, > + u32, offset, void *, to, u32, len) > +{ > + int err; > + > + if (bpf_sock->op != BPF_SOCK_OPS_RCVLOWAT_CB) { > + err = -EOPNOTSUPP; > + goto err_clear; > + } > + > + if (!bpf_sock->skb) { > + err = -EPERM; > + goto err_clear; > + } > + > + return ____bpf_skb_load_bytes(bpf_sock->skb, offset, to, len); nit: wondering whether it's better to use __bpf_skb_load_bytes (which wraps BPF_CALL_4's ____bpf_skb_load_bytes) here? Seems a bit less magic and used already by dynptr? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 15:15 ` Stanislav Fomichev @ 2026-05-08 19:45 ` Kuniyuki Iwashima 2026-05-11 14:56 ` Stanislav Fomichev 0 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 19:45 UTC (permalink / raw) To: Stanislav Fomichev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On Fri, May 8, 2026 at 8:15 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > On 05/08, Kuniyuki Iwashima wrote: > > When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS > > prog can be called with BPF_SOCK_OPS_RCVLOWAT_CB. > > > > In this hook, we want to parse the RPC descriptor in the skb > > and adjust sk->sk_rcvlowat based on the RPC frame size. > > > > However, we cannot access payload via bpf_sock_ops.data on > > modern NICs with TCP header/data split on as the payload is > > not placed in the linear area. > > > > Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. > > > > Two notes: > > > > 1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is > > invoked from recvmsg(). > > > > 2) Access to bpf_sock_ops.data will be disabled by passing > > 0 end_offset to bpf_skops_init_skb(). > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 5fa9189eb772..94d07a15b2ab 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -7718,6 +7718,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { > > .arg3_type = ARG_ANYTHING, > > }; > > > > +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock, > > + u32, offset, void *, to, u32, len) > > +{ > > + int err; > > + > > + if (bpf_sock->op != BPF_SOCK_OPS_RCVLOWAT_CB) { > > + err = -EOPNOTSUPP; > > + goto err_clear; > > + } > > + > > + if (!bpf_sock->skb) { > > + err = -EPERM; > > + goto err_clear; > > + } > > + > > + return ____bpf_skb_load_bytes(bpf_sock->skb, offset, to, len); > > nit: wondering whether it's better to use __bpf_skb_load_bytes (which > wraps BPF_CALL_4's ____bpf_skb_load_bytes) here? Seems a bit less > magic and used already by dynptr? I guess dynptr used it because dynptr is in a different file under kernel/. Users in the same file (sk_reuseport_load_bytes() and sk_reuseport_load_bytes_relative()) call ____bpf_skb_load_bytes() directly and clang inlined it there (including __bpf_skb_load_bytes()) while dynptr tail-called __bpf_skb_load_bytes(). I see the point (I had to read BPF_CALL_4() macro for ____), but ____bpf_skb_load_bytes() looks better. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 19:45 ` Kuniyuki Iwashima @ 2026-05-11 14:56 ` Stanislav Fomichev 0 siblings, 0 replies; 30+ messages in thread From: Stanislav Fomichev @ 2026-05-11 14:56 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 05/08, Kuniyuki Iwashima wrote: > On Fri, May 8, 2026 at 8:15 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > > > On 05/08, Kuniyuki Iwashima wrote: > > > When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS > > > prog can be called with BPF_SOCK_OPS_RCVLOWAT_CB. > > > > > > In this hook, we want to parse the RPC descriptor in the skb > > > and adjust sk->sk_rcvlowat based on the RPC frame size. > > > > > > However, we cannot access payload via bpf_sock_ops.data on > > > modern NICs with TCP header/data split on as the payload is > > > not placed in the linear area. > > > > > > Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB. > > > > > > Two notes: > > > > > > 1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is > > > invoked from recvmsg(). > > > > > > 2) Access to bpf_sock_ops.data will be disabled by passing > > > 0 end_offset to bpf_skops_init_skb(). > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > --- > > > net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 34 insertions(+) > > > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > > index 5fa9189eb772..94d07a15b2ab 100644 > > > --- a/net/core/filter.c > > > +++ b/net/core/filter.c > > > @@ -7718,6 +7718,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { > > > .arg3_type = ARG_ANYTHING, > > > }; > > > > > > +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock, > > > + u32, offset, void *, to, u32, len) > > > +{ > > > + int err; > > > + > > > + if (bpf_sock->op != BPF_SOCK_OPS_RCVLOWAT_CB) { > > > + err = -EOPNOTSUPP; > > > + goto err_clear; > > > + } > > > + > > > + if (!bpf_sock->skb) { > > > + err = -EPERM; > > > + goto err_clear; > > > + } > > > + > > > + return ____bpf_skb_load_bytes(bpf_sock->skb, offset, to, len); > > > > nit: wondering whether it's better to use __bpf_skb_load_bytes (which > > wraps BPF_CALL_4's ____bpf_skb_load_bytes) here? Seems a bit less > > magic and used already by dynptr? > > I guess dynptr used it because dynptr is in a different file under kernel/. > > Users in the same file (sk_reuseport_load_bytes() and > sk_reuseport_load_bytes_relative()) call ____bpf_skb_load_bytes() > directly and clang inlined it there (including __bpf_skb_load_bytes()) > while dynptr tail-called __bpf_skb_load_bytes(). > > I see the point (I had to read BPF_CALL_4() macro for ____), but > ____bpf_skb_load_bytes() looks better. SG! ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 4/8] tcp: Split out __tcp_set_rcvlowat(). 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima ` (2 preceding siblings ...) 2026-05-08 7:33 ` [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima ` (3 subsequent siblings) 7 siblings, 0 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev We will add a kfunc for BPF_SOCK_OPS_RCVLOWAT_CB hooks to adjust sk->sk_rcvlowat. These hooks will be triggered when: 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes In the enqueue path, tcp_data_ready() is always called after the hooks in tcp_queue_rcv() and tcp_ofo_queue(). If tcp_set_rcvlowat() were used as is, tcp_data_ready() could be called twice for the same skb, which is redundant and also confusing. Let's split out __tcp_set_rcvlowat() and add a flag to control wakeup behaviour. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- include/net/tcp.h | 1 + net/ipv4/tcp.c | 12 +++++++++--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index dfa52ceefd23..4e9e634e276b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -515,6 +515,7 @@ void tcp_set_keepalive(struct sock *sk, int val); void tcp_syn_ack_timeout(const struct request_sock *req); int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags); +int __tcp_set_rcvlowat(struct sock *sk, int val, bool wakeup); int tcp_set_rcvlowat(struct sock *sk, int val); void tcp_set_rcvbuf(struct sock *sk, int val); int tcp_set_window_clamp(struct sock *sk, int val); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 2014a6408e93..1d9e52fc454f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1829,8 +1829,7 @@ int tcp_peek_len(struct socket *sock) return tcp_inq(sock->sk); } -/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */ -int tcp_set_rcvlowat(struct sock *sk, int val) +int __tcp_set_rcvlowat(struct sock *sk, int val, bool wakeup) { struct tcp_sock *tp = tcp_sk(sk); int space, cap; @@ -1843,7 +1842,8 @@ int tcp_set_rcvlowat(struct sock *sk, int val) WRITE_ONCE(sk->sk_rcvlowat, val ? : 1); /* Check if we need to signal EPOLLIN right now */ - tcp_data_ready(sk); + if (wakeup) + tcp_data_ready(sk); if (sk->sk_userlocks & SOCK_RCVBUF_LOCK) return 0; @@ -1858,6 +1858,12 @@ int tcp_set_rcvlowat(struct sock *sk, int val) return 0; } +/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */ +int tcp_set_rcvlowat(struct sock *sk, int val) +{ + return __tcp_set_rcvlowat(sk, val, true); +} + void tcp_set_rcvbuf(struct sock *sk, int val) { tcp_set_window_clamp(sk, tcp_win_from_space(sk, val)); -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat. 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima ` (3 preceding siblings ...) 2026-05-08 7:33 ` [PATCH v1 bpf-next 4/8] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-11 12:34 ` Björn Töpel 2026-05-08 7:33 ` [PATCH v1 bpf-next 6/8] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima ` (2 subsequent siblings) 7 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVLOWAT_CB to adjust sk->sk_rcvlowat when 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes Let's provide a kfunc to set sk->sk_rcvlowat. Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT. The wakeup flag is determined based on bpf_sock_ops_kern.skb: * For the enqueue hook, skb is always non-NULL, and wakeup is set to false because * tcp_data_ready() is always called after the hooks in tcp_queue_rcv() and tcp_ofo_queue(). * when tcp_fastopen_add_skb() is called for TFO SYN, the socket is not yet accept()ed, and when called for TFO SYN+ACK, the socket is woken up by sk->sk_state_change() anyway. * For the recvmsg() hook, skb is always NULL, and wakeup is set to true because tcp_data_ready() is not called in the path. An alternative would be to support bpf_setsockopt() by adding BPF_SOCK_OPS_RCVLOWAT_CB to is_locked_tcp_sock_ops(). However, that approach involves excessive conditionals and an unnecessary memcpy(), costs we do not want to pay for every skb in the TCP fast path. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- net/core/filter.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index 94d07a15b2ab..9c4cd27c6d4e 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -12346,6 +12346,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk, #endif } +__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops, + int rcvlowat) +{ +#ifdef CONFIG_INET + if (skops->op != BPF_SOCK_OPS_RCVLOWAT_CB) + return -EOPNOTSUPP; + + if (rcvlowat < 0) + rcvlowat = INT_MAX; + + return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb); +#else + return -EOPNOTSUPP; +#endif +} + __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops, u64 flags) { @@ -12497,6 +12513,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk) BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops) BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp) +BTF_ID_FLAGS(func, bpf_sock_ops_tcp_set_rcvlowat) BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops) static const struct btf_kfunc_id_set bpf_kfunc_set_skb = { -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat. 2026-05-08 7:33 ` [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima @ 2026-05-11 12:34 ` Björn Töpel 0 siblings, 0 replies; 30+ messages in thread From: Björn Töpel @ 2026-05-11 12:34 UTC (permalink / raw) To: Kuniyuki Iwashima, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev Kuniyuki Iwashima <kuniyu@google.com> writes: > We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVLOWAT_CB > to adjust sk->sk_rcvlowat when > > 1. TCP stack enqueues skb to sk->sk_receive_queue > 2. TCP recvmsg() completes > > Let's provide a kfunc to set sk->sk_rcvlowat. > > Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT. > > The wakeup flag is determined based on bpf_sock_ops_kern.skb: > > * For the enqueue hook, skb is always non-NULL, and wakeup is > set to false because > > * tcp_data_ready() is always called after the hooks in > tcp_queue_rcv() and tcp_ofo_queue(). > > * when tcp_fastopen_add_skb() is called for TFO SYN, > the socket is not yet accept()ed, and when called > for TFO SYN+ACK, the socket is woken up by > sk->sk_state_change() anyway. > > * For the recvmsg() hook, skb is always NULL, and wakeup is set > to true because tcp_data_ready() is not called in the path. > > An alternative would be to support bpf_setsockopt() by adding > BPF_SOCK_OPS_RCVLOWAT_CB to is_locked_tcp_sock_ops(). > > However, that approach involves excessive conditionals and an > unnecessary memcpy(), costs we do not want to pay for every skb > in the TCP fast path. > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > net/core/filter.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 94d07a15b2ab..9c4cd27c6d4e 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -12346,6 +12346,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk, > #endif > } > > +__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops, > + int rcvlowat) > +{ > +#ifdef CONFIG_INET > + if (skops->op != BPF_SOCK_OPS_RCVLOWAT_CB) > + return -EOPNOTSUPP; > + > + if (rcvlowat < 0) > + rcvlowat = INT_MAX; > + > + return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb); > +#else > + return -EOPNOTSUPP; > +#endif > +} > + (Nice work BTW! I played a bit with this, but took a sockmap/stream parser approach instead -- this is much nicer!) Curious if we can get into a situation that commit fcf4692fa39e ("mptcp: prevent BPF accessing lowat from a subflow socket.") fixed? IOW, we can construct a program that enables autolowat on MPTCP subflows, right? If so, do we need a similar check (or checking for subflows) in the kfunc? Björn ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 6/8] bpf: tcp: Factorise bpf_skops_established(). 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima ` (4 preceding siblings ...) 2026-05-08 7:33 ` [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima 7 siblings, 0 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev We will call BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVLOWAT_CB. It requires a similar setup to bpf_skops_established(), and the only difference is the skb data length. Let's factor out the common logic into bpf_skops_common_locked(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- net/ipv4/tcp_input.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 021f745747c5..7e26503fd96d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -179,8 +179,9 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); } -static void bpf_skops_established(struct sock *sk, int bpf_op, - struct sk_buff *skb) +static void bpf_skops_common_locked(struct sock *sk, int bpf_op, + struct sk_buff *skb, + unsigned int end_offset) { struct bpf_sock_ops_kern sock_ops; @@ -191,12 +192,18 @@ static void bpf_skops_established(struct sock *sk, int bpf_op, sock_ops.is_fullsock = 1; sock_ops.is_locked_tcp_sock = 1; sock_ops.sk = sk; - /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ if (skb) - bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb)); + bpf_skops_init_skb(&sock_ops, skb, end_offset); BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); } + +static void bpf_skops_established(struct sock *sk, int bpf_op, + struct sk_buff *skb) +{ + /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ + bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0); +} #else static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) { -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima ` (5 preceding siblings ...) 2026-05-08 7:33 ` [PATCH v1 bpf-next 6/8] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 10:37 ` Jiayuan Chen ` (2 more replies) 2026-05-08 7:33 ` [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima 7 siblings, 3 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. Let's invoke the BPF SOCK_OPS prog when 1. TCP stack enqueues skb to sk->sk_receive_queue -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() 2. TCP recvmsg() completes -> __tcp_cleanup_rbuf() This will allow the BPF prog to parse each skb and dynamically adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups until sufficient data (e.g., a full RPC frame) is available in the receive queue. Note that the direct access to bpf_sock_ops.data is intentionally disabled by passing 0 as end_offset. Instead, the BPF prog is supposed to use bpf_skb_load_bytes() with bpf_sock_ops because payload is not in the linear area with TCP header/data split on and skb may contain a RPC descriptor in skb frag. This also simplifies the BPF prog. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- include/net/tcp.h | 14 ++++++++++++++ net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_fastopen.c | 2 ++ net/ipv4/tcp_input.c | 10 ++++++++++ 4 files changed, 28 insertions(+) diff --git a/include/net/tcp.h b/include/net/tcp.h index 4e9e634e276b..003e46c9b500 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock } #endif +#ifdef CONFIG_CGROUP_BPF +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); + +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) +{ + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) + bpf_skops_rcvlowat(sk, skb); +} +#else +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) +{ +} +#endif + /* From net/ipv6/syncookies.c */ int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1d9e52fc454f..80144b97a87a 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) tcp_mstamp_refresh(tp); tcp_send_ack(sk); } + + tcp_bpf_rcvlowat(sk, NULL); } void tcp_cleanup_rbuf(struct sock *sk, int copied) diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 471c78be5513..91bf421fc5b6 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -281,6 +281,8 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) TCP_SKB_CB(skb)->seq++; TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN; + tcp_bpf_rcvlowat(sk, skb); + tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; tcp_add_receive_queue(sk, skb); tp->syn_data_acked = 1; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 7e26503fd96d..a70a8f583025 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -204,6 +204,12 @@ static void bpf_skops_established(struct sock *sk, int bpf_op, /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0); } + +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb) +{ + /* skb is NULL when called from __tcp_cleanup_rbuf(). */ + bpf_skops_common_locked(sk, BPF_SOCK_OPS_RCVLOWAT_CB, skb, 0); +} #else static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) { @@ -5300,6 +5306,8 @@ static void tcp_ofo_queue(struct sock *sk) continue; } + tcp_bpf_rcvlowat(sk, skb); + tail = skb_peek_tail(&sk->sk_receive_queue); eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen); tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); @@ -5503,6 +5511,8 @@ static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int eaten; struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue); + tcp_bpf_rcvlowat(sk, skb); + eaten = (tail && tcp_try_coalesce(sk, tail, skb, fragstolen)) ? 1 : 0; -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima @ 2026-05-08 10:37 ` Jiayuan Chen 2026-05-08 11:30 ` Kuniyuki Iwashima 2026-05-08 15:28 ` Stanislav Fomichev 2026-05-08 21:46 ` sashiko-bot 2 siblings, 1 reply; 30+ messages in thread From: Jiayuan Chen @ 2026-05-08 10:37 UTC (permalink / raw) To: Kuniyuki Iwashima, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 5/8/26 3:33 PM, Kuniyuki Iwashima wrote: > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. > > Let's invoke the BPF SOCK_OPS prog when > > 1. TCP stack enqueues skb to sk->sk_receive_queue > -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() > > 2. TCP recvmsg() completes > -> __tcp_cleanup_rbuf() > > This will allow the BPF prog to parse each skb and dynamically > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups > until sufficient data (e.g., a full RPC frame) is available > in the receive queue. > > Note that the direct access to bpf_sock_ops.data is intentionally > disabled by passing 0 as end_offset. > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes() > with bpf_sock_ops because payload is not in the linear area > with TCP header/data split on and skb may contain a RPC > descriptor in skb frag. This also simplifies the BPF prog. > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > include/net/tcp.h | 14 ++++++++++++++ > net/ipv4/tcp.c | 2 ++ > net/ipv4/tcp_fastopen.c | 2 ++ > net/ipv4/tcp_input.c | 10 ++++++++++ > 4 files changed, 28 insertions(+) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 4e9e634e276b..003e46c9b500 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock > } > #endif > > +#ifdef CONFIG_CGROUP_BPF > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); > + > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > +{ > + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) > + bpf_skops_rcvlowat(sk, skb); > +} > +#else > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > +{ > +} > +#endif > + > /* From net/ipv6/syncookies.c */ > int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); > struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1d9e52fc454f..80144b97a87a 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) > tcp_mstamp_refresh(tp); > tcp_send_ack(sk); > } > + > + tcp_bpf_rcvlowat(sk, NULL); > } > tcp_read_skb (process frame 1 and __skb_unlink) └─ sk_psock_verdict_recv └─ sk_psock_verdict_apply └─ tcp_eat_skb └─ tcp_cleanup_rbuf └─ __tcp_cleanup_rbuf └─ BPF RCVLOWAT_CB └─ bpf_sock_ops_tcp_set_rcvlowat (wakeup=true) └─ tcp_data_ready └─ sk_psock_verdict_data_ready └─ tcp_read_skb (frame 2) └─ ... → tcp_read_skb (frame 3) ... For strparser it use read_sock instead of read_skb and it will become more complicated... I think this will cause stack overflow with amounts of skbs in receive queue or infinite call(not tested) for sockmap/kTLS/strparser. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 10:37 ` Jiayuan Chen @ 2026-05-08 11:30 ` Kuniyuki Iwashima 2026-05-08 12:19 ` Jiayuan Chen 0 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 11:30 UTC (permalink / raw) To: Jiayuan Chen Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On Fri, May 8, 2026 at 3:37 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > On 5/8/26 3:33 PM, Kuniyuki Iwashima wrote: > > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. > > > > Let's invoke the BPF SOCK_OPS prog when > > > > 1. TCP stack enqueues skb to sk->sk_receive_queue > > -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() > > > > 2. TCP recvmsg() completes > > -> __tcp_cleanup_rbuf() > > > > This will allow the BPF prog to parse each skb and dynamically > > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups > > until sufficient data (e.g., a full RPC frame) is available > > in the receive queue. > > > > Note that the direct access to bpf_sock_ops.data is intentionally > > disabled by passing 0 as end_offset. > > > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes() > > with bpf_sock_ops because payload is not in the linear area > > with TCP header/data split on and skb may contain a RPC > > descriptor in skb frag. This also simplifies the BPF prog. > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > include/net/tcp.h | 14 ++++++++++++++ > > net/ipv4/tcp.c | 2 ++ > > net/ipv4/tcp_fastopen.c | 2 ++ > > net/ipv4/tcp_input.c | 10 ++++++++++ > > 4 files changed, 28 insertions(+) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 4e9e634e276b..003e46c9b500 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock > > } > > #endif > > > > +#ifdef CONFIG_CGROUP_BPF > > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); > > + > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > > +{ > > + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) > > + bpf_skops_rcvlowat(sk, skb); > > +} > > +#else > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > > +{ > > +} > > +#endif > > + > > /* From net/ipv6/syncookies.c */ > > int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); > > struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 1d9e52fc454f..80144b97a87a 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) > > tcp_mstamp_refresh(tp); > > tcp_send_ack(sk); > > } > > + > > + tcp_bpf_rcvlowat(sk, NULL); > > } > > > > tcp_read_skb (process frame 1 and __skb_unlink) > └─ sk_psock_verdict_recv > └─ sk_psock_verdict_apply > └─ tcp_eat_skb > └─ tcp_cleanup_rbuf > └─ __tcp_cleanup_rbuf > └─ BPF RCVLOWAT_CB > └─ bpf_sock_ops_tcp_set_rcvlowat (wakeup=true) > └─ tcp_data_ready > └─ sk_psock_verdict_data_ready > └─ tcp_read_skb (frame 2) > └─ ... → tcp_read_skb (frame 3) ... > > For strparser it use read_sock instead of read_skb and it will become > more complicated... To be clear, this feature is NOT to use strparser/sockmap. > > I think this will cause stack overflow with amounts of skbs in receive > queue or infinite call(not tested) for sockmap/kTLS/strparser. > BPF user is responsible for not doing silly things. tcp_bpf_strp_read_sock() can have loop detection logic, but it's only if really needed. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 11:30 ` Kuniyuki Iwashima @ 2026-05-08 12:19 ` Jiayuan Chen 0 siblings, 0 replies; 30+ messages in thread From: Jiayuan Chen @ 2026-05-08 12:19 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 5/8/26 7:30 PM, Kuniyuki Iwashima wrote: > On Fri, May 8, 2026 at 3:37 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: >> >> On 5/8/26 3:33 PM, Kuniyuki Iwashima wrote: >>> Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. >>> >>> Let's invoke the BPF SOCK_OPS prog when >>> >>> 1. TCP stack enqueues skb to sk->sk_receive_queue >>> -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() >>> >>> 2. TCP recvmsg() completes >>> -> __tcp_cleanup_rbuf() >>> >>> This will allow the BPF prog to parse each skb and dynamically >>> adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups >>> until sufficient data (e.g., a full RPC frame) is available >>> in the receive queue. >>> >>> Note that the direct access to bpf_sock_ops.data is intentionally >>> disabled by passing 0 as end_offset. >>> >>> Instead, the BPF prog is supposed to use bpf_skb_load_bytes() >>> with bpf_sock_ops because payload is not in the linear area >>> with TCP header/data split on and skb may contain a RPC >>> descriptor in skb frag. This also simplifies the BPF prog. >>> >>> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> >>> --- >>> include/net/tcp.h | 14 ++++++++++++++ >>> net/ipv4/tcp.c | 2 ++ >>> net/ipv4/tcp_fastopen.c | 2 ++ >>> net/ipv4/tcp_input.c | 10 ++++++++++ >>> 4 files changed, 28 insertions(+) >>> >>> diff --git a/include/net/tcp.h b/include/net/tcp.h >>> index 4e9e634e276b..003e46c9b500 100644 >>> --- a/include/net/tcp.h >>> +++ b/include/net/tcp.h >>> @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock >>> } >>> #endif >>> >>> +#ifdef CONFIG_CGROUP_BPF >>> +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); >>> + >>> +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) >>> +{ >>> + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) >>> + bpf_skops_rcvlowat(sk, skb); >>> +} >>> +#else >>> +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) >>> +{ >>> +} >>> +#endif >>> + >>> /* From net/ipv6/syncookies.c */ >>> int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); >>> struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c >>> index 1d9e52fc454f..80144b97a87a 100644 >>> --- a/net/ipv4/tcp.c >>> +++ b/net/ipv4/tcp.c >>> @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) >>> tcp_mstamp_refresh(tp); >>> tcp_send_ack(sk); >>> } >>> + >>> + tcp_bpf_rcvlowat(sk, NULL); >>> } >>> >> tcp_read_skb (process frame 1 and __skb_unlink) >> └─ sk_psock_verdict_recv >> └─ sk_psock_verdict_apply >> └─ tcp_eat_skb >> └─ tcp_cleanup_rbuf >> └─ __tcp_cleanup_rbuf >> └─ BPF RCVLOWAT_CB >> └─ bpf_sock_ops_tcp_set_rcvlowat (wakeup=true) >> └─ tcp_data_ready >> └─ sk_psock_verdict_data_ready >> └─ tcp_read_skb (frame 2) >> └─ ... → tcp_read_skb (frame 3) ... >> >> For strparser it use read_sock instead of read_skb and it will become >> more complicated... > To be clear, this feature is NOT to use strparser/sockmap. >> I think this will cause stack overflow with amounts of skbs in receive >> queue or infinite call(not tested) for sockmap/kTLS/strparser. >> > BPF user is responsible for not doing silly things. > > tcp_bpf_strp_read_sock() can have loop detection logic, > but it's only if really needed. Similar infinite recursion problems for reference: https://lore.kernel.org/r/20220929070407.965581-5-martin.lau@linux.dev https://lore.kernel.org/bpf/20260421155804.135786-1-kafai.wan@linux.dev/ They were not solved in TCP side but in ops side. Can we try to handle it on the BPF/OPS side first and only prevent it elsewhere if it's not feasible there ? Thanks ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima 2026-05-08 10:37 ` Jiayuan Chen @ 2026-05-08 15:28 ` Stanislav Fomichev 2026-05-08 20:05 ` Kuniyuki Iwashima 2026-05-08 21:46 ` sashiko-bot 2 siblings, 1 reply; 30+ messages in thread From: Stanislav Fomichev @ 2026-05-08 15:28 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 05/08, Kuniyuki Iwashima wrote: > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. > > Let's invoke the BPF SOCK_OPS prog when > > 1. TCP stack enqueues skb to sk->sk_receive_queue > -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() > > 2. TCP recvmsg() completes > -> __tcp_cleanup_rbuf() > > This will allow the BPF prog to parse each skb and dynamically > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups > until sufficient data (e.g., a full RPC frame) is available > in the receive queue. > > Note that the direct access to bpf_sock_ops.data is intentionally > disabled by passing 0 as end_offset. > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes() > with bpf_sock_ops because payload is not in the linear area > with TCP header/data split on and skb may contain a RPC > descriptor in skb frag. This also simplifies the BPF prog. > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> I was reading the series expecting to find some skb_queue_walk-like implementation, but since it's a cgroup hook we obviously don't need to do that.. So at this point BPF_SOCK_OPS_RCVLOWAT_CB_FLAG is basically a "rx queue skb" hook, right? So should we make the name more generic? There is really nothing lowat-specific here besides your new kfunc to read the payload? > --- > include/net/tcp.h | 14 ++++++++++++++ > net/ipv4/tcp.c | 2 ++ > net/ipv4/tcp_fastopen.c | 2 ++ > net/ipv4/tcp_input.c | 10 ++++++++++ > 4 files changed, 28 insertions(+) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 4e9e634e276b..003e46c9b500 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock > } > #endif > > +#ifdef CONFIG_CGROUP_BPF > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); > + > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > +{ > + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) > + bpf_skops_rcvlowat(sk, skb); > +} > +#else > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > +{ > +} > +#endif > + > /* From net/ipv6/syncookies.c */ > int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); > struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1d9e52fc454f..80144b97a87a 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) > tcp_mstamp_refresh(tp); > tcp_send_ack(sk); > } > + > + tcp_bpf_rcvlowat(sk, NULL); > } > > void tcp_cleanup_rbuf(struct sock *sk, int copied) > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c > index 471c78be5513..91bf421fc5b6 100644 > --- a/net/ipv4/tcp_fastopen.c > +++ b/net/ipv4/tcp_fastopen.c > @@ -281,6 +281,8 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) > TCP_SKB_CB(skb)->seq++; > TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN; > > + tcp_bpf_rcvlowat(sk, skb); > + I'm also not sure about the particular placement of some of these.. For example here, why do it before updating tp? Why not after? (and same for tcp_ofo_queue) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 15:28 ` Stanislav Fomichev @ 2026-05-08 20:05 ` Kuniyuki Iwashima 2026-05-11 14:55 ` Stanislav Fomichev 0 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 20:05 UTC (permalink / raw) To: Stanislav Fomichev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On Fri, May 8, 2026 at 8:28 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > On 05/08, Kuniyuki Iwashima wrote: > > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. > > > > Let's invoke the BPF SOCK_OPS prog when > > > > 1. TCP stack enqueues skb to sk->sk_receive_queue > > -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() > > > > 2. TCP recvmsg() completes > > -> __tcp_cleanup_rbuf() > > > > This will allow the BPF prog to parse each skb and dynamically > > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups > > until sufficient data (e.g., a full RPC frame) is available > > in the receive queue. > > > > Note that the direct access to bpf_sock_ops.data is intentionally > > disabled by passing 0 as end_offset. > > > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes() > > with bpf_sock_ops because payload is not in the linear area > > with TCP header/data split on and skb may contain a RPC > > descriptor in skb frag. This also simplifies the BPF prog. > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > I was reading the series expecting to find some skb_queue_walk-like > implementation, but since it's a cgroup hook we obviously don't > need to do that.. So at this point BPF_SOCK_OPS_RCVLOWAT_CB_FLAG > is basically a "rx queue skb" hook, right? If recvmsg() hook w/o skb is not confusing, yes. > So should we make > the name more generic? There is really nothing lowat-specific > here besides your new kfunc to read the payload? Do you have any suggestions? BPF_SOCK_OPS_RCVQ ? > > > --- > > include/net/tcp.h | 14 ++++++++++++++ > > net/ipv4/tcp.c | 2 ++ > > net/ipv4/tcp_fastopen.c | 2 ++ > > net/ipv4/tcp_input.c | 10 ++++++++++ > > 4 files changed, 28 insertions(+) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 4e9e634e276b..003e46c9b500 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock > > } > > #endif > > > > +#ifdef CONFIG_CGROUP_BPF > > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); > > + > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > > +{ > > + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) > > + bpf_skops_rcvlowat(sk, skb); > > +} > > +#else > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > > +{ > > +} > > +#endif > > + > > /* From net/ipv6/syncookies.c */ > > int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); > > struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 1d9e52fc454f..80144b97a87a 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) > > tcp_mstamp_refresh(tp); > > tcp_send_ack(sk); > > } > > + > > + tcp_bpf_rcvlowat(sk, NULL); > > } > > > > void tcp_cleanup_rbuf(struct sock *sk, int copied) > > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c > > index 471c78be5513..91bf421fc5b6 100644 > > --- a/net/ipv4/tcp_fastopen.c > > +++ b/net/ipv4/tcp_fastopen.c > > @@ -281,6 +281,8 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) > > TCP_SKB_CB(skb)->seq++; > > TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN; > > > > + tcp_bpf_rcvlowat(sk, skb); > > + > > I'm also not sure about the particular placement of some of these.. > For example here, why do it before updating tp? Why not after? > > (and same for tcp_ofo_queue) Basically, it's to give the same snapshot with tcp_queue_rcv(). For example, if you call tcp_bpf_rcvlowat() before updating TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog will need to implement an unlikely if branch to strip SYN. Also, TCP stack can queue overlapping skb into recvq. Once you update rcv_nxt with a new skb, you cannot infer the previous one from skb->len. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 20:05 ` Kuniyuki Iwashima @ 2026-05-11 14:55 ` Stanislav Fomichev 0 siblings, 0 replies; 30+ messages in thread From: Stanislav Fomichev @ 2026-05-11 14:55 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 05/08, Kuniyuki Iwashima wrote: > On Fri, May 8, 2026 at 8:28 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > > > On 05/08, Kuniyuki Iwashima wrote: > > > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVLOWAT_CB. > > > > > > Let's invoke the BPF SOCK_OPS prog when > > > > > > 1. TCP stack enqueues skb to sk->sk_receive_queue > > > -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() > > > > > > 2. TCP recvmsg() completes > > > -> __tcp_cleanup_rbuf() > > > > > > This will allow the BPF prog to parse each skb and dynamically > > > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups > > > until sufficient data (e.g., a full RPC frame) is available > > > in the receive queue. > > > > > > Note that the direct access to bpf_sock_ops.data is intentionally > > > disabled by passing 0 as end_offset. > > > > > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes() > > > with bpf_sock_ops because payload is not in the linear area > > > with TCP header/data split on and skb may contain a RPC > > > descriptor in skb frag. This also simplifies the BPF prog. > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > > I was reading the series expecting to find some skb_queue_walk-like > > implementation, but since it's a cgroup hook we obviously don't > > need to do that.. So at this point BPF_SOCK_OPS_RCVLOWAT_CB_FLAG > > is basically a "rx queue skb" hook, right? > > If recvmsg() hook w/o skb is not confusing, yes. > > > So should we make > > the name more generic? There is really nothing lowat-specific > > here besides your new kfunc to read the payload? > > Do you have any suggestions? BPF_SOCK_OPS_RCVQ ? Yeah, idk, maybe something like this :-) > > > > > > > --- > > > include/net/tcp.h | 14 ++++++++++++++ > > > net/ipv4/tcp.c | 2 ++ > > > net/ipv4/tcp_fastopen.c | 2 ++ > > > net/ipv4/tcp_input.c | 10 ++++++++++ > > > 4 files changed, 28 insertions(+) > > > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > > index 4e9e634e276b..003e46c9b500 100644 > > > --- a/include/net/tcp.h > > > +++ b/include/net/tcp.h > > > @@ -737,6 +737,20 @@ static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock > > > } > > > #endif > > > > > > +#ifdef CONFIG_CGROUP_BPF > > > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb); > > > + > > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > > > +{ > > > + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVLOWAT_CB_FLAG)) > > > + bpf_skops_rcvlowat(sk, skb); > > > +} > > > +#else > > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb) > > > +{ > > > +} > > > +#endif > > > + > > > /* From net/ipv6/syncookies.c */ > > > int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th); > > > struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb); > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > > index 1d9e52fc454f..80144b97a87a 100644 > > > --- a/net/ipv4/tcp.c > > > +++ b/net/ipv4/tcp.c > > > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) > > > tcp_mstamp_refresh(tp); > > > tcp_send_ack(sk); > > > } > > > + > > > + tcp_bpf_rcvlowat(sk, NULL); > > > } > > > > > > void tcp_cleanup_rbuf(struct sock *sk, int copied) > > > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c > > > index 471c78be5513..91bf421fc5b6 100644 > > > --- a/net/ipv4/tcp_fastopen.c > > > +++ b/net/ipv4/tcp_fastopen.c > > > @@ -281,6 +281,8 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) > > > TCP_SKB_CB(skb)->seq++; > > > TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN; > > > > > > + tcp_bpf_rcvlowat(sk, skb); > > > + > > > > I'm also not sure about the particular placement of some of these.. > > For example here, why do it before updating tp? Why not after? > > > > (and same for tcp_ofo_queue) > > Basically, it's to give the same snapshot with tcp_queue_rcv(). > > For example, if you call tcp_bpf_rcvlowat() before updating > TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF > prog will need to implement an unlikely if branch to strip SYN. > > Also, TCP stack can queue overlapping skb into recvq. > Once you update rcv_nxt with a new skb, you cannot infer > the previous one from skb->len. I see, makes sense, thanks! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook. 2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima 2026-05-08 10:37 ` Jiayuan Chen 2026-05-08 15:28 ` Stanislav Fomichev @ 2026-05-08 21:46 ` sashiko-bot 2 siblings, 0 replies; 30+ messages in thread From: sashiko-bot @ 2026-05-08 21:46 UTC (permalink / raw) To: Kuniyuki Iwashima; +Cc: bpf Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [High] The new BPF_SOCK_OPS_RCVLOWAT_CB hook is incorrectly rejected by is_locked_tcp_sock_ops(), preventing the BPF program from modifying callback flags. -- commit ef96e613bd920ae4e567781955c5e3d4f11f7f57 Author: Kuniyuki Iwashima <kuniyu@google.com> bpf: tcp: Add SOCK_OPS rcvlowat hook. This commit introduces a new BPF_SOCK_OPS_RCVLOWAT_CB hook that is invoked when TCP enqueues an skb or completes recvmsg(). The hook allows BPF programs to dynamically adjust sk->sk_rcvlowat to suppress unnecessary wakeups. [ ... ] > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 7e26503fd96d4..a70a8f5830258 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -204,6 +204,12 @@ static void bpf_skops_established(struct sock *sk, int bpf_op, > /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ > bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0); > } > + > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb) > +{ > + /* skb is NULL when called from __tcp_cleanup_rbuf(). */ > + bpf_skops_common_locked(sk, BPF_SOCK_OPS_RCVLOWAT_CB, skb, 0); > +} Does this regression prevent the BPF program from dynamically disabling the hook once the RPC boundaries are known? Since BPF_SOCK_OPS_RCVLOWAT_CB is greater than BPF_SOCK_OPS_WRITE_HDR_OPT_CB, is_locked_tcp_sock_ops() will return false for this hook: net/core/filter.c:is_locked_tcp_sock_ops() { return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB; } While this intentionally prevents bpf_setsockopt() from being used unnecessarily, bpf_sock_ops_cb_flags_set() also relies on this check: net/core/filter.c:bpf_sock_ops_cb_flags_set() { if (!is_locked_tcp_sock_ops(bpf_sock)) return -EOPNOTSUPP; ... } This means any attempt to clear the callback flag using bpf_sock_ops_cb_flags_set() within the hook will fail with -EOPNOTSUPP. This could force the callback to run on every single SKB enqueue for the lifetime of the connection. Should is_locked_tcp_sock_ops() or bpf_sock_ops_cb_flags_set() be updated to allow disabling the callback flag for this new hook? -- Sashiko AI review · https://sashiko.dev/#/patchset/20260508073355.3916746-1-kuniyu@google.com?part=7 ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima ` (6 preceding siblings ...) 2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima @ 2026-05-08 7:33 ` Kuniyuki Iwashima 2026-05-08 15:35 ` Stanislav Fomichev 2026-05-08 22:17 ` sashiko-bot 7 siblings, 2 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 7:33 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev The test is roughly divided into two stages, and the sequence is as follows: I) Setup 1. Attach two BPF programs to a cgroup 2. Establish a TCP connection (@client <-> @child) within the cgroup 3. Enable BPF_SOCK_OPS_RCVLOWAT_CB on @child II) RPC frame exchange in various patterns 4. Send a partial RPC descriptor from @client to @child 5. Verify that epoll does NOT wake up @child 6. Send the remaining data of the RPC frame 7. Verify that epoll finally wakes up @child During setup, two BPF programs are attached to simulate a real-world scenario; one is SOCK_OPS and the other is CGROUP_SOCKOPT. While the SOCK_OPS prog handles the dynamic adjustment of sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable BPF_SOCK_OPS_RCVLOWAT_CB via userspace setsockopt() using pseudo options: #define SOL_BPF 0xdeadbeef #define BPF_TCP_AUTOLOWAT 0x8badf00d setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int)); This reflects a common production use case where an application decides to start parsing RPC frames only at a certain point in the stream (e.g., after HTTP Upgrade), rather than immediately after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc). When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises sk_local_storage for two sequence numbers to manage its state. Then, for the RPC frame exchange, this test uses a simple format defined as follows 0 8 16 24 32 +--------+--------+-------+--------+ `. | header size | | +--------+--------+-------+--------+ > RPC descriptor (8 bytes) | payload size | | +--------+--------+-------+--------+ .' ~ header ~ +--------+--------+-------+--------+ ~ payload ~ +--------+--------+-------+--------+ Every time a new skb is enqueued to sk->sk_receive_queue, the SOCK_OPS prog parses it and updates these sequence numbers: rpc_desc_seq : the SEQ # of the start of the RPC descriptor rpc_end_seq : the SEQ # of the end of the RPC frame => rpc_desc_seq + 8 + header size + payload size Assume we receive two RPC descriptors in the following pattern: 1. When we receive skb-1, only a part of RPC descriptor is parsed. rpc_desc_seq is set to the first byte while rpc_end_seq is unknown. Thus, sk->sk_rcvlowat is set to the size of the RPC descriptor (8 bytes). <- skb-1 -> <---- skb-2 ----> <------ skb-3 -----> +-----------+.................+....................+...... | RPC desc 1 | header + payload | RPC desc 2 | ... +-----------+.................+....................+...... ^ ^-. `- rpc_desc_seq `- sk->sk_rcvlowat 2. Next, we receive skb-2, which completes the first RPC descriptor. Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it. <- skb-1 -> <---- skb-2 ----> <------ skb-3 -----> +-----------+-----------------+....................+...... | RPC desc 1 | header + payload | RPC desc 2 | ... +-----------+-----------------+....................+...... ^ ^ '- rpc_desc_seq '- rpc_end_seq & sk->sk_rcvlowat 3. Once we receive skb-3, which contains the next full RPC descriptor, rpc_desc_seq is advanced and rpc_end_seq is updated according to the size of RPC frame 2. Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq yet. This ensures that the application is woken up to read the already complete RPC frame 1. <- skb-1 -> <---- skb-2 ----> <------ skb-3 -----> +-----------+-----------------+--------------------+...... | RPC desc 1 | header + payload | RPC desc 2 | ... | +-----------+-----------------+--------------------+...... ^ ^ rpc_desc_seq -----------' rpc_end_seq ----...-' & sk->sk_rcvlowat This sequence corresponds to the 4th test case in rpc_test_cases[], and we can see helpful output if we "#define DEBUG": # cat /sys/kernel/tracing/trace_pipe | \ awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \ BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID ... AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0 AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1 AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1 AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8 AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1 AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150 AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8 AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8 AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258 ... Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- tools/testing/selftests/bpf/bpf_kfuncs.h | 4 + .../selftests/bpf/prog_tests/tcp_autolowat.c | 350 ++++++++++++++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 2 + .../selftests/bpf/progs/tcp_autolowat.c | 316 ++++++++++++++++ 4 files changed, 672 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h index ae71e9b69051..fc4d6f68f247 100644 --- a/tools/testing/selftests/bpf/bpf_kfuncs.h +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h @@ -64,6 +64,10 @@ struct bpf_tcp_req_attrs; extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk, struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym; +struct bpf_sock_ops_kern; +extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern, + int rcvlowat) __ksym; + void *bpf_cast_to_kern_ctx(void *) __ksym; extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak; diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c new file mode 100644 index 000000000000..5e971c42a32a --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c @@ -0,0 +1,350 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright 2026 Google LLC */ +#include <sys/epoll.h> + +#include "test_progs.h" +#include "cgroup_helpers.h" +#include "network_helpers.h" + +#include "tcp_autolowat.skel.h" + +#define SOL_BPF 0xdeadbeef +#define BPF_TCP_AUTOLOWAT 0x8badf00d + +struct rpc_descriptor { + u32 header_len; + u32 payload_len; +}; + +enum rpc_event_type { + RPC_EVENT_END, + RPC_EVENT_AUTOLOWAT, + RPC_EVENT_SEND, + RPC_EVENT_RECV, + RPC_EVENT_EPOLL, + RPC_EVENT_RCVLOWAT, +}; + +struct rpc_event { + enum rpc_event_type type; + union { + int len; + int nfds; + int val; + int rcvlowat; + }; +}; + +#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor)) + +struct rpc_test_case { + char data[4096]; + struct rpc_descriptor desc[32]; + struct rpc_event event[32]; +} rpc_test_cases[] = { + { + .desc = { + { .header_len = 100, .payload_len = 150 }, + }, + .event = { + { .type = RPC_EVENT_AUTOLOWAT, .val = 1}, + /* Single full RPC message in skb. */ + { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE + 100 + 150}, + { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 100 + 150}, + { .type = RPC_EVENT_EPOLL, .nfds = 1}, + }, + }, + { + .desc = { + {.header_len = 100, .payload_len = 150}, + {.header_len = 100, .payload_len = 150}, + {.header_len = 100, .payload_len = 150}, + }, + .event = { + { .type = RPC_EVENT_AUTOLOWAT, .val = 1}, + /* Two full RPC messages in skb. */ + {.type = RPC_EVENT_SEND, .len = (RPC_DESC_SIZE + 100 + 150) * 2}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2}, + {.type = RPC_EVENT_EPOLL, .nfds = 1}, + /* Single full RPC message in skb. */ + { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE + 100 + 150}, + { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 3}, + { .type = RPC_EVENT_EPOLL, .nfds = 1}, + }, + }, + { + .desc = { + {.header_len = 100, .payload_len = 150}, + {.header_len = 100, .payload_len = 150}, + {.header_len = 100, .payload_len = 150}, + }, + .event = { + { .type = RPC_EVENT_AUTOLOWAT, .val = 1}, + /* Two full RPC messages in skb. */ + {.type = RPC_EVENT_SEND, .len = (RPC_DESC_SIZE + 100 + 150) * 2}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2}, + {.type = RPC_EVENT_EPOLL, .nfds = 1}, + /* Single full RPC message in skb. */ + { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE}, + { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2}, + { .type = RPC_EVENT_EPOLL, .nfds = 1}, + }, + }, + { + .desc = { + {.header_len = 100, .payload_len = 150}, + {.header_len = 200, .payload_len = 500}, + }, + .event = { + { .type = RPC_EVENT_AUTOLOWAT, .val = 1}, + /* The first descriptor is partial. */ + {.type = RPC_EVENT_SEND, .len = 1}, + {.type = RPC_EVENT_EPOLL, .nfds = 0}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE}, + /* The first descriptor is available. */ + {.type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE - 1}, + {.type = RPC_EVENT_EPOLL, .nfds = 0}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100}, + /* The first header is ready. */ + {.type = RPC_EVENT_SEND, .len = 100}, + {.type = RPC_EVENT_EPOLL, .nfds = 0}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100}, + /* skb has the first payload and 1 byte of the next descriptor. */ + {.type = RPC_EVENT_SEND, .len = 150 + 1}, + {.type = RPC_EVENT_EPOLL, .nfds = 1}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100}, + /* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */ + {.type = RPC_EVENT_RECV, .len = RPC_DESC_SIZE + 150 + 100}, + {.type = RPC_EVENT_EPOLL, .nfds = 0}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE}, + /* The second descriptor is available. */ + {.type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE - 1}, + {.type = RPC_EVENT_EPOLL, .nfds = 0}, + {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 200 + 500}, + }, + }, +}; + +struct tcp_autolowat_test_cb { + int saved_netns; + union { + int fd[4]; + struct { + int server, client, child; + int epoll; + }; + }; +}; + +static void tcp_autolowat_teardown_cb(struct tcp_autolowat_test_cb *cb) +{ + int i, err; + + for (i = 0; i < ARRAY_SIZE(cb->fd); i++) { + if (cb->fd[i] != -1) + close(cb->fd[i]); + } + + if (cb->saved_netns != -1) { + err = setns(cb->saved_netns, CLONE_NEWNET); + ASSERT_OK(err, "restore netns"); + + close(cb->saved_netns); + } +} + +static int tcp_autolowat_setup_cb(struct tcp_autolowat_test_cb *cb, int family) +{ + struct epoll_event ev = {}; + int err; + int i; + + for (i = 0; i < ARRAY_SIZE(cb->fd); i++) + cb->fd[i] = -1; + + cb->saved_netns = open("/proc/self/ns/net", O_RDONLY); + if (!ASSERT_NEQ(cb->saved_netns, -1, "save netns")) + goto err; + + err = unshare(CLONE_NEWNET); + if (!ASSERT_OK(err, "unshare")) + goto err; + + err = system("ip link set dev lo up"); + if (!ASSERT_OK(err, "set up lo")) + goto err; + + cb->server = start_server(family, SOCK_STREAM, NULL, 0, 0); + if (!ASSERT_NEQ(cb->server, -1, "start_server")) + goto err; + + cb->client = connect_to_fd(cb->server, 0); + if (!ASSERT_NEQ(cb->client, -1, "connect_to_fd")) + goto err; + + cb->child = accept(cb->server, NULL, NULL); + if (!ASSERT_NEQ(cb->child, -1, "accept")) + goto err; + + cb->epoll = epoll_create1(0); + if (!ASSERT_NEQ(cb->epoll, -1, "epoll_create")) + goto err; + + ev.events = EPOLLIN; + ev.data.fd = cb->child; + + err = epoll_ctl(cb->epoll, EPOLL_CTL_ADD, cb->child, &ev); + if (!ASSERT_OK(err, "epoll_ctl")) + goto err; + + return 0; + +err: + tcp_autolowat_teardown_cb(cb); + return -1; +} + +static int tcp_autolowat_build_data(struct rpc_test_case *test_case) +{ + struct rpc_descriptor *desc = test_case->desc; + char *ptr = test_case->data; + int rpc_size; + + memset(ptr, 0, sizeof(test_case->data)); + + while (desc->header_len + desc->payload_len) { + rpc_size = sizeof(*desc) + desc->header_len + desc->payload_len; + + if (!ASSERT_LE(ptr + rpc_size - test_case->data, + sizeof(test_case->data), "data overflow")) + return 1; + + memcpy(ptr, desc, sizeof(*desc)); + ptr += rpc_size; + desc++; + } + + if (!ASSERT_GT(ptr - test_case->data, 0, "no data")) + return 1; + + return 0; +} + +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb, + struct rpc_test_case *test_case) +{ + struct rpc_event *event = test_case->event; + char *ptr = test_case->data; + struct epoll_event ev; + socklen_t optlen; + int err, optval; + char buf[4096]; + + if (tcp_autolowat_build_data(test_case)) + return; + + while (1) { + switch (event->type) { + case RPC_EVENT_END: + return; + case RPC_EVENT_AUTOLOWAT: + err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT, + &event->val, sizeof(event->val)); + if (!ASSERT_OK(err, "setsockopt")) + return; + break; + case RPC_EVENT_SEND: + err = send(cb->client, ptr, event->len, 0); + if (!ASSERT_EQ(err, event->len, "send")) + return; + + ptr += event->len; + break; + case RPC_EVENT_RECV: + err = recv(cb->child, buf, event->len, 0); + if (!ASSERT_EQ(err, event->len, "recv")) + return; + break; + case RPC_EVENT_EPOLL: + err = epoll_wait(cb->epoll, &ev, 1, 0); + if (!ASSERT_EQ(err, event->nfds, "epoll_wait")) + return; + break; + case RPC_EVENT_RCVLOWAT: + optval = 0; + optlen = sizeof(optval); + + err = getsockopt(cb->child, SOL_SOCKET, SO_RCVLOWAT, &optval, &optlen); + if (!ASSERT_OK(err, "getsockopt") || + !ASSERT_EQ(optval, event->rcvlowat, "rcvlowat")) + return; + break; + } + + event++; + } +} + +static void tcp_autolowat_run_rpc_tests(struct tcp_autolowat *skel, int family) +{ + struct tcp_autolowat_test_cb cb; + int err; + int i; + + for (i = 0; i < ARRAY_SIZE(rpc_test_cases); i++) { + memset(skel->bss->test_name, 0, sizeof(skel->bss->test_name)); + + snprintf(skel->bss->test_name, sizeof(skel->bss->test_name), + "AF_INET%c rpc_test_cases[%d]", + family == AF_INET ? ' ' : '6', i); + + if (!test__start_subtest(skel->bss->test_name)) + continue; + + err = tcp_autolowat_setup_cb(&cb, family); + if (err) + continue; + + tcp_autolowat_run_rpc_test(&cb, &rpc_test_cases[i]); + tcp_autolowat_teardown_cb(&cb); + } +} + +static void tcp_autolowat_run_tests(struct tcp_autolowat *skel) +{ + tcp_autolowat_run_rpc_tests(skel, AF_INET); + tcp_autolowat_run_rpc_tests(skel, AF_INET6); +} + +void test_tcp_autolowat(void) +{ + struct tcp_autolowat *skel; + struct bpf_link *link[2]; + int cgroup; + + skel = tcp_autolowat__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + return; + + cgroup = test__join_cgroup("/tcp_autolowat"); + if (!ASSERT_GE(cgroup, 0, "join_cgroup")) + goto destroy_skel; + + link[0] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat, cgroup); + if (!ASSERT_OK_PTR(link[0], "attach_cgroup(SOCK_OPS)")) + goto close_cgroup; + + link[1] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat_setsockopt, cgroup); + if (!ASSERT_OK_PTR(link[1], "attach_cgroup(SETSOCKOPT)")) + goto destroy_sockops; + + tcp_autolowat_run_tests(skel); + + bpf_link__destroy(link[1]); +destroy_sockops: + bpf_link__destroy(link[0]); +close_cgroup: + close(cgroup); +destroy_skel: + tcp_autolowat__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index d8dacef37c16..bdf28d320383 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -74,6 +74,8 @@ #define NEXTHDR_TCP 6 +#define TCPHDR_FIN 0x01 + #define TCPOPT_NOP 1 #define TCPOPT_EOL 0 #define TCPOPT_MSS 2 diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c new file mode 100644 index 000000000000..86f2af2fe683 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c @@ -0,0 +1,316 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright 2026 Google LLC */ +#include "vmlinux.h" + +#include <string.h> +#include <limits.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> + +#include "bpf_tracing_net.h" + +#define SOL_BPF 0xdeadbeef +#define BPF_TCP_AUTOLOWAT 0x8badf00d + +//#define DEBUG /* For verbose output. */ + +struct rpc_descriptor { + u32 header_len; + u32 payload_len; +}; + +#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor)) +#define MAX_RPC_DESC_PER_SKB 100 + +struct tcp_autolowat_cb { + /* Don't put this field at the end; BPF verifier complains. */ + char rpc_desc_buf[RPC_DESC_SIZE]; + u32 rpc_desc_seq; + u32 rpc_end_seq; +#ifdef DEBUG + u32 isn; +#endif + u8 rpc_desc_buff_len; +}; + +struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct tcp_autolowat_cb); +} tcp_autolowat_map SEC(".maps"); + +char test_name[64]; + +#ifdef DEBUG +#define LOG(str, ...) \ + bpf_printk("%s: " str, test_name, ##__VA_ARGS__) +#else +#define LOG(...) +#endif + +#define SEQ(val) \ + (val - cb->isn) +#define TP_SEQ(field) \ + (tp->field - cb->isn) +#define CB_SEQ(field) \ + (cb->field - cb->isn) + +static int tcp_parse_descriptor(struct bpf_sock_ops *skops, + struct tcp_autolowat_cb *cb, + u32 seq, u32 end_seq) +{ + struct rpc_descriptor *rpc_desc; + u32 rpc_copied_seq; + u32 copy_len; + u64 rpc_len; + int err; + + rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len; + + if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq)) + copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len; + else + copy_len = end_seq - rpc_copied_seq; + + /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7, + * clang omits the "copy_len == 0" check below, which is necessary + * to satisfy the BPF verifier's range check for bpf_skb_load_bytes(). + */ + barrier_var(copy_len); + + if (copy_len == 0) + goto disable; /* FIN. */ + if (copy_len > RPC_DESC_SIZE) + goto disable; /* always false, only for verifier. */ + if (cb->rpc_desc_buf + cb->rpc_desc_buff_len >= &cb->rpc_desc_buf[RPC_DESC_SIZE]) + goto disable; /* always false, only for verifier. */ + + err = bpf_skb_load_bytes(skops, rpc_copied_seq - seq, + cb->rpc_desc_buf + cb->rpc_desc_buff_len, copy_len); + if (err) + goto disable; + + cb->rpc_desc_buff_len += copy_len; + + if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) { + LOG("Copied %d bytes: rpc_desc_buff_len: %u", copy_len, cb->rpc_desc_buff_len); + goto partial; + } + + rpc_desc = (struct rpc_descriptor *)cb->rpc_desc_buf; + rpc_len = RPC_DESC_SIZE + rpc_desc->header_len + rpc_desc->payload_len; + + if (rpc_len > INT_MAX) + goto disable; + + cb->rpc_end_seq = cb->rpc_desc_seq + rpc_len; + + LOG("Copied full descriptor: rpc_desc_seq: %u, rpc_end_seq: %u," + " header_len: %u, payload_len: %u", + CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), + rpc_desc->header_len, rpc_desc->payload_len); + + return 0; +disable: + return -1; +partial: + return 1; +} + +static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern, + struct tcp_autolowat_cb *cb, + struct tcp_sock *tp) +{ + /* To handle wraparound. */ + u32 val = 0; + + LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u", + TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len); + + if (before(tp->copied_seq, cb->rpc_desc_seq)) + val = cb->rpc_desc_seq - tp->copied_seq; + else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) + val = RPC_DESC_SIZE; + else + val = cb->rpc_end_seq - tp->copied_seq; + + if (val != tp->inet_conn.icsk_inet.sk.sk_rcvlowat) { + bpf_sock_ops_tcp_set_rcvlowat(skops_kern, val); + + LOG("Set rcvlowat: expected: %u, actual: %d\n", + val, tp->inet_conn.icsk_inet.sk.sk_rcvlowat); + } else { + LOG("No need to set rcvlowat: %u\n", val); + } +} + +static void tcp_disable_autolowat(struct bpf_sock_ops *skops, + struct bpf_sock_ops_kern *skops_kern) +{ + int flags; + + flags = skops->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RCVLOWAT_CB_FLAG; + bpf_sock_ops_cb_flags_set(skops, flags); + + bpf_sock_ops_tcp_set_rcvlowat(skops_kern, 1); + + LOG("Disabled autolowat"); +} + +static void tcp_do_autolowat(struct bpf_sock_ops *skops, + struct tcp_autolowat_cb *cb, + struct tcp_sock *tp) +{ + struct bpf_sock_ops_kern *skops_kern; + struct tcp_skb_cb *tcb; + struct sk_buff *skb; + u32 seq, end_seq; + int ret = 0, i; + + skops_kern = bpf_cast_to_kern_ctx(skops); + skb = skops_kern->skb; + + if (!skb) + goto update; + + tcb = bpf_core_cast(skb->cb, struct tcp_skb_cb); + seq = tcb->seq; + end_seq = tcb->end_seq - !!(tcb->tcp_flags & TCPHDR_FIN); + + LOG("Start parsing skb: seq: %u, end_seq: %u, len: %u, " + "rpc_desc_seq: %u, rpc_end_seq: %u, rpc_buff_len: %u", + SEQ(seq), SEQ(end_seq), end_seq - seq, + CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len); + + if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) { + ret = tcp_parse_descriptor(skops, cb, seq, end_seq); + if (ret) + goto update; + } + + i = 0; + + while (1) { + if (i++ > MAX_RPC_DESC_PER_SKB) { + ret = -1; + break; + } + + if (after(cb->rpc_end_seq, end_seq)) { + LOG("No more descriptor: rpc_end_seq: %u, end_seq: %u", + CB_SEQ(rpc_end_seq), SEQ(end_seq)); + break; + } + + cb->rpc_desc_seq = cb->rpc_end_seq; + cb->rpc_desc_buff_len = 0; + + if (cb->rpc_end_seq == end_seq) + break; + + LOG("Found next descriptor: rpc_end_seq: %u, end_seq: %u, len: %u", + CB_SEQ(rpc_end_seq), SEQ(end_seq), end_seq - cb->rpc_end_seq); + + ret = tcp_parse_descriptor(skops, cb, seq, end_seq); + if (ret) + break; + } + +update: + if (ret >= 0) + tcp_set_autolowat(skops_kern, cb, tp); + else + tcp_disable_autolowat(skops, skops_kern); +} + +SEC("sockops") +int tcp_autolowat(struct bpf_sock_ops *skops) +{ + struct tcp_autolowat_cb *cb; + struct bpf_sock *bpf_sk; + struct tcp_sock *tp; + + if (skops->op != BPF_SOCK_OPS_RCVLOWAT_CB) + goto out; + + bpf_sk = skops->sk; + if (!bpf_sk) + goto out; /* always false, only for verifier. */ + + tp = bpf_skc_to_tcp_sock(bpf_sk); + if (!tp) + goto out; /* always false, only for verifier. */ + + cb = bpf_sk_storage_get(&tcp_autolowat_map, tp, 0, 0); + if (!cb) + goto out; + + tcp_do_autolowat(skops, cb, tp); +out: + return 1; +} + +static int tcp_init_autolowat_cb(struct bpf_sockopt *sockopt, + struct bpf_tcp_sock *btp) +{ + struct tcp_autolowat_cb *cb; + struct tcp_sock *tp; + int flags; + + cb = bpf_sk_storage_get(&tcp_autolowat_map, btp, 0, + BPF_SK_STORAGE_GET_F_CREATE); + if (!cb) + return -1; + + tp = bpf_core_cast(btp, struct tcp_sock); + if (!tp) + return -1; + + cb->rpc_desc_seq = tp->copied_seq; + cb->rpc_end_seq = tp->copied_seq; +#ifdef DEBUG + cb->isn = tp->copied_seq; +#endif + + if (bpf_getsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS, + &flags, sizeof(flags))) + return -1; + + flags |= BPF_SOCK_OPS_RCVLOWAT_CB_FLAG; + + if (bpf_setsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS, + &flags, sizeof(flags))) + return -1; + + return 0; +} + +SEC("cgroup/setsockopt") +int tcp_autolowat_setsockopt(struct bpf_sockopt *ctx) +{ + void *optval_end = ctx->optval_end; + int *optval = ctx->optval; + struct bpf_tcp_sock *btp; + + if (ctx->level != SOL_BPF || ctx->optname != BPF_TCP_AUTOLOWAT) + goto out; + + if (optval + 1 > optval_end) + return 0; /* -EPERM */ + + btp = bpf_tcp_sock(ctx->sk); + if (!btp) + goto out; + + if (*optval && tcp_init_autolowat_cb(ctx, btp)) + return 0; /* -EPERM */ + + ctx->optlen = -1; /* BPF has consumed this option, don't call kernel + * setsockopt handler. + */ +out: + return 1; +} + +char _license[] SEC("license") = "GPL"; -- 2.54.0.563.g4f69b47b94-goog ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 ` [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima @ 2026-05-08 15:35 ` Stanislav Fomichev 2026-05-08 20:19 ` Kuniyuki Iwashima 2026-05-08 22:17 ` sashiko-bot 1 sibling, 1 reply; 30+ messages in thread From: Stanislav Fomichev @ 2026-05-08 15:35 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 05/08, Kuniyuki Iwashima wrote: > The test is roughly divided into two stages, and the sequence > is as follows: > > I) Setup > > 1. Attach two BPF programs to a cgroup > 2. Establish a TCP connection (@client <-> @child) within the cgroup > 3. Enable BPF_SOCK_OPS_RCVLOWAT_CB on @child > > II) RPC frame exchange in various patterns > > 4. Send a partial RPC descriptor from @client to @child > 5. Verify that epoll does NOT wake up @child > 6. Send the remaining data of the RPC frame > 7. Verify that epoll finally wakes up @child > > During setup, two BPF programs are attached to simulate > a real-world scenario; one is SOCK_OPS and the other is > CGROUP_SOCKOPT. > > While the SOCK_OPS prog handles the dynamic adjustment of > sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable > BPF_SOCK_OPS_RCVLOWAT_CB via userspace setsockopt() using > pseudo options: > > #define SOL_BPF 0xdeadbeef > #define BPF_TCP_AUTOLOWAT 0x8badf00d > > setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int)); Hmm, so you do want to have this enable-on-a-per-socket use case.. Then what happens to the skbs already sitting in the rx queue by the time we enable BPF_SOCK_OPS_RCVLOWAT_CB? Isn't there a race? Or am I missing something? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 15:35 ` Stanislav Fomichev @ 2026-05-08 20:19 ` Kuniyuki Iwashima 2026-05-08 21:47 ` Stanislav Fomichev 0 siblings, 1 reply; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 20:19 UTC (permalink / raw) To: Stanislav Fomichev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On Fri, May 8, 2026 at 8:35 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > On 05/08, Kuniyuki Iwashima wrote: > > The test is roughly divided into two stages, and the sequence > > is as follows: > > > > I) Setup > > > > 1. Attach two BPF programs to a cgroup > > 2. Establish a TCP connection (@client <-> @child) within the cgroup > > 3. Enable BPF_SOCK_OPS_RCVLOWAT_CB on @child > > > > II) RPC frame exchange in various patterns > > > > 4. Send a partial RPC descriptor from @client to @child > > 5. Verify that epoll does NOT wake up @child > > 6. Send the remaining data of the RPC frame > > 7. Verify that epoll finally wakes up @child > > > > During setup, two BPF programs are attached to simulate > > a real-world scenario; one is SOCK_OPS and the other is > > CGROUP_SOCKOPT. > > > > While the SOCK_OPS prog handles the dynamic adjustment of > > sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable > > BPF_SOCK_OPS_RCVLOWAT_CB via userspace setsockopt() using > > pseudo options: > > > > #define SOL_BPF 0xdeadbeef > > #define BPF_TCP_AUTOLOWAT 0x8badf00d > > > > setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int)); > > Hmm, so you do want to have this enable-on-a-per-socket use case.. Then what > happens to the skbs already sitting in the rx queue by the time > we enable BPF_SOCK_OPS_RCVLOWAT_CB? Isn't there a race? Or am I missing > something? If the upper protocol is designed that way, it could be racy. For example, HTTP2 could have multiple streams within the same TCP connection while only one stream can be upgraded. But this use case simply does not work with a single sk->sk_rcvlowat in the first place. This feature implicitly assumes a single stream per connection. In our original implementation, it traversed all (technically up to 100) skbs in the queue and parse them like psock. But this will be unnecessary if the server drains the queue properly before enabling the feature. If needed, we could return -EBUSY when the queue is not empty. It's always under lock_sock() or bh_lock_sock(). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 20:19 ` Kuniyuki Iwashima @ 2026-05-08 21:47 ` Stanislav Fomichev 2026-05-08 21:58 ` Kuniyuki Iwashima 0 siblings, 1 reply; 30+ messages in thread From: Stanislav Fomichev @ 2026-05-08 21:47 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On 05/08, Kuniyuki Iwashima wrote: > On Fri, May 8, 2026 at 8:35 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > > > On 05/08, Kuniyuki Iwashima wrote: > > > The test is roughly divided into two stages, and the sequence > > > is as follows: > > > > > > I) Setup > > > > > > 1. Attach two BPF programs to a cgroup > > > 2. Establish a TCP connection (@client <-> @child) within the cgroup > > > 3. Enable BPF_SOCK_OPS_RCVLOWAT_CB on @child > > > > > > II) RPC frame exchange in various patterns > > > > > > 4. Send a partial RPC descriptor from @client to @child > > > 5. Verify that epoll does NOT wake up @child > > > 6. Send the remaining data of the RPC frame > > > 7. Verify that epoll finally wakes up @child > > > > > > During setup, two BPF programs are attached to simulate > > > a real-world scenario; one is SOCK_OPS and the other is > > > CGROUP_SOCKOPT. > > > > > > While the SOCK_OPS prog handles the dynamic adjustment of > > > sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable > > > BPF_SOCK_OPS_RCVLOWAT_CB via userspace setsockopt() using > > > pseudo options: > > > > > > #define SOL_BPF 0xdeadbeef > > > #define BPF_TCP_AUTOLOWAT 0x8badf00d > > > > > > setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int)); > > > > Hmm, so you do want to have this enable-on-a-per-socket use case.. Then what > > happens to the skbs already sitting in the rx queue by the time > > we enable BPF_SOCK_OPS_RCVLOWAT_CB? Isn't there a race? Or am I missing > > something? > > If the upper protocol is designed that way, it could be racy. > > For example, HTTP2 could have multiple streams within the > same TCP connection while only one stream can be upgraded. > But this use case simply does not work with a single > sk->sk_rcvlowat in the first place. > > This feature implicitly assumes a single stream per connection. > > In our original implementation, it traversed all (technically > up to 100) skbs in the queue and parse them like psock. > > But this will be unnecessary if the server drains the queue > properly before enabling the feature. > > If needed, we could return -EBUSY when the queue is not > empty. It's always under lock_sock() or bh_lock_sock(). But even for a single stream it's racy, no? Suppose I do recvmsg() and then setsockopt(BPF_TCP_AUTOLOWAT, 1). There is a window of time between recvmsg() and setsockopt() where we can receive a packet? Or you're saying that if it is an-RPC type protocol, the receiver explicitly signals to the sender that it's ready to receive (by sending it an RPC request). If that's the case, let's explicitly explain this somewhere? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 21:47 ` Stanislav Fomichev @ 2026-05-08 21:58 ` Kuniyuki Iwashima 0 siblings, 0 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 21:58 UTC (permalink / raw) To: Stanislav Fomichev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev On Fri, May 8, 2026 at 2:48 PM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > On 05/08, Kuniyuki Iwashima wrote: > > On Fri, May 8, 2026 at 8:35 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote: > > > > > > On 05/08, Kuniyuki Iwashima wrote: > > > > The test is roughly divided into two stages, and the sequence > > > > is as follows: > > > > > > > > I) Setup > > > > > > > > 1. Attach two BPF programs to a cgroup > > > > 2. Establish a TCP connection (@client <-> @child) within the cgroup > > > > 3. Enable BPF_SOCK_OPS_RCVLOWAT_CB on @child > > > > > > > > II) RPC frame exchange in various patterns > > > > > > > > 4. Send a partial RPC descriptor from @client to @child > > > > 5. Verify that epoll does NOT wake up @child > > > > 6. Send the remaining data of the RPC frame > > > > 7. Verify that epoll finally wakes up @child > > > > > > > > During setup, two BPF programs are attached to simulate > > > > a real-world scenario; one is SOCK_OPS and the other is > > > > CGROUP_SOCKOPT. > > > > > > > > While the SOCK_OPS prog handles the dynamic adjustment of > > > > sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable > > > > BPF_SOCK_OPS_RCVLOWAT_CB via userspace setsockopt() using > > > > pseudo options: > > > > > > > > #define SOL_BPF 0xdeadbeef > > > > #define BPF_TCP_AUTOLOWAT 0x8badf00d > > > > > > > > setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int)); > > > > > > Hmm, so you do want to have this enable-on-a-per-socket use case.. Then what > > > happens to the skbs already sitting in the rx queue by the time > > > we enable BPF_SOCK_OPS_RCVLOWAT_CB? Isn't there a race? Or am I missing > > > something? > > > > If the upper protocol is designed that way, it could be racy. > > > > For example, HTTP2 could have multiple streams within the > > same TCP connection while only one stream can be upgraded. > > But this use case simply does not work with a single > > sk->sk_rcvlowat in the first place. > > > > This feature implicitly assumes a single stream per connection. > > > > In our original implementation, it traversed all (technically > > up to 100) skbs in the queue and parse them like psock. > > > > But this will be unnecessary if the server drains the queue > > properly before enabling the feature. > > > > If needed, we could return -EBUSY when the queue is not > > empty. It's always under lock_sock() or bh_lock_sock(). > > But even for a single stream it's racy, no? Suppose I do recvmsg() > and then setsockopt(BPF_TCP_AUTOLOWAT, 1). There is a window of time > between recvmsg() and setsockopt() where we can receive a packet? > Or you're saying that if it is an-RPC type protocol, the receiver > explicitly signals to the sender that it's ready to receive (by sending > it an RPC request). Right, we can assume the client stops at a certain point in practical use cases (HTTP 200 for Upgrade:, RPC auth response, etc). > If that's the case, let's explicitly explain this > somewhere? Sure, will add that. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 7:33 ` [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima 2026-05-08 15:35 ` Stanislav Fomichev @ 2026-05-08 22:17 ` sashiko-bot 2026-05-08 22:47 ` Kuniyuki Iwashima 1 sibling, 1 reply; 30+ messages in thread From: sashiko-bot @ 2026-05-08 22:17 UTC (permalink / raw) To: Kuniyuki Iwashima; +Cc: bpf Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider: - [Medium] Flaky test design due to race condition between send() and epoll_wait()/getsockopt(). - [Medium] Pipelined data causes a TOCTOU race that permanently breaks and disables the BPF autolowat program. -- commit 7be0b88b0617982ff63886ecdcf680fd3a6f0db8 Author: Kuniyuki Iwashima <kuniyu@google.com> selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. This commit adds a selftest simulating a production scenario where a BPF program dynamically adjusts a socket's SO_RCVLOWAT based on the parsing of RPC descriptors. The test exchanges partial and full RPC frames over a local connection and verifies that epoll wakes up the application at the correct boundaries. > diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c > --- a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c > +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c [ ... ] > @@ -0,0 +1,350 @@ > +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb, > + struct rpc_test_case *test_case) > +{ [ ... ] > + switch (event->type) { > + case RPC_EVENT_END: > + return; > + case RPC_EVENT_AUTOLOWAT: > + err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT, > + &event->val, sizeof(event->val)); > + if (!ASSERT_OK(err, "setsockopt")) > + return; > + break; > + case RPC_EVENT_SEND: > + err = send(cb->client, ptr, event->len, 0); > + if (!ASSERT_EQ(err, event->len, "send")) > + return; > + > + ptr += event->len; > + break; [ ... ] > + case RPC_EVENT_EPOLL: > + err = epoll_wait(cb->epoll, &ev, 1, 0); > + if (!ASSERT_EQ(err, event->nfds, "epoll_wait")) > + return; > + break; > + case RPC_EVENT_RCVLOWAT: > + optval = 0; > + optlen = sizeof(optval); > + > + err = getsockopt(cb->child, SOL_SOCKET, SO_RCVLOWAT, &optval, &optlen); > + if (!ASSERT_OK(err, "getsockopt") || > + !ASSERT_EQ(optval, event->rcvlowat, "rcvlowat")) > + return; > + break; > + } Are the epoll and getsockopt checks vulnerable to flaky failures under load? Since TCP loopback delivery can be deferred to softirq context, could epoll_wait with a 0-millisecond timeout return 0 if the packet hasn't been queued yet? Similarly, could getsockopt read a stale SO_RCVLOWAT value before the BPF program gets a chance to update it? Would using a polling loop or a positive epoll timeout prevent intermittent test failures? > diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c > --- a/tools/testing/selftests/bpf/progs/tcp_autolowat.c > +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c [ ... ] > @@ -0,0 +1,316 @@ > +static int tcp_parse_descriptor(struct bpf_sock_ops *skops, > + struct tcp_autolowat_cb *cb, > + u32 seq, u32 end_seq) > +{ > + struct rpc_descriptor *rpc_desc; > + u32 rpc_copied_seq; > + u32 copy_len; > + u64 rpc_len; > + int err; > + > + rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len; > + > + if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq)) > + copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len; > + else > + copy_len = end_seq - rpc_copied_seq; > + > + /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7, > + * clang omits the "copy_len == 0" check below, which is necessary > + * to satisfy the BPF verifier's range check for bpf_skb_load_bytes(). > + */ > + barrier_var(copy_len); > + > + if (copy_len == 0) > + goto disable; /* FIN. */ > + if (copy_len > RPC_DESC_SIZE) > + goto disable; /* always false, only for verifier. */ > + if (cb->rpc_desc_buf + cb->rpc_desc_buff_len >= &cb->rpc_desc_buf[RPC_DESC_SIZE]) > + goto disable; /* always false, only for verifier. */ > + > + err = bpf_skb_load_bytes(skops, rpc_copied_seq - seq, > + cb->rpc_desc_buf + cb->rpc_desc_buff_len, copy_len); > + if (err) > + goto disable; Is there a race condition if pipelined data is already in the receive queue when BPF_TCP_AUTOLOWAT is enabled? Looking at tcp_init_autolowat_cb(), cb->rpc_desc_seq is initialized to tp->copied_seq. If there is already unread data in the queue, the next packet arrival triggers BPF_SOCK_OPS_RCVLOWAT_CB. Because seq (which comes from tcb->seq) corresponds to the new packet and is greater than tp->copied_seq, will rpc_copied_seq - seq underflow? If it underflows to a large positive value, won't bpf_skb_load_bytes() fail with -EFAULT, causing the code to jump to the disable label and permanently disable the autolowat functionality for this socket? -- Sashiko AI review · https://sashiko.dev/#/patchset/20260508073355.3916746-1-kuniyu@google.com?part=8 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. 2026-05-08 22:17 ` sashiko-bot @ 2026-05-08 22:47 ` Kuniyuki Iwashima 0 siblings, 0 replies; 30+ messages in thread From: Kuniyuki Iwashima @ 2026-05-08 22:47 UTC (permalink / raw) To: sashiko; +Cc: bpf On Fri, May 8, 2026 at 3:17 PM <sashiko-bot@kernel.org> wrote: > > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider: > - [Medium] Flaky test design due to race condition between send() and epoll_wait()/getsockopt(). > - [Medium] Pipelined data causes a TOCTOU race that permanently breaks and disables the BPF autolowat program. > -- > > commit 7be0b88b0617982ff63886ecdcf680fd3a6f0db8 > Author: Kuniyuki Iwashima <kuniyu@google.com> > > selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB. > > This commit adds a selftest simulating a production scenario where a BPF > program dynamically adjusts a socket's SO_RCVLOWAT based on the parsing > of RPC descriptors. The test exchanges partial and full RPC frames over > a local connection and verifies that epoll wakes up the application at > the correct boundaries. > > > diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c > > --- a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c > > +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c > > [ ... ] > > > @@ -0,0 +1,350 @@ > > +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb, > > + struct rpc_test_case *test_case) > > +{ > > [ ... ] > > > + switch (event->type) { > > + case RPC_EVENT_END: > > + return; > > + case RPC_EVENT_AUTOLOWAT: > > + err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT, > > + &event->val, sizeof(event->val)); > > + if (!ASSERT_OK(err, "setsockopt")) > > + return; > > + break; > > + case RPC_EVENT_SEND: > > + err = send(cb->client, ptr, event->len, 0); > > + if (!ASSERT_EQ(err, event->len, "send")) > > + return; > > + > > + ptr += event->len; > > + break; > > [ ... ] > > > + case RPC_EVENT_EPOLL: > > + err = epoll_wait(cb->epoll, &ev, 1, 0); > > + if (!ASSERT_EQ(err, event->nfds, "epoll_wait")) > > + return; > > + break; > > + case RPC_EVENT_RCVLOWAT: > > + optval = 0; > > + optlen = sizeof(optval); > > + > > + err = getsockopt(cb->child, SOL_SOCKET, SO_RCVLOWAT, &optval, &optlen); > > + if (!ASSERT_OK(err, "getsockopt") || > > + !ASSERT_EQ(optval, event->rcvlowat, "rcvlowat")) > > + return; > > + break; > > + } > > Are the epoll and getsockopt checks vulnerable to flaky failures > under load? > > Since TCP loopback delivery can be deferred to softirq context, could > epoll_wait with a 0-millisecond timeout return 0 if the packet hasn't > been queued yet? No, it's over "lo", and TX/RX is in the same context. > > Similarly, could getsockopt read a stale SO_RCVLOWAT value before the > BPF program gets a chance to update it? > > Would using a polling loop or a positive epoll timeout prevent > intermittent test failures? > > > diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c > > --- a/tools/testing/selftests/bpf/progs/tcp_autolowat.c > > +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c > > [ ... ] > > > @@ -0,0 +1,316 @@ > > +static int tcp_parse_descriptor(struct bpf_sock_ops *skops, > > + struct tcp_autolowat_cb *cb, > > + u32 seq, u32 end_seq) > > +{ > > + struct rpc_descriptor *rpc_desc; > > + u32 rpc_copied_seq; > > + u32 copy_len; > > + u64 rpc_len; > > + int err; > > + > > + rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len; > > + > > + if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq)) > > + copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len; > > + else > > + copy_len = end_seq - rpc_copied_seq; > > + > > + /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7, > > + * clang omits the "copy_len == 0" check below, which is necessary > > + * to satisfy the BPF verifier's range check for bpf_skb_load_bytes(). > > + */ > > + barrier_var(copy_len); > > + > > + if (copy_len == 0) > > + goto disable; /* FIN. */ > > + if (copy_len > RPC_DESC_SIZE) > > + goto disable; /* always false, only for verifier. */ > > + if (cb->rpc_desc_buf + cb->rpc_desc_buff_len >= &cb->rpc_desc_buf[RPC_DESC_SIZE]) > > + goto disable; /* always false, only for verifier. */ > > + > > + err = bpf_skb_load_bytes(skops, rpc_copied_seq - seq, > > + cb->rpc_desc_buf + cb->rpc_desc_buff_len, copy_len); > > + if (err) > > + goto disable; > > Is there a race condition if pipelined data is already in the receive > queue when BPF_TCP_AUTOLOWAT is enabled? > > Looking at tcp_init_autolowat_cb(), cb->rpc_desc_seq is initialized to > tp->copied_seq. If there is already unread data in the queue, the next > packet arrival triggers BPF_SOCK_OPS_RCVLOWAT_CB. > > Because seq (which comes from tcb->seq) corresponds to the new packet > and is greater than tp->copied_seq, will rpc_copied_seq - seq underflow? > > If it underflows to a large positive value, won't bpf_skb_load_bytes() > fail with -EFAULT, causing the code to jump to the disable label and > permanently disable the autolowat functionality for this socket? This assumes the client will stop at a certain point; it will not send the next request speculatively before receiving the previous response. In such an irregular case, we want to fall back to no autolowat, so this is fine. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2026-05-11 14:56 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima 2026-05-08 19:02 ` sashiko-bot 2026-05-08 20:21 ` Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima 2026-05-08 19:17 ` sashiko-bot 2026-05-08 20:26 ` Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima 2026-05-08 15:15 ` Stanislav Fomichev 2026-05-08 19:45 ` Kuniyuki Iwashima 2026-05-11 14:56 ` Stanislav Fomichev 2026-05-08 7:33 ` [PATCH v1 bpf-next 4/8] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima 2026-05-11 12:34 ` Björn Töpel 2026-05-08 7:33 ` [PATCH v1 bpf-next 6/8] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima 2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima 2026-05-08 10:37 ` Jiayuan Chen 2026-05-08 11:30 ` Kuniyuki Iwashima 2026-05-08 12:19 ` Jiayuan Chen 2026-05-08 15:28 ` Stanislav Fomichev 2026-05-08 20:05 ` Kuniyuki Iwashima 2026-05-11 14:55 ` Stanislav Fomichev 2026-05-08 21:46 ` sashiko-bot 2026-05-08 7:33 ` [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima 2026-05-08 15:35 ` Stanislav Fomichev 2026-05-08 20:19 ` Kuniyuki Iwashima 2026-05-08 21:47 ` Stanislav Fomichev 2026-05-08 21:58 ` Kuniyuki Iwashima 2026-05-08 22:17 ` sashiko-bot 2026-05-08 22:47 ` Kuniyuki Iwashima
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox