[PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.

Netdev List
 help / color / mirror / Atom feed

* [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.
@ 2026-05-23  8:29 Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

This series introduces BPF_SOCK_OPS_RCVQ_CB, a new type of opt-in
hooks for BPF SOCK_OPS prog.

The hooks can be enabled on per-socket basis by bpf_setsockopt():

  int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG;

  bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
                 &flags, sizeof(flags));

or via the SOCK_OPS specific helper:

  bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG);

Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVQ_CB upon the following events:

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

This allows the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
is available in the receive queue.

This functionality, which we call "TCP AutoLOWAT", was originally
developed in 2020 by Tenzin Ukyab with the help of Soheil Hassas
Yeganeh, Arjun Roy, and Eric Dumazet.  It has served Google RPC
workloads for more than 5 years.

Combined with TCP RX zerocopy, this typically allows us to read an
entire RPC frame with just a single wakeup and a single system call.

While the original implementation was specialised for our
internal RPC format, this series introduces a more flexible
version by leveraging BPF.

The BPF SOCK_OPS prog in the last selftest patch closely mirrors
the core logic of the original implementation to provide a real-world
example.

Overview:

  Patch  1      : misc cleanup for testing
  Patch  2      : Add BPF_SOCK_OPS_RCVQ_CB with no actual hooks
  Patch  3 -  5 : Add bpf helpers
  Patch  6 -  8 : Add safe guard for BPF_SOCK_OPS_RCVQ_CB
  Patch  9 - 10 : Add BPF_SOCK_OPS_RCVQ_CB hooks
  Patch 11      : selftest

Changes:
  v3:
    Patch  6: Check sk->sk_prot and update tp->bpf_sock_ops_cb_flags
               under sk->sk_callback_lock, and only when not flagged yet.

  v2: https://lore.kernel.org/bpf/20260522074601.1658705-1-kuniyu@google.com/
    Add Patch 6 - 8
    Patch  2: s/BPF_SOCK_OPS_RCVLOWAT_CB/BPF_SOCK_OPS_RCVQ_CB/g
    Patch  3: Explain why using ____ version instead of __ version
    Patch 10: Add explanation of tcp_bpf_rcvlowat() placement.
    Patch 11: Make copy_len u64 and swap validation order for it
               to pass no_alu32 test case

  v1: https://lore.kernel.org/bpf/20260508073355.3916746-1-kuniyu@google.com/

Kuniyuki Iwashima (11):
  selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
  bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB.
  bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
  tcp: Split out __tcp_set_rcvlowat().
  bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
  bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
  bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.
  bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
  bpf: tcp: Factorise bpf_skops_established().
  bpf: tcp: Add SOCK_OPS rcvlowat hook.
  selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.

 include/net/tcp.h                             |  14 +
 include/uapi/linux/bpf.h                      |  18 +-
 net/core/filter.c                             |  95 ++++-
 net/ipv4/tcp.c                                |  14 +-
 net/ipv4/tcp_bpf.c                            |  12 +
 net/ipv4/tcp_fastopen.c                       |   2 +
 net/ipv4/tcp_input.c                          |  25 +-
 tools/include/uapi/linux/bpf.h                |  18 +-
 tools/testing/selftests/bpf/bpf_kfuncs.h      |   4 +
 .../selftests/bpf/prog_tests/tcp_autolowat.c  | 350 ++++++++++++++++++
 .../selftests/bpf/prog_tests/tcpbpf_user.c    |   3 +-
 .../selftests/bpf/progs/bpf_tracing_net.h     |   2 +
 .../selftests/bpf/progs/tcp_autolowat.c       | 326 ++++++++++++++++
 .../selftests/bpf/progs/test_tcpbpf_kern.c    |   3 +-
 14 files changed, 870 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
 create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c

-- 
2.54.0.746.g67dd491aae-goog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  9:06   ` bot+bpf-ci
  2026-05-23  8:29 ` [PATCH v3 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

Once bpf_sock_ops_cb_flags_set() supports a new flag,
tcpbpf_user.c fails due to the hard-coded max value, 0x80.

Let's replace 0x80 with BPF_SOCK_OPS_ALL_CB_FLAGS + 1.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c | 3 ++-
 tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
index 7e8fe1bad03f..e4849d2a2956 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
@@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
 	ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
 	ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
 	ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
-	ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
+	ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
+		  "bad_cb_test_rv");
 	ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
 	ASSERT_EQ(result->num_listen, 1, "num_listen");
 
diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
index 6935f32eeb8f..e30cb1fab079 100644
--- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
@@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
 		break;
 	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
 		/* Test failure to set largest cb flag (assumes not defined) */
-		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
+		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
+								  BPF_SOCK_OPS_ALL_CB_FLAGS + 1);
 		/* Set callback */
 		global.good_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
 						 BPF_SOCK_OPS_STATE_CB_FLAG);
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
@ 2026-05-23  9:06   ` bot+bpf-ci
  0 siblings, 0 replies; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-23  9:06 UTC (permalink / raw)
  To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
  Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
	ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2692 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 7e8fe1bad03f..e4849d2a2956 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
>  	ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
>  	ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
>  	ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
> -	ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
> +	ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
> +		  "bad_cb_test_rv");
>  	ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
>  	ASSERT_EQ(result->num_listen, 1, "num_listen");
>

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> index 6935f32eeb8f..e4cb1fab079 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> @@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
>  		break;
>  	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
>  		/* Test failure to set largest cb flag (assumes not defined) */
> -		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
> +		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
> +								  BPF_SOCK_OPS_ALL_CB_FLAGS + 1);

Multiple reviewers raised concerns about forward compatibility with this
approach. When the test is compiled, BPF_SOCK_OPS_ALL_CB_FLAGS + 1 becomes
a hardcoded value. If this compiled test runs on a future kernel where that
value becomes a valid flag, the helper will successfully set the flag and
return 0, causing the test to fail unexpectedly.

Additionally, this assumes contiguous bitmasks. If a flag is ever deprecated
or skipped, adding 1 could overlap with valid bits.

Would it be better to use a safely distant bit, such as 1 << 30, to guarantee
the flag remains unsupported and the test doesn't break in the future?

Reference: https://lore.kernel.org/bpf/20260508190227.AFAE9C2BD00@smtp.kernel.org/

This concern was raised by sashiko-bot@kernel.org in v1 and v2, and also
flagged by bot+bpf-ci@kernel.org in v2. The author's v1 response indicated
the test is not supposed to run on older kernels, but the reviewers are
concerned about FUTURE kernels, not older ones.


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

We will introduce a new type of opt-in hooks for BPF SOCK_OPS prog.

The hooks can be enabled on per-socket basis by bpf_setsockopt():

  int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG;

  bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
                 &flags, sizeof(flags));

or via the SOCK_OPS specific helper:

  bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG);

Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVQ_CB upon the following events:

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
(e.g., a full RPC frame) is available in the receive queue.

Note that is_locked_tcp_sock_ops() is left unchanged not to enable
bpf_setsockopt() unnecessarily, but bpf_sock_ops_cb_flags_set() is
supported at BPF_SOCK_OPS_RCVQ_CB to disable by itself.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: s/BPF_SOCK_OPS_RCVLOWAT_CB/BPF_SOCK_OPS_RCVQ_CB/g
---
 include/uapi/linux/bpf.h       | 18 +++++++++++++++++-
 net/core/filter.c              |  3 ++-
 tools/include/uapi/linux/bpf.h | 18 +++++++++++++++++-
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aec171ccb6ef..31130e1b63ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6960,6 +6960,9 @@ struct bpf_sock_ops {
 	 *					the 3WHS.
 	 * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
 	 *					the 3WHS.
+	 * BPF_SOCK_OPS_RCVQ_CB : No header included.  The payload is only
+	 *			  accessible by passing bpf_sock_ops to
+	 *			  bpf_skb_load_bytes().
 	 *
 	 * bpf_load_hdr_opt() can also be used to read a particular option.
 	 */
@@ -7031,8 +7034,16 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Call bpf when TCP payload is queued to sk->sk_receive_queue
+	 * and after recvmsg().  The bpf prog will be called under
+	 * sock_ops->op == BPF_SOCK_OPS_RCVQ_CB.
+	 *
+	 * It can be used to adjust sk->sk_rcvlowat and suppress
+	 * unnecessary wakeups before sufficient data is available.
+	 */
+	BPF_SOCK_OPS_RCVQ_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 enum {
@@ -7176,6 +7187,11 @@ enum {
 					 * sendmsg timestamp with corresponding
 					 * tskey.
 					 */
+	BPF_SOCK_OPS_RCVQ_CB,		/* Called when TCP payload is queued to
+					 * sk->sk_receive_queue and after recvmsg()
+					 * to allow adjusting sk->sk_rcvlowat and
+					 * to suppress early wakeups.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/filter.c b/net/core/filter.c
index 9590877b0714..4a50fe2cd863 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6002,7 +6002,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
 	struct sock *sk = bpf_sock->sk;
 	int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
 
-	if (!is_locked_tcp_sock_ops(bpf_sock))
+	if (!is_locked_tcp_sock_ops(bpf_sock) &&
+	    bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB)
 		return -EOPNOTSUPP;
 
 	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 37142e6d911a..3b8f392d8c69 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6960,6 +6960,9 @@ struct bpf_sock_ops {
 	 *					the 3WHS.
 	 * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
 	 *					the 3WHS.
+	 * BPF_SOCK_OPS_RCVQ_CB : No header included.  The payload is only
+	 *			  accessible by passing bpf_sock_ops to
+	 *			  bpf_skb_load_bytes().
 	 *
 	 * bpf_load_hdr_opt() can also be used to read a particular option.
 	 */
@@ -7031,8 +7034,16 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Call bpf when TCP payload is queued to sk->sk_receive_queue
+	 * and after recvmsg().  The bpf prog will be called under
+	 * sock_ops->op == BPF_SOCK_OPS_RCVQ_CB.
+	 *
+	 * It can be used to adjust sk->sk_rcvlowat and suppress
+	 * unnecessary wakeups before sufficient data is available.
+	 */
+	BPF_SOCK_OPS_RCVQ_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 enum {
@@ -7176,6 +7187,11 @@ enum {
 					 * sendmsg timestamp with corresponding
 					 * tskey.
 					 */
+	BPF_SOCK_OPS_RCVQ_CB,		/* Called when TCP payload is queued to
+					 * sk->sk_receive_queue and after recvmsg()
+					 * to allow adjusting sk->sk_rcvlowat and
+					 * to suppress early wakeups.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-26 20:34   ` Martin KaFai Lau
  2026-05-23  8:29 ` [PATCH v3 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
prog can be called with BPF_SOCK_OPS_RCVQ_CB.

In this hook, we want to parse the RPC descriptor in the skb
and adjust sk->sk_rcvlowat based on the RPC frame size.

However, we cannot access payload via bpf_sock_ops.data on
modern NICs with TCP header/data split on as the payload is
not placed in the linear area.

Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.

Three notes:

  1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
      invoked from recvmsg().

  2) Access to bpf_sock_ops.data will be disabled by passing
      0 end_offset to bpf_skops_init_skb().

  3) ____bpf_skb_load_bytes() is called directly instead of
     __bpf_skb_load_bytes() to allow compilers to inline it
     instead of generating a tail-call.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: Explain why using ____ version instead of __
---
 net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 4a50fe2cd863..fa8a7c7d86eb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7760,6 +7760,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock,
+	   u32, offset, void *, to, u32, len)
+{
+	int err;
+
+	if (bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB) {
+		err = -EOPNOTSUPP;
+		goto err_clear;
+	}
+
+	if (!bpf_sock->skb) {
+		err = -EPERM;
+		goto err_clear;
+	}
+
+	return ____bpf_skb_load_bytes(bpf_sock->skb, offset, to, len);
+
+err_clear:
+	memset(to, 0, len);
+	return err;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_skb_load_bytes_proto = {
+	.func		= bpf_sock_ops_skb_load_bytes,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
 static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend,
 				    u8 search_kind, const u8 *magic,
 				    u8 magic_len, bool *eol)
@@ -8616,6 +8648,8 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_netns_cookie:
 		return &bpf_get_netns_cookie_sock_ops_proto;
 #ifdef CONFIG_INET
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_sock_ops_skb_load_bytes_proto;
 	case BPF_FUNC_load_hdr_opt:
 		return &bpf_sock_ops_load_hdr_opt_proto;
 	case BPF_FUNC_store_hdr_opt:
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-26 20:34   ` Martin KaFai Lau
  2026-05-26 21:21     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2026-05-26 20:34 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell,
	Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev

On Sat, May 23, 2026 at 08:29:32AM +0000, Kuniyuki Iwashima wrote:
> When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
> prog can be called with BPF_SOCK_OPS_RCVQ_CB.
> 
> In this hook, we want to parse the RPC descriptor in the skb
> and adjust sk->sk_rcvlowat based on the RPC frame size.
> 
> However, we cannot access payload via bpf_sock_ops.data on
> modern NICs with TCP header/data split on as the payload is
> not placed in the linear area.
> 
> Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
> 
> Three notes:
> 
>   1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
>       invoked from recvmsg().
> 
>   2) Access to bpf_sock_ops.data will be disabled by passing
>       0 end_offset to bpf_skops_init_skb().
> 
>   3) ____bpf_skb_load_bytes() is called directly instead of
>      __bpf_skb_load_bytes() to allow compilers to inline it
>      instead of generating a tail-call.

Some observations below.

> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v2: Explain why using ____ version instead of __
> ---
>  net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4a50fe2cd863..fa8a7c7d86eb 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7760,6 +7760,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
>  	.arg3_type	= ARG_ANYTHING,
>  };
>  
> +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock,
> +	   u32, offset, void *, to, u32, len)
> +{
> +	int err;
> +
> +	if (bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB) {

bpf_dynptr_from_skb() and bpf_dynptr_slice() kfunc could also be considered.
One less bpf_sock->op check in filter.c to maintain and could also avoid
a data copy. There is a bpf_cast_to_kern_ctx() to get to a trusted
skops_kern pointer but this will need changes in verifier.c to get to
skops_kern->skb (e.g. in type_is_trusted_or_null) and this is the tradeoff.

If this new rcvq callback is added to the 'bpf_tcp_ops' proposal [1],
all this will go away. 'struct sk_buff *skb' can be directly passed to an
ops of the 'bpf_tcp_ops'. Supporting '*skb' in a struct_ops has already
been done in the bpf_qdisc.

[1]: https://lore.kernel.org/bpf/20260519215841.2984970-11-martin.lau@linux.dev/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-26 20:34   ` Martin KaFai Lau
@ 2026-05-26 21:21     ` Kuniyuki Iwashima
  2026-05-26 22:18       ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-26 21:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jason Xing, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell,
	Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev

On Tue, May 26, 2026 at 1:34 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On Sat, May 23, 2026 at 08:29:32AM +0000, Kuniyuki Iwashima wrote:
> > When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
> > prog can be called with BPF_SOCK_OPS_RCVQ_CB.
> >
> > In this hook, we want to parse the RPC descriptor in the skb
> > and adjust sk->sk_rcvlowat based on the RPC frame size.
> >
> > However, we cannot access payload via bpf_sock_ops.data on
> > modern NICs with TCP header/data split on as the payload is
> > not placed in the linear area.
> >
> > Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
> >
> > Three notes:
> >
> >   1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
> >       invoked from recvmsg().
> >
> >   2) Access to bpf_sock_ops.data will be disabled by passing
> >       0 end_offset to bpf_skops_init_skb().
> >
> >   3) ____bpf_skb_load_bytes() is called directly instead of
> >      __bpf_skb_load_bytes() to allow compilers to inline it
> >      instead of generating a tail-call.
>
> Some observations below.
>
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > ---
> > v2: Explain why using ____ version instead of __
> > ---
> >  net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 4a50fe2cd863..fa8a7c7d86eb 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -7760,6 +7760,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
> >       .arg3_type      = ARG_ANYTHING,
> >  };
> >
> > +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock,
> > +        u32, offset, void *, to, u32, len)
> > +{
> > +     int err;
> > +
> > +     if (bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB) {
>
> bpf_dynptr_from_skb() and bpf_dynptr_slice() kfunc could also be considered.
> One less bpf_sock->op check in filter.c to maintain and could also avoid
> a data copy. There is a bpf_cast_to_kern_ctx() to get to a trusted
> skops_kern pointer but this will need changes in verifier.c to get to
> skops_kern->skb (e.g. in type_is_trusted_or_null) and this is the tradeoff.

Maybe a dumb question, but does it add extra cost (extra dynptr
function call?) if data overlaps two frags, or can dynptr handle it
seamlessly with a single bpf_dynptr_slice() ?

In our case, the data copy is ~16 bytes, so the cost will not be
a big problem I think.


>
> If this new rcvq callback is added to the 'bpf_tcp_ops' proposal [1],
> all this will go away. 'struct sk_buff *skb' can be directly passed to an
> ops of the 'bpf_tcp_ops'. Supporting '*skb' in a struct_ops has already
> been done in the bpf_qdisc.
>
> [1]: https://lore.kernel.org/bpf/20260519215841.2984970-11-martin.lau@linux.dev/

Oh I missed the series, the struct_ops conversion looks nice !
Since this work isn't urgent, I can wait for your series if mine
churns it.

Jason's series is adding a new op, and I guess this can be
integrated too ?
https://lore.kernel.org/bpf/20260521135244.40869-5-kerneljasonxing@gmail.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-26 21:21     ` Kuniyuki Iwashima
@ 2026-05-26 22:18       ` Martin KaFai Lau
  0 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2026-05-26 22:18 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Jason Xing, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell,
	Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev

On Tue, May 26, 2026 at 02:21:56PM -0700, Kuniyuki Iwashima wrote:
> On Tue, May 26, 2026 at 1:34 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On Sat, May 23, 2026 at 08:29:32AM +0000, Kuniyuki Iwashima wrote:
> > > When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
> > > prog can be called with BPF_SOCK_OPS_RCVQ_CB.
> > >
> > > In this hook, we want to parse the RPC descriptor in the skb
> > > and adjust sk->sk_rcvlowat based on the RPC frame size.
> > >
> > > However, we cannot access payload via bpf_sock_ops.data on
> > > modern NICs with TCP header/data split on as the payload is
> > > not placed in the linear area.
> > >
> > > Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
> > >
> > > Three notes:
> > >
> > >   1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
> > >       invoked from recvmsg().
> > >
> > >   2) Access to bpf_sock_ops.data will be disabled by passing
> > >       0 end_offset to bpf_skops_init_skb().
> > >
> > >   3) ____bpf_skb_load_bytes() is called directly instead of
> > >      __bpf_skb_load_bytes() to allow compilers to inline it
> > >      instead of generating a tail-call.
> >
> > Some observations below.
> >
> > >
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > > ---
> > > v2: Explain why using ____ version instead of __
> > > ---
> > >  net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 4a50fe2cd863..fa8a7c7d86eb 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -7760,6 +7760,38 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
> > >       .arg3_type      = ARG_ANYTHING,
> > >  };
> > >
> > > +BPF_CALL_4(bpf_sock_ops_skb_load_bytes, struct bpf_sock_ops_kern *, bpf_sock,
> > > +        u32, offset, void *, to, u32, len)
> > > +{
> > > +     int err;
> > > +
> > > +     if (bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB) {
> >
> > bpf_dynptr_from_skb() and bpf_dynptr_slice() kfunc could also be considered.
> > One less bpf_sock->op check in filter.c to maintain and could also avoid
> > a data copy. There is a bpf_cast_to_kern_ctx() to get to a trusted
> > skops_kern pointer but this will need changes in verifier.c to get to
> > skops_kern->skb (e.g. in type_is_trusted_or_null) and this is the tradeoff.
> 
> Maybe a dumb question, but does it add extra cost (extra dynptr
> function call?) if data overlaps two frags, or can dynptr handle it
> seamlessly with a single bpf_dynptr_slice() ?

Right, there is an extra bpf_dynptr_from_skb(). I don't think we have
benchmarked it.

If I read it correctly, unlike bpf_xdp_pointer, the skb_header_pointer
will still copy even if the data is in one frag. It works well if the data
is in the headlen and the worst case is to copy, which is the same as
load_bytes.

It is a readonly use case. Maybe the bpf prog can directly read the frag.
Regardless, it is useful to have a kfunc/helper to read it.

> 
> In our case, the data copy is ~16 bytes, so the cost will not be
> a big problem I think.
> 
> 
> >
> > If this new rcvq callback is added to the 'bpf_tcp_ops' proposal [1],
> > all this will go away. 'struct sk_buff *skb' can be directly passed to an
> > ops of the 'bpf_tcp_ops'. Supporting '*skb' in a struct_ops has already
> > been done in the bpf_qdisc.
> >
> > [1]: https://lore.kernel.org/bpf/20260519215841.2984970-11-martin.lau@linux.dev/
> 
> Oh I missed the series, the struct_ops conversion looks nice !
> Since this work isn't urgent, I can wait for your series if mine
> churns it.
> 
> Jason's series is adding a new op, and I guess this can be
> integrated too ?
> https://lore.kernel.org/bpf/20260521135244.40869-5-kerneljasonxing@gmail.com/

imo, a new sock_ops cb should be added as an ops in struct_ops. For example,
in patch 4 of that series, bpf_skops_rx_timestamping assigns u64 to 'u32
args[4]', which is adding tech debt to the current sock_ops interface.
For the timestamping case, it could be a separate ops for the
'struct sock' instead of 'struct tcp_sock' because it should
at least work for both TCP and UDP.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat().
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

We will add a kfunc for BPF_SOCK_OPS_RCVQ_CB hooks to
adjust sk->sk_rcvlowat.

These hooks will be triggered when:

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

In the enqueue path, tcp_data_ready() is always called
after the hooks in tcp_queue_rcv() and tcp_ofo_queue().

If tcp_set_rcvlowat() were used as is, tcp_data_ready()
could be called twice for the same skb, which is redundant
and also confusing.

Let's split out __tcp_set_rcvlowat() and add a flag to
control wakeup behaviour.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/tcp.h |  1 +
 net/ipv4/tcp.c    | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ecbadcb3a744..c6a6853909c4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -515,6 +515,7 @@ void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		int flags);
+int __tcp_set_rcvlowat(struct sock *sk, int val, bool wakeup);
 int tcp_set_rcvlowat(struct sock *sk, int val);
 void tcp_set_rcvbuf(struct sock *sk, int val);
 int tcp_set_window_clamp(struct sock *sk, int val);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 432fa28e47d4..3afeb69a547a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1829,8 +1829,7 @@ int tcp_peek_len(struct socket *sock)
 	return tcp_inq(sock->sk);
 }
 
-/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
-int tcp_set_rcvlowat(struct sock *sk, int val)
+int __tcp_set_rcvlowat(struct sock *sk, int val, bool wakeup)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int space, cap;
@@ -1843,7 +1842,8 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 	WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
 
 	/* Check if we need to signal EPOLLIN right now */
-	tcp_data_ready(sk);
+	if (wakeup)
+		tcp_data_ready(sk);
 
 	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
 		return 0;
@@ -1858,6 +1858,12 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 	return 0;
 }
 
+/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
+int tcp_set_rcvlowat(struct sock *sk, int val)
+{
+	return __tcp_set_rcvlowat(sk, val, true);
+}
+
 void tcp_set_rcvbuf(struct sock *sk, int val)
 {
 	tcp_set_window_clamp(sk, tcp_win_from_space(sk, val));
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  9:06   ` bot+bpf-ci
  2026-05-23  8:29 ` [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB
to adjust sk->sk_rcvlowat when

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

Let's provide a kfunc to set sk->sk_rcvlowat.

Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT.

The wakeup flag is determined based on bpf_sock_ops_kern.skb:

  * For the enqueue hook, skb is always non-NULL, and wakeup is
    set to false because

    * tcp_data_ready() is always called after the hooks in
       tcp_queue_rcv() and tcp_ofo_queue().

    * when tcp_fastopen_add_skb() is called for TFO SYN,
       the socket is not yet accept()ed, and when called
       for TFO SYN+ACK, the socket is woken up by
       sk->sk_state_change() anyway.

  * For the recvmsg() hook, skb is always NULL, and wakeup is set
    to true because tcp_data_ready() is not called in the path.

An alternative would be to support bpf_setsockopt() by adding
BPF_SOCK_OPS_RCVQ_CB to is_locked_tcp_sock_ops().

However, that approach involves excessive conditionals and an
unnecessary memcpy(), costs we do not want to pay for every skb
in the TCP fast path.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/core/filter.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index fa8a7c7d86eb..3608036632a8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -12388,6 +12388,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
 #endif
 }
 
+__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops,
+					      int rcvlowat)
+{
+#ifdef CONFIG_INET
+	if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
+		return -EOPNOTSUPP;
+
+	if (rcvlowat < 0)
+		rcvlowat = INT_MAX;
+
+	return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb);
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
 __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
 					      u64 flags)
 {
@@ -12539,6 +12555,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
 BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
+BTF_ID_FLAGS(func, bpf_sock_ops_tcp_set_rcvlowat)
 BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
 
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
@ 2026-05-23  9:06   ` bot+bpf-ci
  0 siblings, 0 replies; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-23  9:06 UTC (permalink / raw)
  To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
  Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
	ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2228 bytes --]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index fa8a7c7d86eb..3608036632a8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -12388,6 +12388,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
>  #endif
>  }
>
> +__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops,
> +					      int rcvlowat)
> +{
> +#ifdef CONFIG_INET
> +	if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
> +		return -EOPNOTSUPP;
> +
> +	if (rcvlowat < 0)
> +		rcvlowat = INT_MAX;
> +
> +	return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb);

Should this kfunc prevent operation on MPTCP subflows? Bjorn Topel
raised this concern in v1, noting the issue that commit fcf4692fa39e
("mptcp: prevent BPF accessing lowat from a subflow socket.") was
designed to prevent.

You acknowledged this and stated you would drop MPTCP support for now,
mentioning that a subsequent patch (patch 7 in the v3 series: "bpf:
mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.") would disable the callback
for MPTCP.

However, the current code lacks any MPTCP check. Without either disabling
BPF_SOCK_OPS_RCVQ_CB for MPTCP subflows in the callback registration or
adding a check like:

	if (sk_is_mptcp(skops->sk))
		return -EOPNOTSUPP;

couldn't a BPF program invoke this kfunc on MPTCP subflows and trigger
the same problem that fcf4692fa39e was designed to prevent?

Reference: https://lore.kernel.org/bpf/87ik8ujfa1.fsf@all.your.base.are.belong.to.us/

> +#else
> +	return -EOPNOTSUPP;
> +#endif
> +}
> +
>  __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>  					      u64 flags)
>  {
> @@ -12539,6 +12555,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
>
>  BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
>  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
> +BTF_ID_FLAGS(func, bpf_sock_ops_tcp_set_rcvlowat)
>  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
>
>  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  9:20   ` bot+bpf-ci
  2026-05-23  8:29 ` [PATCH v3 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle
socket receive queues, leading to overlapping use cases.

While BPF_SOCK_OPS_RCVQ_CB focuses on optimizing single-socket
performance by reducing EPOLLIN wakeups and fully preserves TCP
zerocopy support, SOCKMAP is designed to facilitate multi-socket
routing at the cost of higher overhead and no zerocopy support.

Enabling both features on the same socket makes no sense and
results in unexpected interference between them.

For instance, SOCKMAP calls __tcp_cleanup_rbuf(), where we will
add a BPF_SOCK_OPS_RCVQ_CB hook, and bpf_sock_ops_tcp_set_rcvlowat()
calls sk->sk_data_ready(), which would trigger SOCKMAP.

Let's make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.

Note that it requires write_lock_bh(&sk->sk_callback_lock) to
synchronise with tcp_bpf_update_proto() and check if sk->sk_prot
is one of tcp_bpf_prots[][] because sock_map_update_elem() only
holds bh_lock_sock() without checking sock_owned_by_user().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v3: Check sk->sk_prot and update tp->bpf_sock_ops_cb_flags
    under sk->sk_callback_lock, and only when not flagged yet.
---
 include/net/tcp.h  |  1 +
 net/core/filter.c  | 35 +++++++++++++++++++++++++++++++----
 net/ipv4/tcp_bpf.c | 12 ++++++++++++
 3 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c6a6853909c4..bc95d8e7b62e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2853,6 +2853,7 @@ struct sk_msg;
 struct sk_psock;
 
 #ifdef CONFIG_BPF_SYSCALL
+bool tcp_in_sockmap(const struct sock *sk);
 int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 #ifdef CONFIG_BPF_STREAM_PARSER
diff --git a/net/core/filter.c b/net/core/filter.c
index 3608036632a8..1fb63b264b18 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5382,12 +5382,34 @@ static int bpf_sol_tcp_getsockopt(struct sock *sk, int optname,
 	return 0;
 }
 
+static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
+{
+	if (!(val & BPF_SOCK_OPS_RCVQ_CB_FLAG) ||
+	    tcp_sk(sk)->bpf_sock_ops_cb_flags & BPF_SOCK_OPS_RCVQ_CB_FLAG) {
+		tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+		return 0;
+	}
+
+	write_lock_bh(&sk->sk_callback_lock);
+
+	if (unlikely(tcp_in_sockmap(sk))) {
+		write_unlock_bh(&sk->sk_callback_lock);
+		return -EBUSY;
+	}
+
+	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	return 0;
+}
+
 static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 				  char *optval, int optlen)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned long timeout;
-	int val;
+	int val, err;
 
 	if (optlen != sizeof(int))
 		return -EINVAL;
@@ -5424,7 +5446,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 	case TCP_BPF_SOCK_OPS_CB_FLAGS:
 		if (val & ~(BPF_SOCK_OPS_ALL_CB_FLAGS))
 			return -EINVAL;
-		tp->bpf_sock_ops_cb_flags = val;
+		err = __bpf_sock_ops_cb_flags_set(sk, val);
+		if (err)
+			return err;
 		break;
 	default:
 		return -EINVAL;
@@ -5999,8 +6023,9 @@ static const struct bpf_func_proto bpf_sock_ops_getsockopt_proto = {
 BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, argval)
 {
-	struct sock *sk = bpf_sock->sk;
 	int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+	struct sock *sk = bpf_sock->sk;
+	int err;
 
 	if (!is_locked_tcp_sock_ops(bpf_sock) &&
 	    bpf_sock->op != BPF_SOCK_OPS_RCVQ_CB)
@@ -6009,7 +6034,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
 	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
 		return -EINVAL;
 
-	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+	err = __bpf_sock_ops_cb_flags_set(sk, val);
+	if (err)
+		return err;
 
 	return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
 }
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cc0bd73f36b6..7e7966b095f9 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -705,6 +705,16 @@ int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
 }
 #endif /* CONFIG_BPF_STREAM_PARSER */
 
+bool tcp_in_sockmap(const struct sock *sk)
+{
+	const struct proto *prot = sk->sk_prot;
+
+	lockdep_assert_held(&sk->sk_callback_lock);
+
+	return &tcp_bpf_prots[0][0] <= prot &&
+		prot <= &tcp_bpf_prots[TCP_BPF_NUM_PROTS - 1][TCP_BPF_NUM_CFGS - 1];
+}
+
 int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
 	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
@@ -729,6 +739,8 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 			sock_replace_proto(sk, psock->sk_proto);
 		}
 		return 0;
+	} else if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG)) {
+		return -EBUSY;
 	}
 
 	if (sk->sk_family == AF_INET6) {
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
@ 2026-05-23  9:20   ` bot+bpf-ci
  2026-05-24  3:37     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-23  9:20 UTC (permalink / raw)
  To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
  Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
	ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 5384 bytes --]

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c6a6853909c4..bc95d8e7b62e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2853,6 +2853,7 @@ struct sk_msg;
>  struct sk_psock;
>
>  #ifdef CONFIG_BPF_SYSCALL
> +bool tcp_in_sockmap(const struct sock *sk);
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
>  void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
>  #ifdef CONFIG_BPF_STREAM_PARSER
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3608036632a8..1fb63b264b18 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> @@ -5382,12 +5382,34 @@ static int bpf_sol_tcp_getsockopt(struct sock *sk, int optname,
>  	return 0;
>  }
>
> +static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
> +{
> +	if (!(val & BPF_SOCK_OPS_RCVQ_CB_FLAG) ||
> +	    tcp_sk(sk)->bpf_sock_ops_cb_flags & BPF_SOCK_OPS_RCVQ_CB_FLAG) {
> +		tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +		return 0;
> +	}
> +
> +	write_lock_bh(&sk->sk_callback_lock);
> +
> +	if (unlikely(tcp_in_sockmap(sk))) {
> +		write_unlock_bh(&sk->sk_callback_lock);
> +		return -EBUSY;
> +	}
> +
> +	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +
> +	write_unlock_bh(&sk->sk_callback_lock);
> +
> +	return 0;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index cc0bd73f36b6..7e7966b095f9 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -705,6 +705,16 @@ int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
>  }
>  #endif /* CONFIG_BPF_STREAM_PARSER */
>
> +bool tcp_in_sockmap(const struct sock *sk)
> +{
> +	const struct proto *prot = sk->sk_prot;
> +
> +	lockdep_assert_held(&sk->sk_callback_lock);
> +
> +	return &tcp_bpf_prots[0][0] <= prot &&
> +		prot <= &tcp_bpf_prots[TCP_BPF_NUM_PROTS - 1][TCP_BPF_NUM_CFGS - 1];
> +}
> +
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
>  	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
> @@ -729,6 +739,8 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  			sock_replace_proto(sk, psock->sk_proto);
>  		}
>  		return 0;
> +	} else if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG)) {
> +		return -EBUSY;
                   ^^^^

A subsystem pattern flags this as potentially concerning: the commit
message explains that write_lock_bh(&sk->sk_callback_lock) is taken to
synchronise with tcp_bpf_update_proto(), but does tcp_bpf_update_proto()
actually take sk->sk_callback_lock?

Looking at the call chain, tcp_bpf_update_proto() is invoked from the
SOCKMAP path through:

sock_map_link() in net/core/sock_map.c
  -> sock_map_init_proto()
     -> tcp_bpf_update_proto()

In sock_map_link(), sk_callback_lock is acquired after sock_map_init_proto()
completes, not before. The lock is taken when setting sk_user_data to the
psock, but sock_map_init_proto() runs outside the lock and calls
sock_replace_proto() to write sk->sk_prot without holding sk_callback_lock.

Is there a race here? Consider this interleaving (the commit message
acknowledges sock_map_update_elem only holds bh_lock_sock(), so it can run
concurrently with setsockopt that holds the socket lock):

  CPU 0 (sockmap path)                   CPU 1 (BPF setsockopt RCVQ_CB)
  sock_map_link():                       __bpf_sock_ops_cb_flags_set():
    sk_psock_init():                       (val has RCVQ_CB, current=0)
      lock sk_callback_lock
      set sk_user_data = psock
      unlock sk_callback_lock
    sock_map_init_proto():
      tcp_bpf_update_proto():
        BPF_SOCK_OPS_TEST_FLAG: 0
        (no flag yet, check passes)
                                           write_lock_bh(sk_callback_lock)
                                           tcp_in_sockmap(sk): sk_prot
                                           still old, returns false
                                           tp->bpf_sock_ops_cb_flags = val
                                           (RCVQ_CB set)
                                           write_unlock_bh
        sock_replace_proto(sk,
          &tcp_bpf_prots[][])
        (sk_prot now in tcp_bpf_prots)

Final state: sk_prot is in tcp_bpf_prots[][] AND bpf_sock_ops_cb_flags has
BPF_SOCK_OPS_RCVQ_CB_FLAG. Both features are now enabled, which violates the
mutual exclusion goal.

The validation checks in tcp_bpf_update_proto() and __bpf_sock_ops_cb_flags_set()
each check shared state before the exclusion point, not under common exclusion.
Running a lockset analysis on the two shared variables:

  bpf_sock_ops_cb_flags: written by __bpf_sock_ops_cb_flags_set under
  sk_callback_lock; read by tcp_bpf_update_proto with no lock.
  Lockset intersection = {sk_callback_lock} intersect {} = {}.

  sk->sk_prot: read by tcp_in_sockmap under sk_callback_lock; written by
  sock_replace_proto (called from tcp_bpf_update_proto) with no lock.
  Lockset intersection = {} intersect {sk_callback_lock} = {}.

Could the synchronization be strengthened so both paths access both variables
under the same lock?

>  	}
>
>  	if (sk->sk_family == AF_INET6) {


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
  2026-05-23  9:20   ` bot+bpf-ci
@ 2026-05-24  3:37     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-24  3:37 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: ast, daniel, andrii, martin.lau, eddyz87, memxor, yonghong.song,
	john.fastabend, sdf, edumazet, ncardwell, willemb, ukyab,
	kuni1840, bpf, netdev, martin.lau, clm, ihor.solodrai

On Sat, May 23, 2026 at 2:20 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index c6a6853909c4..bc95d8e7b62e 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -2853,6 +2853,7 @@ struct sk_msg;
> >  struct sk_psock;
> >
> >  #ifdef CONFIG_BPF_SYSCALL
> > +bool tcp_in_sockmap(const struct sock *sk);
> >  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
> >  void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
> >  #ifdef CONFIG_BPF_STREAM_PARSER
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 3608036632a8..1fb63b264b18 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
>
> [ ... ]
>
> > @@ -5382,12 +5382,34 @@ static int bpf_sol_tcp_getsockopt(struct sock *sk, int optname,
> >       return 0;
> >  }
> >
> > +static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
> > +{
> > +     if (!(val & BPF_SOCK_OPS_RCVQ_CB_FLAG) ||
> > +         tcp_sk(sk)->bpf_sock_ops_cb_flags & BPF_SOCK_OPS_RCVQ_CB_FLAG) {
> > +             tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > +             return 0;
> > +     }
> > +
> > +     write_lock_bh(&sk->sk_callback_lock);
> > +
> > +     if (unlikely(tcp_in_sockmap(sk))) {
> > +             write_unlock_bh(&sk->sk_callback_lock);
> > +             return -EBUSY;
> > +     }
> > +
> > +     tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > +
> > +     write_unlock_bh(&sk->sk_callback_lock);
> > +
> > +     return 0;
> > +}
> > +
>
> [ ... ]
>
> > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> > index cc0bd73f36b6..7e7966b095f9 100644
> > --- a/net/ipv4/tcp_bpf.c
> > +++ b/net/ipv4/tcp_bpf.c
> > @@ -705,6 +705,16 @@ int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
> >  }
> >  #endif /* CONFIG_BPF_STREAM_PARSER */
> >
> > +bool tcp_in_sockmap(const struct sock *sk)
> > +{
> > +     const struct proto *prot = sk->sk_prot;
> > +
> > +     lockdep_assert_held(&sk->sk_callback_lock);
> > +
> > +     return &tcp_bpf_prots[0][0] <= prot &&
> > +             prot <= &tcp_bpf_prots[TCP_BPF_NUM_PROTS - 1][TCP_BPF_NUM_CFGS - 1];
> > +}
> > +
> >  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> >  {
> >       int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
> > @@ -729,6 +739,8 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> >                       sock_replace_proto(sk, psock->sk_proto);
> >               }
> >               return 0;
> > +     } else if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG)) {
> > +             return -EBUSY;
>                    ^^^^
>
> A subsystem pattern flags this as potentially concerning: the commit
> message explains that write_lock_bh(&sk->sk_callback_lock) is taken to
> synchronise with tcp_bpf_update_proto(), but does tcp_bpf_update_proto()
> actually take sk->sk_callback_lock?
>
> Looking at the call chain, tcp_bpf_update_proto() is invoked from the
> SOCKMAP path through:
>
> sock_map_link() in net/core/sock_map.c
>   -> sock_map_init_proto()
>      -> tcp_bpf_update_proto()
>
> In sock_map_link(), sk_callback_lock is acquired after sock_map_init_proto()
> completes, not before.

Oh I assumed it's under lock since sock_map_del_link() and
sk_psock_drop() call it under the lock.

I'll move sock_map_init_proto() under the lock as well.


> The lock is taken when setting sk_user_data to the
> psock, but sock_map_init_proto() runs outside the lock and calls
> sock_replace_proto() to write sk->sk_prot without holding sk_callback_lock.
>
> Is there a race here? Consider this interleaving (the commit message
> acknowledges sock_map_update_elem only holds bh_lock_sock(), so it can run
> concurrently with setsockopt that holds the socket lock):
>
>   CPU 0 (sockmap path)                   CPU 1 (BPF setsockopt RCVQ_CB)
>   sock_map_link():                       __bpf_sock_ops_cb_flags_set():
>     sk_psock_init():                       (val has RCVQ_CB, current=0)
>       lock sk_callback_lock
>       set sk_user_data = psock
>       unlock sk_callback_lock
>     sock_map_init_proto():
>       tcp_bpf_update_proto():
>         BPF_SOCK_OPS_TEST_FLAG: 0
>         (no flag yet, check passes)
>                                            write_lock_bh(sk_callback_lock)
>                                            tcp_in_sockmap(sk): sk_prot
>                                            still old, returns false
>                                            tp->bpf_sock_ops_cb_flags = val
>                                            (RCVQ_CB set)
>                                            write_unlock_bh
>         sock_replace_proto(sk,
>           &tcp_bpf_prots[][])
>         (sk_prot now in tcp_bpf_prots)
>
> Final state: sk_prot is in tcp_bpf_prots[][] AND bpf_sock_ops_cb_flags has
> BPF_SOCK_OPS_RCVQ_CB_FLAG. Both features are now enabled, which violates the
> mutual exclusion goal.
>
> The validation checks in tcp_bpf_update_proto() and __bpf_sock_ops_cb_flags_set()
> each check shared state before the exclusion point, not under common exclusion.
> Running a lockset analysis on the two shared variables:
>
>   bpf_sock_ops_cb_flags: written by __bpf_sock_ops_cb_flags_set under
>   sk_callback_lock; read by tcp_bpf_update_proto with no lock.
>   Lockset intersection = {sk_callback_lock} intersect {} = {}.
>
>   sk->sk_prot: read by tcp_in_sockmap under sk_callback_lock; written by
>   sock_replace_proto (called from tcp_bpf_update_proto) with no lock.
>   Lockset intersection = {} intersect {sk_callback_lock} = {}.
>
> Could the synchronization be strengthened so both paths access both variables
> under the same lock?
>
> >       }
> >
> >       if (sk->sk_family == AF_INET6) {
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

MPTCP has its own sock->ops->set_rcvlowat() / mptcp_set_rcvlowat().

We should not allow calling __tcp_set_rcvlowat() for MPTCP subflows.

Let's disable BPF_SOCK_OPS_RCVQ_CB for MPTCP for now.

If needed in the future, bpf_sock_ops_tcp_set_rcvlowat() could be
extended to properly support MPTCP.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/core/filter.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 1fb63b264b18..82ec2291d6f0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5390,6 +5390,9 @@ static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
 		return 0;
 	}
 
+	if (unlikely(sk_is_mptcp(sk)))
+		return -EOPNOTSUPP;
+
 	write_lock_bh(&sk->sk_callback_lock);
 
 	if (unlikely(tcp_in_sockmap(sk))) {
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  9:20   ` bot+bpf-ci
  2026-05-23  8:29 ` [PATCH v3 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing
skbs in the receive queue when it is enabled for the first time.

In practical production use cases, this behavior is usually not
a problem.

We can safely assume that the upper-layer protocol is designed
with specific synchronisation points where the connection is
temporarily quiet.

At these points, the application can completely drain the receive
queue and safely enable BPF_SOCK_OPS_RCVQ_CB while no skbs are
pending.

A prime example is an application transitioning from HTTP to an
RPC protocol:

     Client                                Server
       |                                     |
       | --- HTTP Upgrade request ---------> |
       |                                     | [Drain all skbs]
       |                                     | [Enable BPF_SOCK_OPS_RCVQ_CB]
       | <-- HTTP 200/Switching protocol --- |
       |                                     |
       | --- RPC Frame 1 ------------------> |

However, to strictly prevent any potential race conditions arising
from unconventional upper-layer protocol designs, let's explicitly
signal a failure if BPF_SOCK_OPS_RCVQ_CB is enabled while the receive
queue is not empty.

-EUCLEAN is chosen to indicate that the caller needs to clean up
the receive queue before enabling the feature.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/core/filter.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 82ec2291d6f0..4041b9fc1c74 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5390,6 +5390,9 @@ static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
 		return 0;
 	}
 
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		return -EUCLEAN;
+
 	if (unlikely(sk_is_mptcp(sk)))
 		return -EOPNOTSUPP;
 
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
@ 2026-05-23  9:20   ` bot+bpf-ci
  0 siblings, 0 replies; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-23  9:20 UTC (permalink / raw)
  To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
  Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
	ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1733 bytes --]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 82ec2291d6f0..4041b9fc1c74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5390,6 +5390,9 @@ static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
>  		return 0;
>  	}
>
> +	if (!skb_queue_empty(&sk->sk_receive_queue))
> +		return -EUCLEAN;
> +

Is there a TOCTOU issue here? The check occurs before acquiring
sk->sk_callback_lock, which means:

Thread A (process context)
__bpf_sock_ops_cb_flags_set()
    if (!skb_queue_empty(&sk->sk_receive_queue))  // queue is empty
        // check passes

Thread B (softirq context)
tcp_rcv_established()
    // adds packet to sk->sk_receive_queue

Thread A (continuing)
__bpf_sock_ops_cb_flags_set()
    write_lock_bh(&sk->sk_callback_lock);
    // sets bpf_sock_ops_cb_flags despite queue being non-empty

Looking at the tcp_in_sockmap() check below, it contains
lockdep_assert_held(&sk->sk_callback_lock), showing that related
validation checks are expected to be performed under this lock:

>  	if (unlikely(sk_is_mptcp(sk)))
>  		return -EOPNOTSUPP;
>
>  	write_lock_bh(&sk->sk_callback_lock);
>
>  	if (unlikely(tcp_in_sockmap(sk))) {
>  		write_unlock_bh(&sk->sk_callback_lock);
>  		return -EBUSY;
>  	}

The commit message says the goal is to "strictly prevent any potential
race conditions" by ensuring the receive queue is empty when
BPF_SOCK_OPS_RCVQ_CB is enabled. Can this check be moved after the
write_lock_bh() call to eliminate the race window?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established().
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
  2026-05-23  8:29 ` [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

We will call BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB.

It requires a similar setup to bpf_skops_established(), and the
only difference is the skb data length.

Let's factor out the common logic into bpf_skops_common_locked().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/ipv4/tcp_input.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d5c9e65d9760..c4ba4f1e9d9e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -179,8 +179,9 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
 	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
 }
 
-static void bpf_skops_established(struct sock *sk, int bpf_op,
-				  struct sk_buff *skb)
+static void bpf_skops_common_locked(struct sock *sk, int bpf_op,
+				    struct sk_buff *skb,
+				    unsigned int end_offset)
 {
 	struct bpf_sock_ops_kern sock_ops;
 
@@ -191,12 +192,18 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
 	sock_ops.is_fullsock = 1;
 	sock_ops.is_locked_tcp_sock = 1;
 	sock_ops.sk = sk;
-	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
 	if (skb)
-		bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
+		bpf_skops_init_skb(&sock_ops, skb, end_offset);
 
 	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
 }
+
+static void bpf_skops_established(struct sock *sk, int bpf_op,
+				  struct sk_buff *skb)
+{
+	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
+	bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0);
+}
 #else
 static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
 {
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-26 20:47   ` Martin KaFai Lau
  2026-05-23  8:29 ` [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.

Let's invoke the BPF SOCK_OPS prog when

  1. TCP stack enqueues skb to sk->sk_receive_queue
     -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()

  2. TCP recvmsg() completes
     -> __tcp_cleanup_rbuf()

This will allow the BPF prog to parse each skb and dynamically
adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
until sufficient data (e.g., a full RPC frame) is available
in the receive queue.

Note that the direct access to bpf_sock_ops.data is intentionally
disabled by passing 0 as end_offset.

Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
with bpf_sock_ops because payload is not in the linear area
with TCP header/data split on and skb may contain a RPC
descriptor in skb frag.  This also simplifies the BPF prog.

The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
tcp_fastopen_add_skb() is chosen to provide the same snapshot
with tcp_queue_rcv().

For example, if tcp_bpf_rcvlowat() were called before updating
TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
need to implement an unlikely if branch to strip SYN.

In addition, TCP stack can queue overlapping skb into recvq.
Once rcv_nxt is updated with a new skb, BPF prog cannot infer
the previous one from skb->len.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: Add explanation of tcp_bpf_rcvlowat() placement.
---
 include/net/tcp.h       | 12 ++++++++++++
 net/ipv4/tcp.c          |  2 ++
 net/ipv4/tcp_fastopen.c |  2 ++
 net/ipv4/tcp_input.c    | 10 ++++++++++
 4 files changed, 26 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index bc95d8e7b62e..a409f2ea710f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2889,12 +2889,24 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
 	skops->skb = skb;
 	skops->skb_data_end = skb->data + end_offset;
 }
+
+void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb);
+
+static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
+{
+	if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG))
+		bpf_skops_rcvlowat(sk, skb);
+}
 #else
 static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
 				      struct sk_buff *skb,
 				      unsigned int end_offset)
 {
 }
+
+static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
+{
+}
 #endif
 
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3afeb69a547a..f7e32891bb4e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
 		tcp_mstamp_refresh(tp);
 		tcp_send_ack(sk);
 	}
+
+	tcp_bpf_rcvlowat(sk, NULL);
 }
 
 void tcp_cleanup_rbuf(struct sock *sk, int copied)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 471c78be5513..91bf421fc5b6 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -281,6 +281,8 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 	TCP_SKB_CB(skb)->seq++;
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN;
 
+	tcp_bpf_rcvlowat(sk, skb);
+
 	tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 	tcp_add_receive_queue(sk, skb);
 	tp->syn_data_acked = 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c4ba4f1e9d9e..477bcf2ba89d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -204,6 +204,12 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
 	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
 	bpf_skops_common_locked(sk, bpf_op, skb, skb ? tcp_hdrlen(skb) : 0);
 }
+
+void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb)
+{
+	/* skb is NULL when called from __tcp_cleanup_rbuf(). */
+	bpf_skops_common_locked(sk, BPF_SOCK_OPS_RCVQ_CB, skb, 0);
+}
 #else
 static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
 {
@@ -5306,6 +5312,8 @@ static void tcp_ofo_queue(struct sock *sk)
 			continue;
 		}
 
+		tcp_bpf_rcvlowat(sk, skb);
+
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
@@ -5509,6 +5517,8 @@ static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
 	int eaten;
 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
 
+	tcp_bpf_rcvlowat(sk, skb);
+
 	eaten = (tail &&
 		 tcp_try_coalesce(sk, tail,
 				  skb, fragstolen)) ? 1 : 0;
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
@ 2026-05-26 20:47   ` Martin KaFai Lau
  2026-05-26 21:07     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2026-05-26 20:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell,
	Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev

On Sat, May 23, 2026 at 08:29:39AM +0000, Kuniyuki Iwashima wrote:
> Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.
> 
> Let's invoke the BPF SOCK_OPS prog when
> 
>   1. TCP stack enqueues skb to sk->sk_receive_queue
>      -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()
> 
>   2. TCP recvmsg() completes
>      -> __tcp_cleanup_rbuf()
> 
> This will allow the BPF prog to parse each skb and dynamically
> adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
> until sufficient data (e.g., a full RPC frame) is available
> in the receive queue.
> 
> Note that the direct access to bpf_sock_ops.data is intentionally
> disabled by passing 0 as end_offset.
> 
> Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
> with bpf_sock_ops because payload is not in the linear area
> with TCP header/data split on and skb may contain a RPC
> descriptor in skb frag.  This also simplifies the BPF prog.
> 
> The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
> tcp_fastopen_add_skb() is chosen to provide the same snapshot
> with tcp_queue_rcv().
> 
> For example, if tcp_bpf_rcvlowat() were called before updating
> TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
> need to implement an unlikely if branch to strip SYN.
> 
> In addition, TCP stack can queue overlapping skb into recvq.
> Once rcv_nxt is updated with a new skb, BPF prog cannot infer
> the previous one from skb->len.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v2: Add explanation of tcp_bpf_rcvlowat() placement.
> ---
>  include/net/tcp.h       | 12 ++++++++++++
>  net/ipv4/tcp.c          |  2 ++
>  net/ipv4/tcp_fastopen.c |  2 ++
>  net/ipv4/tcp_input.c    | 10 ++++++++++
>  4 files changed, 26 insertions(+)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index bc95d8e7b62e..a409f2ea710f 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2889,12 +2889,24 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
>  	skops->skb = skb;
>  	skops->skb_data_end = skb->data + end_offset;
>  }
> +
> +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb);
> +
> +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> +{
> +	if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG))
> +		bpf_skops_rcvlowat(sk, skb);
> +}
>  #else
>  static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
>  				      struct sk_buff *skb,
>  				      unsigned int end_offset)
>  {
>  }
> +
> +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> +{
> +}
>  #endif
>  
>  /* Call BPF_SOCK_OPS program that returns an int. If the return value
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3afeb69a547a..f7e32891bb4e 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
>  		tcp_mstamp_refresh(tp);
>  		tcp_send_ack(sk);
>  	}
> +
> +	tcp_bpf_rcvlowat(sk, NULL);

hmm... so NULL is a way for the bpf prog to tell where it is called?

With skb NULL, what does the bpf prog usually do?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook.
  2026-05-26 20:47   ` Martin KaFai Lau
@ 2026-05-26 21:07     ` Kuniyuki Iwashima
  2026-05-26 21:37       ` Amery Hung
  0 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-26 21:07 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell,
	Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev

On Tue, May 26, 2026 at 1:47 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On Sat, May 23, 2026 at 08:29:39AM +0000, Kuniyuki Iwashima wrote:
> > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.
> >
> > Let's invoke the BPF SOCK_OPS prog when
> >
> >   1. TCP stack enqueues skb to sk->sk_receive_queue
> >      -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()
> >
> >   2. TCP recvmsg() completes
> >      -> __tcp_cleanup_rbuf()
> >
> > This will allow the BPF prog to parse each skb and dynamically
> > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
> > until sufficient data (e.g., a full RPC frame) is available
> > in the receive queue.
> >
> > Note that the direct access to bpf_sock_ops.data is intentionally
> > disabled by passing 0 as end_offset.
> >
> > Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
> > with bpf_sock_ops because payload is not in the linear area
> > with TCP header/data split on and skb may contain a RPC
> > descriptor in skb frag.  This also simplifies the BPF prog.
> >
> > The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
> > tcp_fastopen_add_skb() is chosen to provide the same snapshot
> > with tcp_queue_rcv().
> >
> > For example, if tcp_bpf_rcvlowat() were called before updating
> > TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
> > need to implement an unlikely if branch to strip SYN.
> >
> > In addition, TCP stack can queue overlapping skb into recvq.
> > Once rcv_nxt is updated with a new skb, BPF prog cannot infer
> > the previous one from skb->len.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > ---
> > v2: Add explanation of tcp_bpf_rcvlowat() placement.
> > ---
> >  include/net/tcp.h       | 12 ++++++++++++
> >  net/ipv4/tcp.c          |  2 ++
> >  net/ipv4/tcp_fastopen.c |  2 ++
> >  net/ipv4/tcp_input.c    | 10 ++++++++++
> >  4 files changed, 26 insertions(+)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index bc95d8e7b62e..a409f2ea710f 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -2889,12 +2889,24 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
> >       skops->skb = skb;
> >       skops->skb_data_end = skb->data + end_offset;
> >  }
> > +
> > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb);
> > +
> > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> > +{
> > +     if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG))
> > +             bpf_skops_rcvlowat(sk, skb);
> > +}
> >  #else
> >  static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
> >                                     struct sk_buff *skb,
> >                                     unsigned int end_offset)
> >  {
> >  }
> > +
> > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> > +{
> > +}
> >  #endif
> >
> >  /* Call BPF_SOCK_OPS program that returns an int. If the return value
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 3afeb69a547a..f7e32891bb4e 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
> >               tcp_mstamp_refresh(tp);
> >               tcp_send_ack(sk);
> >       }
> > +
> > +     tcp_bpf_rcvlowat(sk, NULL);
>
> hmm... so NULL is a way for the bpf prog to tell where it is called?

Yes.  The selftest fetches bpf_cast_to_kern_ctx(ops)->skb first to
check it since we need to access skb->cb anyway to read
TCP_SKB_CB()->{seq,end_seq} for non-NULL case.


>
> With skb NULL, what does the bpf prog usually do?

In that case, the prog reads tp->copied_seq and some sequence
numbers from sk local storage and calls kfunc to adjust sk->sk_rcvlowat.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook.
  2026-05-26 21:07     ` Kuniyuki Iwashima
@ 2026-05-26 21:37       ` Amery Hung
  2026-05-26 21:51         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Amery Hung @ 2026-05-26 21:37 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	bpf, netdev

 a

On Tue, May 26, 2026 at 2:07 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Tue, May 26, 2026 at 1:47 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On Sat, May 23, 2026 at 08:29:39AM +0000, Kuniyuki Iwashima wrote:
> > > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.
> > >
> > > Let's invoke the BPF SOCK_OPS prog when
> > >
> > >   1. TCP stack enqueues skb to sk->sk_receive_queue
> > >      -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()
> > >
> > >   2. TCP recvmsg() completes
> > >      -> __tcp_cleanup_rbuf()

Not sure if this is just me, but having to differentiating call sites
by checking if skb == NULL seems a bit less ideal (feels a bit
arbitrary and may be restrictive to extend in the future)

If new sockops is going to be built on top of bpf_tcp_ops, it would be
more obvious as different call sites will be different struct_ops op.
Although arguably tcp_queue_rcv(), tcp_ofo_queue(), and
tcp_fastopen_add_skb() may share the same struct_ops op.

> > >
> > > This will allow the BPF prog to parse each skb and dynamically
> > > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
> > > until sufficient data (e.g., a full RPC frame) is available
> > > in the receive queue.
> > >
> > > Note that the direct access to bpf_sock_ops.data is intentionally
> > > disabled by passing 0 as end_offset.
> > >
> > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
> > > with bpf_sock_ops because payload is not in the linear area
> > > with TCP header/data split on and skb may contain a RPC
> > > descriptor in skb frag.  This also simplifies the BPF prog.
> > >
> > > The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
> > > tcp_fastopen_add_skb() is chosen to provide the same snapshot
> > > with tcp_queue_rcv().
> > >
> > > For example, if tcp_bpf_rcvlowat() were called before updating
> > > TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
> > > need to implement an unlikely if branch to strip SYN.
> > >
> > > In addition, TCP stack can queue overlapping skb into recvq.
> > > Once rcv_nxt is updated with a new skb, BPF prog cannot infer
> > > the previous one from skb->len.
> > >
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > > ---
> > > v2: Add explanation of tcp_bpf_rcvlowat() placement.
> > > ---
> > >  include/net/tcp.h       | 12 ++++++++++++
> > >  net/ipv4/tcp.c          |  2 ++
> > >  net/ipv4/tcp_fastopen.c |  2 ++
> > >  net/ipv4/tcp_input.c    | 10 ++++++++++
> > >  4 files changed, 26 insertions(+)
> > >
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index bc95d8e7b62e..a409f2ea710f 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -2889,12 +2889,24 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
> > >       skops->skb = skb;
> > >       skops->skb_data_end = skb->data + end_offset;
> > >  }
> > > +
> > > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb);
> > > +
> > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> > > +{
> > > +     if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG))
> > > +             bpf_skops_rcvlowat(sk, skb);
> > > +}
> > >  #else
> > >  static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
> > >                                     struct sk_buff *skb,
> > >                                     unsigned int end_offset)
> > >  {
> > >  }
> > > +
> > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> > > +{
> > > +}
> > >  #endif
> > >
> > >  /* Call BPF_SOCK_OPS program that returns an int. If the return value
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 3afeb69a547a..f7e32891bb4e 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
> > >               tcp_mstamp_refresh(tp);
> > >               tcp_send_ack(sk);
> > >       }
> > > +
> > > +     tcp_bpf_rcvlowat(sk, NULL);
> >
> > hmm... so NULL is a way for the bpf prog to tell where it is called?
>
> Yes.  The selftest fetches bpf_cast_to_kern_ctx(ops)->skb first to
> check it since we need to access skb->cb anyway to read
> TCP_SKB_CB()->{seq,end_seq} for non-NULL case.
>
>
> >
> > With skb NULL, what does the bpf prog usually do?
>
> In that case, the prog reads tp->copied_seq and some sequence
> numbers from sk local storage and calls kfunc to adjust sk->sk_rcvlowat.
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook.
  2026-05-26 21:37       ` Amery Hung
@ 2026-05-26 21:51         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-26 21:51 UTC (permalink / raw)
  To: Amery Hung
  Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	bpf, netdev

On Tue, May 26, 2026 at 2:38 PM Amery Hung <ameryhung@gmail.com> wrote:
>
>  a
>
> On Tue, May 26, 2026 at 2:07 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > On Tue, May 26, 2026 at 1:47 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On Sat, May 23, 2026 at 08:29:39AM +0000, Kuniyuki Iwashima wrote:
> > > > Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.
> > > >
> > > > Let's invoke the BPF SOCK_OPS prog when
> > > >
> > > >   1. TCP stack enqueues skb to sk->sk_receive_queue
> > > >      -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()
> > > >
> > > >   2. TCP recvmsg() completes
> > > >      -> __tcp_cleanup_rbuf()
>
> Not sure if this is just me, but having to differentiating call sites
> by checking if skb == NULL seems a bit less ideal (feels a bit
> arbitrary and may be restrictive to extend in the future)

I think Martin shares the feeling :)

I chose the way for two reasons: 1) bpf_skops_established()
already accepts null skb (for unlikely TCP_REPAIR_ON), and
2) adding another op just for the case seems redundant (it adds
more conditionals in the fast path, kfunc) while bpf prog is capable
of distinguishing a null skb.


>
> If new sockops is going to be built on top of bpf_tcp_ops, it would be
> more obvious as different call sites will be different struct_ops op.
> Although arguably tcp_queue_rcv(), tcp_ofo_queue(), and
> tcp_fastopen_add_skb() may share the same struct_ops op.

Yes, this sounds fascinating.

>
> > > >
> > > > This will allow the BPF prog to parse each skb and dynamically
> > > > adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
> > > > until sufficient data (e.g., a full RPC frame) is available
> > > > in the receive queue.
> > > >
> > > > Note that the direct access to bpf_sock_ops.data is intentionally
> > > > disabled by passing 0 as end_offset.
> > > >
> > > > Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
> > > > with bpf_sock_ops because payload is not in the linear area
> > > > with TCP header/data split on and skb may contain a RPC
> > > > descriptor in skb frag.  This also simplifies the BPF prog.
> > > >
> > > > The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
> > > > tcp_fastopen_add_skb() is chosen to provide the same snapshot
> > > > with tcp_queue_rcv().
> > > >
> > > > For example, if tcp_bpf_rcvlowat() were called before updating
> > > > TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
> > > > need to implement an unlikely if branch to strip SYN.
> > > >
> > > > In addition, TCP stack can queue overlapping skb into recvq.
> > > > Once rcv_nxt is updated with a new skb, BPF prog cannot infer
> > > > the previous one from skb->len.
> > > >
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > > > ---
> > > > v2: Add explanation of tcp_bpf_rcvlowat() placement.
> > > > ---
> > > >  include/net/tcp.h       | 12 ++++++++++++
> > > >  net/ipv4/tcp.c          |  2 ++
> > > >  net/ipv4/tcp_fastopen.c |  2 ++
> > > >  net/ipv4/tcp_input.c    | 10 ++++++++++
> > > >  4 files changed, 26 insertions(+)
> > > >
> > > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > > index bc95d8e7b62e..a409f2ea710f 100644
> > > > --- a/include/net/tcp.h
> > > > +++ b/include/net/tcp.h
> > > > @@ -2889,12 +2889,24 @@ static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
> > > >       skops->skb = skb;
> > > >       skops->skb_data_end = skb->data + end_offset;
> > > >  }
> > > > +
> > > > +void bpf_skops_rcvlowat(struct sock *sk, struct sk_buff *skb);
> > > > +
> > > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> > > > +{
> > > > +     if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG))
> > > > +             bpf_skops_rcvlowat(sk, skb);
> > > > +}
> > > >  #else
> > > >  static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
> > > >                                     struct sk_buff *skb,
> > > >                                     unsigned int end_offset)
> > > >  {
> > > >  }
> > > > +
> > > > +static inline void tcp_bpf_rcvlowat(struct sock *sk, struct sk_buff *skb)
> > > > +{
> > > > +}
> > > >  #endif
> > > >
> > > >  /* Call BPF_SOCK_OPS program that returns an int. If the return value
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index 3afeb69a547a..f7e32891bb4e 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -1602,6 +1602,8 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
> > > >               tcp_mstamp_refresh(tp);
> > > >               tcp_send_ack(sk);
> > > >       }
> > > > +
> > > > +     tcp_bpf_rcvlowat(sk, NULL);
> > >
> > > hmm... so NULL is a way for the bpf prog to tell where it is called?
> >
> > Yes.  The selftest fetches bpf_cast_to_kern_ctx(ops)->skb first to
> > check it since we need to access skb->cb anyway to read
> > TCP_SKB_CB()->{seq,end_seq} for non-NULL case.
> >
> >
> > >
> > > With skb NULL, what does the bpf prog usually do?
> >
> > In that case, the prog reads tp->copied_seq and some sequence
> > numbers from sk local storage and calls kfunc to adjust sk->sk_rcvlowat.
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2026-05-23  8:29 ` [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
@ 2026-05-23  8:29 ` Kuniyuki Iwashima
  2026-05-23  9:20   ` bot+bpf-ci
  2026-05-26 21:01   ` Martin KaFai Lau
  10 siblings, 2 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-23  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, John Fastabend, Stanislav Fomichev, Eric Dumazet,
	Neal Cardwell, Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima,
	Kuniyuki Iwashima, bpf, netdev

The test is roughly divided into two stages, and the sequence
is as follows:

  I) Setup

    1. Attach two BPF programs to a cgroup
    2. Establish a TCP connection (@client <-> @child) within the cgroup
    3. Enable BPF_SOCK_OPS_RCVQ_CB on @child

 II) RPC frame exchange in various patterns

    4. Send a partial RPC descriptor from @client to @child
    5. Verify that epoll does NOT wake up @child
    6. Send the remaining data of the RPC frame
    7. Verify that epoll finally wakes up @child

During setup, two BPF programs are attached to simulate
a real-world scenario; one is SOCK_OPS and the other is
CGROUP_SOCKOPT.

While the SOCK_OPS prog handles the dynamic adjustment of
sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable
BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt() using
pseudo options:

  #define SOL_BPF               0xdeadbeef
  #define BPF_TCP_AUTOLOWAT     0x8badf00d

  setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int));

This reflects a common production use case where an application
decides to start parsing RPC frames only at a certain point in
the stream (e.g., after HTTP Upgrade), rather than immediately
after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc).

When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises
sk_local_storage for two sequence numbers to manage its state.

Then, for the RPC frame exchange, this test uses a simple format
defined as follows

  0        8       16      24       32
  +--------+--------+-------+--------+ `.
  |            header size           |  |
  +--------+--------+-------+--------+   > RPC descriptor (8 bytes)
  |            payload size          |  |
  +--------+--------+-------+--------+ .'
  ~               header             ~
  +--------+--------+-------+--------+
  ~               payload            ~
  +--------+--------+-------+--------+

Every time a new skb is enqueued to sk->sk_receive_queue,
the SOCK_OPS prog parses it and updates these sequence numbers:

  rpc_desc_seq : the SEQ # of the start of the RPC descriptor
  rpc_end_seq  : the SEQ # of the end of the RPC frame
                 => rpc_desc_seq + 8 + header size + payload size

Assume we receive two RPC descriptors in the following pattern:

  1. When we receive skb-1, only a part of RPC descriptor is parsed.
     rpc_desc_seq is set to the first byte while rpc_end_seq is
     unknown.  Thus, sk->sk_rcvlowat is set to the size of the RPC
     descriptor (8 bytes).

   <- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
  +-----------+.................+....................+......
  |  RPC desc 1  |  header + payload  |  RPC desc 2  | ...
  +-----------+.................+....................+......
  ^              ^-.
  `- rpc_desc_seq   `- sk->sk_rcvlowat

  2. Next, we receive skb-2, which completes the first RPC descriptor.
     Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it.

   <- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
  +-----------+-----------------+....................+......
  |  RPC desc 1  |  header + payload  |  RPC desc 2  | ...
  +-----------+-----------------+....................+......
  ^                                   ^
  '- rpc_desc_seq                     '- rpc_end_seq
                                           & sk->sk_rcvlowat

  3. Once we receive skb-3, which contains the next full RPC descriptor,
     rpc_desc_seq is advanced and rpc_end_seq is updated according
     to the size of RPC frame 2.

     Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq
     yet.  This ensures that the application is woken up to read the
     already complete RPC frame 1.

   <- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
  +-----------+-----------------+--------------------+......
  |  RPC desc 1  |  header + payload  |  RPC desc 2  | ...   |
  +-----------+-----------------+--------------------+......
                                      ^                      ^
              rpc_desc_seq -----------'  rpc_end_seq ----...-'
                & sk->sk_rcvlowat

This sequence corresponds to the 4th test case in rpc_test_cases[],
and we can see helpful output if we "#define DEBUG":

  # cat /sys/kernel/tracing/trace_pipe | \
    awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \
    BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID
  ...
  AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0
  AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1
  AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1
  AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8

  AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1
  AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150
  AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8
  AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8
  AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258
  ...

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2:
  * Make copy_len u64 and swap validation order for it
    for no_alu32.
---
 tools/testing/selftests/bpf/bpf_kfuncs.h      |   4 +
 .../selftests/bpf/prog_tests/tcp_autolowat.c  | 350 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   2 +
 .../selftests/bpf/progs/tcp_autolowat.c       | 326 ++++++++++++++++
 4 files changed, 682 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
 create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c

diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index ae71e9b69051..fc4d6f68f247 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -64,6 +64,10 @@ struct bpf_tcp_req_attrs;
 extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
 				   struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
 
+struct bpf_sock_ops_kern;
+extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
+					 int rcvlowat) __ksym;
+
 void *bpf_cast_to_kern_ctx(void *) __ksym;
 
 extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
new file mode 100644
index 000000000000..5e971c42a32a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2026 Google LLC */
+#include <sys/epoll.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "tcp_autolowat.skel.h"
+
+#define SOL_BPF			0xdeadbeef
+#define BPF_TCP_AUTOLOWAT	0x8badf00d
+
+struct rpc_descriptor {
+	u32 header_len;
+	u32 payload_len;
+};
+
+enum rpc_event_type {
+	RPC_EVENT_END,
+	RPC_EVENT_AUTOLOWAT,
+	RPC_EVENT_SEND,
+	RPC_EVENT_RECV,
+	RPC_EVENT_EPOLL,
+	RPC_EVENT_RCVLOWAT,
+};
+
+struct rpc_event {
+	enum rpc_event_type type;
+	union {
+		int len;
+		int nfds;
+		int val;
+		int rcvlowat;
+	};
+};
+
+#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor))
+
+struct rpc_test_case {
+	char data[4096];
+	struct rpc_descriptor desc[32];
+	struct rpc_event event[32];
+} rpc_test_cases[] = {
+	{
+		.desc = {
+			{ .header_len = 100, .payload_len = 150 },
+		},
+		.event = {
+			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
+			/* Single full RPC message in skb. */
+			{ .type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE + 100 + 150},
+			{ .type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 100 + 150},
+			{ .type = RPC_EVENT_EPOLL,	.nfds = 1},
+		},
+	},
+	{
+		.desc = {
+			{.header_len = 100, .payload_len = 150},
+			{.header_len = 100, .payload_len = 150},
+			{.header_len = 100, .payload_len = 150},
+		},
+		.event = {
+			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
+			/* Two full RPC messages in skb. */
+			{.type = RPC_EVENT_SEND,	.len = (RPC_DESC_SIZE + 100 + 150) * 2},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 1},
+			/* Single full RPC message in skb. */
+			{ .type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE + 100 + 150},
+			{ .type = RPC_EVENT_RCVLOWAT,	.rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 3},
+			{ .type = RPC_EVENT_EPOLL,	.nfds = 1},
+		},
+	},
+	{
+		.desc = {
+			{.header_len = 100, .payload_len = 150},
+			{.header_len = 100, .payload_len = 150},
+			{.header_len = 100, .payload_len = 150},
+		},
+		.event = {
+			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
+			/* Two full RPC messages in skb. */
+			{.type = RPC_EVENT_SEND,	.len = (RPC_DESC_SIZE + 100 + 150) * 2},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 1},
+			/* Single full RPC message in skb. */
+			{ .type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE},
+			{ .type = RPC_EVENT_RCVLOWAT,	.rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+			{ .type = RPC_EVENT_EPOLL,	.nfds = 1},
+		},
+	},
+	{
+		.desc = {
+			{.header_len = 100, .payload_len = 150},
+			{.header_len = 200, .payload_len = 500},
+		},
+		.event = {
+			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
+			/* The first descriptor is partial. */
+			{.type = RPC_EVENT_SEND,	.len = 1},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},
+			/* The first descriptor is available. */
+			{.type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE - 1},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
+			/* The first header is ready. */
+			{.type = RPC_EVENT_SEND,	.len = 100},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
+			/* skb has the first payload and 1 byte of the next descriptor. */
+			{.type = RPC_EVENT_SEND,	.len = 150 + 1},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 1},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
+			/* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
+			{.type = RPC_EVENT_RECV,	.len = RPC_DESC_SIZE + 150 + 100},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},
+			/* The second descriptor is available. */
+			{.type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE - 1},
+			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
+			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 200 + 500},
+		},
+	},
+};
+
+struct tcp_autolowat_test_cb {
+	int saved_netns;
+	union {
+		int fd[4];
+		struct {
+			int server, client, child;
+			int epoll;
+		};
+	};
+};
+
+static void tcp_autolowat_teardown_cb(struct tcp_autolowat_test_cb *cb)
+{
+	int i, err;
+
+	for (i = 0; i < ARRAY_SIZE(cb->fd); i++) {
+		if (cb->fd[i] != -1)
+			close(cb->fd[i]);
+	}
+
+	if (cb->saved_netns != -1) {
+		err = setns(cb->saved_netns, CLONE_NEWNET);
+		ASSERT_OK(err, "restore netns");
+
+		close(cb->saved_netns);
+	}
+}
+
+static int tcp_autolowat_setup_cb(struct tcp_autolowat_test_cb *cb, int family)
+{
+	struct epoll_event ev = {};
+	int err;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(cb->fd); i++)
+		cb->fd[i] = -1;
+
+	cb->saved_netns = open("/proc/self/ns/net", O_RDONLY);
+	if (!ASSERT_NEQ(cb->saved_netns, -1, "save netns"))
+		goto err;
+
+	err = unshare(CLONE_NEWNET);
+	if (!ASSERT_OK(err, "unshare"))
+		goto err;
+
+	err = system("ip link set dev lo up");
+	if (!ASSERT_OK(err, "set up lo"))
+		goto err;
+
+	cb->server = start_server(family, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_NEQ(cb->server, -1, "start_server"))
+		goto err;
+
+	cb->client = connect_to_fd(cb->server, 0);
+	if (!ASSERT_NEQ(cb->client, -1, "connect_to_fd"))
+		goto err;
+
+	cb->child = accept(cb->server, NULL, NULL);
+	if (!ASSERT_NEQ(cb->child, -1, "accept"))
+		goto err;
+
+	cb->epoll = epoll_create1(0);
+	if (!ASSERT_NEQ(cb->epoll, -1, "epoll_create"))
+		goto err;
+
+	ev.events = EPOLLIN;
+	ev.data.fd = cb->child;
+
+	err = epoll_ctl(cb->epoll, EPOLL_CTL_ADD, cb->child, &ev);
+	if (!ASSERT_OK(err, "epoll_ctl"))
+		goto err;
+
+	return 0;
+
+err:
+	tcp_autolowat_teardown_cb(cb);
+	return -1;
+}
+
+static int tcp_autolowat_build_data(struct rpc_test_case *test_case)
+{
+	struct rpc_descriptor *desc = test_case->desc;
+	char *ptr = test_case->data;
+	int rpc_size;
+
+	memset(ptr, 0, sizeof(test_case->data));
+
+	while (desc->header_len + desc->payload_len) {
+		rpc_size = sizeof(*desc) + desc->header_len + desc->payload_len;
+
+		if (!ASSERT_LE(ptr + rpc_size - test_case->data,
+			       sizeof(test_case->data), "data overflow"))
+			return 1;
+
+		memcpy(ptr, desc, sizeof(*desc));
+		ptr += rpc_size;
+		desc++;
+	}
+
+	if (!ASSERT_GT(ptr - test_case->data, 0, "no data"))
+		return 1;
+
+	return 0;
+}
+
+static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
+				       struct rpc_test_case *test_case)
+{
+	struct rpc_event *event = test_case->event;
+	char *ptr = test_case->data;
+	struct epoll_event ev;
+	socklen_t optlen;
+	int err, optval;
+	char buf[4096];
+
+	if (tcp_autolowat_build_data(test_case))
+		return;
+
+	while (1) {
+		switch (event->type) {
+		case RPC_EVENT_END:
+			return;
+		case RPC_EVENT_AUTOLOWAT:
+			err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
+					 &event->val, sizeof(event->val));
+			if (!ASSERT_OK(err, "setsockopt"))
+				return;
+			break;
+		case RPC_EVENT_SEND:
+			err = send(cb->client, ptr, event->len, 0);
+			if (!ASSERT_EQ(err, event->len, "send"))
+				return;
+
+			ptr += event->len;
+			break;
+		case RPC_EVENT_RECV:
+			err = recv(cb->child, buf, event->len, 0);
+			if (!ASSERT_EQ(err, event->len, "recv"))
+				return;
+			break;
+		case RPC_EVENT_EPOLL:
+			err = epoll_wait(cb->epoll, &ev, 1, 0);
+			if (!ASSERT_EQ(err, event->nfds, "epoll_wait"))
+				return;
+			break;
+		case RPC_EVENT_RCVLOWAT:
+			optval = 0;
+			optlen = sizeof(optval);
+
+			err = getsockopt(cb->child, SOL_SOCKET, SO_RCVLOWAT, &optval, &optlen);
+			if (!ASSERT_OK(err, "getsockopt") ||
+			    !ASSERT_EQ(optval, event->rcvlowat, "rcvlowat"))
+				return;
+			break;
+		}
+
+		event++;
+	}
+}
+
+static void tcp_autolowat_run_rpc_tests(struct tcp_autolowat *skel, int family)
+{
+	struct tcp_autolowat_test_cb cb;
+	int err;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(rpc_test_cases); i++) {
+		memset(skel->bss->test_name, 0, sizeof(skel->bss->test_name));
+
+		snprintf(skel->bss->test_name, sizeof(skel->bss->test_name),
+			 "AF_INET%c rpc_test_cases[%d]",
+			 family == AF_INET ? ' ' : '6', i);
+
+		if (!test__start_subtest(skel->bss->test_name))
+			continue;
+
+		err = tcp_autolowat_setup_cb(&cb, family);
+		if (err)
+			continue;
+
+		tcp_autolowat_run_rpc_test(&cb, &rpc_test_cases[i]);
+		tcp_autolowat_teardown_cb(&cb);
+	}
+}
+
+static void tcp_autolowat_run_tests(struct tcp_autolowat *skel)
+{
+	tcp_autolowat_run_rpc_tests(skel, AF_INET);
+	tcp_autolowat_run_rpc_tests(skel, AF_INET6);
+}
+
+void test_tcp_autolowat(void)
+{
+	struct tcp_autolowat *skel;
+	struct bpf_link *link[2];
+	int cgroup;
+
+	skel = tcp_autolowat__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	cgroup = test__join_cgroup("/tcp_autolowat");
+	if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+		goto destroy_skel;
+
+	link[0] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat, cgroup);
+	if (!ASSERT_OK_PTR(link[0], "attach_cgroup(SOCK_OPS)"))
+		goto close_cgroup;
+
+	link[1] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat_setsockopt, cgroup);
+	if (!ASSERT_OK_PTR(link[1], "attach_cgroup(SETSOCKOPT)"))
+		goto destroy_sockops;
+
+	tcp_autolowat_run_tests(skel);
+
+	bpf_link__destroy(link[1]);
+destroy_sockops:
+	bpf_link__destroy(link[0]);
+close_cgroup:
+	close(cgroup);
+destroy_skel:
+	tcp_autolowat__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index d8dacef37c16..bdf28d320383 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -74,6 +74,8 @@
 
 #define NEXTHDR_TCP		6
 
+#define TCPHDR_FIN		0x01
+
 #define TCPOPT_NOP		1
 #define TCPOPT_EOL		0
 #define TCPOPT_MSS		2
diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
new file mode 100644
index 000000000000..41eacfdb34aa
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2026 Google LLC */
+#include "vmlinux.h"
+
+#include <string.h>
+#include <limits.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+#include "bpf_tracing_net.h"
+
+#define SOL_BPF			0xdeadbeef
+#define BPF_TCP_AUTOLOWAT	0x8badf00d
+
+//#define DEBUG /* For verbose output. */
+
+struct rpc_descriptor {
+	u32 header_len;
+	u32 payload_len;
+};
+
+#define RPC_DESC_SIZE		(sizeof(struct rpc_descriptor))
+#define MAX_RPC_DESC_PER_SKB	100
+
+struct tcp_autolowat_cb {
+	/* Don't put this field at the end; BPF verifier complains. */
+	char rpc_desc_buf[RPC_DESC_SIZE];
+	u32 rpc_desc_seq;
+	u32 rpc_end_seq;
+#ifdef DEBUG
+	u32 isn;
+#endif
+	u8 rpc_desc_buff_len;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct tcp_autolowat_cb);
+} tcp_autolowat_map SEC(".maps");
+
+char test_name[64];
+
+#ifdef DEBUG
+#define LOG(str, ...)							\
+	bpf_printk("%s: " str, test_name, ##__VA_ARGS__)
+#else
+#define LOG(...)
+#endif
+
+#define SEQ(val)				\
+	(val - cb->isn)
+#define TP_SEQ(field)				\
+	(tp->field - cb->isn)
+#define CB_SEQ(field)				\
+	(cb->field - cb->isn)
+
+static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
+				struct tcp_autolowat_cb *cb,
+				u32 seq, u32 end_seq)
+{
+	struct rpc_descriptor *rpc_desc;
+	u32 rpc_copied_seq;
+	u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
+	u64 rpc_len;
+	int err;
+
+	rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
+
+	if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
+		copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
+	else
+		copy_len = end_seq - rpc_copied_seq;
+
+	/* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
+	 * clang omits the "copy_len == 0" check below, which is necessary
+	 * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
+	 */
+	barrier_var(copy_len);
+
+	/* Do not swap the order of the two copy_len checks below.
+	 * The BPF verifier somehow does not properly track the minimum
+	 * value for 'copy_len == 0'.
+	 *
+	 * 91: (15) if r7 == 0x0 goto pc+40      ; R7=scalar(smin=smin32=-247,smax=smax32=8)
+	 * 92: (25) if r7 > 0x8 goto pc+39       ; R7=scalar(smin=smin32=0,smax=umax=smax32=umax32=8,var_off=(0x0; 0xf))
+	 *
+	 * This does not occur if copy_len is u32.
+	 */
+	if (copy_len > RPC_DESC_SIZE)
+		goto disable; /* always false, only for verifier. */
+	if (copy_len == 0)
+		goto disable; /* FIN. */
+
+	if (cb->rpc_desc_buf + cb->rpc_desc_buff_len >= &cb->rpc_desc_buf[RPC_DESC_SIZE])
+		goto disable; /* always false, only for verifier. */
+
+	err = bpf_skb_load_bytes(skops, rpc_copied_seq - seq,
+				 cb->rpc_desc_buf + cb->rpc_desc_buff_len, copy_len);
+	if (err)
+		goto disable;
+
+	cb->rpc_desc_buff_len += copy_len;
+
+	if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
+		LOG("Copied %d bytes: rpc_desc_buff_len: %u", copy_len, cb->rpc_desc_buff_len);
+		goto partial;
+	}
+
+	rpc_desc = (struct rpc_descriptor *)cb->rpc_desc_buf;
+	rpc_len = RPC_DESC_SIZE + rpc_desc->header_len + rpc_desc->payload_len;
+
+	if (rpc_len > INT_MAX)
+		goto disable;
+
+	cb->rpc_end_seq = cb->rpc_desc_seq + rpc_len;
+
+	LOG("Copied full descriptor: rpc_desc_seq: %u, rpc_end_seq: %u,"
+	    " header_len: %u, payload_len: %u",
+	    CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq),
+	    rpc_desc->header_len, rpc_desc->payload_len);
+
+	return 0;
+disable:
+	return -1;
+partial:
+	return 1;
+}
+
+static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
+			      struct tcp_autolowat_cb *cb,
+			      struct tcp_sock *tp)
+{
+	/* To handle wraparound. */
+	u32 val = 0;
+
+	LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
+	    TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+	if (before(tp->copied_seq, cb->rpc_desc_seq))
+		val = cb->rpc_desc_seq - tp->copied_seq;
+	else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
+		val = RPC_DESC_SIZE;
+	else
+		val = cb->rpc_end_seq - tp->copied_seq;
+
+	if (val != tp->inet_conn.icsk_inet.sk.sk_rcvlowat) {
+		bpf_sock_ops_tcp_set_rcvlowat(skops_kern, val);
+
+		LOG("Set rcvlowat: expected: %u, actual: %d\n",
+		    val, tp->inet_conn.icsk_inet.sk.sk_rcvlowat);
+	} else {
+		LOG("No need to set rcvlowat: %u\n", val);
+	}
+}
+
+static void tcp_disable_autolowat(struct bpf_sock_ops *skops,
+				  struct bpf_sock_ops_kern *skops_kern)
+{
+	int flags;
+
+	flags = skops->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RCVQ_CB_FLAG;
+	bpf_sock_ops_cb_flags_set(skops, flags);
+
+	bpf_sock_ops_tcp_set_rcvlowat(skops_kern, 1);
+
+	LOG("Disabled autolowat");
+}
+
+static void tcp_do_autolowat(struct bpf_sock_ops *skops,
+			     struct tcp_autolowat_cb *cb,
+			     struct tcp_sock *tp)
+{
+	struct bpf_sock_ops_kern *skops_kern;
+	struct tcp_skb_cb *tcb;
+	struct sk_buff *skb;
+	u32 seq, end_seq;
+	int ret = 0, i;
+
+	skops_kern = bpf_cast_to_kern_ctx(skops);
+	skb = skops_kern->skb;
+
+	if (!skb)
+		goto update;
+
+	tcb = bpf_core_cast(skb->cb, struct tcp_skb_cb);
+	seq = tcb->seq;
+	end_seq = tcb->end_seq - !!(tcb->tcp_flags & TCPHDR_FIN);
+
+	LOG("Start parsing skb: seq: %u, end_seq: %u, len: %u, "
+	    "rpc_desc_seq: %u, rpc_end_seq: %u, rpc_buff_len: %u",
+	    SEQ(seq), SEQ(end_seq), end_seq - seq,
+	    CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+	if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
+		ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
+		if (ret)
+			goto update;
+	}
+
+	i = 0;
+
+	while (1) {
+		if (i++ > MAX_RPC_DESC_PER_SKB) {
+			ret = -1;
+			break;
+		}
+
+		if (after(cb->rpc_end_seq, end_seq)) {
+			LOG("No more descriptor: rpc_end_seq: %u, end_seq: %u",
+			    CB_SEQ(rpc_end_seq), SEQ(end_seq));
+			break;
+		}
+
+		cb->rpc_desc_seq = cb->rpc_end_seq;
+		cb->rpc_desc_buff_len = 0;
+
+		if (cb->rpc_end_seq == end_seq)
+			break;
+
+		LOG("Found next descriptor: rpc_end_seq: %u, end_seq: %u, len: %u",
+		    CB_SEQ(rpc_end_seq), SEQ(end_seq), end_seq - cb->rpc_end_seq);
+
+		ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
+		if (ret)
+			break;
+	}
+
+update:
+	if (ret >= 0)
+		tcp_set_autolowat(skops_kern, cb, tp);
+	else
+		tcp_disable_autolowat(skops, skops_kern);
+}
+
+SEC("sockops")
+int tcp_autolowat(struct bpf_sock_ops *skops)
+{
+	struct tcp_autolowat_cb *cb;
+	struct bpf_sock *bpf_sk;
+	struct tcp_sock *tp;
+
+	if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
+		goto out;
+
+	bpf_sk = skops->sk;
+	if (!bpf_sk)
+		goto out; /* always false, only for verifier. */
+
+	tp = bpf_skc_to_tcp_sock(bpf_sk);
+	if (!tp)
+		goto out; /* always false, only for verifier. */
+
+	cb = bpf_sk_storage_get(&tcp_autolowat_map, tp, 0, 0);
+	if (!cb)
+		goto out;
+
+	tcp_do_autolowat(skops, cb, tp);
+out:
+	return 1;
+}
+
+static int tcp_init_autolowat_cb(struct bpf_sockopt *sockopt,
+				 struct bpf_tcp_sock *btp)
+{
+	struct tcp_autolowat_cb *cb;
+	struct tcp_sock *tp;
+	int flags;
+
+	cb = bpf_sk_storage_get(&tcp_autolowat_map, btp, 0,
+				BPF_SK_STORAGE_GET_F_CREATE);
+	if (!cb)
+		return -1;
+
+	tp = bpf_core_cast(btp, struct tcp_sock);
+	if (!tp)
+		return -1;
+
+	cb->rpc_desc_seq = tp->copied_seq;
+	cb->rpc_end_seq = tp->copied_seq;
+#ifdef DEBUG
+	cb->isn = tp->copied_seq;
+#endif
+
+	if (bpf_getsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
+			   &flags, sizeof(flags)))
+		return -1;
+
+	flags |= BPF_SOCK_OPS_RCVQ_CB_FLAG;
+
+	if (bpf_setsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
+			   &flags, sizeof(flags)))
+		return -1;
+
+	return 0;
+}
+
+SEC("cgroup/setsockopt")
+int tcp_autolowat_setsockopt(struct bpf_sockopt *ctx)
+{
+	void *optval_end = ctx->optval_end;
+	int *optval = ctx->optval;
+	struct bpf_tcp_sock *btp;
+
+	if (ctx->level != SOL_BPF || ctx->optname != BPF_TCP_AUTOLOWAT)
+		goto out;
+
+	if (optval + 1 > optval_end)
+		return 0; /* -EPERM */
+
+	btp = bpf_tcp_sock(ctx->sk);
+	if (!btp)
+		goto out;
+
+	if (*optval && tcp_init_autolowat_cb(ctx, btp))
+		return 0; /* -EPERM */
+
+	ctx->optlen = -1; /* BPF has consumed this option, don't call kernel
+			   * setsockopt handler.
+			   */
+out:
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.54.0.746.g67dd491aae-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
@ 2026-05-23  9:20   ` bot+bpf-ci
  2026-05-24  4:03     ` Kuniyuki Iwashima
  2026-05-26 21:01   ` Martin KaFai Lau
  1 sibling, 1 reply; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-23  9:20 UTC (permalink / raw)
  To: kuniyu, ast, daniel, andrii, martin.lau, eddyz87, memxor
  Cc: yonghong.song, john.fastabend, sdf, edumazet, ncardwell, willemb,
	ukyab, kuniyu, kuni1840, bpf, netdev, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 7811 bytes --]

> diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
> --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
> +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
> @@ -64,6 +64,10 @@
>  extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
>  				   struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
>
> +struct bpf_sock_ops_kern;
> +extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
> +					 int rcvlowat) __ksym;
> +
>  void *bpf_cast_to_kern_ctx(void *) __ksym;
>
>  extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> @@ -74,6 +74,8 @@ #define ETH_P_IPV6		0x86DD
>
>  #define NEXTHDR_TCP		6
>
> +#define TCPHDR_FIN		0x01
> +
>  #define TCPOPT_NOP		1
>  #define TCPOPT_EOL		0
>  #define TCPOPT_MSS		2

[ ... ]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> @@ -0,0 +1,350 @@

[ ... ]

> +		.event = {
> +			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
> +			/* The first descriptor is partial. */
> +			{.type = RPC_EVENT_SEND,	.len = 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},
> +			/* The first descriptor is available. */
> +			{.type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE - 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* The first header is ready. */
> +			{.type = RPC_EVENT_SEND,	.len = 100},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* skb has the first payload and 1 byte of the next descriptor. */
> +			{.type = RPC_EVENT_SEND,	.len = 150 + 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 1},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
> +			{.type = RPC_EVENT_RECV,	.len = RPC_DESC_SIZE + 150 + 100},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},

Does calling recv() actually trigger the BPF program to update SO_RCVLOWAT?

Since the kernel only invokes BPF_SOCK_OPS_RCVQ_CB when a new skb is added
to the receive queue, and not when data is dequeued via recv(), the BPF
program doesn't seem to run here.

Could this cause the test to fail unconditionally because sk_rcvlowat
retains its previous value?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
> +				       struct rpc_test_case *test_case)
> +{
> +	struct rpc_event *event = test_case->event;
> +	char *ptr = test_case->data;
> +	struct epoll_event ev;
> +	socklen_t optlen;
> +	int err, optval;
> +	char buf[4096];
> +
> +	if (tcp_autolowat_build_data(test_case))
> +		return;
> +
> +	while (1) {
> +		switch (event->type) {
> +		case RPC_EVENT_END:
> +			return;
> +		case RPC_EVENT_AUTOLOWAT:
> +			err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
> +					 &event->val, sizeof(event->val));
> +			if (!ASSERT_OK(err, "setsockopt"))
> +				return;
> +			break;
> +		case RPC_EVENT_SEND:
> +			err = send(cb->client, ptr, event->len, 0);
> +			if (!ASSERT_EQ(err, event->len, "send"))
> +				return;
> +
> +			ptr += event->len;
> +			break;

Could this lead to race conditions causing flaky tests?

The test calls send() to transmit data and immediately executes the next
event. Because network transmission over the loopback interface involves
asynchronous softirq processing, send() might return before the packet is
fully processed by the receiving socket.

If the next event is RPC_EVENT_EPOLL with a timeout of 0, could
epoll_wait() prematurely return 0? Similarly, for RPC_EVENT_RCVLOWAT,
could getsockopt() read a stale sk_rcvlowat value before the TCP stack and
BPF program have finished processing the packet?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> @@ -0,0 +1,326 @@

[ ... ]

> +static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
> +				struct tcp_autolowat_cb *cb,
> +				u32 seq, u32 end_seq)
> +{
> +	struct rpc_descriptor *rpc_desc;
> +	u32 rpc_copied_seq;
> +	u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
> +	u64 rpc_len;
> +	int err;
> +
> +	rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
> +
> +	if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
> +		copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
> +	else
> +		copy_len = end_seq - rpc_copied_seq;
> +
> +	/* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
> +	 * clang omits the "copy_len == 0" check below, which is necessary
> +	 * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
> +	 */

This isn't a bug, but multi-line comments in the BPF subsystem are
expected to have the opening '/*' on its own line:

	/*
	 * Since LLVM commit ...
	 * ...
	 */

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> +static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
> +			      struct tcp_autolowat_cb *cb,
> +			      struct tcp_sock *tp)
> +{
> +	/* To handle wraparound. */
> +	u32 val = 0;
> +
> +	LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
> +	    TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
> +
> +	if (before(tp->copied_seq, cb->rpc_desc_seq))
> +		val = cb->rpc_desc_seq - tp->copied_seq;
> +	else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
> +		val = RPC_DESC_SIZE;
> +	else
> +		val = cb->rpc_end_seq - tp->copied_seq;

Does this code incorrectly set SO_RCVLOWAT to the total combined size of
all unread frames when multiple RPC frames are received in a single skb?

If multiple complete frames arrive in the same skb, the parsing loop in
tcp_do_autolowat() parses all of them and advances cb->rpc_desc_seq to the
end of the last parsed frame. This sets:

val = cb->rpc_desc_seq - tp->copied_seq

which evaluates to the length of all unread frames combined.

If an application sequentially reads one frame at a time, tp->copied_seq
advances and the queue size decreases, but sk_rcvlowat is not updated
because BPF_SOCK_OPS_RCVQ_CB only fires on packet enqueue.

The remaining queue size will then be strictly less than sk_rcvlowat,
causing epoll_wait to block and trapping the remaining fully-received
frames in the receive queue.

Could this hang applications that rely on epoll to read one message at a
time?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  9:20   ` bot+bpf-ci
@ 2026-05-24  4:03     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-24  4:03 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: ast, daniel, andrii, martin.lau, eddyz87, memxor, yonghong.song,
	john.fastabend, sdf, edumazet, ncardwell, willemb, ukyab,
	kuni1840, bpf, netdev, martin.lau, clm, ihor.solodrai

On Sat, May 23, 2026 at 2:20 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
> > --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
> > +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
> > @@ -64,6 +64,10 @@
> >  extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
> >                                  struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
> >
> > +struct bpf_sock_ops_kern;
> > +extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
> > +                                      int rcvlowat) __ksym;
> > +
> >  void *bpf_cast_to_kern_ctx(void *) __ksym;
> >
> >  extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;
>
> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> > --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> > +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> > @@ -74,6 +74,8 @@ #define ETH_P_IPV6          0x86DD
> >
> >  #define NEXTHDR_TCP          6
> >
> > +#define TCPHDR_FIN           0x01
> > +
> >  #define TCPOPT_NOP           1
> >  #define TCPOPT_EOL           0
> >  #define TCPOPT_MSS           2
>
> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> > @@ -0,0 +1,350 @@
>
> [ ... ]
>
> > +             .event = {
> > +                     { .type = RPC_EVENT_AUTOLOWAT,  .val = 1},
> > +                     /* The first descriptor is partial. */
> > +                     {.type = RPC_EVENT_SEND,        .len = 1},
> > +                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
> > +                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE},
> > +                     /* The first descriptor is available. */
> > +                     {.type = RPC_EVENT_SEND,        .len = RPC_DESC_SIZE - 1},
> > +                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
> > +                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE + 150 + 100},
> > +                     /* The first header is ready. */
> > +                     {.type = RPC_EVENT_SEND,        .len = 100},
> > +                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
> > +                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE + 150 + 100},
> > +                     /* skb has the first payload and 1 byte of the next descriptor. */
> > +                     {.type = RPC_EVENT_SEND,        .len = 150 + 1},
> > +                     {.type = RPC_EVENT_EPOLL,       .nfds = 1},
> > +                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE + 150 + 100},
> > +                     /* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
> > +                     {.type = RPC_EVENT_RECV,        .len = RPC_DESC_SIZE + 150 + 100},
> > +                     {.type = RPC_EVENT_EPOLL,       .nfds = 0},
> > +                     {.type = RPC_EVENT_RCVLOWAT,    .rcvlowat = RPC_DESC_SIZE},
>
> Does calling recv() actually trigger the BPF program to update SO_RCVLOWAT?
>
> Since the kernel only invokes BPF_SOCK_OPS_RCVQ_CB when a new skb is added
> to the receive queue, and not when data is dequeued via recv(),

Again, this is wrong.



> the BPF
> program doesn't seem to run here.
>
> Could this cause the test to fail unconditionally because sk_rcvlowat
> retains its previous value?
>
> Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
> Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)
>
> [ ... ]
>
> > +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
> > +                                    struct rpc_test_case *test_case)
> > +{
> > +     struct rpc_event *event = test_case->event;
> > +     char *ptr = test_case->data;
> > +     struct epoll_event ev;
> > +     socklen_t optlen;
> > +     int err, optval;
> > +     char buf[4096];
> > +
> > +     if (tcp_autolowat_build_data(test_case))
> > +             return;
> > +
> > +     while (1) {
> > +             switch (event->type) {
> > +             case RPC_EVENT_END:
> > +                     return;
> > +             case RPC_EVENT_AUTOLOWAT:
> > +                     err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
> > +                                      &event->val, sizeof(event->val));
> > +                     if (!ASSERT_OK(err, "setsockopt"))
> > +                             return;
> > +                     break;
> > +             case RPC_EVENT_SEND:
> > +                     err = send(cb->client, ptr, event->len, 0);
> > +                     if (!ASSERT_EQ(err, event->len, "send"))
> > +                             return;
> > +
> > +                     ptr += event->len;
> > +                     break;
>
> Could this lead to race conditions causing flaky tests?

No, it's over loopback.

>
> The test calls send() to transmit data and immediately executes the next
> event. Because network transmission over the loopback interface involves
> asynchronous softirq processing, send() might return before the packet is
> fully processed by the receiving socket.
>
> If the next event is RPC_EVENT_EPOLL with a timeout of 0, could
> epoll_wait() prematurely return 0? Similarly, for RPC_EVENT_RCVLOWAT,
> could getsockopt() read a stale sk_rcvlowat value before the TCP stack and
> BPF program have finished processing the packet?
>
> Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
> Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)
>
> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> > @@ -0,0 +1,326 @@
>
> [ ... ]
>
> > +static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
> > +                             struct tcp_autolowat_cb *cb,
> > +                             u32 seq, u32 end_seq)
> > +{
> > +     struct rpc_descriptor *rpc_desc;
> > +     u32 rpc_copied_seq;
> > +     u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
> > +     u64 rpc_len;
> > +     int err;
> > +
> > +     rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
> > +
> > +     if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
> > +             copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
> > +     else
> > +             copy_len = end_seq - rpc_copied_seq;
> > +
> > +     /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
> > +      * clang omits the "copy_len == 0" check below, which is necessary
> > +      * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
> > +      */
>
> This isn't a bug, but multi-line comments in the BPF subsystem are
> expected to have the opening '/*' on its own line:
>
>         /*
>          * Since LLVM commit ...
>          * ...
>          */
>
> Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
> Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)
>
> [ ... ]
>
> > +static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
> > +                           struct tcp_autolowat_cb *cb,
> > +                           struct tcp_sock *tp)
> > +{
> > +     /* To handle wraparound. */
> > +     u32 val = 0;
> > +
> > +     LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
> > +         TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
> > +
> > +     if (before(tp->copied_seq, cb->rpc_desc_seq))
> > +             val = cb->rpc_desc_seq - tp->copied_seq;
> > +     else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
> > +             val = RPC_DESC_SIZE;
> > +     else
> > +             val = cb->rpc_end_seq - tp->copied_seq;
>
> Does this code incorrectly set SO_RCVLOWAT to the total combined size of
> all unread frames when multiple RPC frames are received in a single skb?
>
> If multiple complete frames arrive in the same skb, the parsing loop in
> tcp_do_autolowat() parses all of them and advances cb->rpc_desc_seq to the
> end of the last parsed frame. This sets:
>
> val = cb->rpc_desc_seq - tp->copied_seq
>
> which evaluates to the length of all unread frames combined.
>
> If an application sequentially reads one frame at a time, tp->copied_seq
> advances and the queue size decreases, but sk_rcvlowat is not updated
> because BPF_SOCK_OPS_RCVQ_CB only fires on packet enqueue.

This comment is based on the wrong assumption too :/


>
> The remaining queue size will then be strictly less than sk_rcvlowat,
> causing epoll_wait to block and trapping the remaining fully-received
> frames in the receive queue.
>
> Could this hang applications that rely on epoll to read one message at a
> time?
>
> Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
> Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
  2026-05-23  8:29 ` [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
  2026-05-23  9:20   ` bot+bpf-ci
@ 2026-05-26 21:01   ` Martin KaFai Lau
  1 sibling, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2026-05-26 21:01 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Eric Dumazet, Neal Cardwell,
	Willem de Bruijn, Tenzin Ukyab, Kuniyuki Iwashima, bpf, netdev

On Sat, May 23, 2026 at 08:29:40AM +0000, Kuniyuki Iwashima wrote:
> +static void tcp_do_autolowat(struct bpf_sock_ops *skops,
> +			     struct tcp_autolowat_cb *cb,
> +			     struct tcp_sock *tp)
> +{
> +	struct bpf_sock_ops_kern *skops_kern;
> +	struct tcp_skb_cb *tcb;
> +	struct sk_buff *skb;
> +	u32 seq, end_seq;
> +	int ret = 0, i;
> +
> +	skops_kern = bpf_cast_to_kern_ctx(skops);
> +	skb = skops_kern->skb;
> +
> +	if (!skb)
> +		goto update;
> +
> +	tcb = bpf_core_cast(skb->cb, struct tcp_skb_cb);
> +	seq = tcb->seq;
> +	end_seq = tcb->end_seq - !!(tcb->tcp_flags & TCPHDR_FIN);
> +
> +	LOG("Start parsing skb: seq: %u, end_seq: %u, len: %u, "
> +	    "rpc_desc_seq: %u, rpc_end_seq: %u, rpc_buff_len: %u",
> +	    SEQ(seq), SEQ(end_seq), end_seq - seq,
> +	    CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
> +
> +	if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
> +		ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
> +		if (ret)
> +			goto update;
> +	}
> +
> +	i = 0;
> +
> +	while (1) {

A nit. It may be useful to take a look at progs/arena_htab.c or
rbtree_search.c on the 'for (i = zero; i < xyz && can_loop; i++)' usage.

#> ./veristat tcp_autolowat.bpf.o
Processing 'tcp_autolowat.bpf.o'...
File                 Program                   Verdict  Duration (us)  Insns  States  Program size  Jited size
-------------------  ------------------------  -------  -------------  -----  ------  ------------  ----------
tcp_autolowat.bpf.o  tcp_autolowat             success         245961  12624     338           141         939

The "Insns" could likely be reduced to ~1000s level.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-05-26 22:19 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-23  8:29 [PATCH v3 bpf-next 00/11] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 01/11] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
2026-05-23  9:06   ` bot+bpf-ci
2026-05-23  8:29 ` [PATCH v3 bpf-next 02/11] bpf: tcp: Introduce BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 03/11] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-26 20:34   ` Martin KaFai Lau
2026-05-26 21:21     ` Kuniyuki Iwashima
2026-05-26 22:18       ` Martin KaFai Lau
2026-05-23  8:29 ` [PATCH v3 bpf-next 04/11] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 05/11] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
2026-05-23  9:06   ` bot+bpf-ci
2026-05-23  8:29 ` [PATCH v3 bpf-next 06/11] bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive Kuniyuki Iwashima
2026-05-23  9:20   ` bot+bpf-ci
2026-05-24  3:37     ` Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 07/11] bpf: mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 08/11] bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty Kuniyuki Iwashima
2026-05-23  9:20   ` bot+bpf-ci
2026-05-23  8:29 ` [PATCH v3 bpf-next 09/11] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
2026-05-26 20:47   ` Martin KaFai Lau
2026-05-26 21:07     ` Kuniyuki Iwashima
2026-05-26 21:37       ` Amery Hung
2026-05-26 21:51         ` Kuniyuki Iwashima
2026-05-23  8:29 ` [PATCH v3 bpf-next 11/11] selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB Kuniyuki Iwashima
2026-05-23  9:20   ` bot+bpf-ci
2026-05-24  4:03     ` Kuniyuki Iwashima
2026-05-26 21:01   ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox