From: Kuniyuki Iwashima <kuniyu@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
Stanislav Fomichev <sdf@fomichev.me>,
Eric Dumazet <edumazet@google.com>,
Neal Cardwell <ncardwell@google.com>,
Willem de Bruijn <willemb@google.com>,
Tenzin Ukyab <ukyab@berkeley.edu>,
Kuniyuki Iwashima <kuniyu@google.com>,
Kuniyuki Iwashima <kuni1840@gmail.com>,
bpf@vger.kernel.org, netdev@vger.kernel.org
Subject: [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB.
Date: Fri, 8 May 2026 07:33:29 +0000 [thread overview]
Message-ID: <20260508073355.3916746-9-kuniyu@google.com> (raw)
In-Reply-To: <20260508073355.3916746-1-kuniyu@google.com>
The test is roughly divided into two stages, and the sequence
is as follows:
I) Setup
1. Attach two BPF programs to a cgroup
2. Establish a TCP connection (@client <-> @child) within the cgroup
3. Enable BPF_SOCK_OPS_RCVLOWAT_CB on @child
II) RPC frame exchange in various patterns
4. Send a partial RPC descriptor from @client to @child
5. Verify that epoll does NOT wake up @child
6. Send the remaining data of the RPC frame
7. Verify that epoll finally wakes up @child
During setup, two BPF programs are attached to simulate
a real-world scenario; one is SOCK_OPS and the other is
CGROUP_SOCKOPT.
While the SOCK_OPS prog handles the dynamic adjustment of
sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable
BPF_SOCK_OPS_RCVLOWAT_CB via userspace setsockopt() using
pseudo options:
#define SOL_BPF 0xdeadbeef
#define BPF_TCP_AUTOLOWAT 0x8badf00d
setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int));
This reflects a common production use case where an application
decides to start parsing RPC frames only at a certain point in
the stream (e.g., after HTTP Upgrade), rather than immediately
after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc).
When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises
sk_local_storage for two sequence numbers to manage its state.
Then, for the RPC frame exchange, this test uses a simple format
defined as follows
0 8 16 24 32
+--------+--------+-------+--------+ `.
| header size | |
+--------+--------+-------+--------+ > RPC descriptor (8 bytes)
| payload size | |
+--------+--------+-------+--------+ .'
~ header ~
+--------+--------+-------+--------+
~ payload ~
+--------+--------+-------+--------+
Every time a new skb is enqueued to sk->sk_receive_queue,
the SOCK_OPS prog parses it and updates these sequence numbers:
rpc_desc_seq : the SEQ # of the start of the RPC descriptor
rpc_end_seq : the SEQ # of the end of the RPC frame
=> rpc_desc_seq + 8 + header size + payload size
Assume we receive two RPC descriptors in the following pattern:
1. When we receive skb-1, only a part of RPC descriptor is parsed.
rpc_desc_seq is set to the first byte while rpc_end_seq is
unknown. Thus, sk->sk_rcvlowat is set to the size of the RPC
descriptor (8 bytes).
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+.................+....................+......
| RPC desc 1 | header + payload | RPC desc 2 | ...
+-----------+.................+....................+......
^ ^-.
`- rpc_desc_seq `- sk->sk_rcvlowat
2. Next, we receive skb-2, which completes the first RPC descriptor.
Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it.
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+-----------------+....................+......
| RPC desc 1 | header + payload | RPC desc 2 | ...
+-----------+-----------------+....................+......
^ ^
'- rpc_desc_seq '- rpc_end_seq
& sk->sk_rcvlowat
3. Once we receive skb-3, which contains the next full RPC descriptor,
rpc_desc_seq is advanced and rpc_end_seq is updated according
to the size of RPC frame 2.
Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq
yet. This ensures that the application is woken up to read the
already complete RPC frame 1.
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+-----------------+--------------------+......
| RPC desc 1 | header + payload | RPC desc 2 | ... |
+-----------+-----------------+--------------------+......
^ ^
rpc_desc_seq -----------' rpc_end_seq ----...-'
& sk->sk_rcvlowat
This sequence corresponds to the 4th test case in rpc_test_cases[],
and we can see helpful output if we "#define DEBUG":
# cat /sys/kernel/tracing/trace_pipe | \
awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \
BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID
...
AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0
AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8
AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150
AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8
AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8
AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
tools/testing/selftests/bpf/bpf_kfuncs.h | 4 +
.../selftests/bpf/prog_tests/tcp_autolowat.c | 350 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 2 +
.../selftests/bpf/progs/tcp_autolowat.c | 316 ++++++++++++++++
4 files changed, 672 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
create mode 100644 tools/testing/selftests/bpf/progs/tcp_autolowat.c
diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index ae71e9b69051..fc4d6f68f247 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -64,6 +64,10 @@ struct bpf_tcp_req_attrs;
extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
+struct bpf_sock_ops_kern;
+extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
+ int rcvlowat) __ksym;
+
void *bpf_cast_to_kern_ctx(void *) __ksym;
extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
new file mode 100644
index 000000000000..5e971c42a32a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2026 Google LLC */
+#include <sys/epoll.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "tcp_autolowat.skel.h"
+
+#define SOL_BPF 0xdeadbeef
+#define BPF_TCP_AUTOLOWAT 0x8badf00d
+
+struct rpc_descriptor {
+ u32 header_len;
+ u32 payload_len;
+};
+
+enum rpc_event_type {
+ RPC_EVENT_END,
+ RPC_EVENT_AUTOLOWAT,
+ RPC_EVENT_SEND,
+ RPC_EVENT_RECV,
+ RPC_EVENT_EPOLL,
+ RPC_EVENT_RCVLOWAT,
+};
+
+struct rpc_event {
+ enum rpc_event_type type;
+ union {
+ int len;
+ int nfds;
+ int val;
+ int rcvlowat;
+ };
+};
+
+#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor))
+
+struct rpc_test_case {
+ char data[4096];
+ struct rpc_descriptor desc[32];
+ struct rpc_event event[32];
+} rpc_test_cases[] = {
+ {
+ .desc = {
+ { .header_len = 100, .payload_len = 150 },
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* Single full RPC message in skb. */
+ { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE + 100 + 150},
+ { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 100 + 150},
+ { .type = RPC_EVENT_EPOLL, .nfds = 1},
+ },
+ },
+ {
+ .desc = {
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* Two full RPC messages in skb. */
+ {.type = RPC_EVENT_SEND, .len = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_EPOLL, .nfds = 1},
+ /* Single full RPC message in skb. */
+ { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE + 100 + 150},
+ { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 3},
+ { .type = RPC_EVENT_EPOLL, .nfds = 1},
+ },
+ },
+ {
+ .desc = {
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 100, .payload_len = 150},
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* Two full RPC messages in skb. */
+ {.type = RPC_EVENT_SEND, .len = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+ {.type = RPC_EVENT_EPOLL, .nfds = 1},
+ /* Single full RPC message in skb. */
+ { .type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE},
+ { .type = RPC_EVENT_RCVLOWAT, .rcvlowat = (RPC_DESC_SIZE + 100 + 150) * 2},
+ { .type = RPC_EVENT_EPOLL, .nfds = 1},
+ },
+ },
+ {
+ .desc = {
+ {.header_len = 100, .payload_len = 150},
+ {.header_len = 200, .payload_len = 500},
+ },
+ .event = {
+ { .type = RPC_EVENT_AUTOLOWAT, .val = 1},
+ /* The first descriptor is partial. */
+ {.type = RPC_EVENT_SEND, .len = 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE},
+ /* The first descriptor is available. */
+ {.type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE - 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+ /* The first header is ready. */
+ {.type = RPC_EVENT_SEND, .len = 100},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+ /* skb has the first payload and 1 byte of the next descriptor. */
+ {.type = RPC_EVENT_SEND, .len = 150 + 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 1},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 150 + 100},
+ /* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
+ {.type = RPC_EVENT_RECV, .len = RPC_DESC_SIZE + 150 + 100},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE},
+ /* The second descriptor is available. */
+ {.type = RPC_EVENT_SEND, .len = RPC_DESC_SIZE - 1},
+ {.type = RPC_EVENT_EPOLL, .nfds = 0},
+ {.type = RPC_EVENT_RCVLOWAT, .rcvlowat = RPC_DESC_SIZE + 200 + 500},
+ },
+ },
+};
+
+struct tcp_autolowat_test_cb {
+ int saved_netns;
+ union {
+ int fd[4];
+ struct {
+ int server, client, child;
+ int epoll;
+ };
+ };
+};
+
+static void tcp_autolowat_teardown_cb(struct tcp_autolowat_test_cb *cb)
+{
+ int i, err;
+
+ for (i = 0; i < ARRAY_SIZE(cb->fd); i++) {
+ if (cb->fd[i] != -1)
+ close(cb->fd[i]);
+ }
+
+ if (cb->saved_netns != -1) {
+ err = setns(cb->saved_netns, CLONE_NEWNET);
+ ASSERT_OK(err, "restore netns");
+
+ close(cb->saved_netns);
+ }
+}
+
+static int tcp_autolowat_setup_cb(struct tcp_autolowat_test_cb *cb, int family)
+{
+ struct epoll_event ev = {};
+ int err;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(cb->fd); i++)
+ cb->fd[i] = -1;
+
+ cb->saved_netns = open("/proc/self/ns/net", O_RDONLY);
+ if (!ASSERT_NEQ(cb->saved_netns, -1, "save netns"))
+ goto err;
+
+ err = unshare(CLONE_NEWNET);
+ if (!ASSERT_OK(err, "unshare"))
+ goto err;
+
+ err = system("ip link set dev lo up");
+ if (!ASSERT_OK(err, "set up lo"))
+ goto err;
+
+ cb->server = start_server(family, SOCK_STREAM, NULL, 0, 0);
+ if (!ASSERT_NEQ(cb->server, -1, "start_server"))
+ goto err;
+
+ cb->client = connect_to_fd(cb->server, 0);
+ if (!ASSERT_NEQ(cb->client, -1, "connect_to_fd"))
+ goto err;
+
+ cb->child = accept(cb->server, NULL, NULL);
+ if (!ASSERT_NEQ(cb->child, -1, "accept"))
+ goto err;
+
+ cb->epoll = epoll_create1(0);
+ if (!ASSERT_NEQ(cb->epoll, -1, "epoll_create"))
+ goto err;
+
+ ev.events = EPOLLIN;
+ ev.data.fd = cb->child;
+
+ err = epoll_ctl(cb->epoll, EPOLL_CTL_ADD, cb->child, &ev);
+ if (!ASSERT_OK(err, "epoll_ctl"))
+ goto err;
+
+ return 0;
+
+err:
+ tcp_autolowat_teardown_cb(cb);
+ return -1;
+}
+
+static int tcp_autolowat_build_data(struct rpc_test_case *test_case)
+{
+ struct rpc_descriptor *desc = test_case->desc;
+ char *ptr = test_case->data;
+ int rpc_size;
+
+ memset(ptr, 0, sizeof(test_case->data));
+
+ while (desc->header_len + desc->payload_len) {
+ rpc_size = sizeof(*desc) + desc->header_len + desc->payload_len;
+
+ if (!ASSERT_LE(ptr + rpc_size - test_case->data,
+ sizeof(test_case->data), "data overflow"))
+ return 1;
+
+ memcpy(ptr, desc, sizeof(*desc));
+ ptr += rpc_size;
+ desc++;
+ }
+
+ if (!ASSERT_GT(ptr - test_case->data, 0, "no data"))
+ return 1;
+
+ return 0;
+}
+
+static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
+ struct rpc_test_case *test_case)
+{
+ struct rpc_event *event = test_case->event;
+ char *ptr = test_case->data;
+ struct epoll_event ev;
+ socklen_t optlen;
+ int err, optval;
+ char buf[4096];
+
+ if (tcp_autolowat_build_data(test_case))
+ return;
+
+ while (1) {
+ switch (event->type) {
+ case RPC_EVENT_END:
+ return;
+ case RPC_EVENT_AUTOLOWAT:
+ err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
+ &event->val, sizeof(event->val));
+ if (!ASSERT_OK(err, "setsockopt"))
+ return;
+ break;
+ case RPC_EVENT_SEND:
+ err = send(cb->client, ptr, event->len, 0);
+ if (!ASSERT_EQ(err, event->len, "send"))
+ return;
+
+ ptr += event->len;
+ break;
+ case RPC_EVENT_RECV:
+ err = recv(cb->child, buf, event->len, 0);
+ if (!ASSERT_EQ(err, event->len, "recv"))
+ return;
+ break;
+ case RPC_EVENT_EPOLL:
+ err = epoll_wait(cb->epoll, &ev, 1, 0);
+ if (!ASSERT_EQ(err, event->nfds, "epoll_wait"))
+ return;
+ break;
+ case RPC_EVENT_RCVLOWAT:
+ optval = 0;
+ optlen = sizeof(optval);
+
+ err = getsockopt(cb->child, SOL_SOCKET, SO_RCVLOWAT, &optval, &optlen);
+ if (!ASSERT_OK(err, "getsockopt") ||
+ !ASSERT_EQ(optval, event->rcvlowat, "rcvlowat"))
+ return;
+ break;
+ }
+
+ event++;
+ }
+}
+
+static void tcp_autolowat_run_rpc_tests(struct tcp_autolowat *skel, int family)
+{
+ struct tcp_autolowat_test_cb cb;
+ int err;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(rpc_test_cases); i++) {
+ memset(skel->bss->test_name, 0, sizeof(skel->bss->test_name));
+
+ snprintf(skel->bss->test_name, sizeof(skel->bss->test_name),
+ "AF_INET%c rpc_test_cases[%d]",
+ family == AF_INET ? ' ' : '6', i);
+
+ if (!test__start_subtest(skel->bss->test_name))
+ continue;
+
+ err = tcp_autolowat_setup_cb(&cb, family);
+ if (err)
+ continue;
+
+ tcp_autolowat_run_rpc_test(&cb, &rpc_test_cases[i]);
+ tcp_autolowat_teardown_cb(&cb);
+ }
+}
+
+static void tcp_autolowat_run_tests(struct tcp_autolowat *skel)
+{
+ tcp_autolowat_run_rpc_tests(skel, AF_INET);
+ tcp_autolowat_run_rpc_tests(skel, AF_INET6);
+}
+
+void test_tcp_autolowat(void)
+{
+ struct tcp_autolowat *skel;
+ struct bpf_link *link[2];
+ int cgroup;
+
+ skel = tcp_autolowat__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ cgroup = test__join_cgroup("/tcp_autolowat");
+ if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+ goto destroy_skel;
+
+ link[0] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat, cgroup);
+ if (!ASSERT_OK_PTR(link[0], "attach_cgroup(SOCK_OPS)"))
+ goto close_cgroup;
+
+ link[1] = bpf_program__attach_cgroup(skel->progs.tcp_autolowat_setsockopt, cgroup);
+ if (!ASSERT_OK_PTR(link[1], "attach_cgroup(SETSOCKOPT)"))
+ goto destroy_sockops;
+
+ tcp_autolowat_run_tests(skel);
+
+ bpf_link__destroy(link[1]);
+destroy_sockops:
+ bpf_link__destroy(link[0]);
+close_cgroup:
+ close(cgroup);
+destroy_skel:
+ tcp_autolowat__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index d8dacef37c16..bdf28d320383 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -74,6 +74,8 @@
#define NEXTHDR_TCP 6
+#define TCPHDR_FIN 0x01
+
#define TCPOPT_NOP 1
#define TCPOPT_EOL 0
#define TCPOPT_MSS 2
diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
new file mode 100644
index 000000000000..86f2af2fe683
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2026 Google LLC */
+#include "vmlinux.h"
+
+#include <string.h>
+#include <limits.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+#include "bpf_tracing_net.h"
+
+#define SOL_BPF 0xdeadbeef
+#define BPF_TCP_AUTOLOWAT 0x8badf00d
+
+//#define DEBUG /* For verbose output. */
+
+struct rpc_descriptor {
+ u32 header_len;
+ u32 payload_len;
+};
+
+#define RPC_DESC_SIZE (sizeof(struct rpc_descriptor))
+#define MAX_RPC_DESC_PER_SKB 100
+
+struct tcp_autolowat_cb {
+ /* Don't put this field at the end; BPF verifier complains. */
+ char rpc_desc_buf[RPC_DESC_SIZE];
+ u32 rpc_desc_seq;
+ u32 rpc_end_seq;
+#ifdef DEBUG
+ u32 isn;
+#endif
+ u8 rpc_desc_buff_len;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_SK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct tcp_autolowat_cb);
+} tcp_autolowat_map SEC(".maps");
+
+char test_name[64];
+
+#ifdef DEBUG
+#define LOG(str, ...) \
+ bpf_printk("%s: " str, test_name, ##__VA_ARGS__)
+#else
+#define LOG(...)
+#endif
+
+#define SEQ(val) \
+ (val - cb->isn)
+#define TP_SEQ(field) \
+ (tp->field - cb->isn)
+#define CB_SEQ(field) \
+ (cb->field - cb->isn)
+
+static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
+ struct tcp_autolowat_cb *cb,
+ u32 seq, u32 end_seq)
+{
+ struct rpc_descriptor *rpc_desc;
+ u32 rpc_copied_seq;
+ u32 copy_len;
+ u64 rpc_len;
+ int err;
+
+ rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
+
+ if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
+ copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
+ else
+ copy_len = end_seq - rpc_copied_seq;
+
+ /* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
+ * clang omits the "copy_len == 0" check below, which is necessary
+ * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
+ */
+ barrier_var(copy_len);
+
+ if (copy_len == 0)
+ goto disable; /* FIN. */
+ if (copy_len > RPC_DESC_SIZE)
+ goto disable; /* always false, only for verifier. */
+ if (cb->rpc_desc_buf + cb->rpc_desc_buff_len >= &cb->rpc_desc_buf[RPC_DESC_SIZE])
+ goto disable; /* always false, only for verifier. */
+
+ err = bpf_skb_load_bytes(skops, rpc_copied_seq - seq,
+ cb->rpc_desc_buf + cb->rpc_desc_buff_len, copy_len);
+ if (err)
+ goto disable;
+
+ cb->rpc_desc_buff_len += copy_len;
+
+ if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
+ LOG("Copied %d bytes: rpc_desc_buff_len: %u", copy_len, cb->rpc_desc_buff_len);
+ goto partial;
+ }
+
+ rpc_desc = (struct rpc_descriptor *)cb->rpc_desc_buf;
+ rpc_len = RPC_DESC_SIZE + rpc_desc->header_len + rpc_desc->payload_len;
+
+ if (rpc_len > INT_MAX)
+ goto disable;
+
+ cb->rpc_end_seq = cb->rpc_desc_seq + rpc_len;
+
+ LOG("Copied full descriptor: rpc_desc_seq: %u, rpc_end_seq: %u,"
+ " header_len: %u, payload_len: %u",
+ CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq),
+ rpc_desc->header_len, rpc_desc->payload_len);
+
+ return 0;
+disable:
+ return -1;
+partial:
+ return 1;
+}
+
+static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
+ struct tcp_autolowat_cb *cb,
+ struct tcp_sock *tp)
+{
+ /* To handle wraparound. */
+ u32 val = 0;
+
+ LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
+ TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+ if (before(tp->copied_seq, cb->rpc_desc_seq))
+ val = cb->rpc_desc_seq - tp->copied_seq;
+ else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
+ val = RPC_DESC_SIZE;
+ else
+ val = cb->rpc_end_seq - tp->copied_seq;
+
+ if (val != tp->inet_conn.icsk_inet.sk.sk_rcvlowat) {
+ bpf_sock_ops_tcp_set_rcvlowat(skops_kern, val);
+
+ LOG("Set rcvlowat: expected: %u, actual: %d\n",
+ val, tp->inet_conn.icsk_inet.sk.sk_rcvlowat);
+ } else {
+ LOG("No need to set rcvlowat: %u\n", val);
+ }
+}
+
+static void tcp_disable_autolowat(struct bpf_sock_ops *skops,
+ struct bpf_sock_ops_kern *skops_kern)
+{
+ int flags;
+
+ flags = skops->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RCVLOWAT_CB_FLAG;
+ bpf_sock_ops_cb_flags_set(skops, flags);
+
+ bpf_sock_ops_tcp_set_rcvlowat(skops_kern, 1);
+
+ LOG("Disabled autolowat");
+}
+
+static void tcp_do_autolowat(struct bpf_sock_ops *skops,
+ struct tcp_autolowat_cb *cb,
+ struct tcp_sock *tp)
+{
+ struct bpf_sock_ops_kern *skops_kern;
+ struct tcp_skb_cb *tcb;
+ struct sk_buff *skb;
+ u32 seq, end_seq;
+ int ret = 0, i;
+
+ skops_kern = bpf_cast_to_kern_ctx(skops);
+ skb = skops_kern->skb;
+
+ if (!skb)
+ goto update;
+
+ tcb = bpf_core_cast(skb->cb, struct tcp_skb_cb);
+ seq = tcb->seq;
+ end_seq = tcb->end_seq - !!(tcb->tcp_flags & TCPHDR_FIN);
+
+ LOG("Start parsing skb: seq: %u, end_seq: %u, len: %u, "
+ "rpc_desc_seq: %u, rpc_end_seq: %u, rpc_buff_len: %u",
+ SEQ(seq), SEQ(end_seq), end_seq - seq,
+ CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
+
+ if (cb->rpc_desc_buff_len != RPC_DESC_SIZE) {
+ ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
+ if (ret)
+ goto update;
+ }
+
+ i = 0;
+
+ while (1) {
+ if (i++ > MAX_RPC_DESC_PER_SKB) {
+ ret = -1;
+ break;
+ }
+
+ if (after(cb->rpc_end_seq, end_seq)) {
+ LOG("No more descriptor: rpc_end_seq: %u, end_seq: %u",
+ CB_SEQ(rpc_end_seq), SEQ(end_seq));
+ break;
+ }
+
+ cb->rpc_desc_seq = cb->rpc_end_seq;
+ cb->rpc_desc_buff_len = 0;
+
+ if (cb->rpc_end_seq == end_seq)
+ break;
+
+ LOG("Found next descriptor: rpc_end_seq: %u, end_seq: %u, len: %u",
+ CB_SEQ(rpc_end_seq), SEQ(end_seq), end_seq - cb->rpc_end_seq);
+
+ ret = tcp_parse_descriptor(skops, cb, seq, end_seq);
+ if (ret)
+ break;
+ }
+
+update:
+ if (ret >= 0)
+ tcp_set_autolowat(skops_kern, cb, tp);
+ else
+ tcp_disable_autolowat(skops, skops_kern);
+}
+
+SEC("sockops")
+int tcp_autolowat(struct bpf_sock_ops *skops)
+{
+ struct tcp_autolowat_cb *cb;
+ struct bpf_sock *bpf_sk;
+ struct tcp_sock *tp;
+
+ if (skops->op != BPF_SOCK_OPS_RCVLOWAT_CB)
+ goto out;
+
+ bpf_sk = skops->sk;
+ if (!bpf_sk)
+ goto out; /* always false, only for verifier. */
+
+ tp = bpf_skc_to_tcp_sock(bpf_sk);
+ if (!tp)
+ goto out; /* always false, only for verifier. */
+
+ cb = bpf_sk_storage_get(&tcp_autolowat_map, tp, 0, 0);
+ if (!cb)
+ goto out;
+
+ tcp_do_autolowat(skops, cb, tp);
+out:
+ return 1;
+}
+
+static int tcp_init_autolowat_cb(struct bpf_sockopt *sockopt,
+ struct bpf_tcp_sock *btp)
+{
+ struct tcp_autolowat_cb *cb;
+ struct tcp_sock *tp;
+ int flags;
+
+ cb = bpf_sk_storage_get(&tcp_autolowat_map, btp, 0,
+ BPF_SK_STORAGE_GET_F_CREATE);
+ if (!cb)
+ return -1;
+
+ tp = bpf_core_cast(btp, struct tcp_sock);
+ if (!tp)
+ return -1;
+
+ cb->rpc_desc_seq = tp->copied_seq;
+ cb->rpc_end_seq = tp->copied_seq;
+#ifdef DEBUG
+ cb->isn = tp->copied_seq;
+#endif
+
+ if (bpf_getsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
+ &flags, sizeof(flags)))
+ return -1;
+
+ flags |= BPF_SOCK_OPS_RCVLOWAT_CB_FLAG;
+
+ if (bpf_setsockopt(sockopt->sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
+ &flags, sizeof(flags)))
+ return -1;
+
+ return 0;
+}
+
+SEC("cgroup/setsockopt")
+int tcp_autolowat_setsockopt(struct bpf_sockopt *ctx)
+{
+ void *optval_end = ctx->optval_end;
+ int *optval = ctx->optval;
+ struct bpf_tcp_sock *btp;
+
+ if (ctx->level != SOL_BPF || ctx->optname != BPF_TCP_AUTOLOWAT)
+ goto out;
+
+ if (optval + 1 > optval_end)
+ return 0; /* -EPERM */
+
+ btp = bpf_tcp_sock(ctx->sk);
+ if (!btp)
+ goto out;
+
+ if (*optval && tcp_init_autolowat_cb(ctx, btp))
+ return 0; /* -EPERM */
+
+ ctx->optlen = -1; /* BPF has consumed this option, don't call kernel
+ * setsockopt handler.
+ */
+out:
+ return 1;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.54.0.563.g4f69b47b94-goog
next prev parent reply other threads:[~2026-05-08 7:34 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-08 7:33 [PATCH v1 bpf-next 0/8] bpf: Add SOCK_OPS hooks for TCP AutoLOWAT Kuniyuki Iwashima
2026-05-08 7:33 ` [PATCH v1 bpf-next 1/8] selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv Kuniyuki Iwashima
2026-05-08 7:33 ` [PATCH v1 bpf-next 2/8] bpf: tcp: Introduce BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima
2026-05-08 7:33 ` [PATCH v1 bpf-next 3/8] bpf: tcp: Support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVLOWAT_CB Kuniyuki Iwashima
2026-05-08 15:15 ` Stanislav Fomichev
2026-05-08 19:45 ` Kuniyuki Iwashima
2026-05-11 14:56 ` Stanislav Fomichev
2026-05-08 7:33 ` [PATCH v1 bpf-next 4/8] tcp: Split out __tcp_set_rcvlowat() Kuniyuki Iwashima
2026-05-08 7:33 ` [PATCH v1 bpf-next 5/8] bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat Kuniyuki Iwashima
2026-05-11 12:34 ` Björn Töpel
2026-05-08 7:33 ` [PATCH v1 bpf-next 6/8] bpf: tcp: Factorise bpf_skops_established() Kuniyuki Iwashima
2026-05-08 7:33 ` [PATCH v1 bpf-next 7/8] bpf: tcp: Add SOCK_OPS rcvlowat hook Kuniyuki Iwashima
2026-05-08 10:37 ` Jiayuan Chen
2026-05-08 11:30 ` Kuniyuki Iwashima
2026-05-08 12:19 ` Jiayuan Chen
2026-05-08 15:28 ` Stanislav Fomichev
2026-05-08 20:05 ` Kuniyuki Iwashima
2026-05-11 14:55 ` Stanislav Fomichev
2026-05-08 7:33 ` Kuniyuki Iwashima [this message]
2026-05-08 15:35 ` [PATCH v1 bpf-next 8/8] selftest: bpf: Add test for BPF_SOCK_OPS_RCVLOWAT_CB Stanislav Fomichev
2026-05-08 20:19 ` Kuniyuki Iwashima
2026-05-08 21:47 ` Stanislav Fomichev
2026-05-08 21:58 ` Kuniyuki Iwashima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260508073355.3916746-9-kuniyu@google.com \
--to=kuniyu@google.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=kuni1840@gmail.com \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=sdf@fomichev.me \
--cc=ukyab@berkeley.edu \
--cc=willemb@google.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox