[PATCH bpf-next v2 0/2] prevent bpf_reserve_hdr_opt() from growing skb larger than MTU

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 0/2] prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
@ 2024-08-27  1:37 zijianzhang
  2024-08-27  1:37 ` [PATCH bpf-next v2 1/2] bpf: tcp: " zijianzhang
  2024-08-27  1:37 ` [PATCH bpf-next v2 2/2] bpf: selftests: reserve smaller tcp header options than the actual size zijianzhang
  0 siblings, 2 replies; 14+ messages in thread
From: zijianzhang @ 2024-08-27  1:37 UTC (permalink / raw)
  To: bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf, Amery Hung

From: Amery Hung <amery.hung@bytedance.com>

This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().

Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the total
header size.

skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len of the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.

To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time. Since this cannot be done during
verification time, we add a runtime sanity check to have
bpf_reserve_hdr_opt return an error instead of causing packet drops later.

We also add a selftests to verify the sanity check. If users accidentally
reserve a small size, bpf_reserve_hdr_opt() should return an appropriate
error value and no packet should be dropped.

Changelog:
  v1 -> v2:
    - I accidentally missed the eBPF prog file in the previous patch
    submission, sorry for the convenience.

Amery Hung (1):
  bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than
    MTU

Zijian Zhang (1):
  bpf: selftests: reserve smaller tcp header options than the actual
    size

 include/net/tcp.h                             |  8 +++
 net/ipv4/tcp_input.c                          |  8 ---
 net/ipv4/tcp_output.c                         | 13 +++-
 .../bpf/prog_tests/tcp_hdr_options.c          | 51 +++++++++++++
 .../bpf/progs/test_reserve_tcp_hdr_options.c  | 71 +++++++++++++++++++
 5 files changed, 141 insertions(+), 10 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_reserve_tcp_hdr_options.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-27  1:37 [PATCH bpf-next v2 0/2] prevent bpf_reserve_hdr_opt() from growing skb larger than MTU zijianzhang
@ 2024-08-27  1:37 ` zijianzhang
  2024-08-28 21:29   ` Martin KaFai Lau
  2024-08-27  1:37 ` [PATCH bpf-next v2 2/2] bpf: selftests: reserve smaller tcp header options than the actual size zijianzhang
  1 sibling, 1 reply; 14+ messages in thread
From: zijianzhang @ 2024-08-27  1:37 UTC (permalink / raw)
  To: bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf, Amery Hung, Zijian Zhang

From: Amery Hung <amery.hung@bytedance.com>

This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().

Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the total
header size.

skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len of the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.

To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time. Since this cannot be done during
verification time, we add a runtime sanity check to have
bpf_reserve_hdr_opt return an error instead of causing packet drops later.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Co-developed-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
---
 include/net/tcp.h     |  8 ++++++++
 net/ipv4/tcp_input.c  |  8 --------
 net/ipv4/tcp_output.c | 13 +++++++++++--
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2aac11e7e1cc..e202eeb19be4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1058,6 +1058,14 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
 	return TCP_SKB_CB(skb)->tcp_gso_size;
 }
 
+/* I wish gso_size would have a bit more sane initialization than
+ * something-or-zero which complicates things
+ */
+static inline int tcp_skb_seglen(const struct sk_buff *skb)
+{
+	return tcp_skb_pcount(skb) == 1 ? skb->len : tcp_skb_mss(skb);
+}
+
 static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
 {
 	return likely(!TCP_SKB_CB(skb)->eor);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e37488d3453f..c1ffe19b0717 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1550,14 +1550,6 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	return true;
 }
 
-/* I wish gso_size would have a bit more sane initialization than
- * something-or-zero which complicates things
- */
-static int tcp_skb_seglen(const struct sk_buff *skb)
-{
-	return tcp_skb_pcount(skb) == 1 ? skb->len : tcp_skb_mss(skb);
-}
-
 /* Shifting pages past head area doesn't work */
 static int skb_can_shift(const struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 16c48df8df4c..f5996cdbb2ba 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1033,10 +1033,19 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
 					    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
 		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+		unsigned int old_remaining;
 
-		bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
+		if (skb) {
+			unsigned int reserved_opt_spc;
+
+			reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb);
+			if (reserved_opt_spc < remaining)
+				remaining = reserved_opt_spc;
+		}
 
-		size = MAX_TCP_OPTION_SPACE - remaining;
+		old_remaining = remaining;
+		bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
+		size += old_remaining - remaining;
 	}
 
 	return size;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v2 2/2] bpf: selftests: reserve smaller tcp header options than the actual size
  2024-08-27  1:37 [PATCH bpf-next v2 0/2] prevent bpf_reserve_hdr_opt() from growing skb larger than MTU zijianzhang
  2024-08-27  1:37 ` [PATCH bpf-next v2 1/2] bpf: tcp: " zijianzhang
@ 2024-08-27  1:37 ` zijianzhang
  1 sibling, 0 replies; 14+ messages in thread
From: zijianzhang @ 2024-08-27  1:37 UTC (permalink / raw)
  To: bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf, Amery Hung, Zijian Zhang

From: Amery Hung <amery.hung@bytedance.com>

If eBPF users mistakenly reserve smaller header options than the actual
size in BPF_SOCK_OPS_HDR_OPT_LEN_CB, bpf_reserve_hdr_opt should return an
appropriate error value, and there will be no packet dropping.

Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../bpf/prog_tests/tcp_hdr_options.c          | 51 +++++++++++++
 .../bpf/progs/test_reserve_tcp_hdr_options.c  | 71 +++++++++++++++++++
 2 files changed, 122 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_reserve_tcp_hdr_options.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
index 56685fc03c7e..9c250b5bf00a 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -14,6 +14,7 @@
 #include "test_tcp_hdr_options.h"
 #include "test_tcp_hdr_options.skel.h"
 #include "test_misc_tcp_hdr_options.skel.h"
+#include "test_reserve_tcp_hdr_options.skel.h"
 
 #define LO_ADDR6 "::1"
 #define CG_NAME "/tcpbpf-hdr-opt-test"
@@ -25,6 +26,7 @@ static struct bpf_test_option exp_active_fin_in;
 static struct hdr_stg exp_passive_hdr_stg;
 static struct hdr_stg exp_active_hdr_stg = { .active = true, };
 
+static struct test_reserve_tcp_hdr_options *reserve_skel;
 static struct test_misc_tcp_hdr_options *misc_skel;
 static struct test_tcp_hdr_options *skel;
 static int lport_linum_map_fd;
@@ -513,6 +515,49 @@ static void misc(void)
 	bpf_link__destroy(link);
 }
 
+static void reserve_hdr_opt(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+	char send_msg[1500];
+	char recv_msg[sizeof(send_msg)];
+	int ret;
+
+	if (!ASSERT_OK(system("ip link set dev lo mtu 1500"), "set dev lo mtu to 1500"))
+		return;
+
+	lport_linum_map_fd = bpf_map__fd(reserve_skel->maps.lport_linum_map);
+
+	link = bpf_program__attach_cgroup(reserve_skel->progs.reserve_tcp_hdr_options, cg_fd);
+	if (!ASSERT_OK_PTR(link, "attach_cgroup(reserve_tcp_hdr_options)"))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	ret = send(sk_fds.active_fd, send_msg, sizeof(send_msg),
+			   MSG_EOR);
+	if (!ASSERT_EQ(ret, sizeof(send_msg), "send(msg)"))
+		goto check_linum;
+
+	ret = read(sk_fds.passive_fd, recv_msg, sizeof(recv_msg));
+	if (!ASSERT_EQ(ret, sizeof(send_msg), "read(msg)"))
+		goto check_linum;
+
+	if (sk_fds_shutdown(&sk_fds))
+		goto check_linum;
+
+	ASSERT_FALSE(reserve_skel->bss->nr_err_reserve, "unexpected nr_err_reserve");
+	ASSERT_TRUE(reserve_skel->bss->nr_nospc, "unexpected nr_nospc");
+
+check_linum:
+	ASSERT_FALSE(check_error_linum(&sk_fds), "check_error_linum");
+	sk_fds_close(&sk_fds);
+	bpf_link__destroy(link);
+}
+
 struct test {
 	const char *desc;
 	void (*run)(void);
@@ -526,6 +571,7 @@ static struct test tests[] = {
 	DEF_TEST(fastopen_estab),
 	DEF_TEST(fin),
 	DEF_TEST(misc),
+	DEF_TEST(reserve_hdr_opt),
 };
 
 void test_tcp_hdr_options(void)
@@ -540,6 +586,10 @@ void test_tcp_hdr_options(void)
 	if (!ASSERT_OK_PTR(misc_skel, "open and load misc test skel"))
 		goto skel_destroy;
 
+	reserve_skel = test_reserve_tcp_hdr_options__open_and_load();
+	if (!ASSERT_OK_PTR(reserve_skel, "open and load reserve test skel"))
+		goto skel_destroy;
+
 	cg_fd = test__join_cgroup(CG_NAME);
 	if (!ASSERT_GE(cg_fd, 0, "join_cgroup"))
 		goto skel_destroy;
@@ -558,6 +608,7 @@ void test_tcp_hdr_options(void)
 
 	close(cg_fd);
 skel_destroy:
+	test_reserve_tcp_hdr_options__destroy(reserve_skel);
 	test_misc_tcp_hdr_options__destroy(misc_skel);
 	test_tcp_hdr_options__destroy(skel);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_reserve_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_reserve_tcp_hdr_options.c
new file mode 100644
index 000000000000..a40d31c4ae1b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_reserve_tcp_hdr_options.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 ByteDance Inc. */
+
+#include <stddef.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#define BPF_PROG_TEST_TCP_HDR_OPTIONS
+#include "test_tcp_hdr_options.h"
+
+unsigned int nr_err_reserve = 0;
+unsigned int nr_nospc = 0;
+
+static bool skops_current_mss(const struct bpf_sock_ops *skops)
+{
+	return skops->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS;
+}
+
+static int handle_hdr_opt_len(struct bpf_sock_ops *skops)
+{
+	int err;
+
+	if (skops_current_mss(skops)) {
+		err = bpf_reserve_hdr_opt(skops, 4, 0);
+		if (err) {
+			nr_err_reserve++;
+			RET_CG_ERR(err);
+		}
+	} else {
+		err = bpf_reserve_hdr_opt(skops, 8, 0);
+		if (err) {
+			if (err == -ENOSPC) {
+				nr_nospc++;
+			} else {
+				nr_err_reserve++;
+				RET_CG_ERR(err);
+			}
+		}
+	}
+
+	return CG_OK;
+}
+
+SEC("sockops")
+int reserve_tcp_hdr_options(struct bpf_sock_ops *skops)
+{
+	switch (skops->op) {
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		bpf_sock_ops_cb_flags_set(skops,
+					  skops->bpf_sock_ops_cb_flags |
+					  BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+		break;
+	case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
+		return handle_hdr_opt_len(skops);
+	case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
+		break;
+	}
+
+	return CG_OK;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-27  1:37 ` [PATCH bpf-next v2 1/2] bpf: tcp: " zijianzhang
@ 2024-08-28 21:29   ` Martin KaFai Lau
  2024-08-28 23:01     ` Zijian Zhang
  2024-08-29 16:46     ` Cong Wang
  0 siblings, 2 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2024-08-28 21:29 UTC (permalink / raw)
  To: zijianzhang, Amery Hung, bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
> From: Amery Hung <amery.hung@bytedance.com>
> 
> This series prevents sockops users from accidentally causing packet
> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
> reserves different option lengths in tcp_sendmsg().
> 
> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
> again to calculate the actual tcp_option_size and skb_push() the total
> header size.
> 
> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
> reserved opt size is smaller than the actual header size, the len of the
> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
> packet if skb->ignore_df is not set.
> 
> To prevent this accidental packet drop, we need to make sure the
> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
> not more than the first time. 

iiuc, it is a bug in the bpf prog itself that did not reserve the same header 
length and caused a drop. It is not the only drop case though for an incorrect 
bpf prog. There are other cases where a bpf prog can accidentally drop a packet.

Do you have an actual use case that the bpf prog cannot reserve the correct 
header length for the same sk ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-28 21:29   ` Martin KaFai Lau
@ 2024-08-28 23:01     ` Zijian Zhang
  2024-08-29  1:00       ` Martin KaFai Lau
  2024-08-29 16:46     ` Cong Wang
  1 sibling, 1 reply; 14+ messages in thread
From: Zijian Zhang @ 2024-08-28 23:01 UTC (permalink / raw)
  To: Martin KaFai Lau, Amery Hung, bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>> From: Amery Hung <amery.hung@bytedance.com>
>>
>> This series prevents sockops users from accidentally causing packet
>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>> reserves different option lengths in tcp_sendmsg().
>>
>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
>> again to calculate the actual tcp_option_size and skb_push() the total
>> header size.
>>
>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>> reserved opt size is smaller than the actual header size, the len of the
>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>> packet if skb->ignore_df is not set.
>>
>> To prevent this accidental packet drop, we need to make sure the
>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>> not more than the first time. 
> 
> iiuc, it is a bug in the bpf prog itself that did not reserve the same 
> header length and caused a drop. It is not the only drop case though for 
> an incorrect bpf prog. There are other cases where a bpf prog can 
> accidentally drop a packet.
> 
> Do you have an actual use case that the bpf prog cannot reserve the 
> correct header length for the same sk ?

That's right, it's the bug of the bpf prog itself. We are trying to have
the error reported earlier in eBPF program, instead of successfully
returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
at the end because of it.

By adding this patch, the `remaining` variable passed to the
bpf_skops_hdr_opt_len will be more precise, it takes the previously
reserved size into account. As a result, if users accidentally set an
option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
will return -ENOSPC instead of 0.

We have a use case where we add options to some packets kind of randomly
for the purpose of sampling, and accidentally set a larger option size
than the reserved size. It is the problem of ourselves and takes us
some effort to troubleshoot the root cause.

If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
could be helpful for users to avoid this mistake.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-28 23:01     ` Zijian Zhang
@ 2024-08-29  1:00       ` Martin KaFai Lau
  2024-08-30 21:02         ` Zijian Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2024-08-29  1:00 UTC (permalink / raw)
  To: Zijian Zhang, Amery Hung, bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 8/28/24 4:01 PM, Zijian Zhang wrote:
> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>> From: Amery Hung <amery.hung@bytedance.com>
>>>
>>> This series prevents sockops users from accidentally causing packet
>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>> reserves different option lengths in tcp_sendmsg().
>>>
>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
>>> again to calculate the actual tcp_option_size and skb_push() the total
>>> header size.
>>>
>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>> reserved opt size is smaller than the actual header size, the len of the
>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>> packet if skb->ignore_df is not set.
>>>
>>> To prevent this accidental packet drop, we need to make sure the
>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>> not more than the first time. 
>>
>> iiuc, it is a bug in the bpf prog itself that did not reserve the same header 
>> length and caused a drop. It is not the only drop case though for an incorrect 
>> bpf prog. There are other cases where a bpf prog can accidentally drop a packet.
>>
>> Do you have an actual use case that the bpf prog cannot reserve the correct 
>> header length for the same sk ?
> 
> That's right, it's the bug of the bpf prog itself. We are trying to have
> the error reported earlier in eBPF program, instead of successfully
> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
> at the end because of it.
> 
> By adding this patch, the `remaining` variable passed to the
> bpf_skops_hdr_opt_len will be more precise, it takes the previously
> reserved size into account. As a result, if users accidentally set an
> option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
> will return -ENOSPC instead of 0.

Putting aside it adds more checks, this adds something pretty unique to the bpf 
header option comparing with other dynamic options like sack. e.g. For 
tp->mss_cache, I assume it won't change since the earlier tcp_current_mss() was 
called?

> 
> We have a use case where we add options to some packets kind of randomly
> for the purpose of sampling, and accidentally set a larger option size
> than the reserved size. It is the problem of ourselves and takes us
> some effort to troubleshoot the root cause.
> 
> If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
> could be helpful for users to avoid this mistake.

The bpf_sk_storage can be used to decide if a sk has been sampled.

Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in this 
patch can be done in the bpf prog itself.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-28 21:29   ` Martin KaFai Lau
  2024-08-28 23:01     ` Zijian Zhang
@ 2024-08-29 16:46     ` Cong Wang
  2024-08-30  0:20       ` Martin KaFai Lau
  1 sibling, 1 reply; 14+ messages in thread
From: Cong Wang @ 2024-08-29 16:46 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: zijianzhang, Amery Hung, bpf, edumazet, davem, kuba, pabeni,
	dsahern, ast, daniel, andrii, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
	wangdongdong.6, zhoufeng.zf

On Wed, Aug 28, 2024 at 02:29:17PM -0700, Martin KaFai Lau wrote:
> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
> > From: Amery Hung <amery.hung@bytedance.com>
> > 
> > This series prevents sockops users from accidentally causing packet
> > drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
> > reserves different option lengths in tcp_sendmsg().
> > 
> > Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
> > reserve a space in tcp_send_mss(), which will return the MSS for TSO.
> > Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
> > again to calculate the actual tcp_option_size and skb_push() the total
> > header size.
> > 
> > skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
> > derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
> > reserved opt size is smaller than the actual header size, the len of the
> > skb can exceed the MTU. As a result, ip(6)_fragment will drop the
> > packet if skb->ignore_df is not set.
> > 
> > To prevent this accidental packet drop, we need to make sure the
> > second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
> > not more than the first time.
> 
> iiuc, it is a bug in the bpf prog itself that did not reserve the same
> header length and caused a drop. It is not the only drop case though for an
> incorrect bpf prog. There are other cases where a bpf prog can accidentally
> drop a packet.

But safety is the most important thing for eBPF programs, do we really
allow this kind of bug to happen in eBPF programs?

> 
> Do you have an actual use case that the bpf prog cannot reserve the correct
> header length for the same sk ?

You can think of it as a simple call of bpf_get_prandom_u32():

SEC("sockops")
int bpf_sock_ops_cb(struct bpf_sock_ops *skops)
{
    if (skops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB) {
        return bpf_get_prandom_u32();
    }
    return 0;
}

And eBPF programs are stateful anyway, at least we should not assume
it is stateless since maps are commonly used. Therefore, different invocations
of a same eBPF program are expected to return different values. IMHO,
this is a situation we have to deal with in the kernel, hence stricter
checks are reasonable and necessary.

Thanks.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-29 16:46     ` Cong Wang
@ 2024-08-30  0:20       ` Martin KaFai Lau
  0 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2024-08-30  0:20 UTC (permalink / raw)
  To: Cong Wang
  Cc: zijianzhang, Amery Hung, bpf, edumazet, davem, kuba, pabeni,
	dsahern, ast, daniel, andrii, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
	wangdongdong.6, zhoufeng.zf

On 8/29/24 9:46 AM, Cong Wang wrote:
> On Wed, Aug 28, 2024 at 02:29:17PM -0700, Martin KaFai Lau wrote:
>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>> From: Amery Hung <amery.hung@bytedance.com>
>>>
>>> This series prevents sockops users from accidentally causing packet
>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>> reserves different option lengths in tcp_sendmsg().
>>>
>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
>>> again to calculate the actual tcp_option_size and skb_push() the total
>>> header size.
>>>
>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>> reserved opt size is smaller than the actual header size, the len of the
>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>> packet if skb->ignore_df is not set.
>>>
>>> To prevent this accidental packet drop, we need to make sure the
>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>> not more than the first time.
>>
>> iiuc, it is a bug in the bpf prog itself that did not reserve the same
>> header length and caused a drop. It is not the only drop case though for an
>> incorrect bpf prog. There are other cases where a bpf prog can accidentally
>> drop a packet.
> 
> But safety is the most important thing for eBPF programs, do we really
> allow this kind of bug to happen in eBPF programs?
> 
>>
>> Do you have an actual use case that the bpf prog cannot reserve the correct
>> header length for the same sk ?
> 
> You can think of it as a simple call of bpf_get_prandom_u32():
> 
> SEC("sockops")
> int bpf_sock_ops_cb(struct bpf_sock_ops *skops)
> {
>      if (skops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB) {
>          return bpf_get_prandom_u32();

It compiles but this won't even pass the verifier.

If you want to go to this extreme to consider any random bpf program, it is 
essentially deploying a fuzzer to the production traffic, right? Sure, syzbot 
has programs that are rejected by verifier or cause a packet drop. afaik, I 
don't recall syzbot reported them as a crash.

Is a drop a safety issue as you said? I don't think so. It is why this set is 
tagged as bpf-next also and I agree. What is the hurry here?

Is it something that can be improved? may be. Note that the patchwork status has 
not been changed yet. Instead of wasting time here, please allow Zijian (/Amery) 
to continue the discussion on another thread. I have asked question on the 
changes and also suggested other ways.


>      }
>      return 0;
> }
> 
> And eBPF programs are stateful anyway, at least we should not assume
> it is stateless since maps are commonly used. Therefore, different invocations
> of a same eBPF program are expected to return different values. IMHO,
> this is a situation we have to deal with in the kernel, hence stricter
> checks are reasonable and necessary.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-29  1:00       ` Martin KaFai Lau
@ 2024-08-30 21:02         ` Zijian Zhang
  2024-09-03 22:38           ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Zijian Zhang @ 2024-08-30 21:02 UTC (permalink / raw)
  To: Martin KaFai Lau, Amery Hung, bpf
  Cc: edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
> On 8/28/24 4:01 PM, Zijian Zhang wrote:
>> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>
>>>> This series prevents sockops users from accidentally causing packet
>>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>>> reserves different option lengths in tcp_sendmsg().
>>>>
>>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be 
>>>> called to
>>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in 
>>>> __tcp_transmit_skb()
>>>> again to calculate the actual tcp_option_size and skb_push() the total
>>>> header size.
>>>>
>>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>>> reserved opt size is smaller than the actual header size, the len of 
>>>> the
>>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>>> packet if skb->ignore_df is not set.
>>>>
>>>> To prevent this accidental packet drop, we need to make sure the
>>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>>> not more than the first time. 
>>>
>>> iiuc, it is a bug in the bpf prog itself that did not reserve the 
>>> same header length and caused a drop. It is not the only drop case 
>>> though for an incorrect bpf prog. There are other cases where a bpf 
>>> prog can accidentally drop a packet.
>>>
>>> Do you have an actual use case that the bpf prog cannot reserve the 
>>> correct header length for the same sk ?
>>
>> That's right, it's the bug of the bpf prog itself. We are trying to have
>> the error reported earlier in eBPF program, instead of successfully
>> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
>> at the end because of it.
>>
>> By adding this patch, the `remaining` variable passed to the
>> bpf_skops_hdr_opt_len will be more precise, it takes the previously
>> reserved size into account. As a result, if users accidentally set an
>> option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
>> will return -ENOSPC instead of 0.
> 
> Putting aside it adds more checks, this adds something pretty unique to 
> the bpf header option comparing with other dynamic options like sack. 
> e.g. For tp->mss_cache, I assume it won't change since the earlier 
> tcp_current_mss() was called?
> 

Agreed, this check is very unique comparing with other dynamic options.
How about moving the check into function bpf_skops_hdr_opt_len? It
already has some logic to check if the context is in tcp_current_mss.
Does it look more reasonable?

`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
For tp->mss_cache here, yes, I also think it won't change,

I am trying to get the reserved bpf hdr size. Considering other dynamic
options, this equation is not precise, and may change it to
`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb) - size`?

Or, not sure if adding a field in tcp_sock to precisely cache the
reserved bpf hdr size is a good idea?

>>
>> We have a use case where we add options to some packets kind of randomly
>> for the purpose of sampling, and accidentally set a larger option size
>> than the reserved size. It is the problem of ourselves and takes us
>> some effort to troubleshoot the root cause.
>>
>> If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
>> could be helpful for users to avoid this mistake.
> 
> The bpf_sk_storage can be used to decide if a sk has been sampled.
> 
> Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in 
> this patch can be done in the bpf prog itself.
> 

Thanks for pointing this out! Agreed, the check can be implemented in
eBPF progs.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-08-30 21:02         ` Zijian Zhang
@ 2024-09-03 22:38           ` Martin KaFai Lau
  2024-09-05 18:19             ` Zijian Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2024-09-03 22:38 UTC (permalink / raw)
  To: Zijian Zhang, Amery Hung
  Cc: bpf, edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 8/30/24 2:02 PM, Zijian Zhang wrote:
> On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
>> On 8/28/24 4:01 PM, Zijian Zhang wrote:
>>> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>>>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>>
>>>>> This series prevents sockops users from accidentally causing packet
>>>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>>>> reserves different option lengths in tcp_sendmsg().
>>>>>
>>>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
>>>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
>>>>> again to calculate the actual tcp_option_size and skb_push() the total
>>>>> header size.
>>>>>
>>>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>>>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>>>> reserved opt size is smaller than the actual header size, the len of the
>>>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>>>> packet if skb->ignore_df is not set.
>>>>>
>>>>> To prevent this accidental packet drop, we need to make sure the
>>>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>>>> not more than the first time. 
>>>>
>>>> iiuc, it is a bug in the bpf prog itself that did not reserve the same 
>>>> header length and caused a drop. It is not the only drop case though for an 
>>>> incorrect bpf prog. There are other cases where a bpf prog can accidentally 
>>>> drop a packet.
>>>>
>>>> Do you have an actual use case that the bpf prog cannot reserve the correct 
>>>> header length for the same sk ?
>>>
>>> That's right, it's the bug of the bpf prog itself. We are trying to have
>>> the error reported earlier in eBPF program, instead of successfully
>>> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
>>> at the end because of it.
>>>
>>> By adding this patch, the `remaining` variable passed to the
>>> bpf_skops_hdr_opt_len will be more precise, it takes the previously
>>> reserved size into account. As a result, if users accidentally set an
>>> option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
>>> will return -ENOSPC instead of 0.
>>
>> Putting aside it adds more checks, this adds something pretty unique to the 
>> bpf header option comparing with other dynamic options like sack. e.g. For tp- 
>> >mss_cache, I assume it won't change since the earlier tcp_current_mss() was 
>> called?
>>
> 
> Agreed, this check is very unique comparing with other dynamic options.
> How about moving the check into function bpf_skops_hdr_opt_len? It
> already has some logic to check if the context is in tcp_current_mss.
> Does it look more reasonable?

Yes, it probably may be better to put that into the bpf_skops_hdr_opt_len(). 
This is implementation details which is something for later after figuring out 
if the reserved_opt_spc change is correct and needed.

> 
> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
> For tp->mss_cache here, yes, I also think it won't change,

This needs more details and clear explanation on why it is the case and why the 
existing regular bpf prog will continue to work.

afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same between calls 
for this to work . From thinking about it, it should be but I haven't looked at 
all cases. Missing "- size" does not help the confidence here also.

Also, when "skb != NULL", I don't understand why the "MAX_TCP_OPTION_SPACE - 
size" is still being considered. I am likely missing something here. If the 
above formula is correct, why reserved_opt_spc is not always used for the "skb 
!= NULL" case and still need to be compared with the "MAX_TCP_OPTION_SPACE - size"?

> 
> I am trying to get the reserved bpf hdr size. Considering other dynamic
> options, this equation is not precise, and may change it to
> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb) - size`?

"- size" is needed. The test did not catch it?

> 
> Or, not sure if adding a field in tcp_sock to precisely cache the
> reserved bpf hdr size is a good idea?

imo, adding one field in tcp_sock for this is overkill.

It seems your bpf prog is randomly reserving space. If this is the case, giving 
a fail signal in bpf_reserve_hdr_opt() does not improve the success rate for 
your random bpf prog to reserve header. I don't think adding a field or the 
change in this patch really solves anything in your randomly reserving space use 
case ?

> 
>>>
>>> We have a use case where we add options to some packets kind of randomly
>>> for the purpose of sampling, and accidentally set a larger option size
>>> than the reserved size. It is the problem of ourselves and takes us
>>> some effort to troubleshoot the root cause.
>>>
>>> If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
>>> could be helpful for users to avoid this mistake.
>>
>> The bpf_sk_storage can be used to decide if a sk has been sampled.
>>
>> Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in this 
>> patch can be done in the bpf prog itself.
>>
> 
> Thanks for pointing this out! Agreed, the check can be implemented in
> eBPF progs.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-09-03 22:38           ` Martin KaFai Lau
@ 2024-09-05 18:19             ` Zijian Zhang
  2024-09-05 19:38               ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Zijian Zhang @ 2024-09-05 18:19 UTC (permalink / raw)
  To: Martin KaFai Lau, Amery Hung
  Cc: bpf, edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 9/3/24 3:38 PM, Martin KaFai Lau wrote:
> On 8/30/24 2:02 PM, Zijian Zhang wrote:
>> On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
>>> On 8/28/24 4:01 PM, Zijian Zhang wrote:
>>>> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>>>>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>>>
>>>>>> This series prevents sockops users from accidentally causing packet
>>>>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>>>>> reserves different option lengths in tcp_sendmsg().
>>>>>>
>>>>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be 
>>>>>> called to
>>>>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>>>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in 
>>>>>> __tcp_transmit_skb()
>>>>>> again to calculate the actual tcp_option_size and skb_push() the 
>>>>>> total
>>>>>> header size.
>>>>>>
>>>>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, 
>>>>>> which is
>>>>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>>>>> reserved opt size is smaller than the actual header size, the len 
>>>>>> of the
>>>>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>>>>> packet if skb->ignore_df is not set.
>>>>>>
>>>>>> To prevent this accidental packet drop, we need to make sure the
>>>>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>>>>> not more than the first time. 
>>>>>
>>>>> iiuc, it is a bug in the bpf prog itself that did not reserve the 
>>>>> same header length and caused a drop. It is not the only drop case 
>>>>> though for an incorrect bpf prog. There are other cases where a bpf 
>>>>> prog can accidentally drop a packet.
>>>>>
>>>>> Do you have an actual use case that the bpf prog cannot reserve the 
>>>>> correct header length for the same sk ?
>>>>
>>>> That's right, it's the bug of the bpf prog itself. We are trying to 
>>>> have
>>>> the error reported earlier in eBPF program, instead of successfully
>>>> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
>>>> at the end because of it.
>>>>
>>>> By adding this patch, the `remaining` variable passed to the
>>>> bpf_skops_hdr_opt_len will be more precise, it takes the previously
>>>> reserved size into account. As a result, if users accidentally set an
>>>> option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
>>>> will return -ENOSPC instead of 0.
>>>
>>> Putting aside it adds more checks, this adds something pretty unique 
>>> to the bpf header option comparing with other dynamic options like 
>>> sack. e.g. For tp- >mss_cache, I assume it won't change since the 
>>> earlier tcp_current_mss() was called?
>>>
>>
>> Agreed, this check is very unique comparing with other dynamic options.
>> How about moving the check into function bpf_skops_hdr_opt_len? It
>> already has some logic to check if the context is in tcp_current_mss.
>> Does it look more reasonable?
> 
> Yes, it probably may be better to put that into the 
> bpf_skops_hdr_opt_len(). This is implementation details which is 
> something for later after figuring out if the reserved_opt_spc change is 
> correct and needed.
> 
>>
>> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
>> For tp->mss_cache here, yes, I also think it won't change,
> 
> This needs more details and clear explanation on why it is the case and 
> why the existing regular bpf prog will continue to work.
> 
> afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same 
> between calls for this to work . From thinking about it, it should be 
> but I haven't looked at all cases. Missing "- size" does not help the 
> confidence here also.
> 
> Also, when "skb != NULL", I don't understand why the 
> "MAX_TCP_OPTION_SPACE - size" is still being considered. I am likely 
> missing something here. If the above formula is correct, why 
> reserved_opt_spc is not always used for the "skb != NULL" case and still 
> need to be compared with the "MAX_TCP_OPTION_SPACE - size"?
> 

Cases I can think of are as follows,
- When it's not a GSO skb, tcp_skb_seglen will simply return skb->len,
it might make `tp->mss_cache - tcp_skb_seglen(skb)` a large number.

- When we are in tcp_mtu_probe, tp->mss_cache will be smaller than
tcp_skb_seglen(skb), which makes the equation again a large number.


"MAX_TCP_OPTION_SPACE - size" is considered here as the strict upper
bound, while reserved_opt_spc is expected to be a stricter upper bound.
In the above or other cases, where the equation malfunctioned, we can
always fall back to the original bound.


I am not sure which way to get the reserved size is better,
1. We could precisely cache the result of the reserved size, may
need to have a new field for it, I agree that a field in tcp_sock
is overkill.

2. In this patch, we use an equation to infer the value of it. There are
some concerns with tp->mss_cache and tcp_skb_seglen(skb) here, I agree.
If tp->mss_cache and tcp_skb_seglen(skb) might not be in-sync, we may
have an underestimated reserved_opt_spc, it may break existing bpf
progs. If this method is preferred I will do more investigations to
verify it or modify the equation :)

Do you have preference to either option?

>>
>> I am trying to get the reserved bpf hdr size. Considering other dynamic
>> options, this equation is not precise, and may change it to
>> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb) - size`?
> 
> "- size" is needed. The test did not catch it?
> 
>>
>> Or, not sure if adding a field in tcp_sock to precisely cache the
>> reserved bpf hdr size is a good idea?
> 
> imo, adding one field in tcp_sock for this is overkill.
> 
> It seems your bpf prog is randomly reserving space. If this is the case, 
> giving a fail signal in bpf_reserve_hdr_opt() does not improve the 
> success rate for your random bpf prog to reserve header. I don't think 
> adding a field or the change in this patch really solves anything in 
> your randomly reserving space use case ?
> 

It's true that it won't help with the success rate, but it could help
save packets from being dropped.

The goal is to let bpf_reserve_hdr_opt fail, when it is supposed to.
And, as a bonus effect, it could also save packets from being dropped.

When the accidental mistake happens, we want bpf_reserve_hdr_opt to
fail, so that `sock_ops.remaining_opt_len` won't be updated.
If it is still updated in this case, the packet will be dropped in
ip_fragment because `IPCB(skb)->frag_max_size > mtu`.

>>
>>>>
>>>> We have a use case where we add options to some packets kind of 
>>>> randomly
>>>> for the purpose of sampling, and accidentally set a larger option size
>>>> than the reserved size. It is the problem of ourselves and takes us
>>>> some effort to troubleshoot the root cause.
>>>>
>>>> If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
>>>> could be helpful for users to avoid this mistake.
>>>
>>> The bpf_sk_storage can be used to decide if a sk has been sampled.
>>>
>>> Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks 
>>> in this patch can be done in the bpf prog itself.
>>>
>>
>> Thanks for pointing this out! Agreed, the check can be implemented in
>> eBPF progs.
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-09-05 18:19             ` Zijian Zhang
@ 2024-09-05 19:38               ` Martin KaFai Lau
  2024-09-05 20:20                 ` Zijian Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2024-09-05 19:38 UTC (permalink / raw)
  To: Zijian Zhang, Amery Hung
  Cc: bpf, edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 9/5/24 11:19 AM, Zijian Zhang wrote:
> On 9/3/24 3:38 PM, Martin KaFai Lau wrote:
>> On 8/30/24 2:02 PM, Zijian Zhang wrote:
>>> On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
>>>> On 8/28/24 4:01 PM, Zijian Zhang wrote:
>>>>> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>>>>>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>>>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>>>>
>>>>>>> This series prevents sockops users from accidentally causing packet
>>>>>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>>>>>> reserves different option lengths in tcp_sendmsg().
>>>>>>>
>>>>>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
>>>>>>> reserve a space in tcp_send_mss(), which will return the MSS for TSO.
>>>>>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
>>>>>>> again to calculate the actual tcp_option_size and skb_push() the total
>>>>>>> header size.
>>>>>>>
>>>>>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
>>>>>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>>>>>> reserved opt size is smaller than the actual header size, the len of the
>>>>>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>>>>>> packet if skb->ignore_df is not set.
>>>>>>>
>>>>>>> To prevent this accidental packet drop, we need to make sure the
>>>>>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
>>>>>>> not more than the first time. 
>>>>>>
>>>>>> iiuc, it is a bug in the bpf prog itself that did not reserve the same 
>>>>>> header length and caused a drop. It is not the only drop case though for 
>>>>>> an incorrect bpf prog. There are other cases where a bpf prog can 
>>>>>> accidentally drop a packet.
>>>>>>
>>>>>> Do you have an actual use case that the bpf prog cannot reserve the 
>>>>>> correct header length for the same sk ?
>>>>>
>>>>> That's right, it's the bug of the bpf prog itself. We are trying to have
>>>>> the error reported earlier in eBPF program, instead of successfully
>>>>> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
>>>>> at the end because of it.
>>>>>
>>>>> By adding this patch, the `remaining` variable passed to the
>>>>> bpf_skops_hdr_opt_len will be more precise, it takes the previously
>>>>> reserved size into account. As a result, if users accidentally set an
>>>>> option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
>>>>> will return -ENOSPC instead of 0.
>>>>
>>>> Putting aside it adds more checks, this adds something pretty unique to the 
>>>> bpf header option comparing with other dynamic options like sack. e.g. For 
>>>> tp- >mss_cache, I assume it won't change since the earlier tcp_current_mss() 
>>>> was called?
>>>>
>>>
>>> Agreed, this check is very unique comparing with other dynamic options.
>>> How about moving the check into function bpf_skops_hdr_opt_len? It
>>> already has some logic to check if the context is in tcp_current_mss.
>>> Does it look more reasonable?
>>
>> Yes, it probably may be better to put that into the bpf_skops_hdr_opt_len(). 
>> This is implementation details which is something for later after figuring out 
>> if the reserved_opt_spc change is correct and needed.
>>
>>>
>>> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
>>> For tp->mss_cache here, yes, I also think it won't change,
>>
>> This needs more details and clear explanation on why it is the case and why 
>> the existing regular bpf prog will continue to work.
>>
>> afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same between 
>> calls for this to work . From thinking about it, it should be but I haven't 
>> looked at all cases. Missing "- size" does not help the confidence here also.
>>
>> Also, when "skb != NULL", I don't understand why the "MAX_TCP_OPTION_SPACE - 
>> size" is still being considered. I am likely missing something here. If the 
>> above formula is correct, why reserved_opt_spc is not always used for the 
>> "skb != NULL" case and still need to be compared with the 
>> "MAX_TCP_OPTION_SPACE - size"?
>>
> 
> Cases I can think of are as follows,
> - When it's not a GSO skb, tcp_skb_seglen will simply return skb->len,
> it might make `tp->mss_cache - tcp_skb_seglen(skb)` a large number.
> 
> - When we are in tcp_mtu_probe, tp->mss_cache will be smaller than
> tcp_skb_seglen(skb), which makes the equation again a large number.

In tcp_mtu_probe, mss_cache is (smaller) than the tcp_skb_seglen(skb)?

> 
> 
> "MAX_TCP_OPTION_SPACE - size" is considered here as the strict upper
> bound, while reserved_opt_spc is expected to be a stricter upper bound.
> In the above or other cases, where the equation malfunctioned, we can
> always fall back to the original bound.

Make sense. Thanks for the explanation. The commit message could have used this 
details.

> 
> 
> I am not sure which way to get the reserved size is better,
> 1. We could precisely cache the result of the reserved size, may
> need to have a new field for it, I agree that a field in tcp_sock
> is overkill.
> 
> 2. In this patch, we use an equation to infer the value of it. There are
> some concerns with tp->mss_cache and tcp_skb_seglen(skb) here, I agree.
> If tp->mss_cache and tcp_skb_seglen(skb) might not be in-sync, we may
> have an underestimated reserved_opt_spc, it may break existing bpf
> progs. If this method is preferred I will do more investigations to
> verify it or modify the equation :)

imo, the bpf prog can always use bpf_sk_storage to add fields to a sock and 
store the previous reserved space there. I think this is the solution you should 
use in your use case which randomly reserves header space. Adding a field in the 
tcp_sock feels wrong especially the only use case I am hearing so far is a bpf 
prog randomly reserving header spaces.

If (2) can be convinced to be correct, improving bpf_reserve_hdr_opt() is fine 
as long as it does not break the existing program. I expect some tests to cover 
this.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-09-05 19:38               ` Martin KaFai Lau
@ 2024-09-05 20:20                 ` Zijian Zhang
  2024-09-05 21:07                   ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Zijian Zhang @ 2024-09-05 20:20 UTC (permalink / raw)
  To: Martin KaFai Lau, Amery Hung
  Cc: bpf, edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 9/5/24 12:38 PM, Martin KaFai Lau wrote:
> On 9/5/24 11:19 AM, Zijian Zhang wrote:
>> On 9/3/24 3:38 PM, Martin KaFai Lau wrote:
>>> On 8/30/24 2:02 PM, Zijian Zhang wrote:
>>>> On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
>>>>> On 8/28/24 4:01 PM, Zijian Zhang wrote:
>>>>>> On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
>>>>>>> On 8/26/24 6:37 PM, zijianzhang@bytedance.com wrote:
>>>>>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>>>>>
>>>>>>>> This series prevents sockops users from accidentally causing packet
>>>>>>>> drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
>>>>>>>> reserves different option lengths in tcp_sendmsg().
>>>>>>>>
>>>>>>>> Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be 
>>>>>>>> called to
>>>>>>>> reserve a space in tcp_send_mss(), which will return the MSS for 
>>>>>>>> TSO.
>>>>>>>> Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in 
>>>>>>>> __tcp_transmit_skb()
>>>>>>>> again to calculate the actual tcp_option_size and skb_push() the 
>>>>>>>> total
>>>>>>>> header size.
>>>>>>>>
>>>>>>>> skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, 
>>>>>>>> which is
>>>>>>>> derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
>>>>>>>> reserved opt size is smaller than the actual header size, the 
>>>>>>>> len of the
>>>>>>>> skb can exceed the MTU. As a result, ip(6)_fragment will drop the
>>>>>>>> packet if skb->ignore_df is not set.
>>>>>>>>
>>>>>>>> To prevent this accidental packet drop, we need to make sure the
>>>>>>>> second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves 
>>>>>>>> space
>>>>>>>> not more than the first time. 
>>>>>>>
>>>>>>> iiuc, it is a bug in the bpf prog itself that did not reserve the 
>>>>>>> same header length and caused a drop. It is not the only drop 
>>>>>>> case though for an incorrect bpf prog. There are other cases 
>>>>>>> where a bpf prog can accidentally drop a packet.
>>>>>>>
>>>>>>> Do you have an actual use case that the bpf prog cannot reserve 
>>>>>>> the correct header length for the same sk ?
>>>>>>
>>>>>> That's right, it's the bug of the bpf prog itself. We are trying 
>>>>>> to have
>>>>>> the error reported earlier in eBPF program, instead of successfully
>>>>>> returning from bpf_sock_ops_reserve_hdr_opt but leading to packet 
>>>>>> drop
>>>>>> at the end because of it.
>>>>>>
>>>>>> By adding this patch, the `remaining` variable passed to the
>>>>>> bpf_skops_hdr_opt_len will be more precise, it takes the previously
>>>>>> reserved size into account. As a result, if users accidentally set an
>>>>>> option size larger than the reserved size, 
>>>>>> bpf_sock_ops_reserve_hdr_opt
>>>>>> will return -ENOSPC instead of 0.
>>>>>
>>>>> Putting aside it adds more checks, this adds something pretty 
>>>>> unique to the bpf header option comparing with other dynamic 
>>>>> options like sack. e.g. For tp- >mss_cache, I assume it won't 
>>>>> change since the earlier tcp_current_mss() was called?
>>>>>
>>>>
>>>> Agreed, this check is very unique comparing with other dynamic options.
>>>> How about moving the check into function bpf_skops_hdr_opt_len? It
>>>> already has some logic to check if the context is in tcp_current_mss.
>>>> Does it look more reasonable?
>>>
>>> Yes, it probably may be better to put that into the 
>>> bpf_skops_hdr_opt_len(). This is implementation details which is 
>>> something for later after figuring out if the reserved_opt_spc change 
>>> is correct and needed.
>>>
>>>>
>>>> `reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
>>>> For tp->mss_cache here, yes, I also think it won't change,
>>>
>>> This needs more details and clear explanation on why it is the case 
>>> and why the existing regular bpf prog will continue to work.
>>>
>>> afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same 
>>> between calls for this to work . From thinking about it, it should be 
>>> but I haven't looked at all cases. Missing "- size" does not help the 
>>> confidence here also.
>>>
>>> Also, when "skb != NULL", I don't understand why the 
>>> "MAX_TCP_OPTION_SPACE - size" is still being considered. I am likely 
>>> missing something here. If the above formula is correct, why 
>>> reserved_opt_spc is not always used for the "skb != NULL" case and 
>>> still need to be compared with the "MAX_TCP_OPTION_SPACE - size"?
>>>
>>
>> Cases I can think of are as follows,
>> - When it's not a GSO skb, tcp_skb_seglen will simply return skb->len,
>> it might make `tp->mss_cache - tcp_skb_seglen(skb)` a large number.
>>
>> - When we are in tcp_mtu_probe, tp->mss_cache will be smaller than
>> tcp_skb_seglen(skb), which makes the equation again a large number.
> 
> In tcp_mtu_probe, mss_cache is (smaller) than the tcp_skb_seglen(skb)?
> 

```
tcp_init_tso_segs(nskb, nskb->len);
if (!tcp_transmit_skb(sk, nskb, 1, GFP_ATOMIC)) ...
```

In the tcp_transmit_skb inside tcp_mtu_probe, it tries to send an skb
with larger mss, so I assume mss_cache will be smaller than
tcp_skb_seglen(skb). Sorry for the confusion here.

>>
>>
>> "MAX_TCP_OPTION_SPACE - size" is considered here as the strict upper
>> bound, while reserved_opt_spc is expected to be a stricter upper bound.
>> In the above or other cases, where the equation malfunctioned, we can
>> always fall back to the original bound.
> 
> Make sense. Thanks for the explanation. The commit message could have 
> used this details.
> 

No problem ;)

>>
>>
>> I am not sure which way to get the reserved size is better,
>> 1. We could precisely cache the result of the reserved size, may
>> need to have a new field for it, I agree that a field in tcp_sock
>> is overkill.
>>
>> 2. In this patch, we use an equation to infer the value of it. There are
>> some concerns with tp->mss_cache and tcp_skb_seglen(skb) here, I agree.
>> If tp->mss_cache and tcp_skb_seglen(skb) might not be in-sync, we may
>> have an underestimated reserved_opt_spc, it may break existing bpf
>> progs. If this method is preferred I will do more investigations to
>> verify it or modify the equation :)
> 
> imo, the bpf prog can always use bpf_sk_storage to add fields to a sock 
> and store the previous reserved space there. I think this is the 
> solution you should use in your use case which randomly reserves header 
> space. Adding a field in the tcp_sock feels wrong especially the only 
> use case I am hearing so far is a bpf prog randomly reserving header 
> spaces.

Agree, thanks for the suggestion.

> 
> If (2) can be convinced to be correct, improving bpf_reserve_hdr_opt() 
> is fine as long as it does not break the existing program. I expect some 
> tests to cover this.
> 

Got it, will take a further look and do more tests.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU
  2024-09-05 20:20                 ` Zijian Zhang
@ 2024-09-05 21:07                   ` Martin KaFai Lau
  0 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2024-09-05 21:07 UTC (permalink / raw)
  To: Zijian Zhang, Amery Hung
  Cc: bpf, edumazet, davem, kuba, pabeni, dsahern, ast, daniel, andrii,
	eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, shuah, xiyou.wangcong, wangdongdong.6,
	zhoufeng.zf

On 9/5/24 1:20 PM, Zijian Zhang wrote:
>>>
>>> Cases I can think of are as follows,
>>> - When it's not a GSO skb, tcp_skb_seglen will simply return skb->len,
>>> it might make `tp->mss_cache - tcp_skb_seglen(skb)` a large number.
>>>
>>> - When we are in tcp_mtu_probe, tp->mss_cache will be smaller than
>>> tcp_skb_seglen(skb), which makes the equation again a large number.
>>
>> In tcp_mtu_probe, mss_cache is (smaller) than the tcp_skb_seglen(skb)?
>>
> 
> ```
> tcp_init_tso_segs(nskb, nskb->len);
> if (!tcp_transmit_skb(sk, nskb, 1, GFP_ATOMIC)) ...
> ```
> 
> In the tcp_transmit_skb inside tcp_mtu_probe, it tries to send an skb
> with larger mss, so I assume mss_cache will be smaller than
> tcp_skb_seglen(skb). Sorry for the confusion here.

hmm... "mss_cache - tcp_skb_seglen(skb)" and mss_cache could be smaller...
This is another signal that this approach does not sound right. I am not 
positive tbh. Given that I have already suggested more than one other ways. If 
you really eager to pursue this route to improve bpf_reserve_hdr_opt(), the 
tests coverage has to be convincing enough to cover corner cases like this for 
example.

pw-bot: cr

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-09-05 21:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-27  1:37 [PATCH bpf-next v2 0/2] prevent bpf_reserve_hdr_opt() from growing skb larger than MTU zijianzhang
2024-08-27  1:37 ` [PATCH bpf-next v2 1/2] bpf: tcp: " zijianzhang
2024-08-28 21:29   ` Martin KaFai Lau
2024-08-28 23:01     ` Zijian Zhang
2024-08-29  1:00       ` Martin KaFai Lau
2024-08-30 21:02         ` Zijian Zhang
2024-09-03 22:38           ` Martin KaFai Lau
2024-09-05 18:19             ` Zijian Zhang
2024-09-05 19:38               ` Martin KaFai Lau
2024-09-05 20:20                 ` Zijian Zhang
2024-09-05 21:07                   ` Martin KaFai Lau
2024-08-29 16:46     ` Cong Wang
2024-08-30  0:20       ` Martin KaFai Lau
2024-08-27  1:37 ` [PATCH bpf-next v2 2/2] bpf: selftests: reserve smaller tcp header options than the actual size zijianzhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox