* [PATCH bpf v3 0/2] bpf: fix wrong copied_seq calculation and add tests
@ 2024-12-18 5:34 Jiayuan Chen
2024-12-18 5:34 ` [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation Jiayuan Chen
2024-12-18 5:34 ` [PATCH bpf v3 2/2] selftests/bpf: add strparser test for bpf Jiayuan Chen
0 siblings, 2 replies; 10+ messages in thread
From: Jiayuan Chen @ 2024-12-18 5:34 UTC (permalink / raw)
To: bpf
Cc: martin.lau, ast, edumazet, jakub, davem, dsahern, kuba, pabeni,
linux-kernel, song, john.fastabend, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms, Jiayuan Chen
A previous commit described in this topic
http://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
directly updated 'sk->copied_seq' in the tcp_eat_skb() function when the
action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS,
the update logic for 'sk->copied_seq' was moved to
tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature.
That commit works for a single stream_verdict scenario, as it also
modified 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb'
to remove updating 'sk->copied_seq'.
However, for programs where both stream_parser and stream_verdict are
active(strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock)
tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated
updates.
In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.
Modifying tcp_read_sock() or strparser implementation directly is
unreasonable, as it is widely used in other modules.
Here, we introduce a method tcp_bpf_read_sock() to replace
'sk->sk_socket->ops->read_sock' (like 'tls_build_proto()' does in
tls_main.c). Such replacement action was also used in updating
tcp_bpf_prots in tcp_bpf.c, so it's not weird.
(Note that checkpatch.pl may complain missing 'const' qualifier when we
define the bpf-specified 'proto_ops', but we have to do because we need
update it).
Also we remove strparser check in tcp_eat_skb() since we implement custom
function tcp_bpf_read_sock() without copied_seq updating.
Since strparser currently supports only TCP, it's sufficient for 'ops' to
inherit inet_stream_ops.
We added test cases for bpf + strparser and separated them from
sockmap_basic. This is because we need to add more test cases for
strparser in the future.
Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
---
v2-v3: save skb data offset for ENOMEM. (John Fastabend)
https://lore.kernel.org/bpf/20241209152740.281125-1-mrpre@163.com/#r
v1-v2: fix patchwork fail by adding Fixes tag
---
Jiayuan Chen (2):
bpf: fix wrong copied_seq calculation
selftests/bpf: add strparser test for bpf
include/linux/skmsg.h | 2 +
include/net/tcp.h | 1 +
net/core/skmsg.c | 3 +
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_bpf.c | 108 +++++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 53 ---
.../selftests/bpf/prog_tests/sockmap_strp.c | 344 ++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_strp.c | 51 +++
8 files changed, 507 insertions(+), 57 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_strp.c
base-commit: 5a6ea7022ff4d2a65ae328619c586d6a8909b48b
--
2.43.5
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-18 5:34 [PATCH bpf v3 0/2] bpf: fix wrong copied_seq calculation and add tests Jiayuan Chen
@ 2024-12-18 5:34 ` Jiayuan Chen
2024-12-18 15:35 ` Jakub Sitnicki
2024-12-18 5:34 ` [PATCH bpf v3 2/2] selftests/bpf: add strparser test for bpf Jiayuan Chen
1 sibling, 1 reply; 10+ messages in thread
From: Jiayuan Chen @ 2024-12-18 5:34 UTC (permalink / raw)
To: bpf
Cc: martin.lau, ast, edumazet, jakub, davem, dsahern, kuba, pabeni,
linux-kernel, song, john.fastabend, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms, Jiayuan Chen
'sk->copied_seq' was updated in the tcp_eat_skb() function when the
action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS,
the update logic for 'sk->copied_seq' was moved to
tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature.
It works for a single stream_verdict scenario, as it also modified
'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb'
to remove updating 'sk->copied_seq'.
However, for programs where both stream_parser and stream_verdict are
active(strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock)
tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated
updates.
In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.
Modifying tcp_read_sock() or strparser implementation directly is
unreasonable, as it is widely used in other modules.
Here, we introduce a method tcp_bpf_read_sock() to replace
'sk->sk_socket->ops->read_sock' (like 'tls_build_proto()' does in
tls_main.c). Such replacement action was also used in updating
tcp_bpf_prots in tcp_bpf.c, so it's not weird.
(Note that checkpatch.pl may complain missing 'const' qualifier when we
define the bpf-specified 'proto_ops', but we have to do because we need
update it).
Also we remove strparser check in tcp_eat_skb() since we implement custom
function tcp_bpf_read_sock() without copied_seq updating.
Since strparser currently supports only TCP, it's sufficient for 'ops' to
inherit inet_stream_ops.
Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Signed-off-by: Jiayuan Chen <mrpre@163.com>
---
include/linux/skmsg.h | 2 +
include/net/tcp.h | 1 +
net/core/skmsg.c | 3 ++
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_bpf.c | 108 ++++++++++++++++++++++++++++++++++++++++--
5 files changed, 112 insertions(+), 4 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index d9b03e0746e7..7f91bc67e50f 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -85,6 +85,7 @@ struct sk_psock {
struct sock *sk_redir;
u32 apply_bytes;
u32 cork_bytes;
+ u32 strp_offset;
u32 eval;
bool redir_ingress; /* undefined if sk_redir is null */
struct sk_msg *cork;
@@ -112,6 +113,7 @@ struct sk_psock {
int (*psock_update_sk_prot)(struct sock *sk, struct sk_psock *psock,
bool restore);
struct proto *sk_proto;
+ const struct proto_ops *sk_proto_ops;
struct mutex work_mutex;
struct sk_psock_work_state work_state;
struct delayed_work work;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e9b37b76e894..fb3215936ece 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -353,6 +353,7 @@ ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
unsigned int flags);
struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
bool force_schedule);
+void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb);
static inline void tcp_dec_quickack_mode(struct sock *sk)
{
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index e90fbab703b2..99dd75c9e689 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -702,6 +702,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
{
struct sk_psock *psock;
struct proto *prot;
+ const struct proto_ops *proto_ops;
write_lock_bh(&sk->sk_callback_lock);
@@ -722,9 +723,11 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
}
prot = READ_ONCE(sk->sk_prot);
+ proto_ops = likely(sk->sk_socket) ? sk->sk_socket->ops : NULL;
psock->sk = sk;
psock->eval = __SK_NONE;
psock->sk_proto = prot;
+ psock->sk_proto_ops = proto_ops;
psock->saved_unhash = prot->unhash;
psock->saved_destroy = prot->destroy;
psock->saved_close = prot->close;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..6a07d98017f7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1517,7 +1517,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
__tcp_cleanup_rbuf(sk, copied);
}
-static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
+void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
{
__skb_unlink(skb, &sk->sk_receive_queue);
if (likely(skb->destructor == sock_rfree)) {
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 99cef92e6290..4a089afc09b7 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -19,9 +19,6 @@ void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
if (!skb || !skb->len || !sk_is_tcp(sk))
return;
- if (skb_bpf_strparser(skb))
- return;
-
tcp = tcp_sk(sk);
copied = tcp->copied_seq + skb->len;
WRITE_ONCE(tcp->copied_seq, copied);
@@ -578,6 +575,81 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
return copied > 0 ? copied : err;
}
+static void sock_replace_proto_ops(struct sock *sk,
+ const struct proto_ops *proto_ops)
+{
+ if (sk->sk_socket)
+ WRITE_ONCE(sk->sk_socket->ops, proto_ops);
+}
+
+/* The tcp_bpf_read_sock() is an alternative implementation
+ * of tcp_read_sock(), except that it does not update copied_seq.
+ */
+static int tcp_bpf_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ struct sk_psock *psock;
+ struct sk_buff *skb;
+ int offset;
+ int copied = 0;
+
+ if (sk->sk_state == TCP_LISTEN)
+ return -ENOTCONN;
+
+ /* we are called from sk_psock_strp_data_ready() and
+ * psock has already been checked and can't be NULL.
+ */
+ psock = sk_psock_get(sk);
+ /* The offset keeps track of how much data was processed during
+ * the last call.
+ */
+ offset = psock->strp_offset;
+ while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
+ u8 tcp_flags;
+ int used;
+ size_t len;
+
+ len = skb->len - offset;
+ tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+ WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
+ used = recv_actor(desc, skb, offset, len);
+ if (used <= 0) {
+ /* None of the data in skb has been consumed.
+ * May -ENOMEM or other error happened
+ */
+ if (!copied)
+ copied = used;
+ break;
+ }
+
+ if (WARN_ON_ONCE(used > len))
+ used = len;
+ copied += used;
+ if (used < len) {
+ /* Strparser clone and consume all input skb except
+ * -ENOMEM happened and it will replay skb by it's
+ * framework later. So We need to keep offset and
+ * skb for next retry.
+ */
+ offset += used;
+ break;
+ }
+
+ /* Entire skb was consumed, and we don't need this skb
+ * anymore and clean the offset.
+ */
+ offset = 0;
+ tcp_eat_recv_skb(sk, skb);
+ if (!desc->count)
+ break;
+ if (tcp_flags & TCPHDR_FIN)
+ break;
+ }
+
+ WRITE_ONCE(psock->strp_offset, offset);
+ return copied;
+}
+
enum {
TCP_BPF_IPV4,
TCP_BPF_IPV6,
@@ -595,6 +667,10 @@ enum {
static struct proto *tcpv6_prot_saved __read_mostly;
static DEFINE_SPINLOCK(tcpv6_prot_lock);
static struct proto tcp_bpf_prots[TCP_BPF_NUM_PROTS][TCP_BPF_NUM_CFGS];
+/* we do not use 'const' here because it will be polluted later.
+ * It may cause const check warning by script, just ignore it.
+ */
+static struct proto_ops tcp_bpf_proto_ops[TCP_BPF_NUM_PROTS];
static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
struct proto *base)
@@ -615,6 +691,13 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
prot[TCP_BPF_TXRX].recvmsg = tcp_bpf_recvmsg_parser;
}
+static void tcp_bpf_rebuild_proto_ops(struct proto_ops *ops,
+ const struct proto_ops *base)
+{
+ *ops = *base;
+ ops->read_sock = tcp_bpf_read_sock;
+}
+
static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
{
if (unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) {
@@ -627,6 +710,19 @@ static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
}
}
+static int __init tcp_bpf_build_proto_ops(void)
+{
+ /* We update ops separately for further scalability
+ * although v4 and v6 use same ops.
+ */
+ tcp_bpf_rebuild_proto_ops(&tcp_bpf_proto_ops[TCP_BPF_IPV4],
+ &inet_stream_ops);
+ tcp_bpf_rebuild_proto_ops(&tcp_bpf_proto_ops[TCP_BPF_IPV6],
+ &inet_stream_ops);
+ return 0;
+}
+late_initcall(tcp_bpf_build_proto_ops);
+
static int __init tcp_bpf_v4_build_proto(void)
{
tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV4], &tcp_prot);
@@ -648,6 +744,7 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
{
int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
int config = psock->progs.msg_parser ? TCP_BPF_TX : TCP_BPF_BASE;
+ bool strp = psock->progs.stream_verdict && psock->progs.stream_parser;
if (psock->progs.stream_verdict || psock->progs.skb_verdict) {
config = (config == TCP_BPF_TX) ? TCP_BPF_TXRX : TCP_BPF_RX;
@@ -666,6 +763,7 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
sk->sk_write_space = psock->saved_write_space;
/* Pairs with lockless read in sk_clone_lock() */
sock_replace_proto(sk, psock->sk_proto);
+ sock_replace_proto_ops(sk, psock->sk_proto_ops);
}
return 0;
}
@@ -679,6 +777,10 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
/* Pairs with lockless read in sk_clone_lock() */
sock_replace_proto(sk, &tcp_bpf_prots[family][config]);
+
+ if (strp)
+ sock_replace_proto_ops(sk, &tcp_bpf_proto_ops[family]);
+
return 0;
}
EXPORT_SYMBOL_GPL(tcp_bpf_update_proto);
--
2.43.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH bpf v3 2/2] selftests/bpf: add strparser test for bpf
2024-12-18 5:34 [PATCH bpf v3 0/2] bpf: fix wrong copied_seq calculation and add tests Jiayuan Chen
2024-12-18 5:34 ` [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation Jiayuan Chen
@ 2024-12-18 5:34 ` Jiayuan Chen
1 sibling, 0 replies; 10+ messages in thread
From: Jiayuan Chen @ 2024-12-18 5:34 UTC (permalink / raw)
To: bpf
Cc: martin.lau, ast, edumazet, jakub, davem, dsahern, kuba, pabeni,
linux-kernel, song, john.fastabend, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms, Jiayuan Chen
Add test cases for bpf + strparser and separated them from
sockmap_basic. This is because we need to add more test cases for
strparser in the future.
Signed-off-by: Jiayuan Chen <mrpre@163.com>
---
.../selftests/bpf/prog_tests/sockmap_basic.c | 53 ---
.../selftests/bpf/prog_tests/sockmap_strp.c | 344 ++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_strp.c | 51 +++
3 files changed, 395 insertions(+), 53 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_strp.c
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index fdff0652d7ef..4c0eebc433d8 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -530,57 +530,6 @@ static void test_sockmap_skb_verdict_shutdown(void)
test_sockmap_pass_prog__destroy(skel);
}
-static void test_sockmap_stream_pass(void)
-{
- int zero = 0, sent, recvd;
- int verdict, parser;
- int err, map;
- int c = -1, p = -1;
- struct test_sockmap_pass_prog *pass = NULL;
- char snd[256] = "0123456789";
- char rcv[256] = "0";
-
- pass = test_sockmap_pass_prog__open_and_load();
- verdict = bpf_program__fd(pass->progs.prog_skb_verdict);
- parser = bpf_program__fd(pass->progs.prog_skb_parser);
- map = bpf_map__fd(pass->maps.sock_map_rx);
-
- err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0);
- if (!ASSERT_OK(err, "bpf_prog_attach stream parser"))
- goto out;
-
- err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
- if (!ASSERT_OK(err, "bpf_prog_attach stream verdict"))
- goto out;
-
- err = create_pair(AF_INET, SOCK_STREAM, &c, &p);
- if (err)
- goto out;
-
- /* sk_data_ready of 'p' will be replaced by strparser handler */
- err = bpf_map_update_elem(map, &zero, &p, BPF_NOEXIST);
- if (!ASSERT_OK(err, "bpf_map_update_elem(p)"))
- goto out_close;
-
- /*
- * as 'prog_skb_parser' return the original skb len and
- * 'prog_skb_verdict' return SK_PASS, the kernel will just
- * pass it through to original socket 'p'
- */
- sent = xsend(c, snd, sizeof(snd), 0);
- ASSERT_EQ(sent, sizeof(snd), "xsend(c)");
-
- recvd = recv_timeout(p, rcv, sizeof(rcv), SOCK_NONBLOCK,
- IO_TIMEOUT_SEC);
- ASSERT_EQ(recvd, sizeof(rcv), "recv_timeout(p)");
-
-out_close:
- close(c);
- close(p);
-
-out:
- test_sockmap_pass_prog__destroy(pass);
-}
static void test_sockmap_skb_verdict_fionread(bool pass_prog)
{
@@ -1050,8 +999,6 @@ void test_sockmap_basic(void)
test_sockmap_progs_query(BPF_SK_SKB_VERDICT);
if (test__start_subtest("sockmap skb_verdict shutdown"))
test_sockmap_skb_verdict_shutdown();
- if (test__start_subtest("sockmap stream parser and verdict pass"))
- test_sockmap_stream_pass();
if (test__start_subtest("sockmap skb_verdict fionread"))
test_sockmap_skb_verdict_fionread(true);
if (test__start_subtest("sockmap skb_verdict fionread on drop"))
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
new file mode 100644
index 000000000000..0398658d4787
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
@@ -0,0 +1,344 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <error.h>
+
+#include <test_progs.h>
+#include "sockmap_helpers.h"
+#include "test_skmsg_load_helpers.skel.h"
+#include "test_sockmap_strp.skel.h"
+#define STRP_PACKET_HEAD_LEN 4
+#define STRP_PACKET_BODY_LEN 6
+#define STRP_PACKET_FULL_LEN (STRP_PACKET_HEAD_LEN + STRP_PACKET_BODY_LEN)
+static const char packet[STRP_PACKET_FULL_LEN] = "head+body\0";
+static const int test_packet_num = 100;
+
+static struct test_sockmap_strp *sockmap_strp_init(int *map)
+{
+ struct test_sockmap_strp *strp = NULL;
+ int verdict, parser;
+ int err;
+
+ strp = test_sockmap_strp__open_and_load();
+ verdict = bpf_program__fd(strp->progs.prog_skb_verdict_pass);
+ parser = bpf_program__fd(strp->progs.prog_skb_parser_partial);
+ *map = bpf_map__fd(strp->maps.sock_map);
+
+ err = bpf_prog_attach(parser, *map, BPF_SK_SKB_STREAM_PARSER, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream parser"))
+ goto err;
+
+ err = bpf_prog_attach(verdict, *map, BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream verdict"))
+ goto err;
+
+ return strp;
+err:
+ test_sockmap_strp__destroy(strp);
+ return NULL;
+}
+
+/* we have multiple packets in one skb
+ * ------------ ------------ ------------
+ * | packet1 | packet2 | ...
+ * ------------ ------------ ------------
+ */
+static void test_sockmap_strp_multi_packet(int family, int sotype)
+{
+ int i, zero = 0;
+ int sent, recvd, total;
+ int err, map;
+ int c = -1, p = -1;
+ struct test_sockmap_strp *strp = NULL;
+ char *snd = NULL, *rcv = NULL;
+
+ strp = sockmap_strp_init(&map);
+ if (!ASSERT_TRUE(strp != NULL, "sockmap_strp_init"))
+ return;
+
+ err = create_pair(family, sotype, &c, &p);
+ if (err)
+ goto out;
+
+ err = bpf_map_update_elem(map, &zero, &p, BPF_NOEXIST);
+ if (!ASSERT_OK(err, "bpf_map_update_elem(zero, p)"))
+ goto out_close;
+
+ /* construct multiple packets in one buffer */
+ total = test_packet_num * STRP_PACKET_FULL_LEN;
+ snd = malloc(total);
+ rcv = malloc(total + 1);
+ if (!ASSERT_TRUE(snd != NULL, "malloc(multi block)")
+ || !ASSERT_TRUE(rcv != NULL, "malloc(multi block)"))
+ goto out_close;
+
+ for (i = 0; i < test_packet_num; i++) {
+ memcpy(snd + i * STRP_PACKET_FULL_LEN,
+ packet, STRP_PACKET_FULL_LEN);
+ }
+
+ sent = xsend(c, snd, total, 0);
+ if (!ASSERT_EQ(sent, total, "xsend(c)"))
+ goto out_close;
+
+ /* try to recv one more byte to avoid truncation check */
+ recvd = recv_timeout(p, rcv, total + 1, MSG_DONTWAIT, IO_TIMEOUT_SEC);
+ if (!ASSERT_EQ(recvd, total, "recv(rcv)"))
+ goto out_close;
+
+ /* we sent TCP segment with multiple encapsulation
+ * then check whether packets are handled correctly
+ */
+ if (!ASSERT_OK(memcmp(snd, rcv, total), "memcmp(snd, rcv)"))
+ goto out_close;
+
+out_close:
+ close(c);
+ close(p);
+ if (snd)
+ free(snd);
+ if (rcv)
+ free(rcv);
+out:
+ test_sockmap_strp__destroy(strp);
+}
+
+static void test_sockmap_strp_partial_read(int family, int sotype)
+{
+ int zero = 0, recvd, off;
+ int verdict, parser;
+ int err, map;
+ int c = -1, p = -1;
+ struct test_sockmap_strp *strp = NULL;
+ char rcv[STRP_PACKET_FULL_LEN + 1] = "0";
+
+ strp = test_sockmap_strp__open_and_load();
+ verdict = bpf_program__fd(strp->progs.prog_skb_verdict_pass);
+ parser = bpf_program__fd(strp->progs.prog_skb_parser_partial);
+ map = bpf_map__fd(strp->maps.sock_map);
+
+ err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream parser"))
+ goto out;
+
+ err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream verdict"))
+ goto out;
+
+ err = create_pair(family, sotype, &c, &p);
+ if (err)
+ goto out;
+
+ /* sk_data_ready of 'p' will be replaced by strparser handler */
+ err = bpf_map_update_elem(map, &zero, &p, BPF_NOEXIST);
+ if (!ASSERT_OK(err, "bpf_map_update_elem(zero, p)"))
+ goto out_close;
+
+ /* 1.1 send partial head, 1 byte header left*/
+ off = STRP_PACKET_HEAD_LEN - 1;
+ xsend(c, packet, off, 0);
+ recvd = recv_timeout(p, rcv, sizeof(rcv), MSG_DONTWAIT, 5);
+ if (!ASSERT_EQ(-1, recvd, "insufficient head, should no data recvd"))
+ goto out_close;
+
+ /* 1.2 send remaining head and body */
+ xsend(c, packet + off, STRP_PACKET_FULL_LEN - off, 0);
+ recvd = recv_timeout(p, rcv, sizeof(rcv), MSG_DONTWAIT, IO_TIMEOUT_SEC);
+ if (!ASSERT_EQ(recvd, STRP_PACKET_FULL_LEN, "should full data recvd"))
+ goto out_close;
+
+ /* 2.1 send partial head, 1 byte header left */
+ off = STRP_PACKET_HEAD_LEN - 1;
+ xsend(c, packet, off, 0);
+
+ /* 2.2 send remaining head and partial body, 1 byte body left */
+ xsend(c, packet + off, STRP_PACKET_FULL_LEN - off - 1, 0);
+ off = STRP_PACKET_FULL_LEN - 1;
+ recvd = recv_timeout(p, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
+ if (!ASSERT_EQ(-1, recvd, "insufficient body, should no data read"))
+ goto out_close;
+
+ /* 2.3 send remaining body */
+ xsend(c, packet + off, STRP_PACKET_FULL_LEN - off, 0);
+ recvd = recv_timeout(p, rcv, sizeof(rcv), MSG_DONTWAIT, IO_TIMEOUT_SEC);
+ if (!ASSERT_EQ(recvd, STRP_PACKET_FULL_LEN, "should full data recvd"))
+ goto out_close;
+
+out_close:
+ close(c);
+ close(p);
+
+out:
+ test_sockmap_strp__destroy(strp);
+}
+
+static void test_sockmap_strp_pass(int family, int sotype, bool fionread)
+{
+ int zero = 0, pkt_size, sent, recvd, avail;
+ int verdict, parser;
+ int err, map;
+ int c = -1, p = -1;
+ int read_cnt = 10, i;
+ struct test_sockmap_strp *strp = NULL;
+ char rcv[STRP_PACKET_FULL_LEN + 1] = "0";
+
+ strp = test_sockmap_strp__open_and_load();
+ verdict = bpf_program__fd(strp->progs.prog_skb_verdict_pass);
+ parser = bpf_program__fd(strp->progs.prog_skb_parser);
+ map = bpf_map__fd(strp->maps.sock_map);
+
+ err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream parser"))
+ goto out;
+
+ err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream verdict"))
+ goto out;
+
+ err = create_pair(family, sotype, &c, &p);
+ if (err)
+ goto out;
+
+ /* sk_data_ready of 'p' will be replaced by strparser handler */
+ err = bpf_map_update_elem(map, &zero, &p, BPF_NOEXIST);
+ if (!ASSERT_OK(err, "bpf_map_update_elem(p)"))
+ goto out_close;
+
+ /* Previously, we encountered issues such as deadlocks and
+ * sequence errors that resulted in the inability to read
+ * continuously. Therefore, we perform multiple iterations
+ * of testing here.
+ */
+ pkt_size = STRP_PACKET_FULL_LEN;
+ for (i = 0; i < read_cnt; i++) {
+ sent = xsend(c, packet, pkt_size, 0);
+ if (!ASSERT_EQ(sent, pkt_size, "xsend(c)"))
+ goto out_close;
+
+ recvd = recv_timeout(p, rcv, sizeof(rcv), MSG_DONTWAIT,
+ IO_TIMEOUT_SEC);
+ if (!ASSERT_EQ(recvd, pkt_size, "recv_timeout(p)")
+ || !ASSERT_OK(memcmp(packet, rcv, pkt_size),
+ "recv_timeout(p)"))
+ goto out_close;
+ }
+
+ if (fionread) {
+ sent = xsend(c, packet, pkt_size, 0);
+ if (!ASSERT_EQ(sent, pkt_size, "second xsend(c)"))
+ goto out_close;
+
+ err = ioctl(p, FIONREAD, &avail);
+ if (!ASSERT_OK(err, "ioctl(FIONREAD) error")
+ || ASSERT_EQ(avail, pkt_size, "ioctl(FIONREAD)"))
+ goto out_close;
+
+ recvd = recv_timeout(p, rcv, sizeof(rcv), MSG_DONTWAIT,
+ IO_TIMEOUT_SEC);
+ if (!ASSERT_EQ(recvd, pkt_size, "second recv_timeout(p)")
+ || ASSERT_OK(memcmp(packet, rcv, pkt_size),
+ "second recv_timeout(p)"))
+ goto out_close;
+ }
+
+out_close:
+ close(c);
+ close(p);
+
+out:
+ test_sockmap_strp__destroy(strp);
+}
+
+static void test_sockmap_strp_verdict(int family, int sotype)
+{
+ int zero = 0, one = 1, sent, recvd, off;
+ int verdict, parser;
+ int err, map;
+ int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+ struct test_sockmap_strp *strp = NULL;
+ char rcv[STRP_PACKET_FULL_LEN + 1] = "0";
+
+ strp = test_sockmap_strp__open_and_load();
+ verdict = bpf_program__fd(strp->progs.prog_skb_verdict);
+ parser = bpf_program__fd(strp->progs.prog_skb_parser);
+ map = bpf_map__fd(strp->maps.sock_map);
+
+ err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream parser"))
+ goto out;
+
+ err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach stream verdict"))
+ goto out;
+
+ /* We simulate a reverse proxy server.
+ * When p0 receives data from c0, we forward it to p1.
+ * From p1's perspective, it will consider this data
+ * as being sent by c1.
+ */
+ err = create_socket_pairs(family, sotype, &c0, &c1, &p0, &p1);
+ if (!ASSERT_OK(err, "create_socket_pairs()"))
+ goto out;
+
+ err = bpf_map_update_elem(map, &zero, &p0, BPF_NOEXIST);
+ if (!ASSERT_OK(err, "bpf_map_update_elem(p0)"))
+ goto out_close;
+
+ err = bpf_map_update_elem(map, &one, &c1, BPF_NOEXIST);
+ if (!ASSERT_OK(err, "bpf_map_update_elem(c1)"))
+ goto out_close;
+
+ sent = xsend(c0, packet, STRP_PACKET_FULL_LEN, 0);
+ if (!ASSERT_EQ(sent, STRP_PACKET_FULL_LEN, "xsend(c0)"))
+ goto out_close;
+
+ recvd = recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT,
+ IO_TIMEOUT_SEC);
+ if (!ASSERT_EQ(recvd, STRP_PACKET_FULL_LEN, "recv_timeout(p1)")
+ || !ASSERT_OK(memcmp(packet, rcv, STRP_PACKET_FULL_LEN),
+ "received data does not match the sent data"))
+ goto out_close;
+
+ /* send again to ensure the stream is functioning correctly. */
+ sent = xsend(c0, packet, STRP_PACKET_FULL_LEN, 0);
+ if (!ASSERT_EQ(sent, STRP_PACKET_FULL_LEN, "second xsend(c0)"))
+ goto out_close;
+
+ /* partial read */
+ off = STRP_PACKET_FULL_LEN/2;
+ recvd = recv_timeout(p1, rcv, off, MSG_DONTWAIT,
+ IO_TIMEOUT_SEC);
+ recvd += recv_timeout(p1, rcv + off, sizeof(rcv) - off, MSG_DONTWAIT,
+ IO_TIMEOUT_SEC);
+
+ if (!ASSERT_EQ(recvd, STRP_PACKET_FULL_LEN, "partial recv_timeout(p1)")
+ || !ASSERT_OK(memcmp(packet, rcv, STRP_PACKET_FULL_LEN),
+ "partial received data does not match the sent data"))
+ goto out_close;
+
+out_close:
+ close(c0);
+ close(c1);
+ close(p0);
+ close(p1);
+out:
+ test_sockmap_strp__destroy(strp);
+}
+
+void test_sockmap_strp(void)
+{
+ if (test__start_subtest("sockmap strp tcp pass"))
+ test_sockmap_strp_pass(AF_INET, SOCK_STREAM, false);
+ if (test__start_subtest("sockmap strp tcp v6 pass"))
+ test_sockmap_strp_pass(AF_INET6, SOCK_STREAM, false);
+ if (test__start_subtest("sockmap strp tcp pass fionread"))
+ test_sockmap_strp_pass(AF_INET, SOCK_STREAM, true);
+ if (test__start_subtest("sockmap strp tcp v6 pass fionread"))
+ test_sockmap_strp_pass(AF_INET6, SOCK_STREAM, true);
+ if (test__start_subtest("sockmap strp tcp verdict"))
+ test_sockmap_strp_verdict(AF_INET, SOCK_STREAM);
+ if (test__start_subtest("sockmap strp tcp v6 verdict"))
+ test_sockmap_strp_verdict(AF_INET6, SOCK_STREAM);
+ if (test__start_subtest("sockmap strp tcp partial read"))
+ test_sockmap_strp_partial_read(AF_INET, SOCK_STREAM);
+ if (test__start_subtest("sockmap strp tcp multiple packets"))
+ test_sockmap_strp_multi_packet(AF_INET, SOCK_STREAM);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
new file mode 100644
index 000000000000..db2f3b6c87ba
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+ __uint(type, BPF_MAP_TYPE_SOCKMAP);
+ __uint(max_entries, 20);
+ __type(key, int);
+ __type(value, int);
+} sock_map SEC(".maps");
+
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict_pass(struct __sk_buff *skb)
+{
+ return SK_PASS;
+}
+
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+ __u32 one = 1;
+
+ return bpf_sk_redirect_map(skb, &sock_map, one, 0);
+}
+
+SEC("sk_skb/stream_parser")
+int prog_skb_parser(struct __sk_buff *skb)
+{
+ return skb->len;
+}
+
+SEC("sk_skb/stream_parser")
+int prog_skb_parser_partial(struct __sk_buff *skb)
+{
+ /* agreement with the test program on a 4-byte size header
+ * and 6-byte body.
+ */
+ if (skb->len < 4) {
+ /* need more header to determine full length */
+ return 0;
+ }
+ /* return full length decoded from header.
+ * the return value may be larger than skb->len which
+ * means framework must wait body coming.
+ */
+ return 10;
+}
+char _license[] SEC("license") = "GPL";
--
2.43.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-18 5:34 ` [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation Jiayuan Chen
@ 2024-12-18 15:35 ` Jakub Sitnicki
2024-12-19 9:30 ` Jiayuan Chen
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Sitnicki @ 2024-12-18 15:35 UTC (permalink / raw)
To: Jiayuan Chen
Cc: bpf, martin.lau, ast, edumazet, davem, dsahern, kuba, pabeni,
linux-kernel, song, john.fastabend, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms
On Wed, Dec 18, 2024 at 01:34 PM +08, Jiayuan Chen wrote:
> 'sk->copied_seq' was updated in the tcp_eat_skb() function when the
> action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS,
> the update logic for 'sk->copied_seq' was moved to
> tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature.
>
> It works for a single stream_verdict scenario, as it also modified
> 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb'
> to remove updating 'sk->copied_seq'.
>
> However, for programs where both stream_parser and stream_verdict are
> active(strparser purpose), tcp_read_sock() was used instead of
> tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock)
> tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated
> updates.
>
> In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
> in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
>
> The issue causes incorrect copied_seq calculations, which prevent
> correct data reads from the recv() interface in user-land.
>
> Modifying tcp_read_sock() or strparser implementation directly is
> unreasonable, as it is widely used in other modules.
>
> Here, we introduce a method tcp_bpf_read_sock() to replace
> 'sk->sk_socket->ops->read_sock' (like 'tls_build_proto()' does in
> tls_main.c). Such replacement action was also used in updating
> tcp_bpf_prots in tcp_bpf.c, so it's not weird.
> (Note that checkpatch.pl may complain missing 'const' qualifier when we
> define the bpf-specified 'proto_ops', but we have to do because we need
> update it).
>
> Also we remove strparser check in tcp_eat_skb() since we implement custom
> function tcp_bpf_read_sock() without copied_seq updating.
>
> Since strparser currently supports only TCP, it's sufficient for 'ops' to
> inherit inet_stream_ops.
>
> Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> Signed-off-by: Jiayuan Chen <mrpre@163.com>
> ---
> include/linux/skmsg.h | 2 +
> include/net/tcp.h | 1 +
> net/core/skmsg.c | 3 ++
> net/ipv4/tcp.c | 2 +-
> net/ipv4/tcp_bpf.c | 108 ++++++++++++++++++++++++++++++++++++++++--
> 5 files changed, 112 insertions(+), 4 deletions(-)
[...]
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 99cef92e6290..4a089afc09b7 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -19,9 +19,6 @@ void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
> if (!skb || !skb->len || !sk_is_tcp(sk))
> return;
>
> - if (skb_bpf_strparser(skb))
> - return;
> -
> tcp = tcp_sk(sk);
> copied = tcp->copied_seq + skb->len;
> WRITE_ONCE(tcp->copied_seq, copied);
> @@ -578,6 +575,81 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> return copied > 0 ? copied : err;
> }
>
> +static void sock_replace_proto_ops(struct sock *sk,
> + const struct proto_ops *proto_ops)
> +{
> + if (sk->sk_socket)
> + WRITE_ONCE(sk->sk_socket->ops, proto_ops);
> +}
> +
> +/* The tcp_bpf_read_sock() is an alternative implementation
> + * of tcp_read_sock(), except that it does not update copied_seq.
> + */
> +static int tcp_bpf_read_sock(struct sock *sk, read_descriptor_t *desc,
> + sk_read_actor_t recv_actor)
> +{
> + struct sk_psock *psock;
> + struct sk_buff *skb;
> + int offset;
> + int copied = 0;
> +
> + if (sk->sk_state == TCP_LISTEN)
> + return -ENOTCONN;
> +
> + /* we are called from sk_psock_strp_data_ready() and
> + * psock has already been checked and can't be NULL.
> + */
> + psock = sk_psock_get(sk);
> + /* The offset keeps track of how much data was processed during
> + * the last call.
> + */
> + offset = psock->strp_offset;
> + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
> + u8 tcp_flags;
> + int used;
> + size_t len;
> +
> + len = skb->len - offset;
> + tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
> + WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
> + used = recv_actor(desc, skb, offset, len);
> + if (used <= 0) {
> + /* None of the data in skb has been consumed.
> + * May -ENOMEM or other error happened
> + */
> + if (!copied)
> + copied = used;
> + break;
> + }
> +
> + if (WARN_ON_ONCE(used > len))
> + used = len;
> + copied += used;
> + if (used < len) {
> + /* Strparser clone and consume all input skb except
> + * -ENOMEM happened and it will replay skb by it's
> + * framework later. So We need to keep offset and
> + * skb for next retry.
> + */
> + offset += used;
> + break;
> + }
> +
> + /* Entire skb was consumed, and we don't need this skb
> + * anymore and clean the offset.
> + */
> + offset = 0;
> + tcp_eat_recv_skb(sk, skb);
> + if (!desc->count)
> + break;
> + if (tcp_flags & TCPHDR_FIN)
> + break;
> + }
> +
> + WRITE_ONCE(psock->strp_offset, offset);
> + return copied;
> +}
> +
> enum {
> TCP_BPF_IPV4,
> TCP_BPF_IPV6,
[...]
To reiterate my earlier question / suggestion [1] - it would be great if
we can avoid duplicating what tcp_read_skb / tcp_read_sock already do.
Keeping extra state in sk_psock / strparser seems to be the key. I think
you should be able to switch strp_data_ready / str_read_sock to
->read_skb and make an adapter around strp_recv.
Rough code below is what I have in mind. Not tested, compiled
only. Don't expect it to work. And I haven't even looked how to address
the kTLS path. But you get the idea.
[1] https://msgid.link/87o71bx1l4.fsf@cloudflare.com
---8<---
diff --git a/include/net/strparser.h b/include/net/strparser.h
index 41e2ce9e9e10..0dd48c1bc23b 100644
--- a/include/net/strparser.h
+++ b/include/net/strparser.h
@@ -95,9 +95,14 @@ struct strparser {
u32 interrupted : 1;
u32 unrecov_intr : 1;
+ unsigned int need_bytes;
+
struct sk_buff **skb_nextp;
struct sk_buff *skb_head;
- unsigned int need_bytes;
+
+ int rcv_err;
+ unsigned int rcv_off;
+
struct delayed_work msg_timer_work;
struct work_struct work;
struct strp_stats stats;
diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 8299ceb3e373..8a08996429d3 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -18,6 +18,7 @@
#include <linux/poll.h>
#include <linux/rculist.h>
#include <linux/skbuff.h>
+#include <linux/skmsg.h>
#include <linux/socket.h>
#include <linux/uaccess.h>
#include <linux/workqueue.h>
@@ -327,13 +328,39 @@ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
}
EXPORT_SYMBOL_GPL(strp_process);
-static int strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
- unsigned int orig_offset, size_t orig_len)
+static int strp_read_skb(struct sock *sk, struct sk_buff *skb)
{
- struct strparser *strp = (struct strparser *)desc->arg.data;
-
- return __strp_recv(desc, orig_skb, orig_offset, orig_len,
- strp->sk->sk_rcvbuf, strp->sk->sk_rcvtimeo);
+ struct sk_psock *psock = sk_psock_get(sk);
+ struct strparser *strp = &psock->strp;
+ read_descriptor_t desc = {
+ .arg.data = strp,
+ .count = 1,
+ .error = 0,
+ };
+ unsigned int off;
+ size_t len;
+ int used;
+
+ off = strp->rcv_off;
+ len = skb->len - off;
+ used = __strp_recv(&desc, skb, off, len,
+ sk->sk_rcvbuf, sk->sk_rcvtimeo);
+ /* skb not consumed */
+ if (used <= 0) {
+ strp->rcv_err = used;
+ return used;
+ }
+ /* skb partially consumed */
+ if (used < len) {
+ strp->rcv_err = 0;
+ strp->rcv_off += used;
+ return -EPIPE; /* stop reading */
+ }
+ /* skb fully consumed */
+ strp->rcv_err = 0;
+ strp->rcv_off = 0;
+ tcp_eat_recv_skb(sk, skb);
+ return used;
}
static int default_read_sock_done(struct strparser *strp, int err)
@@ -345,21 +372,14 @@ static int default_read_sock_done(struct strparser *strp, int err)
static int strp_read_sock(struct strparser *strp)
{
struct socket *sock = strp->sk->sk_socket;
- read_descriptor_t desc;
- if (unlikely(!sock || !sock->ops || !sock->ops->read_sock))
+ if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
return -EBUSY;
- desc.arg.data = strp;
- desc.error = 0;
- desc.count = 1; /* give more than one skb per call */
-
/* sk should be locked here, so okay to do read_sock */
- sock->ops->read_sock(strp->sk, &desc, strp_recv);
-
- desc.error = strp->cb.read_sock_done(strp, desc.error);
+ sock->ops->read_skb(strp->sk, strp_read_skb);
- return desc.error;
+ return strp->cb.read_sock_done(strp, strp->rcv_err);
}
/* Lower sock lock held */
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-18 15:35 ` Jakub Sitnicki
@ 2024-12-19 9:30 ` Jiayuan Chen
2024-12-20 7:06 ` John Fastabend
2024-12-23 20:57 ` Jakub Sitnicki
0 siblings, 2 replies; 10+ messages in thread
From: Jiayuan Chen @ 2024-12-19 9:30 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: bpf, martin.lau, ast, edumazet, davem, dsahern, kuba, pabeni,
linux-kernel, song, john.fastabend, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms
On Wed, Dec 18, 2024 at 04:35:53PM +0800, Jakub Sitnicki wrote:
[...]
> On Wed, Dec 18, 2024 at 01:34 PM +08, Jiayuan Chen wrote:
> > + if (tcp_flags & TCPHDR_FIN)
> > + break;
> > + }
> > +
> > + WRITE_ONCE(psock->strp_offset, offset);
> > + return copied;
> > +}
> > +
> > enum {
> > TCP_BPF_IPV4,
> > TCP_BPF_IPV6,
>
> [...]
>
> To reiterate my earlier question / suggestion [1] - it would be great if
> we can avoid duplicating what tcp_read_skb / tcp_read_sock already do.
>
> Keeping extra state in sk_psock / strparser seems to be the key. I think
> you should be able to switch strp_data_ready / str_read_sock to
> ->read_skb and make an adapter around strp_recv.
>
> Rough code below is what I have in mind. Not tested, compiled
> only. Don't expect it to work. And I haven't even looked how to address
> the kTLS path. But you get the idea.
>
> [1] https://msgid.link/87o71bx1l4.fsf@cloudflare.com
>
> ---8<---
>
> diff --git a/include/net/strparser.h b/include/net/strparser.h
> index 41e2ce9e9e10..0dd48c1bc23b 100644
> --- a/include/net/strparser.h
> +++ b/include/net/strparser.h
> @@ -95,9 +95,14 @@ struct strparser {
> u32 interrupted : 1;
> u32 unrecov_intr : 1;
>
> + unsigned int need_bytes;
> +
> struct sk_buff **skb_nextp;
> struct sk_buff *skb_head;
> - unsigned int need_bytes;
> +
> + int rcv_err;
> + unsigned int rcv_off;
> +
> struct delayed_work msg_timer_work;
> struct work_struct work;
> struct strp_stats stats;
> diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
> index 8299ceb3e373..8a08996429d3 100644
> --- a/net/strparser/strparser.c
> +++ b/net/strparser/strparser.c
> @@ -18,6 +18,7 @@
> #include <linux/poll.h>
> #include <linux/rculist.h>
> #include <linux/skbuff.h>
> +#include <linux/skmsg.h>
> #include <linux/socket.h>
> #include <linux/uaccess.h>
> #include <linux/workqueue.h>
> @@ -327,13 +328,39 @@ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
> }
> EXPORT_SYMBOL_GPL(strp_process);
>
> -static int strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
> - unsigned int orig_offset, size_t orig_len)
> +static int strp_read_skb(struct sock *sk, struct sk_buff *skb)
> {
> - struct strparser *strp = (struct strparser *)desc->arg.data;
> -
> - return __strp_recv(desc, orig_skb, orig_offset, orig_len,
> - strp->sk->sk_rcvbuf, strp->sk->sk_rcvtimeo);
> + struct sk_psock *psock = sk_psock_get(sk);
> + struct strparser *strp = &psock->strp;
> + read_descriptor_t desc = {
> + .arg.data = strp,
> + .count = 1,
> + .error = 0,
> + };
> + unsigned int off;
> + size_t len;
> + int used;
> +
> + off = strp->rcv_off;
> + len = skb->len - off;
> + used = __strp_recv(&desc, skb, off, len,
> + sk->sk_rcvbuf, sk->sk_rcvtimeo);
> + /* skb not consumed */
> + if (used <= 0) {
> + strp->rcv_err = used;
> + return used;
> + }
> + /* skb partially consumed */
> + if (used < len) {
> + strp->rcv_err = 0;
> + strp->rcv_off += used;
> + return -EPIPE; /* stop reading */
> + }
> + /* skb fully consumed */
> + strp->rcv_err = 0;
> + strp->rcv_off = 0;
> + tcp_eat_recv_skb(sk, skb);
> + return used;
> }
>
> static int default_read_sock_done(struct strparser *strp, int err)
> @@ -345,21 +372,14 @@ static int default_read_sock_done(struct strparser *strp, int err)
> static int strp_read_sock(struct strparser *strp)
> {
> struct socket *sock = strp->sk->sk_socket;
> - read_descriptor_t desc;
>
> - if (unlikely(!sock || !sock->ops || !sock->ops->read_sock))
> + if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
> return -EBUSY;
>
> - desc.arg.data = strp;
> - desc.error = 0;
> - desc.count = 1; /* give more than one skb per call */
> -
> /* sk should be locked here, so okay to do read_sock */
> - sock->ops->read_sock(strp->sk, &desc, strp_recv);
> -
> - desc.error = strp->cb.read_sock_done(strp, desc.error);
> + sock->ops->read_skb(strp->sk, strp_read_skb);
>
> - return desc.error;
> + return strp->cb.read_sock_done(strp, strp->rcv_err);
> }
>
> /* Lower sock lock held */
Thanks Jakub Sitnicki.
I understand your point about using tcp_read_skb to replace
tcp_read_sock, avoiding code duplication and reducing the number of
interfaces.
Currently, not all modules using strparser have issues with
copied_seq miscalculation. The issue exists mainly with
bpf::sockmap + strparser because bpf::sockmap implements a
proprietary read interface for user-land: tcp_bpf_recvmsg_parser().
Both this and strp_recv->tcp_read_sock update copied_seq, leading
to errors.
This is why I rewrote the tcp_read_sock() interface specifically for
bpf::sockmap.
So far, I found two other modules that use the standard strparser module:
1.kcmsock.c
2.espintcp.c (ESP over TCP implementation)
(Interesting, these two don't have self-tests)
Take kcm as an example: its custom read interface kcm_recvmsg()
does not conflict with copied_seq updates in tcp_read_sock().
Therefore, for kcmsock, updating copied_seq in tcp_read_sock is
necessary and aligns with the read semantics. espintcp is similar.
In summary, different modules using strp_recv have different needs
for copied_seq. I still insist on implementing tcp_bpf_read_sock()
specifically for bpf::sockmap without affecting others.
Otherwise, we may need tcp_read_skb() to determine whether
to update copied_seq according to the different needs of each module.
Additionally,
I've found that KTLS has its own read_sock() and
a strparser-like implementation (in tls_strp.c), separate from the
standard strparser module. Therefore, even with your proposed
solution, KTLS may be not affected.
regards
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-19 9:30 ` Jiayuan Chen
@ 2024-12-20 7:06 ` John Fastabend
2024-12-23 20:57 ` Jakub Sitnicki
1 sibling, 0 replies; 10+ messages in thread
From: John Fastabend @ 2024-12-20 7:06 UTC (permalink / raw)
To: Jiayuan Chen, Jakub Sitnicki
Cc: bpf, martin.lau, ast, edumazet, davem, dsahern, kuba, pabeni,
linux-kernel, song, john.fastabend, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms
Jiayuan Chen wrote:
> On Wed, Dec 18, 2024 at 04:35:53PM +0800, Jakub Sitnicki wrote:
> [...]
> > On Wed, Dec 18, 2024 at 01:34 PM +08, Jiayuan Chen wrote:
> > > + if (tcp_flags & TCPHDR_FIN)
> > > + break;
> > > + }
> > > +
> > > + WRITE_ONCE(psock->strp_offset, offset);
> > > + return copied;
> > > +}
> > > +
> > > enum {
> > > TCP_BPF_IPV4,
> > > TCP_BPF_IPV6,
> >
> > [...]
> >
> > To reiterate my earlier question / suggestion [1] - it would be great if
> > we can avoid duplicating what tcp_read_skb / tcp_read_sock already do.
> >
> > Keeping extra state in sk_psock / strparser seems to be the key. I think
> > you should be able to switch strp_data_ready / str_read_sock to
> > ->read_skb and make an adapter around strp_recv.
> >
> > Rough code below is what I have in mind. Not tested, compiled
> > only. Don't expect it to work. And I haven't even looked how to address
> > the kTLS path. But you get the idea.
> >
> > [1] https://msgid.link/87o71bx1l4.fsf@cloudflare.com
> >
> > ---8<---
> >
> > diff --git a/include/net/strparser.h b/include/net/strparser.h
> > index 41e2ce9e9e10..0dd48c1bc23b 100644
> > --- a/include/net/strparser.h
> > +++ b/include/net/strparser.h
> > @@ -95,9 +95,14 @@ struct strparser {
> > u32 interrupted : 1;
> > u32 unrecov_intr : 1;
> >
> > + unsigned int need_bytes;
> > +
> > struct sk_buff **skb_nextp;
> > struct sk_buff *skb_head;
> > - unsigned int need_bytes;
> > +
> > + int rcv_err;
> > + unsigned int rcv_off;
> > +
> > struct delayed_work msg_timer_work;
> > struct work_struct work;
> > struct strp_stats stats;
> > diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
> > index 8299ceb3e373..8a08996429d3 100644
> > --- a/net/strparser/strparser.c
> > +++ b/net/strparser/strparser.c
> > @@ -18,6 +18,7 @@
> > #include <linux/poll.h>
> > #include <linux/rculist.h>
> > #include <linux/skbuff.h>
> > +#include <linux/skmsg.h>
> > #include <linux/socket.h>
> > #include <linux/uaccess.h>
> > #include <linux/workqueue.h>
> > @@ -327,13 +328,39 @@ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
> > }
> > EXPORT_SYMBOL_GPL(strp_process);
> >
> > -static int strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
> > - unsigned int orig_offset, size_t orig_len)
> > +static int strp_read_skb(struct sock *sk, struct sk_buff *skb)
> > {
> > - struct strparser *strp = (struct strparser *)desc->arg.data;
> > -
> > - return __strp_recv(desc, orig_skb, orig_offset, orig_len,
> > - strp->sk->sk_rcvbuf, strp->sk->sk_rcvtimeo);
> > + struct sk_psock *psock = sk_psock_get(sk);
> > + struct strparser *strp = &psock->strp;
> > + read_descriptor_t desc = {
> > + .arg.data = strp,
> > + .count = 1,
> > + .error = 0,
> > + };
> > + unsigned int off;
> > + size_t len;
> > + int used;
> > +
> > + off = strp->rcv_off;
> > + len = skb->len - off;
> > + used = __strp_recv(&desc, skb, off, len,
> > + sk->sk_rcvbuf, sk->sk_rcvtimeo);
I guess the main complication here is read_skb has already unlinked
the skb so we would lose the skb entirely in some cases here? Easy
example would be ENOMEM on __strp_recv clone.
OTOH you could likely optimize __strp_recv a fair amount for the
good case (if it happens to be true in your case) where all data
is in the skb normally and skip the clone or something. Although
not clear to me how common case that is.
> > + /* skb not consumed */
> > + if (used <= 0) {
> > + strp->rcv_err = used;
> > + return used;
> > + }
> > + /* skb partially consumed */
> > + if (used < len) {
> > + strp->rcv_err = 0;
> > + strp->rcv_off += used;
> > + return -EPIPE; /* stop reading */
> > + }
> > + /* skb fully consumed */
> > + strp->rcv_err = 0;
> > + strp->rcv_off = 0;
> > + tcp_eat_recv_skb(sk, skb);
> > + return used;
> > }
> >
> > static int default_read_sock_done(struct strparser *strp, int err)
> > @@ -345,21 +372,14 @@ static int default_read_sock_done(struct strparser *strp, int err)
> > static int strp_read_sock(struct strparser *strp)
> > {
> > struct socket *sock = strp->sk->sk_socket;
> > - read_descriptor_t desc;
> >
> > - if (unlikely(!sock || !sock->ops || !sock->ops->read_sock))
> > + if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
> > return -EBUSY;
> >
> > - desc.arg.data = strp;
> > - desc.error = 0;
> > - desc.count = 1; /* give more than one skb per call */
> > -
> > /* sk should be locked here, so okay to do read_sock */
> > - sock->ops->read_sock(strp->sk, &desc, strp_recv);
> > -
> > - desc.error = strp->cb.read_sock_done(strp, desc.error);
> > + sock->ops->read_skb(strp->sk, strp_read_skb);
> >
> > - return desc.error;
> > + return strp->cb.read_sock_done(strp, strp->rcv_err);
> > }
> >
> > /* Lower sock lock held */
>
> Thanks Jakub Sitnicki.
>
> I understand your point about using tcp_read_skb to replace
> tcp_read_sock, avoiding code duplication and reducing the number of
> interfaces.
>
> Currently, not all modules using strparser have issues with
> copied_seq miscalculation. The issue exists mainly with
> bpf::sockmap + strparser because bpf::sockmap implements a
> proprietary read interface for user-land: tcp_bpf_recvmsg_parser().
>
> Both this and strp_recv->tcp_read_sock update copied_seq, leading
> to errors.
>
> This is why I rewrote the tcp_read_sock() interface specifically for
> bpf::sockmap.
>
> So far, I found two other modules that use the standard strparser module:
>
> 1.kcmsock.c
> 2.espintcp.c (ESP over TCP implementation)
> (Interesting, these two don't have self-tests)
>
> Take kcm as an example: its custom read interface kcm_recvmsg()
> does not conflict with copied_seq updates in tcp_read_sock().
>
> Therefore, for kcmsock, updating copied_seq in tcp_read_sock is
> necessary and aligns with the read semantics. espintcp is similar.
>
> In summary, different modules using strp_recv have different needs
> for copied_seq. I still insist on implementing tcp_bpf_read_sock()
> specifically for bpf::sockmap without affecting others.
>
> Otherwise, we may need tcp_read_skb() to determine whether
> to update copied_seq according to the different needs of each module.
>
>
> Additionally,
> I've found that KTLS has its own read_sock() and
> a strparser-like implementation (in tls_strp.c), separate from the
> standard strparser module. Therefore, even with your proposed
> solution, KTLS may be not affected.
>
> regards
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-19 9:30 ` Jiayuan Chen
2024-12-20 7:06 ` John Fastabend
@ 2024-12-23 20:57 ` Jakub Sitnicki
2024-12-23 22:57 ` Jakub Sitnicki
1 sibling, 1 reply; 10+ messages in thread
From: Jakub Sitnicki @ 2024-12-23 20:57 UTC (permalink / raw)
To: Jiayuan Chen, John Fastabend
Cc: bpf, martin.lau, ast, edumazet, davem, dsahern, kuba, pabeni,
linux-kernel, song, andrii, mhal, yonghong.song, daniel,
xiyou.wangcong, horms
On Thu, Dec 19, 2024 at 05:30 PM +08, Jiayuan Chen wrote:
> Currently, not all modules using strparser have issues with
> copied_seq miscalculation. The issue exists mainly with
> bpf::sockmap + strparser because bpf::sockmap implements a
> proprietary read interface for user-land: tcp_bpf_recvmsg_parser().
>
> Both this and strp_recv->tcp_read_sock update copied_seq, leading
> to errors.
>
> This is why I rewrote the tcp_read_sock() interface specifically for
> bpf::sockmap.
All right. Looks like reusing read_skb is not going to pan out.
But I think we should not give up just yet. It's easy to add new code.
We can try to break up and parametrize tcp_read_sock - if other
maintainers are not against it. Does something like this work for you?
https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2/
Other minor feedback I have:
- The newly added code is optional and should depend on
CONFIG_BPF_STREAM_PARSER being enabled. Please check that it builds
with CONFIG_BPF_STREAM_PARSER=n as well.
- Let's not add complexity until it's really needed, and today we don't
need seprate tcp_bpf_proto_ops for IPv4 and IPv6.
- There are style issues with the added test. Please run checkpatch.pl.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-23 20:57 ` Jakub Sitnicki
@ 2024-12-23 22:57 ` Jakub Sitnicki
2024-12-24 7:16 ` Jiayuan Chen
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Sitnicki @ 2024-12-23 22:57 UTC (permalink / raw)
To: Jiayuan Chen, John Fastabend
Cc: bpf, martin.lau, ast, edumazet, davem, dsahern, kuba, pabeni,
linux-kernel, song, andrii, mhal, yonghong.song, daniel,
xiyou.wangcong, horms
On Mon, Dec 23, 2024 at 09:57 PM +01, Jakub Sitnicki wrote:
> On Thu, Dec 19, 2024 at 05:30 PM +08, Jiayuan Chen wrote:
>> Currently, not all modules using strparser have issues with
>> copied_seq miscalculation. The issue exists mainly with
>> bpf::sockmap + strparser because bpf::sockmap implements a
>> proprietary read interface for user-land: tcp_bpf_recvmsg_parser().
>>
>> Both this and strp_recv->tcp_read_sock update copied_seq, leading
>> to errors.
>>
>> This is why I rewrote the tcp_read_sock() interface specifically for
>> bpf::sockmap.
>
> All right. Looks like reusing read_skb is not going to pan out.
>
> But I think we should not give up just yet. It's easy to add new code.
>
> We can try to break up and parametrize tcp_read_sock - if other
> maintainers are not against it. Does something like this work for you?
>
> https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2/
Actually it reads better if we just add early bailout to tcp_read_sock:
https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2.1/
---8<---
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6a07d98017f7..6564ea3b6cd4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1565,12 +1565,13 @@ EXPORT_SYMBOL(tcp_recv_skb);
* or for 'peeking' the socket using this routine
* (although both would be easy to implement).
*/
-int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- sk_read_actor_t recv_actor)
+static inline int __tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor, bool noack,
+ u32 *copied_seq)
{
struct sk_buff *skb;
struct tcp_sock *tp = tcp_sk(sk);
- u32 seq = tp->copied_seq;
+ u32 seq = *copied_seq;
u32 offset;
int copied = 0;
@@ -1624,9 +1625,12 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
tcp_eat_recv_skb(sk, skb);
if (!desc->count)
break;
- WRITE_ONCE(tp->copied_seq, seq);
+ WRITE_ONCE(*copied_seq, seq);
}
- WRITE_ONCE(tp->copied_seq, seq);
+ WRITE_ONCE(*copied_seq, seq);
+
+ if (noack)
+ goto out;
tcp_rcv_space_adjust(sk);
@@ -1635,10 +1639,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
tcp_recv_skb(sk, seq, &offset);
tcp_cleanup_rbuf(sk, copied);
}
+out:
return copied;
}
+
+int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ return __tcp_read_sock(sk, desc, recv_actor, false,
+ &tcp_sk(sk)->copied_seq);
+}
EXPORT_SYMBOL(tcp_read_sock);
+int tcp_read_sock_noack(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor, u32 *copied_seq)
+{
+ return __tcp_read_sock(sk, desc, recv_actor, true, copied_seq);
+}
+EXPORT_SYMBOL(tcp_read_sock_noack);
+
int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
{
struct sk_buff *skb;
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-23 22:57 ` Jakub Sitnicki
@ 2024-12-24 7:16 ` Jiayuan Chen
2024-12-27 14:09 ` Jakub Sitnicki
0 siblings, 1 reply; 10+ messages in thread
From: Jiayuan Chen @ 2024-12-24 7:16 UTC (permalink / raw)
To: Jakub Sitnicki, John Fastabend
Cc: John Fastabend, bpf, martin.lau, ast, edumazet, davem, dsahern,
kuba, pabeni, linux-kernel, song, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms
On Mon, Dec 23, 2024 at 11:57:58PM +0800, Jakub Sitnicki wrote:
> On Mon, Dec 23, 2024 at 09:57 PM +01, Jakub Sitnicki wrote:
> > On Thu, Dec 19, 2024 at 05:30 PM +08, Jiayuan Chen wrote:
> >> Currently, not all modules using strparser have issues with
> >> copied_seq miscalculation. The issue exists mainly with
> >> bpf::sockmap + strparser because bpf::sockmap implements a
> >> proprietary read interface for user-land: tcp_bpf_recvmsg_parser().
> >>
> >> Both this and strp_recv->tcp_read_sock update copied_seq, leading
> >> to errors.
> >>
> >> This is why I rewrote the tcp_read_sock() interface specifically for
> >> bpf::sockmap.
> >
> > All right. Looks like reusing read_skb is not going to pan out.
> >
> > But I think we should not give up just yet. It's easy to add new code.
> >
> > We can try to break up and parametrize tcp_read_sock - if other
> > maintainers are not against it. Does something like this work for you?
> >
> > https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2/
>
> Actually it reads better if we just add early bailout to tcp_read_sock:
>
> https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2.1/
>
> ---8<---
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 6a07d98017f7..6564ea3b6cd4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1565,12 +1565,13 @@ EXPORT_SYMBOL(tcp_recv_skb);
> * or for 'peeking' the socket using this routine
> * (although both would be easy to implement).
> */
> -int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> - sk_read_actor_t recv_actor)
> +static inline int __tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> + sk_read_actor_t recv_actor, bool noack,
> + u32 *copied_seq)
> {
> struct sk_buff *skb;
> struct tcp_sock *tp = tcp_sk(sk);
> - u32 seq = tp->copied_seq;
> + u32 seq = *copied_seq;
> u32 offset;
> int copied = 0;
>
> @@ -1624,9 +1625,12 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> tcp_eat_recv_skb(sk, skb);
> if (!desc->count)
> break;
> - WRITE_ONCE(tp->copied_seq, seq);
> + WRITE_ONCE(*copied_seq, seq);
> }
> - WRITE_ONCE(tp->copied_seq, seq);
> + WRITE_ONCE(*copied_seq, seq);
> +
> + if (noack)
> + goto out;
>
> tcp_rcv_space_adjust(sk);
>
> @@ -1635,10 +1639,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> tcp_recv_skb(sk, seq, &offset);
> tcp_cleanup_rbuf(sk, copied);
> }
> +out:
> return copied;
> }
> +
> +int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> + sk_read_actor_t recv_actor)
> +{
> + return __tcp_read_sock(sk, desc, recv_actor, false,
> + &tcp_sk(sk)->copied_seq);
> +}
> EXPORT_SYMBOL(tcp_read_sock);
>
> +int tcp_read_sock_noack(struct sock *sk, read_descriptor_t *desc,
> + sk_read_actor_t recv_actor, u32 *copied_seq)
> +{
> + return __tcp_read_sock(sk, desc, recv_actor, true, copied_seq);
> +}
> +EXPORT_SYMBOL(tcp_read_sock_noack);
> +
> int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> {
> struct sk_buff *skb;
This modification definitely reduces code duplication and makes it more
elegant compared to my previous idea. Also If we want to avoid modifying
the strp code and not introduce new ops, perhaps we could revert to the
simplest solution:
'''
void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock)
{
...
sk->sk_data_ready = sk_psock_strp_data_ready;
/* Replacement */
psock->saved_read_sock = sk->sk_socket->ops->read_sock;
sk->sk_socket->ops->read_sock = tcp_read_sock_noack;
}
'''
If acceptable, I can incorporate this approach in the next patch version.
BTW, It seems CI run checkpatch.pl with '--strict' argument so I lost few
of warnings compare to CI, will fix it in next revision.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation
2024-12-24 7:16 ` Jiayuan Chen
@ 2024-12-27 14:09 ` Jakub Sitnicki
0 siblings, 0 replies; 10+ messages in thread
From: Jakub Sitnicki @ 2024-12-27 14:09 UTC (permalink / raw)
To: Jiayuan Chen
Cc: John Fastabend, bpf, martin.lau, ast, edumazet, davem, dsahern,
kuba, pabeni, linux-kernel, song, andrii, mhal, yonghong.song,
daniel, xiyou.wangcong, horms
On Tue, Dec 24, 2024 at 03:16 PM +08, Jiayuan Chen wrote:
> On Mon, Dec 23, 2024 at 11:57:58PM +0800, Jakub Sitnicki wrote:
>> On Mon, Dec 23, 2024 at 09:57 PM +01, Jakub Sitnicki wrote:
>> > On Thu, Dec 19, 2024 at 05:30 PM +08, Jiayuan Chen wrote:
>> >> Currently, not all modules using strparser have issues with
>> >> copied_seq miscalculation. The issue exists mainly with
>> >> bpf::sockmap + strparser because bpf::sockmap implements a
>> >> proprietary read interface for user-land: tcp_bpf_recvmsg_parser().
>> >>
>> >> Both this and strp_recv->tcp_read_sock update copied_seq, leading
>> >> to errors.
>> >>
>> >> This is why I rewrote the tcp_read_sock() interface specifically for
>> >> bpf::sockmap.
>> >
>> > All right. Looks like reusing read_skb is not going to pan out.
>> >
>> > But I think we should not give up just yet. It's easy to add new code.
>> >
>> > We can try to break up and parametrize tcp_read_sock - if other
>> > maintainers are not against it. Does something like this work for you?
>> >
>> > https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2/
>>
>> Actually it reads better if we just add early bailout to tcp_read_sock:
>>
>> https://github.com/jsitnicki/linux/commits/review/stp-copied_seq/idea-2.1/
>>
>> ---8<---
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 6a07d98017f7..6564ea3b6cd4 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -1565,12 +1565,13 @@ EXPORT_SYMBOL(tcp_recv_skb);
>> * or for 'peeking' the socket using this routine
>> * (although both would be easy to implement).
>> */
>> -int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>> - sk_read_actor_t recv_actor)
>> +static inline int __tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>> + sk_read_actor_t recv_actor, bool noack,
>> + u32 *copied_seq)
>> {
>> struct sk_buff *skb;
>> struct tcp_sock *tp = tcp_sk(sk);
>> - u32 seq = tp->copied_seq;
>> + u32 seq = *copied_seq;
>> u32 offset;
>> int copied = 0;
>>
>> @@ -1624,9 +1625,12 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>> tcp_eat_recv_skb(sk, skb);
>> if (!desc->count)
>> break;
>> - WRITE_ONCE(tp->copied_seq, seq);
>> + WRITE_ONCE(*copied_seq, seq);
>> }
>> - WRITE_ONCE(tp->copied_seq, seq);
>> + WRITE_ONCE(*copied_seq, seq);
>> +
>> + if (noack)
>> + goto out;
>>
>> tcp_rcv_space_adjust(sk);
>>
>> @@ -1635,10 +1639,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>> tcp_recv_skb(sk, seq, &offset);
>> tcp_cleanup_rbuf(sk, copied);
>> }
>> +out:
>> return copied;
>> }
>> +
>> +int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>> + sk_read_actor_t recv_actor)
>> +{
>> + return __tcp_read_sock(sk, desc, recv_actor, false,
>> + &tcp_sk(sk)->copied_seq);
>> +}
>> EXPORT_SYMBOL(tcp_read_sock);
>>
>> +int tcp_read_sock_noack(struct sock *sk, read_descriptor_t *desc,
>> + sk_read_actor_t recv_actor, u32 *copied_seq)
>> +{
>> + return __tcp_read_sock(sk, desc, recv_actor, true, copied_seq);
>> +}
>> +EXPORT_SYMBOL(tcp_read_sock_noack);
>> +
>> int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
>> {
>> struct sk_buff *skb;
>
> This modification definitely reduces code duplication and makes it more
> elegant compared to my previous idea. Also If we want to avoid modifying
> the strp code and not introduce new ops, perhaps we could revert to the
> simplest solution:
> '''
> void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock)
> {
> ...
> sk->sk_data_ready = sk_psock_strp_data_ready;
> /* Replacement */
> psock->saved_read_sock = sk->sk_socket->ops->read_sock;
> sk->sk_socket->ops->read_sock = tcp_read_sock_noack;
> }
> '''
> If acceptable, I can incorporate this approach in the next patch version.
SGTM
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-12-27 14:10 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-18 5:34 [PATCH bpf v3 0/2] bpf: fix wrong copied_seq calculation and add tests Jiayuan Chen
2024-12-18 5:34 ` [PATCH bpf v3 1/2] bpf: fix wrong copied_seq calculation Jiayuan Chen
2024-12-18 15:35 ` Jakub Sitnicki
2024-12-19 9:30 ` Jiayuan Chen
2024-12-20 7:06 ` John Fastabend
2024-12-23 20:57 ` Jakub Sitnicki
2024-12-23 22:57 ` Jakub Sitnicki
2024-12-24 7:16 ` Jiayuan Chen
2024-12-27 14:09 ` Jakub Sitnicki
2024-12-18 5:34 ` [PATCH bpf v3 2/2] selftests/bpf: add strparser test for bpf Jiayuan Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).