* [PATCH v7 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
@ 2023-12-21 1:28 ` Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h Kuniyuki Iwashima
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 1:28 UTC (permalink / raw)
To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
We will support arbitrary SYN Cookie with BPF.
When BPF prog validates ACK and kfunc allocates a reqsk, we need
to call tcp_ns_to_ts() to calculate an offset of TSval for later
use:
time
t0 : Send SYN+ACK
-> tsval = Initial TSval (Random Number)
t1 : Recv ACK of 3WHS
-> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns())
= Initial TSval - t1
t2 : Send ACK
-> tsval = t2 + tsoff
= Initial TSval + (t2 - t1)
= Initial TSval + Time Delta (x)
(x) Note that the time delta does not include the initial RTT
from t0 to t1.
Let's move tcp_ns_to_ts() to tcp.h.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
include/net/tcp.h | 9 +++++++++
net/ipv4/syncookies.c | 9 ---------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f5ca4abaee8b..07dd18e2f3af 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -577,6 +577,15 @@ static inline u32 tcp_cookie_time(void)
return val;
}
+/* Convert one nsec 64bit timestamp to ts (ms or usec resolution) */
+static inline u64 tcp_ns_to_ts(bool usec_ts, u64 val)
+{
+ if (usec_ts)
+ return div_u64(val, NSEC_PER_USEC);
+
+ return div_u64(val, NSEC_PER_MSEC);
+}
+
u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
u16 *mssp);
__u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 61f1c96cfe63..981944c22820 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -51,15 +51,6 @@ static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
count, &syncookie_secret[c]);
}
-/* Convert one nsec 64bit timestamp to ts (ms or usec resolution) */
-static u64 tcp_ns_to_ts(bool usec_ts, u64 val)
-{
- if (usec_ts)
- return div_u64(val, NSEC_PER_USEC);
-
- return div_u64(val, NSEC_PER_MSEC);
-}
-
/*
* when syncookies are in effect and tcp timestamps are enabled we encode
* tcp options in the lower bits of the timestamp value that will be
--
2.30.2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v7 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h Kuniyuki Iwashima
@ 2023-12-21 1:28 ` Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock() Kuniyuki Iwashima
` (4 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 1:28 UTC (permalink / raw)
To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
We will support arbitrary SYN Cookie with BPF.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.
In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie
to see if the reqsk is created by kfunc. However, inet_reqsk()
is not available in sock.h.
Let's move skb_steal_sock() to request_sock.h.
While at it, we refactor skb_steal_sock() so it returns early if
skb->sk is NULL to minimise the following patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
include/net/request_sock.h | 28 ++++++++++++++++++++++++++++
include/net/sock.h | 25 -------------------------
2 files changed, 28 insertions(+), 25 deletions(-)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 144c39db9898..26c630c40abb 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -83,6 +83,34 @@ static inline struct sock *req_to_sk(struct request_sock *req)
return (struct sock *)req;
}
+/**
+ * skb_steal_sock - steal a socket from an sk_buff
+ * @skb: sk_buff to steal the socket from
+ * @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
+ */
+static inline struct sock *skb_steal_sock(struct sk_buff *skb,
+ bool *refcounted, bool *prefetched)
+{
+ struct sock *sk = skb->sk;
+
+ if (!sk) {
+ *prefetched = false;
+ *refcounted = false;
+ return NULL;
+ }
+
+ *prefetched = skb_sk_is_prefetched(skb);
+ if (*prefetched)
+ *refcounted = sk_is_refcounted(sk);
+ else
+ *refcounted = true;
+
+ skb->destructor = NULL;
+ skb->sk = NULL;
+ return sk;
+}
+
static inline struct request_sock *
reqsk_alloc(const struct request_sock_ops *ops, struct sock *sk_listener,
bool attach_listener)
diff --git a/include/net/sock.h b/include/net/sock.h
index 8b6fe164b218..1aa19c8510cc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2838,31 +2838,6 @@ sk_is_refcounted(struct sock *sk)
return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
}
-/**
- * skb_steal_sock - steal a socket from an sk_buff
- * @skb: sk_buff to steal the socket from
- * @refcounted: is set to true if the socket is reference-counted
- * @prefetched: is set to true if the socket was assigned from bpf
- */
-static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
-{
- if (skb->sk) {
- struct sock *sk = skb->sk;
-
- *refcounted = true;
- *prefetched = skb_sk_is_prefetched(skb);
- if (*prefetched)
- *refcounted = sk_is_refcounted(sk);
- skb->destructor = NULL;
- skb->sk = NULL;
- return sk;
- }
- *prefetched = false;
- *refcounted = false;
- return NULL;
-}
-
/* Checks if this SKB belongs to an HW offloaded socket
* and whether any SW fallbacks are required based on dev.
* Check decrypted mark in case skb_orphan() cleared socket.
--
2.30.2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v7 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h Kuniyuki Iwashima
@ 2023-12-21 1:28 ` Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
` (3 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 1:28 UTC (permalink / raw)
To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
We will support arbitrary SYN Cookie with BPF.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1. Also,
the reqsk has its listener as req->rsk_listener with no refcnt
taken.
When the TCP stack looks up a socket from the skb, we steal
inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock() so that
the skb will be processed in cookie_v[46]_check() with the
listener.
Note that we do not clear skb->sk and skb->destructor so that we
can carry the reqsk to cookie_v[46]_check().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
include/net/request_sock.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 26c630c40abb..8839133d6f6b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -101,10 +101,21 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb,
}
*prefetched = skb_sk_is_prefetched(skb);
- if (*prefetched)
+ if (*prefetched) {
+#if IS_ENABLED(CONFIG_SYN_COOKIES)
+ if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+ struct request_sock *req = inet_reqsk(sk);
+
+ *refcounted = false;
+ sk = req->rsk_listener;
+ req->rsk_listener = NULL;
+ return sk;
+ }
+#endif
*refcounted = sk_is_refcounted(sk);
- else
+ } else {
*refcounted = true;
+ }
skb->destructor = NULL;
skb->sk = NULL;
--
2.30.2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v7 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
` (2 preceding siblings ...)
2023-12-21 1:28 ` [PATCH v7 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock() Kuniyuki Iwashima
@ 2023-12-21 1:28 ` Kuniyuki Iwashima
2023-12-21 1:28 ` [PATCH v7 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
` (2 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 1:28 UTC (permalink / raw)
To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
We will support arbitrary SYN Cookie with BPF in the following
patch.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to cookie_[46]_check() as skb->sk. If skb->sk is not
NULL, we call cookie_bpf_check().
Then, we clear skb->sk and skb->destructor, which are needed not
to hold refcnt for reqsk and the listener. See the following patch
for details.
After that, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().
Note that the server side WScale is set only for non-BPF SYN Cookie.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
include/net/tcp.h | 20 ++++++++++++++++++++
net/ipv4/syncookies.c | 31 +++++++++++++++++++++++++++----
net/ipv6/syncookies.c | 13 +++++++++----
3 files changed, 56 insertions(+), 8 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 07dd18e2f3af..a63916f41f77 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -599,6 +599,26 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
dst_feature(dst, RTAX_FEATURE_ECN);
}
+#if IS_ENABLED(CONFIG_BPF)
+static inline bool cookie_bpf_ok(struct sk_buff *skb)
+{
+ return skb->sk;
+}
+
+struct request_sock *cookie_bpf_check(struct sock *sk, struct sk_buff *skb);
+#else
+static inline bool cookie_bpf_ok(struct sk_buff *skb)
+{
+ return false;
+}
+
+static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
+ struct sk_buff *skb)
+{
+ return NULL;
+}
+#endif
+
/* From net/ipv6/syncookies.c */
int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th);
struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 981944c22820..be88bf586ff9 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -295,6 +295,24 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
return 0;
}
+#if IS_ENABLED(CONFIG_BPF)
+struct request_sock *cookie_bpf_check(struct sock *sk, struct sk_buff *skb)
+{
+ struct request_sock *req = inet_reqsk(skb->sk);
+
+ skb->sk = NULL;
+ skb->destructor = NULL;
+
+ if (cookie_tcp_reqsk_init(sk, skb, req)) {
+ reqsk_free(req);
+ req = NULL;
+ }
+
+ return req;
+}
+EXPORT_SYMBOL_GPL(cookie_bpf_check);
+#endif
+
struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
struct sock *sk, struct sk_buff *skb,
struct tcp_options_received *tcp_opt,
@@ -395,9 +413,13 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
!th->ack || th->rst)
goto out;
- req = cookie_tcp_check(net, sk, skb);
- if (IS_ERR(req))
- goto out;
+ if (cookie_bpf_ok(skb)) {
+ req = cookie_bpf_check(sk, skb);
+ } else {
+ req = cookie_tcp_check(net, sk, skb);
+ if (IS_ERR(req))
+ goto out;
+ }
if (!req)
goto out_drop;
@@ -445,7 +467,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
ireq->wscale_ok, &rcv_wscale,
dst_metric(&rt->dst, RTAX_INITRWND));
- ireq->rcv_wscale = rcv_wscale;
+ if (!req->syncookie)
+ ireq->rcv_wscale = rcv_wscale;
ireq->ecn_ok &= cookie_ecn_ok(net, &rt->dst);
ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index c8d2ca27220c..6b9c69278819 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -182,9 +182,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
!th->ack || th->rst)
goto out;
- req = cookie_tcp_check(net, sk, skb);
- if (IS_ERR(req))
- goto out;
+ if (cookie_bpf_ok(skb)) {
+ req = cookie_bpf_check(sk, skb);
+ } else {
+ req = cookie_tcp_check(net, sk, skb);
+ if (IS_ERR(req))
+ goto out;
+ }
if (!req)
goto out_drop;
@@ -247,7 +251,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
ireq->wscale_ok, &rcv_wscale,
dst_metric(dst, RTAX_INITRWND));
- ireq->rcv_wscale = rcv_wscale;
+ if (!req->syncookie)
+ ireq->rcv_wscale = rcv_wscale;
ireq->ecn_ok &= cookie_ecn_ok(net, dst);
ret = tcp_get_cookie_sock(sk, skb, req, dst);
--
2.30.2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v7 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie.
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
` (3 preceding siblings ...)
2023-12-21 1:28 ` [PATCH v7 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
@ 2023-12-21 1:28 ` Kuniyuki Iwashima
2024-01-12 1:44 ` Martin KaFai Lau
2023-12-21 1:28 ` [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
2024-01-12 6:20 ` [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Martin KaFai Lau
6 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 1:28 UTC (permalink / raw)
To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
This patch adds a new kfunc available at TC hook to support arbitrary
SYN Cookie.
The basic usage is as follows:
struct bpf_tcp_req_attrs attrs = {
.mss = mss,
.wscale_ok = wscale_ok,
.rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
.snd_wscale = snd_wscale, /* Client's WScale < 15 */
.tstamp_ok = tstamp_ok,
.rcv_tsval = tsval,
.rcv_tsecr = tsecr, /* Server's Initial TSval */
.usec_ts_ok = usec_ts_ok,
.sack_ok = sack_ok,
.ecn_ok = ecn_ok,
}
skc = bpf_skc_lookup_tcp(...);
sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
bpf_sk_release(skc);
bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
bpf_tcp_req_attrs and allocates reqsk and configures it. Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.
The notable thing here is that we do not hold refcnt for both reqsk
and listener. To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now. So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.
When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.
When the TCP stack looks up a socket from the skb, we steal the
listener from the reqsk in skb_steal_sock() and create a full sk
in cookie_v[46]_check().
The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.
Note that we can extend struct bpf_tcp_req_attrs in the future when
we add a new attribute that is determined in 3WHS.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
include/net/tcp.h | 13 ++++++
net/core/filter.c | 113 +++++++++++++++++++++++++++++++++++++++++++++-
net/core/sock.c | 14 +++++-
3 files changed, 136 insertions(+), 4 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a63916f41f77..20619df8819e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -600,6 +600,19 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
}
#if IS_ENABLED(CONFIG_BPF)
+struct bpf_tcp_req_attrs {
+ u32 rcv_tsval;
+ u32 rcv_tsecr;
+ u16 mss;
+ u8 rcv_wscale;
+ u8 snd_wscale;
+ u8 ecn_ok;
+ u8 wscale_ok;
+ u8 sack_ok;
+ u8 tstamp_ok;
+ u8 usec_ts_ok;
+};
+
static inline bool cookie_bpf_ok(struct sk_buff *skb)
{
return skb->sk;
diff --git a/net/core/filter.c b/net/core/filter.c
index 24061f29c9dd..961c2d30bd72 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11837,6 +11837,105 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
return 0;
}
+
+__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
+ struct bpf_tcp_req_attrs *attrs, int attrs__sz)
+{
+#if IS_ENABLED(CONFIG_SYN_COOKIES)
+ const struct request_sock_ops *ops;
+ struct inet_request_sock *ireq;
+ struct tcp_request_sock *treq;
+ struct request_sock *req;
+ struct net *net;
+ __u16 min_mss;
+ u32 tsoff = 0;
+
+ if (attrs__sz != sizeof(*attrs))
+ return -EINVAL;
+
+ if (!sk)
+ return -EINVAL;
+
+ if (!skb_at_tc_ingress(skb))
+ return -EINVAL;
+
+ net = dev_net(skb->dev);
+ if (net != sock_net(sk))
+ return -ENETUNREACH;
+
+ switch (skb->protocol) {
+ case htons(ETH_P_IP):
+ ops = &tcp_request_sock_ops;
+ min_mss = 536;
+ break;
+#if IS_BUILTIN(CONFIG_IPV6)
+ case htons(ETH_P_IPV6):
+ ops = &tcp6_request_sock_ops;
+ min_mss = IPV6_MIN_MTU - 60;
+ break;
+#endif
+ default:
+ return -EINVAL;
+ }
+
+ if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN ||
+ sk_is_mptcp(sk))
+ return -EINVAL;
+
+ if (attrs->mss < min_mss)
+ return -EINVAL;
+
+ if (attrs->wscale_ok) {
+ if (!READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
+ return -EINVAL;
+
+ if (attrs->snd_wscale > TCP_MAX_WSCALE ||
+ attrs->rcv_wscale > TCP_MAX_WSCALE)
+ return -EINVAL;
+ }
+
+ if (attrs->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
+ return -EINVAL;
+
+ if (attrs->tstamp_ok) {
+ if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
+ return -EINVAL;
+
+ tsoff = attrs->rcv_tsecr - tcp_ns_to_ts(attrs->usec_ts_ok, tcp_clock_ns());
+ }
+
+ req = inet_reqsk_alloc(ops, sk, false);
+ if (!req)
+ return -ENOMEM;
+
+ ireq = inet_rsk(req);
+ treq = tcp_rsk(req);
+
+ req->rsk_listener = sk;
+ req->syncookie = 1;
+ req->mss = attrs->mss;
+ req->ts_recent = attrs->rcv_tsval;
+
+ ireq->snd_wscale = attrs->snd_wscale;
+ ireq->rcv_wscale = attrs->rcv_wscale;
+ ireq->tstamp_ok = !!attrs->tstamp_ok;
+ ireq->sack_ok = !!attrs->sack_ok;
+ ireq->wscale_ok = !!attrs->wscale_ok;
+ ireq->ecn_ok = !!attrs->ecn_ok;
+
+ treq->req_usec_ts = !!attrs->usec_ts_ok;
+ treq->ts_off = tsoff;
+
+ skb_orphan(skb);
+ skb->sk = req_to_sk(req);
+ skb->destructor = sock_pfree;
+
+ return 0;
+#else
+ return -EOPNOTSUPP;
+#endif
+}
+
__bpf_kfunc_end_defs();
int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
@@ -11865,6 +11964,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
+BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
+BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
+BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
+
static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
.owner = THIS_MODULE,
.set = &bpf_kfunc_check_set_skb,
@@ -11880,6 +11983,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
.set = &bpf_kfunc_check_set_sock_addr,
};
+static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
+ .owner = THIS_MODULE,
+ .set = &bpf_kfunc_check_set_tcp_reqsk,
+};
+
static int __init bpf_kfunc_init(void)
{
int ret;
@@ -11895,8 +12003,9 @@ static int __init bpf_kfunc_init(void)
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
- return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
- &bpf_kfunc_set_sock_addr);
+ ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+ &bpf_kfunc_set_sock_addr);
+ return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
}
late_initcall(bpf_kfunc_init);
diff --git a/net/core/sock.c b/net/core/sock.c
index 446e945f736b..1116338f8ff4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2575,8 +2575,18 @@ EXPORT_SYMBOL(sock_efree);
#ifdef CONFIG_INET
void sock_pfree(struct sk_buff *skb)
{
- if (sk_is_refcounted(skb->sk))
- sock_gen_put(skb->sk);
+ struct sock *sk = skb->sk;
+
+ if (!sk_is_refcounted(sk))
+ return;
+
+ if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+ inet_reqsk(sk)->rsk_listener = NULL;
+ reqsk_free(inet_reqsk(sk));
+ return;
+ }
+
+ sock_gen_put(sk);
}
EXPORT_SYMBOL(sock_pfree);
#endif /* CONFIG_INET */
--
2.30.2
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v7 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie.
2023-12-21 1:28 ` [PATCH v7 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
@ 2024-01-12 1:44 ` Martin KaFai Lau
2024-01-15 20:13 ` Kuniyuki Iwashima
0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2024-01-12 1:44 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Paolo Abeni
On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> This patch adds a new kfunc available at TC hook to support arbitrary
> SYN Cookie.
>
> The basic usage is as follows:
>
> struct bpf_tcp_req_attrs attrs = {
> .mss = mss,
> .wscale_ok = wscale_ok,
> .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
> .snd_wscale = snd_wscale, /* Client's WScale < 15 */
> .tstamp_ok = tstamp_ok,
> .rcv_tsval = tsval,
> .rcv_tsecr = tsecr, /* Server's Initial TSval */
> .usec_ts_ok = usec_ts_ok,
> .sack_ok = sack_ok,
> .ecn_ok = ecn_ok,
> }
>
> skc = bpf_skc_lookup_tcp(...);
> sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
> bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
> bpf_sk_release(skc);
>
> bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
> bpf_tcp_req_attrs and allocates reqsk and configures it. Then,
> bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.
>
> The notable thing here is that we do not hold refcnt for both reqsk
> and listener. To differentiate that, we mark reqsk->syncookie, which
> is only used in TX for now. So, if reqsk->syncookie is 1 in RX, it
> means that the reqsk is allocated by kfunc.
>
> When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
> and in that case, we set NULL to reqsk->rsk_listener before calling
> reqsk_free() as reqsk does not hold a refcnt of the listener.
>
> When the TCP stack looks up a socket from the skb, we steal the
> listener from the reqsk in skb_steal_sock() and create a full sk
> in cookie_v[46]_check().
>
> The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
> after creating a full sk.
>
> Note that we can extend struct bpf_tcp_req_attrs in the future when
> we add a new attribute that is determined in 3WHS.
Notice a few final details.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> include/net/tcp.h | 13 ++++++
> net/core/filter.c | 113 +++++++++++++++++++++++++++++++++++++++++++++-
> net/core/sock.c | 14 +++++-
> 3 files changed, 136 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a63916f41f77..20619df8819e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -600,6 +600,19 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
> }
>
> #if IS_ENABLED(CONFIG_BPF)
> +struct bpf_tcp_req_attrs {
> + u32 rcv_tsval;
> + u32 rcv_tsecr;
> + u16 mss;
> + u8 rcv_wscale;
> + u8 snd_wscale;
> + u8 ecn_ok;
> + u8 wscale_ok;
> + u8 sack_ok;
> + u8 tstamp_ok;
> + u8 usec_ts_ok;
Add "u8 reserved[3];" for the 3 bytes tail padding.
> +};
> +
> static inline bool cookie_bpf_ok(struct sk_buff *skb)
> {
> return skb->sk;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 24061f29c9dd..961c2d30bd72 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11837,6 +11837,105 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
>
> return 0;
> }
> +
> +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
> + struct bpf_tcp_req_attrs *attrs, int attrs__sz)
> +{
> +#if IS_ENABLED(CONFIG_SYN_COOKIES)
> + const struct request_sock_ops *ops;
> + struct inet_request_sock *ireq;
> + struct tcp_request_sock *treq;
> + struct request_sock *req;
> + struct net *net;
> + __u16 min_mss;
> + u32 tsoff = 0;
> +
> + if (attrs__sz != sizeof(*attrs))
> + return -EINVAL;
> +
> + if (!sk)
> + return -EINVAL;
> +
> + if (!skb_at_tc_ingress(skb))
> + return -EINVAL;
> +
> + net = dev_net(skb->dev);
> + if (net != sock_net(sk))
> + return -ENETUNREACH;
> +
> + switch (skb->protocol) {
> + case htons(ETH_P_IP):
> + ops = &tcp_request_sock_ops;
> + min_mss = 536;
> + break;
> +#if IS_BUILTIN(CONFIG_IPV6)
> + case htons(ETH_P_IPV6):
> + ops = &tcp6_request_sock_ops;
> + min_mss = IPV6_MIN_MTU - 60;
> + break;
> +#endif
> + default:
> + return -EINVAL;
> + }
> +
> + if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN ||
> + sk_is_mptcp(sk))
> + return -EINVAL;
> +
and check for:
if (attrs->reserved[0] || attrs->reserved[1] || attrs->reserved[2])
return -EINVAL;
It will be safer if it needs to extend "struct bpf_tcp_req_attrs". There is an
existing example in __bpf_nf_ct_lookup() when checking the 'struct bpf_ct_opts
*opts'.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v7 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie.
2024-01-12 1:44 ` Martin KaFai Lau
@ 2024-01-15 20:13 ` Kuniyuki Iwashima
0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2024-01-15 20:13 UTC (permalink / raw)
To: martin.lau
Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, netdev,
pabeni
From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Thu, 11 Jan 2024 17:44:55 -0800
> On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> > This patch adds a new kfunc available at TC hook to support arbitrary
> > SYN Cookie.
> >
> > The basic usage is as follows:
> >
> > struct bpf_tcp_req_attrs attrs = {
> > .mss = mss,
> > .wscale_ok = wscale_ok,
> > .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
> > .snd_wscale = snd_wscale, /* Client's WScale < 15 */
> > .tstamp_ok = tstamp_ok,
> > .rcv_tsval = tsval,
> > .rcv_tsecr = tsecr, /* Server's Initial TSval */
> > .usec_ts_ok = usec_ts_ok,
> > .sack_ok = sack_ok,
> > .ecn_ok = ecn_ok,
> > }
> >
> > skc = bpf_skc_lookup_tcp(...);
> > sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
> > bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
> > bpf_sk_release(skc);
> >
> > bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
> > bpf_tcp_req_attrs and allocates reqsk and configures it. Then,
> > bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.
> >
> > The notable thing here is that we do not hold refcnt for both reqsk
> > and listener. To differentiate that, we mark reqsk->syncookie, which
> > is only used in TX for now. So, if reqsk->syncookie is 1 in RX, it
> > means that the reqsk is allocated by kfunc.
> >
> > When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
> > and in that case, we set NULL to reqsk->rsk_listener before calling
> > reqsk_free() as reqsk does not hold a refcnt of the listener.
> >
> > When the TCP stack looks up a socket from the skb, we steal the
> > listener from the reqsk in skb_steal_sock() and create a full sk
> > in cookie_v[46]_check().
> >
> > The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
> > after creating a full sk.
> >
> > Note that we can extend struct bpf_tcp_req_attrs in the future when
> > we add a new attribute that is determined in 3WHS.
>
> Notice a few final details.
>
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > include/net/tcp.h | 13 ++++++
> > net/core/filter.c | 113 +++++++++++++++++++++++++++++++++++++++++++++-
> > net/core/sock.c | 14 +++++-
> > 3 files changed, 136 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index a63916f41f77..20619df8819e 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -600,6 +600,19 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
> > }
> >
> > #if IS_ENABLED(CONFIG_BPF)
> > +struct bpf_tcp_req_attrs {
> > + u32 rcv_tsval;
> > + u32 rcv_tsecr;
> > + u16 mss;
> > + u8 rcv_wscale;
> > + u8 snd_wscale;
> > + u8 ecn_ok;
> > + u8 wscale_ok;
> > + u8 sack_ok;
> > + u8 tstamp_ok;
> > + u8 usec_ts_ok;
>
> Add "u8 reserved[3];" for the 3 bytes tail padding.
>
> > +};
> > +
> > static inline bool cookie_bpf_ok(struct sk_buff *skb)
> > {
> > return skb->sk;
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 24061f29c9dd..961c2d30bd72 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -11837,6 +11837,105 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
> >
> > return 0;
> > }
> > +
> > +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
> > + struct bpf_tcp_req_attrs *attrs, int attrs__sz)
> > +{
> > +#if IS_ENABLED(CONFIG_SYN_COOKIES)
> > + const struct request_sock_ops *ops;
> > + struct inet_request_sock *ireq;
> > + struct tcp_request_sock *treq;
> > + struct request_sock *req;
> > + struct net *net;
> > + __u16 min_mss;
> > + u32 tsoff = 0;
> > +
> > + if (attrs__sz != sizeof(*attrs))
> > + return -EINVAL;
> > +
> > + if (!sk)
> > + return -EINVAL;
> > +
> > + if (!skb_at_tc_ingress(skb))
> > + return -EINVAL;
> > +
> > + net = dev_net(skb->dev);
> > + if (net != sock_net(sk))
> > + return -ENETUNREACH;
> > +
> > + switch (skb->protocol) {
> > + case htons(ETH_P_IP):
> > + ops = &tcp_request_sock_ops;
> > + min_mss = 536;
> > + break;
> > +#if IS_BUILTIN(CONFIG_IPV6)
> > + case htons(ETH_P_IPV6):
> > + ops = &tcp6_request_sock_ops;
> > + min_mss = IPV6_MIN_MTU - 60;
> > + break;
> > +#endif
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN ||
> > + sk_is_mptcp(sk))
> > + return -EINVAL;
> > +
>
> and check for:
>
> if (attrs->reserved[0] || attrs->reserved[1] || attrs->reserved[2])
> return -EINVAL;
>
> It will be safer if it needs to extend "struct bpf_tcp_req_attrs". There is an
> existing example in __bpf_nf_ct_lookup() when checking the 'struct bpf_ct_opts
> *opts'.
I'll add that test in v8.
Thank you!
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
` (4 preceding siblings ...)
2023-12-21 1:28 ` [PATCH v7 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
@ 2023-12-21 1:28 ` Kuniyuki Iwashima
2023-12-21 6:35 ` Martin KaFai Lau
2023-12-21 16:44 ` Matthieu Baerts
2024-01-12 6:20 ` [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Martin KaFai Lau
6 siblings, 2 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 1:28 UTC (permalink / raw)
To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev
This commit adds a sample selftest to demonstrate how we can use
bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
The test creates IPv4/IPv6 x TCP connections and transfer messages
over them on lo with BPF tc prog attached.
The tc prog will process SYN and returns SYN+ACK with the following
ISN and TS. In a real use case, this part will be done by other
hosts.
MSB LSB
ISN: | 31 ... 8 | 7 6 | 5 | 4 | 3 2 1 0 |
| Hash_1 | MSS | ECN | SACK | WScale |
TS: | 31 ... 8 | 7 ... 0 |
| Random | Hash_2 |
WScale in SYN is reused in SYN+ACK.
The client returns ACK, and tc prog will recalculate ISN and TS
from ACK and validate SYN Cookie.
If it's valid, the prog calls kfunc to allocate a reqsk for skb and
configure the reqsk based on the argument created from SYN Cookie.
Later, the reqsk will be processed in cookie_v[46]_check() to create
a connection.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
tools/testing/selftests/bpf/bpf_kfuncs.h | 10 +
tools/testing/selftests/bpf/config | 1 +
.../bpf/prog_tests/tcp_custom_syncookie.c | 154 +++++
.../selftests/bpf/progs/bpf_tracing_net.h | 16 +
.../selftests/bpf/progs/test_siphash.h | 64 ++
.../bpf/progs/test_tcp_custom_syncookie.c | 572 ++++++++++++++++++
.../bpf/progs/test_tcp_custom_syncookie.h | 140 +++++
7 files changed, 957 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index b4e78c1eb37b..23c24f852f4f 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -51,6 +51,16 @@ extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clo
extern int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
const __u8 *sun_path, __u32 sun_path__sz) __ksym;
+/* Description
+ * Allocate and configure a reqsk and link it with a listener and skb.
+ * Returns
+ * Error code
+ */
+struct sock;
+struct bpf_tcp_req_attrs;
+extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
+ struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
+
void *bpf_cast_to_kern_ctx(void *) __ksym;
void *bpf_rdonly_cast(void *obj, __u32 btf_id) __ksym;
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c125c441abc7..01f241ea2c67 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -81,6 +81,7 @@ CONFIG_NF_NAT=y
CONFIG_RC_CORE=y
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
+CONFIG_SYN_COOKIES=y
CONFIG_TEST_BPF=m
CONFIG_USERFAULTFD=y
CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
new file mode 100644
index 000000000000..ec5ea2c85119
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_custom_syncookie.skel.h"
+
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
+static struct test_tcp_custom_syncookie_case {
+ int family, type;
+ char addr[16];
+ char name[10];
+} test_cases[] = {
+ {
+ .name = "IPv4 TCP ",
+ .family = AF_INET,
+ .type = SOCK_STREAM,
+ .addr = "127.0.0.1",
+ },
+ {
+ .name = "IPv6 TCP ",
+ .family = AF_INET6,
+ .type = SOCK_STREAM,
+ .addr = "::1",
+ },
+};
+
+static int setup_netns(void)
+{
+ if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+ return -1;
+
+ if (!ASSERT_OK(system("ip link set dev lo up"), "ip"))
+ goto err;
+
+ if (!ASSERT_OK(write_sysctl("/proc/sys/net/ipv4/tcp_ecn", "1"),
+ "write_sysctl"))
+ goto err;
+
+ return 0;
+err:
+ return -1;
+}
+
+static int setup_tc(struct test_tcp_custom_syncookie *skel)
+{
+ LIBBPF_OPTS(bpf_tc_hook, qdisc_lo, .attach_point = BPF_TC_INGRESS);
+ LIBBPF_OPTS(bpf_tc_opts, tc_attach,
+ .prog_fd = bpf_program__fd(skel->progs.tcp_custom_syncookie));
+
+ qdisc_lo.ifindex = if_nametoindex("lo");
+ if (!ASSERT_OK(bpf_tc_hook_create(&qdisc_lo), "qdisc add dev lo clsact"))
+ goto err;
+
+ if (!ASSERT_OK(bpf_tc_attach(&qdisc_lo, &tc_attach),
+ "filter add dev lo ingress"))
+ goto err;
+
+ return 0;
+err:
+ return -1;
+}
+
+#define msg "Hello World"
+#define msglen 11
+
+static void transfer_message(int sender, int receiver)
+{
+ char buf[msglen];
+ int ret;
+
+ ret = send(sender, msg, msglen, 0);
+ if (!ASSERT_EQ(ret, msglen, "send"))
+ return;
+
+ memset(buf, 0, sizeof(buf));
+
+ ret = recv(receiver, buf, msglen, 0);
+ if (!ASSERT_EQ(ret, msglen, "recv"))
+ return;
+
+ ret = strncmp(buf, msg, msglen);
+ if (!ASSERT_EQ(ret, 0, "strncmp"))
+ return;
+}
+
+static void create_connection(struct test_tcp_custom_syncookie_case *test_case)
+{
+ int server, client, child;
+
+ server = start_server(test_case->family, test_case->type, test_case->addr, 0, 0);
+ if (!ASSERT_NEQ(server, -1, "start_server"))
+ return;
+
+ client = connect_to_fd(server, 0);
+ if (!ASSERT_NEQ(client, -1, "connect_to_fd"))
+ goto close_server;
+
+ child = accept(server, NULL, 0);
+ if (!ASSERT_NEQ(child, -1, "accept"))
+ goto close_client;
+
+ transfer_message(client, child);
+ transfer_message(child, client);
+
+ close(child);
+close_client:
+ close(client);
+close_server:
+ close(server);
+}
+
+void test_tcp_custom_syncookie(void)
+{
+ struct test_tcp_custom_syncookie *skel;
+ int i;
+
+ if (setup_netns())
+ return;
+
+ skel = test_tcp_custom_syncookie__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ if (setup_tc(skel))
+ goto destroy_skel;
+
+ for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+ if (!test__start_subtest(test_cases[i].name))
+ continue;
+
+ skel->bss->handled_syn = false;
+ skel->bss->handled_ack = false;
+
+ create_connection(&test_cases[i]);
+
+ ASSERT_EQ(skel->bss->handled_syn, true, "SYN is not handled at tc.");
+ ASSERT_EQ(skel->bss->handled_ack, true, "ACK is not handled at tc");
+ }
+
+destroy_skel:
+ system("tc qdisc del dev lo clsact");
+
+ test_tcp_custom_syncookie__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 1bdc680b0e0e..49e525ad9856 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -51,9 +51,25 @@
#define ICSK_TIME_LOSS_PROBE 5
#define ICSK_TIME_REO_TIMEOUT 6
+#define ETH_ALEN 6
#define ETH_HLEN 14
+#define ETH_P_IP 0x0800
#define ETH_P_IPV6 0x86DD
+#define NEXTHDR_TCP 6
+
+#define TCPOPT_NOP 1
+#define TCPOPT_EOL 0
+#define TCPOPT_MSS 2
+#define TCPOPT_WINDOW 3
+#define TCPOPT_TIMESTAMP 8
+#define TCPOPT_SACK_PERM 4
+
+#define TCPOLEN_MSS 4
+#define TCPOLEN_WINDOW 3
+#define TCPOLEN_TIMESTAMP 10
+#define TCPOLEN_SACK_PERM 2
+
#define CHECKSUM_NONE 0
#define CHECKSUM_PARTIAL 3
diff --git a/tools/testing/selftests/bpf/progs/test_siphash.h b/tools/testing/selftests/bpf/progs/test_siphash.h
new file mode 100644
index 000000000000..5d3a7ec36780
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_siphash.h
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#ifndef _TEST_SIPHASH_H
+#define _TEST_SIPHASH_H
+
+/* include/linux/bitops.h */
+static inline u64 rol64(u64 word, unsigned int shift)
+{
+ return (word << (shift & 63)) | (word >> ((-shift) & 63));
+}
+
+/* include/linux/siphash.h */
+#define SIPHASH_PERMUTATION(a, b, c, d) ( \
+ (a) += (b), (b) = rol64((b), 13), (b) ^= (a), (a) = rol64((a), 32), \
+ (c) += (d), (d) = rol64((d), 16), (d) ^= (c), \
+ (a) += (d), (d) = rol64((d), 21), (d) ^= (a), \
+ (c) += (b), (b) = rol64((b), 17), (b) ^= (c), (c) = rol64((c), 32))
+
+#define SIPHASH_CONST_0 0x736f6d6570736575ULL
+#define SIPHASH_CONST_1 0x646f72616e646f6dULL
+#define SIPHASH_CONST_2 0x6c7967656e657261ULL
+#define SIPHASH_CONST_3 0x7465646279746573ULL
+
+/* lib/siphash.c */
+#define SIPROUND SIPHASH_PERMUTATION(v0, v1, v2, v3)
+
+#define PREAMBLE(len) \
+ u64 v0 = SIPHASH_CONST_0; \
+ u64 v1 = SIPHASH_CONST_1; \
+ u64 v2 = SIPHASH_CONST_2; \
+ u64 v3 = SIPHASH_CONST_3; \
+ u64 b = ((u64)(len)) << 56; \
+ v3 ^= key->key[1]; \
+ v2 ^= key->key[0]; \
+ v1 ^= key->key[1]; \
+ v0 ^= key->key[0];
+
+#define POSTAMBLE \
+ v3 ^= b; \
+ SIPROUND; \
+ SIPROUND; \
+ v0 ^= b; \
+ v2 ^= 0xff; \
+ SIPROUND; \
+ SIPROUND; \
+ SIPROUND; \
+ SIPROUND; \
+ return (v0 ^ v1) ^ (v2 ^ v3);
+
+static inline u64 siphash_2u64(const u64 first, const u64 second, const siphash_key_t *key)
+{
+ PREAMBLE(16)
+ v3 ^= first;
+ SIPROUND;
+ SIPROUND;
+ v0 ^= first;
+ v3 ^= second;
+ SIPROUND;
+ SIPROUND;
+ v0 ^= second;
+ POSTAMBLE
+}
+#endif
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
new file mode 100644
index 000000000000..a5501b29979a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
@@ -0,0 +1,572 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_tracing_net.h"
+#include "bpf_kfuncs.h"
+#include "test_siphash.h"
+#include "test_tcp_custom_syncookie.h"
+
+/* Hash is calculated for each client and split into ISN and TS.
+ *
+ * MSB LSB
+ * ISN: | 31 ... 8 | 7 6 | 5 | 4 | 3 2 1 0 |
+ * | Hash_1 | MSS | ECN | SACK | WScale |
+ *
+ * TS: | 31 ... 8 | 7 ... 0 |
+ * | Random | Hash_2 |
+ */
+#define COOKIE_BITS 8
+#define COOKIE_MASK (((__u32)1 << COOKIE_BITS) - 1)
+
+enum {
+ /* 0xf is invalid thus means that SYN did not have WScale. */
+ BPF_SYNCOOKIE_WSCALE_MASK = (1 << 4) - 1,
+ BPF_SYNCOOKIE_SACK = (1 << 4),
+ BPF_SYNCOOKIE_ECN = (1 << 5),
+};
+
+#define MSS_LOCAL_IPV4 65495
+#define MSS_LOCAL_IPV6 65476
+
+const __u16 msstab4[] = {
+ 536,
+ 1300,
+ 1460,
+ MSS_LOCAL_IPV4,
+};
+
+const __u16 msstab6[] = {
+ 1280 - 60, /* IPV6_MIN_MTU - 60 */
+ 1480 - 60,
+ 9000 - 60,
+ MSS_LOCAL_IPV6,
+};
+
+static siphash_key_t test_key_siphash = {
+ { 0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL }
+};
+
+struct tcp_syncookie {
+ struct __sk_buff *skb;
+ void *data_end;
+ struct ethhdr *eth;
+ struct iphdr *ipv4;
+ struct ipv6hdr *ipv6;
+ struct tcphdr *tcp;
+ union {
+ char *ptr;
+ __be32 *ptr32;
+ };
+ struct bpf_tcp_req_attrs attrs;
+ u32 cookie;
+ u64 first;
+};
+
+bool handled_syn, handled_ack;
+
+static int tcp_load_headers(struct tcp_syncookie *ctx)
+{
+ ctx->data_end = (void *)(long)ctx->skb->data_end;
+ ctx->eth = (struct ethhdr *)(long)ctx->skb->data;
+
+ if (ctx->eth + 1 > ctx->data_end)
+ goto err;
+
+ switch (bpf_ntohs(ctx->eth->h_proto)) {
+ case ETH_P_IP:
+ ctx->ipv4 = (struct iphdr *)(ctx->eth + 1);
+
+ if (ctx->ipv4 + 1 > ctx->data_end)
+ goto err;
+
+ if (ctx->ipv4->ihl != sizeof(*ctx->ipv4) / 4)
+ goto err;
+
+ if (ctx->ipv4->version != 4)
+ goto err;
+
+ if (ctx->ipv4->protocol != IPPROTO_TCP)
+ goto err;
+
+ ctx->tcp = (struct tcphdr *)(ctx->ipv4 + 1);
+ break;
+ case ETH_P_IPV6:
+ ctx->ipv6 = (struct ipv6hdr *)(ctx->eth + 1);
+
+ if (ctx->ipv6 + 1 > ctx->data_end)
+ goto err;
+
+ if (ctx->ipv6->version != 6)
+ goto err;
+
+ if (ctx->ipv6->nexthdr != NEXTHDR_TCP)
+ goto err;
+
+ ctx->tcp = (struct tcphdr *)(ctx->ipv6 + 1);
+ break;
+ default:
+ goto err;
+ }
+
+ if (ctx->tcp + 1 > ctx->data_end)
+ goto err;
+
+ return 0;
+err:
+ return -1;
+}
+
+static int tcp_reload_headers(struct tcp_syncookie *ctx)
+{
+ /* Without volatile,
+ * R3 32-bit pointer arithmetic prohibited
+ */
+ volatile u64 data_len = ctx->skb->data_end - ctx->skb->data;
+
+ if (ctx->tcp->doff < sizeof(*ctx->tcp) / 4)
+ goto err;
+
+ /* Needed to calculate csum and parse TCP options. */
+ if (bpf_skb_change_tail(ctx->skb, data_len + 60 - ctx->tcp->doff * 4, 0))
+ goto err;
+
+ ctx->data_end = (void *)(long)ctx->skb->data_end;
+ ctx->eth = (struct ethhdr *)(long)ctx->skb->data;
+ if (ctx->ipv4) {
+ ctx->ipv4 = (struct iphdr *)(ctx->eth + 1);
+ ctx->ipv6 = NULL;
+ ctx->tcp = (struct tcphdr *)(ctx->ipv4 + 1);
+ } else {
+ ctx->ipv4 = NULL;
+ ctx->ipv6 = (struct ipv6hdr *)(ctx->eth + 1);
+ ctx->tcp = (struct tcphdr *)(ctx->ipv6 + 1);
+ }
+
+ if ((void *)ctx->tcp + 60 > ctx->data_end)
+ goto err;
+
+ return 0;
+err:
+ return -1;
+}
+
+static __sum16 tcp_v4_csum(struct tcp_syncookie *ctx, __wsum csum)
+{
+ return csum_tcpudp_magic(ctx->ipv4->saddr, ctx->ipv4->daddr,
+ ctx->tcp->doff * 4, IPPROTO_TCP, csum);
+}
+
+static __sum16 tcp_v6_csum(struct tcp_syncookie *ctx, __wsum csum)
+{
+ return csum_ipv6_magic(&ctx->ipv6->saddr, &ctx->ipv6->daddr,
+ ctx->tcp->doff * 4, IPPROTO_TCP, csum);
+}
+
+static int tcp_validate_header(struct tcp_syncookie *ctx)
+{
+ s64 csum;
+
+ if (tcp_reload_headers(ctx))
+ goto err;
+
+ csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
+ if (csum < 0)
+ goto err;
+
+ if (ctx->ipv4) {
+ /* check tcp_v4_csum(csum) is 0 if not on lo. */
+
+ csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl * 4, 0);
+ if (csum < 0)
+ goto err;
+
+ if (csum_fold(csum) != 0)
+ goto err;
+ } else if (ctx->ipv6) {
+ /* check tcp_v6_csum(csum) is 0 if not on lo. */
+ }
+
+ return 0;
+err:
+ return -1;
+}
+
+static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
+{
+ char opcode, opsize;
+
+ if (ctx->ptr + 1 > ctx->data_end)
+ goto stop;
+
+ opcode = *ctx->ptr++;
+
+ if (opcode == TCPOPT_EOL)
+ goto stop;
+
+ if (opcode == TCPOPT_NOP)
+ goto next;
+
+ if (ctx->ptr + 1 > ctx->data_end)
+ goto stop;
+
+ opsize = *ctx->ptr++;
+
+ if (opsize < 2)
+ goto stop;
+
+ switch (opcode) {
+ case TCPOPT_MSS:
+ if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
+ ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
+ ctx->attrs.mss = get_unaligned_be16(ctx->ptr);
+ break;
+ case TCPOPT_WINDOW:
+ if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
+ ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
+ ctx->attrs.wscale_ok = 1;
+ ctx->attrs.snd_wscale = *ctx->ptr;
+ }
+ break;
+ case TCPOPT_TIMESTAMP:
+ if (opsize == TCPOLEN_TIMESTAMP &&
+ ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
+ ctx->attrs.rcv_tsval = get_unaligned_be32(ctx->ptr);
+ ctx->attrs.rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
+
+ if (ctx->tcp->syn && ctx->attrs.rcv_tsecr)
+ ctx->attrs.tstamp_ok = 0;
+ else
+ ctx->attrs.tstamp_ok = 1;
+ }
+ break;
+ case TCPOPT_SACK_PERM:
+ if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
+ ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
+ ctx->attrs.sack_ok = 1;
+ break;
+ }
+
+ ctx->ptr += opsize - 2;
+next:
+ return 0;
+stop:
+ return 1;
+}
+
+static void tcp_parse_options(struct tcp_syncookie *ctx)
+{
+ ctx->ptr = (char *)(ctx->tcp + 1);
+
+ bpf_loop(40, tcp_parse_option, ctx, 0);
+}
+
+static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
+{
+ if ((ctx->ipv4 && ctx->attrs.mss != MSS_LOCAL_IPV4) ||
+ (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
+ goto err;
+
+ if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
+ goto err;
+
+ if (!ctx->attrs.tstamp_ok)
+ goto err;
+
+ if (!ctx->attrs.sack_ok)
+ goto err;
+
+ if (!ctx->tcp->ece || !ctx->tcp->cwr)
+ goto err;
+
+ return 0;
+err:
+ return -1;
+}
+
+static void tcp_prepare_cookie(struct tcp_syncookie *ctx)
+{
+ u32 seq = bpf_ntohl(ctx->tcp->seq);
+ u64 first = 0, second;
+ int mssind = 0;
+ u32 hash;
+
+ if (ctx->ipv4) {
+ for (mssind = ARRAY_SIZE(msstab4) - 1; mssind; mssind--)
+ if (ctx->attrs.mss >= msstab4[mssind])
+ break;
+
+ ctx->attrs.mss = msstab4[mssind];
+
+ first = (u64)ctx->ipv4->saddr << 32 | ctx->ipv4->daddr;
+ } else if (ctx->ipv6) {
+ for (mssind = ARRAY_SIZE(msstab6) - 1; mssind; mssind--)
+ if (ctx->attrs.mss >= msstab6[mssind])
+ break;
+
+ ctx->attrs.mss = msstab6[mssind];
+
+ first = (u64)ctx->ipv6->saddr.in6_u.u6_addr8[0] << 32 |
+ ctx->ipv6->daddr.in6_u.u6_addr32[0];
+ }
+
+ second = (u64)seq << 32 | ctx->tcp->source << 16 | ctx->tcp->dest;
+ hash = siphash_2u64(first, second, &test_key_siphash);
+
+ if (ctx->attrs.tstamp_ok) {
+ ctx->attrs.rcv_tsecr = bpf_get_prandom_u32();
+ ctx->attrs.rcv_tsecr &= ~COOKIE_MASK;
+ ctx->attrs.rcv_tsecr |= hash & COOKIE_MASK;
+ }
+
+ hash &= ~COOKIE_MASK;
+ hash |= mssind << 6;
+
+ if (ctx->attrs.wscale_ok)
+ hash |= ctx->attrs.snd_wscale & BPF_SYNCOOKIE_WSCALE_MASK;
+
+ if (ctx->attrs.sack_ok)
+ hash |= BPF_SYNCOOKIE_SACK;
+
+ if (ctx->attrs.tstamp_ok && ctx->tcp->ece && ctx->tcp->cwr)
+ hash |= BPF_SYNCOOKIE_ECN;
+
+ ctx->cookie = hash;
+}
+
+static void tcp_write_options(struct tcp_syncookie *ctx)
+{
+ ctx->ptr32 = (__be32 *)(ctx->tcp + 1);
+
+ *ctx->ptr32++ = bpf_htonl(TCPOPT_MSS << 24 | TCPOLEN_MSS << 16 |
+ ctx->attrs.mss);
+
+ if (ctx->attrs.wscale_ok)
+ *ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+ TCPOPT_WINDOW << 16 |
+ TCPOLEN_WINDOW << 8 |
+ ctx->attrs.snd_wscale);
+
+ if (ctx->attrs.tstamp_ok) {
+ if (ctx->attrs.sack_ok)
+ *ctx->ptr32++ = bpf_htonl(TCPOPT_SACK_PERM << 24 |
+ TCPOLEN_SACK_PERM << 16 |
+ TCPOPT_TIMESTAMP << 8 |
+ TCPOLEN_TIMESTAMP);
+ else
+ *ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+ TCPOPT_NOP << 16 |
+ TCPOPT_TIMESTAMP << 8 |
+ TCPOLEN_TIMESTAMP);
+
+ *ctx->ptr32++ = bpf_htonl(ctx->attrs.rcv_tsecr);
+ *ctx->ptr32++ = bpf_htonl(ctx->attrs.rcv_tsval);
+ } else if (ctx->attrs.sack_ok) {
+ *ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+ TCPOPT_NOP << 16 |
+ TCPOPT_SACK_PERM << 8 |
+ TCPOLEN_SACK_PERM);
+ }
+}
+
+static int tcp_handle_syn(struct tcp_syncookie *ctx)
+{
+ s64 csum;
+
+ if (tcp_validate_header(ctx))
+ goto err;
+
+ tcp_parse_options(ctx);
+
+ if (tcp_validate_sysctl(ctx))
+ goto err;
+
+ tcp_prepare_cookie(ctx);
+ tcp_write_options(ctx);
+
+ swap(ctx->tcp->source, ctx->tcp->dest);
+ ctx->tcp->check = 0;
+ ctx->tcp->ack_seq = bpf_htonl(bpf_ntohl(ctx->tcp->seq) + 1);
+ ctx->tcp->seq = bpf_htonl(ctx->cookie);
+ ctx->tcp->doff = ((long)ctx->ptr32 - (long)ctx->tcp) >> 2;
+ ctx->tcp->ack = 1;
+ if (!ctx->attrs.tstamp_ok || !ctx->tcp->ece || !ctx->tcp->cwr)
+ ctx->tcp->ece = 0;
+ ctx->tcp->cwr = 0;
+
+ csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
+ if (csum < 0)
+ goto err;
+
+ if (ctx->ipv4) {
+ swap(ctx->ipv4->saddr, ctx->ipv4->daddr);
+ ctx->tcp->check = tcp_v4_csum(ctx, csum);
+
+ ctx->ipv4->check = 0;
+ ctx->ipv4->tos = 0;
+ ctx->ipv4->tot_len = bpf_htons((long)ctx->ptr32 - (long)ctx->ipv4);
+ ctx->ipv4->id = 0;
+ ctx->ipv4->ttl = 64;
+
+ csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, sizeof(*ctx->ipv4), 0);
+ if (csum < 0)
+ goto err;
+
+ ctx->ipv4->check = csum_fold(csum);
+ } else if (ctx->ipv6) {
+ swap(ctx->ipv6->saddr, ctx->ipv6->daddr);
+ ctx->tcp->check = tcp_v6_csum(ctx, csum);
+
+ *(__be32 *)ctx->ipv6 = bpf_htonl(0x60000000);
+ ctx->ipv6->payload_len = bpf_htons((long)ctx->ptr32 - (long)ctx->tcp);
+ ctx->ipv6->hop_limit = 64;
+ }
+
+ swap_array(ctx->eth->h_source, ctx->eth->h_dest);
+
+ if (bpf_skb_change_tail(ctx->skb, (long)ctx->ptr32 - (long)ctx->eth, 0))
+ goto err;
+
+ return bpf_redirect(ctx->skb->ifindex, 0);
+err:
+ return TC_ACT_SHOT;
+}
+
+static int tcp_validate_cookie(struct tcp_syncookie *ctx)
+{
+ u32 cookie = bpf_ntohl(ctx->tcp->ack_seq) - 1;
+ u32 seq = bpf_ntohl(ctx->tcp->seq) - 1;
+ u64 first = 0, second;
+ int mssind;
+ u32 hash;
+
+ if (ctx->ipv4)
+ first = (u64)ctx->ipv4->saddr << 32 | ctx->ipv4->daddr;
+ else if (ctx->ipv6)
+ first = (u64)ctx->ipv6->saddr.in6_u.u6_addr8[0] << 32 |
+ ctx->ipv6->daddr.in6_u.u6_addr32[0];
+
+ second = (u64)seq << 32 | ctx->tcp->source << 16 | ctx->tcp->dest;
+ hash = siphash_2u64(first, second, &test_key_siphash);
+
+ if (ctx->attrs.tstamp_ok)
+ hash -= ctx->attrs.rcv_tsecr & COOKIE_MASK;
+ else
+ hash &= ~COOKIE_MASK;
+
+ hash -= cookie & ~COOKIE_MASK;
+ if (hash)
+ goto err;
+
+ mssind = (cookie & (3 << 6)) >> 6;
+ if (ctx->ipv4) {
+ if (mssind > ARRAY_SIZE(msstab4))
+ goto err;
+
+ ctx->attrs.mss = msstab4[mssind];
+ } else {
+ if (mssind > ARRAY_SIZE(msstab6))
+ goto err;
+
+ ctx->attrs.mss = msstab6[mssind];
+ }
+
+ ctx->attrs.snd_wscale = cookie & BPF_SYNCOOKIE_WSCALE_MASK;
+ ctx->attrs.rcv_wscale = ctx->attrs.snd_wscale;
+ ctx->attrs.wscale_ok = ctx->attrs.snd_wscale == BPF_SYNCOOKIE_WSCALE_MASK;
+ ctx->attrs.sack_ok = cookie & BPF_SYNCOOKIE_SACK;
+ ctx->attrs.ecn_ok = cookie & BPF_SYNCOOKIE_ECN;
+
+ return 0;
+err:
+ return -1;
+}
+
+static int tcp_handle_ack(struct tcp_syncookie *ctx)
+{
+ struct bpf_sock_tuple tuple;
+ struct bpf_sock *skc;
+ int ret = TC_ACT_OK;
+ struct sock *sk;
+ u32 tuple_size;
+
+ if (ctx->ipv4) {
+ tuple.ipv4.saddr = ctx->ipv4->saddr;
+ tuple.ipv4.daddr = ctx->ipv4->daddr;
+ tuple.ipv4.sport = ctx->tcp->source;
+ tuple.ipv4.dport = ctx->tcp->dest;
+ tuple_size = sizeof(tuple.ipv4);
+ } else if (ctx->ipv6) {
+ __builtin_memcpy(tuple.ipv6.saddr, &ctx->ipv6->saddr, sizeof(tuple.ipv6.saddr));
+ __builtin_memcpy(tuple.ipv6.daddr, &ctx->ipv6->daddr, sizeof(tuple.ipv6.daddr));
+ tuple.ipv6.sport = ctx->tcp->source;
+ tuple.ipv6.dport = ctx->tcp->dest;
+ tuple_size = sizeof(tuple.ipv6);
+ } else {
+ goto out;
+ }
+
+ skc = bpf_skc_lookup_tcp(ctx->skb, &tuple, tuple_size, -1, 0);
+ if (!skc)
+ goto out;
+
+ if (skc->state != TCP_LISTEN)
+ goto release;
+
+ sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
+ if (!sk)
+ goto err;
+
+ if (tcp_validate_header(ctx))
+ goto err;
+
+ tcp_parse_options(ctx);
+
+ if (tcp_validate_cookie(ctx))
+ goto err;
+
+ ret = bpf_sk_assign_tcp_reqsk(ctx->skb, sk, &ctx->attrs, sizeof(ctx->attrs));
+ if (ret < 0)
+ goto err;
+
+release:
+ bpf_sk_release(skc);
+out:
+ return ret;
+
+err:
+ ret = TC_ACT_SHOT;
+ goto release;
+}
+
+SEC("tc")
+int tcp_custom_syncookie(struct __sk_buff *skb)
+{
+ struct tcp_syncookie ctx = {
+ .skb = skb,
+ };
+
+ if (tcp_load_headers(&ctx))
+ return TC_ACT_OK;
+
+ if (ctx.tcp->rst)
+ return TC_ACT_OK;
+
+ if (ctx.tcp->syn) {
+ if (ctx.tcp->ack)
+ return TC_ACT_OK;
+
+ handled_syn = true;
+
+ return tcp_handle_syn(&ctx);
+ }
+
+ handled_ack = true;
+
+ return tcp_handle_ack(&ctx);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
new file mode 100644
index 000000000000..29a6a53cf229
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#ifndef _TEST_TCP_SYNCOOKIE_H
+#define _TEST_TCP_SYNCOOKIE_H
+
+#define __packed __attribute__((__packed__))
+#define __force
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define swap(a, b) \
+ do { \
+ typeof(a) __tmp = (a); \
+ (a) = (b); \
+ (b) = __tmp; \
+ } while (0)
+
+#define swap_array(a, b) \
+ do { \
+ typeof(a) __tmp[sizeof(a)]; \
+ __builtin_memcpy(__tmp, a, sizeof(a)); \
+ __builtin_memcpy(a, b, sizeof(a)); \
+ __builtin_memcpy(b, __tmp, sizeof(a)); \
+ } while (0)
+
+/* asm-generic/unaligned.h */
+#define __get_unaligned_t(type, ptr) ({ \
+ const struct { type x; } __packed * __pptr = (typeof(__pptr))(ptr); \
+ __pptr->x; \
+})
+
+#define get_unaligned(ptr) __get_unaligned_t(typeof(*(ptr)), (ptr))
+
+static inline u16 get_unaligned_be16(const void *p)
+{
+ return bpf_ntohs(__get_unaligned_t(__be16, p));
+}
+
+static inline u32 get_unaligned_be32(const void *p)
+{
+ return bpf_ntohl(__get_unaligned_t(__be32, p));
+}
+
+/* lib/checksum.c */
+static inline u32 from64to32(u64 x)
+{
+ /* add up 32-bit and 32-bit for 32+c bit */
+ x = (x & 0xffffffff) + (x >> 32);
+ /* add up carry.. */
+ x = (x & 0xffffffff) + (x >> 32);
+ return (u32)x;
+}
+
+static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
+ __u32 len, __u8 proto, __wsum sum)
+{
+ unsigned long long s = (__force u32)sum;
+
+ s += (__force u32)saddr;
+ s += (__force u32)daddr;
+#ifdef __BIG_ENDIAN
+ s += proto + len;
+#else
+ s += (proto + len) << 8;
+#endif
+ return (__force __wsum)from64to32(s);
+}
+
+/* asm-generic/checksum.h */
+static inline __sum16 csum_fold(__wsum csum)
+{
+ u32 sum = (__force u32)csum;
+
+ sum = (sum & 0xffff) + (sum >> 16);
+ sum = (sum & 0xffff) + (sum >> 16);
+ return (__force __sum16)~sum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
+ __u8 proto, __wsum sum)
+{
+ return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
+}
+
+/* net/ipv6/ip6_checksum.c */
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto, __wsum csum)
+{
+ int carry;
+ __u32 ulen;
+ __u32 uproto;
+ __u32 sum = (__force u32)csum;
+
+ sum += (__force u32)saddr->in6_u.u6_addr32[0];
+ carry = (sum < (__force u32)saddr->in6_u.u6_addr32[0]);
+ sum += carry;
+
+ sum += (__force u32)saddr->in6_u.u6_addr32[1];
+ carry = (sum < (__force u32)saddr->in6_u.u6_addr32[1]);
+ sum += carry;
+
+ sum += (__force u32)saddr->in6_u.u6_addr32[2];
+ carry = (sum < (__force u32)saddr->in6_u.u6_addr32[2]);
+ sum += carry;
+
+ sum += (__force u32)saddr->in6_u.u6_addr32[3];
+ carry = (sum < (__force u32)saddr->in6_u.u6_addr32[3]);
+ sum += carry;
+
+ sum += (__force u32)daddr->in6_u.u6_addr32[0];
+ carry = (sum < (__force u32)daddr->in6_u.u6_addr32[0]);
+ sum += carry;
+
+ sum += (__force u32)daddr->in6_u.u6_addr32[1];
+ carry = (sum < (__force u32)daddr->in6_u.u6_addr32[1]);
+ sum += carry;
+
+ sum += (__force u32)daddr->in6_u.u6_addr32[2];
+ carry = (sum < (__force u32)daddr->in6_u.u6_addr32[2]);
+ sum += carry;
+
+ sum += (__force u32)daddr->in6_u.u6_addr32[3];
+ carry = (sum < (__force u32)daddr->in6_u.u6_addr32[3]);
+ sum += carry;
+
+ ulen = (__force u32)bpf_htonl((__u32)len);
+ sum += ulen;
+ carry = (sum < ulen);
+ sum += carry;
+
+ uproto = (__force u32)bpf_htonl(proto);
+ sum += uproto;
+ carry = (sum < uproto);
+ sum += carry;
+
+ return csum_fold((__force __wsum)sum);
+}
+#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
2023-12-21 1:28 ` [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
@ 2023-12-21 6:35 ` Martin KaFai Lau
2023-12-21 7:04 ` Kuniyuki Iwashima
2024-01-02 19:17 ` Yonghong Song
2023-12-21 16:44 ` Matthieu Baerts
1 sibling, 2 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2023-12-21 6:35 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Paolo Abeni, Yonghong Song
On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> +static int tcp_validate_header(struct tcp_syncookie *ctx)
> +{
> + s64 csum;
> +
> + if (tcp_reload_headers(ctx))
> + goto err;
> +
> + csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
> + if (csum < 0)
> + goto err;
> +
> + if (ctx->ipv4) {
> + /* check tcp_v4_csum(csum) is 0 if not on lo. */
> +
> + csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl * 4, 0);
> + if (csum < 0)
> + goto err;
> +
> + if (csum_fold(csum) != 0)
> + goto err;
> + } else if (ctx->ipv6) {
> + /* check tcp_v6_csum(csum) is 0 if not on lo. */
> + }
> +
> + return 0;
> +err:
> + return -1;
> +}
> +
> +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> +{
> + char opcode, opsize;
> +
> + if (ctx->ptr + 1 > ctx->data_end)
> + goto stop;
> +
> + opcode = *ctx->ptr++;
> +
> + if (opcode == TCPOPT_EOL)
> + goto stop;
> +
> + if (opcode == TCPOPT_NOP)
> + goto next;
> +
> + if (ctx->ptr + 1 > ctx->data_end)
> + goto stop;
> +
> + opsize = *ctx->ptr++;
> +
> + if (opsize < 2)
> + goto stop;
> +
> + switch (opcode) {
> + case TCPOPT_MSS:
> + if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> + ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> + ctx->attrs.mss = get_unaligned_be16(ctx->ptr);
> + break;
> + case TCPOPT_WINDOW:
> + if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> + ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> + ctx->attrs.wscale_ok = 1;
> + ctx->attrs.snd_wscale = *ctx->ptr;
> + }
> + break;
> + case TCPOPT_TIMESTAMP:
> + if (opsize == TCPOLEN_TIMESTAMP &&
> + ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
> + ctx->attrs.rcv_tsval = get_unaligned_be32(ctx->ptr);
> + ctx->attrs.rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
> +
> + if (ctx->tcp->syn && ctx->attrs.rcv_tsecr)
> + ctx->attrs.tstamp_ok = 0;
> + else
> + ctx->attrs.tstamp_ok = 1;
> + }
> + break;
> + case TCPOPT_SACK_PERM:
> + if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
> + ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
> + ctx->attrs.sack_ok = 1;
> + break;
> + }
> +
> + ctx->ptr += opsize - 2;
> +next:
> + return 0;
> +stop:
> + return 1;
> +}
> +
> +static void tcp_parse_options(struct tcp_syncookie *ctx)
> +{
> + ctx->ptr = (char *)(ctx->tcp + 1);
> +
> + bpf_loop(40, tcp_parse_option, ctx, 0);
> +}
> +
> +static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
> +{
> + if ((ctx->ipv4 && ctx->attrs.mss != MSS_LOCAL_IPV4) ||
> + (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
> + goto err;
> +
> + if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
> + goto err;
> +
> + if (!ctx->attrs.tstamp_ok)
The bpf-ci reported error in cpuv4. The email from bot+bpf-ci@kernel.org has the
link.
I tried the following:
if (!ctx->attrs.tstamp_ok) {
bpf_printk("ctx->attrs.tstamp_ok %u",
ctx->attrs.tstamp_ok);
goto err;
}
The above prints tstamp_ok as 1 while there is a "if (!ctx->attrs.tstamp_ok)"
test before it.
Yonghong and I debugged it quite a bit. verifier concluded the
ctx->attrs.tstamp_ok is 0. We knew some red herring like cpuv4 has fewer
register spilling but not able to root cause it yet.
In the mean time, there are existing selftests parsing the tcp header. For
example, the test_parse_tcp_hdr_opt[_dynptr].c. Not as complete as your
tcp_parse_option() but should be pretty close. It does not use bpf_loop. It uses
a bounded loop + a subprog (the parse_hdr_opt in the selftests) instead. You can
consider a similar construct to see if it works around the cpuv4 CI issue for
the time being.
pw-bot: cr
> + goto err;
> +
> + if (!ctx->attrs.sack_ok)
> + goto err;
> +
> + if (!ctx->tcp->ece || !ctx->tcp->cwr)
> + goto err;
> +
> + return 0;
> +err:
> + return -1;
> +}
> +
[ ... ]
> +static int tcp_handle_syn(struct tcp_syncookie *ctx)
> +{
> + s64 csum;
> +
> + if (tcp_validate_header(ctx))
> + goto err;
> +
> + tcp_parse_options(ctx);
> +
> + if (tcp_validate_sysctl(ctx))
> + goto err;
> +
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
2023-12-21 6:35 ` Martin KaFai Lau
@ 2023-12-21 7:04 ` Kuniyuki Iwashima
2023-12-21 8:43 ` Kuniyuki Iwashima
2024-01-02 19:17 ` Yonghong Song
1 sibling, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 7:04 UTC (permalink / raw)
To: martin.lau
Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, netdev,
pabeni, yonghong.song
From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Wed, 20 Dec 2023 22:35:26 -0800
> On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> > +static int tcp_validate_header(struct tcp_syncookie *ctx)
> > +{
> > + s64 csum;
> > +
> > + if (tcp_reload_headers(ctx))
> > + goto err;
> > +
> > + csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
> > + if (csum < 0)
> > + goto err;
> > +
> > + if (ctx->ipv4) {
> > + /* check tcp_v4_csum(csum) is 0 if not on lo. */
> > +
> > + csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl * 4, 0);
> > + if (csum < 0)
> > + goto err;
> > +
> > + if (csum_fold(csum) != 0)
> > + goto err;
> > + } else if (ctx->ipv6) {
> > + /* check tcp_v6_csum(csum) is 0 if not on lo. */
> > + }
> > +
> > + return 0;
> > +err:
> > + return -1;
> > +}
> > +
> > +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> > +{
> > + char opcode, opsize;
> > +
> > + if (ctx->ptr + 1 > ctx->data_end)
> > + goto stop;
> > +
> > + opcode = *ctx->ptr++;
> > +
> > + if (opcode == TCPOPT_EOL)
> > + goto stop;
> > +
> > + if (opcode == TCPOPT_NOP)
> > + goto next;
> > +
> > + if (ctx->ptr + 1 > ctx->data_end)
> > + goto stop;
> > +
> > + opsize = *ctx->ptr++;
> > +
> > + if (opsize < 2)
> > + goto stop;
> > +
> > + switch (opcode) {
> > + case TCPOPT_MSS:
> > + if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> > + ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> > + ctx->attrs.mss = get_unaligned_be16(ctx->ptr);
> > + break;
> > + case TCPOPT_WINDOW:
> > + if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> > + ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> > + ctx->attrs.wscale_ok = 1;
> > + ctx->attrs.snd_wscale = *ctx->ptr;
> > + }
> > + break;
> > + case TCPOPT_TIMESTAMP:
> > + if (opsize == TCPOLEN_TIMESTAMP &&
> > + ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
> > + ctx->attrs.rcv_tsval = get_unaligned_be32(ctx->ptr);
> > + ctx->attrs.rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
> > +
> > + if (ctx->tcp->syn && ctx->attrs.rcv_tsecr)
> > + ctx->attrs.tstamp_ok = 0;
> > + else
> > + ctx->attrs.tstamp_ok = 1;
> > + }
> > + break;
> > + case TCPOPT_SACK_PERM:
> > + if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
> > + ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
> > + ctx->attrs.sack_ok = 1;
> > + break;
> > + }
> > +
> > + ctx->ptr += opsize - 2;
> > +next:
> > + return 0;
> > +stop:
> > + return 1;
> > +}
> > +
> > +static void tcp_parse_options(struct tcp_syncookie *ctx)
> > +{
> > + ctx->ptr = (char *)(ctx->tcp + 1);
> > +
> > + bpf_loop(40, tcp_parse_option, ctx, 0);
> > +}
> > +
> > +static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
> > +{
> > + if ((ctx->ipv4 && ctx->attrs.mss != MSS_LOCAL_IPV4) ||
> > + (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
> > + goto err;
> > +
> > + if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
> > + goto err;
> > +
> > + if (!ctx->attrs.tstamp_ok)
>
> The bpf-ci reported error in cpuv4. The email from bot+bpf-ci@kernel.org has the
> link.
I like the mail from the bot, it's useful, but it seems that
it's sent to the patch author only when the CI passes ?
But yeah, I found the failed test.
https://github.com/kernel-patches/bpf/actions/runs/7284164398/job/19849657597
>
> I tried the following:
>
> if (!ctx->attrs.tstamp_ok) {
> bpf_printk("ctx->attrs.tstamp_ok %u",
> ctx->attrs.tstamp_ok);
> goto err;
> }
>
>
> The above prints tstamp_ok as 1 while there is a "if (!ctx->attrs.tstamp_ok)"
> test before it.
>
> Yonghong and I debugged it quite a bit. verifier concluded the
> ctx->attrs.tstamp_ok is 0. We knew some red herring like cpuv4 has fewer
> register spilling but not able to root cause it yet.
>
> In the mean time, there are existing selftests parsing the tcp header. For
> example, the test_parse_tcp_hdr_opt[_dynptr].c. Not as complete as your
> tcp_parse_option() but should be pretty close. It does not use bpf_loop. It uses
> a bounded loop + a subprog (the parse_hdr_opt in the selftests) instead. You can
> consider a similar construct to see if it works around the cpuv4 CI issue for
> the time being.
Sure, I'll install the latest clang/llvm and check if the test
passes without bpf_loop().
Thanks!
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
2023-12-21 7:04 ` Kuniyuki Iwashima
@ 2023-12-21 8:43 ` Kuniyuki Iwashima
0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-21 8:43 UTC (permalink / raw)
To: kuniyu
Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, martin.lau, netdev,
pabeni, yonghong.song
From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Thu, 21 Dec 2023 16:04:43 +0900
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Wed, 20 Dec 2023 22:35:26 -0800
> > On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> > > +static int tcp_validate_header(struct tcp_syncookie *ctx)
> > > +{
> > > + s64 csum;
> > > +
> > > + if (tcp_reload_headers(ctx))
> > > + goto err;
> > > +
> > > + csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
> > > + if (csum < 0)
> > > + goto err;
> > > +
> > > + if (ctx->ipv4) {
> > > + /* check tcp_v4_csum(csum) is 0 if not on lo. */
> > > +
> > > + csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl * 4, 0);
> > > + if (csum < 0)
> > > + goto err;
> > > +
> > > + if (csum_fold(csum) != 0)
> > > + goto err;
> > > + } else if (ctx->ipv6) {
> > > + /* check tcp_v6_csum(csum) is 0 if not on lo. */
> > > + }
> > > +
> > > + return 0;
> > > +err:
> > > + return -1;
> > > +}
> > > +
> > > +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> > > +{
> > > + char opcode, opsize;
> > > +
> > > + if (ctx->ptr + 1 > ctx->data_end)
> > > + goto stop;
> > > +
> > > + opcode = *ctx->ptr++;
> > > +
> > > + if (opcode == TCPOPT_EOL)
> > > + goto stop;
> > > +
> > > + if (opcode == TCPOPT_NOP)
> > > + goto next;
> > > +
> > > + if (ctx->ptr + 1 > ctx->data_end)
> > > + goto stop;
> > > +
> > > + opsize = *ctx->ptr++;
> > > +
> > > + if (opsize < 2)
> > > + goto stop;
> > > +
> > > + switch (opcode) {
> > > + case TCPOPT_MSS:
> > > + if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> > > + ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> > > + ctx->attrs.mss = get_unaligned_be16(ctx->ptr);
> > > + break;
> > > + case TCPOPT_WINDOW:
> > > + if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> > > + ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> > > + ctx->attrs.wscale_ok = 1;
> > > + ctx->attrs.snd_wscale = *ctx->ptr;
> > > + }
> > > + break;
> > > + case TCPOPT_TIMESTAMP:
> > > + if (opsize == TCPOLEN_TIMESTAMP &&
> > > + ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
> > > + ctx->attrs.rcv_tsval = get_unaligned_be32(ctx->ptr);
> > > + ctx->attrs.rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
> > > +
> > > + if (ctx->tcp->syn && ctx->attrs.rcv_tsecr)
> > > + ctx->attrs.tstamp_ok = 0;
> > > + else
> > > + ctx->attrs.tstamp_ok = 1;
> > > + }
> > > + break;
> > > + case TCPOPT_SACK_PERM:
> > > + if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
> > > + ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
> > > + ctx->attrs.sack_ok = 1;
> > > + break;
> > > + }
> > > +
> > > + ctx->ptr += opsize - 2;
> > > +next:
> > > + return 0;
> > > +stop:
> > > + return 1;
> > > +}
> > > +
> > > +static void tcp_parse_options(struct tcp_syncookie *ctx)
> > > +{
> > > + ctx->ptr = (char *)(ctx->tcp + 1);
> > > +
> > > + bpf_loop(40, tcp_parse_option, ctx, 0);
> > > +}
> > > +
> > > +static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
> > > +{
> > > + if ((ctx->ipv4 && ctx->attrs.mss != MSS_LOCAL_IPV4) ||
> > > + (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
> > > + goto err;
> > > +
> > > + if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
> > > + goto err;
> > > +
> > > + if (!ctx->attrs.tstamp_ok)
> >
> > The bpf-ci reported error in cpuv4. The email from bot+bpf-ci@kernel.org has the
> > link.
>
> I like the mail from the bot, it's useful, but it seems that
> it's sent to the patch author only when the CI passes ?
>
> But yeah, I found the failed test.
> https://github.com/kernel-patches/bpf/actions/runs/7284164398/job/19849657597
>
>
> >
> > I tried the following:
> >
> > if (!ctx->attrs.tstamp_ok) {
> > bpf_printk("ctx->attrs.tstamp_ok %u",
> > ctx->attrs.tstamp_ok);
> > goto err;
> > }
> >
> >
> > The above prints tstamp_ok as 1 while there is a "if (!ctx->attrs.tstamp_ok)"
> > test before it.
> >
> > Yonghong and I debugged it quite a bit. verifier concluded the
> > ctx->attrs.tstamp_ok is 0. We knew some red herring like cpuv4 has fewer
> > register spilling but not able to root cause it yet.
> >
> > In the mean time, there are existing selftests parsing the tcp header. For
> > example, the test_parse_tcp_hdr_opt[_dynptr].c. Not as complete as your
> > tcp_parse_option() but should be pretty close. It does not use bpf_loop. It uses
> > a bounded loop + a subprog (the parse_hdr_opt in the selftests) instead. You can
> > consider a similar construct to see if it works around the cpuv4 CI issue for
> > the time being.
>
> Sure, I'll install the latest clang/llvm and check if the test
> passes without bpf_loop().
I've tested a simple diff below and some more different patterns, but
the prog cannot be loaded. Without bpf_loop(), the parser can loop
only 4 times (s/40/4/), but then, it does not fully parse the necessary
options, so the packet is dropped due to tcp_validate_sysctl(), and the
test fails.
So it seems that tcp_parse_option() cannot work around the issue even
without bpf_loop() and this series needs to wait the cpuv4 fix..
---8<---
@@ -259,9 +260,13 @@ static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
static void tcp_parse_options(struct tcp_syncookie *ctx)
{
+ int i;
+
ctx->ptr = (char *)(ctx->tcp + 1);
- bpf_loop(40, tcp_parse_option, ctx, 0);
+ for (i = 0; i < 40; i++)
+ if (tcp_parse_option(i, ctx))
+ break;
}
static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
---8<---
---8<---
BPF program is too large. Processed 1000001 insn
processed 1000001 insns (limit 1000000) max_states_per_insn 30 total_states 41159 peak_states 344 mark_read 55
-- END PROG LOAD LOG --
libbpf: prog 'tcp_custom_syncookie': failed to load: -7
---8<---
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
2023-12-21 6:35 ` Martin KaFai Lau
2023-12-21 7:04 ` Kuniyuki Iwashima
@ 2024-01-02 19:17 ` Yonghong Song
1 sibling, 0 replies; 15+ messages in thread
From: Yonghong Song @ 2024-01-02 19:17 UTC (permalink / raw)
To: Martin KaFai Lau, Kuniyuki Iwashima
Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Paolo Abeni
On 12/20/23 10:35 PM, Martin KaFai Lau wrote:
> On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
>> +static int tcp_validate_header(struct tcp_syncookie *ctx)
>> +{
>> + s64 csum;
>> +
>> + if (tcp_reload_headers(ctx))
>> + goto err;
>> +
>> + csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4,
>> 0);
>> + if (csum < 0)
>> + goto err;
>> +
>> + if (ctx->ipv4) {
>> + /* check tcp_v4_csum(csum) is 0 if not on lo. */
>> +
>> + csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl
>> * 4, 0);
>> + if (csum < 0)
>> + goto err;
>> +
>> + if (csum_fold(csum) != 0)
>> + goto err;
>> + } else if (ctx->ipv6) {
>> + /* check tcp_v6_csum(csum) is 0 if not on lo. */
>> + }
>> +
>> + return 0;
>> +err:
>> + return -1;
>> +}
>> +
>> +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
>> +{
>> + char opcode, opsize;
>> +
>> + if (ctx->ptr + 1 > ctx->data_end)
>> + goto stop;
>> +
>> + opcode = *ctx->ptr++;
>> +
>> + if (opcode == TCPOPT_EOL)
>> + goto stop;
>> +
>> + if (opcode == TCPOPT_NOP)
>> + goto next;
>> +
>> + if (ctx->ptr + 1 > ctx->data_end)
>> + goto stop;
>> +
>> + opsize = *ctx->ptr++;
>> +
>> + if (opsize < 2)
>> + goto stop;
>> +
>> + switch (opcode) {
>> + case TCPOPT_MSS:
>> + if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
>> + ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
>> + ctx->attrs.mss = get_unaligned_be16(ctx->ptr);
>> + break;
>> + case TCPOPT_WINDOW:
>> + if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
>> + ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
>> + ctx->attrs.wscale_ok = 1;
>> + ctx->attrs.snd_wscale = *ctx->ptr;
>> + }
>> + break;
>> + case TCPOPT_TIMESTAMP:
>> + if (opsize == TCPOLEN_TIMESTAMP &&
>> + ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
>> + ctx->attrs.rcv_tsval = get_unaligned_be32(ctx->ptr);
>> + ctx->attrs.rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
>> +
>> + if (ctx->tcp->syn && ctx->attrs.rcv_tsecr)
>> + ctx->attrs.tstamp_ok = 0;
>> + else
>> + ctx->attrs.tstamp_ok = 1;
>> + }
>> + break;
>> + case TCPOPT_SACK_PERM:
>> + if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
>> + ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
>> + ctx->attrs.sack_ok = 1;
>> + break;
>> + }
>> +
>> + ctx->ptr += opsize - 2;
>> +next:
>> + return 0;
>> +stop:
>> + return 1;
>> +}
>> +
>> +static void tcp_parse_options(struct tcp_syncookie *ctx)
>> +{
>> + ctx->ptr = (char *)(ctx->tcp + 1);
>> +
>> + bpf_loop(40, tcp_parse_option, ctx, 0);
>> +}
>> +
>> +static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
>> +{
>> + if ((ctx->ipv4 && ctx->attrs.mss != MSS_LOCAL_IPV4) ||
>> + (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
>> + goto err;
>> +
>> + if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
>> + goto err;
>> +
>> + if (!ctx->attrs.tstamp_ok)
>
> The bpf-ci reported error in cpuv4. The email from
> bot+bpf-ci@kernel.org has the link.
>
> I tried the following:
>
> if (!ctx->attrs.tstamp_ok) {
> bpf_printk("ctx->attrs.tstamp_ok %u",
> ctx->attrs.tstamp_ok);
> goto err;
> }
>
>
> The above prints tstamp_ok as 1 while there is a "if
> (!ctx->attrs.tstamp_ok)" test before it.
>
> Yonghong and I debugged it quite a bit. verifier concluded the
> ctx->attrs.tstamp_ok is 0. We knew some red herring like cpuv4 has
> fewer register spilling but not able to root cause it yet.
>
> In the mean time, there are existing selftests parsing the tcp header.
> For example, the test_parse_tcp_hdr_opt[_dynptr].c. Not as complete as
> your tcp_parse_option() but should be pretty close. It does not use
> bpf_loop. It uses a bounded loop + a subprog (the parse_hdr_opt in the
> selftests) instead. You can consider a similar construct to see if it
> works around the cpuv4 CI issue for the time being.
I did some investigation and found some issues in verifier. I need to dig more into details.
On the other hand, with the following patch
https://lore.kernel.org/bpf/20240102190726.2017424-1-yonghong.song@linux.dev/
the selftest can run successfully with cpuv2/v3/v4.
>
> pw-bot: cr
>
>> + goto err;
>> +
>> + if (!ctx->attrs.sack_ok)
>> + goto err;
>> +
>> + if (!ctx->tcp->ece || !ctx->tcp->cwr)
>> + goto err;
>> +
>> + return 0;
>> +err:
>> + return -1;
>> +}
>> +
>
> [ ... ]
>
>> +static int tcp_handle_syn(struct tcp_syncookie *ctx)
>> +{
>> + s64 csum;
>> +
>> + if (tcp_validate_header(ctx))
>> + goto err;
>> +
>> + tcp_parse_options(ctx);
>> +
>> + if (tcp_validate_sysctl(ctx))
>> + goto err;
>> +
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
2023-12-21 1:28 ` [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
2023-12-21 6:35 ` Martin KaFai Lau
@ 2023-12-21 16:44 ` Matthieu Baerts
1 sibling, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2023-12-21 16:44 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Paolo Abeni
Hi Kuniyuki,
On 21/12/2023 02:28, Kuniyuki Iwashima wrote:
> This commit adds a sample selftest to demonstrate how we can use
> bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
(...)
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
> new file mode 100644
> index 000000000000..ec5ea2c85119
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
> @@ -0,0 +1,154 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright Amazon.com Inc. or its affiliates. */
> +
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <stdlib.h>
> +#include <net/if.h>
> +
> +#include "test_progs.h"
> +#include "cgroup_helpers.h"
> +#include "network_helpers.h"
> +#include "test_tcp_custom_syncookie.skel.h"
> +
> +#ifndef IPPROTO_MPTCP
> +#define IPPROTO_MPTCP 262
> +#endif
A small detail if you need to send a new version: you can remove this
now that MPTCP support has been dropped. If no new version is needed,
that's OK to leave it, it doesn't hurt.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC.
2023-12-21 1:28 [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
` (5 preceding siblings ...)
2023-12-21 1:28 ` [PATCH v7 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
@ 2024-01-12 6:20 ` Martin KaFai Lau
6 siblings, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2024-01-12 6:20 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Paolo Abeni, Yonghong Song
On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> for the connection request until a valid ACK is responded to the SYN+ACK.
>
> The cookie contains two kinds of host-specific bits, a timestamp and
> secrets, so only can it be validated by the generator. It means SYN
> Cookie consumes network resources between the client and the server;
> intermediate nodes must remember which nodes to route ACK for the cookie.
>
> SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> the edge network. After SYN Proxy completes 3WHS, it forwards SYN to the
> backend server and completes another 3WHS. However, since the server's
> ISN differs from the cookie, the proxy must manage the ISN mappings and
> fix up SEQ/ACK numbers in every packet for each connection. If a proxy
> node goes down, all the connections through it are terminated. Keeping
> a state at proxy is painful from that perspective.
>
> At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> Our SYN Proxy consists of the front proxy layer and the backend kernel
> module. (See slides of LPC2023 [0], p37 - p48)
>
> The cookie that SYN Proxy generates differs from the kernel's cookie in
> that it contains a secret (called rolling salt) (i) shared by all the proxy
> nodes so that any node can validate ACK and (ii) updated periodically so
> that old cookies cannot be validated and we need not encode a timestamp for
> the cookie. Also, ISN contains WScale, SACK, and ECN, not in TS val. This
> is not to sacrifice any connection quality, where some customers turn off
> TCP timestamps option due to retro CVE.
>
> After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
> the TCP-in-TCP packet to the backend server. Our kernel module works at
> Netfilter input/output hooks and first feeds SYN to the TCP stack to
> initiate 3WHS. When the module is triggered for SYN+ACK, it looks up the
> corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
> proxy's cookie. Then, the module can complete 3WHS with the original ACK
> as is.
>
> This way, our SYN Proxy does not manage the ISN mappings nor wait for
> SYN+ACK from the backend thus can remain stateless. It's working very
> well for high-bandwidth services like multiple Tbps, but we are looking
> for a way to drop the dirty hack and further optimise the sequences.
>
> If we could validate an arbitrary SYN Cookie on the backend server with
> BPF, the proxy would need not restore SYN nor pass it. After validating
> ACK, the proxy node just needs to forward it, and then the server can do
> the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
> and create a connection from the ACK.
>
> This series allows us to create a full sk from an arbitrary SYN Cookie,
> which is done in 3 steps.
>
> 1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
> it based on the argument populated from SYN Cookie. The reqsk has
> its listener as req->rsk_listener and is passed to the TCP stack as
> skb->sk.
>
> 2) During TCP socket lookup for the skb, skb_steal_sock() returns a
> listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
> belongs to.
>
> 3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
> a full sk is created.
>
> The kfunc usage is as follows:
>
> struct bpf_tcp_req_attrs attrs = {
> .mss = mss,
> .wscale_ok = wscale_ok,
> .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
> .snd_wscale = snd_wscale, /* Client's WScale < 15 */
> .tstamp_ok = tstamp_ok,
> .rcv_tsval = tsval,
> .rcv_tsecr = tsecr, /* Server's Initial TSval */
> .usec_ts_ok = usec_ts_ok,
> .sack_ok = sack_ok,
> .ecn_ok = ecn_ok,
> }
>
> skc = bpf_skc_lookup_tcp(...);
> sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
> bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
> bpf_sk_release(skc);
>
> [0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf
>
>
> Changes:
> v7:
> * Patch 5 & 6
> * Drop MPTCP support
I think Yonghong's (thanks!) cpuv4 patch
(https://lore.kernel.org/bpf/20240110051348.2737007-1-yonghong.song@linux.dev/)
has addressed the issue that the selftest in patch 6 has encountered.
There are some minor comments in v7. Please respin v8 when the cpuv4 patch has
concluded so that it can kick off the CI also.
^ permalink raw reply [flat|nested] 15+ messages in thread