[PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC.
@ 2023-12-11  7:36 Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h Kuniyuki Iwashima
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
for the connection request until a valid ACK is responded to the SYN+ACK.

The cookie contains two kinds of host-specific bits, a timestamp and
secrets, so only can it be validated by the generator.  It means SYN
Cookie consumes network resources between the client and the server;
intermediate nodes must remember which nodes to route ACK for the cookie.

SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
backend server and completes another 3WHS.  However, since the server's
ISN differs from the cookie, the proxy must manage the ISN mappings and
fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
node goes down, all the connections through it are terminated.  Keeping
a state at proxy is painful from that perspective.

At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
Our SYN Proxy consists of the front proxy layer and the backend kernel
module.  (See slides of LPC2023 [0], p37 - p48)

The cookie that SYN Proxy generates differs from the kernel's cookie in
that it contains a secret (called rolling salt) (i) shared by all the proxy
nodes so that any node can validate ACK and (ii) updated periodically so
that old cookies cannot be validated and we need not encode a timestamp for
the cookie.  Also, ISN contains WScale, SACK, and ECN, not in TS val.  This
is not to sacrifice any connection quality, where some customers turn off
TCP timestamps option due to retro CVE.

After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
the TCP-in-TCP packet to the backend server.  Our kernel module works at
Netfilter input/output hooks and first feeds SYN to the TCP stack to
initiate 3WHS.  When the module is triggered for SYN+ACK, it looks up the
corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
proxy's cookie.  Then, the module can complete 3WHS with the original ACK
as is.

This way, our SYN Proxy does not manage the ISN mappings nor wait for
SYN+ACK from the backend thus can remain stateless.  It's working very
well for high-bandwidth services like multiple Tbps, but we are looking
for a way to drop the dirty hack and further optimise the sequences.

If we could validate an arbitrary SYN Cookie on the backend server with
BPF, the proxy would need not restore SYN nor pass it.  After validating
ACK, the proxy node just needs to forward it, and then the server can do
the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
and create a connection from the ACK.

This series allows us to create a full sk from an arbitrary SYN Cookie,
which is done in 3 steps.

  1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
     it based on the argument populated from SYN Cookie.  The reqsk has
     its listener as req->rsk_listener and is passed to the TCP stack as
     skb->sk.

  2) During TCP socket lookup for the skb, skb_steal_sock() returns a
     listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
     belongs to.

  3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
     a full sk is created.

The kfunc usage is as follows:

    struct tcp_cookie_attributes attr = {
        .tcp_opt = {
            .mss_clamp = mss,
            .wscale_ok = wscale_ok,
            .rcv_scale = recv_scale, /* Server's WScale < 15 */
            .snd_scale = send_scale, /* Client's WScale < 15 */
            .tstamp_ok = tstamp_ok,
            .rcv_tsval = tsval,
            .rcv_tsecr = tsecr,
            .sack_ok = sack_ok,
        },
        .ecn_ok = ecn_ok,
        .usec_ts_ok = usec_ts_ok,
    };

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attr, sizeof(attr));
    bpf_sk_release(skc);

[0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf

Changes:
  v5:
    * Split patch 1-3
    * Patch 3
      * Clear req->rsk_listener in skb_steal_sock()
    * Patch 4 & 5
      * Move sysctl validation and tsoff init from cookie_bpf_check() to kfunc
    * Patch 5
      * Do not increment LINUX_MIB_SYNCOOKIES(RECV|FAILED)
    * Patch 6
      * Remove __always_inline
      * Test if tcp_handle_{syn,ack}() is executed
      * Move some definition to bpf_tracing_net.h
      * s/BPF_F_CURRENT_NETNS/-1/

  v4: https://lore.kernel.org/bpf/20231205013420.88067-1-kuniyu@amazon.com/
    * Patch 1 & 2
      * s/CONFIG_SYN_COOKIE/CONFIG_SYN_COOKIES/
    * Patch 1
      * Don't set rcv_wscale for BPF SYN Cookie case.
    * Patch 2
      * Add test for tcp_opt.{unused,rcv_wscale} in kfunc
      * Modify skb_steal_sock() to avoid resetting skb-sk
      * Support SO_REUSEPORT lookup
    * Patch 3
      * Add CONFIG_SYN_COOKIES to Kconfig for CI
      * Define BPF_F_CURRENT_NETNS

  v3: https://lore.kernel.org/netdev/20231121184245.69569-1-kuniyu@amazon.com/
    * Guard kfunc and req->syncookie part in inet6?_steal_sock() with
      CONFIG_SYN_COOKIE

  v2: https://lore.kernel.org/netdev/20231120222341.54776-1-kuniyu@amazon.com/
    * Drop SOCK_OPS and move SYN Cookie validation logic to TC with kfunc.
    * Add cleanup patches to reduce discrepancy between cookie_v[46]_check()

  v1: https://lore.kernel.org/bpf/20231013220433.70792-1-kuniyu@amazon.com/

Kuniyuki Iwashima (6):
  tcp: Move tcp_ns_to_ts() to tcp.h
  tcp: Move skb_steal_sock() to request_sock.h
  bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().
  bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  bpf: tcp: Support arbitrary SYN Cookie.
  selftest: bpf: Test bpf_sk_assign_tcp_reqsk().

 include/net/request_sock.h                    |  39 ++
 include/net/sock.h                            |  25 -
 include/net/tcp.h                             |  35 ++
 net/core/filter.c                             | 118 +++-
 net/core/sock.c                               |  14 +-
 net/ipv4/syncookies.c                         |  40 +-
 net/ipv6/syncookies.c                         |  13 +-
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  10 +
 tools/testing/selftests/bpf/config            |   1 +
 .../bpf/prog_tests/tcp_custom_syncookie.c     | 169 +++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |  17 +
 .../selftests/bpf/progs/test_siphash.h        |  64 ++
 .../bpf/progs/test_tcp_custom_syncookie.c     | 580 ++++++++++++++++++
 .../bpf/progs/test_tcp_custom_syncookie.h     | 140 +++++
 14 files changed, 1219 insertions(+), 46 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h

-- 
2.30.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h
  2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
@ 2023-12-11  7:36 ` Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support arbitrary SYN Cookie with BPF.

When BPF prog validates ACK and kfunc allocates a reqsk, we need
to call tcp_ns_to_ts() to calculate an offset of TSval for later
use:

  time
  t0 : Send SYN+ACK
       -> tsval = Initial TSval (Random Number)

  t1 : Recv ACK of 3WHS
       -> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns())
                = Initial TSval - t1

  t2 : Send ACK
       -> tsval = t2 + tsoff
                = Initial TSval + (t2 - t1)
                = Initial TSval + Time Delta (x)

  (x) Note that the time delta does not include the initial RTT
      from t0 to t1.

Let's move tcp_ns_to_ts() to tcp.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h     | 9 +++++++++
 net/ipv4/syncookies.c | 9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 973555cb1d3f..c77354d1b86d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -577,6 +577,15 @@ static inline u32 tcp_cookie_time(void)
 	return val;
 }
 
+/* Convert one nsec 64bit timestamp to ts (ms or usec resolution) */
+static inline u64 tcp_ns_to_ts(bool usec_ts, u64 val)
+{
+	if (usec_ts)
+		return div_u64(val, NSEC_PER_USEC);
+
+	return div_u64(val, NSEC_PER_MSEC);
+}
+
 u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
 			      u16 *mssp);
 __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 61f1c96cfe63..981944c22820 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -51,15 +51,6 @@ static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
 			    count, &syncookie_secret[c]);
 }
 
-/* Convert one nsec 64bit timestamp to ts (ms or usec resolution) */
-static u64 tcp_ns_to_ts(bool usec_ts, u64 val)
-{
-	if (usec_ts)
-		return div_u64(val, NSEC_PER_USEC);
-
-	return div_u64(val, NSEC_PER_MSEC);
-}
-
 /*
  * when syncookies are in effect and tcp timestamps are enabled we encode
  * tcp options in the lower bits of the timestamp value that will be
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h
  2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h Kuniyuki Iwashima
@ 2023-12-11  7:36 ` Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock() Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support arbitrary SYN Cookie with BPF.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.

In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie
to see if the reqsk is created by kfunc.  However, inet_reqsk()
is not available in sock.h.

Let's move skb_steal_sock() to request_sock.h.

While at it, we refactor skb_steal_sock() so it returns early if
skb->sk is NULL to minimise the following patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/request_sock.h | 28 ++++++++++++++++++++++++++++
 include/net/sock.h         | 25 -------------------------
 2 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 144c39db9898..26c630c40abb 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -83,6 +83,34 @@ static inline struct sock *req_to_sk(struct request_sock *req)
 	return (struct sock *)req;
 }
 
+/**
+ * skb_steal_sock - steal a socket from an sk_buff
+ * @skb: sk_buff to steal the socket from
+ * @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
+ */
+static inline struct sock *skb_steal_sock(struct sk_buff *skb,
+					  bool *refcounted, bool *prefetched)
+{
+	struct sock *sk = skb->sk;
+
+	if (!sk) {
+		*prefetched = false;
+		*refcounted = false;
+		return NULL;
+	}
+
+	*prefetched = skb_sk_is_prefetched(skb);
+	if (*prefetched)
+		*refcounted = sk_is_refcounted(sk);
+	else
+		*refcounted = true;
+
+	skb->destructor = NULL;
+	skb->sk = NULL;
+	return sk;
+}
+
 static inline struct request_sock *
 reqsk_alloc(const struct request_sock_ops *ops, struct sock *sk_listener,
 	    bool attach_listener)
diff --git a/include/net/sock.h b/include/net/sock.h
index 1d6931caf0c3..0ed77af38000 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2838,31 +2838,6 @@ sk_is_refcounted(struct sock *sk)
 	return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
 }
 
-/**
- * skb_steal_sock - steal a socket from an sk_buff
- * @skb: sk_buff to steal the socket from
- * @refcounted: is set to true if the socket is reference-counted
- * @prefetched: is set to true if the socket was assigned from bpf
- */
-static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
-{
-	if (skb->sk) {
-		struct sock *sk = skb->sk;
-
-		*refcounted = true;
-		*prefetched = skb_sk_is_prefetched(skb);
-		if (*prefetched)
-			*refcounted = sk_is_refcounted(sk);
-		skb->destructor = NULL;
-		skb->sk = NULL;
-		return sk;
-	}
-	*prefetched = false;
-	*refcounted = false;
-	return NULL;
-}
-
 /* Checks if this SKB belongs to an HW offloaded socket
  * and whether any SW fallbacks are required based on dev.
  * Check decrypted mark in case skb_orphan() cleared socket.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().
  2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h Kuniyuki Iwashima
@ 2023-12-11  7:36 ` Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support arbitrary SYN Cookie with BPF.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.  Also,
the reqsk has its listener as req->rsk_listener with no refcnt
taken.

When the TCP stack looks up a socket from the skb, we steal
inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock() so that
the skb will be processed in cookie_v[46]_check() with the
listener or another one in the same reuseport group.

Note that we do not clear skb->sk and skb->destructor so that we
can carry the reqsk to cookie_v[46]_check().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/request_sock.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 26c630c40abb..8839133d6f6b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -101,10 +101,21 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb,
 	}
 
 	*prefetched = skb_sk_is_prefetched(skb);
-	if (*prefetched)
+	if (*prefetched) {
+#if IS_ENABLED(CONFIG_SYN_COOKIES)
+		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+			struct request_sock *req = inet_reqsk(sk);
+
+			*refcounted = false;
+			sk = req->rsk_listener;
+			req->rsk_listener = NULL;
+			return sk;
+		}
+#endif
 		*refcounted = sk_is_refcounted(sk);
-	else
+	} else {
 		*refcounted = true;
+	}
 
 	skb->destructor = NULL;
 	skb->sk = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2023-12-11  7:36 ` [PATCH v5 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock() Kuniyuki Iwashima
@ 2023-12-11  7:36 ` Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support arbitrary SYN Cookie with BPF in the following
patch.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to cookie_[46]_check() as skb->sk.  If skb->sk is not
NULL, we call cookie_bpf_check().

Then, we clear skb->sk and skb->destructor, which are needed not
to hold refcnt for reqsk and the listener.

After that, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().

Note that the server side WScale is set only for non-BPF SYN Cookie.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h     | 20 ++++++++++++++++++++
 net/ipv4/syncookies.c | 31 +++++++++++++++++++++++++++----
 net/ipv6/syncookies.c | 13 +++++++++----
 3 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c77354d1b86d..b5d7149eb0f8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -599,6 +599,26 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
 		dst_feature(dst, RTAX_FEATURE_ECN);
 }
 
+#if IS_ENABLED(CONFIG_BPF)
+static inline bool cookie_bpf_ok(struct sk_buff *skb)
+{
+	return skb->sk;
+}
+
+struct request_sock *cookie_bpf_check(struct sock *sk, struct sk_buff *skb);
+#else
+static inline bool cookie_bpf_ok(struct sk_buff *skb)
+{
+	return false;
+}
+
+static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
+						    struct sk_buff *skb)
+{
+	return NULL;
+}
+#endif
+
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th);
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 981944c22820..be88bf586ff9 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -295,6 +295,24 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_BPF)
+struct request_sock *cookie_bpf_check(struct sock *sk, struct sk_buff *skb)
+{
+	struct request_sock *req = inet_reqsk(skb->sk);
+
+	skb->sk = NULL;
+	skb->destructor = NULL;
+
+	if (cookie_tcp_reqsk_init(sk, skb, req)) {
+		reqsk_free(req);
+		req = NULL;
+	}
+
+	return req;
+}
+EXPORT_SYMBOL_GPL(cookie_bpf_check);
+#endif
+
 struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
 					    struct sock *sk, struct sk_buff *skb,
 					    struct tcp_options_received *tcp_opt,
@@ -395,9 +413,13 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	    !th->ack || th->rst)
 		goto out;
 
-	req = cookie_tcp_check(net, sk, skb);
-	if (IS_ERR(req))
-		goto out;
+	if (cookie_bpf_ok(skb)) {
+		req = cookie_bpf_check(sk, skb);
+	} else {
+		req = cookie_tcp_check(net, sk, skb);
+		if (IS_ERR(req))
+			goto out;
+	}
 	if (!req)
 		goto out_drop;
 
@@ -445,7 +467,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
-	ireq->rcv_wscale  = rcv_wscale;
+	if (!req->syncookie)
+		ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok &= cookie_ecn_ok(net, &rt->dst);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index c8d2ca27220c..6b9c69278819 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -182,9 +182,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	    !th->ack || th->rst)
 		goto out;
 
-	req = cookie_tcp_check(net, sk, skb);
-	if (IS_ERR(req))
-		goto out;
+	if (cookie_bpf_ok(skb)) {
+		req = cookie_bpf_check(sk, skb);
+	} else {
+		req = cookie_tcp_check(net, sk, skb);
+		if (IS_ERR(req))
+			goto out;
+	}
 	if (!req)
 		goto out_drop;
 
@@ -247,7 +251,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(dst, RTAX_INITRWND));
 
-	ireq->rcv_wscale = rcv_wscale;
+	if (!req->syncookie)
+		ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok &= cookie_ecn_ok(net, dst);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, dst);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie.
  2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2023-12-11  7:36 ` [PATCH v5 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
@ 2023-12-11  7:36 ` Kuniyuki Iwashima
  2023-12-11  7:36 ` [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch adds a new kfunc available at TC hook to support arbitrary
SYN Cookie.

The basic usage is as follows:

    struct tcp_cookie_attributes attr = {
        .tcp_opt = {
            .mss_clamp = mss,
            .wscale_ok = wscale_ok,
            .rcv_scale = recv_scale, /* Server's WScale < 15 */
            .snd_scale = send_scale, /* Client's WScale < 15 */
            .tstamp_ok = tstamp_ok,
            .rcv_tsval = tsval,
            .rcv_tsecr = tsecr,
            .sack_ok = sack_ok,
        },
        .ecn_ok = ecn_ok,
        .usec_ts_ok = usec_ts_ok,
    };

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attr, sizeof(attr));
    bpf_sk_release(skc);

bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
tcp_cookie_attributes and allocates reqsk and configures it.  Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.

The notable thing here is that we do not hold refcnt for both reqsk
and listener.  To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now.  So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.

When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.

When the TCP stack looks up a socket from the skb, we steal the
listener from the reqsk in skb_steal_sock() and create a full sk
in cookie_v[46]_check().

The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.

Note that we can use the unused bits in struct tcp_options_received
and extend struct tcp_cookie_attributes in the future when we add a
new attribute that is determined in 3WHS.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h |   6 +++
 net/core/filter.c | 118 +++++++++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c   |  14 +++++-
 3 files changed, 134 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b5d7149eb0f8..6e3ee9036ecb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -600,6 +600,12 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
 }
 
 #if IS_ENABLED(CONFIG_BPF)
+struct tcp_cookie_attributes {
+	struct tcp_options_received tcp_opt;
+	bool ecn_ok;
+	bool usec_ts_ok;
+} __packed;
+
 static inline bool cookie_bpf_ok(struct sk_buff *skb)
 {
 	return skb->sk;
diff --git a/net/core/filter.c b/net/core/filter.c
index adcfc2c25754..713feacb3ad3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11816,6 +11816,110 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
 
 	return 0;
 }
+
+__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
+					struct tcp_cookie_attributes *attr,
+					int attr__sz)
+{
+#if IS_ENABLED(CONFIG_SYN_COOKIES)
+	const struct request_sock_ops *ops;
+	struct inet_request_sock *ireq;
+	struct tcp_request_sock *treq;
+	struct request_sock *req;
+	struct net *net;
+	__u16 min_mss;
+	u32 tsoff = 0;
+
+	if (attr__sz != sizeof(*attr) || attr->tcp_opt.unused)
+		return -EINVAL;
+
+	if (!sk)
+		return -EINVAL;
+
+	if (!skb_at_tc_ingress(skb))
+		return -EINVAL;
+
+	net = dev_net(skb->dev);
+	if (net != sock_net(sk))
+		return -ENETUNREACH;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ops = &tcp_request_sock_ops;
+		min_mss = 536;
+		break;
+#if IS_BUILTIN(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		ops = &tcp6_request_sock_ops;
+		min_mss = IPV6_MIN_MTU - 60;
+		break;
+#endif
+	default:
+		return -EINVAL;
+	}
+
+	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
+		return -EINVAL;
+
+	if (attr->tcp_opt.mss_clamp < min_mss)
+		return -EINVAL;
+
+	if (attr->tcp_opt.wscale_ok) {
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
+			return -EINVAL;
+
+		if (attr->tcp_opt.snd_wscale > TCP_MAX_WSCALE ||
+		    attr->tcp_opt.rcv_wscale > TCP_MAX_WSCALE)
+			return -EINVAL;
+	}
+
+	if (attr->tcp_opt.sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
+		return -EINVAL;
+
+	if (attr->tcp_opt.tstamp_ok) {
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
+			return -EINVAL;
+
+		tsoff = attr->tcp_opt.rcv_tsecr -
+			tcp_ns_to_ts(attr->usec_ts_ok, tcp_clock_ns());
+	}
+
+	if (sk_is_mptcp(sk))
+		req = mptcp_subflow_reqsk_alloc(ops, sk, false);
+	else
+		req = inet_reqsk_alloc(ops, sk, false);
+
+	if (!req)
+		return -ENOMEM;
+
+	ireq = inet_rsk(req);
+	treq = tcp_rsk(req);
+
+	req->rsk_listener = sk;
+	req->syncookie = 1;
+	req->mss = attr->tcp_opt.mss_clamp;
+	req->ts_recent = attr->tcp_opt.rcv_tsval;
+
+	ireq->snd_wscale = attr->tcp_opt.snd_wscale;
+	ireq->rcv_wscale = attr->tcp_opt.rcv_wscale;
+	ireq->tstamp_ok	= attr->tcp_opt.tstamp_ok;
+	ireq->sack_ok = attr->tcp_opt.sack_ok;
+	ireq->wscale_ok = attr->tcp_opt.wscale_ok;
+	ireq->ecn_ok = attr->ecn_ok;
+
+	treq->req_usec_ts = attr->usec_ts_ok;
+	treq->ts_off = tsoff;
+
+	skb_orphan(skb);
+	skb->sk = req_to_sk(req);
+	skb->destructor = sock_pfree;
+
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
 __bpf_kfunc_end_defs();
 
 int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
@@ -11844,6 +11948,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
 BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
 BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
 
+BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
+BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
+BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -11859,6 +11967,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
 	.set = &bpf_kfunc_check_set_sock_addr,
 };
 
+static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_tcp_reqsk,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -11874,8 +11987,9 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
-	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
-						&bpf_kfunc_set_sock_addr);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+					       &bpf_kfunc_set_sock_addr);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
 }
 late_initcall(bpf_kfunc_init);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index fef349dd72fa..998950e97dfe 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2579,8 +2579,18 @@ EXPORT_SYMBOL(sock_efree);
 #ifdef CONFIG_INET
 void sock_pfree(struct sk_buff *skb)
 {
-	if (sk_is_refcounted(skb->sk))
-		sock_gen_put(skb->sk);
+	struct sock *sk = skb->sk;
+
+	if (!sk_is_refcounted(sk))
+		return;
+
+	if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+		inet_reqsk(sk)->rsk_listener = NULL;
+		reqsk_free(inet_reqsk(sk));
+		return;
+	}
+
+	sock_gen_put(sk);
 }
 EXPORT_SYMBOL(sock_pfree);
 #endif /* CONFIG_INET */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2023-12-11  7:36 ` [PATCH v5 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
@ 2023-12-11  7:36 ` Kuniyuki Iwashima
  2023-12-13 20:44   ` Martin KaFai Lau
  5 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-11  7:36 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This commit adds a sample selftest to demonstrate how we can use
bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.

The test creates IPv4/IPv6 x TCP/MPTCP connections and transfer
messages over them on lo with BPF tc prog attached.

The tc prog will process SYN and returns SYN+ACK with the following
ISN and TS.  In a real use case, this part will be done by other
hosts.

        MSB                                   LSB
  ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
        |   Hash_1 | MSS | ECN | SACK |  WScale |

  TS:   | 31 ... 8 |          7 ... 0           |
        |   Random |           Hash_2           |

  WScale in SYN is reused in SYN+ACK.

The client returns ACK, and tc prog will recalculate ISN and TS
from ACK and validate SYN Cookie.

If it's valid, the prog calls kfunc to allocate a reqsk for skb and
configure the reqsk based on the argument created from SYN Cookie.

Later, the reqsk will be processed in cookie_v[46]_check() to create
a connection.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  10 +
 tools/testing/selftests/bpf/config            |   1 +
 .../bpf/prog_tests/tcp_custom_syncookie.c     | 169 +++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |  17 +
 .../selftests/bpf/progs/test_siphash.h        |  64 ++
 .../bpf/progs/test_tcp_custom_syncookie.c     | 580 ++++++++++++++++++
 .../bpf/progs/test_tcp_custom_syncookie.h     | 140 +++++
 7 files changed, 981 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h

diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index b4e78c1eb37b..e2a1651851cf 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -51,6 +51,16 @@ extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clo
 extern int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
 				      const __u8 *sun_path, __u32 sun_path__sz) __ksym;
 
+/* Description
+ *  Allocate and configure a reqsk and link it with a listener and skb.
+ * Returns
+ *  Error code
+ */
+struct sock;
+struct tcp_cookie_attributes;
+extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
+				   struct tcp_cookie_attributes *attr, int attr__sz) __ksym;
+
 void *bpf_cast_to_kern_ctx(void *) __ksym;
 
 void *bpf_rdonly_cast(void *obj, __u32 btf_id) __ksym;
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c125c441abc7..01f241ea2c67 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -81,6 +81,7 @@ CONFIG_NF_NAT=y
 CONFIG_RC_CORE=y
 CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
+CONFIG_SYN_COOKIES=y
 CONFIG_TEST_BPF=m
 CONFIG_USERFAULTFD=y
 CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
new file mode 100644
index 000000000000..20b2b4592340
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_custom_syncookie.skel.h"
+
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
+static struct test_tcp_custom_syncookie_case {
+	int family, type, protocol;
+	char addr[16];
+	char name[10];
+} test_cases[] = {
+	{
+		.name = "IPv4 TCP  ",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.addr = "127.0.0.1",
+	},
+	{
+		.name = "IPv6 TCP  ",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.addr = "::1",
+	},
+	{
+		.name = "IPv4 MPTCP",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.protocol = IPPROTO_MPTCP,
+		.addr = "127.0.0.1",
+	},
+	{
+		.name = "IPv6 MPTCP",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.protocol = IPPROTO_MPTCP,
+		.addr = "::1",
+	},
+};
+
+static int setup_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "ip"))
+		goto err;
+
+	if (!ASSERT_OK(write_sysctl("/proc/sys/net/ipv4/tcp_ecn", "1"),
+		       "write_sysctl"))
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static int setup_tc(struct test_tcp_custom_syncookie *skel)
+{
+	LIBBPF_OPTS(bpf_tc_hook, qdisc_lo, .attach_point = BPF_TC_INGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_attach,
+		    .prog_fd = bpf_program__fd(skel->progs.tcp_custom_syncookie));
+
+	qdisc_lo.ifindex = if_nametoindex("lo");
+	if (!ASSERT_OK(bpf_tc_hook_create(&qdisc_lo), "qdisc add dev lo clsact"))
+		goto err;
+
+	if (!ASSERT_OK(bpf_tc_attach(&qdisc_lo, &tc_attach),
+		       "filter add dev lo ingress"))
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+#define msg "Hello World"
+#define msglen 11
+
+static void transfer_message(int sender, int receiver)
+{
+	char buf[msglen];
+	int ret;
+
+	ret = send(sender, msg, msglen, 0);
+	if (!ASSERT_EQ(ret, msglen, "send"))
+		return;
+
+	memset(buf, 0, sizeof(buf));
+
+	ret = recv(receiver, buf, msglen, 0);
+	if (!ASSERT_EQ(ret, msglen, "recv"))
+		return;
+
+	ret = strncmp(buf, msg, msglen);
+	if (!ASSERT_EQ(ret, 0, "strncmp"))
+		return;
+}
+
+static void create_connection(struct test_tcp_custom_syncookie_case *test_case)
+{
+	int server, client, child;
+
+	if (test_case->protocol == IPPROTO_MPTCP)
+		server = start_mptcp_server(test_case->family, test_case->addr, 0, 0);
+	else
+		server = start_server(test_case->family, test_case->type, test_case->addr, 0, 0);
+	if (!ASSERT_NEQ(server, -1, "start_server"))
+		return;
+
+	client = connect_to_fd(server, 0);
+	if (!ASSERT_NEQ(client, -1, "connect_to_fd"))
+		goto close_server;
+
+	child = accept(server, NULL, 0);
+	if (!ASSERT_NEQ(child, -1, "accept"))
+		goto close_client;
+
+	transfer_message(client, child);
+	transfer_message(child, client);
+
+	close(child);
+close_client:
+	close(client);
+close_server:
+	close(server);
+}
+
+void test_tcp_custom_syncookie(void)
+{
+	struct test_tcp_custom_syncookie *skel;
+	int i;
+
+	if (setup_netns())
+		return;
+
+	skel = test_tcp_custom_syncookie__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	if (setup_tc(skel))
+		goto destroy_skel;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+		skel->bss->handled_syn = false;
+		skel->bss->handled_ack = false;
+
+		test__start_subtest(test_cases[i].name);
+		create_connection(&test_cases[i]);
+
+		ASSERT_EQ(skel->bss->handled_syn, true, "SYN is not handled at tc.");
+		ASSERT_EQ(skel->bss->handled_ack, true, "ACK is not handled at tc");
+	}
+
+destroy_skel:
+	system("tc qdisc del dev lo clsact");
+
+	test_tcp_custom_syncookie__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 0b793a102791..49e525ad9856 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -26,6 +26,7 @@
 #define IPV6_AUTOFLOWLABEL	70
 
 #define TC_ACT_UNSPEC		(-1)
+#define TC_ACT_OK		0
 #define TC_ACT_SHOT		2
 
 #define SOL_TCP			6
@@ -50,9 +51,25 @@
 #define ICSK_TIME_LOSS_PROBE	5
 #define ICSK_TIME_REO_TIMEOUT	6
 
+#define ETH_ALEN		6
 #define ETH_HLEN		14
+#define ETH_P_IP		0x0800
 #define ETH_P_IPV6		0x86DD
 
+#define NEXTHDR_TCP		6
+
+#define TCPOPT_NOP		1
+#define TCPOPT_EOL		0
+#define TCPOPT_MSS		2
+#define TCPOPT_WINDOW		3
+#define TCPOPT_TIMESTAMP	8
+#define TCPOPT_SACK_PERM	4
+
+#define TCPOLEN_MSS		4
+#define TCPOLEN_WINDOW		3
+#define TCPOLEN_TIMESTAMP	10
+#define TCPOLEN_SACK_PERM	2
+
 #define CHECKSUM_NONE		0
 #define CHECKSUM_PARTIAL	3
 
diff --git a/tools/testing/selftests/bpf/progs/test_siphash.h b/tools/testing/selftests/bpf/progs/test_siphash.h
new file mode 100644
index 000000000000..5d3a7ec36780
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_siphash.h
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#ifndef _TEST_SIPHASH_H
+#define _TEST_SIPHASH_H
+
+/* include/linux/bitops.h */
+static inline u64 rol64(u64 word, unsigned int shift)
+{
+	return (word << (shift & 63)) | (word >> ((-shift) & 63));
+}
+
+/* include/linux/siphash.h */
+#define SIPHASH_PERMUTATION(a, b, c, d) ( \
+	(a) += (b), (b) = rol64((b), 13), (b) ^= (a), (a) = rol64((a), 32), \
+	(c) += (d), (d) = rol64((d), 16), (d) ^= (c), \
+	(a) += (d), (d) = rol64((d), 21), (d) ^= (a), \
+	(c) += (b), (b) = rol64((b), 17), (b) ^= (c), (c) = rol64((c), 32))
+
+#define SIPHASH_CONST_0 0x736f6d6570736575ULL
+#define SIPHASH_CONST_1 0x646f72616e646f6dULL
+#define SIPHASH_CONST_2 0x6c7967656e657261ULL
+#define SIPHASH_CONST_3 0x7465646279746573ULL
+
+/* lib/siphash.c */
+#define SIPROUND SIPHASH_PERMUTATION(v0, v1, v2, v3)
+
+#define PREAMBLE(len) \
+	u64 v0 = SIPHASH_CONST_0; \
+	u64 v1 = SIPHASH_CONST_1; \
+	u64 v2 = SIPHASH_CONST_2; \
+	u64 v3 = SIPHASH_CONST_3; \
+	u64 b = ((u64)(len)) << 56; \
+	v3 ^= key->key[1]; \
+	v2 ^= key->key[0]; \
+	v1 ^= key->key[1]; \
+	v0 ^= key->key[0];
+
+#define POSTAMBLE \
+	v3 ^= b; \
+	SIPROUND; \
+	SIPROUND; \
+	v0 ^= b; \
+	v2 ^= 0xff; \
+	SIPROUND; \
+	SIPROUND; \
+	SIPROUND; \
+	SIPROUND; \
+	return (v0 ^ v1) ^ (v2 ^ v3);
+
+static inline u64 siphash_2u64(const u64 first, const u64 second, const siphash_key_t *key)
+{
+	PREAMBLE(16)
+	v3 ^= first;
+	SIPROUND;
+	SIPROUND;
+	v0 ^= first;
+	v3 ^= second;
+	SIPROUND;
+	SIPROUND;
+	v0 ^= second;
+	POSTAMBLE
+}
+#endif
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
new file mode 100644
index 000000000000..21809c69e740
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
@@ -0,0 +1,580 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_tracing_net.h"
+#include "bpf_kfuncs.h"
+#include "test_siphash.h"
+#include "test_tcp_custom_syncookie.h"
+
+/* Hash is calculated for each client and split into ISN and TS.
+ *
+ *       MSB                                   LSB
+ * ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
+ *       |   Hash_1 | MSS | ECN | SACK |  WScale |
+ *
+ * TS:   | 31 ... 8 |          7 ... 0           |
+ *       |   Random |           Hash_2           |
+ */
+#define COOKIE_BITS	8
+#define COOKIE_MASK	(((__u32)1 << COOKIE_BITS) - 1)
+
+enum {
+	/* 0xf is invalid thus means that SYN did not have WScale. */
+	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
+	BPF_SYNCOOKIE_SACK		= (1 << 4),
+	BPF_SYNCOOKIE_ECN		= (1 << 5),
+};
+
+#define MSS_LOCAL_IPV4	65495
+#define MSS_LOCAL_IPV6	65476
+
+const __u16 msstab4[] = {
+	536,
+	1300,
+	1460,
+	MSS_LOCAL_IPV4,
+};
+
+const __u16 msstab6[] = {
+	1280 - 60, /* IPV6_MIN_MTU - 60 */
+	1480 - 60,
+	9000 - 60,
+	MSS_LOCAL_IPV6,
+};
+
+static siphash_key_t test_key_siphash = {
+	{ 0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL }
+};
+
+struct tcp_syncookie {
+	struct __sk_buff *skb;
+	void *data_end;
+	struct ethhdr *eth;
+	struct iphdr *ipv4;
+	struct ipv6hdr *ipv6;
+	struct tcphdr *tcp;
+	union {
+		char *ptr;
+		__be32 *ptr32;
+	};
+	struct tcp_cookie_attributes attr;
+	u32 cookie;
+	u64 first;
+};
+
+bool handled_syn, handled_ack;
+
+static int tcp_load_headers(struct tcp_syncookie *ctx)
+{
+	ctx->data_end = (void *)(long)ctx->skb->data_end;
+	ctx->eth = (struct ethhdr *)(long)ctx->skb->data;
+
+	if (ctx->eth + 1 > ctx->data_end)
+		goto err;
+
+	switch (bpf_ntohs(ctx->eth->h_proto)) {
+	case ETH_P_IP:
+		ctx->ipv4 = (struct iphdr *)(ctx->eth + 1);
+
+		if (ctx->ipv4 + 1 > ctx->data_end)
+			goto err;
+
+		if (ctx->ipv4->ihl != sizeof(*ctx->ipv4) / 4)
+			goto err;
+
+		if (ctx->ipv4->version != 4)
+			goto err;
+
+		if (ctx->ipv4->protocol != IPPROTO_TCP)
+			goto err;
+
+		ctx->tcp = (struct tcphdr *)(ctx->ipv4 + 1);
+		break;
+	case ETH_P_IPV6:
+		ctx->ipv6 = (struct ipv6hdr *)(ctx->eth + 1);
+
+		if (ctx->ipv6 + 1 > ctx->data_end)
+			goto err;
+
+		if (ctx->ipv6->version != 6)
+			goto err;
+
+		if (ctx->ipv6->nexthdr != NEXTHDR_TCP)
+			goto err;
+
+		ctx->tcp = (struct tcphdr *)(ctx->ipv6 + 1);
+		break;
+	default:
+		goto err;
+	}
+
+	if (ctx->tcp + 1 > ctx->data_end)
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static int tcp_reload_headers(struct tcp_syncookie *ctx)
+{
+	/* Without volatile,
+	 * R3 32-bit pointer arithmetic prohibited
+	 */
+	volatile u64 data_len = ctx->skb->data_end - ctx->skb->data;
+
+	if (ctx->tcp->doff < sizeof(*ctx->tcp) / 4)
+		goto err;
+
+	/* Needed to calculate csum and parse TCP options. */
+	if (bpf_skb_change_tail(ctx->skb, data_len + 60 - ctx->tcp->doff * 4, 0))
+		goto err;
+
+	ctx->data_end = (void *)(long)ctx->skb->data_end;
+	ctx->eth = (struct ethhdr *)(long)ctx->skb->data;
+	if (ctx->ipv4) {
+		ctx->ipv4 = (struct iphdr *)(ctx->eth + 1);
+		ctx->ipv6 = NULL;
+		ctx->tcp = (struct tcphdr *)(ctx->ipv4 + 1);
+	} else {
+		ctx->ipv4 = NULL;
+		ctx->ipv6 = (struct ipv6hdr *)(ctx->eth + 1);
+		ctx->tcp = (struct tcphdr *)(ctx->ipv6 + 1);
+	}
+
+	if ((void *)ctx->tcp + 60 > ctx->data_end)
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static __sum16 tcp_v4_csum(struct tcp_syncookie *ctx, __wsum csum)
+{
+	return csum_tcpudp_magic(ctx->ipv4->saddr, ctx->ipv4->daddr,
+				 ctx->tcp->doff * 4, IPPROTO_TCP, csum);
+}
+
+static __sum16 tcp_v6_csum(struct tcp_syncookie *ctx, __wsum csum)
+{
+	return csum_ipv6_magic(&ctx->ipv6->saddr, &ctx->ipv6->daddr,
+			       ctx->tcp->doff * 4, IPPROTO_TCP, csum);
+}
+
+static int tcp_validate_header(struct tcp_syncookie *ctx)
+{
+	s64 csum;
+
+	if (tcp_reload_headers(ctx))
+		goto err;
+
+	csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
+	if (csum < 0)
+		goto err;
+
+	if (ctx->ipv4) {
+		/* check tcp_v4_csum(csum) is 0 if not on lo. */
+
+		csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl * 4, 0);
+		if (csum < 0)
+			goto err;
+
+		if (csum_fold(csum) != 0)
+			goto err;
+	} else if (ctx->ipv6) {
+		/* check tcp_v6_csum(csum) is 0 if not on lo. */
+	}
+
+	return 0;
+err:
+	return -1;
+}
+
+static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+	char opcode, opsize;
+
+	if (ctx->ptr + 1 > ctx->data_end)
+		goto stop;
+
+	opcode = *ctx->ptr++;
+
+	if (opcode == TCPOPT_EOL)
+		goto stop;
+
+	if (opcode == TCPOPT_NOP)
+		goto next;
+
+	if (ctx->ptr + 1 > ctx->data_end)
+		goto stop;
+
+	opsize = *ctx->ptr++;
+
+	if (opsize < 2)
+		goto stop;
+
+	switch (opcode) {
+	case TCPOPT_MSS:
+		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
+		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
+			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
+		break;
+	case TCPOPT_WINDOW:
+		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
+		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
+			tcp_opt->wscale_ok = 1;
+			tcp_opt->snd_wscale = *ctx->ptr;
+		}
+		break;
+	case TCPOPT_TIMESTAMP:
+		if (opsize == TCPOLEN_TIMESTAMP &&
+		    ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
+			tcp_opt->saw_tstamp = 1;
+			tcp_opt->rcv_tsval = get_unaligned_be32(ctx->ptr);
+			tcp_opt->rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
+
+			if (ctx->tcp->syn && tcp_opt->rcv_tsecr)
+				tcp_opt->tstamp_ok = 0;
+			else
+				tcp_opt->tstamp_ok = 1;
+		}
+		break;
+	case TCPOPT_SACK_PERM:
+		if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
+		    ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
+			tcp_opt->sack_ok = 1;
+		break;
+	}
+
+	ctx->ptr += opsize - 2;
+next:
+	return 0;
+stop:
+	return 1;
+}
+
+static void tcp_parse_options(struct tcp_syncookie *ctx)
+{
+	ctx->ptr = (char *)(ctx->tcp + 1);
+
+	bpf_loop(40, tcp_parse_option, ctx, 0);
+}
+
+static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+
+	if ((ctx->ipv4 && tcp_opt->mss_clamp != MSS_LOCAL_IPV4) ||
+	    (ctx->ipv6 && tcp_opt->mss_clamp != MSS_LOCAL_IPV6))
+		goto err;
+
+	if (!tcp_opt->wscale_ok || tcp_opt->snd_wscale != 7)
+		goto err;
+
+	if (!tcp_opt->tstamp_ok)
+		goto err;
+
+	if (!tcp_opt->sack_ok)
+		goto err;
+
+	if (!ctx->tcp->ece || !ctx->tcp->cwr)
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static void tcp_prepare_cookie(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+	u32 seq = bpf_ntohl(ctx->tcp->seq);
+	u64 first = 0, second;
+	int mssind = 0;
+	u32 hash;
+
+	if (ctx->ipv4) {
+		for (mssind = ARRAY_SIZE(msstab4) - 1; mssind; mssind--)
+			if (tcp_opt->mss_clamp >= msstab4[mssind])
+				break;
+
+		tcp_opt->mss_clamp = msstab4[mssind];
+
+		first = (u64)ctx->ipv4->saddr << 32 | ctx->ipv4->daddr;
+	} else if (ctx->ipv6) {
+		for (mssind = ARRAY_SIZE(msstab6) - 1; mssind; mssind--)
+			if (tcp_opt->mss_clamp >= msstab6[mssind])
+				break;
+
+		tcp_opt->mss_clamp = msstab6[mssind];
+
+		first = (u64)ctx->ipv6->saddr.in6_u.u6_addr8[0] << 32 |
+			ctx->ipv6->daddr.in6_u.u6_addr32[0];
+	}
+
+	second = (u64)seq << 32 | ctx->tcp->source << 16 | ctx->tcp->dest;
+	hash = siphash_2u64(first, second, &test_key_siphash);
+
+	if (tcp_opt->tstamp_ok) {
+		tcp_opt->rcv_tsecr = bpf_get_prandom_u32();
+		tcp_opt->rcv_tsecr &= ~COOKIE_MASK;
+		tcp_opt->rcv_tsecr |= hash & COOKIE_MASK;
+	}
+
+	hash &= ~COOKIE_MASK;
+	hash |= mssind << 6;
+
+	if (tcp_opt->wscale_ok)
+		hash |= tcp_opt->snd_wscale & BPF_SYNCOOKIE_WSCALE_MASK;
+
+	if (tcp_opt->sack_ok)
+		hash |= BPF_SYNCOOKIE_SACK;
+
+	if (tcp_opt->tstamp_ok && ctx->tcp->ece && ctx->tcp->cwr)
+		hash |= BPF_SYNCOOKIE_ECN;
+
+	ctx->cookie = hash;
+}
+
+static void tcp_write_options(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+
+	ctx->ptr32 = (__be32 *)(ctx->tcp + 1);
+
+	*ctx->ptr32++ = bpf_htonl(TCPOPT_MSS << 24 | TCPOLEN_MSS << 16 |
+				  tcp_opt->mss_clamp);
+
+	if (tcp_opt->wscale_ok)
+		*ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+					  TCPOPT_WINDOW << 16 |
+					  TCPOLEN_WINDOW << 8 |
+					  tcp_opt->snd_wscale);
+
+	if (tcp_opt->tstamp_ok) {
+		if (tcp_opt->sack_ok)
+			*ctx->ptr32++ = bpf_htonl(TCPOPT_SACK_PERM << 24 |
+						  TCPOLEN_SACK_PERM << 16 |
+						  TCPOPT_TIMESTAMP << 8 |
+						  TCPOLEN_TIMESTAMP);
+		else
+			*ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+						  TCPOPT_NOP << 16 |
+						  TCPOPT_TIMESTAMP << 8 |
+						  TCPOLEN_TIMESTAMP);
+
+		*ctx->ptr32++ = bpf_htonl(tcp_opt->rcv_tsecr);
+		*ctx->ptr32++ = bpf_htonl(tcp_opt->rcv_tsval);
+	} else if (tcp_opt->sack_ok) {
+		*ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+					  TCPOPT_NOP << 16 |
+					  TCPOPT_SACK_PERM << 8 |
+					  TCPOLEN_SACK_PERM);
+	}
+}
+
+static int tcp_handle_syn(struct tcp_syncookie *ctx)
+{
+	s64 csum;
+
+	if (tcp_validate_header(ctx))
+		goto err;
+
+	tcp_parse_options(ctx);
+
+	if (tcp_validate_sysctl(ctx))
+		goto err;
+
+	tcp_prepare_cookie(ctx);
+	tcp_write_options(ctx);
+
+	swap(ctx->tcp->source, ctx->tcp->dest);
+	ctx->tcp->check = 0;
+	ctx->tcp->ack_seq = bpf_htonl(bpf_ntohl(ctx->tcp->seq) + 1);
+	ctx->tcp->seq = bpf_htonl(ctx->cookie);
+	ctx->tcp->doff = ((long)ctx->ptr32 - (long)ctx->tcp) >> 2;
+	ctx->tcp->ack = 1;
+	if (!ctx->attr.tcp_opt.tstamp_ok || !ctx->tcp->ece || !ctx->tcp->cwr)
+		ctx->tcp->ece = 0;
+	ctx->tcp->cwr = 0;
+
+	csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
+	if (csum < 0)
+		goto err;
+
+	if (ctx->ipv4) {
+		swap(ctx->ipv4->saddr, ctx->ipv4->daddr);
+		ctx->tcp->check = tcp_v4_csum(ctx, csum);
+
+		ctx->ipv4->check = 0;
+		ctx->ipv4->tos = 0;
+		ctx->ipv4->tot_len = bpf_htons((long)ctx->ptr32 - (long)ctx->ipv4);
+		ctx->ipv4->id = 0;
+		ctx->ipv4->ttl = 64;
+
+		csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, sizeof(*ctx->ipv4), 0);
+		if (csum < 0)
+			goto err;
+
+		ctx->ipv4->check = csum_fold(csum);
+	} else if (ctx->ipv6) {
+		swap(ctx->ipv6->saddr, ctx->ipv6->daddr);
+		ctx->tcp->check = tcp_v6_csum(ctx, csum);
+
+		*(__be32 *)ctx->ipv6 = bpf_htonl(0x60000000);
+		ctx->ipv6->payload_len = bpf_htons((long)ctx->ptr32 - (long)ctx->tcp);
+		ctx->ipv6->hop_limit = 64;
+	}
+
+	swap_array(ctx->eth->h_source, ctx->eth->h_dest);
+
+	if (bpf_skb_change_tail(ctx->skb, (long)ctx->ptr32 - (long)ctx->eth, 0))
+		goto err;
+
+	return bpf_redirect(ctx->skb->ifindex, 0);
+err:
+	return TC_ACT_SHOT;
+}
+
+static int tcp_validate_cookie(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+	u32 cookie = bpf_ntohl(ctx->tcp->ack_seq) - 1;
+	u32 seq = bpf_ntohl(ctx->tcp->seq) - 1;
+	u64 first = 0, second;
+	int mssind;
+	u32 hash;
+
+	if (ctx->ipv4)
+		first = (u64)ctx->ipv4->saddr << 32 | ctx->ipv4->daddr;
+	else if (ctx->ipv6)
+		first = (u64)ctx->ipv6->saddr.in6_u.u6_addr8[0] << 32 |
+			ctx->ipv6->daddr.in6_u.u6_addr32[0];
+
+	second = (u64)seq << 32 | ctx->tcp->source << 16 | ctx->tcp->dest;
+	hash = siphash_2u64(first, second, &test_key_siphash);
+
+	if (tcp_opt->saw_tstamp)
+		hash -= tcp_opt->rcv_tsecr & COOKIE_MASK;
+	else
+		hash &= ~COOKIE_MASK;
+
+	hash -= cookie & ~COOKIE_MASK;
+	if (hash)
+		goto err;
+
+	mssind = (cookie & (3 << 6)) >> 6;
+	if (ctx->ipv4) {
+		if (mssind > ARRAY_SIZE(msstab4))
+			goto err;
+
+		tcp_opt->mss_clamp = msstab4[mssind];
+	} else {
+		if (mssind > ARRAY_SIZE(msstab6))
+			goto err;
+
+		tcp_opt->mss_clamp = msstab6[mssind];
+	}
+
+	tcp_opt->snd_wscale = cookie & BPF_SYNCOOKIE_WSCALE_MASK;
+	tcp_opt->rcv_wscale = tcp_opt->snd_wscale;
+	tcp_opt->wscale_ok = tcp_opt->snd_wscale == BPF_SYNCOOKIE_WSCALE_MASK;
+	tcp_opt->sack_ok = cookie & BPF_SYNCOOKIE_SACK;
+	ctx->attr.ecn_ok = cookie & BPF_SYNCOOKIE_ECN;
+
+	return 0;
+err:
+	return -1;
+}
+
+static int tcp_handle_ack(struct tcp_syncookie *ctx)
+{
+	struct bpf_sock_tuple tuple;
+	struct bpf_sock *skc;
+	int ret = TC_ACT_OK;
+	struct sock *sk;
+	u32 tuple_size;
+
+	if (ctx->ipv4) {
+		tuple.ipv4.saddr = ctx->ipv4->saddr;
+		tuple.ipv4.daddr = ctx->ipv4->daddr;
+		tuple.ipv4.sport = ctx->tcp->source;
+		tuple.ipv4.dport = ctx->tcp->dest;
+		tuple_size = sizeof(tuple.ipv4);
+	} else if (ctx->ipv6) {
+		__builtin_memcpy(tuple.ipv6.saddr, &ctx->ipv6->saddr, sizeof(tuple.ipv6.saddr));
+		__builtin_memcpy(tuple.ipv6.daddr, &ctx->ipv6->daddr, sizeof(tuple.ipv6.daddr));
+		tuple.ipv6.sport = ctx->tcp->source;
+		tuple.ipv6.dport = ctx->tcp->dest;
+		tuple_size = sizeof(tuple.ipv6);
+	} else {
+		goto out;
+	}
+
+	skc = bpf_skc_lookup_tcp(ctx->skb, &tuple, tuple_size, -1, 0);
+	if (!skc)
+		goto out;
+
+	if (skc->state != TCP_LISTEN)
+		goto release;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
+	if (!sk)
+		goto err;
+
+	if (tcp_validate_header(ctx))
+		goto err;
+
+	tcp_parse_options(ctx);
+
+	if (tcp_validate_cookie(ctx))
+		goto err;
+
+	ret = bpf_sk_assign_tcp_reqsk(ctx->skb, sk, &ctx->attr, sizeof(ctx->attr));
+	if (ret < 0)
+		goto err;
+
+release:
+	bpf_sk_release(skc);
+out:
+	return ret;
+
+err:
+	ret = TC_ACT_SHOT;
+	goto release;
+}
+
+SEC("tc")
+int tcp_custom_syncookie(struct __sk_buff *skb)
+{
+	struct tcp_syncookie ctx = {
+		.skb = skb,
+	};
+
+	if (tcp_load_headers(&ctx))
+		return TC_ACT_OK;
+
+	if (ctx.tcp->rst)
+		return TC_ACT_OK;
+
+	if (ctx.tcp->syn) {
+		if (ctx.tcp->ack)
+			return TC_ACT_OK;
+
+		handled_syn = true;
+
+		return tcp_handle_syn(&ctx);
+	}
+
+	handled_ack = true;
+
+	return tcp_handle_ack(&ctx);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
new file mode 100644
index 000000000000..29a6a53cf229
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#ifndef _TEST_TCP_SYNCOOKIE_H
+#define _TEST_TCP_SYNCOOKIE_H
+
+#define __packed __attribute__((__packed__))
+#define __force
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+#define swap(a, b)				\
+	do {					\
+		typeof(a) __tmp = (a);		\
+		(a) = (b);			\
+		(b) = __tmp;			\
+	} while (0)
+
+#define swap_array(a, b)				\
+	do {						\
+		typeof(a) __tmp[sizeof(a)];		\
+		__builtin_memcpy(__tmp, a, sizeof(a));	\
+		__builtin_memcpy(a, b, sizeof(a));	\
+		__builtin_memcpy(b, __tmp, sizeof(a));	\
+	} while (0)
+
+/* asm-generic/unaligned.h */
+#define __get_unaligned_t(type, ptr) ({						\
+	const struct { type x; } __packed * __pptr = (typeof(__pptr))(ptr);	\
+	__pptr->x;								\
+})
+
+#define get_unaligned(ptr) __get_unaligned_t(typeof(*(ptr)), (ptr))
+
+static inline u16 get_unaligned_be16(const void *p)
+{
+	return bpf_ntohs(__get_unaligned_t(__be16, p));
+}
+
+static inline u32 get_unaligned_be32(const void *p)
+{
+	return bpf_ntohl(__get_unaligned_t(__be32, p));
+}
+
+/* lib/checksum.c */
+static inline u32 from64to32(u64 x)
+{
+	/* add up 32-bit and 32-bit for 32+c bit */
+	x = (x & 0xffffffff) + (x >> 32);
+	/* add up carry.. */
+	x = (x & 0xffffffff) + (x >> 32);
+	return (u32)x;
+}
+
+static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
+					__u32 len, __u8 proto, __wsum sum)
+{
+	unsigned long long s = (__force u32)sum;
+
+	s += (__force u32)saddr;
+	s += (__force u32)daddr;
+#ifdef __BIG_ENDIAN
+	s += proto + len;
+#else
+	s += (proto + len) << 8;
+#endif
+	return (__force __wsum)from64to32(s);
+}
+
+/* asm-generic/checksum.h */
+static inline __sum16 csum_fold(__wsum csum)
+{
+	u32 sum = (__force u32)csum;
+
+	sum = (sum & 0xffff) + (sum >> 16);
+	sum = (sum & 0xffff) + (sum >> 16);
+	return (__force __sum16)~sum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
+					__u8 proto, __wsum sum)
+{
+	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
+}
+
+/* net/ipv6/ip6_checksum.c */
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+				      const struct in6_addr *daddr,
+				      __u32 len, __u8 proto, __wsum csum)
+{
+	int carry;
+	__u32 ulen;
+	__u32 uproto;
+	__u32 sum = (__force u32)csum;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[0];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[0]);
+	sum += carry;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[1];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[1]);
+	sum += carry;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[2];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[2]);
+	sum += carry;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[3];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[3]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[0];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[0]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[1];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[1]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[2];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[2]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[3];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[3]);
+	sum += carry;
+
+	ulen = (__force u32)bpf_htonl((__u32)len);
+	sum += ulen;
+	carry = (sum < ulen);
+	sum += carry;
+
+	uproto = (__force u32)bpf_htonl(proto);
+	sum += uproto;
+	carry = (sum < uproto);
+	sum += carry;
+
+	return csum_fold((__force __wsum)sum);
+}
+#endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-11  7:36 ` [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
@ 2023-12-13 20:44   ` Martin KaFai Lau
  2023-12-14  3:18     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2023-12-13 20:44 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, Daniel Xu, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

On 12/10/23 11:36 PM, Kuniyuki Iwashima wrote:
> This commit adds a sample selftest to demonstrate how we can use
> bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
> 
> The test creates IPv4/IPv6 x TCP/MPTCP connections and transfer
> messages over them on lo with BPF tc prog attached.
> 
> The tc prog will process SYN and returns SYN+ACK with the following
> ISN and TS.  In a real use case, this part will be done by other
> hosts.
> 
>          MSB                                   LSB
>    ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
>          |   Hash_1 | MSS | ECN | SACK |  WScale |
> 
>    TS:   | 31 ... 8 |          7 ... 0           |
>          |   Random |           Hash_2           |
> 
>    WScale in SYN is reused in SYN+ACK.
> 
> The client returns ACK, and tc prog will recalculate ISN and TS
> from ACK and validate SYN Cookie.
> 
> If it's valid, the prog calls kfunc to allocate a reqsk for skb and
> configure the reqsk based on the argument created from SYN Cookie.
> 
> Later, the reqsk will be processed in cookie_v[46]_check() to create
> a connection.

The patch set looks good.

One thing I just noticed is about writing/reading bits into/from "struct 
tcp_options_received". More on this below.

[ ... ]

> +void test_tcp_custom_syncookie(void)
> +{
> +	struct test_tcp_custom_syncookie *skel;
> +	int i;
> +
> +	if (setup_netns())
> +		return;
> +
> +	skel = test_tcp_custom_syncookie__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "open_and_load"))
> +		return;
> +
> +	if (setup_tc(skel))
> +		goto destroy_skel;
> +
> +	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> +		skel->bss->handled_syn = false;
> +		skel->bss->handled_ack = false;
> +
> +		test__start_subtest(test_cases[i].name);


This should be tested with:

	if (!test__start_subtest(test_cases[i].name))
		continue;

to skip the create_connection(). Probably do it at the beginning of the for loop.

> +		create_connection(&test_cases[i]);
> +
> +		ASSERT_EQ(skel->bss->handled_syn, true, "SYN is not handled at tc.");
> +		ASSERT_EQ(skel->bss->handled_ack, true, "ACK is not handled at tc");
> +	}
> +
> +destroy_skel:
> +	system("tc qdisc del dev lo clsact");
> +
> +	test_tcp_custom_syncookie__destroy(skel);
> +}

[ ... ]

> +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> +{
> +	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
> +	char opcode, opsize;
> +
> +	if (ctx->ptr + 1 > ctx->data_end)
> +		goto stop;
> +
> +	opcode = *ctx->ptr++;
> +
> +	if (opcode == TCPOPT_EOL)
> +		goto stop;
> +
> +	if (opcode == TCPOPT_NOP)
> +		goto next;
> +
> +	if (ctx->ptr + 1 > ctx->data_end)
> +		goto stop;
> +
> +	opsize = *ctx->ptr++;
> +
> +	if (opsize < 2)
> +		goto stop;
> +
> +	switch (opcode) {
> +	case TCPOPT_MSS:
> +		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> +		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> +			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
> +		break;
> +	case TCPOPT_WINDOW:
> +		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> +		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> +			tcp_opt->wscale_ok = 1;
> +			tcp_opt->snd_wscale = *ctx->ptr;

When writing to a bitfield of "struct tcp_options_received" which is a kernel 
struct, it needs to use the CO-RE api. The BPF_CORE_WRITE_BITFIELD has not been 
landed yet: 
https://lore.kernel.org/bpf/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz/

The same for reading bitfield but BPF_CORE_READ_BITFIELD() has already been 
implemented in bpf_core_read.h

Once the BPF_CORE_WRITE_BITFIELD is landed, this test needs to be changed to use 
the BPF_CORE_{READ,WRITE}_BITFIELD.

> +		}
> +		break;
> +	case TCPOPT_TIMESTAMP:
> +		if (opsize == TCPOLEN_TIMESTAMP &&
> +		    ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
> +			tcp_opt->saw_tstamp = 1;
> +			tcp_opt->rcv_tsval = get_unaligned_be32(ctx->ptr);
> +			tcp_opt->rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
> +
> +			if (ctx->tcp->syn && tcp_opt->rcv_tsecr)
> +				tcp_opt->tstamp_ok = 0;
> +			else
> +				tcp_opt->tstamp_ok = 1;
> +		}
> +		break;
> +	case TCPOPT_SACK_PERM:
> +		if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
> +		    ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
> +			tcp_opt->sack_ok = 1;
> +		break;
> +	}
> +
> +	ctx->ptr += opsize - 2;
> +next:
> +	return 0;
> +stop:
> +	return 1;
> +}



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-13 20:44   ` Martin KaFai Lau
@ 2023-12-14  3:18     ` Kuniyuki Iwashima
  2023-12-14  6:46       ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-14  3:18 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, dxu, edumazet, kuni1840, kuniyu, netdev

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Wed, 13 Dec 2023 12:44:42 -0800
> On 12/10/23 11:36 PM, Kuniyuki Iwashima wrote:
> > This commit adds a sample selftest to demonstrate how we can use
> > bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
> > 
> > The test creates IPv4/IPv6 x TCP/MPTCP connections and transfer
> > messages over them on lo with BPF tc prog attached.
> > 
> > The tc prog will process SYN and returns SYN+ACK with the following
> > ISN and TS.  In a real use case, this part will be done by other
> > hosts.
> > 
> >          MSB                                   LSB
> >    ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
> >          |   Hash_1 | MSS | ECN | SACK |  WScale |
> > 
> >    TS:   | 31 ... 8 |          7 ... 0           |
> >          |   Random |           Hash_2           |
> > 
> >    WScale in SYN is reused in SYN+ACK.
> > 
> > The client returns ACK, and tc prog will recalculate ISN and TS
> > from ACK and validate SYN Cookie.
> > 
> > If it's valid, the prog calls kfunc to allocate a reqsk for skb and
> > configure the reqsk based on the argument created from SYN Cookie.
> > 
> > Later, the reqsk will be processed in cookie_v[46]_check() to create
> > a connection.
> 
> The patch set looks good.
> 
> One thing I just noticed is about writing/reading bits into/from "struct 
> tcp_options_received". More on this below.
> 
> [ ... ]
> 
> > +void test_tcp_custom_syncookie(void)
> > +{
> > +	struct test_tcp_custom_syncookie *skel;
> > +	int i;
> > +
> > +	if (setup_netns())
> > +		return;
> > +
> > +	skel = test_tcp_custom_syncookie__open_and_load();
> > +	if (!ASSERT_OK_PTR(skel, "open_and_load"))
> > +		return;
> > +
> > +	if (setup_tc(skel))
> > +		goto destroy_skel;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> > +		skel->bss->handled_syn = false;
> > +		skel->bss->handled_ack = false;
> > +
> > +		test__start_subtest(test_cases[i].name);
> 
> 
> This should be tested with:
> 
> 	if (!test__start_subtest(test_cases[i].name))
> 		continue;
> 
> to skip the create_connection(). Probably do it at the beginning of the for loop.

Thanks for catching this!
Will fix.


> 
> > +		create_connection(&test_cases[i]);
> > +
> > +		ASSERT_EQ(skel->bss->handled_syn, true, "SYN is not handled at tc.");
> > +		ASSERT_EQ(skel->bss->handled_ack, true, "ACK is not handled at tc");
> > +	}
> > +
> > +destroy_skel:
> > +	system("tc qdisc del dev lo clsact");
> > +
> > +	test_tcp_custom_syncookie__destroy(skel);
> > +}
> 
> [ ... ]
> 
> > +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> > +{
> > +	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
> > +	char opcode, opsize;
> > +
> > +	if (ctx->ptr + 1 > ctx->data_end)
> > +		goto stop;
> > +
> > +	opcode = *ctx->ptr++;
> > +
> > +	if (opcode == TCPOPT_EOL)
> > +		goto stop;
> > +
> > +	if (opcode == TCPOPT_NOP)
> > +		goto next;
> > +
> > +	if (ctx->ptr + 1 > ctx->data_end)
> > +		goto stop;
> > +
> > +	opsize = *ctx->ptr++;
> > +
> > +	if (opsize < 2)
> > +		goto stop;
> > +
> > +	switch (opcode) {
> > +	case TCPOPT_MSS:
> > +		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> > +		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> > +			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
> > +		break;
> > +	case TCPOPT_WINDOW:
> > +		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> > +		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> > +			tcp_opt->wscale_ok = 1;
> > +			tcp_opt->snd_wscale = *ctx->ptr;
> 
> When writing to a bitfield of "struct tcp_options_received" which is a kernel 
> struct, it needs to use the CO-RE api. The BPF_CORE_WRITE_BITFIELD has not been 
> landed yet: 
> https://lore.kernel.org/bpf/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz/
> 
> The same for reading bitfield but BPF_CORE_READ_BITFIELD() has already been 
> implemented in bpf_core_read.h
> 
> Once the BPF_CORE_WRITE_BITFIELD is landed, this test needs to be changed to use 
> the BPF_CORE_{READ,WRITE}_BITFIELD.

IIUC, the CO-RE api assumes that the offset of bitfields could be changed.

If the size of struct tcp_cookie_attributes is changed, kfunc will not work
in this test.  So, BPF_CORE_WRITE_BITFIELD() works only when the size of
tcp_cookie_attributes is unchanged but fields in tcp_options_received are
rearranged or expanded to use the unused@ bits ?

Also, do we need to use BPF_CORE_READ() for other non-bitfields in
strcut tcp_options_received (and ecn_ok in struct tcp_cookie_attributes
just in case other fields are added to tcp_cookie_attributes and ecn_ok
is rearranged) ?

Just trying to understand when to use CO-RE api.

Btw, thanks for merging BPF_CORE_WRITE_BITFIELD patches!


> 
> > +		}
> > +		break;
> > +	case TCPOPT_TIMESTAMP:
> > +		if (opsize == TCPOLEN_TIMESTAMP &&
> > +		    ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
> > +			tcp_opt->saw_tstamp = 1;
> > +			tcp_opt->rcv_tsval = get_unaligned_be32(ctx->ptr);
> > +			tcp_opt->rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
> > +
> > +			if (ctx->tcp->syn && tcp_opt->rcv_tsecr)
> > +				tcp_opt->tstamp_ok = 0;
> > +			else
> > +				tcp_opt->tstamp_ok = 1;
> > +		}
> > +		break;
> > +	case TCPOPT_SACK_PERM:
> > +		if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
> > +		    ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
> > +			tcp_opt->sack_ok = 1;
> > +		break;
> > +	}
> > +
> > +	ctx->ptr += opsize - 2;
> > +next:
> > +	return 0;
> > +stop:
> > +	return 1;
> > +}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-14  3:18     ` Kuniyuki Iwashima
@ 2023-12-14  6:46       ` Martin KaFai Lau
  2023-12-14  7:49         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2023-12-14  6:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, dxu, edumazet, kuni1840, netdev,
	Yonghong Song

On 12/13/23 7:18 PM, Kuniyuki Iwashima wrote:
>>> +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
>>> +{
>>> +	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
>>> +	char opcode, opsize;
>>> +
>>> +	if (ctx->ptr + 1 > ctx->data_end)
>>> +		goto stop;
>>> +
>>> +	opcode = *ctx->ptr++;
>>> +
>>> +	if (opcode == TCPOPT_EOL)
>>> +		goto stop;
>>> +
>>> +	if (opcode == TCPOPT_NOP)
>>> +		goto next;
>>> +
>>> +	if (ctx->ptr + 1 > ctx->data_end)
>>> +		goto stop;
>>> +
>>> +	opsize = *ctx->ptr++;
>>> +
>>> +	if (opsize < 2)
>>> +		goto stop;
>>> +
>>> +	switch (opcode) {
>>> +	case TCPOPT_MSS:
>>> +		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
>>> +		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
>>> +			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
>>> +		break;
>>> +	case TCPOPT_WINDOW:
>>> +		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
>>> +		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
>>> +			tcp_opt->wscale_ok = 1;
>>> +			tcp_opt->snd_wscale = *ctx->ptr;
>> When writing to a bitfield of "struct tcp_options_received" which is a kernel
>> struct, it needs to use the CO-RE api. The BPF_CORE_WRITE_BITFIELD has not been
>> landed yet:
>> https://lore.kernel.org/bpf/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz/
>>
>> The same for reading bitfield but BPF_CORE_READ_BITFIELD() has already been
>> implemented in bpf_core_read.h
>>
>> Once the BPF_CORE_WRITE_BITFIELD is landed, this test needs to be changed to use
>> the BPF_CORE_{READ,WRITE}_BITFIELD.
> IIUC, the CO-RE api assumes that the offset of bitfields could be changed.
> 
> If the size of struct tcp_cookie_attributes is changed, kfunc will not work
> in this test.  So, BPF_CORE_WRITE_BITFIELD() works only when the size of
> tcp_cookie_attributes is unchanged but fields in tcp_options_received are
> rearranged or expanded to use the unused@ bits ?

Right, CO-RE helps to figure out the offset of a member in the running kernel.

> 
> Also, do we need to use BPF_CORE_READ() for other non-bitfields in
> strcut tcp_options_received (and ecn_ok in struct tcp_cookie_attributes
> just in case other fields are added to tcp_cookie_attributes and ecn_ok
> is rearranged) ?

BPF_CORE_READ is a CO-RE friendly macro for using bpf_probe_read_kernel(). 
bpf_probe_read_kernel() is mostly for the tracing use case where the ptr is not 
safe to read directly.

It is not the case for the tcp_options_received ptr in this tc-bpf use case or 
other stack allocated objects. In general, no need to use BPF_CORE_READ. The 
relocation will be done by the libbpf for tcp_opt->mss_clamp (e.g.).

Going back to bitfield, it needs BPF_CORE_*_BITFIELD because the offset may not 
be right after __attribute__((preserve_access_index)), cc: Yonghong and Andrii 
who know more details than I do.

A verifier error has been reported: 
https://lore.kernel.org/bpf/391d524c496acc97a8801d8bea80976f58485810.1700676682.git.dxu@dxuuu.xyz/.

I also hit an error earlier in 
https://lore.kernel.org/all/20220817061847.4182339-1-kafai@fb.com/ when not 
using BPF_CORE_READ_BITFIELD. I don't exactly remember how the instruction looks 
like but it was reading a wrong value instead of verifier error.

================

Going back to this patch set here.

After sleeping on it longer, I am thinking it is better not to reuse 'struct 
tcp_options_received' (meaning no bitfield) in the bpf_sk_assign_tcp_reqsk() 
kfunc API.

There is not much benefit in reusing 'tcp_options_received'. When new tcp option 
was ever added to tcp_options_received, it is not like bpf_sk_assign_tcp_reqsk 
will support it automatically. It needs to relay this new option back to the 
allocated req. Unlike tcp_sock or req which may have a lot of them such that it 
is useful to have a compact tcp_options_received, the tc-bpf use case here is to 
allocate it once in the stack. Also, not all the members in tcp_options_received 
is useful, e.g. num_sacks, ts_recent_stamp, and user_mss are not used. Leaving 
it there being ignored by bpf_sk_assign_tcp_reqsk is confusing.

How about using a full u8 for each necessary member and directly add them to 
struct tcp_cookie_attributes instead of nesting them into another struct. After 
taking out the unnecessary members, the size may not end up to be much bigger.

The bpf prog can then directly access attr->tstamp_ok more naturally. The 
changes to patch 5 and 6 should be mostly mechanical changes.

I would also rename s/tcp_cookie_attributes/bpf_tcp_req_attrs/.

wdyt?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-14  6:46       ` Martin KaFai Lau
@ 2023-12-14  7:49         ` Kuniyuki Iwashima
  2023-12-14 12:26           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-14  7:49 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, dxu, edumazet, kuni1840, kuniyu, netdev,
	yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Wed, 13 Dec 2023 22:46:11 -0800
> On 12/13/23 7:18 PM, Kuniyuki Iwashima wrote:
> >>> +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> >>> +{
> >>> +	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
> >>> +	char opcode, opsize;
> >>> +
> >>> +	if (ctx->ptr + 1 > ctx->data_end)
> >>> +		goto stop;
> >>> +
> >>> +	opcode = *ctx->ptr++;
> >>> +
> >>> +	if (opcode == TCPOPT_EOL)
> >>> +		goto stop;
> >>> +
> >>> +	if (opcode == TCPOPT_NOP)
> >>> +		goto next;
> >>> +
> >>> +	if (ctx->ptr + 1 > ctx->data_end)
> >>> +		goto stop;
> >>> +
> >>> +	opsize = *ctx->ptr++;
> >>> +
> >>> +	if (opsize < 2)
> >>> +		goto stop;
> >>> +
> >>> +	switch (opcode) {
> >>> +	case TCPOPT_MSS:
> >>> +		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> >>> +		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> >>> +			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
> >>> +		break;
> >>> +	case TCPOPT_WINDOW:
> >>> +		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> >>> +		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> >>> +			tcp_opt->wscale_ok = 1;
> >>> +			tcp_opt->snd_wscale = *ctx->ptr;
> >> When writing to a bitfield of "struct tcp_options_received" which is a kernel
> >> struct, it needs to use the CO-RE api. The BPF_CORE_WRITE_BITFIELD has not been
> >> landed yet:
> >> https://lore.kernel.org/bpf/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz/
> >>
> >> The same for reading bitfield but BPF_CORE_READ_BITFIELD() has already been
> >> implemented in bpf_core_read.h
> >>
> >> Once the BPF_CORE_WRITE_BITFIELD is landed, this test needs to be changed to use
> >> the BPF_CORE_{READ,WRITE}_BITFIELD.
> > IIUC, the CO-RE api assumes that the offset of bitfields could be changed.
> > 
> > If the size of struct tcp_cookie_attributes is changed, kfunc will not work
> > in this test.  So, BPF_CORE_WRITE_BITFIELD() works only when the size of
> > tcp_cookie_attributes is unchanged but fields in tcp_options_received are
> > rearranged or expanded to use the unused@ bits ?
> 
> Right, CO-RE helps to figure out the offset of a member in the running kernel.
> 
> > 
> > Also, do we need to use BPF_CORE_READ() for other non-bitfields in
> > strcut tcp_options_received (and ecn_ok in struct tcp_cookie_attributes
> > just in case other fields are added to tcp_cookie_attributes and ecn_ok
> > is rearranged) ?
> 
> BPF_CORE_READ is a CO-RE friendly macro for using bpf_probe_read_kernel(). 
> bpf_probe_read_kernel() is mostly for the tracing use case where the ptr is not 
> safe to read directly.
> 
> It is not the case for the tcp_options_received ptr in this tc-bpf use case or 
> other stack allocated objects. In general, no need to use BPF_CORE_READ. The 
> relocation will be done by the libbpf for tcp_opt->mss_clamp (e.g.).
> 
> Going back to bitfield, it needs BPF_CORE_*_BITFIELD because the offset may not 
> be right after __attribute__((preserve_access_index)), cc: Yonghong and Andrii 
> who know more details than I do.
> 
> A verifier error has been reported: 
> https://lore.kernel.org/bpf/391d524c496acc97a8801d8bea80976f58485810.1700676682.git.dxu@dxuuu.xyz/.
> 
> I also hit an error earlier in 
> https://lore.kernel.org/all/20220817061847.4182339-1-kafai@fb.com/ when not 
> using BPF_CORE_READ_BITFIELD. I don't exactly remember how the instruction looks 
> like but it was reading a wrong value instead of verifier error.

Thank you so much for detailed explanation!


> 
> ================
> 
> Going back to this patch set here.
> 
> After sleeping on it longer, I am thinking it is better not to reuse 'struct 
> tcp_options_received' (meaning no bitfield) in the bpf_sk_assign_tcp_reqsk() 
> kfunc API.
> 
> There is not much benefit in reusing 'tcp_options_received'. When new tcp option 
> was ever added to tcp_options_received, it is not like bpf_sk_assign_tcp_reqsk 
> will support it automatically. It needs to relay this new option back to the 
> allocated req. Unlike tcp_sock or req which may have a lot of them such that it 
> is useful to have a compact tcp_options_received, the tc-bpf use case here is to 
> allocate it once in the stack. Also, not all the members in tcp_options_received 
> is useful, e.g. num_sacks, ts_recent_stamp, and user_mss are not used. Leaving 
> it there being ignored by bpf_sk_assign_tcp_reqsk is confusing.
> 
> How about using a full u8 for each necessary member and directly add them to 
> struct tcp_cookie_attributes instead of nesting them into another struct. After 
> taking out the unnecessary members, the size may not end up to be much bigger.
> 
> The bpf prog can then directly access attr->tstamp_ok more naturally. The 
> changes to patch 5 and 6 should be mostly mechanical changes.
> 
> I would also rename s/tcp_cookie_attributes/bpf_tcp_req_attrs/.
> 
> wdyt?

Totally agree.  I reused struct tcp_options_received but had a similar
thought like unused fields, confusing fields (saw_tstamp vs tstamp_ok,
user_mss vs clamp_mss), etc.

And I like bpf_tcp_req_attrs, tcp_cookie_attributes was bit wordy :)

So probably bpf_tcp_req_attrs would look like this ?

struct bpf_tcp_req_attrs {
	u32 rcv_tsval;
	u32 rcv_tsecr;
	u16 mss;
	u8 rcv_scale;
	u8 snd_scale;
	bool ecn_ok;
	bool wscale_ok;
	bool sack_ok;
	bool tstamp_ok;
	bool usec_ts;
} __packed;

or you prefer u8 over bool and __packed ?

struct bpf_tcp_req_attrs {
	u32 rcv_tsval;
	u32 rcv_tsecr;
	u16 mss;
	u8 rcv_scale;
	u8 snd_scale;
	u8 ecn_ok;
	u8 wscale_ok;
	u8 sack_ok;
	u8 tstamp_ok;
	u8 usec_ts;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-14  7:49         ` Kuniyuki Iwashima
@ 2023-12-14 12:26           ` Kuniyuki Iwashima
  0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-14 12:26 UTC (permalink / raw)
  To: kuniyu
  Cc: andrii, ast, bpf, daniel, dxu, edumazet, kuni1840, martin.lau,
	netdev, yonghong.song

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Thu, 14 Dec 2023 16:49:55 +0900
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Wed, 13 Dec 2023 22:46:11 -0800
> > On 12/13/23 7:18 PM, Kuniyuki Iwashima wrote:
> > >>> +static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
> > >>> +{
> > >>> +	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
> > >>> +	char opcode, opsize;
> > >>> +
> > >>> +	if (ctx->ptr + 1 > ctx->data_end)
> > >>> +		goto stop;
> > >>> +
> > >>> +	opcode = *ctx->ptr++;
> > >>> +
> > >>> +	if (opcode == TCPOPT_EOL)
> > >>> +		goto stop;
> > >>> +
> > >>> +	if (opcode == TCPOPT_NOP)
> > >>> +		goto next;
> > >>> +
> > >>> +	if (ctx->ptr + 1 > ctx->data_end)
> > >>> +		goto stop;
> > >>> +
> > >>> +	opsize = *ctx->ptr++;
> > >>> +
> > >>> +	if (opsize < 2)
> > >>> +		goto stop;
> > >>> +
> > >>> +	switch (opcode) {
> > >>> +	case TCPOPT_MSS:
> > >>> +		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
> > >>> +		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
> > >>> +			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
> > >>> +		break;
> > >>> +	case TCPOPT_WINDOW:
> > >>> +		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
> > >>> +		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
> > >>> +			tcp_opt->wscale_ok = 1;
> > >>> +			tcp_opt->snd_wscale = *ctx->ptr;
> > >> When writing to a bitfield of "struct tcp_options_received" which is a kernel
> > >> struct, it needs to use the CO-RE api. The BPF_CORE_WRITE_BITFIELD has not been
> > >> landed yet:
> > >> https://lore.kernel.org/bpf/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz/
> > >>
> > >> The same for reading bitfield but BPF_CORE_READ_BITFIELD() has already been
> > >> implemented in bpf_core_read.h
> > >>
> > >> Once the BPF_CORE_WRITE_BITFIELD is landed, this test needs to be changed to use
> > >> the BPF_CORE_{READ,WRITE}_BITFIELD.
> > > IIUC, the CO-RE api assumes that the offset of bitfields could be changed.
> > > 
> > > If the size of struct tcp_cookie_attributes is changed, kfunc will not work
> > > in this test.  So, BPF_CORE_WRITE_BITFIELD() works only when the size of
> > > tcp_cookie_attributes is unchanged but fields in tcp_options_received are
> > > rearranged or expanded to use the unused@ bits ?
> > 
> > Right, CO-RE helps to figure out the offset of a member in the running kernel.
> > 
> > > 
> > > Also, do we need to use BPF_CORE_READ() for other non-bitfields in
> > > strcut tcp_options_received (and ecn_ok in struct tcp_cookie_attributes
> > > just in case other fields are added to tcp_cookie_attributes and ecn_ok
> > > is rearranged) ?
> > 
> > BPF_CORE_READ is a CO-RE friendly macro for using bpf_probe_read_kernel(). 
> > bpf_probe_read_kernel() is mostly for the tracing use case where the ptr is not 
> > safe to read directly.
> > 
> > It is not the case for the tcp_options_received ptr in this tc-bpf use case or 
> > other stack allocated objects. In general, no need to use BPF_CORE_READ. The 
> > relocation will be done by the libbpf for tcp_opt->mss_clamp (e.g.).
> > 
> > Going back to bitfield, it needs BPF_CORE_*_BITFIELD because the offset may not 
> > be right after __attribute__((preserve_access_index)), cc: Yonghong and Andrii 
> > who know more details than I do.
> > 
> > A verifier error has been reported: 
> > https://lore.kernel.org/bpf/391d524c496acc97a8801d8bea80976f58485810.1700676682.git.dxu@dxuuu.xyz/.
> > 
> > I also hit an error earlier in 
> > https://lore.kernel.org/all/20220817061847.4182339-1-kafai@fb.com/ when not 
> > using BPF_CORE_READ_BITFIELD. I don't exactly remember how the instruction looks 
> > like but it was reading a wrong value instead of verifier error.
> 
> Thank you so much for detailed explanation!
> 
> 
> > 
> > ================
> > 
> > Going back to this patch set here.
> > 
> > After sleeping on it longer, I am thinking it is better not to reuse 'struct 
> > tcp_options_received' (meaning no bitfield) in the bpf_sk_assign_tcp_reqsk() 
> > kfunc API.
> > 
> > There is not much benefit in reusing 'tcp_options_received'. When new tcp option 
> > was ever added to tcp_options_received, it is not like bpf_sk_assign_tcp_reqsk 
> > will support it automatically. It needs to relay this new option back to the 
> > allocated req. Unlike tcp_sock or req which may have a lot of them such that it 
> > is useful to have a compact tcp_options_received, the tc-bpf use case here is to 
> > allocate it once in the stack. Also, not all the members in tcp_options_received 
> > is useful, e.g. num_sacks, ts_recent_stamp, and user_mss are not used. Leaving 
> > it there being ignored by bpf_sk_assign_tcp_reqsk is confusing.
> > 
> > How about using a full u8 for each necessary member and directly add them to 
> > struct tcp_cookie_attributes instead of nesting them into another struct. After 
> > taking out the unnecessary members, the size may not end up to be much bigger.
> > 
> > The bpf prog can then directly access attr->tstamp_ok more naturally. The 
> > changes to patch 5 and 6 should be mostly mechanical changes.
> > 
> > I would also rename s/tcp_cookie_attributes/bpf_tcp_req_attrs/.
> > 
> > wdyt?
> 
> Totally agree.  I reused struct tcp_options_received but had a similar
> thought like unused fields, confusing fields (saw_tstamp vs tstamp_ok,
> user_mss vs clamp_mss), etc.
> 
> And I like bpf_tcp_req_attrs, tcp_cookie_attributes was bit wordy :)
> 
> So probably bpf_tcp_req_attrs would look like this ?
> 
> struct bpf_tcp_req_attrs {
> 	u32 rcv_tsval;
> 	u32 rcv_tsecr;
> 	u16 mss;
> 	u8 rcv_scale;
> 	u8 snd_scale;
> 	bool ecn_ok;
> 	bool wscale_ok;
> 	bool sack_ok;
> 	bool tstamp_ok;
> 	bool usec_ts;
> } __packed;
> 
> or you prefer u8 over bool and __packed ?

Ah, bool and __packed will require BPF_CORE_(READ|WRITE)_BITFIELD().
I'll use the following struct.

Thank you!

> 
> struct bpf_tcp_req_attrs {
> 	u32 rcv_tsval;
> 	u32 rcv_tsecr;
> 	u16 mss;
> 	u8 rcv_scale;
> 	u8 snd_scale;
> 	u8 ecn_ok;
> 	u8 wscale_ok;
> 	u8 sack_ok;
> 	u8 tstamp_ok;
> 	u8 usec_ts;
> }

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-12-14 12:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-11  7:36 [PATCH v5 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
2023-12-11  7:36 ` [PATCH v5 bpf-next 1/6] tcp: Move tcp_ns_to_ts() to tcp.h Kuniyuki Iwashima
2023-12-11  7:36 ` [PATCH v5 bpf-next 2/6] tcp: Move skb_steal_sock() to request_sock.h Kuniyuki Iwashima
2023-12-11  7:36 ` [PATCH v5 bpf-next 3/6] bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock() Kuniyuki Iwashima
2023-12-11  7:36 ` [PATCH v5 bpf-next 4/6] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
2023-12-11  7:36 ` [PATCH v5 bpf-next 5/6] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
2023-12-11  7:36 ` [PATCH v5 bpf-next 6/6] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
2023-12-13 20:44   ` Martin KaFai Lau
2023-12-14  3:18     ` Kuniyuki Iwashima
2023-12-14  6:46       ` Martin KaFai Lau
2023-12-14  7:49         ` Kuniyuki Iwashima
2023-12-14 12:26           ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).