[PATCH v4 bpf-next 0/3] bpf: tcp: Support arbitrary SYN Cookie at TC.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 bpf-next 0/3] bpf: tcp: Support arbitrary SYN Cookie at TC.
@ 2023-12-05  1:34 Kuniyuki Iwashima
  2023-12-05  1:34 ` [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-05  1:34 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
for the connection request until a valid ACK is responded to the SYN+ACK.

The cookie contains two kinds of host-specific bits, a timestamp and
secrets, so only can it be validated by the generator.  It means SYN
Cookie consumes network resources between the client and the server;
intermediate nodes must remember which nodes to route ACK for the cookie.

SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
backend server and completes another 3WHS.  However, since the server's
ISN differs from the cookie, the proxy must manage the ISN mappings and
fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
node goes down, all the connections through it are terminated.  Keeping
a state at proxy is painful from that perspective.

At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
Our SYN Proxy consists of the front proxy layer and the backend kernel
module.  (See slides of LPC2023 [0], p37 - p48)

The cookie that SYN Proxy generates differs from the kernel's cookie in
that it contains a secret (called rolling salt) (i) shared by all the proxy
nodes so that any node can validate ACK and (ii) updated periodically so
that old cookies cannot be validated and we need not encode a timestamp for
the cookie.  Also, ISN contains WScale, SACK, and ECN, not in TS val.  This
is not to sacrifice any connection quality, where some customers turn off
TCP timestamps option due to retro CVE.

After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
the TCP-in-TCP packet to the backend server.  Our kernel module works at
Netfilter input/output hooks and first feeds SYN to the TCP stack to
initiate 3WHS.  When the module is triggered for SYN+ACK, it looks up the
corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
proxy's cookie.  Then, the module can complete 3WHS with the original ACK
as is.

This way, our SYN Proxy does not manage the ISN mappings nor wait for
SYN+ACK from the backend thus can remain stateless.  It's working very
well for high-bandwidth services like multiple Tbps, but we are looking
for a way to drop the dirty hack and further optimise the sequences.

If we could validate an arbitrary SYN Cookie on the backend server with
BPF, the proxy would need not restore SYN nor pass it.  After validating
ACK, the proxy node just needs to forward it, and then the server can do
the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
and create a connection from the ACK.

This series adds a new kfunc available on TC to create a reqsk and
configure it based on the argument populated from SYN Cookie.

    Usage:

    struct tcp_cookie_attributes attr = {
        .tcp_opt = {
            .mss_clamp = mss,
            .wscale_ok = wscale_ok,
            .rcv_scale = recv_scale, /* Server's WScale < 15 */
            .snd_scale = send_scale, /* Client's WScale < 15 */
            .tstamp_ok = tstamp_ok,
            .sack_ok = sack_ok,
        },
        .ecn_ok = ecn_ok,
        .usec_ts_ok = usec_ts_ok,
    };

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attr, sizeof(attr));
    bpf_sk_release(skc);

[0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf

Changes:
  v4:
    * Patch 1 & 2
      * s/CONFIG_SYN_COOKIE/CONFIG_SYN_COOKIES/
    * Patch 1
      * Don't set rcv_wscale for BPF SYN Cookie case.
    * Patch 2
      * Add test for tcp_opt.{unused,rcv_wscale} in kfunc
      * Modify skb_steal_sock() to avoid resetting skb-sk
      * Support SO_REUSEPORT lookup
    * Patch 3
      * Add CONFIG_SYN_COOKIES to Kconfig for CI
      * Define BPF_F_CURRENT_NETNS

  v3: https://lore.kernel.org/netdev/20231121184245.69569-1-kuniyu@amazon.com/
    * Guard kfunc and req->syncookie part in inet6?_steal_sock() with
      CONFIG_SYN_COOKIE

  v2: https://lore.kernel.org/netdev/20231120222341.54776-1-kuniyu@amazon.com/
    * Drop SOCK_OPS and move SYN Cookie validation logic to TC with kfunc.
    * Add cleanup patches to reduce discrepancy between cookie_v[46]_check()

  v1: https://lore.kernel.org/bpf/20231013220433.70792-1-kuniyu@amazon.com/

Kuniyuki Iwashima (3):
  bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  bpf: tcp: Support arbitrary SYN Cookie.
  selftest: bpf: Test bpf_sk_assign_tcp_reqsk().

 include/net/request_sock.h                    |  35 ++
 include/net/sock.h                            |  25 -
 include/net/tcp.h                             |  27 +
 net/core/filter.c                             | 102 +++-
 net/core/sock.c                               |  14 +-
 net/ipv4/syncookies.c                         |  62 +-
 net/ipv6/syncookies.c                         |   9 +-
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  10 +
 tools/testing/selftests/bpf/config            |   1 +
 .../bpf/prog_tests/tcp_custom_syncookie.c     | 163 +++++
 .../selftests/bpf/progs/test_siphash.h        |  64 ++
 .../bpf/progs/test_tcp_custom_syncookie.c     | 573 ++++++++++++++++++
 .../bpf/progs/test_tcp_custom_syncookie.h     | 162 +++++
 13 files changed, 1214 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h

-- 
2.30.2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  2023-12-05  1:34 [PATCH v4 bpf-next 0/3] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
@ 2023-12-05  1:34 ` Kuniyuki Iwashima
  2023-12-06  0:19   ` Martin KaFai Lau
  2023-12-05  1:34 ` [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
  2023-12-05  1:34 ` [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
  2 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-05  1:34 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support arbitrary SYN Cookie with BPF in the following
patch.

If BPF prog validates ACK and kfunc allocates reqsk, it will
be carried to cookie_[46]_check() as skb->sk.  Then, we call
cookie_bpf_check() to validate the configuration passed to kfunc.

First, we clear skb->sk, skb->destructor, and req->rsk_listener,
which are needed not to hold refcnt for reqsk and the listener.
See the following patch for details.

Then, we parse TCP options to check if tstamp_ok is discrepant.
If it is invalid, we increment LINUX_MIB_SYNCOOKIESFAILED and send
RST.  If tstamp_ok is valid, we increment LINUX_MIB_SYNCOOKIESRECV.

After that, we check sack_ok and wscale_ok with corresponding
sysctl knobs.  If the test fails, we send RST but do not increment
LINUX_MIB_SYNCOOKIESFAILED.  This behaviour is the same with the
non-BPF cookie handling in cookie_tcp_check().

Finally, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h     | 21 +++++++++++++++
 net/ipv4/syncookies.c | 62 +++++++++++++++++++++++++++++++++++++++++--
 net/ipv6/syncookies.c |  9 +++++--
 3 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 973555cb1d3f..842791997f30 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -590,6 +590,27 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
 		dst_feature(dst, RTAX_FEATURE_ECN);
 }
 
+#if IS_ENABLED(CONFIG_BPF)
+static inline bool cookie_bpf_ok(struct sk_buff *skb)
+{
+	return skb->sk;
+}
+
+struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
+				      struct sk_buff *skb);
+#else
+static inline bool cookie_bpf_ok(struct sk_buff *skb)
+{
+	return false;
+}
+
+static inline struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
+						    struct sk_buff *skb)
+{
+	return NULL;
+}
+#endif
+
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th);
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 61f1c96cfe63..0f9c3aed2014 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -304,6 +304,59 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_BPF)
+struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
+				      struct sk_buff *skb)
+{
+	struct request_sock *req = inet_reqsk(skb->sk);
+	struct inet_request_sock *ireq = inet_rsk(req);
+	struct tcp_request_sock *treq = tcp_rsk(req);
+	struct tcp_options_received tcp_opt;
+	int ret;
+
+	skb->sk = NULL;
+	skb->destructor = NULL;
+	req->rsk_listener = NULL;
+
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
+
+	if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp) {
+		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
+		goto reset;
+	}
+
+	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
+
+	if (ireq->tstamp_ok) {
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
+			goto reset;
+
+		req->ts_recent = tcp_opt.rcv_tsval;
+		treq->ts_off = tcp_opt.rcv_tsecr - tcp_ns_to_ts(false, tcp_clock_ns());
+	}
+
+	if (ireq->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
+		goto reset;
+
+	if (ireq->wscale_ok && !READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
+		goto reset;
+
+	ret = cookie_tcp_reqsk_init(sk, skb, req);
+	if (ret) {
+		reqsk_free(req);
+		req = NULL;
+	}
+
+	return req;
+
+reset:
+	reqsk_free(req);
+	return ERR_PTR(-EINVAL);
+}
+EXPORT_SYMBOL_GPL(cookie_bpf_check);
+#endif
+
 struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
 					    struct sock *sk, struct sk_buff *skb,
 					    struct tcp_options_received *tcp_opt,
@@ -404,7 +457,11 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	    !th->ack || th->rst)
 		goto out;
 
-	req = cookie_tcp_check(net, sk, skb);
+	if (cookie_bpf_ok(skb))
+		req = cookie_bpf_check(net, sk, skb);
+	else
+		req = cookie_tcp_check(net, sk, skb);
+
 	if (IS_ERR(req))
 		goto out;
 	if (!req)
@@ -454,7 +511,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
-	ireq->rcv_wscale  = rcv_wscale;
+	if (!req->syncookie)
+		ireq->rcv_wscale  = rcv_wscale;
 	ireq->ecn_ok &= cookie_ecn_ok(net, &rt->dst);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index c8d2ca27220c..24224138ba1a 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -182,7 +182,11 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	    !th->ack || th->rst)
 		goto out;
 
-	req = cookie_tcp_check(net, sk, skb);
+	if (cookie_bpf_ok(skb))
+		req = cookie_bpf_check(net, sk, skb);
+	else
+		req = cookie_tcp_check(net, sk, skb);
+
 	if (IS_ERR(req))
 		goto out;
 	if (!req)
@@ -247,7 +251,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(dst, RTAX_INITRWND));
 
-	ireq->rcv_wscale = rcv_wscale;
+	if (!req->syncookie)
+		ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok &= cookie_ecn_ok(net, dst);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, dst);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie.
  2023-12-05  1:34 [PATCH v4 bpf-next 0/3] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
  2023-12-05  1:34 ` [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
@ 2023-12-05  1:34 ` Kuniyuki Iwashima
  2023-12-06  1:20   ` Martin KaFai Lau
  2023-12-05  1:34 ` [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
  2 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-05  1:34 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch adds a new kfunc available at TC hook to support arbitrary
SYN Cookie.

The basic usage is as follows:

    struct tcp_cookie_attributes attr = {
        .tcp_opt = {
            .mss_clamp = mss,
            .wscale_ok = wscale_ok,
            .rcv_scale = recv_scale, /* Server's WScale < 15 */
            .snd_scale = send_scale, /* Client's WScale < 15 */
            .tstamp_ok = tstamp_ok,
            .sack_ok = sack_ok,
        },
        .ecn_ok = ecn_ok,
        .usec_ts_ok = usec_ts_ok,
    };

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attr, sizeof(attr));
    bpf_sk_release(skc);

bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
tcp_cookie_attributes and allocates reqsk and configures it.  Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.

The notable thing here is that we do not hold refcnt for both reqsk
and listener.  To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now.  So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.

When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.

When the TCP stack looks up a socket from the skb, we return
inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock().  However,
we do not clear skb->sk and skb->destructor so that we can carry
the reqsk to cookie_v[46]_check().

The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.

Note that we can use the unused bits in struct tcp_options_received
and extend struct tcp_cookie_attributes in the future when we add a
new attribute that is determined in 3WHS.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/request_sock.h |  35 +++++++++++++
 include/net/sock.h         |  25 ---------
 include/net/tcp.h          |   6 +++
 net/core/filter.c          | 102 ++++++++++++++++++++++++++++++++++++-
 net/core/sock.c            |  14 ++++-
 5 files changed, 153 insertions(+), 29 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 144c39db9898..2efffe2c05d0 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -83,6 +83,41 @@ static inline struct sock *req_to_sk(struct request_sock *req)
 	return (struct sock *)req;
 }
 
+/**
+ * skb_steal_sock - steal a socket from an sk_buff
+ * @skb: sk_buff to steal the socket from
+ * @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
+ */
+static inline struct sock *
+skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
+{
+	struct sock *sk = skb->sk;
+
+	if (!skb->sk) {
+		*prefetched = false;
+		*refcounted = false;
+		return NULL;
+	}
+
+	*prefetched = skb_sk_is_prefetched(skb);
+	if (*prefetched) {
+#if IS_ENABLED(CONFIG_SYN_COOKIES)
+		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+			*refcounted = false;
+			return inet_reqsk(sk)->rsk_listener;
+		}
+#endif
+		*refcounted = sk_is_refcounted(sk);
+	} else {
+		*refcounted = true;
+	}
+
+	skb->destructor = NULL;
+	skb->sk = NULL;
+	return sk;
+}
+
 static inline struct request_sock *
 reqsk_alloc(const struct request_sock_ops *ops, struct sock *sk_listener,
 	    bool attach_listener)
diff --git a/include/net/sock.h b/include/net/sock.h
index 1d6931caf0c3..0ed77af38000 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2838,31 +2838,6 @@ sk_is_refcounted(struct sock *sk)
 	return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
 }
 
-/**
- * skb_steal_sock - steal a socket from an sk_buff
- * @skb: sk_buff to steal the socket from
- * @refcounted: is set to true if the socket is reference-counted
- * @prefetched: is set to true if the socket was assigned from bpf
- */
-static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
-{
-	if (skb->sk) {
-		struct sock *sk = skb->sk;
-
-		*refcounted = true;
-		*prefetched = skb_sk_is_prefetched(skb);
-		if (*prefetched)
-			*refcounted = sk_is_refcounted(sk);
-		skb->destructor = NULL;
-		skb->sk = NULL;
-		return sk;
-	}
-	*prefetched = false;
-	*refcounted = false;
-	return NULL;
-}
-
 /* Checks if this SKB belongs to an HW offloaded socket
  * and whether any SW fallbacks are required based on dev.
  * Check decrypted mark in case skb_orphan() cleared socket.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 842791997f30..373afcfaefa6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -591,6 +591,12 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
 }
 
 #if IS_ENABLED(CONFIG_BPF)
+struct tcp_cookie_attributes {
+	struct tcp_options_received tcp_opt;
+	bool ecn_ok;
+	bool usec_ts_ok;
+} __packed;
+
 static inline bool cookie_bpf_ok(struct sk_buff *skb)
 {
 	return skb->sk;
diff --git a/net/core/filter.c b/net/core/filter.c
index 0adaa4afa35f..a43f7627c5fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11816,6 +11816,94 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
 
 	return 0;
 }
+
+__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
+					struct tcp_cookie_attributes *attr,
+					int attr__sz)
+{
+#if IS_ENABLED(CONFIG_SYN_COOKIES)
+	const struct request_sock_ops *ops;
+	struct inet_request_sock *ireq;
+	struct tcp_request_sock *treq;
+	struct request_sock *req;
+	__u16 min_mss;
+
+	if (attr__sz != sizeof(*attr) || attr->tcp_opt.unused)
+		return -EINVAL;
+
+	if (!sk)
+		return -EINVAL;
+
+	if (!skb_at_tc_ingress(skb))
+		return -EINVAL;
+
+	if (dev_net(skb->dev) != sock_net(sk))
+		return -ENETUNREACH;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ops = &tcp_request_sock_ops;
+		min_mss = 536;
+		break;
+#if IS_BUILTIN(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		ops = &tcp6_request_sock_ops;
+		min_mss = IPV6_MIN_MTU - 60;
+		break;
+#endif
+	default:
+		return -EINVAL;
+	}
+
+	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
+		return -EINVAL;
+
+	if (attr->tcp_opt.mss_clamp < min_mss) {
+		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+		return -EINVAL;
+	}
+
+	if (attr->tcp_opt.wscale_ok &&
+	    (attr->tcp_opt.snd_wscale > TCP_MAX_WSCALE ||
+	     attr->tcp_opt.rcv_wscale > TCP_MAX_WSCALE)) {
+		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+		return -EINVAL;
+	}
+
+	if (sk_is_mptcp(sk))
+		req = mptcp_subflow_reqsk_alloc(ops, sk, false);
+	else
+		req = inet_reqsk_alloc(ops, sk, false);
+
+	if (!req)
+		return -ENOMEM;
+
+	ireq = inet_rsk(req);
+	treq = tcp_rsk(req);
+
+	req->syncookie = 1;
+	req->rsk_listener = sk;
+	req->mss = attr->tcp_opt.mss_clamp;
+
+	ireq->snd_wscale = attr->tcp_opt.snd_wscale;
+	ireq->rcv_wscale = attr->tcp_opt.rcv_wscale;
+	ireq->wscale_ok = attr->tcp_opt.wscale_ok;
+	ireq->tstamp_ok	= attr->tcp_opt.tstamp_ok;
+	ireq->sack_ok = attr->tcp_opt.sack_ok;
+	ireq->ecn_ok = attr->ecn_ok;
+
+	treq->req_usec_ts = attr->usec_ts_ok;
+
+	skb_orphan(skb);
+	skb->sk = req_to_sk(req);
+	skb->destructor = sock_pfree;
+
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
 __bpf_kfunc_end_defs();
 
 int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
@@ -11844,6 +11932,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
 BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
 BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
 
+BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
+BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
+BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -11859,6 +11951,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
 	.set = &bpf_kfunc_check_set_sock_addr,
 };
 
+static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_tcp_reqsk,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -11874,8 +11971,9 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
-	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
-						&bpf_kfunc_set_sock_addr);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+					       &bpf_kfunc_set_sock_addr);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
 }
 late_initcall(bpf_kfunc_init);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index fef349dd72fa..998950e97dfe 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2579,8 +2579,18 @@ EXPORT_SYMBOL(sock_efree);
 #ifdef CONFIG_INET
 void sock_pfree(struct sk_buff *skb)
 {
-	if (sk_is_refcounted(skb->sk))
-		sock_gen_put(skb->sk);
+	struct sock *sk = skb->sk;
+
+	if (!sk_is_refcounted(sk))
+		return;
+
+	if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+		inet_reqsk(sk)->rsk_listener = NULL;
+		reqsk_free(inet_reqsk(sk));
+		return;
+	}
+
+	sock_gen_put(sk);
 }
 EXPORT_SYMBOL(sock_pfree);
 #endif /* CONFIG_INET */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-05  1:34 [PATCH v4 bpf-next 0/3] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
  2023-12-05  1:34 ` [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
  2023-12-05  1:34 ` [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
@ 2023-12-05  1:34 ` Kuniyuki Iwashima
  2023-12-05  2:13   ` Alexei Starovoitov
  2023-12-06  6:39   ` Martin KaFai Lau
  2 siblings, 2 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-05  1:34 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This commit adds a sample selftest to demonstrate how we can use
bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.

The test creates IPv4/IPv6 x TCP/MPTCP connections and transfer
messages over them on lo with BPF tc prog attached.

The tc prog will process SYN and returns SYN+ACK with the following
ISN and TS.  In a real use case, this part will be done by other
hosts.

        MSB                                   LSB
  ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
        |   Hash_1 | MSS | ECN | SACK |  WScale |

  TS:   | 31 ... 8 |          7 ... 0           |
        |   Random |           Hash_2           |

  WScale in SYN is reused in SYN+ACK.

The client returns ACK, and tc prog will recalculate ISN and TS
from ACK and validate SYN Cookie.

If it's valid, the prog calls kfunc to allocate a reqsk for skb and
configure the reqsk based on the argument created from SYN Cookie.

Later, the reqsk will be processed in cookie_v[46]_check() to create
a connection.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  10 +
 tools/testing/selftests/bpf/config            |   1 +
 .../bpf/prog_tests/tcp_custom_syncookie.c     | 163 +++++
 .../selftests/bpf/progs/test_siphash.h        |  64 ++
 .../bpf/progs/test_tcp_custom_syncookie.c     | 573 ++++++++++++++++++
 .../bpf/progs/test_tcp_custom_syncookie.h     | 162 +++++
 6 files changed, 973 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h

diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index b4e78c1eb37b..e2a1651851cf 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -51,6 +51,16 @@ extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clo
 extern int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
 				      const __u8 *sun_path, __u32 sun_path__sz) __ksym;
 
+/* Description
+ *  Allocate and configure a reqsk and link it with a listener and skb.
+ * Returns
+ *  Error code
+ */
+struct sock;
+struct tcp_cookie_attributes;
+extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
+				   struct tcp_cookie_attributes *attr, int attr__sz) __ksym;
+
 void *bpf_cast_to_kern_ctx(void *) __ksym;
 
 void *bpf_rdonly_cast(void *obj, __u32 btf_id) __ksym;
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c125c441abc7..01f241ea2c67 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -81,6 +81,7 @@ CONFIG_NF_NAT=y
 CONFIG_RC_CORE=y
 CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
+CONFIG_SYN_COOKIES=y
 CONFIG_TEST_BPF=m
 CONFIG_USERFAULTFD=y
 CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
new file mode 100644
index 000000000000..fefb52d8222c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_custom_syncookie.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_custom_syncookie.skel.h"
+
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
+static struct test_tcp_custom_syncookie_case {
+	int family, type, protocol;
+	char addr[16];
+	char name[10];
+} test_cases[] = {
+	{
+		.name = "IPv4 TCP  ",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.addr = "127.0.0.1",
+	},
+	{
+		.name = "IPv6 TCP  ",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.addr = "::1",
+	},
+	{
+		.name = "IPv4 MPTCP",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.protocol = IPPROTO_MPTCP,
+		.addr = "127.0.0.1",
+	},
+	{
+		.name = "IPv6 MPTCP",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.protocol = IPPROTO_MPTCP,
+		.addr = "::1",
+	},
+};
+
+static int setup_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "ip"))
+		goto err;
+
+	if (!ASSERT_OK(write_sysctl("/proc/sys/net/ipv4/tcp_ecn", "1"),
+		       "write_sysctl"))
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static int setup_tc(struct test_tcp_custom_syncookie *skel)
+{
+	LIBBPF_OPTS(bpf_tc_hook, qdisc_lo, .attach_point = BPF_TC_INGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_attach,
+		    .prog_fd = bpf_program__fd(skel->progs.tcp_custom_syncookie));
+
+	qdisc_lo.ifindex = if_nametoindex("lo");
+	if (!ASSERT_OK(bpf_tc_hook_create(&qdisc_lo), "qdisc add dev lo clsact"))
+		goto err;
+
+	if (!ASSERT_OK(bpf_tc_attach(&qdisc_lo, &tc_attach),
+		       "filter add dev lo ingress"))
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+#define msg "Hello World"
+#define msglen 11
+
+static void transfer_message(int sender, int receiver)
+{
+	char buf[msglen];
+	int ret;
+
+	ret = send(sender, msg, msglen, 0);
+	if (!ASSERT_EQ(ret, msglen, "send"))
+		return;
+
+	memset(buf, 0, sizeof(buf));
+
+	ret = recv(receiver, buf, msglen, 0);
+	if (!ASSERT_EQ(ret, msglen, "recv"))
+		return;
+
+	ret = strncmp(buf, msg, msglen);
+	if (!ASSERT_EQ(ret, 0, "strncmp"))
+		return;
+}
+
+static void create_connection(struct test_tcp_custom_syncookie_case *test_case)
+{
+	int server, client, child;
+
+	if (test_case->protocol == IPPROTO_MPTCP)
+		server = start_mptcp_server(test_case->family, test_case->addr, 0, 0);
+	else
+		server = start_server(test_case->family, test_case->type, test_case->addr, 0, 0);
+	if (!ASSERT_NEQ(server, -1, "start_server"))
+		return;
+
+	client = connect_to_fd(server, 0);
+	if (!ASSERT_NEQ(client, -1, "connect_to_fd"))
+		goto close_server;
+
+	child = accept(server, NULL, 0);
+	if (!ASSERT_NEQ(child, -1, "accept"))
+		goto close_client;
+
+	transfer_message(client, child);
+	transfer_message(child, client);
+
+	close(child);
+close_client:
+	close(client);
+close_server:
+	close(server);
+}
+
+void test_tcp_custom_syncookie(void)
+{
+	struct test_tcp_custom_syncookie *skel;
+	int i;
+
+	if (setup_netns())
+		return;
+
+	skel = test_tcp_custom_syncookie__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	if (setup_tc(skel))
+		goto destroy_skel;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+		test__start_subtest(test_cases[i].name);
+		create_connection(&test_cases[i]);
+	}
+
+destroy_skel:
+	system("tc qdisc del dev lo clsact");
+
+	test_tcp_custom_syncookie__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_siphash.h b/tools/testing/selftests/bpf/progs/test_siphash.h
new file mode 100644
index 000000000000..25334079f8be
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_siphash.h
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#ifndef _TEST_SIPHASH_H
+#define _TEST_SIPHASH_H
+
+/* include/linux/bitops.h */
+static __always_inline __u64 rol64(__u64 word, unsigned int shift)
+{
+	return (word << (shift & 63)) | (word >> ((-shift) & 63));
+}
+
+/* include/linux/siphash.h */
+#define SIPHASH_PERMUTATION(a, b, c, d) ( \
+	(a) += (b), (b) = rol64((b), 13), (b) ^= (a), (a) = rol64((a), 32), \
+	(c) += (d), (d) = rol64((d), 16), (d) ^= (c), \
+	(a) += (d), (d) = rol64((d), 21), (d) ^= (a), \
+	(c) += (b), (b) = rol64((b), 17), (b) ^= (c), (c) = rol64((c), 32))
+
+#define SIPHASH_CONST_0 0x736f6d6570736575ULL
+#define SIPHASH_CONST_1 0x646f72616e646f6dULL
+#define SIPHASH_CONST_2 0x6c7967656e657261ULL
+#define SIPHASH_CONST_3 0x7465646279746573ULL
+
+/* lib/siphash.c */
+#define SIPROUND SIPHASH_PERMUTATION(v0, v1, v2, v3)
+
+#define PREAMBLE(len) \
+	u64 v0 = SIPHASH_CONST_0; \
+	u64 v1 = SIPHASH_CONST_1; \
+	u64 v2 = SIPHASH_CONST_2; \
+	u64 v3 = SIPHASH_CONST_3; \
+	u64 b = ((u64)(len)) << 56; \
+	v3 ^= key->key[1]; \
+	v2 ^= key->key[0]; \
+	v1 ^= key->key[1]; \
+	v0 ^= key->key[0];
+
+#define POSTAMBLE \
+	v3 ^= b; \
+	SIPROUND; \
+	SIPROUND; \
+	v0 ^= b; \
+	v2 ^= 0xff; \
+	SIPROUND; \
+	SIPROUND; \
+	SIPROUND; \
+	SIPROUND; \
+	return (v0 ^ v1) ^ (v2 ^ v3);
+
+static inline u64 siphash_2u64(const u64 first, const u64 second, const siphash_key_t *key)
+{
+	PREAMBLE(16)
+	v3 ^= first;
+	SIPROUND;
+	SIPROUND;
+	v0 ^= first;
+	v3 ^= second;
+	SIPROUND;
+	SIPROUND;
+	v0 ^= second;
+	POSTAMBLE
+}
+#endif
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
new file mode 100644
index 000000000000..9a6e2fc1379c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
@@ -0,0 +1,573 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_kfuncs.h"
+#include "test_siphash.h"
+#include "test_tcp_custom_syncookie.h"
+
+/* Hash is calculated for each client and split into ISN and TS.
+ *
+ *       MSB                                   LSB
+ * ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
+ *       |   Hash_1 | MSS | ECN | SACK |  WScale |
+ *
+ * TS:   | 31 ... 8 |          7 ... 0           |
+ *       |   Random |           Hash_2           |
+ */
+#define COOKIE_BITS	8
+#define COOKIE_MASK	(((__u32)1 << COOKIE_BITS) - 1)
+
+enum {
+	/* 0xf is invalid thus means that SYN did not have WScale. */
+	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
+	BPF_SYNCOOKIE_SACK		= (1 << 4),
+	BPF_SYNCOOKIE_ECN		= (1 << 5),
+};
+
+#define MSS_LOCAL_IPV4	65495
+#define MSS_LOCAL_IPV6	65476
+
+const __u16 msstab4[] = {
+	536,
+	1300,
+	1460,
+	MSS_LOCAL_IPV4,
+};
+
+const __u16 msstab6[] = {
+	1280 - 60, /* IPV6_MIN_MTU - 60 */
+	1480 - 60,
+	9000 - 60,
+	MSS_LOCAL_IPV6,
+};
+
+static siphash_key_t test_key_siphash = {
+	{ 0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL }
+};
+
+struct tcp_syncookie {
+	struct __sk_buff *skb;
+	void *data_end;
+	struct ethhdr *eth;
+	struct iphdr *ipv4;
+	struct ipv6hdr *ipv6;
+	struct tcphdr *tcp;
+	union {
+		char *ptr;
+		__be32 *ptr32;
+	};
+	struct tcp_cookie_attributes attr;
+	u32 cookie;
+	u64 first;
+};
+
+static __always_inline int tcp_load_headers(struct tcp_syncookie *ctx)
+{
+	ctx->data_end = (void *)(long)ctx->skb->data_end;
+	ctx->eth = (struct ethhdr *)(long)ctx->skb->data;
+
+	if (ctx->eth + 1 > ctx->data_end)
+		goto err;
+
+	switch (bpf_ntohs(ctx->eth->h_proto)) {
+	case ETH_P_IP:
+		ctx->ipv4 = (struct iphdr *)(ctx->eth + 1);
+
+		if (ctx->ipv4 + 1 > ctx->data_end)
+			goto err;
+
+		if (ctx->ipv4->ihl != sizeof(*ctx->ipv4) / 4)
+			goto err;
+
+		if (ctx->ipv4->version != 4)
+			goto err;
+
+		if (ctx->ipv4->protocol != IPPROTO_TCP)
+			goto err;
+
+		ctx->tcp = (struct tcphdr *)(ctx->ipv4 + 1);
+		break;
+	case ETH_P_IPV6:
+		ctx->ipv6 = (struct ipv6hdr *)(ctx->eth + 1);
+
+		if (ctx->ipv6 + 1 > ctx->data_end)
+			goto err;
+
+		if (ctx->ipv6->version != 6)
+			goto err;
+
+		if (ctx->ipv6->nexthdr != NEXTHDR_TCP)
+			goto err;
+
+		ctx->tcp = (struct tcphdr *)(ctx->ipv6 + 1);
+		break;
+	default:
+		goto err;
+	}
+
+	if (ctx->tcp + 1 > ctx->data_end)
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static __always_inline int tcp_reload_headers(struct tcp_syncookie *ctx)
+{
+	/* Without volatile,
+	 * R3 32-bit pointer arithmetic prohibited
+	 */
+	volatile u64 data_len = ctx->skb->data_end - ctx->skb->data;
+
+	if (ctx->tcp->doff < sizeof(*ctx->tcp) / 4)
+		goto err;
+
+	/* Needed to calculate csum and parse TCP options. */
+	if (bpf_skb_change_tail(ctx->skb, data_len + 60 - ctx->tcp->doff * 4, 0))
+		goto err;
+
+	ctx->data_end = (void *)(long)ctx->skb->data_end;
+	ctx->eth = (struct ethhdr *)(long)ctx->skb->data;
+	if (ctx->ipv4) {
+		ctx->ipv4 = (struct iphdr *)(ctx->eth + 1);
+		ctx->ipv6 = NULL;
+		ctx->tcp = (struct tcphdr *)(ctx->ipv4 + 1);
+	} else {
+		ctx->ipv4 = NULL;
+		ctx->ipv6 = (struct ipv6hdr *)(ctx->eth + 1);
+		ctx->tcp = (struct tcphdr *)(ctx->ipv6 + 1);
+	}
+
+	if ((void *)ctx->tcp + 60 > ctx->data_end)
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static __always_inline __sum16 tcp_v4_csum(struct tcp_syncookie *ctx, __wsum csum)
+{
+	return csum_tcpudp_magic(ctx->ipv4->saddr, ctx->ipv4->daddr,
+				 ctx->tcp->doff * 4, IPPROTO_TCP, csum);
+}
+
+static __always_inline __sum16 tcp_v6_csum(struct tcp_syncookie *ctx, __wsum csum)
+{
+	return csum_ipv6_magic(&ctx->ipv6->saddr, &ctx->ipv6->daddr,
+			       ctx->tcp->doff * 4, IPPROTO_TCP, csum);
+}
+
+static __always_inline int tcp_validate_header(struct tcp_syncookie *ctx)
+{
+	s64 csum;
+
+	if (tcp_reload_headers(ctx))
+		goto err;
+
+	csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
+	if (csum < 0)
+		goto err;
+
+	if (ctx->ipv4) {
+		/* check tcp_v4_csum(csum) is 0 if not on lo. */
+
+		csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, ctx->ipv4->ihl * 4, 0);
+		if (csum < 0)
+			goto err;
+
+		if (csum_fold(csum) != 0)
+			goto err;
+	} else if (ctx->ipv6) {
+		/* check tcp_v6_csum(csum) is 0 if not on lo. */
+	}
+
+	return 0;
+err:
+	return -1;
+}
+
+static int tcp_parse_option(__u32 index, struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+	char opcode, opsize;
+
+	if (ctx->ptr + 1 > ctx->data_end)
+		goto stop;
+
+	opcode = *ctx->ptr++;
+
+	if (opcode == TCPOPT_EOL)
+		goto stop;
+
+	if (opcode == TCPOPT_NOP)
+		goto next;
+
+	if (ctx->ptr + 1 > ctx->data_end)
+		goto stop;
+
+	opsize = *ctx->ptr++;
+
+	if (opsize < 2)
+		goto stop;
+
+	switch (opcode) {
+	case TCPOPT_MSS:
+		if (opsize == TCPOLEN_MSS && ctx->tcp->syn &&
+		    ctx->ptr + (TCPOLEN_MSS - 2) < ctx->data_end)
+			tcp_opt->mss_clamp = get_unaligned_be16(ctx->ptr);
+		break;
+	case TCPOPT_WINDOW:
+		if (opsize == TCPOLEN_WINDOW && ctx->tcp->syn &&
+		    ctx->ptr + (TCPOLEN_WINDOW - 2) < ctx->data_end) {
+			tcp_opt->wscale_ok = 1;
+			tcp_opt->snd_wscale = *ctx->ptr;
+		}
+		break;
+	case TCPOPT_TIMESTAMP:
+		if (opsize == TCPOLEN_TIMESTAMP &&
+		    ctx->ptr + (TCPOLEN_TIMESTAMP - 2) < ctx->data_end) {
+			tcp_opt->saw_tstamp = 1;
+			tcp_opt->rcv_tsval = get_unaligned_be32(ctx->ptr);
+			tcp_opt->rcv_tsecr = get_unaligned_be32(ctx->ptr + 4);
+
+			if (ctx->tcp->syn && tcp_opt->rcv_tsecr)
+				tcp_opt->tstamp_ok = 0;
+			else
+				tcp_opt->tstamp_ok = 1;
+		}
+		break;
+	case TCPOPT_SACK_PERM:
+		if (opsize == TCPOLEN_SACK_PERM && ctx->tcp->syn &&
+		    ctx->ptr + (TCPOLEN_SACK_PERM - 2) < ctx->data_end)
+			tcp_opt->sack_ok = 1;
+		break;
+	}
+
+	ctx->ptr += opsize - 2;
+next:
+	return 0;
+stop:
+	return 1;
+}
+
+static __always_inline void tcp_parse_options(struct tcp_syncookie *ctx)
+{
+	ctx->ptr = (char *)(ctx->tcp + 1);
+
+	bpf_loop(40, tcp_parse_option, ctx, 0);
+}
+
+static __always_inline int tcp_validate_sysctl(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+
+	if ((ctx->ipv4 && tcp_opt->mss_clamp != MSS_LOCAL_IPV4) ||
+	    (ctx->ipv6 && tcp_opt->mss_clamp != MSS_LOCAL_IPV6))
+		goto err;
+
+	if (!tcp_opt->wscale_ok || tcp_opt->snd_wscale != 7)
+		goto err;
+
+	if (!tcp_opt->tstamp_ok)
+		goto err;
+
+	if (!tcp_opt->sack_ok)
+		goto err;
+
+	if (!ctx->tcp->ece || !ctx->tcp->cwr)
+		goto err;
+
+	return 0;
+err:
+	return -1;
+}
+
+static __always_inline void tcp_prepare_cookie(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+	u32 seq = bpf_ntohl(ctx->tcp->seq);
+	u64 first = 0, second;
+	int mssind = 0;
+	u32 hash;
+
+	if (ctx->ipv4) {
+		for (mssind = ARRAY_SIZE(msstab4) - 1; mssind; mssind--)
+			if (tcp_opt->mss_clamp >= msstab4[mssind])
+				break;
+
+		tcp_opt->mss_clamp = msstab4[mssind];
+
+		first = (u64)ctx->ipv4->saddr << 32 | ctx->ipv4->daddr;
+	} else if (ctx->ipv6) {
+		for (mssind = ARRAY_SIZE(msstab6) - 1; mssind; mssind--)
+			if (tcp_opt->mss_clamp >= msstab6[mssind])
+				break;
+
+		tcp_opt->mss_clamp = msstab6[mssind];
+
+		first = (u64)ctx->ipv6->saddr.in6_u.u6_addr8[0] << 32 |
+			ctx->ipv6->daddr.in6_u.u6_addr32[0];
+	}
+
+	second = (u64)seq << 32 | ctx->tcp->source << 16 | ctx->tcp->dest;
+	hash = siphash_2u64(first, second, &test_key_siphash);
+
+	if (tcp_opt->tstamp_ok) {
+		tcp_opt->rcv_tsecr = bpf_get_prandom_u32();
+		tcp_opt->rcv_tsecr &= ~COOKIE_MASK;
+		tcp_opt->rcv_tsecr |= hash & COOKIE_MASK;
+	}
+
+	hash &= ~COOKIE_MASK;
+	hash |= mssind << 6;
+
+	if (tcp_opt->wscale_ok)
+		hash |= tcp_opt->snd_wscale & BPF_SYNCOOKIE_WSCALE_MASK;
+
+	if (tcp_opt->sack_ok)
+		hash |= BPF_SYNCOOKIE_SACK;
+
+	if (tcp_opt->tstamp_ok && ctx->tcp->ece && ctx->tcp->cwr)
+		hash |= BPF_SYNCOOKIE_ECN;
+
+	ctx->cookie = hash;
+}
+
+static __always_inline void tcp_write_options(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+
+	ctx->ptr32 = (__be32 *)(ctx->tcp + 1);
+
+	*ctx->ptr32++ = bpf_htonl(TCPOPT_MSS << 24 | TCPOLEN_MSS << 16 |
+				  tcp_opt->mss_clamp);
+
+	if (tcp_opt->wscale_ok)
+		*ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+					  TCPOPT_WINDOW << 16 |
+					  TCPOLEN_WINDOW << 8 |
+					  tcp_opt->snd_wscale);
+
+	if (tcp_opt->tstamp_ok) {
+		if (tcp_opt->sack_ok)
+			*ctx->ptr32++ = bpf_htonl(TCPOPT_SACK_PERM << 24 |
+						  TCPOLEN_SACK_PERM << 16 |
+						  TCPOPT_TIMESTAMP << 8 |
+						  TCPOLEN_TIMESTAMP);
+		else
+			*ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+						  TCPOPT_NOP << 16 |
+						  TCPOPT_TIMESTAMP << 8 |
+						  TCPOLEN_TIMESTAMP);
+
+		*ctx->ptr32++ = bpf_htonl(tcp_opt->rcv_tsecr);
+		*ctx->ptr32++ = bpf_htonl(tcp_opt->rcv_tsval);
+	} else if (tcp_opt->sack_ok) {
+		*ctx->ptr32++ = bpf_htonl(TCPOPT_NOP << 24 |
+					  TCPOPT_NOP << 16 |
+					  TCPOPT_SACK_PERM << 8 |
+					  TCPOLEN_SACK_PERM);
+	}
+}
+
+static __always_inline int tcp_handle_syn(struct tcp_syncookie *ctx)
+{
+	s64 csum;
+
+	if (tcp_validate_header(ctx))
+		goto err;
+
+	tcp_parse_options(ctx);
+
+	if (tcp_validate_sysctl(ctx))
+		goto err;
+
+	tcp_prepare_cookie(ctx);
+	tcp_write_options(ctx);
+
+	swap(ctx->tcp->source, ctx->tcp->dest);
+	ctx->tcp->check = 0;
+	ctx->tcp->ack_seq = bpf_htonl(bpf_ntohl(ctx->tcp->seq) + 1);
+	ctx->tcp->seq = bpf_htonl(ctx->cookie);
+	ctx->tcp->doff = ((long)ctx->ptr32 - (long)ctx->tcp) >> 2;
+	ctx->tcp->ack = 1;
+	if (!ctx->attr.tcp_opt.tstamp_ok || !ctx->tcp->ece || !ctx->tcp->cwr)
+		ctx->tcp->ece = 0;
+	ctx->tcp->cwr = 0;
+
+	csum = bpf_csum_diff(0, 0, (void *)ctx->tcp, ctx->tcp->doff * 4, 0);
+	if (csum < 0)
+		goto err;
+
+	if (ctx->ipv4) {
+		swap(ctx->ipv4->saddr, ctx->ipv4->daddr);
+		ctx->tcp->check = tcp_v4_csum(ctx, csum);
+
+		ctx->ipv4->check = 0;
+		ctx->ipv4->tos = 0;
+		ctx->ipv4->tot_len = bpf_htons((long)ctx->ptr32 - (long)ctx->ipv4);
+		ctx->ipv4->id = 0;
+		ctx->ipv4->ttl = 64;
+
+		csum = bpf_csum_diff(0, 0, (void *)ctx->ipv4, sizeof(*ctx->ipv4), 0);
+		if (csum < 0)
+			goto err;
+
+		ctx->ipv4->check = csum_fold(csum);
+	} else if (ctx->ipv6) {
+		swap(ctx->ipv6->saddr, ctx->ipv6->daddr);
+		ctx->tcp->check = tcp_v6_csum(ctx, csum);
+
+		*(__be32 *)ctx->ipv6 = bpf_htonl(0x60000000);
+		ctx->ipv6->payload_len = bpf_htons((long)ctx->ptr32 - (long)ctx->tcp);
+		ctx->ipv6->hop_limit = 64;
+	}
+
+	swap_array(ctx->eth->h_source, ctx->eth->h_dest);
+
+	if (bpf_skb_change_tail(ctx->skb, (long)ctx->ptr32 - (long)ctx->eth, 0))
+		goto err;
+
+	return bpf_redirect(ctx->skb->ifindex, 0);
+err:
+	return TC_ACT_SHOT;
+}
+
+static __always_inline int tcp_validate_cookie(struct tcp_syncookie *ctx)
+{
+	struct tcp_options_received *tcp_opt = &ctx->attr.tcp_opt;
+	u32 cookie = bpf_ntohl(ctx->tcp->ack_seq) - 1;
+	u32 seq = bpf_ntohl(ctx->tcp->seq) - 1;
+	u64 first = 0, second;
+	int mssind;
+	u32 hash;
+
+	if (ctx->ipv4)
+		first = (u64)ctx->ipv4->saddr << 32 | ctx->ipv4->daddr;
+	else if (ctx->ipv6)
+		first = (u64)ctx->ipv6->saddr.in6_u.u6_addr8[0] << 32 |
+			ctx->ipv6->daddr.in6_u.u6_addr32[0];
+
+	second = (u64)seq << 32 | ctx->tcp->source << 16 | ctx->tcp->dest;
+	hash = siphash_2u64(first, second, &test_key_siphash);
+
+	if (tcp_opt->saw_tstamp)
+		hash -= tcp_opt->rcv_tsecr & COOKIE_MASK;
+	else
+		hash &= ~COOKIE_MASK;
+
+	hash -= cookie & ~COOKIE_MASK;
+	if (hash)
+		goto err;
+
+	mssind = (cookie & (3 << 6)) >> 6;
+	if (ctx->ipv4) {
+		if (mssind > ARRAY_SIZE(msstab4))
+			goto err;
+
+		tcp_opt->mss_clamp = msstab4[mssind];
+	} else {
+		if (mssind > ARRAY_SIZE(msstab6))
+			goto err;
+
+		tcp_opt->mss_clamp = msstab6[mssind];
+	}
+
+	tcp_opt->snd_wscale = cookie & BPF_SYNCOOKIE_WSCALE_MASK;
+	tcp_opt->rcv_wscale = tcp_opt->snd_wscale;
+	tcp_opt->wscale_ok = tcp_opt->snd_wscale == BPF_SYNCOOKIE_WSCALE_MASK;
+	tcp_opt->sack_ok = cookie & BPF_SYNCOOKIE_SACK;
+	ctx->attr.ecn_ok = cookie & BPF_SYNCOOKIE_ECN;
+
+	return 0;
+err:
+	return -1;
+}
+
+static __always_inline int tcp_handle_ack(struct tcp_syncookie *ctx)
+{
+	struct bpf_sock_tuple tuple;
+	struct bpf_sock *skc;
+	int ret = TC_ACT_OK;
+	struct sock *sk;
+	u32 tuple_size;
+
+	if (ctx->ipv4) {
+		tuple.ipv4.saddr = ctx->ipv4->saddr;
+		tuple.ipv4.daddr = ctx->ipv4->daddr;
+		tuple.ipv4.sport = ctx->tcp->source;
+		tuple.ipv4.dport = ctx->tcp->dest;
+		tuple_size = sizeof(tuple.ipv4);
+	} else if (ctx->ipv6) {
+		__builtin_memcpy(tuple.ipv6.saddr, &ctx->ipv6->saddr, sizeof(tuple.ipv6.saddr));
+		__builtin_memcpy(tuple.ipv6.daddr, &ctx->ipv6->daddr, sizeof(tuple.ipv6.daddr));
+		tuple.ipv6.sport = ctx->tcp->source;
+		tuple.ipv6.dport = ctx->tcp->dest;
+		tuple_size = sizeof(tuple.ipv6);
+	} else {
+		goto out;
+	}
+
+	skc = bpf_skc_lookup_tcp(ctx->skb, &tuple, tuple_size, BPF_F_CURRENT_NETNS, 0);
+	if (!skc)
+		goto out;
+
+	if (skc->state != TCP_LISTEN)
+		goto release;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
+	if (!sk)
+		goto err;
+
+	if (tcp_validate_header(ctx))
+		goto err;
+
+	tcp_parse_options(ctx);
+
+	if (tcp_validate_cookie(ctx))
+		goto err;
+
+	ret = bpf_sk_assign_tcp_reqsk(ctx->skb, sk, &ctx->attr, sizeof(ctx->attr));
+	if (ret < 0)
+		goto err;
+
+release:
+	bpf_sk_release(skc);
+out:
+	return ret;
+
+err:
+	ret = TC_ACT_SHOT;
+	goto release;
+}
+
+SEC("tc")
+int tcp_custom_syncookie(struct __sk_buff *skb)
+{
+	struct tcp_syncookie ctx = {
+		.skb = skb,
+	};
+
+	if (tcp_load_headers(&ctx))
+		return TC_ACT_OK;
+
+	if (ctx.tcp->rst)
+		return TC_ACT_OK;
+
+	if (ctx.tcp->syn) {
+		if (ctx.tcp->ack)
+			return TC_ACT_OK;
+
+		return tcp_handle_syn(&ctx);
+	}
+
+	return tcp_handle_ack(&ctx);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
new file mode 100644
index 000000000000..a401f59e46d8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#ifndef _TEST_TCP_SYNCOOKIE_H
+#define _TEST_TCP_SYNCOOKIE_H
+
+#define TC_ACT_OK	0
+#define TC_ACT_SHOT	2
+
+#define ETH_ALEN	6
+#define ETH_P_IP	0x0800
+#define ETH_P_IPV6	0x86DD
+
+#define NEXTHDR_TCP	6
+
+#define TCPOPT_NOP		1
+#define TCPOPT_EOL		0
+#define TCPOPT_MSS		2
+#define TCPOPT_WINDOW		3
+#define TCPOPT_TIMESTAMP	8
+#define TCPOPT_SACK_PERM	4
+
+#define TCPOLEN_MSS		4
+#define TCPOLEN_WINDOW		3
+#define TCPOLEN_TIMESTAMP	10
+#define TCPOLEN_SACK_PERM	2
+#define BPF_F_CURRENT_NETNS	(-1)
+
+#define __packed __attribute__((__packed__))
+#define __force
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+#define swap(a, b)				\
+	do {					\
+		typeof(a) __tmp = (a);		\
+		(a) = (b);			\
+		(b) = __tmp;			\
+	} while (0)
+
+#define swap_array(a, b)				\
+	do {						\
+		typeof(a) __tmp[sizeof(a)];		\
+		__builtin_memcpy(__tmp, a, sizeof(a));	\
+		__builtin_memcpy(a, b, sizeof(a));	\
+		__builtin_memcpy(b, __tmp, sizeof(a));	\
+	} while (0)
+
+/* asm-generic/unaligned.h */
+#define __get_unaligned_t(type, ptr) ({						\
+	const struct { type x; } __packed * __pptr = (typeof(__pptr))(ptr);	\
+	__pptr->x;								\
+})
+
+#define get_unaligned(ptr) __get_unaligned_t(typeof(*(ptr)), (ptr))
+
+static inline u16 get_unaligned_be16(const void *p)
+{
+	return bpf_ntohs(__get_unaligned_t(__be16, p));
+}
+
+static inline u32 get_unaligned_be32(const void *p)
+{
+	return bpf_ntohl(__get_unaligned_t(__be32, p));
+}
+
+/* lib/checksum.c */
+static inline u32 from64to32(u64 x)
+{
+	/* add up 32-bit and 32-bit for 32+c bit */
+	x = (x & 0xffffffff) + (x >> 32);
+	/* add up carry.. */
+	x = (x & 0xffffffff) + (x >> 32);
+	return (u32)x;
+}
+
+static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
+					__u32 len, __u8 proto, __wsum sum)
+{
+	unsigned long long s = (__force u32)sum;
+
+	s += (__force u32)saddr;
+	s += (__force u32)daddr;
+#ifdef __BIG_ENDIAN
+	s += proto + len;
+#else
+	s += (proto + len) << 8;
+#endif
+	return (__force __wsum)from64to32(s);
+}
+
+/* asm-generic/checksum.h */
+static inline __sum16 csum_fold(__wsum csum)
+{
+	u32 sum = (__force u32)csum;
+
+	sum = (sum & 0xffff) + (sum >> 16);
+	sum = (sum & 0xffff) + (sum >> 16);
+	return (__force __sum16)~sum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
+					__u8 proto, __wsum sum)
+{
+	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
+}
+
+/* net/ipv6/ip6_checksum.c */
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+				      const struct in6_addr *daddr,
+				      __u32 len, __u8 proto, __wsum csum)
+{
+	int carry;
+	__u32 ulen;
+	__u32 uproto;
+	__u32 sum = (__force u32)csum;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[0];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[0]);
+	sum += carry;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[1];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[1]);
+	sum += carry;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[2];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[2]);
+	sum += carry;
+
+	sum += (__force u32)saddr->in6_u.u6_addr32[3];
+	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[3]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[0];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[0]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[1];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[1]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[2];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[2]);
+	sum += carry;
+
+	sum += (__force u32)daddr->in6_u.u6_addr32[3];
+	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[3]);
+	sum += carry;
+
+	ulen = (__force u32)bpf_htonl((__u32)len);
+	sum += ulen;
+	carry = (sum < ulen);
+	sum += carry;
+
+	uproto = (__force u32)bpf_htonl(proto);
+	sum += uproto;
+	carry = (sum < uproto);
+	sum += carry;
+
+	return csum_fold((__force __wsum)sum);
+}
+#endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-05  1:34 ` [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
@ 2023-12-05  2:13   ` Alexei Starovoitov
  2023-12-05  3:00     ` Kuniyuki Iwashima
  2023-12-06  6:39   ` Martin KaFai Lau
  1 sibling, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2023-12-05  2:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kuniyuki Iwashima, bpf,
	Network Development

On Mon, Dec 4, 2023 at 5:36 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> +static __always_inline int tcp_load_headers(struct tcp_syncookie *ctx)

...

> +static __always_inline int tcp_reload_headers(struct tcp_syncookie *ctx)

please remove __always_inline here and in all other places.
The generated code will be much better == faster and the verifier
should be able to understand it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-05  2:13   ` Alexei Starovoitov
@ 2023-12-05  3:00     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-05  3:00 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, martin.lau,
	netdev

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Mon, 4 Dec 2023 18:13:55 -0800
> On Mon, Dec 4, 2023 at 5:36 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > +static __always_inline int tcp_load_headers(struct tcp_syncookie *ctx)
> 
> ...
> 
> > +static __always_inline int tcp_reload_headers(struct tcp_syncookie *ctx)
> 
> please remove __always_inline here and in all other places.
> The generated code will be much better == faster and the verifier
> should be able to understand it.

I confirmed the test worked without __always_inline.
I'll fix it in v5.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  2023-12-05  1:34 ` [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
@ 2023-12-06  0:19   ` Martin KaFai Lau
  2023-12-06  1:29     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2023-12-06  0:19 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 61f1c96cfe63..0f9c3aed2014 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -304,6 +304,59 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
>   	return 0;
>   }
>   
> +#if IS_ENABLED(CONFIG_BPF)
> +struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
> +				      struct sk_buff *skb)
> +{
> +	struct request_sock *req = inet_reqsk(skb->sk);
> +	struct inet_request_sock *ireq = inet_rsk(req);
> +	struct tcp_request_sock *treq = tcp_rsk(req);
> +	struct tcp_options_received tcp_opt;
> +	int ret;
> +
> +	skb->sk = NULL;
> +	skb->destructor = NULL;
> +	req->rsk_listener = NULL;
> +
> +	memset(&tcp_opt, 0, sizeof(tcp_opt));
> +	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);

In patch 2, the bpf prog is passing the tcp_opt to the kfunc. The selftest in 
patch 3 is also parsing the tcp-options.

The kernel parses the tcp-option here again to do some checking and req's member 
initialization. Can these checking and initialization be done in the 
bpf_sk_assign_tcp_reqsk() kfunc instead to avoid the double tcp-option parsing?

> +
> +	if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp) {
> +		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
> +		goto reset;
> +	}
> +
> +	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
> +
> +	if (ireq->tstamp_ok) {
> +		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
> +			goto reset;
> +
> +		req->ts_recent = tcp_opt.rcv_tsval;
> +		treq->ts_off = tcp_opt.rcv_tsecr - tcp_ns_to_ts(false, tcp_clock_ns());
> +	}
> +
> +	if (ireq->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
> +		goto reset;
> +
> +	if (ireq->wscale_ok && !READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
> +		goto reset;
> +
> +	ret = cookie_tcp_reqsk_init(sk, skb, req);
> +	if (ret) {
> +		reqsk_free(req);
> +		req = NULL;
> +	}
> +
> +	return req;
> +
> +reset:
> +	reqsk_free(req);
> +	return ERR_PTR(-EINVAL);
> +}
> +EXPORT_SYMBOL_GPL(cookie_bpf_check);
> +#endif


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie.
  2023-12-05  1:34 ` [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
@ 2023-12-06  1:20   ` Martin KaFai Lau
  2023-12-06  1:42     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2023-12-06  1:20 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 144c39db9898..2efffe2c05d0 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -83,6 +83,41 @@ static inline struct sock *req_to_sk(struct request_sock *req)
>   	return (struct sock *)req;
>   }
>   
> +/**
> + * skb_steal_sock - steal a socket from an sk_buff
> + * @skb: sk_buff to steal the socket from
> + * @refcounted: is set to true if the socket is reference-counted
> + * @prefetched: is set to true if the socket was assigned from bpf
> + */
> +static inline struct sock *
> +skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
> +{
> +	struct sock *sk = skb->sk;
> +
> +	if (!skb->sk) {
> +		*prefetched = false;
> +		*refcounted = false;
> +		return NULL;
> +	}
> +
> +	*prefetched = skb_sk_is_prefetched(skb);
> +	if (*prefetched) {
> +#if IS_ENABLED(CONFIG_SYN_COOKIES)
> +		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> +			*refcounted = false;
> +			return inet_reqsk(sk)->rsk_listener;

If it does not break later logic, I would set inet_reqsk(sk)->rsk_listener to 
NULL to avoid inconsistency when the later inet[6]_lookup_reuseport() selects 
another listener. skb_steal_sock() steals the inet_reqsk(sk)->rsk_listener in 
this sense.


> +		}
> +#endif
> +		*refcounted = sk_is_refcounted(sk);
> +	} else {
> +		*refcounted = true;
> +	}
> +
> +	skb->destructor = NULL;
> +	skb->sk = NULL;
> +	return sk;
> +}
> +

[ ... ]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0adaa4afa35f..a43f7627c5fd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11816,6 +11816,94 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
>   
>   	return 0;
>   }
> +
> +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
> +					struct tcp_cookie_attributes *attr,
> +					int attr__sz)
> +{
> +#if IS_ENABLED(CONFIG_SYN_COOKIES)
> +	const struct request_sock_ops *ops;
> +	struct inet_request_sock *ireq;
> +	struct tcp_request_sock *treq;
> +	struct request_sock *req;
> +	__u16 min_mss;
> +
> +	if (attr__sz != sizeof(*attr) || attr->tcp_opt.unused)
> +		return -EINVAL;
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	if (!skb_at_tc_ingress(skb))
> +		return -EINVAL;
> +
> +	if (dev_net(skb->dev) != sock_net(sk))
> +		return -ENETUNREACH;
> +
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +		ops = &tcp_request_sock_ops;
> +		min_mss = 536;
> +		break;
> +#if IS_BUILTIN(CONFIG_IPV6)
> +	case htons(ETH_P_IPV6):
> +		ops = &tcp6_request_sock_ops;
> +		min_mss = IPV6_MIN_MTU - 60;
> +		break;
> +#endif
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
> +		return -EINVAL;
> +
> +	if (attr->tcp_opt.mss_clamp < min_mss) {
> +		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);

hmm... this one I am not sure if the kfunc should decide what counted as 
SYNCOOKIESFAILED or not. Beside, the bpf prog should have already rejected the 
skb as part of its cookie validation logic. Thus, reaching here is more like a 
bpf prog's error instead.

I would leave the SYNCOOKIESFAILED usage for the kernel tcp layer only. The 
existing bpf_tcp_check_syncookie() helper does not increment SYNCOOKIESFAILED also.

> +		return -EINVAL;
> +	}
> +
> +	if (attr->tcp_opt.wscale_ok &&
> +	    (attr->tcp_opt.snd_wscale > TCP_MAX_WSCALE ||
> +	     attr->tcp_opt.rcv_wscale > TCP_MAX_WSCALE)) {
> +		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);

Same here.

> +		return -EINVAL;
> +	}
> +
> +	if (sk_is_mptcp(sk))
> +		req = mptcp_subflow_reqsk_alloc(ops, sk, false);
> +	else
> +		req = inet_reqsk_alloc(ops, sk, false);
> +
> +	if (!req)
> +		return -ENOMEM;
> +
> +	ireq = inet_rsk(req);
> +	treq = tcp_rsk(req);
> +
> +	req->syncookie = 1;
> +	req->rsk_listener = sk;
> +	req->mss = attr->tcp_opt.mss_clamp;
> +
> +	ireq->snd_wscale = attr->tcp_opt.snd_wscale;
> +	ireq->rcv_wscale = attr->tcp_opt.rcv_wscale;
> +	ireq->wscale_ok = attr->tcp_opt.wscale_ok;
> +	ireq->tstamp_ok	= attr->tcp_opt.tstamp_ok;
> +	ireq->sack_ok = attr->tcp_opt.sack_ok;
> +	ireq->ecn_ok = attr->ecn_ok;
> +
> +	treq->req_usec_ts = attr->usec_ts_ok;
> +
> +	skb_orphan(skb);
> +	skb->sk = req_to_sk(req);
> +	skb->destructor = sock_pfree;
> +
> +	return 0;
> +#else
> +	return -EOPNOTSUPP;
> +#endif
> +}
> +
>   __bpf_kfunc_end_defs();
>   
>   int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
> @@ -11844,6 +11932,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
>   BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
>   BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
>   
> +BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
> +BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
> +BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
> +
>   static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
>   	.owner = THIS_MODULE,
>   	.set = &bpf_kfunc_check_set_skb,
> @@ -11859,6 +11951,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
>   	.set = &bpf_kfunc_check_set_sock_addr,
>   };
>   
> +static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_kfunc_check_set_tcp_reqsk,
> +};
> +
>   static int __init bpf_kfunc_init(void)
>   {
>   	int ret;
> @@ -11874,8 +11971,9 @@ static int __init bpf_kfunc_init(void)
>   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
>   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
>   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
> -	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> -						&bpf_kfunc_set_sock_addr);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> +					       &bpf_kfunc_set_sock_addr);
> +	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
>   }
>   late_initcall(bpf_kfunc_init);
>   
> diff --git a/net/core/sock.c b/net/core/sock.c
> index fef349dd72fa..998950e97dfe 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2579,8 +2579,18 @@ EXPORT_SYMBOL(sock_efree);
>   #ifdef CONFIG_INET
>   void sock_pfree(struct sk_buff *skb)
>   {
> -	if (sk_is_refcounted(skb->sk))
> -		sock_gen_put(skb->sk);
> +	struct sock *sk = skb->sk;
> +
> +	if (!sk_is_refcounted(sk))
> +		return;
> +
> +	if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> +		inet_reqsk(sk)->rsk_listener = NULL;
> +		reqsk_free(inet_reqsk(sk));
> +		return;
> +	}
> +
> +	sock_gen_put(sk);
>   }
>   EXPORT_SYMBOL(sock_pfree);
>   #endif /* CONFIG_INET */


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  2023-12-06  0:19   ` Martin KaFai Lau
@ 2023-12-06  1:29     ` Kuniyuki Iwashima
  2023-12-06  3:11       ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-06  1:29 UTC (permalink / raw)
  To: martin.lau; +Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, netdev

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Tue, 5 Dec 2023 16:19:20 -0800
> On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> > index 61f1c96cfe63..0f9c3aed2014 100644
> > --- a/net/ipv4/syncookies.c
> > +++ b/net/ipv4/syncookies.c
> > @@ -304,6 +304,59 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
> >   	return 0;
> >   }
> >   
> > +#if IS_ENABLED(CONFIG_BPF)
> > +struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
> > +				      struct sk_buff *skb)
> > +{
> > +	struct request_sock *req = inet_reqsk(skb->sk);
> > +	struct inet_request_sock *ireq = inet_rsk(req);
> > +	struct tcp_request_sock *treq = tcp_rsk(req);
> > +	struct tcp_options_received tcp_opt;
> > +	int ret;
> > +
> > +	skb->sk = NULL;
> > +	skb->destructor = NULL;
> > +	req->rsk_listener = NULL;
> > +
> > +	memset(&tcp_opt, 0, sizeof(tcp_opt));
> > +	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
> 
> In patch 2, the bpf prog is passing the tcp_opt to the kfunc. The selftest in 
> patch 3 is also parsing the tcp-options.
> 
> The kernel parses the tcp-option here again to do some checking and req's member 
> initialization. Can these checking and initialization be done in the 
> bpf_sk_assign_tcp_reqsk() kfunc instead to avoid the double tcp-option parsing?

If TS is not used as a cookie storage, bpf prog need not parse it.
OTOH, if a value is encoded into TS, bpf prog need to parse it.
In that case, we cannot avoid parsing options in bpf prog.

The parsing here comes from my paranoia, so.. probably we can drop it
and the first test below, and rely on bpf prog's tcp_opt, especially
tstamp_ok, rcv_tsval, and rcv_tsecr ?

I placed other tests here to align with the normal cookie flow, but
they can be moved to kfunc.  However, initialisation assuems skb
points to TCP header, so here would be better place, I think.


> 
> > +
> > +	if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp) {
> > +		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
> > +		goto reset;
> > +	}
> > +
> > +	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
> > +
> > +	if (ireq->tstamp_ok) {
> > +		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
> > +			goto reset;
> > +
> > +		req->ts_recent = tcp_opt.rcv_tsval;
> > +		treq->ts_off = tcp_opt.rcv_tsecr - tcp_ns_to_ts(false, tcp_clock_ns());
> > +	}
> > +
> > +	if (ireq->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
> > +		goto reset;
> > +
> > +	if (ireq->wscale_ok && !READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
> > +		goto reset;
> > +
> > +	ret = cookie_tcp_reqsk_init(sk, skb, req);
> > +	if (ret) {
> > +		reqsk_free(req);
> > +		req = NULL;
> > +	}
> > +
> > +	return req;
> > +
> > +reset:
> > +	reqsk_free(req);
> > +	return ERR_PTR(-EINVAL);
> > +}
> > +EXPORT_SYMBOL_GPL(cookie_bpf_check);
> > +#endif

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie.
  2023-12-06  1:20   ` Martin KaFai Lau
@ 2023-12-06  1:42     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-06  1:42 UTC (permalink / raw)
  To: martin.lau; +Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, netdev

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Tue, 5 Dec 2023 17:20:53 -0800
> On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> > diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> > index 144c39db9898..2efffe2c05d0 100644
> > --- a/include/net/request_sock.h
> > +++ b/include/net/request_sock.h
> > @@ -83,6 +83,41 @@ static inline struct sock *req_to_sk(struct request_sock *req)
> >   	return (struct sock *)req;
> >   }
> >   
> > +/**
> > + * skb_steal_sock - steal a socket from an sk_buff
> > + * @skb: sk_buff to steal the socket from
> > + * @refcounted: is set to true if the socket is reference-counted
> > + * @prefetched: is set to true if the socket was assigned from bpf
> > + */
> > +static inline struct sock *
> > +skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
> > +{
> > +	struct sock *sk = skb->sk;
> > +
> > +	if (!skb->sk) {
> > +		*prefetched = false;
> > +		*refcounted = false;
> > +		return NULL;
> > +	}
> > +
> > +	*prefetched = skb_sk_is_prefetched(skb);
> > +	if (*prefetched) {
> > +#if IS_ENABLED(CONFIG_SYN_COOKIES)
> > +		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> > +			*refcounted = false;
> > +			return inet_reqsk(sk)->rsk_listener;
> 
> If it does not break later logic, I would set inet_reqsk(sk)->rsk_listener to 
> NULL to avoid inconsistency when the later inet[6]_lookup_reuseport() selects 
> another listener. skb_steal_sock() steals the inet_reqsk(sk)->rsk_listener in 
> this sense.

That makes sense.
I'll clear rsk_listener in the next spin.


> 
> 
> > +		}
> > +#endif
> > +		*refcounted = sk_is_refcounted(sk);
> > +	} else {
> > +		*refcounted = true;
> > +	}
> > +
> > +	skb->destructor = NULL;
> > +	skb->sk = NULL;
> > +	return sk;
> > +}
> > +
> 
> [ ... ]
> 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 0adaa4afa35f..a43f7627c5fd 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -11816,6 +11816,94 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
> >   
> >   	return 0;
> >   }
> > +
> > +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
> > +					struct tcp_cookie_attributes *attr,
> > +					int attr__sz)
> > +{
> > +#if IS_ENABLED(CONFIG_SYN_COOKIES)
> > +	const struct request_sock_ops *ops;
> > +	struct inet_request_sock *ireq;
> > +	struct tcp_request_sock *treq;
> > +	struct request_sock *req;
> > +	__u16 min_mss;
> > +
> > +	if (attr__sz != sizeof(*attr) || attr->tcp_opt.unused)
> > +		return -EINVAL;
> > +
> > +	if (!sk)
> > +		return -EINVAL;
> > +
> > +	if (!skb_at_tc_ingress(skb))
> > +		return -EINVAL;
> > +
> > +	if (dev_net(skb->dev) != sock_net(sk))
> > +		return -ENETUNREACH;
> > +
> > +	switch (skb->protocol) {
> > +	case htons(ETH_P_IP):
> > +		ops = &tcp_request_sock_ops;
> > +		min_mss = 536;
> > +		break;
> > +#if IS_BUILTIN(CONFIG_IPV6)
> > +	case htons(ETH_P_IPV6):
> > +		ops = &tcp6_request_sock_ops;
> > +		min_mss = IPV6_MIN_MTU - 60;
> > +		break;
> > +#endif
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
> > +		return -EINVAL;
> > +
> > +	if (attr->tcp_opt.mss_clamp < min_mss) {
> > +		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
> 
> hmm... this one I am not sure if the kfunc should decide what counted as 
> SYNCOOKIESFAILED or not. Beside, the bpf prog should have already rejected the 
> skb as part of its cookie validation logic. Thus, reaching here is more like a 
> bpf prog's error instead.

Indeed.

> 
> I would leave the SYNCOOKIESFAILED usage for the kernel tcp layer only. The 
> existing bpf_tcp_check_syncookie() helper does not increment SYNCOOKIESFAILED also.

I'll remove __NET_INC_STATS()s from kfunc.

Thanks!


> 
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (attr->tcp_opt.wscale_ok &&
> > +	    (attr->tcp_opt.snd_wscale > TCP_MAX_WSCALE ||
> > +	     attr->tcp_opt.rcv_wscale > TCP_MAX_WSCALE)) {
> > +		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
> 
> Same here.
> 
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (sk_is_mptcp(sk))
> > +		req = mptcp_subflow_reqsk_alloc(ops, sk, false);
> > +	else
> > +		req = inet_reqsk_alloc(ops, sk, false);
> > +
> > +	if (!req)
> > +		return -ENOMEM;
> > +
> > +	ireq = inet_rsk(req);
> > +	treq = tcp_rsk(req);
> > +
> > +	req->syncookie = 1;
> > +	req->rsk_listener = sk;
> > +	req->mss = attr->tcp_opt.mss_clamp;
> > +
> > +	ireq->snd_wscale = attr->tcp_opt.snd_wscale;
> > +	ireq->rcv_wscale = attr->tcp_opt.rcv_wscale;
> > +	ireq->wscale_ok = attr->tcp_opt.wscale_ok;
> > +	ireq->tstamp_ok	= attr->tcp_opt.tstamp_ok;
> > +	ireq->sack_ok = attr->tcp_opt.sack_ok;
> > +	ireq->ecn_ok = attr->ecn_ok;
> > +
> > +	treq->req_usec_ts = attr->usec_ts_ok;
> > +
> > +	skb_orphan(skb);
> > +	skb->sk = req_to_sk(req);
> > +	skb->destructor = sock_pfree;
> > +
> > +	return 0;
> > +#else
> > +	return -EOPNOTSUPP;
> > +#endif
> > +}
> > +
> >   __bpf_kfunc_end_defs();
> >   
> >   int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
> > @@ -11844,6 +11932,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
> >   BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
> >   BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
> >   
> > +BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
> > +BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
> > +BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
> > +
> >   static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> >   	.owner = THIS_MODULE,
> >   	.set = &bpf_kfunc_check_set_skb,
> > @@ -11859,6 +11951,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
> >   	.set = &bpf_kfunc_check_set_sock_addr,
> >   };
> >   
> > +static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
> > +	.owner = THIS_MODULE,
> > +	.set = &bpf_kfunc_check_set_tcp_reqsk,
> > +};
> > +
> >   static int __init bpf_kfunc_init(void)
> >   {
> >   	int ret;
> > @@ -11874,8 +11971,9 @@ static int __init bpf_kfunc_init(void)
> >   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
> >   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
> >   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
> > -	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> > -						&bpf_kfunc_set_sock_addr);
> > +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> > +					       &bpf_kfunc_set_sock_addr);
> > +	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> >   }
> >   late_initcall(bpf_kfunc_init);
> >   
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index fef349dd72fa..998950e97dfe 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -2579,8 +2579,18 @@ EXPORT_SYMBOL(sock_efree);
> >   #ifdef CONFIG_INET
> >   void sock_pfree(struct sk_buff *skb)
> >   {
> > -	if (sk_is_refcounted(skb->sk))
> > -		sock_gen_put(skb->sk);
> > +	struct sock *sk = skb->sk;
> > +
> > +	if (!sk_is_refcounted(sk))
> > +		return;
> > +
> > +	if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> > +		inet_reqsk(sk)->rsk_listener = NULL;
> > +		reqsk_free(inet_reqsk(sk));
> > +		return;
> > +	}
> > +
> > +	sock_gen_put(sk);
> >   }
> >   EXPORT_SYMBOL(sock_pfree);
> >   #endif /* CONFIG_INET */


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  2023-12-06  1:29     ` Kuniyuki Iwashima
@ 2023-12-06  3:11       ` Martin KaFai Lau
  2023-12-06  5:58         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2023-12-06  3:11 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, netdev

On 12/5/23 5:29 PM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Tue, 5 Dec 2023 16:19:20 -0800
>> On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
>>> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
>>> index 61f1c96cfe63..0f9c3aed2014 100644
>>> --- a/net/ipv4/syncookies.c
>>> +++ b/net/ipv4/syncookies.c
>>> @@ -304,6 +304,59 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
>>>    	return 0;
>>>    }
>>>    
>>> +#if IS_ENABLED(CONFIG_BPF)
>>> +struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
>>> +				      struct sk_buff *skb)
>>> +{
>>> +	struct request_sock *req = inet_reqsk(skb->sk);
>>> +	struct inet_request_sock *ireq = inet_rsk(req);
>>> +	struct tcp_request_sock *treq = tcp_rsk(req);
>>> +	struct tcp_options_received tcp_opt;
>>> +	int ret;
>>> +
>>> +	skb->sk = NULL;
>>> +	skb->destructor = NULL;
>>> +	req->rsk_listener = NULL;
>>> +
>>> +	memset(&tcp_opt, 0, sizeof(tcp_opt));
>>> +	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
>>
>> In patch 2, the bpf prog is passing the tcp_opt to the kfunc. The selftest in
>> patch 3 is also parsing the tcp-options.
>>
>> The kernel parses the tcp-option here again to do some checking and req's member
>> initialization. Can these checking and initialization be done in the
>> bpf_sk_assign_tcp_reqsk() kfunc instead to avoid the double tcp-option parsing?
> 
> If TS is not used as a cookie storage, bpf prog need not parse it.
> OTOH, if a value is encoded into TS, bpf prog need to parse it.
> In that case, we cannot avoid parsing options in bpf prog.

If I read patch 2 correctly, the ireq->tstamp_ok is set by the kfunc, so I 
assume that the bpf prog has to parse the tcp-option.

Like the "if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp)" test below, ireq->tstamp_ok 
will always be 0 if the bpf prog did not parse the tcp-option.

> 
> The parsing here comes from my paranoia, so.. probably we can drop it
> and the first test below, and rely on bpf prog's tcp_opt, especially
> tstamp_ok, rcv_tsval, and rcv_tsecr ?

My preference is that it is clearer to allow the bpf prog to initialize all 
tcp_opt instead of only taking the tcp_opt.tstamp_ok from bpf_prog but ignore 
the tcp_opt.rcv_tsval/tsecr. The kfunc will then use the tcp_opt to initialize 
the req.

It is also better to detect the following error cases as much as possible in the 
kfunc instead of failing later in the tcp stack. e.g. checking the sysctl should 
be doable in the kfunc.

> 
> I placed other tests here to align with the normal cookie flow, but
> they can be moved to kfunc.  However, initialisation assuems skb
> points to TCP header, so here would be better place, I think.
> 
> 
>>
>>> +
>>> +	if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp) {
>>> +		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
>>> +		goto reset;
>>> +	}
>>> +
>>> +	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
>>> +
>>> +	if (ireq->tstamp_ok) {
>>> +		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
>>> +			goto reset;
>>> +
>>> +		req->ts_recent = tcp_opt.rcv_tsval;
>>> +		treq->ts_off = tcp_opt.rcv_tsecr - tcp_ns_to_ts(false, tcp_clock_ns());
>>> +	}
>>> +
>>> +	if (ireq->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
>>> +		goto reset;
>>> +
>>> +	if (ireq->wscale_ok && !READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
>>> +		goto reset;
>>> +
>>> +	ret = cookie_tcp_reqsk_init(sk, skb, req);
>>> +	if (ret) {
>>> +		reqsk_free(req);
>>> +		req = NULL;
>>> +	}
>>> +
>>> +	return req;
>>> +
>>> +reset:
>>> +	reqsk_free(req);
>>> +	return ERR_PTR(-EINVAL);
>>> +}
>>> +EXPORT_SYMBOL_GPL(cookie_bpf_check);
>>> +#endif


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
  2023-12-06  3:11       ` Martin KaFai Lau
@ 2023-12-06  5:58         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-06  5:58 UTC (permalink / raw)
  To: martin.lau; +Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, netdev

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Tue, 5 Dec 2023 19:11:17 -0800
> On 12/5/23 5:29 PM, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <martin.lau@linux.dev>
> > Date: Tue, 5 Dec 2023 16:19:20 -0800
> >> On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> >>> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> >>> index 61f1c96cfe63..0f9c3aed2014 100644
> >>> --- a/net/ipv4/syncookies.c
> >>> +++ b/net/ipv4/syncookies.c
> >>> @@ -304,6 +304,59 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
> >>>    	return 0;
> >>>    }
> >>>    
> >>> +#if IS_ENABLED(CONFIG_BPF)
> >>> +struct request_sock *cookie_bpf_check(struct net *net, struct sock *sk,
> >>> +				      struct sk_buff *skb)
> >>> +{
> >>> +	struct request_sock *req = inet_reqsk(skb->sk);
> >>> +	struct inet_request_sock *ireq = inet_rsk(req);
> >>> +	struct tcp_request_sock *treq = tcp_rsk(req);
> >>> +	struct tcp_options_received tcp_opt;
> >>> +	int ret;
> >>> +
> >>> +	skb->sk = NULL;
> >>> +	skb->destructor = NULL;
> >>> +	req->rsk_listener = NULL;
> >>> +
> >>> +	memset(&tcp_opt, 0, sizeof(tcp_opt));
> >>> +	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
> >>
> >> In patch 2, the bpf prog is passing the tcp_opt to the kfunc. The selftest in
> >> patch 3 is also parsing the tcp-options.
> >>
> >> The kernel parses the tcp-option here again to do some checking and req's member
> >> initialization. Can these checking and initialization be done in the
> >> bpf_sk_assign_tcp_reqsk() kfunc instead to avoid the double tcp-option parsing?
> > 
> > If TS is not used as a cookie storage, bpf prog need not parse it.
> > OTOH, if a value is encoded into TS, bpf prog need to parse it.
> > In that case, we cannot avoid parsing options in bpf prog.
> 
> If I read patch 2 correctly, the ireq->tstamp_ok is set by the kfunc, so I 
> assume that the bpf prog has to parse the tcp-option.
> 
> Like the "if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp)" test below, ireq->tstamp_ok 
> will always be 0 if the bpf prog did not parse the tcp-option.

Ah sorry, I assumed TS bit was encoded in SYN as disabled.
TCP option parsing is needed at least once for SYN, but we
need not do so for SYN+ACK if TS bit is in ISN.


> 
> > 
> > The parsing here comes from my paranoia, so.. probably we can drop it
> > and the first test below, and rely on bpf prog's tcp_opt, especially
> > tstamp_ok, rcv_tsval, and rcv_tsecr ?
> 
> My preference is that it is clearer to allow the bpf prog to initialize all 
> tcp_opt instead of only taking the tcp_opt.tstamp_ok from bpf_prog but ignore 
> the tcp_opt.rcv_tsval/tsecr. The kfunc will then use the tcp_opt to initialize 
> the req.

I'll drop the option parsing in kernel and allow bpf prog to fully
initialise tcp_opt.


> 
> It is also better to detect the following error cases as much as possible in the 
> kfunc instead of failing later in the tcp stack. e.g. checking the sysctl should 
> be doable in the kfunc.

Ok, I'll move the sysctl tests and ts_off init to kfunc.


> 
> > 
> > I placed other tests here to align with the normal cookie flow, but
> > they can be moved to kfunc.  However, initialisation assuems skb
> > points to TCP header, so here would be better place, I think.
> > 
> > 
> >>
> >>> +
> >>> +	if (ireq->tstamp_ok ^ tcp_opt.saw_tstamp) {
> >>> +		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
> >>> +		goto reset;
> >>> +	}
> >>> +
> >>> +	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
> >>> +
> >>> +	if (ireq->tstamp_ok) {
> >>> +		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
> >>> +			goto reset;
> >>> +
> >>> +		req->ts_recent = tcp_opt.rcv_tsval;
> >>> +		treq->ts_off = tcp_opt.rcv_tsecr - tcp_ns_to_ts(false, tcp_clock_ns());
> >>> +	}
> >>> +
> >>> +	if (ireq->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack))
> >>> +		goto reset;
> >>> +
> >>> +	if (ireq->wscale_ok && !READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
> >>> +		goto reset;
> >>> +
> >>> +	ret = cookie_tcp_reqsk_init(sk, skb, req);
> >>> +	if (ret) {
> >>> +		reqsk_free(req);
> >>> +		req = NULL;
> >>> +	}
> >>> +
> >>> +	return req;
> >>> +
> >>> +reset:
> >>> +	reqsk_free(req);
> >>> +	return ERR_PTR(-EINVAL);
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(cookie_bpf_check);
> >>> +#endif

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-05  1:34 ` [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
  2023-12-05  2:13   ` Alexei Starovoitov
@ 2023-12-06  6:39   ` Martin KaFai Lau
  2023-12-07  6:56     ` Kuniyuki Iwashima
  1 sibling, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2023-12-06  6:39 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> This commit adds a sample selftest to demonstrate how we can use
> bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
> 
> The test creates IPv4/IPv6 x TCP/MPTCP connections and transfer
> messages over them on lo with BPF tc prog attached.
> 
> The tc prog will process SYN and returns SYN+ACK with the following
> ISN and TS.  In a real use case, this part will be done by other
> hosts.
> 
>          MSB                                   LSB
>    ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
>          |   Hash_1 | MSS | ECN | SACK |  WScale |
> 
>    TS:   | 31 ... 8 |          7 ... 0           |
>          |   Random |           Hash_2           |

Thanks for the details. It is helpful.

> 
>    WScale in SYN is reused in SYN+ACK.
> 
> The client returns ACK, and tc prog will recalculate ISN and TS
> from ACK and validate SYN Cookie.
> 
> If it's valid, the prog calls kfunc to allocate a reqsk for skb and
> configure the reqsk based on the argument created from SYN Cookie.
> 
> Later, the reqsk will be processed in cookie_v[46]_check() to create
> a connection.
> 

[ ... ]

> +SEC("tc")
> +int tcp_custom_syncookie(struct __sk_buff *skb)
> +{
> +	struct tcp_syncookie ctx = {
> +		.skb = skb,
> +	};
> +
> +	if (tcp_load_headers(&ctx))
> +		return TC_ACT_OK;
> +
> +	if (ctx.tcp->rst)
> +		return TC_ACT_OK;
> +
> +	if (ctx.tcp->syn) {
> +		if (ctx.tcp->ack)
> +			return TC_ACT_OK;
> +
> +		return tcp_handle_syn(&ctx);
> +	}
> +
> +	return tcp_handle_ack(&ctx);

It may be useful to ensure tcp_handle_{syn,ack} is executed instead of the 
kernel doing the regular syncookie. A global variable (bool or counter) can be 
used by the prog_tests to check.

> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
> new file mode 100644
> index 000000000000..a401f59e46d8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
> @@ -0,0 +1,162 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright Amazon.com Inc. or its affiliates. */
> +
> +#ifndef _TEST_TCP_SYNCOOKIE_H
> +#define _TEST_TCP_SYNCOOKIE_H
> +
> +#define TC_ACT_OK	0
> +#define TC_ACT_SHOT	2
> +
> +#define ETH_ALEN	6
> +#define ETH_P_IP	0x0800
> +#define ETH_P_IPV6	0x86DD
> +
> +#define NEXTHDR_TCP	6
> +
> +#define TCPOPT_NOP		1
> +#define TCPOPT_EOL		0
> +#define TCPOPT_MSS		2
> +#define TCPOPT_WINDOW		3
> +#define TCPOPT_TIMESTAMP	8
> +#define TCPOPT_SACK_PERM	4
> +
> +#define TCPOLEN_MSS		4
> +#define TCPOLEN_WINDOW		3
> +#define TCPOLEN_TIMESTAMP	10
> +#define TCPOLEN_SACK_PERM	2

Some of the above is already in the bpf_tracing_net.h. Move the non-existing 
ones to bpf_tracing_net.h also.

> +#define BPF_F_CURRENT_NETNS	(-1)

This should be already in the vmlinux.h

> +
> +#define __packed __attribute__((__packed__))
> +#define __force
> +
> +#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
> +
> +#define swap(a, b)				\
> +	do {					\
> +		typeof(a) __tmp = (a);		\
> +		(a) = (b);			\
> +		(b) = __tmp;			\
> +	} while (0)
> +
> +#define swap_array(a, b)				\
> +	do {						\
> +		typeof(a) __tmp[sizeof(a)];		\
> +		__builtin_memcpy(__tmp, a, sizeof(a));	\
> +		__builtin_memcpy(a, b, sizeof(a));	\
> +		__builtin_memcpy(b, __tmp, sizeof(a));	\
> +	} while (0)
> +
> +/* asm-generic/unaligned.h */
> +#define __get_unaligned_t(type, ptr) ({						\
> +	const struct { type x; } __packed * __pptr = (typeof(__pptr))(ptr);	\
> +	__pptr->x;								\
> +})
> +
> +#define get_unaligned(ptr) __get_unaligned_t(typeof(*(ptr)), (ptr))
> +
> +static inline u16 get_unaligned_be16(const void *p)
> +{
> +	return bpf_ntohs(__get_unaligned_t(__be16, p));
> +}
> +
> +static inline u32 get_unaligned_be32(const void *p)
> +{
> +	return bpf_ntohl(__get_unaligned_t(__be32, p));
> +}
> +
> +/* lib/checksum.c */
> +static inline u32 from64to32(u64 x)
> +{
> +	/* add up 32-bit and 32-bit for 32+c bit */
> +	x = (x & 0xffffffff) + (x >> 32);
> +	/* add up carry.. */
> +	x = (x & 0xffffffff) + (x >> 32);
> +	return (u32)x;
> +}
> +
> +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> +					__u32 len, __u8 proto, __wsum sum)
> +{
> +	unsigned long long s = (__force u32)sum;
> +
> +	s += (__force u32)saddr;
> +	s += (__force u32)daddr;
> +#ifdef __BIG_ENDIAN
> +	s += proto + len;
> +#else
> +	s += (proto + len) << 8;
> +#endif
> +	return (__force __wsum)from64to32(s);
> +}
> +
> +/* asm-generic/checksum.h */
> +static inline __sum16 csum_fold(__wsum csum)
> +{
> +	u32 sum = (__force u32)csum;
> +
> +	sum = (sum & 0xffff) + (sum >> 16);
> +	sum = (sum & 0xffff) + (sum >> 16);
> +	return (__force __sum16)~sum;
> +}
> +
> +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
> +					__u8 proto, __wsum sum)
> +{
> +	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
> +}
> +
> +/* net/ipv6/ip6_checksum.c */
> +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> +				      const struct in6_addr *daddr,
> +				      __u32 len, __u8 proto, __wsum csum)
> +{
> +	int carry;
> +	__u32 ulen;
> +	__u32 uproto;
> +	__u32 sum = (__force u32)csum;
> +
> +	sum += (__force u32)saddr->in6_u.u6_addr32[0];
> +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[0]);
> +	sum += carry;
> +
> +	sum += (__force u32)saddr->in6_u.u6_addr32[1];
> +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[1]);
> +	sum += carry;
> +
> +	sum += (__force u32)saddr->in6_u.u6_addr32[2];
> +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[2]);
> +	sum += carry;
> +
> +	sum += (__force u32)saddr->in6_u.u6_addr32[3];
> +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[3]);
> +	sum += carry;
> +
> +	sum += (__force u32)daddr->in6_u.u6_addr32[0];
> +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[0]);
> +	sum += carry;
> +
> +	sum += (__force u32)daddr->in6_u.u6_addr32[1];
> +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[1]);
> +	sum += carry;
> +
> +	sum += (__force u32)daddr->in6_u.u6_addr32[2];
> +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[2]);
> +	sum += carry;
> +
> +	sum += (__force u32)daddr->in6_u.u6_addr32[3];
> +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[3]);
> +	sum += carry;
> +
> +	ulen = (__force u32)bpf_htonl((__u32)len);
> +	sum += ulen;
> +	carry = (sum < ulen);
> +	sum += carry;
> +
> +	uproto = (__force u32)bpf_htonl(proto);
> +	sum += uproto;
> +	carry = (sum < uproto);
> +	sum += carry;
> +
> +	return csum_fold((__force __wsum)sum);
> +}

The above helpers are useful for other tests, so make sense to stay in 
test_tcp_custom_syncookie.h. e.g. In the future, some of the duplicated helpers 
in xdp_synproxy_kern.c can be removed.

> +#endif


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-06  6:39   ` Martin KaFai Lau
@ 2023-12-07  6:56     ` Kuniyuki Iwashima
  2023-12-09  1:27       ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Kuniyuki Iwashima @ 2023-12-07  6:56 UTC (permalink / raw)
  To: martin.lau; +Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, kuniyu, netdev

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Tue, 5 Dec 2023 22:39:33 -0800
> On 12/4/23 5:34 PM, Kuniyuki Iwashima wrote:
> > This commit adds a sample selftest to demonstrate how we can use
> > bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
> > 
> > The test creates IPv4/IPv6 x TCP/MPTCP connections and transfer
> > messages over them on lo with BPF tc prog attached.
> > 
> > The tc prog will process SYN and returns SYN+ACK with the following
> > ISN and TS.  In a real use case, this part will be done by other
> > hosts.
> > 
> >          MSB                                   LSB
> >    ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
> >          |   Hash_1 | MSS | ECN | SACK |  WScale |
> > 
> >    TS:   | 31 ... 8 |          7 ... 0           |
> >          |   Random |           Hash_2           |
> 
> Thanks for the details. It is helpful.
> 
> > 
> >    WScale in SYN is reused in SYN+ACK.
> > 
> > The client returns ACK, and tc prog will recalculate ISN and TS
> > from ACK and validate SYN Cookie.
> > 
> > If it's valid, the prog calls kfunc to allocate a reqsk for skb and
> > configure the reqsk based on the argument created from SYN Cookie.
> > 
> > Later, the reqsk will be processed in cookie_v[46]_check() to create
> > a connection.
> > 
> 
> [ ... ]
> 
> > +SEC("tc")
> > +int tcp_custom_syncookie(struct __sk_buff *skb)
> > +{
> > +	struct tcp_syncookie ctx = {
> > +		.skb = skb,
> > +	};
> > +
> > +	if (tcp_load_headers(&ctx))
> > +		return TC_ACT_OK;
> > +
> > +	if (ctx.tcp->rst)
> > +		return TC_ACT_OK;
> > +
> > +	if (ctx.tcp->syn) {
> > +		if (ctx.tcp->ack)
> > +			return TC_ACT_OK;
> > +
> > +		return tcp_handle_syn(&ctx);
> > +	}
> > +
> > +	return tcp_handle_ack(&ctx);
> 
> It may be useful to ensure tcp_handle_{syn,ack} is executed instead of the 
> kernel doing the regular syncookie. A global variable (bool or counter) can be 
> used by the prog_tests to check.

Sure, will add that check.


> 
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
> > diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
> > new file mode 100644
> > index 000000000000..a401f59e46d8
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.h
> > @@ -0,0 +1,162 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright Amazon.com Inc. or its affiliates. */
> > +
> > +#ifndef _TEST_TCP_SYNCOOKIE_H
> > +#define _TEST_TCP_SYNCOOKIE_H
> > +
> > +#define TC_ACT_OK	0
> > +#define TC_ACT_SHOT	2
> > +
> > +#define ETH_ALEN	6
> > +#define ETH_P_IP	0x0800
> > +#define ETH_P_IPV6	0x86DD
> > +
> > +#define NEXTHDR_TCP	6
> > +
> > +#define TCPOPT_NOP		1
> > +#define TCPOPT_EOL		0
> > +#define TCPOPT_MSS		2
> > +#define TCPOPT_WINDOW		3
> > +#define TCPOPT_TIMESTAMP	8
> > +#define TCPOPT_SACK_PERM	4
> > +
> > +#define TCPOLEN_MSS		4
> > +#define TCPOLEN_WINDOW		3
> > +#define TCPOLEN_TIMESTAMP	10
> > +#define TCPOLEN_SACK_PERM	2
> 
> Some of the above is already in the bpf_tracing_net.h. Move the non-existing 
> ones to bpf_tracing_net.h also.

Will move it.


> 
> > +#define BPF_F_CURRENT_NETNS	(-1)
> 
> This should be already in the vmlinux.h

I thought so, but the kernel robot complained ...
https://lore.kernel.org/bpf/202311222353.3MM8wxm0-lkp@intel.com/

> 
> > +
> > +#define __packed __attribute__((__packed__))
> > +#define __force
> > +
> > +#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
> > +
> > +#define swap(a, b)				\
> > +	do {					\
> > +		typeof(a) __tmp = (a);		\
> > +		(a) = (b);			\
> > +		(b) = __tmp;			\
> > +	} while (0)
> > +
> > +#define swap_array(a, b)				\
> > +	do {						\
> > +		typeof(a) __tmp[sizeof(a)];		\
> > +		__builtin_memcpy(__tmp, a, sizeof(a));	\
> > +		__builtin_memcpy(a, b, sizeof(a));	\
> > +		__builtin_memcpy(b, __tmp, sizeof(a));	\
> > +	} while (0)
> > +
> > +/* asm-generic/unaligned.h */
> > +#define __get_unaligned_t(type, ptr) ({						\
> > +	const struct { type x; } __packed * __pptr = (typeof(__pptr))(ptr);	\
> > +	__pptr->x;								\
> > +})
> > +
> > +#define get_unaligned(ptr) __get_unaligned_t(typeof(*(ptr)), (ptr))
> > +
> > +static inline u16 get_unaligned_be16(const void *p)
> > +{
> > +	return bpf_ntohs(__get_unaligned_t(__be16, p));
> > +}
> > +
> > +static inline u32 get_unaligned_be32(const void *p)
> > +{
> > +	return bpf_ntohl(__get_unaligned_t(__be32, p));
> > +}
> > +
> > +/* lib/checksum.c */
> > +static inline u32 from64to32(u64 x)
> > +{
> > +	/* add up 32-bit and 32-bit for 32+c bit */
> > +	x = (x & 0xffffffff) + (x >> 32);
> > +	/* add up carry.. */
> > +	x = (x & 0xffffffff) + (x >> 32);
> > +	return (u32)x;
> > +}
> > +
> > +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> > +					__u32 len, __u8 proto, __wsum sum)
> > +{
> > +	unsigned long long s = (__force u32)sum;
> > +
> > +	s += (__force u32)saddr;
> > +	s += (__force u32)daddr;
> > +#ifdef __BIG_ENDIAN
> > +	s += proto + len;
> > +#else
> > +	s += (proto + len) << 8;
> > +#endif
> > +	return (__force __wsum)from64to32(s);
> > +}
> > +
> > +/* asm-generic/checksum.h */
> > +static inline __sum16 csum_fold(__wsum csum)
> > +{
> > +	u32 sum = (__force u32)csum;
> > +
> > +	sum = (sum & 0xffff) + (sum >> 16);
> > +	sum = (sum & 0xffff) + (sum >> 16);
> > +	return (__force __sum16)~sum;
> > +}
> > +
> > +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
> > +					__u8 proto, __wsum sum)
> > +{
> > +	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
> > +}
> > +
> > +/* net/ipv6/ip6_checksum.c */
> > +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> > +				      const struct in6_addr *daddr,
> > +				      __u32 len, __u8 proto, __wsum csum)
> > +{
> > +	int carry;
> > +	__u32 ulen;
> > +	__u32 uproto;
> > +	__u32 sum = (__force u32)csum;
> > +
> > +	sum += (__force u32)saddr->in6_u.u6_addr32[0];
> > +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[0]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)saddr->in6_u.u6_addr32[1];
> > +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[1]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)saddr->in6_u.u6_addr32[2];
> > +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[2]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)saddr->in6_u.u6_addr32[3];
> > +	carry = (sum < (__force u32)saddr->in6_u.u6_addr32[3]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)daddr->in6_u.u6_addr32[0];
> > +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[0]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)daddr->in6_u.u6_addr32[1];
> > +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[1]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)daddr->in6_u.u6_addr32[2];
> > +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[2]);
> > +	sum += carry;
> > +
> > +	sum += (__force u32)daddr->in6_u.u6_addr32[3];
> > +	carry = (sum < (__force u32)daddr->in6_u.u6_addr32[3]);
> > +	sum += carry;
> > +
> > +	ulen = (__force u32)bpf_htonl((__u32)len);
> > +	sum += ulen;
> > +	carry = (sum < ulen);
> > +	sum += carry;
> > +
> > +	uproto = (__force u32)bpf_htonl(proto);
> > +	sum += uproto;
> > +	carry = (sum < uproto);
> > +	sum += carry;
> > +
> > +	return csum_fold((__force __wsum)sum);
> > +}
> 
> The above helpers are useful for other tests, so make sense to stay in 
> test_tcp_custom_syncookie.h. e.g. In the future, some of the duplicated helpers 
> in xdp_synproxy_kern.c can be removed.

Will post a followup patch after this series was merged.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
  2023-12-07  6:56     ` Kuniyuki Iwashima
@ 2023-12-09  1:27       ` Martin KaFai Lau
  0 siblings, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2023-12-09  1:27 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: andrii, ast, bpf, daniel, edumazet, kuni1840, netdev

On 12/6/23 10:56 PM, Kuniyuki Iwashima wrote:
>>> +#define BPF_F_CURRENT_NETNS	(-1)
>> This should be already in the vmlinux.h
> I thought so, but the kernel robot complained ...
> https://lore.kernel.org/bpf/202311222353.3MM8wxm0-lkp@intel.com/

may be the BPF_F_CURRENT_NETNS is not used in the net/core/filter.c and then not 
get into the vmlinux's btf...

just pass -1 for now.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-12-09  1:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-05  1:34 [PATCH v4 bpf-next 0/3] bpf: tcp: Support arbitrary SYN Cookie at TC Kuniyuki Iwashima
2023-12-05  1:34 ` [PATCH v4 bpf-next 1/3] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check() Kuniyuki Iwashima
2023-12-06  0:19   ` Martin KaFai Lau
2023-12-06  1:29     ` Kuniyuki Iwashima
2023-12-06  3:11       ` Martin KaFai Lau
2023-12-06  5:58         ` Kuniyuki Iwashima
2023-12-05  1:34 ` [PATCH v4 bpf-next 2/3] bpf: tcp: Support arbitrary SYN Cookie Kuniyuki Iwashima
2023-12-06  1:20   ` Martin KaFai Lau
2023-12-06  1:42     ` Kuniyuki Iwashima
2023-12-05  1:34 ` [PATCH v4 bpf-next 3/3] selftest: bpf: Test bpf_sk_assign_tcp_reqsk() Kuniyuki Iwashima
2023-12-05  2:13   ` Alexei Starovoitov
2023-12-05  3:00     ` Kuniyuki Iwashima
2023-12-06  6:39   ` Martin KaFai Lau
2023-12-07  6:56     ` Kuniyuki Iwashima
2023-12-09  1:27       ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).