* [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
@ 2023-05-25  8:19 Lorenz Bauer
  2023-05-25 13:24 ` Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Lorenz Bauer @ 2023-05-25  8:19 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	Willem de Bruijn, Joe Stringer
  Cc: Lorenz Bauer, Joe Stringer, Martin KaFai Lau, netdev,
	linux-kernel, bpf
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy, which
configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
TPROXY from our setup.
The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
helpers don't execute SK_REUSEPORT programs. Instead, one of the
reuseport sockets is selected by hash. This could cause dispatch to the
"wrong" socket:
    sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
    bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.
Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.
Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe6 ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
---
 include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
 include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
 include/net/sock.h             |  7 +++++--
 include/uapi/linux/bpf.h       |  3 ---
 net/core/filter.c              |  2 --
 net/ipv4/inet_hashtables.c     | 15 +++++++-------
 net/ipv4/udp.c                 | 23 +++++++++++++++++++---
 net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
 net/ipv6/udp.c                 | 23 +++++++++++++++++++---
 tools/include/uapi/linux/bpf.h |  3 ---
 10 files changed, 119 insertions(+), 39 deletions(-)
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 56f1286583d3..3ba4dc2703da 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
 					const u16 hnum, const int dif,
 					const int sdif);
 
+struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
+				    struct sk_buff *skb, int doff,
+				    const struct in6_addr *saddr,
+				    __be16 sport,
+				    const struct in6_addr *daddr,
+				    unsigned short hnum);
+
 struct sock *inet6_lookup_listener(struct net *net,
 				   struct inet_hashinfo *hashinfo,
 				   struct sk_buff *skb, int doff,
@@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
 					      int iif, int sdif,
 					      bool *refcounted)
 {
-	struct sock *sk = skb_steal_sock(skb, refcounted);
-
+	bool prefetched;
+	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
+	struct net *net = dev_net(skb_dst(skb)->dev);
+	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+
+	if (prefetched) {
+		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
+							       &ip6h->saddr, sport,
+							       &ip6h->daddr, ntohs(dport));
+		if (reuse_sk) {
+			if (reuse_sk != sk) {
+				if (*refcounted) {
+					sock_put(sk);
+					*refcounted = false;
+				}
+				if (IS_ERR(reuse_sk))
+					return NULL;
+			}
+			return reuse_sk;
+		}
+	}
 	if (sk)
 		return sk;
 
-	return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
-			      doff, &ipv6_hdr(skb)->saddr, sport,
-			      &ipv6_hdr(skb)->daddr, ntohs(dport),
+	return __inet6_lookup(net, hashinfo, skb,
+			      doff, &ip6h->saddr, sport,
+			      &ip6h->daddr, ntohs(dport),
 			      iif, sdif, refcounted);
 }
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 99bd823e97f6..c2af195ca71f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,6 +379,11 @@ struct sock *__inet_lookup_established(struct net *net,
 				       const __be32 daddr, const u16 hnum,
 				       const int dif, const int sdif);
 
+struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
+				   struct sk_buff *skb, int doff,
+				   __be32 saddr, __be16 sport,
+				   __be32 daddr, unsigned short hnum);
+
 static inline struct sock *
 	inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
 				const __be32 saddr, const __be16 sport,
@@ -436,13 +441,31 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 					     const int sdif,
 					     bool *refcounted)
 {
-	struct sock *sk = skb_steal_sock(skb, refcounted);
+	bool prefetched;
+	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
+	struct net *net = dev_net(skb_dst(skb)->dev);
 	const struct iphdr *iph = ip_hdr(skb);
 
+	if (prefetched) {
+		struct sock *reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
+							      iph->saddr, sport,
+							      iph->daddr, ntohs(dport));
+		if (reuse_sk) {
+			if (reuse_sk != sk) {
+				if (*refcounted) {
+					sock_put(sk);
+					*refcounted = false;
+				}
+				if (IS_ERR(reuse_sk))
+					return NULL;
+			}
+			return reuse_sk;
+		}
+	}
 	if (sk)
 		return sk;
 
-	return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
+	return __inet_lookup(net, hashinfo, skb,
 			     doff, iph->saddr, sport,
 			     iph->daddr, dport, inet_iif(skb), sdif,
 			     refcounted);
diff --git a/include/net/sock.h b/include/net/sock.h
index 656ea89f60ff..5645570c2a64 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2806,20 +2806,23 @@ sk_is_refcounted(struct sock *sk)
  * skb_steal_sock - steal a socket from an sk_buff
  * @skb: sk_buff to steal the socket from
  * @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
  */
 static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted)
+skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
 {
 	if (skb->sk) {
 		struct sock *sk = skb->sk;
 
 		*refcounted = true;
-		if (skb_sk_is_prefetched(skb))
+		*prefetched = skb_sk_is_prefetched(skb);
+		if (*prefetched)
 			*refcounted = sk_is_refcounted(sk);
 		skb->destructor = NULL;
 		skb->sk = NULL;
 		return sk;
 	}
+	*prefetched = false;
 	*refcounted = false;
 	return NULL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1bb11a6ee667..2af606a525db 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4144,9 +4144,6 @@ union bpf_attr {
  *		**-EOPNOTSUPP** if the operation is not supported, for example
  *		a call from outside of TC ingress.
  *
- *		**-ESOCKTNOSUPPORT** if the socket type is not supported
- *		(reuseport).
- *
  * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
  *	Description
  *		Helper is overloaded depending on BPF program type. This
diff --git a/net/core/filter.c b/net/core/filter.c
index 968139f4a1ac..5f451260849b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7265,8 +7265,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
 		return -EOPNOTSUPP;
 	if (unlikely(dev_net(skb->dev) != sock_net(sk)))
 		return -ENETUNREACH;
-	if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
-		return -ESOCKTNOSUPPORT;
 	if (sk_is_refcounted(sk) &&
 	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
 		return -ENOENT;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e7391bf310a7..920131e4a65d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
-static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
-					    struct sk_buff *skb, int doff,
-					    __be32 saddr, __be16 sport,
-					    __be32 daddr, unsigned short hnum)
+struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
+				   struct sk_buff *skb, int doff,
+				   __be32 saddr, __be16 sport,
+				   __be32 daddr, unsigned short hnum)
 {
 	struct sock *reuse_sk = NULL;
 	u32 phash;
@@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
 	}
 	return reuse_sk;
 }
+EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
 
 /*
  * Here are some nice properties to exploit here. The BSD API
@@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = lookup_reuseport(net, sk, skb, doff,
-						  saddr, sport, daddr, hnum);
+			result = inet_lookup_reuseport(net, sk, skb, doff,
+						       saddr, sport, daddr, hnum);
 			if (result)
 				return result;
 
@@ -399,7 +400,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
 	if (no_reuseport || IS_ERR_OR_NULL(sk))
 		return sk;
 
-	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
+	reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
 	if (reuse_sk)
 		sk = reuse_sk;
 	return sk;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6893fb867529..c67253386a38 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2426,7 +2426,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
 	struct net *net = dev_net(skb->dev);
-	bool refcounted;
+	bool refcounted, prefetched;
 	int drop_reason;
 
 	drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -2455,11 +2455,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
-	sk = skb_steal_sock(skb, &refcounted);
+	sk = skb_steal_sock(skb, &refcounted, &prefetched);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
+		if (prefetched) {
+			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
+								 saddr, uh->source,
+								 daddr, ntohs(uh->dest));
+			if (reuse_sk) {
+				if (reuse_sk != sk) {
+					if (refcounted) {
+						sock_put(sk);
+						refcounted = false;
+					}
+					if (IS_ERR(reuse_sk))
+						goto no_sk;
+				}
+				sk = reuse_sk;
+			}
+		}
+
 		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
 			udp_sk_rx_dst_set(sk, dst);
 
@@ -2476,7 +2493,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
 	if (sk)
 		return udp_unicast_rcv_skb(sk, skb, uh);
-
+no_sk:
 	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto drop;
 	nf_reset_ct(skb);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index b64b49012655..b7c56867314e 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -111,12 +111,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
-static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
-					    struct sk_buff *skb, int doff,
-					    const struct in6_addr *saddr,
-					    __be16 sport,
-					    const struct in6_addr *daddr,
-					    unsigned short hnum)
+struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
+				    struct sk_buff *skb, int doff,
+				    const struct in6_addr *saddr,
+				    __be16 sport,
+				    const struct in6_addr *daddr,
+				    unsigned short hnum)
 {
 	struct sock *reuse_sk = NULL;
 	u32 phash;
@@ -127,6 +127,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
 	}
 	return reuse_sk;
 }
+EXPORT_SYMBOL_GPL(inet6_lookup_reuseport);
 
 /* called with rcu_read_lock() */
 static struct sock *inet6_lhash2_lookup(struct net *net,
@@ -143,8 +144,8 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = lookup_reuseport(net, sk, skb, doff,
-						  saddr, sport, daddr, hnum);
+			result = inet6_lookup_reuseport(net, sk, skb, doff,
+							saddr, sport, daddr, hnum);
 			if (result)
 				return result;
 
@@ -175,7 +176,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
 	if (no_reuseport || IS_ERR_OR_NULL(sk))
 		return sk;
 
-	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
+	reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
 	if (reuse_sk)
 		sk = reuse_sk;
 	return sk;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5a337e6b970..3fede8ec95c4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -949,7 +949,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct net *net = dev_net(skb->dev);
 	struct udphdr *uh;
 	struct sock *sk;
-	bool refcounted;
+	bool refcounted, prefetched;
 	u32 ulen = 0;
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -986,11 +986,28 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto csum_error;
 
 	/* Check if the socket is already available, e.g. due to early demux */
-	sk = skb_steal_sock(skb, &refcounted);
+	sk = skb_steal_sock(skb, &refcounted, &prefetched);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
+		if (prefetched) {
+			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
+								 saddr, uh->source,
+								 daddr, ntohs(uh->dest));
+			if (reuse_sk) {
+				if (reuse_sk != sk) {
+					if (refcounted) {
+						sock_put(sk);
+						refcounted = false;
+					}
+					if (IS_ERR(reuse_sk))
+						goto no_sk;
+				}
+				sk = reuse_sk;
+			}
+		}
+
 		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
 			udp6_sk_rx_dst_set(sk, dst);
 
@@ -1020,7 +1037,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			goto report_csum_error;
 		return udp6_unicast_rcv_skb(sk, skb, uh);
 	}
-
+no_sk:
 	reason = SKB_DROP_REASON_NO_SOCKET;
 
 	if (!uh->check)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1bb11a6ee667..2af606a525db 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4144,9 +4144,6 @@ union bpf_attr {
  *		**-EOPNOTSUPP** if the operation is not supported, for example
  *		a call from outside of TC ingress.
  *
- *		**-ESOCKTNOSUPPORT** if the socket type is not supported
- *		(reuseport).
- *
  * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
  *	Description
  *		Helper is overloaded depending on BPF program type. This
-- 
2.40.1
^ permalink raw reply related	[flat|nested] 10+ messages in thread- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
@ 2023-05-25 13:24 ` Eric Dumazet
  2023-05-25 19:51   ` Daniel Borkmann
  2023-05-25 17:41 ` Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2023-05-25 13:24 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David Ahern, Willem de Bruijn, Joe Stringer,
	Joe Stringer, Martin KaFai Lau, netdev, linux-kernel, bpf
On Thu, May 25, 2023 at 10:19 AM Lorenz Bauer <lmb@isovalent.com> wrote:
>
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
>
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
>
>     sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>     bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
>
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
>
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
> Cc: Joe Stringer <joe@cilium.io>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
> ---
>  include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>  include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>  include/net/sock.h             |  7 +++++--
>  include/uapi/linux/bpf.h       |  3 ---
>  net/core/filter.c              |  2 --
>  net/ipv4/inet_hashtables.c     | 15 +++++++-------
>  net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>  net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>  net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h |  3 ---
>  10 files changed, 119 insertions(+), 39 deletions(-)
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..920131e4a65d 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>         return score;
>  }
>
> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> -                                           struct sk_buff *skb, int doff,
> -                                           __be32 saddr, __be16 sport,
> -                                           __be32 daddr, unsigned short hnum)
> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> +                                  struct sk_buff *skb, int doff,
> +                                  __be32 saddr, __be16 sport,
> +                                  __be32 daddr, unsigned short hnum)
>  {
>         struct sock *reuse_sk = NULL;
>         u32 phash;
> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>         }
>         return reuse_sk;
>  }
> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>
>  /*
>   * Here are some nice properties to exploit here. The BSD API
> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>         sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>                 score = compute_score(sk, net, hnum, daddr, dif, sdif);
>                 if (score > hiscore) {
> -                       result = lookup_reuseport(net, sk, skb, doff,
> -                                                 saddr, sport, daddr, hnum);
> +                       result = inet_lookup_reuseport(net, sk, skb, doff,
> +                                                      saddr, sport, daddr, hnum);
>                         if (result)
>                                 return result;
>
Please split in a series.
First a patch renaming lookup_reuseport() to inet_lookup_reuseport()
and inet6_lookup_reuseport()
(cleanup, no change in behavior)
This would ease review and future bug hunting quite a bit.
^ permalink raw reply	[flat|nested] 10+ messages in thread
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25 13:24 ` Eric Dumazet
@ 2023-05-25 19:51   ` Daniel Borkmann
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Borkmann @ 2023-05-25 19:51 UTC (permalink / raw)
  To: Eric Dumazet, Lorenz Bauer
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, Willem de Bruijn, Joe Stringer, Joe Stringer,
	Martin KaFai Lau, netdev, linux-kernel, bpf
On 5/25/23 3:24 PM, Eric Dumazet wrote:
> On Thu, May 25, 2023 at 10:19 AM Lorenz Bauer <lmb@isovalent.com> wrote:
>>
>> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
>> sockets. This means we can't use the helper to steer traffic to Envoy, which
>> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
>> TPROXY from our setup.
>>
>> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
>> helpers don't execute SK_REUSEPORT programs. Instead, one of the
>> reuseport sockets is selected by hash. This could cause dispatch to the
>> "wrong" socket:
>>
>>      sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>>      bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>>
>> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
>> helpers unfortunately. In the tc context, L2 headers are at the start
>> of the skb, while SK_REUSEPORT expects L3 headers instead.
>>
>> Instead, we execute the SK_REUSEPORT program when the assigned socket
>> is pulled out of the skb, further up the stack. This creates some
>> trickiness with regards to refcounting as bpf_sk_assign will put both
>> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
>> freed. We can infer that the sk_assigned socket is RCU freed if the
>> reuseport lookup succeeds, but convincing yourself of this fact isn't
>> straight forward. Therefore we defensively check refcounting on the
>> sk_assign sock even though it's probably not required in practice.
>>
>> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
>> Fixes: cf7fbe6 ("bpf: Add socket assign support")
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
>> Cc: Joe Stringer <joe@cilium.io>
>> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
>> ---
>>   include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>>   include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>>   include/net/sock.h             |  7 +++++--
>>   include/uapi/linux/bpf.h       |  3 ---
>>   net/core/filter.c              |  2 --
>>   net/ipv4/inet_hashtables.c     | 15 +++++++-------
>>   net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>>   net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>>   net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>>   tools/include/uapi/linux/bpf.h |  3 ---
>>   10 files changed, 119 insertions(+), 39 deletions(-)
> 
> 
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index e7391bf310a7..920131e4a65d 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>>          return score;
>>   }
>>
>> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>> -                                           struct sk_buff *skb, int doff,
>> -                                           __be32 saddr, __be16 sport,
>> -                                           __be32 daddr, unsigned short hnum)
>> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
>> +                                  struct sk_buff *skb, int doff,
>> +                                  __be32 saddr, __be16 sport,
>> +                                  __be32 daddr, unsigned short hnum)
>>   {
>>          struct sock *reuse_sk = NULL;
>>          u32 phash;
>> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>>          }
>>          return reuse_sk;
>>   }
>> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>>
>>   /*
>>    * Here are some nice properties to exploit here. The BSD API
>> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>>          sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>>                  score = compute_score(sk, net, hnum, daddr, dif, sdif);
>>                  if (score > hiscore) {
>> -                       result = lookup_reuseport(net, sk, skb, doff,
>> -                                                 saddr, sport, daddr, hnum);
>> +                       result = inet_lookup_reuseport(net, sk, skb, doff,
>> +                                                      saddr, sport, daddr, hnum);
>>                          if (result)
>>                                  return result;
>>
> 
> Please split in a series.
> 
> First a patch renaming lookup_reuseport() to inet_lookup_reuseport()
> and inet6_lookup_reuseport()
> (cleanup, no change in behavior)
> 
> This would ease review and future bug hunting quite a bit.
Makes sense and should reduce the churn on the actual change.
I think Lorenz is planning to flush out a v2 next week with this split.
Thanks,
Daniel
^ permalink raw reply	[flat|nested] 10+ messages in thread
 
- * [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
  2023-05-25 13:24 ` Eric Dumazet
@ 2023-05-25 17:41 ` Kuniyuki Iwashima
  2023-05-25 20:01   ` Daniel Borkmann
  2023-05-25 23:42 ` Martin KaFai Lau
  2023-05-26  5:56 ` Joe Stringer
  3 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2023-05-25 17:41 UTC (permalink / raw)
  To: lmb
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	martin.lau, netdev, pabeni, sdf, song, willemdebruijn.kernel, yhs,
	kuniyu
From: Lorenz Bauer <lmb@isovalent.com>
Date: Thu, 25 May 2023 09:19:22 +0100
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
> 
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
> 
>     sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>     bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
> 
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
> 
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
> 
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")
Please use 12 chars of hash.
---8<---
$ cat ~/.gitconfig
[core]
	abbrev = 12
[pretty]
	fixes = Fixes: %h (\"%s\")
$ git show 8e368dc --pretty=fixes | head -n 1
Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
---8<---
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
> Cc: Joe Stringer <joe@cilium.io>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
> ---
>  include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>  include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>  include/net/sock.h             |  7 +++++--
>  include/uapi/linux/bpf.h       |  3 ---
>  net/core/filter.c              |  2 --
>  net/ipv4/inet_hashtables.c     | 15 +++++++-------
>  net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>  net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>  net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h |  3 ---
>  10 files changed, 119 insertions(+), 39 deletions(-)
> 
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 56f1286583d3..3ba4dc2703da 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
>  					const u16 hnum, const int dif,
>  					const int sdif);
>  
> +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> +				    struct sk_buff *skb, int doff,
> +				    const struct in6_addr *saddr,
> +				    __be16 sport,
> +				    const struct in6_addr *daddr,
> +				    unsigned short hnum);
> +
>  struct sock *inet6_lookup_listener(struct net *net,
>  				   struct inet_hashinfo *hashinfo,
>  				   struct sk_buff *skb, int doff,
> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>  					      int iif, int sdif,
>  					      bool *refcounted)
>  {
> -	struct sock *sk = skb_steal_sock(skb, refcounted);
> -
> +	bool prefetched;
> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> +	struct net *net = dev_net(skb_dst(skb)->dev);
> +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
nit: Reverse Xmas Tree order.  Same for other chunks.
> +
> +	if (prefetched) {
> +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> +							       &ip6h->saddr, sport,
> +							       &ip6h->daddr, ntohs(dport));
> +		if (reuse_sk) {
> +			if (reuse_sk != sk) {
> +				if (*refcounted) {
> +					sock_put(sk);
> +					*refcounted = false;
> +				}
> +				if (IS_ERR(reuse_sk))
> +					return NULL;
> +			}
> +			return reuse_sk;
> +		}
Maybe we can add a hepler to avoid this duplication ?
> +	}
>  	if (sk)
>  		return sk;
>  
> -	return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
> -			      doff, &ipv6_hdr(skb)->saddr, sport,
> -			      &ipv6_hdr(skb)->daddr, ntohs(dport),
> +	return __inet6_lookup(net, hashinfo, skb,
> +			      doff, &ip6h->saddr, sport,
> +			      &ip6h->daddr, ntohs(dport),
>  			      iif, sdif, refcounted);
>  }
>  
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 99bd823e97f6..c2af195ca71f 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -379,6 +379,11 @@ struct sock *__inet_lookup_established(struct net *net,
>  				       const __be32 daddr, const u16 hnum,
>  				       const int dif, const int sdif);
>  
> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> +				   struct sk_buff *skb, int doff,
> +				   __be32 saddr, __be16 sport,
> +				   __be32 daddr, unsigned short hnum);
> +
>  static inline struct sock *
>  	inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
>  				const __be32 saddr, const __be16 sport,
> @@ -436,13 +441,31 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
>  					     const int sdif,
>  					     bool *refcounted)
>  {
> -	struct sock *sk = skb_steal_sock(skb, refcounted);
> +	bool prefetched;
> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> +	struct net *net = dev_net(skb_dst(skb)->dev);
>  	const struct iphdr *iph = ip_hdr(skb);
>  
> +	if (prefetched) {
> +		struct sock *reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
> +							      iph->saddr, sport,
> +							      iph->daddr, ntohs(dport));
> +		if (reuse_sk) {
> +			if (reuse_sk != sk) {
> +				if (*refcounted) {
> +					sock_put(sk);
> +					*refcounted = false;
> +				}
> +				if (IS_ERR(reuse_sk))
> +					return NULL;
> +			}
> +			return reuse_sk;
> +		}
> +	}
>  	if (sk)
>  		return sk;
>  
> -	return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
> +	return __inet_lookup(net, hashinfo, skb,
>  			     doff, iph->saddr, sport,
>  			     iph->daddr, dport, inet_iif(skb), sdif,
>  			     refcounted);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 656ea89f60ff..5645570c2a64 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2806,20 +2806,23 @@ sk_is_refcounted(struct sock *sk)
>   * skb_steal_sock - steal a socket from an sk_buff
>   * @skb: sk_buff to steal the socket from
>   * @refcounted: is set to true if the socket is reference-counted
> + * @prefetched: is set to true if the socket was assigned from bpf
>   */
>  static inline struct sock *
> -skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> +skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
>  {
>  	if (skb->sk) {
>  		struct sock *sk = skb->sk;
>  
>  		*refcounted = true;
> -		if (skb_sk_is_prefetched(skb))
> +		*prefetched = skb_sk_is_prefetched(skb);
> +		if (*prefetched)
>  			*refcounted = sk_is_refcounted(sk);
>  		skb->destructor = NULL;
>  		skb->sk = NULL;
>  		return sk;
>  	}
> +	*prefetched = false;
>  	*refcounted = false;
>  	return NULL;
>  }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1bb11a6ee667..2af606a525db 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4144,9 +4144,6 @@ union bpf_attr {
>   *		**-EOPNOTSUPP** if the operation is not supported, for example
>   *		a call from outside of TC ingress.
>   *
> - *		**-ESOCKTNOSUPPORT** if the socket type is not supported
> - *		(reuseport).
> - *
>   * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
>   *	Description
>   *		Helper is overloaded depending on BPF program type. This
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 968139f4a1ac..5f451260849b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7265,8 +7265,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>  		return -EOPNOTSUPP;
>  	if (unlikely(dev_net(skb->dev) != sock_net(sk)))
>  		return -ENETUNREACH;
> -	if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
> -		return -ESOCKTNOSUPPORT;
>  	if (sk_is_refcounted(sk) &&
>  	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>  		return -ENOENT;
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..920131e4a65d 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>  	return score;
>  }
>  
> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> -					    struct sk_buff *skb, int doff,
> -					    __be32 saddr, __be16 sport,
> -					    __be32 daddr, unsigned short hnum)
> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> +				   struct sk_buff *skb, int doff,
> +				   __be32 saddr, __be16 sport,
> +				   __be32 daddr, unsigned short hnum)
>  {
>  	struct sock *reuse_sk = NULL;
>  	u32 phash;
> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>  	}
>  	return reuse_sk;
>  }
> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>  
>  /*
>   * Here are some nice properties to exploit here. The BSD API
> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>  	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>  		score = compute_score(sk, net, hnum, daddr, dif, sdif);
>  		if (score > hiscore) {
> -			result = lookup_reuseport(net, sk, skb, doff,
> -						  saddr, sport, daddr, hnum);
> +			result = inet_lookup_reuseport(net, sk, skb, doff,
> +						       saddr, sport, daddr, hnum);
>  			if (result)
>  				return result;
>  
> @@ -399,7 +400,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
>  	if (no_reuseport || IS_ERR_OR_NULL(sk))
>  		return sk;
>  
> -	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
> +	reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
>  	if (reuse_sk)
>  		sk = reuse_sk;
>  	return sk;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 6893fb867529..c67253386a38 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2426,7 +2426,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	struct rtable *rt = skb_rtable(skb);
>  	__be32 saddr, daddr;
>  	struct net *net = dev_net(skb->dev);
> -	bool refcounted;
> +	bool refcounted, prefetched;
>  	int drop_reason;
>  
>  	drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> @@ -2455,11 +2455,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	if (udp4_csum_init(skb, uh, proto))
>  		goto csum_error;
>  
> -	sk = skb_steal_sock(skb, &refcounted);
> +	sk = skb_steal_sock(skb, &refcounted, &prefetched);
>  	if (sk) {
>  		struct dst_entry *dst = skb_dst(skb);
>  		int ret;
>  
> +		if (prefetched) {
> +			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
> +								 saddr, uh->source,
> +								 daddr, ntohs(uh->dest));
> +			if (reuse_sk) {
> +				if (reuse_sk != sk) {
> +					if (refcounted) {
> +						sock_put(sk);
> +						refcounted = false;
> +					}
> +					if (IS_ERR(reuse_sk))
> +						goto no_sk;
> +				}
> +				sk = reuse_sk;
> +			}
> +		}
> +
>  		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
>  			udp_sk_rx_dst_set(sk, dst);
>  
> @@ -2476,7 +2493,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
>  	if (sk)
>  		return udp_unicast_rcv_skb(sk, skb, uh);
> -
> +no_sk:
>  	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
>  		goto drop;
>  	nf_reset_ct(skb);
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index b64b49012655..b7c56867314e 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -111,12 +111,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
>  	return score;
>  }
>  
> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> -					    struct sk_buff *skb, int doff,
> -					    const struct in6_addr *saddr,
> -					    __be16 sport,
> -					    const struct in6_addr *daddr,
> -					    unsigned short hnum)
> +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> +				    struct sk_buff *skb, int doff,
> +				    const struct in6_addr *saddr,
> +				    __be16 sport,
> +				    const struct in6_addr *daddr,
> +				    unsigned short hnum)
>  {
>  	struct sock *reuse_sk = NULL;
>  	u32 phash;
> @@ -127,6 +127,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>  	}
>  	return reuse_sk;
>  }
> +EXPORT_SYMBOL_GPL(inet6_lookup_reuseport);
>  
>  /* called with rcu_read_lock() */
>  static struct sock *inet6_lhash2_lookup(struct net *net,
> @@ -143,8 +144,8 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
>  	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>  		score = compute_score(sk, net, hnum, daddr, dif, sdif);
>  		if (score > hiscore) {
> -			result = lookup_reuseport(net, sk, skb, doff,
> -						  saddr, sport, daddr, hnum);
> +			result = inet6_lookup_reuseport(net, sk, skb, doff,
> +							saddr, sport, daddr, hnum);
>  			if (result)
>  				return result;
>  
> @@ -175,7 +176,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
>  	if (no_reuseport || IS_ERR_OR_NULL(sk))
>  		return sk;
>  
> -	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
> +	reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
>  	if (reuse_sk)
>  		sk = reuse_sk;
>  	return sk;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index e5a337e6b970..3fede8ec95c4 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -949,7 +949,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	struct net *net = dev_net(skb->dev);
>  	struct udphdr *uh;
>  	struct sock *sk;
> -	bool refcounted;
> +	bool refcounted, prefetched;
>  	u32 ulen = 0;
>  
>  	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
> @@ -986,11 +986,28 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  		goto csum_error;
>  
>  	/* Check if the socket is already available, e.g. due to early demux */
> -	sk = skb_steal_sock(skb, &refcounted);
> +	sk = skb_steal_sock(skb, &refcounted, &prefetched);
>  	if (sk) {
>  		struct dst_entry *dst = skb_dst(skb);
>  		int ret;
>  
> +		if (prefetched) {
> +			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
> +								 saddr, uh->source,
> +								 daddr, ntohs(uh->dest));
> +			if (reuse_sk) {
> +				if (reuse_sk != sk) {
> +					if (refcounted) {
> +						sock_put(sk);
> +						refcounted = false;
> +					}
> +					if (IS_ERR(reuse_sk))
> +						goto no_sk;
> +				}
> +				sk = reuse_sk;
> +			}
> +		}
> +
>  		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
>  			udp6_sk_rx_dst_set(sk, dst);
>  
> @@ -1020,7 +1037,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  			goto report_csum_error;
>  		return udp6_unicast_rcv_skb(sk, skb, uh);
>  	}
> -
> +no_sk:
>  	reason = SKB_DROP_REASON_NO_SOCKET;
>  
>  	if (!uh->check)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 1bb11a6ee667..2af606a525db 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4144,9 +4144,6 @@ union bpf_attr {
>   *		**-EOPNOTSUPP** if the operation is not supported, for example
>   *		a call from outside of TC ingress.
>   *
> - *		**-ESOCKTNOSUPPORT** if the socket type is not supported
> - *		(reuseport).
> - *
>   * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
>   *	Description
>   *		Helper is overloaded depending on BPF program type. This
> -- 
> 2.40.1
^ permalink raw reply	[flat|nested] 10+ messages in thread
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25 17:41 ` Kuniyuki Iwashima
@ 2023-05-25 20:01   ` Daniel Borkmann
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Borkmann @ 2023-05-25 20:01 UTC (permalink / raw)
  To: Kuniyuki Iwashima, lmb
  Cc: andrii, ast, bpf, davem, dsahern, edumazet, haoluo, joe, joe,
	john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	martin.lau, netdev, pabeni, sdf, song, willemdebruijn.kernel, yhs
On 5/25/23 7:41 PM, Kuniyuki Iwashima wrote:
> From: Lorenz Bauer <lmb@isovalent.com>
> Date: Thu, 25 May 2023 09:19:22 +0100
>> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
>> sockets. This means we can't use the helper to steer traffic to Envoy, which
>> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
>> TPROXY from our setup.
>>
>> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
>> helpers don't execute SK_REUSEPORT programs. Instead, one of the
>> reuseport sockets is selected by hash. This could cause dispatch to the
>> "wrong" socket:
>>
>>      sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>>      bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>>
>> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
>> helpers unfortunately. In the tc context, L2 headers are at the start
>> of the skb, while SK_REUSEPORT expects L3 headers instead.
>>
>> Instead, we execute the SK_REUSEPORT program when the assigned socket
>> is pulled out of the skb, further up the stack. This creates some
>> trickiness with regards to refcounting as bpf_sk_assign will put both
>> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
>> freed. We can infer that the sk_assigned socket is RCU freed if the
>> reuseport lookup succeeds, but convincing yourself of this fact isn't
>> straight forward. Therefore we defensively check refcounting on the
>> sk_assign sock even though it's probably not required in practice.
>>
>> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
>> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> 
> Please use 12 chars of hash.
> 
> $ cat ~/.gitconfig
> [core]
> 	abbrev = 12
> [pretty]
> 	fixes = Fixes: %h (\"%s\")
> 
> $ git show 8e368dc --pretty=fixes | head -n 1
> Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Yeap, not quite sure what happened here but the 12 chars is clear. Will
be fixed up in v2, too, ofc.
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
>> Cc: Joe Stringer <joe@cilium.io>
>> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
>> ---
>>   include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>>   include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>>   include/net/sock.h             |  7 +++++--
>>   include/uapi/linux/bpf.h       |  3 ---
>>   net/core/filter.c              |  2 --
>>   net/ipv4/inet_hashtables.c     | 15 +++++++-------
>>   net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>>   net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>>   net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>>   tools/include/uapi/linux/bpf.h |  3 ---
>>   10 files changed, 119 insertions(+), 39 deletions(-)
>>
[...]
>> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>>   					      int iif, int sdif,
>>   					      bool *refcounted)
>>   {
>> -	struct sock *sk = skb_steal_sock(skb, refcounted);
>> -
>> +	bool prefetched;
>> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
>> +	struct net *net = dev_net(skb_dst(skb)->dev);
>> +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> 
> nit: Reverse Xmas Tree order.  Same for other chunks.
It is, the prefetched bool is simply used one line below. I don't think
this is much different than most other code from style pov..
>> +
>> +	if (prefetched) {
>> +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
>> +							       &ip6h->saddr, sport,
>> +							       &ip6h->daddr, ntohs(dport));
>> +		if (reuse_sk) {
>> +			if (reuse_sk != sk) {
>> +				if (*refcounted) {
>> +					sock_put(sk);
>> +					*refcounted = false;
>> +				}
>> +				if (IS_ERR(reuse_sk))
>> +					return NULL;
>> +			}
>> +			return reuse_sk;
>> +		}
> 
> Maybe we can add a hepler to avoid this duplication ?
We'll check if it can be made a bit nicer and integrate this into the v2.
Thanks,
Daniel
^ permalink raw reply	[flat|nested] 10+ messages in thread
 
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
  2023-05-25 13:24 ` Eric Dumazet
  2023-05-25 17:41 ` Kuniyuki Iwashima
@ 2023-05-25 23:42 ` Martin KaFai Lau
  2023-05-26  1:43   ` Kuniyuki Iwashima
  2023-05-26  5:56 ` Joe Stringer
  3 siblings, 1 reply; 10+ messages in thread
From: Martin KaFai Lau @ 2023-05-25 23:42 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, Martin KaFai Lau, netdev, linux-kernel, bpf,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David Ahern, Willem de Bruijn, Joe Stringer
On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 56f1286583d3..3ba4dc2703da 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
>   					const u16 hnum, const int dif,
>   					const int sdif);
>   
> +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> +				    struct sk_buff *skb, int doff,
> +				    const struct in6_addr *saddr,
> +				    __be16 sport,
> +				    const struct in6_addr *daddr,
> +				    unsigned short hnum);
> +
>   struct sock *inet6_lookup_listener(struct net *net,
>   				   struct inet_hashinfo *hashinfo,
>   				   struct sk_buff *skb, int doff,
> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>   					      int iif, int sdif,
>   					      bool *refcounted)
>   {
> -	struct sock *sk = skb_steal_sock(skb, refcounted);
> -
> +	bool prefetched;
> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> +	struct net *net = dev_net(skb_dst(skb)->dev);
> +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +
> +	if (prefetched) {
> +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
If sk is TCP_ESTABLISHED, I suspect sk->sk_reuseport is 1 (from sk_clone)?
If it is, it should still work other than an extra inet6_ehashfn. Does it worth 
an extra sk->sk_state check or it is overkill?
> +							       &ip6h->saddr, sport,
> +							       &ip6h->daddr, ntohs(dport));
^ permalink raw reply	[flat|nested] 10+ messages in thread
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25 23:42 ` Martin KaFai Lau
@ 2023-05-26  1:43   ` Kuniyuki Iwashima
  2023-05-26  2:49     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2023-05-26  1:43 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	lmb, netdev, pabeni, sdf, song, willemdebruijn.kernel, yhs,
	kuniyu
From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Thu, 25 May 2023 16:42:46 -0700
> On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > index 56f1286583d3..3ba4dc2703da 100644
> > --- a/include/net/inet6_hashtables.h
> > +++ b/include/net/inet6_hashtables.h
> > @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
> >   					const u16 hnum, const int dif,
> >   					const int sdif);
> >   
> > +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> > +				    struct sk_buff *skb, int doff,
> > +				    const struct in6_addr *saddr,
> > +				    __be16 sport,
> > +				    const struct in6_addr *daddr,
> > +				    unsigned short hnum);
> > +
> >   struct sock *inet6_lookup_listener(struct net *net,
> >   				   struct inet_hashinfo *hashinfo,
> >   				   struct sk_buff *skb, int doff,
> > @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
> >   					      int iif, int sdif,
> >   					      bool *refcounted)
> >   {
> > -	struct sock *sk = skb_steal_sock(skb, refcounted);
> > -
> > +	bool prefetched;
> > +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> > +	struct net *net = dev_net(skb_dst(skb)->dev);
> > +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> > +
> > +	if (prefetched) {
> > +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> 
> If sk is TCP_ESTABLISHED, I suspect sk->sk_reuseport is 1 (from sk_clone)?
Exactly, it will cause null-ptr-deref in reuseport_select_sock().
We may want to use rcu_access_pointer(sk->sk_reuseport_cb) in
each lookup_reuseport() instead of adding sk_state check ?
> 
> If it is, it should still work other than an extra inet6_ehashfn. Does it worth 
> an extra sk->sk_state check or it is overkill?
> 
> 
> > +							       &ip6h->saddr, sport,
> > +							       &ip6h->daddr, ntohs(dport));
^ permalink raw reply	[flat|nested] 10+ messages in thread
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-26  1:43   ` Kuniyuki Iwashima
@ 2023-05-26  2:49     ` Kuniyuki Iwashima
  2023-05-26 21:08       ` Martin KaFai Lau
  0 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2023-05-26  2:49 UTC (permalink / raw)
  To: kuniyu
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	lmb, martin.lau, netdev, pabeni, sdf, song, willemdebruijn.kernel,
	yhs
From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Thu, 25 May 2023 18:43:17 -0700
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Thu, 25 May 2023 16:42:46 -0700
> > On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> > > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > > index 56f1286583d3..3ba4dc2703da 100644
> > > --- a/include/net/inet6_hashtables.h
> > > +++ b/include/net/inet6_hashtables.h
> > > @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
> > >   					const u16 hnum, const int dif,
> > >   					const int sdif);
> > >   
> > > +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> > > +				    struct sk_buff *skb, int doff,
> > > +				    const struct in6_addr *saddr,
> > > +				    __be16 sport,
> > > +				    const struct in6_addr *daddr,
> > > +				    unsigned short hnum);
> > > +
> > >   struct sock *inet6_lookup_listener(struct net *net,
> > >   				   struct inet_hashinfo *hashinfo,
> > >   				   struct sk_buff *skb, int doff,
> > > @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
> > >   					      int iif, int sdif,
> > >   					      bool *refcounted)
> > >   {
> > > -	struct sock *sk = skb_steal_sock(skb, refcounted);
> > > -
> > > +	bool prefetched;
> > > +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> > > +	struct net *net = dev_net(skb_dst(skb)->dev);
> > > +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> > > +
> > > +	if (prefetched) {
> > > +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> > 
> > If sk is TCP_ESTABLISHED, I suspect sk->sk_reuseport is 1 (from sk_clone)?
> 
> Exactly, it will cause null-ptr-deref in reuseport_select_sock().
Sorry, this doesn't occur.  reuseport_select_sock() has null check.
> We may want to use rcu_access_pointer(sk->sk_reuseport_cb) in
> each lookup_reuseport() instead of adding sk_state check ?
And if someone has a weird program that creates multiple listeners and
disable SO_REUSEPORT for a listener that hits first in lhash2, checking
sk_reuseport_cb might not work ?  I hope no one does such though, checking
sk_reuseport and sk_state could be better.
> 
> 
> > 
> > If it is, it should still work other than an extra inet6_ehashfn. Does it worth 
> > an extra sk->sk_state check or it is overkill?
> > 
> > 
> > > +							       &ip6h->saddr, sport,
> > > +							       &ip6h->daddr, ntohs(dport));
^ permalink raw reply	[flat|nested] 10+ messages in thread
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-26  2:49     ` Kuniyuki Iwashima
@ 2023-05-26 21:08       ` Martin KaFai Lau
  0 siblings, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2023-05-26 21:08 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	lmb, netdev, pabeni, sdf, song, willemdebruijn.kernel, yhs
On 5/25/23 7:49 PM, Kuniyuki Iwashima wrote:
>> We may want to use rcu_access_pointer(sk->sk_reuseport_cb) in
>> each lookup_reuseport() instead of adding sk_state check ?
> And if someone has a weird program that creates multiple listeners and
> disable SO_REUSEPORT for a listener that hits first in lhash2, checking
> sk_reuseport_cb might not work ?  
I am also not sure what the correct expectation is if the application disable 
SO_REUSEPORT in a sk after bind(). This is not something new or specific from 
the current patch though.
> I hope no one does such though, checking sk_reuseport 
> and sk_state could be better.
My previous comment on checking sk_state is to avoid the unnecessary 
inet_ehashfn() for the TCP_ESTABLISHED sk. Checking sk_reuseport_cb could also 
work since it should be NULL for established sk. However, not sure if it could 
be another cacheline access. The udp one is checking the sk_state also: 
"sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED".
^ permalink raw reply	[flat|nested] 10+ messages in thread 
 
 
 
- * Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
                   ` (2 preceding siblings ...)
  2023-05-25 23:42 ` Martin KaFai Lau
@ 2023-05-26  5:56 ` Joe Stringer
  3 siblings, 0 replies; 10+ messages in thread
From: Joe Stringer @ 2023-05-26  5:56 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	Willem de Bruijn, Joe Stringer, Joe Stringer, Martin KaFai Lau,
	netdev, linux-kernel, bpf
On Thu, May 25, 2023 at 1:19 AM Lorenz Bauer <lmb@isovalent.com> wrote:
>
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
>
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
>
>     sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>     bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
>
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
>
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
> Cc: Joe Stringer <joe@cilium.io>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Nice approach to fix this issue, wish I'd thought of it :)
I pulled this and tested out in a little-vm-helper environment with
kind and Cilium's examples/kubernetes/connectivity-check proxy suite,
as well as cilium-cli's connectivity tests and the L7 features seem to
be working as expected with SO_REUSEPORT.
Tested-by: Joe Stringer <joe@cilium.io>
I also glanced through the commit, and the various protocols seem to
be handled consistently at the very least, though I agree it'd be
simpler for review and bisecting if broken down into more incremental
changes.
^ permalink raw reply	[flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-05-26 21:08 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
2023-05-25 13:24 ` Eric Dumazet
2023-05-25 19:51   ` Daniel Borkmann
2023-05-25 17:41 ` Kuniyuki Iwashima
2023-05-25 20:01   ` Daniel Borkmann
2023-05-25 23:42 ` Martin KaFai Lau
2023-05-26  1:43   ` Kuniyuki Iwashima
2023-05-26  2:49     ` Kuniyuki Iwashima
2023-05-26 21:08       ` Martin KaFai Lau
2023-05-26  5:56 ` Joe Stringer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).