[RFC PATCH 0/3] Allow sk_lookup UDP return traffic to egress.

linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Allow sk_lookup UDP return traffic to egress.
@ 2024-09-13  9:39 Tiago Lam
  2024-09-13  9:39 ` [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg Tiago Lam
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Tiago Lam @ 2024-09-13  9:39 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: netdev, linux-kernel, bpf, linux-kselftest, Jakub Sitnicki,
	Tiago Lam, kernel-team

Currently, sk_lookup allows an ebpf program to run on the ingress socket
lookup path, and accept traffic not only on a range of addresses, but
also on a range of ports. At Cloudflare we use sk_lookup for two main
cases:
1. Sharing a single port between multiple services - i.e. two services
   (or more) use disjoint IP ranges but share the same port;
2. Receiving traffic on all ports - i.e. a service which accepts traffic
   on specific IP ranges but any port [1].

However, one main challenge we face while using sk_lookup for these use
cases is how to source return UDP traffic:
- On point 1. above, sometimes this range of addresses are not local
  (i.e. there's no local routes for these in the server), which means we
  need IP_TRANSPARENT set to be able to egress traffic from addresses
  we've received traffic on (or simply IP_FREEBIND in the case of IPv6);
- And on point 2. above, allowing traffic to a range of ports means a
  service could get traffic on multiple ports, but currently there's no
  way to set the source UDP port egress traffic should be sourced from -
  it's possible to receive the original destination port using the
  IP_ORIGDSTADDR ancilliary message in recvmsg, but not set it in
  sendmsg.

Both of these limitations can be worked around, but in a sub-optimal
way. Using IP_TRANSPARENT, for instance, requires special privileges.
And while one could use UDP connected sockets to send return traffic,
creating a connected socket for each different address a UDP traffic is
received on does have performance implications.

Given sk_lookup allows services to accept traffic on a range of
addresses or ports, it seems sensible to also allow return traffic to
proceed through as well, without needing extra configurations / set ups.

This patch set allows to do exactly this by performing a reverse socket
lookup on the egress path - where it looks to see if the egress socket
matches a socket in the attached sk_lookup ebpf program for the traffic
that's being sent. If it does, traffic is allowed to proceed.

The downsides to this is that this runs on the egress hot path, although
this work tries to minimise its impact by only performing the reverse
socket lookup when necessary. Further performance measurements are to be
taken, but we're reaching out early for feedback to see what the
technical concerns are and if we can address them.

[1] https://blog.cloudflare.com/how-we-built-spectrum/

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
---
Tiago Lam (3):
      ipv4: Run a reverse sk_lookup on sendmsg.
      ipv6: Run a reverse sk_lookup on sendmsg.
      bpf: Add sk_lookup test to use ORIGDSTADDR cmsg.

 include/net/ip.h                                   |  1 +
 net/ipv4/ip_sockglue.c                             | 11 ++++
 net/ipv4/udp.c                                     | 33 +++++++++-
 net/ipv6/datagram.c                                | 76 ++++++++++++++++++++++
 net/ipv6/udp.c                                     |  8 ++-
 tools/testing/selftests/bpf/prog_tests/sk_lookup.c | 70 +++++++++++++-------
 6 files changed, 174 insertions(+), 25 deletions(-)
---
base-commit: da3ea35007d0af457a0afc87e84fddaebc4e0b63
change-id: 20240909-reverse-sk-lookup-f7bf36292bc4

Best regards,
-- 
Tiago Lam <tiagolam@cloudflare.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg.
  2024-09-13  9:39 [RFC PATCH 0/3] Allow sk_lookup UDP return traffic to egress Tiago Lam
@ 2024-09-13  9:39 ` Tiago Lam
  2024-09-18 12:45   ` Willem de Bruijn
  2024-09-13  9:39 ` [RFC PATCH 2/3] ipv6: " Tiago Lam
  2024-09-13  9:39 ` [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg Tiago Lam
  2 siblings, 1 reply; 16+ messages in thread
From: Tiago Lam @ 2024-09-13  9:39 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: netdev, linux-kernel, bpf, linux-kselftest, Jakub Sitnicki,
	Tiago Lam, kernel-team

In order to check if egress traffic should be allowed through, we run a
reverse socket lookup (i.e. normal socket lookup with the src/dst
addresses and ports reversed) to check if the corresponding ingress
traffic is allowed in.  Thus, if there's a sk_lookup reverse call
returns a socket that matches the egress socket, we also let the egress
traffic through - following the principle of, allowing return traffic to
proceed if ingress traffic is allowed in.  The reverse lookup is only
performed in case an sk_lookup ebpf program is attached and the source
address and/or port for the return traffic have been modified.

The src address and port can be modified by using ancilliary messages.
Up until now, it was possible to specify a different source address to
sendmsg by providing it in an IP_PKTINFO anciliarry message, but there's
no way to change the source port. This patch also extends the ancilliary
messages supported by sendmsg to support the IP_ORIGDSTADDR ancilliary
message, reusing the same cmsg and struct used in recvmsg - which
already supports specifying a port.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
---
 include/net/ip.h       |  1 +
 net/ipv4/ip_sockglue.c | 11 +++++++++++
 net/ipv4/udp.c         | 33 ++++++++++++++++++++++++++++++++-
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index c5606cadb1a5..e5753abd7247 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -75,6 +75,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb)
 struct ipcm_cookie {
 	struct sockcm_cookie	sockc;
 	__be32			addr;
+	__be16			port;
 	int			oif;
 	struct ip_options_rcu	*opt;
 	__u8			protocol;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index cf377377b52d..6e55bd25b5f7 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -297,6 +297,17 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 			ipc->addr = info->ipi_spec_dst.s_addr;
 			break;
 		}
+		case IP_ORIGDSTADDR:
+		{
+			struct sockaddr_in *dst_addr;
+
+			if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct sockaddr_in)))
+				return -EINVAL;
+			dst_addr = (struct sockaddr_in *)CMSG_DATA(cmsg);
+			ipc->port = dst_addr->sin_port;
+			ipc->addr = dst_addr->sin_addr.s_addr;
+			break;
+		}
 		case IP_TTL:
 			if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
 				return -EINVAL;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 49c622e743e8..b9dc0a88b0c6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1060,6 +1060,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
 	struct flowi4 fl4_stack;
 	struct flowi4 *fl4;
+	__u8 flow_flags = inet_sk_flowi_flags(sk);
 	int ulen = len;
 	struct ipcm_cookie ipc;
 	struct rtable *rt = NULL;
@@ -1179,6 +1180,37 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		}
 	}
 
+	/* If we're egressing with a different source address and/or port, we
+	 * perform a reverse socket lookup.  The rationale behind this is that we
+	 * can allow return UDP traffic that has ingressed through sk_lookup to
+	 * also egress correctly. In case this the reverse lookup fails.
+	 *
+	 * The lookup is performed if either source address and/or port changed, and
+	 * neither is "0".
+	 */
+	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+	    !connected &&
+	    (ipc.port && ipc.addr) &&
+	    (inet->inet_saddr != ipc.addr || inet->inet_sport != ipc.port)) {
+		struct sock *sk_egress;
+
+		bpf_sk_lookup_run_v4(sock_net(sk), IPPROTO_UDP,
+				     daddr, dport, ipc.addr, ntohs(ipc.port), 1, &sk_egress);
+		if (IS_ERR_OR_NULL(sk_egress) ||
+		    atomic64_read(&sk_egress->sk_cookie) != atomic64_read(&sk->sk_cookie)) {
+			net_info_ratelimited("No reverse socket lookup match for local addr %pI4:%d remote addr %pI4:%d\n",
+					     &ipc.addr, ntohs(ipc.port), &daddr, ntohs(dport));
+		} else {
+			/* Override the source port to use with the one we got in cmsg,
+			 * and tell routing to let us use a non-local address. Otherwise
+			 * route lookups will fail with non-local source address when
+			 * IP_TRANSPARENT isn't set.
+			 */
+			inet->inet_sport = ipc.port;
+			flow_flags |= FLOWI_FLAG_ANYSRC;
+		}
+	}
+
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
@@ -1223,7 +1255,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	if (!rt) {
 		struct net *net = sock_net(sk);
-		__u8 flow_flags = inet_sk_flowi_flags(sk);
 
 		fl4 = &fl4_stack;
 

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg.
  2024-09-13  9:39 ` [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg Tiago Lam
@ 2024-09-18 12:45   ` Willem de Bruijn
  2024-09-20 16:57     ` Tiago Lam
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2024-09-18 12:45 UTC (permalink / raw)
  To: Tiago Lam, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: netdev, linux-kernel, bpf, linux-kselftest, Jakub Sitnicki,
	Tiago Lam, kernel-team

Tiago Lam wrote:
> In order to check if egress traffic should be allowed through, we run a
> reverse socket lookup (i.e. normal socket lookup with the src/dst
> addresses and ports reversed) to check if the corresponding ingress
> traffic is allowed in.

The subject and this description makes it sound that the change always
runs a reverse sk_lookup on sendmsg.

It also focuses on the mechanism, rather than the purpose.

The feature here adds IP_ORIGDSTADDR as a way to respond from a
user configured address. With the sk_lookup limited to this new
special case, as a safety to allow it.

If I read this correctly, I suggest rewording the cover letter and
commit to make this intent and behavior more explicit.

> Thus, if there's a sk_lookup reverse call
> returns a socket that matches the egress socket, we also let the egress
> traffic through - following the principle of, allowing return traffic to
> proceed if ingress traffic is allowed in.  The reverse lookup is only
> performed in case an sk_lookup ebpf program is attached and the source
> address and/or port for the return traffic have been modified.
> 
> The src address and port can be modified by using ancilliary messages.
> Up until now, it was possible to specify a different source address to
> sendmsg by providing it in an IP_PKTINFO anciliarry message, but there's
> no way to change the source port. This patch also extends the ancilliary
> messages supported by sendmsg to support the IP_ORIGDSTADDR ancilliary
> message, reusing the same cmsg and struct used in recvmsg - which
> already supports specifying a port.
> 
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg.
  2024-09-18 12:45   ` Willem de Bruijn
@ 2024-09-20 16:57     ` Tiago Lam
  0 siblings, 0 replies; 16+ messages in thread
From: Tiago Lam @ 2024-09-20 16:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, netdev, linux-kernel, bpf,
	linux-kselftest, Jakub Sitnicki, kernel-team

On Wed, Sep 18, 2024 at 08:45:23AM -0400, Willem de Bruijn wrote:
> Tiago Lam wrote:
> > In order to check if egress traffic should be allowed through, we run a
> > reverse socket lookup (i.e. normal socket lookup with the src/dst
> > addresses and ports reversed) to check if the corresponding ingress
> > traffic is allowed in.
> 
> The subject and this description makes it sound that the change always
> runs a reverse sk_lookup on sendmsg.
> 
> It also focuses on the mechanism, rather than the purpose.
> 
> The feature here adds IP_ORIGDSTADDR as a way to respond from a
> user configured address. With the sk_lookup limited to this new
> special case, as a safety to allow it.
> 
> If I read this correctly, I suggest rewording the cover letter and
> commit to make this intent and behavior more explicit.
> 

I think that makes sense, given this is really about two things:
1. Allowing users to use IP_ORIGDSTADDR to set src address and/or port
on sendmsg();
2. When they do, allow for that return traffic to exit without any extra
configuration, thus limiting how users can take advantage of this new
functionality.

I've made a few changes which hopefully makes that clearer in v2, which
I'm sending shortly. Thanks for these suggestions!

Tiago.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-13  9:39 [RFC PATCH 0/3] Allow sk_lookup UDP return traffic to egress Tiago Lam
  2024-09-13  9:39 ` [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg Tiago Lam
@ 2024-09-13  9:39 ` Tiago Lam
  2024-09-13 18:24   ` Martin KaFai Lau
                     ` (2 more replies)
  2024-09-13  9:39 ` [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg Tiago Lam
  2 siblings, 3 replies; 16+ messages in thread
From: Tiago Lam @ 2024-09-13  9:39 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: netdev, linux-kernel, bpf, linux-kselftest, Jakub Sitnicki,
	Tiago Lam, kernel-team

This follows the same rationale provided for the ipv4 counterpart, where
it now runs a reverse socket lookup when source addresses and/or ports
are changed, on sendmsg, to check whether egress traffic should be
allowed to go through or not.

As with ipv4, the ipv6 sendmsg path is also extended here to support the
IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
address/port.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
---
 net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/udp.c      |  8 ++++--
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index fff78496803d..4214dda1c320 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
 }
 EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
 
+static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
+				     struct in6_addr *saddr, __be16 sport)
+{
+	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+	    (saddr && sport) &&
+	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
+		struct sock *sk_egress;
+
+		bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
+				     saddr, ntohs(sport), 0, &sk_egress);
+		if (!IS_ERR_OR_NULL(sk_egress) &&
+		    atomic64_read(&sk_egress->sk_cookie) == atomic64_read(&sk->sk_cookie))
+			return true;
+
+		net_info_ratelimited("No reverse socket lookup match for local addr %pI6:%d remote addr %pI6:%d\n",
+				     &saddr, ntohs(sport), &fl6->daddr, ntohs(fl6->fl6_dport));
+	}
+
+	return false;
+}
+
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 			  struct msghdr *msg, struct flowi6 *fl6,
 			  struct ipcm6_cookie *ipc6)
@@ -844,7 +865,62 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 
 			break;
 		    }
+		case IPV6_ORIGDSTADDR:
+			{
+			struct sockaddr_in6 *sockaddr_in;
+			struct net_device *dev = NULL;
+
+			if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct sockaddr_in6))) {
+				err = -EINVAL;
+				goto exit_f;
+			}
+
+			sockaddr_in = (struct sockaddr_in6 *)CMSG_DATA(cmsg);
+
+			addr_type = __ipv6_addr_type(&sockaddr_in->sin6_addr);
+
+			if (addr_type & IPV6_ADDR_LINKLOCAL)
+				return -EINVAL;
+
+			/* If we're egressing with a different source address and/or port, we
+			 * perform a reverse socket lookup.  The rationale behind this is that we
+			 * can allow return UDP traffic that has ingressed through sk_lookup to
+			 * also egress correctly. In case the reverse lookup fails, we
+			 * continue with the normal path.
+			 *
+			 * The lookup is performed if either source address and/or port changed, and
+			 * neither is "0".
+			 */
+			if (reverse_sk_lookup(fl6, sk, &sockaddr_in->sin6_addr,
+					      sockaddr_in->sin6_port)) {
+				/* Override the source port and address to use with the one we
+				 * got in cmsg and bail early.
+				 */
+				fl6->saddr = sockaddr_in->sin6_addr;
+				fl6->fl6_sport = sockaddr_in->sin6_port;
+				break;
+			}
 
+			if (addr_type != IPV6_ADDR_ANY) {
+				int strict = __ipv6_addr_src_scope(addr_type) <= IPV6_ADDR_SCOPE_LINKLOCAL;
+
+				if (!ipv6_can_nonlocal_bind(net, inet_sk(sk)) &&
+				    !ipv6_chk_addr_and_flags(net,
+							     &sockaddr_in->sin6_addr,
+							     dev, !strict, 0,
+							     IFA_F_TENTATIVE) &&
+				    !ipv6_chk_acast_addr_src(net, dev,
+							     &sockaddr_in->sin6_addr))
+					err = -EINVAL;
+				else
+					fl6->saddr = sockaddr_in->sin6_addr;
+			}
+
+			if (err)
+				goto exit_f;
+
+			break;
+			}
 		case IPV6_FLOWINFO:
 			if (cmsg->cmsg_len < CMSG_LEN(4)) {
 				err = -EINVAL;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 6602a2e9cdb5..6121cbb71ad3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1476,6 +1476,12 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	fl6->flowi6_uid = sk->sk_uid;
 
+	/* We use fl6's daddr and fl6_sport in the reverse sk_lookup done
+	 * within ip6_datagram_send_ctl() now.
+	 */
+	fl6->daddr = *daddr;
+	fl6->fl6_sport = inet->inet_sport;
+
 	if (msg->msg_controllen) {
 		opt = &opt_space;
 		memset(opt, 0, sizeof(struct ipv6_txoptions));
@@ -1511,10 +1517,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	fl6->flowi6_proto = sk->sk_protocol;
 	fl6->flowi6_mark = ipc6.sockc.mark;
-	fl6->daddr = *daddr;
 	if (ipv6_addr_any(&fl6->saddr) && !ipv6_addr_any(&np->saddr))
 		fl6->saddr = np->saddr;
-	fl6->fl6_sport = inet->inet_sport;
 
 	if (cgroup_bpf_enabled(CGROUP_UDP6_SENDMSG) && !connected) {
 		err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-13  9:39 ` [RFC PATCH 2/3] ipv6: " Tiago Lam
@ 2024-09-13 18:24   ` Martin KaFai Lau
  2024-09-17 16:15     ` Tiago Lam
  2024-09-14  8:59   ` Simon Horman
  2024-09-14 11:40   ` Eric Dumazet
  2 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2024-09-13 18:24 UTC (permalink / raw)
  To: Tiago Lam
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan, netdev,
	linux-kernel, bpf, linux-kselftest, Jakub Sitnicki, kernel-team

On 9/13/24 2:39 AM, Tiago Lam wrote:
> This follows the same rationale provided for the ipv4 counterpart, where
> it now runs a reverse socket lookup when source addresses and/or ports
> are changed, on sendmsg, to check whether egress traffic should be
> allowed to go through or not.
> 
> As with ipv4, the ipv6 sendmsg path is also extended here to support the
> IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> address/port.
> 
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> ---
>   net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   net/ipv6/udp.c      |  8 ++++--
>   2 files changed, 82 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index fff78496803d..4214dda1c320 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
>   }
>   EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
>   
> +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> +				     struct in6_addr *saddr, __be16 sport)
> +{
> +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> +	    (saddr && sport) &&
> +	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
> +		struct sock *sk_egress;
> +
> +		bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
> +				     saddr, ntohs(sport), 0, &sk_egress);

iirc, in the ingress path, the sk could also be selected by a tc bpf prog doing 
bpf_sk_assign. Then this re-run on sk_lookup may give an incorrect result?

In general, is it necessary to rerun any bpf prog if the user space has 
specified the IP[v6]_ORIGDSTADDR.

> +		if (!IS_ERR_OR_NULL(sk_egress) &&
> +		    atomic64_read(&sk_egress->sk_cookie) == atomic64_read(&sk->sk_cookie))
> +			return true;
> +
> +		net_info_ratelimited("No reverse socket lookup match for local addr %pI6:%d remote addr %pI6:%d\n",
> +				     &saddr, ntohs(sport), &fl6->daddr, ntohs(fl6->fl6_dport));
> +	}
> +
> +	return false;
> +}
> +
>   int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
>   			  struct msghdr *msg, struct flowi6 *fl6,
>   			  struct ipcm6_cookie *ipc6)
> @@ -844,7 +865,62 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
>   
>   			break;
>   		    }
> +		case IPV6_ORIGDSTADDR:
> +			{
> +			struct sockaddr_in6 *sockaddr_in;
> +			struct net_device *dev = NULL;
> +
> +			if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct sockaddr_in6))) {
> +				err = -EINVAL;
> +				goto exit_f;
> +			}
> +
> +			sockaddr_in = (struct sockaddr_in6 *)CMSG_DATA(cmsg);
> +
> +			addr_type = __ipv6_addr_type(&sockaddr_in->sin6_addr);
> +
> +			if (addr_type & IPV6_ADDR_LINKLOCAL)
> +				return -EINVAL;
> +
> +			/* If we're egressing with a different source address and/or port, we
> +			 * perform a reverse socket lookup.  The rationale behind this is that we
> +			 * can allow return UDP traffic that has ingressed through sk_lookup to
> +			 * also egress correctly. In case the reverse lookup fails, we
> +			 * continue with the normal path.
> +			 *
> +			 * The lookup is performed if either source address and/or port changed, and
> +			 * neither is "0".
> +			 */
> +			if (reverse_sk_lookup(fl6, sk, &sockaddr_in->sin6_addr,
> +					      sockaddr_in->sin6_port)) {
> +				/* Override the source port and address to use with the one we
> +				 * got in cmsg and bail early.
> +				 */
> +				fl6->saddr = sockaddr_in->sin6_addr;
> +				fl6->fl6_sport = sockaddr_in->sin6_port;
> +				break;
> +			}
>   
> +			if (addr_type != IPV6_ADDR_ANY) {
> +				int strict = __ipv6_addr_src_scope(addr_type) <= IPV6_ADDR_SCOPE_LINKLOCAL;
> +
> +				if (!ipv6_can_nonlocal_bind(net, inet_sk(sk)) &&
> +				    !ipv6_chk_addr_and_flags(net,
> +							     &sockaddr_in->sin6_addr,
> +							     dev, !strict, 0,
> +							     IFA_F_TENTATIVE) &&
> +				    !ipv6_chk_acast_addr_src(net, dev,
> +							     &sockaddr_in->sin6_addr))
> +					err = -EINVAL;
> +				else
> +					fl6->saddr = sockaddr_in->sin6_addr;
> +			}
> +
> +			if (err)
> +				goto exit_f;
> +
> +			break;
> +			}
>   		case IPV6_FLOWINFO:
>   			if (cmsg->cmsg_len < CMSG_LEN(4)) {
>   				err = -EINVAL;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 6602a2e9cdb5..6121cbb71ad3 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1476,6 +1476,12 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>   
>   	fl6->flowi6_uid = sk->sk_uid;
>   
> +	/* We use fl6's daddr and fl6_sport in the reverse sk_lookup done
> +	 * within ip6_datagram_send_ctl() now.
> +	 */
> +	fl6->daddr = *daddr;
> +	fl6->fl6_sport = inet->inet_sport;
> +
>   	if (msg->msg_controllen) {
>   		opt = &opt_space;
>   		memset(opt, 0, sizeof(struct ipv6_txoptions));
> @@ -1511,10 +1517,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>   
>   	fl6->flowi6_proto = sk->sk_protocol;
>   	fl6->flowi6_mark = ipc6.sockc.mark;
> -	fl6->daddr = *daddr;
>   	if (ipv6_addr_any(&fl6->saddr) && !ipv6_addr_any(&np->saddr))
>   		fl6->saddr = np->saddr;
> -	fl6->fl6_sport = inet->inet_sport;
>   
>   	if (cgroup_bpf_enabled(CGROUP_UDP6_SENDMSG) && !connected) {
>   		err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-13 18:24   ` Martin KaFai Lau
@ 2024-09-17 16:15     ` Tiago Lam
  2024-09-24 23:58       ` Martin KaFai Lau
  0 siblings, 1 reply; 16+ messages in thread
From: Tiago Lam @ 2024-09-17 16:15 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan, netdev,
	linux-kernel, bpf, linux-kselftest, Jakub Sitnicki, kernel-team

On Fri, Sep 13, 2024 at 11:24:09AM -0700, Martin KaFai Lau wrote:
> On 9/13/24 2:39 AM, Tiago Lam wrote:
> > This follows the same rationale provided for the ipv4 counterpart, where
> > it now runs a reverse socket lookup when source addresses and/or ports
> > are changed, on sendmsg, to check whether egress traffic should be
> > allowed to go through or not.
> > 
> > As with ipv4, the ipv6 sendmsg path is also extended here to support the
> > IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> > address/port.
> > 
> > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> > ---
> >   net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >   net/ipv6/udp.c      |  8 ++++--
> >   2 files changed, 82 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> > index fff78496803d..4214dda1c320 100644
> > --- a/net/ipv6/datagram.c
> > +++ b/net/ipv6/datagram.c
> > @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
> >   }
> >   EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
> > +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> > +				     struct in6_addr *saddr, __be16 sport)
> > +{
> > +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> > +	    (saddr && sport) &&
> > +	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
> > +		struct sock *sk_egress;
> > +
> > +		bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
> > +				     saddr, ntohs(sport), 0, &sk_egress);
> 
> iirc, in the ingress path, the sk could also be selected by a tc bpf prog
> doing bpf_sk_assign. Then this re-run on sk_lookup may give an incorrect
> result?
> 

If it does give the incorrect result, we still fallback to the normal
egress path.

> In general, is it necessary to rerun any bpf prog if the user space has
> specified the IP[v6]_ORIGDSTADDR.
> 

More generally, wouldn't that also be the case if someone calls
bpf_sk_assign() in both TC and sk_lookup on ingress? It can lead to some
interference between the two.

It seems like the interesting cases are:
1. Calling bpf_sk_assign() on both TC and sk_lookup ingress: if this
happens sk_lookup on egress should match the correct socket when doing
the reverse lookup;
2. Calling bpf_sk_assign() only on ingress TC: in this case it will
depend if an sk_lookup program is attached or not:
  a. If not, there's no reverse lookup on egress either;
  b. But if yes, although the reverse sk_lookup here won't match the
  initial socket assigned at ingress TC, the packets will still fallback
  to the normal egress path;

You're right in that case 2b above will continue with the same
restrictions as before.

Tiago.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-17 16:15     ` Tiago Lam
@ 2024-09-24 23:58       ` Martin KaFai Lau
  2024-10-11 11:21         ` Tiago Lam
  0 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2024-09-24 23:58 UTC (permalink / raw)
  To: Tiago Lam
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan, netdev,
	linux-kernel, bpf, linux-kselftest, Jakub Sitnicki, kernel-team

On 9/17/24 6:15 PM, Tiago Lam wrote:
> On Fri, Sep 13, 2024 at 11:24:09AM -0700, Martin KaFai Lau wrote:
>> On 9/13/24 2:39 AM, Tiago Lam wrote:
>>> This follows the same rationale provided for the ipv4 counterpart, where
>>> it now runs a reverse socket lookup when source addresses and/or ports
>>> are changed, on sendmsg, to check whether egress traffic should be
>>> allowed to go through or not.
>>>
>>> As with ipv4, the ipv6 sendmsg path is also extended here to support the
>>> IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
>>> address/port.
>>>
>>> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
>>> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
>>> ---
>>>    net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    net/ipv6/udp.c      |  8 ++++--
>>>    2 files changed, 82 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
>>> index fff78496803d..4214dda1c320 100644
>>> --- a/net/ipv6/datagram.c
>>> +++ b/net/ipv6/datagram.c
>>> @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
>>>    }
>>>    EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
>>> +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
>>> +				     struct in6_addr *saddr, __be16 sport)
>>> +{
>>> +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
>>> +	    (saddr && sport) &&
>>> +	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
>>> +		struct sock *sk_egress;
>>> +
>>> +		bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
>>> +				     saddr, ntohs(sport), 0, &sk_egress);
>>
>> iirc, in the ingress path, the sk could also be selected by a tc bpf prog
>> doing bpf_sk_assign. Then this re-run on sk_lookup may give an incorrect
>> result?
>>
> 
> If it does give the incorrect result, we still fallback to the normal
> egress path.
> 
>> In general, is it necessary to rerun any bpf prog if the user space has
>> specified the IP[v6]_ORIGDSTADDR.
>>
> 
> More generally, wouldn't that also be the case if someone calls
> bpf_sk_assign() in both TC and sk_lookup on ingress? It can lead to some
> interference between the two.
> 
> It seems like the interesting cases are:
> 1. Calling bpf_sk_assign() on both TC and sk_lookup ingress: if this
> happens sk_lookup on egress should match the correct socket when doing
> the reverse lookup;
> 2. Calling bpf_sk_assign() only on ingress TC: in this case it will
> depend if an sk_lookup program is attached or not:
>    a. If not, there's no reverse lookup on egress either;
>    b. But if yes, although the reverse sk_lookup here won't match the
>    initial socket assigned at ingress TC, the packets will still fallback
>    to the normal egress path;
> 
> You're right in that case 2b above will continue with the same
> restrictions as before.

imo, all these cases you described above is a good signal that neither the TC 
nor the BPF_PROG_TYPE_SK_LOOKUP program type is the right bpf prog to run here 
_if_ a bpf prog was indeed useful here.

I only followed some of the other discussion in v1 and v2. For now, I still 
don't see running a bpf prog is useful here to process the IP[V6]_ORIGDSTADDR. 
Jakub Sitnicki and I had discussed a similar point during the LPC.

If a bpf prog was indeed needed to process a cmsg, this should work closer to 
what Jakub Sitnicki had proposed for getting the meta data during LPC (but I 
believe the verdict there is also that a bpf prog is not needed). It should be a 
bpf prog that can work in a more generic way to process any BPF specific cmsg 
and can do other operations in the future using kfunc (e.g. route lookup or 
something). Saying yes/no to a particular local IP and port could be one of 
things that the bpf prog can do when processing the cmsg.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-24 23:58       ` Martin KaFai Lau
@ 2024-10-11 11:21         ` Tiago Lam
  0 siblings, 0 replies; 16+ messages in thread
From: Tiago Lam @ 2024-10-11 11:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan, netdev,
	linux-kernel, bpf, linux-kselftest, Jakub Sitnicki, kernel-team

On Tue, Sep 24, 2024 at 04:58:19PM -0700, Martin KaFai Lau wrote:
> On 9/17/24 6:15 PM, Tiago Lam wrote:
> > On Fri, Sep 13, 2024 at 11:24:09AM -0700, Martin KaFai Lau wrote:
> > > On 9/13/24 2:39 AM, Tiago Lam wrote:
> > > > This follows the same rationale provided for the ipv4 counterpart, where
> > > > it now runs a reverse socket lookup when source addresses and/or ports
> > > > are changed, on sendmsg, to check whether egress traffic should be
> > > > allowed to go through or not.
> > > > 
> > > > As with ipv4, the ipv6 sendmsg path is also extended here to support the
> > > > IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> > > > address/port.
> > > > 
> > > > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> > > > Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> > > > ---
> > > >    net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >    net/ipv6/udp.c      |  8 ++++--
> > > >    2 files changed, 82 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> > > > index fff78496803d..4214dda1c320 100644
> > > > --- a/net/ipv6/datagram.c
> > > > +++ b/net/ipv6/datagram.c
> > > > @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
> > > >    }
> > > >    EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
> > > > +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> > > > +				     struct in6_addr *saddr, __be16 sport)
> > > > +{
> > > > +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> > > > +	    (saddr && sport) &&
> > > > +	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
> > > > +		struct sock *sk_egress;
> > > > +
> > > > +		bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
> > > > +				     saddr, ntohs(sport), 0, &sk_egress);
> > > 
> > > iirc, in the ingress path, the sk could also be selected by a tc bpf prog
> > > doing bpf_sk_assign. Then this re-run on sk_lookup may give an incorrect
> > > result?
> > > 
> > 
> > If it does give the incorrect result, we still fallback to the normal
> > egress path.
> > 
> > > In general, is it necessary to rerun any bpf prog if the user space has
> > > specified the IP[v6]_ORIGDSTADDR.
> > > 
> > 
> > More generally, wouldn't that also be the case if someone calls
> > bpf_sk_assign() in both TC and sk_lookup on ingress? It can lead to some
> > interference between the two.
> > 
> > It seems like the interesting cases are:
> > 1. Calling bpf_sk_assign() on both TC and sk_lookup ingress: if this
> > happens sk_lookup on egress should match the correct socket when doing
> > the reverse lookup;
> > 2. Calling bpf_sk_assign() only on ingress TC: in this case it will
> > depend if an sk_lookup program is attached or not:
> >    a. If not, there's no reverse lookup on egress either;
> >    b. But if yes, although the reverse sk_lookup here won't match the
> >    initial socket assigned at ingress TC, the packets will still fallback
> >    to the normal egress path;
> > 
> > You're right in that case 2b above will continue with the same
> > restrictions as before.
> 
> imo, all these cases you described above is a good signal that neither the
> TC nor the BPF_PROG_TYPE_SK_LOOKUP program type is the right bpf prog to run
> here _if_ a bpf prog was indeed useful here.
> 
> I only followed some of the other discussion in v1 and v2. For now, I still
> don't see running a bpf prog is useful here to process the
> IP[V6]_ORIGDSTADDR. Jakub Sitnicki and I had discussed a similar point
> during the LPC.
> 
> If a bpf prog was indeed needed to process a cmsg, this should work closer
> to what Jakub Sitnicki had proposed for getting the meta data during LPC
> (but I believe the verdict there is also that a bpf prog is not needed). It
> should be a bpf prog that can work in a more generic way to process any BPF
> specific cmsg and can do other operations in the future using kfunc (e.g.
> route lookup or something). Saying yes/no to a particular local IP and port
> could be one of things that the bpf prog can do when processing the cmsg.

Thanks for the feeback here, Martin. And apologies for the delay in
respoding to this.

I think you do have a point, and after syncing up some more with Jakub
about your discussion during the LPC, the argument that applications can
already bind to a specific address + port to send their traffic from
makes sense to me. However, I think we could introduce a new cmsg in
sendmsg to allow apps to set the source port to egress from, extending
what they can already do with IP_PKTINFO, i.e. setting the source IP.
We'd need to take care with priviledged and reserved ports, but this
would avoid applications having to do an extra bind. Do you have any
thoughts on this?

Tiago.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-13  9:39 ` [RFC PATCH 2/3] ipv6: " Tiago Lam
  2024-09-13 18:24   ` Martin KaFai Lau
@ 2024-09-14  8:59   ` Simon Horman
  2024-09-17 16:06     ` Tiago Lam
  2024-09-14 11:40   ` Eric Dumazet
  2 siblings, 1 reply; 16+ messages in thread
From: Simon Horman @ 2024-09-14  8:59 UTC (permalink / raw)
  To: Tiago Lam
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, netdev, linux-kernel, bpf, linux-kselftest,
	Jakub Sitnicki, kernel-team

On Fri, Sep 13, 2024 at 10:39:20AM +0100, Tiago Lam wrote:
> This follows the same rationale provided for the ipv4 counterpart, where
> it now runs a reverse socket lookup when source addresses and/or ports
> are changed, on sendmsg, to check whether egress traffic should be
> allowed to go through or not.
> 
> As with ipv4, the ipv6 sendmsg path is also extended here to support the
> IPV6_ORIGDSTADDR ancilliary message to be able to specify a source

Hi Tiago Lam,

Some minor nits from my side.

ancilliary -> ancillary

Likewise in patch 3/3.
Flagged by checkpatch.pl --codespell

> address/port.
> 
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> ---
>  net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  net/ipv6/udp.c      |  8 ++++--
>  2 files changed, 82 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index fff78496803d..4214dda1c320 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
>  }
>  EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
>  
> +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> +				     struct in6_addr *saddr, __be16 sport)
> +{
> +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> +	    (saddr && sport) &&
> +	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {

Please consider, where it can trivially be achieved, limiting Networking
code to 80 columns wide.

Checkpatch can be run with a flag to check for this.

> +		struct sock *sk_egress;
> +
> +		bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
> +				     saddr, ntohs(sport), 0, &sk_egress);
> +		if (!IS_ERR_OR_NULL(sk_egress) &&
> +		    atomic64_read(&sk_egress->sk_cookie) == atomic64_read(&sk->sk_cookie))
> +			return true;
> +
> +		net_info_ratelimited("No reverse socket lookup match for local addr %pI6:%d remote addr %pI6:%d\n",
> +				     &saddr, ntohs(sport), &fl6->daddr, ntohs(fl6->fl6_dport));
> +	}
> +
> +	return false;
> +}
> +
>  int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
>  			  struct msghdr *msg, struct flowi6 *fl6,
>  			  struct ipcm6_cookie *ipc6)

...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-14  8:59   ` Simon Horman
@ 2024-09-17 16:06     ` Tiago Lam
  0 siblings, 0 replies; 16+ messages in thread
From: Tiago Lam @ 2024-09-17 16:06 UTC (permalink / raw)
  To: Simon Horman
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, netdev, linux-kernel, bpf, linux-kselftest,
	Jakub Sitnicki, kernel-team

On Sat, Sep 14, 2024 at 09:59:50AM +0100, Simon Horman wrote:
> On Fri, Sep 13, 2024 at 10:39:20AM +0100, Tiago Lam wrote:
> > This follows the same rationale provided for the ipv4 counterpart, where
> > it now runs a reverse socket lookup when source addresses and/or ports
> > are changed, on sendmsg, to check whether egress traffic should be
> > allowed to go through or not.
> > 
> > As with ipv4, the ipv6 sendmsg path is also extended here to support the
> > IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> 
> Hi Tiago Lam,
> 
> Some minor nits from my side.
> 
> ancilliary -> ancillary
> 
> Likewise in patch 3/3.
> Flagged by checkpatch.pl --codespell
> 
> > address/port.
> > 
> > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> > ---
> >  net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  net/ipv6/udp.c      |  8 ++++--
> >  2 files changed, 82 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> > index fff78496803d..4214dda1c320 100644
> > --- a/net/ipv6/datagram.c
> > +++ b/net/ipv6/datagram.c
> > @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
> >  }
> >  EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
> >  
> > +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> > +				     struct in6_addr *saddr, __be16 sport)
> > +{
> > +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> > +	    (saddr && sport) &&
> > +	    (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
> 
> Please consider, where it can trivially be achieved, limiting Networking
> code to 80 columns wide.
> 
> Checkpatch can be run with a flag to check for this.
> 

Thanks for the hints here, I've addressed these and will include the
changes into the next revision. I use b4 which takes care of some of
this checks, but I'll make sure I change my settings to use
`--max-line-length=80` as well.

Tiago.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-13  9:39 ` [RFC PATCH 2/3] ipv6: " Tiago Lam
  2024-09-13 18:24   ` Martin KaFai Lau
  2024-09-14  8:59   ` Simon Horman
@ 2024-09-14 11:40   ` Eric Dumazet
  2024-09-17 16:03     ` Tiago Lam
  2 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2024-09-14 11:40 UTC (permalink / raw)
  To: Tiago Lam
  Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan, netdev,
	linux-kernel, bpf, linux-kselftest, Jakub Sitnicki, kernel-team

On Fri, Sep 13, 2024 at 11:39 AM Tiago Lam <tiagolam@cloudflare.com> wrote:
>
> This follows the same rationale provided for the ipv4 counterpart, where
> it now runs a reverse socket lookup when source addresses and/or ports
> are changed, on sendmsg, to check whether egress traffic should be
> allowed to go through or not.
>
> As with ipv4, the ipv6 sendmsg path is also extended here to support the
> IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> address/port.
>
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> ---
>  net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  net/ipv6/udp.c      |  8 ++++--
>  2 files changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index fff78496803d..4214dda1c320 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
>  }
>  EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
>
> +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> +                                    struct in6_addr *saddr, __be16 sport)
> +{
> +       if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> +           (saddr && sport) &&
> +           (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
> +               struct sock *sk_egress;
> +
> +               bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
> +                                    saddr, ntohs(sport), 0, &sk_egress);
> +               if (!IS_ERR_OR_NULL(sk_egress) &&
> +                   atomic64_read(&sk_egress->sk_cookie) == atomic64_read(&sk->sk_cookie))

I do not understand this.

1) sk_cookie is not always initialized. It is done on demand, when/if
__sock_gen_cookie() was called.

2) if sk1 and sk2 share the same sk_cookie, then sk1 == sk2 ???

So why not simply testing sk_egress == sk ?

> +                       return true;
> +
> +               net_info_ratelimited("No reverse socket lookup match for local addr %pI6:%d remote addr %pI6:%d\n",
> +                                    &saddr, ntohs(sport), &fl6->daddr, ntohs(fl6->fl6_dport));
> +       }
> +
> +       return false;
> +}
> +
>  int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
>                           struct msghdr *msg, struct flowi6 *fl6,
>                           struct ipcm6_cookie *ipc6)
> @@ -844,7 +865,62 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
>
>                         break;
>                     }
> +               case IPV6_ORIGDSTADDR:
> +                       {
> +                       struct sockaddr_in6 *sockaddr_in;
> +                       struct net_device *dev = NULL;
> +
> +                       if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct sockaddr_in6))) {
> +                               err = -EINVAL;
> +                               goto exit_f;
> +                       }
> +
> +                       sockaddr_in = (struct sockaddr_in6 *)CMSG_DATA(cmsg);
> +
> +                       addr_type = __ipv6_addr_type(&sockaddr_in->sin6_addr);
> +
> +                       if (addr_type & IPV6_ADDR_LINKLOCAL)
> +                               return -EINVAL;
> +
> +                       /* If we're egressing with a different source address and/or port, we
> +                        * perform a reverse socket lookup.  The rationale behind this is that we
> +                        * can allow return UDP traffic that has ingressed through sk_lookup to
> +                        * also egress correctly. In case the reverse lookup fails, we
> +                        * continue with the normal path.
> +                        *
> +                        * The lookup is performed if either source address and/or port changed, and
> +                        * neither is "0".
> +                        */
> +                       if (reverse_sk_lookup(fl6, sk, &sockaddr_in->sin6_addr,
> +                                             sockaddr_in->sin6_port)) {
> +                               /* Override the source port and address to use with the one we
> +                                * got in cmsg and bail early.
> +                                */
> +                               fl6->saddr = sockaddr_in->sin6_addr;
> +                               fl6->fl6_sport = sockaddr_in->sin6_port;
> +                               break;
> +                       }
>
> +                       if (addr_type != IPV6_ADDR_ANY) {
> +                               int strict = __ipv6_addr_src_scope(addr_type) <= IPV6_ADDR_SCOPE_LINKLOCAL;
> +
> +                               if (!ipv6_can_nonlocal_bind(net, inet_sk(sk)) &&
> +                                   !ipv6_chk_addr_and_flags(net,
> +                                                            &sockaddr_in->sin6_addr,
> +                                                            dev, !strict, 0,
> +                                                            IFA_F_TENTATIVE) &&
> +                                   !ipv6_chk_acast_addr_src(net, dev,
> +                                                            &sockaddr_in->sin6_addr))
> +                                       err = -EINVAL;
> +                               else
> +                                       fl6->saddr = sockaddr_in->sin6_addr;
> +                       }
> +
> +                       if (err)
> +                               goto exit_f;
> +
> +                       break;
> +                       }
>                 case IPV6_FLOWINFO:
>                         if (cmsg->cmsg_len < CMSG_LEN(4)) {
>                                 err = -EINVAL;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 6602a2e9cdb5..6121cbb71ad3 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1476,6 +1476,12 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>
>         fl6->flowi6_uid = sk->sk_uid;
>
> +       /* We use fl6's daddr and fl6_sport in the reverse sk_lookup done
> +        * within ip6_datagram_send_ctl() now.
> +        */
> +       fl6->daddr = *daddr;
> +       fl6->fl6_sport = inet->inet_sport;
> +
>         if (msg->msg_controllen) {
>                 opt = &opt_space;
>                 memset(opt, 0, sizeof(struct ipv6_txoptions));
> @@ -1511,10 +1517,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>
>         fl6->flowi6_proto = sk->sk_protocol;
>         fl6->flowi6_mark = ipc6.sockc.mark;
> -       fl6->daddr = *daddr;
>         if (ipv6_addr_any(&fl6->saddr) && !ipv6_addr_any(&np->saddr))
>                 fl6->saddr = np->saddr;
> -       fl6->fl6_sport = inet->inet_sport;
>
>         if (cgroup_bpf_enabled(CGROUP_UDP6_SENDMSG) && !connected) {
>                 err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.
  2024-09-14 11:40   ` Eric Dumazet
@ 2024-09-17 16:03     ` Tiago Lam
  0 siblings, 0 replies; 16+ messages in thread
From: Tiago Lam @ 2024-09-17 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan, netdev,
	linux-kernel, bpf, linux-kselftest, Jakub Sitnicki, kernel-team

On Sat, Sep 14, 2024 at 01:40:25PM +0200, Eric Dumazet wrote:
> On Fri, Sep 13, 2024 at 11:39 AM Tiago Lam <tiagolam@cloudflare.com> wrote:
> >
> > This follows the same rationale provided for the ipv4 counterpart, where
> > it now runs a reverse socket lookup when source addresses and/or ports
> > are changed, on sendmsg, to check whether egress traffic should be
> > allowed to go through or not.
> >
> > As with ipv4, the ipv6 sendmsg path is also extended here to support the
> > IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> > address/port.
> >
> > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> > ---
> >  net/ipv6/datagram.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  net/ipv6/udp.c      |  8 ++++--
> >  2 files changed, 82 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> > index fff78496803d..4214dda1c320 100644
> > --- a/net/ipv6/datagram.c
> > +++ b/net/ipv6/datagram.c
> > @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
> >  }
> >  EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
> >
> > +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> > +                                    struct in6_addr *saddr, __be16 sport)
> > +{
> > +       if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> > +           (saddr && sport) &&
> > +           (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || inet_sk(sk)->inet_sport != sport)) {
> > +               struct sock *sk_egress;
> > +
> > +               bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, fl6->fl6_dport,
> > +                                    saddr, ntohs(sport), 0, &sk_egress);
> > +               if (!IS_ERR_OR_NULL(sk_egress) &&
> > +                   atomic64_read(&sk_egress->sk_cookie) == atomic64_read(&sk->sk_cookie))
> 
> I do not understand this.
> 
> 1) sk_cookie is not always initialized. It is done on demand, when/if
> __sock_gen_cookie() was called.
> 
> 2) if sk1 and sk2 share the same sk_cookie, then sk1 == sk2 ???
> 
> So why not simply testing sk_egress == sk ?
> 

Oh, yes, you're right. I'll include this in my next revision, thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg.
  2024-09-13  9:39 [RFC PATCH 0/3] Allow sk_lookup UDP return traffic to egress Tiago Lam
  2024-09-13  9:39 ` [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg Tiago Lam
  2024-09-13  9:39 ` [RFC PATCH 2/3] ipv6: " Tiago Lam
@ 2024-09-13  9:39 ` Tiago Lam
  2024-09-13 12:10   ` Philo Lu
  2 siblings, 1 reply; 16+ messages in thread
From: Tiago Lam @ 2024-09-13  9:39 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: netdev, linux-kernel, bpf, linux-kselftest, Jakub Sitnicki,
	Tiago Lam, kernel-team

This patch reuses the framework already in place for sk_lookup, allowing
it now to send a reply from the server fd directly, instead of having to
create a socket bound to the original destination address and reply from
there. It does this by passing the source address and port from where to
reply from in a IP_ORIGDSTADDR ancilliary message passed in sendmsg.

Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
---
 tools/testing/selftests/bpf/prog_tests/sk_lookup.c | 70 +++++++++++++++-------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
index ae87c00867ba..b99aff2e3976 100644
--- a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
@@ -75,6 +75,7 @@ struct test {
 	struct inet_addr listen_at;
 	enum server accept_on;
 	bool reuseport_has_conns; /* Add a connected socket to reuseport group */
+	bool dont_bind_reply; /* Don't bind, send direct from server fd. */
 };
 
 struct cb_opts {
@@ -363,7 +364,7 @@ static void v4_to_v6(struct sockaddr_storage *ss)
 	memset(&v6->sin6_addr.s6_addr[0], 0, 10);
 }
 
-static int udp_recv_send(int server_fd)
+static int udp_recv_send(int server_fd, bool dont_bind_reply)
 {
 	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
 	struct sockaddr_storage _src_addr = { 0 };
@@ -415,26 +416,41 @@ static int udp_recv_send(int server_fd)
 		v4_to_v6(dst_addr);
 	}
 
-	/* Reply from original destination address. */
-	fd = start_server_addr(SOCK_DGRAM, dst_addr, sizeof(*dst_addr), NULL);
-	if (!ASSERT_OK_FD(fd, "start_server_addr")) {
-		log_err("failed to create tx socket");
-		return -1;
-	}
+	if (dont_bind_reply) {
+		/* Reply directly from server fd, specifying the source address and/or
+		 * port in struct msghdr.
+		 */
+		n = sendmsg(server_fd, &msg, 0);
+		if (CHECK(n <= 0, "sendmsg", "failed\n")) {
+			log_err("failed to send echo reply");
+			return -1;
+		}
+	} else {
+		/* Reply from original destination address. */
+		fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
+		if (CHECK(fd < 0, "socket", "failed\n")) {
+			log_err("failed to create tx socket");
+			return -1;
+		}
 
-	msg.msg_control = NULL;
-	msg.msg_controllen = 0;
-	n = sendmsg(fd, &msg, 0);
-	if (CHECK(n <= 0, "sendmsg", "failed\n")) {
-		log_err("failed to send echo reply");
-		ret = -1;
-		goto out;
-	}
+		ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
+		if (CHECK(ret, "bind", "failed\n")) {
+			log_err("failed to bind tx socket");
+			close(fd);
+			return ret;
+		}
 
-	ret = 0;
-out:
-	close(fd);
-	return ret;
+		msg.msg_control = NULL;
+		msg.msg_controllen = 0;
+		n = sendmsg(fd, &msg, 0);
+		if (CHECK(n <= 0, "sendmsg", "failed\n")) {
+			log_err("failed to send echo reply");
+			close(fd);
+			return -1;
+		}
+
+		close(fd);
+	}
 }
 
 static int tcp_echo_test(int client_fd, int server_fd)
@@ -454,14 +470,14 @@ static int tcp_echo_test(int client_fd, int server_fd)
 	return 0;
 }
 
-static int udp_echo_test(int client_fd, int server_fd)
+static int udp_echo_test(int client_fd, int server_fd, bool dont_bind_reply)
 {
 	int err;
 
 	err = send_byte(client_fd);
 	if (err)
 		return -1;
-	err = udp_recv_send(server_fd);
+	err = udp_recv_send(server_fd, dont_bind_reply);
 	if (err)
 		return -1;
 	err = recv_byte(client_fd);
@@ -653,7 +669,7 @@ static void run_lookup_prog(const struct test *t)
 	if (t->sotype == SOCK_STREAM)
 		tcp_echo_test(client_fd, server_fds[t->accept_on]);
 	else
-		udp_echo_test(client_fd, server_fds[t->accept_on]);
+		udp_echo_test(client_fd, server_fds[t->accept_on], t->dont_bind_reply);
 
 	close(client_fd);
 close:
@@ -775,6 +791,16 @@ static void test_redirect_lookup(struct test_sk_lookup *skel)
 			.listen_at	= { INT_IP4, INT_PORT },
 			.accept_on	= SERVER_B,
 		},
+		{
+			.desc		= "UDP IPv4 redir different ports",
+			.lookup_prog	= skel->progs.select_sock_a_no_reuseport,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+			.accept_on	= SERVER_A,
+			.dont_bind_reply = true,
+		},
 		{
 			.desc		= "UDP IPv4 redir and reuseport with conns",
 			.lookup_prog	= skel->progs.select_sock_a,

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg.
  2024-09-13  9:39 ` [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg Tiago Lam
@ 2024-09-13 12:10   ` Philo Lu
  2024-09-17 16:00     ` Tiago Lam
  0 siblings, 1 reply; 16+ messages in thread
From: Philo Lu @ 2024-09-13 12:10 UTC (permalink / raw)
  To: Tiago Lam, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: netdev, linux-kernel, bpf, linux-kselftest, Jakub Sitnicki,
	kernel-team

Hi Tiago,

On 2024/9/13 17:39, Tiago Lam wrote:
> This patch reuses the framework already in place for sk_lookup, allowing
> it now to send a reply from the server fd directly, instead of having to
> create a socket bound to the original destination address and reply from
> there. It does this by passing the source address and port from where to
> reply from in a IP_ORIGDSTADDR ancilliary message passed in sendmsg.
> 
> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> ---
>   tools/testing/selftests/bpf/prog_tests/sk_lookup.c | 70 +++++++++++++++-------
>   1 file changed, 48 insertions(+), 22 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> index ae87c00867ba..b99aff2e3976 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> @@ -75,6 +75,7 @@ struct test {
>   	struct inet_addr listen_at;
>   	enum server accept_on;
>   	bool reuseport_has_conns; /* Add a connected socket to reuseport group */
> +	bool dont_bind_reply; /* Don't bind, send direct from server fd. */
>   };
>   
>   struct cb_opts {
> @@ -363,7 +364,7 @@ static void v4_to_v6(struct sockaddr_storage *ss)
>   	memset(&v6->sin6_addr.s6_addr[0], 0, 10);
>   }
>   
> -static int udp_recv_send(int server_fd)
> +static int udp_recv_send(int server_fd, bool dont_bind_reply)
>   {
>   	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
>   	struct sockaddr_storage _src_addr = { 0 };
> @@ -415,26 +416,41 @@ static int udp_recv_send(int server_fd)
>   		v4_to_v6(dst_addr);
>   	}
>   
> -	/* Reply from original destination address. */
> -	fd = start_server_addr(SOCK_DGRAM, dst_addr, sizeof(*dst_addr), NULL);
> -	if (!ASSERT_OK_FD(fd, "start_server_addr")) {
> -		log_err("failed to create tx socket");
> -		return -1;
> -	}
> +	if (dont_bind_reply) {
> +		/* Reply directly from server fd, specifying the source address and/or
> +		 * port in struct msghdr.
> +		 */
> +		n = sendmsg(server_fd, &msg, 0);
> +		if (CHECK(n <= 0, "sendmsg", "failed\n")) {
> +			log_err("failed to send echo reply");
> +			return -1;
> +		}
> +	} else {
> +		/* Reply from original destination address. */
> +		fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
> +		if (CHECK(fd < 0, "socket", "failed\n")) {
> +			log_err("failed to create tx socket");
> +			return -1;
> +		}
>   
Just curious, why not use start_server_addr() here like before?

> -	msg.msg_control = NULL;
> -	msg.msg_controllen = 0;
> -	n = sendmsg(fd, &msg, 0);
> -	if (CHECK(n <= 0, "sendmsg", "failed\n")) {
> -		log_err("failed to send echo reply");
> -		ret = -1;
> -		goto out;
> -	}
> +		ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
> +		if (CHECK(ret, "bind", "failed\n")) {
> +			log_err("failed to bind tx socket");
> +			close(fd);
> +			return ret;
> +		}
>   
> -	ret = 0;
> -out:
> -	close(fd);
> -	return ret;
> +		msg.msg_control = NULL;
> +		msg.msg_controllen = 0;
> +		n = sendmsg(fd, &msg, 0);
> +		if (CHECK(n <= 0, "sendmsg", "failed\n")) {
> +			log_err("failed to send echo reply");
> +			close(fd);
> +			return -1;
> +		}
> +
> +		close(fd);
> +	}

nit: "return 0" missed.

>   }
>   

-- 
Philo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg.
  2024-09-13 12:10   ` Philo Lu
@ 2024-09-17 16:00     ` Tiago Lam
  0 siblings, 0 replies; 16+ messages in thread
From: Tiago Lam @ 2024-09-17 16:00 UTC (permalink / raw)
  To: Philo Lu
  Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, netdev, linux-kernel, bpf, linux-kselftest,
	Jakub Sitnicki, kernel-team

On Fri, Sep 13, 2024 at 08:10:24PM +0800, Philo Lu wrote:
> Hi Tiago,
> 
> On 2024/9/13 17:39, Tiago Lam wrote:
> > This patch reuses the framework already in place for sk_lookup, allowing
> > it now to send a reply from the server fd directly, instead of having to
> > create a socket bound to the original destination address and reply from
> > there. It does this by passing the source address and port from where to
> > reply from in a IP_ORIGDSTADDR ancilliary message passed in sendmsg.
> > 
> > Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> > ---
> >   tools/testing/selftests/bpf/prog_tests/sk_lookup.c | 70 +++++++++++++++-------
> >   1 file changed, 48 insertions(+), 22 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> > index ae87c00867ba..b99aff2e3976 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> > @@ -75,6 +75,7 @@ struct test {
> >   	struct inet_addr listen_at;
> >   	enum server accept_on;
> >   	bool reuseport_has_conns; /* Add a connected socket to reuseport group */
> > +	bool dont_bind_reply; /* Don't bind, send direct from server fd. */
> >   };
> >   struct cb_opts {
> > @@ -363,7 +364,7 @@ static void v4_to_v6(struct sockaddr_storage *ss)
> >   	memset(&v6->sin6_addr.s6_addr[0], 0, 10);
> >   }
> > -static int udp_recv_send(int server_fd)
> > +static int udp_recv_send(int server_fd, bool dont_bind_reply)
> >   {
> >   	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
> >   	struct sockaddr_storage _src_addr = { 0 };
> > @@ -415,26 +416,41 @@ static int udp_recv_send(int server_fd)
> >   		v4_to_v6(dst_addr);
> >   	}
> > -	/* Reply from original destination address. */
> > -	fd = start_server_addr(SOCK_DGRAM, dst_addr, sizeof(*dst_addr), NULL);
> > -	if (!ASSERT_OK_FD(fd, "start_server_addr")) {
> > -		log_err("failed to create tx socket");
> > -		return -1;
> > -	}
> > +	if (dont_bind_reply) {
> > +		/* Reply directly from server fd, specifying the source address and/or
> > +		 * port in struct msghdr.
> > +		 */
> > +		n = sendmsg(server_fd, &msg, 0);
> > +		if (CHECK(n <= 0, "sendmsg", "failed\n")) {
> > +			log_err("failed to send echo reply");
> > +			return -1;
> > +		}
> > +	} else {
> > +		/* Reply from original destination address. */
> > +		fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
> > +		if (CHECK(fd < 0, "socket", "failed\n")) {
> > +			log_err("failed to create tx socket");
> > +			return -1;
> > +		}
> Just curious, why not use start_server_addr() here like before?
> 

Thanks, you're right. Initially I rebased this on commit
e3d332aaf898ed755b29c8cdf59be2cfba1cac4b (v6.6.31), which didn't have
start_server_addr(), and used bind() instead. I must have messed up the
rebased.

I've fixed this and included your other suggestion in the patch below
and will fold it into the next revision.

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
index b99aff2e3976..0628a67681c5 100644
--- a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
@@ -374,7 +374,7 @@ static int udp_recv_send(int server_fd, bool dont_bind_reply)
        struct iovec iov = { 0 };
        struct cmsghdr *cm;
        char buf[1];
-       int ret, fd;
+       int fd;
        ssize_t n;
 
        iov.iov_base = buf;
@@ -427,19 +427,12 @@ static int udp_recv_send(int server_fd, bool dont_bind_reply)
                }
        } else {
                /* Reply from original destination address. */
-               fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
-               if (CHECK(fd < 0, "socket", "failed\n")) {
+               fd = start_server_addr(SOCK_DGRAM, dst_addr, sizeof(*dst_addr), NULL);
+               if (!ASSERT_OK_FD(fd, "start_server_addr")) {
                        log_err("failed to create tx socket");
                        return -1;
                }
 
-               ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
-               if (CHECK(ret, "bind", "failed\n")) {
-                       log_err("failed to bind tx socket");
-                       close(fd);
-                       return ret;
-               }
-
                msg.msg_control = NULL;
                msg.msg_controllen = 0;
                n = sendmsg(fd, &msg, 0);
@@ -451,6 +444,8 @@ static int udp_recv_send(int server_fd, bool dont_bind_reply)
 
                close(fd);
        }
+
+       return 0;
 }
 
 static int tcp_echo_test(int client_fd, int server_fd)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-10-11 11:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-13  9:39 [RFC PATCH 0/3] Allow sk_lookup UDP return traffic to egress Tiago Lam
2024-09-13  9:39 ` [RFC PATCH 1/3] ipv4: Run a reverse sk_lookup on sendmsg Tiago Lam
2024-09-18 12:45   ` Willem de Bruijn
2024-09-20 16:57     ` Tiago Lam
2024-09-13  9:39 ` [RFC PATCH 2/3] ipv6: " Tiago Lam
2024-09-13 18:24   ` Martin KaFai Lau
2024-09-17 16:15     ` Tiago Lam
2024-09-24 23:58       ` Martin KaFai Lau
2024-10-11 11:21         ` Tiago Lam
2024-09-14  8:59   ` Simon Horman
2024-09-17 16:06     ` Tiago Lam
2024-09-14 11:40   ` Eric Dumazet
2024-09-17 16:03     ` Tiago Lam
2024-09-13  9:39 ` [RFC PATCH 3/3] bpf: Add sk_lookup test to use ORIGDSTADDR cmsg Tiago Lam
2024-09-13 12:10   ` Philo Lu
2024-09-17 16:00     ` Tiago Lam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).