netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] ip: improve tcp sock multipath routing
@ 2025-04-20 18:04 Willem de Bruijn
  2025-04-20 18:04 ` [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-20 18:04 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, edumazet, pabeni, dsahern, horms, idosch, kuniyu,
	Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Improve layer 4 multipath hash policy for local tcp connections:

patch 1: Select a source address that matches the nexthop device.
         Due to tcp_v4_connect making separate route lookups for saddr
         and route, the two can currently be inconsistent.

patch 2: Use all paths when opening multiple local tcp connections to
         the same ip address and port.

patch 3: Test the behavior of patch 2. Extend the fib_nexthops.sh
         testsuite with one opening many connections, and count SYNs
         on both egress devices.

Willem de Bruijn (3):
  ipv4: prefer multipath nexthop that matches source address
  ip: load balance tcp connections to single dst addr and port
  selftests/net: test tcp connection load balancing

 include/net/flow.h                          |  1 +
 include/net/ip_fib.h                        |  3 +-
 include/net/route.h                         |  3 +
 net/ipv4/fib_semantics.c                    | 39 ++++++----
 net/ipv4/route.c                            | 15 +++-
 net/ipv6/route.c                            | 13 +++-
 net/ipv6/tcp_ipv6.c                         |  2 +
 tools/testing/selftests/net/fib_nexthops.sh | 83 +++++++++++++++++++++
 8 files changed, 137 insertions(+), 22 deletions(-)

-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address
  2025-04-20 18:04 [PATCH net-next 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
@ 2025-04-20 18:04 ` Willem de Bruijn
  2025-04-22 16:06   ` David Ahern
  2025-04-20 18:04 ` [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
  2025-04-20 18:04 ` [PATCH net-next 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
  2 siblings, 1 reply; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-20 18:04 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, edumazet, pabeni, dsahern, horms, idosch, kuniyu,
	Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

With multipath routes, try to ensure that packets leave on the device
that is associated with the source address.

Avoid the following tcpdump example:

    veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
    veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]

Which can happen easily with the most straightforward setup:

    ip addr add 10.0.0.1/24 dev veth0
    ip addr add 10.1.0.1/24 dev veth1

    ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
    			  nexthop via 10.1.0.2 dev veth1

This is apparently considered WAI, based on the comment in
ip_route_output_key_hash_rcu:

    * 2. Moreover, we are allowed to send packets with saddr
    *    of another iface. --ANK

It may be ok for some uses of multipath, but not all. For instance,
when using two ISPs, a router may drop packets with unknown source.

The behavior occurs because tcp_v4_connect makes three route
lookups when establishing a connection:

1. ip_route_connect calls to select a source address, with saddr zero.
2. ip_route_connect calls again now that saddr and daddr are known.
3. ip_route_newports calls again after a source port is also chosen.

With a route with multiple nexthops, each lookup may make a different
choice depending on available entropy to fib_select_multipath. So it
is possible for 1 to select the saddr from the first entry, but 3 to
select the second entry. Leading to the above situation.

Address this by preferring a match that matches the flowi4 saddr. This
will make 2 and 3 make the same choice as 1. Continue to update the
backup choice until a choice that matches saddr is found.

Do this in fib_select_multipath itself, rather than passing an fl4_oif
constraint, to avoid changing non-multipath route selection. Commit
e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows
how that may cause regressions.

Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
refresh in the loop.

This does not happen in IPv6, which performs only one lookup.

Signed-off-by: Willem de Bruijn <willemb@google.com>

Side-quest: I wonder if the second route lookup in ip_route_connect
is vestigial since the introduction of the third route lookup with
ip_route_newports. IPv6 has neither second nor third lookup, which
hints that perhaps both can be removed.
---
 include/net/ip_fib.h     |  3 ++-
 net/ipv4/fib_semantics.c | 39 +++++++++++++++++++++++++--------------
 net/ipv4/route.c         |  2 +-
 3 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e3864b74e92a..48bb3cf41469 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -574,7 +574,8 @@ static inline u32 fib_multipath_hash_from_keys(const struct net *net,
 
 int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope,
 		 struct netlink_ext_ack *extack);
-void fib_select_multipath(struct fib_result *res, int hash);
+void fib_select_multipath(struct fib_result *res, int hash,
+			  const struct flowi4 *fl4);
 void fib_select_path(struct net *net, struct fib_result *res,
 		     struct flowi4 *fl4, const struct sk_buff *skb);
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f68bb9e34c34..b5d21763dfaf 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -2168,34 +2168,45 @@ static bool fib_good_nh(const struct fib_nh *nh)
 	return !!(state & NUD_VALID);
 }
 
-void fib_select_multipath(struct fib_result *res, int hash)
+void fib_select_multipath(struct fib_result *res, int hash,
+			  const struct flowi4 *fl4)
 {
 	struct fib_info *fi = res->fi;
 	struct net *net = fi->fib_net;
-	bool first = false;
+	bool found = false;
+	bool use_neigh;
+	__be32 saddr;
 
 	if (unlikely(res->fi->nh)) {
 		nexthop_path_fib_result(res, hash);
 		return;
 	}
 
+	use_neigh = READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh);
+	saddr = fl4 ? fl4->saddr : 0;
+
 	change_nexthops(fi) {
-		if (READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh)) {
-			if (!fib_good_nh(nexthop_nh))
-				continue;
-			if (!first) {
-				res->nh_sel = nhsel;
-				res->nhc = &nexthop_nh->nh_common;
-				first = true;
-			}
+		if (use_neigh && !fib_good_nh(nexthop_nh))
+			continue;
+
+		if (!found) {
+			res->nh_sel = nhsel;
+			res->nhc = &nexthop_nh->nh_common;
+			found = !saddr || nexthop_nh->nh_saddr == saddr;
 		}
 
 		if (hash > atomic_read(&nexthop_nh->fib_nh_upper_bound))
 			continue;
 
-		res->nh_sel = nhsel;
-		res->nhc = &nexthop_nh->nh_common;
-		return;
+		if (!saddr || nexthop_nh->nh_saddr == saddr) {
+			res->nh_sel = nhsel;
+			res->nhc = &nexthop_nh->nh_common;
+			return;
+		}
+
+		if (found)
+			return;
+
 	} endfor_nexthops(fi);
 }
 #endif
@@ -2210,7 +2221,7 @@ void fib_select_path(struct net *net, struct fib_result *res,
 	if (fib_info_num_path(res->fi) > 1) {
 		int h = fib_multipath_hash(net, fl4, skb, NULL);
 
-		fib_select_multipath(res, h);
+		fib_select_multipath(res, h, fl4);
 	}
 	else
 #endif
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 49cffbe83802..e5e4c71be3af 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2154,7 +2154,7 @@ ip_mkroute_input(struct sk_buff *skb, struct fib_result *res,
 	if (res->fi && fib_info_num_path(res->fi) > 1) {
 		int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys);
 
-		fib_select_multipath(res, h);
+		fib_select_multipath(res, h, NULL);
 		IPCB(skb)->flags |= IPSKB_MULTIPATH;
 	}
 #endif
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port
  2025-04-20 18:04 [PATCH net-next 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
  2025-04-20 18:04 ` [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
@ 2025-04-20 18:04 ` Willem de Bruijn
  2025-04-21 13:54   ` Willem de Bruijn
  2025-04-22 16:41   ` David Ahern
  2025-04-20 18:04 ` [PATCH net-next 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
  2 siblings, 2 replies; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-20 18:04 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, edumazet, pabeni, dsahern, horms, idosch, kuniyu,
	Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Load balance new TCP connections across nexthops also when they
connect to the same service at a single remote address and port.

This affects only port-based multipath hashing:
fib_multipath_hash_policy 1 or 3.

Local connections must choose both a source address and port when
connecting to a remote service, in ip_route_connect. This
"chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and
simplify ip_route_{connect,newports}()")) is resolved by first
selecting a source address, by looking up a route using the zero
wildcard source port and address.

As a result multiple connections to the same destination address and
port have no entropy in fib_multipath_hash.

This is not a problem when forwarding, as skb-based hashing has a
4-tuple. Nor when establishing UDP connections, as autobind there
selects a port before reaching ip_route_connect.

Load balance also TCP, by using a random port in fib_multipath_hash.
Port assignment in inet_hash_connect is not atomic with
ip_route_connect. Thus ports are unpredictable, effectively random.

Implementation details:

Do not actually pass a random fl4_sport, as that affects not only
hashing, but routing more broadly, and can match a source port based
policy route, which existing wildcard port 0 will not. Instead,
define a new wildcard flowi flag that is used only for hashing.

Selecting a random source is equivalent to just selecting a random
hash entirely. But for code clarity, follow the normal 4-tuple hash
process and only update this field.

fib_multipath_hash can be reached with zero sport from other code
paths, so explicitly pass this flowi flag, rather than trying to infer
this case in the function itself.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/flow.h  |  1 +
 include/net/route.h |  3 +++
 net/ipv4/route.c    | 13 ++++++++++---
 net/ipv6/route.c    | 13 ++++++++++---
 net/ipv6/tcp_ipv6.c |  2 ++
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 2a3f0c42f092..a1839c278d87 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -39,6 +39,7 @@ struct flowi_common {
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_KNOWN_NH		0x02
 #define FLOWI_FLAG_L3MDEV_OIF		0x04
+#define FLOWI_FLAG_ANY_SPORT		0x08
 	__u32	flowic_secid;
 	kuid_t  flowic_uid;
 	__u32		flowic_multipath_hash;
diff --git a/include/net/route.h b/include/net/route.h
index c605fd5ec0c0..8e39aa822cf9 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -326,6 +326,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst,
 	if (inet_test_bit(TRANSPARENT, sk))
 		flow_flags |= FLOWI_FLAG_ANYSRC;
 
+	if (IS_ENABLED(CONFIG_IP_ROUTE_MULTIPATH) && !sport)
+		flow_flags |= FLOWI_FLAG_ANY_SPORT;
+
 	flowi4_init_output(fl4, oif, READ_ONCE(sk->sk_mark), ip_sock_rt_tos(sk),
 			   ip_sock_rt_scope(sk), protocol, flow_flags, dst,
 			   src, dport, sport, sk->sk_uid);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e5e4c71be3af..685e8d3b4f5d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2037,8 +2037,12 @@ static u32 fib_multipath_custom_hash_fl4(const struct net *net,
 		hash_keys.addrs.v4addrs.dst = fl4->daddr;
 	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_IP_PROTO)
 		hash_keys.basic.ip_proto = fl4->flowi4_proto;
-	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT)
-		hash_keys.ports.src = fl4->fl4_sport;
+	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) {
+		if (fl4->flowi4_flags & FLOWI_FLAG_ANY_SPORT)
+			hash_keys.ports.src = get_random_u16();
+		else
+			hash_keys.ports.src = fl4->fl4_sport;
+	}
 	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_DST_PORT)
 		hash_keys.ports.dst = fl4->fl4_dport;
 
@@ -2093,7 +2097,10 @@ int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
 			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
 			hash_keys.addrs.v4addrs.src = fl4->saddr;
 			hash_keys.addrs.v4addrs.dst = fl4->daddr;
-			hash_keys.ports.src = fl4->fl4_sport;
+			if (fl4->flowi4_flags & FLOWI_FLAG_ANY_SPORT)
+				hash_keys.ports.src = get_random_u16();
+			else
+				hash_keys.ports.src = fl4->fl4_sport;
 			hash_keys.ports.dst = fl4->fl4_dport;
 			hash_keys.basic.ip_proto = fl4->flowi4_proto;
 		}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 945857a8bfe3..39f07cdbbc64 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2492,8 +2492,12 @@ static u32 rt6_multipath_custom_hash_fl6(const struct net *net,
 		hash_keys.basic.ip_proto = fl6->flowi6_proto;
 	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_FLOWLABEL)
 		hash_keys.tags.flow_label = (__force u32)flowi6_get_flowlabel(fl6);
-	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT)
-		hash_keys.ports.src = fl6->fl6_sport;
+	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) {
+		if (fl6->flowi6_flags & FLOWI_FLAG_ANY_SPORT)
+			hash_keys.ports.src = get_random_u16();
+		else
+			hash_keys.ports.src = fl6->fl6_sport;
+	}
 	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_DST_PORT)
 		hash_keys.ports.dst = fl6->fl6_dport;
 
@@ -2547,7 +2551,10 @@ u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6,
 			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
 			hash_keys.addrs.v6addrs.src = fl6->saddr;
 			hash_keys.addrs.v6addrs.dst = fl6->daddr;
-			hash_keys.ports.src = fl6->fl6_sport;
+			if (fl6->flowi6_flags & FLOWI_FLAG_ANY_SPORT)
+				hash_keys.ports.src = get_random_u16();
+			else
+				hash_keys.ports.src = fl6->fl6_sport;
 			hash_keys.ports.dst = fl6->fl6_dport;
 			hash_keys.basic.ip_proto = fl6->flowi6_proto;
 		}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7dcb33f879ee..e8e68a142649 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -267,6 +267,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl6.flowi6_mark = sk->sk_mark;
 	fl6.fl6_dport = usin->sin6_port;
 	fl6.fl6_sport = inet->inet_sport;
+	if (IS_ENABLED(CONFIG_IP_ROUTE_MULTIPATH) && !fl6.fl6_sport)
+		fl6.flowi6_flags = FLOWI_FLAG_ANY_SPORT;
 	fl6.flowi6_uid = sk->sk_uid;
 
 	opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk));
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 3/3] selftests/net: test tcp connection load balancing
  2025-04-20 18:04 [PATCH net-next 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
  2025-04-20 18:04 ` [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
  2025-04-20 18:04 ` [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
@ 2025-04-20 18:04 ` Willem de Bruijn
  2025-04-23  9:05   ` Ido Schimmel
  2 siblings, 1 reply; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-20 18:04 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, edumazet, pabeni, dsahern, horms, idosch, kuniyu,
	Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Verify that TCP connections use both routes when connecting multiple
times to a remote service over a two nexthop multipath route.

Use netcat to create the connections. Use tc prio + tc filter to
count routes taken, counting SYN packets across the two egress
devices.

To avoid flaky tests when testing inherently randomized behavior,
set a low bar and pass if even a single SYN is observed on both
devices.

Signed-off-by: Willem de Bruijn <willemb@google.com>

---

Integrated into fib_nexthops.sh as it covers multipath nexthop
routing and can reuse all of its setup(), but technically the test
does not use nexthop *objects* as is, so I can also move into a
separate file and move common setup code to lib.sh if preferred.
---
 tools/testing/selftests/net/fib_nexthops.sh | 83 +++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
index b39f748c2572..93d19e92bd5b 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -31,6 +31,7 @@ IPV4_TESTS="
 	ipv4_compat_mode
 	ipv4_fdb_grp_fcnal
 	ipv4_mpath_select
+	ipv4_mpath_balance
 	ipv4_torture
 	ipv4_res_torture
 "
@@ -45,6 +46,7 @@ IPV6_TESTS="
 	ipv6_compat_mode
 	ipv6_fdb_grp_fcnal
 	ipv6_mpath_select
+	ipv6_mpath_balance
 	ipv6_torture
 	ipv6_res_torture
 "
@@ -2110,6 +2112,87 @@ ipv4_res_torture()
 	log_test 0 0 "IPv4 resilient nexthop group torture test"
 }
 
+# Install a prio qdisc with separate bands counting IPv4 and IPv6 SYNs
+tc_add_syn_counter() {
+	local -r dev=$1
+
+	# qdisc with band 1 for no-match, band 2 for ipv4, band 3 for ipv6
+	ip netns exec $me tc qdisc add dev $dev root handle 1: prio bands 3 \
+		priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+	ip netns exec $me tc qdisc add dev $dev parent 1:1 handle 2: pfifo
+	ip netns exec $me tc qdisc add dev $dev parent 1:2 handle 4: pfifo
+	ip netns exec $me tc qdisc add dev $dev parent 1:3 handle 6: pfifo
+
+	# ipv4 filter on SYN flag set: band 2
+	ip netns exec $me tc filter add dev $dev parent 1: protocol ip u32 \
+		match ip protocol 6 0xff \
+		match ip dport 8000 0xffff \
+		match u8 0x02 0xff at 33 \
+		flowid 1:2
+
+	# ipv6 filter on SYN flag set: band 3
+	ip netns exec $me tc filter add dev $dev parent 1: protocol ipv6 u32 \
+		match ip6 protocol 6 0xff \
+		match ip6 dport 8000 0xffff \
+		match u8 0x02 0xff at 53 \
+		flowid 1:3
+}
+
+tc_get_syn_counter() {
+	ip netns exec $me tc -j -s qdisc show dev $1 handle $2 | jq .[0].packets
+}
+
+ip_mpath_balance() {
+	local -r ipver="-$1"
+	local -r daddr=$2
+	local -r handle="$1:"
+	local -r num_conn=20
+
+	tc_add_syn_counter veth1
+	tc_add_syn_counter veth3
+
+	for i in $(seq 1 $num_conn); do
+		ip netns exec $remote nc $ipver -l -p 8000 >/dev/null &
+		echo -n a | ip netns exec $me nc $ipver -q 0 $daddr 8000
+	done
+
+	local -r syn0="$(tc_get_syn_counter veth1 $handle)"
+	local -r syn1="$(tc_get_syn_counter veth3 $handle)"
+	local -r syns=$((syn0+syn1))
+
+	[ "$VERBOSE" = "1" ] && echo "multipath: syns seen: ($syn0,$syn1)"
+
+	[[ $syns -ge $num_conn ]] && [[ $syn0 -gt 0 ]] && [[ $syn1 -gt 0 ]]
+}
+
+ipv4_mpath_balance()
+{
+	$IP route add 172.16.101.1 \
+		nexthop via 172.16.1.2 \
+		nexthop via 172.16.2.2
+
+	ip netns exec $me \
+		sysctl -q -w net.ipv4.fib_multipath_hash_policy=1
+
+	ip_mpath_balance 4 172.16.101.1
+
+	log_test $? 0 "Multipath loadbalance"
+}
+
+ipv6_mpath_balance()
+{
+	$IP route add 2001:db8:101::1\
+		nexthop via 2001:db8:91::2 \
+		nexthop via 2001:db8:92::2
+
+	ip netns exec $me \
+		sysctl -q -w net.ipv6.fib_multipath_hash_policy=1
+
+	ip_mpath_balance 6 2001:db8:101::1
+
+	log_test $? 0 "Multipath loadbalance"
+}
+
 basic()
 {
 	echo
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port
  2025-04-20 18:04 ` [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
@ 2025-04-21 13:54   ` Willem de Bruijn
  2025-04-22 16:41   ` David Ahern
  1 sibling, 0 replies; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-21 13:54 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, kuba, edumazet, pabeni, dsahern, horms, idosch, kuniyu,
	Willem de Bruijn

Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Load balance new TCP connections across nexthops also when they
> connect to the same service at a single remote address and port.
> 
> This affects only port-based multipath hashing:
> fib_multipath_hash_policy 1 or 3.
> 
> Local connections must choose both a source address and port when
> connecting to a remote service, in ip_route_connect. This
> "chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and
> simplify ip_route_{connect,newports}()")) is resolved by first
> selecting a source address, by looking up a route using the zero
> wildcard source port and address.
> 
> As a result multiple connections to the same destination address and
> port have no entropy in fib_multipath_hash.
> 
> This is not a problem when forwarding, as skb-based hashing has a
> 4-tuple. Nor when establishing UDP connections, as autobind there
> selects a port before reaching ip_route_connect.
> 
> Load balance also TCP, by using a random port in fib_multipath_hash.
> Port assignment in inet_hash_connect is not atomic with
> ip_route_connect. Thus ports are unpredictable, effectively random.
> 
> Implementation details:
> 
> Do not actually pass a random fl4_sport, as that affects not only
> hashing, but routing more broadly, and can match a source port based
> policy route, which existing wildcard port 0 will not. Instead,
> define a new wildcard flowi flag that is used only for hashing.
> 
> Selecting a random source is equivalent to just selecting a random
> hash entirely. But for code clarity, follow the normal 4-tuple hash
> process and only update this field.
> 
> fib_multipath_hash can be reached with zero sport from other code
> paths, so explicitly pass this flowi flag, rather than trying to infer
> this case in the function itself.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/net/flow.h  |  1 +
>  include/net/route.h |  3 +++
>  net/ipv4/route.c    | 13 ++++++++++---
>  net/ipv6/route.c    | 13 ++++++++++---
>  net/ipv6/tcp_ipv6.c |  2 ++
>  5 files changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/flow.h b/include/net/flow.h
> index 2a3f0c42f092..a1839c278d87 100644
> --- a/include/net/flow.h
> +++ b/include/net/flow.h
> @@ -39,6 +39,7 @@ struct flowi_common {
>  #define FLOWI_FLAG_ANYSRC		0x01
>  #define FLOWI_FLAG_KNOWN_NH		0x02
>  #define FLOWI_FLAG_L3MDEV_OIF		0x04
> +#define FLOWI_FLAG_ANY_SPORT		0x08
>  	__u32	flowic_secid;
>  	kuid_t  flowic_uid;
>  	__u32		flowic_multipath_hash;
> diff --git a/include/net/route.h b/include/net/route.h
> index c605fd5ec0c0..8e39aa822cf9 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -326,6 +326,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst,
>  	if (inet_test_bit(TRANSPARENT, sk))
>  		flow_flags |= FLOWI_FLAG_ANYSRC;
>  
> +	if (IS_ENABLED(CONFIG_IP_ROUTE_MULTIPATH) && !sport)
> +		flow_flags |= FLOWI_FLAG_ANY_SPORT;
> +
>  	flowi4_init_output(fl4, oif, READ_ONCE(sk->sk_mark), ip_sock_rt_tos(sk),
>  			   ip_sock_rt_scope(sk), protocol, flow_flags, dst,
>  			   src, dport, sport, sk->sk_uid);
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index e5e4c71be3af..685e8d3b4f5d 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2037,8 +2037,12 @@ static u32 fib_multipath_custom_hash_fl4(const struct net *net,
>  		hash_keys.addrs.v4addrs.dst = fl4->daddr;
>  	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_IP_PROTO)
>  		hash_keys.basic.ip_proto = fl4->flowi4_proto;
> -	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT)
> -		hash_keys.ports.src = fl4->fl4_sport;
> +	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) {
> +		if (fl4->flowi4_flags & FLOWI_FLAG_ANY_SPORT)
> +			hash_keys.ports.src = get_random_u16();
> +		else
> +			hash_keys.ports.src = fl4->fl4_sport;
> +	}
>  	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_DST_PORT)
>  		hash_keys.ports.dst = fl4->fl4_dport;
>  
> @@ -2093,7 +2097,10 @@ int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
>  			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
>  			hash_keys.addrs.v4addrs.src = fl4->saddr;
>  			hash_keys.addrs.v4addrs.dst = fl4->daddr;
> -			hash_keys.ports.src = fl4->fl4_sport;
> +			if (fl4->flowi4_flags & FLOWI_FLAG_ANY_SPORT)
> +				hash_keys.ports.src = get_random_u16();
> +			else
> +				hash_keys.ports.src = fl4->fl4_sport;
>  			hash_keys.ports.dst = fl4->fl4_dport;
>  			hash_keys.basic.ip_proto = fl4->flowi4_proto;
>  		}
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 945857a8bfe3..39f07cdbbc64 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2492,8 +2492,12 @@ static u32 rt6_multipath_custom_hash_fl6(const struct net *net,
>  		hash_keys.basic.ip_proto = fl6->flowi6_proto;
>  	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_FLOWLABEL)
>  		hash_keys.tags.flow_label = (__force u32)flowi6_get_flowlabel(fl6);
> -	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT)
> -		hash_keys.ports.src = fl6->fl6_sport;
> +	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_SRC_PORT) {
> +		if (fl6->flowi6_flags & FLOWI_FLAG_ANY_SPORT)
> +			hash_keys.ports.src = get_random_u16();
> +		else
> +			hash_keys.ports.src = fl6->fl6_sport;
> +	}
>  	if (hash_fields & FIB_MULTIPATH_HASH_FIELD_DST_PORT)
>  		hash_keys.ports.dst = fl6->fl6_dport;
>  
> @@ -2547,7 +2551,10 @@ u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6,
>  			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
>  			hash_keys.addrs.v6addrs.src = fl6->saddr;
>  			hash_keys.addrs.v6addrs.dst = fl6->daddr;
> -			hash_keys.ports.src = fl6->fl6_sport;
> +			if (fl6->flowi6_flags & FLOWI_FLAG_ANY_SPORT)
> +				hash_keys.ports.src = get_random_u16();

I missed the __be16 endianness here and in the related cases

That'll teach me to forget running

    make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/route.o [...]


> +			else
> +				hash_keys.ports.src = fl6->fl6_sport;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address
  2025-04-20 18:04 ` [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
@ 2025-04-22 16:06   ` David Ahern
  0 siblings, 0 replies; 10+ messages in thread
From: David Ahern @ 2025-04-22 16:06 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, kuba, edumazet, pabeni, horms, idosch, kuniyu,
	Willem de Bruijn

On 4/20/25 12:04 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> With multipath routes, try to ensure that packets leave on the device
> that is associated with the source address.
> 
> Avoid the following tcpdump example:
> 
>     veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
>     veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]
> 
> Which can happen easily with the most straightforward setup:
> 
>     ip addr add 10.0.0.1/24 dev veth0
>     ip addr add 10.1.0.1/24 dev veth1
> 
>     ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
>     			  nexthop via 10.1.0.2 dev veth1
> 
> This is apparently considered WAI, based on the comment in
> ip_route_output_key_hash_rcu:
> 
>     * 2. Moreover, we are allowed to send packets with saddr
>     *    of another iface. --ANK
> 
> It may be ok for some uses of multipath, but not all. For instance,
> when using two ISPs, a router may drop packets with unknown source.
> 
> The behavior occurs because tcp_v4_connect makes three route
> lookups when establishing a connection:
> 
> 1. ip_route_connect calls to select a source address, with saddr zero.
> 2. ip_route_connect calls again now that saddr and daddr are known.
> 3. ip_route_newports calls again after a source port is also chosen.
> 
> With a route with multiple nexthops, each lookup may make a different
> choice depending on available entropy to fib_select_multipath. So it
> is possible for 1 to select the saddr from the first entry, but 3 to
> select the second entry. Leading to the above situation.
> 
> Address this by preferring a match that matches the flowi4 saddr. This
> will make 2 and 3 make the same choice as 1. Continue to update the
> backup choice until a choice that matches saddr is found.
> 
> Do this in fib_select_multipath itself, rather than passing an fl4_oif
> constraint, to avoid changing non-multipath route selection. Commit
> e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows
> how that may cause regressions.
> 
> Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
> refresh in the loop.
> 
> This does not happen in IPv6, which performs only one lookup.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> Side-quest: I wonder if the second route lookup in ip_route_connect
> is vestigial since the introduction of the third route lookup with
> ip_route_newports. IPv6 has neither second nor third lookup, which
> hints that perhaps both can be removed.
> ---
>  include/net/ip_fib.h     |  3 ++-
>  net/ipv4/fib_semantics.c | 39 +++++++++++++++++++++++++--------------
>  net/ipv4/route.c         |  2 +-
>  3 files changed, 28 insertions(+), 16 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port
  2025-04-20 18:04 ` [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
  2025-04-21 13:54   ` Willem de Bruijn
@ 2025-04-22 16:41   ` David Ahern
  2025-04-22 18:07     ` Willem de Bruijn
  1 sibling, 1 reply; 10+ messages in thread
From: David Ahern @ 2025-04-22 16:41 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, kuba, edumazet, pabeni, horms, idosch, kuniyu,
	Willem de Bruijn

On 4/20/25 12:04 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Load balance new TCP connections across nexthops also when they
> connect to the same service at a single remote address and port.
> 
> This affects only port-based multipath hashing:
> fib_multipath_hash_policy 1 or 3.
> 
> Local connections must choose both a source address and port when
> connecting to a remote service, in ip_route_connect. This
> "chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and
> simplify ip_route_{connect,newports}()")) is resolved by first
> selecting a source address, by looking up a route using the zero
> wildcard source port and address.
> 
> As a result multiple connections to the same destination address and
> port have no entropy in fib_multipath_hash.
> 
> This is not a problem when forwarding, as skb-based hashing has a
> 4-tuple. Nor when establishing UDP connections, as autobind there
> selects a port before reaching ip_route_connect.
> 
> Load balance also TCP, by using a random port in fib_multipath_hash.
> Port assignment in inet_hash_connect is not atomic with
> ip_route_connect. Thus ports are unpredictable, effectively random.
> 

can the call to inet_hash_connect be moved up? Get an actual sport
assignment and then use it for routing lookups.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port
  2025-04-22 16:41   ` David Ahern
@ 2025-04-22 18:07     ` Willem de Bruijn
  0 siblings, 0 replies; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-22 18:07 UTC (permalink / raw)
  To: David Ahern, Willem de Bruijn, netdev
  Cc: davem, kuba, edumazet, pabeni, horms, idosch, kuniyu,
	Willem de Bruijn

David Ahern wrote:
> On 4/20/25 12:04 PM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> > 
> > Load balance new TCP connections across nexthops also when they
> > connect to the same service at a single remote address and port.
> > 
> > This affects only port-based multipath hashing:
> > fib_multipath_hash_policy 1 or 3.
> > 
> > Local connections must choose both a source address and port when
> > connecting to a remote service, in ip_route_connect. This
> > "chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and
> > simplify ip_route_{connect,newports}()")) is resolved by first
> > selecting a source address, by looking up a route using the zero
> > wildcard source port and address.
> > 
> > As a result multiple connections to the same destination address and
> > port have no entropy in fib_multipath_hash.
> > 
> > This is not a problem when forwarding, as skb-based hashing has a
> > 4-tuple. Nor when establishing UDP connections, as autobind there
> > selects a port before reaching ip_route_connect.
> > 
> > Load balance also TCP, by using a random port in fib_multipath_hash.
> > Port assignment in inet_hash_connect is not atomic with
> > ip_route_connect. Thus ports are unpredictable, effectively random.
> > 
> 
> can the call to inet_hash_connect be moved up? Get an actual sport
> assignment and then use it for routing lookups.

That inverts the chicken-and-egg problem and selects a source
port before a socket. That would be a significant change, and
considerably more risky.

More concrete concern is that during port selection
__inet(6)_check_established uses inet_rcv_saddr/sk_v6_rcv_saddr to
check for established sockets, so expects the saddr to already have
been chosen.

Inverting the choice requires matching against all local addresses.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 3/3] selftests/net: test tcp connection load balancing
  2025-04-20 18:04 ` [PATCH net-next 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
@ 2025-04-23  9:05   ` Ido Schimmel
  2025-04-23 14:18     ` Willem de Bruijn
  0 siblings, 1 reply; 10+ messages in thread
From: Ido Schimmel @ 2025-04-23  9:05 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, davem, kuba, edumazet, pabeni, dsahern, horms, idosch,
	kuniyu, Willem de Bruijn

On Sun, Apr 20, 2025 at 02:04:31PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Verify that TCP connections use both routes when connecting multiple
> times to a remote service over a two nexthop multipath route.
> 
> Use netcat to create the connections. Use tc prio + tc filter to
> count routes taken, counting SYN packets across the two egress
> devices.
> 
> To avoid flaky tests when testing inherently randomized behavior,
> set a low bar and pass if even a single SYN is observed on both
> devices.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> Integrated into fib_nexthops.sh as it covers multipath nexthop
> routing and can reuse all of its setup(), but technically the test
> does not use nexthop *objects* as is, so I can also move into a
> separate file and move common setup code to lib.sh if preferred.

No strong preference, but fib_nexthops.sh explicitly tests nexthop
objects, so including here a test that doesn't use them is a bit weird.
Did you consider putting this in fib_tests.sh instead?

> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 83 +++++++++++++++++++++
>  1 file changed, 83 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
> index b39f748c2572..93d19e92bd5b 100755
> --- a/tools/testing/selftests/net/fib_nexthops.sh
> +++ b/tools/testing/selftests/net/fib_nexthops.sh
> @@ -31,6 +31,7 @@ IPV4_TESTS="
>  	ipv4_compat_mode
>  	ipv4_fdb_grp_fcnal
>  	ipv4_mpath_select
> +	ipv4_mpath_balance
>  	ipv4_torture
>  	ipv4_res_torture
>  "
> @@ -45,6 +46,7 @@ IPV6_TESTS="
>  	ipv6_compat_mode
>  	ipv6_fdb_grp_fcnal
>  	ipv6_mpath_select
> +	ipv6_mpath_balance
>  	ipv6_torture
>  	ipv6_res_torture
>  "
> @@ -2110,6 +2112,87 @@ ipv4_res_torture()
>  	log_test 0 0 "IPv4 resilient nexthop group torture test"
>  }
>  
> +# Install a prio qdisc with separate bands counting IPv4 and IPv6 SYNs
> +tc_add_syn_counter() {
> +	local -r dev=$1
> +
> +	# qdisc with band 1 for no-match, band 2 for ipv4, band 3 for ipv6
> +	ip netns exec $me tc qdisc add dev $dev root handle 1: prio bands 3 \
> +		priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> +	ip netns exec $me tc qdisc add dev $dev parent 1:1 handle 2: pfifo
> +	ip netns exec $me tc qdisc add dev $dev parent 1:2 handle 4: pfifo
> +	ip netns exec $me tc qdisc add dev $dev parent 1:3 handle 6: pfifo
> +
> +	# ipv4 filter on SYN flag set: band 2
> +	ip netns exec $me tc filter add dev $dev parent 1: protocol ip u32 \
> +		match ip protocol 6 0xff \
> +		match ip dport 8000 0xffff \
> +		match u8 0x02 0xff at 33 \
> +		flowid 1:2
> +
> +	# ipv6 filter on SYN flag set: band 3
> +	ip netns exec $me tc filter add dev $dev parent 1: protocol ipv6 u32 \
> +		match ip6 protocol 6 0xff \
> +		match ip6 dport 8000 0xffff \
> +		match u8 0x02 0xff at 53 \
> +		flowid 1:3
> +}
> +
> +tc_get_syn_counter() {
> +	ip netns exec $me tc -j -s qdisc show dev $1 handle $2 | jq .[0].packets
> +}
> +
> +ip_mpath_balance() {
> +	local -r ipver="-$1"
> +	local -r daddr=$2
> +	local -r handle="$1:"
> +	local -r num_conn=20
> +
> +	tc_add_syn_counter veth1
> +	tc_add_syn_counter veth3
> +
> +	for i in $(seq 1 $num_conn); do
> +		ip netns exec $remote nc $ipver -l -p 8000 >/dev/null &
> +		echo -n a | ip netns exec $me nc $ipver -q 0 $daddr 8000

I don't have the '-q' option in Fedora:

# ./fib_nexthops.sh -t ipv4_mpath_balance
nc: invalid option -- 'q'
[...]
Tests passed:   0
Tests failed:   1
Tests skipped:  0

We had multiple problems in the past with 'nc' because of different
distributions using different versions. See for example:

ba6fbd383c12dfe6833968e3555ada422720a76f
5e8670610b93158ffacc3241f835454ff26a3469

Maybe use 'socat' instead?

> +	done
> +
> +	local -r syn0="$(tc_get_syn_counter veth1 $handle)"
> +	local -r syn1="$(tc_get_syn_counter veth3 $handle)"
> +	local -r syns=$((syn0+syn1))
> +
> +	[ "$VERBOSE" = "1" ] && echo "multipath: syns seen: ($syn0,$syn1)"
> +
> +	[[ $syns -ge $num_conn ]] && [[ $syn0 -gt 0 ]] && [[ $syn1 -gt 0 ]]

IIUC, this only tests that connections to the same destination address
and destination port are load balanced across all the paths (patch #2),
but it doesn't test that each connection uses the source address of the
egress interface (patch #1). Any reason not to test both? I'm asking
because I expect the current test to pass even without both patches.

I noticed that you are using tc-u32 for the matching, but with tc-flower
you can easily match on both 'src_ip' and 'tcp_flags'.

> +}
> +
> +ipv4_mpath_balance()
> +{
> +	$IP route add 172.16.101.1 \
> +		nexthop via 172.16.1.2 \
> +		nexthop via 172.16.2.2
> +
> +	ip netns exec $me \
> +		sysctl -q -w net.ipv4.fib_multipath_hash_policy=1
> +
> +	ip_mpath_balance 4 172.16.101.1
> +
> +	log_test $? 0 "Multipath loadbalance"
> +}
> +
> +ipv6_mpath_balance()
> +{
> +	$IP route add 2001:db8:101::1\
> +		nexthop via 2001:db8:91::2 \
> +		nexthop via 2001:db8:92::2
> +
> +	ip netns exec $me \
> +		sysctl -q -w net.ipv6.fib_multipath_hash_policy=1
> +
> +	ip_mpath_balance 6 2001:db8:101::1
> +
> +	log_test $? 0 "Multipath loadbalance"
> +}
> +
>  basic()
>  {
>  	echo
> -- 
> 2.49.0.805.g082f7c87e0-goog
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 3/3] selftests/net: test tcp connection load balancing
  2025-04-23  9:05   ` Ido Schimmel
@ 2025-04-23 14:18     ` Willem de Bruijn
  0 siblings, 0 replies; 10+ messages in thread
From: Willem de Bruijn @ 2025-04-23 14:18 UTC (permalink / raw)
  To: Ido Schimmel, Willem de Bruijn
  Cc: netdev, davem, kuba, edumazet, pabeni, dsahern, horms, idosch,
	kuniyu, Willem de Bruijn

Ido Schimmel wrote:
> On Sun, Apr 20, 2025 at 02:04:31PM -0400, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> > 
> > Verify that TCP connections use both routes when connecting multiple
> > times to a remote service over a two nexthop multipath route.
> > 
> > Use netcat to create the connections. Use tc prio + tc filter to
> > count routes taken, counting SYN packets across the two egress
> > devices.
> > 
> > To avoid flaky tests when testing inherently randomized behavior,
> > set a low bar and pass if even a single SYN is observed on both
> > devices.
> > 
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > 
> > ---
> > 
> > Integrated into fib_nexthops.sh as it covers multipath nexthop
> > routing and can reuse all of its setup(), but technically the test
> > does not use nexthop *objects* as is, so I can also move into a
> > separate file and move common setup code to lib.sh if preferred.
> 
> No strong preference, but fib_nexthops.sh explicitly tests nexthop
> objects, so including here a test that doesn't use them is a bit weird.
> Did you consider putting this in fib_tests.sh instead?

Ok, that is a more logical location.

The main reason for fib_nexthops.sh was that it can reuse all of its
setup().

But I can probably use route_setup and manually add ns remote and
veth5 and 6. Will take a look.
 
> > ---
> >  tools/testing/selftests/net/fib_nexthops.sh | 83 +++++++++++++++++++++
> >  1 file changed, 83 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
> > index b39f748c2572..93d19e92bd5b 100755
> > --- a/tools/testing/selftests/net/fib_nexthops.sh
> > +++ b/tools/testing/selftests/net/fib_nexthops.sh
> > @@ -31,6 +31,7 @@ IPV4_TESTS="
> >  	ipv4_compat_mode
> >  	ipv4_fdb_grp_fcnal
> >  	ipv4_mpath_select
> > +	ipv4_mpath_balance
> >  	ipv4_torture
> >  	ipv4_res_torture
> >  "
> > @@ -45,6 +46,7 @@ IPV6_TESTS="
> >  	ipv6_compat_mode
> >  	ipv6_fdb_grp_fcnal
> >  	ipv6_mpath_select
> > +	ipv6_mpath_balance
> >  	ipv6_torture
> >  	ipv6_res_torture
> >  "
> > @@ -2110,6 +2112,87 @@ ipv4_res_torture()
> >  	log_test 0 0 "IPv4 resilient nexthop group torture test"
> >  }
> >  
> > +# Install a prio qdisc with separate bands counting IPv4 and IPv6 SYNs
> > +tc_add_syn_counter() {
> > +	local -r dev=$1
> > +
> > +	# qdisc with band 1 for no-match, band 2 for ipv4, band 3 for ipv6
> > +	ip netns exec $me tc qdisc add dev $dev root handle 1: prio bands 3 \
> > +		priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > +	ip netns exec $me tc qdisc add dev $dev parent 1:1 handle 2: pfifo
> > +	ip netns exec $me tc qdisc add dev $dev parent 1:2 handle 4: pfifo
> > +	ip netns exec $me tc qdisc add dev $dev parent 1:3 handle 6: pfifo
> > +
> > +	# ipv4 filter on SYN flag set: band 2
> > +	ip netns exec $me tc filter add dev $dev parent 1: protocol ip u32 \
> > +		match ip protocol 6 0xff \
> > +		match ip dport 8000 0xffff \
> > +		match u8 0x02 0xff at 33 \
> > +		flowid 1:2
> > +
> > +	# ipv6 filter on SYN flag set: band 3
> > +	ip netns exec $me tc filter add dev $dev parent 1: protocol ipv6 u32 \
> > +		match ip6 protocol 6 0xff \
> > +		match ip6 dport 8000 0xffff \
> > +		match u8 0x02 0xff at 53 \
> > +		flowid 1:3
> > +}
> > +
> > +tc_get_syn_counter() {
> > +	ip netns exec $me tc -j -s qdisc show dev $1 handle $2 | jq .[0].packets
> > +}
> > +
> > +ip_mpath_balance() {
> > +	local -r ipver="-$1"
> > +	local -r daddr=$2
> > +	local -r handle="$1:"
> > +	local -r num_conn=20
> > +
> > +	tc_add_syn_counter veth1
> > +	tc_add_syn_counter veth3
> > +
> > +	for i in $(seq 1 $num_conn); do
> > +		ip netns exec $remote nc $ipver -l -p 8000 >/dev/null &
> > +		echo -n a | ip netns exec $me nc $ipver -q 0 $daddr 8000
> 
> I don't have the '-q' option in Fedora:
> 
> # ./fib_nexthops.sh -t ipv4_mpath_balance
> nc: invalid option -- 'q'
> [...]
> Tests passed:   0
> Tests failed:   1
> Tests skipped:  0
> 
> We had multiple problems in the past with 'nc' because of different
> distributions using different versions. See for example:
> 
> ba6fbd383c12dfe6833968e3555ada422720a76f
> 5e8670610b93158ffacc3241f835454ff26a3469
> 
> Maybe use 'socat' instead?

Ack, will change.

> > +	done
> > +
> > +	local -r syn0="$(tc_get_syn_counter veth1 $handle)"
> > +	local -r syn1="$(tc_get_syn_counter veth3 $handle)"
> > +	local -r syns=$((syn0+syn1))
> > +
> > +	[ "$VERBOSE" = "1" ] && echo "multipath: syns seen: ($syn0,$syn1)"
> > +
> > +	[[ $syns -ge $num_conn ]] && [[ $syn0 -gt 0 ]] && [[ $syn1 -gt 0 ]]
> 
> IIUC, this only tests that connections to the same destination address
> and destination port are load balanced across all the paths (patch #2),
> but it doesn't test that each connection uses the source address of the
> egress interface (patch #1). Any reason not to test both? I'm asking
> because I expect the current test to pass even without both patches.
> 
> I noticed that you are using tc-u32 for the matching, but with tc-flower
> you can easily match on both 'src_ip' and 'tcp_flags'.

Will do. Thanks!

> > +}
> > +
> > +ipv4_mpath_balance()
> > +{
> > +	$IP route add 172.16.101.1 \
> > +		nexthop via 172.16.1.2 \
> > +		nexthop via 172.16.2.2
> > +
> > +	ip netns exec $me \
> > +		sysctl -q -w net.ipv4.fib_multipath_hash_policy=1
> > +
> > +	ip_mpath_balance 4 172.16.101.1
> > +
> > +	log_test $? 0 "Multipath loadbalance"
> > +}
> > +
> > +ipv6_mpath_balance()
> > +{
> > +	$IP route add 2001:db8:101::1\
> > +		nexthop via 2001:db8:91::2 \
> > +		nexthop via 2001:db8:92::2
> > +
> > +	ip netns exec $me \
> > +		sysctl -q -w net.ipv6.fib_multipath_hash_policy=1
> > +
> > +	ip_mpath_balance 6 2001:db8:101::1
> > +
> > +	log_test $? 0 "Multipath loadbalance"
> > +}
> > +
> >  basic()
> >  {
> >  	echo
> > -- 
> > 2.49.0.805.g082f7c87e0-goog
> > 
> > 



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-04-23 14:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-20 18:04 [PATCH net-next 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
2025-04-20 18:04 ` [PATCH net-next 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
2025-04-22 16:06   ` David Ahern
2025-04-20 18:04 ` [PATCH net-next 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
2025-04-21 13:54   ` Willem de Bruijn
2025-04-22 16:41   ` David Ahern
2025-04-22 18:07     ` Willem de Bruijn
2025-04-20 18:04 ` [PATCH net-next 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
2025-04-23  9:05   ` Ido Schimmel
2025-04-23 14:18     ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).