Re: [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ido Schimmel <idosch@nvidia.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
	edumazet@google.com, pabeni@redhat.com, dsahern@kernel.org,
	horms@kernel.org, kuniyu@amazon.com,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address
Date: Sun, 27 Apr 2025 20:30:47 +0300	[thread overview]
Message-ID: <aA5px6qCjTWbHimM@shredder> (raw)
In-Reply-To: <680cf54b983d5_193a06294ab@willemb.c.googlers.com.notmuch>

On Sat, Apr 26, 2025 at 11:01:31AM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > On Thu, Apr 24, 2025 at 10:35:18AM -0400, Willem de Bruijn wrote:
> > > From: Willem de Bruijn <willemb@google.com>
> > > 
> > > With multipath routes, try to ensure that packets leave on the device
> > > that is associated with the source address.
> > > 
> > > Avoid the following tcpdump example:
> > > 
> > >     veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
> > >     veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]
> > > 
> > > Which can happen easily with the most straightforward setup:
> > > 
> > >     ip addr add 10.0.0.1/24 dev veth0
> > >     ip addr add 10.1.0.1/24 dev veth1
> > > 
> > >     ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
> > >     			  nexthop via 10.1.0.2 dev veth1
> > > 
> > > This is apparently considered WAI, based on the comment in
> > > ip_route_output_key_hash_rcu:
> > > 
> > >     * 2. Moreover, we are allowed to send packets with saddr
> > >     *    of another iface. --ANK
> > > 
> > > It may be ok for some uses of multipath, but not all. For instance,
> > > when using two ISPs, a router may drop packets with unknown source.
> > > 
> > > The behavior occurs because tcp_v4_connect makes three route
> > > lookups when establishing a connection:
> > > 
> > > 1. ip_route_connect calls to select a source address, with saddr zero.
> > > 2. ip_route_connect calls again now that saddr and daddr are known.
> > > 3. ip_route_newports calls again after a source port is also chosen.
> > > 
> > > With a route with multiple nexthops, each lookup may make a different
> > > choice depending on available entropy to fib_select_multipath. So it
> > > is possible for 1 to select the saddr from the first entry, but 3 to
> > > select the second entry. Leading to the above situation.
> > > 
> > > Address this by preferring a match that matches the flowi4 saddr. This
> > > will make 2 and 3 make the same choice as 1. Continue to update the
> > > backup choice until a choice that matches saddr is found.
> > > 
> > > Do this in fib_select_multipath itself, rather than passing an fl4_oif
> > > constraint, to avoid changing non-multipath route selection. Commit
> > > e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows
> > > how that may cause regressions.
> > > 
> > > Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
> > > refresh in the loop.
> > > 
> > > This does not happen in IPv6, which performs only one lookup.
> > > 
> > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > Reviewed-by: David Ahern <dsahern@kernel.org>
> > 
> > Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> > 
> > One note below
> > 
> > [...]
> > 
> > > -void fib_select_multipath(struct fib_result *res, int hash)
> > > +void fib_select_multipath(struct fib_result *res, int hash,
> > > +			  const struct flowi4 *fl4)
> > >  {
> > >  	struct fib_info *fi = res->fi;
> > >  	struct net *net = fi->fib_net;
> > > -	bool first = false;
> > > +	bool found = false;
> > > +	bool use_neigh;
> > > +	__be32 saddr;
> > >  
> > >  	if (unlikely(res->fi->nh)) {
> > >  		nexthop_path_fib_result(res, hash);
> > >  		return;
> > >  	}
> > >  
> > > +	use_neigh = READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh);
> > > +	saddr = fl4 ? fl4->saddr : 0;
> > > +
> > >  	change_nexthops(fi) {
> > > -		if (READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh)) {
> > > -			if (!fib_good_nh(nexthop_nh))
> > > -				continue;
> > > -			if (!first) {
> > > -				res->nh_sel = nhsel;
> > > -				res->nhc = &nexthop_nh->nh_common;
> > > -				first = true;
> > > -			}
> > > +		if (use_neigh && !fib_good_nh(nexthop_nh))
> > > +			continue;
> > > +
> > > +		if (!found) {
> > > +			res->nh_sel = nhsel;
> > > +			res->nhc = &nexthop_nh->nh_common;
> > > +			found = !saddr || nexthop_nh->nh_saddr == saddr;
> > >  		}
> > >  
> > >  		if (hash > atomic_read(&nexthop_nh->fib_nh_upper_bound))
> > >  			continue;
> > 
> > Note that because 'res' is set before comparing the hash with the hash
> > threshold, it's possible to choose a nexthop that does not have a
> > carrier (they are assigned a hash threshold of -1), whereas this did
> > not happen before. Tested with [1].
> 
> This is different from the previous pre-threshold choice if !first,
> because that choice was always tested with fib_good_nh(), while now
> that is optional?

I'm not sure I understood the question, but my point is that we can make
the code a bit clearer and more "correct" with something like this [1]
as a follow-up. It honors the "ignore_routes_with_linkdown" sysctl and
skips over nexthops that do not have a carrier.

I tested with [2] which fails without the patch. fib_tests.sh is also OK
[3] (including the new tests).

In practice, the patch shouldn't make a big difference. For the case of
saddr==0 (e.g., forwarding), it shouldn't make any difference because
you are guaranteed to find a nexthop whose upper bound covers the
calculated hash.

For the case of saddr!=0 (e.g., locally generated traffic) this patch
will not choose a nexthop if it has the correct address but no carrier.
Like I said before, it probably doesn't matter in practice because the
route lookup for the source address wouldn't choose this nexthop /
address in the first place.

[1]
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 2371f311a1e1..ce56fe39b185 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -2188,7 +2188,14 @@ void fib_select_multipath(struct fib_result *res, int hash,
 	saddr = fl4 ? fl4->saddr : 0;
 
 	change_nexthops(fi) {
-		if (use_neigh && !fib_good_nh(nexthop_nh))
+		int nh_upper_bound;
+
+		/* Nexthops without a carrier are assigned an upper bound of
+		 * minus one when "ignore_routes_with_linkdown" is set.
+		 */
+		nh_upper_bound = atomic_read(&nexthop_nh->fib_nh_upper_bound);
+		if (nh_upper_bound == -1 ||
+		    (use_neigh && !fib_good_nh(nexthop_nh)))
 			continue;
 
 		if (!found) {
@@ -2197,7 +2204,7 @@ void fib_select_multipath(struct fib_result *res, int hash,
 			found = !saddr || nexthop_nh->nh_saddr == saddr;
 		}
 
-		if (hash > atomic_read(&nexthop_nh->fib_nh_upper_bound))
+		if (hash > nh_upper_bound)
 			continue;
 
 		if (!saddr || nexthop_nh->nh_saddr == saddr) {

[2]
#!/bin/bash

trap cleanup EXIT

cleanup() {
	ip netns del ns1
}

ip netns add ns1
ip -n ns1 link set dev lo up

ip -n ns1 link add name dummy1 up type dummy
ip -n ns1 link add name dummy2 up type dummy

ip -n ns1 address add 192.0.2.1/28 dev dummy1
ip -n ns1 address add 192.0.2.17/28 dev dummy2

ip -n ns1 route add 198.51.100.0/24 \
	nexthop via 192.0.2.2 dev dummy1 \
	nexthop via 192.0.2.18 dev dummy2

ip netns exec ns1 sysctl -wq net.ipv4.fib_multipath_hash_policy=1
ip netns exec ns1 sysctl -wq net.ipv4.conf.all.ignore_routes_with_linkdown=1

ip -n ns1 link set dev dummy2 carrier off

for i in {1..128}; do
	ip -n ns1 route get to 198.51.100.1 from 192.0.2.17 \
		ipproto tcp sport $i dport $i | grep -q dummy2
	[[ $? -eq 0 ]] && echo "FAIL" && exit
done

echo "SUCCESS"

[3]
# ./fib_tests.sh
[...]
Tests passed: 230
Tests failed:   0

next prev parent reply	other threads:[~2025-04-27 17:31 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24 14:35 [PATCH net-next v2 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
2025-04-24 14:35 ` [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
2025-04-24 16:47   ` Eric Dumazet
2025-04-25 14:59   ` Ido Schimmel
2025-04-26 15:01     ` Willem de Bruijn
2025-04-27 17:30       ` Ido Schimmel [this message]
2025-04-28 16:26         ` Willem de Bruijn
2025-04-29  7:54           ` Ido Schimmel
2025-04-24 14:35 ` [PATCH net-next v2 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
2025-04-24 16:05   ` David Ahern
2025-04-24 16:51   ` Eric Dumazet
2025-04-25 15:14   ` Ido Schimmel
2025-04-24 14:35 ` [PATCH net-next v2 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
2025-04-25 15:47   ` Ido Schimmel
2025-04-29 14:40 ` [PATCH net-next v2 0/3] ip: improve tcp sock multipath routing patchwork-bot+netdevbpf

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:2371f311a1e dfblob:ce56fe39b18 )
 OR (
bs:"Re: [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aA5px6qCjTWbHimM@shredder \
    --to=idosch@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.