All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@nvidia.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
	edumazet@google.com, pabeni@redhat.com, dsahern@kernel.org,
	horms@kernel.org, kuniyu@amazon.com,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address
Date: Fri, 25 Apr 2025 17:59:48 +0300	[thread overview]
Message-ID: <aAujZJXqlG8VZpJF@shredder> (raw)
In-Reply-To: <20250424143549.669426-2-willemdebruijn.kernel@gmail.com>

On Thu, Apr 24, 2025 at 10:35:18AM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> With multipath routes, try to ensure that packets leave on the device
> that is associated with the source address.
> 
> Avoid the following tcpdump example:
> 
>     veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
>     veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]
> 
> Which can happen easily with the most straightforward setup:
> 
>     ip addr add 10.0.0.1/24 dev veth0
>     ip addr add 10.1.0.1/24 dev veth1
> 
>     ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
>     			  nexthop via 10.1.0.2 dev veth1
> 
> This is apparently considered WAI, based on the comment in
> ip_route_output_key_hash_rcu:
> 
>     * 2. Moreover, we are allowed to send packets with saddr
>     *    of another iface. --ANK
> 
> It may be ok for some uses of multipath, but not all. For instance,
> when using two ISPs, a router may drop packets with unknown source.
> 
> The behavior occurs because tcp_v4_connect makes three route
> lookups when establishing a connection:
> 
> 1. ip_route_connect calls to select a source address, with saddr zero.
> 2. ip_route_connect calls again now that saddr and daddr are known.
> 3. ip_route_newports calls again after a source port is also chosen.
> 
> With a route with multiple nexthops, each lookup may make a different
> choice depending on available entropy to fib_select_multipath. So it
> is possible for 1 to select the saddr from the first entry, but 3 to
> select the second entry. Leading to the above situation.
> 
> Address this by preferring a match that matches the flowi4 saddr. This
> will make 2 and 3 make the same choice as 1. Continue to update the
> backup choice until a choice that matches saddr is found.
> 
> Do this in fib_select_multipath itself, rather than passing an fl4_oif
> constraint, to avoid changing non-multipath route selection. Commit
> e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows
> how that may cause regressions.
> 
> Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
> refresh in the loop.
> 
> This does not happen in IPv6, which performs only one lookup.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Reviewed-by: David Ahern <dsahern@kernel.org>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

One note below

[...]

> -void fib_select_multipath(struct fib_result *res, int hash)
> +void fib_select_multipath(struct fib_result *res, int hash,
> +			  const struct flowi4 *fl4)
>  {
>  	struct fib_info *fi = res->fi;
>  	struct net *net = fi->fib_net;
> -	bool first = false;
> +	bool found = false;
> +	bool use_neigh;
> +	__be32 saddr;
>  
>  	if (unlikely(res->fi->nh)) {
>  		nexthop_path_fib_result(res, hash);
>  		return;
>  	}
>  
> +	use_neigh = READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh);
> +	saddr = fl4 ? fl4->saddr : 0;
> +
>  	change_nexthops(fi) {
> -		if (READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh)) {
> -			if (!fib_good_nh(nexthop_nh))
> -				continue;
> -			if (!first) {
> -				res->nh_sel = nhsel;
> -				res->nhc = &nexthop_nh->nh_common;
> -				first = true;
> -			}
> +		if (use_neigh && !fib_good_nh(nexthop_nh))
> +			continue;
> +
> +		if (!found) {
> +			res->nh_sel = nhsel;
> +			res->nhc = &nexthop_nh->nh_common;
> +			found = !saddr || nexthop_nh->nh_saddr == saddr;
>  		}
>  
>  		if (hash > atomic_read(&nexthop_nh->fib_nh_upper_bound))
>  			continue;

Note that because 'res' is set before comparing the hash with the hash
threshold, it's possible to choose a nexthop that does not have a
carrier (they are assigned a hash threshold of -1), whereas this did
not happen before. Tested with [1].

I guess it's not a problem in practice because the initial route lookup
for the source address wouldn't have chosen the linkdown nexthop to
begin with.

>  
> -		res->nh_sel = nhsel;
> -		res->nhc = &nexthop_nh->nh_common;
> -		return;
> +		if (!saddr || nexthop_nh->nh_saddr == saddr) {
> +			res->nh_sel = nhsel;
> +			res->nhc = &nexthop_nh->nh_common;
> +			return;
> +		}
> +
> +		if (found)
> +			return;
> +
>  	} endfor_nexthops(fi);
>  }

[1]
#!/bin/bash

ip link del dev dummy1 &> /dev/null
ip link del dev dummy2 &> /dev/null

ip link add name dummy1 up type dummy
ip link add name dummy2 up type dummy
ip address add 192.0.2.1/28 dev dummy1
ip address add 192.0.2.17/28 dev dummy2
ip route add 192.0.2.32/28 \
	nexthop via 192.0.2.2 dev dummy1 \
	nexthop via 192.0.2.18 dev dummy2

ip link set dev dummy2 carrier off
sysctl -wq net.ipv4.fib_multipath_hash_policy=1
sysctl -wq net.ipv4.conf.all.ignore_routes_with_linkdown=1

sleep 1

ip route show 192.0.2.32/28
for i in {1..100}; do
	ip route get to 192.0.2.33 from 192.0.2.17 ipproto tcp sport $i dport $i | grep dummy2
done

  parent reply	other threads:[~2025-04-25 15:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24 14:35 [PATCH net-next v2 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
2025-04-24 14:35 ` [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address Willem de Bruijn
2025-04-24 16:47   ` Eric Dumazet
2025-04-25 14:59   ` Ido Schimmel [this message]
2025-04-26 15:01     ` Willem de Bruijn
2025-04-27 17:30       ` Ido Schimmel
2025-04-28 16:26         ` Willem de Bruijn
2025-04-29  7:54           ` Ido Schimmel
2025-04-24 14:35 ` [PATCH net-next v2 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
2025-04-24 16:05   ` David Ahern
2025-04-24 16:51   ` Eric Dumazet
2025-04-25 15:14   ` Ido Schimmel
2025-04-24 14:35 ` [PATCH net-next v2 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
2025-04-25 15:47   ` Ido Schimmel
2025-04-29 14:40 ` [PATCH net-next v2 0/3] ip: improve tcp sock multipath routing patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAujZJXqlG8VZpJF@shredder \
    --to=idosch@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.