From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: netdev@vger.kernel.org
Cc: davem@davemloft.net, kuba@kernel.org, edumazet@google.com,
pabeni@redhat.com, dsahern@kernel.org, horms@kernel.org,
idosch@nvidia.com, kuniyu@amazon.com,
Willem de Bruijn <willemb@google.com>
Subject: [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address
Date: Thu, 24 Apr 2025 10:35:18 -0400 [thread overview]
Message-ID: <20250424143549.669426-2-willemdebruijn.kernel@gmail.com> (raw)
In-Reply-To: <20250424143549.669426-1-willemdebruijn.kernel@gmail.com>
From: Willem de Bruijn <willemb@google.com>
With multipath routes, try to ensure that packets leave on the device
that is associated with the source address.
Avoid the following tcpdump example:
veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]
Which can happen easily with the most straightforward setup:
ip addr add 10.0.0.1/24 dev veth0
ip addr add 10.1.0.1/24 dev veth1
ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
nexthop via 10.1.0.2 dev veth1
This is apparently considered WAI, based on the comment in
ip_route_output_key_hash_rcu:
* 2. Moreover, we are allowed to send packets with saddr
* of another iface. --ANK
It may be ok for some uses of multipath, but not all. For instance,
when using two ISPs, a router may drop packets with unknown source.
The behavior occurs because tcp_v4_connect makes three route
lookups when establishing a connection:
1. ip_route_connect calls to select a source address, with saddr zero.
2. ip_route_connect calls again now that saddr and daddr are known.
3. ip_route_newports calls again after a source port is also chosen.
With a route with multiple nexthops, each lookup may make a different
choice depending on available entropy to fib_select_multipath. So it
is possible for 1 to select the saddr from the first entry, but 3 to
select the second entry. Leading to the above situation.
Address this by preferring a match that matches the flowi4 saddr. This
will make 2 and 3 make the same choice as 1. Continue to update the
backup choice until a choice that matches saddr is found.
Do this in fib_select_multipath itself, rather than passing an fl4_oif
constraint, to avoid changing non-multipath route selection. Commit
e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows
how that may cause regressions.
Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
refresh in the loop.
This does not happen in IPv6, which performs only one lookup.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
---
include/net/ip_fib.h | 3 ++-
net/ipv4/fib_semantics.c | 39 +++++++++++++++++++++++++--------------
net/ipv4/route.c | 2 +-
3 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e3864b74e92a..48bb3cf41469 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -574,7 +574,8 @@ static inline u32 fib_multipath_hash_from_keys(const struct net *net,
int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope,
struct netlink_ext_ack *extack);
-void fib_select_multipath(struct fib_result *res, int hash);
+void fib_select_multipath(struct fib_result *res, int hash,
+ const struct flowi4 *fl4);
void fib_select_path(struct net *net, struct fib_result *res,
struct flowi4 *fl4, const struct sk_buff *skb);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5326f1501af0..2371f311a1e1 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -2170,34 +2170,45 @@ static bool fib_good_nh(const struct fib_nh *nh)
return !!(state & NUD_VALID);
}
-void fib_select_multipath(struct fib_result *res, int hash)
+void fib_select_multipath(struct fib_result *res, int hash,
+ const struct flowi4 *fl4)
{
struct fib_info *fi = res->fi;
struct net *net = fi->fib_net;
- bool first = false;
+ bool found = false;
+ bool use_neigh;
+ __be32 saddr;
if (unlikely(res->fi->nh)) {
nexthop_path_fib_result(res, hash);
return;
}
+ use_neigh = READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh);
+ saddr = fl4 ? fl4->saddr : 0;
+
change_nexthops(fi) {
- if (READ_ONCE(net->ipv4.sysctl_fib_multipath_use_neigh)) {
- if (!fib_good_nh(nexthop_nh))
- continue;
- if (!first) {
- res->nh_sel = nhsel;
- res->nhc = &nexthop_nh->nh_common;
- first = true;
- }
+ if (use_neigh && !fib_good_nh(nexthop_nh))
+ continue;
+
+ if (!found) {
+ res->nh_sel = nhsel;
+ res->nhc = &nexthop_nh->nh_common;
+ found = !saddr || nexthop_nh->nh_saddr == saddr;
}
if (hash > atomic_read(&nexthop_nh->fib_nh_upper_bound))
continue;
- res->nh_sel = nhsel;
- res->nhc = &nexthop_nh->nh_common;
- return;
+ if (!saddr || nexthop_nh->nh_saddr == saddr) {
+ res->nh_sel = nhsel;
+ res->nhc = &nexthop_nh->nh_common;
+ return;
+ }
+
+ if (found)
+ return;
+
} endfor_nexthops(fi);
}
#endif
@@ -2212,7 +2223,7 @@ void fib_select_path(struct net *net, struct fib_result *res,
if (fib_info_num_path(res->fi) > 1) {
int h = fib_multipath_hash(net, fl4, skb, NULL);
- fib_select_multipath(res, h);
+ fib_select_multipath(res, h, fl4);
}
else
#endif
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 49cffbe83802..e5e4c71be3af 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2154,7 +2154,7 @@ ip_mkroute_input(struct sk_buff *skb, struct fib_result *res,
if (res->fi && fib_info_num_path(res->fi) > 1) {
int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys);
- fib_select_multipath(res, h);
+ fib_select_multipath(res, h, NULL);
IPCB(skb)->flags |= IPSKB_MULTIPATH;
}
#endif
--
2.49.0.805.g082f7c87e0-goog
next prev parent reply other threads:[~2025-04-24 14:35 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-24 14:35 [PATCH net-next v2 0/3] ip: improve tcp sock multipath routing Willem de Bruijn
2025-04-24 14:35 ` Willem de Bruijn [this message]
2025-04-24 16:47 ` [PATCH net-next v2 1/3] ipv4: prefer multipath nexthop that matches source address Eric Dumazet
2025-04-25 14:59 ` Ido Schimmel
2025-04-26 15:01 ` Willem de Bruijn
2025-04-27 17:30 ` Ido Schimmel
2025-04-28 16:26 ` Willem de Bruijn
2025-04-29 7:54 ` Ido Schimmel
2025-04-24 14:35 ` [PATCH net-next v2 2/3] ip: load balance tcp connections to single dst addr and port Willem de Bruijn
2025-04-24 16:05 ` David Ahern
2025-04-24 16:51 ` Eric Dumazet
2025-04-25 15:14 ` Ido Schimmel
2025-04-24 14:35 ` [PATCH net-next v2 3/3] selftests/net: test tcp connection load balancing Willem de Bruijn
2025-04-25 15:47 ` Ido Schimmel
2025-04-29 14:40 ` [PATCH net-next v2 0/3] ip: improve tcp sock multipath routing patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250424143549.669426-2-willemdebruijn.kernel@gmail.com \
--to=willemdebruijn.kernel@gmail.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=idosch@nvidia.com \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).