All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
To: Ido Schimmel <idosch@nvidia.com>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	David Ahern <dsahern@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org>,
	Willem de Bruijn <willemb@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Shuah Khan <shuah@kernel.org>,
	netdev@vger.kernel.org
Subject: Re: [PATCH net v2 1/2] net: fib: restore ECMP balance from loopback
Date: Sun, 21 Dec 2025 18:49:53 +0000	[thread overview]
Message-ID: <1c6ba073-79e5-461b-ae76-4ef22fe04632@linux.dev> (raw)
In-Reply-To: <aUgnGahB9uXbvrbh@shredder>

On 21/12/2025 16:58, Ido Schimmel wrote:
> On Sun, Dec 21, 2025 at 10:55:15AM -0500, Willem de Bruijn wrote:
>> Vadim Fedorenko wrote:
>>> Preference of nexthop with source address broke ECMP for packets with
>>> source addresses which are not in the broadcast domain, but rather added
>>> to loopback/dummy interfaces. Original behaviour was to balance over
>>> nexthops while now it uses the latest nexthop from the group.
>>>
>>> For the case with 198.51.100.1/32 assigned to dummy0 and routed using
>>> 192.0.2.0/24 and 203.0.113.0/24 networks:
>>>
>>> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
>>>      link/ether d6:54:8a:ff:78:f5 brd ff:ff:ff:ff:ff:ff
>>>      inet 198.51.100.1/32 scope global dummy0
>>>         valid_lft forever preferred_lft forever
>>> 7: veth1@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
>>>      link/ether 06:ed:98:87:6d:8a brd ff:ff:ff:ff:ff:ff link-netnsid 0
>>>      inet 192.0.2.2/24 scope global veth1
>>>         valid_lft forever preferred_lft forever
>>>      inet6 fe80::4ed:98ff:fe87:6d8a/64 scope link proto kernel_ll
>>>         valid_lft forever preferred_lft forever
>>> 9: veth3@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
>>>      link/ether ae:75:23:38:a0:d2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
>>>      inet 203.0.113.2/24 scope global veth3
>>>         valid_lft forever preferred_lft forever
>>>      inet6 fe80::ac75:23ff:fe38:a0d2/64 scope link proto kernel_ll
>>>         valid_lft forever preferred_lft forever
>>>
>>> ~ ip ro list:
>>> default
>>> 	nexthop via 192.0.2.1 dev veth1 weight 1
>>> 	nexthop via 203.0.113.1 dev veth3 weight 1
>>> 192.0.2.0/24 dev veth1 proto kernel scope link src 192.0.2.2
>>> 203.0.113.0/24 dev veth3 proto kernel scope link src 203.0.113.2
>>>
>>> before:
>>>     for i in {1..255} ; do ip ro get 10.0.0.$i; done | grep veth | awk ' {print $(NF-2)}' | sort | uniq -c:
>>>      255 veth3
>>>
>>> after:
>>>     for i in {1..255} ; do ip ro get 10.0.0.$i; done | grep veth | awk ' {print $(NF-2)}' | sort | uniq -c:
>>>      122 veth1
>>>      133 veth3
> 
> The commit message only explains the problem, but not the solution...

Well, the solution is to try to restore original logic. But ok, I'll
explain it explicitly

> 
>>>
>>> Fixes: 32607a332cfe ("ipv4: prefer multipath nexthop that matches source address")
>>> Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
>>> ---
>>> v1 -> v2:
>>>
>>> - add score calculation for nexthop to keep original logic
>>> - adjust commit message to explain the config
>>> - use dummy device instead of loopback
>>> ---
>>>
>>>   net/ipv4/fib_semantics.c | 24 ++++++++----------------
>>>   1 file changed, 8 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>>> index a5f3c8459758..4d3650d20ff2 100644
>>> --- a/net/ipv4/fib_semantics.c
>>> +++ b/net/ipv4/fib_semantics.c
>>> @@ -2167,8 +2167,8 @@ void fib_select_multipath(struct fib_result *res, int hash,
>>>   {
>>>   	struct fib_info *fi = res->fi;
>>>   	struct net *net = fi->fib_net;
>>> -	bool found = false;
>>>   	bool use_neigh;
>>> +	int score = -1;
>>>   	__be32 saddr;
>>>   
>>>   	if (unlikely(res->fi->nh)) {
>>> @@ -2180,7 +2180,7 @@ void fib_select_multipath(struct fib_result *res, int hash,
>>>   	saddr = fl4 ? fl4->saddr : 0;
>>>   
>>>   	change_nexthops(fi) {
>>> -		int nh_upper_bound;
>>> +		int nh_upper_bound, nh_score = 0;
>>>   
>>>   		/* Nexthops without a carrier are assigned an upper bound of
>>>   		 * minus one when "ignore_routes_with_linkdown" is set.
>>> @@ -2190,24 +2190,16 @@ void fib_select_multipath(struct fib_result *res, int hash,
>>>   		    (use_neigh && !fib_good_nh(nexthop_nh)))
>>>   			continue;
>>>   
>>> -		if (!found) {
>>> +		if (saddr && nexthop_nh->nh_saddr == saddr)
>>> +			nh_score += 2;
>>> +		if (hash <= nh_upper_bound)
>>> +			nh_score++;
>>> +		if (score < nh_score) {
>>>   			res->nh_sel = nhsel;
>>>   			res->nhc = &nexthop_nh->nh_common;
>>> -			found = !saddr || nexthop_nh->nh_saddr == saddr;
>>
>> if score == 3 return immediately?
> 
> We can also return early in the input path (!saddr) when score is 1.
> This seems to work:
> 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 4d3650d20ff2..0caf38e44c73 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -2197,6 +2197,8 @@ void fib_select_multipath(struct fib_result *res, int hash,
>   		if (score < nh_score) {
>   			res->nh_sel = nhsel;
>   			res->nhc = &nexthop_nh->nh_common;
> +			if (nh_score == 3 || (!saddr && nh_score == 1))
> +				return;
>   			score = nh_score;
>   		}
> 

It makes sense to amortize the loop. Going to send v3

> Tested with net/fib_tests.sh and forwarding/router_multipath.sh
> 
>>
>>> +			score = nh_score;
>>>   		}
>>>   
>>> -		if (hash > nh_upper_bound)
>>> -			continue;
>>> -
>>> -		if (!saddr || nexthop_nh->nh_saddr == saddr) {
>>> -			res->nh_sel = nhsel;
>>> -			res->nhc = &nexthop_nh->nh_common;
>>> -			return;
>>> -		}
>>> -
>>> -		if (found)
>>> -			return;
>>> -
>>>   	} endfor_nexthops(fi);
>>>   }
>>>   #endif
>>> -- 
>>> 2.47.3
>>>
>>
>>


      reply	other threads:[~2025-12-21 18:50 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-20  3:23 [PATCH net v2 1/2] net: fib: restore ECMP balance from loopback Vadim Fedorenko
2025-12-20  3:23 ` [PATCH net v2 2/2] selftests: fib_test: Add test case for ipv4 multi nexthops Vadim Fedorenko
2025-12-21 15:59   ` Willem de Bruijn
2025-12-21 15:55 ` [PATCH net v2 1/2] net: fib: restore ECMP balance from loopback Willem de Bruijn
2025-12-21 16:58   ` Ido Schimmel
2025-12-21 18:49     ` Vadim Fedorenko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1c6ba073-79e5-461b-ae76-4ef22fe04632@linux.dev \
    --to=vadim.fedorenko@linux.dev \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=idosch@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shuah@kernel.org \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.