Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ido Schimmel <idosch@nvidia.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net, pabeni@redhat.com,
	edumazet@google.com, dsahern@kernel.org, horms@kernel.org,
	gnault@redhat.com, stfomichev@gmail.com
Subject: Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop
Date: Sun, 6 Apr 2025 16:45:11 +0300	[thread overview]
Message-ID: <Z_KFZ5cm7tOaBvw0@shredder> (raw)
In-Reply-To: <67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch>

Hi Willem,

Thanks for taking a look

On Fri, Apr 04, 2025 at 10:40:32AM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > Cited commit transitioned IPv6 path selection to use hash-threshold
> > instead of modulo-N. With hash-threshold, each nexthop is assigned a
> > region boundary in the multipath hash function's output space and a
> > nexthop is chosen if the calculated hash is smaller than the nexthop's
> > region boundary.
> > 
> > Hash-threshold does not work correctly if path selection does not start
> > with the first nexthop. For example, if fib6_select_path() is always
> > passed the last nexthop in the group, then it will always be chosen
> > because its region boundary covers the entire hash function's output
> > space.
> > 
> > Fix this by starting the selection process from the first nexthop and do
> > not consider nexthops for which rt6_score_route() provided a negative
> > score.
> > 
> > Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
> > Reported-by: Stanislav Fomichev <stfomichev@gmail.com>
> > Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
> > Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> > ---
> >  net/ipv6/route.c | 38 +++++++++++++++++++++++++++++++++++---
> >  1 file changed, 35 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index c3406a0d45bd..864f0002034b 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -412,11 +412,35 @@ static bool rt6_check_expired(const struct rt6_info *rt)
> >  	return false;
> >  }
> >  
> > +static struct fib6_info *
> > +rt6_multipath_first_sibling_rcu(const struct fib6_info *rt)
> > +{
> > +	struct fib6_info *iter;
> > +	struct fib6_node *fn;
> > +
> > +	fn = rcu_dereference(rt->fib6_node);
> > +	if (!fn)
> > +		goto out;
> > +	iter = rcu_dereference(fn->leaf);
> > +	if (!iter)
> > +		goto out;
> > +
> > +	while (iter) {
> > +		if (iter->fib6_metric == rt->fib6_metric &&
> > +		    rt6_qualify_for_ecmp(iter))
> > +			return iter;
> > +		iter = rcu_dereference(iter->fib6_next);
> > +	}
> > +
> > +out:
> > +	return NULL;
> > +}
> 
> The rcu counterpart to rt6_multipath_first_sibling, which is used when
> computing the ranges in rt6_multipath_rebalance.

Right

> 
> > +
> >  void fib6_select_path(const struct net *net, struct fib6_result *res,
> >  		      struct flowi6 *fl6, int oif, bool have_oif_match,
> >  		      const struct sk_buff *skb, int strict)
> >  {
> > -	struct fib6_info *match = res->f6i;
> > +	struct fib6_info *first, *match = res->f6i;
> >  	struct fib6_info *sibling;
> >  
> >  	if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
> > @@ -440,10 +464,18 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
> >  		return;
> >  	}
> >  
> > -	if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
> > +	first = rt6_multipath_first_sibling_rcu(match);
> > +	if (!first)
> >  		goto out;
> >  
> > -	list_for_each_entry_rcu(sibling, &match->fib6_siblings,
> > +	if (fl6->mp_hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound) &&
> > +	    rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
> > +			    strict) >= 0) {
> 
> Does this fix address two issues in one patch: start from the first
> sibling, and check validity of the sibling?

The loop below will only choose a nexthop ('match = sibling') if its
score is not negative. The purpose of the check here is to do the same
for the first nexthop. That is, only choose a nexthop when calculated
hash is smaller than the nexthop's region boundary and the nexthop has a
non negative score.

This was not done before for 'match' because the caller already chose
'match' based on its score.

> The behavior on negative score for the first_sibling appears
> different from that on subsequent siblings in the for_each below:
> in that case the loop breaks, while for the first it skips?
> 
>                 if (fl6->mp_hash > nh_upper_bound)
>                         continue;
>                 if (rt6_score_route(nh, sibling->fib6_flags, oif, strict) < 0)
>                         break;
>                 match = sibling;
>                 break;
> 
> Am I reading that correct and is that intentional?

Hmm, I see. I think it makes sense to have the same behavior for all
nexthops. That is, if nexthop fits in terms of hash but has a negative
score, then fallback to 'match'. How about the following diff?

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ab12b816ab94..210b84cecc24 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -470,10 +470,10 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
                goto out;
 
        hash = fl6->mp_hash;
-       if (hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound) &&
-           rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
-                           strict) >= 0) {
-               match = first;
+       if (hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound)) {
+               if (rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
+                                   strict) >= 0)
+                       match = first;
                goto out;
        }

next prev parent reply	other threads:[~2025-04-06 13:45 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-02 11:42 [PATCH net 0/2] ipv6: Multipath routing fixes Ido Schimmel
2025-04-02 11:42 ` [PATCH net 1/2] ipv6: Start path selection from the first nexthop Ido Schimmel
2025-04-04 14:40   ` Willem de Bruijn
2025-04-06 13:45     ` Ido Schimmel [this message]
2025-04-06 18:30       ` Willem de Bruijn
2025-04-07  6:38         ` Ido Schimmel
2025-04-07 14:31           ` Willem de Bruijn
2025-04-02 11:42 ` [PATCH net 2/2] ipv6: Do not consider link down nexthops in path selection Ido Schimmel
2025-04-04 13:22   ` Willem de Bruijn
2025-04-04 14:03     ` Willem de Bruijn
2025-04-04 14:07       ` Willem de Bruijn
2025-04-04 14:40 ` [PATCH net 0/2] ipv6: Multipath routing fixes patchwork-bot+netdevbpf
2025-04-04 14:49   ` Jakub Kicinski
2025-04-04 16:22     ` Willem de Bruijn
2025-04-07 15:12 ` Guillaume Nault

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:ab12b816ab9 dfblob:210b84cecc2 )
 OR (
bs:"Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z_KFZ5cm7tOaBvw0@shredder \
    --to=idosch@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=gnault@redhat.com \
    --cc=horms@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=stfomichev@gmail.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).