Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ido Schimmel <idosch@nvidia.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net, pabeni@redhat.com,
	edumazet@google.com, dsahern@kernel.org, horms@kernel.org,
	gnault@redhat.com, stfomichev@gmail.com
Subject: Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop
Date: Sun, 6 Apr 2025 16:45:11 +0300	[thread overview]
Message-ID: <Z_KFZ5cm7tOaBvw0@shredder> (raw)
In-Reply-To: <67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch>

Hi Willem,

Thanks for taking a look

On Fri, Apr 04, 2025 at 10:40:32AM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > Cited commit transitioned IPv6 path selection to use hash-threshold
> > instead of modulo-N. With hash-threshold, each nexthop is assigned a
> > region boundary in the multipath hash function's output space and a
> > nexthop is chosen if the calculated hash is smaller than the nexthop's
> > region boundary.
> > 
> > Hash-threshold does not work correctly if path selection does not start
> > with the first nexthop. For example, if fib6_select_path() is always
> > passed the last nexthop in the group, then it will always be chosen
> > because its region boundary covers the entire hash function's output
> > space.
> > 
> > Fix this by starting the selection process from the first nexthop and do
> > not consider nexthops for which rt6_score_route() provided a negative
> > score.
> > 
> > Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
> > Reported-by: Stanislav Fomichev <stfomichev@gmail.com>
> > Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
> > Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> > ---
> >  net/ipv6/route.c | 38 +++++++++++++++++++++++++++++++++++---
> >  1 file changed, 35 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index c3406a0d45bd..864f0002034b 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -412,11 +412,35 @@ static bool rt6_check_expired(const struct rt6_info *rt)
> >  	return false;
> >  }
> >  
> > +static struct fib6_info *
> > +rt6_multipath_first_sibling_rcu(const struct fib6_info *rt)
> > +{
> > +	struct fib6_info *iter;
> > +	struct fib6_node *fn;
> > +
> > +	fn = rcu_dereference(rt->fib6_node);
> > +	if (!fn)
> > +		goto out;
> > +	iter = rcu_dereference(fn->leaf);
> > +	if (!iter)
> > +		goto out;
> > +
> > +	while (iter) {
> > +		if (iter->fib6_metric == rt->fib6_metric &&
> > +		    rt6_qualify_for_ecmp(iter))
> > +			return iter;
> > +		iter = rcu_dereference(iter->fib6_next);
> > +	}
> > +
> > +out:
> > +	return NULL;
> > +}
> 
> The rcu counterpart to rt6_multipath_first_sibling, which is used when
> computing the ranges in rt6_multipath_rebalance.

Right

> 
> > +
> >  void fib6_select_path(const struct net *net, struct fib6_result *res,
> >  		      struct flowi6 *fl6, int oif, bool have_oif_match,
> >  		      const struct sk_buff *skb, int strict)
> >  {
> > -	struct fib6_info *match = res->f6i;
> > +	struct fib6_info *first, *match = res->f6i;
> >  	struct fib6_info *sibling;
> >  
> >  	if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
> > @@ -440,10 +464,18 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
> >  		return;
> >  	}
> >  
> > -	if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
> > +	first = rt6_multipath_first_sibling_rcu(match);
> > +	if (!first)
> >  		goto out;
> >  
> > -	list_for_each_entry_rcu(sibling, &match->fib6_siblings,
> > +	if (fl6->mp_hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound) &&
> > +	    rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
> > +			    strict) >= 0) {
> 
> Does this fix address two issues in one patch: start from the first
> sibling, and check validity of the sibling?

The loop below will only choose a nexthop ('match = sibling') if its
score is not negative. The purpose of the check here is to do the same
for the first nexthop. That is, only choose a nexthop when calculated
hash is smaller than the nexthop's region boundary and the nexthop has a
non negative score.

This was not done before for 'match' because the caller already chose
'match' based on its score.

> The behavior on negative score for the first_sibling appears
> different from that on subsequent siblings in the for_each below:
> in that case the loop breaks, while for the first it skips?
> 
>                 if (fl6->mp_hash > nh_upper_bound)
>                         continue;
>                 if (rt6_score_route(nh, sibling->fib6_flags, oif, strict) < 0)
>                         break;
>                 match = sibling;
>                 break;
> 
> Am I reading that correct and is that intentional?

Hmm, I see. I think it makes sense to have the same behavior for all
nexthops. That is, if nexthop fits in terms of hash but has a negative
score, then fallback to 'match'. How about the following diff?

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ab12b816ab94..210b84cecc24 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -470,10 +470,10 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
                goto out;
 
        hash = fl6->mp_hash;
-       if (hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound) &&
-           rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
-                           strict) >= 0) {
-               match = first;
+       if (hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound)) {
+               if (rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
+                                   strict) >= 0)
+                       match = first;
                goto out;
        }

next prev parent reply	other threads:[~2025-04-06 13:45 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-02 11:42 [PATCH net 0/2] ipv6: Multipath routing fixes Ido Schimmel
2025-04-02 11:42 ` [PATCH net 1/2] ipv6: Start path selection from the first nexthop Ido Schimmel
2025-04-04 14:40   ` Willem de Bruijn
2025-04-06 13:45     ` Ido Schimmel [this message]
2025-04-06 18:30       ` Willem de Bruijn
2025-04-07  6:38         ` Ido Schimmel
2025-04-07 14:31           ` Willem de Bruijn
2025-04-02 11:42 ` [PATCH net 2/2] ipv6: Do not consider link down nexthops in path selection Ido Schimmel
2025-04-04 13:22   ` Willem de Bruijn
2025-04-04 14:03     ` Willem de Bruijn
2025-04-04 14:07       ` Willem de Bruijn
2025-04-04 14:40 ` [PATCH net 0/2] ipv6: Multipath routing fixes patchwork-bot+netdevbpf
2025-04-04 14:49   ` Jakub Kicinski
2025-04-04 16:22     ` Willem de Bruijn
2025-04-07 15:12 ` Guillaume Nault

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:ab12b816ab9 dfblob:210b84cecc2 )
 OR (
bs:"Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z_KFZ5cm7tOaBvw0@shredder \
    --to=idosch@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=gnault@redhat.com \
    --cc=horms@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=stfomichev@gmail.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.