Re: [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ido Schimmel <idosch@nvidia.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: netdev@vger.kernel.org, David Ahern <dsahern@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Simon Horman <horms@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Subject: Re: [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal
Date: Wed, 6 May 2026 16:01:29 +0300	[thread overview]
Message-ID: <20260506130129.GA665477@shredder> (raw)
In-Reply-To: <20260506092704.150349-1-cratiu@nvidia.com>

On Wed, May 06, 2026 at 12:27:04PM +0300, Cosmin Ratiu wrote:
> When a device is going down, all nexthops on it are removed, and for
> each nexthop being removed the FIB table is flushed, which does a full
> trie traversal looking for entries marked RTNH_F_DEAD and removing them.
> The performance of this is O(N x R), with N being number of dev nexthops
> and R being number of IPv4 routes.
> 
> The RTNL is held the entire time.
> 
> When there are many nexthops to be removed and many routing entries,
> this can result in the RTNL being held for multiple minutes, which
> causes unhappiness in other processes trying to acquire the RTNL (e.g.
> systemd-networkd for DHCP renewals).
> 
> In a complicated deployment with multiple vxlan devices, each having
> 16K nexthops and a total of 128K ipv4 routes, this is exactly what
> happens:
> 
> nexthop_flush_dev()                # loops over 16K nexthops
>   -> remove_nexthop()
>     -> __remove_nexthop()
>       -> __remove_nexthop_fib()    # marks fi->fib_flags |= RTNH_F_DEAD
>         -> fib_flush()             # for EACH nexthop!
> 	  -> fib_table_flush()     # walks the ENTIRE FIB, 128K entries
> 
> Change that so that a nexthop_flush_dev() does a single fib_flush()
> after all nexthops are removed. This is done with:
> - __remove_nexthop_fib() no longer flushes the FIB, instead returns
>   whether a flush is needed and is marked with __must_check.
> - __remove_nexthop() and remove_nexthop() propagate this return value up
>   with __must_check, which was also added to remove_nexthop_from_groups.
> - A new wrapper is defined, remove_one_nexthop() which calls
>   remove_nexthop() and flushes if necessary.
> - The two direct callers of __remove_nexthop() get a WARN_ON, since the
>   nh about to be removed should not have any FIB entries referencing it
>   when replacing or inserting a new one.
> - Callers which need to remove a single nexthop were migrated to
>   remove_one_nexthop().
> - Callers which need to remove multiple nexthops keep track in a local
>   bool whether a flush is needed and call flush once at the end.
> - This is plumbed through group removal as well, so when removing a leaf
>   nh causes a parent group to lose its last member, the group's flush is
>   also deferred, accumulated via remove_nexthop_from_groups() ->
>   remove_nh_grp_entry() -> remove_nexthop(). remove_nh_grp_entry() gets
>   a __must_check as well.
> 
> This dramatically improves performance from O(N x R) to O(N + R).
> 
> Releasing a nexthop reference in remove_nexthop() now no longer frees
> it. Instead, it is deleted when the last fib_info pointing to it gets
> freed via free_fib_info_rcu(). All routing code is already careful not
> to take into consideration routes marked with RTNH_F_DEAD.
> 
> Tested with:
> DEV=eth2
> ip link set up dev $DEV
> ip link add testnh0 link $DEV type macvlan mode bridge
> ip addr add 198.51.100.1/24 dev testnh0
> ip link set testnh0 up
> 
> seq 1 65536 | \
> sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \
> ip -batch -
> 
> i=1
> for a in $(seq 0 255); do
>   for b in $(seq 0 255); do
>     echo "route add 10.${a}.${b}.0/32 nhid $i"
>     i=$((i + 1))
>   done
> done | ip -batch -
> 
> time ip link set testnh0 down
> ip link del testnh0
> 
> Without this patch:
> real	0m32.601s
> user	0m0.000s
> sys	0m32.511s
> 
> With this patch:
> real	0m0.209s
> user	0m0.000s
> sys	0m0.153s
> 
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  net/ipv4/nexthop.c | 88 +++++++++++++++++++++++++++++-----------------
>  1 file changed, 56 insertions(+), 32 deletions(-)

Patch LGTM, but it would have been easier to review if split into
multiple patches (not saying you should do it). Something like:

1. Change the various nexthop remove functions to return an indication if
flushing is required, but keep doing the flushing in
__remove_nexthop_fib(). Referring to these functions:

remove_nexthop()
__remove_nexthop()
__remove_nexthop_fib()
remove_nexthop_from_groups()
remove_nh_grp_entry()

2. Act upon the flushing indication in the various callers of
remove_nexthop() and remove the flushing from __remove_nexthop_fib().

3. Add __must_check annotations.

See one comment below

> @@ -2592,7 +2609,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
>  	if (!err) {
>  		nh_rt_cache_flush(net, old, new);
>  
> -		__remove_nexthop(net, new, NULL);
> +		WARN_ON(__remove_nexthop(net, new, NULL));
>  		nexthop_put(new);
>  	}

[...]

>  static struct nexthop *nexthop_create_group(struct net *net,
> @@ -2994,7 +3018,7 @@ static struct nexthop *nexthop_add(struct net *net, struct nh_config *cfg,
>  
>  	err = insert_nexthop(net, nh, cfg, extack);
>  	if (err) {
> -		__remove_nexthop(net, nh, NULL);
> +		WARN_ON(__remove_nexthop(net, nh, NULL));
>  		nexthop_put(nh);
>  		nh = ERR_PTR(err);
>  	}

Please change both to WARN_ON_ONCE(). See
Documentation/process/coding-style.rst ("Use WARN_ON_ONCE() rather than
WARN() or WARN_ON()"). I do realize we already have WARN_ON() in the
file, but let's avoid adding more.

next prev parent reply	other threads:[~2026-05-06 13:01 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06  9:27 [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal Cosmin Ratiu
2026-05-06 13:01 ` Ido Schimmel [this message]
2026-05-06 16:26   ` David Ahern
2026-05-06 17:53     ` Cosmin Ratiu
2026-05-06 18:05       ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260506130129.GA665477@shredder \
    --to=idosch@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.