Netdev List
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@nvidia.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: netdev@vger.kernel.org, David Ahern <dsahern@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Simon Horman <horms@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Subject: Re: [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal
Date: Wed, 6 May 2026 16:01:29 +0300	[thread overview]
Message-ID: <20260506130129.GA665477@shredder> (raw)
In-Reply-To: <20260506092704.150349-1-cratiu@nvidia.com>

On Wed, May 06, 2026 at 12:27:04PM +0300, Cosmin Ratiu wrote:
> When a device is going down, all nexthops on it are removed, and for
> each nexthop being removed the FIB table is flushed, which does a full
> trie traversal looking for entries marked RTNH_F_DEAD and removing them.
> The performance of this is O(N x R), with N being number of dev nexthops
> and R being number of IPv4 routes.
> 
> The RTNL is held the entire time.
> 
> When there are many nexthops to be removed and many routing entries,
> this can result in the RTNL being held for multiple minutes, which
> causes unhappiness in other processes trying to acquire the RTNL (e.g.
> systemd-networkd for DHCP renewals).
> 
> In a complicated deployment with multiple vxlan devices, each having
> 16K nexthops and a total of 128K ipv4 routes, this is exactly what
> happens:
> 
> nexthop_flush_dev()                # loops over 16K nexthops
>   -> remove_nexthop()
>     -> __remove_nexthop()
>       -> __remove_nexthop_fib()    # marks fi->fib_flags |= RTNH_F_DEAD
>         -> fib_flush()             # for EACH nexthop!
> 	  -> fib_table_flush()     # walks the ENTIRE FIB, 128K entries
> 
> Change that so that a nexthop_flush_dev() does a single fib_flush()
> after all nexthops are removed. This is done with:
> - __remove_nexthop_fib() no longer flushes the FIB, instead returns
>   whether a flush is needed and is marked with __must_check.
> - __remove_nexthop() and remove_nexthop() propagate this return value up
>   with __must_check, which was also added to remove_nexthop_from_groups.
> - A new wrapper is defined, remove_one_nexthop() which calls
>   remove_nexthop() and flushes if necessary.
> - The two direct callers of __remove_nexthop() get a WARN_ON, since the
>   nh about to be removed should not have any FIB entries referencing it
>   when replacing or inserting a new one.
> - Callers which need to remove a single nexthop were migrated to
>   remove_one_nexthop().
> - Callers which need to remove multiple nexthops keep track in a local
>   bool whether a flush is needed and call flush once at the end.
> - This is plumbed through group removal as well, so when removing a leaf
>   nh causes a parent group to lose its last member, the group's flush is
>   also deferred, accumulated via remove_nexthop_from_groups() ->
>   remove_nh_grp_entry() -> remove_nexthop(). remove_nh_grp_entry() gets
>   a __must_check as well.
> 
> This dramatically improves performance from O(N x R) to O(N + R).
> 
> Releasing a nexthop reference in remove_nexthop() now no longer frees
> it. Instead, it is deleted when the last fib_info pointing to it gets
> freed via free_fib_info_rcu(). All routing code is already careful not
> to take into consideration routes marked with RTNH_F_DEAD.
> 
> Tested with:
> DEV=eth2
> ip link set up dev $DEV
> ip link add testnh0 link $DEV type macvlan mode bridge
> ip addr add 198.51.100.1/24 dev testnh0
> ip link set testnh0 up
> 
> seq 1 65536 | \
> sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \
> ip -batch -
> 
> i=1
> for a in $(seq 0 255); do
>   for b in $(seq 0 255); do
>     echo "route add 10.${a}.${b}.0/32 nhid $i"
>     i=$((i + 1))
>   done
> done | ip -batch -
> 
> time ip link set testnh0 down
> ip link del testnh0
> 
> Without this patch:
> real	0m32.601s
> user	0m0.000s
> sys	0m32.511s
> 
> With this patch:
> real	0m0.209s
> user	0m0.000s
> sys	0m0.153s
> 
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  net/ipv4/nexthop.c | 88 +++++++++++++++++++++++++++++-----------------
>  1 file changed, 56 insertions(+), 32 deletions(-)

Patch LGTM, but it would have been easier to review if split into
multiple patches (not saying you should do it). Something like:

1. Change the various nexthop remove functions to return an indication if
flushing is required, but keep doing the flushing in
__remove_nexthop_fib(). Referring to these functions:

remove_nexthop()
__remove_nexthop()
__remove_nexthop_fib()
remove_nexthop_from_groups()
remove_nh_grp_entry()

2. Act upon the flushing indication in the various callers of
remove_nexthop() and remove the flushing from __remove_nexthop_fib().

3. Add __must_check annotations.

See one comment below

> @@ -2592,7 +2609,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
>  	if (!err) {
>  		nh_rt_cache_flush(net, old, new);
>  
> -		__remove_nexthop(net, new, NULL);
> +		WARN_ON(__remove_nexthop(net, new, NULL));
>  		nexthop_put(new);
>  	}

[...]

>  static struct nexthop *nexthop_create_group(struct net *net,
> @@ -2994,7 +3018,7 @@ static struct nexthop *nexthop_add(struct net *net, struct nh_config *cfg,
>  
>  	err = insert_nexthop(net, nh, cfg, extack);
>  	if (err) {
> -		__remove_nexthop(net, nh, NULL);
> +		WARN_ON(__remove_nexthop(net, nh, NULL));
>  		nexthop_put(nh);
>  		nh = ERR_PTR(err);
>  	}

Please change both to WARN_ON_ONCE(). See
Documentation/process/coding-style.rst ("Use WARN_ON_ONCE() rather than
WARN() or WARN_ON()"). I do realize we already have WARN_ON() in the
file, but let's avoid adding more.

  reply	other threads:[~2026-05-06 13:01 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06  9:27 [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal Cosmin Ratiu
2026-05-06 13:01 ` Ido Schimmel [this message]
2026-05-06 16:26   ` David Ahern
2026-05-06 17:53     ` Cosmin Ratiu
2026-05-06 18:05       ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260506130129.GA665477@shredder \
    --to=idosch@nvidia.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox