From: Ido Schimmel <idosch@nvidia.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: netdev@vger.kernel.org, David Ahern <dsahern@kernel.org>,
"David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Simon Horman <horms@kernel.org>,
Paolo Abeni <pabeni@redhat.com>
Subject: Re: [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal
Date: Wed, 6 May 2026 16:01:29 +0300 [thread overview]
Message-ID: <20260506130129.GA665477@shredder> (raw)
In-Reply-To: <20260506092704.150349-1-cratiu@nvidia.com>
On Wed, May 06, 2026 at 12:27:04PM +0300, Cosmin Ratiu wrote:
> When a device is going down, all nexthops on it are removed, and for
> each nexthop being removed the FIB table is flushed, which does a full
> trie traversal looking for entries marked RTNH_F_DEAD and removing them.
> The performance of this is O(N x R), with N being number of dev nexthops
> and R being number of IPv4 routes.
>
> The RTNL is held the entire time.
>
> When there are many nexthops to be removed and many routing entries,
> this can result in the RTNL being held for multiple minutes, which
> causes unhappiness in other processes trying to acquire the RTNL (e.g.
> systemd-networkd for DHCP renewals).
>
> In a complicated deployment with multiple vxlan devices, each having
> 16K nexthops and a total of 128K ipv4 routes, this is exactly what
> happens:
>
> nexthop_flush_dev() # loops over 16K nexthops
> -> remove_nexthop()
> -> __remove_nexthop()
> -> __remove_nexthop_fib() # marks fi->fib_flags |= RTNH_F_DEAD
> -> fib_flush() # for EACH nexthop!
> -> fib_table_flush() # walks the ENTIRE FIB, 128K entries
>
> Change that so that a nexthop_flush_dev() does a single fib_flush()
> after all nexthops are removed. This is done with:
> - __remove_nexthop_fib() no longer flushes the FIB, instead returns
> whether a flush is needed and is marked with __must_check.
> - __remove_nexthop() and remove_nexthop() propagate this return value up
> with __must_check, which was also added to remove_nexthop_from_groups.
> - A new wrapper is defined, remove_one_nexthop() which calls
> remove_nexthop() and flushes if necessary.
> - The two direct callers of __remove_nexthop() get a WARN_ON, since the
> nh about to be removed should not have any FIB entries referencing it
> when replacing or inserting a new one.
> - Callers which need to remove a single nexthop were migrated to
> remove_one_nexthop().
> - Callers which need to remove multiple nexthops keep track in a local
> bool whether a flush is needed and call flush once at the end.
> - This is plumbed through group removal as well, so when removing a leaf
> nh causes a parent group to lose its last member, the group's flush is
> also deferred, accumulated via remove_nexthop_from_groups() ->
> remove_nh_grp_entry() -> remove_nexthop(). remove_nh_grp_entry() gets
> a __must_check as well.
>
> This dramatically improves performance from O(N x R) to O(N + R).
>
> Releasing a nexthop reference in remove_nexthop() now no longer frees
> it. Instead, it is deleted when the last fib_info pointing to it gets
> freed via free_fib_info_rcu(). All routing code is already careful not
> to take into consideration routes marked with RTNH_F_DEAD.
>
> Tested with:
> DEV=eth2
> ip link set up dev $DEV
> ip link add testnh0 link $DEV type macvlan mode bridge
> ip addr add 198.51.100.1/24 dev testnh0
> ip link set testnh0 up
>
> seq 1 65536 | \
> sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \
> ip -batch -
>
> i=1
> for a in $(seq 0 255); do
> for b in $(seq 0 255); do
> echo "route add 10.${a}.${b}.0/32 nhid $i"
> i=$((i + 1))
> done
> done | ip -batch -
>
> time ip link set testnh0 down
> ip link del testnh0
>
> Without this patch:
> real 0m32.601s
> user 0m0.000s
> sys 0m32.511s
>
> With this patch:
> real 0m0.209s
> user 0m0.000s
> sys 0m0.153s
>
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
> net/ipv4/nexthop.c | 88 +++++++++++++++++++++++++++++-----------------
> 1 file changed, 56 insertions(+), 32 deletions(-)
Patch LGTM, but it would have been easier to review if split into
multiple patches (not saying you should do it). Something like:
1. Change the various nexthop remove functions to return an indication if
flushing is required, but keep doing the flushing in
__remove_nexthop_fib(). Referring to these functions:
remove_nexthop()
__remove_nexthop()
__remove_nexthop_fib()
remove_nexthop_from_groups()
remove_nh_grp_entry()
2. Act upon the flushing indication in the various callers of
remove_nexthop() and remove the flushing from __remove_nexthop_fib().
3. Add __must_check annotations.
See one comment below
> @@ -2592,7 +2609,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
> if (!err) {
> nh_rt_cache_flush(net, old, new);
>
> - __remove_nexthop(net, new, NULL);
> + WARN_ON(__remove_nexthop(net, new, NULL));
> nexthop_put(new);
> }
[...]
> static struct nexthop *nexthop_create_group(struct net *net,
> @@ -2994,7 +3018,7 @@ static struct nexthop *nexthop_add(struct net *net, struct nh_config *cfg,
>
> err = insert_nexthop(net, nh, cfg, extack);
> if (err) {
> - __remove_nexthop(net, nh, NULL);
> + WARN_ON(__remove_nexthop(net, nh, NULL));
> nexthop_put(nh);
> nh = ERR_PTR(err);
> }
Please change both to WARN_ON_ONCE(). See
Documentation/process/coding-style.rst ("Use WARN_ON_ONCE() rather than
WARN() or WARN_ON()"). I do realize we already have WARN_ON() in the
file, but let's avoid adding more.
next prev parent reply other threads:[~2026-05-06 13:01 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-06 9:27 [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal Cosmin Ratiu
2026-05-06 13:01 ` Ido Schimmel [this message]
2026-05-06 16:26 ` David Ahern
2026-05-06 17:53 ` Cosmin Ratiu
2026-05-06 18:05 ` David Ahern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260506130129.GA665477@shredder \
--to=idosch@nvidia.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox