From: Ido Schimmel <idosch@idosch.org>
To: cpaasch@openai.com
Cc: David Ahern <dsahern@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
netdev@vger.kernel.org
Subject: Re: [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops
Date: Fri, 25 Jul 2025 17:05:52 +0300 [thread overview]
Message-ID: <aIOPQH-S5LAPCb1u@shredder> (raw)
In-Reply-To: <20250724-nexthop_dump-v1-1-6b43fffd5bac@openai.com>
On Thu, Jul 24, 2025 at 05:10:36PM -0700, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
>
> When we have a (very) large number of nexthops, they do not fit within a
> single message. rtm_dump_walk_nexthops() thus will be called repeatedly
> and ctx->idx is used to avoid dumping the same nexthops again.
>
> The approach in which we avoid dumpint the same nexthops is by basically
s/dumpint/dumping/
> walking the entire nexthop rb-tree from the left-most node until we find
> a node whose id is >= s_idx. That does not scale well.
>
> Instead of this non-efficient approach, rather go directly through the
^ double space
s/non-efficient/inefficient/ ?
> tree to the nexthop that should be dumped (the one whose nh_id >=
> s_idx). This allows us to find the relevant node in O(log(n)).
>
> We have quite a nice improvement with this:
>
> Before:
> =======
>
> --> ~1M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 1050624
>
> real 0m21.080s
> user 0m0.666s
> sys 0m20.384s
>
> --> ~2M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 2101248
>
> real 1m51.649s
> user 0m1.540s
> sys 1m49.908s
>
> After:
> ======
>
> --> ~1M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 1050624
>
> real 0m1.157s
> user 0m0.926s
> sys 0m0.259s
>
> --> ~2M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 2101248
>
> real 0m2.763s
> user 0m2.042s
> sys 0m0.776s
I was able to reproduce these results.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
> net/ipv4/nexthop.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index 29118c43ebf5f1e91292fe227d4afde313e564bb..226447b1c17d22eab9121bed88c0c2b9148884ac 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -3511,7 +3511,39 @@ static int rtm_dump_walk_nexthops(struct sk_buff *skb,
> int err;
>
> s_idx = ctx->idx;
> - for (node = rb_first(root); node; node = rb_next(node)) {
> +
> + /*
> + * If this is not the first invocation, ctx->idx will contain the id of
> + * the last nexthop we processed. Instead of starting from the very first
> + * element of the red/black tree again and linearly skipping the
> + * (potentially large) set of nodes with an id smaller than s_idx, walk the
> + * tree and find the left-most node whose id is >= s_idx. This provides an
> + * efficient O(log n) starting point for the dump continuation.
> + */
Please try to keep lines at 80 characters.
> + if (s_idx != 0) {
> + struct rb_node *tmp = root->rb_node;
> +
> + node = NULL;
> + while (tmp) {
> + struct nexthop *nh;
> +
> + nh = rb_entry(tmp, struct nexthop, rb_node);
> + if (nh->id < s_idx) {
> + tmp = tmp->rb_right;
> + } else {
> + /* Track current candidate and keep looking on
> + * the left side to find the left-most
> + * (smallest id) that is still >= s_idx.
> + */
I'm aware that netdev now accepts both comment styles, but it's a bit
weird to mix both in the same commit and at the same function.
> + node = tmp;
> + tmp = tmp->rb_left;
> + }
> + }
> + } else {
> + node = rb_first(root);
> + }
> +
> + for (; node; node = rb_next(node)) {
> struct nexthop *nh;
>
> nh = rb_entry(node, struct nexthop, rb_node);
The code below is:
if (nh->id < s_idx)
continue;
Can't it be removed given the above code means we start at a nexthop
whose identifier is at least s_idx ?
next prev parent reply other threads:[~2025-07-25 14:05 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-25 0:10 [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops Christoph Paasch
2025-07-25 0:10 ` Christoph Paasch via B4 Relay
2025-07-25 14:05 ` Ido Schimmel [this message]
2025-07-25 17:47 ` Christoph Paasch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aIOPQH-S5LAPCb1u@shredder \
--to=idosch@idosch.org \
--cc=cpaasch@openai.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.