[PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops
@ 2025-07-25  0:10 Christoph Paasch via B4 Relay
  2025-07-25 14:05 ` Ido Schimmel
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-07-25  0:10 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: netdev, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

When we have a (very) large number of nexthops, they do not fit within a
single message. rtm_dump_walk_nexthops() thus will be called repeatedly
and ctx->idx is used to avoid dumping the same nexthops again.

The approach in which we avoid dumpint the same nexthops is by basically
walking the entire nexthop rb-tree from the left-most node until we find
a node whose id is >= s_idx. That does not scale well.

Instead of this non-efficient  approach, rather go directly through the
tree to the nexthop that should be dumped (the one whose nh_id >=
s_idx). This allows us to find the relevant node in O(log(n)).

We have quite a nice improvement with this:

Before:
=======

--> ~1M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
1050624

real	0m21.080s
user	0m0.666s
sys	0m20.384s

--> ~2M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
2101248

real	1m51.649s
user	0m1.540s
sys	1m49.908s

After:
======

--> ~1M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
1050624

real	0m1.157s
user	0m0.926s
sys	0m0.259s

--> ~2M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
2101248

real	0m2.763s
user	0m2.042s
sys	0m0.776s

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 net/ipv4/nexthop.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 29118c43ebf5f1e91292fe227d4afde313e564bb..226447b1c17d22eab9121bed88c0c2b9148884ac 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -3511,7 +3511,39 @@ static int rtm_dump_walk_nexthops(struct sk_buff *skb,
 	int err;
 
 	s_idx = ctx->idx;
-	for (node = rb_first(root); node; node = rb_next(node)) {
+
+	/*
+	 * If this is not the first invocation, ctx->idx will contain the id of
+	 * the last nexthop we processed.  Instead of starting from the very first
+	 * element of the red/black tree again and linearly skipping the
+	 * (potentially large) set of nodes with an id smaller than s_idx, walk the
+	 * tree and find the left-most node whose id is >= s_idx.  This provides an
+	 * efficient O(log n) starting point for the dump continuation.
+	 */
+	if (s_idx != 0) {
+		struct rb_node *tmp = root->rb_node;
+
+		node = NULL;
+		while (tmp) {
+			struct nexthop *nh;
+
+			nh = rb_entry(tmp, struct nexthop, rb_node);
+			if (nh->id < s_idx) {
+				tmp = tmp->rb_right;
+			} else {
+				/* Track current candidate and keep looking on
+				 * the left side to find the left-most
+				 * (smallest id) that is still >= s_idx.
+				 */
+				node = tmp;
+				tmp = tmp->rb_left;
+			}
+		}
+	} else {
+		node = rb_first(root);
+	}
+
+	for (; node; node = rb_next(node)) {
 		struct nexthop *nh;
 
 		nh = rb_entry(node, struct nexthop, rb_node);

---
base-commit: 8b5a19b4ff6a2096225d88cf24cfeef03edc1bed
change-id: 20250724-nexthop_dump-f6c32472bcdf

Best regards,
-- 
Christoph Paasch <cpaasch@openai.com>



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops
  2025-07-25  0:10 [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops Christoph Paasch via B4 Relay
@ 2025-07-25 14:05 ` Ido Schimmel
  2025-07-25 17:47   ` Christoph Paasch
  0 siblings, 1 reply; 3+ messages in thread
From: Ido Schimmel @ 2025-07-25 14:05 UTC (permalink / raw)
  To: cpaasch
  Cc: David Ahern, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev

On Thu, Jul 24, 2025 at 05:10:36PM -0700, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
> 
> When we have a (very) large number of nexthops, they do not fit within a
> single message. rtm_dump_walk_nexthops() thus will be called repeatedly
> and ctx->idx is used to avoid dumping the same nexthops again.
> 
> The approach in which we avoid dumpint the same nexthops is by basically

s/dumpint/dumping/

> walking the entire nexthop rb-tree from the left-most node until we find
> a node whose id is >= s_idx. That does not scale well.
> 
> Instead of this non-efficient  approach, rather go directly through the
                               ^ double space
s/non-efficient/inefficient/ ?

> tree to the nexthop that should be dumped (the one whose nh_id >=
> s_idx). This allows us to find the relevant node in O(log(n)).
> 
> We have quite a nice improvement with this:
> 
> Before:
> =======
> 
> --> ~1M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 1050624
> 
> real	0m21.080s
> user	0m0.666s
> sys	0m20.384s
> 
> --> ~2M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 2101248
> 
> real	1m51.649s
> user	0m1.540s
> sys	1m49.908s
> 
> After:
> ======
> 
> --> ~1M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 1050624
> 
> real	0m1.157s
> user	0m0.926s
> sys	0m0.259s
> 
> --> ~2M nexthops:
> $ time ~/libnl/src/nl-nh-list | wc -l
> 2101248
> 
> real	0m2.763s
> user	0m2.042s
> sys	0m0.776s

I was able to reproduce these results.

> 
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>  net/ipv4/nexthop.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index 29118c43ebf5f1e91292fe227d4afde313e564bb..226447b1c17d22eab9121bed88c0c2b9148884ac 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -3511,7 +3511,39 @@ static int rtm_dump_walk_nexthops(struct sk_buff *skb,
>  	int err;
>  
>  	s_idx = ctx->idx;
> -	for (node = rb_first(root); node; node = rb_next(node)) {
> +
> +	/*
> +	 * If this is not the first invocation, ctx->idx will contain the id of
> +	 * the last nexthop we processed.  Instead of starting from the very first
> +	 * element of the red/black tree again and linearly skipping the
> +	 * (potentially large) set of nodes with an id smaller than s_idx, walk the
> +	 * tree and find the left-most node whose id is >= s_idx.  This provides an
> +	 * efficient O(log n) starting point for the dump continuation.
> +	 */

Please try to keep lines at 80 characters.

> +	if (s_idx != 0) {
> +		struct rb_node *tmp = root->rb_node;
> +
> +		node = NULL;
> +		while (tmp) {
> +			struct nexthop *nh;
> +
> +			nh = rb_entry(tmp, struct nexthop, rb_node);
> +			if (nh->id < s_idx) {
> +				tmp = tmp->rb_right;
> +			} else {
> +				/* Track current candidate and keep looking on
> +				 * the left side to find the left-most
> +				 * (smallest id) that is still >= s_idx.
> +				 */

I'm aware that netdev now accepts both comment styles, but it's a bit
weird to mix both in the same commit and at the same function.

> +				node = tmp;
> +				tmp = tmp->rb_left;
> +			}
> +		}
> +	} else {
> +		node = rb_first(root);
> +	}
> +
> +	for (; node; node = rb_next(node)) {
>  		struct nexthop *nh;
>  
>  		nh = rb_entry(node, struct nexthop, rb_node);

The code below is:

if (nh->id < s_idx)
	continue;

Can't it be removed given the above code means we start at a nexthop
whose identifier is at least s_idx ?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops
  2025-07-25 14:05 ` Ido Schimmel
@ 2025-07-25 17:47   ` Christoph Paasch
  0 siblings, 0 replies; 3+ messages in thread
From: Christoph Paasch @ 2025-07-25 17:47 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: David Ahern, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev

On Fri, Jul 25, 2025 at 7:05 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Thu, Jul 24, 2025 at 05:10:36PM -0700, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > When we have a (very) large number of nexthops, they do not fit within a
> > single message. rtm_dump_walk_nexthops() thus will be called repeatedly
> > and ctx->idx is used to avoid dumping the same nexthops again.
> >
> > The approach in which we avoid dumpint the same nexthops is by basically
>
> s/dumpint/dumping/
>
> > walking the entire nexthop rb-tree from the left-most node until we find
> > a node whose id is >= s_idx. That does not scale well.
> >
> > Instead of this non-efficient  approach, rather go directly through the
>                                ^ double space
> s/non-efficient/inefficient/ ?
>
> > tree to the nexthop that should be dumped (the one whose nh_id >=
> > s_idx). This allows us to find the relevant node in O(log(n)).
> >
> > We have quite a nice improvement with this:
> >
> > Before:
> > =======
> >
> > --> ~1M nexthops:
> > $ time ~/libnl/src/nl-nh-list | wc -l
> > 1050624
> >
> > real  0m21.080s
> > user  0m0.666s
> > sys   0m20.384s
> >
> > --> ~2M nexthops:
> > $ time ~/libnl/src/nl-nh-list | wc -l
> > 2101248
> >
> > real  1m51.649s
> > user  0m1.540s
> > sys   1m49.908s
> >
> > After:
> > ======
> >
> > --> ~1M nexthops:
> > $ time ~/libnl/src/nl-nh-list | wc -l
> > 1050624
> >
> > real  0m1.157s
> > user  0m0.926s
> > sys   0m0.259s
> >
> > --> ~2M nexthops:
> > $ time ~/libnl/src/nl-nh-list | wc -l
> > 2101248
> >
> > real  0m2.763s
> > user  0m2.042s
> > sys   0m0.776s
>
> I was able to reproduce these results.
>
> >
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >  net/ipv4/nexthop.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> > index 29118c43ebf5f1e91292fe227d4afde313e564bb..226447b1c17d22eab9121bed88c0c2b9148884ac 100644
> > --- a/net/ipv4/nexthop.c
> > +++ b/net/ipv4/nexthop.c
> > @@ -3511,7 +3511,39 @@ static int rtm_dump_walk_nexthops(struct sk_buff *skb,
> >       int err;
> >
> >       s_idx = ctx->idx;
> > -     for (node = rb_first(root); node; node = rb_next(node)) {
> > +
> > +     /*
> > +      * If this is not the first invocation, ctx->idx will contain the id of
> > +      * the last nexthop we processed.  Instead of starting from the very first
> > +      * element of the red/black tree again and linearly skipping the
> > +      * (potentially large) set of nodes with an id smaller than s_idx, walk the
> > +      * tree and find the left-most node whose id is >= s_idx.  This provides an
> > +      * efficient O(log n) starting point for the dump continuation.
> > +      */
>
> Please try to keep lines at 80 characters.
>
> > +     if (s_idx != 0) {
> > +             struct rb_node *tmp = root->rb_node;
> > +
> > +             node = NULL;
> > +             while (tmp) {
> > +                     struct nexthop *nh;
> > +
> > +                     nh = rb_entry(tmp, struct nexthop, rb_node);
> > +                     if (nh->id < s_idx) {
> > +                             tmp = tmp->rb_right;
> > +                     } else {
> > +                             /* Track current candidate and keep looking on
> > +                              * the left side to find the left-most
> > +                              * (smallest id) that is still >= s_idx.
> > +                              */
>
> I'm aware that netdev now accepts both comment styles, but it's a bit
> weird to mix both in the same commit and at the same function.
>
> > +                             node = tmp;
> > +                             tmp = tmp->rb_left;
> > +                     }
> > +             }
> > +     } else {
> > +             node = rb_first(root);
> > +     }
> > +
> > +     for (; node; node = rb_next(node)) {
> >               struct nexthop *nh;
> >
> >               nh = rb_entry(node, struct nexthop, rb_node);
>
> The code below is:
>
> if (nh->id < s_idx)
>         continue;
>
> Can't it be removed given the above code means we start at a nexthop
> whose identifier is at least s_idx ?

Yes, we can drop this check.

Thanks for all your feedback. Will resubmit when net-next reopens.


Christoph

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-07-25 17:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-25  0:10 [PATCH net-next] net: Make nexthop-dumps scale linearly with the number of nexthops Christoph Paasch via B4 Relay
2025-07-25 14:05 ` Ido Schimmel
2025-07-25 17:47   ` Christoph Paasch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox