netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
       [not found] <1IJuR-8qH-39@gated-at.bofh.it>
@ 2004-04-08 17:18 ` Andi Kleen
  2004-04-08 18:10   ` Mathieu Giguere
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2004-04-08 17:18 UTC (permalink / raw)
  To: Mathieu Giguere; +Cc: netdev, linux-kernel

"Mathieu Giguere" <Mathieu.Giguere@ericsson.ca> writes:

[you should probably discuss that on netdev@oss.sgi.com instead, cc'ed]

>     We currently looking for a multi-FIB, scalable routing table in the
> million of entries, no routing cache for IPv4 and IPv6.  We want a IP stack

No routing cache? Doesn't sound like a good idea.

> that can have a log(n) (or better) insertion/deletion and lookup
> performance.  Predictable performance, even in the million of entries.

And even more vast overkill for most linux users than the existing
routing code already is.  Linux has at least the beginnings of a pluggable
FIB interface (fib_table), which has slightly bit rotted, but probably
not too bad. I would suggest you clean that up, make the existing
hash table really optional and then you can just plug in anything you want.

>     I join a patch with the fib_hash in IPv4 replace with a patricia tree
> ready for multi-FIB base on a 2.4.22 kernel.  This is the beginning of a
> long cleanup.

What do you consider dirty in the current stack? 

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
  2004-04-08 17:18 ` Andi Kleen
@ 2004-04-08 18:10   ` Mathieu Giguere
  2004-04-08 18:33     ` David S. Miller
  2004-04-08 18:34     ` alex
  0 siblings, 2 replies; 8+ messages in thread
From: Mathieu Giguere @ 2004-04-08 18:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev, linux-kernel

Hi,

Why we need a routing cache?  A patricia tree or radix tree is not fast
enough with the kind of memory speed we have in PC technologies now???

The main goal to remove the routing cache is to avoid to have 4096 routes
limitation + all problem of the cache is not flush correclty when default
route/next-hop for a particular route are change in the middle of a TCP
connection and are not consider at all.  Also, when the routing cache is
finally flush, all the information about the PMTU of the other entries are
lost and must be rebuild.  So it's a lot of rebuilding of information for
nothing when you have a lot of peer to talk with.

It's may look like overkill for a home user, but for commercial server, 4k
routes can be really fast exhauted.  For us, we talking more about million
of routes in the system.  Also, the patch I provide exploit already the
plug/and/pray of the fib.  But doesn't give at all the flexibility to remove
the _unclean_ hack: the routing cache.

What is dirty form the current implementation, quick example spagetti code
with goto going back at the beginning of the function in route.c in IPV6.
All the mtu/pmtu put in the cache entries instead in the fib himself.  Just
to name few examples.

/mathieu

----- Original Message ----- 
From: "Andi Kleen" <ak@muc.de>
To: "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca>
Cc: <netdev@oss.sgi.com>; <linux-kernel@vger.kernel.org>
Sent: Thursday, April 08, 2004 1:18 PM
Subject: Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of
entries.


> "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca> writes:
>
> [you should probably discuss that on netdev@oss.sgi.com instead, cc'ed]
>
> >     We currently looking for a multi-FIB, scalable routing table in the
> > million of entries, no routing cache for IPv4 and IPv6.  We want a IP
stack
>
> No routing cache? Doesn't sound like a good idea.
>
> > that can have a log(n) (or better) insertion/deletion and lookup
> > performance.  Predictable performance, even in the million of entries.
>
> And even more vast overkill for most linux users than the existing
> routing code already is.  Linux has at least the beginnings of a pluggable
> FIB interface (fib_table), which has slightly bit rotted, but probably
> not too bad. I would suggest you clean that up, make the existing
> hash table really optional and then you can just plug in anything you
want.
>
> >     I join a patch with the fib_hash in IPv4 replace with a patricia
tree
> > ready for multi-FIB base on a 2.4.22 kernel.  This is the beginning of a
> > long cleanup.
>
> What do you consider dirty in the current stack?
>
> -Andi
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
  2004-04-08 18:10   ` Mathieu Giguere
@ 2004-04-08 18:33     ` David S. Miller
  2004-04-08 18:34     ` alex
  1 sibling, 0 replies; 8+ messages in thread
From: David S. Miller @ 2004-04-08 18:33 UTC (permalink / raw)
  To: Mathieu Giguere; +Cc: ak, netdev, linux-kernel

On Thu, 8 Apr 2004 14:10:43 -0400
"Mathieu Giguere" <Mathieu.Giguere@ericsson.ca> wrote:

> The main goal to remove the routing cache is to avoid to have 4096 routes
> limitation

This 4K routes limitation is news to everyone who works on this
code.

When nexthop changes we _MUST_ flush PMTU etc. information because that
could have changed.  If however, such information is locked into the
route itself, it will propagate immediately into the routing cache
entry once recreated.

You seem to be talking about a lot of non-problems, but this may because
you're not providing enough details.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
  2004-04-08 18:10   ` Mathieu Giguere
  2004-04-08 18:33     ` David S. Miller
@ 2004-04-08 18:34     ` alex
  1 sibling, 0 replies; 8+ messages in thread
From: alex @ 2004-04-08 18:34 UTC (permalink / raw)
  Cc: netdev

> Why we need a routing cache?  A patricia tree or radix tree is not fast
> enough with the kind of memory speed we have in PC technologies now???
I absolutely agree. Rephasing Linus, "solution to slow route lookup is 
fast route lookup, not another kernel abstraction". 

> The main goal to remove the routing cache is to avoid to have 4096
> routes limitation + all problem of the cache is not flush correclty when
> default route/next-hop for a particular route are change in the middle
> of a TCP connection and are not consider at all.  Also, when the routing
> cache is finally flush, all the information about the PMTU of the other
> entries are lost and must be rebuild.  So it's a lot of rebuilding of
> information for nothing when you have a lot of peer to talk with.
Well, there is no limitation of 4096 routes in route-cache or RIB. If you 
change the hash size, you can have 32k routes in route-cache easily. And I 
have >100k routes in the RIB.

> It's may look like overkill for a home user, but for commercial server,
> 4k routes can be really fast exhauted.  For us, we talking more about
> million of routes in the system.  Also, the patch I provide exploit
> already the plug/and/pray of the fib.  But doesn't give at all the
> flexibility to remove the _unclean_ hack: the routing cache.
> 
> What is dirty form the current implementation, quick example spagetti code
> with goto going back at the beginning of the function in route.c in IPV6.
> All the mtu/pmtu put in the cache entries instead in the fib himself.  Just
> to name few examples.
Unfortunately, bunch of things in linux depend on existence of per-flow
route-cache to store transient flow information. Just do 'ip -s -o route
list cache'. You get arp-unreachable info, mtu, , etc. 

It isn't as easy as just disable out route-cache and recomputing route
from RIB (even if RIB lookup was O(1)).

Removal (or making route-cache optional) is extremely important to those 
who are running internet-connected routers. It simply does not matter that 
you can handle 300kpps in a single flow, when 50kpps flood of 1 
packet-per-flow will kill you. Unfortunately, there's no easy solution...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
@ 2004-04-08 19:53 Mathieu Giguere
  2004-04-09  1:05 ` jamal
  0 siblings, 1 reply; 8+ messages in thread
From: Mathieu Giguere @ 2004-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, David S. Miller; +Cc: ak, netdev

 Hi,

     Just run the join script on your favorite 2.4 kernel.

 RTNETLINK answers: Cannot allocate memory
 RTNETLINK answers: Cannot allocate memory
 RTNETLINK answers: Cannot allocate memory
 [root@tom tmp]# ip -6 route | wc -l
    4094
 [root@tom tmp]#


     And after 4094 IPv6 routes you will the get the same.

     For the PMTU, the info can't be retain in the route himself.  The PTMU
 is base on DIP not on the current network routing.  So it must be kept in a
 separate hash struct with expire time.  _BUT_ you must not flush all the
 entries each time a route is added or  deleted like in the current
 implementation.

 /mathieu


 -------------------------------
 #!/bin/csh
 echo #!/bin/sh
 set addr1=0
 set addr2=1
 while ($addr1 < 256)
   while ($addr2 < 256)
     echo ip -f inet6 route add 2000:${addr1}::${addr2}/128 dev eth0
     @ addr2++
   end
   set addr2=0
   @ addr1++
 end


 ----- Original Message ----- 
 From: "David S. Miller" <davem@redhat.com>
 To: "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca>
 Cc: <ak@muc.de>; <netdev@oss.sgi.com>; <linux-kernel@vger.kernel.org>
 Sent: Thursday, April 08, 2004 2:33 PM
 Subject: Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of
 entries.


 > On Thu, 8 Apr 2004 14:10:43 -0400
 > "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca> wrote:
 >
 > > The main goal to remove the routing cache is to avoid to have 4096
 routes
 > > limitation
 >
 > This 4K routes limitation is news to everyone who works on this
 > code.
 >
 > When nexthop changes we _MUST_ flush PMTU etc. information because that
 > could have changed.  If however, such information is locked into the
 > route itself, it will propagate immediately into the routing cache
 > entry once recreated.
 >
 > You seem to be talking about a lot of non-problems, but this may because
 > you're not providing enough details.
 >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
@ 2004-04-08 21:16 Krishna Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Krishna Kumar @ 2004-04-08 21:16 UTC (permalink / raw)
  To: Mathieu Giguere; +Cc: ak, David S. Miller, linux-kernel, netdev, netdev-bounce


[-- Attachment #1.1: Type: text/plain, Size: 3374 bytes --]





Hi Mathieu,

It is a sysctl changable parameter, look at ip_rt_max_size or in sysctl -a
for max_size.
Linux supports infinite routes :-)

- KK



|---------+----------------------------->
|         |           "Mathieu Giguere" |
|         |           <Mathieu.Giguere@e|
|         |           ricsson.ca>       |
|         |           Sent by:          |
|         |           netdev-bounce@oss.|
|         |           sgi.com           |
|         |                             |
|         |                             |
|         |           04/08/2004 12:53  |
|         |           PM                |
|         |                             |
|---------+----------------------------->
  >--------------------------------------------------------------------------------------------------------------|
  |                                                                                                              |
  |       To:       <linux-kernel@vger.kernel.org>, "David S. Miller" <davem@redhat.com>                         |
  |       cc:       <ak@muc.de>, <netdev@oss.sgi.com>                                                            |
  |       Subject:  RE: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.                       |
  |                                                                                                              |
  >--------------------------------------------------------------------------------------------------------------|




 Hi,

     Just run the join script on your favorite 2.4 kernel.

 RTNETLINK answers: Cannot allocate memory
 RTNETLINK answers: Cannot allocate memory
 RTNETLINK answers: Cannot allocate memory
 [root@tom tmp]# ip -6 route | wc -l
    4094
 [root@tom tmp]#


     And after 4094 IPv6 routes you will the get the same.

     For the PMTU, the info can't be retain in the route himself.  The PTMU
 is base on DIP not on the current network routing.  So it must be kept in
a
 separate hash struct with expire time.  _BUT_ you must not flush all the
 entries each time a route is added or  deleted like in the current
 implementation.

 /mathieu


 -------------------------------
 #!/bin/csh
 echo #!/bin/sh
 set addr1=0
 set addr2=1
 while ($addr1 < 256)
   while ($addr2 < 256)
     echo ip -f inet6 route add 2000:${addr1}::${addr2}/128 dev eth0
     @ addr2++
   end
   set addr2=0
   @ addr1++
 end


 ----- Original Message -----
 From: "David S. Miller" <davem@redhat.com>
 To: "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca>
 Cc: <ak@muc.de>; <netdev@oss.sgi.com>; <linux-kernel@vger.kernel.org>
 Sent: Thursday, April 08, 2004 2:33 PM
 Subject: Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of
 entries.


 > On Thu, 8 Apr 2004 14:10:43 -0400
 > "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca> wrote:
 >
 > > The main goal to remove the routing cache is to avoid to have 4096
 routes
 > > limitation
 >
 > This 4K routes limitation is news to everyone who works on this
 > code.
 >
 > When nexthop changes we _MUST_ flush PMTU etc. information because that
 > could have changed.  If however, such information is locked into the
 > route itself, it will propagate immediately into the routing cache
 > entry once recreated.
 >
 > You seem to be talking about a lot of non-problems, but this may because
 > you're not providing enough details.
 >




[-- Attachment #1.2: Type: text/html, Size: 4055 bytes --]

[-- Attachment #2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]

[-- Attachment #3: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]

[-- Attachment #4: pic00583.gif --]
[-- Type: image/gif, Size: 1255 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
  2004-04-08 19:53 IPv4 and IPv6 stack multi-FIB, scalable in the million of entries Mathieu Giguere
@ 2004-04-09  1:05 ` jamal
  0 siblings, 0 replies; 8+ messages in thread
From: jamal @ 2004-04-09  1:05 UTC (permalink / raw)
  To: Mathieu Giguere; +Cc: linux-kernel, David S. Miller, ak, netdev

Mathieu,

What i would recommend to you is the following: Make your algorithm
changes; test them, come with some perfomance numbers in comparison
with what Linux already does for the same kernel version. Then you
have ammunition to use for an arguement.

cheers,
jamal

On Thu, 2004-04-08 at 15:53, Mathieu Giguere wrote:
>  Hi,
> 
>      Just run the join script on your favorite 2.4 kernel.
> 
>  RTNETLINK answers: Cannot allocate memory
>  RTNETLINK answers: Cannot allocate memory
>  RTNETLINK answers: Cannot allocate memory
>  [root@tom tmp]# ip -6 route | wc -l
>     4094
>  [root@tom tmp]#
> 
> 
>      And after 4094 IPv6 routes you will the get the same.
> 
>      For the PMTU, the info can't be retain in the route himself.  The PTMU
>  is base on DIP not on the current network routing.  So it must be kept in a
>  separate hash struct with expire time.  _BUT_ you must not flush all the
>  entries each time a route is added or  deleted like in the current
>  implementation.
> 
>  /mathieu
> 
> 
>  -------------------------------
>  #!/bin/csh
>  echo #!/bin/sh
>  set addr1=0
>  set addr2=1
>  while ($addr1 < 256)
>    while ($addr2 < 256)
>      echo ip -f inet6 route add 2000:${addr1}::${addr2}/128 dev eth0
>      @ addr2++
>    end
>    set addr2=0
>    @ addr1++
>  end
> 
> 
>  ----- Original Message ----- 
>  From: "David S. Miller" <davem@redhat.com>
>  To: "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca>
>  Cc: <ak@muc.de>; <netdev@oss.sgi.com>; <linux-kernel@vger.kernel.org>
>  Sent: Thursday, April 08, 2004 2:33 PM
>  Subject: Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of
>  entries.
> 
> 
>  > On Thu, 8 Apr 2004 14:10:43 -0400
>  > "Mathieu Giguere" <Mathieu.Giguere@ericsson.ca> wrote:
>  >
>  > > The main goal to remove the routing cache is to avoid to have 4096
>  routes
>  > > limitation
>  >
>  > This 4K routes limitation is news to everyone who works on this
>  > code.
>  >
>  > When nexthop changes we _MUST_ flush PMTU etc. information because that
>  > could have changed.  If however, such information is locked into the
>  > route itself, it will propagate immediately into the routing cache
>  > entry once recreated.
>  >
>  > You seem to be talking about a lot of non-problems, but this may because
>  > you're not providing enough details.
>  >
> 
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: IPv4 and IPv6 stack multi-FIB, scalable in the million of entries.
@ 2004-04-09 13:16 Ronnie Sahlberg
  0 siblings, 0 replies; 8+ messages in thread
From: Ronnie Sahlberg @ 2004-04-09 13:16 UTC (permalink / raw)
  To: netdev

For fast routing lookups for IPv4, has anyone considered :

Treat all addresses as class c networks, dont route on anything else than
class c networks.
Limit the number of next-hop routers to 256.  With next-hop-router index 0
meaning  no route to that network.

Use a 16Mbyte large lookup table of bytes, where each byte represents the
next hop router.

Let the route to A.B.C.x re represented by the next-hop router described in
   table[A*65536+B*256+C]

Then a route lookup would be O(1), a simple table lookup.

Route insertions/deletions would take longer but anyway, the lookup would be
fast,  essentially a shift by 8 bits
and one memory read from the table.  At the cost of 16Mbyte wasted of kernel
memory.


Wasteful, yes, only really useful if you have enormous routing tables, can
not do policy routing   but fast.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-04-09 13:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-08 19:53 IPv4 and IPv6 stack multi-FIB, scalable in the million of entries Mathieu Giguere
2004-04-09  1:05 ` jamal
  -- strict thread matches above, loose matches on Subject: below --
2004-04-09 13:16 Ronnie Sahlberg
2004-04-08 21:16 Krishna Kumar
     [not found] <1IJuR-8qH-39@gated-at.bofh.it>
2004-04-08 17:18 ` Andi Kleen
2004-04-08 18:10   ` Mathieu Giguere
2004-04-08 18:33     ` David S. Miller
2004-04-08 18:34     ` alex

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).