[PATCH 00/16] Remove the ipv4 routing cache

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/16] Remove the ipv4 routing cache
@ 2012-07-20 21:25 David Miller
  2012-07-20 22:05 ` Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 51+ messages in thread
From: David Miller @ 2012-07-20 21:25 UTC (permalink / raw)
  To: netdev

[ Ok I'm going to be a little bit cranky, and I think I deserve it.

  I'm basically not going to go through the multi-hour rebase and
  retest process again, as it's hit the point of diminishing returns
  as NOBODY is giving me test results but I can guarentee that
  EVERYONE will bitch and complain when I push this into net-next and
  it breaks their favorite feature.  If you can't be bothered to test
  these changes, I'm honestly going to tell people to take a hike and
  fix it themselves.  I simply don't care if you don't care enough to
  test changes of this magnitude to make sure your favorite setup
  still works.

  To say that I'm disappointed with the amount of testing feedback
  after posting more than a dozen iterations of this delicate patch
  set would be an understatement.  I can think of only one person who
  actually tested one iteration of these patches and gave feedback.

  And meanwhile I've personally reviewed, tested, and signed off on
  everyone else's work WITHOUT DELAY during this entire process.

  I've pulled 25 hour long hacking shifts to make that a reality, so
  that my routing cache removal work absolutely would not impact or
  delay the patch submissions of any other networking developer.  And
  I can't even get a handful of testers with some feedback?  You
  really have to be kidding me.. ]

The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables.  The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

The general flow of this patch series is that first the routing cache
is removed.  We build a completely new rtable entry every lookup
request.

Next we make some simplifications due to the fact that removing the
routing cache causes several members of struct rtable to become no
longer necessary.

Then we need to make some amends such that we can legally cache
pre-constructed routes in the FIB nexthops.  Firstly, we need to
invalidate routes which are hit with nexthop exceptions.  Secondly we
have to change the semantics of rt->rt_gateway such that zero means
that the destination is on-link and non-zero otherwise.

Now that the preparations are ready, we start caching precomputed
routes in the FIB nexthops.  Output and input routes need different
kinds of care when determining if we can legally do such caching or
not.  The details are in the commit log messages for those changes.

The patch series then winds down with some more struct rtable
simplifications and other tidy ups that remove unnecessary overhead.

On a SPARC-T3 output route lookups are ~876 cycles.  Input route
lookups are ~1169 cycles with rpfilter disabled, and about ~1468
cycles with rpfilter enabled.

These measurements were taken with the kbench_mod test module in the
net_test_tools GIT tree:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

That GIT tree also includes a udpflood tester tool and stresses
route lookups on packet output.

For example, on the same SPARC-T3 system we can run:

	time ./udpflood -l 10000000 10.2.2.11

with routing cache:
real    1m21.955s       user    0m6.530s        sys     1m15.390s

without routing cache:
real    1m31.678s       user    0m6.520s        sys     1m25.140s

Performance undoubtedly can easily be improved further.

For example fib_table_lookup() performs a lot of excessive
computations with all the masking and shifting, some of it
conditionalized to deal with edge cases.

Also, Eric's no-ref optimization for input route lookups can be
re-instated for the FIB nexthop caching code path.  I would be really
pleased if someone would work on that.

In fact anyone suitable motivated can just fire up perf on the loading
of the test net_test_tools benchmark kernel module.  I spend much of
my time going:

bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
bash# perf report

Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
Hutchings, and others.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 21:25 [PATCH 00/16] Remove the ipv4 routing cache David Miller
@ 2012-07-20 22:05 ` Eric Dumazet
  2012-07-20 22:42   ` Eric Dumazet
  2012-07-22  7:47 ` Vijay Subramanian
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-20 22:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Fri, 2012-07-20 at 14:25 -0700, David Miller wrote:
> [ Ok I'm going to be a little bit cranky, and I think I deserve it.
> 
>   I'm basically not going to go through the multi-hour rebase and
>   retest process again, as it's hit the point of diminishing returns
>   as NOBODY is giving me test results but I can guarentee that
>   EVERYONE will bitch and complain when I push this into net-next and
>   it breaks their favorite feature.  If you can't be bothered to test
>   these changes, I'm honestly going to tell people to take a hike and
>   fix it themselves.  I simply don't care if you don't care enough to
>   test changes of this magnitude to make sure your favorite setup
>   still works.
> 
>   To say that I'm disappointed with the amount of testing feedback
>   after posting more than a dozen iterations of this delicate patch
>   set would be an understatement.  I can think of only one person who
>   actually tested one iteration of these patches and gave feedback.
> 
>   And meanwhile I've personally reviewed, tested, and signed off on
>   everyone else's work WITHOUT DELAY during this entire process.
> 
>   I've pulled 25 hour long hacking shifts to make that a reality, so
>   that my routing cache removal work absolutely would not impact or
>   delay the patch submissions of any other networking developer.  And
>   I can't even get a handful of testers with some feedback?  You
>   really have to be kidding me.. ]

Hmm, ok, please give me few hours to make some tests ;)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 22:05 ` Eric Dumazet
@ 2012-07-20 22:42   ` Eric Dumazet
  2012-07-20 22:50     ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-20 22:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sat, 2012-07-21 at 00:06 +0200, Eric Dumazet wrote:

> Hmm, ok, please give me few hours to make some tests ;)
> 

It seems we have a big regression somewhere with net-next,
but it is already there...

(Apparently we choke on neighbour entries count.

entries = atomic_inc_return(&tbl->entries) - 1;

We need a percpu_counter ? Or something is wrong ?

We also choke on write_lock_bh(&tbl->lock); (__write_lock_failed())
 in __neigh_create()

current 'linux' tree :

tbench 24 -t 60

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    8433514     0.023     1.566
 Close        6195255     0.023     1.450
 Rename        357080     0.022     1.457
 Unlink       1702925     0.023     1.409
 Deltree          240     0.000     0.001
 Mkdir            120     0.024     0.032
 Qpathinfo    7643560     0.023     1.565
 Qfileinfo    1340393     0.023     1.566
 Qfsinfo      1401593     0.023     1.425
 Sfileinfo     686932     0.023     0.237
 Find         2955412     0.023     1.566
 WriteX       4209695     0.043     1.468
 ReadX        13218668     0.029     1.614
 LockX          27458     0.024     0.059
 UnlockX        27458     0.024     0.056
 Flush         591126     0.023     0.317

Throughput 4418.83 MB/sec  24 clients  24 procs  max_latency=2.433 ms

net-next tree with your 16 patches :

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    6545220     0.031    14.433
 Close        4808070     0.031    14.105
 Rename        277171     0.030     0.737
 Unlink       1321711     0.031     2.370
 Deltree          172     0.000     0.001
 Mkdir             86     0.033     0.134
 Qpathinfo    5932577     0.031    11.607
 Qfileinfo    1039922     0.031     6.075
 Qfsinfo      1087803     0.031    12.178
 Sfileinfo     533226     0.031     0.993
 Find         2293696     0.031    11.059
 WriteX       3264634     0.054    19.164
 ReadX        10260208     0.038    11.857
 LockX          21319     0.032     0.168
 UnlockX        21319     0.032     0.162
 Flush         458724     0.032     1.774

Throughput 3425.42 MB/sec  24 clients  24 procs  max_latency=19.174 ms



perf output for linux tree :

Samples: 6M of event 'cycles', Event count (approx.): 4966119889380                                                                                            
  4,18%           tbench  tbench                             [.] 0x0000000000001f49
  4,09%           tbench  libc-2.15.so                       [.] 0x000000000003cb08
  3,10%       tbench_srv  [kernel.kallsyms]                  [k] copy_user_generic_string
  2,05%       tbench_srv  [kernel.kallsyms]                  [k] ipt_do_table
  2,04%           tbench  [kernel.kallsyms]                  [k] ipt_do_table
  1,48%           tbench  [kernel.kallsyms]                  [k] copy_user_generic_string
  1,43%       tbench_srv  [kernel.kallsyms]                  [k] tcp_ack
  1,08%       tbench_srv  [kernel.kallsyms]                  [k] tcp_recvmsg
  1,06%           tbench  [kernel.kallsyms]                  [k] nf_iterate
  1,00%       tbench_srv  [kernel.kallsyms]                  [k] nf_iterate
  0,94%       tbench_srv  [nf_conntrack]                     [k] tcp_packet
  0,94%           tbench  [nf_conntrack]                     [k] tcp_packet
  0,90%       tbench_srv  [kernel.kallsyms]                  [k] __schedule
  0,87%           tbench  [kernel.kallsyms]                  [k] __schedule
  0,87%       tbench_srv  [kernel.kallsyms]                  [k] _raw_spin_lock_bh
  0,85%       tbench_srv  [kernel.kallsyms]                  [k] tcp_sendmsg
  0,80%           tbench  [kernel.kallsyms]                  [k] __switch_to
  0,79%           tbench  [kernel.kallsyms]                  [k] _raw_spin_lock_bh
  0,77%           tbench  libc-2.15.so                       [.] vfprintf
  0,76%           tbench  [kernel.kallsyms]                  [k] tcp_sendmsg
  0,74%           tbench  [kernel.kallsyms]                  [k] select_task_rq_fair
  0,72%       tbench_srv  tbench_srv                         [.] 0x0000000000001840
  0,70%       tbench_srv  libc-2.15.so                       [.] recv
  0,65%       tbench_srv  [kernel.kallsyms]                  [k] tcp_rcv_established
  0,65%           tbench  [kernel.kallsyms]                  [k] tcp_transmit_skb
  0,64%           tbench  [vdso]                             [.] 0x00007fffd93459e8
  0,63%       tbench_srv  [kernel.kallsyms]                  [k] tcp_transmit_skb
  0,63%           tbench  [kernel.kallsyms]                  [k] tcp_recvmsg
  0,55%       tbench_srv  [nf_conntrack]                     [k] nf_conntrack_in

perf for net-next tree :

Samples: 6M of event 'cycles', Event count (approx.): 4685309724658                                                                                            
  3,42%           tbench  tbench                         [.] 0x00000000000035ab
  3,32%           tbench  libc-2.15.so                   [.] 0x00000000000913f0
  2,52%       tbench_srv  [kernel.kallsyms]              [k] copy_user_generic_string
  1,75%           tbench  [kernel.kallsyms]              [k] ipt_do_table
  1,71%       tbench_srv  [kernel.kallsyms]              [k] ipt_do_table
  1,31%           tbench  [kernel.kallsyms]              [k] __neigh_create
  1,25%       tbench_srv  [kernel.kallsyms]              [k] __neigh_create
  1,23%           tbench  [kernel.kallsyms]              [k] nf_iterate
  1,19%           tbench  [kernel.kallsyms]              [k] copy_user_generic_string
  1,19%       tbench_srv  [kernel.kallsyms]              [k] nf_iterate
  1,02%       tbench_srv  [kernel.kallsyms]              [k] tcp_ack
  0,96%       tbench_srv  [kernel.kallsyms]              [k] tcp_recvmsg
  0,88%       tbench_srv  [kernel.kallsyms]              [k] __write_lock_failed
  0,88%           tbench  [kernel.kallsyms]              [k] __write_lock_failed
  0,82%           tbench  [kernel.kallsyms]              [k] __schedule
  0,77%       tbench_srv  [kernel.kallsyms]              [k] tcp_sendmsg
  0,76%       tbench_srv  [kernel.kallsyms]              [k] __schedule
  0,76%           tbench  [nf_conntrack]                 [k] tcp_packet
  0,74%       tbench_srv  [nf_conntrack]                 [k] tcp_packet
  0,74%           tbench  [kernel.kallsyms]              [k] __switch_to
  0,71%       tbench_srv  [kernel.kallsyms]              [k] _raw_spin_lock_bh
  0,68%           tbench  [kernel.kallsyms]              [k] tcp_sendmsg
  0,66%           tbench  [kernel.kallsyms]              [k] _raw_spin_lock_bh
  0,63%           tbench  [kernel.kallsyms]              [k] ip_finish_output
  0,63%           tbench  [kernel.kallsyms]              [k] tcp_recvmsg
  0,61%       tbench_srv  [kernel.kallsyms]              [k] ip_finish_output
  0,61%           tbench  [vdso]                         [.] 0x00007fffb57ff8d1
  0,60%       tbench_srv  libc-2.15.so                   [.] recv
  0,59%           tbench  [kernel.kallsyms]              [k] neigh_destroy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 22:42   ` Eric Dumazet
@ 2012-07-20 22:50     ` David Miller
  2012-07-20 22:54       ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-20 22:50 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 21 Jul 2012 00:42:46 +0200

> (Apparently we choke on neighbour entries count.
> 
> entries = atomic_inc_return(&tbl->entries) - 1;
> 
> We need a percpu_counter ? Or something is wrong ?

What do you mean we choke on it?  Does it exceed the thresholds
and we start garbage-collecting?

That would indicate a leak, or we are creating new neigh entries when
we shouldn't be, ie. we're not comparing the keys in the hash table
entries correctly during the lookup in net/ipv4/ip_output.c

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 22:50     ` David Miller
@ 2012-07-20 22:54       ` David Miller
  2012-07-20 23:13         ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-20 22:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: David Miller <davem@davemloft.net>
Date: Fri, 20 Jul 2012 15:50:21 -0700 (PDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 21 Jul 2012 00:42:46 +0200
> 
>> (Apparently we choke on neighbour entries count.
>> 
>> entries = atomic_inc_return(&tbl->entries) - 1;
>> 
>> We need a percpu_counter ? Or something is wrong ?
> 
> What do you mean we choke on it?  Does it exceed the thresholds
> and we start garbage-collecting?
> 
> That would indicate a leak, or we are creating new neigh entries when
> we shouldn't be, ie. we're not comparing the keys in the hash table
> entries correctly during the lookup in net/ipv4/ip_output.c

I see the problem, we get the key wrong during neigh creation for
loopback.

I'll fix this, thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 22:54       ` David Miller
@ 2012-07-20 23:13         ` David Miller
  2012-07-21  5:40           ` Eric Dumazet
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-20 23:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: David Miller <davem@davemloft.net>
Date: Fri, 20 Jul 2012 15:54:12 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Fri, 20 Jul 2012 15:50:21 -0700 (PDT)
> 
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Sat, 21 Jul 2012 00:42:46 +0200
>> 
>>> (Apparently we choke on neighbour entries count.
>>> 
>>> entries = atomic_inc_return(&tbl->entries) - 1;
>>> 
>>> We need a percpu_counter ? Or something is wrong ?
>> 
>> What do you mean we choke on it?  Does it exceed the thresholds
>> and we start garbage-collecting?
>> 
>> That would indicate a leak, or we are creating new neigh entries when
>> we shouldn't be, ie. we're not comparing the keys in the hash table
>> entries correctly during the lookup in net/ipv4/ip_output.c
> 
> I see the problem, we get the key wrong during neigh creation for
> loopback.
> 
> I'll fix this, thanks.

This should do it:

====================
[PATCH] ipv4: Fix neigh lookup keying over loopback/point-to-point devices.

We were using a special key "0" for all loopback and point-to-point
device neigh lookups under ipv4, but we wouldn't use that special
key for the neigh creation.

So basically we'd make a new neigh at each and every lookup :-)

This special case to use only one neigh for these device types
is of dubious value, so just remove it entirely.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/arp.h |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 4617d98..7f7df93 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -21,9 +21,6 @@ static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device *dev
 	struct neighbour *n;
 	u32 hash_val;
 
-	if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
-		key = 0;
-
 	hash_val = arp_hashfn(key, dev, nht->hash_rnd[0]) >> (32 - nht->hash_shift);
 	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 	     n != NULL;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 23:13         ` David Miller
@ 2012-07-21  5:40           ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2012-07-21  5:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Fri, 2012-07-20 at 16:13 -0700, David Miller wrote:

> 
> ====================
> [PATCH] ipv4: Fix neigh lookup keying over loopback/point-to-point devices.
> 
> We were using a special key "0" for all loopback and point-to-point
> device neigh lookups under ipv4, but we wouldn't use that special
> key for the neigh creation.
> 
> So basically we'd make a new neigh at each and every lookup :-)
> 
> This special case to use only one neigh for these device types
> is of dubious value, so just remove it entirely.
> 
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  include/net/arp.h |    3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/include/net/arp.h b/include/net/arp.h
> index 4617d98..7f7df93 100644
> --- a/include/net/arp.h
> +++ b/include/net/arp.h
> @@ -21,9 +21,6 @@ static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device *dev
>  	struct neighbour *n;
>  	u32 hash_val;
>  
> -	if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
> -		key = 0;
> -
>  	hash_val = arp_hashfn(key, dev, nht->hash_rnd[0]) >> (32 - nht->hash_shift);
>  	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
>  	     n != NULL;

Excellent

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    8024082     0.024     2.443
 Close        5894419     0.025     1.426
 Rename        339732     0.024     0.097
 Unlink       1620260     0.024     1.438
 Deltree          228     0.000     0.001
 Mkdir            114     0.025     0.032
 Qpathinfo    7272340     0.024     1.436
 Qfileinfo    1275230     0.024     1.412
 Qfsinfo      1333479     0.025     0.887
 Sfileinfo     653595     0.025     0.160
 Find         2811797     0.024     1.380
 WriteX       4005005     0.045     2.272
 ReadX        12576354     0.031     6.542
 LockX          26120     0.026     0.059
 UnlockX        26120     0.025     0.052
 Flush         562378     0.025     1.372

Throughput 4202.27 MB/sec  24 clients  24 procs  max_latency=2.343 ms

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 21:25 [PATCH 00/16] Remove the ipv4 routing cache David Miller
  2012-07-20 22:05 ` Eric Dumazet
@ 2012-07-22  7:47 ` Vijay Subramanian
  2012-07-22 19:42   ` David Miller
  2012-07-23  0:39 ` David Miller
  2012-07-25 23:02 ` Alexander Duyck
  3 siblings, 1 reply; 51+ messages in thread
From: Vijay Subramanian @ 2012-07-22  7:47 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On 20 July 2012 14:25, David Miller <davem@davemloft.net> wrote:
> On a SPARC-T3 output route lookups are ~876 cycles.  Input route
> lookups are ~1169 cycles with rpfilter disabled, and about ~1468
> cycles with rpfilter enabled.
>
> These measurements were taken with the kbench_mod test module in the
> net_test_tools GIT tree:
...
> In fact anyone suitable motivated can just fire up perf on the loading
> of the test net_test_tools benchmark kernel module.  I spend much of
> my time going:
>
> bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
> bash# perf report
>

Dave,
I have been running your routing removal patches for the past 3 days
(upgraded yesterday to latest set including the 17th patch you sent in
response to Eric's comment) and have not seen any issues (crashes
etc).

I used the kbench_mod module in net_test_tools for testing. Averaging
after 16 runs (with 4 samples in each run), I get the following for
output route lookups
(ip_route_output_key):

with route-removal patches:   average of 544 cycles with min and max
of 511 and 721
without patches: (commit fa0afcd10 ) :    average of 211 cycles with
min and max of 196 and 266

Apart from time spent in fib_table_lookup(), it seems time is also
spent in check_leaf(). I assume this is expected behavior.
Here are 2 sample perf outputs (I have appended the kbench outputs to
each file)

With patches applied:
# ========
# captured on: Sat Jul 21 23:49:43 2012
# hostname : vijaynsu
# os release : 3.5.0-rc7vns+
# perf version : 3.4.13690.gfda9f
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
# cpuid : GenuineIntel,6,15,11
# total memory : 16407032 kB
# cmdline : /usr/src/net-next/tools/perf/perf record insmod
./kbench_mod.ko dst=172.27.231.128 src=172.27.231.28
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0,
config2 = 0x0, excl_usr = 0, excl_kern = 0, id = { 1, 2, 3, 4, 5, 6,
7, 8 }
# HEADER_CPU_TOPOLOGY info available, use -I to display
# ========
#
# Samples: 458  of event 'cycles'
# Event count (approx.): 209175177
#
# Overhead  Command      Shared Object                         Symbol
# ........  .......  .................  .............................
#
    40.62%   insmod  [kernel.kallsyms]  [k] fib_table_lookup
    24.38%   insmod  [kernel.kallsyms]  [k] __ip_route_output_key
    12.33%   insmod  [kernel.kallsyms]  [k] check_leaf
     9.29%   insmod  [kernel.kallsyms]  [k] dst_release
     6.43%   insmod  [kernel.kallsyms]  [k] __ip_dev_find
     2.67%   insmod  [edac_core]        [k] 0x000000000000c1d7
     1.33%   insmod  [kernel.kallsyms]  [k] ip_route_output_flow
     0.70%   insmod  [kernel.kallsyms]  [k] copy_user_generic_string
     0.46%   insmod  [kernel.kallsyms]  [k] path_openat
     0.41%   insmod  [kernel.kallsyms]  [k] clear_page_c
     0.38%   insmod  libc-2.11.1.so     [.] _dl_addr
     0.23%   insmod  [kernel.kallsyms]  [k] native_write_msr_safe
     0.22%   insmod  [kernel.kallsyms]  [k] rcu_irq_enter
     0.22%   insmod  [kernel.kallsyms]  [k] mem_cgroup_count_vm_event
     0.19%   insmod  [kernel.kallsyms]  [k] do_mpage_readpage
     0.12%   insmod  [kernel.kallsyms]  [k] math_state_restore
     0.02%   insmod  [kernel.kallsyms]  [k] stop_one_cpu



#
# (For a higher level overview, try: perf report --sort comm,dso)
#
Jul 21 23:49:43 vijaynsu kernel: [  146.382382] kbench:
Jul 21 23:49:43 vijaynsu kernel: [  146.382382] flow
[IIF(0),OIF(0),MARK(0x00000000),D(172.27.231.128),S(172.27.231.28),TOS(0x00)]
Jul 21 23:49:43 vijaynsu kernel: [  146.382388] kbench: sizeof(struct
rtable)==176
Jul 21 23:49:43 vijaynsu kernel: [  146.409943] kbench:
ip_route_output_key tdiff: 546
Jul 21 23:49:43 vijaynsu kernel: [  146.437490] kbench:
ip_route_output_key tdiff: 546
Jul 21 23:49:43 vijaynsu kernel: [  146.465064] kbench:
ip_route_output_key tdiff: 511
Jul 21 23:49:43 vijaynsu kernel: [  146.492615] kbench:
ip_route_output_key tdiff: 672



--------------
Without patches applied : (omitting hardware info)


# Overhead  Command      Shared Object                        Symbol
# ........  .......  .................  ............................
#
    46.47%   insmod  [kernel.kallsyms]  [k] __ip_route_output_key
    17.71%   insmod  [kernel.kallsyms]  [k] dst_release
    15.51%   insmod  [kernel.kallsyms]  [k] local_bh_enable
     5.92%   insmod  [kvm]              [k] 0x00000000000771ea
     4.92%   insmod  [kernel.kallsyms]  [k] local_bh_disable
     2.37%   insmod  [kernel.kallsyms]  [k] ip_route_output_flow
     1.80%   insmod  [kernel.kallsyms]  [k] clear_page_c
     1.32%   insmod  [kernel.kallsyms]  [k] find_get_page
     1.31%   insmod  [kernel.kallsyms]  [k] page_remove_rmap
     1.15%   insmod  [kernel.kallsyms]  [k] copy_user_generic_string
     0.59%   insmod  [kernel.kallsyms]  [k] trace_module_notify
     0.59%   insmod  [kernel.kallsyms]  [k] free_pcppages_bulk
     0.25%   insmod  [kernel.kallsyms]  [k] __do_fault
     0.05%   insmod  [kernel.kallsyms]  [k] wait_for_common
     0.02%   insmod  [kernel.kallsyms]  [k] finish_task_switch
     0.00%   insmod  [kernel.kallsyms]  [k] native_write_msr_safe


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
Jul 21 23:33:09 vijaynsu kernel: [13465.736545] kbench:
Jul 21 23:33:09 vijaynsu kernel: [13465.736545] flow
[IIF(0),OIF(0),MARK(0x00000000),D(172.27.231.128),S(172.27.231.28),TOS(0x00)]
Jul 21 23:33:09 vijaynsu kernel: [13465.736551] kbench: sizeof(struct
rtable)==216
Jul 21 23:33:09 vijaynsu kernel: [13465.746394] kbench:
ip_route_output_key tdiff: 231
Jul 21 23:33:09 vijaynsu kernel: [13465.756229] kbench:
ip_route_output_key tdiff: 210
Jul 21 23:33:09 vijaynsu kernel: [13465.766062] kbench:
ip_route_output_key tdiff: 196
Jul 21 23:33:09 vijaynsu kernel: [13465.775894] kbench:
ip_route_output_key tdiff: 196


Thanks,
Vijay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-22  7:47 ` Vijay Subramanian
@ 2012-07-22 19:42   ` David Miller
  0 siblings, 0 replies; 51+ messages in thread
From: David Miller @ 2012-07-22 19:42 UTC (permalink / raw)
  To: subramanian.vijay; +Cc: netdev

From: Vijay Subramanian <subramanian.vijay@gmail.com>
Date: Sun, 22 Jul 2012 00:47:27 -0700

> I have been running your routing removal patches for the past 3 days
> (upgraded yesterday to latest set including the 17th patch you sent in
> response to Eric's comment) and have not seen any issues (crashes
> etc).

Thanks for testing.

> Apart from time spent in fib_table_lookup(), it seems time is also
> spent in check_leaf(). I assume this is expected behavior.
> Here are 2 sample perf outputs (I have appended the kbench outputs to
> each file)

Yes, the two biggest hogs will be fib_table_lookup() and check_leaf().

check_leaf() is expensive largely because that's where we write the
fib_result block, which is a structure on the fib_lookup() caller's
stack.

Your perf traces roughly approximate mine.

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 21:25 [PATCH 00/16] Remove the ipv4 routing cache David Miller
  2012-07-20 22:05 ` Eric Dumazet
  2012-07-22  7:47 ` Vijay Subramanian
@ 2012-07-23  0:39 ` David Miller
  2012-07-23  7:15   ` Eric Dumazet
  2012-07-25 23:02 ` Alexander Duyck
  3 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-23  0:39 UTC (permalink / raw)
  To: netdev


Just FYI, I'm pushing this work out to net-next now.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-23  0:39 ` David Miller
@ 2012-07-23  7:15   ` Eric Dumazet
  2012-07-23 17:54     ` Paweł Staszewski
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-23  7:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, 2012-07-22 at 17:39 -0700, David Miller wrote:
> Just FYI, I'm pushing this work out to net-next now.
> --

Excellent !

Thanks a lot David

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-23  7:15   ` Eric Dumazet
@ 2012-07-23 17:54     ` Paweł Staszewski
  2012-07-23 20:10       ` David Miller
  2012-07-26 17:02       ` Eric Dumazet
  0 siblings, 2 replies; 51+ messages in thread
From: Paweł Staszewski @ 2012-07-23 17:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

W dniu 2012-07-23 09:15, Eric Dumazet pisze:
> On Sun, 2012-07-22 at 17:39 -0700, David Miller wrote:
>> Just FYI, I'm pushing this work out to net-next now.
>> --
> Excellent !
>
> Thanks a lot David
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Yes
Excellent work.

I make some real life tests with 10Gbit traffic and make some rdos.
With kernel 3.4.5 when I reach 6M routing cache entries - traffic is not 
forwarded - on my hardware with 64G RAM

With kernel - 3.5.0-rc7-next-20120720 + David M. patches - route cache 
removal + fix
Traffic is forwarder all the time at speed 8.2Mpps RX from traffic 
generator and 8.2Mpps TX to the sink

Also I see performance improvement.

With kernel 3.4.5 I can reach maximum od 7.5Mpps with my hardware
And with kernel 3.5.0-rc7-next-20120720 i can reach 8.2Mpps - forwarding 
UDP performance - can be more - but my TX pktgen that use 6 cores can't 
do more.

Really thanks for this :)

BR
Paweł Staszewski

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-23 17:54     ` Paweł Staszewski
@ 2012-07-23 20:10       ` David Miller
  2012-07-26 17:02       ` Eric Dumazet
  1 sibling, 0 replies; 51+ messages in thread
From: David Miller @ 2012-07-23 20:10 UTC (permalink / raw)
  To: pstaszewski; +Cc: eric.dumazet, netdev

From: Paweł Staszewski <pstaszewski@itcare.pl>
Date: Mon, 23 Jul 2012 19:54:54 +0200

> Really thanks for this :)

Thanks for testing.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-20 21:25 [PATCH 00/16] Remove the ipv4 routing cache David Miller
                   ` (2 preceding siblings ...)
  2012-07-23  0:39 ` David Miller
@ 2012-07-25 23:02 ` Alexander Duyck
  2012-07-25 23:17   ` David Miller
  3 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2012-07-25 23:02 UTC (permalink / raw)
  To: David Miller, Eric Dumazet; +Cc: netdev

On Fri, Jul 20, 2012 at 2:25 PM, David Miller <davem@davemloft.net> wrote:
>
> [ Ok I'm going to be a little bit cranky, and I think I deserve it.
>
>   I'm basically not going to go through the multi-hour rebase and
>   retest process again, as it's hit the point of diminishing returns
>   as NOBODY is giving me test results but I can guarentee that
>   EVERYONE will bitch and complain when I push this into net-next and
>   it breaks their favorite feature.  If you can't be bothered to test
>   these changes, I'm honestly going to tell people to take a hike and
>   fix it themselves.  I simply don't care if you don't care enough to
>   test changes of this magnitude to make sure your favorite setup
>   still works.
>
>   To say that I'm disappointed with the amount of testing feedback
>   after posting more than a dozen iterations of this delicate patch
>   set would be an understatement.  I can think of only one person who
>   actually tested one iteration of these patches and gave feedback.
>
>   And meanwhile I've personally reviewed, tested, and signed off on
>   everyone else's work WITHOUT DELAY during this entire process.
>
>   I've pulled 25 hour long hacking shifts to make that a reality, so
>   that my routing cache removal work absolutely would not impact or
>   delay the patch submissions of any other networking developer.  And
>   I can't even get a handful of testers with some feedback?  You
>   really have to be kidding me.. ]

Sorry for not responding sooner but I have been on vacation for the
last few days.

I had been testing the patches over the last couple of weeks but I
didn't really feel like I could provide any input of value since I
don't have a strong understanding of the routing stack internals, and
because my test case tends to focus on small packet routing with a
very artificial work flow.  I saw an overall drop in performance.  I
had attributed to the fact that with so few flows I was exploiting the
routing cache to it's maximum potential, and had not explored it much
further.

My test consists of a SmartBits w/ a 10Gb/s port connected back to
back with one port on an 82599 adapter.  I have the SmartBits
generating up to 16 64byte packet flows, each flow is filtered through
an ntuple filter to a specific queue, and each queue is pinned to a
specific CPU.  The flows all have a unique source address but the same
destination address.  The port doing the routing has two subnets.  We
receive packets on the first subnet and then transmit them back out on
the second.  I have set-up a static ARP entry for the destination
address in order to avoid the need for ARP address translation since
we are sending such a heavy packet load.  My kernel config is stripped
down and does not include netfilter support.

Since your patches are in I have started to re-run my tests. I am
seeing a significant drop in throughput with 8 flows which I expected,
however it looks like one of the biggest issues I am seeing is that
the dst_hold and dst_release calls seem to be causing some serious
cache thrash.  I was at 12.5Mpps w/ 8 flows before the patches, after
your patches it drops to 8.3Mpps.  If I increase the number of queues
I am using to 16 the throughput drops off to something like 3.3Mpps.
Prior to your patches being applied the top 3 CPU consumers were
ixgbe_poll at around 10%, ixgbe_xmit_frame_ring at around 5%, and
__netif_receive_skb at around 5%.  Below is the latest perf results
for 8 flows/queues after your patches:
    14.52%  [k] __write_lock_failed
    10.68%  [k] ip_route_input (75% of hits on dst_hold call)
    10.18%  [k] fib_table_lookup
     6.04%  [k] ixgbe_poll
     5.80%  [k] dst_release
     4.14%  [k] __netif_receive_skb
     3.55%  [k] _raw_spin_lock
     2.84%  [k] ip_forward
     2.58%  [k] ixgbe_xmit_frame_ring

I am also seeing routing fail periodically.  I will be moving at rates
listed above and suddenly drop to single digits packets per second.
When this occurs the trace completely changes and __write_lock_failed
jumps to over 90% of the CPU cycles.  It seems to occur more often if
I increase the number of CPUs in use while routing.  Below is the call
graph I recorded for the function from perf to show the function calls
that are leading to the issue:
    14.52%  [k] __write_lock_failed
            |
            |--99.92%-- _raw_write_lock_bh
            |          __neigh_event_send
            |          neigh_resolve_output
            |          ip_finish_output
            |          ip_output
            |          ip_forward
            |          ip_rcv
            |          __netif_receive_skb
            |          netif_receive_skb
            |          napi_skb_finish
            |          napi_gro_receive
            |          ixgbe_poll
            |          net_rx_action
            |          __do_softirq
            |          run_ksoftirqd
            |          kthread
            |          kernel_thread_helper
             --0.08%-- [...]

I am trying to figure out what can be done, but as I said I am not
that familiar with the internals of the IP routing stack itself.  If
you need more data let me know and I can see about performing whatever
test, or altering my configuration as needed.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-25 23:02 ` Alexander Duyck
@ 2012-07-25 23:17   ` David Miller
  2012-07-25 23:39     ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-25 23:17 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 25 Jul 2012 16:02:45 -0700

> Since your patches are in I have started to re-run my tests. I am
> seeing a significant drop in throughput with 8 flows which I expected,
> however it looks like one of the biggest issues I am seeing is that
> the dst_hold and dst_release calls seem to be causing some serious
> cache thrash.  I was at 12.5Mpps w/ 8 flows before the patches, after
> your patches it drops to 8.3Mpps.

Yes, this is something we knew would start happening.

One idea is to make cached dsts be per-cpu in the nexthops.

> I am also seeing routing fail periodically.

Every 30 seconds by chance? :-)

> I will be moving at rates listed above and suddenly drop to single
> digits packets per second.  When this occurs the trace completely
> changes and __write_lock_failed jumps to over 90% of the CPU cycles.

It's probably happening when the nexthop ARP entry expires.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-25 23:17   ` David Miller
@ 2012-07-25 23:39     ` David Miller
  2012-07-26  0:54       ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-25 23:39 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: David Miller <davem@davemloft.net>
Date: Wed, 25 Jul 2012 16:17:32 -0700 (PDT)

> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Wed, 25 Jul 2012 16:02:45 -0700
> 
>> Since your patches are in I have started to re-run my tests. I am
>> seeing a significant drop in throughput with 8 flows which I expected,
>> however it looks like one of the biggest issues I am seeing is that
>> the dst_hold and dst_release calls seem to be causing some serious
>> cache thrash.  I was at 12.5Mpps w/ 8 flows before the patches, after
>> your patches it drops to 8.3Mpps.
> 
> Yes, this is something we knew would start happening.
> 
> One idea is to make cached dsts be per-cpu in the nexthops.

Actually I think what really kills your case is the removal of the
noref path for route lookups.  I'll work on a patch to restore that
in the case where we use cached routes from the FIB nexthops.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-25 23:39     ` David Miller
@ 2012-07-26  0:54       ` David Miller
  2012-07-26  2:30         ` Alexander Duyck
  2012-07-26  8:13         ` Eric Dumazet
  0 siblings, 2 replies; 51+ messages in thread
From: David Miller @ 2012-07-26  0:54 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: David Miller <davem@davemloft.net>
Date: Wed, 25 Jul 2012 16:39:39 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Wed, 25 Jul 2012 16:17:32 -0700 (PDT)
> 
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Date: Wed, 25 Jul 2012 16:02:45 -0700
>> 
>>> Since your patches are in I have started to re-run my tests. I am
>>> seeing a significant drop in throughput with 8 flows which I expected,
>>> however it looks like one of the biggest issues I am seeing is that
>>> the dst_hold and dst_release calls seem to be causing some serious
>>> cache thrash.  I was at 12.5Mpps w/ 8 flows before the patches, after
>>> your patches it drops to 8.3Mpps.
>> 
>> Yes, this is something we knew would start happening.
>> 
>> One idea is to make cached dsts be per-cpu in the nexthops.
> 
> Actually I think what really kills your case is the removal of the
> noref path for route lookups.  I'll work on a patch to restore that
> in the case where we use cached routes from the FIB nexthops.

Alex, here is something I tossed together, does it help with the
dst_hold()/dst_release() overhead at all?

diff --git a/include/net/route.h b/include/net/route.h
index c29ef27..8c52bc6 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -30,6 +30,7 @@
 #include <net/inet_sock.h>
 #include <linux/in_route.h>
 #include <linux/rtnetlink.h>
+#include <linux/rcupdate.h>
 #include <linux/route.h>
 #include <linux/ip.h>
 #include <linux/cache.h>
@@ -157,8 +158,22 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 	return ip_route_output_key(net, fl4);
 }
 
-extern int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
-			  u8 tos, struct net_device *devin);
+extern int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
+				u8 tos, struct net_device *devin);
+
+static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
+				 u8 tos, struct net_device *devin)
+{
+	int err;
+
+	rcu_read_lock();
+	err = ip_route_input_noref(skb, dst, src, tos, devin);
+	if (!err)
+		skb_dst_force(skb);
+	rcu_read_unlock();
+
+	return err;
+}
 
 extern void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu,
 			     int oif, u32 mark, u8 protocol, int flow_flags);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index a0124eb..77e87af 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -827,7 +827,7 @@ static int arp_process(struct sk_buff *skb)
 	}
 
 	if (arp->ar_op == htons(ARPOP_REQUEST) &&
-	    ip_route_input(skb, tip, sip, 0, dev) == 0) {
+	    ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {
 
 		rt = skb_rtable(skb);
 		addr_type = rt->rt_type;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7ad88e5..8d07c97 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -258,8 +258,8 @@ static void ip_expire(unsigned long arg)
 		/* skb dst is stale, drop it, and perform route lookup again */
 		skb_dst_drop(head);
 		iph = ip_hdr(head);
-		err = ip_route_input(head, iph->daddr, iph->saddr,
-				     iph->tos, head->dev);
+		err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+					   iph->tos, head->dev);
 		if (err)
 			goto out_rcu_unlock;
 
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 93134b0..bda8cac 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -339,8 +339,8 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (!skb_dst(skb)) {
-		int err = ip_route_input(skb, iph->daddr, iph->saddr,
-					 iph->tos, skb->dev);
+		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					       iph->tos, skb->dev);
 		if (unlikely(err)) {
 			if (err == -EXDEV)
 				NET_INC_STATS_BH(dev_net(skb->dev),
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 3f7bb71..68887e3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1199,10 +1199,9 @@ restart:
 	fnhe->fnhe_stamp = jiffies;
 }
 
-static inline void rt_release_rcu(struct rcu_head *head)
+static inline void rt_free(struct rtable *rt)
 {
-	struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
-	dst_release(dst);
+	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
 }
 
 static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
@@ -1216,9 +1215,15 @@ static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
 
 	prev = cmpxchg(p, orig, rt);
 	if (prev == orig) {
-		dst_clone(&rt->dst);
 		if (orig)
-			call_rcu_bh(&orig->dst.rcu_head, rt_release_rcu);
+			rt_free(orig);
+	} else {
+		/* Routes we intend to cache in the FIB nexthop have
+		 * the DST_NOCACHE bit set.  However, if we are
+		 * unsuccessful at storing this route into the cache
+		 * we really need to clear that bit.
+		 */
+		rt->dst.flags &= ~DST_NOCACHE;
 	}
 }
 
@@ -1245,7 +1250,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
-		if (!(rt->dst.flags & DST_HOST))
+		if (!(rt->dst.flags & DST_NOCACHE))
 			rt_cache_route(nh, rt);
 	}
 
@@ -1261,7 +1266,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
 				   bool nopolicy, bool noxfrm, bool will_cache)
 {
 	return dst_alloc(&ipv4_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK,
-			 (will_cache ? 0 : DST_HOST) | DST_NOCACHE |
+			 (will_cache ? 0 : (DST_HOST | DST_NOCACHE)) |
 			 (nopolicy ? DST_NOPOLICY : 0) |
 			 (noxfrm ? DST_NOXFRM : 0));
 }
@@ -1588,8 +1593,9 @@ local_input:
 		if (!itag) {
 			rth = FIB_RES_NH(res).nh_rth_input;
 			if (rt_cache_valid(rth)) {
-				dst_hold(&rth->dst);
-				goto set_and_out;
+				skb_dst_set_noref(skb, &rth->dst);
+				err = 0;
+				goto out;
 			}
 			do_cache = true;
 		}
@@ -1620,7 +1626,6 @@ local_input:
 	}
 	if (do_cache)
 		rt_cache_route(&FIB_RES_NH(res), rth);
-set_and_out:
 	skb_dst_set(skb, &rth->dst);
 	err = 0;
 	goto out;
@@ -1658,8 +1663,8 @@ martian_source_keep_err:
 	goto out;
 }
 
-int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-		   u8 tos, struct net_device *dev)
+int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+			 u8 tos, struct net_device *dev)
 {
 	int res;
 
@@ -1702,7 +1707,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rcu_read_unlock();
 	return res;
 }
-EXPORT_SYMBOL(ip_route_input);
+EXPORT_SYMBOL(ip_route_input_noref);
 
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 58d23a5..06814b6 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -27,8 +27,8 @@ static inline int xfrm4_rcv_encap_finish(struct sk_buff *skb)
 	if (skb_dst(skb) == NULL) {
 		const struct iphdr *iph = ip_hdr(skb);
 
-		if (ip_route_input(skb, iph->daddr, iph->saddr,
-				   iph->tos, skb->dev))
+		if (ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					 iph->tos, skb->dev))
 			goto drop;
 	}
 	return dst_input(skb);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  0:54       ` David Miller
@ 2012-07-26  2:30         ` Alexander Duyck
  2012-07-26  5:32           ` David Miller
  2012-07-26  8:13         ` Eric Dumazet
  1 sibling, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2012-07-26  2:30 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Wed, Jul 25, 2012 at 5:54 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 25 Jul 2012 16:39:39 -0700 (PDT)
>
>> From: David Miller <davem@davemloft.net>
>> Date: Wed, 25 Jul 2012 16:17:32 -0700 (PDT)
>>
>>> From: Alexander Duyck <alexander.duyck@gmail.com>
>>> Date: Wed, 25 Jul 2012 16:02:45 -0700
>>>
>>>> Since your patches are in I have started to re-run my tests. I am
>>>> seeing a significant drop in throughput with 8 flows which I expected,
>>>> however it looks like one of the biggest issues I am seeing is that
>>>> the dst_hold and dst_release calls seem to be causing some serious
>>>> cache thrash.  I was at 12.5Mpps w/ 8 flows before the patches, after
>>>> your patches it drops to 8.3Mpps.
>>>
>>> Yes, this is something we knew would start happening.
>>>
>>> One idea is to make cached dsts be per-cpu in the nexthops.
>>
>> Actually I think what really kills your case is the removal of the
>> noref path for route lookups.  I'll work on a patch to restore that
>> in the case where we use cached routes from the FIB nexthops.
>
> Alex, here is something I tossed together, does it help with the
> dst_hold()/dst_release() overhead at all?
>

I downloaded your patch, applied it, rebuilt the kernel, and then
rebooted.  However after the reboot I am unable to access the system
remotely.  It looks like I will have to wait until I get into the
office tomorrow to figure out what happened to the system.

I will get you an update on if it helped or not once I figure out what
happened in the morning.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  2:30         ` Alexander Duyck
@ 2012-07-26  5:32           ` David Miller
  0 siblings, 0 replies; 51+ messages in thread
From: David Miller @ 2012-07-26  5:32 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 25 Jul 2012 19:30:13 -0700

> I downloaded your patch, applied it, rebuilt the kernel, and then
> rebooted.  However after the reboot I am unable to access the system
> remotely.  It looks like I will have to wait until I get into the
> office tomorrow to figure out what happened to the system.
> 
> I will get you an update on if it helped or not once I figure out what
> happened in the morning.

Ok, I may have flubbed the patch, I'll let you know if I find any bugs
in it.

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  0:54       ` David Miller
  2012-07-26  2:30         ` Alexander Duyck
@ 2012-07-26  8:13         ` Eric Dumazet
  2012-07-26  8:18           ` David Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26  8:13 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.duyck, netdev

On Wed, 2012-07-25 at 17:54 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 25 Jul 2012 16:39:39 -0700 (PDT)
> 
> > From: David Miller <davem@davemloft.net>
> > Date: Wed, 25 Jul 2012 16:17:32 -0700 (PDT)
> > 
> >> From: Alexander Duyck <alexander.duyck@gmail.com>
> >> Date: Wed, 25 Jul 2012 16:02:45 -0700
> >> 
> >>> Since your patches are in I have started to re-run my tests. I am
> >>> seeing a significant drop in throughput with 8 flows which I expected,
> >>> however it looks like one of the biggest issues I am seeing is that
> >>> the dst_hold and dst_release calls seem to be causing some serious
> >>> cache thrash.  I was at 12.5Mpps w/ 8 flows before the patches, after
> >>> your patches it drops to 8.3Mpps.
> >> 
> >> Yes, this is something we knew would start happening.
> >> 
> >> One idea is to make cached dsts be per-cpu in the nexthops.
> > 
> > Actually I think what really kills your case is the removal of the
> > noref path for route lookups.  I'll work on a patch to restore that
> > in the case where we use cached routes from the FIB nexthops.
> 
> Alex, here is something I tossed together, does it help with the
> dst_hold()/dst_release() overhead at all?
> 

seems good to me, only one question :


>  static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
> @@ -1216,9 +1215,15 @@ static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
>  
>  	prev = cmpxchg(p, orig, rt);
>  	if (prev == orig) {
> -		dst_clone(&rt->dst);
>  		if (orig)
> -			call_rcu_bh(&orig->dst.rcu_head, rt_release_rcu);
> +			rt_free(orig);
> +	} else {
> +		/* Routes we intend to cache in the FIB nexthop have
> +		 * the DST_NOCACHE bit set.  However, if we are
> +		 * unsuccessful at storing this route into the cache
> +		 * we really need to clear that bit.
> +		 */
> +		rt->dst.flags &= ~DST_NOCACHE;
>  	}
>  }
>  

Not sure why you removed the dst_clone(&rt->dst) ?

If it is not needed, we might need to release a reference in the else {}
clause, no ?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  8:13         ` Eric Dumazet
@ 2012-07-26  8:18           ` David Miller
  2012-07-26  8:27             ` Eric Dumazet
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-26  8:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Jul 2012 10:13:34 +0200

> On Wed, 2012-07-25 at 17:54 -0700, David Miller wrote:
>> @@ -1216,9 +1215,15 @@ static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
>>  
>>  	prev = cmpxchg(p, orig, rt);
>>  	if (prev == orig) {
>> -		dst_clone(&rt->dst);
>>  		if (orig)
>> -			call_rcu_bh(&orig->dst.rcu_head, rt_release_rcu);
>> +			rt_free(orig);
>> +	} else {
>> +		/* Routes we intend to cache in the FIB nexthop have
>> +		 * the DST_NOCACHE bit set.  However, if we are
>> +		 * unsuccessful at storing this route into the cache
>> +		 * we really need to clear that bit.
>> +		 */
>> +		rt->dst.flags &= ~DST_NOCACHE;
>>  	}
>>  }
>>  
> 
> Not sure why you removed the dst_clone(&rt->dst) ?

Because "cached" dst objects have no reference count.  When such a
cached dst is "released", it is dst_free()'d instead of
dst_release()'d.

> If it is not needed, we might need to release a reference in the else {}
> clause, no ?

Nope.  'rt' has a reference count of one, which is for the caller, not
for this cached location.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  8:18           ` David Miller
@ 2012-07-26  8:27             ` Eric Dumazet
  2012-07-26  8:47               ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26  8:27 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.duyck, netdev

On Thu, 2012-07-26 at 01:18 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Jul 2012 10:13:34 +0200
> 
> > On Wed, 2012-07-25 at 17:54 -0700, David Miller wrote:
> >> @@ -1216,9 +1215,15 @@ static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
> >>  
> >>  	prev = cmpxchg(p, orig, rt);
> >>  	if (prev == orig) {
> >> -		dst_clone(&rt->dst);
> >>  		if (orig)
> >> -			call_rcu_bh(&orig->dst.rcu_head, rt_release_rcu);
> >> +			rt_free(orig);
> >> +	} else {
> >> +		/* Routes we intend to cache in the FIB nexthop have
> >> +		 * the DST_NOCACHE bit set.  However, if we are
> >> +		 * unsuccessful at storing this route into the cache
> >> +		 * we really need to clear that bit.
> >> +		 */
> >> +		rt->dst.flags &= ~DST_NOCACHE;
> >>  	}
> >>  }
> >>  
> > 
> > Not sure why you removed the dst_clone(&rt->dst) ?
> 
> Because "cached" dst objects have no reference count.  When such a
> cached dst is "released", it is dst_free()'d instead of
> dst_release()'d.
> 
> > If it is not needed, we might need to release a reference in the else {}
> > clause, no ?
> 
> Nope.  'rt' has a reference count of one, which is for the caller, not
> for this cached location.

But isnt DST_NOCACHE intent reverted then ?

like you meant :

+       } else {
+               /* Routes we intend to cache in the FIB nexthop have
+                * the DST_NOCACHE bit unset.  However, if we are
+                * unsuccessful at storing this route into the cache
+                * we really need to set that bit.
+                */
+               rt->dst.flags |= DST_NOCACHE;
        }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  8:27             ` Eric Dumazet
@ 2012-07-26  8:47               ` David Miller
  2012-07-26  9:12                 ` Eric Dumazet
  2012-07-26 17:18                 ` Alexander Duyck
  0 siblings, 2 replies; 51+ messages in thread
From: David Miller @ 2012-07-26  8:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Jul 2012 10:27:58 +0200

> But isnt DST_NOCACHE intent reverted then ?
> 
> like you meant :
> 
> +       } else {
> +               /* Routes we intend to cache in the FIB nexthop have
> +                * the DST_NOCACHE bit unset.  However, if we are
> +                * unsuccessful at storing this route into the cache
> +                * we really need to set that bit.
> +                */
> +               rt->dst.flags |= DST_NOCACHE;
>         }

Indeed, thanks for catching this bug.

Here's a new version of the patch, as I found another error.  In
fib_semantics.c, we have to change dst_release() to dst_free() for the
liberation the nexthop cached routes.

diff --git a/include/net/route.h b/include/net/route.h
index c29ef27..8c52bc6 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -30,6 +30,7 @@
 #include <net/inet_sock.h>
 #include <linux/in_route.h>
 #include <linux/rtnetlink.h>
+#include <linux/rcupdate.h>
 #include <linux/route.h>
 #include <linux/ip.h>
 #include <linux/cache.h>
@@ -157,8 +158,22 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 	return ip_route_output_key(net, fl4);
 }
 
-extern int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
-			  u8 tos, struct net_device *devin);
+extern int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
+				u8 tos, struct net_device *devin);
+
+static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
+				 u8 tos, struct net_device *devin)
+{
+	int err;
+
+	rcu_read_lock();
+	err = ip_route_input_noref(skb, dst, src, tos, devin);
+	if (!err)
+		skb_dst_force(skb);
+	rcu_read_unlock();
+
+	return err;
+}
 
 extern void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu,
 			     int oif, u32 mark, u8 protocol, int flow_flags);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index a0124eb..77e87af 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -827,7 +827,7 @@ static int arp_process(struct sk_buff *skb)
 	}
 
 	if (arp->ar_op == htons(ARPOP_REQUEST) &&
-	    ip_route_input(skb, tip, sip, 0, dev) == 0) {
+	    ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {
 
 		rt = skb_rtable(skb);
 		addr_type = rt->rt_type;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index e55171f..da0cc2e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -172,9 +172,9 @@ static void free_fib_info_rcu(struct rcu_head *head)
 		if (nexthop_nh->nh_exceptions)
 			free_nh_exceptions(nexthop_nh);
 		if (nexthop_nh->nh_rth_output)
-			dst_release(&nexthop_nh->nh_rth_output->dst);
+			dst_free(&nexthop_nh->nh_rth_output->dst);
 		if (nexthop_nh->nh_rth_input)
-			dst_release(&nexthop_nh->nh_rth_input->dst);
+			dst_free(&nexthop_nh->nh_rth_input->dst);
 	} endfor_nexthops(fi);
 
 	release_net(fi->fib_net);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7ad88e5..8d07c97 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -258,8 +258,8 @@ static void ip_expire(unsigned long arg)
 		/* skb dst is stale, drop it, and perform route lookup again */
 		skb_dst_drop(head);
 		iph = ip_hdr(head);
-		err = ip_route_input(head, iph->daddr, iph->saddr,
-				     iph->tos, head->dev);
+		err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+					   iph->tos, head->dev);
 		if (err)
 			goto out_rcu_unlock;
 
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 93134b0..bda8cac 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -339,8 +339,8 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (!skb_dst(skb)) {
-		int err = ip_route_input(skb, iph->daddr, iph->saddr,
-					 iph->tos, skb->dev);
+		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					       iph->tos, skb->dev);
 		if (unlikely(err)) {
 			if (err == -EXDEV)
 				NET_INC_STATS_BH(dev_net(skb->dev),
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 3f7bb71..7a591aa 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1199,10 +1199,9 @@ restart:
 	fnhe->fnhe_stamp = jiffies;
 }
 
-static inline void rt_release_rcu(struct rcu_head *head)
+static inline void rt_free(struct rtable *rt)
 {
-	struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
-	dst_release(dst);
+	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
 }
 
 static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
@@ -1216,9 +1215,15 @@ static void rt_cache_route(struct fib_nh *nh, struct rtable *rt)
 
 	prev = cmpxchg(p, orig, rt);
 	if (prev == orig) {
-		dst_clone(&rt->dst);
 		if (orig)
-			call_rcu_bh(&orig->dst.rcu_head, rt_release_rcu);
+			rt_free(orig);
+	} else {
+		/* Routes we intend to cache in the FIB nexthop have
+		 * the DST_NOCACHE bit clear.  However, if we are
+		 * unsuccessful at storing this route into the cache
+		 * we really need to set it.
+		 */
+		rt->dst.flags |= DST_NOCACHE;
 	}
 }
 
@@ -1245,7 +1250,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
-		if (!(rt->dst.flags & DST_HOST))
+		if (!(rt->dst.flags & DST_NOCACHE))
 			rt_cache_route(nh, rt);
 	}
 
@@ -1261,7 +1266,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
 				   bool nopolicy, bool noxfrm, bool will_cache)
 {
 	return dst_alloc(&ipv4_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK,
-			 (will_cache ? 0 : DST_HOST) | DST_NOCACHE |
+			 (will_cache ? 0 : (DST_HOST | DST_NOCACHE)) |
 			 (nopolicy ? DST_NOPOLICY : 0) |
 			 (noxfrm ? DST_NOXFRM : 0));
 }
@@ -1588,8 +1593,9 @@ local_input:
 		if (!itag) {
 			rth = FIB_RES_NH(res).nh_rth_input;
 			if (rt_cache_valid(rth)) {
-				dst_hold(&rth->dst);
-				goto set_and_out;
+				skb_dst_set_noref(skb, &rth->dst);
+				err = 0;
+				goto out;
 			}
 			do_cache = true;
 		}
@@ -1620,7 +1626,6 @@ local_input:
 	}
 	if (do_cache)
 		rt_cache_route(&FIB_RES_NH(res), rth);
-set_and_out:
 	skb_dst_set(skb, &rth->dst);
 	err = 0;
 	goto out;
@@ -1658,8 +1663,8 @@ martian_source_keep_err:
 	goto out;
 }
 
-int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-		   u8 tos, struct net_device *dev)
+int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+			 u8 tos, struct net_device *dev)
 {
 	int res;
 
@@ -1702,7 +1707,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rcu_read_unlock();
 	return res;
 }
-EXPORT_SYMBOL(ip_route_input);
+EXPORT_SYMBOL(ip_route_input_noref);
 
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 58d23a5..06814b6 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -27,8 +27,8 @@ static inline int xfrm4_rcv_encap_finish(struct sk_buff *skb)
 	if (skb_dst(skb) == NULL) {
 		const struct iphdr *iph = ip_hdr(skb);
 
-		if (ip_route_input(skb, iph->daddr, iph->saddr,
-				   iph->tos, skb->dev))
+		if (ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					 iph->tos, skb->dev))
 			goto drop;
 	}
 	return dst_input(skb);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  8:47               ` David Miller
@ 2012-07-26  9:12                 ` Eric Dumazet
  2012-07-26 17:18                 ` Alexander Duyck
  1 sibling, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26  9:12 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.duyck, netdev

On Thu, 2012-07-26 at 01:47 -0700, David Miller wrote:

> Indeed, thanks for catching this bug.
> 
> Here's a new version of the patch, as I found another error.  In
> fib_semantics.c, we have to change dst_release() to dst_free() for the
> liberation the nexthop cached routes.

Excellent, I did tests here and this seems ok to me

Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-23 17:54     ` Paweł Staszewski
  2012-07-23 20:10       ` David Miller
@ 2012-07-26 17:02       ` Eric Dumazet
  1 sibling, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26 17:02 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: David Miller, netdev

On Mon, 2012-07-23 at 19:54 +0200, Paweł Staszewski wrote:

> I make some real life tests with 10Gbit traffic and make some rdos.
> With kernel 3.4.5 when I reach 6M routing cache entries - traffic is not 
> forwarded - on my hardware with 64G RAM
> 
> With kernel - 3.5.0-rc7-next-20120720 + David M. patches - route cache 
> removal + fix
> Traffic is forwarder all the time at speed 8.2Mpps RX from traffic 
> generator and 8.2Mpps TX to the sink
> 
> Also I see performance improvement.
> 
> With kernel 3.4.5 I can reach maximum od 7.5Mpps with my hardware
> And with kernel 3.5.0-rc7-next-20120720 i can reach 8.2Mpps - forwarding 
> UDP performance - can be more - but my TX pktgen that use 6 cores can't 
> do more.
> 
> Really thanks for this :)

In real life, with TCP traffic, make sure you
set /proc/sys/net/ipv4/ip_early_demux to 0 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26  8:47               ` David Miller
  2012-07-26  9:12                 ` Eric Dumazet
@ 2012-07-26 17:18                 ` Alexander Duyck
  2012-07-26 17:30                   ` Eric Dumazet
  2012-07-26 20:59                   ` David Miller
  1 sibling, 2 replies; 51+ messages in thread
From: Alexander Duyck @ 2012-07-26 17:18 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Thu, Jul 26, 2012 at 1:47 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Jul 2012 10:27:58 +0200
>
>> But isnt DST_NOCACHE intent reverted then ?
>>
>> like you meant :
>>
>> +       } else {
>> +               /* Routes we intend to cache in the FIB nexthop have
>> +                * the DST_NOCACHE bit unset.  However, if we are
>> +                * unsuccessful at storing this route into the cache
>> +                * we really need to set that bit.
>> +                */
>> +               rt->dst.flags |= DST_NOCACHE;
>>         }
>
> Indeed, thanks for catching this bug.
>
> Here's a new version of the patch, as I found another error.  In
> fib_semantics.c, we have to change dst_release() to dst_free() for the
> liberation the nexthop cached routes.

I tested this patch and it looks like it runs, but still has the same
performance issue.  I did some digging into the annotation for
ip_route_intput_noref and it seems like the issue is that I am hitting
the dst_hold call in  __mkroute_input.

You were correct about the ARP entry.  It looks like I had messed up
my test script and it wasn't putting the static ARP entry back in.
I'm amazed it was even working since my SmartBits system doesn't
normally even reply to ARPs.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:18                 ` Alexander Duyck
@ 2012-07-26 17:30                   ` Eric Dumazet
  2012-07-26 17:36                     ` Eric Dumazet
  2012-07-26 20:59                   ` David Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26 17:30 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:

> I tested this patch and it looks like it runs, but still has the same
> performance issue.  I did some digging into the annotation for
> ip_route_intput_noref and it seems like the issue is that I am hitting
> the dst_hold call in  __mkroute_input.

David suggested a percpu cache.

nh_rth_input would be allocated by alloc_percpu(struct dst *)

I can work on this.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:30                   ` Eric Dumazet
@ 2012-07-26 17:36                     ` Eric Dumazet
  2012-07-26 17:43                       ` Eric Dumazet
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26 17:36 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2012-07-26 at 19:31 +0200, Eric Dumazet wrote:
> On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:
> 
> > I tested this patch and it looks like it runs, but still has the same
> > performance issue.  I did some digging into the annotation for
> > ip_route_intput_noref and it seems like the issue is that I am hitting
> > the dst_hold call in  __mkroute_input.
> 
> David suggested a percpu cache.
> 
> nh_rth_input would be allocated by alloc_percpu(struct dst *)
> 
> I can work on this.

Wait a minute, on input we should use the noref trick too.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:36                     ` Eric Dumazet
@ 2012-07-26 17:43                       ` Eric Dumazet
  2012-07-26 17:48                         ` Eric Dumazet
                                           ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26 17:43 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2012-07-26 at 19:36 +0200, Eric Dumazet wrote:
> On Thu, 2012-07-26 at 19:31 +0200, Eric Dumazet wrote:
> > On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:
> > 
> > > I tested this patch and it looks like it runs, but still has the same
> > > performance issue.  I did some digging into the annotation for
> > > ip_route_intput_noref and it seems like the issue is that I am hitting
> > > the dst_hold call in  __mkroute_input.
> > 
> > David suggested a percpu cache.
> > 
> > nh_rth_input would be allocated by alloc_percpu(struct dst *)
> > 
> > I can work on this.
> 
> Wait a minute, on input we should use the noref trick too.
> 

Something like : (on top of latest David patch)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7a591aa..d5d2ad1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1371,8 +1371,7 @@ static void ip_handle_martian_source(struct net_device *dev,
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
 			   struct in_device *in_dev,
-			   __be32 daddr, __be32 saddr, u32 tos,
-			   struct rtable **result)
+			   __be32 daddr, __be32 saddr, u32 tos)
 {
 	struct rtable *rth;
 	int err;
@@ -1423,7 +1422,7 @@ static int __mkroute_input(struct sk_buff *skb,
 		if (!itag) {
 			rth = FIB_RES_NH(*res).nh_rth_input;
 			if (rt_cache_valid(rth)) {
-				dst_hold(&rth->dst);
+				skb_dst_set_noref(skb, &rth->dst);
 				goto out;
 			}
 			do_cache = true;
@@ -1451,7 +1450,6 @@ static int __mkroute_input(struct sk_buff *skb,
 
 	rt_set_nexthop(rth, daddr, res, NULL, res->fi, res->type, itag);
 out:
-	*result = rth;
 	err = 0;
  cleanup:
 	return err;
@@ -1463,21 +1461,13 @@ static int ip_mkroute_input(struct sk_buff *skb,
 			    struct in_device *in_dev,
 			    __be32 daddr, __be32 saddr, u32 tos)
 {
-	struct rtable *rth = NULL;
-	int err;
-
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1)
 		fib_select_multipath(res);
 #endif
 
 	/* create a routing cache entry */
-	err = __mkroute_input(skb, res, in_dev, daddr, saddr, tos, &rth);
-	if (err)
-		return err;
-
-	skb_dst_set(skb, &rth->dst);
-	return 0;
+	return __mkroute_input(skb, res, in_dev, daddr, saddr, tos);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:43                       ` Eric Dumazet
@ 2012-07-26 17:48                         ` Eric Dumazet
  2012-07-26 18:26                           ` Alexander Duyck
  2012-07-26 18:06                         ` Alexander Duyck
  2012-07-26 21:00                         ` David Miller
  2 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26 17:48 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev

On Thu, 2012-07-26 at 19:43 +0200, Eric Dumazet wrote:
> On Thu, 2012-07-26 at 19:36 +0200, Eric Dumazet wrote:
> > On Thu, 2012-07-26 at 19:31 +0200, Eric Dumazet wrote:
> > > On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:
> > > 
> > > > I tested this patch and it looks like it runs, but still has the same
> > > > performance issue.  I did some digging into the annotation for
> > > > ip_route_intput_noref and it seems like the issue is that I am hitting
> > > > the dst_hold call in  __mkroute_input.
> > > 
> > > David suggested a percpu cache.
> > > 
> > > nh_rth_input would be allocated by alloc_percpu(struct dst *)
> > > 
> > > I can work on this.
> > 
> > Wait a minute, on input we should use the noref trick too.
> > 
> 
> Something like : (on top of latest David patch)

Sorry updated patch : (missing skb_dst_set() before 'out' label)


diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7a591aa..fc1a81c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1371,8 +1371,7 @@ static void ip_handle_martian_source(struct net_device *dev,
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
 			   struct in_device *in_dev,
-			   __be32 daddr, __be32 saddr, u32 tos,
-			   struct rtable **result)
+			   __be32 daddr, __be32 saddr, u32 tos)
 {
 	struct rtable *rth;
 	int err;
@@ -1423,7 +1422,7 @@ static int __mkroute_input(struct sk_buff *skb,
 		if (!itag) {
 			rth = FIB_RES_NH(*res).nh_rth_input;
 			if (rt_cache_valid(rth)) {
-				dst_hold(&rth->dst);
+				skb_dst_set_noref(skb, &rth->dst);
 				goto out;
 			}
 			do_cache = true;
@@ -1450,8 +1449,8 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->dst.output = ip_output;
 
 	rt_set_nexthop(rth, daddr, res, NULL, res->fi, res->type, itag);
+	skb_dst_set(skb, &rth->dst);
 out:
-	*result = rth;
 	err = 0;
  cleanup:
 	return err;
@@ -1463,21 +1462,13 @@ static int ip_mkroute_input(struct sk_buff *skb,
 			    struct in_device *in_dev,
 			    __be32 daddr, __be32 saddr, u32 tos)
 {
-	struct rtable *rth = NULL;
-	int err;
-
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1)
 		fib_select_multipath(res);
 #endif
 
 	/* create a routing cache entry */
-	err = __mkroute_input(skb, res, in_dev, daddr, saddr, tos, &rth);
-	if (err)
-		return err;
-
-	skb_dst_set(skb, &rth->dst);
-	return 0;
+	return __mkroute_input(skb, res, in_dev, daddr, saddr, tos);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:43                       ` Eric Dumazet
  2012-07-26 17:48                         ` Eric Dumazet
@ 2012-07-26 18:06                         ` Alexander Duyck
  2012-07-26 21:00                         ` David Miller
  2 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2012-07-26 18:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Jul 26, 2012 at 10:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-26 at 19:36 +0200, Eric Dumazet wrote:
>> On Thu, 2012-07-26 at 19:31 +0200, Eric Dumazet wrote:
>> > On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:
>> >
>> > > I tested this patch and it looks like it runs, but still has the same
>> > > performance issue.  I did some digging into the annotation for
>> > > ip_route_intput_noref and it seems like the issue is that I am hitting
>> > > the dst_hold call in  __mkroute_input.
>> >
>> > David suggested a percpu cache.
>> >
>> > nh_rth_input would be allocated by alloc_percpu(struct dst *)
>> >
>> > I can work on this.
>>
>> Wait a minute, on input we should use the noref trick too.
>>
>
> Something like : (on top of latest David patch)
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 7a591aa..d5d2ad1 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1371,8 +1371,7 @@ static void ip_handle_martian_source(struct net_device *dev,
>  static int __mkroute_input(struct sk_buff *skb,
>                            const struct fib_result *res,
>                            struct in_device *in_dev,
> -                          __be32 daddr, __be32 saddr, u32 tos,
> -                          struct rtable **result)
> +                          __be32 daddr, __be32 saddr, u32 tos)
>  {
>         struct rtable *rth;
>         int err;
> @@ -1423,7 +1422,7 @@ static int __mkroute_input(struct sk_buff *skb,
>                 if (!itag) {
>                         rth = FIB_RES_NH(*res).nh_rth_input;
>                         if (rt_cache_valid(rth)) {
> -                               dst_hold(&rth->dst);
> +                               skb_dst_set_noref(skb, &rth->dst);
>                                 goto out;
>                         }
>                         do_cache = true;
> @@ -1451,7 +1450,6 @@ static int __mkroute_input(struct sk_buff *skb,
>
>         rt_set_nexthop(rth, daddr, res, NULL, res->fi, res->type, itag);
>  out:
> -       *result = rth;
>         err = 0;
>   cleanup:
>         return err;
> @@ -1463,21 +1461,13 @@ static int ip_mkroute_input(struct sk_buff *skb,
>                             struct in_device *in_dev,
>                             __be32 daddr, __be32 saddr, u32 tos)
>  {
> -       struct rtable *rth = NULL;
> -       int err;
> -
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>         if (res->fi && res->fi->fib_nhs > 1)
>                 fib_select_multipath(res);
>  #endif
>
>         /* create a routing cache entry */
> -       err = __mkroute_input(skb, res, in_dev, daddr, saddr, tos, &rth);
> -       if (err)
> -               return err;
> -
> -       skb_dst_set(skb, &rth->dst);
> -       return 0;
> +       return __mkroute_input(skb, res, in_dev, daddr, saddr, tos);
>  }
>
>  /*
>

With your changes in place I see an increase from 7.5Mpps to 9.9Mpps
for 8 queues, and increasing the queues to 9 gets me up to 11Mpps even
if the 9th queue is on another node.  This is a HUGE improvement over
what we had before.

The only remaining overhead that has been introduced with the recent
changes appears to be the fib_table_lookup which doesn't have any hot
spots that jump out at me.  The performance is in-line with what I was
seeing when I was randomly generating source IPs from a fairly large
set so I suspect this is just the expected behaviour without a routing
cache in place.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:48                         ` Eric Dumazet
@ 2012-07-26 18:26                           ` Alexander Duyck
  2012-07-26 21:06                             ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2012-07-26 18:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Jul 26, 2012 at 10:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-26 at 19:43 +0200, Eric Dumazet wrote:
>> On Thu, 2012-07-26 at 19:36 +0200, Eric Dumazet wrote:
>> > On Thu, 2012-07-26 at 19:31 +0200, Eric Dumazet wrote:
>> > > On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:
>> > >
>> > > > I tested this patch and it looks like it runs, but still has the same
>> > > > performance issue.  I did some digging into the annotation for
>> > > > ip_route_intput_noref and it seems like the issue is that I am hitting
>> > > > the dst_hold call in  __mkroute_input.
>> > >
>> > > David suggested a percpu cache.
>> > >
>> > > nh_rth_input would be allocated by alloc_percpu(struct dst *)
>> > >
>> > > I can work on this.
>> >
>> > Wait a minute, on input we should use the noref trick too.
>> >
>>
>> Something like : (on top of latest David patch)
>
> Sorry updated patch : (missing skb_dst_set() before 'out' label)

The previous results were with a slight modifications to your earlier
patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
reaching a maximum of 11.6Mpps with 9 queues.

When we have a final version for this patch let me know and I will
give it a quick run and throw in my tested by.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:18                 ` Alexander Duyck
  2012-07-26 17:30                   ` Eric Dumazet
@ 2012-07-26 20:59                   ` David Miller
  1 sibling, 0 replies; 51+ messages in thread
From: David Miller @ 2012-07-26 20:59 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 10:18:03 -0700

> I did some digging into the annotation for ip_route_intput_noref and
> it seems like the issue is that I am hitting the dst_hold call in
> __mkroute_input.

We shouldn't be building any routes in __mkroute_input except for the
very first packet, we should be instead using the cached route in the
FIB info nexthop.

Something's not right.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 17:43                       ` Eric Dumazet
  2012-07-26 17:48                         ` Eric Dumazet
  2012-07-26 18:06                         ` Alexander Duyck
@ 2012-07-26 21:00                         ` David Miller
  2 siblings, 0 replies; 51+ messages in thread
From: David Miller @ 2012-07-26 21:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Jul 2012 19:43:53 +0200

> On Thu, 2012-07-26 at 19:36 +0200, Eric Dumazet wrote:
>> On Thu, 2012-07-26 at 19:31 +0200, Eric Dumazet wrote:
>> > On Thu, 2012-07-26 at 10:18 -0700, Alexander Duyck wrote:
>> > 
>> > > I tested this patch and it looks like it runs, but still has the same
>> > > performance issue.  I did some digging into the annotation for
>> > > ip_route_intput_noref and it seems like the issue is that I am hitting
>> > > the dst_hold call in  __mkroute_input.
>> > 
>> > David suggested a percpu cache.
>> > 
>> > nh_rth_input would be allocated by alloc_percpu(struct dst *)
>> > 
>> > I can work on this.
>> 
>> Wait a minute, on input we should use the noref trick too.
>> 
> 
> Something like : (on top of latest David patch)

Grrr, I only got the local routes didn't I? :-)

That would explain everything.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 18:26                           ` Alexander Duyck
@ 2012-07-26 21:06                             ` David Miller
  2012-07-26 22:03                               ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-26 21:06 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 11:26:26 -0700

> The previous results were with a slight modifications to your earlier
> patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
> reaching a maximum of 11.6Mpps with 9 queues.

For fun you might want to see what this patch does for your tests,
it should cut the number of fib_table_lookup() calls roughly in half.

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index b64a19c..fc7eade 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -204,9 +204,7 @@ static inline struct fib_table *fib_get_table(struct net *net, u32 id)
 {
 	struct hlist_head *ptr;
 
-	ptr = id == RT_TABLE_LOCAL ?
-		&net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX] :
-		&net->ipv4.fib_table_hash[TABLE_MAIN_INDEX];
+	ptr = &net->ipv4.fib_table_hash[TABLE_MAIN_INDEX];
 	return hlist_entry(ptr->first, struct fib_table, tb_hlist);
 }
 
@@ -220,10 +218,6 @@ static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
 {
 	struct fib_table *table;
 
-	table = fib_get_table(net, RT_TABLE_LOCAL);
-	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
-		return 0;
-
 	table = fib_get_table(net, RT_TABLE_MAIN);
 	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
 		return 0;
@@ -245,10 +239,6 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 {
 	if (!net->ipv4.fib_has_custom_rules) {
 		res->tclassid = 0;
-		if (net->ipv4.fib_local &&
-		    !fib_table_lookup(net->ipv4.fib_local, flp, res,
-				      FIB_LOOKUP_NOREF))
-			return 0;
 		if (net->ipv4.fib_main &&
 		    !fib_table_lookup(net->ipv4.fib_main, flp, res,
 				      FIB_LOOKUP_NOREF))
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 44bf82e..bdc0231 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -160,7 +160,7 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 		/* Fallback to FIB local table so that communication
 		 * over loopback subnets work.
 		 */
-		local = fib_get_table(net, RT_TABLE_LOCAL);
+		local = fib_get_table(net, RT_TABLE_MAIN);
 		if (local &&
 		    !fib_table_lookup(local, &fl4, &res, FIB_LOOKUP_NOREF) &&
 		    res.type == RTN_LOCAL)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c1fde53..ddfe398 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -52,16 +52,10 @@ static int __net_init fib4_rules_init(struct net *net)
 {
 	struct fib_table *local_table, *main_table;
 
-	local_table = fib_trie_table(RT_TABLE_LOCAL);
-	if (local_table == NULL)
-		return -ENOMEM;
-
 	main_table  = fib_trie_table(RT_TABLE_MAIN);
 	if (main_table == NULL)
 		goto fail;
 
-	hlist_add_head_rcu(&local_table->tb_hlist,
-				&net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX]);
 	hlist_add_head_rcu(&main_table->tb_hlist,
 				&net->ipv4.fib_table_hash[TABLE_MAIN_INDEX]);
 	return 0;
@@ -169,7 +163,7 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 	if (ipv4_is_multicast(addr))
 		return RTN_MULTICAST;
 
-	local_table = fib_get_table(net, RT_TABLE_LOCAL);
+	local_table = fib_get_table(net, RT_TABLE_MAIN);
 	if (local_table) {
 		ret = RTN_UNICAST;
 		rcu_read_lock();
@@ -712,11 +706,7 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
 		},
 	};
 
-	if (type == RTN_UNICAST)
-		tb = fib_new_table(net, RT_TABLE_MAIN);
-	else
-		tb = fib_new_table(net, RT_TABLE_LOCAL);
-
+	tb = fib_new_table(net, RT_TABLE_MAIN);
 	if (tb == NULL)
 		return;
 
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index a83d74e..65135dd 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -284,9 +284,6 @@ static int fib_default_rules_init(struct fib_rules_ops *ops)
 {
 	int err;
 
-	err = fib_default_rule_add(ops, 0, RT_TABLE_LOCAL, 0);
-	if (err < 0)
-		return err;
 	err = fib_default_rule_add(ops, 0x7FFE, RT_TABLE_MAIN, 0);
 	if (err < 0)
 		return err;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 21:06                             ` David Miller
@ 2012-07-26 22:03                               ` Alexander Duyck
  2012-07-26 22:13                                 ` Stephen Hemminger
  2012-07-26 22:53                                 ` David Miller
  0 siblings, 2 replies; 51+ messages in thread
From: Alexander Duyck @ 2012-07-26 22:03 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Thu, Jul 26, 2012 at 2:06 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Thu, 26 Jul 2012 11:26:26 -0700
>
>> The previous results were with a slight modifications to your earlier
>> patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
>> reaching a maximum of 11.6Mpps with 9 queues.
>
> For fun you might want to see what this patch does for your tests,
> it should cut the number of fib_table_lookup() calls roughly in half.

So with your patch, Eric's patch, and this most recent patch we are
now at 11.8Mpps with 8 or 9 queues.  At this point I am staring to hit
the hardware limits since 82599 will typically max out at about 12Mpps
w/ 9 queues.

Here is the latest perf results with all of these patches in place.
As you predicted your patch essentially cut the lookup overhead in
half:
    10.65%  [k] ixgbe_poll
     7.77%  [k] fib_table_lookup
     6.21%  [k] ixgbe_xmit_frame_ring
     6.08%  [k] __netif_receive_skb
     4.41%  [k] _raw_spin_lock
     3.95%  [k] kmem_cache_free
     3.30%  [k] build_skb
     3.17%  [k] memcpy
     2.96%  [k] dev_queue_xmit
     2.79%  [k] ip_finish_output
     2.66%  [k] kmem_cache_alloc
     2.57%  [k] check_leaf
     2.52%  [k] ip_route_input_noref
     2.50%  [k] netdev_alloc_frag
     2.17%  [k] ip_rcv
     2.16%  [k] __phys_addr

I will probably do some more poking around over the next few days in
order to get my head around the fib_table_lookup overhead.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 22:03                               ` Alexander Duyck
@ 2012-07-26 22:13                                 ` Stephen Hemminger
  2012-07-26 22:19                                   ` Eric Dumazet
  2012-07-26 22:48                                   ` David Miller
  2012-07-26 22:53                                 ` David Miller
  1 sibling, 2 replies; 51+ messages in thread
From: Stephen Hemminger @ 2012-07-26 22:13 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, eric.dumazet, netdev

On Thu, 26 Jul 2012 15:03:39 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Thu, Jul 26, 2012 at 2:06 PM, David Miller <davem@davemloft.net> wrote:
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Date: Thu, 26 Jul 2012 11:26:26 -0700
> >
> >> The previous results were with a slight modifications to your earlier
> >> patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
> >> reaching a maximum of 11.6Mpps with 9 queues.
> >
> > For fun you might want to see what this patch does for your tests,
> > it should cut the number of fib_table_lookup() calls roughly in half.
> 
> So with your patch, Eric's patch, and this most recent patch we are
> now at 11.8Mpps with 8 or 9 queues.  At this point I am staring to hit
> the hardware limits since 82599 will typically max out at about 12Mpps
> w/ 9 queues.
> 
> Here is the latest perf results with all of these patches in place.
> As you predicted your patch essentially cut the lookup overhead in
> half:
>     10.65%  [k] ixgbe_poll
>      7.77%  [k] fib_table_lookup
>      6.21%  [k] ixgbe_xmit_frame_ring
>      6.08%  [k] __netif_receive_skb
>      4.41%  [k] _raw_spin_lock
>      3.95%  [k] kmem_cache_free
>      3.30%  [k] build_skb
>      3.17%  [k] memcpy
>      2.96%  [k] dev_queue_xmit
>      2.79%  [k] ip_finish_output
>      2.66%  [k] kmem_cache_alloc
>      2.57%  [k] check_leaf
>      2.52%  [k] ip_route_input_noref
>      2.50%  [k] netdev_alloc_frag
>      2.17%  [k] ip_rcv
>      2.16%  [k] __phys_addr
> 
> I will probably do some more poking around over the next few days in
> order to get my head around the fib_table_lookup overhead.
> 
> Thanks,
> 
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The fib trie stats are global, you may want to either disable CONFIG_IP_FIB_TRIE_STATS
or convert them to per-cpu.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 22:13                                 ` Stephen Hemminger
@ 2012-07-26 22:19                                   ` Eric Dumazet
  2012-07-26 22:48                                   ` David Miller
  1 sibling, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2012-07-26 22:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Alexander Duyck, David Miller, netdev

On Thu, 2012-07-26 at 15:13 -0700, Stephen Hemminger wrote:

> The fib trie stats are global, you may want to either disable CONFIG_IP_FIB_TRIE_STATS
> or convert them to per-cpu.

I guess its already disabled in Alex case ;)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 22:13                                 ` Stephen Hemminger
  2012-07-26 22:19                                   ` Eric Dumazet
@ 2012-07-26 22:48                                   ` David Miller
  1 sibling, 0 replies; 51+ messages in thread
From: David Miller @ 2012-07-26 22:48 UTC (permalink / raw)
  To: shemminger; +Cc: alexander.duyck, eric.dumazet, netdev

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 26 Jul 2012 15:13:12 -0700

> The fib trie stats are global, you may want to either disable
> CONFIG_IP_FIB_TRIE_STATS or convert them to per-cpu.

Oh yeah, turn that stuff off when testing :-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 22:03                               ` Alexander Duyck
  2012-07-26 22:13                                 ` Stephen Hemminger
@ 2012-07-26 22:53                                 ` David Miller
  2012-07-27  2:14                                   ` Alexander Duyck
  1 sibling, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-26 22:53 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 15:03:39 -0700

> Here is the latest perf results with all of these patches in place.
> As you predicted your patch essentially cut the lookup overhead in
> half:

Ok good.

That patch is hard to make legitimate, I'd have to do a bit or work
before we could realize that change.

We can only combine the LOCAL and MAIN tables like that so long as
there are no overlaps in the routes covered by the two tables.  We'd
also have to be sure to report the routes properly in dumps too.

I really wish we had never segregated these two tables, it's
completely pointless and hurts performance.  But now we have to
accomodate this legacy.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-26 22:53                                 ` David Miller
@ 2012-07-27  2:14                                   ` Alexander Duyck
  2012-07-27  3:08                                     ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2012-07-27  2:14 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Thu, Jul 26, 2012 at 3:53 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Thu, 26 Jul 2012 15:03:39 -0700
>
>> Here is the latest perf results with all of these patches in place.
>> As you predicted your patch essentially cut the lookup overhead in
>> half:
>
> Ok good.
>
> That patch is hard to make legitimate, I'd have to do a bit or work
> before we could realize that change.
>
> We can only combine the LOCAL and MAIN tables like that so long as
> there are no overlaps in the routes covered by the two tables.  We'd
> also have to be sure to report the routes properly in dumps too.
>
> I really wish we had never segregated these two tables, it's
> completely pointless and hurts performance.  But now we have to
> accomodate this legacy.

Any idea why these look-ups are so expensive in the first place?  When
I dump fib_trie it doesn't look like I have much there.  I would have
thought the table would be pretty static with just 8 flows all going
to the same destination address, but it seems like I was getting hit
with cache misses for some reason.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27  2:14                                   ` Alexander Duyck
@ 2012-07-27  3:08                                     ` David Miller
  2012-07-27  6:02                                       ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-27  3:08 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 19:14:55 -0700

> Any idea why these look-ups are so expensive in the first place?  When
> I dump fib_trie it doesn't look like I have much there.  I would have
> thought the table would be pretty static with just 8 flows all going
> to the same destination address, but it seems like I was getting hit
> with cache misses for some reason.

A lot of the overhead comes from write traffic that results from
filling in the "fib_result" structure onto the callers stack.

The return value from a fib_lookup() has far too many components.  But
simplifying things is not easy.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27  3:08                                     ` David Miller
@ 2012-07-27  6:02                                       ` David Miller
  2012-07-27 10:01                                         ` Eric Dumazet
  2012-07-28  4:15                                         ` David Miller
  0 siblings, 2 replies; 51+ messages in thread
From: David Miller @ 2012-07-27  6:02 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: David Miller <davem@davemloft.net>
Date: Thu, 26 Jul 2012 20:08:46 -0700 (PDT)

> A lot of the overhead comes from write traffic that results from
> filling in the "fib_result" structure onto the callers stack.

Here's the longer analysis of how things are now.

There are several components to a route lookup result, and struct
fib_result tries to encapsulate all of this.

Another aspect is that our route tables are broken up into different
datas tructures which reference each other, in order to save space.

So the actual objects in the FIB trie are fib_alias structures, and
those point to fib_info.  There is a many to one relationship between
FIB trie nodes and fib_info objects.

The idea is that many routes have the same set of nexthops, metrics,
preferred source address, etc.

So one thing we return in the fib_result is a pointer to the fib_info
and an index into the nexthop array (nh_sel).  That's why we have all
of these funny accessor's FIB_RES_X(res) which essentially provide
res.fi->fib_nh[res.nh_sel].X

Therefore one area of simplification would be to just return a pointer
to the FIB nexthop, rather than the fib_info pointer and the nexthop
index.  We can get to the fib_info, if we need to, via the nh_parent
pointer of the nexthop.

It seems also that the res->scope value can be cribbed from the
fib_info as well.

res->type is embedded in the fib_alias we select hanging off of the
FIB trie node.  And the res->prefixlen is taken from the FIB trie
node.

res->tclassid is problematic, because it comes from the FIB rules
tables rather than the FIB trie.  We used to store a full FIB rules
pointer in the fib_result, but I reduced it down to just the u32
tclassid.

This whole area, as well as the FIB trie lookup itself, is an area
ripe for a large number of small micro-optimizations that in the end
make it's overhead much more reasonable.

Another thing I haven't mentioned is that another part of FIB trie's
overhead is that it does backtracking.  The shorter prefixes sit at
the top of the trie, so when it traverses down it does so until it
can't get a match, then it walks back up to the root until it does
have a match.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27  6:02                                       ` David Miller
@ 2012-07-27 10:01                                         ` Eric Dumazet
  2012-07-27 14:53                                           ` Eric W. Biederman
  2012-07-28  4:15                                         ` David Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-27 10:01 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.duyck, netdev

From: Eric Dumazet <edumazet@google.com>

On Thu, 2012-07-26 at 23:02 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 26 Jul 2012 20:08:46 -0700 (PDT)
> 
> > A lot of the overhead comes from write traffic that results from
> > filling in the "fib_result" structure onto the callers stack.
> 
> Here's the longer analysis of how things are now.
> 
> There are several components to a route lookup result, and struct
> fib_result tries to encapsulate all of this.
> 
> Another aspect is that our route tables are broken up into different
> datas tructures which reference each other, in order to save space.
> 
> So the actual objects in the FIB trie are fib_alias structures, and
> those point to fib_info.  There is a many to one relationship between
> FIB trie nodes and fib_info objects.
> 
> The idea is that many routes have the same set of nexthops, metrics,
> preferred source address, etc.
> 
> So one thing we return in the fib_result is a pointer to the fib_info
> and an index into the nexthop array (nh_sel).  That's why we have all
> of these funny accessor's FIB_RES_X(res) which essentially provide
> res.fi->fib_nh[res.nh_sel].X
> 
> Therefore one area of simplification would be to just return a pointer
> to the FIB nexthop, rather than the fib_info pointer and the nexthop
> index.  We can get to the fib_info, if we need to, via the nh_parent
> pointer of the nexthop.
> 
> It seems also that the res->scope value can be cribbed from the
> fib_info as well.
> 
> res->type is embedded in the fib_alias we select hanging off of the
> FIB trie node.  And the res->prefixlen is taken from the FIB trie
> node.
> 
> res->tclassid is problematic, because it comes from the FIB rules
> tables rather than the FIB trie.  We used to store a full FIB rules
> pointer in the fib_result, but I reduced it down to just the u32
> tclassid.
> 
> This whole area, as well as the FIB trie lookup itself, is an area
> ripe for a large number of small micro-optimizations that in the end
> make it's overhead much more reasonable.
> 
> Another thing I haven't mentioned is that another part of FIB trie's
> overhead is that it does backtracking.  The shorter prefixes sit at
> the top of the trie, so when it traverses down it does so until it
> can't get a match, then it walks back up to the root until it does
> have a match.

We also have cache line misses in this code, and we shouldn't have them
at all.

Following patch helps a lot in my case.

Thanks

[PATCH] ipv4: fib: avoid false sharing

Now IP route cache is removed, we should make sure fib structures
cant share cache lines with possibly often dirtied objects.

On x86, kmalloc-96 cache can be source of such problems.

Problem spotted with perf ... -e cache-misses ... while doing
a forwarding benchmark.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ip_fib.h     |    6 ++++++
 net/ipv4/fib_frontend.c  |    5 +----
 net/ipv4/fib_semantics.c |   15 ++++++++-------
 net/ipv4/fib_trie.c      |   26 +++++++++-----------------
 4 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e69c3a4..5b6a9b3 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -295,6 +295,12 @@ extern void fib_select_multipath(struct fib_result *res);
 /* Exported by fib_trie.c */
 extern void fib_trie_init(void);
 extern struct fib_table *fib_trie_table(u32 id);
+static inline void *fib_zalloc(size_t size)
+{
+	/* We want to avoid possible false sharing */
+	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);
+}
+
 
 static inline void fib_combine_itag(u32 *itag, const struct fib_result *res)
 {
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 8732cc7..a3d0285 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1089,10 +1089,7 @@ static int __net_init ip_fib_net_init(struct net *net)
 	int err;
 	size_t size = sizeof(struct hlist_head) * FIB_TABLE_HASHSZ;
 
-	/* Avoid false sharing : Use at least a full cache line */
-	size = max_t(size_t, size, L1_CACHE_BYTES);
-
-	net->ipv4.fib_table_hash = kzalloc(size, GFP_KERNEL);
+	net->ipv4.fib_table_hash = fib_zalloc(size);
 	if (net->ipv4.fib_table_hash == NULL)
 		return -ENOMEM;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da0cc2e..a9f5e87 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -651,17 +651,17 @@ static inline unsigned int fib_laddr_hashfn(__be32 val)
 		((__force u32)val >> 14)) & mask;
 }
 
-static struct hlist_head *fib_info_hash_alloc(int bytes)
+static struct hlist_head *fib_info_hash_alloc(size_t bytes)
 {
 	if (bytes <= PAGE_SIZE)
-		return kzalloc(bytes, GFP_KERNEL);
+		return fib_zalloc(bytes);
 	else
 		return (struct hlist_head *)
 			__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 					 get_order(bytes));
 }
 
-static void fib_info_hash_free(struct hlist_head *hash, int bytes)
+static void fib_info_hash_free(struct hlist_head *hash, size_t bytes)
 {
 	if (!hash)
 		return;
@@ -678,7 +678,8 @@ static void fib_info_hash_move(struct hlist_head *new_info_hash,
 {
 	struct hlist_head *old_info_hash, *old_laddrhash;
 	unsigned int old_size = fib_info_hash_size;
-	unsigned int i, bytes;
+	unsigned int i;
+	size_t bytes;
 
 	spin_lock_bh(&fib_info_lock);
 	old_info_hash = fib_info_hash;
@@ -766,10 +767,10 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		unsigned int new_size = fib_info_hash_size << 1;
 		struct hlist_head *new_info_hash;
 		struct hlist_head *new_laddrhash;
-		unsigned int bytes;
+		size_t bytes;
 
 		if (!new_size)
-			new_size = 1;
+			new_size = 16;
 		bytes = new_size * sizeof(struct hlist_head *);
 		new_info_hash = fib_info_hash_alloc(bytes);
 		new_laddrhash = fib_info_hash_alloc(bytes);
@@ -783,7 +784,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 			goto failure;
 	}
 
-	fi = kzalloc(sizeof(*fi)+nhs*sizeof(struct fib_nh), GFP_KERNEL);
+	fi = fib_zalloc(sizeof(*fi) + nhs*sizeof(struct fib_nh));
 	if (fi == NULL)
 		goto failure;
 	if (cfg->fc_mx) {
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 18cbc15..664152c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -379,7 +379,7 @@ static inline void free_leaf_info(struct leaf_info *leaf)
 static struct tnode *tnode_alloc(size_t size)
 {
 	if (size <= PAGE_SIZE)
-		return kzalloc(size, GFP_KERNEL);
+		return fib_zalloc(size);
 	else
 		return vzalloc(size);
 }
@@ -449,7 +449,7 @@ static struct leaf *leaf_new(void)
 
 static struct leaf_info *leaf_info_new(int plen)
 {
-	struct leaf_info *li = kmalloc(sizeof(struct leaf_info),  GFP_KERNEL);
+	struct leaf_info *li = fib_zalloc(sizeof(struct leaf_info));
 	if (li) {
 		li->plen = plen;
 		li->mask_plen = ntohl(inet_make_mask(plen));
@@ -1969,32 +1969,24 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   max(sizeof(struct leaf),
 					       sizeof(struct leaf_info)),
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
 }
 
 
 struct fib_table *fib_trie_table(u32 id)
 {
 	struct fib_table *tb;
-	struct trie *t;
-
-	tb = kmalloc(sizeof(struct fib_table) + sizeof(struct trie),
-		     GFP_KERNEL);
-	if (tb == NULL)
-		return NULL;
-
-	tb->tb_id = id;
-	tb->tb_default = -1;
-	tb->tb_num_default = 0;
-
-	t = (struct trie *) tb->tb_data;
-	memset(t, 0, sizeof(*t));
 
+	tb = fib_zalloc(sizeof(struct fib_table) + sizeof(struct trie));
+	if (tb) {
+		tb->tb_id = id;
+		tb->tb_default = -1;
+	}
 	return tb;
 }
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27 10:01                                         ` Eric Dumazet
@ 2012-07-27 14:53                                           ` Eric W. Biederman
  2012-07-27 15:12                                             ` Eric Dumazet
  0 siblings, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 2012-07-27 14:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, alexander.duyck, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Now IP route cache is removed, we should make sure fib structures
> cant share cache lines with possibly often dirtied objects.
>
> On x86, kmalloc-96 cache can be source of such problems.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>


> +static inline void *fib_zalloc(size_t size)
> +{
> +	/* We want to avoid possible false sharing */
> +	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);

Why the hard coded 128 here?

It seems more portable and obvious to do 
	return kzalloc(round_up(size, L1_CACHE_BYTES), GFP_KERNEL);

> +}

Eric

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27 14:53                                           ` Eric W. Biederman
@ 2012-07-27 15:12                                             ` Eric Dumazet
  2012-07-27 16:23                                               ` Eric W. Biederman
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-27 15:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, alexander.duyck, netdev

On Fri, 2012-07-27 at 07:53 -0700, Eric W. Biederman wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
> > Now IP route cache is removed, we should make sure fib structures
> > cant share cache lines with possibly often dirtied objects.
> >
> > On x86, kmalloc-96 cache can be source of such problems.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> 
> > +static inline void *fib_zalloc(size_t size)
> > +{
> > +	/* We want to avoid possible false sharing */
> > +	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);
> 
> Why the hard coded 128 here?
> 
> It seems more portable and obvious to do 
> 	return kzalloc(round_up(size, L1_CACHE_BYTES), GFP_KERNEL);
> 

Its not that obvious, because some machines have an apparent
L1_CACHE_BYTES of 64, but hardware prefetching to 128 bytes

But using 2*L1_CACHE_BYTES as minimum allocation size might be overkill
on some arches with 256 bytes cache lines.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27 15:12                                             ` Eric Dumazet
@ 2012-07-27 16:23                                               ` Eric W. Biederman
  2012-07-27 16:28                                                 ` Eric Dumazet
  0 siblings, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 2012-07-27 16:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, alexander.duyck, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Fri, 2012-07-27 at 07:53 -0700, Eric W. Biederman wrote:
>> Eric Dumazet <eric.dumazet@gmail.com> writes:
>> 
>> > Now IP route cache is removed, we should make sure fib structures
>> > cant share cache lines with possibly often dirtied objects.
>> >
>> > On x86, kmalloc-96 cache can be source of such problems.
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> 
>> 
>> > +static inline void *fib_zalloc(size_t size)
>> > +{
>> > +	/* We want to avoid possible false sharing */
>> > +	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);
>> 
>> Why the hard coded 128 here?
>> 
>> It seems more portable and obvious to do 
>> 	return kzalloc(round_up(size, L1_CACHE_BYTES), GFP_KERNEL);
>> 
>
> Its not that obvious, because some machines have an apparent
> L1_CACHE_BYTES of 64, but hardware prefetching to 128 bytes

I am familiar.  But does hardware prefetching make a difference
if your object is less than 64 bytes?

I don't believe only allocating 64 bytes will be a problem,
as no one else well be dirtying your cache line.

I suppose you could run into pathologies where your object
is 3*64 bytes in size, but your expression doesn't handle
that case either.

> But using 2*L1_CACHE_BYTES as minimum allocation size might be overkill
> on some arches with 256 bytes cache lines.

The other alternative to guarantee very good cache behavior is
to ensure you are allocating a power of two size up to some limit,
perhaps page size.

My point is the magic 128 likely requires an explicatory comment and I
think the net result is you have encoded something fragile that is good
for testing but that will in the fullness of time do strange things that
will be easy to overlook.

Eric

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27 16:23                                               ` Eric W. Biederman
@ 2012-07-27 16:28                                                 ` Eric Dumazet
  2012-07-27 19:06                                                   ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2012-07-27 16:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, alexander.duyck, netdev

On Fri, 2012-07-27 at 09:23 -0700, Eric W. Biederman wrote:

> I am familiar.  But does hardware prefetching make a difference
> if your object is less than 64 bytes?
> 

Apparently yes, if the prefetch touches a dirtied neighbour cache line.

> I don't believe only allocating 64 bytes will be a problem,
> as no one else well be dirtying your cache line.
> 
> I suppose you could run into pathologies where your object
> is 3*64 bytes in size, but your expression doesn't handle
> that case either.
> 

Sure, but in most cases fib objects are under 128 bytes.


> The other alternative to guarantee very good cache behavior is
> to ensure you are allocating a power of two size up to some limit,
> perhaps page size.
> 

Good idea.

> My point is the magic 128 likely requires an explicatory comment and I
> think the net result is you have encoded something fragile that is good
> for testing but that will in the fullness of time do strange things that
> will be easy to overlook.

Sure, I'll send a v2, thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27 16:28                                                 ` Eric Dumazet
@ 2012-07-27 19:06                                                   ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2012-07-27 19:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric W. Biederman, David Miller, netdev

On Fri, Jul 27, 2012 at 9:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-07-27 at 09:23 -0700, Eric W. Biederman wrote:
>
>> I am familiar.  But does hardware prefetching make a difference
>> if your object is less than 64 bytes?
>>
>
> Apparently yes, if the prefetch touches a dirtied neighbour cache line.
>
>> I don't believe only allocating 64 bytes will be a problem,
>> as no one else well be dirtying your cache line.
>>
>> I suppose you could run into pathologies where your object
>> is 3*64 bytes in size, but your expression doesn't handle
>> that case either.
>>
>
> Sure, but in most cases fib objects are under 128 bytes.
>
>
>> The other alternative to guarantee very good cache behavior is
>> to ensure you are allocating a power of two size up to some limit,
>> perhaps page size.
>>
>
> Good idea.
>
>> My point is the magic 128 likely requires an explicatory comment and I
>> think the net result is you have encoded something fragile that is good
>> for testing but that will in the fullness of time do strange things that
>> will be easy to overlook.
>
> Sure, I'll send a v2, thanks.
>
>

I tested out v1 of your patch and didn't seem much of a change in my
test environment.  I figure I will spend most of today going through
the code trying to figure out what is causing the issues I am seeing
and why your changes had no effect on my setup.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-27  6:02                                       ` David Miller
  2012-07-27 10:01                                         ` Eric Dumazet
@ 2012-07-28  4:15                                         ` David Miller
  2012-07-28  5:45                                           ` Alexander Duyck
  1 sibling, 1 reply; 51+ messages in thread
From: David Miller @ 2012-07-28  4:15 UTC (permalink / raw)
  To: alexander.duyck; +Cc: eric.dumazet, netdev

From: David Miller <davem@davemloft.net>
Date: Thu, 26 Jul 2012 23:02:46 -0700 (PDT)

> Therefore one area of simplification would be to just return a pointer
> to the FIB nexthop, rather than the fib_info pointer and the nexthop
> index.  We can get to the fib_info, if we need to, via the nh_parent
> pointer of the nexthop.

So I'm about to post an RFC set of patches which show this kind
of simplification.  It gets fib_result down to two members:

	u32		tclassid;
	struct fib_nh	*nh;

If I could get rid of that tclassid it would be really nice.  But
that's hard because the tclassid is fetched from the fib_rule and
all of that lookup path is abstracted behind a common layer
that's shared between ipv4 and ipv6 so it's a bit of work changing
arg conventions.

These changes help, but only ever so slightly, in my testing.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/16] Remove the ipv4 routing cache
  2012-07-28  4:15                                         ` David Miller
@ 2012-07-28  5:45                                           ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2012-07-28  5:45 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Fri, Jul 27, 2012 at 9:15 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 26 Jul 2012 23:02:46 -0700 (PDT)
>
>> Therefore one area of simplification would be to just return a pointer
>> to the FIB nexthop, rather than the fib_info pointer and the nexthop
>> index.  We can get to the fib_info, if we need to, via the nh_parent
>> pointer of the nexthop.
>
> So I'm about to post an RFC set of patches which show this kind
> of simplification.  It gets fib_result down to two members:
>
>         u32             tclassid;
>         struct fib_nh   *nh;
>
> If I could get rid of that tclassid it would be really nice.  But
> that's hard because the tclassid is fetched from the fib_rule and
> all of that lookup path is abstracted behind a common layer
> that's shared between ipv4 and ipv6 so it's a bit of work changing
> arg conventions.
>
> These changes help, but only ever so slightly, in my testing.

I probably won't be able to do any real testing for the patches until
Monday, or at least I may let somebody else test the patches first
just to make sure they don't panic the system before I do anything
with them.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2012-07-28  5:45 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-20 21:25 [PATCH 00/16] Remove the ipv4 routing cache David Miller
2012-07-20 22:05 ` Eric Dumazet
2012-07-20 22:42   ` Eric Dumazet
2012-07-20 22:50     ` David Miller
2012-07-20 22:54       ` David Miller
2012-07-20 23:13         ` David Miller
2012-07-21  5:40           ` Eric Dumazet
2012-07-22  7:47 ` Vijay Subramanian
2012-07-22 19:42   ` David Miller
2012-07-23  0:39 ` David Miller
2012-07-23  7:15   ` Eric Dumazet
2012-07-23 17:54     ` Paweł Staszewski
2012-07-23 20:10       ` David Miller
2012-07-26 17:02       ` Eric Dumazet
2012-07-25 23:02 ` Alexander Duyck
2012-07-25 23:17   ` David Miller
2012-07-25 23:39     ` David Miller
2012-07-26  0:54       ` David Miller
2012-07-26  2:30         ` Alexander Duyck
2012-07-26  5:32           ` David Miller
2012-07-26  8:13         ` Eric Dumazet
2012-07-26  8:18           ` David Miller
2012-07-26  8:27             ` Eric Dumazet
2012-07-26  8:47               ` David Miller
2012-07-26  9:12                 ` Eric Dumazet
2012-07-26 17:18                 ` Alexander Duyck
2012-07-26 17:30                   ` Eric Dumazet
2012-07-26 17:36                     ` Eric Dumazet
2012-07-26 17:43                       ` Eric Dumazet
2012-07-26 17:48                         ` Eric Dumazet
2012-07-26 18:26                           ` Alexander Duyck
2012-07-26 21:06                             ` David Miller
2012-07-26 22:03                               ` Alexander Duyck
2012-07-26 22:13                                 ` Stephen Hemminger
2012-07-26 22:19                                   ` Eric Dumazet
2012-07-26 22:48                                   ` David Miller
2012-07-26 22:53                                 ` David Miller
2012-07-27  2:14                                   ` Alexander Duyck
2012-07-27  3:08                                     ` David Miller
2012-07-27  6:02                                       ` David Miller
2012-07-27 10:01                                         ` Eric Dumazet
2012-07-27 14:53                                           ` Eric W. Biederman
2012-07-27 15:12                                             ` Eric Dumazet
2012-07-27 16:23                                               ` Eric W. Biederman
2012-07-27 16:28                                                 ` Eric Dumazet
2012-07-27 19:06                                                   ` Alexander Duyck
2012-07-28  4:15                                         ` David Miller
2012-07-28  5:45                                           ` Alexander Duyck
2012-07-26 18:06                         ` Alexander Duyck
2012-07-26 21:00                         ` David Miller
2012-07-26 20:59                   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).