* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
[not found] <Pine.LNX.4.62.0506231953260.28244@graphe.net>
@ 2005-06-24 3:36 ` David S. Miller
2005-06-24 3:40 ` Christoph Lameter
[not found] ` <Pine.LNX.4.62.0506232005030.28244@graphe.net>
1 sibling, 1 reply; 14+ messages in thread
From: David S. Miller @ 2005-06-24 3:36 UTC (permalink / raw)
To: christoph; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
From: Christoph Lameter <christoph@lameter.com>
Date: Thu, 23 Jun 2005 20:04:52 -0700 (PDT)
> The patch also removes the atomic_dec_and_test in dst_destroy.
That is the only part of this patch I'm willing to entertain
at this time, as long as Herbert Xu ACKs it. How much of that
quoted %3 gain does this change alone give you?
Also, please post networking patches to netdev@vger.kernel.org.
Because you didn't, the majority of the networking maintainers
did not see your dst patch submissions. It's mentioned in
linux/MAINTAINERS for a reason:
NETWORKING [GENERAL]
P: Networking Team
M: netdev@vger.kernel.org
L: netdev@vger.kernel.org
S: Maintained
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 3:36 ` [PATCH] dst_entry structure use,lastuse and refcnt abstraction David S. Miller
@ 2005-06-24 3:40 ` Christoph Lameter
2005-06-24 3:47 ` David S. Miller
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2005-06-24 3:40 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
On Thu, 23 Jun 2005, David S. Miller wrote:
> From: Christoph Lameter <christoph@lameter.com>
> Date: Thu, 23 Jun 2005 20:04:52 -0700 (PDT)
>
> > The patch also removes the atomic_dec_and_test in dst_destroy.
>
> That is the only part of this patch I'm willing to entertain
> at this time, as long as Herbert Xu ACKs it. How much of that
> quoted %3 gain does this change alone give you?
Nothing as far as I can tell. The main benefit may be reorganization of
the code.
> Also, please post networking patches to netdev@vger.kernel.org.
> Because you didn't, the majority of the networking maintainers
> did not see your dst patch submissions. It's mentioned in
> linux/MAINTAINERS for a reason:
>
> NETWORKING [GENERAL]
> P: Networking Team
> M: netdev@vger.kernel.org
> L: netdev@vger.kernel.org
> S: Maintained
>
> Thanks.
Yes and it was recently changed. Typical use is linux-xxx@vger.kernel.org
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 3:40 ` Christoph Lameter
@ 2005-06-24 3:47 ` David S. Miller
2005-06-24 3:49 ` Christoph Lameter
0 siblings, 1 reply; 14+ messages in thread
From: David S. Miller @ 2005-06-24 3:47 UTC (permalink / raw)
To: christoph; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
From: Christoph Lameter <christoph@lameter.com>
Date: Thu, 23 Jun 2005 20:40:25 -0700 (PDT)
> Nothing as far as I can tell. The main benefit may be reorganization of
> the code.
You told us way back in the original thread that the final atomic dec
shows up very much on NUMA, and if it could be avoided (and changed
into a read test), it would help a lot on NUMA.
> > NETWORKING [GENERAL]
> > P: Networking Team
> > M: netdev@vger.kernel.org
> > L: netdev@vger.kernel.org
> > S: Maintained
> >
> > Thanks.
>
> Yes and it was recently changed. Typical use is linux-xxx@vger.kernel.org
netdev@oss.sgi.com is what used to be the place for networking
stuff, it's not netdev@vger.kernel.org
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 3:47 ` David S. Miller
@ 2005-06-24 3:49 ` Christoph Lameter
2005-06-24 3:54 ` David S. Miller
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2005-06-24 3:49 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
On Thu, 23 Jun 2005, David S. Miller wrote:
> From: Christoph Lameter <christoph@lameter.com>
> Date: Thu, 23 Jun 2005 20:40:25 -0700 (PDT)
>
> > Nothing as far as I can tell. The main benefit may be reorganization of
> > the code.
>
> You told us way back in the original thread that the final atomic dec
> shows up very much on NUMA, and if it could be avoided (and changed
> into a read test), it would help a lot on NUMA.
No I told you that we need to disassemble the atomic dec_and_test
in order to be able to split the counters.
> > Yes and it was recently changed. Typical use is linux-xxx@vger.kernel.org
>
> netdev@oss.sgi.com is what used to be the place for networking
> stuff, it's not netdev@vger.kernel.org
s/not/now/ right?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 3:49 ` Christoph Lameter
@ 2005-06-24 3:54 ` David S. Miller
2005-06-24 4:06 ` Christoph Lameter
0 siblings, 1 reply; 14+ messages in thread
From: David S. Miller @ 2005-06-24 3:54 UTC (permalink / raw)
To: christoph; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
From: Christoph Lameter <christoph@lameter.com>
Date: Thu, 23 Jun 2005 20:49:17 -0700 (PDT)
> No I told you that we need to disassemble the atomic dec_and_test
> in order to be able to split the counters.
Ok. You're going to have to come up with something better
than a %3 AIM benchmark increase with 5000 threads to justify
those invasive NUMA changes, and thus this infrastructure for
it.
In order to convince me on this NUMA dst stuff, you need to put away
the benchmark microscope and instead use a "macroscrope" to analyze
the side effects of these changes on a higher level.
Really, what are the effects on other things? For example, what
does this do to routing performance? Does the packets per second
go down, if so by how much?
What happens if you bind network interfaces, and the processes that
use them, to cpus. Why would that be too intrusive to setup for the
administrator of such large NUMA machines?
What about loads that have low route locality, and thus touch many
different dst entries, not necessarily contending for a single one
amongst several nodes? Does performance go up or down for such a
case?
I'm picking those examples, because I am rather certain your patches
will hurt performance in those cases. The data structure size
expansion and extra memory allocations alone for the per-node dst
stuff should be good about doing that.
> > > Yes and it was recently changed. Typical use is linux-xxx@vger.kernel.org
> >
> > netdev@oss.sgi.com is what used to be the place for networking
> > stuff, it's not netdev@vger.kernel.org
>
> s/not/now/ right?
Right. :)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 3:54 ` David S. Miller
@ 2005-06-24 4:06 ` Christoph Lameter
2005-06-24 4:11 ` David S. Miller
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2005-06-24 4:06 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
On Thu, 23 Jun 2005, David S. Miller wrote:
> Ok. You're going to have to come up with something better
> than a %3 AIM benchmark increase with 5000 threads to justify
> those invasive NUMA changes, and thus this infrastructure for
> it.
I am sure one can do better than that. The x445 is the smallest NUMA box
that I know of and the only one available in the context of that project.
But I am also not sure that this will reach proportions that will
outweigh the complexity added by these patches. What would be
the performance benefit that we would need to see to feel that is
is right to get such a change in?
> I'm picking those examples, because I am rather certain your patches
> will hurt performance in those cases. The data structure size
> expansion and extra memory allocations alone for the per-node dst
> stuff should be good about doing that.
Hmm. I like the idea of a separate routing cache per node. May actually
reach higher performance than splitting the counters. Lets think a bit
about that.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 4:06 ` Christoph Lameter
@ 2005-06-24 4:11 ` David S. Miller
2005-06-24 6:03 ` Christoph Lameter
2005-06-24 10:57 ` [PATCH] bugfix and scalability changes in net/ipv4/route.c Eric Dumazet
0 siblings, 2 replies; 14+ messages in thread
From: David S. Miller @ 2005-06-24 4:11 UTC (permalink / raw)
To: christoph; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
From: Christoph Lameter <christoph@lameter.com>
Date: Thu, 23 Jun 2005 21:06:30 -0700 (PDT)
> I am sure one can do better than that. The x445 is the smallest NUMA box
> that I know of and the only one available in the context of that project.
> But I am also not sure that this will reach proportions that will
> outweigh the complexity added by these patches. What would be
> the performance benefit that we would need to see to feel that is
> is right to get such a change in?
Something on the order of %10. If I recall correctly, that's
basically what we got on SMP from the RCU changes to the routing
cache.
> Hmm. I like the idea of a separate routing cache per node. May actually
> reach higher performance than splitting the counters. Lets think a bit
> about that.
One thing that may happen down the line is that the flow cache
becomes a more unified thing. Ie. local sockets get found there,
netfilter rules can match, as well as normal "routes".
The idea is that on a miss, you go down into the "slow" lookup path
where the data structure is less distributed (SMP and NUMA wise) and
slower to search. And this populates the flow tables.
But that seems to work best on a per-cpu basis.
Unfortunately, it really doesn't address your problem in and of
itself. This is because flow entries, especially when used on
sockets, would still get accessed by other cpus and thus nodes.
We would need to build something on top of the per-cpu flowcache,
such as the socket node array idea, to get something that helps
this NUMA issue as well.
Just tossing out some forward looking ideas :)
To be honest, the unified flow cache idea has been tossed around
a lot, and it's still intellectual masterbation. But then again,
so was the design for the IPSEC stuff we have now for several years.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst numa: Avoid dst counter cacheline bouncing
[not found] ` <Pine.LNX.4.62.0506232005030.28244@graphe.net>
@ 2005-06-24 4:58 ` Dipankar Sarma
2005-06-24 5:05 ` Christoph Lameter
0 siblings, 1 reply; 14+ messages in thread
From: Dipankar Sarma @ 2005-06-24 4:58 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, netdev, davem, shai, akpm
On Thu, Jun 23, 2005 at 08:10:06PM -0700, Christoph Lameter wrote:
> The dst_entry structure contains an atomic counter that is incremented and
> decremented for each network packet being sent. If there is a high number of
> packets being sent from multiple processors then cacheline bouncing between
> NUMA nodes can limit the tcp performance in a NUMA system.
>
> The following patch splits the usage counter into counters per node.
>
Do we really need to do a distributed reference counter implementation
inside dst cache code ? If you are willing to wait for a while,
we should have modified Rusty's bigref implementation on top of the
interleaving dynamic per-cpu allocator. We can look at distributed
reference counter for dst refcount then and see how that can be
worked out.
Thanks
Dipankar
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst numa: Avoid dst counter cacheline bouncing
2005-06-24 4:58 ` [PATCH] dst numa: Avoid dst counter cacheline bouncing Dipankar Sarma
@ 2005-06-24 5:05 ` Christoph Lameter
2005-06-24 7:29 ` Dipankar Sarma
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2005-06-24 5:05 UTC (permalink / raw)
To: Dipankar Sarma; +Cc: linux-kernel, netdev, davem, shai, akpm
On Fri, 24 Jun 2005, Dipankar Sarma wrote:
> Do we really need to do a distributed reference counter implementation
> inside dst cache code ? If you are willing to wait for a while,
> we should have modified Rusty's bigref implementation on top of the
> interleaving dynamic per-cpu allocator. We can look at distributed
> reference counter for dst refcount then and see how that can be
> worked out.
Is that code available somewhere?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 4:11 ` David S. Miller
@ 2005-06-24 6:03 ` Christoph Lameter
2005-06-24 6:16 ` David S. Miller
2005-06-24 10:57 ` [PATCH] bugfix and scalability changes in net/ipv4/route.c Eric Dumazet
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2005-06-24 6:03 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
On Thu, 23 Jun 2005, David S. Miller wrote:
> > outweigh the complexity added by these patches. What would be
> > the performance benefit that we would need to see to feel that is
> > is right to get such a change in?
>
> Something on the order of %10. If I recall correctly, that's
> basically what we got on SMP from the RCU changes to the routing
> cache.
Ok. Then we are done. With 58 Itanium processors and 200G Ram I get
more than 10% improvement ;-). With 500 tasks we have 453 vs. 499 j/m/t.
That is 9.21%. For 300 tasks we have 9.4% etc. I am sure that I can push
this some more with bigger counts of processors and also some other NUMA
related performance issues.
58p SGI Altix 200G Ram running 2.6.12-mm1.
with patch:
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 2439.02 100 2439.0244 2.46 0.02 Fri Jun 24 00:28:01 2005
100 138664.20 92 1386.6420 4.33 106.04 Fri Jun 24 00:28:19 2005
200 191113.23 87 955.5662 6.28 218.09 Fri Jun 24 00:28:45 2005
300 219968.23 85 733.2274 8.18 329.09 Fri Jun 24 00:29:19 2005
400 237037.04 85 592.5926 10.12 442.07 Fri Jun 24 00:30:01 2005
500 249418.02 82 498.8360 12.03 551.93 Fri Jun 24 00:30:51 2005
600 257879.66 80 429.7994 13.96 663.44 Fri Jun 24 00:31:49 2005
700 264550.26 84 377.9289 15.88 774.88 Fri Jun 24 00:32:54 2005
800 269587.19 81 336.9840 17.80 886.85 Fri Jun 24 00:34:07 2005
900 273736.50 83 304.1517 19.73 997.90 Fri Jun 24 00:35:28 2005
1000 277225.89 82 277.2259 21.64 1109.18 Fri Jun 24 00:36:57 2005
without patch
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 2439.02 100 2439.0244 2.46 0.02 Fri Jun 24 00:46:03 2005
100 131955.14 92 1319.5514 4.55 114.52 Fri Jun 24 00:46:22 2005
200 177409.82 88 887.0491 6.76 246.62 Fri Jun 24 00:46:50 2005
300 201027.47 85 670.0916 8.95 373.49 Fri Jun 24 00:47:27 2005
400 216313.65 87 540.7841 11.09 497.36 Fri Jun 24 00:48:13 2005
500 226637.46 83 453.2749 13.24 620.68 Fri Jun 24 00:49:07 2005
600 233781.41 85 389.6357 15.40 746.18 Fri Jun 24 00:50:10 2005
700 239561.94 84 342.2313 17.53 869.34 Fri Jun 24 00:51:22 2005
800 243630.09 85 304.5376 19.70 996.39 Fri Jun 24 00:52:43 2005
900 246992.64 85 274.4363 21.86 1121.90 Fri Jun 24 00:54:13 2005
1000 249979.17 84 249.9792 24.00 1245.55 Fri Jun 24 00:55:52 2005
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst_entry structure use,lastuse and refcnt abstraction
2005-06-24 6:03 ` Christoph Lameter
@ 2005-06-24 6:16 ` David S. Miller
0 siblings, 0 replies; 14+ messages in thread
From: David S. Miller @ 2005-06-24 6:16 UTC (permalink / raw)
To: clameter; +Cc: linux-kernel, linux-net, shai, akpm, netdev, herbert
From: Christoph Lameter <clameter@engr.sgi.com>
Date: Thu, 23 Jun 2005 23:03:45 -0700 (PDT)
> Ok. Then we are done. With 58 Itanium processors and 200G Ram I get
> more than 10% improvement ;-). With 500 tasks we have 453 vs. 499 j/m/t.
> That is 9.21%. For 300 tasks we have 9.4% etc. I am sure that I can push
> this some more with bigger counts of processors and also some other NUMA
> related performance issues.
So it took 7 times more processors to increase the performance gain by
just over 3 on a microscopic synthetic benchmark. That's not
impressive at all.
And you still haven't shown what happens for the workloads I
suggested. A web benchmark, with say a thousand unique clients, would
be sufficient for one of those btw. That case has very low dst
locality, yet dsts are useful because you'll have about 2 or 3
concurrent connections per dst.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] dst numa: Avoid dst counter cacheline bouncing
2005-06-24 5:05 ` Christoph Lameter
@ 2005-06-24 7:29 ` Dipankar Sarma
0 siblings, 0 replies; 14+ messages in thread
From: Dipankar Sarma @ 2005-06-24 7:29 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, netdev, davem, shai, akpm
On Thu, Jun 23, 2005 at 10:05:09PM -0700, Christoph Lameter wrote:
> On Fri, 24 Jun 2005, Dipankar Sarma wrote:
>
> > Do we really need to do a distributed reference counter implementation
> > inside dst cache code ? If you are willing to wait for a while,
> > we should have modified Rusty's bigref implementation on top of the
> > interleaving dynamic per-cpu allocator. We can look at distributed
> > reference counter for dst refcount then and see how that can be
> > worked out.
>
> Is that code available somewhere?
Various places in lkml discussions. Search for discussions on dynamic
per-cpu allocator. Currently Bharata is adding
cpu hotplug awareness in it, but the basic patches work.
BTW, I am not saying that bigref has what you need. What I am trying
to say is that you should see if something like bigref can
be tweaked to use in your case before implementing a new type of
ref counting wholly in dst code.
Thanks
Dipankar
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] bugfix and scalability changes in net/ipv4/route.c
2005-06-24 4:11 ` David S. Miller
2005-06-24 6:03 ` Christoph Lameter
@ 2005-06-24 10:57 ` Eric Dumazet
2005-06-28 20:14 ` David S. Miller
1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2005-06-24 10:57 UTC (permalink / raw)
To: David S. Miller, netdev; +Cc: linux-net, herbert
[-- Attachment #1: Type: text/plain, Size: 1117 bytes --]
Hi David
This is based on a previous patch I sent 3 months ago, reduced to only the essential part, since
the previous version got no success in netdev (some people said they were working on this stuff but still no progress)
reminder of the bugfix :
The rt_check_expire() has a serious problem on machines with large
route caches, and a standard HZ value of 1000.
With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ;
the loop count :
for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;
overflows (t is a 31 bit value) as soon rt_hash_log is >= 16 (65536
slots in route cache hash table).
In this case, rt_check_expire() does nothing at all
Thank you
Eric Dumazet
[NET] Scalability fixes in net/ipv4/route.c, bugfix in rt_check_expire
- Locking abstraction
- Spinlocks moved out of rt hash table : Less memory (50%) used by rt hash table. it's a win even on UP.
- Sizing of spinlocks table depends on NR_CPUS
- rt hash table allocated using alloc_large_system_hash() function
- rt_check_expire() fixes (an overflow was possible)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: patch.route --]
[-- Type: text/plain, Size: 7698 bytes --]
--- linux-2.6.12/net/ipv4/route.c 2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6.12-ed/net/ipv4/route.c 2005-06-24 11:10:06.000000000 +0200
@@ -54,6 +54,7 @@
* Marc Boucher : routing by fwmark
* Robert Olsson : Added rt_cache statistics
* Arnaldo C. Melo : Convert proc stuff to seq_file
+ * Eric Dumazet : hashed spinlocks and rt_check_expire() fixes.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -70,6 +71,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/bootmem.h>
#include <linux/string.h>
#include <linux/socket.h>
#include <linux/sockios.h>
@@ -201,8 +203,37 @@
struct rt_hash_bucket {
struct rtable *chain;
- spinlock_t lock;
-} __attribute__((__aligned__(8)));
+};
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+/*
+ * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
+ * The size of this table is a power of two and depends on the number of CPUS.
+ */
+#if NR_CPUS >= 32
+#define RT_HASH_LOCK_SZ 4096
+#elif NR_CPUS >= 16
+#define RT_HASH_LOCK_SZ 2048
+#elif NR_CPUS >= 8
+#define RT_HASH_LOCK_SZ 1024
+#elif NR_CPUS >= 4
+#define RT_HASH_LOCK_SZ 512
+#else
+#define RT_HASH_LOCK_SZ 256
+#endif
+
+ static spinlock_t *rt_hash_locks;
+# define rt_hash_lock_addr(slot) &rt_hash_locks[slot & (RT_HASH_LOCK_SZ - 1)]
+# define rt_hash_lock_init() { \
+ int i; \
+ rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ, GFP_KERNEL); \
+ if (!rt_hash_locks) panic("IP: failed to allocate rt_hash_locks\n"); \
+ for (i = 0; i < RT_HASH_LOCK_SZ; i++) \
+ spin_lock_init(&rt_hash_locks[i]); \
+ }
+#else
+# define rt_hash_lock_addr(slot) NULL
+# define rt_hash_lock_init()
+#endif
static struct rt_hash_bucket *rt_hash_table;
static unsigned rt_hash_mask;
@@ -575,19 +606,24 @@
/* This runs via a timer and thus is always in BH context. */
static void rt_check_expire(unsigned long dummy)
{
- static int rover;
- int i = rover, t;
+ static unsigned int rover;
+ int unsigned i = rover, goal;
struct rtable *rth, **rthp;
unsigned long now = jiffies;
-
- for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;
- t -= ip_rt_gc_timeout) {
+ u64 mult ;
+ mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
+ if (ip_rt_gc_timeout > 1)
+ do_div(mult, ip_rt_gc_timeout);
+ goal = (unsigned int)mult;
+ if (goal > rt_hash_mask) goal = rt_hash_mask + 1;
+ for ( ; goal > 0 ; goal--) {
unsigned long tmo = ip_rt_gc_timeout;
i = (i + 1) & rt_hash_mask;
rthp = &rt_hash_table[i].chain;
- spin_lock(&rt_hash_table[i].lock);
+ if (*rthp == 0) continue ;
+ spin_lock(rt_hash_lock_addr(i));
while ((rth = *rthp) != NULL) {
if (rth->u.dst.expires) {
/* Entry is expired even if it is in use */
@@ -620,8 +656,7 @@
rt_free(rth);
#endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
}
- spin_unlock(&rt_hash_table[i].lock);
-
+ spin_unlock(rt_hash_lock_addr(i));
/* Fallback loop breaker. */
if (time_after(jiffies, now))
break;
@@ -643,11 +678,11 @@
get_random_bytes(&rt_hash_rnd, 4);
for (i = rt_hash_mask; i >= 0; i--) {
- spin_lock_bh(&rt_hash_table[i].lock);
+ spin_lock_bh(rt_hash_lock_addr(i));
rth = rt_hash_table[i].chain;
if (rth)
rt_hash_table[i].chain = NULL;
- spin_unlock_bh(&rt_hash_table[i].lock);
+ spin_unlock_bh(rt_hash_lock_addr(i));
for (; rth; rth = next) {
next = rth->u.rt_next;
@@ -780,7 +815,7 @@
k = (k + 1) & rt_hash_mask;
rthp = &rt_hash_table[k].chain;
- spin_lock_bh(&rt_hash_table[k].lock);
+ spin_lock_bh(rt_hash_lock_addr(k));
while ((rth = *rthp) != NULL) {
if (!rt_may_expire(rth, tmo, expire)) {
tmo >>= 1;
@@ -812,7 +847,7 @@
goal--;
#endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
}
- spin_unlock_bh(&rt_hash_table[k].lock);
+ spin_unlock_bh(rt_hash_lock_addr(k));
if (goal <= 0)
break;
}
@@ -882,7 +917,7 @@
rthp = &rt_hash_table[hash].chain;
- spin_lock_bh(&rt_hash_table[hash].lock);
+ spin_lock_bh(rt_hash_lock_addr(hash));
while ((rth = *rthp) != NULL) {
#ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
if (!(rth->u.dst.flags & DST_BALANCED) &&
@@ -908,7 +943,7 @@
rth->u.dst.__use++;
dst_hold(&rth->u.dst);
rth->u.dst.lastuse = now;
- spin_unlock_bh(&rt_hash_table[hash].lock);
+ spin_unlock_bh(rt_hash_lock_addr(hash));
rt_drop(rt);
*rp = rth;
@@ -949,7 +984,7 @@
if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
int err = arp_bind_neighbour(&rt->u.dst);
if (err) {
- spin_unlock_bh(&rt_hash_table[hash].lock);
+ spin_unlock_bh(rt_hash_lock_addr(hash));
if (err != -ENOBUFS) {
rt_drop(rt);
@@ -990,7 +1025,7 @@
}
#endif
rt_hash_table[hash].chain = rt;
- spin_unlock_bh(&rt_hash_table[hash].lock);
+ spin_unlock_bh(rt_hash_lock_addr(hash));
*rp = rt;
return 0;
}
@@ -1058,7 +1093,7 @@
{
struct rtable **rthp;
- spin_lock_bh(&rt_hash_table[hash].lock);
+ spin_lock_bh(rt_hash_lock_addr(hash));
ip_rt_put(rt);
for (rthp = &rt_hash_table[hash].chain; *rthp;
rthp = &(*rthp)->u.rt_next)
@@ -1067,7 +1102,7 @@
rt_free(rt);
break;
}
- spin_unlock_bh(&rt_hash_table[hash].lock);
+ spin_unlock_bh(rt_hash_lock_addr(hash));
}
void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
@@ -3069,12 +3104,14 @@
int __init ip_rt_init(void)
{
- int i, order, goal, rc = 0;
+ int rc = 0;
rt_hash_rnd = (int) ((num_physpages ^ (num_physpages>>8)) ^
(jiffies ^ (jiffies >> 7)));
#ifdef CONFIG_NET_CLS_ROUTE
+ {
+ int order;
for (order = 0;
(PAGE_SIZE << order) < 256 * sizeof(struct ip_rt_acct) * NR_CPUS; order++)
/* NOTHING */;
@@ -3082,6 +3119,7 @@
if (!ip_rt_acct)
panic("IP: failed to allocate ip_rt_acct\n");
memset(ip_rt_acct, 0, PAGE_SIZE << order);
+ }
#endif
ipv4_dst_ops.kmem_cachep = kmem_cache_create("ip_dst_cache",
@@ -3092,39 +3130,22 @@
if (!ipv4_dst_ops.kmem_cachep)
panic("IP: failed to allocate ip_dst_cache\n");
- goal = num_physpages >> (26 - PAGE_SHIFT);
- if (rhash_entries)
- goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT;
- for (order = 0; (1UL << order) < goal; order++)
- /* NOTHING */;
-
- do {
- rt_hash_mask = (1UL << order) * PAGE_SIZE /
- sizeof(struct rt_hash_bucket);
- while (rt_hash_mask & (rt_hash_mask - 1))
- rt_hash_mask--;
- rt_hash_table = (struct rt_hash_bucket *)
- __get_free_pages(GFP_ATOMIC, order);
- } while (rt_hash_table == NULL && --order > 0);
-
- if (!rt_hash_table)
- panic("Failed to allocate IP route cache hash table\n");
-
- printk(KERN_INFO "IP: routing cache hash table of %u buckets, %ldKbytes\n",
- rt_hash_mask,
- (long) (rt_hash_mask * sizeof(struct rt_hash_bucket)) / 1024);
-
- for (rt_hash_log = 0; (1 << rt_hash_log) != rt_hash_mask; rt_hash_log++)
- /* NOTHING */;
-
+ rt_hash_table = (struct rt_hash_bucket *)
+ alloc_large_system_hash("IP route cache",
+ sizeof(struct rt_hash_bucket),
+ rhash_entries,
+ (num_physpages >= 128 * 1024) ?
+ (27 - PAGE_SHIFT) :
+ (29 - PAGE_SHIFT),
+ HASH_HIGHMEM,
+ &rt_hash_log,
+ &rt_hash_mask,
+ 0);
+ memset(rt_hash_table, 0, rt_hash_mask * sizeof(struct rt_hash_bucket));
+ rt_hash_lock_init();
+ ipv4_dst_ops.gc_thresh = rt_hash_mask;
+ ip_rt_max_size = rt_hash_mask * 16;
rt_hash_mask--;
- for (i = 0; i <= rt_hash_mask; i++) {
- spin_lock_init(&rt_hash_table[i].lock);
- rt_hash_table[i].chain = NULL;
- }
-
- ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
- ip_rt_max_size = (rt_hash_mask + 1) * 16;
rt_cache_stat = alloc_percpu(struct rt_cache_stat);
if (!rt_cache_stat)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] bugfix and scalability changes in net/ipv4/route.c
2005-06-24 10:57 ` [PATCH] bugfix and scalability changes in net/ipv4/route.c Eric Dumazet
@ 2005-06-28 20:14 ` David S. Miller
0 siblings, 0 replies; 14+ messages in thread
From: David S. Miller @ 2005-06-28 20:14 UTC (permalink / raw)
To: dada1; +Cc: netdev, linux-net, herbert
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 24 Jun 2005 12:57:47 +0200
> reminder of the bugfix :
>
> The rt_check_expire() has a serious problem on machines with large
> route caches, and a standard HZ value of 1000.
>
> With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ;
>
> the loop count :
>
> for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;
>
>
> overflows (t is a 31 bit value) as soon rt_hash_log is >= 16 (65536
> slots in route cache hash table).
>
> In this case, rt_check_expire() does nothing at all
I'd like this bug fix as a seperate patch.
Also, coding style problems in the spinlock part of the patch:
+ static spinlock_t *rt_hash_locks;
That's a file static variable, no need to tab it at all.
+ if (*rthp == 0) continue ;
Please put the continue on a seperate line, properly
tabbed, and without a space between the continue and the
closing semicolon.
Please scan the rest of your patch for problems like this.
So, again please submit a seperate patch for the overflow
bug, then one for the locking changes, so they may be evaluated
and applied seperately.
Thanks a lot.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2005-06-28 20:14 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.62.0506231953260.28244@graphe.net>
2005-06-24 3:36 ` [PATCH] dst_entry structure use,lastuse and refcnt abstraction David S. Miller
2005-06-24 3:40 ` Christoph Lameter
2005-06-24 3:47 ` David S. Miller
2005-06-24 3:49 ` Christoph Lameter
2005-06-24 3:54 ` David S. Miller
2005-06-24 4:06 ` Christoph Lameter
2005-06-24 4:11 ` David S. Miller
2005-06-24 6:03 ` Christoph Lameter
2005-06-24 6:16 ` David S. Miller
2005-06-24 10:57 ` [PATCH] bugfix and scalability changes in net/ipv4/route.c Eric Dumazet
2005-06-28 20:14 ` David S. Miller
[not found] ` <Pine.LNX.4.62.0506232005030.28244@graphe.net>
2005-06-24 4:58 ` [PATCH] dst numa: Avoid dst counter cacheline bouncing Dipankar Sarma
2005-06-24 5:05 ` Christoph Lameter
2005-06-24 7:29 ` Dipankar Sarma
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).