* ARP does not scale
@ 2004-03-13 18:29 Tim Gardner
2004-03-13 23:32 ` Anton Blanchard
2004-03-26 2:30 ` Mike Fedyk
0 siblings, 2 replies; 4+ messages in thread
From: Tim Gardner @ 2004-03-13 18:29 UTC (permalink / raw)
To: linux-kernel
ARP has hash table and garbage collection scalability limitations.
Please find attached a patch for 2.4.24 (also works for 2.4.25) that allows
you to choose the size of your ARP hash table. Simply choosing the ARP hash
table size that suits your network can cut the ARP list search time by an
order of magnitude. For example, the default hash table size is 32. My
network has over 1000 nodes. If you assume the hash function spreads those
evenly, then there are approximately 31 hash collisions per bucket.
Increasing the hash table size to 512 cuts down the collisions to
approximately 2 per bucket.
Related to scalability is the ARP entry garbage collector. It is currently
coded such that it runs at periodic intervals and processes all ARP entries
in all hash table buckets all within one invocation. As the number of ARP
entries becomes large, the amount of time spent processing all of the ARP
entries also becomes large. At a certain point I think it will begin to cause
significant jitter in packet delivery times. In an effort to smooth out the
garbage collector impact I've recoded it to only process one hash bucket per
invocation. The garbage collector resets its timer so that it processes all
buckets at least once every 60 seconds.
rtg
<--SNIP-->
diff -r -u --new-file linux-2.4.24/Documentation/Configure.help
linux-2.4.24.arp/Documentation/Configure.help
--- linux-2.4.24/Documentation/Configure.help 2004-03-12 07:34:37.000000000
-0700
+++ linux-2.4.24.arp/Documentation/Configure.help 2004-03-12
09:07:01.000000000 -0700
@@ -6921,6 +6921,13 @@
If unsure, say N.
+ARP hash table size power of 2
+CONFIG_NEIGH_NUM_HASHBITS
+ This option defines the size of the ARP hash table for each protocol. The
default size of 5
+ bits (32) is adequate for small LANs. Larger LANs may have enough ARP
entries that there are
+ multiple hash table collisions. In that case, increasing the number of hash
table entries
+ improves ARP entry access performance at the expense of some RAM.
+
Packet socket
CONFIG_PACKET
The Packet protocol is used by applications which communicate
diff -r -u --new-file linux-2.4.24/include/net/neighbour.h
linux-2.4.24.arp/include/net/neighbour.h
--- linux-2.4.24/include/net/neighbour.h 2004-03-12 07:34:44.000000000 -0700
+++ linux-2.4.24.arp/include/net/neighbour.h 2004-03-12 09:35:07.000000000
-0700
@@ -128,7 +128,7 @@
u8 key[0];
};
-#define NEIGH_HASHMASK 0x1F
+#define NEIGH_HASHMASK ((1<<CONFIG_NEIGH_NUM_HASHBITS)-1)
#define PNEIGH_HASHMASK 0xF
/*
@@ -166,6 +166,7 @@
struct tasklet_struct gc_task;
struct neigh_statistics stats;
struct neighbour *hash_buckets[NEIGH_HASHMASK+1];
+ int curr_hash_bucket;
struct pneigh_entry *phash_buckets[PNEIGH_HASHMASK+1];
};
diff -r -u --new-file linux-2.4.24/net/Config.in
linux-2.4.24.arp/net/Config.in
--- linux-2.4.24/net/Config.in 2004-03-12 07:33:30.000000000 -0700
+++ linux-2.4.24.arp/net/Config.in 2004-03-12 09:05:10.000000000 -0700
@@ -8,6 +8,8 @@
bool ' Packet socket: mmapped IO' CONFIG_PACKET_MMAP
fi
+int 'ARP hash table size power of 2' CONFIG_NEIGH_NUM_HASHBITS 5
+
tristate 'Netlink device emulation' CONFIG_NETLINK_DEV
bool 'Network packet filtering (replaces ipchains)' CONFIG_NETFILTER
diff -r -u --new-file linux-2.4.24/net/core/neighbour.c
linux-2.4.24.arp/net/core/neighbour.c
--- linux-2.4.24/net/core/neighbour.c 2004-03-12 07:33:13.000000000 -0700
+++ linux-2.4.24.arp/net/core/neighbour.c 2004-03-12 09:36:35.000000000 -0700
@@ -566,9 +566,8 @@
static void SMP_TIMER_NAME(neigh_periodic_timer)(unsigned long arg)
{
struct neigh_table *tbl = (struct neigh_table*)arg;
+ struct neighbour *n, **np;
unsigned long now = jiffies;
- int i;
-
write_lock(&tbl->lock);
@@ -583,46 +582,48 @@
p->reachable_time = neigh_rand_reach_time(p->base_reachable_time);
}
- for (i=0; i <= NEIGH_HASHMASK; i++) {
- struct neighbour *n, **np;
+ tbl->curr_hash_bucket &= NEIGH_HASHMASK;
+ np = &tbl->hash_buckets[tbl->curr_hash_bucket++];
- np = &tbl->hash_buckets[i];
- while ((n = *np) != NULL) {
- unsigned state;
-
- write_lock(&n->lock);
-
- state = n->nud_state;
- if (state&(NUD_PERMANENT|NUD_IN_TIMER)) {
- write_unlock(&n->lock);
- goto next_elt;
- }
+ while ((n = *np) != NULL) {
+ unsigned state;
- if ((long)(n->used - n->confirmed) < 0)
- n->used = n->confirmed;
+ write_lock(&n->lock);
- if (atomic_read(&n->refcnt) == 1 &&
- (state == NUD_FAILED || now - n->used > n->parms->gc_staletime)) {
- *np = n->next;
- n->dead = 1;
- write_unlock(&n->lock);
- neigh_release(n);
- continue;
- }
+ state = n->nud_state;
+ if (state&(NUD_PERMANENT|NUD_IN_TIMER)) {
+ write_unlock(&n->lock);
+ goto next_elt;
+ }
- if (n->nud_state&NUD_REACHABLE &&
- now - n->confirmed > n->parms->reachable_time) {
- n->nud_state = NUD_STALE;
- neigh_suspect(n);
- }
+ if ((long)(n->used - n->confirmed) < 0)
+ n->used = n->confirmed;
+
+ if (atomic_read(&n->refcnt) == 1 &&
+ (state == NUD_FAILED || now - n->used > n->parms->gc_staletime)) {
+ *np = n->next;
+ n->dead = 1;
write_unlock(&n->lock);
+ neigh_release(n);
+ continue;
+ }
-next_elt:
- np = &n->next;
+ if (n->nud_state&NUD_REACHABLE &&
+ now - n->confirmed > n->parms->reachable_time) {
+ n->nud_state = NUD_STALE;
+ neigh_suspect(n);
}
+ write_unlock(&n->lock);
+
+next_elt:
+ np = &n->next;
}
- mod_timer(&tbl->gc_timer, now + tbl->gc_interval);
+ /*
+ * Cycle through all hash buckets every 60 seconds.
+ */
+ mod_timer(&tbl->gc_timer, now+((HZ*60)/(NEIGH_HASHMASK+1)));
+
write_unlock(&tbl->lock);
}
--
Tim Gardner - timg@tpi.com
www.tpi.com 406-443-5357
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: ARP does not scale
2004-03-13 18:29 ARP does not scale Tim Gardner
@ 2004-03-13 23:32 ` Anton Blanchard
2004-03-14 3:56 ` Tim Gardner
2004-03-26 2:30 ` Mike Fedyk
1 sibling, 1 reply; 4+ messages in thread
From: Anton Blanchard @ 2004-03-13 23:32 UTC (permalink / raw)
To: Tim Gardner; +Cc: netdev
[moving this to netdev]
> ARP has hash table and garbage collection scalability limitations.
>
> Please find attached a patch for 2.4.24 (also works for 2.4.25) that allows
> you to choose the size of your ARP hash table. Simply choosing the ARP hash
> table size that suits your network can cut the ARP list search time by an
> order of magnitude. For example, the default hash table size is 32. My
> network has over 1000 nodes. If you assume the hash function spreads those
> evenly, then there are approximately 31 hash collisions per bucket.
> Increasing the hash table size to 512 cuts down the collisions to
> approximately 2 per bucket.
It would be nice to fix this properly in 2.6, a CONFIG option for this sort
of thing is painful for distros.
Check out net/ipv4/tcp.c and how it makes an effort to scale tcp_ehash and
tcp_bhash with memory and also allows an override with a cmdline option
thash_entries.
Anton
> Related to scalability is the ARP entry garbage collector. It is currently
> coded such that it runs at periodic intervals and processes all ARP entries
> in all hash table buckets all within one invocation. As the number of ARP
> entries becomes large, the amount of time spent processing all of the ARP
> entries also becomes large. At a certain point I think it will begin to cause
> significant jitter in packet delivery times. In an effort to smooth out the
> garbage collector impact I've recoded it to only process one hash bucket per
> invocation. The garbage collector resets its timer so that it processes all
> buckets at least once every 60 seconds.
>
> rtg
>
> <--SNIP-->
>
> diff -r -u --new-file linux-2.4.24/Documentation/Configure.help
> linux-2.4.24.arp/Documentation/Configure.help
> --- linux-2.4.24/Documentation/Configure.help 2004-03-12 07:34:37.000000000
> -0700
> +++ linux-2.4.24.arp/Documentation/Configure.help 2004-03-12
> 09:07:01.000000000 -0700
> @@ -6921,6 +6921,13 @@
>
> If unsure, say N.
>
> +ARP hash table size power of 2
> +CONFIG_NEIGH_NUM_HASHBITS
> + This option defines the size of the ARP hash table for each protocol. The
> default size of 5
> + bits (32) is adequate for small LANs. Larger LANs may have enough ARP
> entries that there are
> + multiple hash table collisions. In that case, increasing the number of hash
> table entries
> + improves ARP entry access performance at the expense of some RAM.
> +
> Packet socket
> CONFIG_PACKET
> The Packet protocol is used by applications which communicate
> diff -r -u --new-file linux-2.4.24/include/net/neighbour.h
> linux-2.4.24.arp/include/net/neighbour.h
> --- linux-2.4.24/include/net/neighbour.h 2004-03-12 07:34:44.000000000 -0700
> +++ linux-2.4.24.arp/include/net/neighbour.h 2004-03-12 09:35:07.000000000
> -0700
> @@ -128,7 +128,7 @@
> u8 key[0];
> };
>
> -#define NEIGH_HASHMASK 0x1F
> +#define NEIGH_HASHMASK ((1<<CONFIG_NEIGH_NUM_HASHBITS)-1)
> #define PNEIGH_HASHMASK 0xF
>
> /*
> @@ -166,6 +166,7 @@
> struct tasklet_struct gc_task;
> struct neigh_statistics stats;
> struct neighbour *hash_buckets[NEIGH_HASHMASK+1];
> + int curr_hash_bucket;
> struct pneigh_entry *phash_buckets[PNEIGH_HASHMASK+1];
> };
>
> diff -r -u --new-file linux-2.4.24/net/Config.in
> linux-2.4.24.arp/net/Config.in
> --- linux-2.4.24/net/Config.in 2004-03-12 07:33:30.000000000 -0700
> +++ linux-2.4.24.arp/net/Config.in 2004-03-12 09:05:10.000000000 -0700
> @@ -8,6 +8,8 @@
> bool ' Packet socket: mmapped IO' CONFIG_PACKET_MMAP
> fi
>
> +int 'ARP hash table size power of 2' CONFIG_NEIGH_NUM_HASHBITS 5
> +
> tristate 'Netlink device emulation' CONFIG_NETLINK_DEV
>
> bool 'Network packet filtering (replaces ipchains)' CONFIG_NETFILTER
> diff -r -u --new-file linux-2.4.24/net/core/neighbour.c
> linux-2.4.24.arp/net/core/neighbour.c
> --- linux-2.4.24/net/core/neighbour.c 2004-03-12 07:33:13.000000000 -0700
> +++ linux-2.4.24.arp/net/core/neighbour.c 2004-03-12 09:36:35.000000000 -0700
> @@ -566,9 +566,8 @@
> static void SMP_TIMER_NAME(neigh_periodic_timer)(unsigned long arg)
> {
> struct neigh_table *tbl = (struct neigh_table*)arg;
> + struct neighbour *n, **np;
> unsigned long now = jiffies;
> - int i;
> -
>
> write_lock(&tbl->lock);
>
> @@ -583,46 +582,48 @@
> p->reachable_time = neigh_rand_reach_time(p->base_reachable_time);
> }
>
> - for (i=0; i <= NEIGH_HASHMASK; i++) {
> - struct neighbour *n, **np;
> + tbl->curr_hash_bucket &= NEIGH_HASHMASK;
> + np = &tbl->hash_buckets[tbl->curr_hash_bucket++];
>
> - np = &tbl->hash_buckets[i];
> - while ((n = *np) != NULL) {
> - unsigned state;
> -
> - write_lock(&n->lock);
> -
> - state = n->nud_state;
> - if (state&(NUD_PERMANENT|NUD_IN_TIMER)) {
> - write_unlock(&n->lock);
> - goto next_elt;
> - }
> + while ((n = *np) != NULL) {
> + unsigned state;
>
> - if ((long)(n->used - n->confirmed) < 0)
> - n->used = n->confirmed;
> + write_lock(&n->lock);
>
> - if (atomic_read(&n->refcnt) == 1 &&
> - (state == NUD_FAILED || now - n->used > n->parms->gc_staletime)) {
> - *np = n->next;
> - n->dead = 1;
> - write_unlock(&n->lock);
> - neigh_release(n);
> - continue;
> - }
> + state = n->nud_state;
> + if (state&(NUD_PERMANENT|NUD_IN_TIMER)) {
> + write_unlock(&n->lock);
> + goto next_elt;
> + }
>
> - if (n->nud_state&NUD_REACHABLE &&
> - now - n->confirmed > n->parms->reachable_time) {
> - n->nud_state = NUD_STALE;
> - neigh_suspect(n);
> - }
> + if ((long)(n->used - n->confirmed) < 0)
> + n->used = n->confirmed;
> +
> + if (atomic_read(&n->refcnt) == 1 &&
> + (state == NUD_FAILED || now - n->used > n->parms->gc_staletime)) {
> + *np = n->next;
> + n->dead = 1;
> write_unlock(&n->lock);
> + neigh_release(n);
> + continue;
> + }
>
> -next_elt:
> - np = &n->next;
> + if (n->nud_state&NUD_REACHABLE &&
> + now - n->confirmed > n->parms->reachable_time) {
> + n->nud_state = NUD_STALE;
> + neigh_suspect(n);
> }
> + write_unlock(&n->lock);
> +
> +next_elt:
> + np = &n->next;
> }
>
> - mod_timer(&tbl->gc_timer, now + tbl->gc_interval);
> + /*
> + * Cycle through all hash buckets every 60 seconds.
> + */
> + mod_timer(&tbl->gc_timer, now+((HZ*60)/(NEIGH_HASHMASK+1)));
> +
> write_unlock(&tbl->lock);
> }
>
>
> --
> Tim Gardner - timg@tpi.com
> www.tpi.com 406-443-5357
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: ARP does not scale
2004-03-13 23:32 ` Anton Blanchard
@ 2004-03-14 3:56 ` Tim Gardner
0 siblings, 0 replies; 4+ messages in thread
From: Tim Gardner @ 2004-03-14 3:56 UTC (permalink / raw)
To: Anton Blanchard; +Cc: netdev
On Saturday 13 March 2004 16:32, Anton Blanchard wrote:
> [moving this to netdev]
Anton,
I figured out how to subscribe.
> It would be nice to fix this properly in 2.6, a CONFIG option for this sort
> of thing is painful for distros.
>
> Check out net/ipv4/tcp.c and how it makes an effort to scale tcp_ehash and
> tcp_bhash with memory and also allows an override with a cmdline option
> thash_entries.
>
> Anton
>
I checked out how tcp_ehash and tcp_bhash auto-size based on the amount of
RAM. I think one can usually make the same assumption about the relative
number of ARP entries. I say usually because I have some core routers that
have over a thousand ARP entries. In that case I would like to override the
auto-size calculation so that I can force enough hash buckets such that there
is never (or rarely) a hash collision.
How about a compromise. Leave in the config option, but default it to 0 which
enables the auto-size algorithm. Otherwise, use the size specified in the
config option.
rtg
--
Tim Gardner - timg@tpi.com
www.tpi.com 406-443-5357
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ARP does not scale
2004-03-13 18:29 ARP does not scale Tim Gardner
2004-03-13 23:32 ` Anton Blanchard
@ 2004-03-26 2:30 ` Mike Fedyk
1 sibling, 0 replies; 4+ messages in thread
From: Mike Fedyk @ 2004-03-26 2:30 UTC (permalink / raw)
To: timg; +Cc: linux-kernel
Tim Gardner wrote:
> ARP has hash table and garbage collection scalability limitations.
>
Is this needed for 2.6 also?
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-03-26 2:30 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-13 18:29 ARP does not scale Tim Gardner
2004-03-13 23:32 ` Anton Blanchard
2004-03-14 3:56 ` Tim Gardner
2004-03-26 2:30 ` Mike Fedyk
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.