netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: ARP does not scale
       [not found] <200403131129.51083.timg@tpi.com>
@ 2004-03-13 23:32 ` Anton Blanchard
  2004-03-14  3:56   ` Tim Gardner
  0 siblings, 1 reply; 2+ messages in thread
From: Anton Blanchard @ 2004-03-13 23:32 UTC (permalink / raw)
  To: Tim Gardner; +Cc: netdev


[moving this to netdev]

> ARP has hash table and garbage collection scalability limitations.
> 
> Please find attached a patch for 2.4.24 (also works for 2.4.25) that allows
> you to choose the size of your ARP hash table. Simply choosing the ARP hash
> table size that suits your network can cut the ARP list search time by an
> order of magnitude. For example, the default hash table size is 32. My
> network has over 1000 nodes. If you assume the hash function spreads those
> evenly, then there are approximately 31 hash collisions per bucket.
> Increasing the hash table size to 512 cuts down the collisions to
> approximately 2 per bucket.

It would be nice to fix this properly in 2.6, a CONFIG option for this sort 
of thing is painful for distros.

Check out net/ipv4/tcp.c and how it makes an effort to scale tcp_ehash and 
tcp_bhash with memory and also allows an override with a cmdline option 
thash_entries.

Anton

> Related to scalability is the ARP entry garbage collector. It is currently
> coded such that it runs at periodic intervals and processes all ARP entries
> in all hash table buckets all within one invocation. As the number of ARP
> entries becomes large, the amount of time spent processing all of the ARP
> entries also becomes large. At a certain point I think it will begin to cause
> significant jitter in packet delivery times. In an effort to smooth out the
> garbage collector impact I've recoded it to only process one hash bucket per
> invocation. The garbage collector resets its timer so that it processes all
> buckets at least once every 60 seconds.
> 
> rtg
> 
> <--SNIP-->
> 
> diff -r -u --new-file linux-2.4.24/Documentation/Configure.help 
> linux-2.4.24.arp/Documentation/Configure.help
> --- linux-2.4.24/Documentation/Configure.help	2004-03-12 07:34:37.000000000 
> -0700
> +++ linux-2.4.24.arp/Documentation/Configure.help	2004-03-12 
> 09:07:01.000000000 -0700
> @@ -6921,6 +6921,13 @@
>  
>    If unsure, say N.
>  
> +ARP hash table size power of 2
> +CONFIG_NEIGH_NUM_HASHBITS
> +  This option defines the size of the ARP hash table for each protocol. The 
> default size of 5
> +  bits (32) is adequate for small LANs. Larger LANs may have enough ARP 
> entries that there are
> +  multiple hash table collisions. In that case, increasing the number of hash 
> table entries
> +  improves ARP entry access performance at the expense of some RAM.
> +
>  Packet socket
>  CONFIG_PACKET
>    The Packet protocol is used by applications which communicate
> diff -r -u --new-file linux-2.4.24/include/net/neighbour.h 
> linux-2.4.24.arp/include/net/neighbour.h
> --- linux-2.4.24/include/net/neighbour.h	2004-03-12 07:34:44.000000000 -0700
> +++ linux-2.4.24.arp/include/net/neighbour.h	2004-03-12 09:35:07.000000000 
> -0700
> @@ -128,7 +128,7 @@
>  	u8			key[0];
>  };
>  
> -#define NEIGH_HASHMASK		0x1F
> +#define NEIGH_HASHMASK		((1<<CONFIG_NEIGH_NUM_HASHBITS)-1)
>  #define PNEIGH_HASHMASK		0xF
>  
>  /*
> @@ -166,6 +166,7 @@
>  	struct tasklet_struct	gc_task;
>  	struct neigh_statistics	stats;
>  	struct neighbour	*hash_buckets[NEIGH_HASHMASK+1];
> +	int					curr_hash_bucket;
>  	struct pneigh_entry	*phash_buckets[PNEIGH_HASHMASK+1];
>  };
>  
> diff -r -u --new-file linux-2.4.24/net/Config.in 
> linux-2.4.24.arp/net/Config.in
> --- linux-2.4.24/net/Config.in	2004-03-12 07:33:30.000000000 -0700
> +++ linux-2.4.24.arp/net/Config.in	2004-03-12 09:05:10.000000000 -0700
> @@ -8,6 +8,8 @@
>     bool '  Packet socket: mmapped IO' CONFIG_PACKET_MMAP
>  fi
>  
> +int 'ARP hash table size power of 2' CONFIG_NEIGH_NUM_HASHBITS 5
> +
>  tristate 'Netlink device emulation' CONFIG_NETLINK_DEV
>  
>  bool 'Network packet filtering (replaces ipchains)' CONFIG_NETFILTER
> diff -r -u --new-file linux-2.4.24/net/core/neighbour.c 
> linux-2.4.24.arp/net/core/neighbour.c
> --- linux-2.4.24/net/core/neighbour.c	2004-03-12 07:33:13.000000000 -0700
> +++ linux-2.4.24.arp/net/core/neighbour.c	2004-03-12 09:36:35.000000000 -0700
> @@ -566,9 +566,8 @@
>  static void SMP_TIMER_NAME(neigh_periodic_timer)(unsigned long arg)
>  {
>  	struct neigh_table *tbl = (struct neigh_table*)arg;
> +	struct neighbour *n, **np;
>  	unsigned long now = jiffies;
> -	int i;
> -
>  
>  	write_lock(&tbl->lock);
>  
> @@ -583,46 +582,48 @@
>  			p->reachable_time = neigh_rand_reach_time(p->base_reachable_time);
>  	}
>  
> -	for (i=0; i <= NEIGH_HASHMASK; i++) {
> -		struct neighbour *n, **np;
> +	tbl->curr_hash_bucket &= NEIGH_HASHMASK;
> +	np = &tbl->hash_buckets[tbl->curr_hash_bucket++];
>  
> -		np = &tbl->hash_buckets[i];
> -		while ((n = *np) != NULL) {
> -			unsigned state;
> -
> -			write_lock(&n->lock);
> -
> -			state = n->nud_state;
> -			if (state&(NUD_PERMANENT|NUD_IN_TIMER)) {
> -				write_unlock(&n->lock);
> -				goto next_elt;
> -			}
> +	while ((n = *np) != NULL) {
> +		unsigned state;
>  
> -			if ((long)(n->used - n->confirmed) < 0)
> -				n->used = n->confirmed;
> +		write_lock(&n->lock);
>  
> -			if (atomic_read(&n->refcnt) == 1 &&
> -			    (state == NUD_FAILED || now - n->used > n->parms->gc_staletime)) {
> -				*np = n->next;
> -				n->dead = 1;
> -				write_unlock(&n->lock);
> -				neigh_release(n);
> -				continue;
> -			}
> +		state = n->nud_state;
> +		if (state&(NUD_PERMANENT|NUD_IN_TIMER)) {
> +			write_unlock(&n->lock);
> +			goto next_elt;
> +		}
>  
> -			if (n->nud_state&NUD_REACHABLE &&
> -			    now - n->confirmed > n->parms->reachable_time) {
> -				n->nud_state = NUD_STALE;
> -				neigh_suspect(n);
> -			}
> +		if ((long)(n->used - n->confirmed) < 0)
> +			n->used = n->confirmed;
> +
> +		if (atomic_read(&n->refcnt) == 1 &&
> +		    (state == NUD_FAILED || now - n->used > n->parms->gc_staletime)) {
> +			*np = n->next;
> +			n->dead = 1;
>  			write_unlock(&n->lock);
> +			neigh_release(n);
> +			continue;
> +		}
>  
> -next_elt:
> -			np = &n->next;
> +		if (n->nud_state&NUD_REACHABLE &&
> +		    now - n->confirmed > n->parms->reachable_time) {
> +			n->nud_state = NUD_STALE;
> +			neigh_suspect(n);
>  		}
> +		write_unlock(&n->lock);
> +
> +next_elt:
> +		np = &n->next;
>  	}
>  
> -	mod_timer(&tbl->gc_timer, now + tbl->gc_interval);
> +	/*
> +	 * Cycle through all hash buckets every 60 seconds.
> +	 */
> +	mod_timer(&tbl->gc_timer, now+((HZ*60)/(NEIGH_HASHMASK+1)));
> +
>  	write_unlock(&tbl->lock);
>  }
>  
> 
> --
> Tim Gardner - timg@tpi.com
> www.tpi.com 406-443-5357
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: ARP does not scale
  2004-03-13 23:32 ` ARP does not scale Anton Blanchard
@ 2004-03-14  3:56   ` Tim Gardner
  0 siblings, 0 replies; 2+ messages in thread
From: Tim Gardner @ 2004-03-14  3:56 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: netdev

On Saturday 13 March 2004 16:32, Anton Blanchard wrote:
> [moving this to netdev]

Anton,

I figured out how to subscribe.

> It would be nice to fix this properly in 2.6, a CONFIG option for this sort
> of thing is painful for distros.
>
> Check out net/ipv4/tcp.c and how it makes an effort to scale tcp_ehash and
> tcp_bhash with memory and also allows an override with a cmdline option
> thash_entries.
>
> Anton
>

I checked out how tcp_ehash and tcp_bhash auto-size based on the amount of 
RAM. I think one can usually make the same assumption about the relative 
number of ARP entries. I say usually because I have some core routers that 
have over a thousand ARP entries. In that case I would like to override the 
auto-size calculation so that I can force enough hash buckets such that there 
is never (or rarely) a hash collision.

How about a compromise. Leave in the config option, but default it to 0 which 
enables the auto-size algorithm. Otherwise, use the size specified in the 
config option. 

rtg
-- 
Tim Gardner - timg@tpi.com
www.tpi.com 406-443-5357 

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-03-14  3:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200403131129.51083.timg@tpi.com>
2004-03-13 23:32 ` ARP does not scale Anton Blanchard
2004-03-14  3:56   ` Tim Gardner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).