TODO list before feature freeze

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TODO list before feature freeze
@ 2002-07-18  9:34 Rusty Russell
  2002-07-19  7:39 ` Balazs Scheidler
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Rusty Russell @ 2002-07-18  9:34 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, netfilter-core

Hi all,

	With four months to go before the feature freeze, it's
important to compile a feature list for netfilter-related things.  I
see the following coming up:

Connection tracking:
	o TCP window tracking finally goes in.
	o Fix the extremely low TCP RST timeout
	o Fix the UDP timeout calculations to be per-port.
	o Improve hashing
	o Fix the massive timer performance problem.
	o Zero-copy-safe the connection tracking framework
	o ctnetlink support

iptables:
	o Change over to a netlink interface
		o Back to add/delete/replace interface + commit.
	o Rewrite libiptc to use netlink (to port iptables).
	o Write new ip extension for iptables.
	o Zero-copy-safe the iptables framework

NAT:
	o Zero-copy-safe the NAT framework

Please add feature requests: note that I have not been following the
lists, so "obvious" things may not be obvious to me.

Thanks for your patience,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-18  9:34 TODO list before feature freeze Rusty Russell
@ 2002-07-19  7:39 ` Balazs Scheidler
  2002-07-19 17:43 ` Michael Richardson
  2002-07-29 10:57 ` jamal
  2 siblings, 0 replies; 29+ messages in thread
From: Balazs Scheidler @ 2002-07-19  7:39 UTC (permalink / raw)
  To: Rusty Russell; +Cc: netfilter-devel, netdev, netfilter-core

On Thu, Jul 18, 2002 at 07:34:53PM +1000, Rusty Russell wrote:
> Hi all,
> 
> 	With four months to go before the feature freeze, it's
> important to compile a feature list for netfilter-related things.  I
> see the following coming up:
> 
> Connection tracking:
> 	o TCP window tracking finally goes in.
> 	o Fix the extremely low TCP RST timeout
> 	o Fix the UDP timeout calculations to be per-port.
> 	o Improve hashing
> 	o Fix the massive timer performance problem.
> 	o Zero-copy-safe the connection tracking framework
> 	o ctnetlink support
> 
> iptables:
> 	o Change over to a netlink interface
> 		o Back to add/delete/replace interface + commit.
> 	o Rewrite libiptc to use netlink (to port iptables).
> 	o Write new ip extension for iptables.
> 	o Zero-copy-safe the iptables framework
> 
> NAT:
> 	o Zero-copy-safe the NAT framework
> 
> Please add feature requests: note that I have not been following the
> lists, so "obvious" things may not be obvious to me.

I think conntrack exemptions and transparent proxy support should be added
to the list. The latter is working for me in production at least for TCP
connections. UDP support is to be dependant on conntrack exemptions, so that
is not yet implemented. (at least the sendmsg side, the recvmsg side should
be working)

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-18  9:34 TODO list before feature freeze Rusty Russell
  2002-07-19  7:39 ` Balazs Scheidler
@ 2002-07-19 17:43 ` Michael Richardson
  2002-07-29 10:57 ` jamal
  2 siblings, 0 replies; 29+ messages in thread
From: Michael Richardson @ 2002-07-19 17:43 UTC (permalink / raw)
  To: Rusty Russell; +Cc: netfilter-devel, netdev, netfilter-core


>>>>> "Rusty" == Rusty Russell <rusty@rustcorp.com.au> writes:
    Rusty> 	With four months to go before the feature freeze, it's
    Rusty> important to compile a feature list for netfilter-related things.  I
    Rusty> see the following coming up:

    Rusty> Connection tracking:
    Rusty> 	o TCP window tracking finally goes in.
    Rusty> 	o Fix the extremely low TCP RST timeout
    Rusty> 	o Fix the UDP timeout calculations to be per-port.
    Rusty> 	o Improve hashing
    Rusty> 	o Fix the massive timer performance problem.
    Rusty> 	o Zero-copy-safe the connection tracking framework
    Rusty> 	o ctnetlink support

  Permit CT to deal with multihomed hosts.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-18  9:34 TODO list before feature freeze Rusty Russell
  2002-07-19  7:39 ` Balazs Scheidler
  2002-07-19 17:43 ` Michael Richardson
@ 2002-07-29 10:57 ` jamal
  2002-07-29 11:12   ` Andi Kleen
  2002-07-29 22:14   ` Rusty Russell
  2 siblings, 2 replies; 29+ messages in thread
From: jamal @ 2002-07-29 10:57 UTC (permalink / raw)
  To: Rusty Russell; +Cc: netfilter-devel, netdev, netfilter-core

On Thu, 18 Jul 2002, Rusty Russell wrote:

> Hi all,
>
> 	With four months to go before the feature freeze,

Really? ;->

>
> Connection tracking:

Fix perfomance problems with this thing. You may have seen reports of
performance degradation it introduces. I was hoping to take a look at some
point time hasnt been visiting this side.

>
> iptables:
> 	o Change over to a netlink interface
> 		o Back to add/delete/replace interface + commit.
> 	o Rewrite libiptc to use netlink (to port iptables).

I hope this resolves the current scheme where the whole
add/delete/replace interface + commit happens in user space?
If you use netlink it would make sense to do incremental updates to the
kernel.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 10:57 ` jamal
@ 2002-07-29 11:12   ` Andi Kleen
  2002-07-29 11:23     ` jamal
  2002-07-29 15:25     ` Michael Richardson
  2002-07-29 22:14   ` Rusty Russell
  1 sibling, 2 replies; 29+ messages in thread
From: Andi Kleen @ 2002-07-29 11:12 UTC (permalink / raw)
  To: jamal; +Cc: Rusty Russell, netfilter-devel, netdev, netfilter-core

> >
> > Connection tracking:
> 
> Fix perfomance problems with this thing. You may have seen reports of
> performance degradation it introduces. I was hoping to take a look at some
> point time hasnt been visiting this side.

One obvious problem that it has is that it uses vmalloc to allocate its
big hash table. This will likely lead to TLB thrashing on a busy box. It should 
try to allocate the hashtable with get_free_pages() first and only fall
back to vmalloc if that fails. This way it would run with large pages.

(case in point: we have at least one report that routing 
performance breaks down with ip_conntrack when memory size is increased over 
1GB on P3s. The hash table size depends on the memory size. The problem does 
not occur on P4s. P4s have larger TLBs than P3s.)

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 11:12   ` Andi Kleen
@ 2002-07-29 11:23     ` jamal
  2002-07-29 11:56       ` Andi Kleen
  2002-07-29 16:26       ` Patrick Schaaf
  2002-07-29 15:25     ` Michael Richardson
  1 sibling, 2 replies; 29+ messages in thread
From: jamal @ 2002-07-29 11:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rusty Russell, netfilter-devel, netdev, netfilter-core



On Mon, 29 Jul 2002, Andi Kleen wrote:

> One obvious problem that it has is that it uses vmalloc to allocate its
> big hash table. This will likely lead to TLB thrashing on a busy box. It should
> try to allocate the hashtable with get_free_pages() first and only fall
> back to vmalloc if that fails. This way it would run with large pages.
>
> (case in point: we have at least one report that routing
> performance breaks down with ip_conntrack when memory size is increased over
> 1GB on P3s. The hash table size depends on the memory size. The problem does
> not occur on P4s. P4s have larger TLBs than P3s.)
>

They also have a lot of problems with their per-packet computations.
Robert and I spent a short time looking at "this thing that is making
us look bad" (perfomance wise) and talked to Harald.
Something that looked like needs improvement at first glance was the aging
and hashing schemes.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 11:23     ` jamal
@ 2002-07-29 11:56       ` Andi Kleen
  2002-07-29 15:40         ` Martin Josefsson
  2002-07-29 22:43         ` Martin Josefsson
  2002-07-29 16:26       ` Patrick Schaaf
  1 sibling, 2 replies; 29+ messages in thread
From: Andi Kleen @ 2002-07-29 11:56 UTC (permalink / raw)
  To: jamal; +Cc: Andi Kleen, Rusty Russell, netfilter-devel, netdev,
	netfilter-core

On Mon, Jul 29, 2002 at 07:23:49AM -0400, jamal wrote:
> 
> 
> On Mon, 29 Jul 2002, Andi Kleen wrote:
> 
> > One obvious problem that it has is that it uses vmalloc to allocate its
> > big hash table. This will likely lead to TLB thrashing on a busy box. It should
> > try to allocate the hashtable with get_free_pages() first and only fall
> > back to vmalloc if that fails. This way it would run with large pages.
> >
> > (case in point: we have at least one report that routing
> > performance breaks down with ip_conntrack when memory size is increased over
> > 1GB on P3s. The hash table size depends on the memory size. The problem does
> > not occur on P4s. P4s have larger TLBs than P3s.)
> >
> 
> They also have a lot of problems with their per-packet computations.
> Robert and I spent a short time looking at "this thing that is making
> us look bad" (perfomance wise) and talked to Harald.
> Something that looked like needs improvement at first glance was the aging
> and hashing schemes.

Yes, some more tuning is probably needed.

here is a patch for 2.4 that just makes it use get_free_pages to test the 
TLB theory. Another obvious improvement would be to not use list_heads 
for the hash table buckets - a single pointer would likely suffice and 
it would cut the hash table in half, saving cache, TLB and memory.

-Andi


--- linux-work/net/ipv4/netfilter/ip_conntrack_core.c-CONNTRACK	Thu Jul 25 13:36:42 2002
+++ linux-work/net/ipv4/netfilter/ip_conntrack_core.c	Mon Jul 29 13:48:33 2002
@@ -50,6 +50,7 @@
 LIST_HEAD(protocol_list);
 static LIST_HEAD(helpers);
 unsigned int ip_conntrack_htable_size = 0;
+static int ip_conntrack_vmalloc;
 static int ip_conntrack_max = 0;
 static atomic_t ip_conntrack_count = ATOMIC_INIT(0);
 struct list_head *ip_conntrack_hash;
@@ -1053,6 +1054,15 @@
 	return 1;
 }
 
+static void free_conntrack_hash(void)
+{
+	if (ip_conntrack_vmalloc)
+		vfree(ip_conntrack_hash);
+	else
+		free_pages((unsigned long)ip_conntrack_hash, 
+			   get_order(sizeof(struct list_head) * ip_conntrack_htable_size));
+}
+
 /* Mishearing the voices in his head, our hero wonders how he's
    supposed to kill the mall. */
 void ip_conntrack_cleanup(void)
@@ -1075,7 +1085,7 @@
 	}
 
 	kmem_cache_destroy(ip_conntrack_cachep);
-	vfree(ip_conntrack_hash);
+	free_conntrack_hash();
 	nf_unregister_sockopt(&so_getorigdst);
 }
 
@@ -1109,8 +1119,17 @@
 	if (ret != 0)
 		return ret;
 
-	ip_conntrack_hash = vmalloc(sizeof(struct list_head)
-				    * ip_conntrack_htable_size);
+	/* AK: the hash table is twice as big than needed because it uses list_head.
+	   it would be much nicer to caches to use a single pointer list head here. */
+	ip_conntrack_vmalloc = 0; 
+	ip_conntrack_hash = (void *)__get_free_pages(GFP_KERNEL, 
+					     get_order(sizeof(struct list_head) * 
+						       ip_conntrack_htable_size));
+	if (!ip_conntrack_hash) { 
+		ip_conntrack_vmalloc = 1;
+		printk("ip_conntrack: falling back to vmalloc. performance may be degraded.\n");
+		ip_conntrack_hash = vmalloc(sizeof(struct list_head) * ip_conntrack_htable_size);
+	}
 	if (!ip_conntrack_hash) {
 		nf_unregister_sockopt(&so_getorigdst);
 		return -ENOMEM;
@@ -1121,7 +1140,7 @@
 	                                        SLAB_HWCACHE_ALIGN, NULL, NULL);
 	if (!ip_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create ip_conntrack slab cache\n");
-		vfree(ip_conntrack_hash);
+		free_conntrack_hash();
 		nf_unregister_sockopt(&so_getorigdst);
 		return -ENOMEM;
 	}
@@ -1145,7 +1164,7 @@
 		= register_sysctl_table(ip_conntrack_root_table, 0);
 	if (ip_conntrack_sysctl_header == NULL) {
 		kmem_cache_destroy(ip_conntrack_cachep);
-		vfree(ip_conntrack_hash);
+		free_conntrack_hash();
 		nf_unregister_sockopt(&so_getorigdst);
 		return -ENOMEM;
 	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 11:56       ` Andi Kleen
@ 2002-07-29 15:40         ` Martin Josefsson
  2002-07-29 16:15           ` Patrick Schaaf
  2002-07-29 22:43         ` Martin Josefsson
  1 sibling, 1 reply; 29+ messages in thread
From: Martin Josefsson @ 2002-07-29 15:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Rusty Russell, Netfilter-devel, netdev, netfilter-core,
	Patrick Schaaf

On Mon, 2002-07-29 at 13:56, Andi Kleen wrote:

> here is a patch for 2.4 that just makes it use get_free_pages to test the 
> TLB theory. Another obvious improvement would be to not use list_heads 
> for the hash table buckets - a single pointer would likely suffice and 
> it would cut the hash table in half, saving cache, TLB and memory.

I think the list_heads are used for only one thing currently, for the
early eviction in case of overload, then it scans backwards in the
chains to find unreplied connections to evict, or so the comment in
early_drop() says:

/* Traverse backwards: gives us oldest, which is roughly LRU */

but then it uses the normal LIST_FIND macro which I think traverses the
list in the normal forward direction. I havn't looked into the rest of
the code but I can't seem to remember anything that needs list_heads.

I think Patrick Schaaf is looking into conntrack as we speak. Maybe he
has any ideas?

I know I've had plans on rewriting the locking in conntrack which is
quite frankly horrible, one giant rwlock used for almost everything
(including the hashtable). One idea that has come to mind is using RCU
(need to learn more about it) or maybe use one rwlock per N buckets or
something. Looking at some stats from one of my routers I see that an
average connection is over 130 packets long so the ratio between reads
and writes is quite good.

And this eviction which occurs at overload needs to be redone, we can't
go around dropping one unreplied connection at a time, we need
gang-eviction of unreplied connections. We've had some nasty DDoS's here
in which our routers have been spending all cputime in conntrack trying
to evict connections to make room for the SYN floods coming in at
>130kpps.

-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat
you with experience.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 15:40         ` Martin Josefsson
@ 2002-07-29 16:15           ` Patrick Schaaf
  2002-07-29 17:12             ` Martin Josefsson
  0 siblings, 1 reply; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-29 16:15 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Andi Kleen, jamal, Rusty Russell, Netfilter-devel, netdev,
	netfilter-core, Patrick Schaaf

(warning: crystal ball engaged to parse from the quoted mail snippets.
Maybe missing context. I'm just reading netfilter-devel)

> On Mon, 2002-07-29 at 13:56, Andi Kleen wrote:
> 
> > here is a patch for 2.4 that just makes it use get_free_pages to test the 
> > TLB theory.

I presume this is about the vmalloc()ed hash bucket table? If yes, it's
certainly an interesting experiment to try making it allocated from an
area without TLB issues. We can expect a TLB miss on every packet with
the current setup, allocating the bucket table from large-TLB memory
would be a clear win of one memory roundtrip.

The netfilter hook statistics patch I mentioned in the other mail,
should be able to show the difference. If my guess is right, you
could see a 5-10% improvement on the ip_conntrack hook functions.

> > Another obvious improvement would be to not use list_heads 
> > for the hash table buckets - a single pointer would likely suffice and 
> > it would cut the hash table in half, saving cache, TLB and memory.
> 
> Martin Josefsson wrote:
> I think the list_heads are used for only one thing currently, for the
> early eviction in case of overload,

Don't forget the nonscanning list_del(), called whenever a conntrack
is unhashed at it's death. However, with a suitable bucket number,
i.e. low chain lengths, the scan on conntrack removal would be OK.

The early_drop() scanning, if it wants to work backward, may as well
work forward, keeping a "last unreplied found" pointer, and returning
that when falling off the single list end.

Thus, I also think that the list could be simple.

>From the top of my head, here are other fields that we could get rid off:

- the ctrack backpointer in each tuple.
- the protocol field in each tuple.
- the 20 byte infos[] array in ip_conntrack.
- we could out-of-line ip_nat_info.

With the current layout, when lists must be walked on a 32-byte-cacheline
box, we are sure to always read two cache lines for the skipped-over
tuples.

> I know I've had plans on rewriting the locking in conntrack which is
> quite frankly horrible, one giant rwlock used for almost everything
> (including the hashtable).

I'd like to see lockmeter statistics before this change. When you split
the one lock into a sectored lock: each conntrack is hashed twice, so
you need to be careful with lock order when adding or removing.
(well, there is another possibility, but I won't go into that now)

> One idea that has come to mind is using RCU

I don't see RCU solving hash link list update problems. Care to explain
how that would work?

> And this eviction which occurs at overload needs to be redone, we can't
> go around dropping one unreplied connection at a time, we need
> gang-eviction of unreplied connections.

I propose to put them all on a seperate LRU list, and reap the oldest.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 16:15           ` Patrick Schaaf
@ 2002-07-29 17:12             ` Martin Josefsson
  2002-07-29 17:35               ` Nivedita Singhvi
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Josefsson @ 2002-07-29 17:12 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, jamal, Rusty Russell, Netfilter-devel, netdev,
	netfilter-core

On Mon, 2002-07-29 at 18:15, Patrick Schaaf wrote:

> > Martin Josefsson wrote:
> > I think the list_heads are used for only one thing currently, for the
> > early eviction in case of overload,
> 
> Don't forget the nonscanning list_del(), called whenever a conntrack
> is unhashed at it's death. However, with a suitable bucket number,
> i.e. low chain lengths, the scan on conntrack removal would be OK.

Oh I forgot about that.

> The early_drop() scanning, if it wants to work backward, may as well
> work forward, keeping a "last unreplied found" pointer, and returning
> that when falling off the single list end.
> 
> Thus, I also think that the list could be simple.

The comment says that we are scanning backwards but the code says we are
scanning forward without keeping a "last unreplied found" pointer, we
just return the first unreplied found.

So going to a single linked list and adding that would probably be
better than the current behaviour.

> > I know I've had plans on rewriting the locking in conntrack which is
> > quite frankly horrible, one giant rwlock used for almost everything
> > (including the hashtable).
> 
> I'd like to see lockmeter statistics before this change. When you split
> the one lock into a sectored lock: each conntrack is hashed twice, so
> you need to be careful with lock order when adding or removing.
> (well, there is another possibility, but I won't go into that now)

The main reason is that I'd like conntrack to hold up better under
attacks. I'll see if I can produce some lockmeter statistics later this
week.

> > One idea that has come to mind is using RCU
> 
> I don't see RCU solving hash link list update problems. Care to explain
> how that would work?

Have you seen the rtcache RCU patch? it almost halved the time spent
doing the lookups because of no lock bouncing between cpu's. But RCU is
best suited for things that can tolerate stale data on reads, something
which we can not. I've spoken to Hana Linder who is one of the RCU
people and she said that the dcache RCU patch uses some techniques to
solve this as the dcache can't tolerate stale data either.

I havn't investigated this yet but it got my attention.

But the fact is that not many routers are SMP machines.
Maybe it could help some very busy SMP servers?

> > And this eviction which occurs at overload needs to be redone, we can't
> > go around dropping one unreplied connection at a time, we need
> > gang-eviction of unreplied connections.
> 
> I propose to put them all on a seperate LRU list, and reap the oldest.

I proposed this as well and then James Morris shot me down :)

What about the case where someone tries to DoS a specific chain?

-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat
you with experience.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 17:12             ` Martin Josefsson
@ 2002-07-29 17:35               ` Nivedita Singhvi
  0 siblings, 0 replies; 29+ messages in thread
From: Nivedita Singhvi @ 2002-07-29 17:35 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Patrick Schaaf, Andi Kleen, jamal, Rusty Russell, Netfilter-devel,
	netdev, netfilter-core

Quoting Martin Josefsson <gandalf@wlug.westbo.se>:

> > I don't see RCU solving hash link list update problems. Care to
> > explain how that would work?
> 
> Have you seen the rtcache RCU patch? it almost halved the time
> spent doing the lookups because of no lock bouncing between cpu's.
> But RCU is best suited for things that can tolerate stale data
> on reads, something which we can not. I've spoken to Hana Linder 
> who is one of the RCU people and she said that the dcache RCU patch
> uses some techniques to solve this as the dcache can't tolerate 
> stale data either.

The other environment parameter that RCU does best in is that
read:write ratio is high, in order to swallow the overhead. 
I havent looked into netfilter code, dont have much experience
in that area to suggest what your normal traffic would consist 
of. However, I've seen some posts and data from Dipankar
and Hanna which suggest the tradeoff is at a lower ratio than
we were used to (I was in Sequent's ptx/TCP/IP team where we
used it heavily, but our machines were NUMA boxes with a staggering
10:1 penalty for going off quad, back in the olden days).

> I havn't investigated this yet but it got my attention.
> 
> But the fact is that not many routers are SMP machines.
> Maybe it could help some very busy SMP servers?

Hope so :)

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 11:56       ` Andi Kleen
  2002-07-29 15:40         ` Martin Josefsson
@ 2002-07-29 22:43         ` Martin Josefsson
  1 sibling, 0 replies; 29+ messages in thread
From: Martin Josefsson @ 2002-07-29 22:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jamal, Rusty Russell, Netfilter-devel, netdev, netfilter-core

On Mon, 2002-07-29 at 13:56, Andi Kleen wrote:

> here is a patch for 2.4 that just makes it use get_free_pages to test the 
> TLB theory. Another obvious improvement would be to not use list_heads 
> for the hash table buckets - a single pointer would likely suffice and 
> it would cut the hash table in half, saving cache, TLB and memory.

ip_nat_core is also allocating it's hashtable via vmalloc and it's twice
as large as the one in ip_conntrack. (or rather, it's two hashtables
allocated at once, maybe they should be split up into two allocations?)


diff -x *.orig -x *.rej -urN linux-2.4.19-rc3.old/net/ipv4/netfilter/ip_nat_core.c linux-2.4.19-rc3/net/ipv4/netfilter/ip_nat_core.c
--- linux-2.4.19-rc3.old/net/ipv4/netfilter/ip_nat_core.c	Thu Jul 25 18:26:42 2002
+++ linux-2.4.19-rc3/net/ipv4/netfilter/ip_nat_core.c	Tue Jul 30 00:14:12 2002
@@ -43,6 +43,8 @@
 /* Calculated at init based on memory size */
 static unsigned int ip_nat_htable_size;
 
+static int ip_nat_vmalloc;
+
 static struct list_head *bysource;
 static struct list_head *byipsproto;
 LIST_HEAD(protos);
@@ -958,8 +960,16 @@
 	/* Leave them the same for the moment. */
 	ip_nat_htable_size = ip_conntrack_htable_size;
 
-	/* One vmalloc for both hash tables */
-	bysource = vmalloc(sizeof(struct list_head) * ip_nat_htable_size*2);
+	/* One allocation for both hash tables */
+	ip_nat_vmalloc = 0;
+	bysource = (void *)__get_free_pages(GFP_KERNEL,
+					get_order(sizeof(struct list_head) *
+						  ip_nat_htable_size * 2));
+	if (!bysource) {
+		ip_nat_vmalloc = 1;
+		printk("ip_nat: falling back to vmalloc. performance may be degraded.\n");
+		bysource = vmalloc(sizeof(struct list_head) * ip_nat_htable_size * 2);
+	}
 	if (!bysource) {
 		return -ENOMEM;
 	}
@@ -999,5 +1009,10 @@
 {
 	ip_ct_selective_cleanup(&clean_nat, NULL);
 	ip_conntrack_destroyed = NULL;
-	vfree(bysource);
+
+	if (ip_nat_vmalloc)
+		vfree(bysource);
+	else
+		free_pages((unsigned long)bysource,
+			   get_order(sizeof(struct list_head) * ip_nat_htable_size * 2));
 }
 
-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat
you with experience.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 11:23     ` jamal
  2002-07-29 11:56       ` Andi Kleen
@ 2002-07-29 16:26       ` Patrick Schaaf
  2002-07-29 16:31         ` Andi Kleen
  2002-07-30 11:58         ` jamal
  1 sibling, 2 replies; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-29 16:26 UTC (permalink / raw)
  To: jamal; +Cc: Andi Kleen, Rusty Russell, netfilter-devel, netdev,
	netfilter-core

Jamal,

> They also have a lot of problems with their per-packet computations.
> Robert and I spent a short time looking at "this thing that is making
> us look bad" (perfomance wise) and talked to Harald.

Do you have written up somewhere what kind of performance problems you were
seeing, under which conditions (hash bucket count, number of tracked
connections, packet load)

> Something that looked like needs improvement at first glance was the aging
> and hashing schemes.

Regarding the hashing schemes, please see discussions on netfilter-devel
over the last weeks:

http://lists.netfilter.org/pipermail/netfilter-devel/2002-July/thread.html

and a small presentation of various bucket sizes / hash functions
for some real world scenarios: http://bei.bof.de/ex6/
This presentation, a bit terse on comments, links to a tarball
which allows you to recreate the same presentation for any
dump of /proc/net/ip_conntrack, varying bucket counts and
hash functions.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 16:26       ` Patrick Schaaf
@ 2002-07-29 16:31         ` Andi Kleen
  2002-07-29 16:42           ` Patrick Schaaf
  2002-07-30 11:58         ` jamal
  1 sibling, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2002-07-29 16:31 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: jamal, Andi Kleen, Rusty Russell, netfilter-devel, netdev,
	netfilter-core

> Regarding the hashing schemes, please see discussions on netfilter-devel
> over the last weeks:
> 
> http://lists.netfilter.org/pipermail/netfilter-devel/2002-July/thread.html
> 
> and a small presentation of various bucket sizes / hash functions
> for some real world scenarios: http://bei.bof.de/ex6/
> This presentation, a bit terse on comments, links to a tarball
> which allows you to recreate the same presentation for any
> dump of /proc/net/ip_conntrack, varying bucket counts and
> hash functions.

Have you done any profiling of real workloads to see where the actual 
overhead comes from?

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 16:31         ` Andi Kleen
@ 2002-07-29 16:42           ` Patrick Schaaf
  2002-07-29 16:45             ` Patrick Schaaf
  0 siblings, 1 reply; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-29 16:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Patrick Schaaf, jamal, Rusty Russell, netfilter-devel, netdev,
	netfilter-core

> Have you done any profiling of real workloads to see where the actual 
> overhead comes from?

Not yet. I've spent the last weeks learning enough about the code to
make sense of profiles :)

This week (probably wednesday) I'll put both my netfilter hook statistic
patch, and enabled kernel profiling, onto a production box (the transproxy
thing from the bucket occupation analysis). Right now I have totally
undersized bucket count on that machine (7168 buckets for 10 times
the tuples), so I'll first measure the "accidental long list walk"
situation, and then retry with a suitable bucket size.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 16:42           ` Patrick Schaaf
@ 2002-07-29 16:45             ` Patrick Schaaf
  0 siblings, 0 replies; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-29 16:45 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, jamal, Rusty Russell, netfilter-devel, netdev,
	netfilter-core

> This week (probably wednesday) I'll put both my netfilter hook statistic
> patch, and enabled kernel profiling, onto a production box (the transproxy
> thing from the bucket occupation analysis). Right now I have totally
> undersized bucket count on that machine (7168 buckets for 10 times
> the tuples), so I'll first measure the "accidental long list walk"
> situation, and then retry with a suitable bucket size.

Before somebody get the wrong idea: the machine I mentioned, serves
as a squid proxy for over 3000 narrowband dialup users (all web
traffic), and it has no performance problems at all with that.
For all I know, any optimization we may make regarding netfilter,
won't make the squids on that box work perceivably better.

I have permanent average and median service time monitoring to prove
or disprove this assertion :-)

best regards
  Patrick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 16:26       ` Patrick Schaaf
  2002-07-29 16:31         ` Andi Kleen
@ 2002-07-30 11:58         ` jamal
  2002-07-30 12:27           ` Patrick Schaaf
  1 sibling, 1 reply; 29+ messages in thread
From: jamal @ 2002-07-30 11:58 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, Rusty Russell, netfilter-devel, netdev,
	netfilter-core

On Mon, 29 Jul 2002, Patrick Schaaf wrote:

> Jamal,
>
> > They also have a lot of problems with their per-packet computations.
> > Robert and I spent a short time looking at "this thing that is making
> > us look bad" (perfomance wise) and talked to Harald.
>
> Do you have written up somewhere what kind of performance problems you were
> seeing, under which conditions (hash bucket count, number of tracked
> connections, packet load)
>

Just standard setup in forwarding. Infact you only need one or two flows
active to verify this -- which leads me to believe hashing may play a
smaller role in the whole problem.
What i have seen and reported by many people (someone seems to have gone
one step further and documented numbers, but cant find his email right
now). Take Linux as a router, it routes at x% of wire rate. Load
conntracking and watch it go down another 25% at least.

> > Something that looked like needs improvement at first glance was the aging
> > and hashing schemes.
>
> Regarding the hashing schemes, please see discussions on netfilter-devel
> over the last weeks:
>

I think hashing is one of the problems. What performance improvemet are
you seeing? (couldnt tell from looking at your data)

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-30 11:58         ` jamal
@ 2002-07-30 12:27           ` Patrick Schaaf
  2002-07-30 12:29             ` jamal
  0 siblings, 1 reply; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-30 12:27 UTC (permalink / raw)
  To: jamal
  Cc: Patrick Schaaf, Andi Kleen, Rusty Russell, netfilter-devel,
	netdev, netfilter-core

> What i have seen and reported by many people (someone seems to have gone
> one step further and documented numbers, but cant find his email right
> now). Take Linux as a router, it routes at x% of wire rate. Load
> conntracking and watch it go down another 25% at least.

Unfortunately, this is insufficient information to pin down what was
happening. As Andi Kleen mentioned, a simple kernel profile from such
a test would be a good start.

Most likely things leading to such a result, in no specific suborder:

- skb linearization
- always-defragment
- ip_conntrack_lock contention
- per-packet timer management

I'm not personally interested in line rate routing, but I look forward
to further results from such setups. I concentrate on real server work-
loads, because that's where my job is.

> I think hashing is one of the problems. What performance improvemet are
> you seeing? (couldnt tell from looking at your data)

We found that the autosizing tends to make the bucket count a multiple
of two, and we found the currently used hash function does not like
that, resulting in longer average bucket list lengths than neccessary.
The crc32 hashes, and suitable modified abcd hashes, don't suffer from
this deficiency, and they are almost identical to random (a pseudohash
I used to depict the optimum).

However, the "badness" of the current hash, given my datasets, results
in less than one additional list element, on average. So we could save
one memory roundtrip. Given that with my netfilter hook statistics patch,
I see >3000 cycles (1Ghz processor) spent in ip_conntrack_in - about
10 memory round-trips - I don't think that you could measure the hash
function improvement, except for artificial test cases.

We can improve here, but not much. Changing the hash function is mostly
interesting to make hash bucket length attacks more unlikely.  The abcd
hash family, with boottime chosen multipliers, could be of use here.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-30 12:27           ` Patrick Schaaf
@ 2002-07-30 12:29             ` jamal
  2002-07-30 13:06               ` Patrick Schaaf
  2002-07-30 13:08               ` Martin Josefsson
  0 siblings, 2 replies; 29+ messages in thread
From: jamal @ 2002-07-30 12:29 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, Rusty Russell, netfilter-devel, netdev,
	netfilter-core



On Tue, 30 Jul 2002, Patrick Schaaf wrote:

> > What i have seen and reported by many people (someone seems to have gone
> > one step further and documented numbers, but cant find his email right
> > now). Take Linux as a router, it routes at x% of wire rate. Load
> > conntracking and watch it go down another 25% at least.
>
> Unfortunately, this is insufficient information to pin down what was
> happening. As Andi Kleen mentioned, a simple kernel profile from such
> a test would be a good start.
>

Dont have time -- if i did youd probably see a patch from me. I gave you
the testcase, you go do it.
Infact you dont even need to be forwarding to reproduce this. Marc Boucher
has a small udp traffic generator; you can use that on the lo device
and reproduce this.

> Most likely things leading to such a result, in no specific suborder:
>
> - skb linearization
> - always-defragment
> - ip_conntrack_lock contention
> - per-packet timer management
>
> I'm not personally interested in line rate routing, but I look forward
> to further results from such setups. I concentrate on real server work-
> loads, because that's where my job is.
>

Has nothing to do with forwarding (Marcs tool is a client-server setup)
You only need one or two flows. I am almost certain that if
you solve that simple case, you end up solving the larger problem.

> > I think hashing is one of the problems. What performance improvemet are
> > you seeing? (couldnt tell from looking at your data)
>
> We found that the autosizing tends to make the bucket count a multiple
> of two, and we found the currently used hash function does not like
> that, resulting in longer average bucket list lengths than neccessary.
> The crc32 hashes, and suitable modified abcd hashes, don't suffer from
> this deficiency, and they are almost identical to random (a pseudohash
> I used to depict the optimum).
>
> However, the "badness" of the current hash, given my datasets, results
> in less than one additional list element, on average. So we could save
> one memory roundtrip. Given that with my netfilter hook statistics patch,
> I see >3000 cycles (1Ghz processor) spent in ip_conntrack_in - about
> 10 memory round-trips - I don't think that you could measure the hash
> function improvement, except for artificial test cases.
>
> We can improve here, but not much. Changing the hash function is mostly
> interesting to make hash bucket length attacks more unlikely.  The abcd
> hash family, with boottime chosen multipliers, could be of use here.
>

I think this is a good start, but insufficient; in general i think you
are looking at the wrong thing. Hashing is an issue, no doubt but i
dont think it is the main issue. Did you start chasing hashing because you
saw some large numbers in profiles? If i was to use instinct i would say
the last two items you list are probably the things you may want to chase.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-30 12:29             ` jamal
@ 2002-07-30 13:06               ` Patrick Schaaf
  2002-07-30 13:42                 ` jamal
  2002-07-30 13:08               ` Martin Josefsson
  1 sibling, 1 reply; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-30 13:06 UTC (permalink / raw)
  To: jamal
  Cc: Patrick Schaaf, Andi Kleen, Rusty Russell, netfilter-devel,
	netdev, netfilter-core

> Hashing is an issue, no doubt but i dont think it is the main issue.

I fully agree. I mention this all the time.

> Did you start chasing hashing because you saw some large numbers
> in profiles?

No. I looked at the hash function and bucket sizes, because I now have
machines in the field running conntrack, and I wanted to understand
how I should chose an appropriate size. So I took real world conntrack
snapshots on those machines, and analysed. I learned all there is to
learn about the question, and present the results for others to use
like they see fit. End of story.

Again, I have no performance trouble with conntracking on any of my
production systems.

best regards
  Patrick

-- 
everything is under control.
	(Hardware)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-30 13:06               ` Patrick Schaaf
@ 2002-07-30 13:42                 ` jamal
  0 siblings, 0 replies; 29+ messages in thread
From: jamal @ 2002-07-30 13:42 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, Rusty Russell, netfilter-devel, netdev,
	netfilter-core



On Tue, 30 Jul 2002, Patrick Schaaf wrote:

>
> Again, I have no performance trouble with conntracking on any of my
> production systems.
>

Then you have embarked on a meaningless science project, i am afraid.
This is where Rustys comments that he needs numbers makes a lot of
sense.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-30 12:29             ` jamal
  2002-07-30 13:06               ` Patrick Schaaf
@ 2002-07-30 13:08               ` Martin Josefsson
  2002-07-30 15:54                 ` Filip Sneppe (Cronos)
  1 sibling, 1 reply; 29+ messages in thread
From: Martin Josefsson @ 2002-07-30 13:08 UTC (permalink / raw)
  To: jamal
  Cc: Patrick Schaaf, Andi Kleen, Rusty Russell, Netfilter-devel,
	netdev, netfilter-core

On Tue, 2002-07-30 at 14:29, jamal wrote:

> On Tue, 30 Jul 2002, Patrick Schaaf wrote:
> > Most likely things leading to such a result, in no specific suborder:
> >
> > - skb linearization
> > - always-defragment
> > - ip_conntrack_lock contention
> > - per-packet timer management

> If i was to use instinct i would say
> the last two items you list are probably the things you may want to chase.

Here's two small patches.

The first is a small patch to avoid updating the per-connection timer
for every packet. With this patch you get one update per second per
connection. Things are complicated by the fact that connections can
change timeouts. This patch isn't verified for correctness, YMMV.
(the pptp helper needs updating to work in combination with this patch)

The second patch changes the hashtable lookup slightly so we don't hash
the tuple each iteration, once is enough.

I don't have any numbers for these patches and I can't find the url to
the tests one of the netfilter-devel people has done.


diff -x *.orig -urN linux.orig/net/ipv4/netfilter/ip_conntrack_core.c linux/net/ipv4/netfilter/ip_conntrack_core.c
--- linux.orig/net/ipv4/netfilter/ip_conntrack_core.c	Tue Jul 30 14:38:41 2002
+++ linux/net/ipv4/netfilter/ip_conntrack_core.c	Tue Jul 30 14:40:06 2002
@@ -855,8 +855,10 @@
 	if (!is_confirmed(ct))
 		ct->timeout.expires = extra_jiffies;
 	else {
-		/* Need del_timer for race avoidance (may already be dying). */
-		if (del_timer(&ct->timeout)) {
+		/* Don't update timer for each packet, only if it's been >HZ
+		 * ticks since last update or change is negative.
+		 * Need del_timer for race avoidance (may already be dying). */
+		if ((unsigned long)(jiffies + extra_jiffies - ct->timeout.expires) >= HZ && del_timer(&ct->timeout)) {
 			ct->timeout.expires = jiffies + extra_jiffies;
 			add_timer(&ct->timeout);
 		}



--- linux-2.4.19-pre10/net/ipv4/netfilter/ip_conntrack_core.c.orig	Sat Jun  8 00:48:59 2002
+++ linux-2.4.19-pre10/net/ipv4/netfilter/ip_conntrack_core.c	Sat Jun  8 00:49:56 2002
@@ -292,9 +292,10 @@
 		    const struct ip_conntrack *ignored_conntrack)
 {
 	struct ip_conntrack_tuple_hash *h;
+	size_t hash = hash_conntrack(tuple);
 
 	MUST_BE_READ_LOCKED(&ip_conntrack_lock);
-	h = LIST_FIND(&ip_conntrack_hash[hash_conntrack(tuple)],
+	h = LIST_FIND(&ip_conntrack_hash[hash],
 		      conntrack_tuple_cmp,
 		      struct ip_conntrack_tuple_hash *,
 		      tuple, ignored_conntrack);

-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat
you with experience.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-30 13:08               ` Martin Josefsson
@ 2002-07-30 15:54                 ` Filip Sneppe (Cronos)
  0 siblings, 0 replies; 29+ messages in thread
From: Filip Sneppe (Cronos) @ 2002-07-30 15:54 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: jamal, Patrick Schaaf, Andi Kleen, Rusty Russell, Netfilter-devel,
	netdev, netfilter-core

On Tue, 2002-07-30 at 15:08, Martin Josefsson wrote:
> On Tue, 2002-07-30 at 14:29, jamal wrote:
> 
> > If i was to use instinct i would say
> > the last two items you list are probably the things you may want to chase.
> 
> Here's two small patches.
> 
> The first is a small patch to avoid updating the per-connection timer
> for every packet. With this patch you get one update per second per
> connection. Things are complicated by the fact that connections can
> change timeouts. This patch isn't verified for correctness, YMMV.
> (the pptp helper needs updating to work in combination with this patch)
> 
> The second patch changes the hashtable lookup slightly so we don't hash
> the tuple each iteration, once is enough.
> 
> I don't have any numbers for these patches and I can't find the url to
> the tests one of the netfilter-devel people has done.
> 

Hi Martin,

These may be the patches I did some very basic testing (and readprofile
profiling) on.

http://www.filip.sneppe.yucom.be/linux/netfilter/performance/benchmarks.htm

I don't have a lot to add to the discussion except that I can make time
to test patches/ideas provided someone tells me *how* to test, what to
look for, etc. For instance, a lot of the current numbers on that page
with the varying mtu sizes are, in retrospect, basically stupid tests
that don't reveal a lot of new stuff :-/.

Regards,
Filip

Regards,
Filip

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 11:12   ` Andi Kleen
  2002-07-29 11:23     ` jamal
@ 2002-07-29 15:25     ` Michael Richardson
  2002-07-29 15:52       ` Patrick Schaaf
  2002-07-29 20:51       ` Andi Kleen
  1 sibling, 2 replies; 29+ messages in thread
From: Michael Richardson @ 2002-07-29 15:25 UTC (permalink / raw)
  To: netfilter-devel, netdev, netfilter-core

>>>>> "Andi" == Andi Kleen <ak@suse.de> writes:
    Andi> (case in point: we have at least one report that routing
    Andi> performance breaks down with ip_conntrack when memory size is
    Andi> increased over 1GB on P3s. The hash table size depends on the
    Andi> memory size. The problem does not occur on P4s. P4s have larger
    Andi> TLBs than P3s.)

  That's a non-obvious result.

  I'll bet that most of the memory-size based hash tables in the kernel
suffer from similar problems. A good topic for a paper, I'd say.

]    Internet Security. Have encryption, will travel           |1 Fish/2 Fish [
]  Michael Richardson, Sandelman Software Works, Ottawa, ON    |Red F./Blow F [
]mcr@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |strong crypto [
]    At the far end of some dark fiber - wait that's dirt!     |for everyone  [

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 15:25     ` Michael Richardson
@ 2002-07-29 15:52       ` Patrick Schaaf
  2002-07-29 20:51       ` Andi Kleen
  1 sibling, 0 replies; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-29 15:52 UTC (permalink / raw)
  To: Michael Richardson; +Cc: netfilter-devel, netdev, netfilter-core

> >>>>> "Andi" == Andi Kleen <ak@suse.de> writes:
>     Andi> (case in point: we have at least one report that routing
>     Andi> performance breaks down with ip_conntrack when memory size is
>     Andi> increased over 1GB on P3s. The hash table size depends on the
>     Andi> memory size. The problem does not occur on P4s. P4s have larger
>     Andi> TLBs than P3s.)
> 
>   That's a non-obvious result.
> 
>   I'll bet that most of the memory-size based hash tables in the kernel
> suffer from similar problems. A good topic for a paper, I'd say.

That's for sure - but I don't see the relevance of TLBs. The only place
where I expect any in-CPU caches to matter, is for synthetic test
cases where there's a very small number of conntracks (fitting into
CPU caches), and a huge load of packets (to look good in a benchmark).

Under real life operation, we either have very light loads - then conntrack
lookup does not matter at all - or we have high load, several 10000
packets per second. Then, things may get slow in conntrack when
you don't have enough hash buckets - two times the number of
concurrent connections is appropriate. Or, if that's not the
problem, you will already spread lookup so far across the hash
table, in a random fashion, that you'll incurr at least two TLB
faults plus several cache line loads for each packet. When that
point is reached, further increase in packet load should not
make things worse.

Andi, what report are you referring to? Any specifics I can read?

In case somebody isn't aware, we have been over the hash function
and hash bucket thing during the last month. See lots of mails in
the netfilter-devel archive.

I'm prepared to take on any presumed inefficiency in the current
conntracking code. I know some things that may be relevant that
I did not write about during the last weeks, but I have no real life
indication that they matter - I'd love to have the opportunity to
see such a situation. So if anybody got the time to work on such
a perceived performance problem, please come to the netfilter-devel
mailing lest, and let's talk specifics.

As it were, I published a small netfilter performance counter patch
over the weekend, which you can find in the archive at

http://lists.netfilter.org/pipermail/netfilter-devel/2002-July/008792.html

I hope to see some really worrying output from you :)

best regards
  Patrick
- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 15:25     ` Michael Richardson
  2002-07-29 15:52       ` Patrick Schaaf
@ 2002-07-29 20:51       ` Andi Kleen
  2002-07-30  7:26         ` Patrick Schaaf
  1 sibling, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2002-07-29 20:51 UTC (permalink / raw)
  To: Michael Richardson; +Cc: netfilter-devel, netdev, netfilter-core

On Mon, Jul 29, 2002 at 11:25:28AM -0400, Michael Richardson wrote:
> 
> >>>>> "Andi" == Andi Kleen <ak@suse.de> writes:
>     Andi> (case in point: we have at least one report that routing
>     Andi> performance breaks down with ip_conntrack when memory size is
>     Andi> increased over 1GB on P3s. The hash table size depends on the
>     Andi> memory size. The problem does not occur on P4s. P4s have larger
>     Andi> TLBs than P3s.)
> 
>   That's a non-obvious result.
> 
>   I'll bet that most of the memory-size based hash tables in the kernel
> suffer from similar problems. A good topic for a paper, I'd say.

In fact there have been papers about this like
http://www.citi.umich.edu/projects/linux-scalability/reports/hash.html
but the results were unfortunately not adapted.

This has been discussed for a long time. Linux hash tables suffer often
from poor hash functions (some are good, but others are not so great),
excessive sizing to cover the poor functions and using double pointer heads 
when not needed (=hash table twice as big). Excessive sizing wastes memory
(several MB just for hash tables on a standard system including some gems
like a incredible big mount hash table that near nobody needs to manage 
their 5-10 mounts) 

I wrote a slist.h that works like the linux
list.h some time ago, but uses lists instead of rings with a single pointer
head to at least avoid the last problem. In the following discussion
the preference was for a more generic hash table ADT instead of another
list abstraction.

So if you wanted to put some work into this I would:

- Develop some simple and tasteful and linux like (most of the existing
ones in other software packages fail at least one of these) generic hash table
abstraction.
- Convert all the big hash tables to this generic code.
- Let it use single pointer heads.
- Make it implement the sizing based on memory size in common code with a 
single knob to tune it per system. In fact I think it should definitely
take L2 cache size in account, not only main memory.
- Add generic statistics as CONFIG option so that you can see hit rates
for all your hash tables and how much space they need.
- Make it either use the existing hash function per table or a generic good 
hash function like http://burtleburtle.net/bob/hash/evahash.html

Try out all these knobs and write a paper about it.

- Try to get it merged with the best results as default options

Unfortunately netfilter hash is a bad example for this, because its
DoS handling requirements (LRU etc.) are more complex than what most other 
linux hash tables need and I am not sure if it would make sense to 
put it into generic code. On the other hand if the generic code is flexible
enough it would be possible to implement it on top of it.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 20:51       ` Andi Kleen
@ 2002-07-30  7:26         ` Patrick Schaaf
  0 siblings, 0 replies; 29+ messages in thread
From: Patrick Schaaf @ 2002-07-30  7:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netfilter-devel, netdev, netfilter-core

> Unfortunately netfilter hash is a bad example for this, because its
> DoS handling requirements (LRU etc.) are more complex than what most other 
> linux hash tables need and I am not sure if it would make sense to 
> put it into generic code.

There is actually one issue with the current netfilter hash code:

The code intentionally does not 0 out the next pointer when
a conntrack is removed from the hashes; only new, never-yet-hashed
conntracks have their next field be 0, and the confirm logic relies
on that. Could be easily changed to use an appropriate flag bit in
struct ip_conntrack.

As a consequence, the single linked list I'm prototyping must be a
ring list, with the hash bucket pointer within the list - same scheme
as with the doubly linked list. It's oopsing on me as I type :)

A non-ring implementation would be smaller, so I think we really want
that flag bit for the confirmations.  Rusty?

All other cases could be handled by a general hash implementation
with per-list-entry user supplied comparison callback, and a
per-table hash function.

I'm sure that any real DoS handling will work by varying constants
used in the hash function. That's the result of the recent "abcd"
hashing work.

The thing that worries me, even with the current setup, is the idea of
a general boottime sizing of all such general hash tables.  The things
are hard to override once loaded, so sizes must fit what's needed in
the real world, and that's over a _mix_ of various tables that all
play together under this or that workload.  Maybe runtime rehashing
is the way to go here, to make this fully adaptive.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 10:57 ` jamal
  2002-07-29 11:12   ` Andi Kleen
@ 2002-07-29 22:14   ` Rusty Russell
  2002-07-30 12:04     ` jamal
  1 sibling, 1 reply; 29+ messages in thread
From: Rusty Russell @ 2002-07-29 22:14 UTC (permalink / raw)
  To: jamal; +Cc: netfilter-devel, netdev

In message <Pine.GSO.4.30.0207290648020.12604-100000@shell.cyberus.ca> you writ
e:
> > Connection tracking:
> 
> Fix perfomance problems with this thing. You may have seen reports of
> performance degradation it introduces. I was hoping to take a look at some
> point time hasnt been visiting this side.

There are several simple things to do here.  One is to improve the
hashing (fine for internet traffic, but frequently sucks under LAN
conditions), which is easy.  The other is to modify the
one-timer-per-connection approach to a "sweep once a second, or when
full" approach.

Both these are simple patches, but I want to see benchmarks showing
that they improve things.

> > iptables:
> > 	o Change over to a netlink interface
> > 		o Back to add/delete/replace interface + commit.
> > 	o Rewrite libiptc to use netlink (to port iptables).
> 
> I hope this resolves the current scheme where the whole
> add/delete/replace interface + commit happens in user space?
> If you use netlink it would make sense to do incremental updates to the
> kernel.

Yes, that's exactly the plan.  It'd be more like the old-style
insert/delete (probably not replace), except with a "commit"
interface, implemented by copying the rules when they start modifying.

Hope that helps,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TODO list before feature freeze
  2002-07-29 22:14   ` Rusty Russell
@ 2002-07-30 12:04     ` jamal
  0 siblings, 0 replies; 29+ messages in thread
From: jamal @ 2002-07-30 12:04 UTC (permalink / raw)
  To: Rusty Russell; +Cc: netfilter-devel, netdev



On Tue, 30 Jul 2002, Rusty Russell wrote:

> In message <Pine.GSO.4.30.0207290648020.12604-100000@shell.cyberus.ca> you writ
> e:
> > > Connection tracking:
> >
> > Fix perfomance problems with this thing. You may have seen reports of
> > performance degradation it introduces. I was hoping to take a look at some
> > point time hasnt been visiting this side.
>
> There are several simple things to do here.  One is to improve the
> hashing (fine for internet traffic, but frequently sucks under LAN
> conditions), which is easy.  The other is to modify the
> one-timer-per-connection approach to a "sweep once a second, or when
> full" approach.
>

Thats the right direction. From code inspection, fixing the later problem
would give you a lot more punch.

> Both these are simple patches, but I want to see benchmarks showing
> that they improve things.
>

Indeed.

> Yes, that's exactly the plan.  It'd be more like the old-style
> insert/delete (probably not replace), except with a "commit"
> interface, implemented by copying the rules when they start modifying.
>

Why not take a look at the way tc does things and emulate that?

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2002-07-30 15:54 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-18  9:34 TODO list before feature freeze Rusty Russell
2002-07-19  7:39 ` Balazs Scheidler
2002-07-19 17:43 ` Michael Richardson
2002-07-29 10:57 ` jamal
2002-07-29 11:12   ` Andi Kleen
2002-07-29 11:23     ` jamal
2002-07-29 11:56       ` Andi Kleen
2002-07-29 15:40         ` Martin Josefsson
2002-07-29 16:15           ` Patrick Schaaf
2002-07-29 17:12             ` Martin Josefsson
2002-07-29 17:35               ` Nivedita Singhvi
2002-07-29 22:43         ` Martin Josefsson
2002-07-29 16:26       ` Patrick Schaaf
2002-07-29 16:31         ` Andi Kleen
2002-07-29 16:42           ` Patrick Schaaf
2002-07-29 16:45             ` Patrick Schaaf
2002-07-30 11:58         ` jamal
2002-07-30 12:27           ` Patrick Schaaf
2002-07-30 12:29             ` jamal
2002-07-30 13:06               ` Patrick Schaaf
2002-07-30 13:42                 ` jamal
2002-07-30 13:08               ` Martin Josefsson
2002-07-30 15:54                 ` Filip Sneppe (Cronos)
2002-07-29 15:25     ` Michael Richardson
2002-07-29 15:52       ` Patrick Schaaf
2002-07-29 20:51       ` Andi Kleen
2002-07-30  7:26         ` Patrick Schaaf
2002-07-29 22:14   ` Rusty Russell
2002-07-30 12:04     ` jamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).