All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 4/4] first conntrack ID must be 1 not 2
@ 2006-02-13  2:41 Pablo Neira Ayuso
  2006-02-13 11:20 ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Pablo Neira Ayuso @ 2006-02-13  2:41 UTC (permalink / raw)
  To: Netfilter Development Mailinglist
  Cc: Harald Welte, Patrick McHardy, Yasuyuki Kozakai

[-- Attachment #1: Type: text/plain, Size: 474 bytes --]

[NF_CONNTRACK] first conntrack ID must be 1 not 2

The first conntrack ID must be 1. If a new conntrack is created, the
general ID counter must be post-incremented instead pre-incremented
since [ip|nf]_conntrack_next_id is initialized to 1. Same applies for
expectations.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

-- 
The dawn of the fourth age of Linux firewalling is coming; a time of
great struggle and heroic deeds -- J.Kadlecsik got inspired by J.Morris

[-- Attachment #2: id.patch --]
[-- Type: text/plain, Size: 2225 bytes --]

[NETFILTER] first conntrack ID must be 1 not 2

The first conntrack ID must be 1. If a new conntrack is created, the
general ID counter must be post-incremented instead pre-incremented since
[ip|nf]_conntrack_next_id is initialized to 1. Same applies for expectations.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Index: net-2.6.git/net/netfilter/nf_conntrack_core.c
===================================================================
--- net-2.6.git.orig/net/netfilter/nf_conntrack_core.c	2006-02-13 01:05:04.000000000 +0100
+++ net-2.6.git/net/netfilter/nf_conntrack_core.c	2006-02-13 01:47:04.000000000 +0100
@@ -682,7 +682,7 @@ static void __nf_conntrack_hash_insert(s
 				       unsigned int hash,
 				       unsigned int repl_hash) 
 {
-	ct->id = ++nf_conntrack_next_id;
+	ct->id = nf_conntrack_next_id++;
 	list_prepend(&nf_conntrack_hash[hash],
 		     &ct->tuplehash[IP_CT_DIR_ORIGINAL].list);
 	list_prepend(&nf_conntrack_hash[repl_hash],
@@ -1247,7 +1247,7 @@ static void nf_conntrack_expect_insert(s
 	exp->timeout.expires = jiffies + exp->master->helper->timeout * HZ;
 	add_timer(&exp->timeout);
 
-	exp->id = ++nf_conntrack_expect_next_id;
+	exp->id = nf_conntrack_expect_next_id++;
 	atomic_inc(&exp->use);
 	NF_CT_STAT_INC(expect_create);
 }
Index: net-2.6.git/net/ipv4/netfilter/ip_conntrack_core.c
===================================================================
--- net-2.6.git.orig/net/ipv4/netfilter/ip_conntrack_core.c	2006-02-04 14:34:40.000000000 +0100
+++ net-2.6.git/net/ipv4/netfilter/ip_conntrack_core.c	2006-02-13 01:46:42.000000000 +0100
@@ -417,7 +417,7 @@ static void __ip_conntrack_hash_insert(s
 					unsigned int hash,
 					unsigned int repl_hash) 
 {
-	ct->id = ++ip_conntrack_next_id;
+	ct->id = ip_conntrack_next_id++;
 	list_prepend(&ip_conntrack_hash[hash],
 		     &ct->tuplehash[IP_CT_DIR_ORIGINAL].list);
 	list_prepend(&ip_conntrack_hash[repl_hash],
@@ -971,7 +971,7 @@ static void ip_conntrack_expect_insert(s
 	exp->timeout.expires = jiffies + exp->master->helper->timeout * HZ;
 	add_timer(&exp->timeout);
 
-	exp->id = ++ip_conntrack_expect_next_id;
+	exp->id = ip_conntrack_expect_next_id++;
 	atomic_inc(&exp->use);
 	CONNTRACK_STAT_INC(expect_create);
 }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-13  2:41 [PATCH 4/4] first conntrack ID must be 1 not 2 Pablo Neira Ayuso
@ 2006-02-13 11:20 ` Harald Welte
  2006-02-16  8:33   ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Harald Welte @ 2006-02-13 11:20 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Netfilter Development Mailinglist, Patrick McHardy,
	Yasuyuki Kozakai

[-- Attachment #1: Type: text/plain, Size: 854 bytes --]

On Mon, Feb 13, 2006 at 03:41:52AM +0100, Pablo Neira Ayuso wrote:
> [NF_CONNTRACK] first conntrack ID must be 1 not 2
> 
> The first conntrack ID must be 1. If a new conntrack is created, the
> general ID counter must be post-incremented instead pre-incremented
> since [ip|nf]_conntrack_next_id is initialized to 1. Same applies for
> expectations.

looks clean to me.  Interesting question though: Does it matter whether
we start with 1 or 2?  Am I missing something?

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-13 11:20 ` Harald Welte
@ 2006-02-16  8:33   ` Patrick McHardy
  2006-02-16  8:47     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-02-16  8:33 UTC (permalink / raw)
  To: Harald Welte
  Cc: Netfilter Development Mailinglist, Yasuyuki Kozakai,
	Pablo Neira Ayuso

Harald Welte wrote:
> On Mon, Feb 13, 2006 at 03:41:52AM +0100, Pablo Neira Ayuso wrote:
> 
>>[NF_CONNTRACK] first conntrack ID must be 1 not 2
>>
>>The first conntrack ID must be 1. If a new conntrack is created, the
>>general ID counter must be post-incremented instead pre-incremented
>>since [ip|nf]_conntrack_next_id is initialized to 1. Same applies for
>>expectations.
> 
> 
> looks clean to me.  Interesting question though: Does it matter whether
> we start with 1 or 2?  Am I missing something?

Its irrelevant since the number doesn't have any meaning beyond beeing
an identifier. Anyway, I'd prefer a patch that just removes the
initialization to 1.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16  8:33   ` Patrick McHardy
@ 2006-02-16  8:47     ` Jozsef Kadlecsik
  2006-02-16  9:02       ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-02-16  8:47 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

On Thu, 16 Feb 2006, Patrick McHardy wrote:

> Harald Welte wrote:
> > On Mon, Feb 13, 2006 at 03:41:52AM +0100, Pablo Neira Ayuso wrote:
> >
> >>[NF_CONNTRACK] first conntrack ID must be 1 not 2
> >>
> >>The first conntrack ID must be 1. If a new conntrack is created, the
> >>general ID counter must be post-incremented instead pre-incremented
> >>since [ip|nf]_conntrack_next_id is initialized to 1. Same applies for
> >>expectations.
> >
> >
> > looks clean to me.  Interesting question though: Does it matter whether
> > we start with 1 or 2?  Am I missing something?
>
> Its irrelevant since the number doesn't have any meaning beyond beeing
> an identifier. Anyway, I'd prefer a patch that just removes the
> initialization to 1.

Couldn't the id be simply equal to the jiffies when creating the conntrack
entry? (And adding time_after/before when comparing ids, of course.)

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16  8:47     ` Jozsef Kadlecsik
@ 2006-02-16  9:02       ` Patrick McHardy
  2006-02-16  9:11         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-02-16  9:02 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

Jozsef Kadlecsik wrote:
> Couldn't the id be simply equal to the jiffies when creating the conntrack
> entry? (And adding time_after/before when comparing ids, of course.)

We can create more than one entry per jiffy, so that wouldn't be unique
anymore.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16  9:02       ` Patrick McHardy
@ 2006-02-16  9:11         ` Jozsef Kadlecsik
  2006-02-16  9:14           ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-02-16  9:11 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

On Thu, 16 Feb 2006, Patrick McHardy wrote:

> Jozsef Kadlecsik wrote:
> > Couldn't the id be simply equal to the jiffies when creating the conntrack
> > entry? (And adding time_after/before when comparing ids, of course.)
>
> We can create more than one entry per jiffy, so that wouldn't be unique
> anymore.

(jiffies, tuples) would be unique even in that case.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16  9:11         ` Jozsef Kadlecsik
@ 2006-02-16  9:14           ` Patrick McHardy
  2006-02-16  9:36             ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-02-16  9:14 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

Jozsef Kadlecsik wrote:
> On Thu, 16 Feb 2006, Patrick McHardy wrote:
> 
> 
>>Jozsef Kadlecsik wrote:
>>
>>>Couldn't the id be simply equal to the jiffies when creating the conntrack
>>>entry? (And adding time_after/before when comparing ids, of course.)
>>
>>We can create more than one entry per jiffy, so that wouldn't be unique
>>anymore.
> 
> 
> (jiffies, tuples) would be unique even in that case.

Thats true. But what is the advantage over using the counter?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16  9:14           ` Patrick McHardy
@ 2006-02-16  9:36             ` Jozsef Kadlecsik
  2006-02-16 20:09               ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-02-16  9:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

On Thu, 16 Feb 2006, Patrick McHardy wrote:

> Jozsef Kadlecsik wrote:
> > On Thu, 16 Feb 2006, Patrick McHardy wrote:
> >
> >>Jozsef Kadlecsik wrote:
> >>
> >>>Couldn't the id be simply equal to the jiffies when creating the conntrack
> >>>entry? (And adding time_after/before when comparing ids, of course.)
> >>
> >>We can create more than one entry per jiffy, so that wouldn't be unique
> >>anymore.
> >
> > (jiffies, tuples) would be unique even in that case.
>
> Thats true. But what is the advantage over using the counter?

Currently nothing :-).

However, the counter poses an internal bottleneck in the overall system
when one attempts to replace the hashtable by hashtrie. (What I really
like about hashtrie is that it naturally implements the 'region locking'
suggested by Patrick Schaaf, compared to the 'brute force' per bucket
locking.)

As I see, we have got three bottlenecks in conntrack in SMP systems:
the hashtable approach, the id and the expectation list. The first two
could be eliminated by hashtrie and jiffies as ids.

Some clever solution should only be found to spread the expectations
over multiple, separatedly locked buckets...

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16  9:36             ` Jozsef Kadlecsik
@ 2006-02-16 20:09               ` Patrick McHardy
  2006-02-17  8:18                 ` Jozsef Kadlecsik
  2006-03-30  8:31                 ` Amin Azez
  0 siblings, 2 replies; 40+ messages in thread
From: Patrick McHardy @ 2006-02-16 20:09 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

Jozsef Kadlecsik wrote:
> On Thu, 16 Feb 2006, Patrick McHardy wrote:
> 
>>>(jiffies, tuples) would be unique even in that case.
>>
>>Thats true. But what is the advantage over using the counter?

Actually it would still not be unique if connections live
shorter than a jiffy and are resurrected.

> Currently nothing :-).
> 
> However, the counter poses an internal bottleneck in the overall system
> when one attempts to replace the hashtable by hashtrie. (What I really
> like about hashtrie is that it naturally implements the 'region locking'
> suggested by Patrick Schaaf, compared to the 'brute force' per bucket
> locking.)
> 
> As I see, we have got three bottlenecks in conntrack in SMP systems:
> the hashtable approach, the id and the expectation list. The first two
> could be eliminated by hashtrie and jiffies as ids.
> 
> Some clever solution should only be found to spread the expectations
> over multiple, separatedly locked buckets...

I've talked to Harald about this and as a start we can start by
allowing masks only for the source part of the tuple. That
means we can hash by destinations.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16 20:09               ` Patrick McHardy
@ 2006-02-17  8:18                 ` Jozsef Kadlecsik
  2006-02-17  8:45                   ` Martin Josefsson
  2006-02-17  8:50                   ` [PATCH 4/4] first conntrack ID must be 1 not 2 Patrick McHardy
  2006-03-30  8:31                 ` Amin Azez
  1 sibling, 2 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-02-17  8:18 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

On Thu, 16 Feb 2006, Patrick McHardy wrote:

> Jozsef Kadlecsik wrote:
> > On Thu, 16 Feb 2006, Patrick McHardy wrote:
> >
> >>>(jiffies, tuples) would be unique even in that case.
> >>
> >>Thats true. But what is the advantage over using the counter?
>
> Actually it would still not be unique if connections live
> shorter than a jiffy and are resurrected.

I can imagine such a situation only when the conntrack table is full and
we are forced to drop unassured connections - when we are in trouble
anyway.

> > Some clever solution should only be found to spread the expectations
> > over multiple, separatedly locked buckets...
>
> I've talked to Harald about this and as a start we can start by
> allowing masks only for the source part of the tuple. That
> means we can hash by destinations.

That's actually fine! Now I should complete that hashtrie/hashtable
testing I have been planning...

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-17  8:18                 ` Jozsef Kadlecsik
@ 2006-02-17  8:45                   ` Martin Josefsson
  2006-02-17  9:30                     ` Jozsef Kadlecsik
  2006-02-17  8:50                   ` [PATCH 4/4] first conntrack ID must be 1 not 2 Patrick McHardy
  1 sibling, 1 reply; 40+ messages in thread
From: Martin Josefsson @ 2006-02-17  8:45 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Netfilter Development Mailinglist, Patrick McHardy,
	Pablo Neira Ayuso, Yasuyuki Kozakai

On Fri, 17 Feb 2006, Jozsef Kadlecsik wrote:

> > > Some clever solution should only be found to spread the expectations
> > > over multiple, separatedly locked buckets...
> >
> > I've talked to Harald about this and as a start we can start by
> > allowing masks only for the source part of the tuple. That
> > means we can hash by destinations.
>
> That's actually fine! Now I should complete that hashtrie/hashtable
> testing I have been planning...

Great that you are testing hashtrie, it works fine in all my simple tests
but there might still be a lot of bugs in it, especially in the forced
eviction, it might not be optimal, me and rusty tried to come up with a
way to make it fair wrt the depth in the tree we are evicting at.

If you have any suggestions and/or patches, please let me know :)

Are you using the 060104 version, that's the latest version I have of the
userspace code.
I have an old untested patch against nf_conntrack as well but it needs
some rewriting of the conntrack locking in order to avoid an SMP deadlock.

/Martin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-17  8:18                 ` Jozsef Kadlecsik
  2006-02-17  8:45                   ` Martin Josefsson
@ 2006-02-17  8:50                   ` Patrick McHardy
  1 sibling, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2006-02-17  8:50 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Netfilter Development Mailinglist,
	Pablo Neira Ayuso, Yasuyuki Kozakai

Jozsef Kadlecsik wrote:
> On Thu, 16 Feb 2006, Patrick McHardy wrote:
> 
>>Actually it would still not be unique if connections live
>>shorter than a jiffy and are resurrected.
> 
> 
> I can imagine such a situation only when the conntrack table is full and
> we are forced to drop unassured connections - when we are in trouble
> anyway.

I guess it depends on how fast the everything is. Someone at the
workshop made the point that given how fast speed increases, it
will probably be a realistic problem at some point. But I was
never convinced of the current half-assed approach to a unique ID
anyway, its either unique or its not. So in my opinion, we could
eliminate it if we find a better solution for the dumping problem.

>>>Some clever solution should only be found to spread the expectations
>>>over multiple, separatedly locked buckets...
>>
>>I've talked to Harald about this and as a start we can start by
>>allowing masks only for the source part of the tuple. That
>>means we can hash by destinations.
> 
> 
> That's actually fine! Now I should complete that hashtrie/hashtable
> testing I have been planning...

I wanted to work on the expectation lookup thing soon, I'll let you
know once I have something working.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-17  8:45                   ` Martin Josefsson
@ 2006-02-17  9:30                     ` Jozsef Kadlecsik
  2006-02-17 18:41                       ` Jozsef Kadlecsik
  2006-03-04 20:11                       ` Hashtrie testing2 " Martin Josefsson
  0 siblings, 2 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-02-17  9:30 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Harald Welte, Netfilter Development Mailinglist, Patrick McHardy,
	Pablo Neira Ayuso, Yasuyuki Kozakai

On Fri, 17 Feb 2006, Martin Josefsson wrote:

> > That's actually fine! Now I should complete that hashtrie/hashtable
> > testing I have been planning...
>
> Great that you are testing hashtrie, it works fine in all my simple tests
> but there might still be a lot of bugs in it, especially in the forced
> eviction, it might not be optimal, me and rusty tried to come up with a
> way to make it fair wrt the depth in the tree we are evicting at.
>
> If you have any suggestions and/or patches, please let me know :)

Here are the tests I'd like to see against hashtrie:

- larger hashentry, i.e. when HASHSIFT is equal to 6, 7 or 8
- other HASHALIGN, hashbit_t values: (32, u8), (64, u16) and (64, u32).
  The current (32, u8) doesn't look optimal on 64bit CPUs, (64, u16) seems
  to be the best, but without testing it's hard to choose.
- DoS against hashtrie by non-random tuples: single fixed destination IP,
  port and successive source IP, port numbers. (I don't think the current
  max 7 levels (childs) can survive such an attack.)

Somehow I don't really like the eviction algorithm. What about some lazy
auto-eviction instead: say, if there are more than 90% of the max
elements, then drop a (any) unassured connection which can be found on the
path when inserting a new one. Thus the current fixed stack could be
eliminated and there were no builtin limit in the depth.

I'm collecting the ideas, so can't submit patches yet ;-).

> I have an old untested patch against nf_conntrack as well but it needs
> some rewriting of the conntrack locking in order to avoid an SMP deadlock.

Rusty's lock-ordering is a sure solution, but penalizes the process.
I have been thinking on wether we could use simply, separated, unordered
locking:

1. lock bucket according to tuple1
	add element
   unlock bucket
2. lock bucket according to tuple2,
	add element
   unlock bucket
   if (add element failed)
	undo 1.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-17  9:30                     ` Jozsef Kadlecsik
@ 2006-02-17 18:41                       ` Jozsef Kadlecsik
  2006-03-04 16:23                         ` Hashtrie testing (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2) Martin Josefsson
  2006-03-04 20:11                       ` Hashtrie testing2 " Martin Josefsson
  1 sibling, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-02-17 18:41 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Harald Welte, Netfilter Development Mailinglist, Patrick McHardy,
	Pablo Neira Ayuso, Yasuyuki Kozakai

Hi Martin,

On Fri, 17 Feb 2006, Jozsef Kadlecsik wrote:

> I'm collecting the ideas, so can't submit patches yet ;-).

Two more things:

- the Jenkins hash internally produces 96 bits, which we could
  use in the hashtrie
- there should be some (whatever artifical) tests to study the
  dynamical behaviour of the hashtrie: say all the entries are
  added, then half of them deleted and added back again, but
  in reverse order. Will the maxdepth grow?

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Hashtrie testing (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-02-17 18:41                       ` Jozsef Kadlecsik
@ 2006-03-04 16:23                         ` Martin Josefsson
  2006-03-05  9:49                           ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Josefsson @ 2006-03-04 16:23 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 3499 bytes --]

On Fri, 2006-02-17 at 19:41 +0100, Jozsef Kadlecsik wrote:

(Changing subject and removing some cc's)

> Hi Martin,

Hi Jozsef

> On Fri, 17 Feb 2006, Jozsef Kadlecsik wrote:
> 
> > I'm collecting the ideas, so can't submit patches yet ;-).

:)

I'll move my svn tree to the netfilter svn sometime soon.

> Two more things:
> 
> - the Jenkins hash internally produces 96 bits, which we could
>   use in the hashtrie

That's a great idea, will look into it.

> - there should be some (whatever artifical) tests to study the
>   dynamical behaviour of the hashtrie: say all the entries are
>   added, then half of them deleted and added back again, but
>   in reverse order. Will the maxdepth grow?

I've performed some simple tests with this and here are the results:

First I tried removing the upper half of the conntrack entries (it's the
conntrack entries index in the test array that's printed out in the
"Removing" line below)

Number of entries in hashtrie: 819200
Number of children in hashtrie: 27345
Maxdepth of hashtrie: 3 (0 == root)
Removing entries between 204800 and 409599.
Adding entries between 409599 and 204800.
insert (half reverse): time: 1484 cyc, 929 ns (1076184/s)
Number of entries in hashtrie: 819200
Number of children in hashtrie: 27345
Maxdepth of hashtrie: 3 (0 == root)

No diffrence at all which is a bit weird, I had to doublecheck the code
to see that I really readded the entries in reverse order and I did.

Then I tried it with the lower half of the conntrack entries, those
lower in the tree and got this:

Number of entries in hashtrie: 819200
Number of children in hashtrie: 27273
Maxdepth of hashtrie: 3 (0 == root)
Removing entries between 0 and 204799.
Adding entries between 204799 and 0.
insert (half reverse): time: 1627 cyc, 1018 ns (981869/s)
Number of entries in hashtrie: 819200
Number of children in hashtrie: 29189
Maxdepth of hashtrie: 3 (0 == root)

Here wee see that the number of child-nodes has increased from 27273 to
29189, that's a 7.0% increase, but the max depth hasn't increased in
this particular test. Given the right sitation the maxdepth will
probably increase, it's likely to occur sometimes since the number of
child-nodes has increased.

Then I tried to remove every other entry and then readding them in
reverse order (this test is performed after the previous readd of the
lower half, not the same testrun as above so the absolute numbers
doesn't match those above):

Number of entries in hashtrie: 819200
Number of children in hashtrie: 27329
Maxdepth of hashtrie: 3 (0 == root)
Removing entries between 0 and 204799.
Adding entries between 204799 and 0.
insert (half reverse): time: 1590 cyc, 996 ns (1004079/s)
Number of entries in hashtrie: 819200
Number of children in hashtrie: 29253
Maxdepth of hashtrie: 3 (0 == root)
Removing every other entry from 0 to 409599
Adding every other entry from 409599 to 0
insert (other reverse): time: 1637 cyc, 1025 ns (975239/s)
Number of entries in hashtrie: 819200
Number of children in hashtrie: 29376
Maxdepth of hashtrie: 3 (0 == root)

Here we see that the number of child-nodes increased in the
remove_lower_half_and_readd_in_reverse_order test just as it did before,
and then the number of child-nodes increased again after removing every
other entry and adding them back in reverse order. The total increase
after both tests is 7.5%

More testing is clearly needed.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-02-17  9:30                     ` Jozsef Kadlecsik
  2006-02-17 18:41                       ` Jozsef Kadlecsik
@ 2006-03-04 20:11                       ` Martin Josefsson
  2006-03-05 11:24                         ` Jozsef Kadlecsik
  1 sibling, 1 reply; 40+ messages in thread
From: Martin Josefsson @ 2006-03-04 20:11 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 4717 bytes --]

On Fri, 2006-02-17 at 10:30 +0100, Jozsef Kadlecsik wrote:

> Here are the tests I'd like to see against hashtrie:
> 
> - larger hashentry, i.e. when HASHSIFT is equal to 6, 7 or 8

I have seen good results when sizeof(struct hashentry) = 64 on my
laptop. But on the other hand my laptop likes the padding of struct
hashentry at the beginning with gives unaligned pointers. It's an
Pentium M processor in my laptop.

> - other HASHALIGN, hashbit_t values: (32, u8), (64, u16) and (64, u32).

Yes this needs more testing as well.

>   The current (32, u8) doesn't look optimal on 64bit CPUs, (64, u16) seems
>   to be the best, but without testing it's hard to choose.

Currently I don't have any 64bit cpus to test things on.

> - DoS against hashtrie by non-random tuples: single fixed destination IP,
>   port and successive source IP, port numbers. (I don't think the current
>   max 7 levels (childs) can survive such an attack.)

I just ran some tests on this with this code:

dstip = 192 << 24 | 168 << 16 | 1 << 8 | 1;
srcip = 200 << 24 | 100 << 16 | 50 << 8 | 0;

conntrack[i]->tuple[IP_CT_DIR_ORIGINAL].src.ip = srcip + rand() % 1024;
conntrack[i]->tuple[IP_CT_DIR_ORIGINAL].src.u.tcp.port = (u16)rand();
conntrack[i]->tuple[IP_CT_DIR_ORIGINAL].dst.ip = dstip;
conntrack[i]->tuple[IP_CT_DIR_ORIGINAL].dst.u.tcp.port = 80;

And this is the result I got:

sizeof struct hashentry: 32
sizeof struct ip_conntrack: 48
number of conntrack entries: 1228800
number of slots per hashtrie bucket: 5
number of pad bytes: 3
number of bytes per child: 2048
insert: time: 11299 cyc, 18854 ns (53040/s)
Number of failed inserts: 22232
Number of entries in hashtrie: 2435368
Number of children in hashtrie: 238756
Maxdepth of hashtrie: 3 (0 == root)
Maximum memory usage: 488972288

2.4 million entries in the hashtrie with fixed dstip, port and 1024
srcip's with random ports and a maxdepth of 3. I'd say the jenkins hash
is doing its job quite nicely, wasn't expecting such good results.

In the results above you see my main worry regarding hashtrie, the
memory usage, 200 bytes per entry in hashtrie (400 bytes per conntrack)
in this scenario, which means that the child-nodes aren't very populated
in the leaf nodes.

2435368 entries / 238756 child-nodes = 10.2 entries/child-node

And there's 2048 / 32 = 64 struct hashentry per child-node.
That's 64 * 5 = 320 entries/child-node

10.2 / 320 = 3.2% usage of child-nodes in average which is simply
horrible.

I'm almost starting to think there's a major bug in there somewhere. 
I need to write some code to walk the tree and calculate the usage of
each level to see that it's actually 100% for all buckets that has a
child.

> Somehow I don't really like the eviction algorithm. What about some lazy
> auto-eviction instead: say, if there are more than 90% of the max
> elements, then drop a (any) unassured connection which can be found on the
> path when inserting a new one. Thus the current fixed stack could be
> eliminated and there were no builtin limit in the depth.

Couldn't this lead to the situation where we evict an entry early on the
path, and then that slot gets reused for another entry that's also
unassured, and it repeats...
The problem with eviction in the hashtrie is that the depth of the entry
has no correlation to the age of the entry like in the current hashtable
(that's only true if you have long linked lists which leads to poor
performance, so no real-world installation has properly working
evication anyway today so maybe it doesn't matter too much)

> > I have an old untested patch against nf_conntrack as well but it needs
> > some rewriting of the conntrack locking in order to avoid an SMP deadlock.
> 
> Rusty's lock-ordering is a sure solution, but penalizes the process.
> I have been thinking on wether we could use simply, separated, unordered
> locking:
> 
> 1. lock bucket according to tuple1
> 	add element
>    unlock bucket
> 2. lock bucket according to tuple2,
> 	add element
>    unlock bucket
>    if (add element failed)
> 	undo 1.

Yes this is the way I've been thinking about. If we get into the case
where we have conntrack A and B that have the exact same tuples but in
reverse it doesn't really matter if one gets dropped or both, this is
strictly best effort, no need to bend over backwards in order to
minimize racewindows if it isn't anything serious.
The only case I can think about where it might matter is the case of
simultaneous open from both sides with the same source/destination
ports, dns (some clients issue requests from port 53), games and ipsec
isakmp come to mind. 

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-04 16:23                         ` Hashtrie testing (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2) Martin Josefsson
@ 2006-03-05  9:49                           ` Jozsef Kadlecsik
  2006-03-05 13:24                             ` Martin Josefsson
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-05  9:49 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin,

On Sat, 4 Mar 2006, Martin Josefsson wrote:

> I'll move my svn tree to the netfilter svn sometime soon.

That'll be great!

> > - there should be some (whatever artifical) tests to study the
> >   dynamical behaviour of the hashtrie: say all the entries are
> >   added, then half of them deleted and added back again, but
> >   in reverse order. Will the maxdepth grow?
>
> I've performed some simple tests with this and here are the results:
>
> First I tried removing the upper half of the conntrack entries (it's the
> conntrack entries index in the test array that's printed out in the
> "Removing" line below)
>
> Number of entries in hashtrie: 819200
> Number of children in hashtrie: 27345
> Maxdepth of hashtrie: 3 (0 == root)
> Removing entries between 204800 and 409599.
> Adding entries between 409599 and 204800.
> insert (half reverse): time: 1484 cyc, 929 ns (1076184/s)
> Number of entries in hashtrie: 819200
> Number of children in hashtrie: 27345
> Maxdepth of hashtrie: 3 (0 == root)
>
> No diffrence at all which is a bit weird, I had to doublecheck the code
> to see that I really readded the entries in reverse order and I did.

That's really weird: same maxdepth and exactly the same number of
children!

> Then I tried it with the lower half of the conntrack entries, those
> lower in the tree and got this:
>
> Number of entries in hashtrie: 819200
> Number of children in hashtrie: 27273
> Maxdepth of hashtrie: 3 (0 == root)
> Removing entries between 0 and 204799.
> Adding entries between 204799 and 0.
> insert (half reverse): time: 1627 cyc, 1018 ns (981869/s)
> Number of entries in hashtrie: 819200
> Number of children in hashtrie: 29189
> Maxdepth of hashtrie: 3 (0 == root)
>
> Here wee see that the number of child-nodes has increased from 27273 to
> 29189, that's a 7.0% increase, but the max depth hasn't increased in
> this particular test. Given the right sitation the maxdepth will
> probably increase, it's likely to occur sometimes since the number of
> child-nodes has increased.

Half of the entries were removed then re-added and it produced 7% increase
in the number of the child nodes. That might be good or bad as well.

[...]
> More testing is clearly needed.

Yes, one is somehow uneasy. Hm. What about filling up the hashtrie and
then some long loop of deleting 10% at random and adding *new* random
entries? We could measure the change in maxdepth/childnodes after every
delete/add cycle. What were the peak numbers? At what numbers would it
stabilize?

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-04 20:11                       ` Hashtrie testing2 " Martin Josefsson
@ 2006-03-05 11:24                         ` Jozsef Kadlecsik
  2006-03-05 17:48                           ` Martin Josefsson
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-05 11:24 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin,

On Sat, 4 Mar 2006, Martin Josefsson wrote:

> On Fri, 2006-02-17 at 10:30 +0100, Jozsef Kadlecsik wrote:
>
> > Here are the tests I'd like to see against hashtrie:
> >
> > - larger hashentry, i.e. when HASHSIFT is equal to 6, 7 or 8
>
> I have seen good results when sizeof(struct hashentry) = 64 on my
> laptop. But on the other hand my laptop likes the padding of struct
> hashentry at the beginning with gives unaligned pointers. It's an
> Pentium M processor in my laptop.

HASHSIFT (i.e. HASHNUM) defines the hash size of the hashtrie.
But the hashtrie - according to your tests - are much "wider" than
"deep". So there might be no point in widening the hash.

> > - other HASHALIGN, hashbit_t values: (32, u8), (64, u16) and (64, u32).
>
> Yes this needs more testing as well.
>
> >   The current (32, u8) doesn't look optimal on 64bit CPUs, (64, u16) seems
> >   to be the best, but without testing it's hard to choose.
>
> Currently I don't have any 64bit cpus to test things on.

With the default (32, u8) settings NUMENTRY looks too small on 64bit
CPUs:

HASHALIGN = 32
hashbits_t = u8

    32bit CPU:  NUMENTRY = (32 - 4)/(1+4) = 5
                PADNUM = 32 - 4 - 5*(1+4) = 3

    64bit CPU:  NUMENTRY = (32 - 8)/(1+8) = 2
                PADNUM = 32 - 8 - 2*(1+8) = 6

HASHALIGN = 64
hashbits_t = u16

    32bit CPU:  NUMENTRY = (64 - 4)/(2+4) = 10
                PADNUM = 64 - 4 - 10*(2+4) = 0

    64bit CPU:  NUMENTRY = (64 - 8)/(2+8) = 5
                PADNUM = 64 - 8 - 5*(2+8) = 6

HASHALIGN = 64
hashbits_t = u32

    32bit CPU:  NUMENTRY = (64 - 4)/(4+4) = 7
                PADNUM = 64 - 4 - 7*(4+4) = 4

    64bit CPU:  NUMENTRY = (64 - 8)/(4+8) = 4
                PADNUM = 64 - 8 - 4*(4+8) = 8

> > - DoS against hashtrie by non-random tuples: single fixed destination IP,
> >   port and successive source IP, port numbers. (I don't think the current
> >   max 7 levels (childs) can survive such an attack.)
>
> I just ran some tests on this with this code:
[...]
> And this is the result I got:
>
> sizeof struct hashentry: 32
> sizeof struct ip_conntrack: 48
> number of conntrack entries: 1228800
> number of slots per hashtrie bucket: 5
> number of pad bytes: 3
> number of bytes per child: 2048
> insert: time: 11299 cyc, 18854 ns (53040/s)
> Number of failed inserts: 22232
> Number of entries in hashtrie: 2435368
> Number of children in hashtrie: 238756
> Maxdepth of hashtrie: 3 (0 == root)
> Maximum memory usage: 488972288
>
> 2.4 million entries in the hashtrie with fixed dstip, port and 1024
> srcip's with random ports and a maxdepth of 3. I'd say the jenkins hash
> is doing its job quite nicely, wasn't expecting such good results.

The hashtrie tends to grow wider and not deeper:

level:		childs (max)	entries (max)
0		64		320
1		4160		20800
2		266304		1331520
3		17043520	85217600

> In the results above you see my main worry regarding hashtrie, the
> memory usage, 200 bytes per entry in hashtrie (400 bytes per conntrack)
> in this scenario, which means that the child-nodes aren't very populated
> in the leaf nodes.
>
> 2435368 entries / 238756 child-nodes = 10.2 entries/child-node
>
> And there's 2048 / 32 = 64 struct hashentry per child-node.
> That's 64 * 5 = 320 entries/child-node
>
> 10.2 / 320 = 3.2% usage of child-nodes in average which is simply
> horrible.
>
> I'm almost starting to think there's a major bug in there somewhere.
> I need to write some code to walk the tree and calculate the usage of
> each level to see that it's actually 100% for all buckets that has a
> child.

Yes, that could help to spot the problem. But if there are clashes
(due to using to small parts of the hash key, i.e  hashbits_t = u8), then
the node can be fairly low utilized and still there is a need to expand by
new childs.

> > Somehow I don't really like the eviction algorithm. What about some lazy
> > auto-eviction instead: say, if there are more than 90% of the max
> > elements, then drop a (any) unassured connection which can be found on the
> > path when inserting a new one. Thus the current fixed stack could be
> > eliminated and there were no builtin limit in the depth.
>
> Couldn't this lead to the situation where we evict an entry early on the
> path, and then that slot gets reused for another entry that's also
> unassured, and it repeats...

You mean, we could always evict the newest unassured entries instead of
the oldest ones? The algorithm could be extended so that we'd walk the
whole tree and register the oldest unassured entry in the path. I'm more
worried that we'd check too litle number of entries as the tree is shorter
than I thought.

> > I have been thinking on wether we could use simply, separated, unordered
> > locking:
[...]
> The only case I can think about where it might matter is the case of
> simultaneous open from both sides with the same source/destination
> ports, dns (some clients issue requests from port 53), games and ipsec
> isakmp come to mind.

That can be fixed by checking that we undo the insertion for the same
conntrack entry we were unable to add at the second step.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-05  9:49                           ` Jozsef Kadlecsik
@ 2006-03-05 13:24                             ` Martin Josefsson
  0 siblings, 0 replies; 40+ messages in thread
From: Martin Josefsson @ 2006-03-05 13:24 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 3144 bytes --]

On Sun, 2006-03-05 at 10:49 +0100, Jozsef Kadlecsik wrote:
> Hi Martin,

Hi Jozsef

> > I'll move my svn tree to the netfilter svn sometime soon.
> 
> That'll be great!

After a small struggle with 'svnadmin dump', 'svndumpfilter' and 'sed'
it's finally there with all history.

trunk/hashtrie

Enjoy and don't laugh too much at all the early experimentation :)

> > First I tried removing the upper half of the conntrack entries (it's the
> > conntrack entries index in the test array that's printed out in the
> > "Removing" line below)
> >
> > Number of entries in hashtrie: 819200
> > Number of children in hashtrie: 27345
> > Maxdepth of hashtrie: 3 (0 == root)
> > Removing entries between 204800 and 409599.
> > Adding entries between 409599 and 204800.
> > insert (half reverse): time: 1484 cyc, 929 ns (1076184/s)
> > Number of entries in hashtrie: 819200
> > Number of children in hashtrie: 27345
> > Maxdepth of hashtrie: 3 (0 == root)
> >
> > No diffrence at all which is a bit weird, I had to doublecheck the code
> > to see that I really readded the entries in reverse order and I did.
> 
> That's really weird: same maxdepth and exactly the same number of
> children!

Yes I thought so too, I even added a bunch of debug printf's to the code
to verify that it actually readded them in reverse order, and it did.

> > Then I tried it with the lower half of the conntrack entries, those
> > lower in the tree and got this:
> >
> > Number of entries in hashtrie: 819200
> > Number of children in hashtrie: 27273
> > Maxdepth of hashtrie: 3 (0 == root)
> > Removing entries between 0 and 204799.
> > Adding entries between 204799 and 0.
> > insert (half reverse): time: 1627 cyc, 1018 ns (981869/s)
> > Number of entries in hashtrie: 819200
> > Number of children in hashtrie: 29189
> > Maxdepth of hashtrie: 3 (0 == root)
> >
> > Here wee see that the number of child-nodes has increased from 27273 to
> > 29189, that's a 7.0% increase, but the max depth hasn't increased in
> > this particular test. Given the right sitation the maxdepth will
> > probably increase, it's likely to occur sometimes since the number of
> > child-nodes has increased.
> 
> Half of the entries were removed then re-added and it produced 7% increase
> in the number of the child nodes. That might be good or bad as well.

Yes I don't know either :)
In these tests both source and destination ipaddresses are random,
that's usually not the case in most installations.
With fixed dst ip, port and 1024 sourceipaddresses and random src ports
it increases by 7.2%

> [...]
> > More testing is clearly needed.
> 
> Yes, one is somehow uneasy. Hm. What about filling up the hashtrie and
> then some long loop of deleting 10% at random and adding *new* random
> entries? We could measure the change in maxdepth/childnodes after every
> delete/add cycle. What were the peak numbers? At what numbers would it
> stabilize?

I'll test this as well during the evening.
Removing 10% of the entries at random, and then give those entries new
random values and readd them.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-05 11:24                         ` Jozsef Kadlecsik
@ 2006-03-05 17:48                           ` Martin Josefsson
  2006-03-06 13:15                             ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Josefsson @ 2006-03-05 17:48 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 7276 bytes --]

On Sun, 2006-03-05 at 12:24 +0100, Jozsef Kadlecsik wrote:
> Hi Martin,

Hi Jozsef

> HASHSIFT (i.e. HASHNUM) defines the hash size of the hashtrie.
> But the hashtrie - according to your tests - are much "wider" than
> "deep". So there might be no point in widening the hash.

HASHSHIFT and HASHNUM controls the number of struct hashentry per child.
The original idea was to allocate 1 page (4kB) for each child but I got
better results with 2kB, which means it isn't as wide as it was from the
start.

> > > - other HASHALIGN, hashbit_t values: (32, u8), (64, u16) and (64, u32).
> >
> > Yes this needs more testing as well.
> >
> > >   The current (32, u8) doesn't look optimal on 64bit CPUs, (64, u16) seems
> > >   to be the best, but without testing it's hard to choose.
> >
> > Currently I don't have any 64bit cpus to test things on.

HASHALIGN is the size of struct hashentry, the idea is that HASHALIGN
should be equal to the cacheline size of the cpu it's running on. That
way you only get one cachemiss per level in the tree.
HASHALIGN is set to 32 bytes just because my laptop has 32byte
cachelines, other cpus has 64bytes or even 128bytes.

hashbit_t is an u8 because in my tests I havn't see any gain by going to
more bits.
The only purpose of hashbits is to "cheat" a bit during lookup, store a
part of the hash-value in the datastructure so you can easily determine
if a member even can be a possible match by comparing the hashbits to
the hashvalue you are searching for, this eliminates a lot of
cachemisses (without this the datastructure will behave more like a
linked list).

I've tested with diffrent number of bits for hashbits but I havn't seen
any real gain by going over 7 bits which is what is currently used, and
the last bit is used as status bit for ASSURED.

> With the default (32, u8) settings NUMENTRY looks too small on 64bit
> CPUs:

Yes it's too small, luckily I don't know of any 64bit cpus that have
32byte cachelines :)
Lets say we keep hashbits as an u8 then we get the following results for
diffrent cacheline sizes (I believe the intel p4 has 128byte cachelines)

HASHALIGN = 32
hashbits_t = u8

    32bit CPU:  NUMENTRY = (32 - 4)/(1+4) = 5
                PADNUM = 32 - 4 - 5*(1+4) = 3

    64bit CPU:  NUMENTRY = (32 - 8)/(1+8) = 2
                PADNUM = 32 - 8 - 2*(1+8) = 6


HASHALIGN = 64
hashbits_t = u8

    32bit CPU: NUMENTRY = (64 - 4)/(1 + 4) = 12
                       PADNUM = 64 - 4 - 12*(1 + 4) = 0

    64bit CPU: NUMENTRY = (64 - 8)/(1 + 8) = 6
                       PADNUM = 64 - 8 - 6*(1 + 8) = 2

HASHALIGN = 128
hashbits_t = u8

    32bit CPU: NUMENTRY = (128 - 4)/(1 +4) = 24
                       PADNUM = 128 - 4 - 24*(1 + 4) = 4

    64bit CPU: NUMENTRY = (128 - 8)/(1 + 8) = 13
                       PADNUM = 128 - 8 - 13*(1 + 8) = 3

When HASHALIGN goes up we need to reduce HASHSHIFT in order to make sure
we don't try to allocate > 1 page (usually 4kB) of memory.

> > > - DoS against hashtrie by non-random tuples: single fixed destination IP,
> > >   port and successive source IP, port numbers. (I don't think the current
> > >   max 7 levels (childs) can survive such an attack.)
[snip]
> > 2.4 million entries in the hashtrie with fixed dstip, port and 1024
> > srcip's with random ports and a maxdepth of 3. I'd say the jenkins hash
> > is doing its job quite nicely, wasn't expecting such good results.
> 
> The hashtrie tends to grow wider and not deeper:
> 
> level:		childs (max)	entries (max)
> 0		64		320
> 1		4160		20800
> 2		266304		1331520
> 3		17043520	85217600

It grows wider if and only if we get hashvalues that aren't too similar
in the high bits. If that happens we grow deeper instead of wider. But
it seems the jenkins hash is really good at distributing the changes
over all of the bits in the hashvalue. (it's just a little bit on the
heavy side, it uses plenty of cpu)

> > I'm almost starting to think there's a major bug in there somewhere.
> > I need to write some code to walk the tree and calculate the usage of
> > each level to see that it's actually 100% for all buckets that has a
> > child.
> 
> Yes, that could help to spot the problem. But if there are clashes
> (due to using to small parts of the hash key, i.e  hashbits_t = u8), then
> the node can be fairly low utilized and still there is a need to expand by
> new childs.

Small hashbits_t doesn't have anything to do with it, hashbits_t is just
for accelerating lookups. If we have a low number of struct hashentry
per child-node we'll probably fill them up faster since there's fewer of
them to be filled. And if we have a low NUMENTRY they will also be
filled faster.

> > Couldn't this lead to the situation where we evict an entry early on the
> > path, and then that slot gets reused for another entry that's also
> > unassured, and it repeats...
> 
> You mean, we could always evict the newest unassured entries instead of
> the oldest ones? The algorithm could be extended so that we'd walk the
> whole tree and register the oldest unassured entry in the path. I'm more
> worried that we'd check too litle number of entries as the tree is shorter
> than I thought.

The problem is how do you know which entry along a path is the oldest?
Entries aren't always added to the front like with the current linked
lists in the hashtable.

> > > I have been thinking on wether we could use simply, separated, unordered
> > > locking:
> [...]
> > The only case I can think about where it might matter is the case of
> > simultaneous open from both sides with the same source/destination
> > ports, dns (some clients issue requests from port 53), games and ipsec
> > isakmp come to mind.
> 
> That can be fixed by checking that we undo the insertion for the same
> conntrack entry we were unable to add at the second step.

What I meant is the following situation:

clients A & B are both located behind NAT and are trying to connect to
each other.
Both bind to the same port, lets say X and try to connect to each others
external ip.
Currently if the packets from both clients arrive at the same time at an
SMP firewall we add a conntrack entry for one of the clients but drop
the second one because of the global lock.
With Rustys lock ordering we keep this behaviour since both buckets are
locked in the same order for both packets.
With the "unordered independent locking" of buckets we might end up
with:

these happen at the same time:
packet from client A hashes into bucket C, lock and add entry.
packet from client B hashes into bucket D, lock and add entry.

and after that these happen at the same time:
packet from client A hashes into bucket D, lock and find entry, remove
entry in bucket C
packet from client B hashes into bucket C, lock and find entry, remove
entry in bucket D

This leads to no conntrack entry created for any of the clients
connection attempts.

The solution would be "ordered independent locking" which locks and adds
entries based on bucket number that the tuplex hash into.
That way we still keep the current behaviour while simplifying the
locking, no chance of deadlocks.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-05 17:48                           ` Martin Josefsson
@ 2006-03-06 13:15                             ` Jozsef Kadlecsik
  2006-03-07 18:33                               ` Martin Josefsson
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-06 13:15 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin

On Sun, 5 Mar 2006, Martin Josefsson wrote:

> HASHSHIFT and HASHNUM controls the number of struct hashentry per child.
> The original idea was to allocate 1 page (4kB) for each child but I got
> better results with 2kB, which means it isn't as wide as it was from the
> start.

Did you get faster lookups/insertions/deletions with half of the page
size?

> HASHALIGN is the size of struct hashentry, the idea is that HASHALIGN
> should be equal to the cacheline size of the cpu it's running on. That
> way you only get one cachemiss per level in the tree.
> HASHALIGN is set to 32 bytes just because my laptop has 32byte
> cachelines, other cpus has 64bytes or even 128bytes.

As far as I see, the structure should satisfy "contradicting"
requirements. We want an as low tree as possible, thus minimising the cost
of the operations. That includes the number of levels of the childs and
the number of slots in a bucket and the latter depends on the HASHALIGN,
hashbits_t parameters and the members of the hashentry structure. So in
theory we should maximize HASHSHIFT and minimize NUMENTRY. However we do
not want a preallocated, huge, sparse hash either, and therefore it is
hard to tell the best values.

> hashbit_t is an u8 because in my tests I havn't see any gain by going to
> more bits.
> The only purpose of hashbits is to "cheat" a bit during lookup, store a
> part of the hash-value in the datastructure so you can easily determine
> if a member even can be a possible match by comparing the hashbits to
> the hashvalue you are searching for, this eliminates a lot of
> cachemisses (without this the datastructure will behave more like a
> linked list).

This structure is quite tricky :-).

> > Yes, that could help to spot the problem. But if there are clashes
> > (due to using to small parts of the hash key, i.e  hashbits_t = u8), then
> > the node can be fairly low utilized and still there is a need to expand by
> > new childs.
>
> Small hashbits_t doesn't have anything to do with it, hashbits_t is just
> for accelerating lookups. If we have a low number of struct hashentry
> per child-node we'll probably fill them up faster since there's fewer of
> them to be filled. And if we have a low NUMENTRY they will also be
> filled faster.

But how can we explain the so few entries in the buckets? Your numbers
suggest that most of the buckets are pretty empty, as if there were two
slots full from the 64 in the DoS case.

Or we can say that the DoS has got an effect: the jenkins hash were
unable to produce good enough keys. Or using the 32 bits from the 96
internal ones is not sufficient.

But in the non-DoS random pattern case, there was 819200 entries / 27345
child-nodes =~ 29 entries/child-nodes, that's still around 10%
utilization. So it looks it's not the jenkins hash which produces the
sparse tree.

> The problem is how do you know which entry along a path is the oldest?
> Entries aren't always added to the front like with the current linked
> lists in the hashtable.

We could check the timer of the entry: the oldest one is that which would
time out nearest in the future.

[...]
> these happen at the same time:
> packet from client A hashes into bucket C, lock and add entry.
> packet from client B hashes into bucket D, lock and add entry.
>
> and after that these happen at the same time:
> packet from client A hashes into bucket D, lock and find entry, remove
> entry in bucket C
> packet from client B hashes into bucket C, lock and find entry, remove
> entry in bucket D
>
> This leads to no conntrack entry created for any of the clients
> connection attempts.
>
> The solution would be "ordered independent locking" which locks and adds
> entries based on bucket number that the tuplex hash into.
> That way we still keep the current behaviour while simplifying the
> locking, no chance of deadlocks.

That looks perfect!

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-06 13:15                             ` Jozsef Kadlecsik
@ 2006-03-07 18:33                               ` Martin Josefsson
  2006-03-08  6:34                                 ` Patrick Schaaf
                                                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Martin Josefsson @ 2006-03-07 18:33 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 5339 bytes --]

On Mon, 2006-03-06 at 14:15 +0100, Jozsef Kadlecsik wrote:
> Hi Martin

Hi Jozsef

> > HASHSHIFT and HASHNUM controls the number of struct hashentry per child.
> > The original idea was to allocate 1 page (4kB) for each child but I got
> > better results with 2kB, which means it isn't as wide as it was from the
> > start.
> 
> Did you get faster lookups/insertions/deletions with half of the page
> size?

Yes I did, but I don't trust my testresults 100% (run on my laptop)
because of the fact that my laptop gets better results if I put the
padding at the start of struct hashentry than if I put the padding at
the end.

> As far as I see, the structure should satisfy "contradicting"
> requirements. We want an as low tree as possible, thus minimising the cost
> of the operations. That includes the number of levels of the childs and
> the number of slots in a bucket and the latter depends on the HASHALIGN,
> hashbits_t parameters and the members of the hashentry structure. So in
> theory we should maximize HASHSHIFT and minimize NUMENTRY. However we do
> not want a preallocated, huge, sparse hash either, and therefore it is
> hard to tell the best values.

Maximizing HASHSHIFT makes us allocate a lot for each node.
And minimizing HASHNUM makes struct hashentry smaller, thus giving a
_lot_ of buckets in each node.
Will having say 200 buckets with 3 entries in each bucket be better than
50 buckets with 12 entries in each bucket? Same number of entries in the
node but one child-pointer per bucket, which might lead to even more
child-nodes in the case where we have 200 buckets that fill up rather
quick since they only hold a small number of entries. But on the other
hand we hash to 200 buckets instead of 50 buckets so the total fill-rate
of the node is the same for both cases.
And having lots of entries in a bucket doesn't really cause a
performance degradation as long as you keep the bucket not larger than
the cacheline size and it has to be cacheline aligned.

Having lots of buckets will give an even flatter tree.
I think I've tested this as well about a year ago but I can't remember
the results.

> This structure is quite tricky :-).

Yes it is :) Image how I felt when I got it described to me in
conversations on irc.
There's been a lot of experimentations and failures along the path and
there's still a lot of things to test and rewrite.

> > Small hashbits_t doesn't have anything to do with it, hashbits_t is just
> > for accelerating lookups. If we have a low number of struct hashentry
> > per child-node we'll probably fill them up faster since there's fewer of
> > them to be filled. And if we have a low NUMENTRY they will also be
> > filled faster.
> 
> But how can we explain the so few entries in the buckets? Your numbers
> suggest that most of the buckets are pretty empty, as if there were two
> slots full from the 64 in the DoS case.

Yes, it's weird. Havn't had time to write the code to traverse the
produce stats from the tree yet, who knows, tonight may be the night
when that happens :)

> Or we can say that the DoS has got an effect: the jenkins hash were
> unable to produce good enough keys. Or using the 32 bits from the 96
> internal ones is not sufficient.

That might be the case.

> But in the non-DoS random pattern case, there was 819200 entries / 27345
> child-nodes =~ 29 entries/child-nodes, that's still around 10%
> utilization. So it looks it's not the jenkins hash which produces the
> sparse tree.

I'll see what I can dig out from the entries in the nodes, it might be
very unbalanced.

> > The problem is how do you know which entry along a path is the oldest?
> > Entries aren't always added to the front like with the current linked
> > lists in the hashtable.
> 
> We could check the timer of the entry: the oldest one is that which would
> time out nearest in the future.

Ouch, this is exactly what we don't want to do. If we are under attack
we don't want to start looking at entries to determine which entries to
evict, that would result in lots of cachemisses which slows things down
a lot.
The current eviction algorithm tries to evict entries along the path
that the new entry will use, and respect the level of the entries so you
don't end up evicting entries from the root-node all the time, those
free slots will get filled up with new entries rather quickly and we
don't want to evict things we probably just added.
But that doesn't say the current eviction algorithm does the right
thing, it was just something me and Rusty came up with during the
workshop as an aproximation.
Fast batch-killing of entries, but try to not evict from the often
walked parts of the path too often. It is best-effort, no need to bend
over backwards just to do the right thing in all situations, it's not
worth it in the end.

> > The solution would be "ordered independent locking" which locks and adds
> > entries based on bucket number that the tuplex hash into.
> > That way we still keep the current behaviour while simplifying the
> > locking, no chance of deadlocks.
> 
> That looks perfect!

It came to me after writing the description of the "unordered
independent locking", it hit me like a punch in the face, it was so
simple.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-07 18:33                               ` Martin Josefsson
@ 2006-03-08  6:34                                 ` Patrick Schaaf
  2006-03-12 18:49                                 ` Martin Josefsson
  2006-03-30  8:28                                 ` Hashtrie testing2, dancing trees Amin Azez
  2 siblings, 0 replies; 40+ messages in thread
From: Patrick Schaaf @ 2006-03-08  6:34 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist, Jozsef Kadlecsik

> > This structure is quite tricky :-).
> 
> Yes it is :) Image how I felt when I got it described to me in
> conversations on irc.

Hmm. Evoke this feeling again. :) Now I'll describe an easy way (haha)
to get rid of half of the tuples, for the case of no-NAT. I did that
before, but if you are now revisiting conntracking basics, maybe
it's time to retell:

Notice that for the non-NAT case, the two tuples of a connection
are fully symmetric in src(ip|port) vs. dst(ip|port). For ease
of exposition, imagine each (ip|port) to be an 48-bit number.

Given two numbers A and B, a function can easily be implemented:

	symsort: N x N -> N x N x Z2
		given (a, b), if a <= b, result is (a, b, 0)
		given (a, b), if a  > b, result is (b, a, 1)

Given the symmetric tuples from both halves of one connection,
this function returns the same A/B results in both cases,
with a different flip bit (third result).

So, when doing the hash lookup, run symsort first, and look for the resulting
"sorted" tuple.

The flip bit output of symsort needs to be taken into
account when classifying original / other destination. Details...

One would have a single tuple, instead of two, in struct ip_conntrack.
For NATted connections, an additional tuple would be allocated on
demand, referenced from ip_conntrack by a pointer, and also put into
the hashes.

When a non-NAT conntrack gets deleted, only one hash entry needs
to be removed, avoiding the need for complex lock ordering schemes
in that place. You can always tell you have this case, by looking
whether the NAT-tuple-pointer is != 0.

Downsides:
- complexity
- second tuple in separate memory, in the NAT case. Maybe more cache
  misses in some places

Upsides in the non-NAT case:
- half the memory for tuples
- half the number of hash insertions and deletions with their assorted
cache influences

best regards
  Patrick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-07 18:33                               ` Martin Josefsson
  2006-03-08  6:34                                 ` Patrick Schaaf
@ 2006-03-12 18:49                                 ` Martin Josefsson
  2006-03-14 11:35                                   ` Jozsef Kadlecsik
  2006-03-23 11:27                                   ` Jozsef Kadlecsik
  2006-03-30  8:28                                 ` Hashtrie testing2, dancing trees Amin Azez
  2 siblings, 2 replies; 40+ messages in thread
From: Martin Josefsson @ 2006-03-12 18:49 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 2922 bytes --]

On Tue, 2006-03-07 at 19:33 +0100, Martin Josefsson wrote:

Hi Jozsef

> > But in the non-DoS random pattern case, there was 819200 entries / 27345
> > child-nodes =~ 29 entries/child-nodes, that's still around 10%
> > utilization. So it looks it's not the jenkins hash which produces the
> > sparse tree.
> 
> I'll see what I can dig out from the entries in the nodes, it might be
> very unbalanced.

I've added some very simple debug-code and here's what I've found so far
when testing with random src/dst ip/port.

number of slots per hashtrie bucket: 5
...
Depth 4 - Number of children 0
Depth 4 - Number of buckets 0
Depth 4 - Number of not full buckets 0

Depth 3 - Number of children 0
Depth 3 - Number of buckets 7801856
Depth 3 - Number of entries 291689 (0%)
Depth 3 - Number of not full buckets 7801856

Depth 2 - Number of children 121904
Depth 2 - Number of buckets 262144
Depth 2 - Number of entries 1121175 (85%)
Depth 2 - Number of not full buckets 99698

Depth 1 - Number of children 4096
Depth 1 - Number of buckets 4096
Depth 1 - Number of entries 20480 (100%)
Depth 1 - Number of not full buckets 0

Depth 0 - Number of children 64
Depth 0 - Number of buckets 64
Depth 0 - Number of entries 256 (100%)
Depth 0 - Number of not full buckets 0

If we look at depth 2 we can see the following:

We have 121904 buckets that are "overfull", thus leading to an
expansion.
(121904 * 5 + 291689) / (121904 * 5) = 1.48 = 148%

And we have 262144 - 121904 - 99698 = 40542 buckets that are 100% full.

And we have 99698 not full buckets with a number of entries:
1121175 - (121904 + 40542) * 5 = 308945 entries
308945 / (99698 * 5) = 0.62 = 62% usage

Summary:
121904 buckets that are 148% used.
40542 buckets that are 100% used.
99698 buckets that are 62% used.

So the jenkins hash doesn't seem to be able to distribute the entries
very well.

This leads to the allocation of 121904 children for just 291689 entries,
that's a huge waste of memory. But even if we are able to come up with a
better hash algorithm for this we'll still have this problem when we
start expanding deeper, the size of the tree simply explodes. One idea
I've played with earlier is to change the number of buckets per child
depending on the depth, so we have larger children at the top of the
tree and then they get smaller further down. I'll have to revive that
code (should be somewhere in svn) in order to say if it helped the
memory usage or not, I know it made things a bit slower.

I've also tried making the buckets larger and decreasing HASHSHIFT in
order to not exceed 4kB allocations, this allows the buckets to absorb
the imperfectness of the hashing and memory usage goes down a _lot_.
But... this is will simply destroy performance on machines with small
cachelines as it will lead to lots of cachemisses per bucket.

Any ideas?

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-12 18:49                                 ` Martin Josefsson
@ 2006-03-14 11:35                                   ` Jozsef Kadlecsik
  2006-03-23 11:27                                   ` Jozsef Kadlecsik
  1 sibling, 0 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-14 11:35 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin,

On Sun, 12 Mar 2006, Martin Josefsson wrote:

> I've added some very simple debug-code and here's what I've found so far
> when testing with random src/dst ip/port.
[...]
> Summary:
> 121904 buckets that are 148% used.
> 40542 buckets that are 100% used.
> 99698 buckets that are 62% used.
>
> So the jenkins hash doesn't seem to be able to distribute the entries
> very well.

I wrote first such words out of frustation, but actually I think the
jenkins hash is fine. It was selected after a lot of testing, checking,
comparison. There might of course be flaws in it, but

> This leads to the allocation of 121904 children for just 291689 entries,
> that's a huge waste of memory. But even if we are able to come up with a
> better hash algorithm for this we'll still have this problem when we
> start expanding deeper, the size of the tree simply explodes.

we may simply see the consequences of the inherent behaviour of hashtree
*and* the goodness of the jenkins hash: as part of the hash keys clash
(they do because HASHSHIFT is relatively small) and as the tree grows,
there will be too many empty branches (the whole hash key is pretty
unique, there are to *few* near-clashes at the higher levels). But that's
an assumption only.

> One idea I've played with earlier is to change the number of buckets per
> child depending on the depth, so we have larger children at the top of
> the tree and then they get smaller further down. I'll have to revive
> that code (should be somewhere in svn) in order to say if it helped the
> memory usage or not, I know it made things a bit slower.

Usually we trade speed for memory. Double hashing does not help really
either, because it slows down the lookups too :-( - I learnt it at ipset.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-12 18:49                                 ` Martin Josefsson
  2006-03-14 11:35                                   ` Jozsef Kadlecsik
@ 2006-03-23 11:27                                   ` Jozsef Kadlecsik
  2006-03-23 21:07                                     ` Martin Josefsson
  1 sibling, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-23 11:27 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin,

On Sun, 12 Mar 2006, Martin Josefsson wrote:

[...]
> I've also tried making the buckets larger and decreasing HASHSHIFT in
> order to not exceed 4kB allocations, this allows the buckets to absorb
> the imperfectness of the hashing and memory usage goes down a _lot_.
> But... this is will simply destroy performance on machines with small
> cachelines as it will lead to lots of cachemisses per bucket.
>
> Any ideas?

I took your code and messed with it: added input parameters
(hashsize/number of elements), HASHNUM-dependent variants of hashtrie,
added Paul Hsiesh's superfasthash function and an implementation of the
shared-node fast hash from a SIGCOMM article. Then ran the different
methods with increasing number of elements (table is half, full, two and
four times more entry than hashsize) and created some graphs by gnuplot.

The results are at http://people.netfilter.org/kadlec/hashtrie/

What I could conclude is that hashtable requires the least memory and the
fastest at inserting elements. Hashtrie is better in looking up entries
when there are more than two times entries in the hash than the hashsize
of the hashtable.

Something is required to improve hashtrie, so that it'd require less
memory. Without losing speed, but possibly gaining, at insertion.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-23 11:27                                   ` Jozsef Kadlecsik
@ 2006-03-23 21:07                                     ` Martin Josefsson
  2006-03-25  8:39                                       ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Josefsson @ 2006-03-23 21:07 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 3674 bytes --]

On Thu, 2006-03-23 at 12:27 +0100, Jozsef Kadlecsik wrote:
> Hi Martin,

Hi Jozsef

> I took your code and messed with it: added input parameters
> (hashsize/number of elements), HASHNUM-dependent variants of hashtrie,
> added Paul Hsiesh's superfasthash function and an implementation of the
> shared-node fast hash from a SIGCOMM article. Then ran the different
> methods with increasing number of elements (table is half, full, two and
> four times more entry than hashsize) and created some graphs by gnuplot.

Wonderful, finally someone playing with it :)

> The results are at http://people.netfilter.org/kadlec/hashtrie/

Wow, that's some nice work. Must have taken quite a bit of time to
rewrite my crappy code (started out as a single small .c file :) into
something useful for testing and then performing all those tests.

A lot of your changes should go into svn, you have full freedom to
change whatever you want.

I see that you have tested with HASHSHIFT 3, 4, 5, 6 and a 32byte
cachelinesize.
That gives:

HASHSHIFT = 3
	HASHNUM (buckets per child) = 8
	childsize = HASHNUM * bucketsize (cachelinesize) = 256 bytes

HASHSHIFT = 4
	HASHNUM (buckets per child) = 16
	childsize = HASHNUM * bucketsize (cachelinesize) = 512 bytes

HASHSHIFT = 5
	HASHNUM (buckets per child) = 32
	childsize = HASHNUM * bucketsize (cachelinesize) = 1024 bytes

HASHSHIFT = 6
	HASHNUM (buckets per child) = 64
	childsize = HASHNUM * bucketsize (cachelinesize) = 2048 bytes

The original idea was to allocate PAGE_SIZE (4096 on i386) bytes at a
time so it would be interesting to test HASHSHIFT = 7 (assuming a 32byte
cachelinesize)

Another idea is to have varible number of buckets (HASHNUM) in the
nodes, more buckets at the top of the trie and then decrease the number
of buckets as we descend deeper into the trie. I've tried this earlier
but with lowered performance, I don't remember anything about the change
in memory-usage, and I don't trust my memory so it should be tested
again.

HASHSHIFT=8 or 9 or even larger at depth 0 would be interesting to test.

Allocating more than PAGE_SIZE makes it an order-1 or larger allocation
in kernel which is slower and more likely to fail but the the root-node
it doesn't matter if it takes a little longer to allocate, and it's
usually performed just after bootup when memory isn't very fragmented so
it shouldn't fail.

> What I could conclude is that hashtable requires the least memory and the
> fastest at inserting elements. Hashtrie is better in looking up entries
> when there are more than two times entries in the hash than the hashsize
> of the hashtable.

Yes, this matches the results I've seen as well. I've also run it under
the cachegrind module of valgrind and among other things it keeps
statistics on the number of instructions executed, and hashtrie executes
a lot more instructions than the plain hashtable during lookup, yet
hashtrie gets better performance in some cases due to the fact that it
gets less cachemisses.

> Something is required to improve hashtrie, so that it'd require less
> memory. Without losing speed, but possibly gaining, at insertion.

Yes, something is needed, right now I don't really see any solution to
the memory-usage problem. The problem is less severe with larger
cachelinesizes than 32bytes. 64byte cachelines gives a lot lower memory
usage, 128bytes like the P4 has decreases the memoryusage even more (you
have to decrease HASHSHIFT in order to not allocate too much memory for
each node).

Now I'm going to think about the other hashtrie email I havn't responded
to yet :)

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-23 21:07                                     ` Martin Josefsson
@ 2006-03-25  8:39                                       ` Jozsef Kadlecsik
  2006-03-28 12:26                                         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-25  8:39 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin,

On Thu, 23 Mar 2006, Martin Josefsson wrote:

[...]
> A lot of your changes should go into svn, you have full freedom to
> change whatever you want.

OK, I'll commit it sometime at the weekend. :-)

> I see that you have tested with HASHSHIFT 3, 4, 5, 6 and a 32byte
> cachelinesize.
> That gives:
>
> HASHSHIFT = 3
> 	HASHNUM (buckets per child) = 8
> 	childsize = HASHNUM * bucketsize (cachelinesize) = 256 bytes
[...]
> The original idea was to allocate PAGE_SIZE (4096 on i386) bytes at a
> time so it would be interesting to test HASHSHIFT = 7 (assuming a 32byte
> cachelinesize)

I just wanted to see the effect of different HASHSHIFT values. But I'll
add HASHSHIFT = 7, that's just some new lines.

> Another idea is to have varible number of buckets (HASHNUM) in the
> nodes, more buckets at the top of the trie and then decrease the number
> of buckets as we descend deeper into the trie. I've tried this earlier
> but with lowered performance, I don't remember anything about the change
> in memory-usage, and I don't trust my memory so it should be tested
> again.
>
> HASHSHIFT=8 or 9 or even larger at depth 0 would be interesting to test.

I'm going into that direction: probably even higher HASHSHIF value at
depth = 0 was fairer. The slow insert time compared to hashtable may come
from the preallocated large hash in hashtable.

> Yes, this matches the results I've seen as well. I've also run it under
> the cachegrind module of valgrind and among other things it keeps
> statistics on the number of instructions executed, and hashtrie executes
> a lot more instructions than the plain hashtable during lookup, yet
> hashtrie gets better performance in some cases due to the fact that it
> gets less cachemisses.

I'm thinking on an array-version, where instead of chaining of pointers,
basically an array of ip_conntrack_tuple is used in the hashtable to
avoid clashing. And IPv6 should be taken into account, too.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)
  2006-03-25  8:39                                       ` Jozsef Kadlecsik
@ 2006-03-28 12:26                                         ` Jozsef Kadlecsik
  0 siblings, 0 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-28 12:26 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Hi Martin,

On Sat, 25 Mar 2006, Jozsef Kadlecsik wrote:

> > A lot of your changes should go into svn, you have full freedom to
> > change whatever you want.
>
> OK, I'll commit it sometime at the weekend. :-)
[...]

The new version finally is in svn :-) The new results and comments can be
found at

http://people.netfilter.org/kadlec/hashtrie

A short summary:

- sfhash does indeed faster than jhash when CPU-dependent optimization is
  enabled
- hastrie with variable hash sizes at the different level can help,
  surprisingly both in memory and speed. I tested three variants, there
  may easily be other ones which are even better.
- hasharray is very good on the price of some memory. But very good,
  and simple.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2, dancing trees
  2006-03-07 18:33                               ` Martin Josefsson
  2006-03-08  6:34                                 ` Patrick Schaaf
  2006-03-12 18:49                                 ` Martin Josefsson
@ 2006-03-30  8:28                                 ` Amin Azez
  2006-03-31 18:43                                   ` Jozsef Kadlecsik
  2 siblings, 1 reply; 40+ messages in thread
From: Amin Azez @ 2006-03-30  8:28 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: Netfilter Development Mailinglist

Just for interest, and I don't know how well tuned for processor cache 
it is, but Hans Reisers dancing trees are build to solve some of this 
problem; dancing trees keep themselves balanced.

I realise that filesystem timescales are not the same as packet 
processing timecales, I merely mention it for completeness.

Sam

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-02-16 20:09               ` Patrick McHardy
  2006-02-17  8:18                 ` Jozsef Kadlecsik
@ 2006-03-30  8:31                 ` Amin Azez
  2006-03-31  1:11                   ` Patrick McHardy
  1 sibling, 1 reply; 40+ messages in thread
From: Amin Azez @ 2006-03-30  8:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, Netfilter Development Mailinglist, Yasuyuki Kozakai,
	Pablo Neira Ayuso

Patrick McHardy wrote:
> Jozsef Kadlecsik wrote:
> 
>>On Thu, 16 Feb 2006, Patrick McHardy wrote:
>>
>>
>>>>(jiffies, tuples) would be unique even in that case.
>>>
>>>Thats true. But what is the advantage over using the counter?
> 
> 
> Actually it would still not be unique if connections live
> shorter than a jiffy and are resurrected.

In which case would there be any harm in it not being unqiue, if it were 
the same connection resurrected, why try to avoid having the same id?

Sam

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-03-30  8:31                 ` Amin Azez
@ 2006-03-31  1:11                   ` Patrick McHardy
  2006-03-31 18:35                     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-03-31  1:11 UTC (permalink / raw)
  To: Amin Azez
  Cc: Harald Welte, Netfilter Development Mailinglist, Yasuyuki Kozakai,
	Pablo Neira Ayuso

Amin Azez wrote:
> Patrick McHardy wrote:
> 
>> Jozsef Kadlecsik wrote:
>>
>>> On Thu, 16 Feb 2006, Patrick McHardy wrote:
>>>
>>>
>>>>> (jiffies, tuples) would be unique even in that case.
>>>>
>>>>
>>>> Thats true. But what is the advantage over using the counter?
>>
>>
>>
>> Actually it would still not be unique if connections live
>> shorter than a jiffy and are resurrected.
> 
> 
> In which case would there be any harm in it not being unqiue, if it were
> the same connection resurrected, why try to avoid having the same id?

Besides solving an implementation problem, so ID was meant as a
long-term unique identifier. For the life-time of a connection
the tuples are already a unique identifier.

It's amazing, I've never had so many discussions about a single
32 bit field :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-03-31  1:11                   ` Patrick McHardy
@ 2006-03-31 18:35                     ` Jozsef Kadlecsik
  2006-03-31 18:44                       ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-31 18:35 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, Pablo Neira Ayuso,
	Netfilter Development Mailinglist, Amin Azez, Yasuyuki Kozakai

On Fri, 31 Mar 2006, Patrick McHardy wrote:

> >> Actually it would still not be unique if connections live
> >> shorter than a jiffy and are resurrected.
> >
> > In which case would there be any harm in it not being unqiue, if it were
> > the same connection resurrected, why try to avoid having the same id?
>
> Besides solving an implementation problem, so ID was meant as a
> long-term unique identifier. For the life-time of a connection
> the tuples are already a unique identifier.
>
> It's amazing, I've never had so many discussions about a single
> 32 bit field :)

It's amazingly hard to accept we need it :-).

Putting aside the implementation problem, there are so many information
already available in the conntrack entries: tuples, packet and byte
counters, state, timeout value. Those together can serve as a short-term
unique identifier, even after the entry is actually deleted. Not
bulletproof, but in practice those can be sufficient.

I could not resist... :-))

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Hashtrie testing2, dancing trees
  2006-03-30  8:28                                 ` Hashtrie testing2, dancing trees Amin Azez
@ 2006-03-31 18:43                                   ` Jozsef Kadlecsik
  0 siblings, 0 replies; 40+ messages in thread
From: Jozsef Kadlecsik @ 2006-03-31 18:43 UTC (permalink / raw)
  To: Amin Azez; +Cc: Netfilter Development Mailinglist, Martin Josefsson

On Thu, 30 Mar 2006, Amin Azez wrote:

> Just for interest, and I don't know how well tuned for processor cache
> it is, but Hans Reisers dancing trees are build to solve some of this
> problem; dancing trees keep themselves balanced.
>
> I realise that filesystem timescales are not the same as packet
> processing timecales, I merely mention it for completeness.

I pondered over trees but excluded the whole family as too slow for our
purposes.

We aim to process packets near at wire speed and that implies the fastest
algorithms. A little complexity can have dire consequences - I was quite
surprised that the shared node method was so weak compared to the
hashes/tries when it required just a little bit more internal maintenance.

I believe keeping a tree balanced requires too expensive operations which
we cannot afford.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-03-31 18:35                     ` Jozsef Kadlecsik
@ 2006-03-31 18:44                       ` Patrick McHardy
  2006-04-01 19:31                         ` Harald Welte
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-03-31 18:44 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Harald Welte, Pablo Neira Ayuso,
	Netfilter Development Mailinglist, Amin Azez, Yasuyuki Kozakai

Jozsef Kadlecsik wrote:
> On Fri, 31 Mar 2006, Patrick McHardy wrote:
> 
> 
>>>>Actually it would still not be unique if connections live
>>>>shorter than a jiffy and are resurrected.
>>>
>>>In which case would there be any harm in it not being unqiue, if it were
>>>the same connection resurrected, why try to avoid having the same id?
>>
>>Besides solving an implementation problem, so ID was meant as a
>>long-term unique identifier. For the life-time of a connection
>>the tuples are already a unique identifier.
>>
>>It's amazing, I've never had so many discussions about a single
>>32 bit field :)
> 
> 
> It's amazingly hard to accept we need it :-).
> 
> Putting aside the implementation problem, there are so many information
> already available in the conntrack entries: tuples, packet and byte
> counters, state, timeout value. Those together can serve as a short-term
> unique identifier, even after the entry is actually deleted. Not
> bulletproof, but in practice those can be sufficient.
> 
> I could not resist... :-))

Me neither :) I was just pointing out the idea behind the ID, not
necessarily condoning it. After discussing it with Rusty in Sevilla,
it became clear that we can provide guarantees similar to filesystems
without it, which seems to be good enough. So I have absolutely
nothing against removing it. In fact, IIRC the last discussion
stopped after I had sent a patch to remove it. I need to dig out that
patch again :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-03-31 18:44                       ` Patrick McHardy
@ 2006-04-01 19:31                         ` Harald Welte
  2006-04-06 11:02                           ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Harald Welte @ 2006-04-01 19:31 UTC (permalink / raw)
  To: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1151 bytes --]

On Fri, Mar 31, 2006 at 08:44:56PM +0200, Patrick McHardy wrote:

> Me neither :) I was just pointing out the idea behind the ID, not
> necessarily condoning it. After discussing it with Rusty in Sevilla,
> it became clear that we can provide guarantees similar to filesystems
> without it, which seems to be good enough. So I have absolutely
> nothing against removing it. In fact, IIRC the last discussion
> stopped after I had sent a patch to remove it. I need to dig out that
> patch again :)

I didn't contribute to any of these discussions anymore, since I already
made my opinion on this quite clear very early on:  I don't have any
problem whatsoever with non-uniqueness ;)

So if there's now a majority of people who want to delete the ID: Go for
it :)

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-04-01 19:31                         ` Harald Welte
@ 2006-04-06 11:02                           ` Patrick McHardy
  2006-04-11 16:09                             ` Amin Azez
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-04-06 11:02 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 645 bytes --]

Harald Welte wrote:
> So if there's now a majority of people who want to delete the ID: Go for
> it :)

Found the patch again. What it does is:

- note entry of next conntrack to be dumped and keep a reference to it
- when continuing, look for the conntrack and continue at it if its
  still there
- if not, dump the entire bucket again

In theory we could end up in an endless loop if the conntrack entry
we're keeping the reference to is deleted everytime we want to
continue dumping. It shouldn't be triggerable intentionally because
of the jenkins hash though. If there are no objections I'll port
it to nf_conntrack_netlink and submit it.


[-- Attachment #2: x --]
[-- Type: text/plain, Size: 1853 bytes --]

diff --git a/net/ipv4/netfilter/ip_conntrack_netlink.c b/net/ipv4/netfilter/ip_conntrack_netlink.c
index e0b5926..5a1769d 100644
--- a/net/ipv4/netfilter/ip_conntrack_netlink.c
+++ b/net/ipv4/netfilter/ip_conntrack_netlink.c
@@ -387,38 +387,52 @@ nfattr_failure:
 static int ctnetlink_done(struct netlink_callback *cb)
 {
 	DEBUGP("entered %s\n", __FUNCTION__);
+	if (cb->args[1])
+		ip_conntrack_put(cb->args[1]);
 	return 0;
 }
 
 static int
 ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct ip_conntrack *ct = NULL;
+	struct ip_conntrack *ct;
 	struct ip_conntrack_tuple_hash *h;
 	struct list_head *i;
-	u_int32_t *id = (u_int32_t *) &cb->args[1];
 
 	DEBUGP("entered %s, last bucket=%lu id=%u\n", __FUNCTION__, 
 			cb->args[0], *id);
 
 	read_lock_bh(&ip_conntrack_lock);
-	for (; cb->args[0] < ip_conntrack_htable_size; cb->args[0]++, *id = 0) {
+	for (; cb->args[0] < ip_conntrack_htable_size; cb->args[0]++) {
+restart:
 		list_for_each_prev(i, &ip_conntrack_hash[cb->args[0]]) {
 			h = (struct ip_conntrack_tuple_hash *) i;
 			if (DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = tuplehash_to_ctrack(h);
-			if (ct->id <= *id)
-				continue;
+			if (cb->args[1]) {
+				if (ct == cb->args[1]) {
+					ip_conntrack_put(cb->args[1]);
+					cb->args[1] = NULL;
+				} else
+					continue;
+			}
 			if (ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).pid,
 		                        	cb->nlh->nlmsg_seq,
 						IPCTNL_MSG_CT_NEW,
-						1, ct) < 0)
+						1, ct) < 0) {
+				nf_conntrack_get(&ct->ct_general);
+				cb->args[1] = ct;
 				goto out;
-			*id = ct->id;
+			}
+		}
+		if (cb->args[1]) {
+			ip_conntrack_put(cb->args[1]);
+			cb->args[1] = NULL;
+			goto restart;
 		}
 	}
-out:	
+out:
 	read_unlock_bh(&ip_conntrack_lock);
 
 	DEBUGP("leaving, last bucket=%lu id=%u\n", cb->args[0], *id);

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-04-06 11:02                           ` Patrick McHardy
@ 2006-04-11 16:09                             ` Amin Azez
  2006-04-11 16:17                               ` Patrick McHardy
  0 siblings, 1 reply; 40+ messages in thread
From: Amin Azez @ 2006-04-11 16:09 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Development Mailinglist

Patrick McHardy wrote:
> Harald Welte wrote:
> 
>>So if there's now a majority of people who want to delete the ID: Go for
>>it :)
> 
> 
> Found the patch again. What it does is:
> 
> - note entry of next conntrack to be dumped and keep a reference to it
> - when continuing, look for the conntrack and continue at it if its
>   still there
> - if not, dump the entire bucket again
> 
> In theory we could end up in an endless loop if the conntrack entry
> we're keeping the reference to is deleted everytime we want to
> continue dumping. 

Why not defer the ip_conntrack_put until after ctnetlink_fill_info, 
would that avoid the problem altogether?

It's no longer the entry of the next conntrack but the entry of the last 
conntrack by the time it gets deleted then...? Or have I misunderstood 
the way in which the loop would occur?

Sam

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
  2006-04-11 16:09                             ` Amin Azez
@ 2006-04-11 16:17                               ` Patrick McHardy
       [not found]                                 ` <443CA579.3030908@ufomechanic.net>
  0 siblings, 1 reply; 40+ messages in thread
From: Patrick McHardy @ 2006-04-11 16:17 UTC (permalink / raw)
  To: Amin Azez; +Cc: Netfilter Development Mailinglist

Amin Azez wrote:
> Patrick McHardy wrote:
> 
>> Found the patch again. What it does is:
>>
>> - note entry of next conntrack to be dumped and keep a reference to it
>> - when continuing, look for the conntrack and continue at it if its
>>   still there
>> - if not, dump the entire bucket again
>>
>> In theory we could end up in an endless loop if the conntrack entry
>> we're keeping the reference to is deleted everytime we want to
>> continue dumping. 
> 
> 
> Why not defer the ip_conntrack_put until after ctnetlink_fill_info,
> would that avoid the problem altogether?
> 
> It's no longer the entry of the next conntrack but the entry of the last
> conntrack by the time it gets deleted then...? Or have I misunderstood
> the way in which the loop would occur?

Yes, the loop is interrupted when the skb is full and continued when
userspace issues the next recvmsg(). The reference is kept to find
the point where to continue.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] first conntrack ID must be 1 not 2
       [not found]                                 ` <443CA579.3030908@ufomechanic.net>
@ 2006-04-12 18:30                                   ` Patrick McHardy
  0 siblings, 0 replies; 40+ messages in thread
From: Patrick McHardy @ 2006-04-12 18:30 UTC (permalink / raw)
  To: Amin Azez; +Cc: Netfilter Development Mailinglist

Amin Azez wrote:
> Patrick McHardy wrote:
> 
>>> It's no longer the entry of the next conntrack but the entry of the last
>>> conntrack by the time it gets deleted then...? Or have I misunderstood
>>> the way in which the loop would occur?
>>>   
>>
>>
>> Yes, the loop is interrupted when the skb is full and continued when
>> userspace issues the next recvmsg(). The reference is kept to find
>> the point where to continue.
>>  
>>
> Did you mean "yes, you misunderstood" or "yes, good idea"?

Yes, you misunderstood :)

> I thought that what would theoretically cause the problem if the
> next-conntrack is deleted was the fact that conntrack-put is called
> before ctnetlink_fill_info.

Its only a marker, once we have found it in the list we don't
need that reference anymore.

> The problem may not be entirely theoretical if the conntrack is being
> deleted by a userspace conntrack client in response to what it reads
> when enumerating conntracks. (The conntrack may spring back into
> existance as an resurrected established connection - and also may
> therefore be likely in the case of a full contrack table where
> connections flip-flop their resurrected conntrack entry,
> Rather than issue warnings to conntrack developers to avoid certain
> kinds of design, it would be better to avoid this.

The marker points to the next entry that beed dumped yet, so
enumerate-and-delete won't affect it. It is also not the
ID of the connection but the pointer to the conntrack entry
that is used, so resurrections don't affect it in any way.

The problem is in my opinion purely theoretical so far:

- the bucket beeing dumped must contain enough entries to exceed
  the size of a single netlink message (very unlikely)

- when continuing to dump, the entry we wanted to dump next
  must be gone - and there must still be enough entries in the
  bucket to exceed the skb.

For an endless loop we must have a steady refill of the bucket
with new conntrack entries, otherwise at some point it won't
exceed the skb anymore and we can move on to the next bucket.
The jenkins hash should prevent this situation.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2006-04-12 18:30 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-13  2:41 [PATCH 4/4] first conntrack ID must be 1 not 2 Pablo Neira Ayuso
2006-02-13 11:20 ` Harald Welte
2006-02-16  8:33   ` Patrick McHardy
2006-02-16  8:47     ` Jozsef Kadlecsik
2006-02-16  9:02       ` Patrick McHardy
2006-02-16  9:11         ` Jozsef Kadlecsik
2006-02-16  9:14           ` Patrick McHardy
2006-02-16  9:36             ` Jozsef Kadlecsik
2006-02-16 20:09               ` Patrick McHardy
2006-02-17  8:18                 ` Jozsef Kadlecsik
2006-02-17  8:45                   ` Martin Josefsson
2006-02-17  9:30                     ` Jozsef Kadlecsik
2006-02-17 18:41                       ` Jozsef Kadlecsik
2006-03-04 16:23                         ` Hashtrie testing (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2) Martin Josefsson
2006-03-05  9:49                           ` Jozsef Kadlecsik
2006-03-05 13:24                             ` Martin Josefsson
2006-03-04 20:11                       ` Hashtrie testing2 " Martin Josefsson
2006-03-05 11:24                         ` Jozsef Kadlecsik
2006-03-05 17:48                           ` Martin Josefsson
2006-03-06 13:15                             ` Jozsef Kadlecsik
2006-03-07 18:33                               ` Martin Josefsson
2006-03-08  6:34                                 ` Patrick Schaaf
2006-03-12 18:49                                 ` Martin Josefsson
2006-03-14 11:35                                   ` Jozsef Kadlecsik
2006-03-23 11:27                                   ` Jozsef Kadlecsik
2006-03-23 21:07                                     ` Martin Josefsson
2006-03-25  8:39                                       ` Jozsef Kadlecsik
2006-03-28 12:26                                         ` Jozsef Kadlecsik
2006-03-30  8:28                                 ` Hashtrie testing2, dancing trees Amin Azez
2006-03-31 18:43                                   ` Jozsef Kadlecsik
2006-02-17  8:50                   ` [PATCH 4/4] first conntrack ID must be 1 not 2 Patrick McHardy
2006-03-30  8:31                 ` Amin Azez
2006-03-31  1:11                   ` Patrick McHardy
2006-03-31 18:35                     ` Jozsef Kadlecsik
2006-03-31 18:44                       ` Patrick McHardy
2006-04-01 19:31                         ` Harald Welte
2006-04-06 11:02                           ` Patrick McHardy
2006-04-11 16:09                             ` Amin Azez
2006-04-11 16:17                               ` Patrick McHardy
     [not found]                                 ` <443CA579.3030908@ufomechanic.net>
2006-04-12 18:30                                   ` Patrick McHardy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.