conntrack performance/DoS formula

All of lore.kernel.org
 help / color / mirror / Atom feed

* conntrack performance/DoS formula
@ 2002-06-27 20:46 Don Cohen
  2002-06-28  6:23 ` Patrick Schaaf
  2002-07-02 14:38 ` conntrack performance/DoS formula Harald Welte
  0 siblings, 2 replies; 55+ messages in thread
From: Don Cohen @ 2002-06-27 20:46 UTC (permalink / raw)
  To: netfilter-devel

I had mentioned that I was hoping to remember/find the appropriate
formula for the probability of a hash bucket reaching its limit.
I got out my prob-stat book from I-won't-mention-how-many years ago.

Suppose there's a limit of L connections to be put in a hash table
with B buckets, each bucket allowed to hold a max of M entries.
I assume that the hash function maps a connection to each bucket
with equal probability, P = 1/B.
Suppose the table is full, meaning it has L connections in it.
The probability that a given bucket has n connections in it is:

c(L,n) (P ^ n) ((1-P) ^ (L-n))  [called binomial distribution]

where c(L,n) is the number of ways of choosing n things out of L,
which is L! / (L-n)! n!, i.e., 10x9x8/3x2x1 ways to choose 3 out of 10.
The binomial formula comes from the fact that we've chosen n
connections out of L to put in this bucket, each of those with
probability P, and of course the other L-n went into other buckets
with probability 1-P. 

BTW, this is closely approximated by a Poisson distribution.

Anyhow, on to some answers.  My initial guess was that with L/2
buckets (so the average bucket length is 2), max bucket size M=10
would result in only a microscopic chance of overflow.
The chance is bigger than I expected, about 1E-5.
Raising M to 15 reduces it to about 1E-10 and raising M to 20 makes it
about 1E-15, which seems quite respectable. 

There's a pretty straight forward tradeoff between space and time.
Instead of increasing the time (# iterations = M) you can increase 
the space (# buckets).  Changing from L/2 buckets to L we get
M=10 probability 1E-9 of filling a bucket
M=15 probability 1E-14
M=20 probability 1E-20

And while we're at it, with 2L buckets (average occupancy .5)
M=10 prob ~ 1E-11
M=15 prob ~ 1E-18
M=20 prob ~ 1E-25

My guess is that the amount of time required to do 20 iterations
comparing connection data with elements of a list in a bucket is
very small compared to the amount of time otherwise spent on a
packet.  On the other hand, a hash bucket is also pretty cheap.
The data for a connection takes much more space than 2 hash buckets, 
or even 10.  And with 10L buckets the chance of overflow at M=10
is only 1E-19, for M=20 it's somewhere around 1E-40.

So, there you have it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-27 20:46 conntrack performance/DoS formula Don Cohen
@ 2002-06-28  6:23 ` Patrick Schaaf
  2002-06-28 17:53   ` Don Cohen
  2002-07-02 14:38 ` conntrack performance/DoS formula Harald Welte
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-28  6:23 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

Hello Don,

> I had mentioned that I was hoping to remember/find the appropriate
> formula for the probability of a hash bucket reaching its limit.

[SNIP calculation - thanks for sharing your research!]

> I assume that the hash function maps a connection to each bucket
> with equal probability, P = 1/B.

Can you find an analysis that permits some non-uniformity, given as
a parameter?  I am pretty sure that the hash function we use for
conntracking is nonuniform in lots of real world cases. Thus, if
you want to propose using that probability derived length limit,
then you must prove uniformity of our hash function.

I have the vague feeling (being a programmer, not a mathematician :)
that even small nonuniformities will have large impact on your
probability distribution.

May I suggest a different, experimental approach: extract the hash function
from the code, creating a small userlevel tool which reads in the saved
contents of some /proc/net/ip_conntrack, and with a hash bucket and limit
specified on the commandline, counts the distribution for that specific
set of connections, maybe comparing it with the theoretical prediction,
resulting in a "how does that real world conntrack table fit to theory"
metric? Once done, such a tool could permit command line selection of
different hash functions, for comparison.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28  6:23 ` Patrick Schaaf
@ 2002-06-28 17:53   ` Don Cohen
  2002-06-28 18:36     ` Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Don Cohen @ 2002-06-28 17:53 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Patrick Schaaf writes:

 > > I assume that the hash function maps a connection to each bucket
 > > with equal probability, P = 1/B.
 > 
 > Can you find an analysis that permits some non-uniformity, given as
 > a parameter?  I am pretty sure that the hash function we use for
 > conntracking is nonuniform in lots of real world cases. Thus, if
 > you want to propose using that probability derived length limit,
 > then you must prove uniformity of our hash function.

The effect of nonuniformity in the hash function is very easy to
analyze.  It just changes the probability of different buckets.
For any bucket b the same formula works, just replacing P with P[b].
So if we take the most commonly used bucket, call its probability P',
then a pessimistic approximation is that the chance of a new
connection encountering a full bucket is given by the earlier formula
with the number of buckets reduced to 1/P'.

 > I have the vague feeling (being a programmer, not a mathematician :)
 > that even small nonuniformities will have large impact on your
 > probability distribution.
I guess it depends on what you consider "small" non-uniformities.
I think the opposite is true.
The danger in a hash function is that there will be a high correlation
among the hashes of the items that arrive.  In this case I don't see
much danger.  

 > May I suggest a different, experimental approach: extract the hash function
 > from the code, creating a small userlevel tool which reads in the saved
 > contents of some /proc/net/ip_conntrack, and with a hash bucket and limit
 > specified on the commandline, counts the distribution for that specific
 > set of connections, maybe comparing it with the theoretical prediction,
 > resulting in a "how does that real world conntrack table fit to theory"
 > metric? Once done, such a tool could permit command line selection of
 > different hash functions, for comparison.

Sounds easy enough.  You have some "real" (or at least "realistic") data?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 17:53   ` Don Cohen
@ 2002-06-28 18:36     ` Patrick Schaaf
  2002-06-28 19:03       ` Don Cohen
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-28 18:36 UTC (permalink / raw)
  To: Don Cohen; +Cc: Patrick Schaaf, netfilter-devel

>  > May I suggest a different, experimental approach: extract the hash function
>  > from the code, creating a small userlevel tool which reads in the saved
>  > contents of some /proc/net/ip_conntrack, and with a hash bucket and limit
>  > specified on the commandline, counts the distribution for that specific
>  > set of connections, maybe comparing it with the theoretical prediction,
>  > resulting in a "how does that real world conntrack table fit to theory"
>  > metric? Once done, such a tool could permit command line selection of
>  > different hash functions, for comparison.
> 
> Sounds easy enough.  You have some "real" (or at least "realistic") data?

I have real data from an IRC server (one of the german IRCnet hubs), and
from several boxen providing transparent proxy service to dialup customers,
3000 customers per box peak, running DNS and squid towards the Internet.
With the peak load, the proxy boxen have up to 100000 conntrack entries
(according to slabinfo; /proc/net/ip_conntrack is too lame to show them all).

Others will certainly be able to contribute data from production
routing firewalls. I have one of those, too, but they only protect
an ~50 workstation LAN, so these are not that significant.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 18:36     ` Patrick Schaaf
@ 2002-06-28 19:03       ` Don Cohen
  2002-06-28 19:35         ` Patrick Schaaf
  2002-07-02 14:40         ` Harald Welte
  0 siblings, 2 replies; 55+ messages in thread
From: Don Cohen @ 2002-06-28 19:03 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Patrick Schaaf writes:
 > I have real data from an IRC server (one of the german IRCnet hubs), and
 > from several boxen providing transparent proxy service to dialup customers,
 > 3000 customers per box peak, running DNS and squid towards the Internet.
 > With the peak load, the proxy boxen have up to 100000 conntrack entries
 > (according to slabinfo; /proc/net/ip_conntrack is too lame to show them all).
 > 
 > Others will certainly be able to contribute data from production
 > routing firewalls. I have one of those, too, but they only protect
 > an ~50 workstation LAN, so these are not that significant.

If you can put some data on a web page that I could download I'd be
happy to write and run the program that analyzes it and compares it
to the results expected for a uniform hash function.

Let me know what hash functions particularly interest you other than
the one currently in conntrack and the slight modification I
suggested. 

Just in case you consider this to be a drawback, I might as well
mention up front that this program will be written in lisp.
(I consider it to be a benefit.)

I suppose /proc/net/ip_conntrack is the preferred data format, but
if you have data in another format that others can easily generate
that would also be fine.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 19:03       ` Don Cohen
@ 2002-06-28 19:35         ` Patrick Schaaf
  2002-06-28 19:39           ` Patrick Schaaf
  2002-06-28 21:10           ` Don Cohen
  2002-07-02 14:40         ` Harald Welte
  1 sibling, 2 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-28 19:35 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1542 bytes --]

Hi Don,

>  > I have real data [...] from several boxen providing transparent proxy
>  > service to dialup customers, 3000 customers per box peak, running DNS
>  > and squid towards the Internet.
>  > With the peak load, the proxy boxen have up to 100000 conntrack entries
> 
> If you can put some data on a web page that I could download I'd be
> happy to write and run the program that analyzes it and compares it
> to the results expected for a uniform hash function.

I have put a current sample, with ~33000 entries from one of those
proxy boxen, at http://bei.bof.de/ip_conntrack.1.gz

To protect the innoccent (i.e. my customers and their surfing habits),
I have modified the top 16 bit of the IP addresses (keeping a random
but unique-once-determined mapping). The low order 16 bit are unmodified.
I hope this is sufficient to combine realism wrt hashing with privacy.
The Perl script I used for transforming the ip_conntrack input, is appended
for review.

> Just in case you consider this to be a drawback, I might as well
> mention up front that this program will be written in lisp.
> (I consider it to be a benefit.)

That's your choice. I would write it in Perl or C.

> I suppose /proc/net/ip_conntrack is the preferred data format

The data format should be nearly /proc/net/ip_conntrack, the only modification
is that I compressed multiple whitespace down to a single character, and the
abovementioned mangling of the IP addresses.

I'm looking forward to your results. Tell me if you want more data.

best regards
  Patrick

[-- Attachment #2: ctmangle --]
[-- Type: text/plain, Size: 680 bytes --]

#! /usr/bin/perl -w

use strict;

# tcp      6 114 TIME_WAIT src=XX.XXX.211.64 dst=YYY.Y.144.66 sport=34685 dport=80 src=YYY.Y.144.66 dst=XX.XXX.211.64 sport=80 dport=34685 [ASSURED] use=1 

my %ipmap = ();

sub domap {
	my $ip = $_[0];
	if (!defined($ipmap{$ip})) {
		my $r = rand (2**16)-1;
		my $m = '';
		$m .= ($r & 255); $r >>= 8; $m .= '.' . ($r & 255);
		$ipmap{$ip} = $m;
	}
	return $ipmap{$ip};
}

while (<>) {
	chop;
	my @F = split;
	my $out = '';
	my $sep = '';
	for my $token (@F) {
		if ($token =~ /^([^=]+)=(\d+\.\d+)(\.\d+\.\d+)$/) {
			my $ip = domap($2);
			$out .= "$sep${1}=$ip${3}";
		} else {
			$out .= "$sep$token";
		}
		$sep = ' ';
	}
	print "$out\n";
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 19:35         ` Patrick Schaaf
@ 2002-06-28 19:39           ` Patrick Schaaf
  2002-06-28 21:10           ` Don Cohen
  1 sibling, 0 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-28 19:39 UTC (permalink / raw)
  To: netfilter-devel

(I wrote)
> I have put a current sample, with ~33000 entries from one of those
> proxy boxen, at http://bei.bof.de/ip_conntrack.1.gz

I forgot to mention that on this box, I run with the default module
load time calculated hash bucket count of 7168 (much too low, I know),
and ip_conntrack_max at 200000.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 19:35         ` Patrick Schaaf
  2002-06-28 19:39           ` Patrick Schaaf
@ 2002-06-28 21:10           ` Don Cohen
  2002-06-28 21:28             ` Patrick Schaaf
  1 sibling, 1 reply; 55+ messages in thread
From: Don Cohen @ 2002-06-28 21:10 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Patrick Schaaf writes:
 > I have put a current sample, with ~33000 entries from one of those
 > proxy boxen, at http://bei.bof.de/ip_conntrack.1.gz

I started to test and found that I got longer buckets than I expected.
The problem seems to be that you have a lot of duplicate entries!
This should not happen in a connection table, right?

(setf ht (make-hash-table :test 'equal))
(loop for d in data do (incf (gethash d ht 0)))
(maphash (lambda (x y) (when (> y 1) (print (list x y)))) ht)
 ... 620 lines !!!
((6 161731392 2776778530 3228 1770) 2) = 9.163.211.64 165.130.71.34
((6 2776779313 974068 1241 80) 2) 
((6 2776779190 2386095140 2596 80) 2)
these are protocol, src/dst ip, src/dst port

I predict that when I remove the duplicates I'll get the expected
distribution.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 21:10           ` Don Cohen
@ 2002-06-28 21:28             ` Patrick Schaaf
  2002-06-28 21:49               ` Don Cohen
  2002-06-28 22:30               ` Don Cohen
  0 siblings, 2 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-28 21:28 UTC (permalink / raw)
  To: Don Cohen; +Cc: Patrick Schaaf, netfilter-devel

On Fri, Jun 28, 2002 at 02:10:50PM -0700, Don Cohen wrote:
> Patrick Schaaf writes:
>  > I have put a current sample, with ~33000 entries from one of those
>  > proxy boxen, at http://bei.bof.de/ip_conntrack.1.gz
> 
> I started to test and found that I got longer buckets than I expected.
> The problem seems to be that you have a lot of duplicate entries!
> This should not happen in a connection table, right?

I see about 550 duplicates in 33000 entries. This is most likely due to
'cat /proc/net/ip_conntrack' being terminally broken. But that's not news
to me. Sorry I did not check before.

> I predict that when I remove the duplicates I'll get the expected
> distribution.

sort ip_conntrack.1|uniq

regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 21:28             ` Patrick Schaaf
@ 2002-06-28 21:49               ` Don Cohen
  2002-06-28 22:30               ` Don Cohen
  1 sibling, 0 replies; 55+ messages in thread
From: Don Cohen @ 2002-06-28 21:49 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel


Below is some code along with some results after I removed the
duplicates.
You'll see I just simulated tables of different sizes with the first 
N lines of your data, always using buckets with expected occupancy=1.

The buckets are still a little longer than expected.
Perhaps this is related to
  ... 'cat /proc/net/ip_conntrack' being terminally broken.
Or perhaps something else strange in your data.  Or maybe something 
inherent in "real" data that has not yet occurred to me.


================
(in-package :user)

(defun c(n m &aux (ans 1))
  "number of ways of choosing m out of n"
  (loop for i from 1 to m do (setf ans (/ (* ans (- (+ 1 n) i)) i))) ans)

#+clisp
(setf (ext:long-float-digits) 64)

(defun f (n m p)
  "prob. that exactly m out of n entries end up in a bucket which gets
   an entry with prob. p"
  (* (c n m) (expt p m)(expt (- 1l0 p) (- n m))))

(defun parse-data-line (string)
  "input is a line from /proc/net/ip_conntrack,
   output is (proto src dst sport dport)"
  (let (proto src dst sport dport)
    (with-input-from-string (s string)
      (read s) (setf proto (read s)))
    (unless (integerp proto) (error "strange proto?"))
    (setf src (find-ip string "src=")
	  dst (find-ip string "dst=")
	  sport (find-port string "sport=")
	  dport (find-port string "dport="))
    (list proto src dst sport dport)))

(defun find-port (string label)
  (let (pos)
    (unless (setf pos (search label string)) (error "can't find ~A" label))
    (parse-integer string :start (+ pos (length label)) :junk-allowed t)))

(defun find-ip (string label)
  (let (pos int (ans 0))
    (unless (setf pos (search label string)) (error "can't find ~A" label))
    (incf pos (length label))
    (loop for i below 4 do
	  (multiple-value-setq (int pos)
	    (parse-integer string :start pos :junk-allowed t))
	  (setf ans (+ (* ans 256) int) pos (1+ pos)))
    ans))

(defun hash1 (tablesize proto src dst sport dport)
  (mod (+ proto src dst sport dport) tablesize))

(defun hash2 (tablesize proto src dst sport dport)
  (mod (+ proto src dst sport dport sport) tablesize))
;; is this the code now in conntrack?

(defun test-hash (hashfn data tablesize entries)
  (let ((table (make-array tablesize :initial-element 0)) max table2)
    (loop for i below entries as d in data do
	  (incf (aref table (apply hashfn tablesize d))))
    (setf max (loop for i below tablesize maximize (aref table i)))
    (setf table2 (make-array (1+ max) :initial-element 0))
    (loop for i below tablesize do (incf (aref table2 (aref table i))))
    (loop for i from 0 to max collect (list i (aref table2 i)))))

(length (setf data (with-open-file (f "/tmp/ip_conntrack.1")
		     (loop while (setf l (read-line f nil nil))
			 collect (parse-data-line l)))))
;; => 33465
;; now remove the duplicates !!!
(setf ht (make-hash-table :test 'equal))
(loop for d in data do (incf (gethash d ht 0)))
;; (maphash (lambda (x y) (when (> y 1) (print (list x y)))) ht)
;;  ... 620 lines !!!
(setf realdata nil)
(maphash (lambda (x y) (push x) realdata) ht)

#| results
(test-hash 'hash1 realdata 1024 1024)
((0 458) (1 291) (2 150) (3 86) (4 25) (5 10) (6 3) (7 1))
;; meaning 458 buckets with no entries, 291 buckets with 1 entry ...
(f 1023 7 (/ 1l0 1024))
7.154...L-5 - so we should get 7 in a bucket less than once in 10K
meaning this hash is not doing a good job with this data

(test-hash 'hash1 realdata 1023 1023)
((0 372) (1 381) (2 188) (3 64) (4 16) (5 2))
(f 1023 5 (/ 1l0 1023))
0.00304 [ i.e., 3 buckets out of 1000 should have 5 entries
so this is more reasonable ]

(test-hash 'hash1 realdata 16384 16384)
((0 6755) (1 5533) (2 2509) (3 955) (4 354) (5 169) (6 68) (7 34) (8 4) (9 1) (10 2))
10 in a bucket should happen only once in 1E7

(test-hash 'hash1 realdata 16383 16383)
((0 6692) (1 5496) (2 2647) (3 956) (4 367) (5 142) (6 50) (7 22) (8 8) (9 2) (10 0) (11 1))
and 11 only once in 1E8

Now adding sport twice
(test-hash 'hash2 realdata 1024 1024)
((0 391) (1 361) (2 184) (3 63) (4 19) (5 6))
(f 1023 5 (/ 1l0 1024))
0.003 (as above)

(test-hash 'hash2 realdata 1023 1023)
((0 365) (1 392) (2 190) (3 57) (4 17) (5 1) (6 0) (7 1))
above we saw 7E-5

(test-hash 'hash2 realdata 16384 16384)
((0 6518) (1 5667) (2 2678) (3 1005) (4 338) (5 104) (6 52) (7 15) (8 6) (9 1))
(test-hash 'hash2 realdata 16383 16383)
((0 6608) (1 5583) (2 2655) (3 984) (4 347) (5 128) (6 53) (7 11) (8 12) (9 1) (10 1))

|#

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 21:28             ` Patrick Schaaf
  2002-06-28 21:49               ` Don Cohen
@ 2002-06-28 22:30               ` Don Cohen
  2002-06-29  9:03                 ` Patrick Schaaf
                                   ` (2 more replies)
  1 sibling, 3 replies; 55+ messages in thread
From: Don Cohen @ 2002-06-28 22:30 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel


  The buckets are still a little longer than expected.
  Perhaps this is related to
    ... 'cat /proc/net/ip_conntrack' being terminally broken.
  Or perhaps something else strange in your data.  Or maybe something 
  inherent in "real" data that has not yet occurred to me.

Here's the most extreme case:
  (test-hash 'hash1 realdata 16383 16383)
  ((0 6692) (1 5496) (2 2647) (3 956) (4 367) (5 142) (6 50) (7 22) (8 8) (9 2) (10 0) (11 1))
  and 11 only once in 1E8

I changed the code to find the data that went into this bucket with
11 connections.  Here it is:
 (loop for x in
 '(
 (6 161731392 2776778790 3228 1045)
 (6 161731392 2776778584 3128 1351)
 (6 161731392 2776778230 3228 1605)
 (6 2776778862 161726449 1064 8080)
 (6 161731392 2776778518 3228 1317)
 (6 161731392 2776778560 3128 1375)
 (6 161731392 2776778406 3228 1429)
 (6 161731392 2776778148 3128 1787)
 (6 161731392 2776779212 3128 33489)
 (6 2776778556 161726449 1370 8080)
 (6 161731392 2776778534 3228 1301)) collect (reduce '+ x))
(2938514461
 2938514461
 2938514461
 2938514461
 2938514461
 2938514461
 2938514461
 2938514461
 2938547227
 2938514461
 2938514461)
And what we see is that the sum of the data (used in hash1) is
almost always the same, so the table size doesn't matter.
Now, let's see how the components of the first differ from those
of the others:
(loop for x in <data above> collect 
 (loop for y in ' (6 161731392 2776778790 3228 1045) as z in x 
 collect (- y z)))
((0 0 0 0 0)
 (0 0 206 100 -306)
 (0 0 560 0 -560)
 (0 -2615047470 2615052341 2164 -7035)
 (0 0 272 0 -272)
 (0 0 230 100 -330)
 (0 0 384 0 -384)
 (0 0 642 100 -742)
 (0 0 -422 100 -32444)
 (0 -2615047164 2615052341 1858 -7035)
 (0 0 256 0 -256))

So there are 4 cases where the source IP and port are the same
and the dest IP and dest port happen to cancel out, and
four more where the source port differs by 100 and the dest
IP and port differ by enough to cancel that 100.
That's 8 out of the 11.

Is this "normal" ?  Do you have some idea about all these
ports 3128 and 3228 ?  

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 22:30               ` Don Cohen
@ 2002-06-29  9:03                 ` Patrick Schaaf
  2002-06-29 16:48                   ` Don Cohen
  2002-06-29  9:29                 ` Patrick Schaaf
  2002-06-29 12:07                 ` Patrick Schaaf
  2 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-29  9:03 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

> Is this "normal" ?  Do you have some idea about all these
> ports 3128 and 3228 ?  

The machine where that ip_conntrack came from, is running two squid
processes, one on each processor, and clients are distributed evenly
over the two processes. 3128 and 3228 are the two listening ports
of those squid processes.

Such a conntrack shape will be the normal case for iptables running
on a server, more extreme than when it's running on a routing firewall.

As you asked for ideas about "better" hashes, could you possibly try
using CRC32 over the concatenation of the key values?

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-29  9:03                 ` Patrick Schaaf
@ 2002-06-29 16:48                   ` Don Cohen
  2002-06-29 17:22                     ` Patrick Schaaf
  2002-06-29 17:33                     ` Patrick Schaaf
  0 siblings, 2 replies; 55+ messages in thread
From: Don Cohen @ 2002-06-29 16:48 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Patrick Schaaf writes:
 > > Is this "normal" ?  Do you have some idea about all these
 > > ports 3128 and 3228 ?  
 > 
 > The machine where that ip_conntrack came from, is running two squid
 > processes, one on each processor, and clients are distributed evenly
 > over the two processes. 3128 and 3228 are the two listening ports
 > of those squid processes.
 > 
 > Such a conntrack shape will be the normal case for iptables running
 > on a server, more extreme than when it's running on a routing firewall.

The fact that many connections have the same source IP and two common 
source ports does not seem at all abnormal.  The interesting correlation
in this case is that the client with IP address differing by n is
using a port that differs by -n.  This is just the sort of correlation
that would be bad for this hash function, and the sort that I would
not expect.

 (6 161731392 2776778790 3228 1045)
 (6 161731392 2776778584 3128 1351) ;;  (0 0 206 100 -306)
 (6 161731392 2776778230 3228 1605) ;;  (0 0 560 0 -560)
 (6 2776778862 161726449 1064 8080) ;; different machines
 (6 161731392 2776778518 3228 1317) ;;  (0 0 272 0 -272)
 (6 161731392 2776778560 3128 1375) ;;  (0 0 230 100 -330)
 (6 161731392 2776778406 3228 1429) ;;  (0 0 384 0 -384)
 (6 161731392 2776778148 3128 1787) ;;  (0 0 642 100 -742)
 (6 161731392 2776779212 3128 33489) ;; different sum
 (6 2776778556 161726449 1370 8080) ;; different machines
 (6 161731392 2776778534 3228 1301) ;;  (0 0 256 0 -256)

 > As you asked for ideas about "better" hashes, could you possibly try
 > using CRC32 over the concatenation of the key values?

Looks like you may have done this already, but if not then perhaps
you could supply the code or tell me how to compute it.  Instead,
I just look at your words above "concatenation of the key values".
Here's my first order approximation:
Instead of adding IPs to ports I concatenate IP and port into one
48 bit number.  Then add the two 48 bit numbers (and the protocol)
and take the result mod table size.
(defun hash3 (tablesize proto src dst sport dport)
  (mod (+ proto (ash (+ src dst) 16) sport dport) tablesize))

The result:
 (test-hash 'hash3 realdata 16383 16383)
 ((0 6284) (1 5796) (2 2867) (3 1015) (4 319) (5 84) (6 14) (7 4))
Much closer to theory

BTW it turns out this leads to a good demonstration of my point about
making the modulus relatively prime to the size of the data.
If I change 16383 to 16384 (a power of 2) then almost all of the IP
address data above is ignored and we get this monster:

 (test-hash 'hash3 realdata 16384 16384)
 ((0 8757) (1 4875) (2 1192) (3 490) (4 293) (5 179) (6 131) (7 95) (8
 92) (9 60) (10 49) (11 42) (12 37) (13 28) (14 20) (15 13) (16 11) (17
 6) (18 6) (19 2) (20 1) (21 1) (22 0) (23 1) (24 1) (25 0) (26 0) (27
 0) (28 1) (29 0) (30 0) (31 0) (32 0) (33 0) (34 0) (35 0) (36 0) (37
 0) (38 0) (39 0) (40 0) (41 0) (42 0) (43 0) (44 0) (45 0) (46 0) (47
 0) (48 0) (49 0) (50 0) (51 0) (52 0) (53 0) (54 0) (55 0) (56 0) (57
 0) (58 0) (59 0) (60 0) (61 0) (62 0) (63 0) (64 0) (65 0) (66 0) (67
 0) (68 0) (69 0) (70 0) (71 0) (72 0) (73 0) (74 0) (75 0) (76 0) (77
 0) (78 0) (79 0) (80 0) (81 0) (82 0) (83 0) (84 0) (85 0) (86 1))

[reactions to later messages]

  one more thing to observe for your tests: each line from ip_conntrack
  is hashed twice, with mirrored source / destination addresses. 
I was only reading the first src/dst/sport/dport from each line.
I didn't check for such mirrors in general, but they don't appear
in the sample bucket above containing 11 entries. 

  This version adds a '-b X' option. If you specify that, all TCP conntracks
  with timeouts less than X get ignored. For example, use "-b 431940" to only
  count the conntracks active within the last 60 seconds.

I guess I just don't understand conntrack well enough.
How does -b 431940 map to active in the last 60 sec?
Also, why does this matter?  If you use my suggestion to treat a "full
bucket" like a full table then the issue is whether buckets are full,
and I don't see how that's affected by how recently used the entries
are. 

  Incidentally, this makes the current ip_conntrack hash function look a lot
  less bad than before. I appears that the "long bucket" packets mostly belong
  to some kind of port scan. They will time out eventually, and they will NOT
  contribute significantly to CPU usage, because they will be rarely hit.

They do contribute to cpu cost because they lengthen the search for
other connections.

  Everybody should see the results for their own system. Really. I mean it.

My feeling is that we should be able to find a hash function that is
"good enough" in almost all cases, meaning something like this:
if we want probability of full bucket < 1E-10 and the formula tells
us that this indicates we need to make max bucket length >= 12 then
to be on the safe side we increase it, say 50% to 18, and then we
log full buckets.  When these appear in the log then there are two
possible explanations.  One is that we're under attack from someone
purposely trying to fill buckets.  The other is that our hash function
is bad for the real data.  In either case we should use your program
to look at the connections in the overflowing buckets, and this will
tell us what's going on (and how to fix it if it can be fixed).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-29 16:48                   ` Don Cohen
@ 2002-06-29 17:22                     ` Patrick Schaaf
  2002-07-05 13:47                       ` Harald Welte
  2002-06-29 17:33                     ` Patrick Schaaf
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-29 17:22 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

>   This version adds a '-b X' option. If you specify that, all TCP conntracks
>   with timeouts less than X get ignored. For example, use "-b 431940" to only
>   count the conntracks active within the last 60 seconds.
> 
> I guess I just don't understand conntrack well enough.
> How does -b 431940 map to active in the last 60 sec?

Whenever a new packet arrives, the timeout of its conntrack is set to
the maximum value. For ESTABLISHED TCP connections, that's 5 days,
i.e. 432000 seconds.

> Also, why does this matter?

When worrying about chain lengths, we are worrying about lookup performance.
By restricting the accounting to conntracks active during the last 60 seconds,
I tried to capture the part of the buckets "where the action was".

My finding is that the really long chain lengths, which also prevail when
I use the crc32 hash, are all for very old inactive (and UNREPLIED)
connections  - which look like portscans from one or two days ago.
And indeed, those show the +n -n pattern on the ports that you observed,
and maybe, just maybe this is a sign of some malicious conntrack attacking
scan tool. If my further analysis below is correct, however, that attack
is almost futile.

> If you use my suggestion to treat a "full
> bucket" like a full table then the issue is whether buckets are full,
> and I don't see how that's affected by how recently used the entries
> are. 

I think my analysis shows a possible problem with your approach: when such
a port scan results in an overly long single bucket list, that list will
prevail for up to 5 days. If N is the number of buckets, and newly arriving
connections are evenly spread over the buckets, then every Nth connection
will fall into that overly long bucket - being blocked if we apply your
cutting logic unmodified.

However, in my observation of those problematic buckets, the connections
are UNREPLIED. Then we can use the same trick that is used when
ip_conntrack_max is reached: recycle one of those UNREPLIED
connections immediately for the new conntrack.

(If you are not aware of that ip_conntrack_max handling, have a look at
the early_drop() logic in ip_conntrack_core.c)

>   Incidentally, this makes the current ip_conntrack hash function look a lot
>   less bad than before. I appears that the "long bucket" packets mostly belong
>   to some kind of port scan. They will time out eventually, and they will NOT
>   contribute significantly to CPU usage, because they will be rarely hit.
> 
> They do contribute to cpu cost because they lengthen the search for
> other connections.

NO. They were "alone in their bucket". In other words, no active connection
fell into the same bucket, so they won't come into play at all during lookup.
And, as each entry has its individual timer, there is no scanning of the
overall table, so they would be really untouched in real life, until they
timeout after 5 days.

At this point, I would like to stop for some time, and await results
from other's real world ip_conntrack setups.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-29 17:22                     ` Patrick Schaaf
@ 2002-07-05 13:47                       ` Harald Welte
  0 siblings, 0 replies; 55+ messages in thread
From: Harald Welte @ 2002-07-05 13:47 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Don Cohen, netfilter-devel

On Sat, Jun 29, 2002 at 07:22:18PM +0200, Patrick Schaaf wrote:
> I think my analysis shows a possible problem with your approach: when such
> a port scan results in an overly long single bucket list, that list will
> prevail for up to 5 days. If N is the number of buckets, and newly arriving
> connections are evenly spread over the buckets, then every Nth connection
> will fall into that overly long bucket - being blocked if we apply your
> cutting logic unmodified.
> 
> However, in my observation of those problematic buckets, the connections
> are UNREPLIED. Then we can use the same trick that is used when
> ip_conntrack_max is reached: recycle one of those UNREPLIED
> connections immediately for the new conntrack.

this sounds like a sane way to go.

> NO. They were "alone in their bucket". In other words, no active connection
> fell into the same bucket, so they won't come into play at all during lookup.
> And, as each entry has its individual timer, there is no scanning of the
> overall table, so they would be really untouched in real life, until they
> timeout after 5 days.

but how high is the probability that they are [and remain] 'alone in
their bucket' ?  This could just be a coincidence...

> At this point, I would like to stop for some time, and await results
> from other's real world ip_conntrack setups.

I'm going to do this on a couple of my firewalls on Jul 08 [that's the
single day I'm going to be in .de between my different trips...].  Right
now I only have dialup modem internet... So please wait another couple
of days.

Thanks.

> best regards
>   Patrick

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-29 16:48                   ` Don Cohen
  2002-06-29 17:22                     ` Patrick Schaaf
@ 2002-06-29 17:33                     ` Patrick Schaaf
  1 sibling, 0 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-29 17:33 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

> BTW it turns out this leads to a good demonstration of my point about
> making the modulus relatively prime to the size of the data.

Missed that point on first reading. I totally agree with you on this
point. We already pay for the modulus in the current code, so we may
as well take advantage of it.

In an unrelated piece of software (phone call correlation), I use a
precomputed table of primes that are all just a bit smaller than a
power of two, and for any bucket count desired by the user, I really
use the next largest almost-power-of-two prime from that table.

best regards
  Patrick

static unsigned sizes[] = {
        3,	/* 2^2-1 */
        7,      /* 2^3-1 */
        13,
        31,     /* 2^5-1 */
        61,
        127,    /* 2^7-1 */
        251,
        509,
        1021,
        2039,
        4093,
        8191,   /* 2^13-1 */
        16381,
        32749,
        65521,
        131071, /* 2^17-1 */
        262139,
        524287, /* 2^19-1 */
        1048571,
        2097157,
        4194301,
        8388581,
};

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 22:30               ` Don Cohen
  2002-06-29  9:03                 ` Patrick Schaaf
@ 2002-06-29  9:29                 ` Patrick Schaaf
  2002-06-29 12:07                 ` Patrick Schaaf
  2 siblings, 0 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-29  9:29 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

Don,

one more thing to observe for your tests: each line from ip_conntrack
is hashed twice, with mirrored source / destination addresses. Ideally,
for a sufficiently oversized bucket count, that would be immaterial.
For the too-small-bucket case, hash list lengths should double.
With a too-bad hash function, things could get even worse.

(My apologies if you already did this doubling in your test code; my meatware
lisp parser is not very good...)

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 22:30               ` Don Cohen
  2002-06-29  9:03                 ` Patrick Schaaf
  2002-06-29  9:29                 ` Patrick Schaaf
@ 2002-06-29 12:07                 ` Patrick Schaaf
  2002-06-29 12:34                   ` Patrick Schaaf
  2 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-29 12:07 UTC (permalink / raw)
  To: Don Cohen; +Cc: Patrick Schaaf, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2146 bytes --]

Hi all,

find appended, for use and review, my C version of a /proc/net/ip_conntrack
based hash function and bucket occupation test program. Here is a short
usage summary:

- extract crc32.c and cttest.c from the attachments
- compile with "gcc -o cttest cttest.c"
- by default, /proc/net/ip_conntrack is read. Use the "-f filename" option
  to read the given file, instead.
- use '-u' to enable unidirectional operation; the reverse tuple is ignored.
- by default, the hash bucket count is 7168. Use "-s NUMBER" to modify this.
  Look at the kernel output from ip_conntrack load time to find out the
  bucket count that is active on your system.
- use '-h NAME' to select the hash function to use. Right now you can use:
	o original
		this should be the hash function used by ip_conntrack
	o crc32
		a crc32 over proto/sip/dip/sport/dport
	o random
		no relation to the input at all; this still reads the
		input, but only to get at the number of entries.
		Instead of calculating an input based hash, random()
		is called.
- use '-i bucket-number' to show the input lines corresponding to the
  given hash bucket number. default is to not do this.
- use '-t count' to output the bucket-number corresponding to the first
  bucket with 'count' entries in it.

For example, if you find that the maximum bucket list length is 131,
use '-t 131', and look for "occupation target 131 is at XXX".
Then, use '-i XXX' on the next run to see all input lines falling into
bucket number XXX.

NOTE: as we have seen, it is likely that reading /proc/net/ip_conntrack
does produce duplicate lines. This test program does not try to cope
with that situation; if it applies to you, first save the output to
some file, and use "sort somefile | uniq >testfile" to generate the
input file for this test program.

Oh, maybe I should mention what the program outputs... The main output
comes after the "Count	Length	CT" heading. Each lines gives the COUNT
of all buckets having list LENGTH, sorted by decreasing length. CT gives
the cummulative total over the counts.

Everybody should see the results for their own system. Really. I mean it.

best regards
  Patrick

[-- Attachment #2: crc32.c --]
[-- Type: text/plain, Size: 4400 bytes --]

/* crc32.c -- compute the CRC-32 of a data stream
 * Copyright (C) 1995-2002 Mark Adler
 * stolen from zlib CVS
 */

/* @(#) $Id: crc32.c,v 1.2.2.1 2002/03/12 09:34:29 werner Exp $ */

/* ========================================================================
 * Table of CRC-32's of all single-byte values (made by make_crc_table)
 */
static const u32 crc_table[256] = {
  0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L,
  0x706af48fL, 0xe963a535L, 0x9e6495a3L, 0x0edb8832L, 0x79dcb8a4L,
  0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L,
  0x90bf1d91L, 0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL,
  0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L, 0x136c9856L,
  0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L,
  0xfa0f3d63L, 0x8d080df5L, 0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L,
  0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL,
  0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L,
  0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L, 0x26d930acL, 0x51de003aL,
  0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L,
  0xb8bda50fL, 0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L,
  0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL, 0x76dc4190L,
  0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL,
  0x9fbfe4a5L, 0xe8b8d433L, 0x7807c9a2L, 0x0f00f934L, 0x9609a88eL,
  0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L,
  0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL,
  0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L, 0x65b0d9c6L, 0x12b7e950L,
  0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L,
  0xfbd44c65L, 0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L,
  0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL, 0x4369e96aL,
  0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L,
  0xaa0a4c5fL, 0xdd0d7cc9L, 0x5005713cL, 0x270241aaL, 0xbe0b1010L,
  0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL,
  0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L,
  0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL, 0xedb88320L, 0x9abfb3b6L,
  0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L,
  0x73dc1683L, 0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L,
  0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L, 0xf00f9344L,
  0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL,
  0x196c3671L, 0x6e6b06e7L, 0xfed41b76L, 0x89d32be0L, 0x10da7a5aL,
  0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L,
  0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L,
  0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL, 0xd80d2bdaL, 0xaf0a1b4cL,
  0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL,
  0x4669be79L, 0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L,
  0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL, 0xc5ba3bbeL,
  0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L,
  0x2cd99e8bL, 0x5bdeae1dL, 0x9b64c2b0L, 0xec63f226L, 0x756aa39cL,
  0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L,
  0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL,
  0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L, 0x86d3d2d4L, 0xf1d4e242L,
  0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L,
  0x18b74777L, 0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL,
  0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L, 0xa00ae278L,
  0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L,
  0x4969474dL, 0x3e6e77dbL, 0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L,
  0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L,
  0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L,
  0xcdd70693L, 0x54de5729L, 0x23d967bfL, 0xb3667a2eL, 0xc4614ab8L,
  0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
  0x2d02ef8dL
};

/* ========================================================================= */
#define DO1(buf) crc = crc_table[((int)crc ^ (*buf++)) & 0xff] ^ (crc >> 8);
#define DO2(buf)  DO1(buf); DO1(buf);
#define DO4(buf)  DO2(buf); DO2(buf);
#define DO8(buf)  DO4(buf); DO4(buf);

/* ========================================================================= */
u32 crc32(crc, buf, len)
    u32 crc;
    const char *buf;
    int len;
{
    crc = crc ^ 0xffffffffL;
    while (len >= 8)
    {
	DO8(buf);
	len -= 8;
    }
    if (len) do {
	DO1(buf);
    } while (--len);
    return crc ^ 0xffffffffL;
}

[-- Attachment #3: cttest.c --]
[-- Type: text/plain, Size: 5614 bytes --]

/* cttest.c - parse /proc/net/ip_conntrack, compute bucket occupation
 *
 * This is how lines look:
 *
 * udp      17 0 src=213.6.73.215 dst=62.104.191.241 sport=1858 dport=53 src=62.104.191.241 dst=213.6.73.215 sport=53 dport=1858 use=1 
 * tcp      6 114 TIME_WAIT src=62.104.211.64 dst=212.7.144.66 sport=34685 dport=80 src=212.7.144.66 dst=62.104.211.64 sport=80 dport=34685 [ASSURED] use=1 
 */

#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <stdlib.h>
#include <getopt.h>
#include <errno.h>
#include <netinet/in.h>
#include <arpa/inet.h>

typedef unsigned long u32;
typedef unsigned short u16;
typedef unsigned char u8;

struct ct_key {
	u32 sip;
	u32 dip;
	u16 sport;
	u16 dport;
	u8 proto;
};

static u32 hash_conntrack(struct ct_key *key)
{
	return ntohl(key->sip + key->dip
		+ key->sport + key->dport
		+ key->proto)
		+ ntohs(key->sport);
}

#include "crc32.c"

static u32 hash_crc32(struct ct_key *key)
{
	u32 crc = crc32(0, (const char *) &key->proto, sizeof(key->proto));
	u32 v32;
	u16 v16;
	v32 = ntohl(key->sip);
	crc = crc32(crc, (const char *) &v32, sizeof(v32));
	v32 = ntohl(key->dip);
	crc = crc32(crc, (const char *) &v32, sizeof(v32));
	v16 = ntohs(key->sport);
	crc = crc32(crc, (const char *) &v16, sizeof(v16));
	v16 = ntohs(key->dport);
	crc = crc32(crc, (const char *) &v16, sizeof(v16));
	return crc;
}

static u32 hash_random(struct ct_key *key)
{
	return (u32) random();
}

struct hash_def {
	char *name;
	u32 (*hash_f)(struct ct_key *);
};

#define NR_HASHES (sizeof(hashes)/sizeof(hashes[0]))

static struct hash_def hashes[] = {
	{ "original", hash_conntrack },
	{ "random", hash_random },
	{ "crc32", hash_crc32 },
};

static u32 htable_size = 7168;
static u32 *occupation;
static struct hash_def *test_hash = hashes;

static void do_init(void)
{
	u32 idx;
	occupation = malloc(sizeof(*occupation) * htable_size);
	for (idx=0; idx<htable_size; idx++)
		occupation[idx] = 0;
}

static int display_target = -1;

static int do_count(struct ct_key *key)
{
	u32 idx = test_hash->hash_f(key) % htable_size;
	occupation[idx]++;
	return ((int)idx) == display_target;
}

static u32 parse_ip(char *t)
{
	char *p = strchr(t, '=');
	if (!p++) return 0;
	return inet_addr(p);
}

static u16 parse_port(char *t)
{
	char *p = strchr(t, '=');
	if (!p++) return 0;
	return (u16) strtoul(p, 0, 10);
}

static int unidir;

static int process(char *p)
{
	struct ct_key key[1];
	char *tok;
	int res;

	strtok(p, " \t\n");					/* prot text */
	tok = strtok(0, " \t\n");
	key->proto = (u8) strtoul(tok, 0, 10);			/* prot num */
	strtok(0, " \t\n");					/* timeout */
	switch (key->proto) {
		case 6:  strtok(0, " \t\n");			/* tcp state */
		case 17: break;
		default: return 0;
	}
	/* forward */
	tok = strtok(0, " \t\n"); key->sip = parse_ip(tok);	/* src ip */
	tok = strtok(0, " \t\n"); key->dip = parse_ip(tok);	/* dst ip */
	tok = strtok(0, " \t\n"); key->sport = parse_port(tok);	/* src port */
	tok = strtok(0, " \t\n"); key->dport = parse_port(tok);	/* dst port */
	res = do_count(key);
	if (unidir) return res;
	/* reverse */
	tok = strtok(0, " \t\n"); key->sip = parse_ip(tok);	/* src ip */
	tok = strtok(0, " \t\n"); key->dip = parse_ip(tok);	/* dst ip */
	tok = strtok(0, " \t\n"); key->sport = parse_port(tok);	/* src port */
	tok = strtok(0, " \t\n"); key->dport = parse_port(tok);	/* dst port */
	return res | do_count(key);
}

static int occupation_target = -1;

static void show_occupation(void)
{
	u32 idx, maxlen, total, partial, *agg;

	for (maxlen=0, total=0, idx=0; idx<htable_size; idx++) {
		total += occupation[idx];
		if (occupation[idx] > maxlen)
			maxlen = occupation[idx];
	}
	printf("hash function: %s\n", test_hash->name);
	printf("maximum bucket list length: %lu\n", maxlen);
	printf("total number of entries: %lu\n", total);
	agg = malloc(sizeof(*agg) * (maxlen+1));
	for (idx=0; idx<=maxlen; idx++)
		agg[idx] = 0;
	for (idx=0; idx<htable_size; idx++) {
		agg[occupation[idx]]++;
		if (((int)occupation[idx]) == occupation_target) {
			printf("occupation target %d at %lu\n",
				occupation_target, idx);
			occupation_target = -1;
		}
	}
	printf("Count\tLength\tCT\n");
	for (partial=0, idx=maxlen; idx>0; idx--) {
		double d;
		if (agg[idx] == 0) continue;
		partial += idx * agg[idx];
		d = (100.0 * partial) / (1.0 * total);
		printf("%lu\t%lu\t%.02f%%\n", agg[idx], idx, d);
	}
	printf("%lu\t0\t100.00%%\n", agg[idx]);
}

static struct hash_def *find_hash(char *name)
{
	u32 idx;
	for (idx=0; idx<NR_HASHES; idx++)
		if (0 == strcmp(name, hashes[idx].name))
			return hashes+idx;
	fprintf(stderr, "no hash function named '%s'\n", name);
	return 0;
}

static void usage(void)
{
}

int main(int argc, char **argv)
{
	int c;
	char buf[512], copy[512];
	char *filename = "/proc/net/ip_conntrack";
	FILE *fp;

	while (EOF != (c = getopt(argc, argv, "f:us:h:i:t:"))) switch(c) {
		case 'f': filename = optarg; break;
		case 'u': unidir = 1; break;
		case 's': htable_size = strtoul(optarg, 0, 10); break;
		case 'h': test_hash = find_hash(optarg); break;
		case 'i': display_target = strtol(optarg, 0, 10); break;
		case 't': occupation_target = strtol(optarg, 0, 10); break;
		default: usage();
	}
	if (!test_hash) return 1;
	fp = (0 == strcmp(filename, "-")) ? stdin : fopen(filename, "r");
	if (!fp) {
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		exit(1);
	}
	printf("filename: %s\n", filename);
	do_init();
	while (fgets(buf, sizeof(buf), fp)) {
		if (display_target != -1)
			strcpy(copy, buf);
		if (process(buf))
			printf("[%d]\t%s", display_target, copy);
	}
	if (fp != stdin) fclose(fp);
	show_occupation();
	return 0;
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-29 12:07                 ` Patrick Schaaf
@ 2002-06-29 12:34                   ` Patrick Schaaf
  2002-06-30  8:31                     ` Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-29 12:34 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Don Cohen, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 811 bytes --]

Hi again,

one more elaboration of the cttest program, before I go offline in preparation
for the fourth German football world championship...

This version adds a '-b X' option. If you specify that, all TCP conntracks
with timeouts less than X get ignored. For example, use "-b 431940" to only
count the conntracks active within the last 60 seconds.

Incidentally, this makes the current ip_conntrack hash function look a lot
less bad than before. I appears that the "long bucket" packets mostly belong
to some kind of port scan. They will time out eventually, and they will NOT
contribute significantly to CPU usage, because they will be rarely hit.

All is well, it seems. At least looking at my 33000 entry sample from
yesterday. I'm looking forward to results from other workloads!

best regards
  Patrick

[-- Attachment #2: cttest.c --]
[-- Type: text/plain, Size: 6095 bytes --]

/* cttest.c - parse /proc/net/ip_conntrack, compute bucket occupation
 *
 * This is how lines look:
 *
 * udp      17 0 src=213.6.73.215 dst=62.104.191.241 sport=1858 dport=53 src=62.104.191.241 dst=213.6.73.215 sport=53 dport=1858 use=1 
 * tcp      6 114 TIME_WAIT src=62.104.211.64 dst=212.7.144.66 sport=34685 dport=80 src=212.7.144.66 dst=62.104.211.64 sport=80 dport=34685 [ASSURED] use=1 
 */

#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <stdlib.h>
#include <getopt.h>
#include <errno.h>
#include <netinet/in.h>
#include <arpa/inet.h>

typedef unsigned long u32;
typedef unsigned short u16;
typedef unsigned char u8;

struct ct_key {
	u32 sip;
	u32 dip;
	u16 sport;
	u16 dport;
	u8 proto;
};

static u32 hash_conntrack(struct ct_key *key)
{
	return ntohl(key->sip + key->dip
		+ key->sport + key->dport
		+ key->proto)
		+ ntohs(key->sport);
}

#include "crc32.c"

static u32 hash_crc32(struct ct_key *key)
{
	u32 crc = crc32(0, (const char *) &key->proto, sizeof(key->proto));
	u32 v32;
	u16 v16;
	v32 = ntohl(key->sip);
	crc = crc32(crc, (const char *) &v32, sizeof(v32));
	v32 = ntohl(key->dip);
	crc = crc32(crc, (const char *) &v32, sizeof(v32));
	v16 = ntohs(key->sport);
	crc = crc32(crc, (const char *) &v16, sizeof(v16));
	v16 = ntohs(key->dport);
	crc = crc32(crc, (const char *) &v16, sizeof(v16));
	return crc;
}

static u32 hash_random(struct ct_key *key)
{
	return (u32) random();
}

struct hash_def {
	char *name;
	u32 (*hash_f)(struct ct_key *);
};

#define NR_HASHES (sizeof(hashes)/sizeof(hashes[0]))

static struct hash_def hashes[] = {
	{ "original", hash_conntrack },
	{ "random", hash_random },
	{ "crc32", hash_crc32 },
};

static u32 htable_size = 7168;
static u32 *occupation;
static struct hash_def *test_hash = hashes;

static void do_init(void)
{
	u32 idx;
	occupation = malloc(sizeof(*occupation) * htable_size);
	for (idx=0; idx<htable_size; idx++)
		occupation[idx] = 0;
}

static int display_target = -1;

static int do_count(struct ct_key *key)
{
	u32 idx = test_hash->hash_f(key) % htable_size;
	occupation[idx]++;
	return ((int)idx) == display_target;
}

static u32 parse_ip(char *t)
{
	char *p = strchr(t, '=');
	if (!p++) return 0;
	return inet_addr(p);
}

static u16 parse_port(char *t)
{
	char *p = strchr(t, '=');
	if (!p++) return 0;
	return (u16) strtoul(p, 0, 10);
}

static u32 ignore_tcp_timeout_bound = 0;

static int ignore_by_timeout(struct ct_key *key, char *tok)
{
	return	key->proto == 6
		&& strtoul(tok, 0, 10) < ignore_tcp_timeout_bound;
}

static int unidir;

static int process(char *p)
{
	struct ct_key key[1];
	char *tok;
	int res;

	strtok(p, " \t\n");					/* prot text */
	tok = strtok(0, " \t\n");
	key->proto = (u8) strtoul(tok, 0, 10);			/* prot num */
	tok = strtok(0, " \t\n");				/* timeout */
	if (ignore_by_timeout(key, tok)) return 0;
	switch (key->proto) {
		case 6:  strtok(0, " \t\n");			/* tcp state */
		case 17: break;
		default: return 0;
	}
	/* forward */
	tok = strtok(0, " \t\n"); key->sip = parse_ip(tok);	/* src ip */
	tok = strtok(0, " \t\n"); key->dip = parse_ip(tok);	/* dst ip */
	tok = strtok(0, " \t\n"); key->sport = parse_port(tok);	/* src port */
	tok = strtok(0, " \t\n"); key->dport = parse_port(tok);	/* dst port */
	res = do_count(key);
	if (unidir) return res;
	/* reverse */
	tok = strtok(0, " \t\n"); key->sip = parse_ip(tok);	/* src ip */
	tok = strtok(0, " \t\n"); key->dip = parse_ip(tok);	/* dst ip */
	tok = strtok(0, " \t\n"); key->sport = parse_port(tok);	/* src port */
	tok = strtok(0, " \t\n"); key->dport = parse_port(tok);	/* dst port */
	return res | do_count(key);
}

static int occupation_target = -1;

static void show_occupation(void)
{
	u32 idx, maxlen, total, partial, *agg;

	for (maxlen=0, total=0, idx=0; idx<htable_size; idx++) {
		total += occupation[idx];
		if (occupation[idx] > maxlen)
			maxlen = occupation[idx];
	}
	printf("hash function: %s\n", test_hash->name);
	printf("maximum bucket list length: %lu\n", maxlen);
	printf("total number of entries: %lu\n", total);
	agg = malloc(sizeof(*agg) * (maxlen+1));
	for (idx=0; idx<=maxlen; idx++)
		agg[idx] = 0;
	for (idx=0; idx<htable_size; idx++) {
		agg[occupation[idx]]++;
		if (((int)occupation[idx]) == occupation_target) {
			printf("occupation target %d at %lu\n",
				occupation_target, idx);
			occupation_target = -1;
		}
	}
	printf("Count\tLength\tCT\n");
	for (partial=0, idx=maxlen; idx>0; idx--) {
		double d;
		if (agg[idx] == 0) continue;
		partial += idx * agg[idx];
		d = (100.0 * partial) / (1.0 * total);
		printf("%lu\t%lu\t%.02f%%\n", agg[idx], idx, d);
	}
	printf("%lu\t0\t100.00%%\n", agg[idx]);
}

static struct hash_def *find_hash(char *name)
{
	u32 idx;
	for (idx=0; idx<NR_HASHES; idx++)
		if (0 == strcmp(name, hashes[idx].name))
			return hashes+idx;
	fprintf(stderr, "no hash function named '%s'\n", name);
	return 0;
}

static void usage(void)
{
	fprintf(stderr, "Usage: not written\n");
	exit(1);
}

int main(int argc, char **argv)
{
	int c;
	char buf[512], copy[512];
	char *filename = "/proc/net/ip_conntrack";
	FILE *fp;

	while (EOF != (c = getopt(argc, argv, "f:us:h:i:t:b:"))) switch(c) {
		case 'f': filename = optarg; break;
		case 'u': unidir = 1; break;
		case 's': htable_size = strtoul(optarg, 0, 10); break;
		case 'h': test_hash = find_hash(optarg); break;
		case 'i': display_target = strtol(optarg, 0, 10); break;
		case 't': occupation_target = strtol(optarg, 0, 10); break;
		case 'b': ignore_tcp_timeout_bound = strtoul(optarg, 0, 10); break;
		default: usage();
	}
	if (!test_hash) return 1;
	fp = (0 == strcmp(filename, "-")) ? stdin : fopen(filename, "r");
	if (!fp) {
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		exit(1);
	}
	printf("filename: %s\n", filename);
	if (ignore_tcp_timeout_bound)
		printf("ignore TCP conntracks with timeout less than %lu\n",
			ignore_tcp_timeout_bound);
	do_init();
	while (fgets(buf, sizeof(buf), fp)) {
		if (display_target != -1)
			strcpy(copy, buf);
		if (process(buf))
			printf("[%d]\t%s", display_target, copy);
	}
	if (fp != stdin) fclose(fp);
	show_occupation();
	return 0;
}

[-- Attachment #3: crc32.c --]
[-- Type: text/plain, Size: 4400 bytes --]

/* crc32.c -- compute the CRC-32 of a data stream
 * Copyright (C) 1995-2002 Mark Adler
 * stolen from zlib CVS
 */

/* @(#) $Id: crc32.c,v 1.2.2.1 2002/03/12 09:34:29 werner Exp $ */

/* ========================================================================
 * Table of CRC-32's of all single-byte values (made by make_crc_table)
 */
static const u32 crc_table[256] = {
  0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L,
  0x706af48fL, 0xe963a535L, 0x9e6495a3L, 0x0edb8832L, 0x79dcb8a4L,
  0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L,
  0x90bf1d91L, 0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL,
  0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L, 0x136c9856L,
  0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L,
  0xfa0f3d63L, 0x8d080df5L, 0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L,
  0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL,
  0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L,
  0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L, 0x26d930acL, 0x51de003aL,
  0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L,
  0xb8bda50fL, 0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L,
  0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL, 0x76dc4190L,
  0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL,
  0x9fbfe4a5L, 0xe8b8d433L, 0x7807c9a2L, 0x0f00f934L, 0x9609a88eL,
  0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L,
  0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL,
  0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L, 0x65b0d9c6L, 0x12b7e950L,
  0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L,
  0xfbd44c65L, 0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L,
  0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL, 0x4369e96aL,
  0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L,
  0xaa0a4c5fL, 0xdd0d7cc9L, 0x5005713cL, 0x270241aaL, 0xbe0b1010L,
  0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL,
  0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L,
  0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL, 0xedb88320L, 0x9abfb3b6L,
  0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L,
  0x73dc1683L, 0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L,
  0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L, 0xf00f9344L,
  0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL,
  0x196c3671L, 0x6e6b06e7L, 0xfed41b76L, 0x89d32be0L, 0x10da7a5aL,
  0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L,
  0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L,
  0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL, 0xd80d2bdaL, 0xaf0a1b4cL,
  0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL,
  0x4669be79L, 0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L,
  0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL, 0xc5ba3bbeL,
  0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L,
  0x2cd99e8bL, 0x5bdeae1dL, 0x9b64c2b0L, 0xec63f226L, 0x756aa39cL,
  0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L,
  0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL,
  0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L, 0x86d3d2d4L, 0xf1d4e242L,
  0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L,
  0x18b74777L, 0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL,
  0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L, 0xa00ae278L,
  0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L,
  0x4969474dL, 0x3e6e77dbL, 0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L,
  0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L,
  0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L,
  0xcdd70693L, 0x54de5729L, 0x23d967bfL, 0xb3667a2eL, 0xc4614ab8L,
  0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
  0x2d02ef8dL
};

/* ========================================================================= */
#define DO1(buf) crc = crc_table[((int)crc ^ (*buf++)) & 0xff] ^ (crc >> 8);
#define DO2(buf)  DO1(buf); DO1(buf);
#define DO4(buf)  DO2(buf); DO2(buf);
#define DO8(buf)  DO4(buf); DO4(buf);

/* ========================================================================= */
u32 crc32(crc, buf, len)
    u32 crc;
    const char *buf;
    int len;
{
    crc = crc ^ 0xffffffffL;
    while (len >= 8)
    {
	DO8(buf);
	len -= 8;
    }
    if (len) do {
	DO1(buf);
    } while (--len);
    return crc ^ 0xffffffffL;
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-29 12:34                   ` Patrick Schaaf
@ 2002-06-30  8:31                     ` Patrick Schaaf
  2002-06-30 19:40                       ` Don Cohen
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-06-30  8:31 UTC (permalink / raw)
  To: don-nf; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

Hi all,

here is the next version of my cttest.c conntrack test program.

Thanks go to Don Cohen for noticing that my parser was fucked up.
I did not notice that we sometimes have the [XXX] state marker
between the two tuples, and sometimes we have it after both tuples.
How ugly... This is now fixed by skipping a field when the second
tuple starts with '['.

Incidentally, the occasional overly long chains (140 entry lists etc.)
are now gone gone gone. They were an artifact of the misparsing...

A small fix I also did, was to (correctly) parse the ports into network
byte order. Now the "original" hash function should really match what's
in the kernel.

We now see a nice "spread" of list lengths for both the original hash
function, as well as the crc32 one, without strange outliers. For my
bucket count of 7168 (default for a 1GB RAM non-highmem machine...),
the original hash is still much worse than the crc32 one.

At this point, I implemented some new options to the cttest program:

Use the new '-P' flag (no arguments) to automatically choose a prime
number of hash buckets, just a bit larger than the number you selected
with '-s NUMBER'. When I do that, the large difference between the
original and crc32 hashes almost vanishes - sometimes our original
hash even looks _better_ (by chance)! This is a strong indication
that we want the "prime select" in the kernel!

You can now use '-c NUMBER' to set a seed for the CRC32 hash. You can now
use '-C' (without parameter) to set a random seed for the CRC32 hash.
These options are for testing the idea of "attack compensation by
inserting local variance". I'd like to hear opinions on this.

best regards
  Patrick


[-- Attachment #2: cttest.c --]
[-- Type: text/plain, Size: 7297 bytes --]

/* cttest.c - parse /proc/net/ip_conntrack, compute bucket occupation
 *
 * This is how lines look:
 *
 * udp      17 0 src=213.6.73.215 dst=62.104.191.241 sport=1858 dport=53 src=62.104.191.241 dst=213.6.73.215 sport=53 dport=1858 use=1 
 * tcp      6 114 TIME_WAIT src=62.104.211.64 dst=212.7.144.66 sport=34685 dport=80 src=212.7.144.66 dst=62.104.211.64 sport=80 dport=34685 [ASSURED] use=1 
 */

#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <stdlib.h>
#include <getopt.h>
#include <errno.h>
#include <time.h>
#include <netinet/in.h>
#include <arpa/inet.h>

typedef unsigned long u32;
typedef unsigned short u16;
typedef unsigned char u8;

struct ct_key {
	u32 sip;
	u32 dip;
	u16 sport;
	u16 dport;
	u8 proto;
};

static u32 hash_conntrack(struct ct_key *key)
{
	return ntohl(key->sip + key->dip
		+ key->sport + key->dport
		+ key->proto)
		+ ntohs(key->sport);
}

#include "crc32.c"

static u32 crc32_seed;

static u32 hash_crc32(struct ct_key *key)
{
	u32 crc = crc32_seed;
	u32 v32;
	u16 v16;
	crc = crc32(crc, (const char *) &key->proto, sizeof(key->proto));
	v32 = ntohl(key->sip);
	crc = crc32(crc, (const char *) &v32, sizeof(v32));
	v32 = ntohl(key->dip);
	crc = crc32(crc, (const char *) &v32, sizeof(v32));
	v16 = ntohs(key->sport);
	crc = crc32(crc, (const char *) &v16, sizeof(v16));
	v16 = ntohs(key->dport);
	crc = crc32(crc, (const char *) &v16, sizeof(v16));
	return crc;
}

static u32 hash_random(struct ct_key *key)
{
	return (u32) random();
}

struct hash_def {
	char *name;
	u32 (*hash_f)(struct ct_key *);
};

#define NR_HASHES (sizeof(hashes)/sizeof(hashes[0]))

static struct hash_def hashes[] = {
	{ "original", hash_conntrack },
	{ "random", hash_random },
	{ "crc32", hash_crc32 },
};

static u32 htable_size = 7168;
static u32 *occupation;
static struct hash_def *test_hash = hashes;

static void do_init(void)
{
	u32 idx;
	occupation = malloc(sizeof(*occupation) * htable_size);
	for (idx=0; idx<htable_size; idx++)
		occupation[idx] = 0;
}

static int display_target = -1;

static int do_count(struct ct_key *key)
{
	u32 idx = test_hash->hash_f(key) % htable_size;
	occupation[idx]++;
	return ((int)idx) == display_target;
}

static u32 parse_ip(char *t)
{
	char *p = strchr(t, '=');
	if (!p++) return 0;
	return inet_addr(p);
}

static u16 parse_port(char *t)
{
	char *p = strchr(t, '=');
	if (!p++) return 0;
	return htons((u16) strtoul(p, 0, 10));
}

static u32 ignore_tcp_timeout_bound = 0;

static int ignore_by_timeout(struct ct_key *key, char *tok)
{
	return	key->proto == 6
		&& strtoul(tok, 0, 10) < ignore_tcp_timeout_bound;
}

static int unidir;

static int process(char *p)
{
	struct ct_key key[1];
	char *tok;
	int res;

	strtok(p, " \t\n");					/* prot text */
	tok = strtok(0, " \t\n");
	key->proto = (u8) strtoul(tok, 0, 10);			/* prot num */
	tok = strtok(0, " \t\n");				/* timeout */
	if (ignore_by_timeout(key, tok)) return 0;
	switch (key->proto) {
		case 6:  strtok(0, " \t\n");			/* tcp state */
		case 17: break;
		default: return 0;
	}
	/* forward */
	tok = strtok(0, " \t\n"); key->sip = parse_ip(tok);	/* src ip */
	tok = strtok(0, " \t\n"); key->dip = parse_ip(tok);	/* dst ip */
	tok = strtok(0, " \t\n"); key->sport = parse_port(tok);	/* src port */
	tok = strtok(0, " \t\n"); key->dport = parse_port(tok);	/* dst port */
	res = do_count(key);
	if (unidir) return res;
	/* reverse */
	tok = strtok(0, " \t\n");
	if (*tok == '[') tok = strtok(0, " \t\n");
	key->sip = parse_ip(tok);				/* src ip */
	tok = strtok(0, " \t\n"); key->dip = parse_ip(tok);	/* dst ip */
	tok = strtok(0, " \t\n"); key->sport = parse_port(tok);	/* src port */
	tok = strtok(0, " \t\n"); key->dport = parse_port(tok);	/* dst port */
	return res | do_count(key);
}

static int occupation_target = -1;

static void show_occupation(void)
{
	u32 idx, maxlen, total, partial, *agg;

	for (maxlen=0, total=0, idx=0; idx<htable_size; idx++) {
		total += occupation[idx];
		if (occupation[idx] > maxlen)
			maxlen = occupation[idx];
	}
	printf("maximum bucket list length: %lu\n", maxlen);
	printf("total number of entries: %lu\n", total);
	agg = malloc(sizeof(*agg) * (maxlen+1));
	for (idx=0; idx<=maxlen; idx++)
		agg[idx] = 0;
	for (idx=0; idx<htable_size; idx++) {
		agg[occupation[idx]]++;
		if (((int)occupation[idx]) == occupation_target) {
			printf("occupation target %d at %lu\n",
				occupation_target, idx);
			occupation_target = -1;
		}
	}
	printf("Count\tLength\tCT\n");
	for (partial=0, idx=maxlen; idx>0; idx--) {
		double d;
		if (agg[idx] == 0) continue;
		partial += idx * agg[idx];
		d = (100.0 * partial) / (1.0 * total);
		printf("%lu\t%lu\t%.02f%%\n", agg[idx], idx, d);
	}
	printf("%lu\t0\t100.00%%\n", agg[idx]);
}

static struct hash_def *find_hash(char *name)
{
	u32 idx;
	for (idx=0; idx<NR_HASHES; idx++)
		if (0 == strcmp(name, hashes[idx].name))
			return hashes+idx;
	fprintf(stderr, "no hash function named '%s'\n", name);
	return 0;
}

static void usage(void)
{
	fprintf(stderr, "Usage: not written\n");
	exit(1);
}

static u32 prime_select(u32);

int main(int argc, char **argv)
{
	int c;
	char buf[512], copy[512];
	char *filename = "/proc/net/ip_conntrack";
	FILE *fp;
	int select_prime = 0;

	while (EOF!=(c=getopt(argc, argv, "f:s:h:i:t:b:c:CPu"))) switch(c) {
		case 'f': filename = optarg; break;
		case 'u': unidir = 1; break;
		case 's': htable_size = strtoul(optarg, 0, 10); break;
		case 'P': select_prime = 1; break;
		case 'h': test_hash = find_hash(optarg); break;
		case 'i': display_target = strtol(optarg, 0, 10); break;
		case 't': occupation_target = strtol(optarg, 0, 10); break;
		case 'b': ignore_tcp_timeout_bound = strtoul(optarg, 0, 10); break;
		case 'c': crc32_seed = strtoul(optarg, 0, 10); break;
		case 'C': srandom(time(0)); crc32_seed = random(); break;
		default: usage();
	}
	if (!test_hash) return 1;
	fp = (0 == strcmp(filename, "-")) ? stdin : fopen(filename, "r");
	if (!fp) {
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		exit(1);
	}
	if (select_prime) htable_size = prime_select(htable_size);
	printf("filename: %s\n", filename);
	printf("hash function: %s\n", test_hash->name);
	printf("number of buckets: %lu\n", htable_size);
	if (ignore_tcp_timeout_bound)
		printf("ignore TCP conntracks with timeout less than %lu\n",
			ignore_tcp_timeout_bound);
	if (crc32_seed)
		printf("CRC32 seed %lu\n", crc32_seed);
	do_init();
	while (fgets(buf, sizeof(buf), fp)) {
		if (display_target != -1)
			strcpy(copy, buf);
		if (process(buf))
			printf("[%d]\t%s", display_target, copy);
	}
	if (fp != stdin) fclose(fp);
	show_occupation();
	return 0;
}

static u32 prime_sizes[] = {
        3,      /* 2^2-1 */
        7,      /* 2^3-1 */
        13,
        31,     /* 2^5-1 */
        61,
        127,    /* 2^7-1 */
        251,
        509,
        1021,
        2039,
        4093,
        8191,   /* 2^13-1 */
        16381,
        32749,
        65521,
        131071, /* 2^17-1 */
        262139,
        524287, /* 2^19-1 */
        1048571,
        2097157,
        4194301,
        8388581,
};

#define NR_PRIME_SIZES (sizeof(prime_sizes)/sizeof(prime_sizes[0]))

static u32 prime_select(u32 val)
{
	u32 idx;

	for (idx=0; idx<NR_PRIME_SIZES; idx++)
		if (val < prime_sizes[idx])
			return prime_sizes[idx];
	return prime_sizes[NR_PRIME_SIZES-1];
}

[-- Attachment #3: crc32.c --]
[-- Type: text/plain, Size: 4400 bytes --]

/* crc32.c -- compute the CRC-32 of a data stream
 * Copyright (C) 1995-2002 Mark Adler
 * stolen from zlib CVS
 */

/* @(#) $Id: crc32.c,v 1.2.2.1 2002/03/12 09:34:29 werner Exp $ */

/* ========================================================================
 * Table of CRC-32's of all single-byte values (made by make_crc_table)
 */
static const u32 crc_table[256] = {
  0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L,
  0x706af48fL, 0xe963a535L, 0x9e6495a3L, 0x0edb8832L, 0x79dcb8a4L,
  0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L,
  0x90bf1d91L, 0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL,
  0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L, 0x136c9856L,
  0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L,
  0xfa0f3d63L, 0x8d080df5L, 0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L,
  0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL,
  0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L,
  0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L, 0x26d930acL, 0x51de003aL,
  0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L,
  0xb8bda50fL, 0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L,
  0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL, 0x76dc4190L,
  0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL,
  0x9fbfe4a5L, 0xe8b8d433L, 0x7807c9a2L, 0x0f00f934L, 0x9609a88eL,
  0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L,
  0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL,
  0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L, 0x65b0d9c6L, 0x12b7e950L,
  0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L,
  0xfbd44c65L, 0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L,
  0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL, 0x4369e96aL,
  0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L,
  0xaa0a4c5fL, 0xdd0d7cc9L, 0x5005713cL, 0x270241aaL, 0xbe0b1010L,
  0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL,
  0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L,
  0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL, 0xedb88320L, 0x9abfb3b6L,
  0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L,
  0x73dc1683L, 0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L,
  0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L, 0xf00f9344L,
  0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL,
  0x196c3671L, 0x6e6b06e7L, 0xfed41b76L, 0x89d32be0L, 0x10da7a5aL,
  0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L,
  0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L,
  0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL, 0xd80d2bdaL, 0xaf0a1b4cL,
  0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL,
  0x4669be79L, 0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L,
  0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL, 0xc5ba3bbeL,
  0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L,
  0x2cd99e8bL, 0x5bdeae1dL, 0x9b64c2b0L, 0xec63f226L, 0x756aa39cL,
  0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L,
  0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL,
  0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L, 0x86d3d2d4L, 0xf1d4e242L,
  0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L,
  0x18b74777L, 0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL,
  0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L, 0xa00ae278L,
  0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L,
  0x4969474dL, 0x3e6e77dbL, 0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L,
  0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L,
  0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L,
  0xcdd70693L, 0x54de5729L, 0x23d967bfL, 0xb3667a2eL, 0xc4614ab8L,
  0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
  0x2d02ef8dL
};

/* ========================================================================= */
#define DO1(buf) crc = crc_table[((int)crc ^ (*buf++)) & 0xff] ^ (crc >> 8);
#define DO2(buf)  DO1(buf); DO1(buf);
#define DO4(buf)  DO2(buf); DO2(buf);
#define DO8(buf)  DO4(buf); DO4(buf);

/* ========================================================================= */
u32 crc32(crc, buf, len)
    u32 crc;
    const char *buf;
    int len;
{
    crc = crc ^ 0xffffffffL;
    while (len >= 8)
    {
	DO8(buf);
	len -= 8;
    }
    if (len) do {
	DO1(buf);
    } while (--len);
    return crc ^ 0xffffffffL;
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-30  8:31                     ` Patrick Schaaf
@ 2002-06-30 19:40                       ` Don Cohen
  2002-07-01  8:07                         ` Henrik Nordstrom
  2002-07-02 14:55                         ` Harald Welte
  0 siblings, 2 replies; 55+ messages in thread
From: Don Cohen @ 2002-06-30 19:40 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Patrick Schaaf writes:
 > Use the new '-P' flag (no arguments) to automatically choose a prime
 > number of hash buckets, just a bit larger than the number you selected
 > with '-s NUMBER'. When I do that, the large difference between the
 > original and crc32 hashes almost vanishes - sometimes our original
 > hash even looks _better_ (by chance)! This is a strong indication
 > that we want the "prime select" in the kernel!

Actually, I believe that this is overkill.  All we really need is for
the components that go into the data to have sizes relatively prime to
the modulus, and since all the data has sizes of form 2^n it should be
sufficient to make the modulus odd.  That's why I suggested the
modulus be "if table size odd then table size otherwise one less".
You might try that to convince yourself that it works.

 > You can now use '-c NUMBER' to set a seed for the CRC32 hash. You can now
 > use '-C' (without parameter) to set a random seed for the CRC32 hash.
 > These options are for testing the idea of "attack compensation by
 > inserting local variance". I'd like to hear opinions on this.

I'm sorry I don't know enough about CRC to know whether this will
help.  How does the -c argument affect the answer?  My first test
is to try specifying -c and not, and I'm encouraged that I get
different distributions.

At a higher level, though, I don't see much value in preventing an
attacker from filling a single bucket when he can almost as easily do
much more harm by filling the entire conntrack database!  The reason
for limiting the size of each bucket is to prevent any such attack
from killing performance by consuming time, and it does that, but it
doesn't prevent a resource consumption attack on space.

Clearly one easy defense against one easy attack (as was mentioned in 
private communication) is that whenever you want to add to a bucket
that is full, you should feel free to throw out the oldest UNREPLIED
connection in that bucket.  

If you're interested in preventing attacks that consume connection
table space devoted to "real" connections I have some ideas, but
that's another large discussion.  If lots of people out there want to
hear it I'll send to the list, otherwise perhaps it's better off the
list.

On a related subject, I'm worried that UNREPLIED might not mean
what I think it does.  Your data contains things like:
 tcp 6 387070 ESTABLISHED src=9.163.211.64 dst=165.130.71.38 sport=3228
 dport=1301 [UNREPLIED] src=165.130.71.38 dst=9.163.211.64 sport=1301
 dport=3228 use=1
How can one half of the connection be established while the other half
is unreplied?  Is this cause in one direction a syn was sent and a
syn-ack was received, and that makes it established, while in the
other direction a syn-ack was sent and nothing more was received so
it's unreplied?  That doesn't make much sense.  They should both be
established only if the entire handshake is complete, right?

You were describing unreplied connections that hang around for 5 days.
When I send some random syn's I see stuff like
 cat /proc/net/ip_conntrack
 tcp      6 113 SYN_SENT src=10.0.3.2 dst=10.0.2.3 sport=4096 ...
I'm guessing the 113 is the timeout, and that it was originally set to
120, which still seems way too long to me, but a lot better than
5 days.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-30 19:40                       ` Don Cohen
@ 2002-07-01  8:07                         ` Henrik Nordstrom
  2002-07-01 17:49                           ` Don Cohen
       [not found]                           ` <15652.38084.704660.234319@isis.cs3-inc.com>
  2002-07-02 14:55                         ` Harald Welte
  1 sibling, 2 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2002-07-01  8:07 UTC (permalink / raw)
  To: Don Cohen, Patrick Schaaf; +Cc: netfilter-devel

Don Cohen wrote:

> On a related subject, I'm worried that UNREPLIED might not mean
> what I think it does.  Your data contains things like:
>  tcp 6 387070 ESTABLISHED src=9.163.211.64 dst=165.130.71.38 sport=3228
>  dport=1301 [UNREPLIED] src=165.130.71.38 dst=9.163.211.64 sport=1301
>  dport=3228 use=1
> How can one half of the connection be established while the other half
> is unreplied?

The ESTABLISHED indicates the TCP state, UNREPLIED indicates the conntrack 
state. This is a TCP session that has only seen ACK in one direction, no 
packets in the other.

Almost related note: The connection is not ASSURED.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-01  8:07                         ` Henrik Nordstrom
@ 2002-07-01 17:49                           ` Don Cohen
  2002-07-02  7:58                             ` Henrik Nordstrom
       [not found]                           ` <15652.38084.704660.234319@isis.cs3-inc.com>
  1 sibling, 1 reply; 55+ messages in thread
From: Don Cohen @ 2002-07-01 17:49 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: netfilter-devel

Henrik Nordstrom writes:
 > Don Cohen wrote:
 > 
 > > On a related subject, I'm worried that UNREPLIED might not mean
 > > what I think it does.  Your data contains things like:
 > >  tcp 6 387070 ESTABLISHED src=9.163.211.64 dst=165.130.71.38 sport=3228
 > >  dport=1301 [UNREPLIED] src=165.130.71.38 dst=9.163.211.64 sport=1301
 > >  dport=3228 use=1
 > > How can one half of the connection be established while the other half
 > > is unreplied?
 > 
 > The ESTABLISHED indicates the TCP state, UNREPLIED indicates the conntrack 
 > state. This is a TCP session that has only seen ACK in one direction, no 
 > packets in the other.
 > 
 > Almost related note: The connection is not ASSURED.

I'm having trouble making sense of your explanation above.
This line is supposed to describe a single connection, right?
Established as a tcp state means the three packet handshake is
complete?  But that seems to contradict the unreplied.

Is there any doc for stuff like this?
- how to read the lines above
- what exactly these things (unreplied, assured, established ...) mean
- can I match on ASSURED ?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-01 17:49                           ` Don Cohen
@ 2002-07-02  7:58                             ` Henrik Nordstrom
  0 siblings, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2002-07-02  7:58 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

On Monday 01 July 2002 19.49, Don Cohen wrote:

>  > The ESTABLISHED indicates the TCP state, UNREPLIED indicates the
>  > conntrack state. This is a TCP session that has only seen ACK in
>  > one direction, no packets in the other.
>  >
>  > Almost related note: The connection is not ASSURED.
>
> I'm having trouble making sense of your explanation above.
> This line is supposed to describe a single connection, right?
> Established as a tcp state means the three packet handshake is
> complete?  But that seems to contradict the unreplied.

See the archives. This was discussed to death some days ago.

Summary in short: TCP state only indicates what kind of packets are 
currently seen on the connection. This can be derived from a single 
packet due to "connection pickup".

> Is there any doc for stuff like this?
> - how to read the lines above
> - what exactly these things (unreplied, assured, established ...)
> mean - can I match on ASSURED ?

ASSURED can be matched using the new conntrack match found in 
patch-o-matic. Normally this flag is only used by conntrack to 
garbagecollect invalid entries in case of a DoS attempt. There isn't 
really much use of matching it in rulesets.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <15652.38084.704660.234319@isis.cs3-inc.com>]

* Re: conntrack performance/DoS formula
       [not found]                           ` <15652.38084.704660.234319@isis.cs3-inc.com>
@ 2002-07-04 21:53                             ` Henrik Nordstrom
  2002-07-05  7:08                               ` Don Cohen
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2002-07-04 21:53 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

On Thursday 04 July 2002 20.32, Don Cohen wrote:
> [off list, but feel free to put it there if you think appropriate]

It has been discussed to death on the list. Try reading the archives 
some weeks back (always a healthy exercise). You are welcome to 
forward this response to the netfilter-users mailinglist where it 
really belongs. But here we go again..

> My current understanding is that there are two matching conntrack
> records here.  The ESTABLISHED is the conntrack state of one, not
> the tcp state.  If it were the tcp state then this would be assured
> and the other conntrack state would not be unreplied but also
> established, right?
> What's use, btw?

ESTABLISHED is the TCP tracking state, not the conntrack state. The 
TCP tracking state ESTABLISHED is derived from the last packet seen 
on the connection (ACK). This is not to be confused with the 
conntrack "state" ESTABLISHED as used in the state match which is a 
actually packet state describing how THIS packet relates to the 
connection, not a connection state as such.

What you have in conntrack for a TCP session is

a) A set of conntrack state bits, describing the state of the 
connection:

   - SEEN_REPLY
   - ASSURED
   - EXPECTED

b) A TCP tracking state, SYN_RECV, SYN_SENT, ESTABLISHED, TIME_WAIT 
and many more. Also keeps track of TCP window information and other 
crucial information for tracking a TCP session properly.

The TCP tracking state mainly controls the timeouts, but also has some 
influence on what kind of packets conntrack expects to see on the 
connection. The TCP tracking state is not visible in your rulesets.

The conntrack state bit SEEN_REPLY is pretty obvious what is 
represents.

ASSURED is a little trickier. What it represents is that conntrack are 
sure this connection is for real. For TCP this requires that 
conntrack has seen the whole SYN->SYNACK->ACK handshake.

EXPECTED is simply that this connection was expected by a conntrack 
protocol helper, for example FTP data channels expected due to 
commands on their FTP control channel.

Now, if we take a look at the "state" match, it indicates four 
different packet states

   - INVALID
   - NEW
   - ESTABLISHED
   - RELATED

These states has nothing to do with TCP, only packet flows. If 
conntrack cannot identify a session key for the packet or otherwise 
fails to establish connection tracking (i.e. OOM) then the paket 
state is INVALID.

If the packet is the first packet conntrack has seen for a connection 
then it is a NEW (or RELATED where applicable). Note that this may 
well be an ACK or even a FIN, conntrack does not care, it only cares 
that it did not know about the connection before seeing the packet.

If there has been any reply packets (including the first reply packet) 
then the packet state is ESTABLISHED.

The ASSURED bit deserves a little more attention. As you may see from 
the above description of the four "state" types this bit is not used 
in any rulesets. The ASSURED bit is used by conntrack to defeat 
almost all DoS attempts. When conntrack runs out of entries it will 
throw away not ASSURED connections to make room for new connections. 
This very effectively stops almost all DoS attempts in filling the 
connection table.

This is the last time I will answer this question. If anyone feels 
like using any of this in a FAQ then you are welcome. If I see much 
more chatter about these fundamental aspects of netfilter connection 
tracking that really SHOULD be common knowledge to all users (note: I 
am not talking about developers here.. developers looking at 
conntrack MUST understand these aspects) then I don't know what I 
will do, but I suspect I will stop following the netfilter-devel 
traffic. I am damn tired that so few people seem to understand what 
"iptables -m state --state NEW" or "--state ESTABLISED" means.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-04 21:53                             ` Henrik Nordstrom
@ 2002-07-05  7:08                               ` Don Cohen
  2002-07-05 11:41                                 ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: Don Cohen @ 2002-07-05  7:08 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: netfilter-devel

Henrik Nordstrom writes:
 > On Thursday 04 July 2002 20.32, Don Cohen wrote:
 > > [off list, but feel free to put it there if you think appropriate]
 > 
 > It has been discussed to death on the list. Try reading the archives 
 > some weeks back (always a healthy exercise). You are welcome to 
 > forward this response to the netfilter-users mailinglist where it 
 > really belongs. But here we go again..

I did read that, and it lead me to my mistaken conclusion.

 > ESTABLISHED is the TCP tracking state, not the conntrack state. 

I believe that now that I see the code:
ip_conntrack_standalone.c
static unsigned int
print_conntrack(char *buffer, const struct ip_conntrack *conntrack)
{
  ... proto name, proto number, timeout,
        len += print_tuple(buffer + len,
                           &conntrack->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
                           proto);

   The 
 > TCP tracking state ESTABLISHED is derived from the last packet seen 
 > on the connection (ACK). This is not to be confused with the 
 > conntrack "state" ESTABLISHED as used in the state match which is a 
 > actually packet state describing how THIS packet relates to the 
 > connection, not a connection state as such.

This is what I was missing.  I had seen and understood that conntrack
states and tcp states were different.  I had also understood, but
evidently *incorrectly* that TCP states were the states as described 
in rfc793.  The term "established" there means something different
from what you describe above, more like what you call assured below.

 > What you have in conntrack for a TCP session is
 > 
 > a) A set of conntrack state bits, describing the state of the 
 > connection:
 > 
 >    - SEEN_REPLY
 >    - ASSURED
 >    - EXPECTED
 > 
 > b) A TCP tracking state, SYN_RECV, SYN_SENT, ESTABLISHED, TIME_WAIT 
 > and many more. Also keeps track of TCP window information and other 
 > crucial information for tracking a TCP session properly.

And these are the same names (or very similar) that show up in rfc793,
which is why I thought they meant the same thing.  If they don't mean
the same thing it would be nice to know what they do mean.  Especially
since they are displayed from /proc.

 > The TCP tracking state mainly controls the timeouts, but also has some 
 > influence on what kind of packets conntrack expects to see on the 
 > connection. The TCP tracking state is not visible in your rulesets.
 > 
 > The conntrack state bit SEEN_REPLY is pretty obvious what is 
 > represents.
 > 
 > ASSURED is a little trickier. What it represents is that conntrack are 
 > sure this connection is for real. For TCP this requires that 
 > conntrack has seen the whole SYN->SYNACK->ACK handshake.

This is what I thought the tcp state ESTABLISHED meant !

 > EXPECTED is simply that this connection was expected by a conntrack 
 > protocol helper, for example FTP data channels expected due to 
 > commands on their FTP control channel.
 > 
 > 
 > Now, if we take a look at the "state" match, it indicates four 
 > different packet states
 > 
 >    - INVALID
 >    - NEW
 >    - ESTABLISHED
 >    - RELATED
 > 
 > These states has nothing to do with TCP, only packet flows. If 
 > conntrack cannot identify a session key for the packet or otherwise 
 > fails to establish connection tracking (i.e. OOM) then the paket 
 > state is INVALID.
 > 
 > If the packet is the first packet conntrack has seen for a connection 
 > then it is a NEW (or RELATED where applicable). Note that this may 
 > well be an ACK or even a FIN, conntrack does not care, it only cares 
 > that it did not know about the connection before seeing the packet.
 > 
 > If there has been any reply packets (including the first reply packet) 
 > then the packet state is ESTABLISHED.
 > 
 > 
 > The ASSURED bit deserves a little more attention. As you may see from 
 > the above description of the four "state" types this bit is not used 
 > in any rulesets. The ASSURED bit is used by conntrack to defeat 
 > almost all DoS attempts. When conntrack runs out of entries it will 
 > throw away not ASSURED connections to make room for new connections. 
 > This very effectively stops almost all DoS attempts in filling the 
 > connection table.
 > 
 > 
 > This is the last time I will answer this question. If anyone feels 
 > like using any of this in a FAQ then you are welcome. If I see much 
 > more chatter about these fundamental aspects of netfilter connection 
 > tracking that really SHOULD be common knowledge to all users (note: I 
 > am not talking about developers here.. developers looking at 
 > conntrack MUST understand these aspects) then I don't know what I 
 > will do, but I suspect I will stop following the netfilter-devel 
 > traffic. I am damn tired that so few people seem to understand what 
 > "iptables -m state --state NEW" or "--state ESTABLISED" means.

Clearly this stuff should be in the manual.  That's what I read when I
was trying to understand it.  Perhaps in
http://netfilter.samba.org/documentation/HOWTO//packet-filtering-HOWTO-7.html#ss7.3
?

So, returning to the original question:
tcp 6 341427 ESTABLISHED src=165.130.74.240 dst=194.243.69.137 sport=1255 dport=80 [UNREPLIED] src=9.163.211.64 dst=165.130.74.240 sport=3128 dport=1255 use=1

This evidently means that only one packet in this connection has been
seen, but it had the TCP ACK bit on.  Hence "established".  And it was
also NATed, since the reply tuple is not just the reverse of the
original. 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-05  7:08                               ` Don Cohen
@ 2002-07-05 11:41                                 ` Henrik Nordstrom
  2002-07-06  2:49                                   ` Don Cohen
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2002-07-05 11:41 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

On Friday 05 July 2002 09.08, Don Cohen wrote:

>  > TCP tracking state ESTABLISHED is derived from the last packet
>  > seen on the connection (ACK). This is not to be confused with
>  > the conntrack "state" ESTABLISHED as used in the state match
>  > which is a actually packet state describing how THIS packet
>  > relates to the connection, not a connection state as such.
>
> This is what I was missing.  I had seen and understood that
> conntrack states and tcp states were different.  I had also
> understood, but evidently *incorrectly* that TCP states were the
> states as described in rfc793.  The term "established" there means
> something different from what you describe above, more like what
> you call assured below.

The TCP tracking states are approximations of RFC793. However, 
conntrack_tcp does not implement TCP, it only tries to derive the 
states of the involved TCP endpoints by looking at the transmitted 
packets.

>  > b) A TCP tracking state, SYN_RECV, SYN_SENT, ESTABLISHED,
>  > TIME_WAIT and many more. Also keeps track of TCP window
>  > information and other crucial information for tracking a TCP
>  > session properly.
>
> And these are the same names (or very similar) that show up in
> rfc793, which is why I thought they meant the same thing.  If they
> don't mean the same thing it would be nice to know what they do
> mean.  Especially since they are displayed from /proc.

They mean mostly the same thing, except for the small fact that 
conntrack is not a TCP endpoint so it has to make some assumptions 
and guesses, and a great deal of simplifications in the state 
transitions... and as it is tracking both TCP endpoints at the same 
time the states may differ slightly..

As conntrack is looking at packets it has to make state transitions is 
based on the packets it sees rather than actions taken by the 
application, and selects the state closest matching the packets it 
sees. To be on the safe side (in terms of tracking) conntrack does 
not enforce a strict TCP state diagram but just tries to gess "If I 
see this kind of packet, what can the current TCP state be".

> So, returning to the original question:
> tcp 6 341427 ESTABLISHED src=165.130.74.240 dst=194.243.69.137
> sport=1255 dport=80 [UNREPLIED] src=9.163.211.64 dst=165.130.74.240
> sport=3128 dport=1255 use=1
>
> This evidently means that only one packet in this connection has
> been seen, but it had the TCP ACK bit on.  Hence "established". 
> And it was also NATed, since the reply tuple is not just the
> reverse of the original.

Correct, except for the minor detail that it can only have been a 
normal ACK packet (ACK,!SYN,!FIN,!RST). As the connection is not 
ASSURED it will be thrown away if there is a shortage of conntrack 
entries.. but as long as there is no shortage of conntrack entries it 
will stay around for a while in case there really is a TCP 
connection.

Based on the port numbers used the NAT rule probably is a REDIRECT of 
HTTP to Squid..

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-05 11:41                                 ` Henrik Nordstrom
@ 2002-07-06  2:49                                   ` Don Cohen
  0 siblings, 0 replies; 55+ messages in thread
From: Don Cohen @ 2002-07-06  2:49 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: netfilter-devel

Henrik Nordstrom writes:

 > The TCP tracking states are approximations of RFC793. However, 
 > conntrack_tcp does not implement TCP, it only tries to derive the 
 > states of the involved TCP endpoints by looking at the transmitted 
 > packets.
I understand that there are limits to what conntrack can do.
However, someone has taken the trouble to compute assured, and
this seems like a *much* better approximation to tcp established
than what is actually presented as the intended approximation. 
I guess now that you can match on assured the right functionality
is there, but the current tcp established still seems like false
advertising.

While I'm at it,
  so what happens in case of NAT?  The tuples are not the same in this
  case...
Yep, that's what I realized later.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-30 19:40                       ` Don Cohen
  2002-07-01  8:07                         ` Henrik Nordstrom
@ 2002-07-02 14:55                         ` Harald Welte
  1 sibling, 0 replies; 55+ messages in thread
From: Harald Welte @ 2002-07-02 14:55 UTC (permalink / raw)
  To: Don Cohen; +Cc: Patrick Schaaf, netfilter-devel

On Sun, Jun 30, 2002 at 12:40:09PM -0700, Don Cohen wrote:
> Clearly one easy defense against one easy attack (as was mentioned in 
> private communication) is that whenever you want to add to a bucket
> that is full, you should feel free to throw out the oldest UNREPLIED
> connection in that bucket.  

if we're going to change the overall strategy into limiting the number
of list-entries per bucket, this is definitely the way to go.  just
deleting unreplied in the case ip_conntrack_max is reached doesn't make
sense anymore.

> If you're interested in preventing attacks that consume connection
> table space devoted to "real" connections I have some ideas, but
> that's another large discussion.  If lots of people out there want to
> hear it I'll send to the list, otherwise perhaps it's better off the
> list.

please send it to the list, this way people can later read it for
reference purpose, and you encourage more people to discuss it.

> On a related subject, I'm worried that UNREPLIED might not mean
> what I think it does.  Your data contains things like:
>  tcp 6 387070 ESTABLISHED src=9.163.211.64 dst=165.130.71.38 sport=3228
>  dport=1301 [UNREPLIED] src=165.130.71.38 dst=9.163.211.64 sport=1301
>  dport=3228 use=1
> How can one half of the connection be established while the other half
> is unreplied?  Is this cause in one direction a syn was sent and a
> syn-ack was received, and that makes it established, while in the
> other direction a syn-ack was sent and nothing more was received so
> it's unreplied?  That doesn't make much sense.  They should both be
> established only if the entire handshake is complete, right?

The problem here is most likely that cat /proc/net/ip_conntrack does not
read out the conntrack entries atomically.  

We really should have ctnetlink for gathering the contents of the
conntrack hash table... I'll try to have it ready soon.

> You were describing unreplied connections that hang around for 5 days.
> When I send some random syn's I see stuff like
>  cat /proc/net/ip_conntrack
>  tcp      6 113 SYN_SENT src=10.0.3.2 dst=10.0.2.3 sport=4096 ...
> I'm guessing the 113 is the timeout, and that it was originally set to
> 120, which still seems way too long to me, but a lot better than
> 5 days.

so now send a single ACK packet and see what happens... This is needed
for connection pickup.

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-28 19:03       ` Don Cohen
  2002-06-28 19:35         ` Patrick Schaaf
@ 2002-07-02 14:40         ` Harald Welte
  2002-07-02 16:32           ` Patrick Schaaf
  1 sibling, 1 reply; 55+ messages in thread
From: Harald Welte @ 2002-07-02 14:40 UTC (permalink / raw)
  To: Don Cohen; +Cc: Patrick Schaaf, netfilter-devel

On Fri, Jun 28, 2002 at 12:03:38PM -0700, Don Cohen wrote:
> Patrick Schaaf writes:
>  > I have real data from an IRC server (one of the german IRCnet hubs), and
>  > from several boxen providing transparent proxy service to dialup customers,
>  > 3000 customers per box peak, running DNS and squid towards the Internet.
>  > With the peak load, the proxy boxen have up to 100000 conntrack entries
>  > (according to slabinfo; /proc/net/ip_conntrack is too lame to show them all).
>  > 
>  > Others will certainly be able to contribute data from production
>  > routing firewalls. I have one of those, too, but they only protect
>  > an ~50 workstation LAN, so these are not that significant.
> 
> If you can put some data on a web page that I could download I'd be
> happy to write and run the program that analyzes it and compares it
> to the results expected for a uniform hash function.

great. I'd definitley want to encourage the two of you to do some
furtherr research on the hashing function.

It's just a pity that I'm currently travelling and away from my test
boxes, so I won't be of much help.

Good work.

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-02 14:40         ` Harald Welte
@ 2002-07-02 16:32           ` Patrick Schaaf
  2002-07-02 16:35             ` Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-02 16:32 UTC (permalink / raw)
  To: Harald Welte, Don Cohen, Patrick Schaaf, netfilter-devel

> great. I'd definitley want to encourage the two of you to do some
> furtherr research on the hashing function.

It's good to hear the audience :)

Everybody, have a look at http://bei.bof.de/ex1/. There's a bunch of
pictures, which may become legible together with reading my recent
mails on this thread.

The "cost" listed for each line in each picture, is the sum of the
squares of the bucket list lengths. It's a lame attempt at getting a
single number out of the spread.

The links below the pictures lead you to the bucket statistics
in ASCII text form (output of my cttest program).

The data these pictures are based on, is my sample from some days ago.

Hope this is interesting to somebody :)

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-02 16:32           ` Patrick Schaaf
@ 2002-07-02 16:35             ` Patrick Schaaf
  2002-07-02 16:53               ` Henrik Nordstrom
  2002-07-02 17:48               ` Don Cohen
  0 siblings, 2 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-02 16:35 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Harald Welte, Don Cohen, netfilter-devel

> Everybody, have a look at http://bei.bof.de/ex1/. There's a bunch of
> pictures, which may become legible together with reading my recent
> mails on this thread.

Ah - maybe I should at least note that the x-axis is bucket length,
on a log-2 scale, The y-axis is the number of occurences, log-10.

later
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-02 16:35             ` Patrick Schaaf
@ 2002-07-02 16:53               ` Henrik Nordstrom
  2002-07-02 17:48               ` Don Cohen
  1 sibling, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2002-07-02 16:53 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Patrick Schaaf wrote:
> > Everybody, have a look at http://bei.bof.de/ex1/. There's a bunch of
> > pictures, which may become legible together with reading my recent
> > mails on this thread.
>
> Ah - maybe I should at least note that the x-axis is bucket length,
> on a log-2 scale, The y-axis is the number of occurences, log-10.

Any explanation for the odd tails with very long bucket chains in the original 
hash?

In many cases the rest of the graph looks quite smooth not making one expect 
such tail, so I wonder if these may be an artefect of tomething else (bug, or 
incorrect input data).

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-02 16:35             ` Patrick Schaaf
  2002-07-02 16:53               ` Henrik Nordstrom
@ 2002-07-02 17:48               ` Don Cohen
  2002-07-02 18:31                 ` Patrick Schaaf
  1 sibling, 1 reply; 55+ messages in thread
From: Don Cohen @ 2002-07-02 17:48 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Harald Welte, netfilter-devel

Patrick Schaaf writes:
 > > Everybody, have a look at http://bei.bof.de/ex1/. There's a bunch of
 > > pictures, which may become legible together with reading my recent
 > > mails on this thread.

The only thing that looks really suspicious to me is the lump in
original for large table sizes.

I tried this myself:
 # time ./cttest -u -f ../ip_conntrack.2 -s 131073
 filename: ../ip_conntrack.2
 hash function: original
 number of buckets: 131073
 maximum bucket list length: 24
 total number of entries: 32866
 Count	Length	CT
 1	24	0.07%
 1	19	0.13%
 1	14	0.17%
 2	8	0.22%
 13	7	0.50%
 31	6	1.06%
 130	5	3.04%
 276	4	6.40%
 821	3	13.90%
 3392	2	34.54%
 21515	1	100.00%
 104890	0	100.00%

BTW 5 is already at the low end of the probability, as you see for the
other two hashes.

Then, as usual, show the data for the longest bucket, read it into my
lisp program and take differences ...
  > (loop for x in data do (print (loop for z in (first data) as y in x collect (- z y))))

  (0 0 0 0 0) 
  (0 0 0 2 2) 
  (0 0 0 4 4) 
  (0 0 0 6 6) 
  (0 0 0 8 8) 
  (0 0 0 10 10) 
  (0 0 0 12 12) 
  (0 0 0 14 14) 
  (0 -498354813 738840907 37187 38280) 
  (0 0 0 -6 -6) 
  (0 0 0 -20 -20) 
  (0 0 0 -27 -27) 
  (0 0 0 -2 -2) 
  (0 0 0 -16 -16) 
  (0 0 0 -37 -37) 
  (0 0 0 -12 -12) 
  (0 0 0 -33 -33) 
  (0 0 0 -8 -8) 
  (0 0 0 -29 -29) 
  (0 0 0 -18 -18) 
  (0 0 0 -25 -25) 
  (0 0 0 -4 -4) 
  (0 0 0 -35 -35) 
  (0 0 0 -31 -31) 

I also changed cttest to print hash data (all hex):
sip 100ce87 dip 100ce87 sport d995 dport d895 proto 6 result 3e4f98db
sip 100ce87 dip 100ce87 sport d795 dport d695 proto 6 result 3e4b98d9
sip 100ce87 dip 100ce87 sport d595 dport d495 proto 6 result 3e4798d7
sip 100ce87 dip 100ce87 sport d395 dport d295 proto 6 result 3e4398d5
sip 100ce87 dip 100ce87 sport d195 dport d095 proto 6 result 3e3f98d3
sip 100ce87 dip 100ce87 sport cf95 dport ce95 proto 6 result 3e3b98d1
sip 100ce87 dip 100ce87 sport cd95 dport cc95 proto 6 result 3e3798cf
sip 100ce87 dip 100ce87 sport cb95 dport ca95 proto 6 result 3e3398cd
sip 7e4a82a5 dip b62ec45b sport 9604 dport 5000 proto 6 result a2d7eca
sip 100ce87 dip 100ce87 sport df95 dport de95 proto 6 result 3e5b98e1
sip 100ce87 dip 100ce87 sport ed95 dport ec95 proto 6 result 3e7798ef
sip 100ce87 dip 100ce87 sport f495 dport f395 proto 6 result 3e8598f6
sip 100ce87 dip 100ce87 sport db95 dport da95 proto 6 result 3e5398dd
sip 100ce87 dip 100ce87 sport e995 dport e895 proto 6 result 3e6f98eb
sip 100ce87 dip 100ce87 sport fe95 dport fd95 proto 6 result 3e999900
sip 100ce87 dip 100ce87 sport e595 dport e495 proto 6 result 3e6798e7
sip 100ce87 dip 100ce87 sport fa95 dport f995 proto 6 result 3e9198fc
sip 100ce87 dip 100ce87 sport e195 dport e095 proto 6 result 3e5f98e3
sip 100ce87 dip 100ce87 sport f695 dport f595 proto 6 result 3e8998f8
sip 100ce87 dip 100ce87 sport eb95 dport ea95 proto 6 result 3e7398ed
sip 100ce87 dip 100ce87 sport f295 dport f195 proto 6 result 3e8198f4
sip 100ce87 dip 100ce87 sport dd95 dport dc95 proto 6 result 3e5798df
sip 100ce87 dip 100ce87 sport fc95 dport fb95 proto 6 result 3e9598fe
sip 100ce87 dip 100ce87 sport f895 dport f795 proto 6 result 3e8d98fa


It seems that adding n to the two ports (which is what happens almost
every time here) gives a difference in the hash function of n*131073 !!

static u32 hash_conntrack(struct ct_key *key)
{
  u32 ans;
  ans= ntohl(key->sip + key->dip
		+ key->sport + key->dport
		+ key->proto)
		+ ntohs(key->sport);
  printf("sip %x dip %x sport %x dport %x proto %x result %x\n", 
	 key->sip, key->dip, key->sport, key->dport, key->proto, ans);
  return ans;
}

I think this is related to the ntohl and ntohs.  If all the data were 0
the sum would be 0.  When you add 1 to sport and dport you get something
like 2^16 + 2 ?
which is 2 x our unlucky table size, right?
Or something like that.
And similar things happen with table size 2^n-1.
Basically the problem with this sort of hash function is that you get 
collisions when you have a lot of IP and port pairs that are close to
each other with the wrong sort of correlation.

Have you tried the ABCD suggestion I posted?
Just as a quick test:
static u32 hash_conntrack(struct ct_key *key)
{
 return
   0x47441DFB * key->sip
   + 0x57655A7D * key->dip
   + 0x1C7F1D2D * key->sport
   + 0xDF91D738 * key->dport
   + key->proto;

coefficients chosen at random from [0, 2^32)

# time ./cttest -u -f ../ip_conntrack.2 -s 131073
filename: ../ip_conntrack.2
hash function: original
number of buckets: 131073
maximum bucket list length: 5
total number of entries: 32866
Count	Length	CT
1	5	0.02%
15	4	0.20%
253	3	2.51%
3180	2	21.86%
25682	1	100.00%
101942	0	100.00%

Longest bucket with 131071 is also 5, with 131072 is 36 (as
advertised!).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-02 17:48               ` Don Cohen
@ 2002-07-02 18:31                 ` Patrick Schaaf
  2002-07-02 21:52                   ` cttest-0.1 Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-02 18:31 UTC (permalink / raw)
  To: Don Cohen; +Cc: Patrick Schaaf, Harald Welte, netfilter-devel

>  > > Everybody, have a look at http://bei.bof.de/ex1/. There's a bunch of
>  > > pictures, which may become legible together with reading my recent
>  > > mails on this thread.
> 
> The only thing that looks really suspicious to me is the lump in
> original for large table sizes.

Note that due to the y-axis log-10 scale, the tail is more visible; the far
out tail does not consist of many buckets. I agree with your correlation
observation for the original hash. portscans, again?

With an overcrowded bucket table (the ~7000 entry cases), there is almost
no difference between the original hash, and the alternatives. They are
all equally bad (because of the overcrowding). With a wide bucket table
(130000 and 260000 buckets for 65000 entries), the general badness of
the original hash clearly shows in the shape of the curve.

> Have you tried the ABCD suggestion I posted?
> Just as a quick test:
> static u32 hash_conntrack(struct ct_key *key)
> {
>  return
>    0x47441DFB * key->sip
>    + 0x57655A7D * key->dip
>    + 0x1C7F1D2D * key->sport
>    + 0xDF91D738 * key->dport
>    + key->proto;

I now added this to cttest (new hash called 'abcd'), and updated the
pictures on the website accordingly. Looks good for odd table sizes,
and detoriates badly for even table sizes - like you said.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* cttest-0.1
  2002-07-02 18:31                 ` Patrick Schaaf
@ 2002-07-02 21:52                   ` Patrick Schaaf
  2002-07-03  4:15                     ` cttest-0.1 Joakim Axelsson
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-02 21:52 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Hi all,

I have put a tarball at http://bei.bof.de/cttest-0.1.tar.gz
Unpack, look at README, and reproduce the gnuplot pictures I have
mentioned earlier today (at http://bei.bof.de/ex1/)

I would love to see results from other kinds of workloads.

thanks in advance
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-02 21:52                   ` cttest-0.1 Patrick Schaaf
@ 2002-07-03  4:15                     ` Joakim Axelsson
  2002-07-05 15:37                       ` cttest-0.1 Martin Josefsson
  2002-07-05 16:10                       ` cttest-0.1 Joakim Axelsson
  0 siblings, 2 replies; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-03  4:15 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

2002-07-02 23:52:06+0200, Patrick Schaaf <bof@bof.de> ->
> Hi all,
> 
> I have put a tarball at http://bei.bof.de/cttest-0.1.tar.gz
> Unpack, look at README, and reproduce the gnuplot pictures I have
> mentioned earlier today (at http://bei.bof.de/ex1/)
> 
> I would love to see results from other kinds of workloads.
> 
> thanks in advance
>   Patrick

I've tested it on our router data. The router holds one mayor DALnet server
(~7000K users) and one minor EFnet-server (~700 users) behind. Also about
1000 student computers. The DALnet-server was attack today and yesterday
with 70K pps icmp reply. However I filtered that (in mangle PREROUTING) fast
and I don't know how much impact is show in the conntrack. I also have
limiters on state NEW to prevent attacks to fill the conntrack.

Graphs at: http://aaricia.hemmet.chalmers.se/~gozem/cttest/

Stats for the actuall running conntrack is:
$MODPROBE ip_conntrack hashsize=131072 
echo 262144 > /proc/sys/net/ipv4/ip_conntrack_max
(Yes meaning average bucket-length to be 2 not 8).

Conntrack was using about 50K slabs according to /proc/slabinfo

However looking at the stats, I will change hashsize to a primenumber 
fast :-)

I collected this data 5am during absolut low time. I'll try again later
at primetime :-)

One comment: In your script ctplot you have an absolute path to gnuplot
which I guess not everyone has the same as you :-)

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-03  4:15                     ` cttest-0.1 Joakim Axelsson
@ 2002-07-05 15:37                       ` Martin Josefsson
  2002-07-05 16:10                       ` cttest-0.1 Joakim Axelsson
  1 sibling, 0 replies; 55+ messages in thread
From: Martin Josefsson @ 2002-07-05 15:37 UTC (permalink / raw)
  To: Joakim Axelsson; +Cc: Patrick Schaaf, netfilter-devel

Hi,

I've also done some plotting.

about 400 student-machines behind router, it's summer now so there's not
as much traffic as the rest of the year, only ~28k entries.

hashsize is 16384 with default ip_conntrack_max of 131072.

http://gandalf.hjorten.nu/kna-gw/

I added 16384 as size in ctreport as this is what's currently used.

There's a huge diffrence between 16384 and 1638[35], from a maximum
length of 181 down to 9.

-- 
/Martin

Never argue with an idiot. They drag you down to their level, then beat
you with experience.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-03  4:15                     ` cttest-0.1 Joakim Axelsson
  2002-07-05 15:37                       ` cttest-0.1 Martin Josefsson
@ 2002-07-05 16:10                       ` Joakim Axelsson
  2002-07-05 16:54                         ` cttest-0.1 Patrick Schaaf
  1 sibling, 1 reply; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-05 16:10 UTC (permalink / raw)
  To: Patrick Schaaf, netfilter-devel

2002-07-03 06:15:02+0200, Joakim Axelsson <gozem@aaricia.hemmet.chalmers.se> ->
> 
> I collected this data 5am during absolut low time. I'll try again later
> at primetime :-)
> 

Here is a new stat with about 85K entries:
http://aaricia.hemmet.chalmers.se/~gozem/cttest-2002-07-05_1739/

Look at the 131072 original hash (which we are running until i dare reboot the
router :-)

"maximum bucket list length: 9471"


-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-05 16:10                       ` cttest-0.1 Joakim Axelsson
@ 2002-07-05 16:54                         ` Patrick Schaaf
  2002-07-05 16:53                           ` cttest-0.1 Joakim Axelsson
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-05 16:54 UTC (permalink / raw)
  To: netfilter-devel

Martin, Joakim,

thank you for your cttest runs. They both provide new data points,
coming from a router instead of a server. They confirm that we should
at least strongly avoid even hash bucket counts. In all examples we have
now, both the original hash, and Don's abcd hash detoriate badly.

The interesting new theme, when comparing my "transproxy server"
statistics with your router statistics, is that for the odd
bucket sizes, the original hash does a lot better in the router
case than it does in the server case.

Today, I wrote a new comparison, using rdtsc() to time the CPU usage
of the hash calculation within cttest. I'll make a cttest-0.2 soon,
so everybody can reproduce. However, here's the preliminary result:
on a 1Ghz P-III machine, compiled with march=i686, the original hash
function takes 55ns per call. The abcd hash is better, at 43ns per call.
After inlining the crc32() function, the crc32 hash 185ns per call.
My experience is that a one-longer hash list, when scanned, needs one
more memory roundtrip at about 200ns latency. Maybe it's double that
with the current ip_conntrack layout, but I'll whine about this at
another time. It is clear that even the cost of the crc32 calculation
is negligible if we can save one on the list length on average.

Overall, these results seem to call for this set of actions:

- make the hash bucket count at least individible by 2. This should go
  as a strong suggestion into the documentation, and should be implemented
  in the default initialization code. Anybody volunteering for one or
  the other? I'll do the code part, but I won't do the docs.
- given the clear advantage of the abcd hash over the original hash in
  the server case, make the abcd hash at least a compile time option,
  or better yet, a module load time option. I can do that, too.

For the code, would one or two patch-o-matic type patches against 2.4.18-rc1
be the desired deliverable?

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-05 16:54                         ` cttest-0.1 Patrick Schaaf
@ 2002-07-05 16:53                           ` Joakim Axelsson
  2002-07-06  6:10                             ` cttest-0.1 Andrew Smith
  0 siblings, 1 reply; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-05 16:53 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

2002-07-05 18:54:15+0200, Patrick Schaaf <bof@bof.de> ->
> 
> - make the hash bucket count at least individible by 2. This should go
>   as a strong suggestion into the documentation, and should be implemented
>   in the default initialization code. Anybody volunteering for one or
>   the other? I'll do the code part, but I won't do the docs.

Well, I think we even can force people to use an odd count.

if (hashsize%2 == 0)
	hashsize--;


-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-05 16:53                           ` cttest-0.1 Joakim Axelsson
@ 2002-07-06  6:10                             ` Andrew Smith
  2002-07-06  7:12                               ` cttest-0.1 Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Smith @ 2002-07-06  6:10 UTC (permalink / raw)
  To: netfilter-devel

> 2002-07-05 18:54:15+0200, Patrick Schaaf <bof@bof.de> ->
>> 
>> - make the hash bucket count at least individible by 2. This should go
>>   as a strong suggestion into the documentation, and should be
>>   implemented in the default initialization code. Anybody volunteering
>>   for one or the other? I'll do the code part, but I won't do the
>>   docs.
> 
> Well, I think we even can force people to use an odd count.
> 
> if (hashsize%2 == 0)
> 	hashsize--;
> 
> 
> -- 
> /Joakim Axelsson A.K.A Gozem@EFnet & OPN

Just so you don't underestimate a requirement ... :-)

if ((hashsize & 1) == 0)
	hashsize++;

(assume we are checking positive numbers ;-)

-- 
-Cheers
-Andrew

MS ... if only he hadn't been hang gliding!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06  6:10                             ` cttest-0.1 Andrew Smith
@ 2002-07-06  7:12                               ` Patrick Schaaf
  2002-07-06 15:23                                 ` cttest-0.1 Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-06  7:12 UTC (permalink / raw)
  To: Andrew Smith; +Cc: netfilter-devel

> > if (hashsize%2 == 0)
> > 	hashsize--;
> 
> Just so you don't underestimate a requirement ... :-)

Underestimating is desirable here. User gives upper bound - observe it!

> (assume we are checking positive numbers ;-)

We are, the variable in question, is unsigned.

Actually, I now have this nice <128-byte-icache snippet which computes the
next-smaller-prime, given any 32 bit unsigned. And it's fast.

If the core team doesn't puke, that's what I would like to do at
conntrack module load time.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06  7:12                               ` cttest-0.1 Patrick Schaaf
@ 2002-07-06 15:23                                 ` Patrick Schaaf
  2002-07-06 21:14                                   ` cttest-0.1 Joakim Axelsson
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-06 15:23 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

Hi all,

I have released http://bei.bof.de/cttest-0.2.tar.gz. It is now easier
to modify the list of sizes and hashes that should be plotted, and
there's timing code for the hash functions, together with average
used list length analysis.

> Actually, I now have this nice <128-byte-icache snippet which computes the
> next-smaller-prime, given any 32 bit unsigned. And it's fast.

If somebody wants to review this code, have a look at cttest-0.2/prime.c

The pictures at http://bei.bof.de/ex1/ have been updated to the produce
of cttest-0.2.

I'd like to see some runs from Athlon and Pentium4 systems, to compare
the hash function timings.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06 15:23                                 ` cttest-0.1 Patrick Schaaf
@ 2002-07-06 21:14                                   ` Joakim Axelsson
  2002-07-06 22:41                                     ` cttest-0.1 Joakim Axelsson
  2002-07-06 22:54                                     ` cttest-0.1 Joakim Axelsson
  0 siblings, 2 replies; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-06 21:14 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netfilter-devel

2002-07-06 17:23:42+0200, Patrick Schaaf <bof@bof.de> ->
> Hi all,
> 
> I have released http://bei.bof.de/cttest-0.2.tar.gz. It is now easier
> to modify the list of sizes and hashes that should be plotted, and
> there's timing code for the hash functions, together with average
> used list length analysis.
> 

Me and Martin Josefsson has tested the new cttest-0.2. Martin also took the
idea of using the hash-function that the routingcache uses in Linux:

res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
res ^= key->sip ^ key->tos;
res ^= (res >> 16);
res ^= (res >> 8););

Together we tried to make one that fitted for conntrack. Since we don't
understand completly why the things are done as they are, there might be
some more tuning where can be done to this function:

static u32 hash_rt(struct ct_key *key)
{
        u32 res;

        PER_HASH_TIMER_1(       
                res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
                res ^= key->sip ^ key->proto;
                res ^= key->dport ^ key->sport;
                res ^= (res >> 16);
                res ^= (res >> 8);
        );

        return res;
}

Also with this idea of using power of (^), we tested the abcd hash using
power of instead of just adding the values up:

static u32 hash_abcd_power(struct ct_key *key)
{
        u32 res;

        PER_HASH_TIMER_1(
                res = (0x47441DFB * key->sip)
                ^ (0x57655A7D * key->dip)
                ^ (0x1C7F1D2D * key->sport)
                ^ (0xDF91D738 * key->dport)
                ^ key->proto;
        );
        return res;
}

Results can be viewed at the following url with my 85K entry conntrack:
http://aaricia.hemmet.chalmers.se/~gozem/cttest-0.2/rt/

Conculsions are that the rt-hash is just as fast as abcd and has a good
distribusion, even on the 2^n hash-sizes. This concludes us and with the
fact that we are going to use prime-hashsizes that this rt-hash is the best
alternitive we have right now. rt uses about 21 cycles on my CPU and abcd
about 22. crc32 is up on 134.

The tests was computed on my (Dual) Athlon MP 1800+ (1533Mhz), with ECC/REG
DDR266 memory. Ofcouse only using one CPU.

For the test of your prime.c I can't say more than its not even measureable
on my computer. But I got a very fast CPU.

>time ./prime 3457675589
3457675579
0.000u 0.000s 0:00.00 0.0%	0+0k 0+0io 87pf+0w

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06 21:14                                   ` cttest-0.1 Joakim Axelsson
@ 2002-07-06 22:41                                     ` Joakim Axelsson
  2002-07-06 23:16                                       ` cttest-0.1 Joakim Axelsson
  2002-07-06 22:54                                     ` cttest-0.1 Joakim Axelsson
  1 sibling, 1 reply; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-06 22:41 UTC (permalink / raw)
  To: Patrick Schaaf, netfilter-devel

2002-07-06 23:14:51+0200, Joakim Axelsson <gozem@aaricia.hemmet.chalmers.se> ->
>
> Me and Martin Josefsson has tested the new cttest-0.2. Martin also took the
> idea of using the hash-function that the routingcache uses in Linux:
> 

Some more on this issue. Martin discovered that changing "res ^= (res >>
16);" into "res ^= (res >> 24);" gave a much better result. Actually better
than any of the hash-functions we have. To the smalest calculation price.
The somewhat bad distribution on 2^n hashsizes has now vanished.

Now to make it even more impossible to attack I added a A,B factor to it:
"res ^= A ^ B;" Same A and B as in abcd*

static u32 hash_rt_ab(struct ct_key *key)
{
 	u32 res;

 	PER_HASH_TIMER_1(       
 	  	res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
 	  	res ^= key->sip ^ key->proto;  
 	  	res ^= key->dport ^ key->sport;
 	  	res ^= 0x47441DFB ^ 0x57655A7D;
 	  	res ^= (res >> 24);
 	  	res ^= (res >> 8);
 	);

 	return res;
}

This addition of A,B only adds 2 more cycles on my CPU (from about 20 to
about 22). In my believe is this hashfunction what we are looking for. Two
random constanst makes sure that a studied attack can't fill just any
bucket, not knowing which one as might be possible with only one random
constant.

I also tested Don's abcdef. Conclusion is that abcdef is about the same as
abcd or worse and takes longer time to caluclate. From about 22 to about 25. On my
test-data. It still has a bad distribution on 2^n hashsizes.

Results are from my calculations:
http://aaricia.hemmet.chalmers.se/~gozem/cttest-0.2/rt_ab/

And from Martin who did test alot of attacking tools against his lab-router:
Look at each index.html to what type of attack it was.

http://gandalf.hjorten.nu/cttest/kna-gw-rt24-rt_ab-abcdef/
http://gandalf.hjorten.nu/cttest/labbrouter-172000-randip-randport-rt24-rt_ab-abcdef/
http://gandalf.hjorten.nu/cttest/labbrouter-172000-same-rt24-rt_ab-abcdef/

The modified cttest.c we have used can be found at:
http://aaricia.hemmet.chalmers.se/~gozem/cttest-0.2/cttest.c

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06 22:41                                     ` cttest-0.1 Joakim Axelsson
@ 2002-07-06 23:16                                       ` Joakim Axelsson
  2002-07-07  2:30                                         ` cttest-0.1 Svenning Sorensen
  0 siblings, 1 reply; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-06 23:16 UTC (permalink / raw)
  To: Patrick Schaaf, netfilter-devel

2002-07-07 00:41:09+0200, Joakim Axelsson <gozem@aaricia.hemmet.chalmers.se> ->
> 
> Now to make it even more impossible to attack I added a A,B factor to it:
> "res ^= A ^ B;" Same A and B as in abcd*
> 
> static u32 hash_rt_ab(struct ct_key *key)
> {
>  	u32 res;
> 
>  	PER_HASH_TIMER_1(       
>  	  	res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
>  	  	res ^= key->sip ^ key->proto;  
>  	  	res ^= key->dport ^ key->sport;
>  	  	res ^= 0x47441DFB ^ 0x57655A7D;
>  	  	res ^= (res >> 24);
>  	  	res ^= (res >> 8);
>  	);
>  	
>  	return res;
> }
> 

I guess you all are begining to get a little tired of my mails :-). Anyhow
on our little misstake what ^ really does in C (should have known better
:-). I guess I seldom use xor in my c-code.) res ^= 0x47441DFB ^ 0x57655A7D
is kinda of useless then. So I changed it into:

res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
res ^= key->sip ^ key->proto;
res ^= key->dport ^ key->sport;
res ^= 0x47441DFB;
res ^= (res >> 24);
res ^= (res >> 8);
res ^= 0x57655A7D;

Results:
http://aaricia.hemmet.chalmers.se/~gozem/cttest-0.2/rt_ab_xor/

New cttest.c (same url as before):
http://aaricia.hemmet.chalmers.se/~gozem/cttest-0.2/cttest.c

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06 23:16                                       ` cttest-0.1 Joakim Axelsson
@ 2002-07-07  2:30                                         ` Svenning Sorensen
  2002-07-07  4:23                                           ` cttest-0.1 Joakim Axelsson
  0 siblings, 1 reply; 55+ messages in thread
From: Svenning Sorensen @ 2002-07-07  2:30 UTC (permalink / raw)
  To: Joakim Axelsson, Patrick Schaaf, netfilter-devel

At 01:16 07-07-2002, Joakim Axelsson wrote:
>I guess you all are begining to get a little tired of my mails :-). Anyhow
>on our little misstake what ^ really does in C (should have known better
>:-). I guess I seldom use xor in my c-code.) res ^= 0x47441DFB ^ 0x57655A7D
>is kinda of useless then. So I changed it into:
>
>res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
>res ^= key->sip ^ key->proto;
>res ^= key->dport ^ key->sport;
>res ^= 0x47441DFB;
>res ^= (res >> 24);
>res ^= (res >> 8);
>res ^= 0x57655A7D;

Heh, still as useless... :)
Two random 32 bit values xor'ed gives, well, 32 random bits.
And you don't get more randomness into the first value by shifting and
xor'ing with itself.

But actually, I don't see how it buys you anything to xor with a random value.
It doesn't effect the distribution of the hash values, so an attacker can
still attack a single bucket. He just doesn't know which one, but I bet he
doesn't care :)

A multiplication would be slightly better (and more expensive, cpu-wise),
since that would make it harder to predict which input bits change which
output bits. Furthermore, you need to perform the two prf functions on
different parts of the input, so any variation in one part of the input
can't be easily nullified by the inverse variation in the other part.

Maybe it would also make sense to distribute the source & destination ports
across all 32 bits, and similarly mix the high and low parts better,
like this (completely untested):

        res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
        res ^= (res * secret_value_a);
        res ^= key->sip ^ key->proto;
        res ^= (key->dport << 16) ^ key->sport;
        res ^= (res * secret_value_b);
        res ^= ((res >> 25) | (res << 7));

The compiler should optimize the two last shifts into a single rotate instruction.

Svenning

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-07  2:30                                         ` cttest-0.1 Svenning Sorensen
@ 2002-07-07  4:23                                           ` Joakim Axelsson
  2002-07-07  5:46                                             ` cttest-0.1 Joakim Axelsson
  0 siblings, 1 reply; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-07  4:23 UTC (permalink / raw)
  To: Svenning Sorensen; +Cc: Patrick Schaaf, netfilter-devel

2002-07-07 04:30:00+0200, Svenning Sorensen <sss@sss.dnsalias.net> ->
> At 01:16 07-07-2002, Joakim Axelsson wrote:
> >I guess you all are begining to get a little tired of my mails :-). Anyhow
> >on our little misstake what ^ really does in C (should have known better
> >:-). I guess I seldom use xor in my c-code.) res ^= 0x47441DFB ^ 0x57655A7D
> >is kinda of useless then. So I changed it into:
> >
> >res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
> >res ^= key->sip ^ key->proto;
> >res ^= key->dport ^ key->sport;
> >res ^= 0x47441DFB;
> >res ^= (res >> 24);
> >res ^= (res >> 8);
> >res ^= 0x57655A7D;
> 
> Heh, still as useless... :)
> Two random 32 bit values xor'ed gives, well, 32 random bits.
> And you don't get more randomness into the first value by shifting and
> xor'ing with itself.
> 

Thx for the input!

> But actually, I don't see how it buys you anything to xor with a random value.
> It doesn't effect the distribution of the hash values, so an attacker can
> still attack a single bucket. He just doesn't know which one, but I bet he
> doesn't care :)
> 

Well, since the value is later % with the hashsize it might matters somewhat,
but not as much as a multiplication. Anyhow, you are right.

> A multiplication would be slightly better (and more expensive, cpu-wise),
> since that would make it harder to predict which input bits change which
> output bits. Furthermore, you need to perform the two prf functions on
> different parts of the input, so any variation in one part of the input
> can't be easily nullified by the inverse variation in the other part.
> 
> Maybe it would also make sense to distribute the source & destination ports
> across all 32 bits, and similarly mix the high and low parts better,
> like this (completely untested):
> 

Yepp, the spread of ports can be good.

>         res = ((key->dip & 0xF0F0F0F0) >> 4) | ((key->dip & 0x0F0F0F0F) << 4);
>         res ^= (res * secret_value_a);
>         res ^= key->sip ^ key->proto;
>         res ^= (key->dport << 16) ^ key->sport;
>         res ^= (res * secret_value_b);
>         res ^= ((res >> 25) | (res << 7));
> 
> The compiler should optimize the two last shifts into a single rotate instruction.
>

Why this rotate? 

I tested your code and concluded two things:

1. res ^= (res * value) is not good. You can't bring "parts" of the res it
self to xor with. A plain "res *= value" was better.

2. res ^= ((res >> 25) | (res << 7)); did not do the same fine thing as our
>> 24 followed by >> 8. It gave a second "hill" as a tail. Not giving a fine
line.

I also noticed that its using dip in the first shifts. Now we (me and
Martin) only got router conntrack-data where dip is a wide range. But on an
end machine/server there will to 99% only be one dip present. Considering
packets coming TO the machine. It will take more packets than it gives on an
attack. Remember also the orignal routing cache wants to know where the
packet are heading. We want to know where it came from. If anyone has more
ideas on this one I'll happly hear them. So, I swaped placed on dip and sip.
The swaping didn't seams to affect anything on my input data. Ending in this
function:

res = ((key->sip & 0xF0F0F0F0) >> 4) | ((key->sip & 0x0F0F0F0F) << 4);
res *= 0x47441DFB;
res ^= key->dip ^ key->proto;
res ^= (key->dport << 16) ^ key->sport;
res *= 0x57655A7D;
res ^= (res >> 24);
res ^= (res >> 8);

I think this solution will make the two random constanst enough independent
to make it impossible to attack one, any, bucket. Better solutions on the
two constants are welcome.

This hash takes about 25 cycles on my CPU compared to the plain first
rt-hash which takes about 20 cycles. The first rt_ab I made takes about 22
cyles.

Results on:
http://aaricia.hemmet.chalmers.se/~gozem/cttest-0.2/rt_new

I called your idea rt_sven, my new idea rt_new. I only displayed the 4
rt-hashes and random to get some resolutions in the grafs. abcd and original
only kills them with long tails.

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-07  4:23                                           ` cttest-0.1 Joakim Axelsson
@ 2002-07-07  5:46                                             ` Joakim Axelsson
  2002-07-07 11:00                                               ` cttest-0.1 Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-07  5:46 UTC (permalink / raw)
  To: Svenning Sorensen, Patrick Schaaf, netfilter-devel

2002-07-07 06:23:41+0200, Joakim Axelsson <gozem@aaricia.hemmet.chalmers.se> ->
> 
> res = ((key->sip & 0xF0F0F0F0) >> 4) | ((key->sip & 0x0F0F0F0F) << 4);
> res *= 0x47441DFB;
> res ^= key->dip ^ key->proto;
> res ^= (key->dport << 16) ^ key->sport;
> res *= 0x57655A7D;
> res ^= (res >> 24);
> res ^= (res >> 8);
> 
> I think this solution will make the two random constanst enough independent
> to make it impossible to attack one, any, bucket. Better solutions on the
> two constants are welcome.
> 

It can still be calculated backwards to attack i think. However leting one
constant multiply (the first) and one xor should make it harder to attack
one any bucket. It even gives a slightly better distribution and removing
one muliplication is good. I also made an or of the ports. Makes no
difference but in reading.

res = ((key->sip & 0xF0F0F0F0) >> 4) | ((key->sip & 0x0F0F0F0F) << 4);
res *= 0x47441DFB;
res ^= key->dip ^ key->proto;
res ^= ((key->dport << 16) | key->sport);
res ^= 0x57655A7D;
res ^= (res >> 24);
res ^= (res >> 8); 

Now someone else have to take a look at it. I digged my head into a corner.

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-07  5:46                                             ` cttest-0.1 Joakim Axelsson
@ 2002-07-07 11:00                                               ` Henrik Nordstrom
  0 siblings, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2002-07-07 11:00 UTC (permalink / raw)
  To: Joakim Axelsson, netfilter-devel

On Sunday 07 July 2002 07.46, Joakim Axelsson wrote:
> res = ((key->sip & 0xF0F0F0F0) >> 4) | ((key->sip & 0x0F0F0F0F) <<
> 4); res *= 0x47441DFB;
> res ^= key->dip ^ key->proto;
> res ^= ((key->dport << 16) | key->sport);
> res ^= 0x57655A7D;
> res ^= (res >> 24);
> res ^= (res >> 8);
>
>
> Now someone else have to take a look at it. I digged my head into a
> corner.

Hmm...

> res ^= key->dip ^ key->proto;
> res ^= ((key->dport << 16) | key->sport);

can easily nullify itself.. here you are just xor:ing the destination 
IP with the source and destination ports, which makes a highly 
predictable hash pattern..

To always hit the same hash bucket you just need to keep the source IP 
constant, and the result of the above XOR constant..

> res ^= 0x57655A7D;
> res ^= (res >> 24);
> res ^= (res >> 8);

And this is twofold.. it both obfuscates the bits and probably makes 
it likelier an attacker can find bits that nullify eachother, 
especially so if the hash size is not prime (or at a minimum odd).. 
These 5 xor statements xor:s quite many sources into the lower bits, 
always in the same order.. I am pretty sure you will find very 
interesting iterference patterns if you unwind these xors.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cttest-0.1
  2002-07-06 21:14                                   ` cttest-0.1 Joakim Axelsson
  2002-07-06 22:41                                     ` cttest-0.1 Joakim Axelsson
@ 2002-07-06 22:54                                     ` Joakim Axelsson
  1 sibling, 0 replies; 55+ messages in thread
From: Joakim Axelsson @ 2002-07-06 22:54 UTC (permalink / raw)
  To: Patrick Schaaf, netfilter-devel

2002-07-06 23:14:51+0200, Joakim Axelsson <gozem@aaricia.hemmet.chalmers.se> ->
>
> Also with this idea of using power of (^), we tested the abcd hash using
> power of instead of just adding the values up:
> 
> static u32 hash_abcd_power(struct ct_key *key)
> {
>         u32 res;
> 
>         PER_HASH_TIMER_1(
>                 res = (0x47441DFB * key->sip)
>                 ^ (0x57655A7D * key->dip)
>                 ^ (0x1C7F1D2D * key->sport)
>                 ^ (0xDF91D738 * key->dport)
>                 ^ key->proto;
>         );
>         return res;
> }
> 

Oops. ^ is xor in C, not "power of". Hehe, thinking in too much mathemaical
terms here. Guess that tells you that we really dont know what we're doing.
Anyhow. Our rt_ab seams to be very good, thats what matters :-)

-- 
/Joakim Axelsson A.K.A Gozem@EFnet & OPN

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-06-27 20:46 conntrack performance/DoS formula Don Cohen
  2002-06-28  6:23 ` Patrick Schaaf
@ 2002-07-02 14:38 ` Harald Welte
  1 sibling, 0 replies; 55+ messages in thread
From: Harald Welte @ 2002-07-02 14:38 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

On Thu, Jun 27, 2002 at 01:46:28PM -0700, Don Cohen wrote:
> 
> I had mentioned that I was hoping to remember/find the appropriate
> formula for the probability of a hash bucket reaching its limit.
> I got out my prob-stat book from I-won't-mention-how-many years ago.

[thanks for your detailad analysis, I will try to look at all the
details ASAP].

> My guess is that the amount of time required to do 20 iterations
> comparing connection data with elements of a list in a bucket is
> very small compared to the amount of time otherwise spent on a
> packet.  On the other hand, a hash bucket is also pretty cheap.
> The data for a connection takes much more space than 2 hash buckets, 
> or even 10.  And with 10L buckets the chance of overflow at M=10
> is only 1E-19, for M=20 it's somewhere around 1E-40.

ok the question is now to test this in the real world, which can be
fairly easily done, as the hashsize is configurable.

> So, there you have it.

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <20020701121404.B78724512@lists.samba.org>]

* Re: conntrack performance/DoS formula
       [not found] <20020701121404.B78724512@lists.samba.org>
@ 2002-07-01 21:30 ` Don Cohen
  2002-07-02  6:05   ` Patrick Schaaf
  0 siblings, 1 reply; 55+ messages in thread
From: Don Cohen @ 2002-07-01 21:30 UTC (permalink / raw)
  To: netfilter-devel, bof


Patrick Schaaf writes:
 >  > You can now use '-c NUMBER' to set a seed for the CRC32 hash. You can now
 >  > use '-C' (without parameter) to set a random seed for the CRC32 hash.
 >  > These options are for testing the idea of "attack compensation by
 >  > inserting local variance". I'd like to hear opinions on this.

 > At a higher level, though, I don't see much value in preventing an
 > attacker from filling a single bucket when he can almost as easily do
 > much more harm by filling the entire conntrack database!

On the other hand, if we do have a way to prevent an attacker from
filling the conntrack database, this does become interesting again.
Assuming, of course, the attacker can't skew the crc hash results by
feeding you particular data without knowing your seed.  I still hope
to hear from some experts out there on this subject.

And now, since I think this is a useful approach it's worth while to
mention the following:

  static u32 hash_crc32(struct ct_key *key)
  {
	  u32 crc = crc32_seed;
	  u32 v32;
	  u16 v16;
	  crc = crc32(crc, (const char *) &key->proto, sizeof(key->proto));
	  v32 = ntohl(key->sip);
	  crc = crc32(crc, (const char *) &v32, sizeof(v32));
	  v32 = ntohl(key->dip);
	  crc = crc32(crc, (const char *) &v32, sizeof(v32));
	  v16 = ntohs(key->sport);
	  crc = crc32(crc, (const char *) &v16, sizeof(v16));
	  v16 = ntohs(key->dport);
	  crc = crc32(crc, (const char *) &v16, sizeof(v16));
	  return crc;
  }

Isn't that the same (modulo order of data) as this?
     return crc32(crc, (const char *) key, sizeof(struct ct_key)); 

On my 200MHz machine
 your crc    my change	hash_conntrack
 1.5usec     1.1usec    .2usec		   gcc -o cttest cttest.c
 1.1usec     .53usec    .14usec		   gcc -O9 -o cttest cttest.c

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: conntrack performance/DoS formula
  2002-07-01 21:30 ` Don Cohen
@ 2002-07-02  6:05   ` Patrick Schaaf
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick Schaaf @ 2002-07-02  6:05 UTC (permalink / raw)
  To: Don Cohen; +Cc: netfilter-devel

Hi Don,

> And now, since I think this is a useful approach it's worth while to
> mention the following:

I'm doing some gnuplot pictures right now, and crc32 really does come
out as preferrable regarding list length distributions for varying
bucket sizes. I'll post a link to a presentation of that analysis
later today.

[regarding the hash_crc32() in my test program]
> 
> Isn't that the same (modulo order of data) as this?
>      return crc32(crc, (const char *) key, sizeof(struct ct_key)); 

The problem with this is not so much order of data, which would be
the same for all calls on a local machine (i.e. at least consistent).
The problem is possible padding in the structure. While such padding
in the real in-kernel tuple data structure would be space wastage,
it _is_ possible in principle for a C compiler to introduce such padding.

> On my 200MHz machine
>  your crc    my change	hash_conntrack
>  1.5usec     1.1usec    .2usec		   gcc -o cttest cttest.c
>  1.1usec     .53usec    .14usec		   gcc -O9 -o cttest cttest.c

I'm sure that there's much more speed to be had over the crc32() function
I borrowed from zlibc. That one is a general N-byte crc32, and here we
always have a fixed amount of bytes. Thus, for production, we would look
into completely unrolling the crc32 loop. I'll look into microoptimizating
this if and only if it is decided to have crc32 as a (possibly alternative)
hash function for conntrack. It's not worth the effort for the test program
alone.

Thanks nevertheless for bringing this up; it is an important issue when
we want to deploy crc32.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2002-07-07 11:00 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-27 20:46 conntrack performance/DoS formula Don Cohen
2002-06-28  6:23 ` Patrick Schaaf
2002-06-28 17:53   ` Don Cohen
2002-06-28 18:36     ` Patrick Schaaf
2002-06-28 19:03       ` Don Cohen
2002-06-28 19:35         ` Patrick Schaaf
2002-06-28 19:39           ` Patrick Schaaf
2002-06-28 21:10           ` Don Cohen
2002-06-28 21:28             ` Patrick Schaaf
2002-06-28 21:49               ` Don Cohen
2002-06-28 22:30               ` Don Cohen
2002-06-29  9:03                 ` Patrick Schaaf
2002-06-29 16:48                   ` Don Cohen
2002-06-29 17:22                     ` Patrick Schaaf
2002-07-05 13:47                       ` Harald Welte
2002-06-29 17:33                     ` Patrick Schaaf
2002-06-29  9:29                 ` Patrick Schaaf
2002-06-29 12:07                 ` Patrick Schaaf
2002-06-29 12:34                   ` Patrick Schaaf
2002-06-30  8:31                     ` Patrick Schaaf
2002-06-30 19:40                       ` Don Cohen
2002-07-01  8:07                         ` Henrik Nordstrom
2002-07-01 17:49                           ` Don Cohen
2002-07-02  7:58                             ` Henrik Nordstrom
     [not found]                           ` <15652.38084.704660.234319@isis.cs3-inc.com>
2002-07-04 21:53                             ` Henrik Nordstrom
2002-07-05  7:08                               ` Don Cohen
2002-07-05 11:41                                 ` Henrik Nordstrom
2002-07-06  2:49                                   ` Don Cohen
2002-07-02 14:55                         ` Harald Welte
2002-07-02 14:40         ` Harald Welte
2002-07-02 16:32           ` Patrick Schaaf
2002-07-02 16:35             ` Patrick Schaaf
2002-07-02 16:53               ` Henrik Nordstrom
2002-07-02 17:48               ` Don Cohen
2002-07-02 18:31                 ` Patrick Schaaf
2002-07-02 21:52                   ` cttest-0.1 Patrick Schaaf
2002-07-03  4:15                     ` cttest-0.1 Joakim Axelsson
2002-07-05 15:37                       ` cttest-0.1 Martin Josefsson
2002-07-05 16:10                       ` cttest-0.1 Joakim Axelsson
2002-07-05 16:54                         ` cttest-0.1 Patrick Schaaf
2002-07-05 16:53                           ` cttest-0.1 Joakim Axelsson
2002-07-06  6:10                             ` cttest-0.1 Andrew Smith
2002-07-06  7:12                               ` cttest-0.1 Patrick Schaaf
2002-07-06 15:23                                 ` cttest-0.1 Patrick Schaaf
2002-07-06 21:14                                   ` cttest-0.1 Joakim Axelsson
2002-07-06 22:41                                     ` cttest-0.1 Joakim Axelsson
2002-07-06 23:16                                       ` cttest-0.1 Joakim Axelsson
2002-07-07  2:30                                         ` cttest-0.1 Svenning Sorensen
2002-07-07  4:23                                           ` cttest-0.1 Joakim Axelsson
2002-07-07  5:46                                             ` cttest-0.1 Joakim Axelsson
2002-07-07 11:00                                               ` cttest-0.1 Henrik Nordstrom
2002-07-06 22:54                                     ` cttest-0.1 Joakim Axelsson
2002-07-02 14:38 ` conntrack performance/DoS formula Harald Welte
     [not found] <20020701121404.B78724512@lists.samba.org>
2002-07-01 21:30 ` Don Cohen
2002-07-02  6:05   ` Patrick Schaaf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.