performance issues (nat / conntrack)

All of lore.kernel.org
 help / color / mirror / Atom feed

* performance issues (nat / conntrack)
@ 2002-06-20 19:48 Jean-Michel Hemstedt
  2002-06-22 16:51 ` Harald Welte
  2002-06-25 10:35 ` Jozsef Kadlecsik
  0 siblings, 2 replies; 33+ messages in thread
From: Jean-Michel Hemstedt @ 2002-06-20 19:48 UTC (permalink / raw)
  To: netfilter-devel

dear netdevels,

I'm doing some tcp benches on a netfilter enabled box and noticed
huge and surprising perf decrease when loading iptable_nat module. 

- ip_conntrack is of course also loading the system, but with huge memory
and a large bucket size, the problem can be solved. The big issue with
ip_conntrack are the state timeouts: it simply kill the system and drops
all the traffic with the default ones, because the ip_conntrack table
becomes quickly full, and it seems that there is no way to recover from
that  situation... Keeping unused entries (time_close) even 1 minute in
the cache is really not suitable for configurations handling (relatively)
large number of connections/s. 
o The cumulative effect should be reconsidered.
o Are there ways/plans to tune the timeouts dynamically? and what are
  the valid/invalid ranges of timeouts?
o looking at the code, it seems that one timer is started by tuple...
  wouldn't it be more efficient to have a unique periodic callback
  scanning the whole or part of the table for aged entries?

- The annoying point is iptable_nat: normally the number of entries in
the nat table is much lower than the number of entries in the conntrack
table. So even if the hash function itself could be less efficient than
the ip_conntrack one (because it takes less arguments: src+dst+proto),
the load of nat, should be much lower than the load of conntrack.
o So... why is it the opposite??
o Are there ways to tune the nat performances?

- Another (old) question: why are conntrack or nat active when there are
no rules configured (using them or not)? If not fixed it should be at
least documented... Somebody doing "iptables -t nat -L" takes the risk
of killing its system if it's already under load... In the same spirit,
iptables -F should unload all unused modules (the ip_tables modules 
doesn't hurt). Just one quick fix: replace the 'iptables' executable by
one 'iptables' script calling the exe (located somewhere else) and 
doing an rmmod at the end...

comments are welcome;


here is my test bed:

tested target:
 -kernel 2.4.18 + non_local_bind + small conntrack timeouts...
 -PIII~500MHz, RAM=256MB
 -2*100Mb/s NIC

The target acts as a forwarding gateway between a load generator client
running httperf, and an apache proxy serving cached pages. 100Mb/s NICs
and requests/response sizes insure that BW and packet collisions is not
an issue.

Since in my test, each connection is ephemeral (<10ms), i recompiled the 
kernel with very short conntrack timeouts (i.e: 1 sec for close_wait, 
and about 60 sec for established!) This was also the only way to restrict
the conntrack hash table size (given my RAM) and avoid exagerated hash
collisions. Another limitation comes from my load generator creating traffic
from one source to one destination ipa, with only source port variation 
(but given my configured hash table size and the hash function itself
it shouldn't have been an issue).

results are averages from procinfo -n10 [d]

test results:

1) target = forwarding only (no iptables module or rule)
 -  rate          : 100        conn/s (=request-response/s)
 -> CPU load      : 0%         system
 -> context       : 7          context/s
 -> irq(eth0/eth1): 0.9 / 0.9  kpps   (# of packet/sec = #irq/s)

 -  rate          : 500        conn/s
 -> CPU load      : 10%        system
 -> context       : 18->100    context/s (varying!)
 -> irq(eth0/eth1): 4.4 / 4.4  kpps

 -  rate (max)    : 1050       conn/s (max from my load generator)
 -> CPU load      : 25%        system
 -> context       : 1000       context/s
 -> irq(eth0/eth1): 10 / 10    kpps

2) (1) + insmod ip_conntrack 16384 (no rules)

 -  rate          : 100        conn/s
 -> CPU load      : 0.8%       system
 -> context       : 7          context/s
 -> irq(eth0/eth1): 0.9 / 0.9  kpps
 -> conntrack size: 970        concurrent entries

 -  rate          : 250        conn/s
 -> CPU load      : 10%        system
 -> context       : 12         context/s
 -> irq(eth0/eth1): 2.2 / 2.2  kpps
 -> conntrack size: 2390       concurrent entries

 -  rate          : 500        conn/s
 -> CPU load      : 30-70%     system  (varying)
 -> context       : 45-90      context/s
 -> irq(eth0/eth1): 4 / 4      kpps
 -> conntrack size: 4770       concurrent entries

3) (2) + iptables -t nat -L  [=iptable_nat] (no rules)
 -  rate          : 100        conn/s
 -> CPU load      : 1%         system
 -> context       : 8          context/s
 -> irq(eth0/eth1): 0.9 / 0.9  kpps
 -> conntrack size: 970        concurrent entries

 -  rate          : 250        conn/s
 -> CPU load      : 40%        system
 -> context       : 20         context/s
 -> irq(eth0/eth1): 2.2 / 2.2  kpps
 -> conntrack size: 2390       concurrent entries

 -  rate  (max)   : 420        conn/s (all failed)
 -> CPU load      : 97%        system
 -> context       : 28         context/s
 -> irq(eth0/eth1): 3.1 / 4.1  kpps
 -> conntrack size: 4050       concurrent entries

 -  rate (killing): [500]->0   conn/s (all failed)
 -> CPU load      : 100%       system (no response)
 -> context       : ?          context/s
 -> irq(eth0/eth1): ?          kpps
 -> conntrack size: 10500???   concurrent entries

other results with active rules (i.e. REDIRECT) are dependent
of the load generated by the local process handling the traffic,
and are thus not relevant (FYI: max conn/s < 200 with one process
handling the REDIRECTed traffic)

kr,
_______________________________________________________________________

-jmhe-

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-20 19:48 performance issues (nat / conntrack) Jean-Michel Hemstedt
@ 2002-06-22 16:51 ` Harald Welte
  2002-06-23  9:15   ` Jean-Michel Hemstedt
  2002-06-25 10:35 ` Jozsef Kadlecsik
  1 sibling, 1 reply; 33+ messages in thread
From: Harald Welte @ 2002-06-22 16:51 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 3881 bytes --]

On Thu, Jun 20, 2002 at 09:48:27PM +0200, Jean-Michel Hemstedt wrote:
> dear netdevels,
> 
> I'm doing some tcp benches on a netfilter enabled box and noticed
> huge and surprising perf decrease when loading iptable_nat module. 

Sounds as expected.

> - ip_conntrack is of course also loading the system, but with huge memory
> and a large bucket size, the problem can be solved. The big issue with
> ip_conntrack are the state timeouts: it simply kill the system and drops
> all the traffic with the default ones, because the ip_conntrack table
> becomes quickly full, and it seems that there is no way to recover from
> that  situation... Keeping unused entries (time_close) even 1 minute in
> the cache is really not suitable for configurations handling (relatively)
> large number of connections/s. 

what is a 'relatively' large number of connections? I've seen a couple
of netfilter firewalls dealing with 200000+ tracked connections.

> o The cumulative effect should be reconsidered.

could you please try to explain what you mean?

> o Are there ways/plans to tune the timeouts dynamically? and what are
>   the valid/invalid ranges of timeouts?

No, see the mailinglist archives for th reason why.

> o looking at the code, it seems that one timer is started by tuple...
>   wouldn't it be more efficient to have a unique periodic callback
>   scanning the whole or part of the table for aged entries?

I think somebody (Martin Josefsson?) is currently looking into optimizing

> - The annoying point is iptable_nat: normally the number of entries in
> the nat table is much lower than the number of entries in the conntrack
> table. So even if the hash function itself could be less efficient than
> the ip_conntrack one (because it takes less arguments: src+dst+proto),
> the load of nat, should be much lower than the load of conntrack.
> o So... why is it the opposite??

? What 'nat table' are  you talking about?  Do you understand how NAT
works and how it interacts with connection tracking?

> o Are there ways to tune the nat performances?

no. NAT (and esp. NAT performance) is not a very strong point of netfilter.
Everybody agrees that NAT is evil and it should be avoided in all circumstances.
Rusty didn't want to become NAT/masquerading maintainer in the first place,
but rather concentrate on packet filtering.

The NAT subsystem has a number of shortcomings, some of which have been 
fixed, other still remain.  

> - Another (old) question: why are conntrack or nat active when there are
> no rules configured (using them or not)? If not fixed it should be at
> least documented... 

This is standard behaviour.  Does your network driver unload if you 
'ifconfig down' an interface?  Does a TC qdisc module unload if you 
delete all instances of the queue?

conntrack is _not_ related/intermangled with iptables at all.  Conntrack
does not know if anybody is using conntrack state in the system.

> Somebody doing "iptables -t nat -L" takes the risk
> of killing its system if it's already under load... 

?  Please explain why. I see no reason for this.

> In the same spirit,
> iptables -F should unload all unused modules (the ip_tables modules 
> doesn't hurt). Just one quick fix: replace the 'iptables' executable by
> one 'iptables' script calling the exe (located somewhere else) and 
> doing an rmmod at the end...

no. this is considered a feature. The current [and past] behaviour is wanted
like this by design.

> -jmhe-

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M+ 
V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*)

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-22 16:51 ` Harald Welte
@ 2002-06-23  9:15   ` Jean-Michel Hemstedt
  2002-06-25 11:33     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 33+ messages in thread
From: Jean-Michel Hemstedt @ 2002-06-23  9:15 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel

I know this debate is not new... I just didn't expect such a (90% see
below) perf drop, and unavailablity risk. That's why I'm only reporting
it, hoping secretly that experienced hackers will consider it seriously.
;o)

Note: I don't want to play with words, but if you prefer, consider
     'load generator' as 'malicious DoS user', and 'perf issue' as
      'DoS vulnerability' as Don Cohen cleverly suggested :-/
     (for me it's the same problem, except that DoS is ponctual while
      perf is what we may expect in normal situation)

> >
> > I'm doing some tcp benches on a netfilter enabled box and noticed
> > huge and surprising perf decrease when loading iptable_nat module.
>
> Sounds as expected.

loading a module, doesn't mean using it (lsmod reports it as 'unused'
in my tests). So, does it really 'sounds as expected', when you see
your cpu load hitting 100%, and most packets dropped just after having
done 'iptables -t nat -L' on a system with 1%CPU load handling 'only'
10kpps and forwarding about 1000 new TCP connections/s?

>
> > - ip_conntrack is of course also loading the system, but with huge memory
> > and a large bucket size, the problem can be solved. The big issue with
> > ip_conntrack are the state timeouts: it simply kill the system and drops
> > all the traffic with the default ones, because the ip_conntrack table
> > becomes quickly full, and it seems that there is no way to recover from
> > that  situation... Keeping unused entries (time_close) even 1 minute in
> > the cache is really not suitable for configurations handling (relatively)
> > large number of connections/s.
>
> what is a 'relatively' large number of connections? I've seen a couple
> of netfilter firewalls dealing with 200000+ tracked connections.

200K concurrent established connections, maybe... but surely not NEW
connections/second.
See previous results: with only ip_conntrack loaded (no nat), I hardly
reached 500 (new) conn/s.

>
> > o The cumulative effect should be reconsidered.
>
> could you please try to explain what you mean?

There are 3 aspects:
- table exhaustion (can be fixed with large memory) as long as the
  hash is correctly distributed (few collisions)
- concurrent timers (1 per conntrack tuple??)
- I can't explain the last one, but when the table is exhausted
  conntrack drops new packets, right? What I noticed is that at that
  moment, the cpu load suddenly hit 100%, and the machine did not
  recover, unless I killed the load generator

>
> > o Are there ways/plans to tune the timeouts dynamically? and what are
> >   the valid/invalid ranges of timeouts?
>
> No, see the mailinglist archives for th reason why.

If you refer to your mail of 18 January 2001, I think that this timeout
should also be reviewed ;o)... Waiting for somebody having the time and
being able of doing a redesign was quite idealistic, while a quick patch
for configurable timeouts per rule (ie: http timeouts different from smtp
ones, as suggested by Denis Ducamp) would have been more realistic.

>
> > o looking at the code, it seems that one timer is started by tuple...
> >   wouldn't it be more efficient to have a unique periodic callback
> >   scanning the whole or part of the table for aged entries?
>
> I think somebody (Martin Josefsson?) is currently looking into optimizing
>
> > - The annoying point is iptable_nat: normally the number of entries in
> > the nat table is much lower than the number of entries in the conntrack
> > table. So even if the hash function itself could be less efficient than
> > the ip_conntrack one (because it takes less arguments: src+dst+proto),
> > the load of nat, should be much lower than the load of conntrack.
> > o So... why is it the opposite??
>
> ? What 'nat table' are  you talking about?  Do you understand how NAT
> works and how it interacts with connection tracking?

Actually, that's also what i would like to know ;o)
bysource or byisproto hash tables, pointing to ip_nat_hash tuples
pointing to ip_conntrack entry. But i don't understand where the
extra processing comes from when there are no (nat) rules defined.
Just to recall my test: I generated an amount of new connections
per second passing through a forwarding machine without any iptables
module and measured the cpu load/responsiveness and other things...
Then while the machine was sustaining this amount of new conn/s, i did
'insmod ip_conntrack [size]', saw the cpu load increasing, and finally
just did 'iptables -t nat -L' to load the nat module without any rule,
and saw again the cpu load increasing. With 500conn/s, the cpu load went
from 10% -> ~50/70% -> 100% (machine unavailable).

>
> > o Are there ways to tune the nat performances?
>
> no. NAT (and esp. NAT performance) is not a very strong point of netfilter.
> Everybody agrees that NAT is evil and it should be avoided in all
circumstances.
> Rusty didn't want to become NAT/masquerading maintainer in the first place,
> but rather concentrate on packet filtering.

wow! what is the alternative for 'Everybody' using REDIRECT?

>
> The NAT subsystem has a number of shortcomings, some of which have been
> fixed, other still remain.
>
> > - Another (old) question: why are conntrack or nat active when there are
> > no rules configured (using them or not)? If not fixed it should be at
> > least documented...
>
> This is standard behaviour.  Does your network driver unload if you
> 'ifconfig down' an interface?  Does a TC qdisc module unload if you
> delete all instances of the queue?

ok, but does your interface sends irq when it is down? I don't care
about having an 'unused' module in memory as long as it is doing
nothing and not (over)loading the system.

>
> conntrack is _not_ related/intermangled with iptables at all.  Conntrack
> does not know if anybody is using conntrack state in the system.
>
> > Somebody doing "iptables -t nat -L" takes the risk
> > of killing its system if it's already under load...
>
> ?  Please explain why. I see no reason for this.

We agree, i also don't see any reason for it.
see above: a 'clean' machine without iptables modules or rule which
is handling 500conn/s hit 100%cpu and becomes unavailable if you do
'iptables -t nat -L'.

>
> > In the same spirit,
> > iptables -F should unload all unused modules (the ip_tables modules
> > doesn't hurt). Just one quick fix: replace the 'iptables' executable by
> > one 'iptables' script calling the exe (located somewhere else) and
> > doing an rmmod at the end...
>
> no. this is considered a feature. The current [and past] behaviour is wanted
> like this by design.

that's a... choice.

> - Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/

_______________________________________________________________________
-jmhe-               He who expects nothing shall never be disappointed

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-20 19:48 performance issues (nat / conntrack) Jean-Michel Hemstedt
  2002-06-22 16:51 ` Harald Welte
@ 2002-06-25 10:35 ` Jozsef Kadlecsik
  2002-06-25 12:42   ` Jean-Michel Hemstedt
  1 sibling, 1 reply; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 10:35 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: netfilter-devel

On Thu, 20 Jun 2002, Jean-Michel Hemstedt wrote:

> I'm doing some tcp benches on a netfilter enabled box and noticed
> huge and surprising perf decrease when loading iptable_nat module.
>
> - ip_conntrack is of course also loading the system, but with huge memory
> and a large bucket size, the problem can be solved. The big issue with
> ip_conntrack are the state timeouts: it simply kill the system and drops
> all the traffic with the default ones, because the ip_conntrack table
> becomes quickly full, and it seems that there is no way to recover from
> that  situation... Keeping unused entries (time_close) even 1 minute in
> the cache is really not suitable for configurations handling (relatively)
> large number of connections/s.

Please note: the role of the conntrack subsystem is to keep track of the
connections. As good as possible. If the conntrack table becomes full,
there are two possibilities:

- conntrack table size is underestimated for the real traffic flowing
  trough. Get more RAM and increase the table size.
- conntrack is under a (DoS) attack. Then protect conntrack by appropriate
  rules using the recent/limit/psd etc modules.

I'm against in changing the *default* timeout values, except when it is
based on real-life, well established cases.

> o The cumulative effect should be reconsidered.
> o Are there ways/plans to tune the timeouts dynamically? and what are
>   the valid/invalid ranges of timeouts?

There is already a patch in p-o-m which makes possible to *tune* the
timeouts dynamically via /proc. Actually, the only reason why that part of
the patch was written was to make possible to dynamically *increase* the
timeout value of the close_wait state.

> - The annoying point is iptable_nat: normally the number of entries in
> the nat table is much lower than the number of entries in the conntrack
> table. So even if the hash function itself could be less efficient than
> the ip_conntrack one (because it takes less arguments: src+dst+proto),
> the load of nat, should be much lower than the load of conntrack.

If there is no explicit NAT rule for a connection, then automatic NULL
mapping happens. (Also, because NAT keeps two additional hashes, the total
amount of memory required for the data is 3*ip_conntrack_htable_size.)

The book-keeping overhead is at least doubled compared to the
conntrack-only case - this explains pretty well the results you got.

> - Another (old) question: why are conntrack or nat active when there are
> no rules configured (using them or not)? If not fixed it should be at
> least documented... Somebody doing "iptables -t nat -L" takes the risk

conntrack and nat are subsystems. If somebody loads them in, then they
start to work.

But why would anyone type in "iptables -t nat -L" when in reality he/she
does not use nat and the nat table itself??

> here is my test bed:
>
> tested target:
>  -kernel 2.4.18 + non_local_bind + small conntrack timeouts...
>  -PIII~500MHz, RAM=256MB
>  -2*100Mb/s NIC
>
> The target acts as a forwarding gateway between a load generator client
> running httperf, and an apache proxy serving cached pages. 100Mb/s NICs
> and requests/response sizes insure that BW and packet collisions is not
> an issue.
>
> Since in my test, each connection is ephemeral (<10ms), i recompiled the
> kernel with very short conntrack timeouts (i.e: 1 sec for close_wait,
> and about 60 sec for established!) This was also the only way to restrict
> the conntrack hash table size (given my RAM) and avoid exagerated hash
> collisions. Another limitation comes from my load generator creating traffic
> from one source to one destination ipa, with only source port variation
> (but given my configured hash table size and the hash function itself
> it shouldn't have been an issue).

I think because only the source port varies, this is an important issue in
your setup. You actually tested the hash functions and could bomb some
hash entries. The overall effect was a DoS against conntrack.

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-23  9:15   ` Jean-Michel Hemstedt
@ 2002-06-25 11:33     ` Jozsef Kadlecsik
  2002-06-25 12:47       ` Harald Welte
  2002-06-25 13:21       ` Jean-Michel Hemstedt
  0 siblings, 2 replies; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 11:33 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: Harald Welte, netfilter-devel

On Sun, 23 Jun 2002, Jean-Michel Hemstedt wrote:

> > > I'm doing some tcp benches on a netfilter enabled box and noticed
> > > huge and surprising perf decrease when loading iptable_nat module.
> >
> > Sounds as expected.
>
> loading a module, doesn't mean using it (lsmod reports it as 'unused'
> in my tests). So, does it really 'sounds as expected', when you see

>From where do you think that the module usage counter reports how many
packets/connections are handled (currently? totally?) by the module.
There is no whatsoever connection!

> > > o The cumulative effect should be reconsidered.
>
> - I can't explain the last one, but when the table is exhausted
>   conntrack drops new packets, right? What I noticed is that at that
>   moment, the cpu load suddenly hit 100%, and the machine did not
>   recover, unless I killed the load generator

That is unusual and should be tested further.

> > ? What 'nat table' are  you talking about?  Do you understand how NAT
> > works and how it interacts with connection tracking?
>
> Just to recall my test: I generated an amount of new connections
> per second passing through a forwarding machine without any iptables
> module and measured the cpu load/responsiveness and other things...
> Then while the machine was sustaining this amount of new conn/s, i did
> 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally
> just did 'iptables -t nat -L' to load the nat module without any rule,
> and saw again the cpu load increasing. With 500conn/s, the cpu load went
> from 10% -> ~50/70% -> 100% (machine unavailable).

According to your first mail, the machine has 256M RAM and you issued

insmod ip_conntrack 16384

That requires 16384*8*~600byte ~= 75MB non-swappable RAM.

When you issued "iptables -t nat -L", the system tried to reserve plus
2x75MB. That's in total pretty near to all your available physical RAM
and the machine might died in swapping.

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 10:35 ` Jozsef Kadlecsik
@ 2002-06-25 12:42   ` Jean-Michel Hemstedt
  2002-06-25 13:50     ` Patrick Schaaf
                       ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Jean-Michel Hemstedt @ 2002-06-25 12:42 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

> > I'm doing some tcp benches on a netfilter enabled box and noticed
> > huge and surprising perf decrease when loading iptable_nat module.
> >
> > - ip_conntrack is of course also loading the system, but with huge memory
> > and a large bucket size, the problem can be solved. The big issue with
> > ip_conntrack are the state timeouts: it simply kill the system and drops
> > all the traffic with the default ones, because the ip_conntrack table
> > becomes quickly full, and it seems that there is no way to recover from
> > that  situation... Keeping unused entries (time_close) even 1 minute in
> > the cache is really not suitable for configurations handling (relatively)
> > large number of connections/s.
>
> Please note: the role of the conntrack subsystem is to keep track of the
> connections. As good as possible. If the conntrack table becomes full,
> there are two possibilities:
>
> - conntrack table size is underestimated for the real traffic flowing
>   trough. Get more RAM and increase the table size.
> - conntrack is under a (DoS) attack. Then protect conntrack by appropriate
>   rules using the recent/limit/psd etc modules.

And what if, under load conditions, your table becomes full because 90% of
its entries, which are unused, are not aged because of timeouts?
We don't even need to have a full table to get into troubles. If at one
point, the vast majority of the conntrack entries are unused, but still
in hash, then you get more and more collisions, which decreases the
hash efficiency.

There's another side effect: when the system get's loaded (because of
hash exhaustion or hash collisions), it can't process all packets arriving
which means that conntrack will not see some FIN or RST packets allowing
it to recover... This is a kind of 'vicious circle', or point of failure.

In my opinion, a first step should be to reconsider timeout values but
also timer mechanisms.

>
> I'm against in changing the *default* timeout values, except when it is
> based on real-life, well established cases.

What sounds the most significant: 'TCP timeouts' or 'application timeouts'?
Should (i.e) HTTP, FTP and Telnet have the same lifetime in hash?

>
> > o The cumulative effect should be reconsidered.
> > o Are there ways/plans to tune the timeouts dynamically? and what are
> >   the valid/invalid ranges of timeouts?
>
> There is already a patch in p-o-m which makes possible to *tune* the
> timeouts dynamically via /proc. Actually, the only reason why that part of
> the patch was written was to make possible to dynamically *increase* the
> timeout value of the close_wait state.

I didn't know that. thanks for the info.
But unfortunately it doesn't meet my 'timeout per protocol' needs.

>
> > - The annoying point is iptable_nat: normally the number of entries in
> > the nat table is much lower than the number of entries in the conntrack
> > table. So even if the hash function itself could be less efficient than
> > the ip_conntrack one (because it takes less arguments: src+dst+proto),
> > the load of nat, should be much lower than the load of conntrack.
>
> If there is no explicit NAT rule for a connection, then automatic NULL
> mapping happens. (Also, because NAT keeps two additional hashes, the total
> amount of memory required for the data is 3*ip_conntrack_htable_size.)

indeed, this dimensioning is quite conservative, and it assumes that
conntrack is distributed on src+dst+proto, not on ports. But we can
live with that, since it's only a memory overhead (except if we start
considering memory pages swapping).

>
> The book-keeping overhead is at least doubled compared to the
> conntrack-only case - this explains pretty well the results you got.

what do you mean by 'book-keeping' ?
Does NAT do a lookup even if there are no rules?

>
> > - Another (old) question: why are conntrack or nat active when there are
> > no rules configured (using them or not)? If not fixed it should be at
> > least documented... Somebody doing "iptables -t nat -L" takes the risk
>
> conntrack and nat are subsystems. If somebody loads them in, then they
> start to work.
>

work on what, since NAT has nothing to translate?

> But why would anyone type in "iptables -t nat -L" when in reality he/she
> does not use nat and the nat table itself??

(why do we live if it's for dying in the end?)

>
> > here is my test bed:
> >
> > tested target:
> >  -kernel 2.4.18 + non_local_bind + small conntrack timeouts...
> >  -PIII~500MHz, RAM=256MB
> >  -2*100Mb/s NIC
> >
> > The target acts as a forwarding gateway between a load generator client
> > running httperf, and an apache proxy serving cached pages. 100Mb/s NICs
> > and requests/response sizes insure that BW and packet collisions is not
> > an issue.
> >
> > Since in my test, each connection is ephemeral (<10ms), i recompiled the
> > kernel with very short conntrack timeouts (i.e: 1 sec for close_wait,
> > and about 60 sec for established!) This was also the only way to restrict
> > the conntrack hash table size (given my RAM) and avoid exagerated hash
> > collisions. Another limitation comes from my load generator creating traffic
> > from one source to one destination ipa, with only source port variation
> > (but given my configured hash table size and the hash function itself
> > it shouldn't have been an issue).
>
> I think because only the source port varies, this is an important issue in
> your setup. You actually tested the hash functions and could bomb some
> hash entries. The overall effect was a DoS against conntrack.

ok, here we go:

98  static inline u_int32_t
99  hash_conntrack(const struct ip_conntrack_tuple *tuple)
100  {
101  #if 0
102          dump_tuple(tuple);
103  #endif
104          /* ntohl because more differences in low bits. */
105          /* To ensure that halves of the same connection don't hash
106             clash, we add the source per-proto again. */
107          return (ntohl(tuple->src.ip + tuple->dst.ip
108                       + tuple->src.u.all + tuple->dst.u.all
109                       + tuple->dst.protonum)
110                  + ntohs(tuple->src.u.all))
111                  % ip_conntrack_htable_size;
112  }

src.u.all & dst.u.all refer (unless there's a bug) to src.tcp.port
and dst.tcp.port respectively. So, if only src.port varies linearly
(let's say between 32000 and 64000), and if ip_conntrack_htable_size
= 32768 (kernel: ip_conntrack (32768 buckets, 262144 max)), then
we should have maximum 2 collisions per bucket (unless there's a type
overfow somewhere).

This was my test setup, but since I haven't verified the conntrack hash
distribution, I didn't want to argue on that. To measure that, we should
maintain hash counters such as max collisions, average collisions per
key, hit/miss depth average, number of hit/miss per second, etc...
I've planned to do that along with profiling, but unfortunately not in
the 2 coming weeks.

--

last points I wanted to clarify:

> From: "Patrick Schaaf" <bof@bof.de>
> On Sun, Jun 23, 2002 at 09:46:29PM -0700, Don Cohen wrote:
> >  > From: "Jean-Michel Hemstedt" <jean-michel.hemstedt@alcatel.be>
> >  > >  > Since in my test, each connection is ephemeral (<10ms) ...
> >
> > One question here is whether the traffic generator is acting like
> > a real set of users or like an attacker.  A real user would not keep
> > trying to make connections at the same rate if the previous attempts
> > were not being served.  I suspect you're acting more like an attacker.
>
> He definitely is. The test he described is completely artificial, and does
> not represent any normal real world workload.
>
> Nevertheless, it does point out a valid optimization chance. We discussed
> that months ago, and it's still there.

No, I don't think so.
1) the hash is not in cause (see above)
   (btw, as discussed in 'connection tracking scaling' [19 March 2002]
    i don't see ways to really optimize it unless you go for
    multidimesional hashes described in theoretical papers, or if
    you make traffic assumptions which is most likely impossible
    in such a generic framework...) However, I don't understand
    why we are adding twice the src.port in the hash function?
2) My test was artificial, but not unrealistic: one endpoint sustaining
   1000 conn/s wathever the responsiveness of the target, or 10000 users
   trying to connect through the gw in a time lapse of 10 seconds is
   similar.
   Now, if some of you are telling me that I'm not allowed, or that I'm nuts
   to place my box in front of 10000 users, that's another debate.
   I'm not talking about dimensioning, I'm talking about relative performances,
   and strange weaknesses.

kr,
-jmhe-

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 11:33     ` Jozsef Kadlecsik
@ 2002-06-25 12:47       ` Harald Welte
  2002-06-25 14:23         ` Jozsef Kadlecsik
                           ` (2 more replies)
  2002-06-25 13:21       ` Jean-Michel Hemstedt
  1 sibling, 3 replies; 33+ messages in thread
From: Harald Welte @ 2002-06-25 12:47 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Jean-Michel Hemstedt, netfilter-devel

On Tue, Jun 25, 2002 at 01:33:13PM +0200, Jozsef Kadlecsik wrote:
> From where do you think that the module usage counter reports how many
> packets/connections are handled (currently? totally?) by the module.
> There is no whatsoever connection!

one should also consider the performance impact this would have !!!

> According to your first mail, the machine has 256M RAM and you issued
> 
> insmod ip_conntrack 16384
> 
> That requires 16384*8*~600byte ~= 75MB non-swappable RAM.
> 
> When you issued "iptables -t nat -L", the system tried to reserve plus
> 2x75MB. That's in total pretty near to all your available physical RAM
> and the machine might died in swapping.

??? Why should listing an IP table try to reserve twice the size of the
conntrack table?

> Regards,
> Jozsef

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 11:33     ` Jozsef Kadlecsik
  2002-06-25 12:47       ` Harald Welte
@ 2002-06-25 13:21       ` Jean-Michel Hemstedt
  2002-06-25 13:51         ` Harald Welte
                           ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Jean-Michel Hemstedt @ 2002-06-25 13:21 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Harald Welte, netfilter-devel

> > loading a module, doesn't mean using it (lsmod reports it as 'unused'
> > in my tests). So, does it really 'sounds as expected', when you see
> 
> From where do you think that the module usage counter reports how many
> packets/connections are handled (currently? totally?) by the module.
> There is no whatsoever connection!

module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT).
In this test, no rule was defined, and no target module was loaded.
So I did not expect NAT to process any packet.

> 
> > > > o The cumulative effect should be reconsidered.
> >
> > - I can't explain the last one, but when the table is exhausted
> >   conntrack drops new packets, right? What I noticed is that at that
> >   moment, the cpu load suddenly hit 100%, and the machine did not
> >   recover, unless I killed the load generator
> 
> That is unusual and should be tested further.

I suppose that due to the load, packets are dropped not because of conntrack
but because they simply can't be processed, and thus conntrack misses packets
of existing connections (such as FIN, RST) and can't thus recover due to its 
timeouts.

> 
> > > ? What 'nat table' are  you talking about?  Do you understand how NAT
> > > works and how it interacts with connection tracking?
> >
> > Just to recall my test: I generated an amount of new connections
> > per second passing through a forwarding machine without any iptables
> > module and measured the cpu load/responsiveness and other things...
> > Then while the machine was sustaining this amount of new conn/s, i did
> > 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally
> > just did 'iptables -t nat -L' to load the nat module without any rule,
> > and saw again the cpu load increasing. With 500conn/s, the cpu load went
> > from 10% -> ~50/70% -> 100% (machine unavailable).
> 
> According to your first mail, the machine has 256M RAM and you issued
> 
> insmod ip_conntrack 16384
> 
> That requires 16384*8*~600byte ~= 75MB non-swappable RAM.
> 
> When you issued "iptables -t nat -L", the system tried to reserve plus
> 2x75MB. That's in total pretty near to all your available physical RAM
> and the machine might died in swapping.
> 

exact! 
That's why I looked (but not closely) at swap-in/swap-out in procinfo, 
but didn't notice anything (0 most of the time on 10 sec average). 
But I agree that I was close to the limit, and even over when I tried 32K.
Despite that, nothing so surpising to have so few swaps, since my table
was not full (max 4000 up to 10000 concurrent tuples).

But this raises one additional problem: 
1) the hash index size and the hash total size should be configurable 
separately (get rid of that factor 8, and use a free list for the tuple 
allocation).
2) NAT hash sizes should also be configurable independently from conntrack.
Normally the nat hashes are smaller than conntrack hash, since conntrack
is based on ports, while nat is not.

PS: could anybody redo similar tests so that we can compare the results
    and stop killing the messenger, please? ;o)

> Regards,
> Jozsef
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
> WWW-Home: http://www.kfki.hu/~kadlec
> Address : KFKI Research Institute for Particle and Nuclear Physics
>           H-1525 Budapest 114, POB. 49, Hungary
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 12:42   ` Jean-Michel Hemstedt
@ 2002-06-25 13:50     ` Patrick Schaaf
  2002-06-25 19:03       ` Harald Welte
  2002-06-25 13:56     ` Alex Bennee
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Patrick Schaaf @ 2002-06-25 13:50 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: netfilter-devel

> In my opinion, a first step should be to reconsider timeout values but
> also timer mechanisms.

No. A first step MUST be pointing out that the current timeouts become
a problem in REAL LIFE. Right now you are speculating.  On all setups
I personally know, the timeouts are NOT a problem.

Regarding timer _mechanisms_ I have seen no indication at all that the
current mechanism is a problem. If you want to insist, _please_ learn
about kernel profiling, and start posting FACT.

regards
  Patrick

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 13:21       ` Jean-Michel Hemstedt
@ 2002-06-25 13:51         ` Harald Welte
  2002-06-25 14:33           ` Jozsef Kadlecsik
  2002-06-25 14:51           ` Jean-Michel Hemstedt
  2002-06-25 13:52         ` Patrick Schaaf
  2002-06-25 14:53         ` Jozsef Kadlecsik
  2 siblings, 2 replies; 33+ messages in thread
From: Harald Welte @ 2002-06-25 13:51 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel

On Tue, Jun 25, 2002 at 03:21:56PM +0200, Jean-Michel Hemstedt wrote:
> > > loading a module, doesn't mean using it (lsmod reports it as 'unused'
> > > in my tests). So, does it really 'sounds as expected', when you see
> > 
> > From where do you think that the module usage counter reports how many
> > packets/connections are handled (currently? totally?) by the module.
> > There is no whatsoever connection!
> 
> module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT).
> In this test, no rule was defined, and no target module was loaded.
> So I did not expect NAT to process any packet.

the way NAT is implemented currently, it always processes every packet
the same way.  For a NEW packet where we don't find a nat rule, we
allocate a 'null binding' telling the nat code that there is no nat 
transformation to be made .

> But this raises one additional problem: 
> 1) the hash index size and the hash total size should be configurable 
> separately (get rid of that factor 8, and use a free list for the tuple 
> allocation).
> 2) NAT hash sizes should also be configurable independently from conntrack.
> Normally the nat hashes are smaller than conntrack hash, since conntrack
> is based on ports, while nat is not.

both of this is already true. look at the module loadtime parameters of
ip_conntrack.o and iptable_nat.o

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 13:21       ` Jean-Michel Hemstedt
  2002-06-25 13:51         ` Harald Welte
@ 2002-06-25 13:52         ` Patrick Schaaf
  2002-06-25 14:53         ` Jozsef Kadlecsik
  2 siblings, 0 replies; 33+ messages in thread
From: Patrick Schaaf @ 2002-06-25 13:52 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: netfilter-devel

Jean-Michel,

> PS: could anybody redo similar tests so that we can compare the results
>     and stop killing the messenger, please? ;o)

Just so you don't get the wrong impression: I am not trying to shoot
the messenger, I'm trying to shoot incomplete messages. Please, don't
become discouraged in further investigating the situation!

best regards
  Patrick

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 12:42   ` Jean-Michel Hemstedt
  2002-06-25 13:50     ` Patrick Schaaf
@ 2002-06-25 13:56     ` Alex Bennee
  2002-06-25 14:17     ` Jozsef Kadlecsik
  2002-06-25 19:01     ` Harald Welte
  3 siblings, 0 replies; 33+ messages in thread
From: Alex Bennee @ 2002-06-25 13:56 UTC (permalink / raw)
  To: netfilter-devel

Jean-Michel Hemstedt said:
> In my opinion, a first step should be to reconsider timeout values but
> also timer mechanisms.

I've been following this thread with interest as I recently also had
conntrack related problems (failing to establish new connections due to the
table being full).

My machine is resource contrained (28M RAM) as its only an ADSL gateway yet
when I count the number of connections its tracking it varies between
300->600 connections which bare little relation to what it should be.

I excacerbate the problem by running gtk-gnutella which entertains a lot of
short lived incomming connections that get closed by the application but
still create long-lived conntrack entries.

>> I'm against in changing the *default* timeout values, except when it
>> is based on real-life, well established cases.
>
> What sounds the most significant: 'TCP timeouts' or 'application
> timeouts'? Should (i.e) HTTP, FTP and Telnet have the same lifetime in
> hash?

Maybe a iptables marking approach (a-la tc)?

Alex
www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 12:42   ` Jean-Michel Hemstedt
  2002-06-25 13:50     ` Patrick Schaaf
  2002-06-25 13:56     ` Alex Bennee
@ 2002-06-25 14:17     ` Jozsef Kadlecsik
  2002-06-25 15:13       ` Balazs Scheidler
  2002-06-27  2:21       ` Andrew Smith
  2002-06-25 19:01     ` Harald Welte
  3 siblings, 2 replies; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 14:17 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: netfilter-devel

On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:

> > connections. As good as possible. If the conntrack table becomes full,
> > there are two possibilities:
> >
> > - conntrack table size is underestimated for the real traffic flowing
> >   trough. Get more RAM and increase the table size.
> > - conntrack is under a (DoS) attack. Then protect conntrack by appropriate
> >   rules using the recent/limit/psd etc modules.
>
> And what if, under load conditions, your table becomes full because 90% of
> its entries, which are unused, are not aged because of timeouts?

The only case when that might happen is a DoS. You did not consider the
second point above.

> We don't even need to have a full table to get into troubles. If at one
> point, the vast majority of the conntrack entries are unused, but still
> in hash, then you get more and more collisions, which decreases the
> hash efficiency.

What kind of collisions? Do you mean, that we end up in the same hash
entry and the linked list in the entry becomes too long? There is not much
wizardy we can do about it:

- increase the hash size (i.e buy more RAM) if the hash is small
- create better hash function, if one can deliberately hit the same entry.

By the way, so far nobody has ever proved that the hash function is
not good enough.

> There's another side effect: when the system get's loaded (because of
> hash exhaustion or hash collisions), it can't process all packets arriving
> which means that conntrack will not see some FIN or RST packets allowing
> it to recover... This is a kind of 'vicious circle', or point of failure.

This is not true. If those FIN/RST packets belong to already existing
connections, then those are in the conntrack hash and data can be updated.
If those packets do not belong to an existing connection, then either they
can create a new entry and we are fine, or conntrack is full and the
packets will be dropped - we are fine again.

> In my opinion, a first step should be to reconsider timeout values but
> also timer mechanisms.

As Patric already wrote: there is still no proof that the timeout values
are wrong.

> > I'm against in changing the *default* timeout values, except when it is
> > based on real-life, well established cases.
>
> What sounds the most significant: 'TCP timeouts' or 'application timeouts'?
> Should (i.e) HTTP, FTP and Telnet have the same lifetime in hash?

Sorry, I have the impression that you do not know how conntrack works, how
conntrack entries created, updated and destroyed.

Applications get the same timeouts, but their lifetime (and even that of
the different connections of the same application) can be quite different.

> > The book-keeping overhead is at least doubled compared to the
> > conntrack-only case - this explains pretty well the results you got.
>
> what do you mean by 'book-keeping' ?
> Does NAT do a lookup even if there are no rules?

I have to write again: even if there are no any rules, NULL
mapping happens and new connections must be put into both nat hashes.

> > conntrack and nat are subsystems. If somebody loads them in, then they
> > start to work.
>
> work on what, since NAT has nothing to translate?

See above.

> > But why would anyone type in "iptables -t nat -L" when in reality he/she
> > does not use nat and the nat table itself??
>
> (why do we live if it's for dying in the end?)

If somebody want to shoot himself in the foot, we can give him even more
rope :-).

> > I think because only the source port varies, this is an important issue in
> > your setup. You actually tested the hash functions and could bomb some
> > hash entries. The overall effect was a DoS against conntrack.

> This was my test setup, but since I haven't verified the conntrack hash
> distribution, I didn't want to argue on that. To measure that, we should
> maintain hash counters such as max collisions, average collisions per
> key, hit/miss depth average, number of hit/miss per second, etc...
> I've planned to do that along with profiling, but unfortunately not in
> the 2 coming weeks.

In my opinion, this is the real question. But I repeat again, nobody
proved that the hash function is not good enough. It's only speculation.

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 12:47       ` Harald Welte
@ 2002-06-25 14:23         ` Jozsef Kadlecsik
       [not found]         ` <025001c21c50$763fa880$0489cb8a@etbx180>
  2002-06-25 21:08         ` Jozsef Kadlecsik
  2 siblings, 0 replies; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 14:23 UTC (permalink / raw)
  To: Harald Welte; +Cc: Jean-Michel Hemstedt, netfilter-devel

On Tue, 25 Jun 2002, Harald Welte wrote:

> > According to your first mail, the machine has 256M RAM and you issued
> >
> > insmod ip_conntrack 16384
> >
> > That requires 16384*8*~600byte ~= 75MB non-swappable RAM.
> >
> > When you issued "iptables -t nat -L", the system tried to reserve plus
> > 2x75MB. That's in total pretty near to all your available physical RAM
> > and the machine might died in swapping.
>
> ??? Why should listing an IP table try to reserve twice the size of the
> conntrack table?

By entering the command above, he loads in the iptable_nat kernel module,
which at initializing itself try to allocate memory for the bysource and
byipsproto hashes (with the same size as of ip_conntrack_hash).

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 13:51         ` Harald Welte
@ 2002-06-25 14:33           ` Jozsef Kadlecsik
  2002-06-25 14:51           ` Jean-Michel Hemstedt
  1 sibling, 0 replies; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 14:33 UTC (permalink / raw)
  To: Harald Welte; +Cc: Jean-Michel Hemstedt, netfilter-devel

On Tue, 25 Jun 2002, Harald Welte wrote:

> > But this raises one additional problem:
> > 1) the hash index size and the hash total size should be configurable
> > separately (get rid of that factor 8, and use a free list for the tuple
> > allocation).
> > 2) NAT hash sizes should also be configurable independently from conntrack.
> > Normally the nat hashes are smaller than conntrack hash, since conntrack
> > is based on ports, while nat is not.
>
> both of this is already true. look at the module loadtime parameters of
> ip_conntrack.o and iptable_nat.o

One must set hashsize for the ip_conntrack module and then tweak of
/proc/sys/net/ip_conntrack_max in order to get rid of the factor 8.

But we do not have a module parameter yet for setting the hashsizes
of iptable_nat independently.

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 13:51         ` Harald Welte
  2002-06-25 14:33           ` Jozsef Kadlecsik
@ 2002-06-25 14:51           ` Jean-Michel Hemstedt
  2002-06-25 16:11             ` Harald Welte
  1 sibling, 1 reply; 33+ messages in thread
From: Jean-Michel Hemstedt @ 2002-06-25 14:51 UTC (permalink / raw)
  To: Harald Welte; +Cc: Jozsef Kadlecsik, netfilter-devel


> > But this raises one additional problem:
> > 1) the hash index size and the hash total size should be configurable
> > separately (get rid of that factor 8, and use a free list for the tuple
> > allocation).
> > 2) NAT hash sizes should also be configurable independently from conntrack.
> > Normally the nat hashes are smaller than conntrack hash, since conntrack
> > is based on ports, while nat is not.
>
> both of this is already true. look at the module loadtime parameters of
> ip_conntrack.o and iptable_nat.o

right for conntrack, but i can't find something similar for nat:

conntrack:
----------
- ip_conntrack_htable_size : load time param
                           : allocated at init
                           : 16? bytes per list head
- ip_conntrack_max: /proc setting only after the module is loaded
                  : tuples allocated on demand (kmem_chache_alloc)
                  : 392 bytes per tuple.

=> that's why i'm not swapping when my table is not full...

but in ip_conntrack_init():
1093          ip_conntrack_max = 8 * ip_conntrack_htable_size;
=> when the module is loaded, it is loaded with this default value.
   could be good to have it as loadable parameter in order to
   save it and restore in modules.conf

nat:
----
(from ip_nat_init):
- ip_nat_htable_size = ip_conntrack_htable_size; (not configurable)
                     : allocated at init twice
                       (for bysource and byisproto hashes)
- max tuples??? haven't found any value neither any config data.
                (is it in patch-o-matic)?
                but the tuples are allocated on demand.


PS: the fact that tuples are allocated on demand (392bytes/tuple) and not at
init
    explains also why I was not swapping. (just facts ;o))

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 13:21       ` Jean-Michel Hemstedt
  2002-06-25 13:51         ` Harald Welte
  2002-06-25 13:52         ` Patrick Schaaf
@ 2002-06-25 14:53         ` Jozsef Kadlecsik
  2002-06-25 15:22           ` Balazs Scheidler
  2 siblings, 1 reply; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 14:53 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: Harald Welte, netfilter-devel

On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:

> > From where do you think that the module usage counter reports how many
> > packets/connections are handled (currently? totally?) by the module.
> > There is no whatsoever connection!
>
> module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT).

Yes, this is true for netfilter target/match modules. But even in that
case, the number refers how many rules use the module and not how many
packets were processed.

> In this test, no rule was defined, and no target module was loaded.

There are always an implicit rule in the case of NAT. Being an implicit
rule, it is not counted in the module usage counter. :-)

> So I did not expect NAT to process any packet.

No, NAT always processes all packets, the same way as conntrack does.

> I suppose that due to the load, packets are dropped not because of conntrack
> but because they simply can't be processed, and thus conntrack misses packets
> of existing connections (such as FIN, RST) and can't thus recover due to its
> timeouts.

If conntrack missed packets such a way, then the destination would miss as
well and the sender should resend them. No problem.

> > When you issued "iptables -t nat -L", the system tried to reserve plus
> > 2x75MB. That's in total pretty near to all your available physical RAM
> > and the machine might died in swapping.
>
> exact!
> That's why I looked (but not closely) at swap-in/swap-out in procinfo,
> but didn't notice anything (0 most of the time on 10 sec average).
> But I agree that I was close to the limit, and even over when I tried 32K.
> Despite that, nothing so surpising to have so few swaps, since my table
> was not full (max 4000 up to 10000 concurrent tuples).

But the whole space gets reserved! Immediately as the module loaded!
An it is non-swappable RAM, everything else would get the rest.

> PS: could anybody redo similar tests so that we can compare the results
>     and stop killing the messenger, please? ;o)

Sorry if I look harsh, it's not my intention at all. We were simply over
almost exaclty the same arguments several times. And those resulted
neither pinpointing real flaws in the system, nor better algorithms.

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 14:17     ` Jozsef Kadlecsik
@ 2002-06-25 15:13       ` Balazs Scheidler
  2002-06-25 19:06         ` Harald Welte
  2002-06-27  2:21       ` Andrew Smith
  1 sibling, 1 reply; 33+ messages in thread
From: Balazs Scheidler @ 2002-06-25 15:13 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Jean-Michel Hemstedt, netfilter-devel

On Tue, Jun 25, 2002 at 04:17:54PM +0200, Jozsef Kadlecsik wrote:
> On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:
> > > The book-keeping overhead is at least doubled compared to the
> > > conntrack-only case - this explains pretty well the results you got.
> >
> > what do you mean by 'book-keeping' ?
> > Does NAT do a lookup even if there are no rules?
> 
> I have to write again: even if there are no any rules, NULL
> mapping happens and new connections must be put into both nat hashes.

This should not explain the performance degradation others found. If no
rules are found in the table, the conntrack entry is added to the NAT
hashes. (place_in_hashes() function), this involves adding the entry to two
linked lists (changes two pointers per list), and then calling do_bindings()
which does nothing (num_manips == 0) except for calling helpers, which
should be none, if helper modules are not loaded.

Adding entries to the NAT hashes doesn't involve memory allocation (NAT info
is stored in ip_conntrack), therefore I don't see the reason for the 50%
performance decrease.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 14:53         ` Jozsef Kadlecsik
@ 2002-06-25 15:22           ` Balazs Scheidler
  0 siblings, 0 replies; 33+ messages in thread
From: Balazs Scheidler @ 2002-06-25 15:22 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Jean-Michel Hemstedt, Harald Welte, netfilter-devel

On Tue, Jun 25, 2002 at 04:53:33PM +0200, Jozsef Kadlecsik wrote:
> On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:
> > PS: could anybody redo similar tests so that we can compare the results
> >     and stop killing the messenger, please? ;o)
> 
> Sorry if I look harsh, it's not my intention at all. We were simply over
> almost exaclty the same arguments several times. And those resulted
> neither pinpointing real flaws in the system, nor better algorithms.

no only head pointers for hashes are preallocated. conntrack structures
themselves are allocated by the slab allocator: kmem_cache_alloc() called in
init_conntrack() which initializes a single conntrack entry.

So the initial memory allocations for conntrack and nat are

conntrack: htable_size * 8 

(8 is sizeof(list_head))

nat: 2 * htable_size * 8

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
       [not found]         ` <025001c21c50$763fa880$0489cb8a@etbx180>
@ 2002-06-25 16:07           ` Harald Welte
  0 siblings, 0 replies; 33+ messages in thread
From: Harald Welte @ 2002-06-25 16:07 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: netfilter-devel

On Tue, Jun 25, 2002 at 03:59:01PM +0200, Jean-Michel Hemstedt wrote:
> > ??? Why should listing an IP table try to reserve twice the size of the
> > conntrack table?
> 
> this is in nat_init (or so): nat takes the conntrack hash size to
> allocate 2 additional nat hashes 'bysource' and 'byisproto'.

Ah. I was not aware that you didn't have iptable_nat loaded before the 
command.  Just issuing the '-L' command with no nat loaded does not 
allocate anything big inside the kernel.

> The question is, why do we init it, if we don't use it (on a rule
> point of view)? This init step should occur only if we insert a rule
> using nat.

no. This is again something I regard as feature, not as bug. dont load
the module if you don't use it.

It's the same behaviour like conntrack.

> kr,
> -jmhe-

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 14:51           ` Jean-Michel Hemstedt
@ 2002-06-25 16:11             ` Harald Welte
  0 siblings, 0 replies; 33+ messages in thread
From: Harald Welte @ 2002-06-25 16:11 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel

On Tue, Jun 25, 2002 at 04:51:37PM +0200, Jean-Michel Hemstedt wrote:
> > both of this is already true. look at the module loadtime parameters of
> > ip_conntrack.o and iptable_nat.o
> 
> right for conntrack, but i can't find something similar for nat:

strange. I though we already had that.

> conntrack:
> ----------
> 
> but in ip_conntrack_init():
> 1093          ip_conntrack_max = 8 * ip_conntrack_htable_size;
> => when the module is loaded, it is loaded with this default value.
>    could be good to have it as loadable parameter in order to
>    save it and restore in modules.conf

wheres the problem with having a 'echo 12345 >
/proc/sys/net/ipv4/ip_conntrack_max' in the post-load script in
modules.conf?

> nat:
> ----
> (from ip_nat_init):
> - ip_nat_htable_size = ip_conntrack_htable_size; (not configurable)
>                      : allocated at init twice
>                        (for bysource and byisproto hashes)
> - max tuples??? haven't found any value neither any config data.
>                 (is it in patch-o-matic)?
>                 but the tuples are allocated on demand.
> 
> 
> PS: the fact that tuples are allocated on demand (392bytes/tuple) and not at
> init

a tuple does not have 329 bytes. this sounds more like the size of a 
struct ip_conntrack.

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 12:42   ` Jean-Michel Hemstedt
                       ` (2 preceding siblings ...)
  2002-06-25 14:17     ` Jozsef Kadlecsik
@ 2002-06-25 19:01     ` Harald Welte
  2002-06-25 20:53       ` conntrack DoS Henrik Nordstrom
  3 siblings, 1 reply; 33+ messages in thread
From: Harald Welte @ 2002-06-25 19:01 UTC (permalink / raw)
  To: Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel

On Tue, Jun 25, 2002 at 02:42:36PM +0200, Jean-Michel Hemstedt wrote:
> There's another side effect: when the system get's loaded (because of
> hash exhaustion or hash collisions), it can't process all packets arriving
> which means that conntrack will not see some FIN or RST packets allowing
> it to recover... This is a kind of 'vicious circle', or point of failure.

if conntrack  doesn't see a FIN or RST packet, it won't be forwarded by
the machine and thus never arrive at the receiver.  The sender will thus
retransmit, and hope the packet makes it next time.

> In my opinion, a first step should be to reconsider timeout values but
> also timer mechanisms.

no, the timeout values are reasonable.

> > I'm against in changing the *default* timeout values, except when it is
> > based on real-life, well established cases.
> 
> What sounds the most significant: 'TCP timeouts' or 'application timeouts'?
> Should (i.e) HTTP, FTP and Telnet have the same lifetime in hash?

yes, they should.  They are TCP connections.  We shouldn't impose
any application-protocol specific layer4 timeouts, that sounds horrible.

the port versus application protocol (i.e. 80 == http) are by
convention, not by protocol design.

> But unfortunately it doesn't meet my 'timeout per protocol' needs.

well, so go ahead and implement it. nobody prevents you from doing that.

> indeed, this dimensioning is quite conservative, and it assumes that
> conntrack is distributed on src+dst+proto, not on ports. But we can
> live with that, since it's only a memory overhead (except if we start
> considering memory pages swapping).

kernel memory is never swapped out.

> > conntrack and nat are subsystems. If somebody loads them in, then they
> > start to work.
> 
> work on what, since NAT has nothing to translate?

they start the work necessary to be prepared to nat packets/connections.

> > But why would anyone type in "iptables -t nat -L" when in reality he/she
> > does not use nat and the nat table itself??
> 
> (why do we live if it's for dying in the end?)

I don't know what kind of weird position you are claiming.  I think it
is now clear that you have a different perspective on how conntrack/nat
should work.

If the netfilter people respond to this as 'this is by design and not a
bug', you will have to live with that or implement a different system.
That's something different from improving load under DoS situations or
improving conntrack performance in general, where we have the same goal.

> This was my test setup, but since I haven't verified the conntrack hash
> distribution, I didn't want to argue on that. To measure that, we should
> maintain hash counters such as max collisions, average collisions per
> key, hit/miss depth average, number of hit/miss per second, etc...
> I've planned to do that along with profiling, but unfortunately not in
> the 2 coming weeks.

this sounds very constructive and we're looking forward to the results.

> last points I wanted to clarify:
> 
> 2) My test was artificial, but not unrealistic: one endpoint sustaining
>    1000 conn/s wathever the responsiveness of the target, or 10000 users
>    trying to connect through the gw in a time lapse of 10 seconds is
>    similar.
>    Now, if some of you are telling me that I'm not allowed, or that I'm nuts
>    to place my box in front of 10000 users, that's another debate.
>    I'm not talking about dimensioning, I'm talking about relative
>    performances, and strange weaknesses.

conntrack should definitely be able to handle this case and I'm looking
forward to see detailed results.

I'm away from my testing equipment for almost three weeks, so I cannot
really reproduce or try to verify any of your claims, neither reject
them.

It should at least deal with 10kconn/s

> kr,
> -jmhe-

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 13:50     ` Patrick Schaaf
@ 2002-06-25 19:03       ` Harald Welte
  0 siblings, 0 replies; 33+ messages in thread
From: Harald Welte @ 2002-06-25 19:03 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: Jean-Michel Hemstedt, netfilter-devel

On Tue, Jun 25, 2002 at 03:50:38PM +0200, Patrick Schaaf wrote:
> > In my opinion, a first step should be to reconsider timeout values but
> > also timer mechanisms.
> 
> No. A first step MUST be pointing out that the current timeouts become
> a problem in REAL LIFE. Right now you are speculating.  On all setups
> I personally know, the timeouts are NOT a problem.
> 
> Regarding timer _mechanisms_ I have seen no indication at all that the
> current mechanism is a problem. If you want to insist, _please_ learn
> about kernel profiling, and start posting FACT.

I've been talking about this with a couple of people here at the kernel
summit, and it looks like the per-packet del_timer/add_timer in
ip_ct_refresh should be a severe performance hit on SMP boxes.

Changing this to 'do not update timer if update would be < HZ different
than current timer' is a two-line patch.  

As stated before, I'm currently away of my testing equipment, so if 
anybody wants to give it a try...

> regards
>   Patrick

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 15:13       ` Balazs Scheidler
@ 2002-06-25 19:06         ` Harald Welte
  2002-06-26  8:18           ` Balazs Scheidler
  0 siblings, 1 reply; 33+ messages in thread
From: Harald Welte @ 2002-06-25 19:06 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: Jozsef Kadlecsik, Jean-Michel Hemstedt, netfilter-devel

On Tue, Jun 25, 2002 at 05:13:02PM +0200, Balazs Scheidler wrote:
> On Tue, Jun 25, 2002 at 04:17:54PM +0200, Jozsef Kadlecsik wrote:
> > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:
> > > > The book-keeping overhead is at least doubled compared to the
> > > > conntrack-only case - this explains pretty well the results you got.
> > >
> > > what do you mean by 'book-keeping' ?
> > > Does NAT do a lookup even if there are no rules?
> > 
> > I have to write again: even if there are no any rules, NULL
> > mapping happens and new connections must be put into both nat hashes.
> 
> This should not explain the performance degradation others found. If no
> rules are found in the table, the conntrack entry is added to the NAT
> hashes. (place_in_hashes() function), this involves adding the entry to two
> linked lists (changes two pointers per list), and then calling do_bindings()
> which does nothing (num_manips == 0) except for calling helpers, which
> should be none, if helper modules are not loaded.
> 
> Adding entries to the NAT hashes doesn't involve memory allocation (NAT info
> is stored in ip_conntrack), therefore I don't see the reason for the 50%
> performance decrease.

think about the lock contention on SMP system. The 'null binding'
approach for nat (and for example, that nat helpers are called for
connections with 'null binding') is a poor design.  

I've recently did some testing which try to avoid the null binding, but 
as I'm not entirely sure they don't break something else I haven't been
releasing them yet.

> Bazsi

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: conntrack DoS
  2002-06-25 19:01     ` Harald Welte
@ 2002-06-25 20:53       ` Henrik Nordstrom
  2002-06-25 21:47         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 33+ messages in thread
From: Henrik Nordstrom @ 2002-06-25 20:53 UTC (permalink / raw)
  To: Harald Welte, Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel

Harald Welte wrote:

> if conntrack  doesn't see a FIN or RST packet, it won't be forwarded by
> the machine and thus never arrive at the receiver.  The sender will thus
> retransmit, and hope the packet makes it next time.

FIN will be retransmitted a couple of times, but RST won't. RST will only be 
retransmitted indirectly if data arrives from the other end.

This brings me back to the important cases where TCP closes down without 
conntrack noticing. Have tried to bring this up a couple of times before but 
I do not recall seeing much of a response..

There exists at least two real-life cenarios where conntrack won't notice that 
a TCP connection is gone:

If the RST is lost on a aborted connection where the other endpoint has 
disappeared.

when the TCP is dropped due to retransmission timeouts after one of the 
endpoints have disappeared.

The two cases above is only slight variations of the same scenario. A TCP is 
established, then one of the endpoints disappears from the network preventing 
the connection to be shut down in a normal manner.

The first very uncommon and probably not easily exploitable.

The second case is a real problem and can quite easily be used to DoS 
conntrack with a relatively small amount of packets. (total ~20 minimum size 
TCP packets per wasted conntrack entry)

The good news is that second case can be detected by detecting a long term 
uni-directional packet flow, and we probably do not need to care about the 
first..

A simple test case illustrating the second case:

Have three machines

A <-> conntrack server <-> B

On A, set up the following two simple rules to simulate connection dropout
-A OUPUT -p tcp --tcp-flags RST RST -j DROP
-A OUPUT -p tcp --tcp-flags FIN FIN -j DROP

On B, enable the chargen TCP service to have a simple TCP data source, and for 
the sake of accelerating the test, conserver resources of B and more 
obviously illustrate the point lower tcp_retries2 to something like 5.

Then run the following silly test program on A

while true; do telnet B chargen </dev/null >/dev/null; done

Note: This is only a superficial twist of the traditional TCP connection flood 
DoS. The attacker finishes the SYN handshake and then ignores the connection. 
Can be launched on conntrack via any TCP service that returns data, causing a 
TCP data queue on the connection preventing a FIN to be sent when the server 
aborts the connection.

> yes, they should.  They are TCP connections.  We shouldn't impose
> any application-protocol specific layer4 timeouts, that sounds horrible.
>
> the port versus application protocol (i.e. 80 == http) are by
> convention, not by protocol design.

I agree.

Should also note that there exists fully valid HTTP applications utilizing 
very long idle periods on a HTTP connection. How long the HTTP connection is 
kept open is a business between the user-agent and the server only, nobody 
else. The fact that most HTTP connections are short lived does not say that 
all are and that it is OK to drop idle HTTP connections.

Regards
Henrik

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 12:47       ` Harald Welte
  2002-06-25 14:23         ` Jozsef Kadlecsik
       [not found]         ` <025001c21c50$763fa880$0489cb8a@etbx180>
@ 2002-06-25 21:08         ` Jozsef Kadlecsik
  2 siblings, 0 replies; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 21:08 UTC (permalink / raw)
  To: Harald Welte; +Cc: Jean-Michel Hemstedt, netfilter-devel

On Tue, 25 Jun 2002, Harald Welte wrote:

> > According to your first mail, the machine has 256M RAM and you issued
> >
> > insmod ip_conntrack 16384
> >
> > That requires 16384*8*~600byte ~= 75MB non-swappable RAM.
> >
> > When you issued "iptables -t nat -L", the system tried to reserve plus
> > 2x75MB. That's in total pretty near to all your available physical RAM
> > and the machine might died in swapping.
>
> ??? Why should listing an IP table try to reserve twice the size of the
> conntrack table?

Harald, Bazsi, of course you are totally right and I wrote bullshit above.

Of course loading iptable_nat does *not* involve a memory requirement
comparable to ip_conntrack. Hash element sizes are equal but structure
sizes in the hashes are far from each other.

Sorry for spreading false information. :-(

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: conntrack DoS
  2002-06-25 20:53       ` conntrack DoS Henrik Nordstrom
@ 2002-06-25 21:47         ` Jozsef Kadlecsik
  2002-06-25 22:42           ` Henrik Nordstrom
  0 siblings, 1 reply; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-25 21:47 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Harald Welte, Jean-Michel Hemstedt, netfilter-devel

On Tue, 25 Jun 2002, Henrik Nordstrom wrote:

> > if conntrack  doesn't see a FIN or RST packet, it won't be forwarded by
> > the machine and thus never arrive at the receiver.  The sender will thus
> > retransmit, and hope the packet makes it next time.
>
> FIN will be retransmitted a couple of times, but RST won't. RST will only be
> retransmitted indirectly if data arrives from the other end.

Yes.

> This brings me back to the important cases where TCP closes down without
> conntrack noticing. Have tried to bring this up a couple of times before but
> I do not recall seeing much of a response..
>
> There exists at least two real-life cenarios where conntrack won't notice that
> a TCP connection is gone:
>
> If the RST is lost on a aborted connection where the other endpoint has
> disappeared.
>
> when the TCP is dropped due to retransmission timeouts after one of the
> endpoints have disappeared.
>
> The two cases above is only slight variations of the same scenario. A TCP is
> established, then one of the endpoints disappears from the network preventing
> the connection to be shut down in a normal manner.
>
> The first very uncommon and probably not easily exploitable.
>
> The second case is a real problem and can quite easily be used to DoS
> conntrack with a relatively small amount of packets. (total ~20 minimum size
> TCP packets per wasted conntrack entry)
>
> The good news is that second case can be detected by detecting a long term
> uni-directional packet flow, and we probably do not need to care about the
> first..

How could be such connections (second case) sorted out from
legitimate uni-directional (even half-closed) connections?

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: conntrack DoS
  2002-06-25 21:47         ` Jozsef Kadlecsik
@ 2002-06-25 22:42           ` Henrik Nordstrom
  2002-06-27  8:26             ` Jozsef Kadlecsik
  0 siblings, 1 reply; 33+ messages in thread
From: Henrik Nordstrom @ 2002-06-25 22:42 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

Jozsef Kadlecsik wrote:

> > The good news is that second case can be detected by detecting a long term
> > uni-directional packet flow, and we probably do not need to care about the
> > first..
> 
> How could be such connections (second case) sorted out from
> legitimate uni-directional (even half-closed) connections?

A running TCP packet flow (even for a "half-closed" uni-directional TCP)
is never uni-directional. If there is data in flowing in one direction
then there is ACKs in the other direction.

Idea on how conntrack could deal with such connections: If several
retransmissions (lets say 5) is seen in one direction and no ACKs in the
other within a reasonable timeframe (lets say 10 minutes) then the TCP
is most likely dead and a low inactivity timeout can be assigned (lets
say 20 minutes) to have it cleaned out from conntrack.

At a first glance this can be simplified into a RETRANSMIT/ACK timeout
state machinery, but there is a significant race window making a simple
packet driven state machine unsuitable. Must not trigger on a delayed
retransmission followed by a lost ACK, or delayed retransmissions not
resulting in ACK (out of window).

Regards
Henrik

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 19:06         ` Harald Welte
@ 2002-06-26  8:18           ` Balazs Scheidler
  0 siblings, 0 replies; 33+ messages in thread
From: Balazs Scheidler @ 2002-06-26  8:18 UTC (permalink / raw)
  To: Harald Welte, Jozsef Kadlecsik, Jean-Michel Hemstedt,
	netfilter-devel

On Tue, Jun 25, 2002 at 09:06:47PM +0200, Harald Welte wrote:
> On Tue, Jun 25, 2002 at 05:13:02PM +0200, Balazs Scheidler wrote:
> > On Tue, Jun 25, 2002 at 04:17:54PM +0200, Jozsef Kadlecsik wrote:
> > > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:
> > > > what do you mean by 'book-keeping' ?
> > > > Does NAT do a lookup even if there are no rules?
> > > 
> > > I have to write again: even if there are no any rules, NULL
> > > mapping happens and new connections must be put into both nat hashes.
> > 
> > This should not explain the performance degradation others found. If no
> > rules are found in the table, the conntrack entry is added to the NAT
> > hashes. (place_in_hashes() function), this involves adding the entry to two
> > linked lists (changes two pointers per list), and then calling do_bindings()
> > which does nothing (num_manips == 0) except for calling helpers, which
> > should be none, if helper modules are not loaded.
> > 
> > Adding entries to the NAT hashes doesn't involve memory allocation (NAT info
> > is stored in ip_conntrack), therefore I don't see the reason for the 50%
> > performance decrease.
> 
> think about the lock contention on SMP system. The 'null binding'
> approach for nat (and for example, that nat helpers are called for
> connections with 'null binding') is a poor design.  
> 
> I've recently did some testing which try to avoid the null binding, but 
> as I'm not entirely sure they don't break something else I haven't been
> releasing them yet.

The original test machine used to gather performance information was not
SMP:

"
here is my test bed:

tested target:
 -kernel 2.4.18 + non_local_bind + small conntrack timeouts...
 -PIII~500MHz, RAM=256MB
 -2*100Mb/s NIC

"

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-25 14:17     ` Jozsef Kadlecsik
  2002-06-25 15:13       ` Balazs Scheidler
@ 2002-06-27  2:21       ` Andrew Smith
  2002-06-27 11:24         ` Harald Welte
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Smith @ 2002-06-27  2:21 UTC (permalink / raw)
  To: netfilter-devel

> On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote:
> 
>> > connections. As good as possible. If the conntrack table becomes
>> > full, there are two possibilities:
>> >
>> > - conntrack table size is underestimated for the real traffic
>> > flowing
>> >   trough. Get more RAM and increase the table size.
>> > - conntrack is under a (DoS) attack. Then protect conntrack by
>> > appropriate
>> >   rules using the recent/limit/psd etc modules.
>>
>> And what if, under load conditions, your table becomes full because
>> 90% of its entries, which are unused, are not aged because of
>> timeouts?
> 
> The only case when that might happen is a DoS. You did not consider the
> second point above.

<snip>

I've mentioned this before but since I'm not an actual developer
in the netfilter arena I assume it got ignored (and will again)
but I can suggest what appears to me to be a common cause of this
problem - online gaming.
The specific game that causes this the most is a game called
CounterStrike.
It is a mod of a game called Half-Life which is handled online by
Sierra.
When you want to play online your computer will talk to one of
the Sierra servers (there is 3 of them I think) that controls any
known games that are created via the same process and the Sierra
server will reply with a list of IP addresses of online game servers
- anywhere from about 5,000 to 20,000 during peak times (my guess
at an average would be around 10,000)
Your PC will then usually 'ping' each of the game servers (yes all
X thousand of them) as quickly as possible to determine the response
times you will get if you play on that server.
This 'ping' connection does end up in the conntack table
(I call it a 'ping' coz I've never bothered to check what it really
is and it doesn't matter anyway - it ends up in the conntrack table
is all that matters)
There are plenty of other similar games but CoutnerStrike is the most
popular and thus its numbers are larger than any other game but most
are only an order of magnitude smaller - e.g. QuakeI, II & III,
Tribes 2, Medal Of Honour etc.
The number of players online is usualy between 5 & 10 times the number
of game servers running.

This gives a good example when being able to set the timeout dependant
upon specific factors (e.g. port/protocol) would be good rather than a
global timeout that suits specific cases and does not match many cases
- and causes a severe problem for a limited set of cases

-- 
-Cheers
-Andrew

MS ... if only he hadn't been hang gliding!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: conntrack DoS
  2002-06-25 22:42           ` Henrik Nordstrom
@ 2002-06-27  8:26             ` Jozsef Kadlecsik
  0 siblings, 0 replies; 33+ messages in thread
From: Jozsef Kadlecsik @ 2002-06-27  8:26 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: netfilter-devel

On Wed, 26 Jun 2002, Henrik Nordstrom wrote:

> A running TCP packet flow (even for a "half-closed" uni-directional TCP)
> is never uni-directional. If there is data in flowing in one direction
> then there is ACKs in the other direction.

Yes, right.

> Idea on how conntrack could deal with such connections: If several
> retransmissions (lets say 5) is seen in one direction and no ACKs in the
> other within a reasonable timeframe (lets say 10 minutes) then the TCP
> is most likely dead and a low inactivity timeout can be assigned (lets
> say 20 minutes) to have it cleaned out from conntrack.
>
> At a first glance this can be simplified into a RETRANSMIT/ACK timeout
> state machinery, but there is a significant race window making a simple
> packet driven state machine unsuitable. Must not trigger on a delayed
> retransmission followed by a lost ACK, or delayed retransmissions not
> resulting in ACK (out of window).

I believe it is a good approach and can be implemented. But first the
NOTRACK patch...

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-27  2:21       ` Andrew Smith
@ 2002-06-27 11:24         ` Harald Welte
  2002-06-29  5:25           ` Andrew Smith
  0 siblings, 1 reply; 33+ messages in thread
From: Harald Welte @ 2002-06-27 11:24 UTC (permalink / raw)
  To: Andrew Smith; +Cc: netfilter-devel

On Thu, Jun 27, 2002 at 12:21:45PM +1000, Andrew Smith wrote:
> This gives a good example when being able to set the timeout dependant
> upon specific factors (e.g. port/protocol) would be good rather than a
> global timeout that suits specific cases and does not match many cases
> - and causes a severe problem for a limited set of cases

Sorry, but we've had this discussion over and over again. Go to the list
archives and look for tuneable timeouts.

The conclusion of this discussion was, that we need to cope with all
cases without any tuning being necessarry. 

btw: For the 'ping' case, the icmp echo reply is closing the connection
anyway.

conntrack is mostly about tracking layer 3+4 protocol state.  And this
should happen as transparent as possible, so assumptions about the
application are made.  [conntrack helpers are an exemption, and be sure
I would be much happier if we didn't need to have them].

> -Cheers
> -Andrew

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: performance issues (nat / conntrack)
  2002-06-27 11:24         ` Harald Welte
@ 2002-06-29  5:25           ` Andrew Smith
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Smith @ 2002-06-29  5:25 UTC (permalink / raw)
  To: netfilter-devel

> On Thu, Jun 27, 2002 at 12:21:45PM +1000, Andrew Smith wrote:
>> This gives a good example when being able to set the timeout dependant
>> upon specific factors (e.g. port/protocol) would be good rather than a
>> global timeout that suits specific cases and does not match many cases
>> - and causes a severe problem for a limited set of cases
> 
> Sorry, but we've had this discussion over and over again. Go to the
> list archives and look for tuneable timeouts.
> 
> The conclusion of this discussion was, that we need to cope with all
> cases without any tuning being necessarry. 

Well either there is a language mistake or that statement is rubbish.
It does NOT cope with all cases.
If fails dismally with the case I've given.
It is not POSSIBLE to cope with all cases without any tuning being
necessary unless the code tuned itself.

Pity that the conclusion is flawed.

> btw: For the 'ping' case, the icmp echo reply is closing the connection
> anyway.

So I guess I need to look in detail what is happening in my case
- but at a guess the problem might be that a large number of the
connections fail to get a fast enough response and thus do not get
closed for a 'long' time.

> conntrack is mostly about tracking layer 3+4 protocol state.  And this
> should happen as transparent as possible, so assumptions about the
> application are made.  [conntrack helpers are an exemption, and be sure
> I would be much happier if we didn't need to have them].
> - Harald Welte / laforge@gnumonks.org              

Yes but the problem is that it causes problems at a higher protocol
level and though it works for most cases - it fails on at least a
few specific cases.

Anyway - this argument will not get anywhere.
I guess some time (in the far distant future :-) when I have the
time and inclination I'll fix it myself and then just have to keep
patching it every time it's updated - coz the comments certainly
suggest that a patch would not be accepted here.

-- 
-Cheers
-Andrew

MS ... if only he hadn't been hang gliding!

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2002-06-29  5:25 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-20 19:48 performance issues (nat / conntrack) Jean-Michel Hemstedt
2002-06-22 16:51 ` Harald Welte
2002-06-23  9:15   ` Jean-Michel Hemstedt
2002-06-25 11:33     ` Jozsef Kadlecsik
2002-06-25 12:47       ` Harald Welte
2002-06-25 14:23         ` Jozsef Kadlecsik
     [not found]         ` <025001c21c50$763fa880$0489cb8a@etbx180>
2002-06-25 16:07           ` Harald Welte
2002-06-25 21:08         ` Jozsef Kadlecsik
2002-06-25 13:21       ` Jean-Michel Hemstedt
2002-06-25 13:51         ` Harald Welte
2002-06-25 14:33           ` Jozsef Kadlecsik
2002-06-25 14:51           ` Jean-Michel Hemstedt
2002-06-25 16:11             ` Harald Welte
2002-06-25 13:52         ` Patrick Schaaf
2002-06-25 14:53         ` Jozsef Kadlecsik
2002-06-25 15:22           ` Balazs Scheidler
2002-06-25 10:35 ` Jozsef Kadlecsik
2002-06-25 12:42   ` Jean-Michel Hemstedt
2002-06-25 13:50     ` Patrick Schaaf
2002-06-25 19:03       ` Harald Welte
2002-06-25 13:56     ` Alex Bennee
2002-06-25 14:17     ` Jozsef Kadlecsik
2002-06-25 15:13       ` Balazs Scheidler
2002-06-25 19:06         ` Harald Welte
2002-06-26  8:18           ` Balazs Scheidler
2002-06-27  2:21       ` Andrew Smith
2002-06-27 11:24         ` Harald Welte
2002-06-29  5:25           ` Andrew Smith
2002-06-25 19:01     ` Harald Welte
2002-06-25 20:53       ` conntrack DoS Henrik Nordstrom
2002-06-25 21:47         ` Jozsef Kadlecsik
2002-06-25 22:42           ` Henrik Nordstrom
2002-06-27  8:26             ` Jozsef Kadlecsik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.